Public summary. This page summarizes the recurring structure and recent direction of the course. Semester-specific classroom links, deadlines, rooms, office hours, and exact grading rules are announced in the official learning management system.

Course Overview

Data Science and Analytics introduces students to the full data science pipeline: understanding a problem domain, examining raw data, cleaning and transforming data, performing exploratory and descriptive analytics, building predictive models, evaluating results, and communicating findings. The course is intentionally practical and interdisciplinary. Recent project groups include students from multiple departments so that domain expertise and computational skills can be combined in the same team.

The course is closely aligned with data mining practice. Students study data preprocessing, data warehousing and OLAP, pattern mining, clustering, classification, evaluation, visualization, and project reporting. Recent course projects have emphasized medical data mining, especially early hospital readmission analysis using the UCI Diabetes 130-US Hospitals for Years 1999–2008 dataset.

Data preprocessing Exploratory data analysis Clustering Association rules Classification Feature selection Model evaluation Medical data mining Reproducible analytics

Learning Outcomes

Data Understanding and Preparation

  • Identify attribute types, measurement scales, missing values, noise, outliers, and data-quality problems.
  • Apply appropriate data cleaning, integration, encoding, transformation, normalization, sampling, and dimensionality-reduction methods.
  • Prepare a final analytical dataset and clearly justify preprocessing decisions.

Descriptive and Predictive Analytics

  • Use clustering and association rule mining to discover interpretable patterns.
  • Formulate classification and optional regression tasks from a real dataset.
  • Compare multiple models using appropriate metrics, experiments, visualizations, and statistical reasoning.

Experimentation and Reproducibility

  • Design 6–10 controlled predictive experiments that vary algorithms, hyperparameters, feature subsets, and preprocessing choices.
  • Use GitHub, notebooks, scripts, clear README files, and organized repositories for reproducible work.
  • Document datasets, transformations, assumptions, and evaluation procedures.

Communication and Teamwork

  • Present technical work clearly to a mixed audience of computer engineering and non-computer-engineering students.
  • Use visualizations and tables to communicate patterns, model behavior, limitations, and practical insights.
  • Demonstrate balanced team contribution and the ability to answer project questions individually.

Core Topic Map

The exact order may vary by semester, but recent offerings focus on the following themes.

1. Introduction to Data Mining and Data Science

Knowledge discovery from data, data mining tasks, descriptive vs. predictive analytics, supervised vs. unsupervised learning, applications, data science lifecycle, and ethical issues.

2. Data, Measurements, and Preprocessing

Data objects, attributes, attribute types, summary statistics, distribution shape, covariance, correlation, similarity and distance measures, dirty data, missing values, noise, normalization, discretization, PCA, sampling, and dimensionality reduction.

3. Data Warehousing and OLAP

Data warehouse architecture, ETL, multidimensional data models, fact and dimension tables, star and snowflake schemas, cubes, roll-up, drill-down, slice, dice, pivot, and OLAP-style analysis.

4. Pattern Mining and Association Rules

Frequent itemsets, support, confidence, lift, anti-monotonicity, Apriori, candidate generation, rule interpretation, interestingness, and pattern discovery in structured or discretized data.

5. Clustering and Descriptive Analytics

k-means, hierarchical clustering, DBSCAN, distance choices, scaling effects, cluster validity, silhouette analysis, cluster interpretation, and descriptive pattern discovery.

6. Classification and Predictive Analytics

Training/test/validation splits, cross-validation, decision trees, k-NN, Naive Bayes, SVM, random forests, gradient boosting, ANN/MLP, feature selection, hyperparameter tuning, and model comparison.

7. Evaluation, Visualization, and Reporting

Confusion matrix, accuracy, precision, recall, F1, ROC curve, AUC, class imbalance, baseline models, statistical significance, experiment tables, visualization quality, conclusions, limitations, and reproducible reporting.

Textbook and Reading Structure

The main textbook used in recent offerings is Jiawei Han, Jian Pei, and Hanghang Tong, Data Mining: Concepts and Techniques, 4th ed., Morgan Kaufmann / Elsevier. The official companion site provides chapter slides used as course material.

Core Chapters

  • Chapter 1: Introduction
  • Chapter 2: Data, measurements, and data preprocessing
  • Chapter 3: Data warehousing and online analytical processing
  • Chapter 4: Pattern mining: basic concepts and methods
  • Chapter 6: Classification: basic concepts and methods
  • Chapter 8: Cluster analysis: basic concepts and methods

Supporting Practice

  • Python data analysis with pandas, NumPy, scikit-learn, and visualization libraries.
  • Notebook-based exploratory analysis and reproducible modeling pipelines.
  • GitHub repositories for version control, documentation, and project delivery.
  • Medical data mining examples based on public datasets.

Assessment and Course Work

Assessment varies by semester. A recent Spring 2026 grading pattern used quizzes, a midterm exam, a project, and a final exam.

15%

Quiz

Reading-based checks and short technical questions.

25%

Midterm

Data, preprocessing, warehousing, OLAP, and pattern mining foundations.

20%

Project

Team-based descriptive and predictive analytics study.

40%

Final

Broad coverage including clustering, classification, evaluation, and project-related concepts.

This breakdown is shown to explain the course style. The official grading scheme for a given semester is announced separately.

Recent Term Project: Medical Data Mining, Descriptive Analytics, and Predictive Analytics

In the recent project structure, all groups work on a common public medical dataset and build a complete data science pipeline. The project is simplified into two major graded milestones: a progress presentation and a final presentation. The emphasis is on technical correctness, interpretation, visualization quality, reproducibility, and clear communication.

Common Dataset

101,766

Hospital Encounters

Records from diabetic inpatient encounters.

47

Features

Categorical and integer attributes with missing values.

1999–2008

Clinical Period

Ten years of care from 130 US hospitals and integrated delivery networks.

The UCI dataset is appropriate for classification and clustering. Early readmission is a natural classification target, and variables such as time_in_hospital may support optional regression analysis. Because it includes sensitive demographic and clinical attributes, students should treat it as a serious healthcare analytics dataset and report limitations carefully.

Required Project Components

1. Data Understanding and Preprocessing

  • Inspect rows, columns, feature types, and target candidates.
  • Analyze missing values, invalid values, duplicates, imbalance, and data-quality problems.
  • Justify feature removal, feature transformation, encoding, scaling, discretization, and final dataset construction.

2. Descriptive Analytics

  • Apply k-means clustering.
  • Apply hierarchical clustering.
  • Apply DBSCAN.
  • Apply Apriori association rule mining.
  • Interpret discovered patterns in a healthcare or hospital-management context.

3. Predictive Analytics

  • Formulate a classification task, preferably early readmission.
  • Use feature selection and ranked feature tables.
  • Compare at least three classification algorithms.
  • Run approximately 6–10 experiments.
  • Optionally formulate a regression task such as predicting time_in_hospital.

Recommended Algorithms and Metrics

TaskMethodsRecommended outputs
PreprocessingMissing-value handling, categorical encoding, scaling, discretization, feature selection, train/test splitPreprocessing diagram, feature table, missing-value table, final dataset shape
Clusteringk-means, hierarchical clustering, DBSCANCluster profiles, visualizations, silhouette or other validity scores, interpretation of patient/hospital patterns
Association rulesApriori with support, confidence, liftTop rules, filtered meaningful rules, rule interpretation, warnings about spurious rules
Classificationk-NN, Naive Bayes, SVM, decision trees, random forest, gradient boosting/XGBoost, ANN/MLPMetric comparison table, confusion matrix, ROC curve/AUC if applicable, significance analysis
Regression (optional)Linear models, trees, random forest, gradient boosting, MLPMAE/RMSE/R² table, residual analysis, interpretation of important predictors

Project Milestones and Deliverables

Progress Presentation

By the progress milestone, groups should have repository setup, dataset inspection, preprocessing plan, initial EDA, first clustering results, first association-rule results, initial predictive-task formulation, and a plan for the remaining work.

  • Recommended length: around 10–15 slides.
  • Include group number, full member list, student numbers, departments, and e-mail addresses.
  • Show dataset overview, missing values, feature types, completed preprocessing, early insights, and work distribution.

Final Presentation

By the final milestone, groups should complete the full data science study and present their descriptive analytics, predictive analytics, conclusions, limitations, and reproducibility package.

  • Recommended maximum: about 20 slides.
  • Include clustering comparisons, Apriori results, feature selection, 6–10 predictive experiments, best-model confusion matrix, ROC/AUC if applicable, and statistical significance analysis.
  • Explain practical insights and limitations in healthcare context.

Final Project Package

Code package
A cleaned ZIP export of the GitHub repository with notebooks, scripts, README, and documentation.
Preprocessed data
The final preprocessed analytical dataset, preferably in XLSX or CSV, compressed if necessary. Raw data should not be uploaded to GitHub.
Final report
A PDF summary of descriptive analytics, predictive analytics, insights, conclusions, and important visual outputs.
Teamwork report
At least one page explaining individual contributions, collaboration process, and what the group learned from multidisciplinary teamwork.

Project Grade Pattern

40%

Progress Presentation

Technical progress, explanation quality, visualizations, early results, and ability to answer questions.

60%

Final Presentation

Completeness, depth of analysis, experimental quality, insight, reproducibility, and communication.

Suggested Project Pipeline

Step 1 — Problem framing

Define the analytical questions: What patient/hospital patterns are you trying to discover? What predictive target will you model? Why would the result matter in a healthcare or hospital-management context?

Step 2 — Data audit

Load the dataset, check shape, inspect IDs, target variables, feature types, missing-value codes, rare categories, duplicated rows, and suspicious values.

Step 3 — Preprocessing design

Choose how to handle IDs, missing values, categorical features, ordinal features such as age bands, scaling, discretization, class imbalance, train/test split, and leakage risk.

Step 4 — Exploratory analysis

Produce descriptive tables and visualizations for demographic attributes, encounter characteristics, medication-related variables, readmission status, and clinically meaningful feature groups.

Step 5 — Descriptive analytics

Run clustering and association rule mining, then interpret the results in terms of patient profiles, hospital utilization, medication patterns, or readmission-related patterns.

Step 6 — Predictive modeling

Train several models under controlled experimental settings. Compare feature sets and algorithms using appropriate metrics and avoid using test data during model selection.

Step 7 — Reporting and reproducibility

Document all code, data transformations, experiments, results, limitations, and teamwork contributions. Make sure another reader can reproduce the main results.

Study Questions

The following questions summarize exam-preparation and project-preparation themes. They are intentionally extensive so students and external readers can understand the knowledge expected in the course.

Chapter 1 — Introduction to Data Mining and Data Science
  1. Define data mining and explain how it relates to knowledge discovery from data.
  2. What is the difference between descriptive analytics and predictive analytics? Give one example of each.
  3. Compare supervised, unsupervised, and semi-supervised learning using data science examples.
  4. What are the major stages of a data science project lifecycle?
  5. Why is problem formulation as important as algorithm selection in real-world data science?
  6. Explain the difference between data, information, knowledge, and actionable insight.
  7. Give examples of data mining applications in healthcare, finance, e-commerce, and education.
  8. What does it mean for a data mining result to be valid, novel, useful, and understandable?
  9. Why can a highly accurate model still be practically useless or ethically problematic?
  10. What is data leakage? Give an example from a medical prediction task.
  11. Why should baseline models be reported in predictive analytics?
  12. How do domain experts improve the quality of a data science project?
Chapter 2 — Data, Measurements, and Data Preprocessing
  1. What are the main types of data sets, and what are the distinguishing features of each type? Give examples.
  2. Describe the curse of dimensionality. Why is dimensionality reduction important?
  3. Define sparsity in datasets. How does sparsity affect analysis?
  4. What are attributes, and what are the main attribute types? Give examples of nominal, ordinal, interval, and ratio attributes.
  5. Explain the difference between interval-scaled and ratio-scaled numeric attributes.
  6. Compare discrete and continuous attributes with examples.
  7. What are common measures of central tendency? When should median be preferred over mean?
  8. Describe symmetric and skewed distributions. How do mean, median, and mode behave in a positively skewed distribution?
  9. What is covariance, and how is it related to correlation?
  10. What do positive, negative, and near-zero correlation coefficients indicate?
  11. Describe Minkowski distance and identify its special cases.
  12. What is the Jaccard coefficient, and why is it useful for asymmetric binary attributes?
  13. What are common dimensions of data quality?
  14. Define noise and list three common causes of noisy data.
  15. How can noisy data be handled? Explain binning as one option.
  16. What is data integration, and what problems does it address?
  17. What is normalization? Explain min-max normalization with an example.
  18. What is data discretization, and why might it be useful?
  19. Explain Principal Component Analysis and why it is used for dimensionality reduction.
  20. Compare supervised and unsupervised discretization methods.
  21. Explain simple random sampling and stratified sampling.
  22. What is cosine similarity, and where is it useful?
  23. What is Kullback-Leibler divergence? Why is it not considered a true metric?
  24. What is dirty data? Give examples of missing, inconsistent, duplicate, and invalid values.
  25. Why does data cleaning often take more time than model training?
  26. What is data scrubbing, and how can domain knowledge help?
  27. Compare deletion, imputation, and model-based strategies for missing values.
  28. What challenges arise when integrating data from multiple sources, and how does source trustworthiness matter?
Chapter 3 — Data Warehousing and OLAP
  1. What is a data warehouse, and how does it differ from an operational database?
  2. Explain the ETL process: extraction, transformation, and loading.
  3. What are fact tables and dimension tables?
  4. Compare star schema and snowflake schema designs.
  5. What is a data cube?
  6. Define roll-up, drill-down, slice, dice, and pivot operations.
  7. What is OLAP, and why is it useful for decision support?
  8. Compare MOLAP, ROLAP, and HOLAP at a conceptual level.
  9. How can aggregation level affect the interpretation of results?
  10. What is concept hierarchy, and how does it support summarization?
  11. What data-quality issues can arise during ETL?
  12. How can a healthcare organization use OLAP-style analysis?
  13. Why are time dimensions important in warehousing and analytics?
  14. What is the difference between data mart and enterprise data warehouse?
  15. How can a data warehouse support both descriptive reporting and predictive modeling?
Chapter 4 — Pattern Mining and Association Rule Mining
  1. Define itemset, transaction, support, confidence, and lift.
  2. What is a frequent itemset?
  3. Explain the Apriori principle and why it reduces search space.
  4. What is anti-monotonicity in frequent pattern mining?
  5. How does Apriori generate and prune candidate itemsets?
  6. Why can a rule with high confidence still be uninteresting?
  7. How does lift help interpret association rules?
  8. What happens when minimum support is set too high or too low?
  9. What happens when minimum confidence is set too high or too low?
  10. Why is discretization often needed before association rule mining on numerical data?
  11. Give an example of a meaningful association rule in hospital data.
  12. How can redundant or trivial rules be filtered?
  13. What is the difference between closed frequent itemsets and maximal frequent itemsets?
  14. Why can association rules reveal correlation but not causation?
  15. How can domain experts help interpret association rules?
  16. What are rare but potentially important patterns?
  17. How can class labels be used when mining rules for predictive insight?
  18. What are possible privacy or fairness concerns in medical association rules?
  19. How would you compare rule sets generated under different support thresholds?
  20. What visualizations can help communicate association-rule results?
Chapter 6 — Classification and Predictive Modeling
  1. Define classification and give examples of binary, multiclass, and multilabel classification.
  2. What is the difference between a training set, validation set, and test set?
  3. Why should the test set remain untouched during model selection?
  4. Explain overfitting and underfitting.
  5. What is cross-validation, and why is it useful?
  6. How does k-nearest neighbors classify a new instance?
  7. What assumptions does Naive Bayes make?
  8. How does a decision tree choose a split? Explain information gain or Gini impurity.
  9. Why are random forests usually more robust than a single decision tree?
  10. What is the main idea behind support vector machines?
  11. What is gradient boosting, and how does it differ from bagging?
  12. What is an artificial neural network / MLP in tabular classification?
  13. Why can class imbalance be a serious problem in medical prediction?
  14. Define accuracy, precision, recall, specificity, F1-score, ROC curve, and AUC.
  15. When is recall more important than precision? Give a healthcare example.
  16. What information is contained in a confusion matrix?
  17. What is feature selection, and why can it improve interpretability and generalization?
  18. Compare filter, wrapper, and embedded feature selection methods.
  19. What is hyperparameter tuning?
  20. Why should preprocessing be fit only on the training data in a machine learning pipeline?
  21. How can statistical significance be tested between two models?
  22. Why is calibration important when predicted probabilities are used for decisions?
  23. How would you design 6–10 experiments for readmission classification?
  24. What does reproducibility mean in predictive modeling?
  25. How can feature importance be interpreted cautiously in a healthcare model?
Chapter 8 — Cluster Analysis
  1. Define clustering and explain how it differs from classification.
  2. What makes a clustering result good?
  3. Explain the k-means algorithm step by step.
  4. Why is k-means sensitive to initialization and scaling?
  5. How can the number of clusters be selected?
  6. What is the elbow method?
  7. What is the silhouette coefficient?
  8. Explain agglomerative hierarchical clustering.
  9. Compare single, complete, average, and Ward linkage.
  10. What is a dendrogram, and how is it interpreted?
  11. Explain the DBSCAN algorithm.
  12. What are core points, border points, and noise points in DBSCAN?
  13. How do eps and min_samples affect DBSCAN results?
  14. Why can DBSCAN discover non-spherical clusters?
  15. Why can DBSCAN fail when clusters have different densities?
  16. How do categorical variables complicate clustering?
  17. What preprocessing decisions are especially important before clustering?
  18. How can PCA or t-SNE-like visualization help communicate clusters?
  19. What is the difference between internal and external cluster validation?
  20. How would you interpret clusters in a medical dataset?
  21. Why should clusters not be automatically treated as causal or clinically valid groups?
  22. How can domain experts validate discovered clusters?
  23. What are outliers, and how can clustering methods reveal them?
  24. How would you compare k-means, hierarchical clustering, and DBSCAN on the same dataset?
  25. What tables and plots would you include in a final clustering presentation?
Project and Exam Preparation Questions
  1. For the diabetes readmission dataset, what should be the target variable for a classification task, and how would you encode it?
  2. Which features might create leakage if they encode information unavailable at prediction time?
  3. How would you handle the encounter_id and patient_nbr columns?
  4. How would you handle missing values represented by special codes such as question marks?
  5. Which features are nominal, ordinal, binary, or numeric?
  6. How would you handle age intervals during preprocessing?
  7. What visualizations would you use to summarize class imbalance?
  8. What visualizations would you use to compare readmitted and non-readmitted patients?
  9. How would you build a fair train/test split if patients appear more than once?
  10. Which feature selection methods would you try, and how would you present ranked features?
  11. Which three classifiers would you choose for a first baseline comparison, and why?
  12. How would you decide whether accuracy is an appropriate metric?
  13. What should be included in the confusion matrix interpretation of the best model?
  14. How would you explain an ROC curve to a non-technical teammate?
  15. How would you compare two models statistically?
  16. How would you turn continuous or high-cardinality variables into categories for Apriori?
  17. What makes an association rule medically or operationally meaningful?
  18. How can a group demonstrate balanced contribution using GitHub history?
  19. What should be in the final README so the project is reproducible?
  20. What limitations should be discussed when using historical hospital data for prediction?

Useful Public Resources

Course Policies Frequently Emphasized

Back to Teaching