Introduction to Machine Learning

Public summary. This page summarizes the recurring structure and recent direction of the course. Semester-specific classroom links, deadlines, rooms, attendance lists, office hours, and exact grading rules are announced in the official learning management system.

Course Overview

Introduction to Machine Learning introduces the main concepts, algorithms, mathematical foundations, implementation practices, and evaluation procedures used in modern machine learning. The course combines classical supervised learning methods with introductory neural-network concepts and selected unsupervised/descriptive learning topics.

Recent offerings emphasize hands-on implementation, step-by-step algorithmic understanding, exam-style numerical reasoning, and project-based learning. Students are expected to understand not only how to call machine learning libraries, but also how major algorithms compute predictions, losses, gradients, model updates, and evaluation metrics.

Decision trees Naive Bayes k-nearest neighbors Logistic regression Gradient descent Neural networks Model evaluation Overfitting Clustering Frequent pattern mining

Learning Outcomes

Foundations and Modeling

Explain the difference between supervised, unsupervised, and descriptive learning tasks.
Use vectors, matrices, probability, likelihood, loss functions, and gradients in basic ML derivations.
Recognize assumptions behind algorithms such as Naive Bayes, k-NN, logistic regression, and neural networks.

Algorithmic Understanding

Build and interpret decision trees using impurity or information-gain style criteria.
Compute Naive Bayes probabilities, including class priors, likelihoods, and smoothing.
Perform forward-pass, loss, gradient, and update calculations for logistic regression and simple neural networks.

Implementation Practice

Implement simple machine learning algorithms in Python, including Naive Bayes and k-nearest neighbors.
Organize project code and experiments in GitHub repositories with reproducible instructions.
Use training/test splits, validation strategies, and evaluation metrics correctly.

Critical Evaluation

Diagnose overfitting, underfitting, bias-variance behavior, and data leakage.
Compare models using accuracy, precision, recall, F1, confusion matrices, ROC/AUC, and error analysis where appropriate.
Discuss the No Free Lunch principle and explain why model choice depends on data, assumptions, and task objectives.

Core Topic Map

The exact order may vary by semester, but recent offerings cover the following themes.

1. Introduction and Mathematical Review

Machine learning tasks, examples, training data, features, labels, inductive bias, vectors and matrices, basic linear algebra, probability theory, conditional probability, Bayes rule, and common notation.

2. Decision Trees

Tree induction, attribute selection, entropy, information gain, impurity, recursive partitioning, stopping criteria, overfitting, pruning, interpretability, and exam-style construction of small trees.

3. Naive Bayes and Text Classification

Class priors, conditional likelihoods, conditional independence assumption, Laplace smoothing, multinomial Naive Bayes, classification by posterior scores, log-space computation, and simple text classification examples.

4. Instance-Based Learning and k-NN

Nearest-neighbor classification, distance functions, normalization, choice of k, decision boundaries, tie-breaking, computational cost, lazy learning, and Python implementation of a simple classifier.

5. Classification, Regression, and Evaluation

Supervised learning pipeline, regression vs. classification, holdout evaluation, cross-validation, confusion matrix, precision, recall, F1, ROC/AUC, class imbalance, baseline models, and experiment design.

6. Logistic Regression and Gradient Descent

Linear scores, sigmoid function, binary cross-entropy, regularization, gradient computation, learning rate, parameter updates, softmax intuition, and numerical examples using calculators.

7. Artificial Neural Networks

Perceptron-style units, multilayer perceptrons, activation functions, forward propagation, backpropagation, hidden layers, gradient flow, deep feedforward networks, and practical issues in training.

8. Generalization and Optimization

Overfitting, underfitting, bias-variance trade-off, No Free Lunch theorem, maximum likelihood vs. MAP intuition, numerical computation, gradient-based optimization, and regularization.

9. Clustering and Frequent Pattern Mining

Hierarchical clustering, single-link distance, Euclidean distance, clustering interpretation, frequent pattern mining, support, confidence, and introductory pattern discovery concepts.

Reading and Material Structure

Recent offerings use instructor slides and selected readings organized around algorithmic modules. The course also incorporates mathematical refreshers and selected modern AI resources when useful for context.

Reading Assignments

Decision trees
Naive Bayes
Instance-based learning
Logistic regression
Artificial neural networks and deep feedforward networks
Numerical computation and gradient-based optimization
Overfitting, underfitting, bias/variance, MAP, and the No Free Lunch theorem

Lecture Materials

Introduction overview
Review of matrices and vectors
Probability theory for machine learning
Classification/regression algorithms and evaluation
Naive Bayes and text classification
Logistic regression and gradient descent
Artificial neural networks
Clustering and frequent pattern mining

Programming Assignments

Programming assignments are designed to make students implement basic algorithms rather than only use high-level APIs. Typical assignments include:

Simple Naive Bayes Implementation

Students implement a basic Naive Bayes classifier, estimate class priors and likelihoods, handle unseen features with smoothing, classify test instances, and report evaluation results.

Simple k-Nearest Neighbor Classifier

Students implement a k-NN classifier in Python, compute distances, choose k, handle ties, evaluate predictions, and observe the effect of feature scaling and distance functions.

Project Component

The course includes a team project that requires students to apply machine learning to a concrete dataset or application problem. Recent projects use teams of five students and GitHub repositories for collaboration. Students are expected to include the instructor and teaching assistant as collaborators when required by the semester instructions.

Typical Project Expectations

Define a supervised learning problem with clear input features and target variable.
Prepare the dataset with appropriate cleaning, encoding, scaling, and splitting procedures.
Compare several algorithms and justify the choice of models and hyperparameters.
Report results using suitable metrics and visualizations.
Explain errors, limitations, and possible improvements.
Maintain a reproducible GitHub repository with code, documentation, and presentation materials.

Progress and Final Presentation Style

Recent offerings use project proposal submissions, progress presentations, demos, final presentations, and repository-based evaluation. All group members are expected to understand the entire project and answer questions, not only the part they personally implemented.

Assessment Patterns

Exact weights vary by semester. Recent grading patterns included homework/programming assignments, a project, a midterm exam, and a final exam. A representative recent structure was:

Homework / programming assignments: 5%
Project: 20%
Midterm exam: 35%
Final exam: 40%

Exams may include multiple-choice questions, conceptual questions, and classical calculation questions. Students should be comfortable with calculator-based computations such as logarithms, sigmoid values, cross-entropy losses, Euclidean distances, simple gradients, and parameter updates.

Representative Exam and Study Skills

Numerical Skills

Compute entropy, information gain, class probabilities, likelihoods, and smoothed estimates.
Evaluate logistic regression predictions using sigmoid and cross-entropy.
Carry out one gradient descent update for logistic regression with L2 regularization.
Compute Euclidean distances and perform single-link hierarchical clustering steps.
Perform forward and backward calculations for a small MLP.

Conceptual Skills

Explain algorithm assumptions and when they fail.
Compare eager vs. lazy learning, generative vs. discriminative models, and parametric vs. non-parametric models.
Diagnose overfitting, underfitting, data leakage, and class imbalance.
Interpret confusion matrices and choose metrics appropriate to the application.

Study Questions

The following questions summarize the style of preparation expected in the course. They are useful for students, prospective students, and readers who want to understand the scope of the course.

Introduction, Probability, Vectors, and Matrices

Define supervised, unsupervised, and reinforcement learning. Give one realistic example of each.
What are features, labels, instances, training data, test data, and a hypothesis/model?
Explain the difference between classification and regression.
What is inductive bias, and why is it unavoidable in machine learning?
Explain the difference between training error, validation error, and test error.
What is Bayes' rule? How is it used in probabilistic classification?
What is conditional independence? Why is it important for Naive Bayes?
Given two vectors, compute their dot product, Euclidean distance, and cosine similarity.
Why are matrices and vectorized operations important in machine learning?
Explain why feature scaling may affect distance-based and gradient-based algorithms.

Decision Trees

What is the main idea of decision-tree learning?
Define entropy and explain how it measures class impurity.
What is information gain? How is it used to select a split?
Compare information gain, gain ratio, and Gini impurity at a high level.
Why can decision trees overfit training data?
What are pre-pruning and post-pruning?
How can decision trees handle numeric attributes?
Why are decision trees considered interpretable models?
Construct a small decision tree from a given table and show all impurity calculations.
Discuss one advantage and one disadvantage of decision trees compared with logistic regression.

Naive Bayes

State the Naive Bayes classification rule.
What is the naive conditional independence assumption?
Why is Laplace smoothing needed?
Explain the difference between Bernoulli and multinomial Naive Bayes.
Why are log probabilities often used in Naive Bayes implementation?
Given class counts and feature counts, compute priors, likelihoods, and a posterior score.
How does Naive Bayes work for text classification?
Why can Naive Bayes perform well even when its independence assumption is not fully true?
What happens when a feature value appears in the test set but never appears in training?
Compare Naive Bayes with k-NN in terms of training cost, prediction cost, and assumptions.

k-Nearest Neighbors

What makes k-NN a lazy learning algorithm?
How does the choice of k affect bias and variance?
Why is normalization important for k-NN?
Compare Euclidean, Manhattan, and cosine distance for different types of data.
How can ties be handled in k-NN classification?
What is weighted k-NN?
Why can k-NN be computationally expensive at prediction time?
Explain how irrelevant features can harm k-NN.
Given a small labeled dataset and a test point, compute the k-NN prediction manually.
Discuss one case where k-NN is a reasonable baseline model.

Logistic Regression and Gradient Descent

What is the logistic/sigmoid function, and why is it used for binary classification?
Write the logistic regression prediction equation for two input features.
What is binary cross-entropy loss?
Why is squared error usually not preferred for logistic regression classification?
What does L2 regularization do to the learned weights?
Explain the role of the learning rate in gradient descent.
Given weights, bias, and feature values, compute the predicted probability.
Given a small dataset, compute the average cross-entropy loss plus an L2 penalty.
Compute one gradient descent update for logistic regression with two features.
Explain the difference between batch, stochastic, and mini-batch gradient descent.

Artificial Neural Networks

What is a neuron/unit in an artificial neural network?
What is the purpose of an activation function?
Compare sigmoid, tanh, and ReLU activations.
What is a hidden layer, and what does it allow the model to learn?
Explain the forward pass in a single-hidden-layer MLP.
What is backpropagation?
Why are gradients computed using the chain rule?
Given a small MLP with numeric weights, compute the forward pass.
For a simple output error, compute the gradient direction for one weight.
What are vanishing gradients, and why can they make deep networks difficult to train?

Generalization, Overfitting, and Evaluation

Define overfitting and underfitting.
Explain the bias-variance trade-off with examples.
What does the No Free Lunch theorem imply for model selection?
What is data leakage? Give two examples.
Why is a validation set needed?
Explain k-fold cross-validation.
Define accuracy, precision, recall, and F1 score.
When is accuracy misleading?
What is a confusion matrix, and how can it reveal error types?
Explain ROC curve and AUC conceptually.

Clustering and Frequent Pattern Mining

What is the difference between classification and clustering?
Explain bottom-up hierarchical clustering.
What is single-link distance between two clusters?
Given six points in two dimensions, perform the first three merges of single-link hierarchical clustering.
How do complete-link and average-link clustering differ from single-link clustering?
What is a dendrogram?
Define support and confidence in frequent pattern mining.
What is the Apriori principle?
What is lift, and why can it be more informative than confidence alone?
Give an example of an association rule and interpret it in plain language.

Representative Calculation Exercises

Given points A=(0,0), B=(1,0), C=(2,2), D=(5,2), E=(4,5), F=(6,5), cluster them bottom-up using Euclidean distance and single-link hierarchical clustering. Show each merge and distance.
Given four binary examples T1=(0,0,y=0), T2=(0,1,y=0), T3=(1,0,y=0), T4=(1,1,y=1), run one logistic regression update from w1=0, w2=0, b=0 using learning rate 0.1 and L2 regularization strength 0.2. Compute z, predicted probabilities, loss, gradients, and updated parameters.
Using the same dataset, compute the forward pass of a single-hidden-layer MLP with two hidden neurons and sigmoid activations for each training instance.
For a tiny text classification dataset, compute Naive Bayes class scores using Laplace smoothing and classify a new document.
For a small decision-tree dataset, compute entropy before and after a candidate split and decide which attribute should be selected.

Useful Resources for Readers

Practice implementing simple algorithms from scratch before using advanced libraries.
Keep a formula sheet for entropy, information gain, Bayes rule, sigmoid, cross-entropy, gradients, and evaluation metrics.
For exams, practice small hand-calculation examples rather than only conceptual summaries.
For projects, maintain reproducible notebooks/scripts, commit regularly, and document all preprocessing decisions.