Explain

Machine Learning for Prediction

Learning Objectives

Distinguish between supervised and unsupervised learning approaches
Apply classification algorithms to predict AQI categories
Build and evaluate regression models for concentration prediction
Understand train-test splits and cross-validation
Interpret model performance metrics (accuracy, RMSE, confusion matrix)

Machine Learning Overview

Supervised Learning

Learn from labeled examples to predict outcomes

Classification: Predict categories (Good/Moderate/Unhealthy AQI)
Regression: Predict continuous values (PM2.5 concentration)

Unsupervised Learning

Find patterns without labeled outcomes

Clustering: Group similar observations (source apportionment)
Dimensionality reduction: Simplify complex data (PCA)

The ML Workflow

Data preparation: Clean, handle missing values, feature engineering
Train-test split: Reserve data for unbiased evaluation (typically 80/20)
Model selection: Choose algorithm appropriate to problem
Training: Fit model to training data
Hyperparameter tuning: Optimize model settings (via cross-validation)
Evaluation: Assess performance on held-out test data
Deployment: Apply to new data for predictions

Common Algorithms for Air Quality

Algorithm	Type	Air Quality Application	Key Feature
Decision Tree	Classification/Regression	AQI category prediction	Interpretable rules
Random Forest	Ensemble	PM2.5 forecasting	Handles nonlinearity
Gradient Boosting	Ensemble	Concentration prediction	High accuracy
Neural Networks	Deep learning	Complex pattern recognition	Flexible but opaque
k-Nearest Neighbors	Instance-based	Analog forecasting	No training needed
k-Means	Clustering	Source identification	Unsupervised grouping

Model Evaluation Metrics

Classification Metrics

Accuracy: Correct predictions / total
Precision: True positives / predicted positives
Recall: True positives / actual positives
F1 Score: Harmonic mean of precision and recall
Confusion matrix: Cross-tabulation of predictions vs. actual

Regression Metrics

R-squared: Variance explained (0-1)
RMSE: Root mean squared error
MAE: Mean absolute error
Bias: Mean error (over/underprediction)
IOA: Index of agreement

Cross-Validation

Cross-validation provides robust performance estimates by training and testing on multiple subsets:

k-Fold Cross-Validation

Split data into k equal folds (typically k=5 or 10)
For each fold: train on k-1 folds, test on remaining fold
Average performance across all k iterations

For time series: Use time-series cross-validation where training always precedes test data to avoid "data leakage" from the future.

Activity: Build a PM2.5 Forecaster

Using a year of daily PM2.5 data with meteorological covariates:

Create features: temperature, wind speed, humidity, day of week, month, previous day PM2.5
Split data: first 10 months for training, last 2 months for testing
Train three models: linear regression, decision tree, random forest
Calculate RMSE and R-squared on test data for each model
Create scatter plots of predicted vs. observed for each model
Which model performs best? Why might that be?

Extension: Add lagged variables (PM2.5 from 2, 3 days ago) as features. Does this improve predictions?

Key Takeaway

Machine learning enables powerful predictive models for air quality, from forecasting next-day concentrations to classifying pollution sources. The key to successful ML is rigorous evaluation: using held-out test data, appropriate metrics, and cross-validation to ensure models generalize beyond training data. While complex algorithms can achieve high accuracy, simpler models are often more interpretable and reliable for real-world applications.

← Lesson 2: Statistics Lesson 4: Visualization →