3
Explain

Machine Learning for Prediction

Learning Objectives

Machine Learning Overview

Supervised Learning

Learn from labeled examples to predict outcomes

  • Classification: Predict categories (Good/Moderate/Unhealthy AQI)
  • Regression: Predict continuous values (PM2.5 concentration)

Unsupervised Learning

Find patterns without labeled outcomes

  • Clustering: Group similar observations (source apportionment)
  • Dimensionality reduction: Simplify complex data (PCA)

The ML Workflow

  1. Data preparation: Clean, handle missing values, feature engineering
  2. Train-test split: Reserve data for unbiased evaluation (typically 80/20)
  3. Model selection: Choose algorithm appropriate to problem
  4. Training: Fit model to training data
  5. Hyperparameter tuning: Optimize model settings (via cross-validation)
  6. Evaluation: Assess performance on held-out test data
  7. Deployment: Apply to new data for predictions

Common Algorithms for Air Quality

AlgorithmTypeAir Quality ApplicationKey Feature
Decision TreeClassification/RegressionAQI category predictionInterpretable rules
Random ForestEnsemblePM2.5 forecastingHandles nonlinearity
Gradient BoostingEnsembleConcentration predictionHigh accuracy
Neural NetworksDeep learningComplex pattern recognitionFlexible but opaque
k-Nearest NeighborsInstance-basedAnalog forecastingNo training needed
k-MeansClusteringSource identificationUnsupervised grouping

Model Evaluation Metrics

Classification Metrics

  • Accuracy: Correct predictions / total
  • Precision: True positives / predicted positives
  • Recall: True positives / actual positives
  • F1 Score: Harmonic mean of precision and recall
  • Confusion matrix: Cross-tabulation of predictions vs. actual

Regression Metrics

  • R-squared: Variance explained (0-1)
  • RMSE: Root mean squared error
  • MAE: Mean absolute error
  • Bias: Mean error (over/underprediction)
  • IOA: Index of agreement

Cross-Validation

Cross-validation provides robust performance estimates by training and testing on multiple subsets:

k-Fold Cross-Validation

  1. Split data into k equal folds (typically k=5 or 10)
  2. For each fold: train on k-1 folds, test on remaining fold
  3. Average performance across all k iterations

For time series: Use time-series cross-validation where training always precedes test data to avoid "data leakage" from the future.

Activity: Build a PM2.5 Forecaster

Using a year of daily PM2.5 data with meteorological covariates:

  1. Create features: temperature, wind speed, humidity, day of week, month, previous day PM2.5
  2. Split data: first 10 months for training, last 2 months for testing
  3. Train three models: linear regression, decision tree, random forest
  4. Calculate RMSE and R-squared on test data for each model
  5. Create scatter plots of predicted vs. observed for each model
  6. Which model performs best? Why might that be?

Extension: Add lagged variables (PM2.5 from 2, 3 days ago) as features. Does this improve predictions?

Key Takeaway

Machine learning enables powerful predictive models for air quality, from forecasting next-day concentrations to classifying pollution sources. The key to successful ML is rigorous evaluation: using held-out test data, appropriate metrics, and cross-validation to ensure models generalize beyond training data. While complex algorithms can achieve high accuracy, simpler models are often more interpretable and reliable for real-world applications.

← Lesson 2: Statistics Lesson 4: Visualization →