Machine Learning for Prediction
Learning Objectives
- Distinguish between supervised and unsupervised learning approaches
- Apply classification algorithms to predict AQI categories
- Build and evaluate regression models for concentration prediction
- Understand train-test splits and cross-validation
- Interpret model performance metrics (accuracy, RMSE, confusion matrix)
Machine Learning Overview
Supervised Learning
Learn from labeled examples to predict outcomes
- Classification: Predict categories (Good/Moderate/Unhealthy AQI)
- Regression: Predict continuous values (PM2.5 concentration)
Unsupervised Learning
Find patterns without labeled outcomes
- Clustering: Group similar observations (source apportionment)
- Dimensionality reduction: Simplify complex data (PCA)
The ML Workflow
- Data preparation: Clean, handle missing values, feature engineering
- Train-test split: Reserve data for unbiased evaluation (typically 80/20)
- Model selection: Choose algorithm appropriate to problem
- Training: Fit model to training data
- Hyperparameter tuning: Optimize model settings (via cross-validation)
- Evaluation: Assess performance on held-out test data
- Deployment: Apply to new data for predictions
Common Algorithms for Air Quality
| Algorithm | Type | Air Quality Application | Key Feature |
|---|---|---|---|
| Decision Tree | Classification/Regression | AQI category prediction | Interpretable rules |
| Random Forest | Ensemble | PM2.5 forecasting | Handles nonlinearity |
| Gradient Boosting | Ensemble | Concentration prediction | High accuracy |
| Neural Networks | Deep learning | Complex pattern recognition | Flexible but opaque |
| k-Nearest Neighbors | Instance-based | Analog forecasting | No training needed |
| k-Means | Clustering | Source identification | Unsupervised grouping |
Model Evaluation Metrics
Classification Metrics
- Accuracy: Correct predictions / total
- Precision: True positives / predicted positives
- Recall: True positives / actual positives
- F1 Score: Harmonic mean of precision and recall
- Confusion matrix: Cross-tabulation of predictions vs. actual
Regression Metrics
- R-squared: Variance explained (0-1)
- RMSE: Root mean squared error
- MAE: Mean absolute error
- Bias: Mean error (over/underprediction)
- IOA: Index of agreement
Cross-Validation
Cross-validation provides robust performance estimates by training and testing on multiple subsets:
k-Fold Cross-Validation
- Split data into k equal folds (typically k=5 or 10)
- For each fold: train on k-1 folds, test on remaining fold
- Average performance across all k iterations
For time series: Use time-series cross-validation where training always precedes test data to avoid "data leakage" from the future.
Activity: Build a PM2.5 Forecaster
Using a year of daily PM2.5 data with meteorological covariates:
- Create features: temperature, wind speed, humidity, day of week, month, previous day PM2.5
- Split data: first 10 months for training, last 2 months for testing
- Train three models: linear regression, decision tree, random forest
- Calculate RMSE and R-squared on test data for each model
- Create scatter plots of predicted vs. observed for each model
- Which model performs best? Why might that be?
Extension: Add lagged variables (PM2.5 from 2, 3 days ago) as features. Does this improve predictions?
Key Takeaway
Machine learning enables powerful predictive models for air quality, from forecasting next-day concentrations to classifying pollution sources. The key to successful ML is rigorous evaluation: using held-out test data, appropriate metrics, and cross-validation to ensure models generalize beyond training data. While complex algorithms can achieve high accuracy, simpler models are often more interpretable and reliable for real-world applications.