2
Explore

Statistical Analysis Methods

Learning Objectives

Descriptive Statistics for Air Quality

Key Measures

MeasureFormulaAir Quality Application
Arithmetic Meansum(x)/nAnnual average for NAAQS comparison
Geometric Meanexp(mean(ln(x)))Better for log-normal distributions
PercentilesValue at p% of distribution98th percentile for 24-hr standards
Design Value3-year average of 98th %ileRegulatory attainment determination
Exceedance DaysCount where C > standardPublic health communication

Note: Air quality data is often right-skewed (log-normal), making median and geometric mean more robust than arithmetic mean.

Linear Regression

y = beta_0 + beta_1 * x + epsilon

  • beta_0 (intercept): y value when x = 0
  • beta_1 (slope): Change in y per unit change in x
  • epsilon: Random error term

Key Diagnostics

  • R-squared: Proportion of variance explained (0 to 1)
  • p-value: Probability slope is zero by chance (want p < 0.05)
  • Residuals: Should be normally distributed with constant variance
  • RMSE: Root mean squared error (prediction accuracy)

Multiple Regression

PM2.5 = beta_0 + beta_1*Temperature + beta_2*Wind + beta_3*Traffic + epsilon

Multiple regression models pollutant concentrations as functions of several predictors:

  • Meteorological variables (temperature, wind, humidity, mixing height)
  • Temporal factors (hour of day, day of week, season)
  • Source indicators (traffic counts, industrial activity)

Multicollinearity: When predictors are correlated, coefficient estimates become unstable. Check with variance inflation factor (VIF < 5).

Time Series Analysis

Decomposition

Time series = Trend + Seasonality + Residual

  • Trend: Long-term increase/decrease
  • Seasonality: Regular periodic patterns
  • Residual: Random variation

Trend Detection

  • Mann-Kendall test: Non-parametric trend test
  • Sen's slope: Robust trend magnitude
  • Linear regression: Year as predictor
  • Moving average: Smoothing for visualization

Hypothesis Testing

Common Tests for Air Quality

TestUse CaseExample
t-testCompare two meansBefore/after intervention
Paired t-testMatched observationsIndoor vs outdoor at same time
ANOVACompare multiple groupsAcross different land use types
Chi-squaredCategorical comparisonsExceedance frequency by season
Mann-Whitney UNon-parametric comparisonSkewed distributions

Remember: Statistical significance (p < 0.05) does not imply practical significance. Always report effect sizes.

Activity: Regression Analysis

Using one year of daily PM2.5 and meteorological data from your city:

  1. Create scatter plots of PM2.5 vs. temperature, wind speed, and relative humidity
  2. Calculate Pearson correlation coefficients for each pair
  3. Build a simple linear regression: PM2.5 ~ Temperature
  4. Build a multiple regression: PM2.5 ~ Temperature + Wind + Humidity
  5. Compare R-squared values. How much variance is explained?
  6. Plot residuals. Are model assumptions met?

Discussion: What factors beyond meteorology influence PM2.5? Why is R-squared not 1.0?

Key Takeaway

Statistical methods provide the foundation for rigorous air quality analysis. Descriptive statistics characterize distributions, regression models relationships between variables, and hypothesis tests enable formal comparisons. Understanding these tools - including their assumptions and limitations - is essential for drawing valid conclusions from air quality data.

← Lesson 1: Data Sources Lesson 3: Machine Learning →