Statistical Analysis Methods
Learning Objectives
- Apply descriptive statistics to characterize air quality distributions
- Perform linear regression to model pollutant relationships
- Conduct time series analysis including trend detection and seasonality
- Use hypothesis testing to compare air quality across conditions
- Interpret correlation coefficients and R-squared values
Descriptive Statistics for Air Quality
Key Measures
| Measure | Formula | Air Quality Application |
|---|---|---|
| Arithmetic Mean | sum(x)/n | Annual average for NAAQS comparison |
| Geometric Mean | exp(mean(ln(x))) | Better for log-normal distributions |
| Percentiles | Value at p% of distribution | 98th percentile for 24-hr standards |
| Design Value | 3-year average of 98th %ile | Regulatory attainment determination |
| Exceedance Days | Count where C > standard | Public health communication |
Note: Air quality data is often right-skewed (log-normal), making median and geometric mean more robust than arithmetic mean.
Linear Regression
y = beta_0 + beta_1 * x + epsilon
- beta_0 (intercept): y value when x = 0
- beta_1 (slope): Change in y per unit change in x
- epsilon: Random error term
Key Diagnostics
- R-squared: Proportion of variance explained (0 to 1)
- p-value: Probability slope is zero by chance (want p < 0.05)
- Residuals: Should be normally distributed with constant variance
- RMSE: Root mean squared error (prediction accuracy)
Multiple Regression
PM2.5 = beta_0 + beta_1*Temperature + beta_2*Wind + beta_3*Traffic + epsilon
Multiple regression models pollutant concentrations as functions of several predictors:
- Meteorological variables (temperature, wind, humidity, mixing height)
- Temporal factors (hour of day, day of week, season)
- Source indicators (traffic counts, industrial activity)
Multicollinearity: When predictors are correlated, coefficient estimates become unstable. Check with variance inflation factor (VIF < 5).
Time Series Analysis
Decomposition
Time series = Trend + Seasonality + Residual
- Trend: Long-term increase/decrease
- Seasonality: Regular periodic patterns
- Residual: Random variation
Trend Detection
- Mann-Kendall test: Non-parametric trend test
- Sen's slope: Robust trend magnitude
- Linear regression: Year as predictor
- Moving average: Smoothing for visualization
Hypothesis Testing
Common Tests for Air Quality
| Test | Use Case | Example |
|---|---|---|
| t-test | Compare two means | Before/after intervention |
| Paired t-test | Matched observations | Indoor vs outdoor at same time |
| ANOVA | Compare multiple groups | Across different land use types |
| Chi-squared | Categorical comparisons | Exceedance frequency by season |
| Mann-Whitney U | Non-parametric comparison | Skewed distributions |
Remember: Statistical significance (p < 0.05) does not imply practical significance. Always report effect sizes.
Activity: Regression Analysis
Using one year of daily PM2.5 and meteorological data from your city:
- Create scatter plots of PM2.5 vs. temperature, wind speed, and relative humidity
- Calculate Pearson correlation coefficients for each pair
- Build a simple linear regression: PM2.5 ~ Temperature
- Build a multiple regression: PM2.5 ~ Temperature + Wind + Humidity
- Compare R-squared values. How much variance is explained?
- Plot residuals. Are model assumptions met?
Discussion: What factors beyond meteorology influence PM2.5? Why is R-squared not 1.0?
Key Takeaway
Statistical methods provide the foundation for rigorous air quality analysis. Descriptive statistics characterize distributions, regression models relationships between variables, and hypothesis tests enable formal comparisons. Understanding these tools - including their assumptions and limitations - is essential for drawing valid conclusions from air quality data.