Providing Physical Insights into the Morphology of Spatial and Temporal Distributions of Atmospheric Aerosols Using Machine Learning





Journal Title

Journal ISSN

Volume Title



The concentration of airborne particulate matter (PM2.5) is a significant environmental and health issue. Many tools have been used to examine the relationship between PM2.5 abundance and meteorological variables. Some of these relationships are non-linear, non-Gaussian, and even unknown. Machine Learning provides a broad range of practical solutions to help examine and provide physical insights into these relationships. In this thesis we have used a variety of machine learning approaches. Unsupervised machine learning was used to classify the morphology of PM2.5 seasonal cycles in East Asia. Machine learning is able to objectively classify the seasonal cycles, and without apriori assumptions, is able to clearly distinguish between urban and rural areas. We show an example of this in the Sichuan Basin of China. Further, a supervised machine learning approach, random forest is able to identify the key factors associated with each distinct shape of the seasonal cycle, such as the key role placed by the surface type and the built environment.

While random forests can be improved by using an optimized ensemble of machine learning approaches (boosting & bagging), which explores a variety of ensemble methods to choose the algorithm with the best performance with tuned hyperparameters. This optimized approach automatically provides the most important meteorological and surface variables associated with PM2.5 concentration. The variables highlighted by optimized machine learning were then examined together with five traditional meteorological features via multiple linear regression (MLR) models, which provide comprehensive physical mechanistic insights into the effect of these variables on the variation of the PM2.5 annual cycles, e.g., how these environment variables interact with PM2.5 in specific areas.

Lastly, the SHapley Additive exPlanation (SHAP) values, which is a consistent measurement of individualized feature attributions in ensemble tree models, were employed to get more information about the impacts of those environmental variables in ensemble tree models. SHAP provided individualized attributions of predictors on the final output. SHAP values were calculated based on ensemble tree models and it didn't assume any linear relationships between predictors and PM2.5 concentration like MLR. Results of these impacts given by SHAP were consistent with MLR, but more generally applicable.



Air—Pollution—Research, Machine learning, Air—Pollution—Meteorological aspects, Self-organizing maps


Copyright ©2018 is held by the author. Digital access to this material is made possible by the Eugene McDermott Library. Further transmission, reproduction or presentation (such as public display or performance) of protected items is prohibited except with permission of the author.