Cloud Detection and PM2.5 Estimation Using Machine Learning




Journal Title

Journal ISSN

Volume Title



Earth observation (EO) is the gathering of information about the physical, chemical, and biological systems of the planet via remote-sensing technologies, supplemented by Earthsurveying techniques, which encompasses the collection, analysis, and presentation of data. Research on exploring effective methods for earth observation data analysis has increased over the years because of the increasing amount of data generated by earth observation systems, such as remote sensing imagery and weather radars. Researchers have therefore taken an interest in machine learning, a technique that allows computer algorithms to learn from samples. In general, the more comprehensive our training samples are, the better the machine learning performance will be. This feature makes machine learning an ideal approach for analyzing earth observation data. Particulate matter of fine size, such as particulate matter 2.5 (PM2.5), poses a severe health risk to humans and is associated with many different health problems. PM2.5 concentrations are influenced by factors such as meteorological conditions, local population density, and the geographic context. As a result of the large quantity of information provided by Earth observation, they become a valuable tool for studying PM2.5. They are huge and come from different platforms, with different spatial and temporal resolutions, and in different formats, which challenge the approaches for PM2.5 studies. This dissertation shows how machine learning methods can be used to address these challenges in three subtopics connected to modeling and estimation for PM2.5. Satellite-based remote sensing products provide important variables that can be used to study regional and global PM2.5, such as the Aerosol Optical Depth (AOD). Nevertheless, AOD products in cloudy areas cannot be retrieved, and the quality of AOD data in nearby cloud areas cannot be guaranteed. Accordingly, the first study aims to detect cloud pixels based on remote sensing images. This study investigates the cloud detection with a set of machine learning models on four subsets of 88 Landsat8 images that have been carefully labelled by analysts. Four subsets of training data are used to train 16 machine learning models with different input feature selections. The performance of these models is then compared with that of the Fmask algorithm, which is widely used for cloud detection. When testing on the 88 annotated images, the best performance was observed with a model that incorporates unsupervised self-organizing map (SOM) classification results among the input features. In comparison with Fmask4.0, the model improves the correctness by 10.11% and reduces the cloud omission error by 6.39%. Focusing on the other 8 independent validation images that were never sampled as part of the model training, the model trained on the second largest training subset with additional 5 input features has the best overall performance. Compared with Fmask4.0, this model improves the overall correctness by 3.26% and reduces the cloud omission error by 1.28%. In the second study, high temporal resolution PM2.5 models are developed based on data from weather radar systems and the meteorological data from the European Centre for MediumRange Weather Forecasts (ECMWF). A dataset covering the period from July 2019 to June 2021 was collected for model training, which included the Next Generation Weather Radar (NEXRAD) retrieved from a repository on Amazon Web Services (AWS), meteorological data from ECMWF, and the PM2.5 ground observations from 31 sensors deployed across Dallas county, Collin county, and Tarrant county. The models are classified in groups to demonstrate the effectiveness of NEXRAD in high temporal PM2.5 modeling. The model utilizing NEXRAD data achieves an 0.855 score of the correlation of determination (R2 ), while the model without NEXRAD has a 0.7 R2 for PM2.5. The third study establishes a nationwide PM2.5 estimation model by using high temporal resolution AOD data from the GOES-16 geostationary satellite, meteorological variables from ECMWF and a set of ancillary data from a variety of sources, which achieves 3.0µg/m3 and 5.8 µg/m3 as the value of mean absolute error (MAE) and root mean square error (RMSE). The model performances are then further evaluated by time, elevation, soil order, population density, and lithology. The historical PM2.5 estimation surfaces are then reconstructed and the PM2.5 surfaces during the period of California Santa Clara Unite (SCU) Lightning Complex fires are demonstrated.



Geography, Atmospheric Sciences, Environmental Sciences