Preprints
https://doi.org/10.5194/essd-2024-170
https://doi.org/10.5194/essd-2024-170
15 May 2024
 | 15 May 2024
Status: this preprint is currently under review for the journal ESSD.

Retrieving Ground-Level PM2.5 Concentrations in China (2013–2021) with a Numerical Model-Informed Testbed to Mitigate Sample Imbalance-Induced Biases

Siwei Li, Yu Ding, Jia Xing, and Joshua S. Fu

Abstract. Ground-level PM2.5 data derived from satellites with machine learning are crucial for health and climate assessments, however, uncertainties persist due to the absence of spatially covered observations. To address this, we propose a novel testbed using untraditional numerical simulations to evaluate PM2.5 estimation across the entire spatial domain. The testbed emulates the general machine-learning approach, by training the model with grids corresponding to ground monitor sites and subsequently testing its predictive accuracy for other locations. Our approach enables comprehensive evaluation of various machine-learning methods’ performance in estimating PM2.5 across the spatial domain for the first time. Unexpected results are shown in the application in China, with larger PM2.5 biases found in densely populated regions with abundant ground observations across all benchmark models, challenging conventional expectations and are not explored in the recent literature. The imbalance in training samples, mostly from urban areas with high emissions, is the main reason, leading to significant overestimation due to the lack of monitors in downwind areas where PM2.5 is transported from urban areas with varying vertical profiles. Our proposed testbed also provides an efficient strategy for optimizing model structure or training samples to enhance satellite-retrieval model performance. Integration of spatiotemporal features, especially with CNN-based deep-learning approaches like the ResNet model, successfully mitigates PM2.5 overestimation (by 5–30 µg m-3) and corresponding exposure (by 3 million people • µg m-3) in the downwind area over the past nine years (2013–2021) compared to the traditional approach. Furthermore, the incorporation of 600 strategically positioned ground-measurement sites identified through the testbed is essential to achieve a more balanced distribution of training samples, thereby ensuring precise PM2.5 estimation and facilitating the assessment of associated impacts in China. In addition to presenting the retrieved surface PM2.5 concentrations in China from 2013 to 2021, this study provides a testbed dataset derived from physical modeling simulations which can serve to evaluate the performance of data-driven methodologies, such as machine learning, in estimating spatial PM2.5 concentrations for the community.

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this preprint. The responsibility to include appropriate place names lies with the authors.
Siwei Li, Yu Ding, Jia Xing, and Joshua S. Fu

Status: open (until 21 Jun 2024)

Comment types: AC – author | RC – referee | CC – community | EC – editor | CEC – chief editor | : Report abuse
Siwei Li, Yu Ding, Jia Xing, and Joshua S. Fu

Data sets

Numerical model-informed testbed for surface PM2.5 concentration over China and its estimates during 2013-2021 S. Li et al. https://doi.org/10.5281/zenodo.11122294

Model code and software

Numerical model-informed testbed for surface PM2.5 concentration over China and its estimates during 2013-2021 S. Li et al. https://doi.org/10.5281/zenodo.11122294

Siwei Li, Yu Ding, Jia Xing, and Joshua S. Fu

Viewed

Total article views: 185 (including HTML, PDF, and XML)
HTML PDF XML Total Supplement BibTeX EndNote
141 33 11 185 19 7 7
  • HTML: 141
  • PDF: 33
  • XML: 11
  • Total: 185
  • Supplement: 19
  • BibTeX: 7
  • EndNote: 7
Views and downloads (calculated since 15 May 2024)
Cumulative views and downloads (calculated since 15 May 2024)

Viewed (geographical distribution)

Total article views: 178 (including HTML, PDF, and XML) Thereof 178 with geography defined and 0 with unknown origin.
Country # Views %
  • 1
1
 
 
 
 
Latest update: 31 May 2024
Download
Short summary
Surface PM2.5 data has gained widespread application in health assessments and related fields, while the inherent uncertainties in PM2.5 data persist due to the lack of ground-truth data across the space. This study provides a novel testbed, enabling the comprehensive evaluation across the entire spatial domain. The optimized deep-learning model with spatiotemporal features, successfully retrieved surface PM2.5 concentrations in China (2013–2021) with reduced biases induced by sample imbalance.
Altmetrics