ClimateBench2.0: Probabilistic Climate Model Scoring

20252 min read260 words

Duncan Watson-Parris, Venkatramani Balaji, Christopher S Bretherton, William Chapman, Gregory S Elsaesser, Pierre Gentine, Ralph F Keeling, David Lawrence, J David Neelin, Sarah G Purkey, Tapio Schneider, Isla Simpson, Graeme L Stephens, Willa Tobin, Laure Zanna, Kevin W Bowman, Peter Martin Caldwell, William Drew Collins, Veronika Eyring, Stephan Hoyer, Nikolay Koldunov, Christian Lessig, Mike S Pritchard, Gavin A Schmidt, Michael Schulz, Tiffany Shaw, Joao P Teixeira, Andrew Williams, and Rose Yu

Goddard Space Flight Center

Despite their central role in climate science and policy, Earth system models (ESMs) remain difficult to compare in any rigorous or transparent way. Most existing evaluations either emphasize specific processes or rely on qualitative assessments across diverse metrics, making it nearly impossible to rank models by their predictive skill. ClimateBench2.0 introduces a probabilistic scoring framework that focuses instead on what matters most: a model’s ability to accurately simulate the historical climate and project future multi-decadal change.

The benchmark leverages high-quality observations from the satellite era (1980–present), with a particular focus on present-day metrics such as top-of-atmosphere (TOA) energy balance, seasonal cycle fidelity, and variability in clouds, aerosols, precipitation, and ocean heat uptake for which observational constraints are strongest. Paleoclimate reconstructions (LGM, LIG, Mid-Holocene) are incorporated as out-of-distribution tests to evaluate models beyond the narrow window of recent data. Scoring is based on robust probabilistic metrics such as CRPS and Brier scores, designed to assess ensemble skill and uncertainty quantification.

Crucially, statistical performance alone is not sufficient. ClimateBench2.0 will also introduce a dedicated Physical Consistency category, evaluating properties such as global energy balance closure, conservation of water and carbon, and realistic land-ocean-atmosphere energy exchanges. These physical integrity checks are essential for trusting a model’s out-of-distribution predictions - especially under strong forcings not seen in the historical record.

By combining empirical benchmarks with physically grounded constraints, ClimateBench2.0 transforms evaluation into a reproducible, quantitative, and outcome-driven ranking framework. It applies across model types, from physical to hybrid to ML-based, and integrates with existing efforts (e.g., CMIP, Obs4MIPs) to ensure transparency and broad adoption.

ClimateBench2.0: Probabilistic Climate Model Scoring

Related Earth Science Documents

A Deep Neural Network for Achieving Spectrally Consistent and Seamless Infrared Radiance Measurements Across Geostationary Satellite Domains

A Fast and Efficient Method for Deriving 20 years of Climate Data Records from Multiple Satellite IR Sounders

A Multi-Satellite Framework to Rapidly Evaluate Extreme Biosphere Cascades: The Western US 2021 Drought and Heatwave