Model evaluation, multi model ensembles and structural error

Similar documents
Uncertainty concepts, types, sources

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Lecture 1: Machine Learning Basics

Probability and Statistics Curriculum Pacing Guide

Toward Probabilistic Natural Logic for Syllogistic Reasoning

Statistical Analysis of Climate Change, Renewable Energies, and Sustainability An Independent Investigation for Introduction to Statistics

Lecture 10: Reinforcement Learning

Lecture 1: Basic Concepts of Machine Learning

INSTRUCTIONAL FOCUS DOCUMENT Grade 5/Science

Python Machine Learning

Inquiry Learning Methodologies and the Disposition to Energy Systems Problem Solving

This Performance Standards include four major components. They are

Probability estimates in a scenario tree

Writing an Effective Research Proposal

The lab is designed to remind you how to work with scientific data (including dealing with uncertainty) and to review experimental design.

JONATHAN H. WRIGHT Department of Economics, Johns Hopkins University, 3400 N. Charles St., Baltimore MD (410)

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Peer Influence on Academic Achievement: Mean, Variance, and Network Effects under School Choice

Understanding and Interpreting the NRC s Data-Based Assessment of Research-Doctorate Programs in the United States (2010)

BENCHMARK TREND COMPARISON REPORT:

GDP Falls as MBA Rises?

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Benjamin Pohl, Yves Richard, Manon Kohler, Justin Emery, Thierry Castel, Benjamin De Lapparent, Denis Thévenin, Thomas Thévenin, Julien Pergaud

On the Combined Behavior of Autonomous Resource Management Agents

Lahore University of Management Sciences. FINN 321 Econometrics Fall Semester 2017

Simulation in Maritime Education and Training

ACTL5103 Stochastic Modelling For Actuaries. Course Outline Semester 2, 2014

Introduction to Simulation

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Grade 6: Correlated to AGS Basic Math Skills

9.85 Cognition in Infancy and Early Childhood. Lecture 7: Number

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

STT 231 Test 1. Fill in the Letter of Your Choice to Each Question in the Scantron. Each question is worth 2 point.

A GENERIC SPLIT PROCESS MODEL FOR ASSET MANAGEMENT DECISION-MAKING

learning collegiate assessment]

STA 225: Introductory Statistics (CT)

On-the-Fly Customization of Automated Essay Scoring

Evidence-based Practice: A Workshop for Training Adult Basic Education, TANF and One Stop Practitioners and Program Administrators

Developing an Assessment Plan to Learn About Student Learning

NCEO Technical Report 27

Reinforcement Learning by Comparing Immediate Reward

Rote rehearsal and spacing effects in the free recall of pure and mixed lists. By: Peter P.J.L. Verkoeijen and Peter F. Delaney

Exploration. CS : Deep Reinforcement Learning Sergey Levine

TU-E2090 Research Assignment in Operations Management and Services

CSC200: Lecture 4. Allan Borodin

3. Improving Weather and Emergency Management Messaging: The Tulsa Weather Message Experiment. Arizona State University

Artificial Neural Networks written examination

Evolutive Neural Net Fuzzy Filtering: Basic Description

Models of / for Teaching Modeling

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

w o r k i n g p a p e r s

Executive Guide to Simulation for Health

United states panel on climate change. memorandum

Science Fair Project Handbook

Virtual Teams: The Design of Architecture and Coordination for Realistic Performance and Shared Awareness

THE PENNSYLVANIA STATE UNIVERSITY SCHREYER HONORS COLLEGE DEPARTMENT OF MATHEMATICS ASSESSING THE EFFECTIVENESS OF MULTIPLE CHOICE MATH TESTS

Practical Research. Planning and Design. Paul D. Leedy. Jeanne Ellis Ormrod. Upper Saddle River, New Jersey Columbus, Ohio

Study Group Handbook

Aviation English Training: How long Does it Take?

How People Learn Physics

Word Segmentation of Off-line Handwritten Documents

The Search for Strategies to Prevent Persistent Misconceptions

College Pricing. Ben Johnson. April 30, Abstract. Colleges in the United States price discriminate based on student characteristics

What is Thinking (Cognition)?

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Scientific Method Investigation of Plant Seed Germination

Deploying Agile Practices in Organizations: A Case Study

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

Short vs. Extended Answer Questions in Computer Science Exams

How the Guppy Got its Spots:

Math Placement at Paci c Lutheran University

CS Machine Learning

Generative models and adversarial training

VIEW: An Assessment of Problem Solving Style

Rule-based Expert Systems

Learning From the Past with Experiment Databases

Qualitative Site Review Protocol for DC Charter Schools

Time series prediction

POLA: a student modeling framework for Probabilistic On-Line Assessment of problem solving performance

A Case Study: News Classification Based on Term Frequency

Certified Six Sigma - Black Belt VS-1104

Stochastic Calculus for Finance I (46-944) Spring 2008 Syllabus

Document number: 2013/ Programs Committee 6/2014 (July) Agenda Item 42.0 Bachelor of Engineering with Honours in Software Engineering

Analysis of Enzyme Kinetic Data

Susan K. Woodruff. instructional coaching scale: measuring the impact of coaching interactions

CHAPTER 4: REIMBURSEMENT STRATEGIES 24

Objectives. Chapter 2: The Representation of Knowledge. Expert Systems: Principles and Programming, Fourth Edition

Graduation Initiative 2025 Goals San Jose State

Concept Acquisition Without Representation William Dylan Sabo

Full text of O L O W Science As Inquiry conference. Science as Inquiry

Using computational modeling in language acquisition research

Improving Fairness in Memory Scheduling

Lesson 1 Taking chances with the Sun

STABILISATION AND PROCESS IMPROVEMENT IN NAB

Redirected Inbound Call Sampling An Example of Fit for Purpose Non-probability Sample Design

GCE. Mathematics (MEI) Mark Scheme for June Advanced Subsidiary GCE Unit 4766: Statistics 1. Oxford Cambridge and RSA Examinations

UK Institutional Research Brief: Results of the 2012 National Survey of Student Engagement: A Comparison with Carnegie Peer Institutions

elearning OVERVIEW GFA Consulting Group GmbH 1

Axiom 2013 Team Description Paper

Transcription:

ETH Zurich Reto Knutti Model evaluation, multi model ensembles and structural error Reto Knutti, IAC ETH

Toy model Model: obs = linear trend + noise(variance, spectrum) Short term predictability, separation of trend plus noise, calibration, structure of model

RCP4.5 surface warming end of the century

RCP4.5 surface warming end of the century Which model is the best? What makes a model a good model? Is a physical model better than a statistical model? Is a more complex model better? What is the purpose of a model? Does this sample characterize uncertainty? Can we interpret this as probabilities? Why more than one model? Do more models make us more confident?

Should we weight models? How? Very different results, depending on the statistical method and the constraints/weighting. (Tebaldi and Knutti 2007)

What do we learn from more models? The assumption that models are independent and distributed around the true climate implies that the uncertainty in our projection decreases as more models are added ( truth plus error ). Alternatively, one may assume that models and observations are sampled from the same distribution ( indistinguishable) (Knutti et al. 2010)

Contents Motivation The idea of model evaluation The prior distribution in a multi model ensemble Model independence Model averaging Relating past/current and future model performance Model tuning, evaluation and overconfidence Conclusions and open questions

Types of models Empirical, data-based, statistical models, assuming little in advance, e.g., time series models, regressions, power laws, neural nets Stochastic, general-form but highly structured models which can incorporate prior knowledge, e.g. state-space models and hidden Markov models Specific theory- or process-based models (often termed deterministic) e.g. specific types of partial or ordinary differential equations Conceptual models based on assumed structural similarities to the system, e.g. Bayesian (decision) networks, compartmental models, cellular automata Agent-based models allowing locally structured emergent behavior, as distinct from models representing regular behavior that is averaged or summed over large parts of the system Rule-based models, e.g. expert systems, decision trees (Jakeman et al. 2006)

Models have different purposes Data assessment, discovering inconsistencies and limitations, data reduction, interpolation Understanding of the system, hypothesis testing Prediction, both extrapolation from the past and what if exploration Providing guidance for management and decision-making Do I believe my model prediction? is equivalent to: Can I quantify the uncertainty in my model prediction with reasonable confidence/accuracy?

Basic questions in model evaluation Has the model been constructed of approved materials i.e., approved constituent hypotheses (in scientific terms)? Does its behavior approximate well that observed in respect of the real thing? Does it work i.e. does it fulfill its designated task, or serve its intended purpose? (Jakeman et al. 2006)

Development and evaluation of models (Jakeman et al. 2006)

Why do we trust climate models? Physical principles Reproduce climate Reproduce trends Processes Weather Past climate Robustness (Knutti, 2008)

Model confirmation Confirm the model (just a set of rules), or that the world has a similar causal structure? Evaluate that each part/process works well, and from that conclude (or hope?) that the model is good. Statistical evaluation on all datasets. If it fits it has converged to reality. Emergent constraints: relating past and future observable across models.

Model confirmation for the particular purpose of interest, 1) the relevant quantitative relationships or interaction between different parts or variables that emerge from the inner structure of the model are sufficiently similar to those in the target system, 2) they will remain so over time and beyond the range where data is available for evaluation, and 3) no important part or interaction, either known or unknown is missing.

My model is better than your model What is the purpose, and is the model adequate for that purpose? What means best anyway? What is the evidence that a model is doing the right thing? How can we quantify uncertainty beyond ensemble spread? How do we combine evidence from different models and observations? Why is it so hard, and are we making progress?

Model performance and quality Performance metric Measure of agreement between model and observation Model quality metric Measure designed to infer the skill of a model for a specific purpose (Gleckler et al., 2008)

Metrics and model quality An infinite number of metrics can be defined. Many metrics are dependent. Observation datasets and uncertainty matters. The concept of a best model is ill-defined. There may be a best model for a particular purpose, where best measured in a specific way. But determining that is hard.

Models improve Better Model performance Worse (Reichler and Kim 2007)

Models improve (Knutti et al. 2013)

Why multiple models? To quantify uncertainty in a prediction we need to sample the space of plausible models. This can be achieved by perturbing parameters/parts of a single models of by building families of models (multi model ensembles). When two theories are available that are incompatible we try to reject one. This is often impossible with environmental models. Several models are plausible given the limited understanding, the uncertainties in data, the lack of an overall measure of skill and the lack of verification. Models are seen as complementary. (Knutti, 2008)

The multi model ensemble Is B1 more uncertain than A2?

The multi model ensemble 11 models where all all models scenarios are available The prior distribution of models in the multi model ensemble is arbitrary. (Knutti et al. 2010)

Multi model averages We average models, because a model average is better than a single model. But is it really? IPCC AR4 WGI FIGURE SPM-7. Relative changes in precipitation (in percent) for the period 2090 2099, relative to 1980 1999. Values are multi-model averages based on the SRES A1B scenario for December to February (left) and June to August (right).white areas are where less than 66% of the models agree in the sign of the change and stippled areas are where more than 90% of the models agree in the sign of the change.

Averaging can help Better Model performance Worse (Reichler and Kim 2007)

All models are wrong Average of N models Average of best N models 1/sqrt(N) Black dashed: sqrt(b/n+c) Less than half of the temperature errors disappear for an average of an infinite number of models of the same quality (Knutti et al. 2010)

A statistical framework for an ensemble Probabilistic interpretation of an ensemble requires a statistical framework: What is my sample? What causes the variation across the sample? How do I attach weights to members? What do the ensemble members represent in relation to the truth that we are after? Each of the member can be sampled from a distribution (eventually) centered around the truth: the Truth + Error view. The use of the ensemble seeks some form of consensus and would characterize the uncertainty of this consensus estimate as decreasing with the increasing ensemble size. Alternatively, each of the member is (eventually) considered as indistinguishable from the truth and any other member. The range of the ensemble corresponds to the range of uncertainty, and the truth is not a synthesis but falls somewhere among the members (weather forecasting view of ensemble forecasting).

Loss of signal by averaging Most models shows areas of strong drying, but the multi model average does not. (Knutti et al. 2010)

How does a passenger jet look like?

How does a passenger jet look like? The average jet : (idea stolen from Doug Nychka )

How does a passenger jet look like? Is the average meaningful? Not independent information Better and worse information Does it reflect the what we think the uncertainty is? Two issues: sampling and weighting

Climate model genealogy (Edwards, 2011)

Climate model genealogy Dissimilarity for surface temperature and precipitation (Knutti et al. 2013)

Climate model genealogy (Knutti et al. 2013, Masson and Knutti 2011)

How should we evaluate climate models? What is a good model? There is considerable confidence that climate models provide credible quantitative estimates of future climate change, particularly at continental scales and above. This confidence comes from the foundation of the models in accepted physical principles and from their ability to reproduce observed features of current climate and past climate changes. (IPCC AR4 FAQ 8.1) So people have attached weights based on current climate Aspects of observed climate that must be simulated to ensure reliable future predictions are unclear. For example, models that simulate the most realistic present-day temperatures for North America may not generate the most reliable projections of future temperature changes. (US CCSP report 3.1)

What is a good model? Does model performance on the mean state tell us much about the ability to predict future trends? Ability to simulated observed pattern of warming trend R = 0.27 R = -0.21 Ability to simulated observed pattern of mean climate (Jun et al. 2008)

Which model should we trust? Use statistical methods and physical understanding to identify model evaluation metrics that demonstrably constrain the model response in the future. (Knutti 2008)

What is a good model?

What is a good model? Models continue to improve on present day climatology, but uncertainty in projections is not decreasing. We may be looking at the wrong thing, i.e. climatology provides no strong constraint on projections. We cannot verify our projections, but only test models indirectly.

Relating model performance to projections Land ocean contrast in surface longwave downward all sky radiation (Huber et al. 2011)

Relating past changes to projections (Mahlstein and Knutti 2012)

Relating past changes to projections (Mahlstein and Knutti 2012)

Why do the GCMs reproduce the observed warming so well? Natural Natural and anthropogenic Observed (IPCC, 2007)

Agreement in 20 th century warming trends Climate sensitivity and radiative forcing across models are correlated. High sensitivity is compensated by high aerosol forcing. (IPCC AR4 TS Fig. 23a) Models do not sample the full range of uncertainty (in particular in forcing). Is the agreement a problem? If we have used the observations in model development (and it seems like we have), agreement tells us only that the assumed forcing is consistent with observed changes in that model. It is not a proof that the model is correct, only that it is a plausible one given the uncertainties.

Agreement in 20 th century warming trends Model development and evaluation use the same datasets. Quotes from various people in a recent discussion about 20 th century agreement (shortened): We value models more if they seem to be "right" even without tuning, so to an extent we may have tuned them unconsciously. The only way of having confidence in projections is how well we can simulate the past using models built up with basic physical principles. The tuning of a single model to match observed processes of change, and the constraint or weighting of an ensemble of models using observed climate change, share a common idea to reduce uncertainty in projections. We made stronger statements in IPCC AR4 about climate sensitivity, transient climate response and SRES ranges not because the models were any more certain than before, but because observed climate change had also been used to constrain projections. If we are prepared to use the evidence of climate change in simple models, why not use it for AOGCMs? Indeed, observationally constrained projections do this by posterior scaling, but that's not so different from prior tuning. I am not advocating trying to tune and tweak to reproduce exactly what happened in the past. I am sure we wouldn't be able to do that anyway. I am suggesting that we should not ignore important changes that have happened in the past but are not simulated in the models. In a Bayesian approach the use of past trends to constrain the future is fine, so agreement of models and data is natural and expected. But there is a danger of using information more than once.

Summary and open questions Despite some disturbing slides For some variables and scales, model projections are remarkably robust and unlikely to be entirely wrong. Climate is changing, we are responsible, and future changes will be larger than those observed. Out of sample prediction or extrapolation. The life cycle of a model is much shorter than the timescale over which a prediction can be checked against observations. Model sampling is neither systematic nor random, arbitrary prior. CMIP is a collection of best guesses rather than designed to span the full uncertainty range (e.g. sensitivity) Model performance varies, but we don t know how to make use of that. Implicitly we weight models by using only the latest ones, but we are not prepared to do it formally e.g. in IPCC reports. What is a good model? Metrics are a thorny issue, and most metrics of present day climate provide only a weak constraint on the future.

Summary and open questions (cont.) Model averaging may help in some cases but creates problems, e.g. a loss of signal. Models are developed, evaluated (and in some cases a posteriori weighted) on the same datasets. Climatology often correlates poorly with predicted change. Are we looking at the wrong metric? Are we starting with an sample that is too tight? Models are not independent, nor distributed around the truth (structural error). Common metrics could lead to overconfident prior sets of models. Sampling extreme behavior is important. How many models do we need? Massive ensembles to quantify uncertainty? Structurally different models? Weight them equally? How should we sample models, how should we aggregate them? Some papers: http://www.iac.ethz.ch/group/climate-physics/knutti/publications.html