STATISTICAL DATA ANALYSIS FOR THE PHYSICAL SCIENCES

Similar documents
Advanced Grammar in Use

Developing Grammar in Context

Lecture Notes on Mathematical Olympiad Courses

International Examinations. IGCSE English as a Second Language Teacher s book. Second edition Peter Lucantoni and Lydia Kellas

Probability and Statistics Curriculum Pacing Guide

Lecture 1: Machine Learning Basics

STA 225: Introductory Statistics (CT)

Mathematics subject curriculum

An Empirical Analysis of the Effects of Mexican American Studies Participation on Student Achievement within Tucson Unified School District

Edexcel GCSE. Statistics 1389 Paper 1H. June Mark Scheme. Statistics Edexcel GCSE

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Cambridge NATIONALS. Creative imedia Level 1/2. UNIT R081 - Pre-Production Skills DELIVERY GUIDE

Document number: 2013/ Programs Committee 6/2014 (July) Agenda Item 42.0 Bachelor of Engineering with Honours in Software Engineering

Guidelines for Writing an Internship Report

Generative models and adversarial training

Cambridge, New York, Melbourne, Madrid, Cape Town, Singapore, São Paulo, Delhi

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Numeracy Medium term plan: Summer Term Level 2C/2B Year 2 Level 2A/3C

THE PROMOTION OF SOCIAL AWARENESS

Using research in your school and your teaching Research-engaged professional practice TPLF06

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

faculty of science and engineering Appendices for the Bachelor s degree programme(s) in Astronomy

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

MASTER S THESIS GUIDE MASTER S PROGRAMME IN COMMUNICATION SCIENCE

Dublin City Schools Mathematics Graded Course of Study GRADE 4

EDEXCEL FUNCTIONAL SKILLS PILOT. Maths Level 2. Chapter 7. Working with probability

Business. Pearson BTEC Level 1 Introductory in. Specification

Presentation Advice for your Professional Review

Abstractions and the Brain

B. How to write a research paper

Principles of Public Speaking

Disciplinary Literacy in Science

Dakar Framework for Action. Education for All: Meeting our Collective Commitments. World Education Forum Dakar, Senegal, April 2000

GCSE Mathematics B (Linear) Mark Scheme for November Component J567/04: Mathematics Paper 4 (Higher) General Certificate of Secondary Education

S T A T 251 C o u r s e S y l l a b u s I n t r o d u c t i o n t o p r o b a b i l i t y

DICE - Final Report. Project Information Project Acronym DICE Project Title

The Singapore Copyright Act applies to the use of this document.

Special Educational Needs Policy (including Disability)

Problem Solving for Success Handbook. Solve the Problem Sustain the Solution Celebrate Success

Unit 7 Data analysis and design

Politics and Society Curriculum Specification

Tun your everyday simulation activity into research

Guide to Teaching Computer Science

St Michael s Catholic Primary School

Delaware Performance Appraisal System Building greater skills and knowledge for educators

COMMUNICATION STRATEGY FOR THE IMPLEMENTATION OF THE SYSTEM OF ENVIRONMENTAL ECONOMIC ACCOUNTING. Version: 14 November 2017

How to Judge the Quality of an Objective Classroom Test

INTRODUCTION TO TEACHING GUIDE

HDR Presentation of Thesis Procedures pro-030 Version: 2.01

The Good Judgment Project: A large scale test of different methods of combining expert predictions

DSTO WTOIBUT10N STATEMENT A

Seminar - Organic Computing

Copyright Corwin 2015

Python Machine Learning

Thameside Primary School Rationale for Assessment against the National Curriculum

User education in libraries

University of Groningen. Systemen, planning, netwerken Bosman, Aart

OVERVIEW OF CURRICULUM-BASED MEASUREMENT AS A GENERAL OUTCOME MEASURE

A Program Evaluation of Connecticut Project Learning Tree Educator Workshops

Management 4219 Strategic Management

Author: Justyna Kowalczys Stowarzyszenie Angielski w Medycynie (PL) Feb 2015

Visual CP Representation of Knowledge

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

DOCTORAL SCHOOL TRAINING AND DEVELOPMENT PROGRAMME

Sociology 521: Social Statistics and Quantitative Methods I Spring 2013 Mondays 2 5pm Kap 305 Computer Lab. Course Website

Statistics and Data Analytics Minor

Pragmatic Use Case Writing

The lab is designed to remind you how to work with scientific data (including dealing with uncertainty) and to review experimental design.

Economics 201 Principles of Microeconomics Fall 2010 MWF 10:00 10:50am 160 Bryan Building

Delaware Performance Appraisal System Building greater skills and knowledge for educators

UNIT ONE Tools of Algebra

School Size and the Quality of Teaching and Learning

SPRING GROVE AREA SCHOOL DISTRICT

10.2. Behavior models

Ph.D. in Behavior Analysis Ph.d. i atferdsanalyse

PROFESSIONAL TREATMENT OF TEACHERS AND STUDENT ACADEMIC ACHIEVEMENT. James B. Chapman. Dissertation submitted to the Faculty of the Virginia

GOING GLOBAL 2018 SUBMITTING A PROPOSAL

APPENDIX A: Process Sigma Table (I)

Math-U-See Correlation with the Common Core State Standards for Mathematical Content for Third Grade

Spring 2015 Natural Science I: Quarks to Cosmos CORE-UA 209. SYLLABUS and COURSE INFORMATION.

Analysis of Enzyme Kinetic Data

Reviewed by Florina Erbeli

AQUA: An Ontology-Driven Question Answering System

Foundation Certificate in Higher Education

Julia Smith. Effective Classroom Approaches to.

AP Statistics Summer Assignment 17-18

Grade 6: Correlated to AGS Basic Math Skills

Radius STEM Readiness TM

The Common European Framework of Reference for Languages p. 58 to p. 82

Computerized Adaptive Psychological Testing A Personalisation Perspective

Access Center Assessment Report

CHMB16H3 TECHNIQUES IN ANALYTICAL CHEMISTRY

Uta Bilow, TU Dresden

Effect of Cognitive Apprenticeship Instructional Method on Auto-Mechanics Students

Guidelines for the Use of the Continuing Education Unit (CEU)

ACTL5103 Stochastic Modelling For Actuaries. Course Outline Semester 2, 2014

Investment in e- journals, use and research outcomes

A. What is research? B. Types of research

Transcription:

STATISTICAL DATA ANALYSIS FOR THE PHYSICAL SCIENCES Data analysis lies at the heart of every experimental science. Providing a modern introduction to statistics, this book is ideal for undergraduates in physics. It introduces the necessary tools required to analyse data from experiments across a range of areas, making it a valuable resource for students. In addition to covering the basic topics, the book also takes in advanced and modern subjects, such as neural networks, decision trees, fitting techniques and issues concerning limit or interval setting. Worked examples and case studies illustrate the techniques presented, and end-of-chapter exercises help test the reader s understanding of the material. adrian bevan is a Reader in Particle Physics in the School of Physics and Astronomy, Queen Mary, University of London. He is an expert in quark flavour physics and has been analysing experimental data for over 15 years.

STATISTICAL DATA ANALYSIS FOR THE PHYSICAL SCIENCES ADRIAN BEVAN Queen Mary, University of London

cambridge university press Cambridge, New York, Melbourne, Madrid, Cape Town, Singapore, São Paulo, Delhi, Mexico City Cambridge University Press The Edinburgh Building, Cambridge CB2 8RU, UK Published in the United States of America by Cambridge University Press, New York Information on this title: /9781107670341 C A. Bevan 2013 This publication is in copyright. Subject to statutory exception and to the provisions of relevant collective licensing agreements, no reproduction of any part may take place without the written permission of Cambridge University Press. First published 2013 Printed and bound in the United Kingdom by the MPG Books Group A catalogue record for this publication is available from the British Library ISBN 978-1-107-03001-5 Hardback ISBN 978-1-107-67034-1 Paperback Cambridge University Press has no responsibility for the persistence or accuracy of URLs for external or third-party internet websites referred to in this publication, and does not guarantee that any content on such websites is, or will remain, accurate or appropriate.

Contents Preface page ix 1 Introduction 1 1.1 Measuring g, the coefficient of acceleration due to gravity 1 1.2 Verification of Ohm s law 5 1.3 Measuring the half-life of an isotope 7 1.4 Summary 10 2 Sets 12 2.1 Relationships between sets 13 2.2 Summary 17 Exercises 18 3 Probability 20 3.1 Elementary rules 21 3.2 Bayesian probability 21 3.3 Classic approach 24 3.4 Frequentist probability 25 3.5 Probability density functions 26 3.6 Likelihood 27 3.7 Case studies 27 3.8 Summary 32 Exercises 33 4 Visualising and quantifying the properties of data 35 4.1 Visual representation of data 35 4.2 Mode, median, mean 37 4.3 Quantifying the spread of data 39 4.4 Presenting a measurement 41 4.5 Skew 43 v

vi Contents 4.6 Measurements of more than one observable 44 4.7 Case study 52 4.8 Summary 53 Exercises 53 5 Useful distributions 56 5.1 Expectation values of probability density functions 57 5.2 Binomial distribution 57 5.3 Poisson distribution 62 5.4 Gaussian distribution 65 5.5 χ 2 distribution 67 5.6 Computational issues 68 5.7 Summary 70 Exercises 70 6 Uncertainty and errors 72 6.1 The nature of errors 72 6.2 Combination of errors 75 6.3 Binomial error 79 6.4 Averaging results 81 6.5 Systematic errors and systematic bias 82 6.6 Blind analysis technique 84 6.7 Case studies 85 6.8 Summary 90 Exercises 91 7 Confidence intervals 93 7.1 Two-sided intervals 93 7.2 Upper and lower limit calculations 94 7.3 Limits for a Gaussian distribution 96 7.4 Limits for a Poisson distribution 98 7.5 Limits for a binomial distribution 100 7.6 Unified approach to analysis of small signals 101 7.7 Monte Carlo method 105 7.8 Case studies 106 7.9 Summary 111 Exercises 112 8 Hypothesis testing 114 8.1 Formulating a hypothesis 114 8.2 Testing if the hypothesis agrees with data 115 8.3 Testing if the hypothesis disagrees with data 117

Contents vii 8.4 Hypothesis comparison 117 8.5 Testing the compatibility of results 119 8.6 Establishing evidence for, or observing a new effect 120 8.7 Case studies 124 8.8 Summary 125 Exercises 126 9 Fitting 128 9.1 Optimisation 128 9.2 The least squares or χ 2 fit 131 9.3 Linear least-squares fit 134 9.4 Maximum-likelihood fit 136 9.5 Combination of results 140 9.6 Template fitting 142 9.7 Case studies 142 9.8 Summary 150 Exercises 151 10 Multivariate analysis 153 10.1 Cutting on variables 154 10.2 Bayesian classifier 157 10.3 Fisher discriminant 158 10.4 Artificial neural networks 162 10.5 Decision trees 169 10.6 Choosing an MVA technique 171 10.7 Case studies 174 10.8 Summary 177 Exercises 178 Appendix A Glossary 181 Appendix B Probability density functions 186 Appendix C Numerical integration methods 198 Appendix D Solutions 201 Appendix E Reference tables 207 References 216 Index 218

Preface The foundations of science are built upon centuries of careful observation. These constitute measurements that are interpreted in terms of hypotheses, models, and ultimately well-tested theories that may stand the test of time for only a few years or for centuries. In order to understand what a single measurement means we need to appreciate a diverse range of statistical methods. Without such an appreciation it would be impossible for scientific method to turn observations of nature into theories that describe the behaviour of the Universe from sub-atomic to cosmic scales. In other words science would be impracticable without statistical data analysis. The data analysis principles underpinning scientific method pervade our everyday lives, from the use of statistics we are subjected to through advertising to the smooth operation of SPAM filters that we take for granted as we read our e-mail. These methods also impact upon the wider economy, as some areas of the financial industry use data mining and other statistical techniques to predict trading performance or to perform risk analysis for insurance purposes. This book evolved from a one-semester advanced undergraduate course on statistical data analysis for physics students at Queen Mary, University of London with the aim of covering the rudimentary techniques required for many disciplines, as well as some of the more advanced topics that can be employed when dealing with limited data samples. This has been written by a physicist with a non-specialist audience in mind. This is not a statistics book for statisticians, and references have been provided for the interested reader to refer to for more rigorous treatment of the techniques discussed here. As a result this book provides an up-to-date introduction to a wide range of methods and concepts that are needed in order to analyse data. Thus this book is a mixture of a traditional text book approach and a teach by example approach. By providing these opposing viewpoints it is hoped that the reader will find the material more accessible. Throughout the book, a number of case studies are presented with possible solutions discussed in detail. The purpose of these sections is to consolidate the more abstract notions discussed in the book and ix

x Preface apply them to an example. In some instances the case study may appear somewhat abstract and specific to scientific research; however, where possible more widely applicable problems have been included. At the end of each chapter there is a summary of the main issues raised, followed by a number of example questions to help the reader practise and gain a deeper understanding of the material included. Solutions to questions are presented at the end of the book. The Introduction motivates the importance of studying statistical methods when analysing data by looking at three common problems encountered early within the life of a physicist: measuring g, testing Ohm s law and studying the law of radioactive decay. Following this motivational introduction the book is divided into two parts: (i) the foundations of statistical data analysis from set notation through to confidence intervals, and (ii) discussion of more advanced topics in the form of optimisation, fitting, and data mining. The material in the first part of the book is ordered logically so that successive sections build on material discussed in the earlier ones, while the second part of the book contains stand alone chapters that depend on concepts developed in the first part. These later chapters can be read in any order. The first part of this book starts with an introduction to sets and Venn diagrams that provide some of the language that we use to discuss data. Having developed this language, the concept of probability is formally introduced in Chapter 3. Readers who are familiar with these concepts already may wish to skip over the first two chapters and proceed straight to the discussion in Chapter 4 on how to visualise and quantify data. Distributions of data are often described by simple functions that are used to represent the probability of observing data with a certain value. A number of useful distributions are described in Chapter 5, and Appendix B builds on this topic by discussing a number of additional functions that may be of use. Measurements are based on the determination of some central value of an observable quantity, with an uncertainty or error on that observable. Issues surrounding uncertainties and errors are introduced in Chapter 6, and this topic is further developed in Chapter 7. Chapter 8 discusses hypothesis testing and brings together many of the earlier concepts in the book. The second part of the book presents more advanced topics. Chapter 9 discusses fitting data given some assumed model using χ 2 and likelihood methods. This relies heavily on concepts developed in Chapters 5 and 6, and Appendix B. Chapter 10 discusses data mining, or how to efficiently separate two classes of data, for example signal from background using numerical methods. The methods discussed include the use of cut-based selection, the Bayesian classifier, Fisher s linear discriminant, artificial neural networks, and decision trees. To avoid interrupting the flow of the text, a number of detailed appendices have been prepared. The most important of these appendices is a collection of probability

Preface xi tables, which is conveniently located at the end of the book in order to provide a quick reference to the reader. There is also a glossary of terms intended to help the reader when referring back to the book some time after an initial reading. Appendices listing a number of commonly used probability density functions, and elementary numerical integration techniques have also been provided. While these are not strictly required in order to understand the concepts introduced in the book, they have been included in order to make this a more complete resource for readers who wish to study this topic beyond an undergraduate course. There are a number of technical terms introduced throughout this book. When a new term is introduced, that term is highlighted in bold-italic text to help the reader refer back to this description at a later time. I would like to thank colleagues who have provided me with feedback on the draft of this book, and in particular Peter Crew.