The Role of Statistics i Data Sciece, ad Vice Versa Jessica Utts Professor of Statistics Uiversity of Califoria, Irvie Presidet, America Statistical Associatio Nicholas Horto Professor of Statistics Amherst College
Some Issues for Discussio How does statistics (as a disciplie) view the emergig field of data sciece? What ca statisticias cotribute to data sciece? What elemets of statistics are essetial for data sciece educatio?
Overview ad History Statistics has evolved alog with techology ad the growth of data Statistics from the 1990s Statistics today! Foudatioal goal is the same ASA s visio statemet says it well: A world that relies o data ad statistical thikig to drive discovery ad iform decisios But methods for achievig that goal have chaged ad expaded
A Very Early Adopter: Joh Tukey 1962, Aals of Mathematical Statistics Idetified four drivig forces i the ew sciece : 1. The formal theories of statistics 2. Acceleratig developmets i computers ad display devices 3. The challege, i may fields, of more ad ever larger bodies of data 4. The emphasis o quatificatio i a ever wider variety of disciplies
A Less Early Adopter: Leo Breima, 2001 Statistical Modelig: The Two Cultures There are two cultures i the use of statistical modelig to reach coclusios from data. Oe assumes that the data are geerated by a give stochastic data model. The other uses algorithmic models ad treats the data mechaism as ukow. The statistical commuity has bee committed to the almost exclusive use of data models Algorithmic modelig, both i theory ad practice, has developed rapidly i fields outside statistics. It ca be used both o large complex data sets ad as a more accurate ad iformative alterative to data modelig o smaller data sets. If our goal as a field is to use data to solve problems, the we eed to move away from exclusive depedece o data models ad adopt a more diverse set of tools. (Statistical Sciece, 2001, with discussats)
A Side Commet David Dooho s 50 Years of Data Sciece (2015) is worth readig. His versio of Breima s 2 cultures: The Geerative [stochastic data] Modelig culture seeks to develop stochastic models which fit the data, ad the make ifereces about the data-geeratig mechaism based o the structure of those models. The Predictive [algorithmic] Modelig culture prioritizes predictio is effectively silet about the uderlyig mechaism geeratig the data, ad allows for may differet predictive algorithms, preferrig to discuss oly accuracy of predictio made by differet algorithms o various datasets.
Fast Forward 14 Years: ASA Statemet o Role of Data Sciece i Statistics, 2015 Idetifies foudatioal data sciece fields: Database maagemet Statistics ad machie learig Distributed ad parallel systems Ecourages greater, mutually beeficial collaboratio across these three fields Itersects with umerous disciplies ad related research areas
May ogoig discipliary collaboratios Some examples: Geomics (ad persoalized medicie) Health services research (electroic medical records) Busiess aalytics (customer trackig) Smart cities (ad sesor etworks) Astroomy (data streams) Ad others
ASA Statemet, Cotiued Notes that statistics educatio must evolve to meet eeds For example, address iclusio of data sciece i K-12, commuity college More later o other aspects of educatio Elucidates role of statistics i data sciece
From the ASA Statemet: The Role of Statistics Framig questios statistically allows researchers to leverage data resources to extract kowledge ad obtai better aswers. The cetral dogma of statistical iferece, that there is a compoet of radomess i data, eables researchers to formulate questios i terms of uderlyig processes ad to quatify ucertaity i their aswers. A statistical framework allows researchers to distiguish betwee causatio ad correlatio ad thus to idetify itervetios that will cause chages i outcomes.
The ASA Statemet, cotiued It also allows them to establish methods for predictio ad estimatio, to quatify their degree of certaity, ad to do all of this usig algorithms that exhibit predictable ad reproducible behavior. I this way, statistical methods aim to focus attetio o fidigs that ca be reproduced by other researchers with differet data resources. Simply put, statistical methods allow researchers to accumulate kowledge.
The Statistical Iquiry Cycle Wild ad Pfakuch, 1999, Iteratioal Statistical Review Problem, Pla, Data, Aalysis, Coclusios PPDAC CONCLUSIONS Iterpretatio Coclusios New Ideas Commuicatio ANALYSIS Data exploratio Plaed aalyses Uplaed Aalyses Hypothesis Geeratio DATA PROBLEM Graspig system dyamics Defiig Problem PLAN Measuremet System Samplig desig Data Maagemet Pilotig ad aalysis Data Collectio Data Maagemet Data Cleaig
How to carry out PPDAC? This scietific approach to statistical problem-solvig is importat for all data aalysts. It eeds to start i the first course ad be a cosistet theme i all subsequet courses. - America Statistical Associatio Guidelies for Udergraduate Programs i Statistics (2014), http:// www.amstat.org/asa/educatio/curriculum-guidelies-for- Udergraduate-Programs-i-Statistical-Sciece.aspx
How to carry out PPDAC? Workig with data requires extesive computig skills. To be prepared for statistics ad data sciece careers, studets eed facility with professioal statistical aalysis software, the ability to wragle data i various ways ad algorithmic problem-solvig. Studets should be fluet i higher-level programmig laguages ad facile with database systems. - America Statistical Associatio Guidelies for Udergraduate Programs i Statistics (2014), http://www.amstat.org/asa/educatio/ Curriculum-Guidelies-for-Udergraduate-Programs-i- Statistical-Sciece.aspx
How to carry out PPDAC? Statistical Methods ad Theory: Need to uderstad issues of desig, cofoudig, ad bias, have a foudatio i theoretical statistics priciples for soud aalyses, develop kowledge ad gai experiece applyig a variety of statistical methods, assess appropriateess of methods, ad commuicate results
How to carry out PPDAC? Data Wraglig ad Computatio: Need to be facile with professioal statistical software program i a higher-level laguage ad thik algorithmically, use simulatio-based statistical techiques ad udertake simulatio studies, maage ad wragle data, ad udertake aalyses i reproducible maer
How to carry out PPDAC? Statistical Practice ad Commuicatio: Need to write clearly, speak fluetly, ad costruct effective visual displays ad compellig summaries, demostrate ability to collaborate i teams ad to orgaize ad maage projects, icorporate ethical precepts ito all aspects of their work, ad commuicate complex statistical methods i basic terms to maagers ad other audieces
How to carry out PPDAC? Disciplie-Specific Kowledge: Need to apply statistical reasoig to domai-specific questios, traslate research questios ito statistical questios, ad commuicate results appropriate to differet discipliary audieces. Skills take from udergraduate guidelies, but relevat at other levels as well
Park City Group Report (2016) Curriculum Guidelies for Udergraduate Programs i Data Sciece (DeVeaux + 24 other authors) Data sciece as sciece Iterdiscipliary ature Data at the core Aalytical (computatioal ad statistical) thikig ad problem-solvig (New pathways for) mathematical foudatios Flexibility http://www.amstat.org/asa/files/pdfs/edu-datascieceguidelies.pdf
What do statisticias brig to the table? Importace of cotext Accoutig for variability Desig, cofoudig, ad aalysis of foud (observatioal) data Uderstadig of iferece, multiplicity ad reproducibility issues Statistical aalysis (PPDAC) cycle Log history of makig decisios with data Experiece workig o multidiscipliary teams
Some Issues for Discussio How does statistics (as a disciplie) view the emergig field of data sciece? What ca statisticias cotribute to data sciece? What elemets of statistics are essetial for data sciece educatio?