The 5 Most Important Things in Data Science. KIRK BORNE Principal Data Scientist, Booz Allen Hamilton

Similar documents
Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Solution Focused Methods RAYYA GHUL 2017

Lecture 1: Basic Concepts of Machine Learning

Multidisciplinary Engineering Systems 2 nd and 3rd Year College-Wide Courses

Lab 1 - The Scientific Method

Citrine Informatics. The Latest from Citrine. Citrine Informatics. The data analytics platform for the physical world

Asia s Global Influence. The focus of this lesson plan is on the sites and attractions of Hong Kong.

IT Students Workshop within Strategic Partnership of Leibniz University and Peter the Great St. Petersburg Polytechnic University

Mathematics Program Assessment Plan

I N T E R P R E T H O G A N D E V E L O P HOGAN BUSINESS REASONING INVENTORY. Report for: Martina Mustermann ID: HC Date: May 02, 2017

A Reinforcement Learning Variant for Control Scheduling

Science with Kids, Science by Kids By Sally Bowers, Dane County 4-H Youth Development Educator and Tom Zinnen, Biotechnology Specialist

How to make successful presentations in English Part 2

Full text of O L O W Science As Inquiry conference. Science as Inquiry

Cognitive Thinking Style Sample Report

Disciplinary Literacy in Science

The Strong Minimalist Thesis and Bounded Optimality

Critical Thinking in the Workplace. for City of Tallahassee Gabrielle K. Gabrielli, Ph.D.

The lab is designed to remind you how to work with scientific data (including dealing with uncertainty) and to review experimental design.

Strategic Practice: Career Practitioner Case Study

Speech Recognition at ICSI: Broadcast News and beyond

Is operations research really research?

MYCIN. The MYCIN Task

Stopping rules for sequential trials in high-dimensional data

Purdue Data Summit Communication of Big Data Analytics. New SAT Predictive Validity Case Study

Changing the face of science and technology. DIVISION OF SOCIAL SCIENCES ISEE. Institute for Scientist & Engineer Educators

On Human Computer Interaction, HCI. Dr. Saif al Zahir Electrical and Computer Engineering Department UBC

Mapping the Educational Knowledge for the continuously support of teachers and educational staff

Timeline. Recommendations

Making Confident Decisions

We are strong in research and particularly noted in software engineering, information security and privacy, and humane gaming.

Knowledge based expert systems D H A N A N J A Y K A L B A N D E

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

a) analyse sentences, so you know what s going on and how to use that information to help you find the answer.

Rover Races Grades: 3-5 Prep Time: ~45 Minutes Lesson Time: ~105 minutes

Science Fair Project Handbook

COUNSELLING PROCESS. Definition

Probability and Statistics Curriculum Pacing Guide

Science Fair Rules and Requirements

A Pipelined Approach for Iterative Software Process Model

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

Edexcel GCSE. Statistics 1389 Paper 1H. June Mark Scheme. Statistics Edexcel GCSE

Grade 7 - Expansion of the Hudson s Bay Company: Contributions of Aboriginal Peoples in Canada

Career Series Interview with Dr. Dan Costa, a National Program Director for the EPA

Active Ingredients of Instructional Coaching Results from a qualitative strand embedded in a randomized control trial

Self Study Report Computer Science

University of Toronto Mississauga Degree Level Expectations. Preamble

success. It will place emphasis on:

WORK OF LEADERS GROUP REPORT

Topic: Making A Colorado Brochure Grade : 4 to adult An integrated lesson plan covering three sessions of approximately 50 minutes each.

Engineering Our Future

Prerequisite: General Biology 107 (UE) and 107L (UE) with a grade of C- or better. Chemistry 118 (UE) and 118L (UE) or permission of instructor.

What Teachers Are Saying

White Paper. The Art of Learning

Axiom 2013 Team Description Paper

Researcher Development Assessment A: Knowledge and intellectual abilities

GDP Falls as MBA Rises?

Massachusetts Institute of Technology Tel: Massachusetts Avenue Room 32-D558 MA 02139

Formative Assessment in Mathematics. Part 3: The Learner s Role

VIEW: An Assessment of Problem Solving Style

An Introduction to LEAP

essays personal admission college college personal admission

Practitioner s Lexicon What is meant by key terminology.

ITE and PSA Launched Specialist Nitec Course Initiative to provide structured course for ITE graduates to sharpen their skills in port equipment

Bachelor of Arts. Intercultural German Studies. Language in intercultural contexts

Fulltime MSc Real Estate and MSc Real Estate Finance Programmes: An Introduction

Florida Reading for College Success

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

GRAND CHALLENGES SCHOLARS PROGRAM

Julia Smith. Effective Classroom Approaches to.

City University of Hong Kong Course Syllabus. offered by Department of Architecture and Civil Engineering with effect from Semester A 2017/18

Innovating Toward a Vibrant Learning Ecosystem:

Quantitative analysis with statistics (and ponies) (Some slides, pony-based examples from Blase Ur)

The Talent Development High School Model Context, Components, and Initial Impacts on Ninth-Grade Students Engagement and Performance

A COMPARATIVE STUDY BETWEEN NATURAL APPROACH AND QUANTUM LEARNING METHOD IN TEACHING VOCABULARY TO THE STUDENTS OF ENGLISH CLUB AT SMPN 1 RUMPIN

Politics and Society Curriculum Specification

Computational Data Analysis Techniques In Economics And Finance

Medical Complexity: A Pragmatic Theory

Physics 270: Experimental Physics

Learn & Grow. Lead & Show

(Sub)Gradient Descent

ECON 442: Economic Development Course Syllabus Second Semester 2009/2010

Evidence-based Practice: A Workshop for Training Adult Basic Education, TANF and One Stop Practitioners and Program Administrators

Signs, Signals, and Codes Merit Badge Workbook

New Venture Financing

E-Teaching Materials as the Means to Improve Humanities Teaching Proficiency in the Context of Education Informatization

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Laboratorio di Intelligenza Artificiale e Robotica

People: Past and Present

Top US Tech Talent for the Top China Tech Company

Strategic Management (MBA 800-AE) Fall 2010

Proposal of Pattern Recognition as a necessary and sufficient principle to Cognitive Science

SECTION I: Strategic Planning Background and Approach

STUDENT PERCEPTION SURVEYS ACTIONABLE STUDENT FEEDBACK PROMOTING EXCELLENCE IN TEACHING AND LEARNING

SELECCIÓN DE CURSOS CAMPUS CIUDAD DE MÉXICO. Instructions for Course Selection

Graduate Program in Education

The Consistent Positive Direction Pinnacle Certification Course

What is the ielts test fee. Where does the domestic cat come from..

Abstractions and the Brain

Gifted/Challenge Program Descriptions Summer 2016

Transcription:

The 5 Most Important Things in Data Science KIRK BORNE Principal Data Scientist,

KIRK BORNE Principal Data Scientist,

THE 5 MOST IMPORTANT THINGS IN DATA SCIENCE Kirk Borne [@KirkDBorne] Principal Data Scientist and Executive Advisor Booz Allen Innovation Center, Washington DC PRESENTED FOR METIS, DEMYSTIFY DATA SCIENCE CONF SEPTEMBER 27, 2017 H T T P : / / W W W. B O O Z A L L E N. C O M / D A T A S C I E N C E

SUMMARY I will highlight the five most important things in data science, providing a short illustrative (hopefully enlightening and informative) example from my own experience for each of these: The Data, The Science, Data Storytelling, Data Ethics, and Data Literacy. Since the primary focus of data science is discovery (new insights, better decisions, and value-added innovations), I will include an overview of the different flavors of machine learning for discovery in big data, plus a summary of the different levels of analytics maturity and what they mean for real world data science applications. I will finish with a review of the top characteristics of leading candidates for data scientist positions within my organization. 5 4

THE 5 MOST IMPORTANT THINGS IN DATA SCIENCE 1. THE DATA One example of one type of data in the world: 1 = 1 Zettabyte 5

BIG DATA VOLUME IS BIG, BUT BIG DATA VARIETY IS BIGGEST ENABLER OF DISCOVERY (Welcome to the IoT of Big Discovery!!) 3+1 V s of Big Data: Volume = most annoying V Velocity = most challenging V Variety = most rich V for discovery Value = the most important V y ~ x! x ^ x Combinatorial Growth! (all possible interconnections, linkages, and interactions: high variety for discovery!) y ~ 2 ^ x (exponential growth) y ~ 2 * x (linear growth) https://www.linkedin.com/pulse/exponential-growth-isnt-cool-combinatorial-tor-bair 6

BIG DATA USE CASE IN ENVIRONMENTAL SCIENCE: From Data to Information to Knowledge to Understanding: The science of X-informatics is born! 7

BIG DATA ANSWERS OUR BIG QUESTIONS: WHO, WHAT, WHEN, WHERE, HOW, AND MAYBE WHY If we collect a thorough set of data (high-dimensional, with many attributes & features) for a complete set of items within our domain of study, then we would have a perfect statistical model for that domain. In other words, Big Data becomes the descriptive, predictive, and explanatory model for a domain X => X-informatics: o Environmental, Climate, Bio-, Geo-, Astro-, Urban, Health, Security, Biomedical informatics, and more Anything we want to know about that domain is specified and encoded within the data. The goal of Data Science is to decode all of the signals, and discover the knowledge that the data represent: using the Scientific Method 8

THE 5 MOST IMPORTANT THINGS IN DATA SCIENCE 2. THE SCIENCE 1.2 2 True Negatives 1 0.8 Decision Boundary: Class A or Not Class A 0.6 False Positives 0.4 0.2 True Positives 0-15 -10-5 0 5 10 15 False Negatives 9

DATA SCIENCE FOLLOWS THE SCIENTIFIC METHOD CYCLE 1. Data Collection: observation and characterization 2. Formulation of a hypothesis: diagnosis & classification 3. Deduction: formulation of a predictive test 4. Experimental design and testing 5. Evaluation: error characterization and minimization 6. Review results: validate or revise hypothesis https://www.oreilly.com/ideas/10-signs-of-data-science-maturity -- http://www.boozallen.com/datascience 10

DATA SCIENCE MATURITY Booz Allen Hamilton Data Science 1) is curiosity in action that creates & organizes knowledge 2) embraces & personifies a culture of experimentation 3)...follows rigorous scientific methodology (i.e., experimental, disciplined, ) 4) systematically explores the world through observation & experiment 5) relentlessly asks the right questions, and searches for the next one 6) is testable and repeatable 7) adopts a fast-fail collaborative culture (knowing that errors are informative) 8) attracts & retains diverse participants (granting them freedom to explore) 9) is a way of doing things, not a thing to do 10) presents insights by illustrating and telling the data s story. https://www.oreilly.com/ideas/10-signs-of-data-science-maturity -- http://www.boozallen.com/datascience 11

THE 5 MOST IMPORTANT THINGS IN DATA SCIENCE 3. DATA STORYTELLING 3 12

KNOWING THE KNOWABLE THROUGH DATA SCIENCE Don t just explain to us how you used Machine Learning (= algorithms that learn from experience), but tell us what you discovered, why you did it, and what it now means! 1) Class Discovery: Find the categories of objects (population segments), events, and behaviors in your data. + Learn the rules that constrain the class boundaries (that uniquely distinguish them). 2) Correlation (Predictive and Prescriptive Power) Discovery: Find trends, patterns, and dependencies in data, which reveal new governing principles or behavioral patterns (the DNA ). 3) Novelty (Surprise!) Discovery: Find new, rare, one-in-a-[million / billion / trillion] objects, events, and behaviors. 4) Association (or Link) Discovery: (Graph and Network Analytics) Find the unusual (interesting) co-occurring associations / links / connections. 13

THE 5 LEVELS OF ANALYTICS MATURITY Explain the level of analytics maturity that your Data Science is attempting to achieve. 1) Descriptive Analytics Hindsight (What happened?) 2) Diagnostic Analytics Oversight (real-time / What is happening? Why did it happen?) 3) Predictive Analytics Foresight (What will happen?) 4) Prescriptive Analytics Insight (How can we optimize what happens?) (Follow the dots / connections in the graph!) 5) Cognitive Analytics Right Sight (the 360 view, what is the right question to ask for this set of data in this context = Game of Jeopardy) Finds the right insight, the right action, the right decision, right now! = Next Best Action! Moves beyond simply providing answers, to generating new questions and hypotheses. As data scientists, we must not only Walk The Talk, but we must also must Talk The Walk. 14

THE 5 MOST IMPORTANT THINGS IN DATA SCIENCE 4. DATA ETHICS 4 https://weaponsofmathdestructionbook.com/ 15

Quote from H.G. Wells (1903; writer) Statistical thinking will one day be as necessary for efficient citizenship as the ability to read and write. Well, that day is here now! Statistical & Data Literacy Matters! 16

Quote from Ronald Coase (economist) If you torture your data long enough, it will confess to anything. 17

Quote from somebody (?) It is now beyond any doubt that cigarettes are the biggest cause of statistics 18

THE 5 MOST IMPORTANT THINGS IN DATA SCIENCE 5. DATA LITERACY 5 https://cdn.andertoons.com/img/toons/cartoon6517t.png 19

Quote from a famous politician in the 1990 s I am shocked that half the students in this country score below average on our standardized tests. 20

Data Literacy in 2 parts: Data Science and Data Ethics http://www.kirkborne.net/cds151/ 1) How to use data 2) How to use data correctly http://dilbert.com/strip/2000-11-13 http://dilbert.com/strip/2008-05-07 21

Data Literacy For All A Reading List http://rocketdatascience.org/?p=356 My journey from Astrophysics into Data Science was motivated significantly by a strong desire to build Data Literacy in the next-generation workforce! Learn about my journey here: https://youtu.be/w19sguvx7lw 22

THE 6 MOST IMPORTANT THINGS IN DATA SCIENCE 6. (BONUS) THE DATA SCIENTIST 6 http://www.marketingdistillery.com/2014/11/29/is-data-science-a-buzzword-modern-data-scientist-defined/ 23

THE MOST IMPORTANT V OF BIG DATA = VALUE! https://twitter.com/dez_blanchfield/status/645139875440668672 24

SAILING THE 7 SEAS OF DATA: THE INDIVIDUAL S JOURNEY TO DATA SCIENCE MATURITY The Seven Seas (C s) of Data Scientists: 1) Cognitively Curious (ask questions the right questions!) 2) Creative (design thinker) 3) Courageous problem-solver (rocks the culture, willingness to fail) 4) Cool under pressure (tolerance for ambiguity) 5) Continuous life-long learner (hackathons, online classes, ) 6) Communicator (data storyteller) 7) Collaborative ( data science is a team sport ) + 3 more: 8) Critical Thinker 9) Computational 10) Consultative 25

DATA SCIENTISTS ARE EXPLORERS EXPLORING VAST AND ENDLESS SEAS OF DATA! https://www.pinterest.com/pin/377106168772298092/ If you want to build a ship, don t drum up people to gather wood and don t assign them tasks and work, but rather teach them to yearn for the vast and endless sea. - Antoine de Saint-Exupery 26

THANK YOU! LET US EXPLORE & BUILD A BETTER WORLD WITH DATA SCIENCE! Booz Allen Hamilton LISTEN @KirkDBorne @BoozDataScience READ, BUILD, and EXPLORE www.boozallen.com/datascience Tips for Building a Data Science Capability The Mathematical Corporation 10 Signs of Data Science Maturity The Field Guide to Data Science The Data and Analytics Catalyst Explore: sailfish.boozallen.com in Machine Intelligence Learn how AI and Machine Intelligence empower The Mathematical Corporation PARTICIPATE datasciencebowl.com These slides here: http://www.kirkborne.net/demystifyds2017/ 27