A little philosophy, ramblings, and a preview of coming events

Similar documents
Odisseia PPgEL/UFRN (ISSN: )

Spring 2014 SYLLABUS Michigan State University STT 430: Probability and Statistics for Engineering

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

GRADUATE STUDENT HANDBOOK Master of Science Programs in Biostatistics

UNIVERSIDADE DE LISBOA

DT + Self-Awareness. PDXScholar

"On-board training tools for long term missions" Experiment Overview. 1. Abstract:

Using Moodle in ESOL Writing Classes

Oregon Institute of Technology Computer Systems Engineering Technology Department Embedded Systems Engineering Technology Program Assessment

Circuit Simulators: A Revolutionary E-Learning Platform

Initial approaches on Cross-Lingual Information Retrieval using Statistical Machine Translation on User Queries

DIGITAL GAMING & INTERACTIVE MEDIA BACHELOR S DEGREE. Junior Year. Summer (Bridge Quarter) Fall Winter Spring GAME Credits.

Mathematics 112 Phone: (580) Southeastern Oklahoma State University Web: Durant, OK USA

Beyond the Blend: Optimizing the Use of your Learning Technologies. Bryan Chapman, Chapman Alliance

CPMT 1347 Computer System Peripherals COURSE SYLLABUS

REVISTA DE INFORMÁTICA APLICADA VOL. 6 - Nº 02 - JUL/DEZ

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Computer Organization I (Tietokoneen toiminta)

Statistics and Data Analytics Minor

Ericsson Wallet Platform (EWP) 3.0 Training Programs. Catalog of Course Descriptions

MASTER OF SCIENCE (M.S.) MAJOR IN COMPUTER SCIENCE

Self Study Report Computer Science

Procedia - Social and Behavioral Sciences 98 ( 2014 ) International Conference on Current Trends in ELT

MTH 215: Introduction to Linear Algebra

4. Long title: Emerging Technologies for Gaming, Animation, and Simulation

Mathematics Program Assessment Plan

Business 4 exchange academic guide

Research computing Results

We are strong in research and particularly noted in software engineering, information security and privacy, and humane gaming.

Evaluation of Learning Management System software. Part II of LMS Evaluation

Bluetooth mlearning Applications for the Classroom of the Future

Requirements-Gathering Collaborative Networks in Distributed Software Projects

Law Professor's Proposal for Reporting Sexual Violence Funded in Virginia, The Hatchet

Computer Science 1015F ~ 2016 ~ Notes to Students

THE WEB 2.0 AS A PLATFORM FOR THE ACQUISITION OF SKILLS, IMPROVE ACADEMIC PERFORMANCE AND DESIGNER CAREER PROMOTION IN THE UNIVERSITY

Mathematics. Mathematics

ACTL5103 Stochastic Modelling For Actuaries. Course Outline Semester 2, 2014

Segmentation Study of Tulsa Area Higher Education Needs Ages 36+ March Prepared for: Conducted by:

Assignment 1: Predicting Amazon Review Ratings

Europeana Creative. Bringing Cultural Heritage Institutions and Creative Industries Europeana Day, April 11, 2014 Zagreb

Top US Tech Talent for the Top China Tech Company

Survey Results and an Android App to Support Open Lesson Plans in Edu-AREA

School of Innovative Technologies and Engineering

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Marinho Cristiel Bender STUDENTS PERCEPTIONS ON THE USE OF VIDEOS IN DISTANCE EDUCATION

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Empiricism as Unifying Theme in the Standards for Mathematical Practice. Glenn Stevens Department of Mathematics Boston University

Present tense I need Yo necesito. Present tense It s. Hace. Lueve.

ALLAN DIEGO SILVA LIMA S.O.R.M.: SOCIAL OPINION RELEVANCE MODEL

Malaysia & Singapore [DK TRAVEL GD MALAYSIA & SINGAP] [Paperback] By DK Publishing"(Manufactured by)

Lesson M4. page 1 of 2

Common Core Exemplar for English Language Arts and Social Studies: GRADE 1

Chamilo 2.0: A Second Generation Open Source E-learning and Collaboration Platform

Cambridge NATIONALS. Creative imedia Level 1/2. UNIT R081 - Pre-Production Skills DELIVERY GUIDE

Winter School, February 1 to 5, 2016 Schedule. Ronald Schlegel, December 10, 2015

An Empirical Analysis of the Effects of Mexican American Studies Participation on Student Achievement within Tucson Unified School District

CHEM 6487: Problem Seminar in Inorganic Chemistry Spring 2010

21st Century Community Learning Center

Internet Journal of Medical Update

Dear Family, Literature

Ryerson University Sociology SOC 483: Advanced Research and Statistics

Software Maintenance

Math Placement at Paci c Lutheran University

Capturing and Organizing Prior Student Learning with the OCW Backpack

SAT & ACT PREP. Evening classes at GBS - open to all Juniors!

International Seminar: Dates, Locations, and Course Descriptions

South Carolina English Language Arts

AP Spanish Language and Culture Summer Work Sra. Wild Village Christian School

1.2 Interpretive Communication: Students will demonstrate comprehension of content from authentic audio and visual resources.

Tuesday 13 May 2014 Afternoon

General Certificate of Education Advanced Level Examination June 2010

Intermediate Computable General Equilibrium (CGE) Modelling: Online Single Country Course

P. Belsis, C. Sgouropoulou, K. Sfikas, G. Pantziou, C. Skourlas, J. Varnas

Statewide Strategic Plan for e-learning in California s Child Welfare Training System

NSU Oceanographic Center Directions for the Thesis Track Student

Running Head: Implementing Articulate Storyline using the ADDIE Model 1. Implementing Articulate Storyline using the ADDIE Model.

Lawal, H. M. t Adeagbo, C.'Isah Alhassan

The Paw Print McMeans Junior High Westheimer Parkway Katy, TX 77450

Journalism 336/Media Law Texas A&M University-Commerce Spring, 2015/9:30-10:45 a.m., TR Journalism Building, Room 104

Web-based Learning Systems From HTML To MOODLE A Case Study

Essentials of Ability Testing. Joni Lakin Assistant Professor Educational Foundations, Leadership, and Technology

Python Machine Learning

Automating Outcome Based Assessment

AICC 2017 Annual Meeting, Designers Lab & Independent Packaging Design Competition September 25-27, 2017 The Encore Hotel Las Vegas, NV

learning collegiate assessment]

Citrine Informatics. The Latest from Citrine. Citrine Informatics. The data analytics platform for the physical world

TU-E2090 Research Assignment in Operations Management and Services

An Introduction to Simio for Beginners

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

ME 443/643 Design Techniques in Mechanical Engineering. Lecture 1: Introduction

The Paw Print McMeans Junior High Westheimer Parkway

Edexcel Gcse Maths 2013 Nov Resit

UNIT ONE Tools of Algebra

Using Rhetoric Technique in Persuasive Speech

The winning student organization, student, or December 2013 alumni will be notified by Wed, Feb. 12th.

Intel-powered Classmate PC. SMART Response* Training Foils. Version 2.0

MGT/MGP/MGB 261: Investment Analysis

Strategy and Design of ICT Services

QuickGuide for SEAS CS Students (New Requirements Beginning Fall 2012)

GREAT Britain: Film Brief

Transcription:

A little philosophy, ramblings, and a preview of coming events http://www.stat.yale.edu/~jay/ Associate Professor of Statistics, Yale University (Professor Emerson prefers to be called Jay ) Please feel free to ask questions along the way! http://www.stat.yale.edu/~jay/brazil/campinas/

Outline Why I Do What I Do 1 Why I Do What I Do 2 3 4 5

Statistics? Computer Science? Bioinformatics? Sports? What s up with this guy? I love my job! The teaching, the research, the wide range of problems I see every week... Example: Yale s Statistical Clinics, http://www.stat.yale.edu/clinic/clinic.html It s all about the data and real-world problems; statistics should be data-driven, or at least problem-driven Data analysis should not simply be an excuse to exercise new theory Vocé trabalha em que? Eu sou professor. Eu sou estudante. Ambos!

Citações favoritas Para chamar o estatístico após o experimento é feito pode ser mais do que pedindo a ele para realizar um exame post-mortem: ele pode ser capaz de dizer que a experiência da morte. - Sir Ronald A Fisher O plural de anedota não é de dados. - Roger Brinner A combinação de alguns dados e um desejo doloroso de uma resposta não garante que uma resposta razoável pode ser extraído de uma determinada massa de dados. - John Tukey

Favorite quotes To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. - Sir Ronald A Fisher The plural of anecdote is not data. - Roger Brinner The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data. - John Tukey

Monday Morning A speedy introduction to R with some entertaining data analysis along the way.

: por que R? R is the lingua franca of statistics. It is a language and environment for statistical programming that is ideal for interactive data analysis and graphics, and much, much more. It is extended by a large collection of packages. If you want a GUI, there are some options. But that misses point. GUI reproducible research. Don t go there.

Preliminaries: available resources The R Project: http://www.r-project.org/ Official Documentation: http://cran.r-project.org/manuals.html Contributed Documentation: http://cran.r-project.org/other-docs.html Other resources linked on CRAN: Frequently Asked Questions (FAQs), the R Journal, a Wiki, Books, etc... Another R community site: http://crantastic.org/ Sweave: http: //www.statistik.uni-muenchen.de/~leisch/sweave/ Reproducible research: http://cran.r-project.org/web/ views/reproducibleresearch.html

About the talk Not for a liberal arts audience: no graphical user interface (GUI) Good for people newish to R who want to learn more (or hear a different perspective) Good for people who have never used R but who have a fairly solid programming/scripting background Will be about 30-40 minutes of formal introduction to the language fundamentals Will include about 30-40 minutes of data analysis on a real-world problem (judging bias in Olympic diving) to reinforce these fundamentals: engaging data with R

Monday Afternoon The R package management system, the C/C++ interface, an introduction to parallel programing via foreach, all in the context of Bayesian change point analysis.

Why R? Why I Do What I Do R is the lingua franca of statistics: The syntax is simple and well-suited for data exploration and analysis. It has excellent graphical capabilities. It is extensible, with over 2500 packages available on CRAN alone. It is open source and freely available for Windows/MacOS/Linux platforms. This talk emphasizes the importance of the package management system. Much of the success of R should be attributed to: Ross & Robert s early decision to go open-source and encourage collaboration, and the growth of CRAN and the success of the package management system.

Example: Coriell cell lines (raw data) log2ratio 0.8 0.6 0.4 0.2 0.0 0.2 0 50 100 150 Position on Chromosome 11

foreach Why I Do What I Do The user may register any one of several parallel backends like domc or dosnow, or none at all. The code will either run sequentially or will make use of the parallel backend, if specified, without code modification. > library(foreach) > library(domc) > registerdomc(2) > > a <- 10 > ans <- foreach(i = 1:5,.combine = c) %dopar% + { + a + i^2 + } > > ans [1] 11 14 19 26 35

Tuesday Morning An introduction to the Bigmemory Project, covering pitfalls and solutions for working with massive data.

A new era The analysis of very large data sets has recently become an active area of research in statistics and machine learning. Many new computational challenges arise when managing, exploring, and analyzing these data sets, challenges that effectively put the data beyond the reach of researchers who lack specialized software development skills of expensive hardware. Entramos em uma era de enorme coleção de dados científicos, com a procura de respostas para os problemas de inferência em grande escala que estão além o âmbito das estatísticas clássicas. Efron (2005) classical statistics should include mainstream computational statistics. Kane, Emerson, and Weston (in preparation, in reference to Efron s quote)

Example data sets Airline on-time data 2009 JSM Data Expo (thanks, Hadley!) About 120 million commercial US airline flights over 20 years 29 variables, integer-valued or categorical (recoded as integer) About 12 gigabytes (GB) http://stat-computing.org/dataexpo/2009/ Netflix data About 100 million ratings from 500,000 customers for 17,000 movies About 2 GB stored as integers No statisticians on the winning team; hard to find statisticians on the leaderboard Top teams: access to expensive hardware; professional computer science and programming expertise http://www.netflixprize.com/

Why R? Why I Do What I Do R is the lingua franca of statistics! (Did I say that earlier?) Currently, the Bigmemory Project is designed to extend the R programming environment through a set of packages (bigmemory, bigtabulate, biganalytics, synchronicity, and bigalgebra), but it could also be used as a standalone C++ library or with other languages and programming environments.

: http://www.bigmemory.org/ NMF DNAcopy irlba non-negative DNA copy truncated matrix number data SVDs on factorization analysis (big.)matrices biglm regressions for data too big to fit in memory biganalytics statistical analyses with big.matrices bigalgebra linear algebra for (big.)matrices methods generic functions bigmemory Core big.matrix creation and manipution stats statistical functions foreach concurrentenabled loops bigtabulate synchronicity fast tabulation mutual and exclusions summaries utils utility functions R base packages (not all are shown) domc parallel backend for SMP unix donws concurrent backend for NetworkSpaces dosnow concurrent backend for snow dosmp concurrent backend for SMP machines doredis concurrent backend for redis package description An R package and description A B A depends on (or imports) B Low-level parallel support B C B suggests C

In a nutshell... The approaches adopted by statisticians in analyzing small data sets don t scale to massive ones. Statisticians who want to explore massive data must be aware of the various pitfalls; adopt new approaches to avoid them. We will illustrate common challenges for dealing with massive data; provide general solutions for avoiding the pitfalls.

Examples Some examples, time permitting.

Espero que alguns de vocês podem desfrutar de uma discussão mais aprofundada de alguns dos temas que vou falar. Por favor não se acanhe em pedir perguntas, durante ou antes ou depois de qualquer das conversações. Isto é particularmente verdadeiro para o Projeto Bigmemory, onde alguns de vocês podem já estar a usá-lo e tiver perguntas específicas.