Sanjoy Dasgupta Professor, Computer Science and Engineering Faculty-Affiliate, Calit2

Sanjoy Dasgupta Professor, Computer Science and Engineering Faculty-Affiliate, Calit2 Prior to joining the UCSD Jacobs School in 2002, Sanjoy Dasgupta was a senior member of the technical staff at AT&T Labs-Research, where his work focused on algorithms for data mining, with applications to speech recognition and to the analysis of business data. Prof Dasgupta received a Ph.D. in Computer Science in 2000 from UC Berkeley and a B.A. in Computer Science from Harvard in 1993. He is a member of the editorial boards of the Journal of Machine Learning Research, the Journal of Artificial Intelligence Research, and the Machine Learning Journal. High-dimensional statistics, clustering, algorithms for finding underlying patterns in highdimensional data, machine learning Professor Sanjoy Dasgupta develops algorithms for the statistical analysis of high-dimensional data. Such data is now widespread, in domains ranging from environmental modeling to genomics to web search. The geometry of high-dimensional spaces presents unusual challenges; many traditional statistical procedures were developed with one- or twodimensional data in mind and do not scale well to this modern context. Some of them are very inefficient; others give poor results because of counter-intuitive effects in high dimension. Dasgupta has developed the first provably correct, efficient algorithms for a variety of canonical statistical tasks, especially related to clustering (grouping) data. He is one of the few machine learning researchers whose work combines algorithmic theory with geometry and mathematical statistics. He adds a strong theoretical focus to UCSD's CSE artificial intelligence and bioinformatics groups.

DATA SCIENCE IN THE JACOBS SCHOOL OF ENGINEERING Sanjoy Dasgupta Computer Science and Engineering

Data + Methods The data From all over campus Neural, atmospheric/oceanic, medical, personal health, internet, genetic, The methods Concentrated in the Jacobs School

Research in core methodologies Machine learning Big data algorithmics Security and privacy Interpretability and confidence assessment Yoav Freund Daniel Kane Mihir Bellare Kamalika Chaudhuri

Goals 1. Spread the expertise 2. Simplify the interface between domain experts methods experts 3. The view beyond campus

Spreading the expertise Starting fall 2017: Undergraduate major in data science Starting fall 2017: MSc in data science (through ECE dept) Starting summer 2017: Micro-MSc in data science

Undergraduate data science Application domains Machine learning / data mining Algorithms Visualization Database management Distributed computing Linear algebra Probability and statistics Programming Discrete structures

Major in data science Core classes: lower division Core classes: upper division Electives Senior project Overview of data science Introduction to programming Introduction to data structures Representations of data Linear algebra Discrete math for data science Networked life Calculus, Physics/Chemistry/Biology Probability and statistics Exploratory data analysis Databases Distributed computation Data visualization Probabilistic reasoning and decision making Machine learning Data mining 8 classes: ideally, develop domain of specialization

Domains of specialization Computer science Cognitive science Signal processing Theory In planning: Computational social science Digital humanities / arts / music Neuroscience Biology/medicine Business analytics Climate and environmental science

Interface: {domain,methods} experts Recent faculty hires: engineering + application domain Another idea: help desk drop-in consultation with methods experts

Looking beyond campus Two-year Master of Advanced Study program (since 2014) Full-day classes, every second Friday and Saturday Taught mostly by UCSD faculty Small class sizes (under 30) Significant TA support outside class Total cost (for two years): roughly $36K

MAS: the curriculum Term Fall Winter Spring Fall Winter Spring Course Python for data analysis Case studies in data science Probability and statistics using Python Data management systems Machine learning Big data analysis using Hadoop and Spark Beyond relational data models Unsupervised learning Data visualization Capstone project Capstone project