GIE - Management of Statistical Information

Coordinating unit: Teaching unit: Academic year: Degree: ECTS credits: 2016 200 - FME - School of Mathematics and Statistics 707 - ESAII - Department of Automatic Control 723 - CS - Department of Computer Science 1004 - UB - (ENG)Universitat de Barcelona 715 - EIO - Department of Statistics and Operations Research 5 Teaching languages: English Teaching staff Coordinator: Others: PEDRO FRANCISCO DELICADO USEROS Segon quadrimestre: PEDRO FRANCISCO DELICADO USEROS - A, B JOAQUIN GABARRÓ VALLÉS - A, B ALEXANDRE PERERA LLUNA - A, B ÀLEX SÁNCHEZ PLA - A, B Prior skills Compulsory subject for all students. The student has already developed several abilities in Statistics and/or Operations Research in the previous semester. The student must know basic computing environment and programming capabilities such as those developed by the mandatory course " Statistical Computation and Optimization". A B2 (Cambridge First Certificate, TOEFL PBT >550) level of English is required. Degree competences to which the subject contributes Specific: 3. CE-1. Ability to design and manage the collection of information and coding, handling, storing and processing it. 4. CE-4. Ability to use different inference procedures to answer questions, identifying the properties of different estimation methods and their advantages and disadvantages, tailored to a specific situation and a specific context. 5. CE-5. Ability to formulate and solve real problems of decision-making in different application areas being able to choose the statistical method and the optimization algorithm more suitable in every occasion. Translate to english 6. CE-6. Ability to use appropriate software to perform the necessary calculations in solving a problem. 7. CE-7. Ability to understand statistical and operations research papers of an advanced level. Know the research procedures for both the production of new knowledge and its transmission. 8. CE-8. Ability to discuss the validity, scope and relevance of these solutions and be able to present and defend their conclusions. Transversal: 1. ENTREPRENEURSHIP AND INNOVATION: Being aware of and understanding how companies are organised and the principles that govern their activity, and being able to understand employment regulations and the relationships between planning, industrial and commercial strategies, quality and profit. 2. EFFECTIVE USE OF INFORMATION RESOURCES: Managing the acquisition, structuring, analysis and display of data and information in the chosen area of specialisation and critically assessing the results obtained. 10. FOREIGN LANGUAGE: Achieving a level of spoken and written proficiency in a foreign language, preferably 1 / 8

English, that meets the needs of the profession and the labour market. 11. TEAMWORK: Being able to work in an interdisciplinary team, whether as a member or as a leader, with the aim of contributing to projects pragmatically and responsibly and making commitments in view of the resources that are available. Teaching methodology The course is divided into 3 modules that are taught in succession. Each module consists of the third part of the sessions. All classes are theoretical-practical and in them teachers present and discuss the basic concepts of each module. The support material will be published previously in Athena (teaching guide, contents, course slides, examples, evaluation activities schedule, bibliography,...). The student should devote the autonomous learning hours to the study of the subjects of the course, bibliography extension and follow-up of the laboratory practices. Learning objectives of the subject This course presents and discusses tools and techniques to prepare students for their professional development. The course consists of three main modules. MODULE 1: First modulus will cover a crash course for scientific python for data analysis for (around 15h). This crash course will include include three main stages: * Introduction to python language as a tool. Workflow, ipython, ipython notebook (jupyter), basic types, mutability and inmutability and object oriented programming. * Short introduction to numerical python and matplotlib for graphical visualization. * Introduction to scientific kits for data analysis with machine learning. Principal components analysis, clustering and supervised analysis with multivariate data. MODULE 2: The second module develops relational databases. At the end of this module, students should be able to work fluently with a client/server relational DB system like PostgreSQL or MariaDB. In a more specific way: * Query an existing DB. * Update a current DB and create (a small) DB. * Work with instruments like triggers and stored procedures. * Understand the problems and solutions with the concurrent access. MODULE 3 An important aspect when dealing with data is that often they are found in the web in formats that require some preprocessing before being analyzed. This module will explore techniques to understand these formats so that you can retrieve data from the web and extract the desired information. The first part of the module introduces the most common web technologies, their relationship and some tools to manipulate and extract the information. Then the most common formats for storing web information (HTML, XML, JSON) are presented, as well as tools to extract it, as XPath and CSS selectors. Finally we introduce some technical package suitable to process Web information with R. Specifically at the end of the module students should: * To be familiar with the main technologies with information stored in the web. * To recognize the different formats that can be used for storage. * To learn how to extract information from these formats using specific R packages. 2 / 8

Study load Total learning time: 125h Hours large group: 30h 24.00% Hours medium group: 0h 0.00% Hours small group: 15h 12.00% Guided activities: 0h 0.00% Self study: 80h 64.00% 3 / 8

Content Introduction to Python a. Why Python? b. Python History c. Installing Python d. Python resources Working with Python a. Workflow b. ipython vs. CLI c. Text Editors d. IDEs e. Notebook Getting started with Python a. Introduction b. Getting Help c. Basic types d. Mutable and in-mutable e. Assignment operator f. Controlling execution flow g. Exception handling 4 / 8

Functions and Object Oriented Programming a. Defining Functions b. Input and Output c. Standard Library d. Object-oriented programming Introduction to NumPy a. Overview b. Arrays c. Operations on arrays d. Advanced arrays (ndarrays) e. Notes on Performance (\%timeit in ipython) Matplotlib a. Introduction b. Figures and Subplots c. Axes and Further Control of Figures d. Other Plot Types e. Animations Python scikits a. Introduction b. scikit-timeseries 5 / 8

scikit-learn a. Datasets b. Sample generators c. Unsupervised Learning d. Supervised Learning i. Linear and Quadratic Discriminant Analysis ii. Nearest Neighbors iii. Support Vector Machines e. Feature Selection Practical Introduction to Scikit-learn a. Solving an eigenfaces problem i. Goals ii. Data description iii. Initial Classes iv. Importing data b. Unsupervised analysis i. Descriptive Statistics ii. Principal Component Analysis iii. Clustering c. Supervised Analysis i. k-nearest Neighbors ii. Support Vector Classification iii. Cross validation Introduction to the relational data bases Learning time: 5h Theory classes: 2h Laboratory classes: 3h Basic concepts on DB like tables, tuples. First steps in PostgreSQL 6 / 8

SQL and relational algebra Learning time: 5h Theory classes: 2h Laboratory classes: 3h Queries, insertions and deletions, joints, Elements of the relational algebra. Ordering, grouping, averages. Transactions Learning time: 5h Theory classes: 2h Laboratory classes: 3h Problems on the concurrent access. ACID properties. Different levels of isolation Web data processing Learning time: 15h Theory classes: 15h 1. Introduction to technology for wed data. (1.5h) 2. The languages and formats for the web: HTML, XML, JSON, XPath, CSS (4.5h) 3. Programs and communication protocols: HTTP (1.5h) 4. Retrieving web data: web "scrapping" and text mining (4.5h) 5. Data project management and case studies (3h) Qualification system There will be a grade for each module, derived from an exam or from a final project, depending on the module. The final grade will be the average of the grades of the 3 modules 7 / 8

Bibliography Basic: Langtangen, H.P. A Primer on Scientific Programming with Python [on line]. Springer, 2011Available on: <https://hplgit.github.io/primer /doc/pub/half/book.pdf>. ISBN 978-3-642-18365-2. Munzert, S.; Rubba, R.; Meiboner, P.; Nyhuis, D. Automated data collection with R: A Practical guide to web scraping and text mining. Wiley, 2015. ISBN 978-1118834817. Nolan, D.; Lang, D.T. XML and web technologies for data sciences with R. Springer, 2014. ISBN 978-1-4614-7899-7. Shapiro, B.E. Scientific Computation: Python Hacking for Math Junkies. Sherwood Forest Books, 2015. ISBN 9780692366936. Stones, Richard; Matthew, Neil. Beginning databases with Postgre SQL : from novice to professional [on line]. 2nd ed. USA: Apress, 2005Available on: <http://site.ebrary.com/lib/upcatalunya/docdetail.action?docid=10150839>. ISBN 978-1-59059-478-0. Suehring, S.;Valade, J. PHP, MySQL, & HTML5 All-in-One For Dummies. Wiley, 2013. ISBN 978-1-118-21370-4. Complementary: Garcia-Molina, Hector ; Ullman, Jeffrey D. ; Widom, Jennifer. Database Systems: the complete book. 2nd ed. USA: Pearson, 2009. ISBN 0131873253. Spector, P.. Concepts in computing with data (Stat 133, UC Berkeley) [on line]. Berkeley, 2011Available on: <http://www.stat.berkeley.edu/ spector/s133/index >. 8 / 8