I590 Data Science Onramp Basics

Similar documents
Python Machine Learning

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Lecture 1: Machine Learning Basics

(Sub)Gradient Descent

Assignment 1: Predicting Amazon Review Ratings

CSL465/603 - Machine Learning

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

CS Machine Learning

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

CS 446: Machine Learning

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Learning From the Past with Experiment Databases

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Universidade do Minho Escola de Engenharia

Analyzing sentiments in tweets for Tesla Model 3 using SAS Enterprise Miner and SAS Sentiment Analysis Studio

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Linking Task: Identifying authors and book titles in verbose queries

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

CS4491/CS 7265 BIG DATA ANALYTICS INTRODUCTION TO THE COURSE. Mingon Kang, PhD Computer Science, Kennesaw State University

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Switchboard Language Model Improvement with Conversational Data from Gigaword

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Indian Institute of Technology, Kanpur

MASTER OF SCIENCE (M.S.) MAJOR IN COMPUTER SCIENCE

Probabilistic Latent Semantic Analysis

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Postprint.

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Reducing Features to Improve Bug Prediction

A Case Study: News Classification Based on Term Frequency

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

BYLINE [Heng Ji, Computer Science Department, New York University,

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Large-Scale Web Page Classification. Sathi T Marath. Submitted in partial fulfilment of the requirements. for the degree of Doctor of Philosophy

AQUA: An Ontology-Driven Question Answering System

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

Axiom 2013 Team Description Paper

Human Emotion Recognition From Speech

arxiv: v2 [cs.cv] 30 Mar 2017

Top US Tech Talent for the Top China Tech Company

arxiv: v1 [cs.lg] 15 Jun 2015

Rule Learning With Negation: Issues Regarding Effectiveness

CS 101 Computer Science I Fall Instructor Muller. Syllabus

Calibration of Confidence Measures in Speech Recognition

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Rule Learning with Negation: Issues Regarding Effectiveness

Computerized Adaptive Psychological Testing A Personalisation Perspective

Applications of memory-based natural language processing

Citrine Informatics. The Latest from Citrine. Citrine Informatics. The data analytics platform for the physical world

Software Maintenance

Speech Emotion Recognition Using Support Vector Machine

On-Line Data Analytics

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Learning Methods in Multilingual Speech Recognition

Modeling function word errors in DNN-HMM based LVCSR systems

Word Segmentation of Off-line Handwritten Documents

School of Innovative Technologies and Engineering

Modeling function word errors in DNN-HMM based LVCSR systems

Impact of Cluster Validity Measures on Performance of Hybrid Models Based on K-means and Decision Trees

Intel-powered Classmate PC. SMART Response* Training Foils. Version 2.0

PRAAT ON THE WEB AN UPGRADE OF PRAAT FOR SEMI-AUTOMATIC SPEECH ANNOTATION

Artificial Neural Networks written examination

Australian Journal of Basic and Applied Sciences

Detecting English-French Cognates Using Orthographic Edit Distance

USER ADAPTATION IN E-LEARNING ENVIRONMENTS

Lecture 1: Basic Concepts of Machine Learning

An Introduction to Simio for Beginners

Course Content Concepts

Radius STEM Readiness TM

Android App Development for Beginners

ScienceDirect. A Framework for Clustering Cardiac Patient s Records Using Unsupervised Learning Techniques

GACE Computer Science Assessment Test at a Glance

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

Article A Novel, Gradient Boosting Framework for Sentiment Analysis in Languages where NLP Resources Are Not Plentiful: A Case Study for Modern Greek

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

TextGraphs: Graph-based algorithms for Natural Language Processing

Semi-Supervised Face Detection

Laboratorio di Intelligenza Artificiale e Robotica

Conversation Starters: Using Spatial Context to Initiate Dialogue in First Person Perspective Games

Multilingual Sentiment and Subjectivity Analysis

The Strong Minimalist Thesis and Bounded Optimality

Model Ensemble for Click Prediction in Bing Search Ads

Developing a TT-MCTAG for German with an RCG-based Parser

Beyond the Pipeline: Discrete Optimization in NLP

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

Comment-based Multi-View Clustering of Web 2.0 Items

WHEN THERE IS A mismatch between the acoustic

Tour. English Discoveries Online

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Transcription:

I590 Data Science Onramp Basics Data Science Onramp contains mini courses with the goal to build and enhance your data science skills which are oftentimes demanded or desired in data science related jobs. Each mini course will be counted as one credit hour. Each time you enroll, you can select 1-3 credit hours which means that you can select 1 or 2 or 3 mini courses. You are allowed to maximally enroll twice for this whole course. Most of the mini courses are written in text format and few in video format. We provide Teaching Assistant (TA) support and office hours. If you encounter any problems, please feel free to reach our TAs either at their office hours or schedule an appointment which fits better to your schedule. You can learn your selected mini courses in sequence or parallel. But we HIGHLY recommend parallel learning because: 1) you can participate the online discussion with other classmates; 2) our TAs will have weekly office hours and monthly live demos based on weekly and monthly contents of the mini courses; and 3) you have a good reason to get your assignment done on time rather than rush to finish them before the end of the semester. Each mini course has its own grading policy. In general, grading is based on assignments/projects, online discussions, and quizzes. If you select more than one mini course, the average of each mini course grade will be counted towards your final grade. Tableau Tableau is a leading data analysis software used by analytics, banking and consulting organizations for data analysis. Tableau helps users to design/develop/deploy data science algorithms without writing huge chunks of code. The visualization of data joins and merges provides an easy way for a non-technical user to work on Data without worrying about coding in traditional scripting languages. In this course, we will learn about Tableau visualization from scratch to a professional level of understanding. We will also understand the techniques for building effective visualizations on various public data sets. The course consists of bi-weekly assignments which mainly focus towards a target problem and building visualization to discovery significant insights. There is also a final project for students to apply knowledge for a practical dataset and present their story-telling skills through interesting data visualizations. This course will enable all the students gain all the important skills needed for building data visualizations and effective story telling. It will make the students proficient in using the tableau visualization tool and build impressive visualization story boards in their professional careers. Introduction to data visualization and its usage

Familiarizing with the Tableau visualization tool Importing data in tableau, working with sample data set, exploring features in tableau. Building simple visualizations in tableau Working on features like filters Effective use of Details feature, sorting options, view tool bar, worksheet options. Creating dashboard and worksheets Creating calculated fields, groping set, creating hierarchy Working with Time Series data set Building effective geo maps and other custom visualizations Implementing K-means clustering and classification, prediction in tableau Final Project Machine Learning with Python Machine learning is a technique which is used to teach computers, without being explicitly programmed. In this course, you will learn about basics of python and extending them to use different important packages like Matplotlib, Scikit-learn in python and about different kinds of classifications and classifiers used in machine learning. We will begin our course with basic python programs because it is good to have some basic level of python experience before we go into advance concepts like machine learning. In industries, most of the computer programmers use two important approaches to write complex applications, recursive approach and iterative approach. You will learn about these important concepts from modules as well as from programming assignments. Scikit-learn is one of the best open source machine learning package in python with large active open source community. We will use this package to learn machine learning in applied fashion. At last, we will show you how you can build recommendation system using Scikit-learn package. Introduction to Python Installing Python and setting up PyCharm IDE and Anaconda Python strings, constants, variables and scope Arithmetic and binary operations Control structures, functions and exception handling Using NumPy and Pandas library in Python Introduction to Matplotlib in Python Machine learning with Scikit-Learn and Scipy Concepts and implementation of linear regression using Numpy and Logistic Regression Introduction to Scipy Overfitting of curve and Ridge Regression using Python K-Means algorithm and its implementation using Scikit-Learn Implementation of SVM and Decision tree using Scikit-Learn Expectation and Maximization Algorithm and implementing it using Scikit-Learn

Principal Component Analysis (PCA) and its implementation using Scikit-Learn Neural Networks and its implementation using Scikit-Learn NLP in Python Text mining starts generally with the process of information retrieval. We need to identify the source of data and then collect from this source. General sources are web, blogs, social media platforms, reviews and comments, etc. Once we collect the data, we need to clean the noise in it, such as the removal of duplicate data entries, unwanted information such as url's, image links, etc. There are number of steps involved in denoising the data and this depends on the kind of data that you have at hand. Once we clean the text data, we can apply natural language processing techniques such as parsing, pos tagging, etc. The whole idea is to convert something not so structured into something meaningful and structured. Once we have such a structured output, we can perform various tasks such as: Sentiment analysis Topic detection Document summarization Entity relational modelling Pattern recognition Predictive analytics Text categorization In this course we will cover some primary concepts in sentiment analysis. The abovementioned tasks are extremely useful for gaining insights into textual data. We will explore the topics in detail. Use regular expression to match string patterns Basic linux commend line functions Set up python and install packages using pip Basic functions in Python Basic python function working on strings Twitter APT to grab tweets Handle Json format and how to deal with it in Python Clean a tweet s content by removing non-useful characters Use nltk to run semantic analysis on sentences Two projects Machine Learning with Java Machine learning is current one of the hottest topics now. As the data science is deeply involved with machine learning algorithms and programming languages, it is important

to master at least one programming language skills to play with machine learning algorithms or to solve real-life problems. In this course, we introduce how to use Java to build machine learning models to solve regression, classification and clustering problems. We also introduce how to evaluate the machine learning models and interpret results. Although in fact there are many Java packages supporting machine learning algorithms, in this course we only focus on the most popular one-weka, which is a Java package containing many fancy algorithms and is widely used in recent years. One nice thing about Weka is that the package offers not only a Java library so that you can develop your own code to build a model but a well-developed GUI tool so that even for those people who are not familiar with Java, they can even build up a machine learning model very quickly by just clicking several buttons in the GUI. What is Weka and install the GUI on your local computer What is Java install Java JRE and JDK so that you can run your code in the later sessions Input file format in Weka such as ARFF and XRFF Generate artificial data in Weka Java package Filter data in Weka Some basic classification methods using in Weka Tree based classification methods in Weka Advanced classification methods particularly in Weka Basic clustering methods using in Weka Learn how to visualize your results in Weka Machine Learning with R In this course it is expected that you know the basic functionalities in R coding and we are going to cover the machine learning topics, how to implement them, what are the famous packages in R community, and how we can use those packages and how we can play with the different parameters in the packages which will affect the results. This will be a short course with 10 modules which will cover almost all widely used machine learning algorithms. Don t worry, I won t be adding a lot of theory to it rather I will be adding a lot of screenshots and code to give you a much better experience. My expectation is whatever task we are going to perform, please try to do hands-on side by side on your system. Don t take this class as a theory lecture rather take it as a lab session. Getting started Principal Component Analysis Linear Regression Logistic Regression

Clustering Decision Trees Neural Networks Support Vector Machines Text Mining Time Series Analysis Web Scraping As a Data Scientist, one is responsible for crunching humongous amounts of data to extract insights and streamline businesses based on the results. But the role of a Data Scientist doesn't start with understanding and analyzing data. Before we do any analysis, we must have data at hand. The first step to solving any data problem is to identify the problem, followed by collecting relevant data, and cleaning and representing the data in a functional form. Then we can use visualization and other analytical techniques to glean any useful insights. It is fundamentally essential that data scientists can collect data from various sources. Data could be available in a structured form via well-defined REST APIs or unstructured (raw) data from websites, and any other type of data in-between. The Web Scraping course is all about extracting data of interest from any source. The course will be divided into 5 parts. The first part deals with the basics of Python, which is completely optional for students with prior experience using Python. However, I recommend taking a quick glance at it unless you use Python on a day-to-day basis. The second part of the course deals with advanced Python coding necessary for web scraping. The third deals with extracting structured data using APIs. In the fourth part we throw light on basic tools and packages of Python for web and chrome development tool. Our fifth and final part deals with extracting raw data from web pages using Scrapy package. Part 1: Fundamentals of Python (Optional) 1. Using ipython notebooks 2. Control flow 3. Functions 4. Data Structures: Lists, tuples, dictionaries 5. Iterables and generators Part 2: Essentials of Python 1. Object oriented programming using Python

2. Error and Exception handling 3. File Input / Output 4. CSV files 5. JSON files 6. Strings and Regular Expressions Part 3: Structured Data Extraction 1. REST APIs 2. Twitter API Part 4: Fundamentals of Web Data and Developer Tools 1. HTML 2. XML 3. Chrome dev tools 4. urllib package 5. BeautifulSoup package Part 5: Building Spiders using Scrapy 1. Scrapy package Machine Learning Principles The goal of this course is to provide students with the knowledge and breadth of Machine Learning. This involves some of the crucial paradigms in the field such as the anatomy of Machine Learning problems, Gradient Descent, Regularization, Cross- Validation, Overfitting, Bias/Variance tradeoffs and more. Other topics covered are various practical algorithms used in Machine Learning such as Supervised Learning Problems using Linear Regression, Decision Trees, SVMs, Naive Bayes, and Logistic Regression, Unsupervised Learning Problems using K-Means and K-Nearest Neighbors, and Semi-supervised learning. Other miscellaneous topics are covered as well, such as Deep Learning, Reinforcement Learning, Ensemble Methods, and more. There are also various portions where the content is generated upon the students demands, where they can learn about what is trending and popular in the field of Machine Learning in the current day. All of these topics will be learned using readings, quizzes, surveys, discussions, and of course coding assignments using Python and Jupyter Notebook! Course Introduction and Overview

Introduction to Machine Learning and the Development Environment for This Course Linear Regression Overfitting, Underfitting, the Bias/Variance Tradeoff, and Regularization Decision Trees Cross Validation Support Vector Machines Maximum Likelihood Estimation (MLE), Maximum A Posteriori Estimation (MAP) and Gradient Descent Naive Bayes Logistic Regression Supervised Learning, Unsupervised Learning, Semi-Supervised Learning, Reinforcement Learning, and Deep Learning K-Means Clustering Miscellaneous Topics 1 Miscellaneous Topics 2 Wrap-up week