Some Tips on Project Proposal. April 15, 2010

Similar documents
Python Machine Learning

Learning From the Past with Experiment Databases

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Lecture 1: Machine Learning Basics

(Sub)Gradient Descent

Assignment 1: Predicting Amazon Review Ratings

CSL465/603 - Machine Learning

Probabilistic Latent Semantic Analysis

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Universidade do Minho Escola de Engenharia

CS4491/CS 7265 BIG DATA ANALYTICS INTRODUCTION TO THE COURSE. Mingon Kang, PhD Computer Science, Kennesaw State University

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Data Fusion Through Statistical Matching

Time series prediction

Indian Institute of Technology, Kanpur

Large-Scale Web Page Classification. Sathi T Marath. Submitted in partial fulfilment of the requirements. for the degree of Doctor of Philosophy

CS 446: Machine Learning

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

CS Machine Learning

Human Emotion Recognition From Speech

A Case Study: News Classification Based on Term Frequency

Switchboard Language Model Improvement with Conversational Data from Gigaword

Reducing Features to Improve Bug Prediction

Multivariate k-nearest Neighbor Regression for Time Series data -

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

Generative models and adversarial training

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Model Ensemble for Click Prediction in Bing Search Ads

arxiv: v2 [cs.cv] 30 Mar 2017

Rule Learning With Negation: Issues Regarding Effectiveness

arxiv: v1 [cs.lg] 15 Jun 2015

Multi-Lingual Text Leveling

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Latent Semantic Analysis

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Feature Selection based on Sampling and C4.5 Algorithm to Improve the Quality of Text Classification using Naïve Bayes

Rule Learning with Negation: Issues Regarding Effectiveness

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Detailed course syllabus

Artificial Neural Networks written examination

Activity Recognition from Accelerometer Data

Attributed Social Network Embedding

arxiv: v1 [cs.lg] 3 May 2013

WHEN THERE IS A mismatch between the acoustic

Article A Novel, Gradient Boosting Framework for Sentiment Analysis in Languages where NLP Resources Are Not Plentiful: A Case Study for Modern Greek

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Writing Research Articles

Math-U-See Correlation with the Common Core State Standards for Mathematical Content for Third Grade

Experiment Databases: Towards an Improved Experimental Methodology in Machine Learning

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

Axiom 2013 Team Description Paper

Speech Emotion Recognition Using Support Vector Machine

Welcome to. ECML/PKDD 2004 Community meeting

Softprop: Softmax Neural Network Backpropagation Learning

Impact of Cluster Validity Measures on Performance of Hybrid Models Based on K-means and Decision Trees

Modeling function word errors in DNN-HMM based LVCSR systems

A Bayesian Learning Approach to Concept-Based Document Classification

Academic Success at Ohio State. Caroline Omolesky Program Officer for Sponsored Programs and Academic Liaison Office of International Affairs

Purdue Data Summit Communication of Big Data Analytics. New SAT Predictive Validity Case Study

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Calibration of Confidence Measures in Speech Recognition

Semantic and Context-aware Linguistic Model for Bias Detection

Beyond the Pipeline: Discrete Optimization in NLP

Radius STEM Readiness TM

(I couldn t find a Smartie Book) NEW Grade 5/6 Mathematics: (Number, Statistics and Probability) Title Smartie Mathematics

Issues in the Mining of Heart Failure Datasets

Learning Methods in Multilingual Speech Recognition

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

Australian Journal of Basic and Applied Sciences

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Montana Content Standards for Mathematics Grade 3. Montana Content Standards for Mathematical Practices and Mathematics Content Adopted November 2011

Navigating the PhD Options in CMS

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Cross-lingual Short-Text Document Classification for Facebook Comments

The Boosting Approach to Machine Learning An Overview

Applications of data mining algorithms to analysis of medical data

Lahore University of Management Sciences. FINN 321 Econometrics Fall Semester 2017

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Visit us at:

Improvements to the Pruning Behavior of DNN Acoustic Models

File # for photo

We are strong in research and particularly noted in software engineering, information security and privacy, and humane gaming.

ScienceDirect. A Framework for Clustering Cardiac Patient s Records Using Unsupervised Learning Techniques

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Bug triage in open source systems: a review

Semi-Supervised Face Detection

Spring 2014 SYLLABUS Michigan State University STT 430: Probability and Statistics for Engineering

Natural Language Processing: Interpretation, Reasoning and Machine Learning

Discriminative Learning of Beam-Search Heuristics for Planning

Term Weighting based on Document Revision History

Grade 6: Correlated to AGS Basic Math Skills

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Ohio s Learning Standards-Clear Learning Targets

Dublin City Schools Mathematics Graded Course of Study GRADE 4

Introductory Astronomy. Physics 134K. Fall 2016

Transcription:

Some Tips on Project Proposal April 15, 2010

Course Project 1. Start with an interesting task and find real-world data 2. Perform research to find out appropriate data mining / machine learning algorithms 3. Implement several different algorithms 4. Evaluate the performance of the algorithms on data (if unsuccessful, return to step 3, 2, or 1) 5. Write up results Please be proactive! Please be creative!

Components of Proposal Introduce task(s) Describe data set(s) Propose potential algorithm(s) Discuss potential evaluation strategies Review related works Propose plan for completion

Introduce task(s) What real-world problem are you tackling? Why is solving this problem important? In the proposal you would need to provide a motivation for performing this project. Specifically, what machine learning task(s) are you proposing? Classification Regression Clustering

Describe data set(s) Every project would need to be centered around data set(s) If you have the data set(s) already, please describe details How many training examples? (the bigger the data set, the better ) How many features? What type of features? Is there missing data? What is unique about this data? You can perform quick EDA If you don t have a data set yet, you must provide a concrete plan for obtaining the data in the proposal e.g. if you wish to scrape data from the web, mention the URLs by the first progress report (Week 5) you must have a data set.

Propose Potential Algorithms We haven t covered many data mining / machine learning algorithms yet, so do please just do your best on this section. This is where reading various related papers will help Propose more than one algorithm: Baseline algorithms: knn, logistic regression, linear regression More complicated algorithms: SVMs, neural networks, etc.

A quick listing of algorithms Classification: Nearest neighbors, logistic regression (binary), multinomial logit (extension of logistic regression), neural networks, naïve Bayes, support vector machines (SVMs), decision trees, etc. Regression: Linear regression, generalized linear regression, regression trees, etc. Clustering: K-means, mixture of Gaussians, agglomerative/divisive clustering, etc. Dimensionality reduction: Principal component analysis (PCA), singular value decomposition (SVD), topic models (for text), etc. Some algorithms only work with certain types of data (e.g. interval data)

A quick listing of algorithmic techniques Ensemble methods: Bagging (take average of many different classifiers/regressors) Boosting (adaptively weight each data case, e.g. AdaBoost) Random forests (combining multiple decision trees) Regularization (for regression): L2 regularization ( ridge regression, tikhonov regularization ) L1 regularization ( LASSO ) -- this is harder case Manipulating features: Centering, standardizing, converting categorical to binary Feature selection techniques, adding nonlinear features (e.g. x1 * x2) Optimization: Gradient descent Newton s method Conjugate gradient Grid search Stochastic search Beam search There are many papers, tutorials, slides, and Wikipedia pages available on these topics

Proposed evaluation For classification/regression: Learn model on training set, and calculate accuracy/error on test set Can be a function of many different quantities (e.g. amount of training data, number of features, complexity of model, characteristics of data set) Checks to see if you are overfitting/underfitting Cross-validation More specialized metrics for various tasks, e.g. root mean squared error (RMSE), expected reciprocal rank (ERR) For clustering: Visualize clusters and see if it looks correct If probabilistic model, evaluate likelihood on test data

Related work Read 3 papers related to your project Can be related to the domain (e.g. sports statistics + machine learning) Can be related to the algorithms that you potentially would use (e.g. ranking algorithms for yahoo challenge) Can be tutorial or survey papers How to find these papers? Google scholar, ACM digital library, Citeseer, etc. If you find 1 good paper, do a breadth-first search on the references Skim papers (read abstract) and if it is not related, move to the next one Would need to summarize these papers (e.g. what were their results and how can this paper be potentially useful to your project) and cite these papers in your bibliography

Plan for completion Suggest a rough timeline for completing the project: What do you plan to accomplish each week? Will you have weekly team meetings? How do you plan to divide work among team? Will you use third-party tools or code the algorithms yourself? What bottlenecks do you foresee? You are allowed to use third-party tools such as SVM-light, Weka, etc.

Format of Proposal Use NIPS conference format http://nips.cc/paperinformation/stylefiles Should be 4 pages You can use the Word (.rtf) template 5% extra credit for using LaTeX system Very easy to learn: http://heather.cs.ucdavis.edu/~matloff/latex.html Requires unix/linux environment to compile (for Windows, use Cygwin)

Quick Presentation! On Tuesday, April 20, one member of your team will be required to give a 2 minute lightning presentation of your project proposal Please prepare one Powerpoint slide (.ppt) to augment your presentation Please be ready with a USB key