Homework 1: Regular expressions (due Sept 24 at midnight)

Similar documents
Notetaking Directions

Getting Started with Deliberate Practice

CS Machine Learning

How to make an A in Physics 101/102. Submitted by students who earned an A in PHYS 101 and PHYS 102.

The Writing Process. The Academic Support Centre // September 2015

Virtually Anywhere Episodes 1 and 2. Teacher s Notes

Shockwheat. Statistics 1, Activity 1

TU-E2090 Research Assignment in Operations Management and Services

Welcome to ACT Brain Boot Camp

Chapter 4 - Fractions

If we want to measure the amount of cereal inside the box, what tool would we use: string, square tiles, or cubes?

Case study Norway case 1

Graduate Diploma in Sustainability and Climate Policy

LEARNER VARIABILITY AND UNIVERSAL DESIGN FOR LEARNING

The UNF Digital Commons

DOCTORAL SCHOOL TRAINING AND DEVELOPMENT PROGRAMME

Multi-genre Writing Assignment

AP Statistics Summer Assignment 17-18

5 Guidelines for Learning to Spell

The Success Principles How to Get from Where You Are to Where You Want to Be

Foothill College Summer 2016

Schoology Getting Started Guide for Teachers

UNIT ONE Tools of Algebra

Liking and Loving Now and When I m Older

A Teacher Toolbox. Let the Great World Spin. for. by Colum McCann ~~~~ The KCC Reads Selection. for the. Academic Year ~~~~

Welcome to WRT 104 Writing to Inform and Explain Tues 11:00 12:15 and ONLINE Swan 305

THESIS GUIDE FORMAL INSTRUCTION GUIDE FOR MASTER S THESIS WRITING SCHOOL OF BUSINESS

Maths Games Resource Kit - Sample Teaching Problem Solving

The Task. A Guide for Tutors in the Rutgers Writing Centers Written and edited by Michael Goeller and Karen Kalteissen

Social Media Journalism J336F Unique ID CMA Fall 2012

own yours narrative essay about. Own about. own narrative yours about essay essays own about

Utilizing FREE Internet Resources to Flip Your Classroom. Presenter: Shannon J. Holden

Exploration. CS : Deep Reinforcement Learning Sergey Levine

English Policy Statement and Syllabus Fall 2017 MW 10:00 12:00 TT 12:15 1:00 F 9:00 11:00

STUDENT MOODLE ORIENTATION

Instructor: Mario D. Garrett, Ph.D. Phone: Office: Hepner Hall (HH) 100

EDIT 576 (2 credits) Mobile Learning and Applications Fall Semester 2015 August 31 October 18, 2015 Fully Online Course

Physics XL 6B Reg# # Units: 5. Office Hour: Tuesday 5 pm to 7:30 pm; Wednesday 5 pm to 6:15 pm

SESSION 2: HELPING HAND

Writing an essay about sports >>>CLICK HERE<<<

Multi-Lingual Text Leveling

Finding Translations in Scanned Book Collections

EDIT 576 DL1 (2 credits) Mobile Learning and Applications Fall Semester 2014 August 25 October 12, 2014 Fully Online Course

babysign 7 Answers to 7 frequently asked questions about how babysign can help you.

Introduction to WeBWorK for Students

Measures of the Location of the Data

Eduroam Support Clinics What are they?

Corporate Communication

POFI 1349 Spreadsheets ONLINE COURSE SYLLABUS

How to write an essay about self identity. Some people may be able to use one approach better than the other..

Demography and Population Geography with GISc GEH 320/GEP 620 (H81) / PHE 718 / EES80500 Syllabus

Switchboard Language Model Improvement with Conversational Data from Gigaword

What is PDE? Research Report. Paul Nichols

Gr. 9 Geography. Canada: Creating a Sustainable Future DAY 1

Shared Portable Moodle Taking online learning offline to support disadvantaged students

Hentai High School A Game Guide

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

MTH 215: Introduction to Linear Algebra

University of Waterloo School of Accountancy. AFM 102: Introductory Management Accounting. Fall Term 2004: Section 4

E C C. American Heart Association. Basic Life Support Instructor Course. Updated Written Exams. February 2016

10 tango! lessons. for THERAPISTS

How to Apply for Fellowships & Internships Connecting students to global careers!

Part I. Figuring out how English works

Essay on importance of good friends. It can cause flooding of the countries or even continents..

Experience Corps. Mentor Toolkit

Thinking Maps for Organizing Thinking

writing good objectives lesson plans writing plan objective. lesson. writings good. plan plan good lesson writing writing. plan plan objective

Corpus Linguistics (L615)

Why Pay Attention to Race?

Fundraising 101 Introduction to Autism Speaks. An Orientation for New Hires

Read&Write Gold is a software application and can be downloaded in Macintosh or PC version directly from

Introduction to Causal Inference. Problem Set 1. Required Problems

CONTENTS. Resources. Labels Text Page Web Page Link to a File or Website Display a Directory Add an IMS Content Package.

Carnegie Mellon University Department of Computer Science /615 - Database Applications C. Faloutsos & A. Pavlo, Spring 2014.

THE REFLECTIVE SUPERVISION TOOLKIT

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

essays. for good college write write good how write college college for application

APA Basics. APA Formatting. Title Page. APA Sections. Title Page. Title Page

PLANT SCIENCE/SOIL SCIENCE 2100 INTRODUCTION TO SOIL SCIENCE

Grammar Lesson Plan: Yes/No Questions with No Overt Auxiliary Verbs

Get a Smart Start with Youth

Mathematics Success Grade 7

Writing Unit of Study

Tutoring First-Year Writing Students at UNM

Science Fair Rules and Requirements

disadvantage research and research research

a) analyse sentences, so you know what s going on and how to use that information to help you find the answer.

Summarizing A Nonfiction

Contents. Foreword... 5

COMMUNICATING EFFECTIVELY WITH YOUR INSTRUCTOR

How long did... Who did... Where was... When did... How did... Which did...

Managerial Decision Making

Foothill College Fall 2014 Math My Way Math 230/235 MTWThF 10:00-11:50 (click on Math My Way tab) Math My Way Instructors:

MASTER OF SCIENCE (M.S.) MAJOR IN COMPUTER SCIENCE

Naviance / Family Connection

ACTL5103 Stochastic Modelling For Actuaries. Course Outline Semester 2, 2014

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

Instructor. Darlene Diaz. Office SCC-SC-124. Phone (714) Course Information

Stacks Teacher notes. Activity description. Suitability. Time. AMP resources. Equipment. Key mathematical language. Key processes

ALL-IN-ONE MEETING GUIDE THE ECONOMICS OF WELL-BEING

Transcription:

Homework 1: Regular expressions (due Sept 24 at midnight) 1. Read chapters 1 and 2 from JM. 2. From the book JM: 2.1, 2.4, 2.8 3. Exploritory data analysis is a common thing to do with numbers, histograms, box plots, etc. But, much of this isn t that interesting when using words. So this problem asks you to explore what a regular expression actually does by simply running it against a bunch of text. Consider the following regular expression: (?:[a-z0-9!#$%&'*+/=?^_`{ }~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{ }~-]+)* "(?:[\x01-\x08\ I posted a link to the above regex on the index.html page for the class (at the due date in the schedule). See if you can work with that easier if your cut and paste doesn t work from the pdf. (You can also grab it out of the Rwn file.) Darian suggest trying perl = TRUE as being helpful for getting it to work. (Note, if you have trouble getting that to run in R, try the following which is a much watered down version: > reg1 <- "[a-z0-9]+(\\.[a-z0-9]+)*(@[a-z0-9\\.]+)?" ) Now you could read it and understand it. But that would be cheating this is a statistics course! So test them against a bunch of strings and see if you and figure out what are legal and what aren t legal strings. So try it on say a large corpus, for example, http://www.cs.cmu.edu/ enron/. When you look at what it matches, make a guess as to what the pattern is supposed to do. Can you test this guess more accurately? Homework 2: N-grams 1. Pick whether you want to do a paper replication or 2 Language log like posts for your final project. If you want to do a paper replication, let me konw what area you might want to do it in. 2. Read chapter 3-6 of JM. 3. Read the spectral paper on HMM s 1

4. JM: 3.6 5. JM: 4.4, 4.10. 6. We will analyse the text called Alice in Wonderland. First we want to grab it down from the Gutenberg project. They have collected up over 30,000 books that you can read or play with. So surf for the Gutenberg project and find an ascii version of Alice in wonderland that you can download. If that doesn t work, you can just click on http://www.gutenberg.org/files/11/11.txt, but that would be cheating. (a) After you download it, you can read it into R with the scan command. Or you can read it directly via the command: > alice <- scan("http://www.gutenberg.org/files/11/11.txt", what = "character + quote = "", skip = 25) This command reads the whole file in as a vector of words. If you try it without the quote="" it will read all the quoted material as single words. Probably not what we want. The file starts with a blurb about this file not being copywrited so we should skip the first 25 lines or so. (b) First we will look at the frequency of the words themselves. i. Using the table command get the counts of the various words. Now sort them by frequency. What are the 10 most common words? Are they significantly different than the 10 most common words in the federalist papers? Does this seem resonable? ii. Now we will make the classic Zipf plot. We want a plot of the log of the frequency of the word (or just the log count) vs the log of the index of the word. iii. Add a regression line to this plot. Yikes! It seems to miss most of the data. We can eliminate the first 10 words since they don t fit the line all that well and the last bunch. So fit a line which uses something like the 10 th through the 1000 th observation. iv. Is the slope you compute similar to the one computed for the federalist papers? How about the wikipedia Zipf slope? (You will have to read off the slope by hand.) Is there a story here? 2

(c) I mentioned in class, that prediction and data compression are the same thing. So this part will have you consider three different compression schemes: By hand, By ZIP, and by google-2-grams. Basically you will need to look at a sequence of words: I m late! I m late! For a very important and fill in the missing word. You will first do it by hand, and then by compression and finally by google. i. First pick a location at random in the text. 1 Now print out the previous 20 words or so. Write down several possible next words. What probability do you give to each of these words? Now, look at which word actually occured? What probability did you give to this word? Here are the R commands to do an example. Choose the index > index <- round(runif(1) * 24384) > index Then look at the the preceeding words are: > cat(alice[(index - 20):(index - 11)], "\n", alice[(index - 10):(index - + 1)]) Now guess the next word. The correct answer is: > alice[index] In the example I started with: I m late! I m late! For a very important. 2 One might guess the words: event, date, meeting, activity. Now you give probabilities to each of these, say P(event) =.1, P(date) =.4, P(meeting) =.2, P(activity) =.1. Note these probabilities don t add up to one since I should also have probabilities for other words that I haven t bothered to write down. Now look and see the correct word is. For the example I m using the correct word is date which I assigned a probability of.4. Repeat this with 10 different words. How often was one of the words you guessed the correct word? ii. Compute the average of the log probabilities for your guesses. Assume that any time you missed the word altogether, you 1 Determine how many words there are and then generate a random index from 1 up to the last possible word. 2 For the purists, this actually doesn t occur in the original text but only in the Disney version. 3

should have in fact used a longer list which eventually would have included the correct word. So give yourself a probability of say, 1/24384 for that word. The average of your log probabilities is called the entropy. Entropy is usually measured in bits which mean base 2. So use log base 2 for this step. 3 iii. Compare your entropy to the LZ compression scheme. You can do this noting that the ZIP version is 59k bytes at Guttenberg. What does your total entropy look like? (Use the total number of words times number average entropy per word as an estimate.) iv. Now we want to make a prediction of the next word based on the http://gosset.wharton.upenn.edu/ foster/teaching/471/google n-gram data set. I have made up an easier set of data to work so you don t have to process those gigabytes of compressed data 4 First look up the previous word in the google 1-gram file. For my example, it is important which occurs 119695314 times. Now look up the actual two-gram word pair that occured. For my example, it is important date which occured 35885 times. So the probability is 35885/119695314, or about 1/3000. How does google do on forecasting the probability of the 10 words you came up with? How would you estimate the entropy goolge would do for the entire file? v. (Bonus) Write a R script that will compute the -log(probability) of each word based on the google 2-gram data set. What is the final entropy? Does it do better than LZ compression? 7. (Chapter 4) Find an approximation to the perplexity based on the entropy. (see page JM:96 for PP measure.) 8. (n-grams) Estimate how many words a day you hear or read. From this, estimate how many words you will process in your life time. How many lifetimes worth of data are in the google n-gram database? 3 You can assign the base in R, or you can use the formula, log 2(x) = log(x)/log(2). 4 If you want to read this into R you will find the files google/easy one and google/easy two will read in with less trouble. Or you can use Sivan s magic of: one.gram < read.delim("easy_google",nrow=3160,header=false) or for two grams two.gram < read.delim("easy_google",skip=3160,header=false) 4

Homework 3: Speech recognition page 247: 7.2. Listen to a few people from accent archive. Pick one word, (i.e. snake) and listen to how different people pronounce it. See if you can figure out how a new person will pronounce it before you hit play. Try it for 3 new people and tell me if you feel you can get them right. Homework 4: Speaking or not? (I m still writing this but if you want to get started early here are the basic instructions. I ve also updated the final project.) This homework will have you run some big regressions. Each row of the data table consists of a recording. It has been processed so you will not have to deal with.wav files and such. The puzzle is to figure out whether someone is speaking whether it is just background noise. Introduction to the data: Start out by reading Neville Ryant s description of the data. (Note: He generated this dataset for us to play with. So if you run into him thank him!) And opening up the data in R: Neville s readme.pdf file. Neville s gzipped text datafile. Neville s R binary file. (smaller and faster) As an exploritory data analysis, see how well you can predict whether there is a speaker in the overall data set. You can use whatever method you like best. (I ll assume you are using stepwise regression since that is the easist.) This then is a single model which predicts everyone. 1. What is your RMSE for identifying whether someone is speaking? If these 19 domains were all that existed, this one regression might be a fine thing to do. But in fact, we want to predict on new domains not on the ones we already have seen. So run your script you wrote to generate the fit on the entire dataset on each of the 19 domains. 1. Make a histogram/boxplot of the 19 RMSE s. Which domain is the easiest? Which is the hardest? Is RMSE a fair comparison across 5

these domains? (Consider the base rate of each domain.) What is the average RMSE? Why is it lower than the RMSE of your original model? The key issue: Now we want to use a model generated for one dataset to predict a different dataset. So A possible cure: 1 Final Project Here are the deadlines for the paper replication. (Nov 12) What paper do you want to replicate? Give me a link to the paper, and a link to data which you plan on replicating the paper using. (Nov 19) Freshen the references of this paper. What has been done that is more current? Write a sentence summary of each paper which is more current than those cited in your replication paper. (Dec 10) If you want me to read a first draft, then turn one in on this date. If you are feeling brave, wait until the 17th. (Dec 17) Final draft due. If you are doiing original research, please follow the following schedule: (Nov 12, 2010) One page description of what you propose to do. Provide a thesis sentence. This is a single statement of what it is you would like to show. It might be I will replicate the analysis in such-and-such a paper. Or it might be I will investiage the CCA connection between abstracts and references of papers taken from NIPS 2008. Provide a paragraph of back ground material. Provide a paragraph of what you will be doing. Provide a link as to where you will find data if you will need it. (Nov 19) References section. What papers are related to your project? 6

Identify the two or three papers that are closest to your work and also give the 10 or so other papers that you ran into along the way. In either case, give a one sentence statement of what you got from each paper. Write this with yourself as the target audience not me. So if I find these statements confusing that is fine. As long as you don t! (Nov 26) If you are doing data, show some parsed examples of your text that you can read in. (Dec 3) If you are doing data, give your statistical analysis of your data. We should sit down together and go through your data analysis. (Dec 10) If you want me to read a first draft, then turn one in on this date. If you are feeling brave, wait until the 17th. (Dec 17) Final draft due. 7