Homework 1: Regular expressions (due Sept 24 at midnight) 1. Read chapters 1 and 2 from JM. 2. From the book JM: 2.1, 2.4, 2.8 3. Exploritory data analysis is a common thing to do with numbers, histograms, box plots, etc. But, much of this isn t that interesting when using words. So this problem asks you to explore what a regular expression actually does by simply running it against a bunch of text. Consider the following regular expression: (?:[a-z0-9!#$%&'*+/=?^_`{ }~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{ }~-]+)* "(?:[\x01-\x08\ I posted a link to the above regex on the index.html page for the class (at the due date in the schedule). See if you can work with that easier if your cut and paste doesn t work from the pdf. (You can also grab it out of the Rwn file.) Darian suggest trying perl = TRUE as being helpful for getting it to work. (Note, if you have trouble getting that to run in R, try the following which is a much watered down version: > reg1 <- "[a-z0-9]+(\\.[a-z0-9]+)*(@[a-z0-9\\.]+)?" ) Now you could read it and understand it. But that would be cheating this is a statistics course! So test them against a bunch of strings and see if you and figure out what are legal and what aren t legal strings. So try it on say a large corpus, for example, http://www.cs.cmu.edu/ enron/. When you look at what it matches, make a guess as to what the pattern is supposed to do. Can you test this guess more accurately? Homework 2: N-grams 1. Pick whether you want to do a paper replication or 2 Language log like posts for your final project. If you want to do a paper replication, let me konw what area you might want to do it in. 2. Read chapter 3-6 of JM. 3. Read the spectral paper on HMM s 1
4. JM: 3.6 5. JM: 4.4, 4.10. 6. We will analyse the text called Alice in Wonderland. First we want to grab it down from the Gutenberg project. They have collected up over 30,000 books that you can read or play with. So surf for the Gutenberg project and find an ascii version of Alice in wonderland that you can download. If that doesn t work, you can just click on http://www.gutenberg.org/files/11/11.txt, but that would be cheating. (a) After you download it, you can read it into R with the scan command. Or you can read it directly via the command: > alice <- scan("http://www.gutenberg.org/files/11/11.txt", what = "character + quote = "", skip = 25) This command reads the whole file in as a vector of words. If you try it without the quote="" it will read all the quoted material as single words. Probably not what we want. The file starts with a blurb about this file not being copywrited so we should skip the first 25 lines or so. (b) First we will look at the frequency of the words themselves. i. Using the table command get the counts of the various words. Now sort them by frequency. What are the 10 most common words? Are they significantly different than the 10 most common words in the federalist papers? Does this seem resonable? ii. Now we will make the classic Zipf plot. We want a plot of the log of the frequency of the word (or just the log count) vs the log of the index of the word. iii. Add a regression line to this plot. Yikes! It seems to miss most of the data. We can eliminate the first 10 words since they don t fit the line all that well and the last bunch. So fit a line which uses something like the 10 th through the 1000 th observation. iv. Is the slope you compute similar to the one computed for the federalist papers? How about the wikipedia Zipf slope? (You will have to read off the slope by hand.) Is there a story here? 2
(c) I mentioned in class, that prediction and data compression are the same thing. So this part will have you consider three different compression schemes: By hand, By ZIP, and by google-2-grams. Basically you will need to look at a sequence of words: I m late! I m late! For a very important and fill in the missing word. You will first do it by hand, and then by compression and finally by google. i. First pick a location at random in the text. 1 Now print out the previous 20 words or so. Write down several possible next words. What probability do you give to each of these words? Now, look at which word actually occured? What probability did you give to this word? Here are the R commands to do an example. Choose the index > index <- round(runif(1) * 24384) > index Then look at the the preceeding words are: > cat(alice[(index - 20):(index - 11)], "\n", alice[(index - 10):(index - + 1)]) Now guess the next word. The correct answer is: > alice[index] In the example I started with: I m late! I m late! For a very important. 2 One might guess the words: event, date, meeting, activity. Now you give probabilities to each of these, say P(event) =.1, P(date) =.4, P(meeting) =.2, P(activity) =.1. Note these probabilities don t add up to one since I should also have probabilities for other words that I haven t bothered to write down. Now look and see the correct word is. For the example I m using the correct word is date which I assigned a probability of.4. Repeat this with 10 different words. How often was one of the words you guessed the correct word? ii. Compute the average of the log probabilities for your guesses. Assume that any time you missed the word altogether, you 1 Determine how many words there are and then generate a random index from 1 up to the last possible word. 2 For the purists, this actually doesn t occur in the original text but only in the Disney version. 3
should have in fact used a longer list which eventually would have included the correct word. So give yourself a probability of say, 1/24384 for that word. The average of your log probabilities is called the entropy. Entropy is usually measured in bits which mean base 2. So use log base 2 for this step. 3 iii. Compare your entropy to the LZ compression scheme. You can do this noting that the ZIP version is 59k bytes at Guttenberg. What does your total entropy look like? (Use the total number of words times number average entropy per word as an estimate.) iv. Now we want to make a prediction of the next word based on the http://gosset.wharton.upenn.edu/ foster/teaching/471/google n-gram data set. I have made up an easier set of data to work so you don t have to process those gigabytes of compressed data 4 First look up the previous word in the google 1-gram file. For my example, it is important which occurs 119695314 times. Now look up the actual two-gram word pair that occured. For my example, it is important date which occured 35885 times. So the probability is 35885/119695314, or about 1/3000. How does google do on forecasting the probability of the 10 words you came up with? How would you estimate the entropy goolge would do for the entire file? v. (Bonus) Write a R script that will compute the -log(probability) of each word based on the google 2-gram data set. What is the final entropy? Does it do better than LZ compression? 7. (Chapter 4) Find an approximation to the perplexity based on the entropy. (see page JM:96 for PP measure.) 8. (n-grams) Estimate how many words a day you hear or read. From this, estimate how many words you will process in your life time. How many lifetimes worth of data are in the google n-gram database? 3 You can assign the base in R, or you can use the formula, log 2(x) = log(x)/log(2). 4 If you want to read this into R you will find the files google/easy one and google/easy two will read in with less trouble. Or you can use Sivan s magic of: one.gram < read.delim("easy_google",nrow=3160,header=false) or for two grams two.gram < read.delim("easy_google",skip=3160,header=false) 4
Homework 3: Speech recognition page 247: 7.2. Listen to a few people from accent archive. Pick one word, (i.e. snake) and listen to how different people pronounce it. See if you can figure out how a new person will pronounce it before you hit play. Try it for 3 new people and tell me if you feel you can get them right. Homework 4: Speaking or not? (I m still writing this but if you want to get started early here are the basic instructions. I ve also updated the final project.) This homework will have you run some big regressions. Each row of the data table consists of a recording. It has been processed so you will not have to deal with.wav files and such. The puzzle is to figure out whether someone is speaking whether it is just background noise. Introduction to the data: Start out by reading Neville Ryant s description of the data. (Note: He generated this dataset for us to play with. So if you run into him thank him!) And opening up the data in R: Neville s readme.pdf file. Neville s gzipped text datafile. Neville s R binary file. (smaller and faster) As an exploritory data analysis, see how well you can predict whether there is a speaker in the overall data set. You can use whatever method you like best. (I ll assume you are using stepwise regression since that is the easist.) This then is a single model which predicts everyone. 1. What is your RMSE for identifying whether someone is speaking? If these 19 domains were all that existed, this one regression might be a fine thing to do. But in fact, we want to predict on new domains not on the ones we already have seen. So run your script you wrote to generate the fit on the entire dataset on each of the 19 domains. 1. Make a histogram/boxplot of the 19 RMSE s. Which domain is the easiest? Which is the hardest? Is RMSE a fair comparison across 5
these domains? (Consider the base rate of each domain.) What is the average RMSE? Why is it lower than the RMSE of your original model? The key issue: Now we want to use a model generated for one dataset to predict a different dataset. So A possible cure: 1 Final Project Here are the deadlines for the paper replication. (Nov 12) What paper do you want to replicate? Give me a link to the paper, and a link to data which you plan on replicating the paper using. (Nov 19) Freshen the references of this paper. What has been done that is more current? Write a sentence summary of each paper which is more current than those cited in your replication paper. (Dec 10) If you want me to read a first draft, then turn one in on this date. If you are feeling brave, wait until the 17th. (Dec 17) Final draft due. If you are doiing original research, please follow the following schedule: (Nov 12, 2010) One page description of what you propose to do. Provide a thesis sentence. This is a single statement of what it is you would like to show. It might be I will replicate the analysis in such-and-such a paper. Or it might be I will investiage the CCA connection between abstracts and references of papers taken from NIPS 2008. Provide a paragraph of back ground material. Provide a paragraph of what you will be doing. Provide a link as to where you will find data if you will need it. (Nov 19) References section. What papers are related to your project? 6
Identify the two or three papers that are closest to your work and also give the 10 or so other papers that you ran into along the way. In either case, give a one sentence statement of what you got from each paper. Write this with yourself as the target audience not me. So if I find these statements confusing that is fine. As long as you don t! (Nov 26) If you are doing data, show some parsed examples of your text that you can read in. (Dec 3) If you are doing data, give your statistical analysis of your data. We should sit down together and go through your data analysis. (Dec 10) If you want me to read a first draft, then turn one in on this date. If you are feeling brave, wait until the 17th. (Dec 17) Final draft due. 7