Quantitative Methods I: Laboratory 2

Quantitative Methods I: Laboratory 2 Prepared by Sebastian de Ramon Revised by Brian Wallace & Hui-Fai Shing Carefully read the instructions and follow the commands you are asked to carry out. Do not carry on reading until you have successfully completed all previous instructions. Answer all the questions you are asked. In order to make the best possible use of the laboratory time, write down your answers or at least write a brief summary of what you think the answer is (or save it to a Microsoft Word file). Remember to save your work to your Y: drive (or, less preferably, your A: drive) regularly. You may work in pairs or by yourself. For basic definitions on the Excel s purpose and elements refer to the first handout Laboratory 1, Quantitative Methods I. In this work sheet we will examine cross section data and categorical data. This week we will work with cross section data from a census of states in the U.S. The data consist of fifty observations, one for each state, corresponding to totals and averages. In this case, there is no temporal (time) component associated to the observations (as opposed to last week where we dealt with currency price fluctuations over a period of time). The data have been recorded for each state at the same moment in time; time does not change across observations. Before you start: a word of warning. Some of the machines in the lab are unstable and may crash. To ensure you don t lose lots of work, you should save your work periodically throughout the session. To begin working with Excel, go to the Start menu and then Excel XP. Wait for the program to start and then, before doing any calculations, you may need to alter some settings in Excel as the default settings (as set by the computer centre) do not allow certain operations. Go to the menu Tools then click Add-Ins (this may cause the machine to freeze temporarily) and make sure that the top 2 options Analysis Toolpak and Analysis Toolpak VBA are ticked. Then click OK. Now go to the menu File on the top bar, click on the menu Open. Then select the R: drive, the folder Economics, the folder QM1 and, from there, click the file LabData2.xls and then click Open. Study the file you opened. The first row contains the names of the series and there are four columns with data. Data in column A contains the name of each state surveyed. Column B contains the name of a region associated to each state. Column C contains the median value of the population s age in each state. Finally, Column D measures the proportion of deaths per 100 habitants. As you can see it does not matter which observation comes first, second or last. With a time series, the temporal component defines an order for the observations and it always makes sense to see how the data moves over time. Given that the data does not have a temporal component it does not make sense to make a plot one of the columns alone. Such a picture, apart of indicating what is the mean value and dispersion of the data, could not tell anything about general tendency or trend lines from the data; and that is because there aren t any! The analysis that can be done is to compare two of these cross section series in order to find HS 24/01/06 1

common patterns (correlation) across observations. Given all these considerations the analysis we will make will not depend or have any relation to any particular order of the observations. To begin, we will compare with a graph the two series medage and death per 100. To do it we have first to indicate the program where is the data we want to plot. Find the first value for the data; it should be in cell Cl which is the name of the series of dates medage. Click with the mouse on it and keep holding the mouse left button while you move to the right to cover cell D1 and do not release the mouse button. Cell Dl contains the name of the series of returns death per 100. Now, you must keep holding the mouse button while you go down to cover all the observations one by one. When you get to the bottom of the Excel screen keep holding the button so the spreadsheet will scroll down to show you the rest of the data. The further you go off the window, the faster the screen will move down, so keep the mouse just over the edge of the Excel window until all the data have been shown. Bring the mouse over cell D51 and, only then, release the button. Now we will use again the graph creation facility featured by the Excel. With the data selected as it was indicated above, click on the menu Insert, (one of the top bar menus) and then select Chart from it. The program will offer you four dialogue-windows with different options how to create a chart. The first dialogue offers you different chart types to select to plot the data. Let us select from the list the type Scatter and the first chart type listed to the right, then click Next. You can make changes in the data range in the second menu. This time we will leave it as it is, so click Next. In the third menu you can add a title for the chart and/or for the axis. Study the menu and click Next. Finally, the last menu allows you to create a new sheet for the plot or to leave it as part of the current sheet. Click on the small circle to the left of As new sheet, and then click Finish. Your chart will now appear on a separate sheet from your data input sheet. To get back to your data sheet, go to the tabs on the bottom of the screen and click on data. To look again at the chart click on the tab Chart 1. On your Chart, notice that all the data in bunched up to the right hand side. We can change the scale on the X-axis as follows. Double click on the X-axis and a menu Format Axis should appear; then click on the Scale tab. Change the Minimum to 20 and the Maximum to 35 then click Ok. You should see a new Chart with the data more spread out. The horizontal axis registers the values of the variable medage (the median population age), while the vertical axis shows the number of deaths per 100 habitants. Through this figure we are able to detect common patterns of the two series. In this case, we see that larger values of the age correspond to larger number of deaths per 100 habitants. Ql) Study the graph you made and describe the most significant features of the two series. Do you think there is a strong link between the two series? Q2) Use the functions taught last week to compute average (=average(...)), median (=median(...)), and standard deviation (=stdev(...)) for each series. For example, if you select cell C53 and write =average(c2:c51) you will obtain the average of the median ages for all 50 states. Also compute the min, max, inter-quartile range [remember that interquartile range is the difference between the 3 rd quartile (=quartile(, 3)) and the 1 st quartile (=quartile(, 1))] and maximum range for each series. HS 24/01/06 2

The statistics you estimated above give you a summary of the data. In particular, they tell you things about the central tendency and dispersion of the data. Another summary of the data that can be very informative is what we call a histogram. In a histogram we can see a detailed picture of the distribution of the data: we can determine what proportion of the total or how many observations fall in any category we want. Excel has a built-in routine to help us to prepare a histogram. To use it when you are on the datasheet, go to the menu Tools and then Data Analysis. You will see a window including a list of statistical analysis tools, from the list select Histogram and then click Ok. In this window you will have to indicate what is the data you want to analyse, what kind of output you want and where you want it to be displayed. Click in the box labelled Input Range and write c2:c51, this range contains the data we want to analysed; the median value of the population s age. In the area labelled Output Options : click on the little circle next to Output Range, then click in the box to right of it, and finally type in j1. Starting from cell j1, the program will produce a table with the amount of observations in each category. To finish, check the little box next to Chart Output, in order to request the program to produce a chart histogram. Click Ok and wait for the program to process your request. Now find your new chart, which should be near the data table in cells J1 to K9. Left click on the chart to select it, then right click to bring up some options. From these options click Location, then click the circle next to As new sheet:. FROM NOW ON, please use a different sheet for each Chart, selecting the As new sheet: option and labelling your sheets, Chart 1, Chart 2 etc. It makes it easier to switch between Charts. The table in cells J1 to K9 (you may need to move any charts out of the way to see these cells) describes the amount of observations in each category. The categories have been determined automatically by the program and they are recorded in column J under the label Bin ; they are: 24.2, 25.7, 27.2, 28.7, 30.2, 31.7, 33.2 and More. The numbers in column K record the amount of observations falling within each category. The number one next to 24.2 indicates that from the data only one observation was smaller or equal to 24.2. Zero observations were found between 24.2 and 25.7, two observations fell between 25.7 and 27.2, and so on. The last category labelled More records all the observations falling to the right of 33.2. So in the histogram, the bar above a number corresponds to the number of items LESS than that number but greater than the previous one. Q3) You will find a glossary of terms in the back page of this manual. Study the histogram you made and decide whether this is a skewed to the right histogram, skewed to the left histogram, symmetric histogram or uniform histogram. [See glossary for definition of skewness]. Q4) Compare the histogram with the summary of the data given by the average, median and standard deviation. What is the Modal (the most frequent) category? What value would you choose as a centrality measurement of the data and what would be a measure of dispersion? HS 24/01/06 3

One problem with the histogram before was the choice of the boundaries made by the Excel. Normally we would have chosen integer numbers for each boundary instead of 24.2 or 25.7. On the other hand, the program chose too many categories at the left side of the data and too few at the right side. We can have more control of the boundaries and other features of the histogram. Let us create our own boundaries: double click on cell i14. Type in 27 and press Enter. In cell i15 type in 28 and Enter. Carry on doing this until you type in 33 in cell "i20. These will be the new boundaries for the histogram. Now, go to the menu Tools and then Data Analysis. Form the list select Histogram and then click Ok. Once again, click in the box labelled Input Range and type in c2:c51. This time we will indicate the boundaries we want the program to use; in the box labelled Bin Range, type in i14:i20. We will also do a change in the area labelled Output Options. Click the circle to the left of Output Range and in the box to right of it type in j13 (to place the table somewhere else). Finally, check the boxes Chart Output and Cumulative Percentage and click Ok. Wait for the program to do the analysis you requested. Remember to move your histogram to a separate sheet. As before, you should have the first and second columns with the boundaries and observations in each category. The first class (boundary 27) count how many states have a median age of 27 or less. The second class counts how many observations between 27 and 28 and so on until the last class records the number of observations falling to the right of 33. This time however, there is a third column with results under the label Cumulative %. This column computes the ratio of the accumulated number observations less than the boundary over the total. The graph now presents two vertical axes. In the left-hand side the values correspond as before to the total frequencies falling in each class. The right hand side plots and Ogive of the accumulated percentage frequencies. Q5) Compare this new histogram with the previous analysis. What is the modal class now? From the Ogive, infer which value corresponds to the median of the data. How does it compare with the values you found before? Let us turn to analyse the data on deaths per 100 people. When on the spreadsheet window, go to the menu Tools and then Data Analysis. Select Histogram and click Ok. This time in box labelled Input Range type in d2:d51, these are the cells containing deaths per 100. We do not know yet which boundaries are appropriate for this analysis so leave the box labelled Bin Range empty. In the area labelled Output Options, click the circle to the left of Output Range and in the box to right of it type in j25. Finally, check the boxes Chart Output and Cumulative Percentage and click Ok. When the program finishes the analysis you requested, observe carefully the main features of the two plots. Q6) Describe the main features of the data. Compare this summary with the mean value, median and standard deviation you obtained before. Is this a skewed to the right histogram, skewed to the left histogram, symmetric histogram or uniform histogram? Q7) As an extra exercise define new boundaries for the deaths per 100 data (as we did before) and redo the histogram. For example, use boundaries like: 0.4, 0.475, 0.55 and so on; so that the top classes in the previous plot can be broken into two or more. HS 24/01/06 4

There is some extra information coming with this data corresponding to regions associated to each state. Obviously, because these data are non-numerical quantities we could not use a histogram in the way we did before. We cannot define boundaries for this variable and the order of the regions does not matter. We could not say which region is greater that the other, whereas with the median age and deaths per 100 order did matter. Therefore, if we are going to plot these categorical data the approach will have to be quite different. Let us first compute the number of observations per region. Locate cell F2, double click on it, write NE and press Enter. Move to Cell F3, double click on it, write South and press Enter. Now, go to Cell F4, double click on it, write West and press Enter. Select Cell F5, double click on it, write N Cntrl and press Enter. Now, we will compute the number of observations in each category. To do it, first double click with the mouse over cell G2 and write: =countif(b2:b51, =NE ) ***Cells B2 to B5l contain the data on regions; so the formula =countif(b2:b51, =NE ) counts the number of observations in the range of cells B2:B51 that have the value NE.*** Next, select cell G3 and after double clicking write the formula: =countif(b2:b51, =South ) When you finish typing the formula press Enter. As before, this formula computes the number of observation with the value South in the range of cells. To compute the number of observations from West and from N Cntrl regions type in the formulas =countif(b2:b51, =West ) in cell G4 and =countif(b2:b51, N Cntrl ) in cell G5. The variable Region takes only these four values so the sum of all four counts should be equal to fifty, the total number of observations. Now we can prepare a bar chart with these data. Select cells F2 to G5. Click on the menu Insert, (one of the top bar menus) and then select Chart from it. Follow the four dialogue-windows to create a chart. Select from the list the type Column and the first chart type listed to the right, then click Next. Leave this menu unchanged and click Next. In the third menu add a title for the chart: under Chart title type in Observations per Region. Finally, in the last menu click on the small circle to the left of As new sheet: and then click Finish. Q8) Describe the distribution of regions. For this plot, does it make any sense talking about symmetry or skewness? That s all for this week. Next week, we will look at the statistical and econometrics package called Stata. Using Stata is very difficult if you haven t seen it before, but absolutely necessary for your project (and beyond if you are taking QM2 and the Economics Dissertation!), so attendance is vital. You can find more information about this course, including answers to the questions on the course web site, available at: http://personal.rhul.ac.uk/pmte/165/qm1c/qm1c.html HS 24/01/06 5

GLOSSARY Bar graph A graph made of bars whose heights represent the frequencies of respective categories. Class An interval that includes all the values in a (quantitative) data set that fall within two numbers, the lower and upper limits of the class. Class boundary The midpoint of the upper limit of one class and the lower limit of the next class. Class frequency The number of values in a data set that belong to a certain class. Class midpoint or mark Obtained by dividing the sum of the lower and upper limits (or boundaries) of a class by 2. Class width or size The difference between the two boundaries of a class. Cross-section data Data collected on different elements at the same point in time or for the same period of time. Cumulative frequency The frequency of a class that includes all values in a data set that fall below the upper boundary of that class. Cumulative frequency distribution A table that lists the total number of values that fall below the upper boundary of each class. Cumulative relative frequency The cumulative frequency of a class divided by the total number of observations. Cumulative percentage The cumulative relative frequency multiplied by 100. Frequency distribution A table that lists all the categories or classes and the number of values that belong to each of these categories or classes. Grouped data A data set presented in the form of a frequency distribution. Histogram A graph in which classes are marked on the horizontal axis and either frequencies, relative frequencies, or percentages are marked on the vertical axis. The frequencies, relative frequencies, or percentages of various classes are represented by bars that are drawn adjacent to each other. Ogive A curve drawn for cumulative frequency distribution. Percentage The percentage for a class or category is obtained by multiplying the relative frequency of that class or category by 100. Pie chart A circle divided into portions that represent the relative frequencies or percentages of different categories or classes. Polygon A graph formed by joining the midpoints of the tops of successive bars in a histogram by straight lines. Raw data Data recorded in the sequence in which they are collected and before they are processed. Relative frequency The frequency of a class or category divided by the sum of all frequencies. Skewed to the left histogram A histogram with a longer tail on the left side. Skewed to the right histogram A histogram with a longer tail on the right side. Stem-and-leaf display A display of data in which each value is divided into two portions, a stem and a leaf. Symmetric histogram A histogram that is identical on both sides of its central point. Time-series data Data that give the values for the same variable for the same element at different points in time or for different periods of time. Uniform or rectangular histogram A histogram with the same frequency for all classes. HS 24/01/06 6