Bootstrapping: described and illustrated Comparing the standard mean, 10 and 20% trimmed means, & median

Size: px

Start display at page:

Download "Bootstrapping: described and illustrated Comparing the standard mean, 10 and 20% trimmed means, & median"

Samson Powers
6 years ago
Views:

1 Bootstrapping: described and illustrated Comparing the standard mean, 10 and 20% trimmed means, & median Before discussing the main topic, let us quickly review sampling distributions so that everyone is clear on the major theoretical background. Because the concept of a sampling distribution of a statistic (especially a mean) is so fundamental to bootstrapping what it s about, why it works as it does I want to review the following: The sampling distribution of the mean has three principal characteristics you should remember: (1) For any sample size n, the mean of all (!) sample means [drawn from a finite population] is necessarily equal to the population mean (such a statistic is said to be unbiased); (2) The variance of the distribution of all means (always) equals the population variance divided by n; and (perhaps surprisingly), (3) As sample size, n, grows larger, the shape of the sampling distribution of the mean tends toward that of a normal curve, regardless of the shape or form of the parent population. Thus, it is properly said that the distribution of the sample mean, necessarily has mean µ, and variance σ 2 /n, where these Greek letters stand for the population mean and variance respectively. Moreover, such a sampling distribution approaches normal form as the sample size n grows larger for [virtually] every population! It is the generality of the last point that is so distinctive. The square root of the variance σ 2 /n (written as σ/ n, so we may write σ mean = σ/ n) is called the standard error of the mean (i.e., the standard deviation of the sampling distribution of the mean, σ mean ); a term that is frequently encountered in statistical practice. What has just been stated are the principal results of the Central limit theorem. (Stop to note that the sampling distribution [of any statistic] is itself a population; but such a distribution is wholly distinct from the distribution of the parent population, or from the distribution of any particular sample. Because of its status as a population, the characteristics of a sampling distribution are generally denoted by Greek letters [consider that all possible samples of a given size were the sources of the statistics]. But don t confuse the sampling distribution (of any statistic and there is generally a different one for different statistics) with the parent population from which samples were drawn. The preceding points are fundamental to the bootstrap method (see below). But note that when we speak about a bootstrap distribution of a statistic we are talking about an approximate sampling distribution of a particular statistic (not just the mean!), based on a large number of bootstrap samples; and for each sample, the sampling is done with replacement from a particular sample-as-population. And each sample is of the same size as the original, i.e., n. Still that large number [1000 below] of statistics is far smaller than the total of all possible samples, which is generally n to the power n (n n ) in a bootstrap context, or N to the power n (N n ), for a finite population of size N.) Bootstrapping entails sampling with replacement from a vector (or rows of matrix or data frame; see next page, bottom for how to do in R), so that each bootstrap sample is always the same size as the original. But don t confine your thinking to just the mean as we begin to consider bootstrapping; in general, bootstrap distributions can be created for any statistic that can be computed and each statistic is based on a set of resampled data points. The following illustration begins from a vector y that contains n = 100 values, originally generated as a random sample from the t 3 distribution, i.e. t w/ 3 degrees of freedom, and then scaled to have a mean of 20 and a standard deviation of about 3. This accounts for the relatively long tails of y, compared with a Gaussian (normal) distribution that you see below. See plot of y

2 (which is both a sample and a population, depending on your point of view both ideas are relevant); its summary statistics (parameters?) are given below. Here is the R function I used to obtain four central tendency estimates for each of 1000 bootstrap samples: (Copy and paste means4 into your R session) means4 <- function(x,tr1=.1,tr2=.2) { # function that, given vector x, computes FOUR statistics to assess central tendency xm1 <- mean(x) xmt.1 <- mean(x, trim =tr1) # 10% trimmed mean xmt.2 <- mean(x,trim=tr2) # 20% trimmed mean xm.5 <- median(x) # 50% trimmed mean = median xms <- c(xm1, xmt.1,xmt.2, xm.5) xms=round(xms,2) list(xms=xms) } #the four means above are given as mean 1...mean4 below. Now, we use the bootstrap function from the library bootstrap: mns4.y <- bootstrap(y, nboots=1000,means4) command used for main bootstrap run (1000 replicates)[ nboot >= 1000 for good C.I s] I generated 1000 bootstrap replications of the four statistics [for library: bootstrap in R] All R commands given on page 4 below. Numerical summary of bootstrap results: >cbind(my.summary(y=x),my.summary(mns4.500)) pop. mean1 mean2 mean3 mean4 #mean1 is just conventional mean. means departures from 20 indicate bias s.d.s #first value is popul s.d./rest (bold vals=s.e. s) skewns of s.d.s are bootstrap s.e. s

3 will discuss! See plot next p. #key results in bold italics above AND BELOW. A second run, again w/ 1000 bootstrap replications, gave results: means # I ignore the pop. values here s.d.s as they did not change. skewns For practical purposes, identical. Following are the four bootstrap distributions for the first set: NB: The initial population was a long-tailed sample. It s use affords an opportunity to study the conventional sample mean as an estimate of the center of a distribution, when normality does not hold. We in fact see below that the conventional mean is the worst of the four estimators of the center of the distribution of the parent population, based on 1000 bootstrap samples. Remember initial sample as population had n=100 scores, so that n = 100 for each sample. Thus, the first s.e., for mean1, can be calculated by theory; that theory says divide the population s.d. by sqrt(n); here 3.03/sqrt(100)=.303. We are most informed by the computed standard error estimates; these quantify how well the different estimators of the center of this distribution work in relation to one another. To repeat, each bootstrap sample entails sampling WITH replacement from the elements of the initial data vector y. Each of the B = 1000 bootstrap samples contains n = 100 resampled scores, and all four statistics ( means ; three trimmed ) were computed for each bootstrap sample. The summary results, and especially standard error estimates, based on the bootstrap replicates are the principal results on which one will usually focus in studies like this. See the documentation for bootstrap for more information as to what this function does, or can do.

4 The first major book on the bootstrap was written by Bradley Efron, inventor of the bootstrap, and Tibshirani: An introduction to the bootstrap, There are now at least a dozen books, many of them technical, about bootstrapping. (The May 2003 issue of Statistical Science is devoted exclusively to articles on bootstrapping, for its 25 th anniversary.) Some things you may find useful about bootstrapping within the world of R: 1. A vector such as y, regarded as y[1:n], where one controls contents, e.g. y[c(1,3)] = 1 st and 3 rd elements of y; or y[n:1] presents y values in reverse order; or y[sample(1:n,n,repl=t)] yields a bootstrap sample of y, of size n; and the latter, repeated (using sampling WITH replacement), becomes a basis for bootstrap analysis. 2. A matrix such as yy, regarded as yy[1:n,1:p] (of order n x p) can be examined in parts using bracket notation; e.g. yy[1:3, ] displays the first 3 rows of yy; also, to sample the rows of yy, use yy[sample(1:n,n,repl=t), ], where comma in [, ] separates row and column designations. R commands used to get the preceding numerical results: > bt.mn4=bootstrap(x,nboot=1000,theta=means4) #x is called xt3 in Fig1 > bt.mns4=(as.data.frame(bt.mn4$theta)) #the output thetastar is of class list ; needs to be data.frame or a matrix for what follows. > bt.mns4=t(bt.mns4) #transpose of matrix (d.frame) bt.mns4 is taken for convenience (below) > gpairs(bt.mns4) #gpairs function is from package YaleToolkit > my.summary(bt.mns4) #I wrote my.summary. Copy it into your R session, or just use summary from R. my.summary <- function (xxx, dig = 3) { #xxx is taken to be input data.frame or matrix. xxx <- as.matrix(xxx) xm <- apply(xxx, 2, mean) s.d <- sqrt(apply(xxx, 2, var)) xs <- scale(xxx) sk <- apply(xs^3, 2, mean) kr <- apply(xs^4, 2, mean) - 3 rg <- apply(xxx, 2, range) sumry <- round(rbind(xm, s.d, sk, kr, rg), 3) dimnames(sumry)[1] <- list(c("means", "s.d.s", "skewns", "krtsis", "low", "high")) sumry <- round(sumry, dig) sumry } Bootstrapping sources in R R functions for bootstrapping can be found in the bootstrap and the boot library, so you should examine the help files for several of the functions in these libraries to see how to proceed. Note that bootstrap is a much smaller library than boot, and generally easier to use effectively. I recommend that you begin w/ the function bootstrap in the library of the same name, but the greater generality of boot will be most helpful in some more advanced situations. The help.search function will yield more packages that reference bootstrapping, so give that a try too. Naturally, many introductions and discussions can be found on the web; let s see what you like post a URL or two.

5 These pages summarize key points, and offer more general principles. Of particular interest is the concept of confidence intervals, and ways to use bootstrap methods to generate limits for CIs. Note the (deliberate) redundancy in what follows and what you have read above. The essence of bootstrapping: We begin by assuming a quantitative variable, for which we want to characterize or describe using numerous different statistics (means, medians, trimmed means, variances, sds, skewness, etc.). Our goal is to make inferences about parent population parameters using confidence intervals that have in turn been constructed using the information from a reasonably large number of computer generated bootstrap samples. (Take note: we will NOT introduce any mathematical theory here; all that follows involves computer intensive computation, but no theory as such.) 1. Begin from an initial sample, not too small (say, 30 or 40 cases at a minimum); this should be a random sample, or a sample for which we can reasonably argue that it reasonably re- presents some larger universe of scores that we shall think of as our parent population. Let y represent the sample data. 2. Decide what feature(s) of the parent population we would like to make inferences about ( center, spread, skewness, etc.); then, given one or two choices, say center and spread, decide on what statistics we want to use for inferences. We might have two, three or more alternative measures of each feature (e.g., four means for center; s.d.s and IQRs to assess spread, etc), a total of S statistics, say. One goal here is to compare various estimators with one another with respect to their purposes in helping to make inferences. We might also have begun with difference scores in our initial vector. 3. Compute and save each of these statistics for our initial sample of y values; we shall call them by a special name: bootstrap parameters (which are also statistics, see below). Reflect on this point since it is easy to be confused here. 3. Choose or write functions that we will be able to apply to each bootstrap sample, where each bootstrap sample is simply a sample drawn w/out replacement from the initial sample. The initial sample will now be regarded for the purposes of bootstrapping as our bootstrap population. (Note carefully that we must take care in what follows to distinguish the parent population from the bootstrap population. The latter population can be said to have bootstrap parameters that are also properly labeled as conventional sample statistics.) 4. Generate bootstrap samples a substantial number of times (say B = 500 to 1000 of these), where we save these bootstrap replicates (those that measure center, spread, skewness) for each of the bootstrap samples. Best to generate an array (matrix, of order B x S) that contains all of these; they shall be called replicate values for the respective statistics, and they will be the basis for the ultimate inferences. 5. Summarize the preceding table by columns, one for each statistic that relates to a particular feature of the initial bootstrap population (recalling that our bootstrap population began as our initial sample). Both numeric and graphical methods should usually be employed. 6. Compute and compare the (conventional) means of the replicate statistics (columns) with the bootstrap population parameters; the differences may positive or negative, and these differences measure bias. Ideally, we might seek zero bias, but small amounts of bias are usually tolerated, particularly if the biased statistics have compensating virtues, especially relatively small variation across the set of bootstrap samples. 7. Then compute and compare the s.d.s of the respective statistics; often the main goal of the entire bootstrapping study is to find which statistics have the smallest s.d.s (which is to say bootstrap standard errors) since these are the statistics that will have the narrowest confidence

6 intervals. If a statistic is found to be notably biased, we may want to adjust the statistics (nominally used as centers of our ultimate confidence intervals). 8. Generate the density distributions (histograms ok) and, more importantly, selected quantiles, of any bootstrap statistics (the bootstrap replicates) we generated. For, example if we aim to get a 95% interval for a trimmed mean, we find the 2.5% and the 97.5% quantile points of the distribution of that trimmed mean, and (supposing it has minimal bias) these become our confidence limits for a 95% interval. We will surely want to compare these limits with those for the conventional mean. Statistics with the narrowest CIs can usually be said to be best, particularly if they were found to be minimally biased. Similar methods are used for 99% CIs, etc. Graphics can be useful in this context, but be sure to note that all the information is based on a (rather arbitrary) initial sample, so care has to be taken not to misinterpret, or over-interpret results. 9. Summarize by comprehensively describing the main results, also noting that this methodology has bypassed normal theory methods that strictly speaking, apply only when normality assumptions can be invoked; moreover, we have made no assumptions about shapes or other features of the (putative [look it up!]) parent population. In particular, we have not assumed normality at any point, and we have gone (well) beyond the mean to consider virtually any statistic of interest (review these; add others). 10. Finally, recognize that the interpretation of any such bootstrap CI is essentially the same as that for a conventional CI gotten by normal theory methods. These ideas readily generalize to statistics that are vectors, such as vectors of regression coefficients. This means we are all free to invent and use bootstrap methods to study the comparative merits and demerits of a wide variety of statistics, without regard for whether they are supported by normal theory. We need not invoke normality assumptions, nor make any other so-called parametric assumptions in the process. The main thing to note is that any bootstrap sample drawn from a vector or matrix, is just a sample drawn with replacement from the ROWS of an initial sample data matrix; and (vector) bootstrap statistics are computed for each bootstrap matrix, analogous to what has been described above. The computation may be intense but with modern computers such operations are readily carried out for rather large matrices (thousands of rows, hundreds of columns) if efficient computational routines are used. Conventional statistics can be notably inferior to certain new counterparts, a point that needs often to be considered seriously.

Probability and Statistics Curriculum Pacing Guide

Probability and Statistics Curriculum Pacing Guide Unit 1 Terms PS.SPMJ.3 PS.SPMJ.5 Plan and conduct a survey to answer a statistical question. Recognize how the plan addresses sampling technique, randomization, measurement of experimental error and methods