WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Similar documents
How to Judge the Quality of an Objective Classroom Test

Evidence for Reliability, Validity and Learning Effectiveness

CS Machine Learning

Physics 270: Experimental Physics

The Good Judgment Project: A large scale test of different methods of combining expert predictions

The Oregon Literacy Framework of September 2009 as it Applies to grades K-3

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

Entrepreneurial Discovery and the Demmert/Klein Experiment: Additional Evidence from Germany

GRIT. The Secret to Advancement STORIES OF SUCCESSFUL WOMEN LAWYERS

A Case Study: News Classification Based on Term Frequency

Early Warning System Implementation Guide

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Segmentation Study of Tulsa Area Higher Education Needs Ages 36+ March Prepared for: Conducted by:

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

West s Paralegal Today The Legal Team at Work Third Edition

School Leadership Rubrics

CONSISTENCY OF TRAINING AND THE LEARNING EXPERIENCE

International Business BADM 455, Section 2 Spring 2008

Strategic Practice: Career Practitioner Case Study

Rule-based Expert Systems

Interpreting ACER Test Results

Calculators in a Middle School Mathematics Classroom: Helpful or Harmful?

Probabilistic Latent Semantic Analysis

Practice Examination IREB

Rule Learning With Negation: Issues Regarding Effectiveness

Full text of O L O W Science As Inquiry conference. Science as Inquiry

What is a Mental Model?

Chapter 9: Conducting Interviews

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Using Blackboard.com Software to Reach Beyond the Classroom: Intermediate

A Study of Metacognitive Awareness of Non-English Majors in L2 Listening

Aviation English Training: How long Does it Take?

Linking Task: Identifying authors and book titles in verbose queries

Word Segmentation of Off-line Handwritten Documents

Measurement. When Smaller Is Better. Activity:

Exams: Accommodations Guidelines. English Language Learners

Enhancing Learning with a Poster Session in Engineering Economy

Assessing and Providing Evidence of Generic Skills 4 May 2016

A Guide to Adequate Yearly Progress Analyses in Nevada 2007 Nevada Department of Education

Citrine Informatics. The Latest from Citrine. Citrine Informatics. The data analytics platform for the physical world

The Enterprise Knowledge Portal: The Concept

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Executive Guide to Simulation for Health

Lecture 1: Machine Learning Basics

ECON 365 fall papers GEOS 330Z fall papers HUMN 300Z fall papers PHIL 370 fall papers

Scoring Guide for Candidates For retake candidates who began the Certification process in and earlier.

Learning or lurking? Tracking the invisible online student

Creating Meaningful Assessments for Professional Development Education in Software Architecture

JOB OUTLOOK 2018 NOVEMBER 2017 FREE TO NACE MEMBERS $52.00 NONMEMBER PRICE NATIONAL ASSOCIATION OF COLLEGES AND EMPLOYERS

Effective practices of peer mentors in an undergraduate writing intensive course

THE PENNSYLVANIA STATE UNIVERSITY SCHREYER HONORS COLLEGE DEPARTMENT OF MATHEMATICS ASSESSING THE EFFECTIVENESS OF MULTIPLE CHOICE MATH TESTS

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Multi-Disciplinary Teams and Collaborative Peer Learning in an Introductory Nuclear Engineering Course

Modeling user preferences and norms in context-aware systems

An Empirical Analysis of the Effects of Mexican American Studies Participation on Student Achievement within Tucson Unified School District

Knowledge based expert systems D H A N A N J A Y K A L B A N D E

Reducing Features to Improve Bug Prediction

SETTING STANDARDS FOR CRITERION- REFERENCED MEASUREMENT

Introduction. 1. Evidence-informed teaching Prelude

Introduction to Questionnaire Design

NCEO Technical Report 27

Medical Complexity: A Pragmatic Theory

PEDAGOGICAL LEARNING WALKS: MAKING THE THEORY; PRACTICE

Managerial Decision Making

MENTORING. Tips, Techniques, and Best Practices

Susan K. Woodruff. instructional coaching scale: measuring the impact of coaching interactions

Task Types. Duration, Work and Units Prepared by

South Carolina College- and Career-Ready Standards for Mathematics. Standards Unpacking Documents Grade 5

5. UPPER INTERMEDIATE

We re Listening Results Dashboard How To Guide

AC : DEVELOPMENT OF AN INTRODUCTION TO INFRAS- TRUCTURE COURSE

Practices Worthy of Attention Step Up to High School Chicago Public Schools Chicago, Illinois

ACCREDITATION STANDARDS

Rover Races Grades: 3-5 Prep Time: ~45 Minutes Lesson Time: ~105 minutes

November 2012 MUET (800)

DOES RETELLING TECHNIQUE IMPROVE SPEAKING FLUENCY?

AGENDA LEARNING THEORIES LEARNING THEORIES. Advanced Learning Theories 2/22/2016

Successfully Flipping a Mathematics Classroom

GCSE English Language 2012 An investigation into the outcomes for candidates in Wales

The Common European Framework of Reference for Languages p. 58 to p. 82

A Pilot Study on Pearson s Interactive Science 2011 Program

PROVIDING AND COMMUNICATING CLEAR LEARNING GOALS. Celebrating Success THE MARZANO COMPENDIUM OF INSTRUCTIONAL STRATEGIES

Defragmenting Textual Data by Leveraging the Syntactic Structure of the English Language

Author: Justyna Kowalczys Stowarzyszenie Angielski w Medycynie (PL) Feb 2015

Total Knowledge Management. May 2002

learning collegiate assessment]

VIA ACTION. A Primer for I/O Psychologists. Robert B. Kaiser

Probability estimates in a scenario tree

Edexcel GCSE. Statistics 1389 Paper 1H. June Mark Scheme. Statistics Edexcel GCSE

Number of students enrolled in the program in Fall, 2011: 20. Faculty member completing template: Molly Dugan (Date: 1/26/2012)

Shank, Matthew D. (2009). Sports marketing: A strategic perspective (4th ed.). Upper Saddle River, NJ: Pearson/Prentice Hall.

HDR Presentation of Thesis Procedures pro-030 Version: 2.01

BSM 2801, Sport Marketing Course Syllabus. Course Description. Course Textbook. Course Learning Outcomes. Credits.

SEMAFOR: Frame Argument Resolution with Log-Linear Models

School Year 2017/18. DDS MySped Application SPECIAL EDUCATION. Training Guide

Ohio s Learning Standards-Clear Learning Targets

Assessing System Agreement and Instance Difficulty in the Lexical Sample Tasks of SENSEVAL-2

Transcription:

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT PRACTICAL APPLICATIONS OF RANDOM SAMPLING IN ediscovery By Matthew Verga, J.D.

INTRODUCTION Anyone who spends ample time working in ediscovery knows that the topic of sampling comes up constantly when referring to collections, early case assessment and review both human and technology-assisted. Long before modern review tools incorporated sophisticated sampling calculators, attorneys were manually taking samples, perhaps re-reviewing every 10 th document. Sampling is name-checked repeatedly in ediscovery orders and decisions. Despite the ubiquity of the concept, most discussions do not stop to explain the basic mechanics of sampling, the basic calculations used to leverage it and the practical, stepby-step ways to apply it in your ediscovery projects. Terms that I do not recall hearing in any of my law school classes are peppered liberally throughout the discourse. WOULD YOU LIKE A SAMPLE? Having always favored words over numbers (like most attorneys and paralegals), I became involved in ediscovery 7½ years ago with little to no knowledge of sampling techniques, confidence levels or recall and precision. The best wisdom of the day: Included iterative testing of search strings by partners or senior attorneys, who would informally sample the results of each revised search string to inform their next revision. Suggested employing a 3-pass document review process with successively more senior attorneys performing each pass: o The first pass reviewed everything o The second pass re-reviewed a random 10% sample o And the third pass re-reviewed a random 5% sample WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT 2

SOUNDS REASONABLE, RIGHT? It did to me too until I started asking myself and others why? Why is a search that returns more documents than expected invalid? How many search results are enough to sample? Why re-review 10% and 5%? What basis do we have to believe these processes are sufficient or reliable? I keenly felt a gap in my knowledge, and the knowledge of my peers. Surely, there were better ways to accomplish these goals and more defensible bases for decision making of this kind. Surely other professionals in other fields addressed these questions all the time and dispatched them using something more than their guts. FOUR PRACTICAL APPLICATIONS I found that the solution lies in the basic statistics course that some of you took, and that the rest of us should have taken. As it turns out, there is math for that. Some of it is moderately complicated and requires a specialized calculator. Some of it is very complicated and requires an expert with different letters after their name than mine. And, some of it is so simple you can do it with pen and paper. The purpose of this white paper is to illustrate a few practical applications of random sampling in ediscovery, including key concepts, key vocabulary and illustrations of the basic math. At the very least, this information will equip you to have a more productive conversation with your service providers about what you want to accomplish and how they can help. It may even provide you with the confidence to begin experimenting with some of these more concrete methods in your own ediscovery projects. The four practical applications I will cover are: 1. Estimating Prevalence - finding out what s in a new, unknown dataset 2. Testing Classifiers - finding out how good a search string is 3. Quality Control of Human Document Review - finding out how good your reviewers are 4. Elusion and Overall Completeness - finding out how much stuff you missed WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT 3

ESTIMATING PREVALENCE FINDING OUT WHAT S IN A NEW, UNKNOWN DATASET The first important application of simple random sampling in ediscovery is for the estimation of prevalence. Prevalence is the portion of a dataset that is relevant to a particular information need. For example, if one third of a dataset was relevant in a case, the prevalence of relevant materials would be 33%. Prevalence is always known by the end of a document review project; hindsight is 20/20. But, would there be value in knowing the prevalence at the start of a document review project? Certainly, there is: Knowing the prevalence of relevant materials can guide the selection of culling and review techniques to be employed and other next steps to be taken o It can also provide a measuring stick for overall progress Knowing the prevalence of different subclasses of materials can guide decisions about resource allocation (e.g., associates vs. contract attorneys vs. LPO) or prioritization Knowing the prevalence of specific features facilitates more accurate estimation of project costs: o How much material is likely to need to be reviewed o How much privilege quality control review and logging is likely to be needed o How much redaction is likely to be needed In each of these examples, the application of this sampling process provides valuable discovery intelligence that can serve as the basis of data-driven decision making, replacing gut-feelings with knowledge. When utilizing simple random sampling to estimate prevalence, the first question is: from what pool of materials should the sample be taken? The answer to that question is dictated by the specific prevalence you are attempting to estimate. For the purposes of this discussion, let s assume we are simply trying to estimate the overall prevalence of relevant materials. WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT 4

RANDOM SAMPLING Since we are looking for potentially relevant materials, the pool from which the sample should be taken is the same as the pool that would normally be submitted for review: A pool with system files removed (de-nisted) A pool with documents outside of any applicable date range removed A pool that has been de-duplicated A pool to which any other obvious, objective culling criteria have been applied o (e.g., court mandated key word or custodian filtering) Once this pool of materials has been isolated, it will become your sampling frame. Your simple random sample will be taken from within this frame. A simple random sample is one in which every document has an equal chance of being selected. To accomplish this, a random number generator is used. 1 Most modern review programs have sampling tools built in, which will be based on an acceptable pseudo-random number generator, such as the one included in the Microsoft.NET development framework. If experimenting with simple random sampling manually, you can also utilize spreadsheet programs like Microsoft Excel to generate lists of random numbers. The size of the sample you should take is dictated by the strength of the measurement you want to achieve, the size of your dataset and the expected prevalence of relevant material within the dataset. 1 Technically, all software tools generate pseudo-random numbers. This means that if they were used to generate extremely large sets of random numbers there would eventually be identifiable patterns or repetition, but for our purposes, we can treat them as random number generators. WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT 5

The strength of the measurement is expressed through two values: confidence level and confidence interval. Confidence Level is expressed as a percentage, and is a measure of how certain you are about the results you get. Or, said another way: if you took the same size sample the same way 100 times, how many times out of 100 would you get the same results? Typically, you will be seeking a confidence level of 90%, 95% or 99%. Confidence Interval is also expressed as a percentage, and is a measure of how precise your results are. Or, said differently, how much uncertainty there is in your results. Typically you will be seeking a confidence interval between +/-2% (which is a total range of 4%) and +/-5% (which is a total range of 10%). o The term confidence interval is sometimes used interchangeably with the term margin of error. The margin of error, however, is stated as one half of the confidence interval, just as a radius is one half of a diameter. For example, a margin of error of 2% refers to a confidence interval of +/- 2% (a 4% range). For example, you might choose to take a measurement with a confidence level of 95% and a confidence interval of +/-2% to estimate prevalence. That measurement strength has been referenced in a variety of cases and articles as a potentially acceptable standard. 2 If review of your sample revealed a prevalence of 50%, you would know that if you repeated the test another 100 times, 95 of those tests would also have results that fall between 48% and 52% prevalence. Strength of measurement affects sample sizes in two ways. 1. First, the higher the confidence level you desire, the larger the sample you will need to take. 2. Second, the lower the margin of error you desire, the larger the sample you will need to take. See Figure 1 on the next page illustrating how sample sizes increase with confidence level and interval. 2 For example, in the widely read and discussed Monique da Silva Moore, et al., v. Publicis Groupe & MSL Group. WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT 6

FIGURE 1 - SAMPLE SIZE VARIABILITY WITH CONFIDENCE LEVEL AND INTERVAL Sample sizes also increase with the size of the sampling frame, but only up to a point. Beyond that point, the required sample size levels off. For example, the sample size needed for 100,000 documents is roughly the same as the sample size needed for 1,000,000 documents. Figure 2 illustrates how sample size increases with sampling frame size. FIGURE 2 - SAMPLE SIZE VARIABILITY WITH SAMPLE FRAME SIZE WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT 7

Understanding this can produce significant cost savings. A traditional 5% sample of 1,000,000 documents would be 50,000 documents, but a simple random sample of only about 2,400 documents actually is sufficient to estimate prevalence and accomplish other useful investigatory tasks. PREVALENCE Prevalence also affects the required sample size, however it will not yet be known when prevalence itself is what you are sampling to estimate. In that case, you should use the most conservative value the one resulting in the largest sample size. Assuming a prevalence of 50%, i.e. that half of the sampling frame is relevant and half is not, requires the largest sample size. Sample size decreases as prevalence increases or decreases from 50%. See Figure 3 for a visualization of how sample size fluctuates with prevalence. FIGURE 3 SAMPLE SIZE VARIABILITY WITH PREVALENCE When estimating prevalence, there is no correct strength of measurement to take. As noted above, several orders and articles have referenced a 95% confidence level and a +/- 2% confidence interval, but that is persuasive authority at best. You may not feel comfortable with anything less than 99% +/- 1%, or you may be fine at 90% +/- 5%. It depends on your specific circumstances. WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT 8

Assuming you have settled on a measurement strength of 95% +/- 2%, how do you calculate your sample size? In most instances, your review tool will have a built in sampling calculator that you can use. If not, sampling calculators are available online on a variety of websites. 3 In either case, you will input your desired confidence level (95%), your desired confidence interval or margin of error (+/- 2% or 2%), your sampling frame size (e.g., 1,000,000), and depending on the calculator, the expected prevalence (50%; some calculators always assume 50% by default and do not allow for customization of this variable). If you were to enter this set of hypothetical variables into a sampling calculator, you would learn that a simple random sample of 2,396 documents from your sampling frame of 1,000,000 will allow you to estimate prevalence with a confidence level of 95% +/- 2%. For example, if you took such a sample and reviewed it, And the review identified 599 relevant documents, You would have 95% confidence o That the overall prevalence of relevant documents is between 23% and 27%, o Or between 230,000 and 270,000 of your 1,000,000 hypothetical documents When reviewing random samples to estimate prevalence whether of general relevance or more specific features it is important to ensure the highest quality review possible, as any errors in the review of the sample will be effectively amplified in the estimations based on that review. For this reason, reviews of such samples should be conducted by one or more members of the project team with direct, substantial knowledge of the matter. Estimating prevalence in this way can reveal a variety of valuable, specific information about an unknown dataset information that can be used to guide many critical project decisions. Moreover, this process can be completed using smaller numbers of documents than traditional methods. And, once completed, this reviewed sample can serve additional purposes as a control set for testing classifiers. 3 For example, www.raosoft.com/samplesize.html. WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT 9

TESTING CLASSIFIERS FINDING OUT HOW GOOD A SEARCH STRING IS The second important application of simple random sampling in ediscovery is for the testing of classifiers. In the context of ediscovery, classifiers are tools, mechanisms or processes by which documents from a dataset are classified into categories like responsive and nonresponsive or privileged and non-privileged. The tools, mechanisms, and processes employed could include: Keyword searching Individual human reviewers Overall human review processes Machine categorization by latent semantic indexing Predictive coding by probabilistic latent semantic analysis Testing classifiers has significant value as a source of discovery intelligence to guide data-driven decision making about the methodologies employed on your matters: Search strings and other classifiers can be refined through iterative testing Testing provides strong bases to argue for or against particular classifiers during negotiations When classifiers are tested, their efficacy is expressed through two values: recall and precision. The higher recall is, the more comprehensive a search s results will be, and the higher precision is, the more efficient any subsequent review process will be. Recall is expressed as a percentage and is a measure of how much of the material sought was returned by the classifier. For example, if 250,000 relevant documents exist and a search returns 125,000 of them, it has a recall of 50%. Precision is also expressed as a percentage and is a measure of how much unwanted material was returned by the classifier. For example, if a search returns 150,000 documents of which 75,000 are irrelevant, it has a precision of 50%. WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT 10

Testing classifiers before applying them to a full dataset requires the creation of a control set against which they can be tested. A control set is a representative sample of the full dataset that has already been classified by the best reviewers possible so that it can function as a gold standard. If you have already estimated prevalence, the sample reviewed for that estimation will generally also work as a gold standard control set for testing classifiers. Assuming you estimated prevalence as described above, you would have a ready-made control set of 2,396 documents that could be used to test search strings for recall and precision. Search strings would be tested by running them against the 2,396 document sample and comparing the results of the search to the results of the prior review by subject matter experts. The comparison facilitates the calculation of recall and precision. To demonstrate how this comparison is used to perform this calculation, we will use contingency tables (sometimes referred to as cross-tabulations). These tables provide an easy breakdown of the comparison between the results of a classifier being tested and the prior review by subject matter experts. It breaks the comparison down into four categories: 1. True Positives a. Documents BOTH returned by the search AND previously reviewed as relevant 2. False Positives a. Documents returned by the search BUT previously reviewed as NOT relevant 3. False Negatives a. Documents NOT returned by the search BUT previously reviewed as relevant 4. True Negatives a. Documents BOTH NOT returned by the search AND previously reviewed as NOT relevant For this example, let s assume that you tested a search string against your 2,396 document control set, and to keep the math simple, let s assume that the comparison of the search string to the prior review resulted in even split of 599 documents in each of those four categories. WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT 11

Figure 4 shows what this contingency table would look like. FIGURE 4 RESULTS OF SEARCH STRING TESTED AGAINST CONTROL SET Relevant/Returned (Search) Not Relevant/Not Returned (Search) Relevant (Prior Review) Not Relevant (Prior Review) 599 599 599 599 On the contingency table in Figure 4: There are 599 True Positives o Top left box o Documents BOTH returned by the search AND previously reviewed as relevant There are 599 False Positives o Top right box o Documents returned by the search BUT previously reviewed as NOT relevant There are 599 False Negatives o Bottom left box o Documents NOT returned by the search BUT previously reviewed as relevant There are 599 True Negatives o Bottom right box o Documents BOTH NOT returned by the search AND previously reviewed as NOT relevant With the results broken out in this way in a contingency table, it is straightforward to perform the calculations of recall and precision for the hypothetical search string being tested. WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT 12

As noted, recall is the percentage of all relevant documents returned by the tested classifier o In this hypothetical, the search string correctly returned 599 out of 1,198 relevant documents o 599 / 1198 = 0.50, or 50% o The hypothetical search string has a recall of 50% As noted, precision is the percentage of documents returned by the classifier that are actually relevant o In this hypothetical, the search string returned a total of 1,198 documents of which 599 were actually relevant o 599 / 1198 = 0.50, or 50% o The hypothetical search string has a precision of 50% Calculating recall and precision provides us with an excellent assessment of the strength and efficacy of a particular search string or other classifier, but it s important to remember that the same confidence level and interval do not automatically apply to these numbers. These numbers have not been calculated based on a sample size of 2,396. Rather: Recall has been calculated based on the total number of relevant documents in the sample of 2,396 o In this hypothetical, that is 1,198 documents o 1,198 documents is, thus, the effective sample size for this calculation, with the sampling frame being the total universe of relevant documents o Of that sample of the total universe of relevant documents, this search can recall half Precision has been calculated based on the total number of returned documents o In this hypothetical, that is also 1,198 documents o 1,198 documents is thus the effective sample size for this calculation, with the sampling frame being the total universe of documents the search would return o Of this sample of the total universe of documents this search would return, half were relevant WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT 13

Some sampling calculators will allow you to input these variables and work backwards to determine the new confidence level/interval for these measurements; it will be somewhat lower/wider than for the original 2,396 document sample. Testing classifiers in this manner by calculating recall and precision (and determining the reliability of those calculations) offers excellent return on effort, replacing blind sampling and anecdotal evidence with a repeatable process and reliable results valuable discovery intelligence that can be leveraged for data-driven decision making. WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT 14

QUALITY CONTROL OF HUMAN DOCUMENT REVIEW FINDING OUT HOW GOOD YOUR REVIEWERS ARE In addition to leveraging simple random sampling for estimating prevalence and testing classifiers, simple random sampling can also be leveraged for quality control of human document review. In the traditional approach to document review, quality control is maintained by a multi-pass system of review that includes extensive re-review of documents by successively more senior attorneys. Often these later passes will involve the review of a flat percentage of the documents from the pass below; sometimes important categories of materials will be entirely re-reviewed. Instead of such extensive, brute-force re-review, simple random sampling can be employed to streamline the quality control process, while simultaneously increasing its precision. In such a scenario, the reviewer doing the initial work is the classifier being tested and the control set is the decisions of the more senior attorney reviewing the random sample and agreeing or disagreeing with the initial reviewer. After the more senior attorney completes quality control review of an appropriatelysized random sample of the initial reviewer s work (or of a team s combined work), the differences between the initial reviewer s classifications and the more senior attorney s classifications can be used to create a contingency table like those discussed above. If an appropriate tagging palette is employed for documenting quality control decisions, this is a simple matter. With such a contingency table, you can easily calculate the two values used to assess the performance of a reviewer: accuracy and error rate. As the names suggest, the higher a reviewer s (or a review team s) accuracy rate the better, and the lower their error rate the better. Accuracy is expressed as a percentage and is a measure of how many initial reviewer determinations were correct, out of all determinations made. The closer to 100% the better your reviewers are doing. Error Rate is expressed as a percentage and is a measure of how many initial reviewer determinations were incorrect, out of all determinations made. The closer to 0% the better your reviewers are doing. o Error rate and accuracy together should always total 100%. WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT 15

To create a contingency table for this purpose, the four categories would be similar to those used above: 1. True Positives a. Documents deemed relevant by BOTH the initial reviewer AND the QC reviewer 2. False Positives a. Documents deemed relevant by the initial reviewer BUT NOT by the QC reviewer 3. False Negatives a. Documents deemed NOT relevant by the initial reviewer BUT relevant by the QC reviewer 4. True Negatives a. Documents deemed NOT relevant by BOTH the initial reviewer AND the QC reviewer In Figure 5, you can see such a contingency table created for the quality control review of a random sample of 1,000 documents taken from the thousands completed by a particular, hypothetical reviewer. FIGURE 5 RESULTS OF INITIAL REVIEWER S DETERMINATIONS TESTED AGAINST QC REVIEWER S DETERMINATIONS Relevant/Returned (Initial Reviewer) Not Relevant/Not Returned (Initial Reviewer) Relevant (QC Reviewer) Not Relevant (QC Reviewer) 250 200 100 450 On the contingency table in Figure 5: There are 250 True Positives o Top left box o Documents deemed relevant by BOTH the initial reviewer AND the QC reviewer WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT 16

There are 200 False Positives o Top right box o Documents deemed relevant by the initial reviewer BUT NOT by the QC reviewer There are 100 False Negatives o Bottom left box o Documents deemed NOT relevant by the initial reviewer BUT relevant by the QC reviewer There are 450 True Negatives o Bottom right box o Documents deemed NOT relevant by BOTH the initial reviewer AND the QC reviewer With the results broken out in this way in a contingency table, it is straightforward to perform the calculations of accuracy and error rate for the hypothetical reviewer being tested. As noted above, accuracy is the percentage of correct determinations out of all those made o In this hypothetical, the reviewer made 700 correct determinations (True Positives + True Negatives) out of 1000 total o 700 / 1000 = 0.70, or 70% o The hypothetical reviewer has 70% accuracy As noted above, error rate is the percentage of incorrect determinations out of all those made o In this hypothetical, the reviewer made 300 incorrect determinations (False Positives + False Negatives) out of 1000 total o 300 / 1000 = 0.30, or 30% o The hypothetical reviewer has a 30% error rate As with estimations of prevalence and testing of classifiers, the reliability of the measurements will depend on the overall sampling frame (e.g., all of a reviewer s work) and the sample size taken, with larger samples giving more reliable results. For ongoing quality control review, it is not practical to attempt to attain the same confidence levels and intervals for these measurements that can be achieved for prevalence, recall, and precision. Most projects will not have sufficient scale for the sampling frame and sample sizes for individual reviewers to grow very large. WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT 17

LOT ACCEPTANCE SAMPLING Another approach that can be taken to offset this limitation is lot acceptance sampling. Lot acceptance sampling is a methodology employed in pharmaceutical manufacturing, military contract fulfillment, and many other high-volume, quality-focused processes. When employing lot acceptance sampling, a maximum acceptable error threshold is established, as is a sampling protocol. Each lot has a random sample taken from it for testing. If the established acceptable error rate is exceeded, the entire lot is rejected without further evaluation. In ediscovery, the lot would correspond most readily to the individual review batch. In a high volume document review project, with a large review team, some form of batch acceptance sampling could present an efficient quality control solution. Each completed batch could be randomly sampled to test for batch acceptance. If the established maximum acceptable error rate is exceeded, the entire batch is rejected and sent back for re-review. Statistics on batch rejection could be tracked by reviewer, by source material, or by other useful properties. 4 Simple random sampling can be leveraged in a variety of ways for the quality control of human document review and can be adapted for use in both small and large projects. Employing it gives you the ability to measure quality precisely and to speak with certainty about individual reviewers relative performance, to once again, replace gutfeelings and anecdotal evidence with concrete measurements. 4 Although some practitioners experience discomfort at the thought of positively identifying an acceptable error rate, it is important to remember two things: first, choosing not to acknowledge or measure the error rate in a document review project does not mean that it does not exist; and second, reasonableness is the standard, not perfection. WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT 18

ELUSION AND OVERALL COMPLETENESS FINDING OUT HOW MUCH STUFF YOU MISSED Finally, simple random sampling can be employed to ascertain the overall completeness of a review effort. When engaged in large-scale document review, some initial classifier an iteratively-refined search string or a predictive coding tool will almost certainly be used to cut down the total processed dataset to a smaller subset that will actually be reviewed for potential production. The remainder of the materials not returned by this classifier often are not reviewed at all. If the classifier used was court ordered or was agreed to by the opposing party, then it is acceptable to take no further action with regard to that remainder. If, however, the classifier was of your own design or selection, you may want to validate that classification after the fact. If you employed a predictive coding solution as a classifier, you may want to validate the irrelevance of the excluded materials as part of ensuring the defensibility of your overall process. Like the measurements above of recall and precision, this measurement is related to the performance of a classifier. In this scenario, the classifier is the one used to separate the total dataset into pools to be reviewed and to be ignored. The pool to be reviewed is composed entirely of True Positives and False Positives. The pool to be ignored is composed entirely of True Negatives and False Negatives. What you want to know is what percentage of the pool to be ignored are False Negatives that should have been included in review and production. This measurement is sometimes referred to as elusion. Another way to frame this inquiry is as another estimation of prevalence being made regarding just this remainder, the pool to be ignored. Whether considered elusion or prevalence, the calculation is the same: a simple random sample of the remainder (at a size determined by desired measurement strength) reviewed for remaining relevant documents/false Negatives. There is no way to perfectly identify and produce all relevant materials in the age of high volume ediscovery 5, but there can be great value in being able to say, for example, 5 Even total human review has been shown, repeatedly, to be inconsistent and incomplete, typically achieving only 70-80% recall. WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT 19

that you have 99% confidence that no more than 3-5% of the remainder is potentially relevant. The 3-5% of that hypothetical sample that was relevant could also then be used to illustrate the types of relevant materials remaining, and knowing their prevalence and the size of the pool, you could also estimate with great accuracy the cost required to find each additional document which is a powerful position from which to argue regarding reasonability and proportionality as you near the end of a long ediscovery effort. WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT 20

CONCLUSION WE GAVE BASIC MATH SKILLS TO A LAWYER, AND IT MADE HIM A MORE EFFECTIVE EDISCOVERY PRACTITIONER As noted in the subheading above, mastering math can do the same for you. Random sampling is an essential tool for many activities throughout the discovery lifecycle, and it can be done more effectively than with arbitrarily-selected percentages. As demonstrated throughout this paper, it can be used to find out in a precise fashion how much relevant material you have, how effective your searches will be and how effective your reviewers are. Performing these calculations is not something all practitioners will want to do themselves, but all practitioners will benefit from greater understanding of these concepts in their conversations with service providers, opposing parties, and other fellow practitioners. ABOUT THE AUTHOR Matthew Verga is an electronic discovery consultant and practitioner proficient at leveraging a combination of legal, technical, and logistical expertise to develop pragmatic solutions for electronic discovery problems. Matthew has spent the past seven years working in electronic discovery, four years as a practicing attorney with an AmLaw 100 firm and three years as a consultant with electronic discovery service providers. Matthew has personally designed and managed many large scale electronic discovery efforts and has overseen the design and management of numerous other efforts as an attorney and a consultant. He has provided consultation and training for AmLaw 100 firms and Fortune 100 companies, as well as written and spoken widely on electronic discovery issues. Matthew is currently the Director, Content Marketing and ediscovery Strategy, for Modus ediscovery Inc. Matthew is responsible for managing assessments of law firms and corporations electronic discovery systems and processes. In this role, he focuses his expertise on assessing organizations readiness and capability to handle ediscovery matters across each segment of the EDRM. Additionally, Matthew is responsible for the creation of articles, white papers, presentations and other substantive content in support of Modus marketing, branding and thought leadership efforts. WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT 21