Brent Fitzgerald. CS224N Final Project - June 1, 2000
|
|
- Wilfrid Randall
- 5 years ago
- Views:
Transcription
1 IMPLEMENTATION OF AN AUTOMATED TEXT SEGMENTATION SYSTEM USING HEARST S TEXTTILING ALGORITHM Brent Fitzgerald brentf@stanford.edu CS224N Final Project - June 1, 2000 ABSTRACT This paper describes the implementation of a text segmentation system based on Hearst s TextTiling algorithm. Hearst is a pioneer in the field of text segmentation, and her algorithm has already been shown to provide good results. The algorithm uses lexical frequency and distribution information to recognize the level of cohesion between blocks of text, and then uses these cohesion estimates to judge which sections are likely to be different topics. INTRODUCTION Most of the texts one comes across are composed of a number of topics, perhaps varying in their relevance to one another and their scope. A system that could automatically detect these subtopics would certainly be useful, allowing the reader to quickly skip to the topics most relevant to her purpose. The segmentation might also aid in tasks of information extraction and summarization, since it provides structural semantic information about the document. The ability to identify the various subtopics could let one quickly build outlines of the essential points. More recently, the web s proliferation has led to an overwhelming increase in readily available information, but finding the information one needs can be a difficult task. Search engines and directories provide a means of classifying and organizing this information on a multi-document level, but there is still a need for a system that can provide organization within long, information rich documents. A good segmentation system, perhaps combined with summarization and information extraction technologies, could fill this niche quite nicely. Thus, any highly accurate segmentation system would certainly be useful in these times of overly abundant, undocumented data. The system described in this paper is currently not up to this daunting task, but it is an interesting experiment in building a system that automatically locates topic boundaries. This paper will review the algorithm 1
2 behind the system as well as some of the practical aspects of the implementation, and will conclude with a discussion of the results and some possible extensions of the current system. ALGORITHM AND IMPLEMENTATION There are several different approaches that have been presented in the literature. The approach used in this paper is based on Hearst s TextTiling algorithm, a moving window approach that uses lexical overlap as a means of detecting topic coherence. Another approach called dotplotting, presented by Reynar (1994) and furthered by Choi (2000), finds the similarity between every pair of sentences in the document and uses these results to identify chunks of cohesive sentences. A very different strategy called Lexical Chaining uses lexical semantic similarity information to create chains of related words. Generally, a document will have at least several of these chains, allowing one to segment the document based on the features of the chains, such as start and end points. Hearst s algorithm is used in this system because it is relatively straightforward and well documented. Hearst defines three main components of the TextTiling algorithm. First, it divides the input text into sequences of relevant tokens and calculates the cohesion at each potential boundary point. It then uses these cohesion scores to produce depth scores for each potential boundary point that has a lower cohesion than the neighboring boundary points. Using these depth scores, the algorithm is able to select boundary points where the depth is low relative to the other depth scores, indicating that that gap represents a topic shift in the text. The output is the text file with boundaries inserted at these gaps with sufficiently high depth scores, delineating the various topics by breaking at the least cohesive points. The first task of this system, then, is to calculate the gap scores. In order to do so, it is first necessary to break the document into appropriately sized sequences of text. Gap cohesion is computed between a group of text sequences immediately prior to the gap and a group of text sequences immediately after. Hearst advocates various strategies regarding methods of breaking the text into sequences. One method is to use chunks of text that have some fixed number of valuable tokens. For this approach, Hearst recommends 20 tokens per sequence. The benefit to this approach is that each sequence carries the same amount of information as the other sequences. The other method is to assign each sentence in the document to its own sequence. One advantage to this approach is that the boundaries tested are sentence boundaries rather than mid-sentence boundaries, and thus are better representative of where a change in topic is most likely to occur. The other, more practical advantage of this approach is that if the system finds the gap scores at the sentence boundaries, then it is extremely straightforward to insert the segmentation break points. The other method requires deciding upon the nearest sentence boundary. This system 2
3 uses a one sentence per sequence approach. The system also takes a list of stop words, which are words that uninformative regarding the topic of a particular passage such as the, and, they, we, a, will, can, have, etc. Eliminating these stopwords will prevent the system form getting distracted by irrelevant data. The gap cohesion score is found by creating a vector from the token counts found in some fixed number n of sentence sequences immediately prior to the gap, and another vector from the token counts found in the same number n of sequences immediately following the gap. Hearst suggests a number of sequences approximately equal to the average paragraph length in sentences. A vector similarity metric, such as the cosine FIGURE 1: Gap score results from analysis of concatenation of 10 New York Times articles. Horizontal axis is the gap number, vertical axis is the gap score measured by cohesion of adjacent blocks. Greater vertical axis values indicate higher levels of cohesion. The breaks between the various articles tend to correlate to the low points in the graph. similarity, is then applied to these two vectors to obtain an estimate of the cohesiveness between the two sections. The cosine similarity can be computed This number is called the gap score, and it is calculated at each potential boundary location, obtaining a distribution of gap scores with a visual representation of the form seen in Figure 1. The next step is the smoothing process. As we see in Figure 1, the initial computation of gap scores leaves one wanting clearer boundary markers, since many small local minima might lead to too many small segments in our output. The system lessens the effect of these small local extremities using an average smoothing technique with a flexible window size. Using this system, gap score s i is replaced by (s i - k/2 + + s i + + s i + k/2 ) / (k + 1), for some optimally configured k. The size of k, of course, should depend on the type of document being segmented and granularity of segmentation desired. A smaller k value will leave more of the original information intact, making it a good choice for shorter texts like newspaper articles, but it can lead to too much fragmentation by failing to sufficiently eliminate undesired noise. Larger values of k eliminate the subtleties in the data, and thus are useful if one is planning to segment a larger text. Note that in this implementation, if there are not enough gap scores to smooth using the k value chosen, then the window size collapses to a suitable value. This allows us to smooth the score distribution near the beginning of the text. See Figure 2 for a visual representation of the effects of smoothing on the gap score 3
4 Now that the correlation scores have been calculated and smoothed, the next step is to locate the high and low points in this set of data. A list of the peaks is obtained by culling the scores for local maxima, and then each pair of adjacent peaks is used to find FIGURE 2: SMOOTHING OF NEW YORK TIMES GAP SCORES The following four figures show the effect of smoothing on the New York Times data with various window sizes. the lowest gap score in the valley between. Using these local minima and their neighboring local maxima, it is fairly straightforward to calculate the depth score, which is the difference in height of the left peak and the low point, plus the difference between the right peak and the low point. The depth score is an indication of the lack of correlation at that gap relative to the Concatenated Times Articles, no smoothing (k = 0) correlation at the nearest maxima. Thus, if the depth score is high, then the correlation is particularly low relative to the nearby preceding and successive gaps. If the depth score of a gap is low, then the gap is most likely not a break, since it s gap score does not differ from it s neighbors so much as the other depth scores. To find the boundary points the system finds the depth scores that are sufficiently large relative to the other candidate Same data, smoothing with window size 10 (k = 10) depth scores in the document. This is accomplished by including only those depth scores that exceed mean c (standard deviation), for some optimally configured value c. Hearst recommends a value of 1/2 based on her experiments. Larger values of c increase the number of inserted boundary points. Same data, smoothing with window size 20 (k = 20) EVALUATION AND RESULTS Evaluation of the system s performance consists of running the system on a concatenation of newspaper articles. Newspaper articles seem a decent choice of data because they are readily available and reasonably short, so they can be concatenated together to obtain longer documents where the topic structure is already known. One potential problem with the use of Same data, smoothing with window size 30 (k = 30) 4
5 newspaper articles is that they don t necessarily contain only one major topic. An article might contain several subtopics, each of which might be relevant to one another but no more so than the other articles in the data, which could lead to boundaries inserted mid article. Ideally, it would have been informative and worthwhile to test the system against the segmentation choices made by human judges, as Hearst did in the original evaluation of her system. Hearst s evaluations compared her implementation s performance to that of human judgement, and it fared relatively well with an average precision score of 0.66 compared to the judge s 0.81, and average recall of 0.61 compared to the judges recall score of Indeed, when run on non-test data, the segmentation of this system seems quite reasonable. The tests were run with a variety of parameter specifications. The default parameters of the system were determined by taking the parameters that yielded the highest combined level of precision and recall. In the initial tests, the smoothing window sizes 10 and 20 were found to be too large and significantly hurt both the precision and performance. In the second round of tests, the parameters were kept much more moderate. The results of these tests are attached to this document. The best precision score was 0.77 when run on the New York Times texts, and it was accompanied by a recall score of 0.77 as well. While these scores may sound relatively impressive, it is important to note that they were only numerically evaluated on this one set of data, and so it is unlikely that those parameters would return such high scores in all circumstances. FURTHER RESEARCH This implementation makes no use of structural cues in the text, and it would be interesting and most assuredly beneficial to consider this structure. This could be done by modifying the algorithm to assign the break only to the nearest paragraph boundary, rather than ignoring the white space as we have in this implementation. The choice was made to ignore white space information in order to allow for greater flexibility in the text data we wish to segment. However, if the system were operating within a narrower domain, it would be advantageous to tune the system to take advantage of available cues. For example, if the system was applied to html tagged web page texts, then it would probably be useful to weight the segmentation scheme to break at <P> paragraph boundary tags or <BR> break tags. Another avenue of research is key word and sentence extraction from the sections obtained using this segmenting system, producing a summary or outline of the topics covered in the document. This might be done using a key sentence extraction technique such as those used in summarization systems. It would be an interesting 5
6 research topic to try to improve summarization systems by using a segmentation system to break the text into its subtopics, then find the key sentence summaries for each topic. Other segmentation systems use a stemming routine in the preprocessing stage of the system. Hearst ignores stem values and uses the bare words, but it would certainly be worthwhile to see how using the stems in the similarity measure might affect the segments produced. Finally, TextTiling is language independent, failing to use any semantic information in measuring cohesiveness. Rather than basing the similarity measure on the number occurrences of words in the sequence, it might be beneficial to base the similarity measure on the occurrences of semantic classes of words. This might be done using the synonyms provided by WordNet, perhaps in combination with a sense disambiguator to determine the intended sense. SUMMARY This paper describes research in text segmentation, specifically Hearst s text segmentation algorithm TextTiling. The system presented in this paper uses the TextTiling algorithm to compute the cohesion between blocks of text and determine the most likely boundary locations. While this system fails to perform as well as many of the other segmentation systems that have recently been presented in the literature, it is certainly on the right path and can produce good results with the proper parameters. REFERENCES Choi, F., 2000, Advances in domain independent linear text segmentation. To appear in Proceedings of NAACL'00, Seattle, USA. Hearst, M TextTiling: A quantitative approach to discourse segmentation. Technical Report 93/24, U. of California, Berkeley. Hearst, M Multi-paragraph segmentation of expository text. In Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics (ACL-94), New Mexico, USA,
7 Ponte, J. M., Croft, W.B. 1997, Text Segmentation by Topic. In Proceedings of the First European Conference on Research and Advanced Technology for Digital Libraries, pp Reynar, J. C. (1994). An automatic method of finding topic boundaries. In Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics (ACL-94), New Mexico, USA. Richmond, K., Smith, A. and Amitay, E., 1997, Detecting subject boundaries within text: A language independent statistical approach. In Proceedings of the Second Conference on Empirical Methods in Natural Language Processing (EMNLP--97), pages , Providence, Rhode Island, August
8 ABOUT THE SOFTWARE The programs included are everything one needs to get started segmenting text! Several properly formatted documents are already included, but it is straightforward to make new ones as well. To segment a text document using this segmentation system: 1. Run sentencesnipper on the text. Sentencesnipper is a quick and dirty sentence boundary detection system. It takes ASCII text as input along with an optional (but highly recommended) list of common abbreviations. The output of sentencesnipper is a printout of each sentence separated by two newline characters. An example is as follows: %> sentencesnipper/sentencesnipper../raw_data/basketball abbreviations The players dispersed after a tense timeout, but a frantic Coach Jeff Van Gundy was still standing on the court. There were just 12.4 seconds on the clock, and all his team needed was one last defensive stop to leave the Miami Heat in pieces once again.... Note that sentencesnipper is not a full-fledged sentence boundary detection program. It sometimes has problems with some abbreviations (even with the abbreviations file included), and commonly inserts two spaces instead of one, and there may well be other yet to be discovered quirks. Generally, though, it does a good job splitting the sentences apart, and is quite appropriate for this particular task. 2. Run segment on the snipped text. Segment is the actual text segmentation program. It requires only one command argument, the document to be segmented. It also takes four optional arguments: a list of stopwords, the threshold coefficient, the comparison size, and the smoothing window size. For example: %> segment data/unmarked_data/nytimes.unmarked stopwords This command runs segment on nytimes.unmarked data file, with the stopwords file, a threshold coefficient at 1 (higher number translates to increased tendency to break at less salient gaps), a 8
9 comparison size of 10 (10 sentences before gap compared to 10 after), and a smoothing window size of 6 (average of 6 surrounding gap scores plus the one to be replaced). 3. Evaluate using evaluation.pl. This is the third component of the package, and it is used to test the accuracy of the segment program s output against a marked version of the same text. The marked text file should be chopped into sentences using sentencesnipper, with each segment boundary marked with a <--BREAK--> statement with one newline character between the statement and both the preceding and next sentences. evaluation.pl takes the name of the data to be tested, the name of the previously marked data, and an integer indicating the leniency. Here is an example of how to run it: %> evaluation/evaluation.pl../.../nytimes.results../.../nytimes.marked 2 This compares the nytimes.results file with the nytimes.marked file, and counts a successful boundary identification even if the break is two or less sentences from the actual break. Here is an example of the output of the program: Actual System target!target selected 8 25!selected Precision = Recall = If any of this doesn t work right or if you have questions, please brentf@stanford.edu. 9
10 These are the results of the second set of tests, The left field is the name of the file, where the first number in the name is the threshold coefficient, the second is the comparison size, and the third is the smoothing window size. Notice that as we decrease our threshold, disallowing the less pronounced breaks, precision increases as recall decreases. Also, notice that for a smoothing window size of 4 we usually get better results than with the other window sizes, and we also seem to get better results with a comparison size of 7. According to this data, the magic numbers are 0.5 threshold, a 7 sentence comparison size, and a 3 smoothing window of size 4, since these figures yield the highest precision score of 0.77, and a decent recall score of 0.77 as well. However, to maintain some degree of generality and ensure that these good results are not specific only to this data, the default values of the actual system will have a weaker threshold of 0 rather than 0.5, ensuring that some segmentation will occur in most texts. Output file Precision Recall nytimes_0.5_3_ nytimes_0.5_3_ nytimes_0.5_3_ nytimes_0.5_5_ nytimes_0.5_5_ nytimes_0.5_5_ nytimes_0.5_7_ nytimes_0.5_7_ nytimes_0.5_7_ nytimes_0.25_3_ nytimes_0.25_3_ nytimes_0.25_3_ nytimes_0.25_5_ nytimes_0.25_5_ nytimes_0.25_5_ nytimes_0.25_7_ nytimes_0.25_7_ nytimes_0.25_7_ nytimes_0_3_ nytimes_0_3_
11 nytimes_0_3_ nytimes_0_5_ nytimes_0_5_ nytimes_0_5_ nytimes_0_7_ nytimes_0_7_ nytimes_0_7_ nytimes_-0.25_3_ nytimes_-0.25_3_ nytimes_-0.25_3_ nytimes_-0.25_5_ nytimes_-0.25_5_ nytimes_-0.25_5_ nytimes_-0.25_7_ nytimes_-0.25_7_ nytimes_-0.25_7_ nytimes_-0.5_3_ nytimes_-0.5_3_ nytimes_-0.5_3_ nytimes_-0.5_5_ nytimes_-0.5_5_ nytimes_-0.5_5_ nytimes_-0.5_7_ nytimes_-0.5_7_ nytimes_-0.5_7_
A Case Study: News Classification Based on Term Frequency
A Case Study: News Classification Based on Term Frequency Petr Kroha Faculty of Computer Science University of Technology 09107 Chemnitz Germany kroha@informatik.tu-chemnitz.de Ricardo Baeza-Yates Center
More informationSINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)
SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF) Hans Christian 1 ; Mikhael Pramodana Agus 2 ; Derwin Suhartono 3 1,2,3 Computer Science Department,
More informationOn document relevance and lexical cohesion between query terms
Information Processing and Management 42 (2006) 1230 1247 www.elsevier.com/locate/infoproman On document relevance and lexical cohesion between query terms Olga Vechtomova a, *, Murat Karamuftuoglu b,
More informationHow to Judge the Quality of an Objective Classroom Test
How to Judge the Quality of an Objective Classroom Test Technical Bulletin #6 Evaluation and Examination Service The University of Iowa (319) 335-0356 HOW TO JUDGE THE QUALITY OF AN OBJECTIVE CLASSROOM
More informationMajor Milestones, Team Activities, and Individual Deliverables
Major Milestones, Team Activities, and Individual Deliverables Milestone #1: Team Semester Proposal Your team should write a proposal that describes project objectives, existing relevant technology, engineering
More informationhave to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,
A Language-Independent, Data-Oriented Architecture for Grapheme-to-Phoneme Conversion Walter Daelemans and Antal van den Bosch Proceedings ESCA-IEEE speech synthesis conference, New York, September 1994
More informationStacks Teacher notes. Activity description. Suitability. Time. AMP resources. Equipment. Key mathematical language. Key processes
Stacks Teacher notes Activity description (Interactive not shown on this sheet.) Pupils start by exploring the patterns generated by moving counters between two stacks according to a fixed rule, doubling
More informationThink A F R I C A when assessing speaking. C.E.F.R. Oral Assessment Criteria. Think A F R I C A - 1 -
C.E.F.R. Oral Assessment Criteria Think A F R I C A - 1 - 1. The extracts in the left hand column are taken from the official descriptors of the CEFR levels. How would you grade them on a scale of low,
More informationGrade 6: Correlated to AGS Basic Math Skills
Grade 6: Correlated to AGS Basic Math Skills Grade 6: Standard 1 Number Sense Students compare and order positive and negative integers, decimals, fractions, and mixed numbers. They find multiples and
More informationLinking Task: Identifying authors and book titles in verbose queries
Linking Task: Identifying authors and book titles in verbose queries Anaïs Ollagnier, Sébastien Fournier, and Patrice Bellot Aix-Marseille University, CNRS, ENSAM, University of Toulon, LSIS UMR 7296,
More informationAQUA: An Ontology-Driven Question Answering System
AQUA: An Ontology-Driven Question Answering System Maria Vargas-Vera, Enrico Motta and John Domingue Knowledge Media Institute (KMI) The Open University, Walton Hall, Milton Keynes, MK7 6AA, United Kingdom.
More informationMemory-based grammatical error correction
Memory-based grammatical error correction Antal van den Bosch Peter Berck Radboud University Nijmegen Tilburg University P.O. Box 9103 P.O. Box 90153 NL-6500 HD Nijmegen, The Netherlands NL-5000 LE Tilburg,
More informationarxiv: v1 [cs.cl] 2 Apr 2017
Word-Alignment-Based Segment-Level Machine Translation Evaluation using Word Embeddings Junki Matsuo and Mamoru Komachi Graduate School of System Design, Tokyo Metropolitan University, Japan matsuo-junki@ed.tmu.ac.jp,
More informationMandarin Lexical Tone Recognition: The Gating Paradigm
Kansas Working Papers in Linguistics, Vol. 0 (008), p. 8 Abstract Mandarin Lexical Tone Recognition: The Gating Paradigm Yuwen Lai and Jie Zhang University of Kansas Research on spoken word recognition
More informationRadius STEM Readiness TM
Curriculum Guide Radius STEM Readiness TM While today s teens are surrounded by technology, we face a stark and imminent shortage of graduates pursuing careers in Science, Technology, Engineering, and
More informationLanguage Acquisition Chart
Language Acquisition Chart This chart was designed to help teachers better understand the process of second language acquisition. Please use this chart as a resource for learning more about the way people
More informationThe stages of event extraction
The stages of event extraction David Ahn Intelligent Systems Lab Amsterdam University of Amsterdam ahn@science.uva.nl Abstract Event detection and recognition is a complex task consisting of multiple sub-tasks
More informationWE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT
WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT PRACTICAL APPLICATIONS OF RANDOM SAMPLING IN ediscovery By Matthew Verga, J.D. INTRODUCTION Anyone who spends ample time working
More informationExperiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling
Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling Notebook for PAN at CLEF 2013 Andrés Alfonso Caurcel Díaz 1 and José María Gómez Hidalgo 2 1 Universidad
More informationInstructor: Mario D. Garrett, Ph.D. Phone: Office: Hepner Hall (HH) 100
San Diego State University School of Social Work 610 COMPUTER APPLICATIONS FOR SOCIAL WORK PRACTICE Statistical Package for the Social Sciences Office: Hepner Hall (HH) 100 Instructor: Mario D. Garrett,
More informationOPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS
OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS Václav Kocian, Eva Volná, Michal Janošek, Martin Kotyrba University of Ostrava Department of Informatics and Computers Dvořákova 7,
More informationNumeracy Medium term plan: Summer Term Level 2C/2B Year 2 Level 2A/3C
Numeracy Medium term plan: Summer Term Level 2C/2B Year 2 Level 2A/3C Using and applying mathematics objectives (Problem solving, Communicating and Reasoning) Select the maths to use in some classroom
More informationOn the Combined Behavior of Autonomous Resource Management Agents
On the Combined Behavior of Autonomous Resource Management Agents Siri Fagernes 1 and Alva L. Couch 2 1 Faculty of Engineering Oslo University College Oslo, Norway siri.fagernes@iu.hio.no 2 Computer Science
More informationRule Learning With Negation: Issues Regarding Effectiveness
Rule Learning With Negation: Issues Regarding Effectiveness S. Chua, F. Coenen, G. Malcolm University of Liverpool Department of Computer Science, Ashton Building, Ashton Street, L69 3BX Liverpool, United
More informationModule 12. Machine Learning. Version 2 CSE IIT, Kharagpur
Module 12 Machine Learning 12.1 Instructional Objective The students should understand the concept of learning systems Students should learn about different aspects of a learning system Students should
More informationBridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models
Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models Jung-Tae Lee and Sang-Bum Kim and Young-In Song and Hae-Chang Rim Dept. of Computer &
More informationAGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS
AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS 1 CALIFORNIA CONTENT STANDARDS: Chapter 1 ALGEBRA AND WHOLE NUMBERS Algebra and Functions 1.4 Students use algebraic
More informationPredicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks
Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks Devendra Singh Chaplot, Eunhee Rhim, and Jihie Kim Samsung Electronics Co., Ltd. Seoul, South Korea {dev.chaplot,eunhee.rhim,jihie.kim}@samsung.com
More informationStatewide Framework Document for:
Statewide Framework Document for: 270301 Standards may be added to this document prior to submission, but may not be removed from the framework to meet state credit equivalency requirements. Performance
More informationPython Machine Learning
Python Machine Learning Unlock deeper insights into machine learning with this vital guide to cuttingedge predictive analytics Sebastian Raschka [ PUBLISHING 1 open source I community experience distilled
More informationConstructing Parallel Corpus from Movie Subtitles
Constructing Parallel Corpus from Movie Subtitles Han Xiao 1 and Xiaojie Wang 2 1 School of Information Engineering, Beijing University of Post and Telecommunications artex.xh@gmail.com 2 CISTR, Beijing
More informationCEFR Overall Illustrative English Proficiency Scales
CEFR Overall Illustrative English Proficiency s CEFR CEFR OVERALL ORAL PRODUCTION Has a good command of idiomatic expressions and colloquialisms with awareness of connotative levels of meaning. Can convey
More informationOhio s Learning Standards-Clear Learning Targets
Ohio s Learning Standards-Clear Learning Targets Math Grade 1 Use addition and subtraction within 20 to solve word problems involving situations of 1.OA.1 adding to, taking from, putting together, taking
More informationHow to analyze visual narratives: A tutorial in Visual Narrative Grammar
How to analyze visual narratives: A tutorial in Visual Narrative Grammar Neil Cohn 2015 neilcohn@visuallanguagelab.com www.visuallanguagelab.com Abstract Recent work has argued that narrative sequential
More informationAssignment 1: Predicting Amazon Review Ratings
Assignment 1: Predicting Amazon Review Ratings 1 Dataset Analysis Richard Park r2park@acsmail.ucsd.edu February 23, 2015 The dataset selected for this assignment comes from the set of Amazon reviews for
More informationGCSE Mathematics B (Linear) Mark Scheme for November Component J567/04: Mathematics Paper 4 (Higher) General Certificate of Secondary Education
GCSE Mathematics B (Linear) Component J567/04: Mathematics Paper 4 (Higher) General Certificate of Secondary Education Mark Scheme for November 2014 Oxford Cambridge and RSA Examinations OCR (Oxford Cambridge
More informationLanguage Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus
Language Acquisition Fall 2010/Winter 2011 Lexical Categories Afra Alishahi, Heiner Drenhaus Computational Linguistics and Phonetics Saarland University Children s Sensitivity to Lexical Categories Look,
More informationThe following information has been adapted from A guide to using AntConc.
1 7. Practical application of genre analysis in the classroom In this part of the workshop, we are going to analyse some of the texts from the discipline that you teach. Before we begin, we need to get
More informationWeb as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics
(L615) Markus Dickinson Department of Linguistics, Indiana University Spring 2013 The web provides new opportunities for gathering data Viable source of disposable corpora, built ad hoc for specific purposes
More informationSURVIVING ON MARS WITH GEOGEBRA
SURVIVING ON MARS WITH GEOGEBRA Lindsey States and Jenna Odom Miami University, OH Abstract: In this paper, the authors describe an interdisciplinary lesson focused on determining how long an astronaut
More informationThe Internet as a Normative Corpus: Grammar Checking with a Search Engine
The Internet as a Normative Corpus: Grammar Checking with a Search Engine Jonas Sjöbergh KTH Nada SE-100 44 Stockholm, Sweden jsh@nada.kth.se Abstract In this paper some methods using the Internet as a
More informationRule Learning with Negation: Issues Regarding Effectiveness
Rule Learning with Negation: Issues Regarding Effectiveness Stephanie Chua, Frans Coenen, and Grant Malcolm University of Liverpool Department of Computer Science, Ashton Building, Ashton Street, L69 3BX
More informationCSC200: Lecture 4. Allan Borodin
CSC200: Lecture 4 Allan Borodin 1 / 22 Announcements My apologies for the tutorial room mixup on Wednesday. The room SS 1088 is only reserved for Fridays and I forgot that. My office hours: Tuesdays 2-4
More informationThe College Board Redesigned SAT Grade 12
A Correlation of, 2017 To the Redesigned SAT Introduction This document demonstrates how myperspectives English Language Arts meets the Reading, Writing and Language and Essay Domains of Redesigned SAT.
More informationVisit us at:
White Paper Integrating Six Sigma and Software Testing Process for Removal of Wastage & Optimizing Resource Utilization 24 October 2013 With resources working for extended hours and in a pressurized environment,
More informationTHE PENNSYLVANIA STATE UNIVERSITY SCHREYER HONORS COLLEGE DEPARTMENT OF MATHEMATICS ASSESSING THE EFFECTIVENESS OF MULTIPLE CHOICE MATH TESTS
THE PENNSYLVANIA STATE UNIVERSITY SCHREYER HONORS COLLEGE DEPARTMENT OF MATHEMATICS ASSESSING THE EFFECTIVENESS OF MULTIPLE CHOICE MATH TESTS ELIZABETH ANNE SOMERS Spring 2011 A thesis submitted in partial
More informationAviation English Training: How long Does it Take?
Aviation English Training: How long Does it Take? Elizabeth Mathews 2008 I am often asked, How long does it take to achieve ICAO Operational Level 4? Unfortunately, there is no quick and easy answer to
More informationWhat is PDE? Research Report. Paul Nichols
What is PDE? Research Report Paul Nichols December 2013 WHAT IS PDE? 1 About Pearson Everything we do at Pearson grows out of a clear mission: to help people make progress in their lives through personalized
More informationNotes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1
Notes on The Sciences of the Artificial Adapted from a shorter document written for course 17-652 (Deciding What to Design) 1 Ali Almossawi December 29, 2005 1 Introduction The Sciences of the Artificial
More informationGetting Started with Deliberate Practice
Getting Started with Deliberate Practice Most of the implementation guides so far in Learning on Steroids have focused on conceptual skills. Things like being able to form mental images, remembering facts
More informationSouth Carolina English Language Arts
South Carolina English Language Arts A S O F J U N E 2 0, 2 0 1 0, T H I S S TAT E H A D A D O P T E D T H E CO M M O N CO R E S TAT E S TA N DA R D S. DOCUMENTS REVIEWED South Carolina Academic Content
More informationThe Smart/Empire TIPSTER IR System
The Smart/Empire TIPSTER IR System Chris Buckley, Janet Walz Sabir Research, Gaithersburg, MD chrisb,walz@sabir.com Claire Cardie, Scott Mardis, Mandar Mitra, David Pierce, Kiri Wagstaff Department of
More informationWord Segmentation of Off-line Handwritten Documents
Word Segmentation of Off-line Handwritten Documents Chen Huang and Sargur N. Srihari {chuang5, srihari}@cedar.buffalo.edu Center of Excellence for Document Analysis and Recognition (CEDAR), Department
More informationHuman Emotion Recognition From Speech
RESEARCH ARTICLE OPEN ACCESS Human Emotion Recognition From Speech Miss. Aparna P. Wanare*, Prof. Shankar N. Dandare *(Department of Electronics & Telecommunication Engineering, Sant Gadge Baba Amravati
More informationLecture 1: Machine Learning Basics
1/69 Lecture 1: Machine Learning Basics Ali Harakeh University of Waterloo WAVE Lab ali.harakeh@uwaterloo.ca May 1, 2017 2/69 Overview 1 Learning Algorithms 2 Capacity, Overfitting, and Underfitting 3
More informationMETHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS
METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS Ruslan Mitkov (R.Mitkov@wlv.ac.uk) University of Wolverhampton ViktorPekar (v.pekar@wlv.ac.uk) University of Wolverhampton Dimitar
More informationPowerTeacher Gradebook User Guide PowerSchool Student Information System
PowerSchool Student Information System Document Properties Copyright Owner Copyright 2007 Pearson Education, Inc. or its affiliates. All rights reserved. This document is the property of Pearson Education,
More informationThe Effect of Extensive Reading on Developing the Grammatical. Accuracy of the EFL Freshmen at Al Al-Bayt University
The Effect of Extensive Reading on Developing the Grammatical Accuracy of the EFL Freshmen at Al Al-Bayt University Kifah Rakan Alqadi Al Al-Bayt University Faculty of Arts Department of English Language
More informationProbabilistic Latent Semantic Analysis
Probabilistic Latent Semantic Analysis Thomas Hofmann Presentation by Ioannis Pavlopoulos & Andreas Damianou for the course of Data Mining & Exploration 1 Outline Latent Semantic Analysis o Need o Overview
More informationDefragmenting Textual Data by Leveraging the Syntactic Structure of the English Language
Defragmenting Textual Data by Leveraging the Syntactic Structure of the English Language Nathaniel Hayes Department of Computer Science Simpson College 701 N. C. St. Indianola, IA, 50125 nate.hayes@my.simpson.edu
More informationWhat is beautiful is useful visual appeal and expected information quality
What is beautiful is useful visual appeal and expected information quality Thea van der Geest University of Twente T.m.vandergeest@utwente.nl Raymond van Dongelen Noordelijke Hogeschool Leeuwarden Dongelen@nhl.nl
More informationStrategic Practice: Career Practitioner Case Study
Strategic Practice: Career Practitioner Case Study heidi Lund 1 Interpersonal conflict has one of the most negative impacts on today s workplaces. It reduces productivity, increases gossip, and I believe
More informationLoughton School s curriculum evening. 28 th February 2017
Loughton School s curriculum evening 28 th February 2017 Aims of this session Share our approach to teaching writing, reading, SPaG and maths. Share resources, ideas and strategies to support children's
More informationTrend Survey on Japanese Natural Language Processing Studies over the Last Decade
Trend Survey on Japanese Natural Language Processing Studies over the Last Decade Masaki Murata, Koji Ichii, Qing Ma,, Tamotsu Shirado, Toshiyuki Kanamaru,, and Hitoshi Isahara National Institute of Information
More informationPhysics 270: Experimental Physics
2017 edition Lab Manual Physics 270 3 Physics 270: Experimental Physics Lecture: Lab: Instructor: Office: Email: Tuesdays, 2 3:50 PM Thursdays, 2 4:50 PM Dr. Uttam Manna 313C Moulton Hall umanna@ilstu.edu
More informationA GENERIC SPLIT PROCESS MODEL FOR ASSET MANAGEMENT DECISION-MAKING
A GENERIC SPLIT PROCESS MODEL FOR ASSET MANAGEMENT DECISION-MAKING Yong Sun, a * Colin Fidge b and Lin Ma a a CRC for Integrated Engineering Asset Management, School of Engineering Systems, Queensland
More informationTU-E2090 Research Assignment in Operations Management and Services
Aalto University School of Science Operations and Service Management TU-E2090 Research Assignment in Operations Management and Services Version 2016-08-29 COURSE INSTRUCTOR: OFFICE HOURS: CONTACT: Saara
More informationLQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization
LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization Annemarie Friedrich, Marina Valeeva and Alexis Palmer COMPUTATIONAL LINGUISTICS & PHONETICS SAARLAND UNIVERSITY, GERMANY
More informationDiagnostic Test. Middle School Mathematics
Diagnostic Test Middle School Mathematics Copyright 2010 XAMonline, Inc. All rights reserved. No part of the material protected by this copyright notice may be reproduced or utilized in any form or by
More informationUnvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition
Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition Hua Zhang, Yun Tang, Wenju Liu and Bo Xu National Laboratory of Pattern Recognition Institute of Automation, Chinese
More informationEvaluation of Usage Patterns for Web-based Educational Systems using Web Mining
Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining Dave Donnellan, School of Computer Applications Dublin City University Dublin 9 Ireland daviddonnellan@eircom.net Claus Pahl
More informationEvaluation of Usage Patterns for Web-based Educational Systems using Web Mining
Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining Dave Donnellan, School of Computer Applications Dublin City University Dublin 9 Ireland daviddonnellan@eircom.net Claus Pahl
More informationLearning From the Past with Experiment Databases
Learning From the Past with Experiment Databases Joaquin Vanschoren 1, Bernhard Pfahringer 2, and Geoff Holmes 2 1 Computer Science Dept., K.U.Leuven, Leuven, Belgium 2 Computer Science Dept., University
More informationEdexcel GCSE. Statistics 1389 Paper 1H. June Mark Scheme. Statistics Edexcel GCSE
Edexcel GCSE Statistics 1389 Paper 1H June 2007 Mark Scheme Edexcel GCSE Statistics 1389 NOTES ON MARKING PRINCIPLES 1 Types of mark M marks: method marks A marks: accuracy marks B marks: unconditional
More informationMaster Program: Strategic Management. Master s Thesis a roadmap to success. Innsbruck University School of Management
Master Program: Strategic Management Department of Strategic Management, Marketing & Tourism Innsbruck University School of Management Master s Thesis a roadmap to success Index Objectives... 1 Topics...
More informationSystem Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks
System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks 1 Tzu-Hsuan Yang, 2 Tzu-Hsuan Tseng, and 3 Chia-Ping Chen Department of Computer Science and Engineering
More informationLinking the Common European Framework of Reference and the Michigan English Language Assessment Battery Technical Report
Linking the Common European Framework of Reference and the Michigan English Language Assessment Battery Technical Report Contact Information All correspondence and mailings should be addressed to: CaMLA
More informationScoring Guide for Candidates For retake candidates who began the Certification process in and earlier.
Adolescence and Young Adulthood SOCIAL STUDIES HISTORY For retake candidates who began the Certification process in 2013-14 and earlier. Part 1 provides you with the tools to understand and interpret your
More informationMeasurement. Time. Teaching for mastery in primary maths
Measurement Time Teaching for mastery in primary maths Contents Introduction 3 01. Introduction to time 3 02. Telling the time 4 03. Analogue and digital time 4 04. Converting between units of time 5 05.
More informationHLTCOE at TREC 2013: Temporal Summarization
HLTCOE at TREC 2013: Temporal Summarization Tan Xu University of Maryland College Park Paul McNamee Johns Hopkins University HLTCOE Douglas W. Oard University of Maryland College Park Abstract Our team
More informationEvidence for Reliability, Validity and Learning Effectiveness
PEARSON EDUCATION Evidence for Reliability, Validity and Learning Effectiveness Introduction Pearson Knowledge Technologies has conducted a large number and wide variety of reliability and validity studies
More informationThe Role of String Similarity Metrics in Ontology Alignment
The Role of String Similarity Metrics in Ontology Alignment Michelle Cheatham and Pascal Hitzler August 9, 2013 1 Introduction Tim Berners-Lee originally envisioned a much different world wide web than
More informationSoftware Maintenance
1 What is Software Maintenance? Software Maintenance is a very broad activity that includes error corrections, enhancements of capabilities, deletion of obsolete capabilities, and optimization. 2 Categories
More informationAssessing System Agreement and Instance Difficulty in the Lexical Sample Tasks of SENSEVAL-2
Assessing System Agreement and Instance Difficulty in the Lexical Sample Tasks of SENSEVAL-2 Ted Pedersen Department of Computer Science University of Minnesota Duluth, MN, 55812 USA tpederse@d.umn.edu
More informationDigital Fabrication and Aunt Sarah: Enabling Quadratic Explorations via Technology. Michael L. Connell University of Houston - Downtown
Digital Fabrication and Aunt Sarah: Enabling Quadratic Explorations via Technology Michael L. Connell University of Houston - Downtown Sergei Abramovich State University of New York at Potsdam Introduction
More informationA Graph Based Authorship Identification Approach
A Graph Based Authorship Identification Approach Notebook for PAN at CLEF 2015 Helena Gómez-Adorno 1, Grigori Sidorov 1, David Pinto 2, and Ilia Markov 1 1 Center for Computing Research, Instituto Politécnico
More informationBeyond the Pipeline: Discrete Optimization in NLP
Beyond the Pipeline: Discrete Optimization in NLP Tomasz Marciniak and Michael Strube EML Research ggmbh Schloss-Wolfsbrunnenweg 33 69118 Heidelberg, Germany http://www.eml-research.de/nlp Abstract We
More informationIndividual Component Checklist L I S T E N I N G. for use with ONE task ENGLISH VERSION
L I S T E N I N G Individual Component Checklist for use with ONE task ENGLISH VERSION INTRODUCTION This checklist has been designed for use as a practical tool for describing ONE TASK in a test of listening.
More informationTRAITS OF GOOD WRITING
TRAITS OF GOOD WRITING Each paper was scored on a scale of - on the following traits of good writing: Ideas and Content: Organization: Voice: Word Choice: Sentence Fluency: Conventions: The ideas are clear,
More informationCHAPTER 4: REIMBURSEMENT STRATEGIES 24
CHAPTER 4: REIMBURSEMENT STRATEGIES 24 INTRODUCTION Once state level policymakers have decided to implement and pay for CSR, one issue they face is simply how to calculate the reimbursements to districts
More informationAlgebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview
Algebra 1, Quarter 3, Unit 3.1 Line of Best Fit Overview Number of instructional days 6 (1 day assessment) (1 day = 45 minutes) Content to be learned Analyze scatter plots and construct the line of best
More informationArizona s English Language Arts Standards th Grade ARIZONA DEPARTMENT OF EDUCATION HIGH ACADEMIC STANDARDS FOR STUDENTS
Arizona s English Language Arts Standards 11-12th Grade ARIZONA DEPARTMENT OF EDUCATION HIGH ACADEMIC STANDARDS FOR STUDENTS 11 th -12 th Grade Overview Arizona s English Language Arts Standards work together
More informationActivities, Exercises, Assignments Copyright 2009 Cem Kaner 1
Patterns of activities, iti exercises and assignments Workshop on Teaching Software Testing January 31, 2009 Cem Kaner, J.D., Ph.D. kaner@kaner.com Professor of Software Engineering Florida Institute of
More informationNCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches
NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches Yu-Chun Wang Chun-Kai Wu Richard Tzong-Han Tsai Department of Computer Science
More informationSpeech Recognition at ICSI: Broadcast News and beyond
Speech Recognition at ICSI: Broadcast News and beyond Dan Ellis International Computer Science Institute, Berkeley CA Outline 1 2 3 The DARPA Broadcast News task Aspects of ICSI
More informationUniversity of Groningen. Systemen, planning, netwerken Bosman, Aart
University of Groningen Systemen, planning, netwerken Bosman, Aart IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite from it. Please check the document
More informationLeader s Guide: Dream Big and Plan for Success
Leader s Guide: Dream Big and Plan for Success The goal of this lesson is to: Provide a process for Managers to reflect on their dream and put it in terms of business goals with a plan of action and weekly
More informationSession 2B From understanding perspectives to informing public policy the potential and challenges for Q findings to inform survey design
Session 2B From understanding perspectives to informing public policy the potential and challenges for Q findings to inform survey design Paper #3 Five Q-to-survey approaches: did they work? Job van Exel
More informationLeveraging Sentiment to Compute Word Similarity
Leveraging Sentiment to Compute Word Similarity Balamurali A.R., Subhabrata Mukherjee, Akshat Malu and Pushpak Bhattacharyya Dept. of Computer Science and Engineering, IIT Bombay 6th International Global
More informationLearning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models
Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models Stephan Gouws and GJ van Rooyen MIH Medialab, Stellenbosch University SOUTH AFRICA {stephan,gvrooyen}@ml.sun.ac.za
More information