Concept-Instance Relation Extraction from Simple Noun Sequences Using a Full-Text Search Engine

Size: px
Start display at page:

Download "Concept-Instance Relation Extraction from Simple Noun Sequences Using a Full-Text Search Engine"

Transcription

1 Concept-Instance Relation Extraction from Simple Noun Sequences Using a Full-Text Search Engine Asuka Sumida 1, Kentaro Torisawa 1, and Keiji Shinzato 2 1 Graduate School of Information Science, JapanAdvanced Institute of Science and Technology 2 Graduate School of Informatics, Kyoto University Abstract. This paper describes a simple method for acquiring conceptinstance relations from simple noun sequences that frequently appear in Japanese Web documents. In Japanese, many noun sequences can consist of two NPs that have a concept-instance relation. This phenomenon is similar to apposition in English but differs in that many of these noun sequences do not provide any explicit clues, such as the proper noun capitalization or commas used in English apposition, that indicate the boundary between the concept name and the instance name. We developed a method to detect such implicit boundaries between concept names and instance names, and to filter out erroneous concept-instance relations by using a search engine. 1 Introduction Lexico-syntactic patterns have been used to automatically acquire hyponymy relations or concept-instance relations from particular types of expressions [1 6]. Such relations are crucial in many NLP applications. For instance, Q&A systems require such relations to answer Who/what is questions [2] (e.g., What is Rakuten Market? - It is a virtual shopping center in Japan. ). However, the relations acquired by previous methods do not have sufficient coverage for practical applications such as Q&A, and a means of acquiring hyponymy relations and concept-instance relations from a wider range of expressions is still needed. The goal of this study has been to acquire a large number of concept-instance relations from simple noun sequences in Japanese, to which existing patternbased methods cannot be applied. In Japanese, many noun sequences can consist of two NPs that have a concept-instance relation. There are two types of (A) movie monk story concept name: movie instance name: Monk Story (B) hotel Maruei concept name: hotel instance name: Maruei Fig. 1. Examples of Japanese noun sequences

2 2 A. Sumida, K.Torisawa, K.Shinzato such noun sequences, as shown in Figure 1. The first is sequences in which the boundary between the concept name and the instance name is explicitly marked by quotation marks. Example (A) in Figure 1 shows cases where the conceptinstance boundary is specified by quotation marks ( and ). The second type of noun sequence is sequences in which no evident clues indicate the boundary. Example (B) shows such sequences. Note that simple pattern-based methods can be applied to the first type by giving the quotation marks in the patterns, but it cannot be applied to the second type. This also means there is no evident way to distinguish the noun sequences representing concept-instance relations from others. In addition, unlike English, Japanese does not have a system of capitalization of proper nouns. This makes it difficult to detect the concept-instance boundary. Our objective, therefore, is to acquire concept-instance relations from noun sequences which have no quotation marks. To achieve this, we use knowledge obtained from the concept-instance relations acquired from sequences that do include quotation marks. We also utilize a full-text search engine established on a large Web repository for our task. Another important point is that although we designed our method for Japanese, we also found similar noun sequences in languages such as Chinese and believe that our method can be extended to handle those languages. In addition, as a byproduct of our work, we have acquired a large number of highly precise relations from the noun sequences with quotation marks. A similar task has been addressed by Fleischman et al. for English [2]. Our method differs from theirs in the following ways. I) Their method relies on explicit clues such as commas and capitalization. We do not use such clues since a considerable number of class-instance relations can be expressed without such clues in Japanese and, in particular, Japanese does not have any orthographical systems corresponding to capitalization in English. II) Though Fleischman s method relies heavily on the distinction between proper nouns and other nouns provided by a morphological analyzer, we do not rely on this distinction. This is because the distinction provided by Japanese morphological analyzers is highly unreliable, mainly because of a lack of capitalization. III) We restrict the concepts of a relations through a certain condition but these are not limited to particular semantically coherent domains as in the work of Fleischman et al. We do not use any domain dependent knowledge, which was encoded in feature sets for machine learning in their work. 2 Approach We have taken a twofold approach. First, to distinguish noun sequences expressing concept-instance relations from other noun sequences, we use a set of concept names which appear frequently in the concept-instance relations acquired through noun sequences where the concept names and instance names are separated by quotation marks. We regard only noun sequences prefixed by such a concept name as candidates to be sequences expressing concept-instance

3 Concept-Instance Relation Extraction from Simple Noun Sequences 3 relations. This is based on our intuition that some NPs are likely to refer to concept names and that the noun sequences prefixed by such NPs are likely to represent concept-instance relations. In addition, we have manually checked the head nouns of frequently appearing concept names and removed any that were inappropriate. This increases precision. Second, we use a full-text search engine as a filter to judge if extracted instance name candidates are actually instance names. Since instance names are basically proper nouns, we can assume that a phrase which instance-name will not appear even in huge corpora. (We actually gave such a phrase as a search engine query and regarded an instance name candidate that was not included in any documents as a proper instance name.) This prevents our procedure from producing an inappropriate instance name. In addition, we use other statistics obtained from the search engine to further elaborate on the resulting conceptinstance relations. In our experiments, our procedure produced 4,276 concept-instance relations (without duplications) from our test set of 6.5GB of HTML documents. Although our method is quite simple and does not use any advanced learning methods such as Support Vector Machines, the precision of the extracted relations reached 83.5%. As preparation for this series of experiments, we extracted 28,083 candidates of concept-instance relations (without duplications) from noun sequences with quotation marks from our development set, (13.1 GB of HTML documents); we obtained 11,154 concept names from the extracted relations and used these names to acquire relations from noun sequences without quotation marks. Note that we also extracted concept-instance relations from the noun sequences with quotation marks that were found in the same test set (i.e., the 6.5GB of HTML documents.) for the same concept names. In this case, the number of relations was about 16,000, about four times as many as our method extracted. The precision was also slightly higher than with our method. However, we did not find any common relations in those obtained by the two methods and we think that the noun sequences without quotation marks can be used as a knowledge source to augment the relations extracted from other knowledge sources. In addition, although a larger number of the concept names might be beneficial, the number is limited by the costs of manually checking concept names and the data sparseness. If we can use a larger Web repository and dedicate more time to checking concept names, this limitation will be considerably relaxed. 3 Method Our procedure for concept-instance relation acquisition consists of two steps: Step 1 Acquire concept names from the noun sequences with quotation marks Step 2 Acquire concept-instance relations from the noun sequences without quotation marks using a full-text search engine. Each step is described in the following.

4 4 A. Sumida, K.Torisawa, K.Shinzato 3.1 Step 1:Acquisition of Concept Names from the Noun Sequences with Quotation Marks The purpose of this step is to acquire a set of concept-names from the noun sequences with explicit clues, (i.e., quotation marks) to use in the concept-instance acquisition from sequences without quotation marks. We basically collect a large number of concept-instance relation candidates by using the pattern with the quotation marks. The concept name candidates acquired using the pattern are classified according to their head nouns. We then check if some concept-instance relation candidates including each head noun are proper ones. If the checked candidates include a relatively small number of errors, we regarded all the concept names including that head noun as proper ones, and these are used in Step 2. As shown in Figure 1 (A), the noun sequences we use in this step have the form N 1 N 2 N i 1 N i N i+1 N m (1) The prefix N 1 N 2 N i 1 before the quotation mark becomes a concept name candidate, and the sequence N i N i+1 N m inside the quotation marks becomes an instance name candidate. Since Japanese does not have a space between words, we first need to tokenize a string into a word sequence by using a morphological analyzer(we used MeCab, which is available from 3. Second, we extract noun sequences that match the above pattern and obtain a set of concept-instance relation candidates. Note that we assume all the words labeled by the morphological analyzer as unknown words are nouns. In addition, we have found some problematic subclasses of nouns, which are specified in the output of the morphological analyzer, in the sense that the sequences including such subclasses are less likely to represent concept-instance relations. We exclude these sequences according to the rules concerning the subclasses shown in Figure If an instance name is a single word expression and the parts-of-speech subclass of the word is either noun suffix, functional noun, number or pronoun, remove the relation from the candidate set. 2. If a concept name is a sequence of either subclass noun suffix or subclass number, remove the relation from the candidate set. 3. If a concept name includes either subclass district/area name, subclass adverbial noun, subclass adjectival noun, or subclass proper noun, remove the relation from the candidate set. Note that, in MeCab output, the noun class includes the subclasses noun suffix, functional noun, number, pronoun, district/area name, adverbial noun, adjectival noun and proper noun. Fig. 2. Rules for subclasses of parts-of-speech for filtering in Step 1 3 We used MeCab, which is available from

5 Concept-Instance Relation Extraction from Simple Noun Sequences 5 Through the above automatic procedure, we extracted 69,665 concept-instance relation candidates from the 13.1GB of HTML documents downloaded from the WWW. The precision of the relations was 58% (for 100 randomly selected samples.). As a next step, for each head noun of the concept names in the extracted relations (i.e., the final noun N i 1 in the above pattern), we counted the variations of the concept instance relations that included the head noun in the head word position of the concept names, and sorted the head nouns according to the number of variations including the nouns. We found that the top 409 head nouns covered about 60% of the relation candidates and those head nouns appeared in the head word positions of 13,381 distinct concept names. Then, for each head noun, we randomly picked out five concept-instance relations including the noun, and checked if the relations are valid. We excluded the head nouns that had more than three erroneous candidates and obtained 301 head nouns. Then, we also removed three useless concept names, (name), (thing) and (name) from the list of the concept names including the head nouns. As a result, we obtained and these covered 11,154 concept names. Examples of the head nouns are listed in Figure 3. We use these concept names in Step 2. When we limited the relation candidates to those including the concept names obtained after the manual check, the number of candidates was 28,135 and the precision of the relation rose to 89% (for 100 randomly selected candidates.). This means that the manual filtering of concept-name head nouns improved the precision of the extracted relations and that we could acquire a large number of highly precise instance-name relations from the noun sequences with quotation marks. Note that hyponymy relation acquisition from noun sequences with quotation marks for Japanese has already been addressed by Imasumi [6]. (Note that Imasumi s hyponymy relations include both of concept-instance relations and hyponymy relations.) He reported more than 90% precision without manual checking of hypernyms as we did. The precision was much higher than that of our method. One of the possible reasons would be that he acquired both of hyponymy relations and concept-instance relations, while we acquired only concept-instance relations. However, we could not obtain such a high accuracy even after we relaxed the relations to be acquired to union of hyponymy relations and concept-instance relations. We now think this higher precision was due to the corpus used. He used newspaper articles, which were written by trained writers who as a group show relatively little diversity, compared to the HTML documents we used. (food stuffs) (flower) (functionality) (symposium) (club) (ballet) (movie) (shop/restaurant) (software) (weapon) Fig. 3. Examples of head words in concept names

6 6 A. Sumida, K.Torisawa, K.Shinzato 3.2 Step 2:Acquisition of concept-instance relations from the noun sequences without quotation marks. Step 2 proceeds as follows. Step 2.A Extract noun sequences from WWW documents. Step 2.B Filter the sequences and determine the boundary between concept names and instance names by using the concept names obtained in Step 1. Step 2.C Filter the sequences by using parts-of-speech. Step 2.D Filter by using a full-text search engine and the length of concept names. In Step 2.A, we simply extract sequences of nouns and unknown words from WWW documents. In Step 2.B, we pick out only the noun sequences that are prefixed by one of the concept names obtained in Step 1, and assume that the ending of the concept name is the boundary between the concept name and an instance name. Note that instance names are limited to those consisting of a single word or two words at this time. This is for the sake of the simplicity in this discussion. The Step 2.C procedure filters out the noun sequences according to rules concerning subclasses of parts of speech. These rules were developed during our experiments using our development set and are listed in Figure 4. The output of Step 2.C is a set of noun sequences that have the following form, where C j and I j denote nouns. C 1 C 2 C i }{{} concept name candidate I 1 {I 2 } }{{} instance name candidate Step 2.D is the key part of the Step 2 procedure. We use hit counts obtained by a full-text search engine on a WWW repository and the length of a concept name according to the following four heuristics. Here, hit(query) denotes the hit count of the string query obtained by the search engine. Heuristic 1 Attach the Japanese word dono, which corresponds to the interrogative adjective which (of) in English, to the instance name candidate and give it to a search engine as a query. If hit( dono instance name ) > θ, remove the sequence from the candidates. This is based on our intuition that instance names behave as proper nouns and the phrase which instancename will not appear frequently in the WWW repository. If the hit count is large, the instance name is likely to actually be a generic concept. (2) 1. If an instance name contains verbal nouns or noun suffix, remove the relation from the candidate set. 2. If the first word of an instance name is labeled as noun suffix (generic), remove the relation from the candidate set. Fig. 4. Rules for subclasses of parts-of-speech for filtering in Step 2.C

7 Concept-Instance Relation Extraction from Simple Noun Sequences 7 Heuristic 2 Assume that a noun sequence consists of a class name denoted by C 1 C 2 C i and an instance name consists of two words and is denoted by I 1 I 2. If hit(c 1 C i I 1 )/hit(c 1 C i I 1 I 2 ) > σ, remove the sequence from the candidates. This heuristic checks if the sequence I 1 I 2 is a proper compound noun, as explained later. If the instance name consists of a single word, do nothing. Heuristic 3 If hit(instance name) > ξ, remove the sequence from the candidates. This is based on our observation that if an instance name is frequently observed in the Web repository, it is likely to be a generic concept. Heuristic 4 If the concept name is a single character word, remove the sequence from the candidates. This is based on our observation that the morphological analyzer sometime fails to recognize a long proper noun and decompose it to a sequence of single character words and such words sometime coincide with concept names. Note that these heuristics are applied to the candidate set sequentially in the above order and that the parameters in the heuristics were set as θ = 0, σ = 1.5 and ξ = according to our observation in experiments using our development set. (Note that we used a search engine on a 0.7TB WWW repository) Heuristic 2 needs a more detailed explanation. Figure 5 shows an example of the Japanese translation of opera house orchestra. Assume the following. opera is among the concept names obtained in Step 1. house orchestra is not a frequently appearing compound noun (though this might not be the case in English). which house orchestra does not appear in the WWW repository. In this situation, house orchestra is likely to be regarded as a proper instance name of the concept opera. Since which house orchestra has a hit count of 0, it is not eliminated by Heuristic 1. Heuristic 3 allows the sequence to remain in the candidate set since house orchestra does not appear frequently. Heuristic 4 does not eliminate the sequence because the concept name opera( ) is a two-character word. This unrealistic situation is caused by the assumption that the boundary of two NPs exists between opera and house, while it is actually between house and orchestra. More precisely, Heuristic 1 fails to regard an instance name as a generic concept because of this wrong assumption since it queries the phrase consisting of which and the rarely appearing compound noun house orchestra. Heuristic 2 is used to avoid detecting such wrong NP boundaries. Usually opera house orchestra is interpreted as orchestra of an opera house opera place/house orchestra Fig. 5. An example of an erroneously decomposed sequence

8 8 A. Sumida, K.Torisawa, K.Shinzato Table 1. Results obtained after each step and heuristic Step # of rels. precision(%) estimated # of correct rels. Step 2.A 4,107, Step 2.B 117, ,744 Step 2.C 58, ,081 Step 2.D 4, ,570 Heuristic 1 14, ,785 Heuristic 2 6, ,052 Heuristic 3 5, ,029 Heuristic 4 4, ,570 Final Results 4, ,570 and the boundary between the two NPs should be between opera house and orchestra. Heuristic 2 detects this true boundary by obtaining hit counts hit(c 1 C i I 1 ) = hit( opera house ). If this figure is large enough compared to the hit count of the whole sequence hit(c 1 C i I 1 I 2 ) = hit( opera house orchestra ) (i.e., hit(c 1 C i I 1 )/hit(c 1 C i I 1 I 2 ) is large enough), opera house often appears as an independent and acceptable compound noun. This suggests the boundary is between house and orchestra not between opera and house. The final output of our procedure is the set of concept instance relations remaining in the candidate set after the heuristics are applied. 4 Experiments We first established a search engine on a 0.7TB WWW repository 4,obtained by downloading documents from the WWW. Then we applied our method for concept-instance relations to 6.5 GB of unseen HTML documents randomly selected from the WWW repository. Table 1 shows A) the number of noun sequences that were produced at each step, B) the precision of the conceptinstance relations extracted from 100 samples randomly picked from a set of noun sequences produced by each step (As mentioned later, the precision after Step 2.D was measured from 200 samples.), and C) the expected number of correct concept-instance relations estimated from the precision and the number of extracted relations. Note that all the numbers were obtained after removing duplicates in the candidate sets. While we would have liked to have also measured the recall of the output, this would be difficult because of the large number of candidates. As a substitute for the recall measure, we used the expected number of correct concept-instance relations. As can be seen easily, each step improves the precision. The table also shows the results after each heuristic was applied in Step 2.D. These results show that each heuristic also contributed to a higher precision. (The precision was measured from 200 samples randomly picked for the heuristics.) As the final output, we obtained 4,276 concept-instance relations and their precision was 83.5%. The relations included 685 concept names. Examples of 4 We used freyasx as the search engine software, which is available from

9 Concept-Instance Relation Extraction from Simple Noun Sequences 9 / corporation / Senba (corporation Seneba ) / sightseeing place / So doumon (sightseeing place Sodoumon ) / novel / Yoshida school (novel Yoshida s School ) / prelude tune / G-Major ( G-Major itself is not a prelude.) CD / CD / Romrom ( CD Romrom is the title of a TV game.) * indicates a incorrectly recognized relation. Fig. 6. Examples of noun sequences and acquired concept-instance relations from them Table 2. Results obtained from noun sequences with the quotation marks Step # of rels. precision(%) estimated correct rels. Before conceptrestriction 38, ,442 After concept restriction 15, ,195 recognized class-instance relations are shown in Figure 6. The concept-instance boundaries detected by our procedure are marked by the symbol /. Table 2 shows the results of applying the concept-instance relation acquisition from the noun sequences with quotation marks (i.e., the same as Step 1 in our method) to the same document set. This table shows the performance before and after we restricted the concept names to the set of concept names used in our procedure. (All the numbers were obtained after removing all the duplicates.) The final results of this method included 6,056 concept names. Actually, these results were better than those with our method. The difference of the precision after restricting the concepts and that of our method is not large, but the number of relations obtained by our method was just 28% that acquired from the sequences with the quotation marks. However, we could not find any common items in the relations extracted from the sequences with the quotation marks and those obtained by our method. Thus, our method found a significant number of relations that could not be acquired from sequences with quotation marks, so it can be used to augment the relations obtained by using the quotation marks to help deal with an expected shortage of relations in real-world applications considerably in the future. Finally, we evaluated other patterns for Japanese used in existing research on hyponymy relation acquisition [6, 5] in the context of concept-instance relation acquisition. The patterns are listed in Figure 7. We applied the patterns to 4.4GB of HTML documents and obtained 147,056 concept-instance relations. The precision was 14.5% (obtained from 200 random samples). Since we did not use any optimization techniques, the precision was low. If we scale up the document size to the size of our test set, 6.5GB, the estimated number of correct relations is about 31,500 and each pattern acquires around 3,937 correct relations in average. This means that our method could obtain at least almost the same number of class-instance relations as each of the other patterns with the high accuracy, 83.5%, without any explicit clues on surface strings.

10 10 A. Sumida, K.Torisawa, K.Shinzato NP NP (NP such as NP) NP NP (NP such as NP) NP NP (NP similar to NP) NP NP (NP such as NP) NP NP (NP other than NP) NP NP (NP called NP) NP NP (NP called NP) NP NP (NP called NP) Fig. 7. Other patterns for hyponymy relation acquisition 5 Conclusion and Future Work We have described a method for extracting concept-instance relationships from the simple noun sequences, frequently found in corpora. Our method found 4,276 concept-instance relations from 6.5 GB of HTML documents with 83.5% precision. The method is based on our intuitions that some NPs are likely to refer to concept names and that noun sequences prefixed by such NPs are likely to represent concept-instance relations. In addition, our method uses a search engine as a filter to remove erroneously extracted concept-instance relationships. We plan to apply our method to a larger WWW repository and to use the resulting concept-instance relationships in practical applications such as Q&A, IR and IE.. Introducing compound-noun analysis technique [7] to our method will also be an important research direction. Furthermore, another direction will be to extend the concept-instance relations obtained by our method from itemization or listing in HTML documents, by using which some previous works tried to extract concept instance relations[8, 9]. References 1. Hearst, M.A.: Automatic acquistition of hyponyms from large text corpora. In: Proceedings of the 14th International Conference on Computational Linguistics. (1992) Fleischman, M., Hovy, E., Echihabi, A.: Offline strategies for online question answering: Answering questions before they are asked. In: ACL2003. (2003) Caraballo, S.A.: Automatic construction of a hypernym-labeled noun hierarchy from text. In: ACL1999. (1999) Pantel, P., Ravichandran, D., Hovy, E.: Towards terascale knowledge acquisition. In: COLING (2004) Ando, M., Sekine, S., Ishizaki, S.: Automatic extraction of hyponyms from newspaper using lexicosyntactic patterns. In: IPSJ SIG Technical Report 2003-NL-157. (2003) Imasumi, K.: Automatic acqusition of hyponymy relations from coordinated noun phrases and appositions. Master s thesis, Kyushu Institute of Technology (2001) 7. Kobayasi, Y., Tokunaga, T., Tanaka, H.: Analysis of japanese compound nouns using collocational information. In: COLING1994. (1994) Shinzato, K., Torisawa, K.: Acquiring hyponymy relations from web documents. In: Proceedings of HLT-NAACL2004. (2004) Etzioni, O., Cafarella, M., Downey, D., Popescu, A., Shaked, T., Soderland, S., Weld, D.S., Yates, A.: Unsupervised named-entity extraction from the web: an experimental study. Artif. Intell. 165(1) (2005)

Linking Task: Identifying authors and book titles in verbose queries

Linking Task: Identifying authors and book titles in verbose queries Linking Task: Identifying authors and book titles in verbose queries Anaïs Ollagnier, Sébastien Fournier, and Patrice Bellot Aix-Marseille University, CNRS, ENSAM, University of Toulon, LSIS UMR 7296,

More information

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

The Internet as a Normative Corpus: Grammar Checking with a Search Engine The Internet as a Normative Corpus: Grammar Checking with a Search Engine Jonas Sjöbergh KTH Nada SE-100 44 Stockholm, Sweden jsh@nada.kth.se Abstract In this paper some methods using the Internet as a

More information

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics (L615) Markus Dickinson Department of Linguistics, Indiana University Spring 2013 The web provides new opportunities for gathering data Viable source of disposable corpora, built ad hoc for specific purposes

More information

A heuristic framework for pivot-based bilingual dictionary induction

A heuristic framework for pivot-based bilingual dictionary induction 2013 International Conference on Culture and Computing A heuristic framework for pivot-based bilingual dictionary induction Mairidan Wushouer, Toru Ishida, Donghui Lin Department of Social Informatics,

More information

AQUA: An Ontology-Driven Question Answering System

AQUA: An Ontology-Driven Question Answering System AQUA: An Ontology-Driven Question Answering System Maria Vargas-Vera, Enrico Motta and John Domingue Knowledge Media Institute (KMI) The Open University, Walton Hall, Milton Keynes, MK7 6AA, United Kingdom.

More information

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar Chung-Chi Huang Mei-Hua Chen Shih-Ting Huang Jason S. Chang Institute of Information Systems and Applications, National Tsing Hua University,

More information

Exploiting Wikipedia as External Knowledge for Named Entity Recognition

Exploiting Wikipedia as External Knowledge for Named Entity Recognition Exploiting Wikipedia as External Knowledge for Named Entity Recognition Jun ichi Kazama and Kentaro Torisawa Japan Advanced Institute of Science and Technology (JAIST) Asahidai 1-1, Nomi, Ishikawa, 923-1292

More information

A Semantic Similarity Measure Based on Lexico-Syntactic Patterns

A Semantic Similarity Measure Based on Lexico-Syntactic Patterns A Semantic Similarity Measure Based on Lexico-Syntactic Patterns Alexander Panchenko, Olga Morozova and Hubert Naets Center for Natural Language Processing (CENTAL) Université catholique de Louvain Belgium

More information

PAGE(S) WHERE TAUGHT If sub mission ins not a book, cite appropriate location(s))

PAGE(S) WHERE TAUGHT If sub mission ins not a book, cite appropriate location(s)) Ohio Academic Content Standards Grade Level Indicators (Grade 11) A. ACQUISITION OF VOCABULARY Students acquire vocabulary through exposure to language-rich situations, such as reading books and other

More information

A Case Study: News Classification Based on Term Frequency

A Case Study: News Classification Based on Term Frequency A Case Study: News Classification Based on Term Frequency Petr Kroha Faculty of Computer Science University of Technology 09107 Chemnitz Germany kroha@informatik.tu-chemnitz.de Ricardo Baeza-Yates Center

More information

The College Board Redesigned SAT Grade 12

The College Board Redesigned SAT Grade 12 A Correlation of, 2017 To the Redesigned SAT Introduction This document demonstrates how myperspectives English Language Arts meets the Reading, Writing and Language and Essay Domains of Redesigned SAT.

More information

Multilingual Sentiment and Subjectivity Analysis

Multilingual Sentiment and Subjectivity Analysis Multilingual Sentiment and Subjectivity Analysis Carmen Banea and Rada Mihalcea Department of Computer Science University of North Texas rada@cs.unt.edu, carmen.banea@gmail.com Janyce Wiebe Department

More information

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities Yoav Goldberg Reut Tsarfaty Meni Adler Michael Elhadad Ben Gurion

More information

The Smart/Empire TIPSTER IR System

The Smart/Empire TIPSTER IR System The Smart/Empire TIPSTER IR System Chris Buckley, Janet Walz Sabir Research, Gaithersburg, MD chrisb,walz@sabir.com Claire Cardie, Scott Mardis, Mandar Mitra, David Pierce, Kiri Wagstaff Department of

More information

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data Ebba Gustavii Department of Linguistics and Philology, Uppsala University, Sweden ebbag@stp.ling.uu.se

More information

Proof Theory for Syntacticians

Proof Theory for Syntacticians Department of Linguistics Ohio State University Syntax 2 (Linguistics 602.02) January 5, 2012 Logics for Linguistics Many different kinds of logic are directly applicable to formalizing theories in syntax

More information

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages Nuanwan Soonthornphisaj 1 and Boonserm Kijsirikul 2 Machine Intelligence and Knowledge Discovery Laboratory Department of Computer

More information

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING SISOM & ACOUSTICS 2015, Bucharest 21-22 May THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING MarilenaăLAZ R 1, Diana MILITARU 2 1 Military Equipment and Technologies Research Agency, Bucharest,

More information

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence. NLP Lab Session Week 8 October 15, 2014 Noun Phrase Chunking and WordNet in NLTK Getting Started In this lab session, we will work together through a series of small examples using the IDLE window and

More information

Constructing Parallel Corpus from Movie Subtitles

Constructing Parallel Corpus from Movie Subtitles Constructing Parallel Corpus from Movie Subtitles Han Xiao 1 and Xiaojie Wang 2 1 School of Information Engineering, Beijing University of Post and Telecommunications artex.xh@gmail.com 2 CISTR, Beijing

More information

Coupling Semi-Supervised Learning of Categories and Relations

Coupling Semi-Supervised Learning of Categories and Relations Coupling Semi-Supervised Learning of Categories and Relations Andrew Carlson 1, Justin Betteridge 1, Estevam R. Hruschka Jr. 1,2 and Tom M. Mitchell 1 1 School of Computer Science Carnegie Mellon University

More information

Disambiguation of Thai Personal Name from Online News Articles

Disambiguation of Thai Personal Name from Online News Articles Disambiguation of Thai Personal Name from Online News Articles Phaisarn Sutheebanjard Graduate School of Information Technology Siam University Bangkok, Thailand mr.phaisarn@gmail.com Abstract Since online

More information

The Ups and Downs of Preposition Error Detection in ESL Writing

The Ups and Downs of Preposition Error Detection in ESL Writing The Ups and Downs of Preposition Error Detection in ESL Writing Joel R. Tetreault Educational Testing Service 660 Rosedale Road Princeton, NJ, USA JTetreault@ets.org Martin Chodorow Hunter College of CUNY

More information

LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization

LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization Annemarie Friedrich, Marina Valeeva and Alexis Palmer COMPUTATIONAL LINGUISTICS & PHONETICS SAARLAND UNIVERSITY, GERMANY

More information

Lecture 1: Machine Learning Basics

Lecture 1: Machine Learning Basics 1/69 Lecture 1: Machine Learning Basics Ali Harakeh University of Waterloo WAVE Lab ali.harakeh@uwaterloo.ca May 1, 2017 2/69 Overview 1 Learning Algorithms 2 Capacity, Overfitting, and Underfitting 3

More information

ScienceDirect. Malayalam question answering system

ScienceDirect. Malayalam question answering system Available online at www.sciencedirect.com ScienceDirect Procedia Technology 24 (2016 ) 1388 1392 International Conference on Emerging Trends in Engineering, Science and Technology (ICETEST - 2015) Malayalam

More information

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF) SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF) Hans Christian 1 ; Mikhael Pramodana Agus 2 ; Derwin Suhartono 3 1,2,3 Computer Science Department,

More information

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17. Semi-supervised methods of text processing, and an application to medical concept extraction Yacine Jernite Text-as-Data series September 17. 2015 What do we want from text? 1. Extract information 2. Link

More information

Comprehension Recognize plot features of fairy tales, folk tales, fables, and myths.

Comprehension Recognize plot features of fairy tales, folk tales, fables, and myths. 4 th Grade Language Arts Scope and Sequence 1 st Nine Weeks Instructional Units Reading Unit 1 & 2 Language Arts Unit 1& 2 Assessments Placement Test Running Records DIBELS Reading Unit 1 Language Arts

More information

Memory-based grammatical error correction

Memory-based grammatical error correction Memory-based grammatical error correction Antal van den Bosch Peter Berck Radboud University Nijmegen Tilburg University P.O. Box 9103 P.O. Box 90153 NL-6500 HD Nijmegen, The Netherlands NL-5000 LE Tilburg,

More information

Short Text Understanding Through Lexical-Semantic Analysis

Short Text Understanding Through Lexical-Semantic Analysis Short Text Understanding Through Lexical-Semantic Analysis Wen Hua #1, Zhongyuan Wang 2, Haixun Wang 3, Kai Zheng #4, Xiaofang Zhou #5 School of Information, Renmin University of China, Beijing, China

More information

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words, A Language-Independent, Data-Oriented Architecture for Grapheme-to-Phoneme Conversion Walter Daelemans and Antal van den Bosch Proceedings ESCA-IEEE speech synthesis conference, New York, September 1994

More information

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models Stephan Gouws and GJ van Rooyen MIH Medialab, Stellenbosch University SOUTH AFRICA {stephan,gvrooyen}@ml.sun.ac.za

More information

Reading Grammar Section and Lesson Writing Chapter and Lesson Identify a purpose for reading W1-LO; W2- LO; W3- LO; W4- LO; W5-

Reading Grammar Section and Lesson Writing Chapter and Lesson Identify a purpose for reading W1-LO; W2- LO; W3- LO; W4- LO; W5- New York Grade 7 Core Performance Indicators Grades 7 8: common to all four ELA standards Throughout grades 7 and 8, students demonstrate the following core performance indicators in the key ideas of reading,

More information

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy Informatics 2A: Language Complexity and the Chomsky Hierarchy September 28, 2010 Starter 1 Is there a finite state machine that recognises all those strings s from the alphabet {a, b} where the difference

More information

On document relevance and lexical cohesion between query terms

On document relevance and lexical cohesion between query terms Information Processing and Management 42 (2006) 1230 1247 www.elsevier.com/locate/infoproman On document relevance and lexical cohesion between query terms Olga Vechtomova a, *, Murat Karamuftuoglu b,

More information

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS Daffodil International University Institutional Repository DIU Journal of Science and Technology Volume 8, Issue 1, January 2013 2013-01 BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS Uddin, Sk.

More information

Rule Learning With Negation: Issues Regarding Effectiveness

Rule Learning With Negation: Issues Regarding Effectiveness Rule Learning With Negation: Issues Regarding Effectiveness S. Chua, F. Coenen, G. Malcolm University of Liverpool Department of Computer Science, Ashton Building, Ashton Street, L69 3BX Liverpool, United

More information

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY Chen, Hsin-Hsi Department of Computer Science and Information Engineering National Taiwan University Taipei, Taiwan E-mail: hh_chen@csie.ntu.edu.tw Abstract

More information

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections Tyler Perrachione LING 451-0 Proseminar in Sound Structure Prof. A. Bradlow 17 March 2006 Intra-talker Variation: Audience Design Factors Affecting Lexical Selections Abstract Although the acoustic and

More information

arxiv: v1 [cs.cl] 2 Apr 2017

arxiv: v1 [cs.cl] 2 Apr 2017 Word-Alignment-Based Segment-Level Machine Translation Evaluation using Word Embeddings Junki Matsuo and Mamoru Komachi Graduate School of System Design, Tokyo Metropolitan University, Japan matsuo-junki@ed.tmu.ac.jp,

More information

First Grade Curriculum Highlights: In alignment with the Common Core Standards

First Grade Curriculum Highlights: In alignment with the Common Core Standards First Grade Curriculum Highlights: In alignment with the Common Core Standards ENGLISH LANGUAGE ARTS Foundational Skills Print Concepts Demonstrate understanding of the organization and basic features

More information

Universiteit Leiden ICT in Business

Universiteit Leiden ICT in Business Universiteit Leiden ICT in Business Ranking of Multi-Word Terms Name: Ricardo R.M. Blikman Student-no: s1184164 Internal report number: 2012-11 Date: 07/03/2013 1st supervisor: Prof. Dr. J.N. Kok 2nd supervisor:

More information

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining Dave Donnellan, School of Computer Applications Dublin City University Dublin 9 Ireland daviddonnellan@eircom.net Claus Pahl

More information

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining Dave Donnellan, School of Computer Applications Dublin City University Dublin 9 Ireland daviddonnellan@eircom.net Claus Pahl

More information

Vocabulary Usage and Intelligibility in Learner Language

Vocabulary Usage and Intelligibility in Learner Language Vocabulary Usage and Intelligibility in Learner Language Emi Izumi, 1 Kiyotaka Uchimoto 1 and Hitoshi Isahara 1 1. Introduction In verbal communication, the primary purpose of which is to convey and understand

More information

Grade 4. Common Core Adoption Process. (Unpacked Standards)

Grade 4. Common Core Adoption Process. (Unpacked Standards) Grade 4 Common Core Adoption Process (Unpacked Standards) Grade 4 Reading: Literature RL.4.1 Refer to details and examples in a text when explaining what the text says explicitly and when drawing inferences

More information

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments Cristina Vertan, Walther v. Hahn University of Hamburg, Natural Language Systems Division Hamburg,

More information

Leveraging Sentiment to Compute Word Similarity

Leveraging Sentiment to Compute Word Similarity Leveraging Sentiment to Compute Word Similarity Balamurali A.R., Subhabrata Mukherjee, Akshat Malu and Pushpak Bhattacharyya Dept. of Computer Science and Engineering, IIT Bombay 6th International Global

More information

Python Machine Learning

Python Machine Learning Python Machine Learning Unlock deeper insights into machine learning with this vital guide to cuttingedge predictive analytics Sebastian Raschka [ PUBLISHING 1 open source I community experience distilled

More information

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur Module 12 Machine Learning 12.1 Instructional Objective The students should understand the concept of learning systems Students should learn about different aspects of a learning system Students should

More information

Probabilistic Latent Semantic Analysis

Probabilistic Latent Semantic Analysis Probabilistic Latent Semantic Analysis Thomas Hofmann Presentation by Ioannis Pavlopoulos & Andreas Damianou for the course of Data Mining & Exploration 1 Outline Latent Semantic Analysis o Need o Overview

More information

ELD CELDT 5 EDGE Level C Curriculum Guide LANGUAGE DEVELOPMENT VOCABULARY COMMON WRITING PROJECT. ToolKit

ELD CELDT 5 EDGE Level C Curriculum Guide LANGUAGE DEVELOPMENT VOCABULARY COMMON WRITING PROJECT. ToolKit Unit 1 Language Development Express Ideas and Opinions Ask for and Give Information Engage in Discussion ELD CELDT 5 EDGE Level C Curriculum Guide 20132014 Sentences Reflective Essay August 12 th September

More information

Detecting English-French Cognates Using Orthographic Edit Distance

Detecting English-French Cognates Using Orthographic Edit Distance Detecting English-French Cognates Using Orthographic Edit Distance Qiongkai Xu 1,2, Albert Chen 1, Chang i 1 1 The Australian National University, College of Engineering and Computer Science 2 National

More information

A Named Entity Recognition Method using Rules Acquired from Unlabeled Data

A Named Entity Recognition Method using Rules Acquired from Unlabeled Data A Named Entity Recognition Method using Rules Acquired from Unlabeled Data Tomoya Iwakura Fujitsu Laboratories Ltd. 1-1, Kamikodanaka 4-chome, Nakahara-ku, Kawasaki 211-8588, Japan iwakura.tomoya@jp.fujitsu.com

More information

Procedia - Social and Behavioral Sciences 141 ( 2014 ) WCLTA Using Corpus Linguistics in the Development of Writing

Procedia - Social and Behavioral Sciences 141 ( 2014 ) WCLTA Using Corpus Linguistics in the Development of Writing Available online at www.sciencedirect.com ScienceDirect Procedia - Social and Behavioral Sciences 141 ( 2014 ) 124 128 WCLTA 2013 Using Corpus Linguistics in the Development of Writing Blanka Frydrychova

More information

Bootstrapping and Evaluating Named Entity Recognition in the Biomedical Domain

Bootstrapping and Evaluating Named Entity Recognition in the Biomedical Domain Bootstrapping and Evaluating Named Entity Recognition in the Biomedical Domain Andreas Vlachos Computer Laboratory University of Cambridge Cambridge, CB3 0FD, UK av308@cl.cam.ac.uk Caroline Gasperin Computer

More information

Rule Learning with Negation: Issues Regarding Effectiveness

Rule Learning with Negation: Issues Regarding Effectiveness Rule Learning with Negation: Issues Regarding Effectiveness Stephanie Chua, Frans Coenen, and Grant Malcolm University of Liverpool Department of Computer Science, Ashton Building, Ashton Street, L69 3BX

More information

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature 1 st Grade Curriculum Map Common Core Standards Language Arts 2013 2014 1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature Key Ideas and Details

More information

Speech Recognition at ICSI: Broadcast News and beyond

Speech Recognition at ICSI: Broadcast News and beyond Speech Recognition at ICSI: Broadcast News and beyond Dan Ellis International Computer Science Institute, Berkeley CA Outline 1 2 3 The DARPA Broadcast News task Aspects of ICSI

More information

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading ELA/ELD Correlation Matrix for ELD Materials Grade 1 Reading The English Language Arts (ELA) required for the one hour of English-Language Development (ELD) Materials are listed in Appendix 9-A, Matrix

More information

Using Semantic Relations to Refine Coreference Decisions

Using Semantic Relations to Refine Coreference Decisions Using Semantic Relations to Refine Coreference Decisions Heng Ji David Westbrook Ralph Grishman Department of Computer Science New York University New York, NY, 10003, USA hengji@cs.nyu.edu westbroo@cs.nyu.edu

More information

Prediction of Maximal Projection for Semantic Role Labeling

Prediction of Maximal Projection for Semantic Role Labeling Prediction of Maximal Projection for Semantic Role Labeling Weiwei Sun, Zhifang Sui Institute of Computational Linguistics Peking University Beijing, 100871, China {ws, szf}@pku.edu.cn Haifeng Wang Toshiba

More information

Parsing of part-of-speech tagged Assamese Texts

Parsing of part-of-speech tagged Assamese Texts IJCSI International Journal of Computer Science Issues, Vol. 6, No. 1, 2009 ISSN (Online): 1694-0784 ISSN (Print): 1694-0814 28 Parsing of part-of-speech tagged Assamese Texts Mirzanur Rahman 1, Sufal

More information

Distant Supervised Relation Extraction with Wikipedia and Freebase

Distant Supervised Relation Extraction with Wikipedia and Freebase Distant Supervised Relation Extraction with Wikipedia and Freebase Marcel Ackermann TU Darmstadt ackermann@tk.informatik.tu-darmstadt.de Abstract In this paper we discuss a new approach to extract relational

More information

Learning From the Past with Experiment Databases

Learning From the Past with Experiment Databases Learning From the Past with Experiment Databases Joaquin Vanschoren 1, Bernhard Pfahringer 2, and Geoff Holmes 2 1 Computer Science Dept., K.U.Leuven, Leuven, Belgium 2 Computer Science Dept., University

More information

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases POS Tagging Problem Part-of-Speech Tagging L545 Spring 203 Given a sentence W Wn and a tagset of lexical categories, find the most likely tag T..Tn for each word in the sentence Example Secretariat/P is/vbz

More information

Extracting and Ranking Product Features in Opinion Documents

Extracting and Ranking Product Features in Opinion Documents Extracting and Ranking Product Features in Opinion Documents Lei Zhang Department of Computer Science University of Illinois at Chicago 851 S. Morgan Street Chicago, IL 60607 lzhang3@cs.uic.edu Bing Liu

More information

Using dialogue context to improve parsing performance in dialogue systems

Using dialogue context to improve parsing performance in dialogue systems Using dialogue context to improve parsing performance in dialogue systems Ivan Meza-Ruiz and Oliver Lemon School of Informatics, Edinburgh University 2 Buccleuch Place, Edinburgh I.V.Meza-Ruiz@sms.ed.ac.uk,

More information

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

Project in the framework of the AIM-WEST project Annotation of MWEs for translation Project in the framework of the AIM-WEST project Annotation of MWEs for translation 1 Agnès Tutin LIDILEM/LIG Université Grenoble Alpes 30 october 2014 Outline 2 Why annotate MWEs in corpora? A first experiment

More information

The Good Judgment Project: A large scale test of different methods of combining expert predictions

The Good Judgment Project: A large scale test of different methods of combining expert predictions The Good Judgment Project: A large scale test of different methods of combining expert predictions Lyle Ungar, Barb Mellors, Jon Baron, Phil Tetlock, Jaime Ramos, Sam Swift The University of Pennsylvania

More information

Task Tolerance of MT Output in Integrated Text Processes

Task Tolerance of MT Output in Integrated Text Processes Task Tolerance of MT Output in Integrated Text Processes John S. White, Jennifer B. Doyon, and Susan W. Talbott Litton PRC 1500 PRC Drive McLean, VA 22102, USA {white_john, doyon jennifer, talbott_susan}@prc.com

More information

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation tatistical Parsing (Following slides are modified from Prof. Raymond Mooney s slides.) tatistical Parsing tatistical parsing uses a probabilistic model of syntax in order to assign probabilities to each

More information

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

OCR for Arabic using SIFT Descriptors With Online Failure Prediction OCR for Arabic using SIFT Descriptors With Online Failure Prediction Andrey Stolyarenko, Nachum Dershowitz The Blavatnik School of Computer Science Tel Aviv University Tel Aviv, Israel Email: stloyare@tau.ac.il,

More information

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System QuickStroke: An Incremental On-line Chinese Handwriting Recognition System Nada P. Matić John C. Platt Λ Tony Wang y Synaptics, Inc. 2381 Bering Drive San Jose, CA 95131, USA Abstract This paper presents

More information

Postprint.

Postprint. http://www.diva-portal.org Postprint This is the accepted version of a paper presented at CLEF 2013 Conference and Labs of the Evaluation Forum Information Access Evaluation meets Multilinguality, Multimodality,

More information

A Bayesian Learning Approach to Concept-Based Document Classification

A Bayesian Learning Approach to Concept-Based Document Classification Databases and Information Systems Group (AG5) Max-Planck-Institute for Computer Science Saarbrücken, Germany A Bayesian Learning Approach to Concept-Based Document Classification by Georgiana Ifrim Supervisors

More information

LEXICAL COHESION ANALYSIS OF THE ARTICLE WHAT IS A GOOD RESEARCH PROJECT? BY BRIAN PALTRIDGE A JOURNAL ARTICLE

LEXICAL COHESION ANALYSIS OF THE ARTICLE WHAT IS A GOOD RESEARCH PROJECT? BY BRIAN PALTRIDGE A JOURNAL ARTICLE LEXICAL COHESION ANALYSIS OF THE ARTICLE WHAT IS A GOOD RESEARCH PROJECT? BY BRIAN PALTRIDGE A JOURNAL ARTICLE Submitted in partial fulfillment of the requirements for the degree of Sarjana Sastra (S.S.)

More information

Trend Survey on Japanese Natural Language Processing Studies over the Last Decade

Trend Survey on Japanese Natural Language Processing Studies over the Last Decade Trend Survey on Japanese Natural Language Processing Studies over the Last Decade Masaki Murata, Koji Ichii, Qing Ma,, Tamotsu Shirado, Toshiyuki Kanamaru,, and Hitoshi Isahara National Institute of Information

More information

Improving Conceptual Understanding of Physics with Technology

Improving Conceptual Understanding of Physics with Technology INTRODUCTION Improving Conceptual Understanding of Physics with Technology Heidi Jackman Research Experience for Undergraduates, 1999 Michigan State University Advisors: Edwin Kashy and Michael Thoennessen

More information

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Twitter Sentiment Classification on Sanders Data using Hybrid Approach IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 4, Ver. I (July Aug. 2015), PP 118-123 www.iosrjournals.org Twitter Sentiment Classification on Sanders

More information

Software Maintenance

Software Maintenance 1 What is Software Maintenance? Software Maintenance is a very broad activity that includes error corrections, enhancements of capabilities, deletion of obsolete capabilities, and optimization. 2 Categories

More information

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures Ulrike Baldewein (ulrike@coli.uni-sb.de) Computational Psycholinguistics, Saarland University D-66041 Saarbrücken,

More information

5 th Grade Language Arts Curriculum Map

5 th Grade Language Arts Curriculum Map 5 th Grade Language Arts Curriculum Map Quarter 1 Unit of Study: Launching Writer s Workshop 5.L.1 - Demonstrate command of the conventions of Standard English grammar and usage when writing or speaking.

More information

Online Updating of Word Representations for Part-of-Speech Tagging

Online Updating of Word Representations for Part-of-Speech Tagging Online Updating of Word Representations for Part-of-Speech Tagging Wenpeng Yin LMU Munich wenpeng@cis.lmu.de Tobias Schnabel Cornell University tbs49@cornell.edu Hinrich Schütze LMU Munich inquiries@cislmu.org

More information

5 Star Writing Persuasive Essay

5 Star Writing Persuasive Essay 5 Star Writing Persuasive Essay Grades 5-6 Intro paragraph states position and plan Multiparagraphs Organized At least 3 reasons Explanations, Examples, Elaborations to support reasons Arguments/Counter

More information

Parallel Evaluation in Stratal OT * Adam Baker University of Arizona

Parallel Evaluation in Stratal OT * Adam Baker University of Arizona Parallel Evaluation in Stratal OT * Adam Baker University of Arizona tabaker@u.arizona.edu 1.0. Introduction The model of Stratal OT presented by Kiparsky (forthcoming), has not and will not prove uncontroversial

More information

The taming of the data:

The taming of the data: The taming of the data: Using text mining in building a corpus for diachronic analysis Stefania Degaetano-Ortlieb, Hannah Kermes, Ashraf Khamis, Jörg Knappen, Noam Ordan and Elke Teich Background Big data

More information

University of Alberta. Large-Scale Semi-Supervised Learning for Natural Language Processing. Shane Bergsma

University of Alberta. Large-Scale Semi-Supervised Learning for Natural Language Processing. Shane Bergsma University of Alberta Large-Scale Semi-Supervised Learning for Natural Language Processing by Shane Bergsma A thesis submitted to the Faculty of Graduate Studies and Research in partial fulfillment of

More information

Cross Language Information Retrieval

Cross Language Information Retrieval Cross Language Information Retrieval RAFFAELLA BERNARDI UNIVERSITÀ DEGLI STUDI DI TRENTO P.ZZA VENEZIA, ROOM: 2.05, E-MAIL: BERNARDI@DISI.UNITN.IT Contents 1 Acknowledgment.............................................

More information

Emmaus Lutheran School English Language Arts Curriculum

Emmaus Lutheran School English Language Arts Curriculum Emmaus Lutheran School English Language Arts Curriculum Rationale based on Scripture God is the Creator of all things, including English Language Arts. Our school is committed to providing students with

More information

A Computational Evaluation of Case-Assignment Algorithms

A Computational Evaluation of Case-Assignment Algorithms A Computational Evaluation of Case-Assignment Algorithms Miles Calabresi Advisors: Bob Frank and Jim Wood Submitted to the faculty of the Department of Linguistics in partial fulfillment of the requirements

More information

Data-driven Type Checking in Open Domain Question Answering

Data-driven Type Checking in Open Domain Question Answering Data-driven Type Checking in Open Domain Question Answering Stefan Schlobach a,1 David Ahn b,2 Maarten de Rijke b,3 Valentin Jijkoun b,4 a AI Department, Division of Mathematics and Computer Science, Vrije

More information

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS Ruslan Mitkov (R.Mitkov@wlv.ac.uk) University of Wolverhampton ViktorPekar (v.pekar@wlv.ac.uk) University of Wolverhampton Dimitar

More information

Search right and thou shalt find... Using Web Queries for Learner Error Detection

Search right and thou shalt find... Using Web Queries for Learner Error Detection Search right and thou shalt find... Using Web Queries for Learner Error Detection Michael Gamon Claudia Leacock Microsoft Research Butler Hill Group One Microsoft Way P.O. Box 935 Redmond, WA 981052, USA

More information

Outline. Web as Corpus. Using Web Data for Linguistic Purposes. Ines Rehbein. NCLT, Dublin City University. nclt

Outline. Web as Corpus. Using Web Data for Linguistic Purposes. Ines Rehbein. NCLT, Dublin City University. nclt Outline Using Web Data for Linguistic Purposes NCLT, Dublin City University Outline Outline 1 Corpora as linguistic tools 2 Limitations of web data Strategies to enhance web data 3 Corpora as linguistic

More information

Data Integration through Clustering and Finding Statistical Relations - Validation of Approach

Data Integration through Clustering and Finding Statistical Relations - Validation of Approach Data Integration through Clustering and Finding Statistical Relations - Validation of Approach Marek Jaszuk, Teresa Mroczek, and Barbara Fryc University of Information Technology and Management, ul. Sucharskiego

More information

Abstractions and the Brain

Abstractions and the Brain Abstractions and the Brain Brian D. Josephson Department of Physics, University of Cambridge Cavendish Lab. Madingley Road Cambridge, UK. CB3 OHE bdj10@cam.ac.uk http://www.tcm.phy.cam.ac.uk/~bdj10 ABSTRACT

More information

Statewide Framework Document for:

Statewide Framework Document for: Statewide Framework Document for: 270301 Standards may be added to this document prior to submission, but may not be removed from the framework to meet state credit equivalency requirements. Performance

More information

Accuracy (%) # features

Accuracy (%) # features Question Terminology and Representation for Question Type Classication Noriko Tomuro DePaul University School of Computer Science, Telecommunications and Information Systems 243 S. Wabash Ave. Chicago,

More information