function(n1,n2) will return the frequency of the input noun pair (n1,n2) appearing in the corpus. So the frequency of (n1,n2) and (n2,n3) determines
|
|
- Gilbert Williamson
- 6 years ago
- Views:
Transcription
1 CIS 630 Class Project Szu-ting Yi and Susan Converse 18 December 2000 I. Introduction Compound nouns, or noun-noun compounds, are prevalent in both English and Chinese. Handling them properly poses some challenges for NLP systems. First there is the problem of just identifying the compounds, recognizing what their boundaries are and what the components are. For both English and Chinese, determining the components is related to the problem of deciding what unit size should be stored in the lexicon and in the Chinese case it is part of the problem (or solution) of segmenting the text into "words." Second, there is the problem of determining the syntactic structure of the compound itself (assuming a binary branching parse structure). Like the prepositional phrase attachment problem, when there are three or more nouns in a compound, there is a choice of which components bind more closely to one another. That is, given N1 N2 N3, should the modification relations be left bracketed as [[N1 N2] N3] or right bracketed as [N1 [N2 N3]]? Finally, there is the challenge of determining the semantic relation/s between/among the components and interpreting the whole compound according to the choices made during partitioning and parsing the compound. In a 1995 ACL paper, Mark Lauer(1) proposed some statistical methods for syntactic analysis of compound nouns, addressing the second of the three problems just outlined. Although Lauer was working on English compounds, his methods are not in principle language-specific, and we have attempted in this paper to apply a similar approach to Chinese noun compounds. II. Goal and Methods The goal of our system is to decide the syntactic structure of a three-noun sequence. That is, give a noun triple, our system will determine whether it is left-branching or right-branching. In order to achieve this goal, we tested two different models using four different statistical methods. The two models, called the _adjacency model_ and the _dependency model_, are taken from Lauer s paper. In each model, the strength of the association between two nouns in the triple (e.g., n1 and n2) is compared to the strength of the other possible pairing (e.g., n2 with n3). For the _adjacency model_ (the model most often found in the literature) the pairings are the two adjacent pairs (n1,n2) and (n2,n3). If the 1
2 (n1,n2) bond is stronger than (n2,n3) (in which the meaning of "stronger" will be discussed presently), then the compound is left branching, otherwise the compound is assumed to be right branching. Lauer argues that the _dependency model_ that he describes in his paper is a better predictor of branching behavior. The _dependency model_ compares the strength of the (n1,n2) bond against the strength of the (n1,n3) bond instead of that of (n2,n3). Figure 1: left left v v v v NN1 NN2 NN3 NN1 NN2 NN3 ^ ^ ^ ^ right right ADJACENCY DEPENDENCY Lauer uses the example "calcium ion exchange" to illustrate the motivation for the dependency model. While there appears to be no a priori reason that either "calcium ion" or "ion exchange" would be more frequent than the other, the general understanding of the compound is that calcium is characterizing the ions, not the exchange (we don t say "calcium exchange", except in the rare nutritional setting), and we would therefore bracket the compound as follows: [[calcium ion] exchange] On the flip side, given the compound "IRCS Friday colloquium," the association between "IRCS" and "colloquium" is far greater than the association between "IRCS" and "Friday," and we would bracket the compound [IRCS [Friday colloqium]]. Given these two models, we measured the strength of the association between the two nouns in a pair using four different statistical functions. For example, in the adjacency model, given a noun triple (n1 n2 n3), we would calculate function(n1,n2) and function(n2,n3) and compare the function values returned for the pairs. If the former value is greater than the latter, then we say that (n1,n2) are more likely to be put together than (n2,n3) and thus (n1 n2 n3) should have a left-branching structure; otherwise we say the likelihood of (n2,n3) being together is larger, and we assign the noun triple a right-branching structure. We have tried four statistical methods for the function mentioned above. The four methods are a pure frequency method, mutual information, the t test, and the chi-square test. The details are as follows. a) pure word frequency method: 2
3 function(n1,n2) will return the frequency of the input noun pair (n1,n2) appearing in the corpus. So the frequency of (n1,n2) and (n2,n3) determines the final decision. Of course this the simplest method. The following three methods are all well-known in the statistical natural language processing literature when identifying collocations. The goals and the logic of applying these statistical tests are quite similar, only the the motivations behind what triggers the dependency differ. When executing statistical-function(n1,n2), we first assume the null hypothesis -- that if two things happen together it is a coincidence and not because of some dependency between the two events -- and see whether the values we get applying the statistical methods are big enough to reject the null hypothesis. If they are then we can positively state that two things are dependent. The null hypothesis we used in our experiments is the probability of seeing n1 times that of seeing n2, because we assume two things are independent. b) mutual information method: The equation we present is the pointwise mutual information. Mutual information is an information-theoretically motivated measure for discovering interesting collocations, and roughly it is a measure of how much one word tells us about the other. The mutual information score depends heavily on the frequency of the individual words, however, so instead of saying it is a good measure of dependency, we would rather say it is a good measure of _independence_. In addition to this, our data do not present good distributions. We therefore also tried the t test and chi-square test. c) t test Another test that has been widely used for determining the degree of dependence between two things is the t test. The t test looks at the difference between the observed and expected means, scaled by the variance of the data, and tells us how likely one is to get a sample of that mean and variance assuming that the sample is drawn from a normal distribution with that expected mean. d) chi-square test The t test has been criticized because it assumes that probabilities are approximately normally distributed, which is not true in general. An alternative test for dependence which does not assume normally distributed probabilities is the chi-square test. Just as application of the t test is problematic because of the underlying normality assumption, so is application of chi-square test in cases where 3
4 the sample size is too small. III. The Data The Corpus We used the Chinese Treebank ( which is a parsed corpus of newswire articles from Xinhuashe ( - New China News Agency). There are 325 separate articles, with a total of 99,720 words. "Word" in this case is anything tagged with a part-of-speech tag, including punctuation (see the website for the tagging guidelines). A (non-punctuation) word is usually two characters in length, with single- and triple-character words next most common. Proper names may have many characters in them. In this corpus there are 4,453 different noun types (unique nouns). Of these, 2,191 (just about half) occur only once, and 650 appear only twice. These statistics include all tagged text, including headlines and bylines. Because of the nature of the corpus, there are a few words that have artificially high frequencies, such as _dian_ ( -- equivalent to "wire" in a byline -- which is the 7th most frequent noun) and Xinhuashe ( -- New China News Agency, which also appears in bylines, is the 24th most frequent noun). There are approximately 3200 sentences, excluding headlines and bylines. The Training Data Two versions of the training data were extracted from sentences (but not the headlines or bylines) of the Chinese Treebank: a syntax-blind version and an enhanced version. We only used part-of-speech tags in extracting the blind version of the training data. For the enhanced version, we took into account the bracketing information from the Treebank parses. We called the latter training set the "enhanced" version, not only because we are taking advantage of more linguistic knowledge, but also because we hope we can getter better results. Take the following sentence as an example. ((IP (LCP-TMP (NP (NT )) jin1nian2 -- this year (LC )) lai2 -- coming (PU ) (NP-TPC (DNP (LCP (NP (NP-PN-APP (NR ) zhong1 -- China (NR )) han2 -- Korea (QP (CD )) liang3 -- two (NP (NN ))) guo2 -- country (LC )) zhi1jian4 -- between (DEG )) de0 -- particle (NP (NN ) jing1mao4 -- economics&trade 4
5 (NN ))) wang3lai2 -- contacts (NP-SBJ (NN )) fa1zhan3 -- expansion (VP (VA )) xun4su4 -- rapid (PU ))) In the syntax-blind version, any adjacent pair of nouns was counted, ignoring syntactic bracketing. This would be equivalent to only POS-tagging a corpus and then taking any adjacent pairing of nouns. In the above sentence the pairs ( ), where the two nouns appear in the same phrase, and ( ), in which the nouns are in different phrases, both would be extracted. (Note that in addition to making pairs that are not motivated by constituent structure, this clearly biases against the dependency calculations, since no (n1,n3) pair would ever be counted.) In the "enhanced" training data, only the pair ( ) would be counted, since it appears within a single NP. In addition to not crossing constituent boundaries in the "enhanced" version of the noun pairs extraction, pairs were not extracted from noun compounds of three or more. For example, no noun pair or triple was extracted from the following four-noun expression: (IP (NP-SBJ (NN ) wai4zi1 -- foreign capital/investment (NN ) qi3ye4 -- enterprise/business (NN ) chu1kou3 -- export (NN )) chan3pin3 -- products/produce (VP (VV ))) tui4shui4 -- drawback... The blind version of the pairs file would have three pairs, however. In each set of training data, we have two files. One is the single noun file, with one noun and its corpus frequency on each line. The other file is the noun pair file, in which each line contains a noun pair and its corresponding frequency in the corpus. Observing the training data we have, we found two serious limitations. Because the data we have are so small in number, first of all we have to face the sparse data problem. Furthermore, the distribution of our training data is not as representative as a much larger data set would be. For example, for the two files of the enhanced version, in the single noun file, we have 4453 entries in total, and among them, 2191 nouns appear only one, 650 nouns appear twice, and about 72.3% of nouns appear three or fewer times. The situation is worse when we look at the noun pair file, where more than three fourths of the noun pairs appear only once. The Test Data The test data consist of a file of 1186 noun triples, extracted syntactically 5
6 using the same kinds of bracketing information that were used in extracting the "enhanced" pairs. Triples were not extracted from noun compounds of four or more nouns, nor were they extracted across constituent boundaries. IV. Evaluation We asked three Chinese native speakers to decide the bracketing structure of the 1186 noun triples to develop the gold standard. Among those three people, two are from Taiwan, and they both are majors in Linguistics. The remaining one is from Mainland China and does not have any linguistic background. Since the Chinese Treebank data are from Mainland China news articles, we expected that the person from Mainland China could do the judgements in a more intuitive way, and that the two people from Taiwan would draw on their linguistic knowledge. The three judges spent from 2 to 7 hours to finish the whole job, and the average time spent was 4 hours. Several questions were raised while they were doing the judgements. They are: 1) flat structure versus binary branching structure; 2) implicit possessive usage; 3) lexical or syntactic ambiguity; 4) don t understand the term at all. 1) flat structure versus binary branching structure Whether a noun sequence has a flat structure or a binary branching structure is always a debatable question. Sometimes a judge just could not distinguish which two nouns should be paired together and had to leave all three nouns at the same level. The triple ( ) (sheng3 qu1 shi4 -- province district municipality) is one such example. Cases like these might be interpreted as really being (n1 conjunction n2 conjunction n3), with the conjunctions just dropped. This is consistent with what Lauer found. He noted that there will be some noun compounds that will be ambiguous even in context. He found that 35 out of 279 noun triples could not be assigned either left- or right-branching structures, even when context was taken into account. 2) implicit possessive usage At times, in Chinese a given noun pair (n1 n2) can be interpreted as a noun compound (n1n2) that can be paraphrased as (n1 s n2). This kind of usage will have an effect on determining whether a noun triple has a left- or right- branching structure. For example, the first two nouns in ( ) (dang1di4 jing1ji4 zhi1zhu4 -- that locale economy mainstay) could be construed as "that locale s economy." 6
7 3) lexical or syntactic ambiguity Although by common sense it might seem that when a word consists of more than two characters it would be less ambiguous (since each character might be seen to be refining the meaning), nevertheless, there still are nouns which have lexical or syntactic ambiguity, and these make it hard for the annotators to make their judgements. 4) don t understand the term at all There are just a few of these cases. Each person had four to six terms that to them were unknown. The Answer Keys Based on the three set of answers, we applied two different guidelines to get two sets of answer keys. The first guideline is to use the answers that all three annotators agreed with. By this, we got 369 left-branching answers (L), 143 right-branching answers (R), 8 undecidable answers (U), and 666 at least one has a different answer (N). That means we only have = 512 valid answers. We called this the agreement_all_three version. The second guideline is to use the answers on which at least two annotators agreed. From this standard, we got 685 L, 365 R, 38 U, and 98 N. In total, we have 1050 valid answers. This was called agreement_above_two answer key. V. The Results If we use the agreement_all_three answer key and the adjacency model, we get the following results: (all percents are percent correct for the method): using blind-version of noun pairs, chi-square test: 62.7% mutual information: 62.3% t test: 72.9% word frequency: 72.5% with enhanced-version of noun pairs, chi-square test: 68.0% mutual information: 75.9% t test: 76.0% word frequency: 78.9% 7
8 If we use the agreement_all_three answer key, and the dependency model, the results are as follows: using blind-version of noun pairs, chi-square test: 72.5% mutual information: 71.9% t test: 73.5% word frequency: 72.8% with enhanced-version of noun pairs, chi-square test: 74.8% mutual information: 71.3% t test: 71.3% word frequency: 74.0% If we use the agreement_above_two version answer key, the accuracy rate distributes similarly, but the accuracy rate is lower in general. By comparison, an assignment of left-branching to all triples results in an accuracy of 72.1% for the agreement_all_three standard, and an accuracy of 65.2% for the agreement_above_two standard. VI. Discussion Using the adjacency model, the "enhanced" version of the noun pairs outperformed the syntax-blind version by a lot: the accuracy difference ranges from 5% to 13%. However, if we switch to the dependency model, the difference is not that obvious, it seems that we don t need too much linguistic information using dependency model. The chi-square method did not perform well. The poor performance was probably due to too small a sample size. Comparing the t test and the mutual information test, the t test works a little bit better than mutual information method. At this moment, the word frequency method is the best one. Although the simplest method of course can (should?) be the best solution, this result is unexpected. Possible causes bring us back to our training data. It is not a complex data set, and due to the over-simplified distribution and the small amount of data, the simple word frequency method won the championship. If we can get more data, or we have good ways to normalize our training data, we would expect that the t test would perform the best. VII. Future Work As indicated before, the system has a very severe sparse data problem. 8
9 Most of the single nouns and noun pairs appear only once. To alleviate this shortcoming, tokenizing our training set in a more coarse-grained manner might be a good approach. In other words, we would like to change the tokenization of our training data from the word level we have used thus far to a class or category level tokenization -- that is, to cluster our nouns into several categories. This is the approach Lauer used (borrowed from Resnick and Hearst (1993(2))). He grouped words into classes and essentially compared the mutual information between the classes the words belonged to instead of the information between the words themselves. Lauer used Roget s Thesaurus for the classes. We will use HowNet, which can be treated as a Chinese semantic network, to define the categories we want. HowNet s noun category is a hierarchical structure, and it has, at deepest, three levels. Here, we only use the information of the highest level. The algorithm for assigning a class name to a noun entry we have in the training data is shown below: 1. Get the noun (n1) from our single file 2. Look it up in the HowNet dictionary 3. Extract the definition part of that dictionary entry 4. Extract the highest level information from the definition we get from 3 5. Assign the noun the class name we get from 4 6. See the next noun entry We have done a preliminary test on this algorithm. However, not every noun in our training data has an entry as a noun in the HowNet dictionary, and the overlap between the nouns and the HowNet dictionary noun entries is surprisingly limited. This forces us to one of two strategies. We can keep the classifying guidelines derived from HowNet, and assign the class name for each noun by hand. Another approach would be to see if a Chinese Treebank noun that is not categorized as a noun in HowNet appears elsewhere in HowNet with another classification, such as a verb. References (1) Mark Lauer, "Corpus Statistics Meet the Noun Compound: Some Empirical Results" pp 47-54, _Proceedings of the Annual Meeting of the Association for Computational Linguistics_ 1995 (2) P. Resnik and M. Hearst "Structural Ambiguity and Conceptual Relations" pp 58-64, _Proceedings of the Workshop on Very Large Corpora: Academic and Industrial Perspectives_ Ohio State University June
On document relevance and lexical cohesion between query terms
Information Processing and Management 42 (2006) 1230 1247 www.elsevier.com/locate/infoproman On document relevance and lexical cohesion between query terms Olga Vechtomova a, *, Murat Karamuftuoglu b,
More informationChinese Language Parsing with Maximum-Entropy-Inspired Parser
Chinese Language Parsing with Maximum-Entropy-Inspired Parser Heng Lian Brown University Abstract The Chinese language has many special characteristics that make parsing difficult. The performance of state-of-the-art
More informationSINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)
SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF) Hans Christian 1 ; Mikhael Pramodana Agus 2 ; Derwin Suhartono 3 1,2,3 Computer Science Department,
More informationEnhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities
Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities Yoav Goldberg Reut Tsarfaty Meni Adler Michael Elhadad Ben Gurion
More informationChunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.
NLP Lab Session Week 8 October 15, 2014 Noun Phrase Chunking and WordNet in NLTK Getting Started In this lab session, we will work together through a series of small examples using the IDLE window and
More informationTarget Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data
Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data Ebba Gustavii Department of Linguistics and Philology, Uppsala University, Sweden ebbag@stp.ling.uu.se
More information11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation
tatistical Parsing (Following slides are modified from Prof. Raymond Mooney s slides.) tatistical Parsing tatistical parsing uses a probabilistic model of syntax in order to assign probabilities to each
More informationSemi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.
Semi-supervised methods of text processing, and an application to medical concept extraction Yacine Jernite Text-as-Data series September 17. 2015 What do we want from text? 1. Extract information 2. Link
More informationCross Language Information Retrieval
Cross Language Information Retrieval RAFFAELLA BERNARDI UNIVERSITÀ DEGLI STUDI DI TRENTO P.ZZA VENEZIA, ROOM: 2.05, E-MAIL: BERNARDI@DISI.UNITN.IT Contents 1 Acknowledgment.............................................
More informationThe Role of the Head in the Interpretation of English Deverbal Compounds
The Role of the Head in the Interpretation of English Deverbal Compounds Gianina Iordăchioaia i, Lonneke van der Plas ii, Glorianna Jagfeld i (Universität Stuttgart i, University of Malta ii ) Wen wurmt
More informationParsing of part-of-speech tagged Assamese Texts
IJCSI International Journal of Computer Science Issues, Vol. 6, No. 1, 2009 ISSN (Online): 1694-0784 ISSN (Print): 1694-0814 28 Parsing of part-of-speech tagged Assamese Texts Mirzanur Rahman 1, Sufal
More informationProof Theory for Syntacticians
Department of Linguistics Ohio State University Syntax 2 (Linguistics 602.02) January 5, 2012 Logics for Linguistics Many different kinds of logic are directly applicable to formalizing theories in syntax
More informationMultilingual Sentiment and Subjectivity Analysis
Multilingual Sentiment and Subjectivity Analysis Carmen Banea and Rada Mihalcea Department of Computer Science University of North Texas rada@cs.unt.edu, carmen.banea@gmail.com Janyce Wiebe Department
More informationCS 598 Natural Language Processing
CS 598 Natural Language Processing Natural language is everywhere Natural language is everywhere Natural language is everywhere Natural language is everywhere!"#$%&'&()*+,-./012 34*5665756638/9:;< =>?@ABCDEFGHIJ5KL@
More informationMeasuring the relative compositionality of verb-noun (V-N) collocations by integrating features
Measuring the relative compositionality of verb-noun (V-N) collocations by integrating features Sriram Venkatapathy Language Technologies Research Centre, International Institute of Information Technology
More informationEdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar
EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar Chung-Chi Huang Mei-Hua Chen Shih-Ting Huang Jason S. Chang Institute of Information Systems and Applications, National Tsing Hua University,
More informationAQUA: An Ontology-Driven Question Answering System
AQUA: An Ontology-Driven Question Answering System Maria Vargas-Vera, Enrico Motta and John Domingue Knowledge Media Institute (KMI) The Open University, Walton Hall, Milton Keynes, MK7 6AA, United Kingdom.
More informationDefragmenting Textual Data by Leveraging the Syntactic Structure of the English Language
Defragmenting Textual Data by Leveraging the Syntactic Structure of the English Language Nathaniel Hayes Department of Computer Science Simpson College 701 N. C. St. Indianola, IA, 50125 nate.hayes@my.simpson.edu
More informationMemory-based grammatical error correction
Memory-based grammatical error correction Antal van den Bosch Peter Berck Radboud University Nijmegen Tilburg University P.O. Box 9103 P.O. Box 90153 NL-6500 HD Nijmegen, The Netherlands NL-5000 LE Tilburg,
More informationModeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures
Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures Ulrike Baldewein (ulrike@coli.uni-sb.de) Computational Psycholinguistics, Saarland University D-66041 Saarbrücken,
More informationPrediction of Maximal Projection for Semantic Role Labeling
Prediction of Maximal Projection for Semantic Role Labeling Weiwei Sun, Zhifang Sui Institute of Computational Linguistics Peking University Beijing, 100871, China {ws, szf}@pku.edu.cn Haifeng Wang Toshiba
More informationIntroduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.
to as a linguistic theory to to a member of the family of linguistic frameworks that are called generative grammars a grammar which is formalized to a high degree and thus makes exact predictions about
More informationThe Smart/Empire TIPSTER IR System
The Smart/Empire TIPSTER IR System Chris Buckley, Janet Walz Sabir Research, Gaithersburg, MD chrisb,walz@sabir.com Claire Cardie, Scott Mardis, Mandar Mitra, David Pierce, Kiri Wagstaff Department of
More informationA Case Study: News Classification Based on Term Frequency
A Case Study: News Classification Based on Term Frequency Petr Kroha Faculty of Computer Science University of Technology 09107 Chemnitz Germany kroha@informatik.tu-chemnitz.de Ricardo Baeza-Yates Center
More informationWord Sense Disambiguation
Word Sense Disambiguation D. De Cao R. Basili Corso di Web Mining e Retrieval a.a. 2008-9 May 21, 2009 Excerpt of the R. Mihalcea and T. Pedersen AAAI 2005 Tutorial, at: http://www.d.umn.edu/ tpederse/tutorials/advances-in-wsd-aaai-2005.ppt
More informationMULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY
MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY Chen, Hsin-Hsi Department of Computer Science and Information Engineering National Taiwan University Taipei, Taiwan E-mail: hh_chen@csie.ntu.edu.tw Abstract
More informationGrammars & Parsing, Part 1:
Grammars & Parsing, Part 1: Rules, representations, and transformations- oh my! Sentence VP The teacher Verb gave the lecture 2015-02-12 CS 562/662: Natural Language Processing Game plan for today: Review
More informationIntra-talker Variation: Audience Design Factors Affecting Lexical Selections
Tyler Perrachione LING 451-0 Proseminar in Sound Structure Prof. A. Bradlow 17 March 2006 Intra-talker Variation: Audience Design Factors Affecting Lexical Selections Abstract Although the acoustic and
More informationPOS tagging of Chinese Buddhist texts using Recurrent Neural Networks
POS tagging of Chinese Buddhist texts using Recurrent Neural Networks Longlu Qin Department of East Asian Languages and Cultures longlu@stanford.edu Abstract Chinese POS tagging, as one of the most important
More informationhave to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,
A Language-Independent, Data-Oriented Architecture for Grapheme-to-Phoneme Conversion Walter Daelemans and Antal van den Bosch Proceedings ESCA-IEEE speech synthesis conference, New York, September 1994
More informationTHE VERB ARGUMENT BROWSER
THE VERB ARGUMENT BROWSER Bálint Sass sass.balint@itk.ppke.hu Péter Pázmány Catholic University, Budapest, Hungary 11 th International Conference on Text, Speech and Dialog 8-12 September 2008, Brno PREVIEW
More informationThe stages of event extraction
The stages of event extraction David Ahn Intelligent Systems Lab Amsterdam University of Amsterdam ahn@science.uva.nl Abstract Event detection and recognition is a complex task consisting of multiple sub-tasks
More informationLoughton School s curriculum evening. 28 th February 2017
Loughton School s curriculum evening 28 th February 2017 Aims of this session Share our approach to teaching writing, reading, SPaG and maths. Share resources, ideas and strategies to support children's
More informationProcedia - Social and Behavioral Sciences 154 ( 2014 )
Available online at www.sciencedirect.com ScienceDirect Procedia - Social and Behavioral Sciences 154 ( 2014 ) 263 267 THE XXV ANNUAL INTERNATIONAL ACADEMIC CONFERENCE, LANGUAGE AND CULTURE, 20-22 October
More information1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature
1 st Grade Curriculum Map Common Core Standards Language Arts 2013 2014 1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature Key Ideas and Details
More informationCorpus Linguistics (L615)
(L615) Basics of Markus Dickinson Department of, Indiana University Spring 2013 1 / 23 : the extent to which a sample includes the full range of variability in a population distinguishes corpora from archives
More informationThe Choice of Features for Classification of Verbs in Biomedical Texts
The Choice of Features for Classification of Verbs in Biomedical Texts Anna Korhonen University of Cambridge Computer Laboratory 15 JJ Thomson Avenue Cambridge CB3 0FD, UK alk23@cl.cam.ac.uk Yuval Krymolowski
More informationLING 329 : MORPHOLOGY
LING 329 : MORPHOLOGY TTh 10:30 11:50 AM, Physics 121 Course Syllabus Spring 2013 Matt Pearson Office: Vollum 313 Email: pearsonm@reed.edu Phone: 7618 (off campus: 503-517-7618) Office hrs: Mon 1:30 2:30,
More informationSome Principles of Automated Natural Language Information Extraction
Some Principles of Automated Natural Language Information Extraction Gregers Koch Department of Computer Science, Copenhagen University DIKU, Universitetsparken 1, DK-2100 Copenhagen, Denmark Abstract
More information2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases
POS Tagging Problem Part-of-Speech Tagging L545 Spring 203 Given a sentence W Wn and a tagset of lexical categories, find the most likely tag T..Tn for each word in the sentence Example Secretariat/P is/vbz
More informationLinking Task: Identifying authors and book titles in verbose queries
Linking Task: Identifying authors and book titles in verbose queries Anaïs Ollagnier, Sébastien Fournier, and Patrice Bellot Aix-Marseille University, CNRS, ENSAM, University of Toulon, LSIS UMR 7296,
More informationThe Ups and Downs of Preposition Error Detection in ESL Writing
The Ups and Downs of Preposition Error Detection in ESL Writing Joel R. Tetreault Educational Testing Service 660 Rosedale Road Princeton, NJ, USA JTetreault@ets.org Martin Chodorow Hunter College of CUNY
More informationVocabulary Usage and Intelligibility in Learner Language
Vocabulary Usage and Intelligibility in Learner Language Emi Izumi, 1 Kiyotaka Uchimoto 1 and Hitoshi Isahara 1 1. Introduction In verbal communication, the primary purpose of which is to convey and understand
More informationApproaches to control phenomena handout Obligatory control and morphological case: Icelandic and Basque
Approaches to control phenomena handout 6 5.4 Obligatory control and morphological case: Icelandic and Basque Icelandinc quirky case (displaying properties of both structural and inherent case: lexically
More informationEnsemble Technique Utilization for Indonesian Dependency Parser
Ensemble Technique Utilization for Indonesian Dependency Parser Arief Rahman Institut Teknologi Bandung Indonesia 23516008@std.stei.itb.ac.id Ayu Purwarianti Institut Teknologi Bandung Indonesia ayu@stei.itb.ac.id
More informationSEMAFOR: Frame Argument Resolution with Log-Linear Models
SEMAFOR: Frame Argument Resolution with Log-Linear Models Desai Chen or, The Case of the Missing Arguments Nathan Schneider SemEval July 16, 2010 Dipanjan Das School of Computer Science Carnegie Mellon
More informationConstructing Parallel Corpus from Movie Subtitles
Constructing Parallel Corpus from Movie Subtitles Han Xiao 1 and Xiaojie Wang 2 1 School of Information Engineering, Beijing University of Post and Telecommunications artex.xh@gmail.com 2 CISTR, Beijing
More informationUNIVERSITY OF OSLO Department of Informatics. Dialog Act Recognition using Dependency Features. Master s thesis. Sindre Wetjen
UNIVERSITY OF OSLO Department of Informatics Dialog Act Recognition using Dependency Features Master s thesis Sindre Wetjen November 15, 2013 Acknowledgments First I want to thank my supervisors Lilja
More information2.1 The Theory of Semantic Fields
2 Semantic Domains In this chapter we define the concept of Semantic Domain, recently introduced in Computational Linguistics [56] and successfully exploited in NLP [29]. This notion is inspired by the
More informationOCR for Arabic using SIFT Descriptors With Online Failure Prediction
OCR for Arabic using SIFT Descriptors With Online Failure Prediction Andrey Stolyarenko, Nachum Dershowitz The Blavatnik School of Computer Science Tel Aviv University Tel Aviv, Israel Email: stloyare@tau.ac.il,
More informationLTAG-spinal and the Treebank
LTAG-spinal and the Treebank a new resource for incremental, dependency and semantic parsing Libin Shen (lshen@bbn.com) BBN Technologies, 10 Moulton Street, Cambridge, MA 02138, USA Lucas Champollion (champoll@ling.upenn.edu)
More informationA Semantic Similarity Measure Based on Lexico-Syntactic Patterns
A Semantic Similarity Measure Based on Lexico-Syntactic Patterns Alexander Panchenko, Olga Morozova and Hubert Naets Center for Natural Language Processing (CENTAL) Université catholique de Louvain Belgium
More informationUniversity of Alberta. Large-Scale Semi-Supervised Learning for Natural Language Processing. Shane Bergsma
University of Alberta Large-Scale Semi-Supervised Learning for Natural Language Processing by Shane Bergsma A thesis submitted to the Faculty of Graduate Studies and Research in partial fulfillment of
More informationSpecification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments
Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments Cristina Vertan, Walther v. Hahn University of Hamburg, Natural Language Systems Division Hamburg,
More informationA Statistical Approach to the Semantics of Verb-Particles
A Statistical Approach to the Semantics of Verb-Particles Colin Bannard School of Informatics University of Edinburgh 2 Buccleuch Place Edinburgh EH8 9LW, UK c.j.bannard@ed.ac.uk Timothy Baldwin CSLI Stanford
More informationMethods for the Qualitative Evaluation of Lexical Association Measures
Methods for the Qualitative Evaluation of Lexical Association Measures Stefan Evert IMS, University of Stuttgart Azenbergstr. 12 D-70174 Stuttgart, Germany evert@ims.uni-stuttgart.de Brigitte Krenn Austrian
More informationMatching Similarity for Keyword-Based Clustering
Matching Similarity for Keyword-Based Clustering Mohammad Rezaei and Pasi Fränti University of Eastern Finland {rezaei,franti}@cs.uef.fi Abstract. Semantic clustering of objects such as documents, web
More informationThe 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X
The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, 2013 10.12753/2066-026X-13-154 DATA MINING SOLUTIONS FOR DETERMINING STUDENT'S PROFILE Adela BÂRA,
More informationBasic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1
Basic Parsing with Context-Free Grammars Some slides adapted from Julia Hirschberg and Dan Jurafsky 1 Announcements HW 2 to go out today. Next Tuesday most important for background to assignment Sign up
More informationReview in ICAME Journal, Volume 38, 2014, DOI: /icame
Review in ICAME Journal, Volume 38, 2014, DOI: 10.2478/icame-2014-0012 Gaëtanelle Gilquin and Sylvie De Cock (eds.). Errors and disfluencies in spoken corpora. Amsterdam: John Benjamins. 2013. 172 pp.
More informationWeb as a Corpus: Going Beyond the n-gram
Web as a Corpus: Going Beyond the n-gram Preslav Nakov Qatar Computing Research Institute, Tornado Tower, floor 10 P.O.box 5825 Doha, Qatar pnakov@qf.org.qa Abstract. The 60-year-old dream of computational
More informationThe Strong Minimalist Thesis and Bounded Optimality
The Strong Minimalist Thesis and Bounded Optimality DRAFT-IN-PROGRESS; SEND COMMENTS TO RICKL@UMICH.EDU Richard L. Lewis Department of Psychology University of Michigan 27 March 2010 1 Purpose of this
More informationCollocations of Nouns: How to Present Verb-noun Collocations in a Monolingual Dictionary
Sanni Nimb, The Danish Dictionary, University of Copenhagen Collocations of Nouns: How to Present Verb-noun Collocations in a Monolingual Dictionary Abstract The paper discusses how to present in a monolingual
More informationObjectives. Chapter 2: The Representation of Knowledge. Expert Systems: Principles and Programming, Fourth Edition
Chapter 2: The Representation of Knowledge Expert Systems: Principles and Programming, Fourth Edition Objectives Introduce the study of logic Learn the difference between formal logic and informal logic
More information! # %& ( ) ( + ) ( &, % &. / 0!!1 2/.&, 3 ( & 2/ &,
! # %& ( ) ( + ) ( &, % &. / 0!!1 2/.&, 3 ( & 2/ &, 4 The Interaction of Knowledge Sources in Word Sense Disambiguation Mark Stevenson Yorick Wilks University of Shef eld University of Shef eld Word sense
More informationCharacter Stream Parsing of Mixed-lingual Text
Character Stream Parsing of Mixed-lingual Text Harald Romsdorfer and Beat Pfister Speech Processing Group Computer Engineering and Networks Laboratory ETH Zurich {romsdorfer,pfister}@tik.ee.ethz.ch Abstract
More informationLanguage Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus
Language Acquisition Fall 2010/Winter 2011 Lexical Categories Afra Alishahi, Heiner Drenhaus Computational Linguistics and Phonetics Saarland University Children s Sensitivity to Lexical Categories Look,
More informationLeveraging Sentiment to Compute Word Similarity
Leveraging Sentiment to Compute Word Similarity Balamurali A.R., Subhabrata Mukherjee, Akshat Malu and Pushpak Bhattacharyya Dept. of Computer Science and Engineering, IIT Bombay 6th International Global
More informationDerivational: Inflectional: In a fit of rage the soldiers attacked them both that week, but lost the fight.
Final Exam (120 points) Click on the yellow balloons below to see the answers I. Short Answer (32pts) 1. (6) The sentence The kinder teachers made sure that the students comprehended the testable material
More informationA Computational Evaluation of Case-Assignment Algorithms
A Computational Evaluation of Case-Assignment Algorithms Miles Calabresi Advisors: Bob Frank and Jim Wood Submitted to the faculty of the Department of Linguistics in partial fulfillment of the requirements
More informationA Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many
Schmidt 1 Eric Schmidt Prof. Suzanne Flynn Linguistic Study of Bilingualism December 13, 2013 A Minimalist Approach to Code-Switching In the field of linguistics, the topic of bilingualism is a broad one.
More informationProduct Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments
Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments Vijayshri Ramkrishna Ingale PG Student, Department of Computer Engineering JSPM s Imperial College of Engineering &
More informationA Comparison of Two Text Representations for Sentiment Analysis
010 International Conference on Computer Application and System Modeling (ICCASM 010) A Comparison of Two Text Representations for Sentiment Analysis Jianxiong Wang School of Computer Science & Educational
More informationUsing dialogue context to improve parsing performance in dialogue systems
Using dialogue context to improve parsing performance in dialogue systems Ivan Meza-Ruiz and Oliver Lemon School of Informatics, Edinburgh University 2 Buccleuch Place, Edinburgh I.V.Meza-Ruiz@sms.ed.ac.uk,
More informationCombining a Chinese Thesaurus with a Chinese Dictionary
Combining a Chinese Thesaurus with a Chinese Dictionary Ji Donghong Kent Ridge Digital Labs 21 Heng Mui Keng Terrace Singapore, 119613 dhji @krdl.org.sg Gong Junping Department of Computer Science Ohio
More informationTHE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING
SISOM & ACOUSTICS 2015, Bucharest 21-22 May THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING MarilenaăLAZ R 1, Diana MILITARU 2 1 Military Equipment and Technologies Research Agency, Bucharest,
More informationBANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS
Daffodil International University Institutional Repository DIU Journal of Science and Technology Volume 8, Issue 1, January 2013 2013-01 BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS Uddin, Sk.
More informationAn Introduction to the Minimalist Program
An Introduction to the Minimalist Program Luke Smith University of Arizona Summer 2016 Some findings of traditional syntax Human languages vary greatly, but digging deeper, they all have distinct commonalities:
More informationTwitter Sentiment Classification on Sanders Data using Hybrid Approach
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 4, Ver. I (July Aug. 2015), PP 118-123 www.iosrjournals.org Twitter Sentiment Classification on Sanders
More informationWord Segmentation of Off-line Handwritten Documents
Word Segmentation of Off-line Handwritten Documents Chen Huang and Sargur N. Srihari {chuang5, srihari}@cedar.buffalo.edu Center of Excellence for Document Analysis and Recognition (CEDAR), Department
More informationProviding student writers with pre-text feedback
Providing student writers with pre-text feedback Ana Frankenberg-Garcia This paper argues that the best moment for responding to student writing is before any draft is completed. It analyses ways in which
More informationAN INTRODUCTION (2 ND ED.) (LONDON, BLOOMSBURY ACADEMIC PP. VI, 282)
B. PALTRIDGE, DISCOURSE ANALYSIS: AN INTRODUCTION (2 ND ED.) (LONDON, BLOOMSBURY ACADEMIC. 2012. PP. VI, 282) Review by Glenda Shopen _ This book is a revised edition of the author s 2006 introductory
More informationContext Free Grammars. Many slides from Michael Collins
Context Free Grammars Many slides from Michael Collins Overview I An introduction to the parsing problem I Context free grammars I A brief(!) sketch of the syntax of English I Examples of ambiguous structures
More informationNatural Language Processing. George Konidaris
Natural Language Processing George Konidaris gdk@cs.brown.edu Fall 2017 Natural Language Processing Understanding spoken/written sentences in a natural language. Major area of research in AI. Why? Humans
More informationA corpus-based approach to the acquisition of collocational prepositional phrases
COMPUTATIONAL LEXICOGRAPHY AND LEXICOl..OGV A corpus-based approach to the acquisition of collocational prepositional phrases M. Begoña Villada Moirón and Gosse Bouma Alfa-informatica Rijksuniversiteit
More informationLexical category induction using lexically-specific templates
Lexical category induction using lexically-specific templates Richard E. Leibbrandt and David M. W. Powers Flinders University of South Australia 1. The induction of lexical categories from distributional
More informationCompositional Semantics
Compositional Semantics CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Words, bag of words Sequences Trees Meaning Representing Meaning An important goal of NLP/AI: convert natural language
More informationComputerized Adaptive Psychological Testing A Personalisation Perspective
Psychology and the internet: An European Perspective Computerized Adaptive Psychological Testing A Personalisation Perspective Mykola Pechenizkiy mpechen@cc.jyu.fi Introduction Mixed Model of IRT and ES
More informationCEFR Overall Illustrative English Proficiency Scales
CEFR Overall Illustrative English Proficiency s CEFR CEFR OVERALL ORAL PRODUCTION Has a good command of idiomatic expressions and colloquialisms with awareness of connotative levels of meaning. Can convey
More informationA Comparative Study of Research Article Discussion Sections of Local and International Applied Linguistic Journals
THE JOURNAL OF ASIA TEFL Vol. 9, No. 1, pp. 1-29, Spring 2012 A Comparative Study of Research Article Discussion Sections of Local and International Applied Linguistic Journals Alireza Jalilifar Shahid
More informationOutline. Web as Corpus. Using Web Data for Linguistic Purposes. Ines Rehbein. NCLT, Dublin City University. nclt
Outline Using Web Data for Linguistic Purposes NCLT, Dublin City University Outline Outline 1 Corpora as linguistic tools 2 Limitations of web data Strategies to enhance web data 3 Corpora as linguistic
More informationMandarin Lexical Tone Recognition: The Gating Paradigm
Kansas Working Papers in Linguistics, Vol. 0 (008), p. 8 Abstract Mandarin Lexical Tone Recognition: The Gating Paradigm Yuwen Lai and Jie Zhang University of Kansas Research on spoken word recognition
More informationWhat Can Neural Networks Teach us about Language? Graham Neubig a2-dlearn 11/18/2017
What Can Neural Networks Teach us about Language? Graham Neubig a2-dlearn 11/18/2017 Supervised Training of Neural Networks for Language Training Data Training Model this is an example the cat went to
More informationProject in the framework of the AIM-WEST project Annotation of MWEs for translation
Project in the framework of the AIM-WEST project Annotation of MWEs for translation 1 Agnès Tutin LIDILEM/LIG Université Grenoble Alpes 30 october 2014 Outline 2 Why annotate MWEs in corpora? A first experiment
More informationInleiding Taalkunde. Docent: Paola Monachesi. Blok 4, 2001/ Syntax 2. 2 Phrases and constituent structure 2. 3 A minigrammar of Italian 3
Inleiding Taalkunde Docent: Paola Monachesi Blok 4, 2001/2002 Contents 1 Syntax 2 2 Phrases and constituent structure 2 3 A minigrammar of Italian 3 4 Trees 3 5 Developing an Italian lexicon 4 6 S(emantic)-selection
More informationCross-Lingual Text Categorization
Cross-Lingual Text Categorization Nuria Bel 1, Cornelis H.A. Koster 2, and Marta Villegas 1 1 Grup d Investigació en Lingüística Computacional Universitat de Barcelona, 028 - Barcelona, Spain. {nuria,tona}@gilc.ub.es
More informationGeneration of Referring Expressions: Managing Structural Ambiguities
Generation of Referring Expressions: Managing Structural Ambiguities Imtiaz Hussain Khan and Kees van Deemter and Graeme Ritchie Department of Computing Science University of Aberdeen Aberdeen AB24 3UE,
More informationA heuristic framework for pivot-based bilingual dictionary induction
2013 International Conference on Culture and Computing A heuristic framework for pivot-based bilingual dictionary induction Mairidan Wushouer, Toru Ishida, Donghui Lin Department of Social Informatics,
More informationThe Discourse Anaphoric Properties of Connectives
The Discourse Anaphoric Properties of Connectives Cassandre Creswell, Kate Forbes, Eleni Miltsakaki, Rashmi Prasad, Aravind Joshi Λ, Bonnie Webber y Λ University of Pennsylvania 3401 Walnut Street Philadelphia,
More informationMontana Content Standards for Mathematics Grade 3. Montana Content Standards for Mathematical Practices and Mathematics Content Adopted November 2011
Montana Content Standards for Mathematics Grade 3 Montana Content Standards for Mathematical Practices and Mathematics Content Adopted November 2011 Contents Standards for Mathematical Practice: Grade
More information