n?1 Y CF P (N1; ::::; N n ) = log(p (N1jS P re ) P (N i js In ) P (N n js P ost )) (1) i=2 C(N1; S P re ) P (N1jS P re ) = Pn i=1 C(N i; S P re ) C(N

Size: px
Start display at page:

Download "n?1 Y CF P (N1; ::::; N n ) = log(p (N1jS P re ) P (N i js In ) P (N n js P ost )) (1) i=2 C(N1; S P re ) P (N1jS P re ) = Pn i=1 C(N i; S P re ) C(N"

Transcription

1 Segmenting Korean Compound Nouns using Statistical Information and a Preference Rule Bo-Hyun Yun, Min-Jeung Cho, Hae-Chang Rim Department of Computer Science, Korea University 1, 5-ka, Anam-dong, SEOUL, , KOREA ybh@nlp.korea.ac.kr, cmj@nlp.korea.ac.kr, rim@nlp.korea.ac.kr Abstract This paper presents a method of segmenting Korean compound nouns by using statistical information and a preference rule. Statistical information is represented by CFP(Compound noun Formation Probability) that consists of both frequencies of axes and frequencies of two-syllabled and three-syllabled nouns. A preference rule is MNPR(Minimal Noun Preference Rule) that prefers a structure pattern of a compound noun with minimal number of unit nouns. Moreover, we apply three kinds of heuristics in order to segment compound nouns including unknown unit nouns. Experimental results show that the precision of the proposed method is approximately 96% on average. Furthermore, the experiments prove the proposed method can segment compound nouns including unknown nouns and maintain the constant precision rate in segmenting compound nouns extracted from various domains. 1 Introduction Segmenting a compound noun(cn) in a raw corpus is one of the crucial issues for natural language processing systems such as a machine translation system, an information retrieval system, and a spelling checker. It is necessary to segment Korean compound nouns correctly in order to select the right target lexemes in machine translation, to increase the recall rate in information retrieval, and to correct a spacing error of compound nouns in spelling checking. However, the segmentation is a dicult problem because a Korean compound noun consists of more than one unit noun without blanks and because there are possibly many ambiguous segmentations in a compound noun. In segmenting Korean compound nouns in a raw corpus, we have to consider following problems: 1) A raw corpus has various eojeols 1 such as a verbal and an adjectival to be eliminated. 2) An eojeol including a compound noun has several suxes to be removed. 3) There exist many ambiguous segmentations to be resolved in Korean compound nouns. 4) Because all of unit noun(un)s can't be registered in a lexicon, there are many compound nouns including unknown 1 Eojeol is the spacing unit in Korean like a word in English. An eojeol consists of one or more morphemes. It sometimes corresponds to a word or a phrase in English. unit nouns. In this research, we have solved the rst and the second problems by using a morphological analyzer[8] and a POS(Part-Of-Speech) tagger 2 [6, 7, 9] and suggest only the solutions of the third and the fourth problems. To analyze compound nouns in Japanese, Yosiyuki et al[11] uses collocation information and a thesaurus. The accuracy of this method is about 80%. In Chienese, Nie et al[10], at rst, segment a text by using the rule and dictionary-based method. Then a hybrid approach is applied to locate candidates for the unknown words contained therein, and the segmentation process is driven again. This method shows the accuracy of 96.51%. For Korean compound nouns, several segmentation methods[3, 4, 12] have been proposed. Choi[4] applies structure patterns of compound nouns orderly and then segments compound nouns. But this method can't resolve ambiguous segmentations. Yun et al[12] applies several structure patterns and resolves ambiguous segmentations by using the frequencies of head words and statistical preference rules. However, both methods can't segment compound nouns including unknown unit nouns. Chang et al[3] constructs a trie to store corpus information, inserts 2 The morphological analyzer and POS tagger have been developed at the NLP Lab. of Korea University.

2 n?1 Y CF P (N1; ::::; N n ) = log(p (N1jS P re ) P (N i js In ) P (N n js P ost )) (1) i=2 C(N1; S P re ) P (N1jS P re ) = Pn i=1 C(N i; S P re ) C(N i ; S In ) P (N i js In ) = Pn i=1 C(N i; S In ) C(N n ; S P ost ) P (N n js P ost ) = Pn i=1 C(N i; S P ost ) (2) (3) (4) dummy nodes to mark the end of a noun in the learning phase, and analyzes the compound noun by using the constructed trie in the application phase. But the performance of this method is dependent on specic domains. To solve these problems, we propose a method of segmenting compound nouns based on statistical information, CFP and a preference rule, MNPR. 2 Statistical Information and Preference Rule 2.1 Statistical Information To acquire statistical information, we assume that the structure of Korean compound nouns can be expressed in a binary tree. The binary tree consists of a specier and a head. That is, the structure of Korean compound nouns corresponds to the Binary Branch Structure(BBS) based on X' theory in linguistics[5]. Figure 1 shows that the specier and the head have the recursive property as indicated by a symbol '+' in the structure of Korean compound nouns. The specier and the head can also have a subspecier and a subhead respectively. In this research, to simplify the acquisition of statistical information, we dene the unit noun between a specier and a head as an intermediate. Based on the above structure, the frequencies of two-syllabled and three-syllabled nouns are obtained from 81,276 compound nouns registered in the dictionary of Kumsung Publishing Company as follows: The rst unit noun N1 is counted as the speci- er. The middle unit nouns N2? N n?1 are counted as the intermediate. The last unit noun N n is counted as the head. As those compound nouns have the mark '-' which stands for a correct segmentation, it is easy to distinguish the specier, the intermediate and the head. Figure 1: The Structure of Compound Noun The frequencies of one-syllabled axes are acquired from 4,486 three-syllabled compound nouns with a N1 - N2 form as follows: If N1 is one syllable, N1 is counted as the prex. If N2 is one syllable, N2 is counted as the sux. By using the frequency data, we can dene CFP as the equation (1). where S P re, S In, and S P ost are the state of a specier, an intermediate, and a head respectively. C(N1; S P re ), C(N i ; S In ), and C(N1; S P ost ) are the frequencies that N is used as a specier, an intermediate, and a head respectively. Equation (1) is calculated by multiplying the probability that N1 is used as a specier, the probability that N2; :::; N n? 1 is used as an intermediate, and the probability that N n is used as a head[2]. In other words, CFP represents the capacity that unit nouns form a compound noun. Indeed, by using log, we forces the value of the probability to be ranged from 0 to?1. In equation (2), P (N1jS P re ) expresses the probability that N1 is used as a specier. Likewise, the probabilities in equation (3) and (4) have the similar meanings with the probability in equation (2). 2.2 Preference Rule A preference rule, MNPR, is the rule acquired by an empirical study. The basic principle is based on MAP(Minimal Attachment Principle) that is applied to a syntactic analysis[1]. The MAP is the principle

3 that a parse tree with the least node is preferred in resolving structural ambiguity. Similarly, we dene MNPR based on MAP as follows: MNPR(Minimal Noun Preference Rule): If the number of unit nouns is dierent among ambiguous segmentations, we prefer the structure pattern with minimal number of unit nouns. 3 Segmentation Algorithm The algorithm of segmenting compound nouns is shown in Figure 2. At rst, we apply structure patterns of compound nouns by consulting a general noun dictionary with 50,518 entries. If one result is generated, we regard the segmentation result as the correct segmentation. If the given compound noun can be ambiguously segmented, we resolve it by using CFP and MNPR. The method of resolving an ambiguous segmentation is explained in Section 3.1 in detail. If a compound noun can not be segmented, we regard the compound noun as a compound noun including unknown unit nouns and segment compound nouns by the method suggested in the Section 3.2. Segment CN (CN) f Apply structure patterns of compound nouns if ( one segmentation result ) Print the segmentation result else if ( several segmentation results ) Resolve Ambiguity() else if ( no segmentation result ) Segment CN including Unknown Word(CN) g Figure 2: A Segmentation Algorithm 3.1 Resolving Ambiguous Segmentations The algorithm of resolving ambiguous segmentations is performed dierently according to the number of unit nouns. If the number of segmented unit nouns is the same among ambiguous segmentations, we apply statistical information, CFP; otherwise, we apply a preference rule, MNPR. First, if the number of unit nouns is the same among ambiguous segmentations, we apply CFP to segment the compound noun. Table 1 shows total summations of the frequency data used as a speci- er, an intermediate, and a head for the calculation of CFP. For instance, a compound noun 'bujeonghapgukja( A <,, a illegally successful candidate)' can be segmented into both 'bujeonghapgukja( A /<, /, a disharmonious Table 1: Summations of each Specier, Intermediate, and Head Type 2-Syllable 3-Syllable Pn i=1 C(N i; S P re ) Pn i=1 C(N i; S In ) Pn i=1 C(N i; S P ost ) lattice)' and 'bujeonghapgukja( A/ <, /, a illegally successful candidate)'. The frequencies of unit nouns are as follows: C(bujeonghap; S P re ) = 1 C(gukja; S P ost ) = 13 C(bujeong; S P re ) = 87 C(hapgukja; S P ost ) = 4 By using the above frequencies, we calculate CFPs of two candidates as follows: log(cf P (bujeonghap=gukja)) = log(p (bujeonghapjs P re ) P (gukjajs P ost )) =?7:9866 log(cf P (bujeong=hapgukja)) = log(p (bujeongjs P re ) P (hapgukjajs P ost )) =?6:6507 Because CFP(bujeong/hapgukja) is larger thann CFP(bujeonghap/gukja), 'bujeonghapgukja( A <)' is segmented into 'bujeong/hapgukja( A/ <)'. Second, if the number of unit nouns is dierent, we resolve an ambiguous segmentation by MNPR. For example, a compound noun 'golfjangsaupja(p $P z, golfw?, a golf course businessman)' can be segmented into both 'golf/jangsa/upja(p /$P/z, golf/$p/?, a golf trade businessman)' and 'golfjang/saupja(p $/Pz, golf/w?, a golf course businessman)'. The number of unit nouns in 'golfjangsaupja(p /$P/z)' is 3 and the number of unit nouns in 'golfjangsaupja(p $/Pz)' is 2. By MNPR, we choose 'golfjangsaupja(p $/P z)' for the correct segmentation because it has smaller number of unit nouns. 3.2 Segmenting Compound Nouns including Unknown Nouns In general, because all unit nouns can't be registered in a lexicon, many compound nouns include unknown unit nouns. Most of the unknown unit nouns are three-syllabled noun, a foreign noun, and a noun of

4 a specic area. In this research, we segment these compound nouns through three phases. First, if more than three-syllabled noun of a specic position is a known noun, we apply the structure pattern itself. The unit nouns of a specic position are underlined as follows: 6 syllable : 3/3, 4/2, 2/4 7 syllable : 2/3/2, 3/4, 4/3, 5/2, 2/5 8 syllable : 2/3/3, 3/3/2, 2/4/2, 3/5, 5/3, 6/2, 2/6 9 syllable : 3/3/3, 2/3/4, 2/4/3, 3/4/2, 4/3/2, 2/5/2, 3/6, 6/3, 2/7, 7/2 10 syllable : 2/4/4, 4/4/2, 2/4/3, 4/3/3, 3/4/3, 3/3/4, 3/5/2, 2/5/3 For example, a compound noun 'orengekaunti( b /, Orange County)' have a known noun 'orenge( )' and have an unknown noun 'kaunti(b /)'. By a structure pattern '3/3', a compound noun 'orengekaunti( b /)' is correctly segmented into 'orengekaunti( /b /)'. Second, if two-syllabled noun is registered but three-syllabled noun is not registered, we apply the frequencies of an ax. For instance, a compound noun 'gunchuksahuphoy(&9p,, an architect society)' is at rst segmented into 'gunchuk/sa/huphoy(&9/p/ )' because 'gunchuk(&9)' and 'huphoy( )' is registered but 'gunchuksa(&9p)' is not. Then, in order to decide whether an ax 'sa(p)' is a prex or a sux, we use the frequency of a prex and a suf- x. An ax 'sa(p)' was used 29 as a prex and 111 as a sux. Therefore, a compound noun 'gunchuksahuphoy(&9p )' can be correctly segmented into 'gunchuksa/huphoy(&9p/ )'. Third, we assume following default patterns as the patterns that are frequently segmented, and we apply them for a segmentation. 4 syllable : 2/2 5 syllable : 2/3 6 syllable : 2/2/2 7 syllable : 2/2/3 8 syllable : 2/2/2/2 9 syllable : 2/2/2/3 10 syllable : 2/2/2/2/2 Figure 3: System Conguration 4 Experimental Results The system conguration that implements the proposed algorithm is shown in Figure 3. A raw text is analyzed by a morphological analyzer and is tagged a POS tagger. Then, we extract N, N+N, N+N+N, and N+N+N+N forms from a POS-tagged corpus. But a N form may be an unit noun or a compound noun due to the recognition process of unknown nouns. Accordingly, we assume the unit noun is registered in an unit noun dictionary and lter out the unit noun of N forms. After all, the segmentation system receives only compound nouns as an input and produces the one segmentation result. We use three kinds of data to estimate the precision rate on the proposed algorithm. The rst test data is 345 compound nouns including a great deal of an unknown unit noun. The second test data is 1,200 compound nouns extracted from about 1,000 documents of KTSET 2.0 which are used for a test set of information retrieval. The KTSET 2.0 test collection consists of 44,400 documents and 50 queries. It includes the relevance judgment of each document with respect to each query. The third test data is 1,644 compound nouns extracted in a balance and at random from corpora. The third test data is extracted from 19,613 compound nouns that the Korean morphological analyzer can not analyze. We dene the criteria of evaluating the segmentation algorithm as follows: The inclusion rate of unknown nouns : D=B 100

5 Table 2: Experimental Results Type data 1 data 2 data 3 # of CNs in the Input(A) # of CNs Segmented by the System(B) # of CNs including only Known UNs(C) # of CNs including at least one Unknown UN(D) # of Ambiguously Segmented(E) # of CNs Correctly Segmented(F) Inclusion Rate of UN 28.6% 12% 24.1% Rate of Ambiguous Segmenatations 35% 21.5% 24% Precision Rate 95.6% 96.8% 95.8% The rate of ambiguous segmentations : E=B 100 The precision rate : F=B 100 where B,D,E, and F are shown in Table 2. From the result of the rst test data and the third test data, we can say that the proposed algorithm can segment compound nouns including unknown nouns correctly. By the result of the second test data, we can nd that the performance of the proposed algorithm can maintain the constant precision rate in segmenting compound nouns extracted from various domains. In Table 3, we show a data analysis on CFP, MNPR, heuristics of resolving ambiguous segmentations, where B and F are shown in Table 2. The table show that CFP and MNPR are useful informations in resolving ambiguous segmentations. But heuristics to segment compound nouns including at least one unit noun show the precision of 78%. This means that there's still plenty of room for improvement. Table 3: Data Analysis of CFP, MNPR, Heuristics Method B F Precision CFP % MNPR % Heuristics % Our proposed method is compared with other researches as shown in Table 4. In this table, 'Segmentation' means the segmentation of CNs including unknown nouns and 'Resolution' means the resolution of ambiguous segmentations. This table shows the proposed method can segment compound nouns including unknown nouns and resolve ambiguous segmentations at better precision rate. In Table 5, we compare our method with that of Chang apart from existing researches. The reason is Table 4: Results of Comparision 1 Factor Yun95 Choi96 Proposed Segmentation No No Yes Resolution Yes No Yes Precision 82% 83% 95.6% that Yun[12] and Choi[4] use dictionary-based methods but Chang[3] utilizes the corpus-based method. In this table, 'Trained' means the trained data used in order to construct a trie and acquire statistical information. 'Untrained' means the untrained data to evaluate the precision rate besides the trained data. This result shows the proposed method can maintain a constant precision rate regardless of a specic area. Table 5: Results of Comparision 2 Data Chang96 Proposed Tranined 97.66% 98.0% Untrained 87.75% 95.6% KTSET % 96.8% 5 Conclusion In this paper, we have presented four requirements necessary for segmenting Korean compound nouns in a raw corpus and suggested a method of segmenting Korean compound nouns into unit nouns. We applied structure patterns of compound nouns and resolved ambiguous segmentations by using statistical information, CFP, and a preference rule, MNPR. The experimental results have shown that the precision rate is about 96%. The experiments have

6 proved the proposed method can segment compound nouns including unknown nouns and maintain the constant precision rate in segmenting compound nouns extracted from various domains. In future work, we will try to improve the accuracy of segmenting compound nouns including unit nouns. In addition, we will apply the segmentation method to compound noun indexing in order to improve the performance of an information retrieval system. References [11] K. Yosiyuki, T. Takenobu, T. Hozumi, \Analysis of Japanese Compound Nouns using Collocation information," Proc. of the 14th Conference on Computational Linguistics (COLING- 94), pp , [12] B.H. Yun, H.S. Lim, H.C. Rim, \Analysis of Korean Compound Nouns using Statistical Information," Proc. of the 22nd Korea Information Science Society Spring Conference, pp , April [1] J. Allen, Natural Language Understanding, The Benjamin/Cummings Publishing Company Inc., [2] E. Charniak, C. Hendrickson, N. Jacobson, and M. Perkowitz, \Equations for Part-of-speech Tagging," Proc. of the Eleventh National Conference on Ariticial Intelligence, pp , [3] D.H. Chang, S.H. Myaeng, \A Korean Compound Noun Analysis method for Eective indexing," Hangul and Korean Information Processing Conference, pp , (in Korean) [4] J.H. Choi, \A Division Method of Korean Compound Noun by number of syllable," Hangul and Korean Information Processing Conference, pp , (in Korean) [5] W.S. Jung, Word Formation Theory of Korean language, 1st Ed., p.267, Hansin-Culture Publishing Company (in Korean) [6] J.D. Kim, A Korean Part-of-Speech Tagging Model Based on Morpheme-unit with Eojeol Context, M.S. Dissertation, Korea University, (in Korean) [7] S.Z. Lee, Two-level Korean Part-of-Speech Tagging using HMM, M.S. Dissertation, Korea University, (in Korean) [8] H.S. Lim, Korean Mophological Analyzer based on Classication of Ambiguity pattern, M.S. Dissertation, Korea University, (in Korean) [9] H.S. Lim, J.D. Kim, H.C. Rim, \Improvement of Transformation Rule-Based Korean Part-Of- Speech Tagger," Hangul and Korean Information Processing Conference, pp , (in Korean) [10] J.Y. Nie, M.L. Hannan, W. Jin, \Combining Dictionary, Rules and Statistical Information in Segmentation of Chinese," Computer Processing of Chinese and Oriental Languages, Vol. 9, No., 2, pp , 1995.

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words, A Language-Independent, Data-Oriented Architecture for Grapheme-to-Phoneme Conversion Walter Daelemans and Antal van den Bosch Proceedings ESCA-IEEE speech synthesis conference, New York, September 1994

More information

A Syllable Based Word Recognition Model for Korean Noun Extraction

A Syllable Based Word Recognition Model for Korean Noun Extraction are used as the most important terms (features) that express the document in NLP applications such as information retrieval, document categorization, text summarization, information extraction, and etc.

More information

Cross Language Information Retrieval

Cross Language Information Retrieval Cross Language Information Retrieval RAFFAELLA BERNARDI UNIVERSITÀ DEGLI STUDI DI TRENTO P.ZZA VENEZIA, ROOM: 2.05, E-MAIL: BERNARDI@DISI.UNITN.IT Contents 1 Acknowledgment.............................................

More information

Parsing of part-of-speech tagged Assamese Texts

Parsing of part-of-speech tagged Assamese Texts IJCSI International Journal of Computer Science Issues, Vol. 6, No. 1, 2009 ISSN (Online): 1694-0784 ISSN (Print): 1694-0814 28 Parsing of part-of-speech tagged Assamese Texts Mirzanur Rahman 1, Sufal

More information

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF) SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF) Hans Christian 1 ; Mikhael Pramodana Agus 2 ; Derwin Suhartono 3 1,2,3 Computer Science Department,

More information

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY Chen, Hsin-Hsi Department of Computer Science and Information Engineering National Taiwan University Taipei, Taiwan E-mail: hh_chen@csie.ntu.edu.tw Abstract

More information

Linking Task: Identifying authors and book titles in verbose queries

Linking Task: Identifying authors and book titles in verbose queries Linking Task: Identifying authors and book titles in verbose queries Anaïs Ollagnier, Sébastien Fournier, and Patrice Bellot Aix-Marseille University, CNRS, ENSAM, University of Toulon, LSIS UMR 7296,

More information

Learning Methods in Multilingual Speech Recognition

Learning Methods in Multilingual Speech Recognition Learning Methods in Multilingual Speech Recognition Hui Lin Department of Electrical Engineering University of Washington Seattle, WA 98125 linhui@u.washington.edu Li Deng, Jasha Droppo, Dong Yu, and Alex

More information

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar Chung-Chi Huang Mei-Hua Chen Shih-Ting Huang Jason S. Chang Institute of Information Systems and Applications, National Tsing Hua University,

More information

Improved Effects of Word-Retrieval Treatments Subsequent to Addition of the Orthographic Form

Improved Effects of Word-Retrieval Treatments Subsequent to Addition of the Orthographic Form Orthographic Form 1 Improved Effects of Word-Retrieval Treatments Subsequent to Addition of the Orthographic Form The development and testing of word-retrieval treatments for aphasia has generally focused

More information

Citation for published version (APA): Veenstra, M. J. A. (1998). Formalizing the minimalist program Groningen: s.n.

Citation for published version (APA): Veenstra, M. J. A. (1998). Formalizing the minimalist program Groningen: s.n. University of Groningen Formalizing the minimalist program Veenstra, Mettina Jolanda Arnoldina IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF if you wish to cite from

More information

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities Yoav Goldberg Reut Tsarfaty Meni Adler Michael Elhadad Ben Gurion

More information

phone hidden time phone

phone hidden time phone MODULARITY IN A CONNECTIONIST MODEL OF MORPHOLOGY ACQUISITION Michael Gasser Departments of Computer Science and Linguistics Indiana University Abstract This paper describes a modular connectionist model

More information

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches Yu-Chun Wang Chun-Kai Wu Richard Tzong-Han Tsai Department of Computer Science

More information

LING 329 : MORPHOLOGY

LING 329 : MORPHOLOGY LING 329 : MORPHOLOGY TTh 10:30 11:50 AM, Physics 121 Course Syllabus Spring 2013 Matt Pearson Office: Vollum 313 Email: pearsonm@reed.edu Phone: 7618 (off campus: 503-517-7618) Office hrs: Mon 1:30 2:30,

More information

Proceedings of the 19th COLING, , 2002.

Proceedings of the 19th COLING, , 2002. Crosslinguistic Transfer in Automatic Verb Classication Vivian Tsang Computer Science University of Toronto vyctsang@cs.toronto.edu Suzanne Stevenson Computer Science University of Toronto suzanne@cs.toronto.edu

More information

The Role of the Head in the Interpretation of English Deverbal Compounds

The Role of the Head in the Interpretation of English Deverbal Compounds The Role of the Head in the Interpretation of English Deverbal Compounds Gianina Iordăchioaia i, Lonneke van der Plas ii, Glorianna Jagfeld i (Universität Stuttgart i, University of Malta ii ) Wen wurmt

More information

Probabilistic Latent Semantic Analysis

Probabilistic Latent Semantic Analysis Probabilistic Latent Semantic Analysis Thomas Hofmann Presentation by Ioannis Pavlopoulos & Andreas Damianou for the course of Data Mining & Exploration 1 Outline Latent Semantic Analysis o Need o Overview

More information

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks POS tagging of Chinese Buddhist texts using Recurrent Neural Networks Longlu Qin Department of East Asian Languages and Cultures longlu@stanford.edu Abstract Chinese POS tagging, as one of the most important

More information

PROTEIN NAMES AND HOW TO FIND THEM

PROTEIN NAMES AND HOW TO FIND THEM PROTEIN NAMES AND HOW TO FIND THEM KRISTOFER FRANZÉN, GUNNAR ERIKSSON, FREDRIK OLSSON Swedish Institute of Computer Science, Box 1263, SE-164 29 Kista, Sweden LARS ASKER, PER LIDÉN, JOAKIM CÖSTER Virtual

More information

2 Mitsuru Ishizuka x1 Keywords Automatic Indexing, PAI, Asserted Keyword, Spreading Activation, Priming Eect Introduction With the increasing number o

2 Mitsuru Ishizuka x1 Keywords Automatic Indexing, PAI, Asserted Keyword, Spreading Activation, Priming Eect Introduction With the increasing number o PAI: Automatic Indexing for Extracting Asserted Keywords from a Document 1 PAI: Automatic Indexing for Extracting Asserted Keywords from a Document Naohiro Matsumura PRESTO, Japan Science and Technology

More information

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data Ebba Gustavii Department of Linguistics and Philology, Uppsala University, Sweden ebbag@stp.ling.uu.se

More information

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation tatistical Parsing (Following slides are modified from Prof. Raymond Mooney s slides.) tatistical Parsing tatistical parsing uses a probabilistic model of syntax in order to assign probabilities to each

More information

Using dialogue context to improve parsing performance in dialogue systems

Using dialogue context to improve parsing performance in dialogue systems Using dialogue context to improve parsing performance in dialogue systems Ivan Meza-Ruiz and Oliver Lemon School of Informatics, Edinburgh University 2 Buccleuch Place, Edinburgh I.V.Meza-Ruiz@sms.ed.ac.uk,

More information

CS 598 Natural Language Processing

CS 598 Natural Language Processing CS 598 Natural Language Processing Natural language is everywhere Natural language is everywhere Natural language is everywhere Natural language is everywhere!"#$%&'&()*+,-./012 34*5665756638/9:;< =>?@ABCDEFGHIJ5KL@

More information

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

OCR for Arabic using SIFT Descriptors With Online Failure Prediction OCR for Arabic using SIFT Descriptors With Online Failure Prediction Andrey Stolyarenko, Nachum Dershowitz The Blavatnik School of Computer Science Tel Aviv University Tel Aviv, Israel Email: stloyare@tau.ac.il,

More information

Disambiguation of Thai Personal Name from Online News Articles

Disambiguation of Thai Personal Name from Online News Articles Disambiguation of Thai Personal Name from Online News Articles Phaisarn Sutheebanjard Graduate School of Information Technology Siam University Bangkok, Thailand mr.phaisarn@gmail.com Abstract Since online

More information

Studies on Key Skills for Jobs that On-Site. Professionals from Construction Industry Demand

Studies on Key Skills for Jobs that On-Site. Professionals from Construction Industry Demand Contemporary Engineering Sciences, Vol. 7, 2014, no. 21, 1061-1069 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ces.2014.49133 Studies on Key Skills for Jobs that On-Site Professionals from

More information

Accuracy (%) # features

Accuracy (%) # features Question Terminology and Representation for Question Type Classication Noriko Tomuro DePaul University School of Computer Science, Telecommunications and Information Systems 243 S. Wabash Ave. Chicago,

More information

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology ISCA Archive SUBJECTIVE EVALUATION FOR HMM-BASED SPEECH-TO-LIP MOVEMENT SYNTHESIS Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano Graduate School of Information Science, Nara Institute of Science & Technology

More information

Universiteit Leiden ICT in Business

Universiteit Leiden ICT in Business Universiteit Leiden ICT in Business Ranking of Multi-Word Terms Name: Ricardo R.M. Blikman Student-no: s1184164 Internal report number: 2012-11 Date: 07/03/2013 1st supervisor: Prof. Dr. J.N. Kok 2nd supervisor:

More information

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

The Internet as a Normative Corpus: Grammar Checking with a Search Engine The Internet as a Normative Corpus: Grammar Checking with a Search Engine Jonas Sjöbergh KTH Nada SE-100 44 Stockholm, Sweden jsh@nada.kth.se Abstract In this paper some methods using the Internet as a

More information

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17. Semi-supervised methods of text processing, and an application to medical concept extraction Yacine Jernite Text-as-Data series September 17. 2015 What do we want from text? 1. Extract information 2. Link

More information

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections Tyler Perrachione LING 451-0 Proseminar in Sound Structure Prof. A. Bradlow 17 March 2006 Intra-talker Variation: Audience Design Factors Affecting Lexical Selections Abstract Although the acoustic and

More information

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

Project in the framework of the AIM-WEST project Annotation of MWEs for translation Project in the framework of the AIM-WEST project Annotation of MWEs for translation 1 Agnès Tutin LIDILEM/LIG Université Grenoble Alpes 30 october 2014 Outline 2 Why annotate MWEs in corpora? A first experiment

More information

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages Nuanwan Soonthornphisaj 1 and Boonserm Kijsirikul 2 Machine Intelligence and Knowledge Discovery Laboratory Department of Computer

More information

RANKING AND UNRANKING LEFT SZILARD LANGUAGES. Erkki Mäkinen DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF TAMPERE REPORT A ER E P S I M S

RANKING AND UNRANKING LEFT SZILARD LANGUAGES. Erkki Mäkinen DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF TAMPERE REPORT A ER E P S I M S N S ER E P S I M TA S UN A I S I T VER RANKING AND UNRANKING LEFT SZILARD LANGUAGES Erkki Mäkinen DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF TAMPERE REPORT A-1997-2 UNIVERSITY OF TAMPERE DEPARTMENT OF

More information

Books Effective Literacy Y5-8 Learning Through Talk Y4-8 Switch onto Spelling Spelling Under Scrutiny

Books Effective Literacy Y5-8 Learning Through Talk Y4-8 Switch onto Spelling Spelling Under Scrutiny By the End of Year 8 All Essential words lists 1-7 290 words Commonly Misspelt Words-55 working out more complex, irregular, and/or ambiguous words by using strategies such as inferring the unknown from

More information

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS Daffodil International University Institutional Repository DIU Journal of Science and Technology Volume 8, Issue 1, January 2013 2013-01 BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS Uddin, Sk.

More information

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks Devendra Singh Chaplot, Eunhee Rhim, and Jihie Kim Samsung Electronics Co., Ltd. Seoul, South Korea {dev.chaplot,eunhee.rhim,jihie.kim}@samsung.com

More information

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence. NLP Lab Session Week 8 October 15, 2014 Noun Phrase Chunking and WordNet in NLTK Getting Started In this lab session, we will work together through a series of small examples using the IDLE window and

More information

AQUA: An Ontology-Driven Question Answering System

AQUA: An Ontology-Driven Question Answering System AQUA: An Ontology-Driven Question Answering System Maria Vargas-Vera, Enrico Motta and John Domingue Knowledge Media Institute (KMI) The Open University, Walton Hall, Milton Keynes, MK7 6AA, United Kingdom.

More information

Prediction of Maximal Projection for Semantic Role Labeling

Prediction of Maximal Projection for Semantic Role Labeling Prediction of Maximal Projection for Semantic Role Labeling Weiwei Sun, Zhifang Sui Institute of Computational Linguistics Peking University Beijing, 100871, China {ws, szf}@pku.edu.cn Haifeng Wang Toshiba

More information

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures Ulrike Baldewein (ulrike@coli.uni-sb.de) Computational Psycholinguistics, Saarland University D-66041 Saarbrücken,

More information

The Verbmobil Semantic Database. Humboldt{Univ. zu Berlin. Computerlinguistik. Abstract

The Verbmobil Semantic Database. Humboldt{Univ. zu Berlin. Computerlinguistik. Abstract The Verbmobil Semantic Database Karsten L. Worm Univ. des Saarlandes Computerlinguistik Postfach 15 11 50 D{66041 Saarbrucken Germany worm@coli.uni-sb.de Johannes Heinecke Humboldt{Univ. zu Berlin Computerlinguistik

More information

! # %& ( ) ( + ) ( &, % &. / 0!!1 2/.&, 3 ( & 2/ &,

! # %& ( ) ( + ) ( &, % &. / 0!!1 2/.&, 3 ( & 2/ &, ! # %& ( ) ( + ) ( &, % &. / 0!!1 2/.&, 3 ( & 2/ &, 4 The Interaction of Knowledge Sources in Word Sense Disambiguation Mark Stevenson Yorick Wilks University of Shef eld University of Shef eld Word sense

More information

First Grade Curriculum Highlights: In alignment with the Common Core Standards

First Grade Curriculum Highlights: In alignment with the Common Core Standards First Grade Curriculum Highlights: In alignment with the Common Core Standards ENGLISH LANGUAGE ARTS Foundational Skills Print Concepts Demonstrate understanding of the organization and basic features

More information

ScienceDirect. Malayalam question answering system

ScienceDirect. Malayalam question answering system Available online at www.sciencedirect.com ScienceDirect Procedia Technology 24 (2016 ) 1388 1392 International Conference on Emerging Trends in Engineering, Science and Technology (ICETEST - 2015) Malayalam

More information

Clouds = Heavy Sidewalk = Wet. davinci V2.1 alpha3

Clouds = Heavy Sidewalk = Wet. davinci V2.1 alpha3 Identifying and Handling Structural Incompleteness for Validation of Probabilistic Knowledge-Bases Eugene Santos Jr. Dept. of Comp. Sci. & Eng. University of Connecticut Storrs, CT 06269-3155 eugene@cse.uconn.edu

More information

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases POS Tagging Problem Part-of-Speech Tagging L545 Spring 203 Given a sentence W Wn and a tagset of lexical categories, find the most likely tag T..Tn for each word in the sentence Example Secretariat/P is/vbz

More information

The Smart/Empire TIPSTER IR System

The Smart/Empire TIPSTER IR System The Smart/Empire TIPSTER IR System Chris Buckley, Janet Walz Sabir Research, Gaithersburg, MD chrisb,walz@sabir.com Claire Cardie, Scott Mardis, Mandar Mitra, David Pierce, Kiri Wagstaff Department of

More information

Mandarin Lexical Tone Recognition: The Gating Paradigm

Mandarin Lexical Tone Recognition: The Gating Paradigm Kansas Working Papers in Linguistics, Vol. 0 (008), p. 8 Abstract Mandarin Lexical Tone Recognition: The Gating Paradigm Yuwen Lai and Jie Zhang University of Kansas Research on spoken word recognition

More information

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data Maja Popović and Hermann Ney Lehrstuhl für Informatik VI, Computer

More information

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Chinese Language Parsing with Maximum-Entropy-Inspired Parser Chinese Language Parsing with Maximum-Entropy-Inspired Parser Heng Lian Brown University Abstract The Chinese language has many special characteristics that make parsing difficult. The performance of state-of-the-art

More information

Summarizing Text Documents: Carnegie Mellon University 4616 Henry Street

Summarizing Text Documents:   Carnegie Mellon University 4616 Henry Street Summarizing Text Documents: Sentence Selection and Evaluation Metrics Jade Goldstein y Mark Kantrowitz Vibhu Mittal Jaime Carbonell y jade@cs.cmu.edu mkant@jprc.com mittal@jprc.com jgc@cs.cmu.edu y Language

More information

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Twitter Sentiment Classification on Sanders Data using Hybrid Approach IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 4, Ver. I (July Aug. 2015), PP 118-123 www.iosrjournals.org Twitter Sentiment Classification on Sanders

More information

THE VERB ARGUMENT BROWSER

THE VERB ARGUMENT BROWSER THE VERB ARGUMENT BROWSER Bálint Sass sass.balint@itk.ppke.hu Péter Pázmány Catholic University, Budapest, Hungary 11 th International Conference on Text, Speech and Dialog 8-12 September 2008, Brno PREVIEW

More information

Defragmenting Textual Data by Leveraging the Syntactic Structure of the English Language

Defragmenting Textual Data by Leveraging the Syntactic Structure of the English Language Defragmenting Textual Data by Leveraging the Syntactic Structure of the English Language Nathaniel Hayes Department of Computer Science Simpson College 701 N. C. St. Indianola, IA, 50125 nate.hayes@my.simpson.edu

More information

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics (L615) Markus Dickinson Department of Linguistics, Indiana University Spring 2013 The web provides new opportunities for gathering data Viable source of disposable corpora, built ad hoc for specific purposes

More information

Multilingual Sentiment and Subjectivity Analysis

Multilingual Sentiment and Subjectivity Analysis Multilingual Sentiment and Subjectivity Analysis Carmen Banea and Rada Mihalcea Department of Computer Science University of North Texas rada@cs.unt.edu, carmen.banea@gmail.com Janyce Wiebe Department

More information

Character Stream Parsing of Mixed-lingual Text

Character Stream Parsing of Mixed-lingual Text Character Stream Parsing of Mixed-lingual Text Harald Romsdorfer and Beat Pfister Speech Processing Group Computer Engineering and Networks Laboratory ETH Zurich {romsdorfer,pfister}@tik.ee.ethz.ch Abstract

More information

A Case Study: News Classification Based on Term Frequency

A Case Study: News Classification Based on Term Frequency A Case Study: News Classification Based on Term Frequency Petr Kroha Faculty of Computer Science University of Technology 09107 Chemnitz Germany kroha@informatik.tu-chemnitz.de Ricardo Baeza-Yates Center

More information

Trend Survey on Japanese Natural Language Processing Studies over the Last Decade

Trend Survey on Japanese Natural Language Processing Studies over the Last Decade Trend Survey on Japanese Natural Language Processing Studies over the Last Decade Masaki Murata, Koji Ichii, Qing Ma,, Tamotsu Shirado, Toshiyuki Kanamaru,, and Hitoshi Isahara National Institute of Information

More information

Performance Analysis of Optimized Content Extraction for Cyrillic Mongolian Learning Text Materials in the Database

Performance Analysis of Optimized Content Extraction for Cyrillic Mongolian Learning Text Materials in the Database Journal of Computer and Communications, 2016, 4, 79-89 Published Online August 2016 in SciRes. http://www.scirp.org/journal/jcc http://dx.doi.org/10.4236/jcc.2016.410009 Performance Analysis of Optimized

More information

SCHEMA ACTIVATION IN MEMORY FOR PROSE 1. Michael A. R. Townsend State University of New York at Albany

SCHEMA ACTIVATION IN MEMORY FOR PROSE 1. Michael A. R. Townsend State University of New York at Albany Journal of Reading Behavior 1980, Vol. II, No. 1 SCHEMA ACTIVATION IN MEMORY FOR PROSE 1 Michael A. R. Townsend State University of New York at Albany Abstract. Forty-eight college students listened to

More information

The Ups and Downs of Preposition Error Detection in ESL Writing

The Ups and Downs of Preposition Error Detection in ESL Writing The Ups and Downs of Preposition Error Detection in ESL Writing Joel R. Tetreault Educational Testing Service 660 Rosedale Road Princeton, NJ, USA JTetreault@ets.org Martin Chodorow Hunter College of CUNY

More information

Characteristics of the Text Genre Realistic fi ction Text Structure

Characteristics of the Text Genre Realistic fi ction Text Structure LESSON 14 TEACHER S GUIDE by Oscar Hagen Fountas-Pinnell Level A Realistic Fiction Selection Summary A boy and his mom visit a pond and see and count a bird, fish, turtles, and frogs. Number of Words:

More information

Using Semantic Relations to Refine Coreference Decisions

Using Semantic Relations to Refine Coreference Decisions Using Semantic Relations to Refine Coreference Decisions Heng Ji David Westbrook Ralph Grishman Department of Computer Science New York University New York, NY, 10003, USA hengji@cs.nyu.edu westbroo@cs.nyu.edu

More information

Towards a MWE-driven A* parsing with LTAGs [WG2,WG3]

Towards a MWE-driven A* parsing with LTAGs [WG2,WG3] Towards a MWE-driven A* parsing with LTAGs [WG2,WG3] Jakub Waszczuk, Agata Savary To cite this version: Jakub Waszczuk, Agata Savary. Towards a MWE-driven A* parsing with LTAGs [WG2,WG3]. PARSEME 6th general

More information

Problems of the Arabic OCR: New Attitudes

Problems of the Arabic OCR: New Attitudes Problems of the Arabic OCR: New Attitudes Prof. O.Redkin, Dr. O.Bernikova Department of Asian and African Studies, St. Petersburg State University, St Petersburg, Russia Abstract - This paper reviews existing

More information

Heuristic Sample Selection to Minimize Reference Standard Training Set for a Part-Of-Speech Tagger

Heuristic Sample Selection to Minimize Reference Standard Training Set for a Part-Of-Speech Tagger Page 1 of 35 Heuristic Sample Selection to Minimize Reference Standard Training Set for a Part-Of-Speech Tagger Kaihong Liu, MD, MS, Wendy Chapman, PhD, Rebecca Hwa, PhD, and Rebecca S. Crowley, MD, MS

More information

The stages of event extraction

The stages of event extraction The stages of event extraction David Ahn Intelligent Systems Lab Amsterdam University of Amsterdam ahn@science.uva.nl Abstract Event detection and recognition is a complex task consisting of multiple sub-tasks

More information

Tour. English Discoveries Online

Tour. English Discoveries Online Techno-Ware Tour Of English Discoveries Online Online www.englishdiscoveries.com http://ed242us.engdis.com/technotms Guided Tour of English Discoveries Online Background: English Discoveries Online is

More information

Physics 270: Experimental Physics

Physics 270: Experimental Physics 2017 edition Lab Manual Physics 270 3 Physics 270: Experimental Physics Lecture: Lab: Instructor: Office: Email: Tuesdays, 2 3:50 PM Thursdays, 2 4:50 PM Dr. Uttam Manna 313C Moulton Hall umanna@ilstu.edu

More information

A Domain Ontology Development Environment Using a MRD and Text Corpus

A Domain Ontology Development Environment Using a MRD and Text Corpus A Domain Ontology Development Environment Using a MRD and Text Corpus Naomi Nakaya 1 and Masaki Kurematsu 2 and Takahira Yamaguchi 1 1 Faculty of Information, Shizuoka University 3-5-1 Johoku Hamamatsu

More information

Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models

Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models Jung-Tae Lee and Sang-Bum Kim and Young-In Song and Hae-Chang Rim Dept. of Computer &

More information

Loughton School s curriculum evening. 28 th February 2017

Loughton School s curriculum evening. 28 th February 2017 Loughton School s curriculum evening 28 th February 2017 Aims of this session Share our approach to teaching writing, reading, SPaG and maths. Share resources, ideas and strategies to support children's

More information

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature 1 st Grade Curriculum Map Common Core Standards Language Arts 2013 2014 1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature Key Ideas and Details

More information

Ensemble Technique Utilization for Indonesian Dependency Parser

Ensemble Technique Utilization for Indonesian Dependency Parser Ensemble Technique Utilization for Indonesian Dependency Parser Arief Rahman Institut Teknologi Bandung Indonesia 23516008@std.stei.itb.ac.id Ayu Purwarianti Institut Teknologi Bandung Indonesia ayu@stei.itb.ac.id

More information

Constructing Parallel Corpus from Movie Subtitles

Constructing Parallel Corpus from Movie Subtitles Constructing Parallel Corpus from Movie Subtitles Han Xiao 1 and Xiaojie Wang 2 1 School of Information Engineering, Beijing University of Post and Telecommunications artex.xh@gmail.com 2 CISTR, Beijing

More information

On document relevance and lexical cohesion between query terms

On document relevance and lexical cohesion between query terms Information Processing and Management 42 (2006) 1230 1247 www.elsevier.com/locate/infoproman On document relevance and lexical cohesion between query terms Olga Vechtomova a, *, Murat Karamuftuoglu b,

More information

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING SISOM & ACOUSTICS 2015, Bucharest 21-22 May THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING MarilenaăLAZ R 1, Diana MILITARU 2 1 Military Equipment and Technologies Research Agency, Bucharest,

More information

Some Principles of Automated Natural Language Information Extraction

Some Principles of Automated Natural Language Information Extraction Some Principles of Automated Natural Language Information Extraction Gregers Koch Department of Computer Science, Copenhagen University DIKU, Universitetsparken 1, DK-2100 Copenhagen, Denmark Abstract

More information

Derivational: Inflectional: In a fit of rage the soldiers attacked them both that week, but lost the fight.

Derivational: Inflectional: In a fit of rage the soldiers attacked them both that week, but lost the fight. Final Exam (120 points) Click on the yellow balloons below to see the answers I. Short Answer (32pts) 1. (6) The sentence The kinder teachers made sure that the students comprehended the testable material

More information

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS 1 CALIFORNIA CONTENT STANDARDS: Chapter 1 ALGEBRA AND WHOLE NUMBERS Algebra and Functions 1.4 Students use algebraic

More information

COMPUTATIONAL COMPLEXITY OF LEFT-ASSOCIATIVE GRAMMAR

COMPUTATIONAL COMPLEXITY OF LEFT-ASSOCIATIVE GRAMMAR COMPUTATIONAL COMPLEXITY OF LEFT-ASSOCIATIVE GRAMMAR ROLAND HAUSSER Institut für Deutsche Philologie Ludwig-Maximilians Universität München München, West Germany 1. CHOICE OF A PRIMITIVE OPERATION The

More information

arxiv: v1 [cs.cl] 2 Apr 2017

arxiv: v1 [cs.cl] 2 Apr 2017 Word-Alignment-Based Segment-Level Machine Translation Evaluation using Word Embeddings Junki Matsuo and Mamoru Komachi Graduate School of System Design, Tokyo Metropolitan University, Japan matsuo-junki@ed.tmu.ac.jp,

More information

Context Free Grammars. Many slides from Michael Collins

Context Free Grammars. Many slides from Michael Collins Context Free Grammars Many slides from Michael Collins Overview I An introduction to the parsing problem I Context free grammars I A brief(!) sketch of the syntax of English I Examples of ambiguous structures

More information

On-Line Data Analytics

On-Line Data Analytics International Journal of Computer Applications in Engineering Sciences [VOL I, ISSUE III, SEPTEMBER 2011] [ISSN: 2231-4946] On-Line Data Analytics Yugandhar Vemulapalli #, Devarapalli Raghu *, Raja Jacob

More information

Training and evaluation of POS taggers on the French MULTITAG corpus

Training and evaluation of POS taggers on the French MULTITAG corpus Training and evaluation of POS taggers on the French MULTITAG corpus A. Allauzen, H. Bonneau-Maynard LIMSI/CNRS; Univ Paris-Sud, Orsay, F-91405 {allauzen,maynard}@limsi.fr Abstract The explicit introduction

More information

Derivational and Inflectional Morphemes in Pak-Pak Language

Derivational and Inflectional Morphemes in Pak-Pak Language Derivational and Inflectional Morphemes in Pak-Pak Language Agustina Situmorang and Tima Mariany Arifin ABSTRACT The objectives of this study are to find out the derivational and inflectional morphemes

More information

Short Text Understanding Through Lexical-Semantic Analysis

Short Text Understanding Through Lexical-Semantic Analysis Short Text Understanding Through Lexical-Semantic Analysis Wen Hua #1, Zhongyuan Wang 2, Haixun Wang 3, Kai Zheng #4, Xiaofang Zhou #5 School of Information, Renmin University of China, Beijing, China

More information

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading ELA/ELD Correlation Matrix for ELD Materials Grade 1 Reading The English Language Arts (ELA) required for the one hour of English-Language Development (ELD) Materials are listed in Appendix 9-A, Matrix

More information

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview Algebra 1, Quarter 3, Unit 3.1 Line of Best Fit Overview Number of instructional days 6 (1 day assessment) (1 day = 45 minutes) Content to be learned Analyze scatter plots and construct the line of best

More information

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments Vijayshri Ramkrishna Ingale PG Student, Department of Computer Engineering JSPM s Imperial College of Engineering &

More information

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011 The Karlsruhe Institute of Technology Translation Systems for the WMT 2011 Teresa Herrmann, Mohammed Mediani, Jan Niehues and Alex Waibel Karlsruhe Institute of Technology Karlsruhe, Germany firstname.lastname@kit.edu

More information

THE PENNSYLVANIA STATE UNIVERSITY SCHREYER HONORS COLLEGE DEPARTMENT OF MATHEMATICS ASSESSING THE EFFECTIVENESS OF MULTIPLE CHOICE MATH TESTS

THE PENNSYLVANIA STATE UNIVERSITY SCHREYER HONORS COLLEGE DEPARTMENT OF MATHEMATICS ASSESSING THE EFFECTIVENESS OF MULTIPLE CHOICE MATH TESTS THE PENNSYLVANIA STATE UNIVERSITY SCHREYER HONORS COLLEGE DEPARTMENT OF MATHEMATICS ASSESSING THE EFFECTIVENESS OF MULTIPLE CHOICE MATH TESTS ELIZABETH ANNE SOMERS Spring 2011 A thesis submitted in partial

More information

A heuristic framework for pivot-based bilingual dictionary induction

A heuristic framework for pivot-based bilingual dictionary induction 2013 International Conference on Culture and Computing A heuristic framework for pivot-based bilingual dictionary induction Mairidan Wushouer, Toru Ishida, Donghui Lin Department of Social Informatics,

More information

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN From: AAAI Technical Report WS-98-08. Compilation copyright 1998, AAAI (www.aaai.org). All rights reserved. Recommender Systems: A GroupLens Perspective Joseph A. Konstan *t, John Riedl *t, AI Borchers,

More information

The development of a new learner s dictionary for Modern Standard Arabic: the linguistic corpus approach

The development of a new learner s dictionary for Modern Standard Arabic: the linguistic corpus approach BILINGUAL LEARNERS DICTIONARIES The development of a new learner s dictionary for Modern Standard Arabic: the linguistic corpus approach Mark VAN MOL, Leuven, Belgium Abstract This paper reports on the

More information