Extended Similarity Test for the Evaluation of Semantic Similarity Functions

Size: px
Start display at page:

Download "Extended Similarity Test for the Evaluation of Semantic Similarity Functions"


1 Extended Similarity Test for the Evaluation of Semantic Similarity Functions Maciej Piasecki 1, Stanisław Szpakowicz 2,3, Bartosz Broda 1 1 Institute of Applied Informatics, Wrocław University of Technology, Poland {maciej.piasecki,bartosz.broda}@pwr.wroc.pl 2 School of Information Technology and Engineering, University of Ottawa szpak@site.uottawa.ca 3 Institute of Computer Science, Polish Academy of Sciences Abstract We propose a more demanding version of a well-known WordNet-Based Similarity Test. Our work with semantic similarity functions for Polish nouns has shown that test to be insufficiently stringent. We briefly present the background, explain the extension of WBST and report on experiments that contrast the old and new evaluation tool. 1. Introduction Many tasks in Natural Language Processing Word Sense Disambiguation, Text Entailment, Text Classification, to name just a few require a measure of semantic relatedness. Automatic acquisition of lexical semantic relations, in particular, can hardly be imagined without some form of a semantic similarity function (henceforth, SSF). A SSF maps pairs of lexical units into real numbers, and is usually normalized. A lexical unit (LU) is a word type or lexeme organized, especially in inflected languages, by the values of morphological categories such as number, gender and so on. Evaluation of the quality or effectiveness of a SSF is a non-trivial problem. Manual evaluation is barely feasible on a small scale. Not only are SSFs required to work for any pair of LUs, but also people are notoriously bad at working with real numbers. A linear ordering of dozens of LUs is nearly impossible, and even comparing two terms requires a significantly complicated setup (Rubenstein and Goodenough, 1965). Given a small sample, people can easily distinguish a bad SSF from a good one; we must distinguish good SSFs from those that are merely passable. We note three forms of SSF evaluation (Budanitsky and Hirst, 2006; Zesch and Gurevych, 2006): mathematical analysis of formal properties (for example, the property of a metric distance (Lin, 1998b)), applicationspecific evaluation and comparison with human judgement. Mathematical analysis gives few clues about the future uses of a SSF. Evaluation via an application may make it difficult to separate the effect of a SSF and other elements of the application (Zesch and Gurevych, 2006). A direct comparison to a manually created resource seems the least trouble-free. The construction of such resources, however, is labour-intensive even if it only labels LU pairs as similar (maybe just related (Zesch and Gurevych, 2006)) or not similar; this does not allow a fair assessment of the ordering of LUs on a continuous scale, as an SSF does. Indirect comparison with the existing resources (Grefenstette, 1993) is another possibility. For example, one could compare a SSF constructed automatically and another based on the semantic similarity across WordNet s (Fellbaum, 1998) hypernymy structure. In (Lin, 1998a; Weeds and Weir, 2005) two list of the k LUs most similar to the given one are transformed to rank numbers of the subsequent LUs on the lists, and compared by the cosine measure. The drawback of such an evaluation is that we know how close the two similarity functions are, but not how people perceive a SSF. Automatic differentiation between words synonymous and not synonymous with a LU is a natural application for a SSF. In Latent Semantic Analysis (LSA) (Landauer and Dumais, 1997) the SSF constructed on the basis of a statistical analysis of a corpus was used to make decisions in a synonymy test, a component of the Test of English as a Foreign Language (TOEFL); this gave 64.4% of hits. (Turney, 2001) reported 73.75% hits, and (Turney et al., 2003) 97.5% hits; this practically solved the TOFEL synonymy problem. Next, (Freitag et al., 2005) proposed a WordNet-Based Synonymy Test (WBST), in which Word- Net is used to generate a large set of questions identical in format to those in the TOEFL. Section 2. discusses WBST. The best reported result for nouns is 75.8% (Freitag et al., 2005). A slightly modified WBST was used to evaluate a SSF for Polish nouns (Piasecki et al., 2006) with the result of 86.09%. The evaluation of a SSF via a synonymy test shows the ability of the SSF to distinguish synonyms from nonsynonyms. Since the SSF is the centrepiece of the application, the achieved results can be directly attributed to it. There was, however, a problem: WBST appeared to be too easy, as it is shown in Section 2., so it no longer was a useful tool in the assessment of more sophisticated SSFs for Polish nouns. In view of these findings, we have set out to design a more demanding automatic method of SSF assessment. We want its results to be easily interpreted by people and its feasibility tested on people. We also expect that it will pick the SSF that is a better tool for the recognition of lexical semantic relations between LUs. 2. WBST and the Similarity among Polish Nouns The application of LSA to the TOEFL became unattractive as a method of comparing SSFs once the result of 97.5% hits has been achieved (Turney et al., 2003).

2 (Freitag et al., 2005) proposed a new test, WBST. It was seen as more difficult because it contained many more questions. An instance of the test is built thus: first, pair a LU q included in a wordnet (WordNet 2.0 in (Freitag et al., 2005)) with a randomly chosen synonym s; next, randomly draw from the wordnet three other LUs not in q s synset (detractors) to complete an answer set A in the question-answer (QA) pair q, A. During evaluation, SSF generates values for the pairs q, a i, a i A, that are expected to favour s. The WBST has been, amongst other applications, used to evaluate SSFs for Polish nouns. The underlying resource was Polish WordNet (plwn) (Derwojedowa et al., 2007a; Derwojedowa et al., 2007b) a lexical database now under construction (Piasecki et al., 2006). The test was slightly modified. In plwn, many synsets have only 1-3 LUs, in accordance with the definition of the semantic relations (Derwojedowa et al., 2007b). In order to get a better coverage of LUs by WBST questions, and not to leave LUs in singleton synsets untested, the direct hypernyms of LUs from singleton synsets were taken to form QA pairs in (Piasecki et al., 2006). We named this modification the WBST with Hypernyms (WBST+H). The inclusion of hypernyms in QA pairs did not make the test easier, as was shown in (Piasecki et al., 2006). In (Piasecki et al., 2006) a SSF was based on adjectival modification of nouns and on noun coordination; we also ran preliminary experiments with describing a noun via its association with verbs. This work lacks the in-depth analysis of all possible lexico-syntactic markers of Polish noun meaning. After the re-implementation of the approach of (Piasecki et al., 2006) and the addition of several lexicosyntactic features (such as modification by an adjectival participle see Section 4.), the result exceeded 90%. We could, however, observe little difference in the influence of the subsequent types of features. This was contrary to our expectations. In the repeated WBST+H tests with several raters, we had the result close to 100% (markedly more than 89.29% reported in (Piasecki et al., 2006)). This may have happened because the tests were generated on the basis of a more recent, improved version of plwn. It is imperative that we construct a more difficult WBST-style test to facilitate further work on SSFs for Polish nouns. 3. Enhanced WBST The WBST defined in (Freitag et al., 2005) stipulates that the elements of the answer set A not synonymous with q are chosen completely at random from the whole wordnet. This means that the difference in meaning between q and the detractors is in most cases obvious to test-takers. It also tends to be obvious to a good SSF. Our overall goal, however, is to construct automatically synsets of highly similar LUs, and to differentiate the LUs in a synset from all other LUs that are similar but not synonymous, among them co-hyponyms (Derwojedowa et al., 2007b). Any SSF must therefore distinguish closely related LUs, not only those with very different meaning. We need to construct the answer set A so that nonsynonyms are closer in meaning to the correct answer q than it is the case in WBST+H. Obviously, they cannot be synonyms of either s nor q, but they ought to be related to both. We need to select the non-synonyms among LUs similar to s and to q. In order to achieve this, we have decided to leverage the structure of the wordnet in the determination of similarity. During the generation of the modified test, named Enhanced WBST (EWBST), non-synonyms are still selected randomly but only from the set of LUs broadly similar to q and s. The acceptable values of SSF W N (Q, x) are lower than a threshold sim t ; the synset Q contains q and s, and x is a detractor. We tested several wordnet-based similarity functions (Agirre and Edmonds, 2006), here implemented on the basis of plwn s hypernymy structure, achieving the best result in a generated test with the following function: SSF W N = p min (1) 2d p min is the length of a minimal path between two LUs in plwn, and d = 9 is a maximal depth of the hypernymy hierarchy in the current version of plwn. The similarity threshold sim t = 2 for this function has been established experimentally. The hypernymy structure of nouns in plwn (as of May 7, 2007) does not have a single root. Many methods of similarity computation require a root, so we have introduced one artificially and linked to it all trees in the hypernymy forest. We noticed that the random selection of LU detractors on the basis of any similarity measure tends to favour LUs in the hypernymy subtrees other than q, if q is located near the root. The number of LUs linked by a short path across the root is much higher than the number of LUs from the subtree of q which are located at a close distance to q. The problem is especially visible for question LUs in small hypernymy subtrees with a limited number of hyponyms. As the problem appears in the case of any similarity measure based on the path length, we have heuristically modified the measure by adding the value 3 to any path going across the artificial root. The lower values gave no visible changes, while the higher numbers caused a large reduction of the number of QA pairs. To illustrate the difference in the level of difficulty between WBST+H and EWBST, we show an example problem generated by this method for the same QA pair admistracja (administration), zarza d(board). The EWBST built the following test: Q: admistracja (administration) A: urza d (office, department), fundacja (charity, endowment), zarza d (board, management), ministerstwo (ministry). And the test generated by WBST+H: Q: admistracja (administration) A: poddasze (attic), repatriacja (repatriation), zarza d (board, management), zwolennik (follower, zealot). An example EWBST test was given to 32 native speakers of Polish, all of them Computer Science students. (This bias in the group of raters should not influence the results, because the test was composed on the basis of plwn which at present includes only general Polish vocabulary.) The test consisted of 99 QA pairs. All LUs in

3 the test were selected from 5706 single word noun LUs included in plwn. In the set of question LUs, there were 42 LUs occurring more 1000 times in the IPI PAN corpus (Przepiórkowski, 2004). This subset was distinguished in the test, because such LUs are also the basis of the comparison with the results achieved in (Freitag et al., 2005) and (Piasecki et al., 2006). For all QA pairs the result was 70%, with the minimum 61.62%, maximum 78.79% and the standard deviation from the mean σ = 4.07%. For the subset consisting of frequent LUs, the average result was 63.24% with the minimum 52.38%, maximum 73.81% and σ = 5.37%. The results, as expected, are much lower than those achieved in WBST+H tests. We were surprised that the results the raters had for the frequent LUs were significantly lower than for all LUs. It is likely that more frequent LUs are at same time more polysemous, and that makes them more difficult to distinguish from other similar LUs. The results for frequent LUs are lower, but at the level similar to the results for all LUs. A situation like this was also observed in the application of the EWBST to SSFs discussed in Section Similarity Functions for Polish Nouns (Piasecki et al., 2006) proposed a SSF based on the frequency of modification of a noun by specific adjectives and on the frequency of coordination with specific nouns. Features based on verbs were also tested. Following this approach, we have defined a set of noun-meaning markers (italicised on the list below) identifiable via shallow morpho-syntactic processing 1 : modification by a specific adjective, from (Piasecki et al., 2006) (written A in Table 1), modification by a specific adjective, from (Piasecki et al., 2006) (Part in Table 1), co-ordination with a a specific noun, from (Piasecki et al., 2006) (Nc), occurrence of a verb for which a given noun in a specific case can be an argument (V(case)), modification by a specific noun in genitive (NMg), occurrence of a specific preposition with which a given noun in a specific case forms a prepositional phrase (Prep(case)). An N C matrix M is created from the IPI PAN corpus documents tagged by the TaKIPI tagger (Piasecki, 2006). C is the number of lexico-syntactic features used, N the number of nouns, M[n, c] the number of occurrences of the n-th noun with the c-th feature (i.e., the constraint was satisfied). All features are based on the occurrences of certain lexical markers in the context (a sentence defined by TaKIPI), which satisfy certain morpho-syntactic constraints such as for example the presence of some syntactic configuration between the noun and the marker. The constraints are expressed in the 1 Full parsing, unfortunately, was not an option. JOSKIPI language included in TaKIPI. One such constraint is shown below (partially, enough to illustrate the variety of morpho-syntactic phenomena that we can test). or( and( llook(-1,begin,$c, and(equal(pos[$c],{conj}), inter(base[$c],{"ani","albo", "czy","i","lub","oraz"}))), only($c,-1,$oa,in(pos[$oa],{conj,adj, pact,ppas,num,numcol, adv,qub,pcon,pant})), llook($-1c,begin,$s, and(in(pos[$s],subst,ger,depr), inter(base[$s],"variable-n"))), inter(cas[$s],cas[0]), only($s,$c,$ob,in(pos[$ob],{conj, adj,pact,ppas,num,numcol,adv, qub,pcon,pant,subst,ger,depr})) ), and(... analogically to the right) ) In this expression, the first operator llook looks for a conjunction to the left of the centre of the context the position of a given noun. The operator only test the units between that conjunction and the centre; the allowed types are conjunction, adjective, adjectival participle,numeral, adverb, etc. Next, we look for the potentially coordinated noun, defined in each instance of the constraint (variable- N) for a column of the matrix, and we test case agreement inter. Next, we consider words to the right of the conjunction, using a symmetrically inverted constraint. The similarity between nouns is calculated on the basis of matrix rows according to the method proposed in (Piasecki et al., 2006). The central element of the method is the Rank Weight Function: 1. Weighted values of the cells are recalculated using a weight function f w : c M[n i, c] = f w (M[n i, c]). 2. Features in a row vector M[n i, ] are sorted in the ascending order on the weighted values. 3. The k highest-ranking features are selected; e.g. k = For each selected feature c j : M[n i, c j ] = k rank(c j ) As the weight function, we applied the t-score test tscore(n, c) (Manning and Schütze, 2001): tscore(n, c) = M[n,c] T FnT Fc W T FnT Fc W T F n, T F c are the total frequencies of noun words / constraints satisfied, and W is the number of words processed. Additionally a threshold for the number of features common to both nouns mcom > 1% serves as a constraint for the similarity to be positive (otherwise it is set to 0).

4 5. Experiments In order to test the influence of the constraints, for each constraint we created a separate matrix on the basis of about 254 million words from the IPI PAN corpus. The SSFs calculated from the matrices by the rank method were next tested using WBST+H and EWBST, both generated according to the present state of plwn. Most of the tests were limited to nouns in plwn that occur more than 1000 times in the corpus (6105 nouns) the threshold used in (Freitag et al., 2005; Piasecki et al., 2006). We generated 3025 QA pairs for the frequent nouns. In EWBST tests run for all nouns of plwn, the results were similar to those for the frequent nouns only. It is in contrast with the WBST+H test in which there is a big difference between the accuracy of the frequent and infrequent nouns. For example, in (Piasecki and Broda, 2007) the best result for the frequent nouns is 81.15%, while for all nouns only 64.03%. WBST+H is easier for the frequent nouns well described by the frequent occurences of features, because it is easier to distinguish them from completely randomly selected nouns. In EWBST all nouns are compared with those similar to them. The result for infrequent nouns, not so well described, stays approximately the same, but the result for the frequent nouns is worse, as the task becomes harder. From the point of view of the automatic construction of synsets, this behaviour of EWBST is advantageous: we perform only one test and yet we get a good description of the whole SSF. Features W E E A A 88.65% 51.51% 50,97% Part 78.79% 43.86% 37,94% NMg 72.43% 44.56% 41,16% Nc 76.85% 47.01% 44,70% Prep(acc) 35.14% 22.20% 20,39% Prep(all) 50.21% 30.00% 28,33% V(acc) 75.36% 41.78% 40,17% V(dat) 48.64% 30.04% 26,25% V(all) 75.94% 42.04% 40.12% A+NMg 86.66% 52.20% 52.75% A+NMg+Prep(all) 86.74% 51.20% 52.24% A+NMg+Prep(all)+Part 87.40% 52.27% 52.62% A+NMg+Part 87.29% 52.86% 53.31% A+Nc+Part 90.92% 53.32% 52.55% A+Nc+NMg 88.65% 53.52% 54.25% A+Nc+NMg+Part 88.57% 53.13% 54.25% Table 1: Experiments with SSFs based on different constraints. Constraint names are defined in Section 4. W means WBST applied to frequent nouns (> 1000 occurrences). E means EWBST for frequent and E A for all nouns. The rank method can select an appropriate set of features for a tested noun, so the individual results of the subsequent matrices do not influence directly the results of the joint matrix. This can be seen in Table 1. For example, the individual result of the constraint based on modification by participles is only 43.86%, but when we add the Part matrix to the Adj+NMg matrix, with a higher accuracy, the result goes up to 52.86%. The results of the best SSFs are different in WBST+H and in EWBST. The best combination in WBST+H, namely A+Nc+Part, expresses a relatively low result in EWBST. The coordination with other nouns is a good factor to identify large semantic fields of related nouns, and in the WBST+H test it helps distinguish between a given noun and a completely unrelated noun. This is why the result is very good. In EWBST the situation is different. We test the ability to distinguish among semantically related nouns. The modification by a noun in genitive is a medium-quality feature when taken alone (only 44.56% in EWBST), probably because this modification is polysemous (and vague as well); the constraint also overgenerates it often signals inexistent assotiations. There are no morpho-syntactic constraints for this type of modification. No agreement is required, and without full parsing we can only rely on a very vague syntactic requirement of adjacency of the two words, like this: and( rlook(1,5,$a, and(in(pos[$a],{subst,ger,depr}), equal(cas[$a],{gen}), inter(base[$a],{"variable-n"})) ), only(1,$a,$ad, or(in(pos[$ad],{adv,qub,pcon, pant})), and( in(pos[$ad],{subst,ger, depr})), equal(cas[$ad],{gen})), and(in(pos[$ad],{adj,pact, ppas,num,numcol}), agrpp(0,$ad,{nmb,gnd,cas},3)) ))) agrpp is the operator of morpho-syntactic agreement on the selected attributes. In spite of the errors introduced by the constraint, the feature NMg delivers the additional source of properties expressed by the meaning of a noun, and a combination of the NMgen matrix with other matrices of properties, Adj+Nc+NMg+Part, results in the best score obtained in the EWBST test. It could not be observed with the former test, WBST+H. 6. Conclusions We have proposed an extension of the WordNet-Based Similarity Test, which appears to be more discerning. Our research goal is the application of SSFs in the automatic construction of wordnet synsets. The operation of the proposed EWBST brings us closer to that goal. The EWBST allows us to observe the ability of a SSF to make finegrained distinctions between semantically related Lexical Units. Its results can be easily interpreted. The test can be generated on a large scale, depending only on the size of the underlying wordnet. The EWBST is challenging for people and significantly difficult for SSFs. It leaves more

5 room for improvement, behaves in the same way for frequent and infrequent nouns. The drawback of EWBST is its dependency on the existence of a semantic similarity measure generated on the basis of manually created data (in its present version on the existence of a hypernymy hierarchy). In many, if not all, wordnets, the hypernymy hierarchy is rich only for nouns. The EWBST can work well for verbs or adjectives, but first a different similarity function should be proposed for the generation of the answer sets in tests. Acknowledgement. Work financed by the Polish Ministry of Education and Science, project No. 3 T11C References Agirre, Eneko and Philip Edmonds (eds.), Word Sense Disambiguation: Algorithms and Applications. Text, Speech and Language Technology. Springer. Budanitsky, Alexander and Graeme Hirst, Evaluating wordnet-based measures of semantic distance. Computational Linguistics, 32(1): Derwojedowa, Magdalena, Maciej Piasecki, Stanisław Szpakowicz, and Magdalena Zawisławska, 2007a. plwordnet the polish wordnet. Online access to the database of plwordnet: wroc.pl. Derwojedowa, Magdalena, Maciej Piasecki, Stanisław Szpakowicz, and Magdalena Zawisławska, 2007b. Polish WordNet on a shoestring. In Proceedings of Biannual Conference of the Society for Computational Linguistics and Language Technology, Tübingen, April Universität Tübingen. Fellbaum, Christiane (ed.), WordNet An Electronic Lexical Database. The MIT Press. Freitag, Dayne, Matthias Blume, John Byrnes, Edmond Chow, Sadik Kapadia, Richard Rohwer, and Zhiqiang Wang, New experiments in distributional representations of synonymy. In Proceedings of the Ninth Conference on Computational Natural Language Learning (CoNLL-2005). Ann Arbor, Michigan: Association for Computational Linguistics. Grefenstette, G., Evaluation techniques for automatic semantic extraction: Comparing syntactic and window based approaches. In Proceedings of The Workshop on Acquisition of Lexical Knowledge from Text, Columbus, SIGLEX 93. ACL. Landauer, T. and S. Dumais, A solution to Plato s problem: The latent semantic analysis theory of acquisition. Psychological Review, 104(2): Lin, Dekang, 1998a. Automatic retrieval and clustering of similar words. In COLING ACL. Lin, Dekang, 1998b. An information-theoretic definition of similarity. In Proceedings of 15th International Conference on Machine Learning. Morgan Kaufmann, San Francisco, CA. Manning, Christopher D. and Hinrich Schütze, Foundations of Statistical Natural Language Processing. The MIT Press. Piasecki, Maciej, Handmade and automatic rules for Polish tagger. Lecture Notes in Artificial Intelligence. Springer. Piasecki, Maciej and Bartosz Broda, Semantic similarity measure of Polish nouns based on linguistic features. In Witold Abramowicz (ed.), Business Information Systems 10th International Conference, BIS 2007, Poznan, Poland, April 25-27, 2007, Proceedings, volume 4439 of Lecture Notes in Computer Science. Springer. Piasecki, Maciej, Stanisław Szpakowicz, and Bartosz Broda, Automatic selection of heterogeneous syntactic features in semantic similarity of polish nouns. In Proceedings of the Text, Speech and Dialog 2007 Conference. Przepiórkowski, Adam, The IPI PAN Corpus Preliminary Version. Institute of Computer Science PAS. Rubenstein, H. and J. B. Goodenough, Contextual correlates of synonymy. Communication of the ACM, 8(10): Turney, P.T., Mining the web for synonyms: Pmiir versus lsa on toefl. In Proceedings of the Twelfth European Conference on Machine Learning. Berlin: Springer-Verlag. Turney, P.T., M.L. Littman, J. Bigham, and V. Shnayder, Combining independent modules to solve multiple-choice synonym and analogy problems. In Proceedings International Conference on Recent Advances in Natural Language Processing (RANLP-03). Borovets, Bulgaria. Weeds, Julie and David Weir, Co-occurrence retrieval: A flexible framework for lexical distributional similarity. Computational Linguistics, 31(4): Zesch, Torsten and Iryna Gurevych, Automatically creating datasets for measures of semantic relatedness. In Proceedings of the Workshop on Linguistic Distances. Sydney, Australia: Association for Computational Linguistics.

Probabilistic Latent Semantic Analysis

Probabilistic Latent Semantic Analysis Probabilistic Latent Semantic Analysis Thomas Hofmann Presentation by Ioannis Pavlopoulos & Andreas Damianou for the course of Data Mining & Exploration 1 Outline Latent Semantic Analysis o Need o Overview

More information

Vocabulary Usage and Intelligibility in Learner Language

Vocabulary Usage and Intelligibility in Learner Language Vocabulary Usage and Intelligibility in Learner Language Emi Izumi, 1 Kiyotaka Uchimoto 1 and Hitoshi Isahara 1 1. Introduction In verbal communication, the primary purpose of which is to convey and understand

More information

Leveraging Sentiment to Compute Word Similarity

Leveraging Sentiment to Compute Word Similarity Leveraging Sentiment to Compute Word Similarity Balamurali A.R., Subhabrata Mukherjee, Akshat Malu and Pushpak Bhattacharyya Dept. of Computer Science and Engineering, IIT Bombay 6th International Global

More information

A Semantic Similarity Measure Based on Lexico-Syntactic Patterns

A Semantic Similarity Measure Based on Lexico-Syntactic Patterns A Semantic Similarity Measure Based on Lexico-Syntactic Patterns Alexander Panchenko, Olga Morozova and Hubert Naets Center for Natural Language Processing (CENTAL) Université catholique de Louvain Belgium

More information

On document relevance and lexical cohesion between query terms

On document relevance and lexical cohesion between query terms Information Processing and Management 42 (2006) 1230 1247 www.elsevier.com/locate/infoproman On document relevance and lexical cohesion between query terms Olga Vechtomova a, *, Murat Karamuftuoglu b,

More information

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections Tyler Perrachione LING 451-0 Proseminar in Sound Structure Prof. A. Bradlow 17 March 2006 Intra-talker Variation: Audience Design Factors Affecting Lexical Selections Abstract Although the acoustic and

More information

A Comparison of Two Text Representations for Sentiment Analysis

A Comparison of Two Text Representations for Sentiment Analysis 010 International Conference on Computer Application and System Modeling (ICCASM 010) A Comparison of Two Text Representations for Sentiment Analysis Jianxiong Wang School of Computer Science & Educational

More information

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence. NLP Lab Session Week 8 October 15, 2014 Noun Phrase Chunking and WordNet in NLTK Getting Started In this lab session, we will work together through a series of small examples using the IDLE window and

More information

AQUA: An Ontology-Driven Question Answering System

AQUA: An Ontology-Driven Question Answering System AQUA: An Ontology-Driven Question Answering System Maria Vargas-Vera, Enrico Motta and John Domingue Knowledge Media Institute (KMI) The Open University, Walton Hall, Milton Keynes, MK7 6AA, United Kingdom.

More information

Linking Task: Identifying authors and book titles in verbose queries

Linking Task: Identifying authors and book titles in verbose queries Linking Task: Identifying authors and book titles in verbose queries Anaïs Ollagnier, Sébastien Fournier, and Patrice Bellot Aix-Marseille University, CNRS, ENSAM, University of Toulon, LSIS UMR 7296,

More information

Recognition of Structured Collocations in An Inflective Language

Recognition of Structured Collocations in An Inflective Language Proceedings of the International Multiconference on Computer Science and Information Technology pp. 237 246 ISSN 1896-7094 c 2007PIPS Recognition of Structured Collocations in An Inflective Language Bartosz

More information

Learning Methods for Fuzzy Systems

Learning Methods for Fuzzy Systems Learning Methods for Fuzzy Systems Rudolf Kruse and Andreas Nürnberger Department of Computer Science, University of Magdeburg Universitätsplatz, D-396 Magdeburg, Germany Phone : +, Fax : +

More information

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models Stephan Gouws and GJ van Rooyen MIH Medialab, Stellenbosch University SOUTH AFRICA {stephan,gvrooyen}@ml.sun.ac.za

More information

Assignment 1: Predicting Amazon Review Ratings

Assignment 1: Predicting Amazon Review Ratings Assignment 1: Predicting Amazon Review Ratings 1 Dataset Analysis Richard Park r2park@acsmail.ucsd.edu February 23, 2015 The dataset selected for this assignment comes from the set of Amazon reviews for

More information

Multilingual Sentiment and Subjectivity Analysis

Multilingual Sentiment and Subjectivity Analysis Multilingual Sentiment and Subjectivity Analysis Carmen Banea and Rada Mihalcea Department of Computer Science University of North Texas rada@cs.unt.edu, carmen.banea@gmail.com Janyce Wiebe Department

More information

Automatic Extraction of Semantic Relations by Using Web Statistical Information

Automatic Extraction of Semantic Relations by Using Web Statistical Information Automatic Extraction of Semantic Relations by Using Web Statistical Information Valeria Borzì, Simone Faro,, Arianna Pavone Dipartimento di Matematica e Informatica, Università di Catania Viale Andrea

More information

Rule Learning With Negation: Issues Regarding Effectiveness

Rule Learning With Negation: Issues Regarding Effectiveness Rule Learning With Negation: Issues Regarding Effectiveness S. Chua, F. Coenen, G. Malcolm University of Liverpool Department of Computer Science, Ashton Building, Ashton Street, L69 3BX Liverpool, United

More information

Matching Similarity for Keyword-Based Clustering

Matching Similarity for Keyword-Based Clustering Matching Similarity for Keyword-Based Clustering Mohammad Rezaei and Pasi Fränti University of Eastern Finland {rezaei,franti}@cs.uef.fi Abstract. Semantic clustering of objects such as documents, web

More information

Data Integration through Clustering and Finding Statistical Relations - Validation of Approach

Data Integration through Clustering and Finding Statistical Relations - Validation of Approach Data Integration through Clustering and Finding Statistical Relations - Validation of Approach Marek Jaszuk, Teresa Mroczek, and Barbara Fryc University of Information Technology and Management, ul. Sucharskiego

More information

A Bayesian Learning Approach to Concept-Based Document Classification

A Bayesian Learning Approach to Concept-Based Document Classification Databases and Information Systems Group (AG5) Max-Planck-Institute for Computer Science Saarbrücken, Germany A Bayesian Learning Approach to Concept-Based Document Classification by Georgiana Ifrim Supervisors

More information

University of Groningen. Systemen, planning, netwerken Bosman, Aart

University of Groningen. Systemen, planning, netwerken Bosman, Aart University of Groningen Systemen, planning, netwerken Bosman, Aart IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite from it. Please check the document

More information

arxiv: v1 [cs.cl] 2 Apr 2017

arxiv: v1 [cs.cl] 2 Apr 2017 Word-Alignment-Based Segment-Level Machine Translation Evaluation using Word Embeddings Junki Matsuo and Mamoru Komachi Graduate School of System Design, Tokyo Metropolitan University, Japan matsuo-junki@ed.tmu.ac.jp,

More information

Using dialogue context to improve parsing performance in dialogue systems

Using dialogue context to improve parsing performance in dialogue systems Using dialogue context to improve parsing performance in dialogue systems Ivan Meza-Ruiz and Oliver Lemon School of Informatics, Edinburgh University 2 Buccleuch Place, Edinburgh I.V.Meza-Ruiz@sms.ed.ac.uk,

More information

A Domain Ontology Development Environment Using a MRD and Text Corpus

A Domain Ontology Development Environment Using a MRD and Text Corpus A Domain Ontology Development Environment Using a MRD and Text Corpus Naomi Nakaya 1 and Masaki Kurematsu 2 and Takahira Yamaguchi 1 1 Faculty of Information, Shizuoka University 3-5-1 Johoku Hamamatsu

More information

CS Machine Learning

CS Machine Learning CS 478 - Machine Learning Projects Data Representation Basic testing and evaluation schemes CS 478 Data and Testing 1 Programming Issues l Program in any platform you want l Realize that you will be doing

More information

An Interactive Intelligent Language Tutor Over The Internet

An Interactive Intelligent Language Tutor Over The Internet An Interactive Intelligent Language Tutor Over The Internet Trude Heift Linguistics Department and Language Learning Centre Simon Fraser University, B.C. Canada V5A1S6 E-mail: heift@sfu.ca Abstract: This

More information

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words, A Language-Independent, Data-Oriented Architecture for Grapheme-to-Phoneme Conversion Walter Daelemans and Antal van den Bosch Proceedings ESCA-IEEE speech synthesis conference, New York, September 1994

More information

Robust Sense-Based Sentiment Classification

Robust Sense-Based Sentiment Classification Robust Sense-Based Sentiment Classification Balamurali A R 1 Aditya Joshi 2 Pushpak Bhattacharyya 2 1 IITB-Monash Research Academy, IIT Bombay 2 Dept. of Computer Science and Engineering, IIT Bombay Mumbai,

More information

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

The Internet as a Normative Corpus: Grammar Checking with a Search Engine The Internet as a Normative Corpus: Grammar Checking with a Search Engine Jonas Sjöbergh KTH Nada SE-100 44 Stockholm, Sweden jsh@nada.kth.se Abstract In this paper some methods using the Internet as a

More information

Combining a Chinese Thesaurus with a Chinese Dictionary

Combining a Chinese Thesaurus with a Chinese Dictionary Combining a Chinese Thesaurus with a Chinese Dictionary Ji Donghong Kent Ridge Digital Labs 21 Heng Mui Keng Terrace Singapore, 119613 dhji @krdl.org.sg Gong Junping Department of Computer Science Ohio

More information


CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2 1 CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2 Peter A. Chew, Brett W. Bader, Ahmed Abdelali Proceedings of the 13 th SIGKDD, 2007 Tiago Luís Outline 2 Cross-Language IR (CLIR) Latent Semantic Analysis

More information



More information

Parsing of part-of-speech tagged Assamese Texts

Parsing of part-of-speech tagged Assamese Texts IJCSI International Journal of Computer Science Issues, Vol. 6, No. 1, 2009 ISSN (Online): 1694-0784 ISSN (Print): 1694-0814 28 Parsing of part-of-speech tagged Assamese Texts Mirzanur Rahman 1, Sufal

More information

Ensemble Technique Utilization for Indonesian Dependency Parser

Ensemble Technique Utilization for Indonesian Dependency Parser Ensemble Technique Utilization for Indonesian Dependency Parser Arief Rahman Institut Teknologi Bandung Indonesia 23516008@std.stei.itb.ac.id Ayu Purwarianti Institut Teknologi Bandung Indonesia ayu@stei.itb.ac.id

More information

Improved Effects of Word-Retrieval Treatments Subsequent to Addition of the Orthographic Form

Improved Effects of Word-Retrieval Treatments Subsequent to Addition of the Orthographic Form Orthographic Form 1 Improved Effects of Word-Retrieval Treatments Subsequent to Addition of the Orthographic Form The development and testing of word-retrieval treatments for aphasia has generally focused

More information

Word Sense Disambiguation

Word Sense Disambiguation Word Sense Disambiguation D. De Cao R. Basili Corso di Web Mining e Retrieval a.a. 2008-9 May 21, 2009 Excerpt of the R. Mihalcea and T. Pedersen AAAI 2005 Tutorial, at: http://www.d.umn.edu/ tpederse/tutorials/advances-in-wsd-aaai-2005.ppt

More information

Reinforcement Learning by Comparing Immediate Reward

Reinforcement Learning by Comparing Immediate Reward Reinforcement Learning by Comparing Immediate Reward Punit Pandey DeepshikhaPandey Dr. Shishir Kumar Abstract This paper introduces an approach to Reinforcement Learning Algorithm by comparing their immediate

More information

Lecture 1: Machine Learning Basics

Lecture 1: Machine Learning Basics 1/69 Lecture 1: Machine Learning Basics Ali Harakeh University of Waterloo WAVE Lab ali.harakeh@uwaterloo.ca May 1, 2017 2/69 Overview 1 Learning Algorithms 2 Capacity, Overfitting, and Underfitting 3

More information

Multi-Lingual Text Leveling

Multi-Lingual Text Leveling Multi-Lingual Text Leveling Salim Roukos, Jerome Quin, and Todd Ward IBM T. J. Watson Research Center, Yorktown Heights, NY 10598 {roukos,jlquinn,tward}@us.ibm.com Abstract. Determining the language proficiency

More information

Assessing System Agreement and Instance Difficulty in the Lexical Sample Tasks of SENSEVAL-2

Assessing System Agreement and Instance Difficulty in the Lexical Sample Tasks of SENSEVAL-2 Assessing System Agreement and Instance Difficulty in the Lexical Sample Tasks of SENSEVAL-2 Ted Pedersen Department of Computer Science University of Minnesota Duluth, MN, 55812 USA tpederse@d.umn.edu

More information

Constructing Parallel Corpus from Movie Subtitles

Constructing Parallel Corpus from Movie Subtitles Constructing Parallel Corpus from Movie Subtitles Han Xiao 1 and Xiaojie Wang 2 1 School of Information Engineering, Beijing University of Post and Telecommunications artex.xh@gmail.com 2 CISTR, Beijing

More information

Short Text Understanding Through Lexical-Semantic Analysis

Short Text Understanding Through Lexical-Semantic Analysis Short Text Understanding Through Lexical-Semantic Analysis Wen Hua #1, Zhongyuan Wang 2, Haixun Wang 3, Kai Zheng #4, Xiaofang Zhou #5 School of Information, Renmin University of China, Beijing, China

More information

Axiom 2013 Team Description Paper

Axiom 2013 Team Description Paper Axiom 2013 Team Description Paper Mohammad Ghazanfari, S Omid Shirkhorshidi, Farbod Samsamipour, Hossein Rahmatizadeh Zagheli, Mohammad Mahdavi, Payam Mohajeri, S Abbas Alamolhoda Robotics Scientific Association

More information

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments Cristina Vertan, Walther v. Hahn University of Hamburg, Natural Language Systems Division Hamburg,

More information

Evidence for Reliability, Validity and Learning Effectiveness

Evidence for Reliability, Validity and Learning Effectiveness PEARSON EDUCATION Evidence for Reliability, Validity and Learning Effectiveness Introduction Pearson Knowledge Technologies has conducted a large number and wide variety of reliability and validity studies

More information

TextGraphs: Graph-based algorithms for Natural Language Processing

TextGraphs: Graph-based algorithms for Natural Language Processing HLT-NAACL 06 TextGraphs: Graph-based algorithms for Natural Language Processing Proceedings of the Workshop Production and Manufacturing by Omnipress Inc. 2600 Anderson Street Madison, WI 53704 c 2006

More information


THE VERB ARGUMENT BROWSER THE VERB ARGUMENT BROWSER Bálint Sass sass.balint@itk.ppke.hu Péter Pázmány Catholic University, Budapest, Hungary 11 th International Conference on Text, Speech and Dialog 8-12 September 2008, Brno PREVIEW

More information

A Case Study: News Classification Based on Term Frequency

A Case Study: News Classification Based on Term Frequency A Case Study: News Classification Based on Term Frequency Petr Kroha Faculty of Computer Science University of Technology 09107 Chemnitz Germany kroha@informatik.tu-chemnitz.de Ricardo Baeza-Yates Center

More information

Cross Language Information Retrieval

Cross Language Information Retrieval Cross Language Information Retrieval RAFFAELLA BERNARDI UNIVERSITÀ DEGLI STUDI DI TRENTO P.ZZA VENEZIA, ROOM: 2.05, E-MAIL: BERNARDI@DISI.UNITN.IT Contents 1 Acknowledgment.............................................

More information

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions. to as a linguistic theory to to a member of the family of linguistic frameworks that are called generative grammars a grammar which is formalized to a high degree and thus makes exact predictions about

More information

CS 598 Natural Language Processing

CS 598 Natural Language Processing CS 598 Natural Language Processing Natural language is everywhere Natural language is everywhere Natural language is everywhere Natural language is everywhere!"#$%&'&()*+,-./012 34*5665756638/9:;< =>?@ABCDEFGHIJ5KL@

More information

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming Data Mining VI 205 Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming C. Romero, S. Ventura, C. Hervás & P. González Universidad de Córdoba, Campus Universitario de

More information

Procedia - Social and Behavioral Sciences 141 ( 2014 ) WCLTA Using Corpus Linguistics in the Development of Writing

Procedia - Social and Behavioral Sciences 141 ( 2014 ) WCLTA Using Corpus Linguistics in the Development of Writing Available online at www.sciencedirect.com ScienceDirect Procedia - Social and Behavioral Sciences 141 ( 2014 ) 124 128 WCLTA 2013 Using Corpus Linguistics in the Development of Writing Blanka Frydrychova

More information

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar Chung-Chi Huang Mei-Hua Chen Shih-Ting Huang Jason S. Chang Institute of Information Systems and Applications, National Tsing Hua University,

More information

Universiteit Leiden ICT in Business

Universiteit Leiden ICT in Business Universiteit Leiden ICT in Business Ranking of Multi-Word Terms Name: Ricardo R.M. Blikman Student-no: s1184164 Internal report number: 2012-11 Date: 07/03/2013 1st supervisor: Prof. Dr. J.N. Kok 2nd supervisor:

More information

Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge

Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge Innov High Educ (2009) 34:93 103 DOI 10.1007/s10755-009-9095-2 Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge Phyllis Blumberg Published online: 3 February

More information

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data Ebba Gustavii Department of Linguistics and Philology, Uppsala University, Sweden ebbag@stp.ling.uu.se

More information

Learning Methods in Multilingual Speech Recognition

Learning Methods in Multilingual Speech Recognition Learning Methods in Multilingual Speech Recognition Hui Lin Department of Electrical Engineering University of Washington Seattle, WA 98125 linhui@u.washington.edu Li Deng, Jasha Droppo, Dong Yu, and Alex

More information

Python Machine Learning

Python Machine Learning Python Machine Learning Unlock deeper insights into machine learning with this vital guide to cuttingedge predictive analytics Sebastian Raschka [ PUBLISHING 1 open source I community experience distilled

More information

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1 Notes on The Sciences of the Artificial Adapted from a shorter document written for course 17-652 (Deciding What to Design) 1 Ali Almossawi December 29, 2005 1 Introduction The Sciences of the Artificial

More information

On-the-Fly Customization of Automated Essay Scoring

On-the-Fly Customization of Automated Essay Scoring Research Report On-the-Fly Customization of Automated Essay Scoring Yigal Attali Research & Development December 2007 RR-07-42 On-the-Fly Customization of Automated Essay Scoring Yigal Attali ETS, Princeton,

More information

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN From: AAAI Technical Report WS-98-08. Compilation copyright 1998, AAAI (www.aaai.org). All rights reserved. Recommender Systems: A GroupLens Perspective Joseph A. Konstan *t, John Riedl *t, AI Borchers,

More information

Measuring the relative compositionality of verb-noun (V-N) collocations by integrating features

Measuring the relative compositionality of verb-noun (V-N) collocations by integrating features Measuring the relative compositionality of verb-noun (V-N) collocations by integrating features Sriram Venkatapathy Language Technologies Research Centre, International Institute of Information Technology

More information

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Twitter Sentiment Classification on Sanders Data using Hybrid Approach IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 4, Ver. I (July Aug. 2015), PP 118-123 www.iosrjournals.org Twitter Sentiment Classification on Sanders

More information

Rule Learning with Negation: Issues Regarding Effectiveness

Rule Learning with Negation: Issues Regarding Effectiveness Rule Learning with Negation: Issues Regarding Effectiveness Stephanie Chua, Frans Coenen, and Grant Malcolm University of Liverpool Department of Computer Science, Ashton Building, Ashton Street, L69 3BX

More information

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA Alta de Waal, Jacobus Venter and Etienne Barnard Abstract Most actionable evidence is identified during the analysis phase of digital forensic investigations.

More information

Accuracy (%) # features

Accuracy (%) # features Question Terminology and Representation for Question Type Classication Noriko Tomuro DePaul University School of Computer Science, Telecommunications and Information Systems 243 S. Wabash Ave. Chicago,

More information

Word Segmentation of Off-line Handwritten Documents

Word Segmentation of Off-line Handwritten Documents Word Segmentation of Off-line Handwritten Documents Chen Huang and Sargur N. Srihari {chuang5, srihari}@cedar.buffalo.edu Center of Excellence for Document Analysis and Recognition (CEDAR), Department

More information

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics Machine Learning from Garden Path Sentences: The Application of Computational Linguistics http://dx.doi.org/10.3991/ijet.v9i6.4109 J.L. Du 1, P.F. Yu 1 and M.L. Li 2 1 Guangdong University of Foreign Studies,

More information

Evolution of Symbolisation in Chimpanzees and Neural Nets

Evolution of Symbolisation in Chimpanzees and Neural Nets Evolution of Symbolisation in Chimpanzees and Neural Nets Angelo Cangelosi Centre for Neural and Adaptive Systems University of Plymouth (UK) a.cangelosi@plymouth.ac.uk Introduction Animal communication

More information

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Chinese Language Parsing with Maximum-Entropy-Inspired Parser Chinese Language Parsing with Maximum-Entropy-Inspired Parser Heng Lian Brown University Abstract The Chinese language has many special characteristics that make parsing difficult. The performance of state-of-the-art

More information

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining Dave Donnellan, School of Computer Applications Dublin City University Dublin 9 Ireland daviddonnellan@eircom.net Claus Pahl

More information

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining Dave Donnellan, School of Computer Applications Dublin City University Dublin 9 Ireland daviddonnellan@eircom.net Claus Pahl

More information

Outline. Web as Corpus. Using Web Data for Linguistic Purposes. Ines Rehbein. NCLT, Dublin City University. nclt

Outline. Web as Corpus. Using Web Data for Linguistic Purposes. Ines Rehbein. NCLT, Dublin City University. nclt Outline Using Web Data for Linguistic Purposes NCLT, Dublin City University Outline Outline 1 Corpora as linguistic tools 2 Limitations of web data Strategies to enhance web data 3 Corpora as linguistic

More information

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur Module 12 Machine Learning 12.1 Instructional Objective The students should understand the concept of learning systems Students should learn about different aspects of a learning system Students should

More information

LANGUAGE IN INDIA Strength for Today and Bright Hope for Tomorrow Volume 11 : 12 December 2011 ISSN

LANGUAGE IN INDIA Strength for Today and Bright Hope for Tomorrow Volume 11 : 12 December 2011 ISSN LANGUAGE IN INDIA Strength for Today and Bright Hope for Tomorrow Volume ISSN 1930-2940 Managing Editor: M. S. Thirumalai, Ph.D. Editors: B. Mallikarjun, Ph.D. Sam Mohanlal, Ph.D. B. A. Sharada, Ph.D.

More information

Online Updating of Word Representations for Part-of-Speech Tagging

Online Updating of Word Representations for Part-of-Speech Tagging Online Updating of Word Representations for Part-of-Speech Tagging Wenpeng Yin LMU Munich wenpeng@cis.lmu.de Tobias Schnabel Cornell University tbs49@cornell.edu Hinrich Schütze LMU Munich inquiries@cislmu.org

More information

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model Xinying Song, Xiaodong He, Jianfeng Gao, Li Deng Microsoft Research, One Microsoft Way, Redmond, WA 98052, U.S.A.

More information

Language Independent Passage Retrieval for Question Answering

Language Independent Passage Retrieval for Question Answering Language Independent Passage Retrieval for Question Answering José Manuel Gómez-Soriano 1, Manuel Montes-y-Gómez 2, Emilio Sanchis-Arnal 1, Luis Villaseñor-Pineda 2, Paolo Rosso 1 1 Polytechnic University

More information

Program Matrix - Reading English 6-12 (DOE Code 398) University of Florida. Reading

Program Matrix - Reading English 6-12 (DOE Code 398) University of Florida. Reading Program Requirements Competency 1: Foundations of Instruction 60 In-service Hours Teachers will develop substantive understanding of six components of reading as a process: comprehension, oral language,

More information

The Smart/Empire TIPSTER IR System

The Smart/Empire TIPSTER IR System The Smart/Empire TIPSTER IR System Chris Buckley, Janet Walz Sabir Research, Gaithersburg, MD chrisb,walz@sabir.com Claire Cardie, Scott Mardis, Mandar Mitra, David Pierce, Kiri Wagstaff Department of

More information

The Effect of Discourse Markers on the Speaking Production of EFL Students. Iman Moradimanesh

The Effect of Discourse Markers on the Speaking Production of EFL Students. Iman Moradimanesh The Effect of Discourse Markers on the Speaking Production of EFL Students Iman Moradimanesh Abstract The research aimed at investigating the relationship between discourse markers (DMs) and a special

More information

Writing a composition

Writing a composition A good composition has three elements: Writing a composition an introduction: A topic sentence which contains the main idea of the paragraph. a body : Supporting sentences that develop the main idea. a

More information

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data Maja Popović and Hermann Ney Lehrstuhl für Informatik VI, Computer

More information

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Using Web Searches on Important Words to Create Background Sets for LSI Classification Using Web Searches on Important Words to Create Background Sets for LSI Classification Sarah Zelikovitz and Marina Kogan College of Staten Island of CUNY 2800 Victory Blvd Staten Island, NY 11314 Abstract

More information

Learning From the Past with Experiment Databases

Learning From the Past with Experiment Databases Learning From the Past with Experiment Databases Joaquin Vanschoren 1, Bernhard Pfahringer 2, and Geoff Holmes 2 1 Computer Science Dept., K.U.Leuven, Leuven, Belgium 2 Computer Science Dept., University

More information

Problems of the Arabic OCR: New Attitudes

Problems of the Arabic OCR: New Attitudes Problems of the Arabic OCR: New Attitudes Prof. O.Redkin, Dr. O.Bernikova Department of Asian and African Studies, St. Petersburg State University, St Petersburg, Russia Abstract - This paper reviews existing

More information

Cross-Lingual Text Categorization

Cross-Lingual Text Categorization Cross-Lingual Text Categorization Nuria Bel 1, Cornelis H.A. Koster 2, and Marta Villegas 1 1 Grup d Investigació en Lingüística Computacional Universitat de Barcelona, 028 - Barcelona, Spain. {nuria,tona}@gilc.ub.es

More information

Multilingual Document Clustering: an Heuristic Approach Based on Cognate Named Entities

Multilingual Document Clustering: an Heuristic Approach Based on Cognate Named Entities Multilingual Document Clustering: an Heuristic Approach Based on Cognate Named Entities Soto Montalvo GAVAB Group URJC Raquel Martínez NLP&IR Group UNED Arantza Casillas Dpt. EE UPV-EHU Víctor Fresno GAVAB

More information

Beyond the Pipeline: Discrete Optimization in NLP

Beyond the Pipeline: Discrete Optimization in NLP Beyond the Pipeline: Discrete Optimization in NLP Tomasz Marciniak and Michael Strube EML Research ggmbh Schloss-Wolfsbrunnenweg 33 69118 Heidelberg, Germany http://www.eml-research.de/nlp Abstract We

More information

Memory-based grammatical error correction

Memory-based grammatical error correction Memory-based grammatical error correction Antal van den Bosch Peter Berck Radboud University Nijmegen Tilburg University P.O. Box 9103 P.O. Box 90153 NL-6500 HD Nijmegen, The Netherlands NL-5000 LE Tilburg,

More information

Detecting English-French Cognates Using Orthographic Edit Distance

Detecting English-French Cognates Using Orthographic Edit Distance Detecting English-French Cognates Using Orthographic Edit Distance Qiongkai Xu 1,2, Albert Chen 1, Chang i 1 1 The Australian National University, College of Engineering and Computer Science 2 National

More information

An Introduction to the Minimalist Program

An Introduction to the Minimalist Program An Introduction to the Minimalist Program Luke Smith University of Arizona Summer 2016 Some findings of traditional syntax Human languages vary greatly, but digging deeper, they all have distinct commonalities:

More information

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments Vijayshri Ramkrishna Ingale PG Student, Department of Computer Engineering JSPM s Imperial College of Engineering &

More information

Defragmenting Textual Data by Leveraging the Syntactic Structure of the English Language

Defragmenting Textual Data by Leveraging the Syntactic Structure of the English Language Defragmenting Textual Data by Leveraging the Syntactic Structure of the English Language Nathaniel Hayes Department of Computer Science Simpson College 701 N. C. St. Indianola, IA, 50125 nate.hayes@my.simpson.edu

More information

Derivational and Inflectional Morphemes in Pak-Pak Language

Derivational and Inflectional Morphemes in Pak-Pak Language Derivational and Inflectional Morphemes in Pak-Pak Language Agustina Situmorang and Tima Mariany Arifin ABSTRACT The objectives of this study are to find out the derivational and inflectional morphemes

More information



More information

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature 1 st Grade Curriculum Map Common Core Standards Language Arts 2013 2014 1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature Key Ideas and Details

More information

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems Modeling function word errors in DNN-HMM based LVCSR systems Melvin Jose Johnson Premkumar, Ankur Bapna and Sree Avinash Parchuri Department of Computer Science Department of Electrical Engineering Stanford

More information

Advanced Grammar in Use

Advanced Grammar in Use Advanced Grammar in Use A self-study reference and practice book for advanced learners of English Third Edition with answers and CD-ROM cambridge university press cambridge, new york, melbourne, madrid,

More information