METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

Size: px
Start display at page:

Download "METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS"

Transcription

1 METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS Ruslan Mitkov University of Wolverhampton ViktorPekar University of Wolverhampton Dimitar Blagoev University of Plovdiv Andrea Mulloni University of Wolverhampton Abstract The identification of cognates has attracted the attention of researchers working in the area of Natural Language Processing, but the identification of false friends is still an under-researched area. This paper proposes two novel methods (with the first method experimented in different variants) for the automatic identification of both cognates and false friends from corpora which are not dependent on the existence of parallel texts or any other collections/lists. These two methods are evaluated on English, French, German and Spanish corpora in order to identify English-French, English-German, English-Spanish and French- Spanish pairs of cognates or false friends. The experiments were performed in two settings: (i) an ideal environment where the pairs to be classified were either cognates or false friends, and (ii) a real-world environment where cognates and false friends had to be identified among all unique words found in two comparable corpora in different languages. The former task would be the same as the latter in the case of perfect pre-processing or as the task of classifying a pair as cognates or false friends from lists of orthographically (and semantically) close words. The evaluation results show that the developed methods identify cognates and false friends with very satisfactory levels for both recall and precision. INTRODUCTION Cognates and false friends play an important role in the teaching and learning of a foreign language, as well as in translation. The existence of cognates, which are words that have similar meaning and spelling in two or more languages (e.g. colour in English, color in Spanish and couleur in French, or Bibliothek in German, bibliothèque in French and biblioteca in Spanish), helps students reading comprehension and contributes to the expansion of their vocabularies. False friends (faux amis), however, create problems and have the opposite effect, as they have similar spellings but do not share the same meaning (e.g. library in English as opposed to libreria in Spanish or librarie in French).

2 The ability of a student to distinguish cognates from false friends plays a vital role in the successful learning and mastering of a foreign language. However, lists of such words can be found for a very limited number of languages, and the preparation of these lists is a labourintensive and time-consuming task. Therefore, an attractive alternative would be to retrieve cognates or false friends from a corpus automatically. In addition, the automatic identification of cognates plays an important role in applied NLP tasks, as in the case of lexicon acquisition from comparable corpora (Koehn and Knight ) or word alignment in parallel corpora (Melamed, Simard et al. 999). Translation studies have also shown an interest in cognate identification (Laviosa ). The identification of cognates has already attracted attention of researchers working in the area of Natural Language Processing (Simard et al. 99; Melamed 999; Danielsson and Muhlenbock ; Kondrak ; Volk et al. ; Mulloni and Pekar 6) but the identification of false friends is still an under researched area. Whereas some research has been reported on the automatic recognition of false friends, the developed methodology depends on existing lists of false friends and parallel corpora (e.g. Inkpen et al. ). By contrast, this paper proposes two novel methods, for automatic identification of both cognates and false friends from corpora which is not dependent on the existence of parallel texts or any other collections/lists. METHODOLOGY The methodology for automatic identification of cognates and false friends we propose in this paper is based on a two-stage process. The first stage involves the extraction of candidate pairs from non-parallel bilingual corpora whereas the second stage is concerned with the classification of the extracted pairs as cognates, false friends or unrelated words. This methodology has been tested on four language pairs: English-French, English-Spanish, English-German and Spanish-French. The extraction of candidate pairs is based on comparing orthographic similarity between two words whereas the classification of the extracted pair of two words is performed on the basis of their semantic similarity. Semantic similarity has been computed from taxonomies, or approximated from corpus data employing distributional similarity algorithms. In the following we introduce the measures of orthographic and semantic similarity employed in this study.. Orthographic Similarity Given that the extraction process involved comparison of any word from the first language with any word from the second, speed and efficiency was a major consideration, and hence the choice of a suitable, in our case, orthographic similarity measure/algorithm. In this study, two orthographic similarity measures have been experimented with. None of the methods has separately been reported in any previous publication. The first method has been experimented with in three different variants. Measuring orthographic similarity is a commonly used method for distinguishing pairs of unrelated word from pairs of cognates and false friends. Inkpen et al. () present a study of different measures and their efficiency.

3 LCSR (Least Common Subsequence Ratio), as proposed by Melamed (999), is computed by dividing the length of the longest common subsequence (LCS) by the length of the longer word: LCS ( w, w ) LCSR ( w, w ) = max w, w ( ) For example, LCSR(example,exemple) = 6 (their LCS is e-x-m-p-l-e ). NED (Normalised Edit Distance), as proposed in (Inkpen et al. ), is calculated by dividing the minimum number of edit operations needed to transform one word into another by the length of the longer string. Edit operations include substitutions, insertions and deletions.. Semantic Similarity While a number of semantic similarity measures based on taxonomies exist (see Budanitsky and Hirst, for an overview references), in this study we have experimented with the following two measures. Leacock and Chodorow s (998) measure uses the normalised path length between the two concepts c and c and is computed as follows: len( c, c ) ( ) ( ) sim LC c, c = log MAX Wu and Palmer s (994) measure is based on edge distance but also takes into account the most specific node dominating the two concepts c and c : d( c ) sim WP ( c, c ) = d( c ) + d( c ) where c is the maximally specific superclass of c and c, d ( c ) is the depth of c (the distance from the root of the taxonomy), and d ( c ) and d ( c ) are the depths of c and c. Each word, however, can have one or more meaning(s) (sense(s)) mapping to different concepts in the ontology. Using s ( w) to represent the set of concepts in the taxonomy that are senses of the word w, the word similarity can be defined as (Resnik 999): max wsim ( w, w ) = [ sim( c, c )] c, c where c ranges over s ( w ) and c ranges over s ( w ).. Distributional Similarity Since taxonomies with wide coverage are not readily available, semantic similarity can also be modelled via word co-occurrences in corpora. Every word w j is represented by the set of words wi K n with which it co-occurs. For deriving a representation of and all words in the context of w j, all occurrences of w j w j are identified and counted. To ac-

4 count for the context of w j, two approaches have been applied: windowbased and syntactic. In the first, context is marked out by defining a window of a certain size around w (e.g., Gale, Church, and Yarowsky j (99) used a thousand-word window). In the second approach, the context is limited to words appearing in a certain syntactic relation to w, such as direct objects of a verb (Grefenstette 996; Pereira, Tishby, and Lee 99). Once the co-occurrence data is collected, the semantics of w are modelled as a vector in an n-dimensional space where n is the j number of words co-occurring with w j and the components of the vector are the probabilities of the co-occurrences established from their observed frequencies: C w = P w w, P w w,, P w w ( ) ( ) ( ) ( ) j j i j i K Semantic similarity between words is then operationalised via the distance between their vectors. In the literature, various distance measures have been used including Euclidean distance, the cosine (Schuetze 99); Kullback-Leibler divergence (Pereira, Tishby, and Lee 99) and Jensen-Shannon divergence (Dagan, Lee, and Pereira 999). j in j EXTRACTION OF CANDIDATE PAIRS During the extraction stage the orthographic similarity between each noun from the first language ( S ) and a noun from the second language (T ) is computed and a list of the most similar word pairs is compiled. This list is expected to contain pairs of cognates or false friends but due to pre-processing errors, the pairs extracted may contain unrelated words or errors such as words which are not orthographically similar or are not of the same part of speech. While a high degree of orthographic similarity indicates that two words belonging to different languages are cognates 4, many unrelated words may have great similarity in spelling (e.g. Eng. black and Ger. Block). And vice versa, two words may be cognate, but their spellings may have little in common (e.g., Eng. cat and Ger. Katze). Our intuition is that between two given pairs of languages there are certain regularities in which the spelling of a word changes once it is borrowed from one language into the other. In the following three sections we describe an algorithm which learns orthographic transformation rules capturing such regularities from a list of known cognates (.) and an algorithm for applying the induced rules to the discovery of potential cognates in a corpus (.).. Learning algorithm The learning algorithm involves three major steps: (a) the association of edit operations to the actual mutations that occur between two words known to be cognates (or false friends); (b) the extraction of candidate rules; (c) the assignment of a statistical score to the extracted rules signifying their reliability. Its input is a list of translation pairs, which is then passed on to a filtering module based on Normalised Edit Distance (NED): this allows for the identification of pairs of cognates/false friends 4 Or, in fact, borrowings.

5 C in two languages S and T, each consisting of a, word w S S and a word w T T. The output of the algorithm is a set of rules R. In the beginning, two procedures are applied to the data: (i) edit operations between the two strings of the same pair are identified; (ii) NED between each pair is calculated in order to assign a score to each cognate pair. NED is calculated by dividing edit distance (ED) by the length of the longer string. NED and normalisation in general allows for more consistent values, since it was noticed that when applying standard ED, word pairs of short length ( to 4 words each) would be more prone to be included in the cognate list even if they are actually unrelated (e.g. at/an, ape/affe). Sample output of this step for three English-German pairs is shown in Figure. Figure. Edit operation association between English and German. At the next stage of the algorithm, a candidate rule c r is extracted from each edit operation of each word pair in the training data. Each candidate rule consists of two letter n-grams, the former referring to language S and the latter pointing to its counterpart in language T. To construct it, for each edit operation detected we use k symbols on either side of the edited symbol in both e and g. The left-hand side refers to the language S n-gram, while the right-hand side corresponds to the same n-gram in language T with the detected mutations. Figure illustrates rules detected in this manner. Candidate rules are extracted using different values of k for each kind of edit operations, each value having been set experimentally. Substitution rules are created without considering the context around the letter being substituted, i.e. taking into account only the letter substitution itself, while deletions and insertions are sampled with k symbols on both sides. After extensive testing, k was empirically set to, whereas the candidate rule would vary in length both on the left-hand side and the right-hand side depending on the number of insertions and deletions it accounts for. This decision was supported by the fact that longer rules are less frequent than shorter rules, but they are nonetheless more precise. In fact, because of the task at stake and the further areas we want to apply the algorithm to, we were somewhat more inclined towards obtaining higher precision rather than higher recall. At the final stage, statistical scores are assigned to each unique candidate rule extracted. After exploring different scoring functions (Fisher s exact test, chi-square, odds ratio and likelihood ratio), the chi-square for measuring the strength of the association between the left-hand side and the right-hand side of the candidate rule was chosen. Once every candidate rule has been associated with a chi-square value, candidates falling below a specific threshold on the chi-square value are filtered, thus outputting the final rules.

6 Rule Chi-square score c/k d/t ary/är 8.9 my/mie 8.9 hy/hie 8.9 gy/gie.846 ty/tät et/ett sh/sch. ive/iv 48.6 Figure. Top rules detected by the algorithm for the English-German language combination, along with the associated chi-square scores.. Testing algorithm The learning algorithm provides a set of rules which account for the orthographic behaviour of words between a source language and a target language. The second part of the algorithm (i.e. the testing algorithm) tries to deploy this kind of information (input) in the candidate extraction process. Once the input data is made available, we proceed to apply the rules to each possible word pair, that is to substitute relevant n-grams in the rules with their counterpart in the target language. LCSR is then computed for every pair, and the top most similar pairs are added to the candidate cognate list. A case in point is represented by the English-German entry electric/elektrisch : the original LCSR is., but if the rules c/k and ic/isch detected earlier in the algorithm, are applied, the new LCSR is.. 4 CLASSIFICATION The goal of the next stage of the methodology is to separate cognates from false friends from the lists obtained in the previous stage. Given a list of known cognates and false friends (training data) and a list of mixed pairs of words where the words in each pair are either cognates or false friends (test data), the classification task is to automatically label each test pair. To this end, the semantic similarity between the words in each pair in the training data is computed, and a numerical measure (threshold) is estimated. This threshold is later used to label the test data: if the similarity between the words in the test pair is lower than the threshold, the words are returned as false friends; otherwise, they are returned as cognates. After a threshold is estimated, all pairs in the test data are labelled in this manner. The presented methodology is independent of the language pair(s) in that it operates on any texts in any languages, not necessarily parallel ones, although better results are expected if the corpora are comparable. In order to establish the similarity between words in different languages, we experimented with four methods: Method operating in different variants (Method without taxonomy, Method with Leacock

7 and Chodorow, and Method with Wu and Palmer) and Method. These are outlined below. 4. Exploiting distributional similarities between words of the same language Method is based on the premise that if two words have the same meanings (and are cognates), they should be semantically close to the roughly the same set of words in both (or more) languages, whereas two words which do not have the same meaning (and are false friends), will not be semantically close to the same set of words in both or more languages. Method can be formally described as follows: S T Start with two words ( w, w ) in the languages S and T, S T ( w S, w T ). Then calculate the N most similar words for each of the two words according to a chosen distributional similarity function. In this study, skew divergence (Lee 999) was selected for Method because it performed best during our pilot tests. Then build two sets of N S S S S T T T T S words W ( w, w, K, wn ) and W ( w, w, K, wn ) such that w i is S T the i-th most similar word to w and w i is the i-th most similar word to T w. Figure shows two cognates Eng. article and Fre. article along with the sets of their most similar words in the respective languages (N=). A connection between two words is made when one of the words is listed as a translation of the other in the bilingual dictionary. Then a Dice Coefficient function is applied over the two sets to determine their similarity. A word from set S is added to the collision set only if it has at least one translation present in set T, and vice versa. By way of example, the similarity between the words in Figure will be / =. Currently words with multiple translations in the opposite set are not treated in any special way. Note that N is a crucial parameter. If the value of N is too small (N <), similar words may appear as very distant, because their common synonyms may not be present in both sets (or at all). If the value of N is too big, the sets of distant words may be filled with synonym word pairs that are distant from the initial one, thus making the words in the initial pair appear more similar than they actually are. The dictionary used can further affect the results. In the evaluation section, Method, which employs the distributional similarity function skew divergence as outlined above, is simply referred to as Method. Figure. Method

8 4. Exploiting distributional similarities between words across different languages Method which is inspired by the methods for acquisition of translation lexicons from comparable corpora (Fung 998; Rapp 999; Tanaka and Iwasaki 996; Gaussier et al 4) determines semantic similarity between words across two languages by mapping the space of distributional features of one language onto that of the other using a bilingual dictionary. Method can be described formally as follows. First, cooccurrence data on each word of interest in both languages are extracted (in the present study we model the semantics of nouns by their cooccurrence with syntactically related verbs) and feature spaces for both S T sets of words are created. Then, given two words ( w, w ) in the lan- S T S guages S and T, ( w S, w T ), the feature vector of w is translated into T with the help of a dictionary and then added to the cooccurrence data for T. The result is co-occurrence data that contains vectors for all words in the target language (T ) plus the translated vector S of the source word ( w ). All words in T are then ranked according to S their distributional similarity to the translated vector of w (using skew T divergence again) and the rank of w is noted ( R ). The same measure is used to rank all words according to their similarity from w, taking S only the rank of w ( R ). This is done because skew divergence is not symmetrical. The final result is the average of the two ranks ( R + R )/. Here the quality of the dictionary used for translation is essential. Besides the source data itself, one parameter that can make difference is the direction of translation, i.e. which of the languages is the source and which it the target. Method (without taxonomy) and Method can be regarded as related in that the distributional similarity techniques employed both do rely on context, but we felt it was worth experimenting with both methods with a view to comparing performance. In addition to making use of co-occurrence data, both methods also rely on dictionaries. However, one major difference is in the type of dictionaries needed. For the current task of comparing cognates/false friends which are nouns, a noun dictionary for Method is required, whereas Method requires a verb dictionary. 4. Exploiting taxonomic similarities between words of the same language If a pair of cognates/false friends under consideration are present in a taxonomical thesaurus, comparing/computing the semantic similarity directly from the taxonomy promises to be the best way forward. However, the absence of words in the thesaurus could result in a high precision but low recall. To overcome this limitation, a hybrid method has been developed which works as follows. For any two words, their presence in a specific taxonomy is checked and if they are present, a taxonomical semantic similarity measure is employed to compute a similarity value. Otherwise distributional measures are used (as in Method without taxonomy) to obtain a list of the N nearest neighbours for each word. The taxonomy is used in this case only instead of a dictionary. EuroWordNet was chosen for this study because it covers a number of languages. The semantic similarities measures based on taxonomy made

9 use of are Leacock and Chodorow (998) which checks the normalised path length between two concepts being compared and Wu and Palmer (994) based on edge distance as well but taking into account the most specific node dominating the two concepts. To start with, similarities S T between each nearest neighbour of w and w are computed and for each nearest neighbour the most similar to its neighbour in the opposite language is found. The final value is the average of the maximal similarities for all neighbours. The variant of Method which employs Leacock and Chodorow s similarity measure is referred to in the evaluation section as Method with Leacock&Chodorow, whereas the variant which computes similarity according to Wu and Palmer s measure is referred to as Method with Wu&Palmer. 4.4 Threshold estimation The following threshold estimation techniques were used with the methods outlined above: Average: The distances between the words in both training sets (cognates and false friends) are measured using the chosen method. The mean average of the distances of the cognates and the mean average of the distances of the false friends are computed. The threshold is the average of the two means. Average: As above, but median average is used instead of mean average. Max Accuracy: All distances from the training data sets are analysed to find a distance which, when used as a threshold, would (supposedly) give a maximal accuracy for the test data. The accuracy is computed as: TP + TN TP + TN + FP + FN An evaluation framework was created, which simplifies and speeds up the process. The evaluation process splits the list of known cognates and false friends, and creates the training and test pairs for each test. The threshold estimation process uses only the training pairs to find a threshold that is sent to the classification process. The distance measurement process receives a word pair, and returns the distance according to the method and parameters used. 4. Evaluation settings The evaluation is performed using ten-fold cross-validation. During the test, the task is to classify a pair as cognates or false friends, based on the measure of similarity and the threshold, obtained from the training pairs. The results were evaluated both in terms of recall and precision. An average measure of recall defined as the average of the recall in identification of cognates and recall in identification of false friends 6 has been used. Similarly, precision is computed as the average of the precision in identifying cognates and the precision in identifying false friends 8. Recall for the identification of cognates is computed as TP/(TP+FN) 6 Recall for the identification of false friends is computed as TN/(TN+FP) (this is because if an item is not a cognate, then it is a false friend or vice versa) Precision for the identification of cognates is computed as TP/(TP+FP) 8 Precision for the identification of false friends is computed as TN/(TN+FN)

10 EXPERIMENTS, EVALUATION AND DISCUS- SION The experiments, extracting candidate pairs and classifying them, covered bilingual texts in 4 pairs of languages: English-French, English- German, English-Spanish and French-Spanish. They were performed in two settings: (i) an ideal environment where the pairs to be classified were either cognates or false friends and (ii) in a real-world environment where the pairs to be classified as cognates and false friends (or unrelated) were extracted from bilingual corpora. The former task would be the same as the latter in the case of perfect pre-processing, or as the task of classifying a pair as cognates or false friends from lists of orthographically (and semantically) close words. In order to conduct the experiments described below, co-occurrence statistics were needed, as well as bidirectional dictionaries. The following co-occurrence data for computing distributional similarity was used: verb-object co-occurrence data from the parsed version of the English WSJ corpus verb-object co-occurrence data from the French Le Monde corpus which we had previously parsed with the Xerox Xelda shallow parser verb-object co-occurrence data from the German Tageszeitung corpus verb-object co-occurrence data from the Spanish EFE (years 994 and 99) corpus EuroWordNet was used as the source of the four bidirectional dictionaries.. Extracting candidate pairs For the task of extracting candidates from corpora, all combinations of word pairs for each of the four language pairs were compared in terms of LCSR orthographic similarity. The most similar pairs were chosen as sample data for each language pair. The lists were then manually annotated to provide both training data for the classifier and evaluation results for the extraction stage. The pairs were marked into four categories: Cognates, False Friends, Unrelated and Errors, where errors were considered to be pairs which are not orthographically similar, or which contain words tagged with different or incorrect part of speech. Table. Pairs returned in the automatic extraction process Language Pair Cognates False Friends Unrelated Errors Accuracy English-French % English-German % English-Spanish % French-Spanish % Average %

11 The accuracy of extraction of candidates was computed as the number of cognates and false friends divided by the total number of pairs (in our case ), as this task is concerned with the identification of pairs that are either cognates or false friends. The results are satisfactory with an average accuracy of 84.% (see Table I). If the task is limited to finding cognates then the employed methodology can be very effective, as the ratio of extracted cognates to false friends is more than :.. Classifying pairs as Cognates or False Friends The extracted pairs were classified as either cognates or false friends. The evaluation was carried out in parameter-driven fashion, with the impact on the performance results of the following parameters being investigated. Threshold estimation method: How does the choice of the threshold estimation method affect the results? The influence of N for Method : How many similar words should be considered when measuring the distance? Direction of translation for Method : When mapping feature vectors from one language to another, does the direction of translation have an effect on the performance? The effect of errors in the extraction stage: How does this affect the classification methods? The sample data: How do the methods behave with unbalanced class sizes? For each evaluation parameter (method, threshold estimation, etc.) the evaluation procedure splits the two samples (cognates and false friends) into parts each, and runs a tenfold cross-validation using the specified parameters. As aforementioned, the evaluation was conducted in two settings - ideal setting where the extraction was regarded as perfect in that the pairs to be classified were either cognates or false friends and real-world setting where the pairs to be classified as cognates and false friends (or unrelated) were extracted from bilingual corpora and as a consequence due to the pre-processing errors not all pairs were necessarily cognates or false friends... Classification of cognates and false friends in an ideal setting of perfect extraction The evaluation in an ideal environment was based on the assumption that the extraction process was % accurate. For this purpose, the lists were filtered (post-edited) to contain only cognates and false friends. In the present experiment, we tried out two evaluation options. One involved an unbalanced sample, which contained all cognates and all false friends present in our data. In each language pair, the proportion between them is highly skewed towards cognates, and this appears to reflect the typical tendency in a language pair. To have a better insight into the relative performance of the methods, we additionally evaluated them on samples, where the number of cognates was reduced by randomly removing them until it was the same as the number of false friends.

12 For both options, the baseline for recall is the case when all pairs are marked as cognates (since the number of cognates is always greater than the number of false friends) and the baseline for precision is set to random classification, both figures being %. Table II contains the results for Method, Method, Method with Leacock&Chodorow and Method with Wu&Palmer respectively, when using unbalanced samples. The second column of the table shows the threshold estimation method. The third column represents the value of N for Method, or the direction of translation for Method (L being Leftto-Right language translation and R being Right-to-Left). The results can be regarded as satisfactory. Method scores as high as 8.6% recall and 6.4% precision, and Method achieves a recall of.6% and precision of.9%. Method with Leacock&Chodorow delivers a best recall of 8.8% and precision of 88.9% (for English-Spanish), whilst Method with Wu&Palmer performs with a recall as high as 86.% (also English-Spanish) and precision of 8.6%. Overall, the best performing methods are Method with Wu&Palmer, followed by Method with Leacock&Chdodorow, Method and Method. The best results from the point of view of language pairs are reported on English- Spanish, followed by English-German, English-French and English- Spanish. We see that while the best configurations of the methods beat the baseline by a considerable margin, in about % of the test cases the results are actually below the baseline. This suggests that tuning the parameters can makes a very large difference to the performance of the methods. Table III describes results achieved on balanced samples. The best results are achieved by Method with Wu and Palmer, with a highest recall of 9.% and precision 94.%, followed closely by Method with Leacock&Chodorow, with its best recall of 88.% and precision of 89.98%, Method (recall 8.%, precision 8.6%) and Method (recall 69.%, precision 68.4%). Again, the best performance is obtained on English-Spanish, followed by French-Spanish, English-German and English-French. The best configurations of the methods often show a greater improvement on the baseline, in comparison with the unbalanced samples. However, the difference is not as large as would be expected and in some cases the results are slightly worse (for Method, for example)... Classification of cognates and false friends in a real-world setting of fully automatic extraction from corpora Table IV reports the classification performance on balanced samples in a fully automatic mode, involving extraction of pairs before classification and, as a result, including errors and unrelated words. The results are lower than those obtained on filtered lists where pre-processing errors came into play, but are still satisfactory. Since the methods always label a pair of words as either cognates or false friends, precision decreased as a consequence. The recall was the same as in Table II because the threshold estimation was computed with training data for cognates and false friends only. On average, the best method was again Method with Wu&Palmer (precision 8.8%), followed by Method with Leacock&Chodorow (6.%), Method (with a highest figure of.4%) and Method (6.8%). The best results were obtained on English- Spanish, followed by English-German, French-Spanish and English- French. As expected, the results on all pairs (unbalanced samples where

13 Table II. Classification results with corrected unbalanced lists Method Method Method with Laecock&Chodorow Method with Wu&Palmer En-Fr En-Ge En-Sp Fr-Sp Rec. Prec. Rec. Prec. Rec. Prec. Rec. Prec L R L R L R Baseline the vast majority of pairs are cognates) are lower (Table IV) but still reasonable, with Method with Leacock&Chodorow achieving a 6.98% precision on English-Spanish, and Method scoring 6.6% on the same pair. In the following, we discuss the results in relation to each of the evaluation parameters listed above.

14 Table III. Classification results with corrected balanced lists Method Method Method with Laecock&Chodorow Method with Wu&Palmer En-Fr En-Ge En-Sp Fr-Sp Rec. Prec. Rec. Prec. Rec. Prec. Rec. Prec L R L R L R Baseline Threshold estimation: Three functions were used: Average, Average and Max Accuracy. Max Accuracy searches for the threshold that gives maximum accuracy in the training set. For different values of N (N {,,,,,,,,,,,,,, }) on balanced samples with errors, Average gives best recall in

15 . 6.% of the test cases, followed by in.4%, and the Max Accuracy threshold estimation in.8% of the cases. The precision achieved by Average is highest in.64% of the cases, followed by (.%) and Max Accuracy (.8%). On balanced samples without errors, the results are somewhat different. Average and Max Accuracy estimations both deliver better results in 4.4% of the cases, followed by Average (.9% of cases). However, in terms of precision, Average scores higher in 9.6% of the cases, followed by Max Accuracy (.% of cases) and (.% of cases). On unbalanced samples without preprocessing errors, provides best recall in.6% of the cases, in 46.8% and Max Accuracy in.66% of the cases. In terms of precision, the employment of Average leads to best performance in 4.% of the cases, with performing best in 6.% of the cases, and Max Accuracy in.8% of the cases. For fully automatic extraction and classification (i.e. with preprocessing errors) on the whole (unbalanced) corpus, the best performance figures for recall are distributed in the same way as the figures on the whole corpus without pre-processing errors. The only difference is seen when comparing precision: Average gives best results in 4.% of the test cases, followed by Average in 9.9% and Max Accuracy in.% of the cases. Table V summarises the best evaluation results (recall, precision, accuracy) for each method (Method without taxonomy, Method with Leacock&Chodorow, Method with Wu&Palmer, Method ), language pair (English-French, English-German, English-Spanish, French- Spanish), pre-processing mode (fully automatic mode with preprocessing errors or with perfect processing on filtered lists without errors), sampling mode (all pairs, balanced samples) and parameters (number of words for Method, direction of translation for method ) with particular reference to the three threshold estimation techniques ( Average, Average, Max Accuracy). The influence of N for Method : The number of most similar words (N) which Method uses to measure similarity between pairs is an essential parameter for this method. Figures 4 and report recall and precision for each N when evaluating English-German 9 cognates with Max accuracy as a threshold estimation from the balanced sample without errors. When a semantic taxonomy is not used, results gradually improve up to some value and then begin to deteriorate. When classification benefits from a taxonomy, the figures are different. With N=, classification is carried out entirely according to the taxonomy similarity method used. The precision is quite high but, because not all words are in the dictionary, the recall is below the baseline of (%). Increasing N results in a better recall, but can also degrade precision. The behaviour is similar across language pairs and pre-processing mode (see Figures 8- in the Appendix). Direction of translation for Method : Figures 6 and show the effect of direction on the accuracy of Method when using balanced samples without errors with all three threshold estimations. For both 9 In this section we have chosen to report the evaluation results for one language pair and without pre-processing errors only, as reporting all possible combinations of language pairs and methods would result in a high number of figures within the section. All remaining tables are provided in the Appendix.

16 Table IV. Precision when errors are present (recall same as in Table II) Method with Wu&Palmer Method with Laecock&Chodorow Method Method Balanced samples All pairs En-Fr En-Ge En-Sp Fr-Sp En-Fr En-Ge En-Sp Fr-Sp L R L R L R Baseline English-French and English-German, using English as the source resulted in better performance than using English as the target (the only exception being when median average is applied for English-French)...

17 Table V. Summary of the best evaluation results (recall, precision, accuracy) for each method, language pair, pre-processing mode, sampling mode and parameters with particular reference to the three threshold estimation techniques ( Average, Average, Maximize Accuracy) Ballanced with errors Ballanced without errors All pairs with errors All pairs without errors En-Fr En-Ge En-Sp Fr-Sp Rec. Prec. Acc. Rec. Prec. Acc. Rec. Prec. Acc. Rec. Prec. Acc. Best Est. Method Method M+L&C Method M+W&P M+W&P M+W&P M+W&P M+W&P M+W&P M+L&C M+W&P M+L&C N/Dir Value(%) 69.9% 6.44% 6.%.% 66.6% 6.6% 9.% 9.9%.84%.% 6.8% 6.86% Best Est. Method M+L&C Method M+L&C M+W&P M+W&P M+W&P M+W&P M+W&P M+W&P M+L&C M+L&C M+L&C N/Dir Value(%) 69.6%.99% 69.6%.% 8.4%.% 9.% 94.% 9.%.%.4%.% Best Est. Method Method Method Method M+W&P M+W&P Method M+W&P M+L&C M+L&C M+W&P Method Method N/Dir R Value(%) 6.8%.6% 6.46%.%.6%.8% 86.99% 64.88% 8.8% 68.% 6.8%.% Best Est. Method Method Method Method M+W&P M+L&C Method M+W&P M+L&C M+L&C M+W&P M+L&C Method N/Dir Value(%) 6.8% 6.8% 8.%.%.% 9.4% 86.99% 89.8% 9.6% 68.% 4.% 84.8%

18 Method M+L&C M+W&P Figure 4. The recall of Method for English-German with balanced samples without pre-processing errors, using max. accuracy estimation Method M+L&C M+W&P Figure. The precision of Method for English-German with balanced samples without pre-processing errors, using max. accuracy estimation L R En-Fr En-Ge En-Sp Fr-Sp Figure 6. Recall of Method with balanced samples without pre-processing errors. If Spanish is one of the languages, the results are better if it is the source language (if the translation is from right to left). Again, an exception is the case when mean average is used for French-Spanish, with higher ac-

19 curacy achieved when French is the source language. Figures and (Appendix) illustrate the effect of direction of translation for Method on balanced samples including pre-processing errors. The better direction for each case is the same as for the sample without errors. With preprocessing errors in the samples for some cases (like Max.Acc. for En- Sp) the difference in performance between the two directions is far less noticeable than when using samples without errors. We experimentally found that the differences in performance within a language pair amount of up to 6.% depending on the choice of the direction of translation and the differences in recall up to.% for balanced samples without errors L R En-Fr En-Ge En-Sp Fr-Sp Figure. Precision of Method with balanced samples without preprocessing errors. Method vs. Method : In most cases, the optimal results for Method without taxonomy for any of the four language pairs are higher than the optimal results for Method (see Tables II-V for a subset of all evaluation results; see also Table VI). The only exceptions apply to the best result for precision on English-Spanish when evaluated on unbalanced samples (with or without preprocessing errors), and for precision on English-Spanish for unbalanced samples without errors. On balanced samples without errors, Method with Leacock&Chodorow produces the best recall for English-French and the best recall and precision for French-Spanish, whereas Wu&Palmer offers the optimal values for English-German and English-Spanish. When pre-processing errors are present in the balanced samples, Method holds the best recall for English- French. Leacock&Chodorow produces the highest precision for English- French and the highest recall for French-Spanish, whereas Wu&Palmer performs best for English-German and English-Spanish, and obtains the best precision for French-Spanish. On unbalanced samples without errors, Method with Leacock&Chodorow performs with highest precision for all language pairs except English-French, whilst Method with Wu&Palmer records the best recall for the same language pairs. With errors present in the unbalanced samples, Method with Leacock&Chodorow produces best precision for English-Spanish, whereas Wu&Palmer gives optimal performance for English-German and the best recall for English-Spanish and French-Spanish. Method produces the best recall in one case only: unbalanced sample with errors for French- Spanish. In conclusion, Method generally has the edge over Method, and the taxonomy can improve results even further. The drawback of

20 Table VI: Best results for recall and precision (separate) for each method, sample and language pair With errors Without errors With errors Without errors All Balanced All Balanced All Balanced All Balanced En-Fr En-Ge En-Sp Fr-Sp Rec. Prec. Rec. Prec. Rec. Prec. Rec. Prec. Method 6.8% 6.8% 4.4%.% 8.% 68.8% 6.8%.% Method 9.%.% 6.46%.%.6%.% 8.8%.9% M+L&C 6.%.8% 4.68%.% 8.86% 89.8% 6.49% 4.% M+W&P 6.6% 6.%.%.% 86.99% 8.6% 68.% 6.6% Method 69.9%.99% 4.% 4.% 86.% 88.% 6.98% 4.% Method 9.64% 6.% 6.% 9.9%.%.% 6.9% 6.% M+L&C 69.6% 69.4% 4.%.4% 9.% 9.%.%.4% M+W&P 6.6% 68.%.% 8.4% 9.% 94.%.9%.8% Method 6.8%.6% 4.4% % 8.% 8.9% 6.8%.8% Method 9.%.% 6.46% 4.96%.6% 6.6% 8.8% 6.8% M+L&C 6.%.% 4.68%.94% 8.86% 64.88% 6.49% 4.9% M+W&P 6.6%.%.%.6% 86.99% 6.8% 68.% 4.99% Method 69.9% 6.%.% 64.69% 8.%.4% 68.% 8.6% Method 9.6%.9% 6.%.8%.% 6.8% 6.4%.6% M+L&C 68.9% 6.44% 4.% 6.4% 9.% 8.%.% 6% M+W&P 6.% 6.%.% 66.6% 9.% 9.9%.% 6.8% En-Fr En-Ge En-Sp Fr-Sp Rec. Prec. Rec. Prec. Rec. Prec. Rec. Prec. Best Method Method M+W&P M+L&C M+W&P M+L&C M+W&P M+L&C Second M+L&C M+L&C M+L&C Method Method M+W&P M+L&C Method Third M+W&P Method Method M+W&P M+L&C Method Method Method Last Method M+W&P Method Method Method Method Method M+W&P Best M+L&C Method M+W&P M+W&P M+W&P M+W&P M+L&C M+L&C Second Method M+L&C Method Method M+L&C M+L&C M+W&P Method Third M+W&P M+W&P M+L&C M+L&C Method Method Method M+W&P Last Method Method Method Method Method Method Method Method Best Method Method M+W&P M+W&P M+W&P M+L&C M+W&P Method Second M+L&C M+L&C M+L&C M+L&C Method M+W&P M+L&C M+W&P Third M+W&P M+W&P Method Method M+L&C Method Method M+L&C Last Method Method Method Method Method Method Method Method Best Method M+L&C M+W&P M+W&P M+W&P M+W&P M+L&C M+W&P Second M+L&C Method Method Method M+L&C M+L&C M+W&P M+L&C Third M+W&P M+W&P M+L&C M+L&C Method Method Method Method Last Method Method Method Method Method Method Method Method

21 Method is that there is no optimal value for N which guarantees the best results in all cases; this can sometimes lead to a performance significantly lower than that of Method. The effect of errors in the extraction stage: Since errors are not currently used for training, the effect they have on the evaluation results is lowering precision. When using balanced samples, the decrease of precision can be up to.9% (for Method with Leacock&Chodorow, N=, French-Spanish, max accuracy as estimation), with an average of 9.9% for all test cases. With unbalanced samples (consisting of all extracted pairs) the precision is 8% lower on average, with the highest drop for Method without taxonomy (N=, French-Spanish, max accuracy as estimation).8%. The sample data: When using balanced samples (the number of cognates being equal to the number of false friends), average precision and recall are, respectively,.% and 8.8% higher (over all test cases run) than when using unbalanced samples. The most significant improvements for precision and recall on balanced samples are 4.4% and 4.% respectively. For some parameters, however, better results are obtained on unbalanced samples, with the greatest difference in favour of the evaluation results on all pairs being.% for precision and 9% for recall. However, experiments on balanced samples result in a higher precision in 8.6% of the combinations of parameter settings, and higher recall in 6.% of these cases. All reported results in this paper, as well as other results which can be obtained following different choice of parameters, can be viewed on 6 CONCLUSION This paper proposes two novel methods (one of which is presented in variants) for automatic identification of both cognates and false friends from corpora, neither of which are dependent on the existence of parallel texts or any other collections/lists. The extensive evaluation results cover a variety of parameters and show that automatic identification of cognates and false friends from corpora is a feasible task, and all methods proposed perform in a satisfactory manner. Best results are obtained by Method which is based on the premise that cognates are semantically closer than false friends, when the variant of the method using a taxonomy and Wu and Palmer's measure, is employed. References Budanitsky, A., and G. Hirst, "Semantic Distance in WordNet: An Experimental, Application-oriented Evaluation of Five Measures", Workshop on WordNet and Other Lexical Resources, in the North American Chapter of the Association for Computational Linguistics (NAACL-), Pittsburgh, PA, June. Dagan, I., Lee, L., Pereira, F. (999). "Similarity-based models of word co-occurrence probabilities". Machine Learning 4(-): 4-69 Danielsson, P. and Muehlenbock, K., (). "Small but Efficient: The Misconception of High-Frequency Words in Scandinavian Translation". Proceedings of the 4th Conference of the Association for Machine Translation in the Americas on Envisioning Machine Translation in the Information Future. Springer Verlag, London, 8-68.

Semantic Evidence for Automatic Identification of Cognates

Semantic Evidence for Automatic Identification of Cognates Semantic Evidence for Automatic Identification of Cognates Andrea Mulloni CLG, University of Wolverhampton Stafford Street Wolverhampton WV SB, United Kingdom andrea@wlv.ac.uk Viktor Pekar CLG, University

More information

Cross Language Information Retrieval

Cross Language Information Retrieval Cross Language Information Retrieval RAFFAELLA BERNARDI UNIVERSITÀ DEGLI STUDI DI TRENTO P.ZZA VENEZIA, ROOM: 2.05, E-MAIL: BERNARDI@DISI.UNITN.IT Contents 1 Acknowledgment.............................................

More information

Detecting English-French Cognates Using Orthographic Edit Distance

Detecting English-French Cognates Using Orthographic Edit Distance Detecting English-French Cognates Using Orthographic Edit Distance Qiongkai Xu 1,2, Albert Chen 1, Chang i 1 1 The Australian National University, College of Engineering and Computer Science 2 National

More information

AQUA: An Ontology-Driven Question Answering System

AQUA: An Ontology-Driven Question Answering System AQUA: An Ontology-Driven Question Answering System Maria Vargas-Vera, Enrico Motta and John Domingue Knowledge Media Institute (KMI) The Open University, Walton Hall, Milton Keynes, MK7 6AA, United Kingdom.

More information

A heuristic framework for pivot-based bilingual dictionary induction

A heuristic framework for pivot-based bilingual dictionary induction 2013 International Conference on Culture and Computing A heuristic framework for pivot-based bilingual dictionary induction Mairidan Wushouer, Toru Ishida, Donghui Lin Department of Social Informatics,

More information

Linking Task: Identifying authors and book titles in verbose queries

Linking Task: Identifying authors and book titles in verbose queries Linking Task: Identifying authors and book titles in verbose queries Anaïs Ollagnier, Sébastien Fournier, and Patrice Bellot Aix-Marseille University, CNRS, ENSAM, University of Toulon, LSIS UMR 7296,

More information

Constructing Parallel Corpus from Movie Subtitles

Constructing Parallel Corpus from Movie Subtitles Constructing Parallel Corpus from Movie Subtitles Han Xiao 1 and Xiaojie Wang 2 1 School of Information Engineering, Beijing University of Post and Telecommunications artex.xh@gmail.com 2 CISTR, Beijing

More information

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

OCR for Arabic using SIFT Descriptors With Online Failure Prediction OCR for Arabic using SIFT Descriptors With Online Failure Prediction Andrey Stolyarenko, Nachum Dershowitz The Blavatnik School of Computer Science Tel Aviv University Tel Aviv, Israel Email: stloyare@tau.ac.il,

More information

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

The Internet as a Normative Corpus: Grammar Checking with a Search Engine The Internet as a Normative Corpus: Grammar Checking with a Search Engine Jonas Sjöbergh KTH Nada SE-100 44 Stockholm, Sweden jsh@nada.kth.se Abstract In this paper some methods using the Internet as a

More information

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data Ebba Gustavii Department of Linguistics and Philology, Uppsala University, Sweden ebbag@stp.ling.uu.se

More information

A Case Study: News Classification Based on Term Frequency

A Case Study: News Classification Based on Term Frequency A Case Study: News Classification Based on Term Frequency Petr Kroha Faculty of Computer Science University of Technology 09107 Chemnitz Germany kroha@informatik.tu-chemnitz.de Ricardo Baeza-Yates Center

More information

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words, A Language-Independent, Data-Oriented Architecture for Grapheme-to-Phoneme Conversion Walter Daelemans and Antal van den Bosch Proceedings ESCA-IEEE speech synthesis conference, New York, September 1994

More information

arxiv: v1 [cs.cl] 2 Apr 2017

arxiv: v1 [cs.cl] 2 Apr 2017 Word-Alignment-Based Segment-Level Machine Translation Evaluation using Word Embeddings Junki Matsuo and Mamoru Komachi Graduate School of System Design, Tokyo Metropolitan University, Japan matsuo-junki@ed.tmu.ac.jp,

More information

The stages of event extraction

The stages of event extraction The stages of event extraction David Ahn Intelligent Systems Lab Amsterdam University of Amsterdam ahn@science.uva.nl Abstract Event detection and recognition is a complex task consisting of multiple sub-tasks

More information

Matching Similarity for Keyword-Based Clustering

Matching Similarity for Keyword-Based Clustering Matching Similarity for Keyword-Based Clustering Mohammad Rezaei and Pasi Fränti University of Eastern Finland {rezaei,franti}@cs.uef.fi Abstract. Semantic clustering of objects such as documents, web

More information

Cross-Lingual Text Categorization

Cross-Lingual Text Categorization Cross-Lingual Text Categorization Nuria Bel 1, Cornelis H.A. Koster 2, and Marta Villegas 1 1 Grup d Investigació en Lingüística Computacional Universitat de Barcelona, 028 - Barcelona, Spain. {nuria,tona}@gilc.ub.es

More information

Lecture 1: Machine Learning Basics

Lecture 1: Machine Learning Basics 1/69 Lecture 1: Machine Learning Basics Ali Harakeh University of Waterloo WAVE Lab ali.harakeh@uwaterloo.ca May 1, 2017 2/69 Overview 1 Learning Algorithms 2 Capacity, Overfitting, and Underfitting 3

More information

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models Stephan Gouws and GJ van Rooyen MIH Medialab, Stellenbosch University SOUTH AFRICA {stephan,gvrooyen}@ml.sun.ac.za

More information

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence. NLP Lab Session Week 8 October 15, 2014 Noun Phrase Chunking and WordNet in NLTK Getting Started In this lab session, we will work together through a series of small examples using the IDLE window and

More information

Software Maintenance

Software Maintenance 1 What is Software Maintenance? Software Maintenance is a very broad activity that includes error corrections, enhancements of capabilities, deletion of obsolete capabilities, and optimization. 2 Categories

More information

Multilingual Document Clustering: an Heuristic Approach Based on Cognate Named Entities

Multilingual Document Clustering: an Heuristic Approach Based on Cognate Named Entities Multilingual Document Clustering: an Heuristic Approach Based on Cognate Named Entities Soto Montalvo GAVAB Group URJC Raquel Martínez NLP&IR Group UNED Arantza Casillas Dpt. EE UPV-EHU Víctor Fresno GAVAB

More information

Rule Learning With Negation: Issues Regarding Effectiveness

Rule Learning With Negation: Issues Regarding Effectiveness Rule Learning With Negation: Issues Regarding Effectiveness S. Chua, F. Coenen, G. Malcolm University of Liverpool Department of Computer Science, Ashton Building, Ashton Street, L69 3BX Liverpool, United

More information

On document relevance and lexical cohesion between query terms

On document relevance and lexical cohesion between query terms Information Processing and Management 42 (2006) 1230 1247 www.elsevier.com/locate/infoproman On document relevance and lexical cohesion between query terms Olga Vechtomova a, *, Murat Karamuftuoglu b,

More information

Disambiguation of Thai Personal Name from Online News Articles

Disambiguation of Thai Personal Name from Online News Articles Disambiguation of Thai Personal Name from Online News Articles Phaisarn Sutheebanjard Graduate School of Information Technology Siam University Bangkok, Thailand mr.phaisarn@gmail.com Abstract Since online

More information

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur Module 12 Machine Learning 12.1 Instructional Objective The students should understand the concept of learning systems Students should learn about different aspects of a learning system Students should

More information

Assignment 1: Predicting Amazon Review Ratings

Assignment 1: Predicting Amazon Review Ratings Assignment 1: Predicting Amazon Review Ratings 1 Dataset Analysis Richard Park r2park@acsmail.ucsd.edu February 23, 2015 The dataset selected for this assignment comes from the set of Amazon reviews for

More information

Memory-based grammatical error correction

Memory-based grammatical error correction Memory-based grammatical error correction Antal van den Bosch Peter Berck Radboud University Nijmegen Tilburg University P.O. Box 9103 P.O. Box 90153 NL-6500 HD Nijmegen, The Netherlands NL-5000 LE Tilburg,

More information

A Domain Ontology Development Environment Using a MRD and Text Corpus

A Domain Ontology Development Environment Using a MRD and Text Corpus A Domain Ontology Development Environment Using a MRD and Text Corpus Naomi Nakaya 1 and Masaki Kurematsu 2 and Takahira Yamaguchi 1 1 Faculty of Information, Shizuoka University 3-5-1 Johoku Hamamatsu

More information

CS Machine Learning

CS Machine Learning CS 478 - Machine Learning Projects Data Representation Basic testing and evaluation schemes CS 478 Data and Testing 1 Programming Issues l Program in any platform you want l Realize that you will be doing

More information

Vocabulary Usage and Intelligibility in Learner Language

Vocabulary Usage and Intelligibility in Learner Language Vocabulary Usage and Intelligibility in Learner Language Emi Izumi, 1 Kiyotaka Uchimoto 1 and Hitoshi Isahara 1 1. Introduction In verbal communication, the primary purpose of which is to convey and understand

More information

Using dialogue context to improve parsing performance in dialogue systems

Using dialogue context to improve parsing performance in dialogue systems Using dialogue context to improve parsing performance in dialogue systems Ivan Meza-Ruiz and Oliver Lemon School of Informatics, Edinburgh University 2 Buccleuch Place, Edinburgh I.V.Meza-Ruiz@sms.ed.ac.uk,

More information

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Chinese Language Parsing with Maximum-Entropy-Inspired Parser Chinese Language Parsing with Maximum-Entropy-Inspired Parser Heng Lian Brown University Abstract The Chinese language has many special characteristics that make parsing difficult. The performance of state-of-the-art

More information

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF) SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF) Hans Christian 1 ; Mikhael Pramodana Agus 2 ; Derwin Suhartono 3 1,2,3 Computer Science Department,

More information

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar Chung-Chi Huang Mei-Hua Chen Shih-Ting Huang Jason S. Chang Institute of Information Systems and Applications, National Tsing Hua University,

More information

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT PRACTICAL APPLICATIONS OF RANDOM SAMPLING IN ediscovery By Matthew Verga, J.D. INTRODUCTION Anyone who spends ample time working

More information

Reducing Features to Improve Bug Prediction

Reducing Features to Improve Bug Prediction Reducing Features to Improve Bug Prediction Shivkumar Shivaji, E. James Whitehead, Jr., Ram Akella University of California Santa Cruz {shiv,ejw,ram}@soe.ucsc.edu Sunghun Kim Hong Kong University of Science

More information

Multilingual Sentiment and Subjectivity Analysis

Multilingual Sentiment and Subjectivity Analysis Multilingual Sentiment and Subjectivity Analysis Carmen Banea and Rada Mihalcea Department of Computer Science University of North Texas rada@cs.unt.edu, carmen.banea@gmail.com Janyce Wiebe Department

More information

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches Yu-Chun Wang Chun-Kai Wu Richard Tzong-Han Tsai Department of Computer Science

More information

Learning Methods in Multilingual Speech Recognition

Learning Methods in Multilingual Speech Recognition Learning Methods in Multilingual Speech Recognition Hui Lin Department of Electrical Engineering University of Washington Seattle, WA 98125 linhui@u.washington.edu Li Deng, Jasha Droppo, Dong Yu, and Alex

More information

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics (L615) Markus Dickinson Department of Linguistics, Indiana University Spring 2013 The web provides new opportunities for gathering data Viable source of disposable corpora, built ad hoc for specific purposes

More information

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments Cristina Vertan, Walther v. Hahn University of Hamburg, Natural Language Systems Division Hamburg,

More information

Universiteit Leiden ICT in Business

Universiteit Leiden ICT in Business Universiteit Leiden ICT in Business Ranking of Multi-Word Terms Name: Ricardo R.M. Blikman Student-no: s1184164 Internal report number: 2012-11 Date: 07/03/2013 1st supervisor: Prof. Dr. J.N. Kok 2nd supervisor:

More information

Predicting Students Performance with SimStudent: Learning Cognitive Skills from Observation

Predicting Students Performance with SimStudent: Learning Cognitive Skills from Observation School of Computer Science Human-Computer Interaction Institute Carnegie Mellon University Year 2007 Predicting Students Performance with SimStudent: Learning Cognitive Skills from Observation Noboru Matsuda

More information

Probabilistic Latent Semantic Analysis

Probabilistic Latent Semantic Analysis Probabilistic Latent Semantic Analysis Thomas Hofmann Presentation by Ioannis Pavlopoulos & Andreas Damianou for the course of Data Mining & Exploration 1 Outline Latent Semantic Analysis o Need o Overview

More information

Rule Learning with Negation: Issues Regarding Effectiveness

Rule Learning with Negation: Issues Regarding Effectiveness Rule Learning with Negation: Issues Regarding Effectiveness Stephanie Chua, Frans Coenen, and Grant Malcolm University of Liverpool Department of Computer Science, Ashton Building, Ashton Street, L69 3BX

More information

Prediction of Maximal Projection for Semantic Role Labeling

Prediction of Maximal Projection for Semantic Role Labeling Prediction of Maximal Projection for Semantic Role Labeling Weiwei Sun, Zhifang Sui Institute of Computational Linguistics Peking University Beijing, 100871, China {ws, szf}@pku.edu.cn Haifeng Wang Toshiba

More information

Finding Translations in Scanned Book Collections

Finding Translations in Scanned Book Collections Finding Translations in Scanned Book Collections Ismet Zeki Yalniz Dept. of Computer Science University of Massachusetts Amherst, MA, 01003 zeki@cs.umass.edu R. Manmatha Dept. of Computer Science University

More information

A Semantic Similarity Measure Based on Lexico-Syntactic Patterns

A Semantic Similarity Measure Based on Lexico-Syntactic Patterns A Semantic Similarity Measure Based on Lexico-Syntactic Patterns Alexander Panchenko, Olga Morozova and Hubert Naets Center for Natural Language Processing (CENTAL) Université catholique de Louvain Belgium

More information

The Ups and Downs of Preposition Error Detection in ESL Writing

The Ups and Downs of Preposition Error Detection in ESL Writing The Ups and Downs of Preposition Error Detection in ESL Writing Joel R. Tetreault Educational Testing Service 660 Rosedale Road Princeton, NJ, USA JTetreault@ets.org Martin Chodorow Hunter College of CUNY

More information

The role of the first language in foreign language learning. Paul Nation. The role of the first language in foreign language learning

The role of the first language in foreign language learning. Paul Nation. The role of the first language in foreign language learning 1 Article Title The role of the first language in foreign language learning Author Paul Nation Bio: Paul Nation teaches in the School of Linguistics and Applied Language Studies at Victoria University

More information

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling Notebook for PAN at CLEF 2013 Andrés Alfonso Caurcel Díaz 1 and José María Gómez Hidalgo 2 1 Universidad

More information

The Good Judgment Project: A large scale test of different methods of combining expert predictions

The Good Judgment Project: A large scale test of different methods of combining expert predictions The Good Judgment Project: A large scale test of different methods of combining expert predictions Lyle Ungar, Barb Mellors, Jon Baron, Phil Tetlock, Jaime Ramos, Sam Swift The University of Pennsylvania

More information

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data Maja Popović and Hermann Ney Lehrstuhl für Informatik VI, Computer

More information

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17. Semi-supervised methods of text processing, and an application to medical concept extraction Yacine Jernite Text-as-Data series September 17. 2015 What do we want from text? 1. Extract information 2. Link

More information

The Smart/Empire TIPSTER IR System

The Smart/Empire TIPSTER IR System The Smart/Empire TIPSTER IR System Chris Buckley, Janet Walz Sabir Research, Gaithersburg, MD chrisb,walz@sabir.com Claire Cardie, Scott Mardis, Mandar Mitra, David Pierce, Kiri Wagstaff Department of

More information

Switchboard Language Model Improvement with Conversational Data from Gigaword

Switchboard Language Model Improvement with Conversational Data from Gigaword Katholieke Universiteit Leuven Faculty of Engineering Master in Artificial Intelligence (MAI) Speech and Language Technology (SLT) Switchboard Language Model Improvement with Conversational Data from Gigaword

More information

The Role of String Similarity Metrics in Ontology Alignment

The Role of String Similarity Metrics in Ontology Alignment The Role of String Similarity Metrics in Ontology Alignment Michelle Cheatham and Pascal Hitzler August 9, 2013 1 Introduction Tim Berners-Lee originally envisioned a much different world wide web than

More information

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA Alta de Waal, Jacobus Venter and Etienne Barnard Abstract Most actionable evidence is identified during the analysis phase of digital forensic investigations.

More information

A Bayesian Learning Approach to Concept-Based Document Classification

A Bayesian Learning Approach to Concept-Based Document Classification Databases and Information Systems Group (AG5) Max-Planck-Institute for Computer Science Saarbrücken, Germany A Bayesian Learning Approach to Concept-Based Document Classification by Georgiana Ifrim Supervisors

More information

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks 1 Tzu-Hsuan Yang, 2 Tzu-Hsuan Tseng, and 3 Chia-Ping Chen Department of Computer Science and Engineering

More information

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems Modeling function word errors in DNN-HMM based LVCSR systems Melvin Jose Johnson Premkumar, Ankur Bapna and Sree Avinash Parchuri Department of Computer Science Department of Electrical Engineering Stanford

More information

A Comparison of Two Text Representations for Sentiment Analysis

A Comparison of Two Text Representations for Sentiment Analysis 010 International Conference on Computer Application and System Modeling (ICCASM 010) A Comparison of Two Text Representations for Sentiment Analysis Jianxiong Wang School of Computer Science & Educational

More information

arxiv:cmp-lg/ v1 22 Aug 1994

arxiv:cmp-lg/ v1 22 Aug 1994 arxiv:cmp-lg/94080v 22 Aug 994 DISTRIBUTIONAL CLUSTERING OF ENGLISH WORDS Fernando Pereira AT&T Bell Laboratories 600 Mountain Ave. Murray Hill, NJ 07974 pereira@research.att.com Abstract We describe and

More information

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases POS Tagging Problem Part-of-Speech Tagging L545 Spring 203 Given a sentence W Wn and a tagset of lexical categories, find the most likely tag T..Tn for each word in the sentence Example Secretariat/P is/vbz

More information

Guidelines for Writing an Internship Report

Guidelines for Writing an Internship Report Guidelines for Writing an Internship Report Master of Commerce (MCOM) Program Bahauddin Zakariya University, Multan Table of Contents Table of Contents... 2 1. Introduction.... 3 2. The Required Components

More information

Word Segmentation of Off-line Handwritten Documents

Word Segmentation of Off-line Handwritten Documents Word Segmentation of Off-line Handwritten Documents Chen Huang and Sargur N. Srihari {chuang5, srihari}@cedar.buffalo.edu Center of Excellence for Document Analysis and Recognition (CEDAR), Department

More information

Using Semantic Relations to Refine Coreference Decisions

Using Semantic Relations to Refine Coreference Decisions Using Semantic Relations to Refine Coreference Decisions Heng Ji David Westbrook Ralph Grishman Department of Computer Science New York University New York, NY, 10003, USA hengji@cs.nyu.edu westbroo@cs.nyu.edu

More information

Combining a Chinese Thesaurus with a Chinese Dictionary

Combining a Chinese Thesaurus with a Chinese Dictionary Combining a Chinese Thesaurus with a Chinese Dictionary Ji Donghong Kent Ridge Digital Labs 21 Heng Mui Keng Terrace Singapore, 119613 dhji @krdl.org.sg Gong Junping Department of Computer Science Ohio

More information

Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models

Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models Jung-Tae Lee and Sang-Bum Kim and Young-In Song and Hae-Chang Rim Dept. of Computer &

More information

Speech Recognition at ICSI: Broadcast News and beyond

Speech Recognition at ICSI: Broadcast News and beyond Speech Recognition at ICSI: Broadcast News and beyond Dan Ellis International Computer Science Institute, Berkeley CA Outline 1 2 3 The DARPA Broadcast News task Aspects of ICSI

More information

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY Chen, Hsin-Hsi Department of Computer Science and Information Engineering National Taiwan University Taipei, Taiwan E-mail: hh_chen@csie.ntu.edu.tw Abstract

More information

An Evaluation of the Interactive-Activation Model Using Masked Partial-Word Priming. Jason R. Perry. University of Western Ontario. Stephen J.

An Evaluation of the Interactive-Activation Model Using Masked Partial-Word Priming. Jason R. Perry. University of Western Ontario. Stephen J. An Evaluation of the Interactive-Activation Model Using Masked Partial-Word Priming Jason R. Perry University of Western Ontario Stephen J. Lupker University of Western Ontario Colin J. Davis Royal Holloway

More information

Improved Effects of Word-Retrieval Treatments Subsequent to Addition of the Orthographic Form

Improved Effects of Word-Retrieval Treatments Subsequent to Addition of the Orthographic Form Orthographic Form 1 Improved Effects of Word-Retrieval Treatments Subsequent to Addition of the Orthographic Form The development and testing of word-retrieval treatments for aphasia has generally focused

More information

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System QuickStroke: An Incremental On-line Chinese Handwriting Recognition System Nada P. Matić John C. Platt Λ Tony Wang y Synaptics, Inc. 2381 Bering Drive San Jose, CA 95131, USA Abstract This paper presents

More information

The Strong Minimalist Thesis and Bounded Optimality

The Strong Minimalist Thesis and Bounded Optimality The Strong Minimalist Thesis and Bounded Optimality DRAFT-IN-PROGRESS; SEND COMMENTS TO RICKL@UMICH.EDU Richard L. Lewis Department of Psychology University of Michigan 27 March 2010 1 Purpose of this

More information

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2 1 CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2 Peter A. Chew, Brett W. Bader, Ahmed Abdelali Proceedings of the 13 th SIGKDD, 2007 Tiago Luís Outline 2 Cross-Language IR (CLIR) Latent Semantic Analysis

More information

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages Nuanwan Soonthornphisaj 1 and Boonserm Kijsirikul 2 Machine Intelligence and Knowledge Discovery Laboratory Department of Computer

More information

Handling Sparsity for Verb Noun MWE Token Classification

Handling Sparsity for Verb Noun MWE Token Classification Handling Sparsity for Verb Noun MWE Token Classification Mona T. Diab Center for Computational Learning Systems Columbia University mdiab@ccls.columbia.edu Madhav Krishna Computer Science Department Columbia

More information

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments Vijayshri Ramkrishna Ingale PG Student, Department of Computer Engineering JSPM s Imperial College of Engineering &

More information

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems Modeling function word errors in DNN-HMM based LVCSR systems Melvin Jose Johnson Premkumar, Ankur Bapna and Sree Avinash Parchuri Department of Computer Science Department of Electrical Engineering Stanford

More information

Search right and thou shalt find... Using Web Queries for Learner Error Detection

Search right and thou shalt find... Using Web Queries for Learner Error Detection Search right and thou shalt find... Using Web Queries for Learner Error Detection Michael Gamon Claudia Leacock Microsoft Research Butler Hill Group One Microsoft Way P.O. Box 935 Redmond, WA 981052, USA

More information

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining Dave Donnellan, School of Computer Applications Dublin City University Dublin 9 Ireland daviddonnellan@eircom.net Claus Pahl

More information

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining Dave Donnellan, School of Computer Applications Dublin City University Dublin 9 Ireland daviddonnellan@eircom.net Claus Pahl

More information

Semantic Inference at the Lexical-Syntactic Level for Textual Entailment Recognition

Semantic Inference at the Lexical-Syntactic Level for Textual Entailment Recognition Semantic Inference at the Lexical-Syntactic Level for Textual Entailment Recognition Roy Bar-Haim,Ido Dagan, Iddo Greental, Idan Szpektor and Moshe Friedman Computer Science Department, Bar-Ilan University,

More information

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS Julia Tmshkina Centre for Text Techitology, North-West University, 253 Potchefstroom, South Africa 2025770@puk.ac.za

More information

How to Judge the Quality of an Objective Classroom Test

How to Judge the Quality of an Objective Classroom Test How to Judge the Quality of an Objective Classroom Test Technical Bulletin #6 Evaluation and Examination Service The University of Iowa (319) 335-0356 HOW TO JUDGE THE QUALITY OF AN OBJECTIVE CLASSROOM

More information

Python Machine Learning

Python Machine Learning Python Machine Learning Unlock deeper insights into machine learning with this vital guide to cuttingedge predictive analytics Sebastian Raschka [ PUBLISHING 1 open source I community experience distilled

More information

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many Schmidt 1 Eric Schmidt Prof. Suzanne Flynn Linguistic Study of Bilingualism December 13, 2013 A Minimalist Approach to Code-Switching In the field of linguistics, the topic of bilingualism is a broad one.

More information

Data Integration through Clustering and Finding Statistical Relations - Validation of Approach

Data Integration through Clustering and Finding Statistical Relations - Validation of Approach Data Integration through Clustering and Finding Statistical Relations - Validation of Approach Marek Jaszuk, Teresa Mroczek, and Barbara Fryc University of Information Technology and Management, ul. Sucharskiego

More information

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy Informatics 2A: Language Complexity and the Chomsky Hierarchy September 28, 2010 Starter 1 Is there a finite state machine that recognises all those strings s from the alphabet {a, b} where the difference

More information

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models Netpix: A Method of Feature Selection Leading to Accurate Sentiment-Based Classification Models 1 Netpix: A Method of Feature Selection Leading to Accurate Sentiment-Based Classification Models James B.

More information

Numeracy Medium term plan: Summer Term Level 2C/2B Year 2 Level 2A/3C

Numeracy Medium term plan: Summer Term Level 2C/2B Year 2 Level 2A/3C Numeracy Medium term plan: Summer Term Level 2C/2B Year 2 Level 2A/3C Using and applying mathematics objectives (Problem solving, Communicating and Reasoning) Select the maths to use in some classroom

More information

Longest Common Subsequence: A Method for Automatic Evaluation of Handwritten Essays

Longest Common Subsequence: A Method for Automatic Evaluation of Handwritten Essays IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 6, Ver. IV (Nov Dec. 2015), PP 01-07 www.iosrjournals.org Longest Common Subsequence: A Method for

More information

Lecture 2: Quantifiers and Approximation

Lecture 2: Quantifiers and Approximation Lecture 2: Quantifiers and Approximation Case study: Most vs More than half Jakub Szymanik Outline Number Sense Approximate Number Sense Approximating most Superlative Meaning of most What About Counting?

More information

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus Language Acquisition Fall 2010/Winter 2011 Lexical Categories Afra Alishahi, Heiner Drenhaus Computational Linguistics and Phonetics Saarland University Children s Sensitivity to Lexical Categories Look,

More information

Measuring the relative compositionality of verb-noun (V-N) collocations by integrating features

Measuring the relative compositionality of verb-noun (V-N) collocations by integrating features Measuring the relative compositionality of verb-noun (V-N) collocations by integrating features Sriram Venkatapathy Language Technologies Research Centre, International Institute of Information Technology

More information

Discriminative Learning of Beam-Search Heuristics for Planning

Discriminative Learning of Beam-Search Heuristics for Planning Discriminative Learning of Beam-Search Heuristics for Planning Yuehua Xu School of EECS Oregon State University Corvallis,OR 97331 xuyu@eecs.oregonstate.edu Alan Fern School of EECS Oregon State University

More information

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1 Notes on The Sciences of the Artificial Adapted from a shorter document written for course 17-652 (Deciding What to Design) 1 Ali Almossawi December 29, 2005 1 Introduction The Sciences of the Artificial

More information

Levels of processing: Qualitative differences or task-demand differences?

Levels of processing: Qualitative differences or task-demand differences? Memory & Cognition 1983,11 (3),316-323 Levels of processing: Qualitative differences or task-demand differences? SHANNON DAWN MOESER Memorial University ofnewfoundland, St. John's, NewfoundlandAlB3X8,

More information

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011 The Karlsruhe Institute of Technology Translation Systems for the WMT 2011 Teresa Herrmann, Mohammed Mediani, Jan Niehues and Alex Waibel Karlsruhe Institute of Technology Karlsruhe, Germany firstname.lastname@kit.edu

More information