METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS Ruslan Mitkov (R.Mitkov@wlv.ac.uk) University of Wolverhampton ViktorPekar (v.pekar@wlv.ac.uk) University of Wolverhampton Dimitar Blagoev (d.blagoev@wlv.ac.uk) University of Plovdiv Andrea Mulloni (andrea@wlv.ac.uk) University of Wolverhampton Abstract The identification of cognates has attracted the attention of researchers working in the area of Natural Language Processing, but the identification of false friends is still an under-researched area. This paper proposes two novel methods (with the first method experimented in different variants) for the automatic identification of both cognates and false friends from corpora which are not dependent on the existence of parallel texts or any other collections/lists. These two methods are evaluated on English, French, German and Spanish corpora in order to identify English-French, English-German, English-Spanish and French- Spanish pairs of cognates or false friends. The experiments were performed in two settings: (i) an ideal environment where the pairs to be classified were either cognates or false friends, and (ii) a real-world environment where cognates and false friends had to be identified among all unique words found in two comparable corpora in different languages. The former task would be the same as the latter in the case of perfect pre-processing or as the task of classifying a pair as cognates or false friends from lists of orthographically (and semantically) close words. The evaluation results show that the developed methods identify cognates and false friends with very satisfactory levels for both recall and precision. INTRODUCTION Cognates and false friends play an important role in the teaching and learning of a foreign language, as well as in translation. The existence of cognates, which are words that have similar meaning and spelling in two or more languages (e.g. colour in English, color in Spanish and couleur in French, or Bibliothek in German, bibliothèque in French and biblioteca in Spanish), helps students reading comprehension and contributes to the expansion of their vocabularies. False friends (faux amis), however, create problems and have the opposite effect, as they have similar spellings but do not share the same meaning (e.g. library in English as opposed to libreria in Spanish or librarie in French).

The ability of a student to distinguish cognates from false friends plays a vital role in the successful learning and mastering of a foreign language. However, lists of such words can be found for a very limited number of languages, and the preparation of these lists is a labourintensive and time-consuming task. Therefore, an attractive alternative would be to retrieve cognates or false friends from a corpus automatically. In addition, the automatic identification of cognates plays an important role in applied NLP tasks, as in the case of lexicon acquisition from comparable corpora (Koehn and Knight ) or word alignment in parallel corpora (Melamed, Simard et al. 999). Translation studies have also shown an interest in cognate identification (Laviosa ). The identification of cognates has already attracted attention of researchers working in the area of Natural Language Processing (Simard et al. 99; Melamed 999; Danielsson and Muhlenbock ; Kondrak ; Volk et al. ; Mulloni and Pekar 6) but the identification of false friends is still an under researched area. Whereas some research has been reported on the automatic recognition of false friends, the developed methodology depends on existing lists of false friends and parallel corpora (e.g. Inkpen et al. ). By contrast, this paper proposes two novel methods, for automatic identification of both cognates and false friends from corpora which is not dependent on the existence of parallel texts or any other collections/lists. METHODOLOGY The methodology for automatic identification of cognates and false friends we propose in this paper is based on a two-stage process. The first stage involves the extraction of candidate pairs from non-parallel bilingual corpora whereas the second stage is concerned with the classification of the extracted pairs as cognates, false friends or unrelated words. This methodology has been tested on four language pairs: English-French, English-Spanish, English-German and Spanish-French. The extraction of candidate pairs is based on comparing orthographic similarity between two words whereas the classification of the extracted pair of two words is performed on the basis of their semantic similarity. Semantic similarity has been computed from taxonomies, or approximated from corpus data employing distributional similarity algorithms. In the following we introduce the measures of orthographic and semantic similarity employed in this study.. Orthographic Similarity Given that the extraction process involved comparison of any word from the first language with any word from the second, speed and efficiency was a major consideration, and hence the choice of a suitable, in our case, orthographic similarity measure/algorithm. In this study, two orthographic similarity measures have been experimented with. None of the methods has separately been reported in any previous publication. The first method has been experimented with in three different variants. Measuring orthographic similarity is a commonly used method for distinguishing pairs of unrelated word from pairs of cognates and false friends. Inkpen et al. () present a study of different measures and their efficiency.

LCSR (Least Common Subsequence Ratio), as proposed by Melamed (999), is computed by dividing the length of the longest common subsequence (LCS) by the length of the longer word: LCS ( w, w ) LCSR ( w, w ) = max w, w ( ) For example, LCSR(example,exemple) = 6 (their LCS is e-x-m-p-l-e ). NED (Normalised Edit Distance), as proposed in (Inkpen et al. ), is calculated by dividing the minimum number of edit operations needed to transform one word into another by the length of the longer string. Edit operations include substitutions, insertions and deletions.. Semantic Similarity While a number of semantic similarity measures based on taxonomies exist (see Budanitsky and Hirst, for an overview references), in this study we have experimented with the following two measures. Leacock and Chodorow s (998) measure uses the normalised path length between the two concepts c and c and is computed as follows: len( c, c ) ( ) ( ) sim LC c, c = log MAX Wu and Palmer s (994) measure is based on edge distance but also takes into account the most specific node dominating the two concepts c and c : d( c ) sim WP ( c, c ) = d( c ) + d( c ) where c is the maximally specific superclass of c and c, d ( c ) is the depth of c (the distance from the root of the taxonomy), and d ( c ) and d ( c ) are the depths of c and c. Each word, however, can have one or more meaning(s) (sense(s)) mapping to different concepts in the ontology. Using s ( w) to represent the set of concepts in the taxonomy that are senses of the word w, the word similarity can be defined as (Resnik 999): max wsim ( w, w ) = [ sim( c, c )] c, c where c ranges over s ( w ) and c ranges over s ( w ).. Distributional Similarity Since taxonomies with wide coverage are not readily available, semantic similarity can also be modelled via word co-occurrences in corpora. Every word w j is represented by the set of words wi K n with which it co-occurs. For deriving a representation of and all words in the context of w j, all occurrences of w j w j are identified and counted. To ac-

count for the context of w j, two approaches have been applied: windowbased and syntactic. In the first, context is marked out by defining a window of a certain size around w (e.g., Gale, Church, and Yarowsky j (99) used a thousand-word window). In the second approach, the context is limited to words appearing in a certain syntactic relation to w, such as direct objects of a verb (Grefenstette 996; Pereira, Tishby, and Lee 99). Once the co-occurrence data is collected, the semantics of w are modelled as a vector in an n-dimensional space where n is the j number of words co-occurring with w j and the components of the vector are the probabilities of the co-occurrences established from their observed frequencies: C w = P w w, P w w,, P w w ( ) ( ) ( ) ( ) j j i j i K Semantic similarity between words is then operationalised via the distance between their vectors. In the literature, various distance measures have been used including Euclidean distance, the cosine (Schuetze 99); Kullback-Leibler divergence (Pereira, Tishby, and Lee 99) and Jensen-Shannon divergence (Dagan, Lee, and Pereira 999). j in j EXTRACTION OF CANDIDATE PAIRS During the extraction stage the orthographic similarity between each noun from the first language ( S ) and a noun from the second language (T ) is computed and a list of the most similar word pairs is compiled. This list is expected to contain pairs of cognates or false friends but due to pre-processing errors, the pairs extracted may contain unrelated words or errors such as words which are not orthographically similar or are not of the same part of speech. While a high degree of orthographic similarity indicates that two words belonging to different languages are cognates 4, many unrelated words may have great similarity in spelling (e.g. Eng. black and Ger. Block). And vice versa, two words may be cognate, but their spellings may have little in common (e.g., Eng. cat and Ger. Katze). Our intuition is that between two given pairs of languages there are certain regularities in which the spelling of a word changes once it is borrowed from one language into the other. In the following three sections we describe an algorithm which learns orthographic transformation rules capturing such regularities from a list of known cognates (.) and an algorithm for applying the induced rules to the discovery of potential cognates in a corpus (.).. Learning algorithm The learning algorithm involves three major steps: (a) the association of edit operations to the actual mutations that occur between two words known to be cognates (or false friends); (b) the extraction of candidate rules; (c) the assignment of a statistical score to the extracted rules signifying their reliability. Its input is a list of translation pairs, which is then passed on to a filtering module based on Normalised Edit Distance (NED): this allows for the identification of pairs of cognates/false friends 4 Or, in fact, borrowings.

C in two languages S and T, each consisting of a, word w S S and a word w T T. The output of the algorithm is a set of rules R. In the beginning, two procedures are applied to the data: (i) edit operations between the two strings of the same pair are identified; (ii) NED between each pair is calculated in order to assign a score to each cognate pair. NED is calculated by dividing edit distance (ED) by the length of the longer string. NED and normalisation in general allows for more consistent values, since it was noticed that when applying standard ED, word pairs of short length ( to 4 words each) would be more prone to be included in the cognate list even if they are actually unrelated (e.g. at/an, ape/affe). Sample output of this step for three English-German pairs is shown in Figure. Figure. Edit operation association between English and German. At the next stage of the algorithm, a candidate rule c r is extracted from each edit operation of each word pair in the training data. Each candidate rule consists of two letter n-grams, the former referring to language S and the latter pointing to its counterpart in language T. To construct it, for each edit operation detected we use k symbols on either side of the edited symbol in both e and g. The left-hand side refers to the language S n-gram, while the right-hand side corresponds to the same n-gram in language T with the detected mutations. Figure illustrates rules detected in this manner. Candidate rules are extracted using different values of k for each kind of edit operations, each value having been set experimentally. Substitution rules are created without considering the context around the letter being substituted, i.e. taking into account only the letter substitution itself, while deletions and insertions are sampled with k symbols on both sides. After extensive testing, k was empirically set to, whereas the candidate rule would vary in length both on the left-hand side and the right-hand side depending on the number of insertions and deletions it accounts for. This decision was supported by the fact that longer rules are less frequent than shorter rules, but they are nonetheless more precise. In fact, because of the task at stake and the further areas we want to apply the algorithm to, we were somewhat more inclined towards obtaining higher precision rather than higher recall. At the final stage, statistical scores are assigned to each unique candidate rule extracted. After exploring different scoring functions (Fisher s exact test, chi-square, odds ratio and likelihood ratio), the chi-square for measuring the strength of the association between the left-hand side and the right-hand side of the candidate rule was chosen. Once every candidate rule has been associated with a chi-square value, candidates falling below a specific threshold on the chi-square value are filtered, thus outputting the final rules.

Rule Chi-square score c/k 86.886 d/t 4.6994 ary/är 8.9 my/mie 8.9 hy/hie 8.9 gy/gie.846 ty/tät 6.499 et/ett 6.9468 sh/sch. ive/iv 48.6 Figure. Top rules detected by the algorithm for the English-German language combination, along with the associated chi-square scores.. Testing algorithm The learning algorithm provides a set of rules which account for the orthographic behaviour of words between a source language and a target language. The second part of the algorithm (i.e. the testing algorithm) tries to deploy this kind of information (input) in the candidate extraction process. Once the input data is made available, we proceed to apply the rules to each possible word pair, that is to substitute relevant n-grams in the rules with their counterpart in the target language. LCSR is then computed for every pair, and the top most similar pairs are added to the candidate cognate list. A case in point is represented by the English-German entry electric/elektrisch : the original LCSR is., but if the rules c/k and ic/isch detected earlier in the algorithm, are applied, the new LCSR is.. 4 CLASSIFICATION The goal of the next stage of the methodology is to separate cognates from false friends from the lists obtained in the previous stage. Given a list of known cognates and false friends (training data) and a list of mixed pairs of words where the words in each pair are either cognates or false friends (test data), the classification task is to automatically label each test pair. To this end, the semantic similarity between the words in each pair in the training data is computed, and a numerical measure (threshold) is estimated. This threshold is later used to label the test data: if the similarity between the words in the test pair is lower than the threshold, the words are returned as false friends; otherwise, they are returned as cognates. After a threshold is estimated, all pairs in the test data are labelled in this manner. The presented methodology is independent of the language pair(s) in that it operates on any texts in any languages, not necessarily parallel ones, although better results are expected if the corpora are comparable. In order to establish the similarity between words in different languages, we experimented with four methods: Method operating in different variants (Method without taxonomy, Method with Leacock

and Chodorow, and Method with Wu and Palmer) and Method. These are outlined below. 4. Exploiting distributional similarities between words of the same language Method is based on the premise that if two words have the same meanings (and are cognates), they should be semantically close to the roughly the same set of words in both (or more) languages, whereas two words which do not have the same meaning (and are false friends), will not be semantically close to the same set of words in both or more languages. Method can be formally described as follows: S T Start with two words ( w, w ) in the languages S and T, S T ( w S, w T ). Then calculate the N most similar words for each of the two words according to a chosen distributional similarity function. In this study, skew divergence (Lee 999) was selected for Method because it performed best during our pilot tests. Then build two sets of N S S S S T T T T S words W ( w, w, K, wn ) and W ( w, w, K, wn ) such that w i is S T the i-th most similar word to w and w i is the i-th most similar word to T w. Figure shows two cognates Eng. article and Fre. article along with the sets of their most similar words in the respective languages (N=). A connection between two words is made when one of the words is listed as a translation of the other in the bilingual dictionary. Then a Dice Coefficient function is applied over the two sets to determine their similarity. A word from set S is added to the collision set only if it has at least one translation present in set T, and vice versa. By way of example, the similarity between the words in Figure will be / =. Currently words with multiple translations in the opposite set are not treated in any special way. Note that N is a crucial parameter. If the value of N is too small (N <), similar words may appear as very distant, because their common synonyms may not be present in both sets (or at all). If the value of N is too big, the sets of distant words may be filled with synonym word pairs that are distant from the initial one, thus making the words in the initial pair appear more similar than they actually are. The dictionary used can further affect the results. In the evaluation section, Method, which employs the distributional similarity function skew divergence as outlined above, is simply referred to as Method. Figure. Method

4. Exploiting distributional similarities between words across different languages Method which is inspired by the methods for acquisition of translation lexicons from comparable corpora (Fung 998; Rapp 999; Tanaka and Iwasaki 996; Gaussier et al 4) determines semantic similarity between words across two languages by mapping the space of distributional features of one language onto that of the other using a bilingual dictionary. Method can be described formally as follows. First, cooccurrence data on each word of interest in both languages are extracted (in the present study we model the semantics of nouns by their cooccurrence with syntactically related verbs) and feature spaces for both S T sets of words are created. Then, given two words ( w, w ) in the lan- S T S guages S and T, ( w S, w T ), the feature vector of w is translated into T with the help of a dictionary and then added to the cooccurrence data for T. The result is co-occurrence data that contains vectors for all words in the target language (T ) plus the translated vector S of the source word ( w ). All words in T are then ranked according to S their distributional similarity to the translated vector of w (using skew T divergence again) and the rank of w is noted ( R ). The same measure is used to rank all words according to their similarity from w, taking S only the rank of w ( R ). This is done because skew divergence is not symmetrical. The final result is the average of the two ranks ( R + R )/. Here the quality of the dictionary used for translation is essential. Besides the source data itself, one parameter that can make difference is the direction of translation, i.e. which of the languages is the source and which it the target. Method (without taxonomy) and Method can be regarded as related in that the distributional similarity techniques employed both do rely on context, but we felt it was worth experimenting with both methods with a view to comparing performance. In addition to making use of co-occurrence data, both methods also rely on dictionaries. However, one major difference is in the type of dictionaries needed. For the current task of comparing cognates/false friends which are nouns, a noun dictionary for Method is required, whereas Method requires a verb dictionary. 4. Exploiting taxonomic similarities between words of the same language If a pair of cognates/false friends under consideration are present in a taxonomical thesaurus, comparing/computing the semantic similarity directly from the taxonomy promises to be the best way forward. However, the absence of words in the thesaurus could result in a high precision but low recall. To overcome this limitation, a hybrid method has been developed which works as follows. For any two words, their presence in a specific taxonomy is checked and if they are present, a taxonomical semantic similarity measure is employed to compute a similarity value. Otherwise distributional measures are used (as in Method without taxonomy) to obtain a list of the N nearest neighbours for each word. The taxonomy is used in this case only instead of a dictionary. EuroWordNet was chosen for this study because it covers a number of languages. The semantic similarities measures based on taxonomy made

use of are Leacock and Chodorow (998) which checks the normalised path length between two concepts being compared and Wu and Palmer (994) based on edge distance as well but taking into account the most specific node dominating the two concepts. To start with, similarities S T between each nearest neighbour of w and w are computed and for each nearest neighbour the most similar to its neighbour in the opposite language is found. The final value is the average of the maximal similarities for all neighbours. The variant of Method which employs Leacock and Chodorow s similarity measure is referred to in the evaluation section as Method with Leacock&Chodorow, whereas the variant which computes similarity according to Wu and Palmer s measure is referred to as Method with Wu&Palmer. 4.4 Threshold estimation The following threshold estimation techniques were used with the methods outlined above: Average: The distances between the words in both training sets (cognates and false friends) are measured using the chosen method. The mean average of the distances of the cognates and the mean average of the distances of the false friends are computed. The threshold is the average of the two means. Average: As above, but median average is used instead of mean average. Max Accuracy: All distances from the training data sets are analysed to find a distance which, when used as a threshold, would (supposedly) give a maximal accuracy for the test data. The accuracy is computed as: TP + TN TP + TN + FP + FN An evaluation framework was created, which simplifies and speeds up the process. The evaluation process splits the list of known cognates and false friends, and creates the training and test pairs for each test. The threshold estimation process uses only the training pairs to find a threshold that is sent to the classification process. The distance measurement process receives a word pair, and returns the distance according to the method and parameters used. 4. Evaluation settings The evaluation is performed using ten-fold cross-validation. During the test, the task is to classify a pair as cognates or false friends, based on the measure of similarity and the threshold, obtained from the training pairs. The results were evaluated both in terms of recall and precision. An average measure of recall defined as the average of the recall in identification of cognates and recall in identification of false friends 6 has been used. Similarly, precision is computed as the average of the precision in identifying cognates and the precision in identifying false friends 8. Recall for the identification of cognates is computed as TP/(TP+FN) 6 Recall for the identification of false friends is computed as TN/(TN+FP) (this is because if an item is not a cognate, then it is a false friend or vice versa) Precision for the identification of cognates is computed as TP/(TP+FP) 8 Precision for the identification of false friends is computed as TN/(TN+FN)

EXPERIMENTS, EVALUATION AND DISCUS- SION The experiments, extracting candidate pairs and classifying them, covered bilingual texts in 4 pairs of languages: English-French, English- German, English-Spanish and French-Spanish. They were performed in two settings: (i) an ideal environment where the pairs to be classified were either cognates or false friends and (ii) in a real-world environment where the pairs to be classified as cognates and false friends (or unrelated) were extracted from bilingual corpora. The former task would be the same as the latter in the case of perfect pre-processing, or as the task of classifying a pair as cognates or false friends from lists of orthographically (and semantically) close words. In order to conduct the experiments described below, co-occurrence statistics were needed, as well as bidirectional dictionaries. The following co-occurrence data for computing distributional similarity was used: verb-object co-occurrence data from the parsed version of the English WSJ corpus verb-object co-occurrence data from the French Le Monde corpus which we had previously parsed with the Xerox Xelda shallow parser verb-object co-occurrence data from the German Tageszeitung corpus verb-object co-occurrence data from the Spanish EFE (years 994 and 99) corpus EuroWordNet was used as the source of the four bidirectional dictionaries.. Extracting candidate pairs For the task of extracting candidates from corpora, all combinations of word pairs for each of the four language pairs were compared in terms of LCSR orthographic similarity. The most similar pairs were chosen as sample data for each language pair. The lists were then manually annotated to provide both training data for the classifier and evaluation results for the extraction stage. The pairs were marked into four categories: Cognates, False Friends, Unrelated and Errors, where errors were considered to be pairs which are not orthographically similar, or which contain words tagged with different or incorrect part of speech. Table. Pairs returned in the automatic extraction process Language Pair Cognates False Friends Unrelated Errors Accuracy English-French 8 6 4 89.6% English-German 89 9 8 8.6% English-Spanish 4 8 8.% French-Spanish 4 69 4 4 8.8% Average. 4. 6. 84.%

The accuracy of extraction of candidates was computed as the number of cognates and false friends divided by the total number of pairs (in our case ), as this task is concerned with the identification of pairs that are either cognates or false friends. The results are satisfactory with an average accuracy of 84.% (see Table I). If the task is limited to finding cognates then the employed methodology can be very effective, as the ratio of extracted cognates to false friends is more than :.. Classifying pairs as Cognates or False Friends The extracted pairs were classified as either cognates or false friends. The evaluation was carried out in parameter-driven fashion, with the impact on the performance results of the following parameters being investigated. Threshold estimation method: How does the choice of the threshold estimation method affect the results? The influence of N for Method : How many similar words should be considered when measuring the distance? Direction of translation for Method : When mapping feature vectors from one language to another, does the direction of translation have an effect on the performance? The effect of errors in the extraction stage: How does this affect the classification methods? The sample data: How do the methods behave with unbalanced class sizes? For each evaluation parameter (method, threshold estimation, etc.) the evaluation procedure splits the two samples (cognates and false friends) into parts each, and runs a tenfold cross-validation using the specified parameters. As aforementioned, the evaluation was conducted in two settings - ideal setting where the extraction was regarded as perfect in that the pairs to be classified were either cognates or false friends and real-world setting where the pairs to be classified as cognates and false friends (or unrelated) were extracted from bilingual corpora and as a consequence due to the pre-processing errors not all pairs were necessarily cognates or false friends... Classification of cognates and false friends in an ideal setting of perfect extraction The evaluation in an ideal environment was based on the assumption that the extraction process was % accurate. For this purpose, the lists were filtered (post-edited) to contain only cognates and false friends. In the present experiment, we tried out two evaluation options. One involved an unbalanced sample, which contained all cognates and all false friends present in our data. In each language pair, the proportion between them is highly skewed towards cognates, and this appears to reflect the typical tendency in a language pair. To have a better insight into the relative performance of the methods, we additionally evaluated them on samples, where the number of cognates was reduced by randomly removing them until it was the same as the number of false friends.

For both options, the baseline for recall is the case when all pairs are marked as cognates (since the number of cognates is always greater than the number of false friends) and the baseline for precision is set to random classification, both figures being %. Table II contains the results for Method, Method, Method with Leacock&Chodorow and Method with Wu&Palmer respectively, when using unbalanced samples. The second column of the table shows the threshold estimation method. The third column represents the value of N for Method, or the direction of translation for Method (L being Leftto-Right language translation and R being Right-to-Left). The results can be regarded as satisfactory. Method scores as high as 8.6% recall and 6.4% precision, and Method achieves a recall of.6% and precision of.9%. Method with Leacock&Chodorow delivers a best recall of 8.8% and precision of 88.9% (for English-Spanish), whilst Method with Wu&Palmer performs with a recall as high as 86.% (also English-Spanish) and precision of 8.6%. Overall, the best performing methods are Method with Wu&Palmer, followed by Method with Leacock&Chdodorow, Method and Method. The best results from the point of view of language pairs are reported on English- Spanish, followed by English-German, English-French and English- Spanish. We see that while the best configurations of the methods beat the baseline by a considerable margin, in about % of the test cases the results are actually below the baseline. This suggests that tuning the parameters can makes a very large difference to the performance of the methods. Table III describes results achieved on balanced samples. The best results are achieved by Method with Wu and Palmer, with a highest recall of 9.% and precision 94.%, followed closely by Method with Leacock&Chodorow, with its best recall of 88.% and precision of 89.98%, Method (recall 8.%, precision 8.6%) and Method (recall 69.%, precision 68.4%). Again, the best performance is obtained on English-Spanish, followed by French-Spanish, English-German and English-French. The best configurations of the methods often show a greater improvement on the baseline, in comparison with the unbalanced samples. However, the difference is not as large as would be expected and in some cases the results are slightly worse (for Method, for example)... Classification of cognates and false friends in a real-world setting of fully automatic extraction from corpora Table IV reports the classification performance on balanced samples in a fully automatic mode, involving extraction of pairs before classification and, as a result, including errors and unrelated words. The results are lower than those obtained on filtered lists where pre-processing errors came into play, but are still satisfactory. Since the methods always label a pair of words as either cognates or false friends, precision decreased as a consequence. The recall was the same as in Table II because the threshold estimation was computed with training data for cognates and false friends only. On average, the best method was again Method with Wu&Palmer (precision 8.8%), followed by Method with Leacock&Chodorow (6.%), Method (with a highest figure of.4%) and Method (6.8%). The best results were obtained on English- Spanish, followed by English-German, French-Spanish and English- French. As expected, the results on all pairs (unbalanced samples where

Table II. Classification results with corrected unbalanced lists Method Method Method with Laecock&Chodorow Method with Wu&Palmer En-Fr En-Ge En-Sp Fr-Sp Rec. Prec. Rec. Prec. Rec. Prec. Rec. Prec...8.86 49. 9.68 6.4.6.4 6.6 6.4..8 8.4 6.4 6.9 6. 64.6 8. 69.66.96 8. 6.9.68 4. 9.. 4.4. 8.94 66.4 8.6 4.. 4.. 4.68. 44.. 4.6 64.8 8. 6.9. 8.4 64.88 6.4 6.4 64.6.6.9 4. 8. 64.9.84 4. 9.9 4.8.9. 8.6 6.4. 4.. 4.. 4.68. 44.. 4.6. 4.. 4.68. 44..69 44..4.. 4.68.9. 8.9 64.9. 4.9 49.9 4.6 8.66 64. 6.9.6 L 6.9.8.49..6 64.94 8.8 8.8 R 9..99 6.46. 64. 9..99 4.46 L.6..66. 66.8 8.69. 4.49 R.6.6 6..84 6.4..6.8 L.8. 49.8 4.6.. 4.9 R.4. 49.8 4.6.. 4..9.6.9 66.68. 8.8 4. 6. 6.8 9.. 68.4 4. 9.9.6 6.49 6. 8.4 4..4.6 9. 6.6 66.8 6.9 49.4 49. 6. 9.49.8 66.6 68. 6... 4.8 8.9. 66.9 6. 8. 4.6 66..6.8.4 6.4 6.68 8.4 4.4. 4.9 9.8.66 66.9 6.4.6.8. 4. 8.6.9 66.9 6. 49.9 4.. 4.68 8.9 8..94 68. 49.6 4.48.4 4.8 6.6 8.69.9 6. 49.8 4..8 48. 6.9 88.9 9. 6.8 49.48 4.46 49.9 4.84 6. 8.8 6...48. 6.. 8.. 6. 6.84.44.8 6.4. 8. 69.8 6.9 6.48.4.86.8.6 8.6.9 6. 6.4.4. 6.6. 8. 69.6 6.4 6.8. 4. 6..49 84..6 64. 6.4.8 6.8. 86..4 6. 6..69.8 66.8. 86.9.4 6. 6.9.... 8.9 8 6.4 6.. 4.. 4.68 8 9.8.4 4.8. 4.. 4.68 8..84 6.6 6.8 49.4 4.49 6.48 4. 84.4.9 6.49 6.8 8 4.6. 4.68 8.99 8.6 6.6 6. Baseline........ the vast majority of pairs are cognates) are lower (Table IV) but still reasonable, with Method with Leacock&Chodorow achieving a 6.98% precision on English-Spanish, and Method scoring 6.6% on the same pair. In the following, we discuss the results in relation to each of the evaluation parameters listed above.

Table III. Classification results with corrected balanced lists Method Method Method with Laecock&Chodorow Method with Wu&Palmer En-Fr En-Ge En-Sp Fr-Sp Rec. Prec. Rec. Prec. Rec. Prec. Rec. Prec.. 6.8. 9.8 6. 8.9. 4.9 6.98 6.4 6. 9.8 8. 8.6 6.8 6.9 64.88 6. 6. 66.8 8 8. 9. 6. 8. 8... 8. 8.8 6.8 6.49........ 64.88 6.4 6. 6. 9. 8.8 6.98 68.94 6. 6.8. 68. 8. 8.8 6. 6.89 9.6 6.8.. 8. 86.48 8.4 9.. 4.9..8 4. 48..6 6.94 69.9.99 9. 6.8 8. 84.6 66.9 6.4 6.6 6.6 69. 6. 8. 8.6.4 6. 6.4 66.9 6.4 8. 8.6 6.4 64.4 L 9. 9.9 4. 4.9 6. 68. 6.9 4.8 R 9. 6. 6. 9.9 69. 68.4. 4. L 6.. 4. 48.. 4.6 6.9 6. R 4. 4.6 6. 8. 8. 8...48 L..9. 4.6.. 6. 6. R 9.64 6. 6. 8. 6. 6.6 6.6 9..6 6.8 6. 64.8 8. 8. 64.9 6.6 8.69 8.96 68. 68. 8. 8.9 6.9. 6. 6.99..6 9. 8.88 6.6 69.. 69. 6. 6. 9. 66..9.86.99 6. 6. 8. 84.69 6.6 6. 6.6 6.9 6. 6. 8. 8.8 66. 6..6 8... 9. 8. 6.86 68.86 48.69 48.8. 68. 6. 8.84..4.4 4.. 69.4 8. 8.6 6..6 9. 6. 68. 66.8 88. 89.98 64.9 68... 68. 6.8 84. 86.46 66. 68.9.4 49.9 6. 66.8 8. 8.9 68..89.9. 6. 9. 8. 9.9 6.6 6..4.4 66. 64.8 8. 86.44 6. 69. 9. 6.9. 8. 88.4 66.4 6. 48. 48.46 68. 66. 8. 8.49 68. 69.4.. 6. 6. 8. 9 6.4 66. 6.9 6. 64. 6.4 9. 94. 64.6 66.44. 4.4 6. 6. 8. 88. 66.9 69.66 4.6.98.. 86. 8. 66. 6...8 6. 6.8 84. 86.6 66.6 69..4 6.4 9..6 84. 86.8 6.6. 6.9 64. 69. 68. 88. 89. 6 6..8.9. 8.4 84. 8. 69.. Baseline........ Threshold estimation: Three functions were used: Average, Average and Max Accuracy. Max Accuracy searches for the threshold that gives maximum accuracy in the training set. For different values of N (N {,,,,,,,,,,,,,, }) on balanced samples with errors, Average gives best recall in

. 6.% of the test cases, followed by in.4%, and the Max Accuracy threshold estimation in.8% of the cases. The precision achieved by Average is highest in.64% of the cases, followed by (.%) and Max Accuracy (.8%). On balanced samples without errors, the results are somewhat different. Average and Max Accuracy estimations both deliver better results in 4.4% of the cases, followed by Average (.9% of cases). However, in terms of precision, Average scores higher in 9.6% of the cases, followed by Max Accuracy (.% of cases) and (.% of cases). On unbalanced samples without preprocessing errors, provides best recall in.6% of the cases, in 46.8% and Max Accuracy in.66% of the cases. In terms of precision, the employment of Average leads to best performance in 4.% of the cases, with performing best in 6.% of the cases, and Max Accuracy in.8% of the cases. For fully automatic extraction and classification (i.e. with preprocessing errors) on the whole (unbalanced) corpus, the best performance figures for recall are distributed in the same way as the figures on the whole corpus without pre-processing errors. The only difference is seen when comparing precision: Average gives best results in 4.% of the test cases, followed by Average in 9.9% and Max Accuracy in.% of the cases. Table V summarises the best evaluation results (recall, precision, accuracy) for each method (Method without taxonomy, Method with Leacock&Chodorow, Method with Wu&Palmer, Method ), language pair (English-French, English-German, English-Spanish, French- Spanish), pre-processing mode (fully automatic mode with preprocessing errors or with perfect processing on filtered lists without errors), sampling mode (all pairs, balanced samples) and parameters (number of words for Method, direction of translation for method ) with particular reference to the three threshold estimation techniques ( Average, Average, Max Accuracy). The influence of N for Method : The number of most similar words (N) which Method uses to measure similarity between pairs is an essential parameter for this method. Figures 4 and report recall and precision for each N when evaluating English-German 9 cognates with Max accuracy as a threshold estimation from the balanced sample without errors. When a semantic taxonomy is not used, results gradually improve up to some value and then begin to deteriorate. When classification benefits from a taxonomy, the figures are different. With N=, classification is carried out entirely according to the taxonomy similarity method used. The precision is quite high but, because not all words are in the dictionary, the recall is below the baseline of (%). Increasing N results in a better recall, but can also degrade precision. The behaviour is similar across language pairs and pre-processing mode (see Figures 8- in the Appendix). Direction of translation for Method : Figures 6 and show the effect of direction on the accuracy of Method when using balanced samples without errors with all three threshold estimations. For both 9 In this section we have chosen to report the evaluation results for one language pair and without pre-processing errors only, as reporting all possible combinations of language pairs and methods would result in a high number of figures within the section. All remaining tables are provided in the Appendix.

Table IV. Precision when errors are present (recall same as in Table II) Method with Wu&Palmer Method with Laecock&Chodorow Method Method Balanced samples All pairs En-Fr En-Ge En-Sp Fr-Sp En-Fr En-Ge En-Sp Fr-Sp 4.99..4. 48.6 48..8. 8. 49.99.4 4.. 48.8.69.6 9. 6.. 49.6 4.8 49. 6.9..46 6...8.6.6 49.9....8 8. 8.9. 4. 9.69 49. 69. 8.6 4.4 4..98. 6.9 6.6. 4.8 4. 49...8.9 6.4 4.... 49.6 49.8.6 4..4 8. 8.9. 4. 6. 4.66 4.4 6. 8. 8.9. 6.8 6. 8.9.9 46.4 4.8 8.9 4. 48. 4.8 9.9.44.6 4.8 8.89 49.4 4.8 L.6 4.9 9.9 46.9 49.4 44.4. 48.8 R.9.8 6.8 4.9. 4.96. 4.8 L.8 4.6 6.69.6 49.4 4.. 49. R 48. 49.98 49.8. 48.68 4.64 49. 44. L 4.94 6.6 6. 49.8 48.66 8.96 6.6. R.4 48.8.66 49. 4.98 8.96 4. 6.8 49.8 4.8.9 48.99 49.89 9..84 4.9 6. 6.9 4.6.. 8. 4.4.6 6. 68. 4.4.6.8.8. 44. 6. 64.86 6. 46.4.4..4.4 8.9 69.. 49.4. 9.4 4. 4. 4.4 6. 4..68.66. 4..8.9 6.4.9..9 8.4 4.6 44.8 6. 66. 6 4..96 6.86 4..8 8.4.84 6. 4.8 8.9 6. 49.8. 4. 6.. 8.8 9.4 6.98 48.64.66 6..64 4. 8.4 9.4 6.9 48.8.46.9.4.68 8.49 9.6 6.9 46.8 49.8.9 6.4 4. 49.99 4. 9.96 4.6 49. 4.98 4..4.6.88 9.4 4.49 49.66 9.4.4.66.8. 9. 4.6 44. 8. 4.4. 48.9.6 9.6 4.6..8..6 8. 4.4 9.4.9 4.6. 8.8.9.6 49.9 9.8 4. 48.9.98 4.4 6.. 9. 4.68 64.4 4. 6.49 4.8. 9.9 4.6 44. 6.4.. 8. 8.9 49.4 6..4 46.. 6.8 8. 8.9 6.64..4 6. 6.9.8 8.4 44. 6.9.9 46. 66.6.6 9.9 4. 8.9 6.. Baseline........ English-French and English-German, using English as the source resulted in better performance than using English as the target (the only exception being when median average is applied for English-French)...

Table V. Summary of the best evaluation results (recall, precision, accuracy) for each method, language pair, pre-processing mode, sampling mode and parameters with particular reference to the three threshold estimation techniques ( Average, Average, Maximize Accuracy) Ballanced with errors Ballanced without errors All pairs with errors All pairs without errors En-Fr En-Ge En-Sp Fr-Sp Rec. Prec. Acc. Rec. Prec. Acc. Rec. Prec. Acc. Rec. Prec. Acc. Best Est. Method Method M+L&C Method M+W&P M+W&P M+W&P M+W&P M+W&P M+W&P M+L&C M+W&P M+L&C N/Dir Value(%) 69.9% 6.44% 6.%.% 66.6% 6.6% 9.% 9.9%.84%.% 6.8% 6.86% Best Est. Method M+L&C Method M+L&C M+W&P M+W&P M+W&P M+W&P M+W&P M+W&P M+L&C M+L&C M+L&C N/Dir Value(%) 69.6%.99% 69.6%.% 8.4%.% 9.% 94.% 9.%.%.4%.% Best Est. Method Method Method Method M+W&P M+W&P Method M+W&P M+L&C M+L&C M+W&P Method Method N/Dir R Value(%) 6.8%.6% 6.46%.%.6%.8% 86.99% 64.88% 8.8% 68.% 6.8%.% Best Est. Method Method Method Method M+W&P M+L&C Method M+W&P M+L&C M+L&C M+W&P M+L&C Method N/Dir Value(%) 6.8% 6.8% 8.%.%.% 9.4% 86.99% 89.8% 9.6% 68.% 4.% 84.8%

.9.8..6... Method M+L&C M+W&P Figure 4. The recall of Method for English-German with balanced samples without pre-processing errors, using max. accuracy estimation..8..6 Method M+L&C M+W&P Figure. The precision of Method for English-German with balanced samples without pre-processing errors, using max. accuracy estimation...6.6 L R En-Fr En-Ge En-Sp Fr-Sp Figure 6. Recall of Method with balanced samples without pre-processing errors. If Spanish is one of the languages, the results are better if it is the source language (if the translation is from right to left). Again, an exception is the case when mean average is used for French-Spanish, with higher ac-

curacy achieved when French is the source language. Figures and (Appendix) illustrate the effect of direction of translation for Method on balanced samples including pre-processing errors. The better direction for each case is the same as for the sample without errors. With preprocessing errors in the samples for some cases (like Max.Acc. for En- Sp) the difference in performance between the two directions is far less noticeable than when using samples without errors. We experimentally found that the differences in performance within a language pair amount of up to 6.% depending on the choice of the direction of translation and the differences in recall up to.% for balanced samples without errors..8..6 L R En-Fr En-Ge En-Sp Fr-Sp Figure. Precision of Method with balanced samples without preprocessing errors. Method vs. Method : In most cases, the optimal results for Method without taxonomy for any of the four language pairs are higher than the optimal results for Method (see Tables II-V for a subset of all evaluation results; see also Table VI). The only exceptions apply to the best result for precision on English-Spanish when evaluated on unbalanced samples (with or without preprocessing errors), and for precision on English-Spanish for unbalanced samples without errors. On balanced samples without errors, Method with Leacock&Chodorow produces the best recall for English-French and the best recall and precision for French-Spanish, whereas Wu&Palmer offers the optimal values for English-German and English-Spanish. When pre-processing errors are present in the balanced samples, Method holds the best recall for English- French. Leacock&Chodorow produces the highest precision for English- French and the highest recall for French-Spanish, whereas Wu&Palmer performs best for English-German and English-Spanish, and obtains the best precision for French-Spanish. On unbalanced samples without errors, Method with Leacock&Chodorow performs with highest precision for all language pairs except English-French, whilst Method with Wu&Palmer records the best recall for the same language pairs. With errors present in the unbalanced samples, Method with Leacock&Chodorow produces best precision for English-Spanish, whereas Wu&Palmer gives optimal performance for English-German and the best recall for English-Spanish and French-Spanish. Method produces the best recall in one case only: unbalanced sample with errors for French- Spanish. In conclusion, Method generally has the edge over Method, and the taxonomy can improve results even further. The drawback of

Table VI: Best results for recall and precision (separate) for each method, sample and language pair With errors Without errors With errors Without errors All Balanced All Balanced All Balanced All Balanced En-Fr En-Ge En-Sp Fr-Sp Rec. Prec. Rec. Prec. Rec. Prec. Rec. Prec. Method 6.8% 6.8% 4.4%.% 8.% 68.8% 6.8%.% Method 9.%.% 6.46%.%.6%.% 8.8%.9% M+L&C 6.%.8% 4.68%.% 8.86% 89.8% 6.49% 4.% M+W&P 6.6% 6.%.%.% 86.99% 8.6% 68.% 6.6% Method 69.9%.99% 4.% 4.% 86.% 88.% 6.98% 4.% Method 9.64% 6.% 6.% 9.9%.%.% 6.9% 6.% M+L&C 69.6% 69.4% 4.%.4% 9.% 9.%.%.4% M+W&P 6.6% 68.%.% 8.4% 9.% 94.%.9%.8% Method 6.8%.6% 4.4% % 8.% 8.9% 6.8%.8% Method 9.%.% 6.46% 4.96%.6% 6.6% 8.8% 6.8% M+L&C 6.%.% 4.68%.94% 8.86% 64.88% 6.49% 4.9% M+W&P 6.6%.%.%.6% 86.99% 6.8% 68.% 4.99% Method 69.9% 6.%.% 64.69% 8.%.4% 68.% 8.6% Method 9.6%.9% 6.%.8%.% 6.8% 6.4%.6% M+L&C 68.9% 6.44% 4.% 6.4% 9.% 8.%.% 6% M+W&P 6.% 6.%.% 66.6% 9.% 9.9%.% 6.8% En-Fr En-Ge En-Sp Fr-Sp Rec. Prec. Rec. Prec. Rec. Prec. Rec. Prec. Best Method Method M+W&P M+L&C M+W&P M+L&C M+W&P M+L&C Second M+L&C M+L&C M+L&C Method Method M+W&P M+L&C Method Third M+W&P Method Method M+W&P M+L&C Method Method Method Last Method M+W&P Method Method Method Method Method M+W&P Best M+L&C Method M+W&P M+W&P M+W&P M+W&P M+L&C M+L&C Second Method M+L&C Method Method M+L&C M+L&C M+W&P Method Third M+W&P M+W&P M+L&C M+L&C Method Method Method M+W&P Last Method Method Method Method Method Method Method Method Best Method Method M+W&P M+W&P M+W&P M+L&C M+W&P Method Second M+L&C M+L&C M+L&C M+L&C Method M+W&P M+L&C M+W&P Third M+W&P M+W&P Method Method M+L&C Method Method M+L&C Last Method Method Method Method Method Method Method Method Best Method M+L&C M+W&P M+W&P M+W&P M+W&P M+L&C M+W&P Second M+L&C Method Method Method M+L&C M+L&C M+W&P M+L&C Third M+W&P M+W&P M+L&C M+L&C Method Method Method Method Last Method Method Method Method Method Method Method Method

Method is that there is no optimal value for N which guarantees the best results in all cases; this can sometimes lead to a performance significantly lower than that of Method. The effect of errors in the extraction stage: Since errors are not currently used for training, the effect they have on the evaluation results is lowering precision. When using balanced samples, the decrease of precision can be up to.9% (for Method with Leacock&Chodorow, N=, French-Spanish, max accuracy as estimation), with an average of 9.9% for all test cases. With unbalanced samples (consisting of all extracted pairs) the precision is 8% lower on average, with the highest drop for Method without taxonomy (N=, French-Spanish, max accuracy as estimation).8%. The sample data: When using balanced samples (the number of cognates being equal to the number of false friends), average precision and recall are, respectively,.% and 8.8% higher (over all test cases run) than when using unbalanced samples. The most significant improvements for precision and recall on balanced samples are 4.4% and 4.% respectively. For some parameters, however, better results are obtained on unbalanced samples, with the greatest difference in favour of the evaluation results on all pairs being.% for precision and 9% for recall. However, experiments on balanced samples result in a higher precision in 8.6% of the combinations of parameter settings, and higher recall in 6.% of these cases. All reported results in this paper, as well as other results which can be obtained following different choice of parameters, can be viewed on http://bul.sytes.net/gefix/cognates/extract.php. 6 CONCLUSION This paper proposes two novel methods (one of which is presented in variants) for automatic identification of both cognates and false friends from corpora, neither of which are dependent on the existence of parallel texts or any other collections/lists. The extensive evaluation results cover a variety of parameters and show that automatic identification of cognates and false friends from corpora is a feasible task, and all methods proposed perform in a satisfactory manner. Best results are obtained by Method which is based on the premise that cognates are semantically closer than false friends, when the variant of the method using a taxonomy and Wu and Palmer's measure, is employed. References Budanitsky, A., and G. Hirst, "Semantic Distance in WordNet: An Experimental, Application-oriented Evaluation of Five Measures", Workshop on WordNet and Other Lexical Resources, in the North American Chapter of the Association for Computational Linguistics (NAACL-), Pittsburgh, PA, June. Dagan, I., Lee, L., Pereira, F. (999). "Similarity-based models of word co-occurrence probabilities". Machine Learning 4(-): 4-69 Danielsson, P. and Muehlenbock, K., (). "Small but Efficient: The Misconception of High-Frequency Words in Scandinavian Translation". Proceedings of the 4th Conference of the Association for Machine Translation in the Americas on Envisioning Machine Translation in the Information Future. Springer Verlag, London, 8-68.