Extended Similarity Test for the Evaluation of Semantic Similarity Functions

Extended Similarity Test for the Evaluation of Semantic Similarity Functions Maciej Piasecki 1, Stanisław Szpakowicz 2,3, Bartosz Broda 1 1 Institute of Applied Informatics, Wrocław University of Technology, Poland {maciej.piasecki,bartosz.broda}@pwr.wroc.pl 2 School of Information Technology and Engineering, University of Ottawa szpak@site.uottawa.ca 3 Institute of Computer Science, Polish Academy of Sciences Abstract We propose a more demanding version of a well-known WordNet-Based Similarity Test. Our work with semantic similarity functions for Polish nouns has shown that test to be insufficiently stringent. We briefly present the background, explain the extension of WBST and report on experiments that contrast the old and new evaluation tool. 1. Introduction Many tasks in Natural Language Processing Word Sense Disambiguation, Text Entailment, Text Classification, to name just a few require a measure of semantic relatedness. Automatic acquisition of lexical semantic relations, in particular, can hardly be imagined without some form of a semantic similarity function (henceforth, SSF). A SSF maps pairs of lexical units into real numbers, and is usually normalized. A lexical unit (LU) is a word type or lexeme organized, especially in inflected languages, by the values of morphological categories such as number, gender and so on. Evaluation of the quality or effectiveness of a SSF is a non-trivial problem. Manual evaluation is barely feasible on a small scale. Not only are SSFs required to work for any pair of LUs, but also people are notoriously bad at working with real numbers. A linear ordering of dozens of LUs is nearly impossible, and even comparing two terms requires a significantly complicated setup (Rubenstein and Goodenough, 1965). Given a small sample, people can easily distinguish a bad SSF from a good one; we must distinguish good SSFs from those that are merely passable. We note three forms of SSF evaluation (Budanitsky and Hirst, 2006; Zesch and Gurevych, 2006): mathematical analysis of formal properties (for example, the property of a metric distance (Lin, 1998b)), applicationspecific evaluation and comparison with human judgement. Mathematical analysis gives few clues about the future uses of a SSF. Evaluation via an application may make it difficult to separate the effect of a SSF and other elements of the application (Zesch and Gurevych, 2006). A direct comparison to a manually created resource seems the least trouble-free. The construction of such resources, however, is labour-intensive even if it only labels LU pairs as similar (maybe just related (Zesch and Gurevych, 2006)) or not similar; this does not allow a fair assessment of the ordering of LUs on a continuous scale, as an SSF does. Indirect comparison with the existing resources (Grefenstette, 1993) is another possibility. For example, one could compare a SSF constructed automatically and another based on the semantic similarity across WordNet s (Fellbaum, 1998) hypernymy structure. In (Lin, 1998a; Weeds and Weir, 2005) two list of the k LUs most similar to the given one are transformed to rank numbers of the subsequent LUs on the lists, and compared by the cosine measure. The drawback of such an evaluation is that we know how close the two similarity functions are, but not how people perceive a SSF. Automatic differentiation between words synonymous and not synonymous with a LU is a natural application for a SSF. In Latent Semantic Analysis (LSA) (Landauer and Dumais, 1997) the SSF constructed on the basis of a statistical analysis of a corpus was used to make decisions in a synonymy test, a component of the Test of English as a Foreign Language (TOEFL); this gave 64.4% of hits. (Turney, 2001) reported 73.75% hits, and (Turney et al., 2003) 97.5% hits; this practically solved the TOFEL synonymy problem. Next, (Freitag et al., 2005) proposed a WordNet-Based Synonymy Test (WBST), in which Word- Net is used to generate a large set of questions identical in format to those in the TOEFL. Section 2. discusses WBST. The best reported result for nouns is 75.8% (Freitag et al., 2005). A slightly modified WBST was used to evaluate a SSF for Polish nouns (Piasecki et al., 2006) with the result of 86.09%. The evaluation of a SSF via a synonymy test shows the ability of the SSF to distinguish synonyms from nonsynonyms. Since the SSF is the centrepiece of the application, the achieved results can be directly attributed to it. There was, however, a problem: WBST appeared to be too easy, as it is shown in Section 2., so it no longer was a useful tool in the assessment of more sophisticated SSFs for Polish nouns. In view of these findings, we have set out to design a more demanding automatic method of SSF assessment. We want its results to be easily interpreted by people and its feasibility tested on people. We also expect that it will pick the SSF that is a better tool for the recognition of lexical semantic relations between LUs. 2. WBST and the Similarity among Polish Nouns The application of LSA to the TOEFL became unattractive as a method of comparing SSFs once the result of 97.5% hits has been achieved (Turney et al., 2003).

(Freitag et al., 2005) proposed a new test, WBST. It was seen as more difficult because it contained many more questions. An instance of the test is built thus: first, pair a LU q included in a wordnet (WordNet 2.0 in (Freitag et al., 2005)) with a randomly chosen synonym s; next, randomly draw from the wordnet three other LUs not in q s synset (detractors) to complete an answer set A in the question-answer (QA) pair q, A. During evaluation, SSF generates values for the pairs q, a i, a i A, that are expected to favour s. The WBST has been, amongst other applications, used to evaluate SSFs for Polish nouns. The underlying resource was Polish WordNet (plwn) (Derwojedowa et al., 2007a; Derwojedowa et al., 2007b) a lexical database now under construction (Piasecki et al., 2006). The test was slightly modified. In plwn, many synsets have only 1-3 LUs, in accordance with the definition of the semantic relations (Derwojedowa et al., 2007b). In order to get a better coverage of LUs by WBST questions, and not to leave LUs in singleton synsets untested, the direct hypernyms of LUs from singleton synsets were taken to form QA pairs in (Piasecki et al., 2006). We named this modification the WBST with Hypernyms (WBST+H). The inclusion of hypernyms in QA pairs did not make the test easier, as was shown in (Piasecki et al., 2006). In (Piasecki et al., 2006) a SSF was based on adjectival modification of nouns and on noun coordination; we also ran preliminary experiments with describing a noun via its association with verbs. This work lacks the in-depth analysis of all possible lexico-syntactic markers of Polish noun meaning. After the re-implementation of the approach of (Piasecki et al., 2006) and the addition of several lexicosyntactic features (such as modification by an adjectival participle see Section 4.), the result exceeded 90%. We could, however, observe little difference in the influence of the subsequent types of features. This was contrary to our expectations. In the repeated WBST+H tests with several raters, we had the result close to 100% (markedly more than 89.29% reported in (Piasecki et al., 2006)). This may have happened because the tests were generated on the basis of a more recent, improved version of plwn. It is imperative that we construct a more difficult WBST-style test to facilitate further work on SSFs for Polish nouns. 3. Enhanced WBST The WBST defined in (Freitag et al., 2005) stipulates that the elements of the answer set A not synonymous with q are chosen completely at random from the whole wordnet. This means that the difference in meaning between q and the detractors is in most cases obvious to test-takers. It also tends to be obvious to a good SSF. Our overall goal, however, is to construct automatically synsets of highly similar LUs, and to differentiate the LUs in a synset from all other LUs that are similar but not synonymous, among them co-hyponyms (Derwojedowa et al., 2007b). Any SSF must therefore distinguish closely related LUs, not only those with very different meaning. We need to construct the answer set A so that nonsynonyms are closer in meaning to the correct answer q than it is the case in WBST+H. Obviously, they cannot be synonyms of either s nor q, but they ought to be related to both. We need to select the non-synonyms among LUs similar to s and to q. In order to achieve this, we have decided to leverage the structure of the wordnet in the determination of similarity. During the generation of the modified test, named Enhanced WBST (EWBST), non-synonyms are still selected randomly but only from the set of LUs broadly similar to q and s. The acceptable values of SSF W N (Q, x) are lower than a threshold sim t ; the synset Q contains q and s, and x is a detractor. We tested several wordnet-based similarity functions (Agirre and Edmonds, 2006), here implemented on the basis of plwn s hypernymy structure, achieving the best result in a generated test with the following function: SSF W N = p min (1) 2d p min is the length of a minimal path between two LUs in plwn, and d = 9 is a maximal depth of the hypernymy hierarchy in the current version of plwn. The similarity threshold sim t = 2 for this function has been established experimentally. The hypernymy structure of nouns in plwn (as of May 7, 2007) does not have a single root. Many methods of similarity computation require a root, so we have introduced one artificially and linked to it all trees in the hypernymy forest. We noticed that the random selection of LU detractors on the basis of any similarity measure tends to favour LUs in the hypernymy subtrees other than q, if q is located near the root. The number of LUs linked by a short path across the root is much higher than the number of LUs from the subtree of q which are located at a close distance to q. The problem is especially visible for question LUs in small hypernymy subtrees with a limited number of hyponyms. As the problem appears in the case of any similarity measure based on the path length, we have heuristically modified the measure by adding the value 3 to any path going across the artificial root. The lower values gave no visible changes, while the higher numbers caused a large reduction of the number of QA pairs. To illustrate the difference in the level of difficulty between WBST+H and EWBST, we show an example problem generated by this method for the same QA pair admistracja (administration), zarza d(board). The EWBST built the following test: Q: admistracja (administration) A: urza d (office, department), fundacja (charity, endowment), zarza d (board, management), ministerstwo (ministry). And the test generated by WBST+H: Q: admistracja (administration) A: poddasze (attic), repatriacja (repatriation), zarza d (board, management), zwolennik (follower, zealot). An example EWBST test was given to 32 native speakers of Polish, all of them Computer Science students. (This bias in the group of raters should not influence the results, because the test was composed on the basis of plwn which at present includes only general Polish vocabulary.) The test consisted of 99 QA pairs. All LUs in

the test were selected from 5706 single word noun LUs included in plwn. In the set of question LUs, there were 42 LUs occurring more 1000 times in the IPI PAN corpus (Przepiórkowski, 2004). This subset was distinguished in the test, because such LUs are also the basis of the comparison with the results achieved in (Freitag et al., 2005) and (Piasecki et al., 2006). For all QA pairs the result was 70%, with the minimum 61.62%, maximum 78.79% and the standard deviation from the mean σ = 4.07%. For the subset consisting of frequent LUs, the average result was 63.24% with the minimum 52.38%, maximum 73.81% and σ = 5.37%. The results, as expected, are much lower than those achieved in WBST+H tests. We were surprised that the results the raters had for the frequent LUs were significantly lower than for all LUs. It is likely that more frequent LUs are at same time more polysemous, and that makes them more difficult to distinguish from other similar LUs. The results for frequent LUs are lower, but at the level similar to the results for all LUs. A situation like this was also observed in the application of the EWBST to SSFs discussed in Section 5. 4. Similarity Functions for Polish Nouns (Piasecki et al., 2006) proposed a SSF based on the frequency of modification of a noun by specific adjectives and on the frequency of coordination with specific nouns. Features based on verbs were also tested. Following this approach, we have defined a set of noun-meaning markers (italicised on the list below) identifiable via shallow morpho-syntactic processing 1 : modification by a specific adjective, from (Piasecki et al., 2006) (written A in Table 1), modification by a specific adjective, from (Piasecki et al., 2006) (Part in Table 1), co-ordination with a a specific noun, from (Piasecki et al., 2006) (Nc), occurrence of a verb for which a given noun in a specific case can be an argument (V(case)), modification by a specific noun in genitive (NMg), occurrence of a specific preposition with which a given noun in a specific case forms a prepositional phrase (Prep(case)). An N C matrix M is created from the IPI PAN corpus documents tagged by the TaKIPI tagger (Piasecki, 2006). C is the number of lexico-syntactic features used, N the number of nouns, M[n, c] the number of occurrences of the n-th noun with the c-th feature (i.e., the constraint was satisfied). All features are based on the occurrences of certain lexical markers in the context (a sentence defined by TaKIPI), which satisfy certain morpho-syntactic constraints such as for example the presence of some syntactic configuration between the noun and the marker. The constraints are expressed in the 1 Full parsing, unfortunately, was not an option. JOSKIPI language included in TaKIPI. One such constraint is shown below (partially, enough to illustrate the variety of morpho-syntactic phenomena that we can test). or( and( llook(-1,begin,$c, and(equal(pos[$c],{conj}), inter(base[$c],{"ani","albo", "czy","i","lub","oraz"}))), only($c,-1,$oa,in(pos[$oa],{conj,adj, pact,ppas,num,numcol, adv,qub,pcon,pant})), llook($-1c,begin,$s, and(in(pos[$s],subst,ger,depr), inter(base[$s],"variable-n"))), inter(cas[$s],cas[0]), only($s,$c,$ob,in(pos[$ob],{conj, adj,pact,ppas,num,numcol,adv, qub,pcon,pant,subst,ger,depr})) ), and(... analogically to the right) ) In this expression, the first operator llook looks for a conjunction to the left of the centre of the context the position of a given noun. The operator only test the units between that conjunction and the centre; the allowed types are conjunction, adjective, adjectival participle,numeral, adverb, etc. Next, we look for the potentially coordinated noun, defined in each instance of the constraint (variable- N) for a column of the matrix, and we test case agreement inter. Next, we consider words to the right of the conjunction, using a symmetrically inverted constraint. The similarity between nouns is calculated on the basis of matrix rows according to the method proposed in (Piasecki et al., 2006). The central element of the method is the Rank Weight Function: 1. Weighted values of the cells are recalculated using a weight function f w : c M[n i, c] = f w (M[n i, c]). 2. Features in a row vector M[n i, ] are sorted in the ascending order on the weighted values. 3. The k highest-ranking features are selected; e.g. k = 1000. 4. For each selected feature c j : M[n i, c j ] = k rank(c j ) As the weight function, we applied the t-score test tscore(n, c) 2.567 (Manning and Schütze, 2001): tscore(n, c) = M[n,c] T FnT Fc W T FnT Fc W T F n, T F c are the total frequencies of noun words / constraints satisfied, and W is the number of words processed. Additionally a threshold for the number of features common to both nouns mcom > 1% serves as a constraint for the similarity to be positive (otherwise it is set to 0).

5. Experiments In order to test the influence of the constraints, for each constraint we created a separate matrix on the basis of about 254 million words from the IPI PAN corpus. The SSFs calculated from the matrices by the rank method were next tested using WBST+H and EWBST, both generated according to the present state of plwn. Most of the tests were limited to nouns in plwn that occur more than 1000 times in the corpus (6105 nouns) the threshold used in (Freitag et al., 2005; Piasecki et al., 2006). We generated 3025 QA pairs for the frequent nouns. In EWBST tests run for all nouns of plwn, the results were similar to those for the frequent nouns only. It is in contrast with the WBST+H test in which there is a big difference between the accuracy of the frequent and infrequent nouns. For example, in (Piasecki and Broda, 2007) the best result for the frequent nouns is 81.15%, while for all nouns only 64.03%. WBST+H is easier for the frequent nouns well described by the frequent occurences of features, because it is easier to distinguish them from completely randomly selected nouns. In EWBST all nouns are compared with those similar to them. The result for infrequent nouns, not so well described, stays approximately the same, but the result for the frequent nouns is worse, as the task becomes harder. From the point of view of the automatic construction of synsets, this behaviour of EWBST is advantageous: we perform only one test and yet we get a good description of the whole SSF. Features W E E A A 88.65% 51.51% 50,97% Part 78.79% 43.86% 37,94% NMg 72.43% 44.56% 41,16% Nc 76.85% 47.01% 44,70% Prep(acc) 35.14% 22.20% 20,39% Prep(all) 50.21% 30.00% 28,33% V(acc) 75.36% 41.78% 40,17% V(dat) 48.64% 30.04% 26,25% V(all) 75.94% 42.04% 40.12% A+NMg 86.66% 52.20% 52.75% A+NMg+Prep(all) 86.74% 51.20% 52.24% A+NMg+Prep(all)+Part 87.40% 52.27% 52.62% A+NMg+Part 87.29% 52.86% 53.31% A+Nc+Part 90.92% 53.32% 52.55% A+Nc+NMg 88.65% 53.52% 54.25% A+Nc+NMg+Part 88.57% 53.13% 54.25% Table 1: Experiments with SSFs based on different constraints. Constraint names are defined in Section 4. W means WBST applied to frequent nouns (> 1000 occurrences). E means EWBST for frequent and E A for all nouns. The rank method can select an appropriate set of features for a tested noun, so the individual results of the subsequent matrices do not influence directly the results of the joint matrix. This can be seen in Table 1. For example, the individual result of the constraint based on modification by participles is only 43.86%, but when we add the Part matrix to the Adj+NMg matrix, with a higher accuracy, the result goes up to 52.86%. The results of the best SSFs are different in WBST+H and in EWBST. The best combination in WBST+H, namely A+Nc+Part, expresses a relatively low result in EWBST. The coordination with other nouns is a good factor to identify large semantic fields of related nouns, and in the WBST+H test it helps distinguish between a given noun and a completely unrelated noun. This is why the result is very good. In EWBST the situation is different. We test the ability to distinguish among semantically related nouns. The modification by a noun in genitive is a medium-quality feature when taken alone (only 44.56% in EWBST), probably because this modification is polysemous (and vague as well); the constraint also overgenerates it often signals inexistent assotiations. There are no morpho-syntactic constraints for this type of modification. No agreement is required, and without full parsing we can only rely on a very vague syntactic requirement of adjacency of the two words, like this: and( rlook(1,5,$a, and(in(pos[$a],{subst,ger,depr}), equal(cas[$a],{gen}), inter(base[$a],{"variable-n"})) ), only(1,$a,$ad, or(in(pos[$ad],{adv,qub,pcon, pant})), and( in(pos[$ad],{subst,ger, depr})), equal(cas[$ad],{gen})), and(in(pos[$ad],{adj,pact, ppas,num,numcol}), agrpp(0,$ad,{nmb,gnd,cas},3)) ))) agrpp is the operator of morpho-syntactic agreement on the selected attributes. In spite of the errors introduced by the constraint, the feature NMg delivers the additional source of properties expressed by the meaning of a noun, and a combination of the NMgen matrix with other matrices of properties, Adj+Nc+NMg+Part, results in the best score obtained in the EWBST test. It could not be observed with the former test, WBST+H. 6. Conclusions We have proposed an extension of the WordNet-Based Similarity Test, which appears to be more discerning. Our research goal is the application of SSFs in the automatic construction of wordnet synsets. The operation of the proposed EWBST brings us closer to that goal. The EWBST allows us to observe the ability of a SSF to make finegrained distinctions between semantically related Lexical Units. Its results can be easily interpreted. The test can be generated on a large scale, depending only on the size of the underlying wordnet. The EWBST is challenging for people and significantly difficult for SSFs. It leaves more

room for improvement, behaves in the same way for frequent and infrequent nouns. The drawback of EWBST is its dependency on the existence of a semantic similarity measure generated on the basis of manually created data (in its present version on the existence of a hypernymy hierarchy). In many, if not all, wordnets, the hypernymy hierarchy is rich only for nouns. The EWBST can work well for verbs or adjectives, but first a different similarity function should be proposed for the generation of the answer sets in tests. Acknowledgement. Work financed by the Polish Ministry of Education and Science, project No. 3 T11C 018 29. 7. References Agirre, Eneko and Philip Edmonds (eds.), 2006. Word Sense Disambiguation: Algorithms and Applications. Text, Speech and Language Technology. Springer. Budanitsky, Alexander and Graeme Hirst, 2006. Evaluating wordnet-based measures of semantic distance. Computational Linguistics, 32(1):13 47. Derwojedowa, Magdalena, Maciej Piasecki, Stanisław Szpakowicz, and Magdalena Zawisławska, 2007a. plwordnet the polish wordnet. Online access to the database of plwordnet: www.plwordnet.pwr. wroc.pl. Derwojedowa, Magdalena, Maciej Piasecki, Stanisław Szpakowicz, and Magdalena Zawisławska, 2007b. Polish WordNet on a shoestring. In Proceedings of Biannual Conference of the Society for Computational Linguistics and Language Technology, Tübingen, April 1113 2007. Universität Tübingen. Fellbaum, Christiane (ed.), 1998. WordNet An Electronic Lexical Database. The MIT Press. Freitag, Dayne, Matthias Blume, John Byrnes, Edmond Chow, Sadik Kapadia, Richard Rohwer, and Zhiqiang Wang, 2005. New experiments in distributional representations of synonymy. In Proceedings of the Ninth Conference on Computational Natural Language Learning (CoNLL-2005). Ann Arbor, Michigan: Association for Computational Linguistics. Grefenstette, G., 1993. Evaluation techniques for automatic semantic extraction: Comparing syntactic and window based approaches. In Proceedings of The Workshop on Acquisition of Lexical Knowledge from Text, Columbus, SIGLEX 93. ACL. Landauer, T. and S. Dumais, 1997. A solution to Plato s problem: The latent semantic analysis theory of acquisition. Psychological Review, 104(2):211 240. Lin, Dekang, 1998a. Automatic retrieval and clustering of similar words. In COLING 1998. ACL. Lin, Dekang, 1998b. An information-theoretic definition of similarity. In Proceedings of 15th International Conference on Machine Learning. Morgan Kaufmann, San Francisco, CA. Manning, Christopher D. and Hinrich Schütze, 2001. Foundations of Statistical Natural Language Processing. The MIT Press. Piasecki, Maciej, 2006. Handmade and automatic rules for Polish tagger. Lecture Notes in Artificial Intelligence. Springer. Piasecki, Maciej and Bartosz Broda, 2007. Semantic similarity measure of Polish nouns based on linguistic features. In Witold Abramowicz (ed.), Business Information Systems 10th International Conference, BIS 2007, Poznan, Poland, April 25-27, 2007, Proceedings, volume 4439 of Lecture Notes in Computer Science. Springer. Piasecki, Maciej, Stanisław Szpakowicz, and Bartosz Broda, 2006. Automatic selection of heterogeneous syntactic features in semantic similarity of polish nouns. In Proceedings of the Text, Speech and Dialog 2007 Conference. Przepiórkowski, Adam, 2004. The IPI PAN Corpus Preliminary Version. Institute of Computer Science PAS. Rubenstein, H. and J. B. Goodenough, 1965. Contextual correlates of synonymy. Communication of the ACM, 8(10):627 633. Turney, P.T., 2001. Mining the web for synonyms: Pmiir versus lsa on toefl. In Proceedings of the Twelfth European Conference on Machine Learning. Berlin: Springer-Verlag. Turney, P.T., M.L. Littman, J. Bigham, and V. Shnayder, 2003. Combining independent modules to solve multiple-choice synonym and analogy problems. In Proceedings International Conference on Recent Advances in Natural Language Processing (RANLP-03). Borovets, Bulgaria. Weeds, Julie and David Weir, 2005. Co-occurrence retrieval: A flexible framework for lexical distributional similarity. Computational Linguistics, 31(4):439 475. Zesch, Torsten and Iryna Gurevych, 2006. Automatically creating datasets for measures of semantic relatedness. In Proceedings of the Workshop on Linguistic Distances. Sydney, Australia: Association for Computational Linguistics.