Concept-Instance Relation Extraction from Simple Noun Sequences Using a Full-Text Search Engine

Concept-Instance Relation Extraction from Simple Noun Sequences Using a Full-Text Search Engine Asuka Sumida 1, Kentaro Torisawa 1, and Keiji Shinzato 2 1 Graduate School of Information Science, JapanAdvanced Institute of Science and Technology 2 Graduate School of Informatics, Kyoto University Abstract. This paper describes a simple method for acquiring conceptinstance relations from simple noun sequences that frequently appear in Japanese Web documents. In Japanese, many noun sequences can consist of two NPs that have a concept-instance relation. This phenomenon is similar to apposition in English but differs in that many of these noun sequences do not provide any explicit clues, such as the proper noun capitalization or commas used in English apposition, that indicate the boundary between the concept name and the instance name. We developed a method to detect such implicit boundaries between concept names and instance names, and to filter out erroneous concept-instance relations by using a search engine. 1 Introduction Lexico-syntactic patterns have been used to automatically acquire hyponymy relations or concept-instance relations from particular types of expressions [1 6]. Such relations are crucial in many NLP applications. For instance, Q&A systems require such relations to answer Who/what is questions [2] (e.g., What is Rakuten Market? - It is a virtual shopping center in Japan. ). However, the relations acquired by previous methods do not have sufficient coverage for practical applications such as Q&A, and a means of acquiring hyponymy relations and concept-instance relations from a wider range of expressions is still needed. The goal of this study has been to acquire a large number of concept-instance relations from simple noun sequences in Japanese, to which existing patternbased methods cannot be applied. In Japanese, many noun sequences can consist of two NPs that have a concept-instance relation. There are two types of (A) movie monk story concept name: movie instance name: Monk Story (B) hotel Maruei concept name: hotel instance name: Maruei Fig. 1. Examples of Japanese noun sequences

2 A. Sumida, K.Torisawa, K.Shinzato such noun sequences, as shown in Figure 1. The first is sequences in which the boundary between the concept name and the instance name is explicitly marked by quotation marks. Example (A) in Figure 1 shows cases where the conceptinstance boundary is specified by quotation marks ( and ). The second type of noun sequence is sequences in which no evident clues indicate the boundary. Example (B) shows such sequences. Note that simple pattern-based methods can be applied to the first type by giving the quotation marks in the patterns, but it cannot be applied to the second type. This also means there is no evident way to distinguish the noun sequences representing concept-instance relations from others. In addition, unlike English, Japanese does not have a system of capitalization of proper nouns. This makes it difficult to detect the concept-instance boundary. Our objective, therefore, is to acquire concept-instance relations from noun sequences which have no quotation marks. To achieve this, we use knowledge obtained from the concept-instance relations acquired from sequences that do include quotation marks. We also utilize a full-text search engine established on a large Web repository for our task. Another important point is that although we designed our method for Japanese, we also found similar noun sequences in languages such as Chinese and believe that our method can be extended to handle those languages. In addition, as a byproduct of our work, we have acquired a large number of highly precise relations from the noun sequences with quotation marks. A similar task has been addressed by Fleischman et al. for English [2]. Our method differs from theirs in the following ways. I) Their method relies on explicit clues such as commas and capitalization. We do not use such clues since a considerable number of class-instance relations can be expressed without such clues in Japanese and, in particular, Japanese does not have any orthographical systems corresponding to capitalization in English. II) Though Fleischman s method relies heavily on the distinction between proper nouns and other nouns provided by a morphological analyzer, we do not rely on this distinction. This is because the distinction provided by Japanese morphological analyzers is highly unreliable, mainly because of a lack of capitalization. III) We restrict the concepts of a relations through a certain condition but these are not limited to particular semantically coherent domains as in the work of Fleischman et al. We do not use any domain dependent knowledge, which was encoded in feature sets for machine learning in their work. 2 Approach We have taken a twofold approach. First, to distinguish noun sequences expressing concept-instance relations from other noun sequences, we use a set of concept names which appear frequently in the concept-instance relations acquired through noun sequences where the concept names and instance names are separated by quotation marks. We regard only noun sequences prefixed by such a concept name as candidates to be sequences expressing concept-instance

Concept-Instance Relation Extraction from Simple Noun Sequences 3 relations. This is based on our intuition that some NPs are likely to refer to concept names and that the noun sequences prefixed by such NPs are likely to represent concept-instance relations. In addition, we have manually checked the head nouns of frequently appearing concept names and removed any that were inappropriate. This increases precision. Second, we use a full-text search engine as a filter to judge if extracted instance name candidates are actually instance names. Since instance names are basically proper nouns, we can assume that a phrase which instance-name will not appear even in huge corpora. (We actually gave such a phrase as a search engine query and regarded an instance name candidate that was not included in any documents as a proper instance name.) This prevents our procedure from producing an inappropriate instance name. In addition, we use other statistics obtained from the search engine to further elaborate on the resulting conceptinstance relations. In our experiments, our procedure produced 4,276 concept-instance relations (without duplications) from our test set of 6.5GB of HTML documents. Although our method is quite simple and does not use any advanced learning methods such as Support Vector Machines, the precision of the extracted relations reached 83.5%. As preparation for this series of experiments, we extracted 28,083 candidates of concept-instance relations (without duplications) from noun sequences with quotation marks from our development set, (13.1 GB of HTML documents); we obtained 11,154 concept names from the extracted relations and used these names to acquire relations from noun sequences without quotation marks. Note that we also extracted concept-instance relations from the noun sequences with quotation marks that were found in the same test set (i.e., the 6.5GB of HTML documents.) for the same concept names. In this case, the number of relations was about 16,000, about four times as many as our method extracted. The precision was also slightly higher than with our method. However, we did not find any common relations in those obtained by the two methods and we think that the noun sequences without quotation marks can be used as a knowledge source to augment the relations extracted from other knowledge sources. In addition, although a larger number of the concept names might be beneficial, the number is limited by the costs of manually checking concept names and the data sparseness. If we can use a larger Web repository and dedicate more time to checking concept names, this limitation will be considerably relaxed. 3 Method Our procedure for concept-instance relation acquisition consists of two steps: Step 1 Acquire concept names from the noun sequences with quotation marks Step 2 Acquire concept-instance relations from the noun sequences without quotation marks using a full-text search engine. Each step is described in the following.

4 A. Sumida, K.Torisawa, K.Shinzato 3.1 Step 1:Acquisition of Concept Names from the Noun Sequences with Quotation Marks The purpose of this step is to acquire a set of concept-names from the noun sequences with explicit clues, (i.e., quotation marks) to use in the concept-instance acquisition from sequences without quotation marks. We basically collect a large number of concept-instance relation candidates by using the pattern with the quotation marks. The concept name candidates acquired using the pattern are classified according to their head nouns. We then check if some concept-instance relation candidates including each head noun are proper ones. If the checked candidates include a relatively small number of errors, we regarded all the concept names including that head noun as proper ones, and these are used in Step 2. As shown in Figure 1 (A), the noun sequences we use in this step have the form N 1 N 2 N i 1 N i N i+1 N m (1) The prefix N 1 N 2 N i 1 before the quotation mark becomes a concept name candidate, and the sequence N i N i+1 N m inside the quotation marks becomes an instance name candidate. Since Japanese does not have a space between words, we first need to tokenize a string into a word sequence by using a morphological analyzer(we used MeCab, which is available from 3. Second, we extract noun sequences that match the above pattern and obtain a set of concept-instance relation candidates. Note that we assume all the words labeled by the morphological analyzer as unknown words are nouns. In addition, we have found some problematic subclasses of nouns, which are specified in the output of the morphological analyzer, in the sense that the sequences including such subclasses are less likely to represent concept-instance relations. We exclude these sequences according to the rules concerning the subclasses shown in Figure 2. 1. If an instance name is a single word expression and the parts-of-speech subclass of the word is either noun suffix, functional noun, number or pronoun, remove the relation from the candidate set. 2. If a concept name is a sequence of either subclass noun suffix or subclass number, remove the relation from the candidate set. 3. If a concept name includes either subclass district/area name, subclass adverbial noun, subclass adjectival noun, or subclass proper noun, remove the relation from the candidate set. Note that, in MeCab output, the noun class includes the subclasses noun suffix, functional noun, number, pronoun, district/area name, adverbial noun, adjectival noun and proper noun. Fig. 2. Rules for subclasses of parts-of-speech for filtering in Step 1 3 We used MeCab, which is available from http://chasen.naist.jp/~taku/software/mecab/.

Concept-Instance Relation Extraction from Simple Noun Sequences 5 Through the above automatic procedure, we extracted 69,665 concept-instance relation candidates from the 13.1GB of HTML documents downloaded from the WWW. The precision of the relations was 58% (for 100 randomly selected samples.). As a next step, for each head noun of the concept names in the extracted relations (i.e., the final noun N i 1 in the above pattern), we counted the variations of the concept instance relations that included the head noun in the head word position of the concept names, and sorted the head nouns according to the number of variations including the nouns. We found that the top 409 head nouns covered about 60% of the relation candidates and those head nouns appeared in the head word positions of 13,381 distinct concept names. Then, for each head noun, we randomly picked out five concept-instance relations including the noun, and checked if the relations are valid. We excluded the head nouns that had more than three erroneous candidates and obtained 301 head nouns. Then, we also removed three useless concept names, (name), (thing) and (name) from the list of the concept names including the head nouns. As a result, we obtained and these covered 11,154 concept names. Examples of the head nouns are listed in Figure 3. We use these concept names in Step 2. When we limited the relation candidates to those including the concept names obtained after the manual check, the number of candidates was 28,135 and the precision of the relation rose to 89% (for 100 randomly selected candidates.). This means that the manual filtering of concept-name head nouns improved the precision of the extracted relations and that we could acquire a large number of highly precise instance-name relations from the noun sequences with quotation marks. Note that hyponymy relation acquisition from noun sequences with quotation marks for Japanese has already been addressed by Imasumi [6]. (Note that Imasumi s hyponymy relations include both of concept-instance relations and hyponymy relations.) He reported more than 90% precision without manual checking of hypernyms as we did. The precision was much higher than that of our method. One of the possible reasons would be that he acquired both of hyponymy relations and concept-instance relations, while we acquired only concept-instance relations. However, we could not obtain such a high accuracy even after we relaxed the relations to be acquired to union of hyponymy relations and concept-instance relations. We now think this higher precision was due to the corpus used. He used newspaper articles, which were written by trained writers who as a group show relatively little diversity, compared to the HTML documents we used. (food stuffs) (flower) (functionality) (symposium) (club) (ballet) (movie) (shop/restaurant) (software) (weapon) Fig. 3. Examples of head words in concept names

6 A. Sumida, K.Torisawa, K.Shinzato 3.2 Step 2:Acquisition of concept-instance relations from the noun sequences without quotation marks. Step 2 proceeds as follows. Step 2.A Extract noun sequences from WWW documents. Step 2.B Filter the sequences and determine the boundary between concept names and instance names by using the concept names obtained in Step 1. Step 2.C Filter the sequences by using parts-of-speech. Step 2.D Filter by using a full-text search engine and the length of concept names. In Step 2.A, we simply extract sequences of nouns and unknown words from WWW documents. In Step 2.B, we pick out only the noun sequences that are prefixed by one of the concept names obtained in Step 1, and assume that the ending of the concept name is the boundary between the concept name and an instance name. Note that instance names are limited to those consisting of a single word or two words at this time. This is for the sake of the simplicity in this discussion. The Step 2.C procedure filters out the noun sequences according to rules concerning subclasses of parts of speech. These rules were developed during our experiments using our development set and are listed in Figure 4. The output of Step 2.C is a set of noun sequences that have the following form, where C j and I j denote nouns. C 1 C 2 C i }{{} concept name candidate I 1 {I 2 } }{{} instance name candidate Step 2.D is the key part of the Step 2 procedure. We use hit counts obtained by a full-text search engine on a WWW repository and the length of a concept name according to the following four heuristics. Here, hit(query) denotes the hit count of the string query obtained by the search engine. Heuristic 1 Attach the Japanese word dono, which corresponds to the interrogative adjective which (of) in English, to the instance name candidate and give it to a search engine as a query. If hit( dono instance name ) > θ, remove the sequence from the candidates. This is based on our intuition that instance names behave as proper nouns and the phrase which instancename will not appear frequently in the WWW repository. If the hit count is large, the instance name is likely to actually be a generic concept. (2) 1. If an instance name contains verbal nouns or noun suffix, remove the relation from the candidate set. 2. If the first word of an instance name is labeled as noun suffix (generic), remove the relation from the candidate set. Fig. 4. Rules for subclasses of parts-of-speech for filtering in Step 2.C

Concept-Instance Relation Extraction from Simple Noun Sequences 7 Heuristic 2 Assume that a noun sequence consists of a class name denoted by C 1 C 2 C i and an instance name consists of two words and is denoted by I 1 I 2. If hit(c 1 C i I 1 )/hit(c 1 C i I 1 I 2 ) > σ, remove the sequence from the candidates. This heuristic checks if the sequence I 1 I 2 is a proper compound noun, as explained later. If the instance name consists of a single word, do nothing. Heuristic 3 If hit(instance name) > ξ, remove the sequence from the candidates. This is based on our observation that if an instance name is frequently observed in the Web repository, it is likely to be a generic concept. Heuristic 4 If the concept name is a single character word, remove the sequence from the candidates. This is based on our observation that the morphological analyzer sometime fails to recognize a long proper noun and decompose it to a sequence of single character words and such words sometime coincide with concept names. Note that these heuristics are applied to the candidate set sequentially in the above order and that the parameters in the heuristics were set as θ = 0, σ = 1.5 and ξ = 2 10 4 according to our observation in experiments using our development set. (Note that we used a search engine on a 0.7TB WWW repository) Heuristic 2 needs a more detailed explanation. Figure 5 shows an example of the Japanese translation of opera house orchestra. Assume the following. opera is among the concept names obtained in Step 1. house orchestra is not a frequently appearing compound noun (though this might not be the case in English). which house orchestra does not appear in the WWW repository. In this situation, house orchestra is likely to be regarded as a proper instance name of the concept opera. Since which house orchestra has a hit count of 0, it is not eliminated by Heuristic 1. Heuristic 3 allows the sequence to remain in the candidate set since house orchestra does not appear frequently. Heuristic 4 does not eliminate the sequence because the concept name opera( ) is a two-character word. This unrealistic situation is caused by the assumption that the boundary of two NPs exists between opera and house, while it is actually between house and orchestra. More precisely, Heuristic 1 fails to regard an instance name as a generic concept because of this wrong assumption since it queries the phrase consisting of which and the rarely appearing compound noun house orchestra. Heuristic 2 is used to avoid detecting such wrong NP boundaries. Usually opera house orchestra is interpreted as orchestra of an opera house opera place/house orchestra Fig. 5. An example of an erroneously decomposed sequence

8 A. Sumida, K.Torisawa, K.Shinzato Table 1. Results obtained after each step and heuristic Step # of rels. precision(%) estimated # of correct rels. Step 2.A 4,107,460 - - Step 2.B 117,443 10 11,744 Step 2.C 58,322 19 11,081 Step 2.D 4,276 83.5 3,570 Heuristic 1 14,418 54 7,785 Heuristic 2 6,140 66.0 4,052 Heuristic 3 5,337 75.5 4,029 Heuristic 4 4,276 83.5 3,570 Final Results 4,276 83.5 3,570 and the boundary between the two NPs should be between opera house and orchestra. Heuristic 2 detects this true boundary by obtaining hit counts hit(c 1 C i I 1 ) = hit( opera house ). If this figure is large enough compared to the hit count of the whole sequence hit(c 1 C i I 1 I 2 ) = hit( opera house orchestra ) (i.e., hit(c 1 C i I 1 )/hit(c 1 C i I 1 I 2 ) is large enough), opera house often appears as an independent and acceptable compound noun. This suggests the boundary is between house and orchestra not between opera and house. The final output of our procedure is the set of concept instance relations remaining in the candidate set after the heuristics are applied. 4 Experiments We first established a search engine on a 0.7TB WWW repository 4,obtained by downloading documents from the WWW. Then we applied our method for concept-instance relations to 6.5 GB of unseen HTML documents randomly selected from the WWW repository. Table 1 shows A) the number of noun sequences that were produced at each step, B) the precision of the conceptinstance relations extracted from 100 samples randomly picked from a set of noun sequences produced by each step (As mentioned later, the precision after Step 2.D was measured from 200 samples.), and C) the expected number of correct concept-instance relations estimated from the precision and the number of extracted relations. Note that all the numbers were obtained after removing duplicates in the candidate sets. While we would have liked to have also measured the recall of the output, this would be difficult because of the large number of candidates. As a substitute for the recall measure, we used the expected number of correct concept-instance relations. As can be seen easily, each step improves the precision. The table also shows the results after each heuristic was applied in Step 2.D. These results show that each heuristic also contributed to a higher precision. (The precision was measured from 200 samples randomly picked for the heuristics.) As the final output, we obtained 4,276 concept-instance relations and their precision was 83.5%. The relations included 685 concept names. Examples of 4 We used freyasx as the search engine software, which is available from http://www.delegate.org/freyasx/index.shtml.

Concept-Instance Relation Extraction from Simple Noun Sequences 9 / corporation / Senba (corporation Seneba ) / sightseeing place / So doumon (sightseeing place Sodoumon ) / novel / Yoshida school (novel Yoshida s School ) / prelude tune / G-Major ( G-Major itself is not a prelude.) CD / CD / Romrom ( CD Romrom is the title of a TV game.) * indicates a incorrectly recognized relation. Fig. 6. Examples of noun sequences and acquired concept-instance relations from them Table 2. Results obtained from noun sequences with the quotation marks Step # of rels. precision(%) estimated correct rels. Before conceptrestriction 38,987 55 21,442 After concept restriction 15,950 89 14,195 recognized class-instance relations are shown in Figure 6. The concept-instance boundaries detected by our procedure are marked by the symbol /. Table 2 shows the results of applying the concept-instance relation acquisition from the noun sequences with quotation marks (i.e., the same as Step 1 in our method) to the same document set. This table shows the performance before and after we restricted the concept names to the set of concept names used in our procedure. (All the numbers were obtained after removing all the duplicates.) The final results of this method included 6,056 concept names. Actually, these results were better than those with our method. The difference of the precision after restricting the concepts and that of our method is not large, but the number of relations obtained by our method was just 28% that acquired from the sequences with the quotation marks. However, we could not find any common items in the relations extracted from the sequences with the quotation marks and those obtained by our method. Thus, our method found a significant number of relations that could not be acquired from sequences with quotation marks, so it can be used to augment the relations obtained by using the quotation marks to help deal with an expected shortage of relations in real-world applications considerably in the future. Finally, we evaluated other patterns for Japanese used in existing research on hyponymy relation acquisition [6, 5] in the context of concept-instance relation acquisition. The patterns are listed in Figure 7. We applied the patterns to 4.4GB of HTML documents and obtained 147,056 concept-instance relations. The precision was 14.5% (obtained from 200 random samples). Since we did not use any optimization techniques, the precision was low. If we scale up the document size to the size of our test set, 6.5GB, the estimated number of correct relations is about 31,500 and each pattern acquires around 3,937 correct relations in average. This means that our method could obtain at least almost the same number of class-instance relations as each of the other patterns with the high accuracy, 83.5%, without any explicit clues on surface strings.

10 A. Sumida, K.Torisawa, K.Shinzato NP NP (NP such as NP) NP NP (NP such as NP) NP NP (NP similar to NP) NP NP (NP such as NP) NP NP (NP other than NP) NP NP (NP called NP) NP NP (NP called NP) NP NP (NP called NP) Fig. 7. Other patterns for hyponymy relation acquisition 5 Conclusion and Future Work We have described a method for extracting concept-instance relationships from the simple noun sequences, frequently found in corpora. Our method found 4,276 concept-instance relations from 6.5 GB of HTML documents with 83.5% precision. The method is based on our intuitions that some NPs are likely to refer to concept names and that noun sequences prefixed by such NPs are likely to represent concept-instance relations. In addition, our method uses a search engine as a filter to remove erroneously extracted concept-instance relationships. We plan to apply our method to a larger WWW repository and to use the resulting concept-instance relationships in practical applications such as Q&A, IR and IE.. Introducing compound-noun analysis technique [7] to our method will also be an important research direction. Furthermore, another direction will be to extend the concept-instance relations obtained by our method from itemization or listing in HTML documents, by using which some previous works tried to extract concept instance relations[8, 9]. References 1. Hearst, M.A.: Automatic acquistition of hyponyms from large text corpora. In: Proceedings of the 14th International Conference on Computational Linguistics. (1992) 539 545 2. Fleischman, M., Hovy, E., Echihabi, A.: Offline strategies for online question answering: Answering questions before they are asked. In: ACL2003. (2003) 1 7 3. Caraballo, S.A.: Automatic construction of a hypernym-labeled noun hierarchy from text. In: ACL1999. (1999) 120 126 4. Pantel, P., Ravichandran, D., Hovy, E.: Towards terascale knowledge acquisition. In: COLING 2004. (2004) 771 777 5. Ando, M., Sekine, S., Ishizaki, S.: Automatic extraction of hyponyms from newspaper using lexicosyntactic patterns. In: IPSJ SIG Technical Report 2003-NL-157. (2003) 77 82 6. Imasumi, K.: Automatic acqusition of hyponymy relations from coordinated noun phrases and appositions. Master s thesis, Kyushu Institute of Technology (2001) 7. Kobayasi, Y., Tokunaga, T., Tanaka, H.: Analysis of japanese compound nouns using collocational information. In: COLING1994. (1994) 865 869 8. Shinzato, K., Torisawa, K.: Acquiring hyponymy relations from web documents. In: Proceedings of HLT-NAACL2004. (2004) 73 80 9. Etzioni, O., Cafarella, M., Downey, D., Popescu, A., Shaked, T., Soderland, S., Weld, D.S., Yates, A.: Unsupervised named-entity extraction from the web: an experimental study. Artif. Intell. 165(1) (2005) 91 134