A Study of Relation Annotation in Business Environments Using Web Mining

A Study of Relation Annotation in Business Environments Using Web Mining Qi Li School of Information Science University of Pittsburgh qili@sis.pitt.edu Daqing He School of Information Science University of Pittsburgh daqing@sis.pitt.edu Ming Mao SAP North America Lab ming.mao@sap.com Abstract Relation annotation (RA) is a process of marking up relations among a set of entities identified from a plain text. RA is important to enterprise applications due to its capability of revealing semantics in business environments. However, RA in business environment is different from that in news domain because the entities involved in the relations in business domain often not just refer to entities like People or Locations, and many business entities still could not be identified by existing entity identification tools. In this paper, we explore RA in business environment using web mining techniques, and propose the Relation Annotation Platform in Business Environments (RAPBE), which can automatically help information workers by annotating business relations in enterprise setting. We evaluated RAPBE using two sample relations that are common in business domain -- COMPANY-LOCATION and COMPANY- PRODUCT. Our experiment results demonstrate the usefulness of RAPBE in relation annotation, and also show that the best method for marking up relations of the entities identifiable by existing entity identification tools is Frequency Weight method, whereas Distant Weight is the best when some entities involved in RA cannot be identified by the information extraction tools. 1. Introduction Today, every business person may have to access an overwhelming amount of potentially relevant information that is continuously produced in various media with varying interpretations. It is impossible for an information worker to manually discover and synthesize all the available information [1]. As pointed out by Mao, sense-making focuses on making sense of ambiguous contexts and continuously making the found knowledge more precise based on disambiguating the context. Effective analysis tools are needed to find the key entities and their relations in the sensemaking task. Currently, most relation identification work focuses on the relations like is-a or part-of, which express the connections between entities in a hierarchical structure. However relations in business environments are usually not limited to hierarchical relations. For example, in the field of customer relationship management, it is important to capture the relations between a company and its product, and a product and customer reviews. Then in marketing and business intelligence area, it is important to identify important relations based on extracted entities (e.g. People-In-Organzation-In- Some-Place or New Product-With-Some- Company ). To help with these goals, we propose a process called relation annotation (RA) and an associated system to automatically annotate nonhierarchical relations which are predefined by the requests from business environments. Relation annotation (RA) is a process of creating a markup of relations among entities from plain text. For example, as shown in Figure 1, two entities, Google and Mountain View, can be annotated with the relationship of Company-Base-on-Location short as COMPANY-LOCATION according to the ontology illustrated in Figure 1. Our approach of RA uses patterns which are rules predefined or learned from a large corpus. The rules then guide the system to annotate relations among the extracted entities. The general idea of our approach was developed initially in news domain. However, our approach is still novel in the following points. First, methods in news domain concentrate on entities about people, location, organization, time, and so on. Although these entities and their relationships are important in business environments, other entities such as products and competitors, are critical information in business too. Second, most existing entity identification tools are designed for news domain, and they are trained and tested for entities

like people or locations. It is not clear whether or how well they can perform on entities like products or competitors. Figure 1 Business Ontology & Instances One possible approach to identify the relations in business domain with non-identified entities is to develop an entity identification tool first and then do relation identification. However, there are too many entities in business environment and too few labeled data to train the entity tools. We, therefore, fall back to more basic linguistic features, and assume that noun phrases, which can be identified reliably with existing tools like POS taggers or syntactic parser, are possible entity candidates. The patterns we constructed for business domain then help us to filter out non-entity noun phrases and unrelated entities, and classify the relations between the entities. In this paper, we choose relations that represent respectively two different scenarios for studying our method. One type of relations involves entities that are identifiable by existing entity identification tools. This shows the connection between relation annotation in business domain and relation extraction in news domain. The other type of relations contains entities that need to use noun phrases as the starting point. This shows the difference of our relation annotation to news domain relation extraction. Our experiments will examine the results of these two scenarios. Because there is no much training data available to use, we employ a bootstrap approach, which utilize limited samples to start the generation, and then learn more patterns on the way. With better patterns, more samples can be created for better pattern generation. A common source of information in many existing pattern generation methods uses the local context of the known entities. The patterns generated, however, can miss critical global syntactic clues at the sentence level. Therefore, our pattern generation method will utilize sentence syntactic information, i.e. Subject Verb and Object (SVO) structure. SVO has been studied before in information extraction [2], in semantic navigation to represent the semantic structure of a sentence [3], and many other areas. In this paper, we developed five different methods of using SVO information for pattern generation. The reminder of the paper is organized ss follows. Section 2 is some related works review. We present the Relation Annotation Platform in Business Environments (RAPBE), which can automatically annotate relations using Web mining technique in Section 3. The core idea of RAPBE is a two-step bootstrap. Therefore, we will first briefly introduce two-step bootstrap for RA (Section 2). Then we introduce RAPBE in Section 4, the experiment design and result analysis in Section 5, and conclusions in Section 6. 2. Related Work In the literature of relation extraction, most researchers so far have focused on relations like is-a [4] and part-of [5]. Hearst used patterns to extract hyponym relations [4]. Later, Berland and Charnias expanded Hearst s work for part-of relation extraction [5]. Girju [6] combined machine learning algorithms and WordNet to raise and disambiguate part-of generic patterns for RE. Although some papers, like Blohm [20], began to work on non-hierarchy relations, they still focus on context information for pattern extraction. Our relation annotation approach tries to extract generic patterns too but uses Subject-Verb- Object (SVO). Bootstrap learning is an iterative approach that alternates between learning rules from a set of instances and learning instances from rules [7]. Hearst pioneered the use of bootstrap method for extraction hyponym relation based on patterns [4]. Manually building three lexico-syntactic patterns, Hearst also used these three patterns to induce other patterns. Blum and Mitchell [8] used the bootstrap method for classifying webpages. Riloff and Jones [9] used bootstrap learning on a small corpus to iterate learning instances of large semantic classes and four patterns which can generate more instances. Ravichandran & Hovy [10] used bootstraps to find patterns surrounding seed values for question answering from a training set

of question and answer pairs. Etzioni et al. [11] and [12] used bootstraps for entity extraction. Stevenson and Greenwood [2], in the task of pattern induction, considered to induce the triple patterns, subject, verb and object, instead of local context. Therefore, we will use the SVO pattern in the bootstrap for Relation Annotation task. 3 Relation Annotation Using Bootstrap Our Relation Annotation approach for business environments assumes that there is a knowledge base for obtaining existing information about certain relations, so its inputs for training an annotation model for a given relation are the set of predicates of that relation (e.g. Company-Locate-in-Country ). The output of the training, therefore, is the annotation model. Based on our assumption of having knowledge bases (KBs) as the starting point for training models in RAPBE, we design a two-step bootstrap algorithm. The first step is seed generation from knowledge bases, that is, generate relations (entities pairs) from knowledge bases. It takes advantages of the structure or the schema of KBs to extract high quality seeds. Related work includes relation extracting[8], taxonomy extraction [9], or ontology extraction [10] from KBs like WordNet and Wikipedia. The second step is pattern generation and ranking, which will be the basis of building a trained model for RA. One key problem of this step is how to accurately identify the patterns that would effectively predict the relations in the later relation annotation task. 4. Relation Annotation Platform in Business Environment (RAPBE) Our relation annotation platform is called RAPBE (stands for Relation Annotation Platform in Business Environments). Figure 2 shows an example workflow of RAPBE. Figure 2 RAPBE relation annotation framework: Step : extract target relations and their instances from KB as the seeds; Step : generate queries regard according to the extracted relation seeds; Step : issue queries to the Web and get the search results; Step : generate the patterns from the search results; Step : weighting all the patterns for further RA; Step : RA 4.1. Seed Generation Seed generation in RAPBE relies on a knowledge base. Current RAPBE uses Wikipedia and its Infobox. Although Wikipedia contains rich knowledge with useful structure information, it still suffers some problems as the KB for seed generation. For example, Wikipedia Infobox (short as Infobox) still needed further schema clean [18]. In this stage, we first have to identify whether the relation has some instances in Infobox. The second step in our seed generation is to identify the entity instances acting as the attributes of these relation instances. If the attribute fields in

Infobox are not directly extractable, we mined the associated page in order to extract the products information. 4.2. Pattern Generation and Ranking Based on the extracted seeds, RAPBE tries to infer patterns that cover the extracted seeds. RAPBE uses the Web as the corpus for generating patterns and uses a Web search engine (i.e., Yahoo!BOSS) to query the Web (as Step in Figure 2). The results returned from the search engine usually cover several pieces of information such as page title and short summary of the page content (shown as Step in Figure 2). RAPBE uses subject-verb-object (SVO) structure for extracting patterns from search results. Patterns are identified by the key verb with two entities in the sentence. As Brin [17] points out, the quality of later annotated relations highly correlates to the quality of the extracted patterns. After collecting all the patterns, we need to evaluate the relevance of those patterns and the relations. Ravichandran [10] used frequency threshold on the patterns to select the final pattern. However, low frequency patterns could be also good pattern. Therefore, ranking the relevance between patterns and relation instances is an important task here, and we propose five different weighting schemes to rank patterns. This is shown as Step in Figure2 4.2.1. Frequency Weight (FW). Frequency Weight (FW) assumes that the higher the frequency of a pattern is on the Web, the better its quality is [Hovy, 2002]. Stemming is used to improve the coverage of the method. In RAPBE, FW is defined as in formula (1). (1) where x, p, y denotes pattern co-occurrence frequency of term x, y and pattern verb, p, in the same window size. In this paper the window size is within the same sentence. 4.2.2. Distance Weight (DW) and Verb Distance Weight (VDW). Distance Weight (DW) denotes the word distance between two entities, which is defined as in formula (2). (2) Verb Distance Weight (VDW) represents a special case of DW which examines the distance between verb and entity y (see formula (3)). (3) 4.2.3. Frequency-Distance Weight (DW). Frequency Distance Weight (FDW) combines the distance weight and the frequency weight, which is defined in formula (4): (4) 4.2.4. PMI. Pointwise Mutual Information (PMI) is a commonly used metric for measuring the connections between two events. We adopt PMI as a weight for the pattern ranking, and at the same time, PMI is also used a baseline for evaluating the weights mentioned above. PMI is defined as formula (5) (6) where Max pmi is the maximum PMI of all patterns and all instances. And pmi is defined as in formula (6) where xi,p,yi is the frequency of the pattern p instantiated with term xi and yi; xi,yi is the frequency of term xi and term yi co-occurrence together; p is the frequency of term verb p. x,p,y is the frequency of the pattern p instantiated with term x and y; x,y is the frequency of term x and term y co-occurrence together; p is the frequency of term verb p. 4.3. Relation Annotation The output of RA in RAPBE is consisted of flat lists of annotation for relation instance pairs. For example, for the relation of COMPANY-LOCATION, the output will be Google, Menlo Park pair. Our assumption is that the system has named entity tool to help identifying two entities in a relation. RAPBE would annotate whether two entities have the relationships according to the pattern, as shown Step in Figure 2. During the task of RA, entities would be matched to the surface text in documents. One problem in such matching is co-reference. Due to lack of a co-reference tool, RAPBE couldn t handle. To overcome this problem, we developed a matching strategy that relies on matching to just one entity. Our approach is motivated by Yarowsky s work in word sense disambiguation that stated one-sense-per-collocation [19]. Therefore, in the experiment, we also set up the comparison experiments on the matching strategy. COM matching: Match verb pattern with both entity x and entity y. NON-COM matching: Match only verb pattern with entity y, and use the topic entity as the default entity x. (5)

5. Experiments Two relations, COMPANY-LOCATION (C-L) and COMPANY-PRODUCTS (C-P), were considered as sample relations in business environment to evaluate the performance of the RAPBE for RA. C-L relation was chosen to represent the relations with an identifiable entity (e.g. LOCATION), and C-P relation was used to represent the relations with a nonidentifiable entity (e.g. PRODUCTS). The performance of five weights (FW, DW, VDW, FDW and PMI) and two matching methods (COM and NON-COM) were evaluated. 5.1. Experimental Setup Named entity extraction tool from Inxight LinguistX Platform was for entity identification. Yahoo! Search BOSS was used for querying the web in the experiment. Twenty-five target companies articles from Wikipedia distributing in five industries (according to Fortune 500, 2008) were chosen for experiments as testing sets for both Company- Location (C-L) and Company-Product (C-P) relation annotation. Thirty one companies with their C-P relation pairs from Infobox were extracted as seed C-P relation pairs for training. The companies in Nasdaq100 index with their C-L relation pairs from Infobox were as seeds for C-L relation pairs. Ground truth was manually marked up by two experts. And precision and recall were used for the evaluation. Precision in this paper for RA is the fraction of correctly annotated relation pairs (e.g. C-L) to the total produced the relation, while recall is the ratio of the number of correctly labeled responses to the total that should have been labeled as the predefined relations (e.g., C-L). 5.2. COMPANY-LOCATION Experiment Since the average C-L in the documents is about 3, only top 5 locations are evaluated. For NON-COM matching, there is no significant difference between PMI and VDW, and FDW and FW in either precision or recall. DW is significant better than VDW in both precision and recall. The recall of FDW is significantly better than DW; and precision and recall of PMI is worse than the other four weights. Similarly experiments were conducted on COM matching, and we found no significant difference between COM and NON-COM matching for five groups in precision and recall by running a T test. Therefore, FW and FDW are better than DW and VDW for the relations with an identifiable entity. And all four weights (FW, DW, VDW, and FDW) are better than PMI. Matching methods, COM and NON- COM matching, has no effects for the relations with identified entities. 5.3. COMPANY-PRODUCT Experiment For the NON-COM matching, there is no significant difference between FW and FDW for precision and recall. VDW is significantly better than FDW for both precision and recall. DW is significantly better than VDW in precision but not in recall. There is no significant difference between PMI and VDW. For the COM matching, FDW is significantly better than FW; VDW and FDW has no significant difference; PMI and DW are have no significant difference also, but both are better than VDW. Both FW and FDW of COM matching are significantly better than NON-COM matching in precision. But for DW and VDW, NON-COM is significantly better than COM matching in precision. The difference between frequency weighting and distance weighting is that distance weighting considered the distance of VERB and the other entity, that is, it is including the sentece syntactic information, while frequence weighting only considered the frequency of the verb with no syntactic information at all. Therefore, syntactic information is very useful for the relations with a non-identifiable entity. Therefore, five weighting methods could not improve the recall except VDW. VD and VDW weights run better than FW and FDW in the relations with a non-identifiable entity. And PMI is comparable to FW and FDW. Matching pattern could affect the results in the relations with a non-identifiable entity. 6. Conclusion and Discussion This paper describes relation annotation in business environments and proposes a Relation Annotation Platform in Business Environments (RAPBE) for relation annotation using web mining techniques. The core idea for RAPBE is a two-step bootstrap, seed generation and pattern generation. Seed generation extracts clean seeds using a knowledge base, such as Infobox from Wikipedia. Our studies show that Infobox is a good source for seed generation in relation annotation. Pattern generation generates the

patterns to build up models for later relation annotation tasks. In order to find the good quality patterns, different weight schemes (FW, FDW, DW, VDW, and PMI) were investigated as the methods for ranking the relation patterns. Our experiments focused on testing two relations, COMPANY-PRODUCT (C-P) which represents the type of relations with non-identifiable entities using existing entity identification tools, and COMPANY- LOCATION (C-L) which represents the relations with identifiable entities. For the relations with nonidentifiable entities (C-P), the syntax information will be critical for RA, while for the relations with identifiable entities (C-L), frequency is more important. NON-COM is to compensate the problem when co-reference resolution is not available. As shown in the results, NON-COM and COM matching has no significant difference for the relations with an identifiable entity. But NON-COM matching is better than COM matching in FW, VDW, and PMI for the relations with non-identifiable entity. Many experiments have been done on the RAPBE system, but the precision and recall for RA are still not good enough for annotating relations with nonidentified entity. One future work, therefore, will focus on methods to filter out irrelevant entities. Another direction of future work is to enable the pattern extraction method in RAPBE to handle relations involving more than two entities. 7. Acknowledge We thank SAP Continuous Sensemaking teams for great supports, especially Keith Klemba, and Thomas Heinzel. 8. References [1] Ming Mao, T. Heinzel, Keith Klemba, and Qi Li, A Sensemaking-based Information Foraging and Summarization System in Business Environments. Proceeding of EEE09 [2] Stevenson, Mark; Greenwood, Mark A, A Semantic Approach to IE Pattern Induction. ACL05. [3] Robin Stewart, Gregory Scott, Vladimir Zelevinsky, Idea Navigation: Structured Browsing for Unstructure Text. CHI 2008 proceedings, April 2008 [4] Hearst, M. Automatic acquisition of hyponyms from large text corpora. COLING-92, (pp. 539-545). Nantes, France. 1992 [5] M. Berland and E. Charniak, Finding Parts In Very Large Corpora. ACL99 (pp. 57-64), College Park, MD. 1999 [6] Girju, R., Badulescu, A., & Moldovan, D. Automatic Discovery of Part-Whole Relations, Computational Linguistics, 83-135. 2006. [7] Jones, R., McCallum, A. M., Nigam, K., & Riloff, E. Bootstrapping for Text Learning Tasks. IJCAI-99 Workshop on Text Mining: Foundations, Techniques and Applications. 1999. [8] Avrim Blum and T. Mitchell, Combining Labeled and Unlabeled Data with Co-Training, the 1998 Conference on Computational Learning Theory, 1998. [9] Riloff, E., & Jones, R. Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping, Proceedings of the Sixteenth National Conference on Artificial Intelligence. 1999. [10] D. Ravichandran and E.H. Hovy, A Learning Surface Text Patterns for a Question Answering System. ACL02, Philadelphia. 2002. [11] O. Etzioni, M.J. Cafarella, D. Downey, A. M. Popescu, et al., Unsupervised Named-Entity Extraction From The Web: An Experimental Study. Artificial Intelligence, 165(1):91 134.2005 [15] Suchanek, F. M., Kasneci, G., & Weikum, G. Yago: A core of semantic knowledge - unifying WordNet and Wikipedia. WWW07. [16] M. Pasca, D. Lin, J. Bigham, A. Lifchits and A. Jain, Organizing and Searching The World Wide Web of Facts - Step One: the One-Milliion Fact Extraction Challenge. AAAI06, (pp. 1400-1405). 2006. [17] Brin, S. Extracting Patterns and Relations from the World Wide Web. Lecture Notes and Computer Science. 1999 [18] Wu, F., & Weld, D. S. Automatically Refining the Wikipedia Infobox Ontology. WWW08. Beijing, China. 2008 [19] Yarowsky, D, One Sense Per Collocation, Proceeding of the ARPA human language technology workshop. 2993 [20] Sebastian Blohm, Philipp Cimiano, Scaling up Pattern Induction for Web Relation Extraction through Frequent Itemset Mining, Proceedings of the KI 2008 Workshop on Ontology-Based Information Extraction Systems. September 2008.