A Study of Relation Annotation in Business Environments Using Web Mining
|
|
- Lindsay Walton
- 6 years ago
- Views:
Transcription
1 A Study of Relation Annotation in Business Environments Using Web Mining Qi Li School of Information Science University of Pittsburgh Daqing He School of Information Science University of Pittsburgh Ming Mao SAP North America Lab Abstract Relation annotation (RA) is a process of marking up relations among a set of entities identified from a plain text. RA is important to enterprise applications due to its capability of revealing semantics in business environments. However, RA in business environment is different from that in news domain because the entities involved in the relations in business domain often not just refer to entities like People or Locations, and many business entities still could not be identified by existing entity identification tools. In this paper, we explore RA in business environment using web mining techniques, and propose the Relation Annotation Platform in Business Environments (RAPBE), which can automatically help information workers by annotating business relations in enterprise setting. We evaluated RAPBE using two sample relations that are common in business domain -- COMPANY-LOCATION and COMPANY- PRODUCT. Our experiment results demonstrate the usefulness of RAPBE in relation annotation, and also show that the best method for marking up relations of the entities identifiable by existing entity identification tools is Frequency Weight method, whereas Distant Weight is the best when some entities involved in RA cannot be identified by the information extraction tools. 1. Introduction Today, every business person may have to access an overwhelming amount of potentially relevant information that is continuously produced in various media with varying interpretations. It is impossible for an information worker to manually discover and synthesize all the available information [1]. As pointed out by Mao, sense-making focuses on making sense of ambiguous contexts and continuously making the found knowledge more precise based on disambiguating the context. Effective analysis tools are needed to find the key entities and their relations in the sensemaking task. Currently, most relation identification work focuses on the relations like is-a or part-of, which express the connections between entities in a hierarchical structure. However relations in business environments are usually not limited to hierarchical relations. For example, in the field of customer relationship management, it is important to capture the relations between a company and its product, and a product and customer reviews. Then in marketing and business intelligence area, it is important to identify important relations based on extracted entities (e.g. People-In-Organzation-In- Some-Place or New Product-With-Some- Company ). To help with these goals, we propose a process called relation annotation (RA) and an associated system to automatically annotate nonhierarchical relations which are predefined by the requests from business environments. Relation annotation (RA) is a process of creating a markup of relations among entities from plain text. For example, as shown in Figure 1, two entities, Google and Mountain View, can be annotated with the relationship of Company-Base-on-Location short as COMPANY-LOCATION according to the ontology illustrated in Figure 1. Our approach of RA uses patterns which are rules predefined or learned from a large corpus. The rules then guide the system to annotate relations among the extracted entities. The general idea of our approach was developed initially in news domain. However, our approach is still novel in the following points. First, methods in news domain concentrate on entities about people, location, organization, time, and so on. Although these entities and their relationships are important in business environments, other entities such as products and competitors, are critical information in business too. Second, most existing entity identification tools are designed for news domain, and they are trained and tested for entities
2 like people or locations. It is not clear whether or how well they can perform on entities like products or competitors. Figure 1 Business Ontology & Instances One possible approach to identify the relations in business domain with non-identified entities is to develop an entity identification tool first and then do relation identification. However, there are too many entities in business environment and too few labeled data to train the entity tools. We, therefore, fall back to more basic linguistic features, and assume that noun phrases, which can be identified reliably with existing tools like POS taggers or syntactic parser, are possible entity candidates. The patterns we constructed for business domain then help us to filter out non-entity noun phrases and unrelated entities, and classify the relations between the entities. In this paper, we choose relations that represent respectively two different scenarios for studying our method. One type of relations involves entities that are identifiable by existing entity identification tools. This shows the connection between relation annotation in business domain and relation extraction in news domain. The other type of relations contains entities that need to use noun phrases as the starting point. This shows the difference of our relation annotation to news domain relation extraction. Our experiments will examine the results of these two scenarios. Because there is no much training data available to use, we employ a bootstrap approach, which utilize limited samples to start the generation, and then learn more patterns on the way. With better patterns, more samples can be created for better pattern generation. A common source of information in many existing pattern generation methods uses the local context of the known entities. The patterns generated, however, can miss critical global syntactic clues at the sentence level. Therefore, our pattern generation method will utilize sentence syntactic information, i.e. Subject Verb and Object (SVO) structure. SVO has been studied before in information extraction [2], in semantic navigation to represent the semantic structure of a sentence [3], and many other areas. In this paper, we developed five different methods of using SVO information for pattern generation. The reminder of the paper is organized ss follows. Section 2 is some related works review. We present the Relation Annotation Platform in Business Environments (RAPBE), which can automatically annotate relations using Web mining technique in Section 3. The core idea of RAPBE is a two-step bootstrap. Therefore, we will first briefly introduce two-step bootstrap for RA (Section 2). Then we introduce RAPBE in Section 4, the experiment design and result analysis in Section 5, and conclusions in Section Related Work In the literature of relation extraction, most researchers so far have focused on relations like is-a [4] and part-of [5]. Hearst used patterns to extract hyponym relations [4]. Later, Berland and Charnias expanded Hearst s work for part-of relation extraction [5]. Girju [6] combined machine learning algorithms and WordNet to raise and disambiguate part-of generic patterns for RE. Although some papers, like Blohm [20], began to work on non-hierarchy relations, they still focus on context information for pattern extraction. Our relation annotation approach tries to extract generic patterns too but uses Subject-Verb- Object (SVO). Bootstrap learning is an iterative approach that alternates between learning rules from a set of instances and learning instances from rules [7]. Hearst pioneered the use of bootstrap method for extraction hyponym relation based on patterns [4]. Manually building three lexico-syntactic patterns, Hearst also used these three patterns to induce other patterns. Blum and Mitchell [8] used the bootstrap method for classifying webpages. Riloff and Jones [9] used bootstrap learning on a small corpus to iterate learning instances of large semantic classes and four patterns which can generate more instances. Ravichandran & Hovy [10] used bootstraps to find patterns surrounding seed values for question answering from a training set
3 of question and answer pairs. Etzioni et al. [11] and [12] used bootstraps for entity extraction. Stevenson and Greenwood [2], in the task of pattern induction, considered to induce the triple patterns, subject, verb and object, instead of local context. Therefore, we will use the SVO pattern in the bootstrap for Relation Annotation task. 3 Relation Annotation Using Bootstrap Our Relation Annotation approach for business environments assumes that there is a knowledge base for obtaining existing information about certain relations, so its inputs for training an annotation model for a given relation are the set of predicates of that relation (e.g. Company-Locate-in-Country ). The output of the training, therefore, is the annotation model. Based on our assumption of having knowledge bases (KBs) as the starting point for training models in RAPBE, we design a two-step bootstrap algorithm. The first step is seed generation from knowledge bases, that is, generate relations (entities pairs) from knowledge bases. It takes advantages of the structure or the schema of KBs to extract high quality seeds. Related work includes relation extracting[8], taxonomy extraction [9], or ontology extraction [10] from KBs like WordNet and Wikipedia. The second step is pattern generation and ranking, which will be the basis of building a trained model for RA. One key problem of this step is how to accurately identify the patterns that would effectively predict the relations in the later relation annotation task. 4. Relation Annotation Platform in Business Environment (RAPBE) Our relation annotation platform is called RAPBE (stands for Relation Annotation Platform in Business Environments). Figure 2 shows an example workflow of RAPBE. Figure 2 RAPBE relation annotation framework: Step : extract target relations and their instances from KB as the seeds; Step : generate queries regard according to the extracted relation seeds; Step : issue queries to the Web and get the search results; Step : generate the patterns from the search results; Step : weighting all the patterns for further RA; Step : RA 4.1. Seed Generation Seed generation in RAPBE relies on a knowledge base. Current RAPBE uses Wikipedia and its Infobox. Although Wikipedia contains rich knowledge with useful structure information, it still suffers some problems as the KB for seed generation. For example, Wikipedia Infobox (short as Infobox) still needed further schema clean [18]. In this stage, we first have to identify whether the relation has some instances in Infobox. The second step in our seed generation is to identify the entity instances acting as the attributes of these relation instances. If the attribute fields in
4 Infobox are not directly extractable, we mined the associated page in order to extract the products information Pattern Generation and Ranking Based on the extracted seeds, RAPBE tries to infer patterns that cover the extracted seeds. RAPBE uses the Web as the corpus for generating patterns and uses a Web search engine (i.e., Yahoo!BOSS) to query the Web (as Step in Figure 2). The results returned from the search engine usually cover several pieces of information such as page title and short summary of the page content (shown as Step in Figure 2). RAPBE uses subject-verb-object (SVO) structure for extracting patterns from search results. Patterns are identified by the key verb with two entities in the sentence. As Brin [17] points out, the quality of later annotated relations highly correlates to the quality of the extracted patterns. After collecting all the patterns, we need to evaluate the relevance of those patterns and the relations. Ravichandran [10] used frequency threshold on the patterns to select the final pattern. However, low frequency patterns could be also good pattern. Therefore, ranking the relevance between patterns and relation instances is an important task here, and we propose five different weighting schemes to rank patterns. This is shown as Step in Figure Frequency Weight (FW). Frequency Weight (FW) assumes that the higher the frequency of a pattern is on the Web, the better its quality is [Hovy, 2002]. Stemming is used to improve the coverage of the method. In RAPBE, FW is defined as in formula (1). (1) where x, p, y denotes pattern co-occurrence frequency of term x, y and pattern verb, p, in the same window size. In this paper the window size is within the same sentence Distance Weight (DW) and Verb Distance Weight (VDW). Distance Weight (DW) denotes the word distance between two entities, which is defined as in formula (2). (2) Verb Distance Weight (VDW) represents a special case of DW which examines the distance between verb and entity y (see formula (3)). (3) Frequency-Distance Weight (DW). Frequency Distance Weight (FDW) combines the distance weight and the frequency weight, which is defined in formula (4): (4) PMI. Pointwise Mutual Information (PMI) is a commonly used metric for measuring the connections between two events. We adopt PMI as a weight for the pattern ranking, and at the same time, PMI is also used a baseline for evaluating the weights mentioned above. PMI is defined as formula (5) (6) where Max pmi is the maximum PMI of all patterns and all instances. And pmi is defined as in formula (6) where xi,p,yi is the frequency of the pattern p instantiated with term xi and yi; xi,yi is the frequency of term xi and term yi co-occurrence together; p is the frequency of term verb p. x,p,y is the frequency of the pattern p instantiated with term x and y; x,y is the frequency of term x and term y co-occurrence together; p is the frequency of term verb p Relation Annotation The output of RA in RAPBE is consisted of flat lists of annotation for relation instance pairs. For example, for the relation of COMPANY-LOCATION, the output will be Google, Menlo Park pair. Our assumption is that the system has named entity tool to help identifying two entities in a relation. RAPBE would annotate whether two entities have the relationships according to the pattern, as shown Step in Figure 2. During the task of RA, entities would be matched to the surface text in documents. One problem in such matching is co-reference. Due to lack of a co-reference tool, RAPBE couldn t handle. To overcome this problem, we developed a matching strategy that relies on matching to just one entity. Our approach is motivated by Yarowsky s work in word sense disambiguation that stated one-sense-per-collocation [19]. Therefore, in the experiment, we also set up the comparison experiments on the matching strategy. COM matching: Match verb pattern with both entity x and entity y. NON-COM matching: Match only verb pattern with entity y, and use the topic entity as the default entity x. (5)
5 5. Experiments Two relations, COMPANY-LOCATION (C-L) and COMPANY-PRODUCTS (C-P), were considered as sample relations in business environment to evaluate the performance of the RAPBE for RA. C-L relation was chosen to represent the relations with an identifiable entity (e.g. LOCATION), and C-P relation was used to represent the relations with a nonidentifiable entity (e.g. PRODUCTS). The performance of five weights (FW, DW, VDW, FDW and PMI) and two matching methods (COM and NON-COM) were evaluated Experimental Setup Named entity extraction tool from Inxight LinguistX Platform was for entity identification. Yahoo! Search BOSS was used for querying the web in the experiment. Twenty-five target companies articles from Wikipedia distributing in five industries (according to Fortune 500, 2008) were chosen for experiments as testing sets for both Company- Location (C-L) and Company-Product (C-P) relation annotation. Thirty one companies with their C-P relation pairs from Infobox were extracted as seed C-P relation pairs for training. The companies in Nasdaq100 index with their C-L relation pairs from Infobox were as seeds for C-L relation pairs. Ground truth was manually marked up by two experts. And precision and recall were used for the evaluation. Precision in this paper for RA is the fraction of correctly annotated relation pairs (e.g. C-L) to the total produced the relation, while recall is the ratio of the number of correctly labeled responses to the total that should have been labeled as the predefined relations (e.g., C-L) COMPANY-LOCATION Experiment Since the average C-L in the documents is about 3, only top 5 locations are evaluated. For NON-COM matching, there is no significant difference between PMI and VDW, and FDW and FW in either precision or recall. DW is significant better than VDW in both precision and recall. The recall of FDW is significantly better than DW; and precision and recall of PMI is worse than the other four weights. Similarly experiments were conducted on COM matching, and we found no significant difference between COM and NON-COM matching for five groups in precision and recall by running a T test. Therefore, FW and FDW are better than DW and VDW for the relations with an identifiable entity. And all four weights (FW, DW, VDW, and FDW) are better than PMI. Matching methods, COM and NON- COM matching, has no effects for the relations with identified entities COMPANY-PRODUCT Experiment For the NON-COM matching, there is no significant difference between FW and FDW for precision and recall. VDW is significantly better than FDW for both precision and recall. DW is significantly better than VDW in precision but not in recall. There is no significant difference between PMI and VDW. For the COM matching, FDW is significantly better than FW; VDW and FDW has no significant difference; PMI and DW are have no significant difference also, but both are better than VDW. Both FW and FDW of COM matching are significantly better than NON-COM matching in precision. But for DW and VDW, NON-COM is significantly better than COM matching in precision. The difference between frequency weighting and distance weighting is that distance weighting considered the distance of VERB and the other entity, that is, it is including the sentece syntactic information, while frequence weighting only considered the frequency of the verb with no syntactic information at all. Therefore, syntactic information is very useful for the relations with a non-identifiable entity. Therefore, five weighting methods could not improve the recall except VDW. VD and VDW weights run better than FW and FDW in the relations with a non-identifiable entity. And PMI is comparable to FW and FDW. Matching pattern could affect the results in the relations with a non-identifiable entity. 6. Conclusion and Discussion This paper describes relation annotation in business environments and proposes a Relation Annotation Platform in Business Environments (RAPBE) for relation annotation using web mining techniques. The core idea for RAPBE is a two-step bootstrap, seed generation and pattern generation. Seed generation extracts clean seeds using a knowledge base, such as Infobox from Wikipedia. Our studies show that Infobox is a good source for seed generation in relation annotation. Pattern generation generates the
6 patterns to build up models for later relation annotation tasks. In order to find the good quality patterns, different weight schemes (FW, FDW, DW, VDW, and PMI) were investigated as the methods for ranking the relation patterns. Our experiments focused on testing two relations, COMPANY-PRODUCT (C-P) which represents the type of relations with non-identifiable entities using existing entity identification tools, and COMPANY- LOCATION (C-L) which represents the relations with identifiable entities. For the relations with nonidentifiable entities (C-P), the syntax information will be critical for RA, while for the relations with identifiable entities (C-L), frequency is more important. NON-COM is to compensate the problem when co-reference resolution is not available. As shown in the results, NON-COM and COM matching has no significant difference for the relations with an identifiable entity. But NON-COM matching is better than COM matching in FW, VDW, and PMI for the relations with non-identifiable entity. Many experiments have been done on the RAPBE system, but the precision and recall for RA are still not good enough for annotating relations with nonidentified entity. One future work, therefore, will focus on methods to filter out irrelevant entities. Another direction of future work is to enable the pattern extraction method in RAPBE to handle relations involving more than two entities. 7. Acknowledge We thank SAP Continuous Sensemaking teams for great supports, especially Keith Klemba, and Thomas Heinzel. 8. References [1] Ming Mao, T. Heinzel, Keith Klemba, and Qi Li, A Sensemaking-based Information Foraging and Summarization System in Business Environments. Proceeding of EEE09 [2] Stevenson, Mark; Greenwood, Mark A, A Semantic Approach to IE Pattern Induction. ACL05. [3] Robin Stewart, Gregory Scott, Vladimir Zelevinsky, Idea Navigation: Structured Browsing for Unstructure Text. CHI 2008 proceedings, April 2008 [4] Hearst, M. Automatic acquisition of hyponyms from large text corpora. COLING-92, (pp ). Nantes, France [5] M. Berland and E. Charniak, Finding Parts In Very Large Corpora. ACL99 (pp ), College Park, MD [6] Girju, R., Badulescu, A., & Moldovan, D. Automatic Discovery of Part-Whole Relations, Computational Linguistics, [7] Jones, R., McCallum, A. M., Nigam, K., & Riloff, E. Bootstrapping for Text Learning Tasks. IJCAI-99 Workshop on Text Mining: Foundations, Techniques and Applications [8] Avrim Blum and T. Mitchell, Combining Labeled and Unlabeled Data with Co-Training, the 1998 Conference on Computational Learning Theory, [9] Riloff, E., & Jones, R. Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping, Proceedings of the Sixteenth National Conference on Artificial Intelligence [10] D. Ravichandran and E.H. Hovy, A Learning Surface Text Patterns for a Question Answering System. ACL02, Philadelphia [11] O. Etzioni, M.J. Cafarella, D. Downey, A. M. Popescu, et al., Unsupervised Named-Entity Extraction From The Web: An Experimental Study. Artificial Intelligence, 165(1): [15] Suchanek, F. M., Kasneci, G., & Weikum, G. Yago: A core of semantic knowledge - unifying WordNet and Wikipedia. WWW07. [16] M. Pasca, D. Lin, J. Bigham, A. Lifchits and A. Jain, Organizing and Searching The World Wide Web of Facts - Step One: the One-Milliion Fact Extraction Challenge. AAAI06, (pp ) [17] Brin, S. Extracting Patterns and Relations from the World Wide Web. Lecture Notes and Computer Science [18] Wu, F., & Weld, D. S. Automatically Refining the Wikipedia Infobox Ontology. WWW08. Beijing, China [19] Yarowsky, D, One Sense Per Collocation, Proceeding of the ARPA human language technology workshop [20] Sebastian Blohm, Philipp Cimiano, Scaling up Pattern Induction for Web Relation Extraction through Frequent Itemset Mining, Proceedings of the KI 2008 Workshop on Ontology-Based Information Extraction Systems. September 2008.
Coupling Semi-Supervised Learning of Categories and Relations
Coupling Semi-Supervised Learning of Categories and Relations Andrew Carlson 1, Justin Betteridge 1, Estevam R. Hruschka Jr. 1,2 and Tom M. Mitchell 1 1 School of Computer Science Carnegie Mellon University
More informationAQUA: An Ontology-Driven Question Answering System
AQUA: An Ontology-Driven Question Answering System Maria Vargas-Vera, Enrico Motta and John Domingue Knowledge Media Institute (KMI) The Open University, Walton Hall, Milton Keynes, MK7 6AA, United Kingdom.
More informationMultilingual Sentiment and Subjectivity Analysis
Multilingual Sentiment and Subjectivity Analysis Carmen Banea and Rada Mihalcea Department of Computer Science University of North Texas rada@cs.unt.edu, carmen.banea@gmail.com Janyce Wiebe Department
More informationA Case Study: News Classification Based on Term Frequency
A Case Study: News Classification Based on Term Frequency Petr Kroha Faculty of Computer Science University of Technology 09107 Chemnitz Germany kroha@informatik.tu-chemnitz.de Ricardo Baeza-Yates Center
More informationDistant Supervised Relation Extraction with Wikipedia and Freebase
Distant Supervised Relation Extraction with Wikipedia and Freebase Marcel Ackermann TU Darmstadt ackermann@tk.informatik.tu-darmstadt.de Abstract In this paper we discuss a new approach to extract relational
More informationExtracting and Ranking Product Features in Opinion Documents
Extracting and Ranking Product Features in Opinion Documents Lei Zhang Department of Computer Science University of Illinois at Chicago 851 S. Morgan Street Chicago, IL 60607 lzhang3@cs.uic.edu Bing Liu
More informationMatching Similarity for Keyword-Based Clustering
Matching Similarity for Keyword-Based Clustering Mohammad Rezaei and Pasi Fränti University of Eastern Finland {rezaei,franti}@cs.uef.fi Abstract. Semantic clustering of objects such as documents, web
More informationTHE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING
SISOM & ACOUSTICS 2015, Bucharest 21-22 May THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING MarilenaăLAZ R 1, Diana MILITARU 2 1 Military Equipment and Technologies Research Agency, Bucharest,
More informationCross Language Information Retrieval
Cross Language Information Retrieval RAFFAELLA BERNARDI UNIVERSITÀ DEGLI STUDI DI TRENTO P.ZZA VENEZIA, ROOM: 2.05, E-MAIL: BERNARDI@DISI.UNITN.IT Contents 1 Acknowledgment.............................................
More informationTarget Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data
Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data Ebba Gustavii Department of Linguistics and Philology, Uppsala University, Sweden ebbag@stp.ling.uu.se
More informationA heuristic framework for pivot-based bilingual dictionary induction
2013 International Conference on Culture and Computing A heuristic framework for pivot-based bilingual dictionary induction Mairidan Wushouer, Toru Ishida, Donghui Lin Department of Social Informatics,
More informationExploiting Wikipedia as External Knowledge for Named Entity Recognition
Exploiting Wikipedia as External Knowledge for Named Entity Recognition Jun ichi Kazama and Kentaro Torisawa Japan Advanced Institute of Science and Technology (JAIST) Asahidai 1-1, Nomi, Ishikawa, 923-1292
More informationA Semantic Similarity Measure Based on Lexico-Syntactic Patterns
A Semantic Similarity Measure Based on Lexico-Syntactic Patterns Alexander Panchenko, Olga Morozova and Hubert Naets Center for Natural Language Processing (CENTAL) Université catholique de Louvain Belgium
More informationSemi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.
Semi-supervised methods of text processing, and an application to medical concept extraction Yacine Jernite Text-as-Data series September 17. 2015 What do we want from text? 1. Extract information 2. Link
More informationSINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)
SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF) Hans Christian 1 ; Mikhael Pramodana Agus 2 ; Derwin Suhartono 3 1,2,3 Computer Science Department,
More informationWord Sense Disambiguation
Word Sense Disambiguation D. De Cao R. Basili Corso di Web Mining e Retrieval a.a. 2008-9 May 21, 2009 Excerpt of the R. Mihalcea and T. Pedersen AAAI 2005 Tutorial, at: http://www.d.umn.edu/ tpederse/tutorials/advances-in-wsd-aaai-2005.ppt
More informationProbabilistic Latent Semantic Analysis
Probabilistic Latent Semantic Analysis Thomas Hofmann Presentation by Ioannis Pavlopoulos & Andreas Damianou for the course of Data Mining & Exploration 1 Outline Latent Semantic Analysis o Need o Overview
More informationOutline. Web as Corpus. Using Web Data for Linguistic Purposes. Ines Rehbein. NCLT, Dublin City University. nclt
Outline Using Web Data for Linguistic Purposes NCLT, Dublin City University Outline Outline 1 Corpora as linguistic tools 2 Limitations of web data Strategies to enhance web data 3 Corpora as linguistic
More informationSpecification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments
Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments Cristina Vertan, Walther v. Hahn University of Hamburg, Natural Language Systems Division Hamburg,
More informationMULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY
MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY Chen, Hsin-Hsi Department of Computer Science and Information Engineering National Taiwan University Taipei, Taiwan E-mail: hh_chen@csie.ntu.edu.tw Abstract
More informationIterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages
Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages Nuanwan Soonthornphisaj 1 and Boonserm Kijsirikul 2 Machine Intelligence and Knowledge Discovery Laboratory Department of Computer
More informationUsing Semantic Relations to Refine Coreference Decisions
Using Semantic Relations to Refine Coreference Decisions Heng Ji David Westbrook Ralph Grishman Department of Computer Science New York University New York, NY, 10003, USA hengji@cs.nyu.edu westbroo@cs.nyu.edu
More informationModule 12. Machine Learning. Version 2 CSE IIT, Kharagpur
Module 12 Machine Learning 12.1 Instructional Objective The students should understand the concept of learning systems Students should learn about different aspects of a learning system Students should
More informationLinking Task: Identifying authors and book titles in verbose queries
Linking Task: Identifying authors and book titles in verbose queries Anaïs Ollagnier, Sébastien Fournier, and Patrice Bellot Aix-Marseille University, CNRS, ENSAM, University of Toulon, LSIS UMR 7296,
More informationOn document relevance and lexical cohesion between query terms
Information Processing and Management 42 (2006) 1230 1247 www.elsevier.com/locate/infoproman On document relevance and lexical cohesion between query terms Olga Vechtomova a, *, Murat Karamuftuoglu b,
More informationWeb as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics
(L615) Markus Dickinson Department of Linguistics, Indiana University Spring 2013 The web provides new opportunities for gathering data Viable source of disposable corpora, built ad hoc for specific purposes
More informationThe Enterprise Knowledge Portal: The Concept
The Enterprise Knowledge Portal: The Concept Executive Information Systems, Inc. www.dkms.com eisai@home.com (703) 461-8823 (o) 1 A Beginning Where is the life we have lost in living! Where is the wisdom
More informationBYLINE [Heng Ji, Computer Science Department, New York University,
INFORMATION EXTRACTION BYLINE [Heng Ji, Computer Science Department, New York University, hengji@cs.nyu.edu] SYNONYMS NONE DEFINITION Information Extraction (IE) is a task of extracting pre-specified types
More informationAccuracy (%) # features
Question Terminology and Representation for Question Type Classication Noriko Tomuro DePaul University School of Computer Science, Telecommunications and Information Systems 243 S. Wabash Ave. Chicago,
More informationA Domain Ontology Development Environment Using a MRD and Text Corpus
A Domain Ontology Development Environment Using a MRD and Text Corpus Naomi Nakaya 1 and Masaki Kurematsu 2 and Takahira Yamaguchi 1 1 Faculty of Information, Shizuoka University 3-5-1 Johoku Hamamatsu
More informationEdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar
EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar Chung-Chi Huang Mei-Hua Chen Shih-Ting Huang Jason S. Chang Institute of Information Systems and Applications, National Tsing Hua University,
More informationMETHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS
METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS Ruslan Mitkov (R.Mitkov@wlv.ac.uk) University of Wolverhampton ViktorPekar (v.pekar@wlv.ac.uk) University of Wolverhampton Dimitar
More informationLanguage Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus
Language Acquisition Fall 2010/Winter 2011 Lexical Categories Afra Alishahi, Heiner Drenhaus Computational Linguistics and Phonetics Saarland University Children s Sensitivity to Lexical Categories Look,
More informationLeveraging Sentiment to Compute Word Similarity
Leveraging Sentiment to Compute Word Similarity Balamurali A.R., Subhabrata Mukherjee, Akshat Malu and Pushpak Bhattacharyya Dept. of Computer Science and Engineering, IIT Bombay 6th International Global
More informationEvaluation of Usage Patterns for Web-based Educational Systems using Web Mining
Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining Dave Donnellan, School of Computer Applications Dublin City University Dublin 9 Ireland daviddonnellan@eircom.net Claus Pahl
More informationEvaluation of Usage Patterns for Web-based Educational Systems using Web Mining
Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining Dave Donnellan, School of Computer Applications Dublin City University Dublin 9 Ireland daviddonnellan@eircom.net Claus Pahl
More informationRule Learning With Negation: Issues Regarding Effectiveness
Rule Learning With Negation: Issues Regarding Effectiveness S. Chua, F. Coenen, G. Malcolm University of Liverpool Department of Computer Science, Ashton Building, Ashton Street, L69 3BX Liverpool, United
More informationProduct Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments
Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments Vijayshri Ramkrishna Ingale PG Student, Department of Computer Engineering JSPM s Imperial College of Engineering &
More informationThe stages of event extraction
The stages of event extraction David Ahn Intelligent Systems Lab Amsterdam University of Amsterdam ahn@science.uva.nl Abstract Event detection and recognition is a complex task consisting of multiple sub-tasks
More informationA Bayesian Learning Approach to Concept-Based Document Classification
Databases and Information Systems Group (AG5) Max-Planck-Institute for Computer Science Saarbrücken, Germany A Bayesian Learning Approach to Concept-Based Document Classification by Georgiana Ifrim Supervisors
More informationMaximizing Learning Through Course Alignment and Experience with Different Types of Knowledge
Innov High Educ (2009) 34:93 103 DOI 10.1007/s10755-009-9095-2 Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge Phyllis Blumberg Published online: 3 February
More informationChunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.
NLP Lab Session Week 8 October 15, 2014 Noun Phrase Chunking and WordNet in NLTK Getting Started In this lab session, we will work together through a series of small examples using the IDLE window and
More informationUnsupervised Learning of Narrative Schemas and their Participants
Unsupervised Learning of Narrative Schemas and their Participants Nathanael Chambers and Dan Jurafsky Stanford University, Stanford, CA 94305 {natec,jurafsky}@stanford.edu Abstract We describe an unsupervised
More informationTransfer Learning Action Models by Measuring the Similarity of Different Domains
Transfer Learning Action Models by Measuring the Similarity of Different Domains Hankui Zhuo 1, Qiang Yang 2, and Lei Li 1 1 Software Research Institute, Sun Yat-sen University, Guangzhou, China. zhuohank@gmail.com,lnslilei@mail.sysu.edu.cn
More informationThe Smart/Empire TIPSTER IR System
The Smart/Empire TIPSTER IR System Chris Buckley, Janet Walz Sabir Research, Gaithersburg, MD chrisb,walz@sabir.com Claire Cardie, Scott Mardis, Mandar Mitra, David Pierce, Kiri Wagstaff Department of
More informationModeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures
Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures Ulrike Baldewein (ulrike@coli.uni-sb.de) Computational Psycholinguistics, Saarland University D-66041 Saarbrücken,
More informationUsing dialogue context to improve parsing performance in dialogue systems
Using dialogue context to improve parsing performance in dialogue systems Ivan Meza-Ruiz and Oliver Lemon School of Informatics, Edinburgh University 2 Buccleuch Place, Edinburgh I.V.Meza-Ruiz@sms.ed.ac.uk,
More informationUsing Web Searches on Important Words to Create Background Sets for LSI Classification
Using Web Searches on Important Words to Create Background Sets for LSI Classification Sarah Zelikovitz and Marina Kogan College of Staten Island of CUNY 2800 Victory Blvd Staten Island, NY 11314 Abstract
More informationBootstrapping and Evaluating Named Entity Recognition in the Biomedical Domain
Bootstrapping and Evaluating Named Entity Recognition in the Biomedical Domain Andreas Vlachos Computer Laboratory University of Cambridge Cambridge, CB3 0FD, UK av308@cl.cam.ac.uk Caroline Gasperin Computer
More informationarxiv: v1 [cs.cl] 2 Apr 2017
Word-Alignment-Based Segment-Level Machine Translation Evaluation using Word Embeddings Junki Matsuo and Mamoru Komachi Graduate School of System Design, Tokyo Metropolitan University, Japan matsuo-junki@ed.tmu.ac.jp,
More informationActive Learning. Yingyu Liang Computer Sciences 760 Fall
Active Learning Yingyu Liang Computer Sciences 760 Fall 2017 http://pages.cs.wisc.edu/~yliang/cs760/ Some of the slides in these lectures have been adapted/borrowed from materials developed by Mark Craven,
More informationMemory-based grammatical error correction
Memory-based grammatical error correction Antal van den Bosch Peter Berck Radboud University Nijmegen Tilburg University P.O. Box 9103 P.O. Box 90153 NL-6500 HD Nijmegen, The Netherlands NL-5000 LE Tilburg,
More informationProject in the framework of the AIM-WEST project Annotation of MWEs for translation
Project in the framework of the AIM-WEST project Annotation of MWEs for translation 1 Agnès Tutin LIDILEM/LIG Université Grenoble Alpes 30 october 2014 Outline 2 Why annotate MWEs in corpora? A first experiment
More informationVocabulary Usage and Intelligibility in Learner Language
Vocabulary Usage and Intelligibility in Learner Language Emi Izumi, 1 Kiyotaka Uchimoto 1 and Hitoshi Isahara 1 1. Introduction In verbal communication, the primary purpose of which is to convey and understand
More informationLearning Computational Grammars
Learning Computational Grammars John Nerbonne, Anja Belz, Nicola Cancedda, Hervé Déjean, James Hammerton, Rob Koeling, Stasinos Konstantopoulos, Miles Osborne, Franck Thollard and Erik Tjong Kim Sang Abstract
More informationDeveloping True/False Test Sheet Generating System with Diagnosing Basic Cognitive Ability
Developing True/False Test Sheet Generating System with Diagnosing Basic Cognitive Ability Shih-Bin Chen Dept. of Information and Computer Engineering, Chung-Yuan Christian University Chung-Li, Taiwan
More informationThe Ups and Downs of Preposition Error Detection in ESL Writing
The Ups and Downs of Preposition Error Detection in ESL Writing Joel R. Tetreault Educational Testing Service 660 Rosedale Road Princeton, NJ, USA JTetreault@ets.org Martin Chodorow Hunter College of CUNY
More informationAbstractions and the Brain
Abstractions and the Brain Brian D. Josephson Department of Physics, University of Cambridge Cavendish Lab. Madingley Road Cambridge, UK. CB3 OHE bdj10@cam.ac.uk http://www.tcm.phy.cam.ac.uk/~bdj10 ABSTRACT
More informationCS 446: Machine Learning
CS 446: Machine Learning Introduction to LBJava: a Learning Based Programming Language Writing classifiers Christos Christodoulopoulos Parisa Kordjamshidi Motivation 2 Motivation You still have not learnt
More informationLQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization
LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization Annemarie Friedrich, Marina Valeeva and Alexis Palmer COMPUTATIONAL LINGUISTICS & PHONETICS SAARLAND UNIVERSITY, GERMANY
More informationHLTCOE at TREC 2013: Temporal Summarization
HLTCOE at TREC 2013: Temporal Summarization Tan Xu University of Maryland College Park Paul McNamee Johns Hopkins University HLTCOE Douglas W. Oard University of Maryland College Park Abstract Our team
More informationLearning a Cross-Lingual Semantic Representation of Relations Expressed in Text
Learning a Cross-Lingual Semantic Representation of Relations Expressed in Text Achim Rettinger, Artem Schumilin, Steffen Thoma, and Basil Ell Karlsruhe Institute of Technology (KIT), Karlsruhe, Germany
More informationDisambiguation of Thai Personal Name from Online News Articles
Disambiguation of Thai Personal Name from Online News Articles Phaisarn Sutheebanjard Graduate School of Information Technology Siam University Bangkok, Thailand mr.phaisarn@gmail.com Abstract Since online
More informationUniversiteit Leiden ICT in Business
Universiteit Leiden ICT in Business Ranking of Multi-Word Terms Name: Ricardo R.M. Blikman Student-no: s1184164 Internal report number: 2012-11 Date: 07/03/2013 1st supervisor: Prof. Dr. J.N. Kok 2nd supervisor:
More informationA DISTRIBUTIONAL STRUCTURED SEMANTIC SPACE FOR QUERYING RDF GRAPH DATA
International Journal of Semantic Computing Vol. 5, No. 4 (2011) 433 462 c World Scientific Publishing Company DOI: 10.1142/S1793351X1100133X A DISTRIBUTIONAL STRUCTURED SEMANTIC SPACE FOR QUERYING RDF
More informationAnalysis: Evaluation: Knowledge: Comprehension: Synthesis: Application:
In 1956, Benjamin Bloom headed a group of educational psychologists who developed a classification of levels of intellectual behavior important in learning. Bloom found that over 95 % of the test questions
More informationDetecting English-French Cognates Using Orthographic Edit Distance
Detecting English-French Cognates Using Orthographic Edit Distance Qiongkai Xu 1,2, Albert Chen 1, Chang i 1 1 The Australian National University, College of Engineering and Computer Science 2 National
More informationLearning Disability Functional Capacity Evaluation. Dear Doctor,
Dear Doctor, I have been asked to formulate a vocational opinion regarding NAME s employability in light of his/her learning disability. To assist me with this evaluation I would appreciate if you can
More informationExtracting Opinion Expressions and Their Polarities Exploration of Pipelines and Joint Models
Extracting Opinion Expressions and Their Polarities Exploration of Pipelines and Joint Models Richard Johansson and Alessandro Moschitti DISI, University of Trento Via Sommarive 14, 38123 Trento (TN),
More informationSearch right and thou shalt find... Using Web Queries for Learner Error Detection
Search right and thou shalt find... Using Web Queries for Learner Error Detection Michael Gamon Claudia Leacock Microsoft Research Butler Hill Group One Microsoft Way P.O. Box 935 Redmond, WA 981052, USA
More informationCompositional Semantics
Compositional Semantics CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Words, bag of words Sequences Trees Meaning Representing Meaning An important goal of NLP/AI: convert natural language
More informationOrganizational Knowledge Distribution: An Experimental Evaluation
Association for Information Systems AIS Electronic Library (AISeL) AMCIS 24 Proceedings Americas Conference on Information Systems (AMCIS) 12-31-24 : An Experimental Evaluation Surendra Sarnikar University
More informationSyntactic Patterns versus Word Alignment: Extracting Opinion Targets from Online Reviews
Syntactic Patterns versus Word Alignment: Extracting Opinion Targets from Online Reviews Kang Liu, Liheng Xu and Jun Zhao National Laboratory of Pattern Recognition Institute of Automation, Chinese Academy
More informationData Integration through Clustering and Finding Statistical Relations - Validation of Approach
Data Integration through Clustering and Finding Statistical Relations - Validation of Approach Marek Jaszuk, Teresa Mroczek, and Barbara Fryc University of Information Technology and Management, ul. Sucharskiego
More informationReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology
ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology Tiancheng Zhao CMU-LTI-16-006 Language Technologies Institute School of Computer Science Carnegie Mellon
More informationPython Machine Learning
Python Machine Learning Unlock deeper insights into machine learning with this vital guide to cuttingedge predictive analytics Sebastian Raschka [ PUBLISHING 1 open source I community experience distilled
More informationSouth Carolina English Language Arts
South Carolina English Language Arts A S O F J U N E 2 0, 2 0 1 0, T H I S S TAT E H A D A D O P T E D T H E CO M M O N CO R E S TAT E S TA N DA R D S. DOCUMENTS REVIEWED South Carolina Academic Content
More informationRule Learning with Negation: Issues Regarding Effectiveness
Rule Learning with Negation: Issues Regarding Effectiveness Stephanie Chua, Frans Coenen, and Grant Malcolm University of Liverpool Department of Computer Science, Ashton Building, Ashton Street, L69 3BX
More informationCombining a Chinese Thesaurus with a Chinese Dictionary
Combining a Chinese Thesaurus with a Chinese Dictionary Ji Donghong Kent Ridge Digital Labs 21 Heng Mui Keng Terrace Singapore, 119613 dhji @krdl.org.sg Gong Junping Department of Computer Science Ohio
More informationWord Segmentation of Off-line Handwritten Documents
Word Segmentation of Off-line Handwritten Documents Chen Huang and Sargur N. Srihari {chuang5, srihari}@cedar.buffalo.edu Center of Excellence for Document Analysis and Recognition (CEDAR), Department
More informationLEARNING A SEMANTIC PARSER FROM SPOKEN UTTERANCES. Judith Gaspers and Philipp Cimiano
LEARNING A SEMANTIC PARSER FROM SPOKEN UTTERANCES Judith Gaspers and Philipp Cimiano Semantic Computing Group, CITEC, Bielefeld University {jgaspers cimiano}@cit-ec.uni-bielefeld.de ABSTRACT Semantic parsers
More information2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases
POS Tagging Problem Part-of-Speech Tagging L545 Spring 203 Given a sentence W Wn and a tagset of lexical categories, find the most likely tag T..Tn for each word in the sentence Example Secretariat/P is/vbz
More informationAutomating the E-learning Personalization
Automating the E-learning Personalization Fathi Essalmi 1, Leila Jemni Ben Ayed 1, Mohamed Jemni 1, Kinshuk 2, and Sabine Graf 2 1 The Research Laboratory of Technologies of Information and Communication
More informationPrediction of Maximal Projection for Semantic Role Labeling
Prediction of Maximal Projection for Semantic Role Labeling Weiwei Sun, Zhifang Sui Institute of Computational Linguistics Peking University Beijing, 100871, China {ws, szf}@pku.edu.cn Haifeng Wang Toshiba
More informationThe Internet as a Normative Corpus: Grammar Checking with a Search Engine
The Internet as a Normative Corpus: Grammar Checking with a Search Engine Jonas Sjöbergh KTH Nada SE-100 44 Stockholm, Sweden jsh@nada.kth.se Abstract In this paper some methods using the Internet as a
More informationEnsemble Technique Utilization for Indonesian Dependency Parser
Ensemble Technique Utilization for Indonesian Dependency Parser Arief Rahman Institut Teknologi Bandung Indonesia 23516008@std.stei.itb.ac.id Ayu Purwarianti Institut Teknologi Bandung Indonesia ayu@stei.itb.ac.id
More informationA Topic Maps-based ontology IR system versus Clustering-based IR System: A Comparative Study in Security Domain
A Topic Maps-based ontology IR system versus Clustering-based IR System: A Comparative Study in Security Domain Myongho Yi 1 and Sam Gyun Oh 2* 1 School of Library and Information Studies, Texas Woman
More informationKnowledge-Based - Systems
Knowledge-Based - Systems ; Rajendra Arvind Akerkar Chairman, Technomathematics Research Foundation and Senior Researcher, Western Norway Research institute Priti Srinivas Sajja Sardar Patel University
More informationA Comparison of Standard and Interval Association Rules
A Comparison of Standard and Association Rules Choh Man Teng cmteng@ai.uwf.edu Institute for Human and Machine Cognition University of West Florida 4 South Alcaniz Street, Pensacola FL 325, USA Abstract
More informationDEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS
DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS Julia Tmshkina Centre for Text Techitology, North-West University, 253 Potchefstroom, South Africa 2025770@puk.ac.za
More informationOn-Line Data Analytics
International Journal of Computer Applications in Engineering Sciences [VOL I, ISSUE III, SEPTEMBER 2011] [ISSN: 2231-4946] On-Line Data Analytics Yugandhar Vemulapalli #, Devarapalli Raghu *, Raja Jacob
More informationWE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT
WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT PRACTICAL APPLICATIONS OF RANDOM SAMPLING IN ediscovery By Matthew Verga, J.D. INTRODUCTION Anyone who spends ample time working
More informationTask Tolerance of MT Output in Integrated Text Processes
Task Tolerance of MT Output in Integrated Text Processes John S. White, Jennifer B. Doyon, and Susan W. Talbott Litton PRC 1500 PRC Drive McLean, VA 22102, USA {white_john, doyon jennifer, talbott_susan}@prc.com
More informationEffect of Word Complexity on L2 Vocabulary Learning
Effect of Word Complexity on L2 Vocabulary Learning Kevin Dela Rosa Language Technologies Institute Carnegie Mellon University 5000 Forbes Ave. Pittsburgh, PA kdelaros@cs.cmu.edu Maxine Eskenazi Language
More informationLearning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for
Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for Email Marilyn A. Walker Jeanne C. Fromer Shrikanth Narayanan walker@research.att.com jeannie@ai.mit.edu shri@research.att.com
More informationShort Text Understanding Through Lexical-Semantic Analysis
Short Text Understanding Through Lexical-Semantic Analysis Wen Hua #1, Zhongyuan Wang 2, Haixun Wang 3, Kai Zheng #4, Xiaofang Zhou #5 School of Information, Renmin University of China, Beijing, China
More informationMovie Review Mining and Summarization
Movie Review Mining and Summarization Li Zhuang Microsoft Research Asia Department of Computer Science and Technology, Tsinghua University Beijing, P.R.China f-lzhuang@hotmail.com Feng Jing Microsoft Research
More informationA Comparison of Two Text Representations for Sentiment Analysis
010 International Conference on Computer Application and System Modeling (ICCASM 010) A Comparison of Two Text Representations for Sentiment Analysis Jianxiong Wang School of Computer Science & Educational
More informationExploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data
Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data Maja Popović and Hermann Ney Lehrstuhl für Informatik VI, Computer
More informationThe taming of the data:
The taming of the data: Using text mining in building a corpus for diachronic analysis Stefania Degaetano-Ortlieb, Hannah Kermes, Ashraf Khamis, Jörg Knappen, Noam Ordan and Elke Teich Background Big data
More information