A Study of Relation Annotation in Business Environments Using Web Mining

Size: px
Start display at page:

Download "A Study of Relation Annotation in Business Environments Using Web Mining"

Transcription

1 A Study of Relation Annotation in Business Environments Using Web Mining Qi Li School of Information Science University of Pittsburgh Daqing He School of Information Science University of Pittsburgh Ming Mao SAP North America Lab Abstract Relation annotation (RA) is a process of marking up relations among a set of entities identified from a plain text. RA is important to enterprise applications due to its capability of revealing semantics in business environments. However, RA in business environment is different from that in news domain because the entities involved in the relations in business domain often not just refer to entities like People or Locations, and many business entities still could not be identified by existing entity identification tools. In this paper, we explore RA in business environment using web mining techniques, and propose the Relation Annotation Platform in Business Environments (RAPBE), which can automatically help information workers by annotating business relations in enterprise setting. We evaluated RAPBE using two sample relations that are common in business domain -- COMPANY-LOCATION and COMPANY- PRODUCT. Our experiment results demonstrate the usefulness of RAPBE in relation annotation, and also show that the best method for marking up relations of the entities identifiable by existing entity identification tools is Frequency Weight method, whereas Distant Weight is the best when some entities involved in RA cannot be identified by the information extraction tools. 1. Introduction Today, every business person may have to access an overwhelming amount of potentially relevant information that is continuously produced in various media with varying interpretations. It is impossible for an information worker to manually discover and synthesize all the available information [1]. As pointed out by Mao, sense-making focuses on making sense of ambiguous contexts and continuously making the found knowledge more precise based on disambiguating the context. Effective analysis tools are needed to find the key entities and their relations in the sensemaking task. Currently, most relation identification work focuses on the relations like is-a or part-of, which express the connections between entities in a hierarchical structure. However relations in business environments are usually not limited to hierarchical relations. For example, in the field of customer relationship management, it is important to capture the relations between a company and its product, and a product and customer reviews. Then in marketing and business intelligence area, it is important to identify important relations based on extracted entities (e.g. People-In-Organzation-In- Some-Place or New Product-With-Some- Company ). To help with these goals, we propose a process called relation annotation (RA) and an associated system to automatically annotate nonhierarchical relations which are predefined by the requests from business environments. Relation annotation (RA) is a process of creating a markup of relations among entities from plain text. For example, as shown in Figure 1, two entities, Google and Mountain View, can be annotated with the relationship of Company-Base-on-Location short as COMPANY-LOCATION according to the ontology illustrated in Figure 1. Our approach of RA uses patterns which are rules predefined or learned from a large corpus. The rules then guide the system to annotate relations among the extracted entities. The general idea of our approach was developed initially in news domain. However, our approach is still novel in the following points. First, methods in news domain concentrate on entities about people, location, organization, time, and so on. Although these entities and their relationships are important in business environments, other entities such as products and competitors, are critical information in business too. Second, most existing entity identification tools are designed for news domain, and they are trained and tested for entities

2 like people or locations. It is not clear whether or how well they can perform on entities like products or competitors. Figure 1 Business Ontology & Instances One possible approach to identify the relations in business domain with non-identified entities is to develop an entity identification tool first and then do relation identification. However, there are too many entities in business environment and too few labeled data to train the entity tools. We, therefore, fall back to more basic linguistic features, and assume that noun phrases, which can be identified reliably with existing tools like POS taggers or syntactic parser, are possible entity candidates. The patterns we constructed for business domain then help us to filter out non-entity noun phrases and unrelated entities, and classify the relations between the entities. In this paper, we choose relations that represent respectively two different scenarios for studying our method. One type of relations involves entities that are identifiable by existing entity identification tools. This shows the connection between relation annotation in business domain and relation extraction in news domain. The other type of relations contains entities that need to use noun phrases as the starting point. This shows the difference of our relation annotation to news domain relation extraction. Our experiments will examine the results of these two scenarios. Because there is no much training data available to use, we employ a bootstrap approach, which utilize limited samples to start the generation, and then learn more patterns on the way. With better patterns, more samples can be created for better pattern generation. A common source of information in many existing pattern generation methods uses the local context of the known entities. The patterns generated, however, can miss critical global syntactic clues at the sentence level. Therefore, our pattern generation method will utilize sentence syntactic information, i.e. Subject Verb and Object (SVO) structure. SVO has been studied before in information extraction [2], in semantic navigation to represent the semantic structure of a sentence [3], and many other areas. In this paper, we developed five different methods of using SVO information for pattern generation. The reminder of the paper is organized ss follows. Section 2 is some related works review. We present the Relation Annotation Platform in Business Environments (RAPBE), which can automatically annotate relations using Web mining technique in Section 3. The core idea of RAPBE is a two-step bootstrap. Therefore, we will first briefly introduce two-step bootstrap for RA (Section 2). Then we introduce RAPBE in Section 4, the experiment design and result analysis in Section 5, and conclusions in Section Related Work In the literature of relation extraction, most researchers so far have focused on relations like is-a [4] and part-of [5]. Hearst used patterns to extract hyponym relations [4]. Later, Berland and Charnias expanded Hearst s work for part-of relation extraction [5]. Girju [6] combined machine learning algorithms and WordNet to raise and disambiguate part-of generic patterns for RE. Although some papers, like Blohm [20], began to work on non-hierarchy relations, they still focus on context information for pattern extraction. Our relation annotation approach tries to extract generic patterns too but uses Subject-Verb- Object (SVO). Bootstrap learning is an iterative approach that alternates between learning rules from a set of instances and learning instances from rules [7]. Hearst pioneered the use of bootstrap method for extraction hyponym relation based on patterns [4]. Manually building three lexico-syntactic patterns, Hearst also used these three patterns to induce other patterns. Blum and Mitchell [8] used the bootstrap method for classifying webpages. Riloff and Jones [9] used bootstrap learning on a small corpus to iterate learning instances of large semantic classes and four patterns which can generate more instances. Ravichandran & Hovy [10] used bootstraps to find patterns surrounding seed values for question answering from a training set

3 of question and answer pairs. Etzioni et al. [11] and [12] used bootstraps for entity extraction. Stevenson and Greenwood [2], in the task of pattern induction, considered to induce the triple patterns, subject, verb and object, instead of local context. Therefore, we will use the SVO pattern in the bootstrap for Relation Annotation task. 3 Relation Annotation Using Bootstrap Our Relation Annotation approach for business environments assumes that there is a knowledge base for obtaining existing information about certain relations, so its inputs for training an annotation model for a given relation are the set of predicates of that relation (e.g. Company-Locate-in-Country ). The output of the training, therefore, is the annotation model. Based on our assumption of having knowledge bases (KBs) as the starting point for training models in RAPBE, we design a two-step bootstrap algorithm. The first step is seed generation from knowledge bases, that is, generate relations (entities pairs) from knowledge bases. It takes advantages of the structure or the schema of KBs to extract high quality seeds. Related work includes relation extracting[8], taxonomy extraction [9], or ontology extraction [10] from KBs like WordNet and Wikipedia. The second step is pattern generation and ranking, which will be the basis of building a trained model for RA. One key problem of this step is how to accurately identify the patterns that would effectively predict the relations in the later relation annotation task. 4. Relation Annotation Platform in Business Environment (RAPBE) Our relation annotation platform is called RAPBE (stands for Relation Annotation Platform in Business Environments). Figure 2 shows an example workflow of RAPBE. Figure 2 RAPBE relation annotation framework: Step : extract target relations and their instances from KB as the seeds; Step : generate queries regard according to the extracted relation seeds; Step : issue queries to the Web and get the search results; Step : generate the patterns from the search results; Step : weighting all the patterns for further RA; Step : RA 4.1. Seed Generation Seed generation in RAPBE relies on a knowledge base. Current RAPBE uses Wikipedia and its Infobox. Although Wikipedia contains rich knowledge with useful structure information, it still suffers some problems as the KB for seed generation. For example, Wikipedia Infobox (short as Infobox) still needed further schema clean [18]. In this stage, we first have to identify whether the relation has some instances in Infobox. The second step in our seed generation is to identify the entity instances acting as the attributes of these relation instances. If the attribute fields in

4 Infobox are not directly extractable, we mined the associated page in order to extract the products information Pattern Generation and Ranking Based on the extracted seeds, RAPBE tries to infer patterns that cover the extracted seeds. RAPBE uses the Web as the corpus for generating patterns and uses a Web search engine (i.e., Yahoo!BOSS) to query the Web (as Step in Figure 2). The results returned from the search engine usually cover several pieces of information such as page title and short summary of the page content (shown as Step in Figure 2). RAPBE uses subject-verb-object (SVO) structure for extracting patterns from search results. Patterns are identified by the key verb with two entities in the sentence. As Brin [17] points out, the quality of later annotated relations highly correlates to the quality of the extracted patterns. After collecting all the patterns, we need to evaluate the relevance of those patterns and the relations. Ravichandran [10] used frequency threshold on the patterns to select the final pattern. However, low frequency patterns could be also good pattern. Therefore, ranking the relevance between patterns and relation instances is an important task here, and we propose five different weighting schemes to rank patterns. This is shown as Step in Figure Frequency Weight (FW). Frequency Weight (FW) assumes that the higher the frequency of a pattern is on the Web, the better its quality is [Hovy, 2002]. Stemming is used to improve the coverage of the method. In RAPBE, FW is defined as in formula (1). (1) where x, p, y denotes pattern co-occurrence frequency of term x, y and pattern verb, p, in the same window size. In this paper the window size is within the same sentence Distance Weight (DW) and Verb Distance Weight (VDW). Distance Weight (DW) denotes the word distance between two entities, which is defined as in formula (2). (2) Verb Distance Weight (VDW) represents a special case of DW which examines the distance between verb and entity y (see formula (3)). (3) Frequency-Distance Weight (DW). Frequency Distance Weight (FDW) combines the distance weight and the frequency weight, which is defined in formula (4): (4) PMI. Pointwise Mutual Information (PMI) is a commonly used metric for measuring the connections between two events. We adopt PMI as a weight for the pattern ranking, and at the same time, PMI is also used a baseline for evaluating the weights mentioned above. PMI is defined as formula (5) (6) where Max pmi is the maximum PMI of all patterns and all instances. And pmi is defined as in formula (6) where xi,p,yi is the frequency of the pattern p instantiated with term xi and yi; xi,yi is the frequency of term xi and term yi co-occurrence together; p is the frequency of term verb p. x,p,y is the frequency of the pattern p instantiated with term x and y; x,y is the frequency of term x and term y co-occurrence together; p is the frequency of term verb p Relation Annotation The output of RA in RAPBE is consisted of flat lists of annotation for relation instance pairs. For example, for the relation of COMPANY-LOCATION, the output will be Google, Menlo Park pair. Our assumption is that the system has named entity tool to help identifying two entities in a relation. RAPBE would annotate whether two entities have the relationships according to the pattern, as shown Step in Figure 2. During the task of RA, entities would be matched to the surface text in documents. One problem in such matching is co-reference. Due to lack of a co-reference tool, RAPBE couldn t handle. To overcome this problem, we developed a matching strategy that relies on matching to just one entity. Our approach is motivated by Yarowsky s work in word sense disambiguation that stated one-sense-per-collocation [19]. Therefore, in the experiment, we also set up the comparison experiments on the matching strategy. COM matching: Match verb pattern with both entity x and entity y. NON-COM matching: Match only verb pattern with entity y, and use the topic entity as the default entity x. (5)

5 5. Experiments Two relations, COMPANY-LOCATION (C-L) and COMPANY-PRODUCTS (C-P), were considered as sample relations in business environment to evaluate the performance of the RAPBE for RA. C-L relation was chosen to represent the relations with an identifiable entity (e.g. LOCATION), and C-P relation was used to represent the relations with a nonidentifiable entity (e.g. PRODUCTS). The performance of five weights (FW, DW, VDW, FDW and PMI) and two matching methods (COM and NON-COM) were evaluated Experimental Setup Named entity extraction tool from Inxight LinguistX Platform was for entity identification. Yahoo! Search BOSS was used for querying the web in the experiment. Twenty-five target companies articles from Wikipedia distributing in five industries (according to Fortune 500, 2008) were chosen for experiments as testing sets for both Company- Location (C-L) and Company-Product (C-P) relation annotation. Thirty one companies with their C-P relation pairs from Infobox were extracted as seed C-P relation pairs for training. The companies in Nasdaq100 index with their C-L relation pairs from Infobox were as seeds for C-L relation pairs. Ground truth was manually marked up by two experts. And precision and recall were used for the evaluation. Precision in this paper for RA is the fraction of correctly annotated relation pairs (e.g. C-L) to the total produced the relation, while recall is the ratio of the number of correctly labeled responses to the total that should have been labeled as the predefined relations (e.g., C-L) COMPANY-LOCATION Experiment Since the average C-L in the documents is about 3, only top 5 locations are evaluated. For NON-COM matching, there is no significant difference between PMI and VDW, and FDW and FW in either precision or recall. DW is significant better than VDW in both precision and recall. The recall of FDW is significantly better than DW; and precision and recall of PMI is worse than the other four weights. Similarly experiments were conducted on COM matching, and we found no significant difference between COM and NON-COM matching for five groups in precision and recall by running a T test. Therefore, FW and FDW are better than DW and VDW for the relations with an identifiable entity. And all four weights (FW, DW, VDW, and FDW) are better than PMI. Matching methods, COM and NON- COM matching, has no effects for the relations with identified entities COMPANY-PRODUCT Experiment For the NON-COM matching, there is no significant difference between FW and FDW for precision and recall. VDW is significantly better than FDW for both precision and recall. DW is significantly better than VDW in precision but not in recall. There is no significant difference between PMI and VDW. For the COM matching, FDW is significantly better than FW; VDW and FDW has no significant difference; PMI and DW are have no significant difference also, but both are better than VDW. Both FW and FDW of COM matching are significantly better than NON-COM matching in precision. But for DW and VDW, NON-COM is significantly better than COM matching in precision. The difference between frequency weighting and distance weighting is that distance weighting considered the distance of VERB and the other entity, that is, it is including the sentece syntactic information, while frequence weighting only considered the frequency of the verb with no syntactic information at all. Therefore, syntactic information is very useful for the relations with a non-identifiable entity. Therefore, five weighting methods could not improve the recall except VDW. VD and VDW weights run better than FW and FDW in the relations with a non-identifiable entity. And PMI is comparable to FW and FDW. Matching pattern could affect the results in the relations with a non-identifiable entity. 6. Conclusion and Discussion This paper describes relation annotation in business environments and proposes a Relation Annotation Platform in Business Environments (RAPBE) for relation annotation using web mining techniques. The core idea for RAPBE is a two-step bootstrap, seed generation and pattern generation. Seed generation extracts clean seeds using a knowledge base, such as Infobox from Wikipedia. Our studies show that Infobox is a good source for seed generation in relation annotation. Pattern generation generates the

6 patterns to build up models for later relation annotation tasks. In order to find the good quality patterns, different weight schemes (FW, FDW, DW, VDW, and PMI) were investigated as the methods for ranking the relation patterns. Our experiments focused on testing two relations, COMPANY-PRODUCT (C-P) which represents the type of relations with non-identifiable entities using existing entity identification tools, and COMPANY- LOCATION (C-L) which represents the relations with identifiable entities. For the relations with nonidentifiable entities (C-P), the syntax information will be critical for RA, while for the relations with identifiable entities (C-L), frequency is more important. NON-COM is to compensate the problem when co-reference resolution is not available. As shown in the results, NON-COM and COM matching has no significant difference for the relations with an identifiable entity. But NON-COM matching is better than COM matching in FW, VDW, and PMI for the relations with non-identifiable entity. Many experiments have been done on the RAPBE system, but the precision and recall for RA are still not good enough for annotating relations with nonidentified entity. One future work, therefore, will focus on methods to filter out irrelevant entities. Another direction of future work is to enable the pattern extraction method in RAPBE to handle relations involving more than two entities. 7. Acknowledge We thank SAP Continuous Sensemaking teams for great supports, especially Keith Klemba, and Thomas Heinzel. 8. References [1] Ming Mao, T. Heinzel, Keith Klemba, and Qi Li, A Sensemaking-based Information Foraging and Summarization System in Business Environments. Proceeding of EEE09 [2] Stevenson, Mark; Greenwood, Mark A, A Semantic Approach to IE Pattern Induction. ACL05. [3] Robin Stewart, Gregory Scott, Vladimir Zelevinsky, Idea Navigation: Structured Browsing for Unstructure Text. CHI 2008 proceedings, April 2008 [4] Hearst, M. Automatic acquisition of hyponyms from large text corpora. COLING-92, (pp ). Nantes, France [5] M. Berland and E. Charniak, Finding Parts In Very Large Corpora. ACL99 (pp ), College Park, MD [6] Girju, R., Badulescu, A., & Moldovan, D. Automatic Discovery of Part-Whole Relations, Computational Linguistics, [7] Jones, R., McCallum, A. M., Nigam, K., & Riloff, E. Bootstrapping for Text Learning Tasks. IJCAI-99 Workshop on Text Mining: Foundations, Techniques and Applications [8] Avrim Blum and T. Mitchell, Combining Labeled and Unlabeled Data with Co-Training, the 1998 Conference on Computational Learning Theory, [9] Riloff, E., & Jones, R. Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping, Proceedings of the Sixteenth National Conference on Artificial Intelligence [10] D. Ravichandran and E.H. Hovy, A Learning Surface Text Patterns for a Question Answering System. ACL02, Philadelphia [11] O. Etzioni, M.J. Cafarella, D. Downey, A. M. Popescu, et al., Unsupervised Named-Entity Extraction From The Web: An Experimental Study. Artificial Intelligence, 165(1): [15] Suchanek, F. M., Kasneci, G., & Weikum, G. Yago: A core of semantic knowledge - unifying WordNet and Wikipedia. WWW07. [16] M. Pasca, D. Lin, J. Bigham, A. Lifchits and A. Jain, Organizing and Searching The World Wide Web of Facts - Step One: the One-Milliion Fact Extraction Challenge. AAAI06, (pp ) [17] Brin, S. Extracting Patterns and Relations from the World Wide Web. Lecture Notes and Computer Science [18] Wu, F., & Weld, D. S. Automatically Refining the Wikipedia Infobox Ontology. WWW08. Beijing, China [19] Yarowsky, D, One Sense Per Collocation, Proceeding of the ARPA human language technology workshop [20] Sebastian Blohm, Philipp Cimiano, Scaling up Pattern Induction for Web Relation Extraction through Frequent Itemset Mining, Proceedings of the KI 2008 Workshop on Ontology-Based Information Extraction Systems. September 2008.

Coupling Semi-Supervised Learning of Categories and Relations

Coupling Semi-Supervised Learning of Categories and Relations Coupling Semi-Supervised Learning of Categories and Relations Andrew Carlson 1, Justin Betteridge 1, Estevam R. Hruschka Jr. 1,2 and Tom M. Mitchell 1 1 School of Computer Science Carnegie Mellon University

More information

AQUA: An Ontology-Driven Question Answering System

AQUA: An Ontology-Driven Question Answering System AQUA: An Ontology-Driven Question Answering System Maria Vargas-Vera, Enrico Motta and John Domingue Knowledge Media Institute (KMI) The Open University, Walton Hall, Milton Keynes, MK7 6AA, United Kingdom.

More information

Multilingual Sentiment and Subjectivity Analysis

Multilingual Sentiment and Subjectivity Analysis Multilingual Sentiment and Subjectivity Analysis Carmen Banea and Rada Mihalcea Department of Computer Science University of North Texas rada@cs.unt.edu, carmen.banea@gmail.com Janyce Wiebe Department

More information

A Case Study: News Classification Based on Term Frequency

A Case Study: News Classification Based on Term Frequency A Case Study: News Classification Based on Term Frequency Petr Kroha Faculty of Computer Science University of Technology 09107 Chemnitz Germany kroha@informatik.tu-chemnitz.de Ricardo Baeza-Yates Center

More information

Distant Supervised Relation Extraction with Wikipedia and Freebase

Distant Supervised Relation Extraction with Wikipedia and Freebase Distant Supervised Relation Extraction with Wikipedia and Freebase Marcel Ackermann TU Darmstadt ackermann@tk.informatik.tu-darmstadt.de Abstract In this paper we discuss a new approach to extract relational

More information

Extracting and Ranking Product Features in Opinion Documents

Extracting and Ranking Product Features in Opinion Documents Extracting and Ranking Product Features in Opinion Documents Lei Zhang Department of Computer Science University of Illinois at Chicago 851 S. Morgan Street Chicago, IL 60607 lzhang3@cs.uic.edu Bing Liu

More information

Matching Similarity for Keyword-Based Clustering

Matching Similarity for Keyword-Based Clustering Matching Similarity for Keyword-Based Clustering Mohammad Rezaei and Pasi Fränti University of Eastern Finland {rezaei,franti}@cs.uef.fi Abstract. Semantic clustering of objects such as documents, web

More information

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING SISOM & ACOUSTICS 2015, Bucharest 21-22 May THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING MarilenaăLAZ R 1, Diana MILITARU 2 1 Military Equipment and Technologies Research Agency, Bucharest,

More information

Cross Language Information Retrieval

Cross Language Information Retrieval Cross Language Information Retrieval RAFFAELLA BERNARDI UNIVERSITÀ DEGLI STUDI DI TRENTO P.ZZA VENEZIA, ROOM: 2.05, E-MAIL: BERNARDI@DISI.UNITN.IT Contents 1 Acknowledgment.............................................

More information

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data Ebba Gustavii Department of Linguistics and Philology, Uppsala University, Sweden ebbag@stp.ling.uu.se

More information

A heuristic framework for pivot-based bilingual dictionary induction

A heuristic framework for pivot-based bilingual dictionary induction 2013 International Conference on Culture and Computing A heuristic framework for pivot-based bilingual dictionary induction Mairidan Wushouer, Toru Ishida, Donghui Lin Department of Social Informatics,

More information

Exploiting Wikipedia as External Knowledge for Named Entity Recognition

Exploiting Wikipedia as External Knowledge for Named Entity Recognition Exploiting Wikipedia as External Knowledge for Named Entity Recognition Jun ichi Kazama and Kentaro Torisawa Japan Advanced Institute of Science and Technology (JAIST) Asahidai 1-1, Nomi, Ishikawa, 923-1292

More information

A Semantic Similarity Measure Based on Lexico-Syntactic Patterns

A Semantic Similarity Measure Based on Lexico-Syntactic Patterns A Semantic Similarity Measure Based on Lexico-Syntactic Patterns Alexander Panchenko, Olga Morozova and Hubert Naets Center for Natural Language Processing (CENTAL) Université catholique de Louvain Belgium

More information

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17. Semi-supervised methods of text processing, and an application to medical concept extraction Yacine Jernite Text-as-Data series September 17. 2015 What do we want from text? 1. Extract information 2. Link

More information

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF) SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF) Hans Christian 1 ; Mikhael Pramodana Agus 2 ; Derwin Suhartono 3 1,2,3 Computer Science Department,

More information

Word Sense Disambiguation

Word Sense Disambiguation Word Sense Disambiguation D. De Cao R. Basili Corso di Web Mining e Retrieval a.a. 2008-9 May 21, 2009 Excerpt of the R. Mihalcea and T. Pedersen AAAI 2005 Tutorial, at: http://www.d.umn.edu/ tpederse/tutorials/advances-in-wsd-aaai-2005.ppt

More information

Probabilistic Latent Semantic Analysis

Probabilistic Latent Semantic Analysis Probabilistic Latent Semantic Analysis Thomas Hofmann Presentation by Ioannis Pavlopoulos & Andreas Damianou for the course of Data Mining & Exploration 1 Outline Latent Semantic Analysis o Need o Overview

More information

Outline. Web as Corpus. Using Web Data for Linguistic Purposes. Ines Rehbein. NCLT, Dublin City University. nclt

Outline. Web as Corpus. Using Web Data for Linguistic Purposes. Ines Rehbein. NCLT, Dublin City University. nclt Outline Using Web Data for Linguistic Purposes NCLT, Dublin City University Outline Outline 1 Corpora as linguistic tools 2 Limitations of web data Strategies to enhance web data 3 Corpora as linguistic

More information

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments Cristina Vertan, Walther v. Hahn University of Hamburg, Natural Language Systems Division Hamburg,

More information

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY Chen, Hsin-Hsi Department of Computer Science and Information Engineering National Taiwan University Taipei, Taiwan E-mail: hh_chen@csie.ntu.edu.tw Abstract

More information

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages Nuanwan Soonthornphisaj 1 and Boonserm Kijsirikul 2 Machine Intelligence and Knowledge Discovery Laboratory Department of Computer

More information

Using Semantic Relations to Refine Coreference Decisions

Using Semantic Relations to Refine Coreference Decisions Using Semantic Relations to Refine Coreference Decisions Heng Ji David Westbrook Ralph Grishman Department of Computer Science New York University New York, NY, 10003, USA hengji@cs.nyu.edu westbroo@cs.nyu.edu

More information

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur Module 12 Machine Learning 12.1 Instructional Objective The students should understand the concept of learning systems Students should learn about different aspects of a learning system Students should

More information

Linking Task: Identifying authors and book titles in verbose queries

Linking Task: Identifying authors and book titles in verbose queries Linking Task: Identifying authors and book titles in verbose queries Anaïs Ollagnier, Sébastien Fournier, and Patrice Bellot Aix-Marseille University, CNRS, ENSAM, University of Toulon, LSIS UMR 7296,

More information

On document relevance and lexical cohesion between query terms

On document relevance and lexical cohesion between query terms Information Processing and Management 42 (2006) 1230 1247 www.elsevier.com/locate/infoproman On document relevance and lexical cohesion between query terms Olga Vechtomova a, *, Murat Karamuftuoglu b,

More information

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics (L615) Markus Dickinson Department of Linguistics, Indiana University Spring 2013 The web provides new opportunities for gathering data Viable source of disposable corpora, built ad hoc for specific purposes

More information

The Enterprise Knowledge Portal: The Concept

The Enterprise Knowledge Portal: The Concept The Enterprise Knowledge Portal: The Concept Executive Information Systems, Inc. www.dkms.com eisai@home.com (703) 461-8823 (o) 1 A Beginning Where is the life we have lost in living! Where is the wisdom

More information

BYLINE [Heng Ji, Computer Science Department, New York University,

BYLINE [Heng Ji, Computer Science Department, New York University, INFORMATION EXTRACTION BYLINE [Heng Ji, Computer Science Department, New York University, hengji@cs.nyu.edu] SYNONYMS NONE DEFINITION Information Extraction (IE) is a task of extracting pre-specified types

More information

Accuracy (%) # features

Accuracy (%) # features Question Terminology and Representation for Question Type Classication Noriko Tomuro DePaul University School of Computer Science, Telecommunications and Information Systems 243 S. Wabash Ave. Chicago,

More information

A Domain Ontology Development Environment Using a MRD and Text Corpus

A Domain Ontology Development Environment Using a MRD and Text Corpus A Domain Ontology Development Environment Using a MRD and Text Corpus Naomi Nakaya 1 and Masaki Kurematsu 2 and Takahira Yamaguchi 1 1 Faculty of Information, Shizuoka University 3-5-1 Johoku Hamamatsu

More information

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar Chung-Chi Huang Mei-Hua Chen Shih-Ting Huang Jason S. Chang Institute of Information Systems and Applications, National Tsing Hua University,

More information

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS Ruslan Mitkov (R.Mitkov@wlv.ac.uk) University of Wolverhampton ViktorPekar (v.pekar@wlv.ac.uk) University of Wolverhampton Dimitar

More information

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus Language Acquisition Fall 2010/Winter 2011 Lexical Categories Afra Alishahi, Heiner Drenhaus Computational Linguistics and Phonetics Saarland University Children s Sensitivity to Lexical Categories Look,

More information

Leveraging Sentiment to Compute Word Similarity

Leveraging Sentiment to Compute Word Similarity Leveraging Sentiment to Compute Word Similarity Balamurali A.R., Subhabrata Mukherjee, Akshat Malu and Pushpak Bhattacharyya Dept. of Computer Science and Engineering, IIT Bombay 6th International Global

More information

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining Dave Donnellan, School of Computer Applications Dublin City University Dublin 9 Ireland daviddonnellan@eircom.net Claus Pahl

More information

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining Dave Donnellan, School of Computer Applications Dublin City University Dublin 9 Ireland daviddonnellan@eircom.net Claus Pahl

More information

Rule Learning With Negation: Issues Regarding Effectiveness

Rule Learning With Negation: Issues Regarding Effectiveness Rule Learning With Negation: Issues Regarding Effectiveness S. Chua, F. Coenen, G. Malcolm University of Liverpool Department of Computer Science, Ashton Building, Ashton Street, L69 3BX Liverpool, United

More information

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments Vijayshri Ramkrishna Ingale PG Student, Department of Computer Engineering JSPM s Imperial College of Engineering &

More information

The stages of event extraction

The stages of event extraction The stages of event extraction David Ahn Intelligent Systems Lab Amsterdam University of Amsterdam ahn@science.uva.nl Abstract Event detection and recognition is a complex task consisting of multiple sub-tasks

More information

A Bayesian Learning Approach to Concept-Based Document Classification

A Bayesian Learning Approach to Concept-Based Document Classification Databases and Information Systems Group (AG5) Max-Planck-Institute for Computer Science Saarbrücken, Germany A Bayesian Learning Approach to Concept-Based Document Classification by Georgiana Ifrim Supervisors

More information

Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge

Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge Innov High Educ (2009) 34:93 103 DOI 10.1007/s10755-009-9095-2 Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge Phyllis Blumberg Published online: 3 February

More information

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence. NLP Lab Session Week 8 October 15, 2014 Noun Phrase Chunking and WordNet in NLTK Getting Started In this lab session, we will work together through a series of small examples using the IDLE window and

More information

Unsupervised Learning of Narrative Schemas and their Participants

Unsupervised Learning of Narrative Schemas and their Participants Unsupervised Learning of Narrative Schemas and their Participants Nathanael Chambers and Dan Jurafsky Stanford University, Stanford, CA 94305 {natec,jurafsky}@stanford.edu Abstract We describe an unsupervised

More information

Transfer Learning Action Models by Measuring the Similarity of Different Domains

Transfer Learning Action Models by Measuring the Similarity of Different Domains Transfer Learning Action Models by Measuring the Similarity of Different Domains Hankui Zhuo 1, Qiang Yang 2, and Lei Li 1 1 Software Research Institute, Sun Yat-sen University, Guangzhou, China. zhuohank@gmail.com,lnslilei@mail.sysu.edu.cn

More information

The Smart/Empire TIPSTER IR System

The Smart/Empire TIPSTER IR System The Smart/Empire TIPSTER IR System Chris Buckley, Janet Walz Sabir Research, Gaithersburg, MD chrisb,walz@sabir.com Claire Cardie, Scott Mardis, Mandar Mitra, David Pierce, Kiri Wagstaff Department of

More information

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures Ulrike Baldewein (ulrike@coli.uni-sb.de) Computational Psycholinguistics, Saarland University D-66041 Saarbrücken,

More information

Using dialogue context to improve parsing performance in dialogue systems

Using dialogue context to improve parsing performance in dialogue systems Using dialogue context to improve parsing performance in dialogue systems Ivan Meza-Ruiz and Oliver Lemon School of Informatics, Edinburgh University 2 Buccleuch Place, Edinburgh I.V.Meza-Ruiz@sms.ed.ac.uk,

More information

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Using Web Searches on Important Words to Create Background Sets for LSI Classification Using Web Searches on Important Words to Create Background Sets for LSI Classification Sarah Zelikovitz and Marina Kogan College of Staten Island of CUNY 2800 Victory Blvd Staten Island, NY 11314 Abstract

More information

Bootstrapping and Evaluating Named Entity Recognition in the Biomedical Domain

Bootstrapping and Evaluating Named Entity Recognition in the Biomedical Domain Bootstrapping and Evaluating Named Entity Recognition in the Biomedical Domain Andreas Vlachos Computer Laboratory University of Cambridge Cambridge, CB3 0FD, UK av308@cl.cam.ac.uk Caroline Gasperin Computer

More information

arxiv: v1 [cs.cl] 2 Apr 2017

arxiv: v1 [cs.cl] 2 Apr 2017 Word-Alignment-Based Segment-Level Machine Translation Evaluation using Word Embeddings Junki Matsuo and Mamoru Komachi Graduate School of System Design, Tokyo Metropolitan University, Japan matsuo-junki@ed.tmu.ac.jp,

More information

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Active Learning. Yingyu Liang Computer Sciences 760 Fall Active Learning Yingyu Liang Computer Sciences 760 Fall 2017 http://pages.cs.wisc.edu/~yliang/cs760/ Some of the slides in these lectures have been adapted/borrowed from materials developed by Mark Craven,

More information

Memory-based grammatical error correction

Memory-based grammatical error correction Memory-based grammatical error correction Antal van den Bosch Peter Berck Radboud University Nijmegen Tilburg University P.O. Box 9103 P.O. Box 90153 NL-6500 HD Nijmegen, The Netherlands NL-5000 LE Tilburg,

More information

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

Project in the framework of the AIM-WEST project Annotation of MWEs for translation Project in the framework of the AIM-WEST project Annotation of MWEs for translation 1 Agnès Tutin LIDILEM/LIG Université Grenoble Alpes 30 october 2014 Outline 2 Why annotate MWEs in corpora? A first experiment

More information

Vocabulary Usage and Intelligibility in Learner Language

Vocabulary Usage and Intelligibility in Learner Language Vocabulary Usage and Intelligibility in Learner Language Emi Izumi, 1 Kiyotaka Uchimoto 1 and Hitoshi Isahara 1 1. Introduction In verbal communication, the primary purpose of which is to convey and understand

More information

Learning Computational Grammars

Learning Computational Grammars Learning Computational Grammars John Nerbonne, Anja Belz, Nicola Cancedda, Hervé Déjean, James Hammerton, Rob Koeling, Stasinos Konstantopoulos, Miles Osborne, Franck Thollard and Erik Tjong Kim Sang Abstract

More information

Developing True/False Test Sheet Generating System with Diagnosing Basic Cognitive Ability

Developing True/False Test Sheet Generating System with Diagnosing Basic Cognitive Ability Developing True/False Test Sheet Generating System with Diagnosing Basic Cognitive Ability Shih-Bin Chen Dept. of Information and Computer Engineering, Chung-Yuan Christian University Chung-Li, Taiwan

More information

The Ups and Downs of Preposition Error Detection in ESL Writing

The Ups and Downs of Preposition Error Detection in ESL Writing The Ups and Downs of Preposition Error Detection in ESL Writing Joel R. Tetreault Educational Testing Service 660 Rosedale Road Princeton, NJ, USA JTetreault@ets.org Martin Chodorow Hunter College of CUNY

More information

Abstractions and the Brain

Abstractions and the Brain Abstractions and the Brain Brian D. Josephson Department of Physics, University of Cambridge Cavendish Lab. Madingley Road Cambridge, UK. CB3 OHE bdj10@cam.ac.uk http://www.tcm.phy.cam.ac.uk/~bdj10 ABSTRACT

More information

CS 446: Machine Learning

CS 446: Machine Learning CS 446: Machine Learning Introduction to LBJava: a Learning Based Programming Language Writing classifiers Christos Christodoulopoulos Parisa Kordjamshidi Motivation 2 Motivation You still have not learnt

More information

LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization

LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization Annemarie Friedrich, Marina Valeeva and Alexis Palmer COMPUTATIONAL LINGUISTICS & PHONETICS SAARLAND UNIVERSITY, GERMANY

More information

HLTCOE at TREC 2013: Temporal Summarization

HLTCOE at TREC 2013: Temporal Summarization HLTCOE at TREC 2013: Temporal Summarization Tan Xu University of Maryland College Park Paul McNamee Johns Hopkins University HLTCOE Douglas W. Oard University of Maryland College Park Abstract Our team

More information

Learning a Cross-Lingual Semantic Representation of Relations Expressed in Text

Learning a Cross-Lingual Semantic Representation of Relations Expressed in Text Learning a Cross-Lingual Semantic Representation of Relations Expressed in Text Achim Rettinger, Artem Schumilin, Steffen Thoma, and Basil Ell Karlsruhe Institute of Technology (KIT), Karlsruhe, Germany

More information

Disambiguation of Thai Personal Name from Online News Articles

Disambiguation of Thai Personal Name from Online News Articles Disambiguation of Thai Personal Name from Online News Articles Phaisarn Sutheebanjard Graduate School of Information Technology Siam University Bangkok, Thailand mr.phaisarn@gmail.com Abstract Since online

More information

Universiteit Leiden ICT in Business

Universiteit Leiden ICT in Business Universiteit Leiden ICT in Business Ranking of Multi-Word Terms Name: Ricardo R.M. Blikman Student-no: s1184164 Internal report number: 2012-11 Date: 07/03/2013 1st supervisor: Prof. Dr. J.N. Kok 2nd supervisor:

More information

A DISTRIBUTIONAL STRUCTURED SEMANTIC SPACE FOR QUERYING RDF GRAPH DATA

A DISTRIBUTIONAL STRUCTURED SEMANTIC SPACE FOR QUERYING RDF GRAPH DATA International Journal of Semantic Computing Vol. 5, No. 4 (2011) 433 462 c World Scientific Publishing Company DOI: 10.1142/S1793351X1100133X A DISTRIBUTIONAL STRUCTURED SEMANTIC SPACE FOR QUERYING RDF

More information

Analysis: Evaluation: Knowledge: Comprehension: Synthesis: Application:

Analysis: Evaluation: Knowledge: Comprehension: Synthesis: Application: In 1956, Benjamin Bloom headed a group of educational psychologists who developed a classification of levels of intellectual behavior important in learning. Bloom found that over 95 % of the test questions

More information

Detecting English-French Cognates Using Orthographic Edit Distance

Detecting English-French Cognates Using Orthographic Edit Distance Detecting English-French Cognates Using Orthographic Edit Distance Qiongkai Xu 1,2, Albert Chen 1, Chang i 1 1 The Australian National University, College of Engineering and Computer Science 2 National

More information

Learning Disability Functional Capacity Evaluation. Dear Doctor,

Learning Disability Functional Capacity Evaluation. Dear Doctor, Dear Doctor, I have been asked to formulate a vocational opinion regarding NAME s employability in light of his/her learning disability. To assist me with this evaluation I would appreciate if you can

More information

Extracting Opinion Expressions and Their Polarities Exploration of Pipelines and Joint Models

Extracting Opinion Expressions and Their Polarities Exploration of Pipelines and Joint Models Extracting Opinion Expressions and Their Polarities Exploration of Pipelines and Joint Models Richard Johansson and Alessandro Moschitti DISI, University of Trento Via Sommarive 14, 38123 Trento (TN),

More information

Search right and thou shalt find... Using Web Queries for Learner Error Detection

Search right and thou shalt find... Using Web Queries for Learner Error Detection Search right and thou shalt find... Using Web Queries for Learner Error Detection Michael Gamon Claudia Leacock Microsoft Research Butler Hill Group One Microsoft Way P.O. Box 935 Redmond, WA 981052, USA

More information

Compositional Semantics

Compositional Semantics Compositional Semantics CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Words, bag of words Sequences Trees Meaning Representing Meaning An important goal of NLP/AI: convert natural language

More information

Organizational Knowledge Distribution: An Experimental Evaluation

Organizational Knowledge Distribution: An Experimental Evaluation Association for Information Systems AIS Electronic Library (AISeL) AMCIS 24 Proceedings Americas Conference on Information Systems (AMCIS) 12-31-24 : An Experimental Evaluation Surendra Sarnikar University

More information

Syntactic Patterns versus Word Alignment: Extracting Opinion Targets from Online Reviews

Syntactic Patterns versus Word Alignment: Extracting Opinion Targets from Online Reviews Syntactic Patterns versus Word Alignment: Extracting Opinion Targets from Online Reviews Kang Liu, Liheng Xu and Jun Zhao National Laboratory of Pattern Recognition Institute of Automation, Chinese Academy

More information

Data Integration through Clustering and Finding Statistical Relations - Validation of Approach

Data Integration through Clustering and Finding Statistical Relations - Validation of Approach Data Integration through Clustering and Finding Statistical Relations - Validation of Approach Marek Jaszuk, Teresa Mroczek, and Barbara Fryc University of Information Technology and Management, ul. Sucharskiego

More information

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology Tiancheng Zhao CMU-LTI-16-006 Language Technologies Institute School of Computer Science Carnegie Mellon

More information

Python Machine Learning

Python Machine Learning Python Machine Learning Unlock deeper insights into machine learning with this vital guide to cuttingedge predictive analytics Sebastian Raschka [ PUBLISHING 1 open source I community experience distilled

More information

South Carolina English Language Arts

South Carolina English Language Arts South Carolina English Language Arts A S O F J U N E 2 0, 2 0 1 0, T H I S S TAT E H A D A D O P T E D T H E CO M M O N CO R E S TAT E S TA N DA R D S. DOCUMENTS REVIEWED South Carolina Academic Content

More information

Rule Learning with Negation: Issues Regarding Effectiveness

Rule Learning with Negation: Issues Regarding Effectiveness Rule Learning with Negation: Issues Regarding Effectiveness Stephanie Chua, Frans Coenen, and Grant Malcolm University of Liverpool Department of Computer Science, Ashton Building, Ashton Street, L69 3BX

More information

Combining a Chinese Thesaurus with a Chinese Dictionary

Combining a Chinese Thesaurus with a Chinese Dictionary Combining a Chinese Thesaurus with a Chinese Dictionary Ji Donghong Kent Ridge Digital Labs 21 Heng Mui Keng Terrace Singapore, 119613 dhji @krdl.org.sg Gong Junping Department of Computer Science Ohio

More information

Word Segmentation of Off-line Handwritten Documents

Word Segmentation of Off-line Handwritten Documents Word Segmentation of Off-line Handwritten Documents Chen Huang and Sargur N. Srihari {chuang5, srihari}@cedar.buffalo.edu Center of Excellence for Document Analysis and Recognition (CEDAR), Department

More information

LEARNING A SEMANTIC PARSER FROM SPOKEN UTTERANCES. Judith Gaspers and Philipp Cimiano

LEARNING A SEMANTIC PARSER FROM SPOKEN UTTERANCES. Judith Gaspers and Philipp Cimiano LEARNING A SEMANTIC PARSER FROM SPOKEN UTTERANCES Judith Gaspers and Philipp Cimiano Semantic Computing Group, CITEC, Bielefeld University {jgaspers cimiano}@cit-ec.uni-bielefeld.de ABSTRACT Semantic parsers

More information

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases POS Tagging Problem Part-of-Speech Tagging L545 Spring 203 Given a sentence W Wn and a tagset of lexical categories, find the most likely tag T..Tn for each word in the sentence Example Secretariat/P is/vbz

More information

Automating the E-learning Personalization

Automating the E-learning Personalization Automating the E-learning Personalization Fathi Essalmi 1, Leila Jemni Ben Ayed 1, Mohamed Jemni 1, Kinshuk 2, and Sabine Graf 2 1 The Research Laboratory of Technologies of Information and Communication

More information

Prediction of Maximal Projection for Semantic Role Labeling

Prediction of Maximal Projection for Semantic Role Labeling Prediction of Maximal Projection for Semantic Role Labeling Weiwei Sun, Zhifang Sui Institute of Computational Linguistics Peking University Beijing, 100871, China {ws, szf}@pku.edu.cn Haifeng Wang Toshiba

More information

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

The Internet as a Normative Corpus: Grammar Checking with a Search Engine The Internet as a Normative Corpus: Grammar Checking with a Search Engine Jonas Sjöbergh KTH Nada SE-100 44 Stockholm, Sweden jsh@nada.kth.se Abstract In this paper some methods using the Internet as a

More information

Ensemble Technique Utilization for Indonesian Dependency Parser

Ensemble Technique Utilization for Indonesian Dependency Parser Ensemble Technique Utilization for Indonesian Dependency Parser Arief Rahman Institut Teknologi Bandung Indonesia 23516008@std.stei.itb.ac.id Ayu Purwarianti Institut Teknologi Bandung Indonesia ayu@stei.itb.ac.id

More information

A Topic Maps-based ontology IR system versus Clustering-based IR System: A Comparative Study in Security Domain

A Topic Maps-based ontology IR system versus Clustering-based IR System: A Comparative Study in Security Domain A Topic Maps-based ontology IR system versus Clustering-based IR System: A Comparative Study in Security Domain Myongho Yi 1 and Sam Gyun Oh 2* 1 School of Library and Information Studies, Texas Woman

More information

Knowledge-Based - Systems

Knowledge-Based - Systems Knowledge-Based - Systems ; Rajendra Arvind Akerkar Chairman, Technomathematics Research Foundation and Senior Researcher, Western Norway Research institute Priti Srinivas Sajja Sardar Patel University

More information

A Comparison of Standard and Interval Association Rules

A Comparison of Standard and Interval Association Rules A Comparison of Standard and Association Rules Choh Man Teng cmteng@ai.uwf.edu Institute for Human and Machine Cognition University of West Florida 4 South Alcaniz Street, Pensacola FL 325, USA Abstract

More information

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS Julia Tmshkina Centre for Text Techitology, North-West University, 253 Potchefstroom, South Africa 2025770@puk.ac.za

More information

On-Line Data Analytics

On-Line Data Analytics International Journal of Computer Applications in Engineering Sciences [VOL I, ISSUE III, SEPTEMBER 2011] [ISSN: 2231-4946] On-Line Data Analytics Yugandhar Vemulapalli #, Devarapalli Raghu *, Raja Jacob

More information

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT PRACTICAL APPLICATIONS OF RANDOM SAMPLING IN ediscovery By Matthew Verga, J.D. INTRODUCTION Anyone who spends ample time working

More information

Task Tolerance of MT Output in Integrated Text Processes

Task Tolerance of MT Output in Integrated Text Processes Task Tolerance of MT Output in Integrated Text Processes John S. White, Jennifer B. Doyon, and Susan W. Talbott Litton PRC 1500 PRC Drive McLean, VA 22102, USA {white_john, doyon jennifer, talbott_susan}@prc.com

More information

Effect of Word Complexity on L2 Vocabulary Learning

Effect of Word Complexity on L2 Vocabulary Learning Effect of Word Complexity on L2 Vocabulary Learning Kevin Dela Rosa Language Technologies Institute Carnegie Mellon University 5000 Forbes Ave. Pittsburgh, PA kdelaros@cs.cmu.edu Maxine Eskenazi Language

More information

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for Email Marilyn A. Walker Jeanne C. Fromer Shrikanth Narayanan walker@research.att.com jeannie@ai.mit.edu shri@research.att.com

More information

Short Text Understanding Through Lexical-Semantic Analysis

Short Text Understanding Through Lexical-Semantic Analysis Short Text Understanding Through Lexical-Semantic Analysis Wen Hua #1, Zhongyuan Wang 2, Haixun Wang 3, Kai Zheng #4, Xiaofang Zhou #5 School of Information, Renmin University of China, Beijing, China

More information

Movie Review Mining and Summarization

Movie Review Mining and Summarization Movie Review Mining and Summarization Li Zhuang Microsoft Research Asia Department of Computer Science and Technology, Tsinghua University Beijing, P.R.China f-lzhuang@hotmail.com Feng Jing Microsoft Research

More information

A Comparison of Two Text Representations for Sentiment Analysis

A Comparison of Two Text Representations for Sentiment Analysis 010 International Conference on Computer Application and System Modeling (ICCASM 010) A Comparison of Two Text Representations for Sentiment Analysis Jianxiong Wang School of Computer Science & Educational

More information

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data Maja Popović and Hermann Ney Lehrstuhl für Informatik VI, Computer

More information

The taming of the data:

The taming of the data: The taming of the data: Using text mining in building a corpus for diachronic analysis Stefania Degaetano-Ortlieb, Hannah Kermes, Ashraf Khamis, Jörg Knappen, Noam Ordan and Elke Teich Background Big data

More information