Bachelor thesis research plan

Similar documents
Matching Similarity for Keyword-Based Clustering

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

Automating the E-learning Personalization

Detecting Wikipedia Vandalism using Machine Learning Notebook for PAN at CLEF 2011

Probabilistic Latent Semantic Analysis

Data Integration through Clustering and Finding Statistical Relations - Validation of Approach

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

Linking Task: Identifying authors and book titles in verbose queries

Efficient Online Summarization of Microblogging Streams

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

Welcome to. ECML/PKDD 2004 Community meeting

Learning Methods for Fuzzy Systems

Study in Berlin at the HTW. Study in Berlin at the HTW

Using Web Searches on Important Words to Create Background Sets for LSI Classification

A Comparison of Two Text Representations for Sentiment Analysis

Georgetown University at TREC 2017 Dynamic Domain Track

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Cross Language Information Retrieval

Humboldt-Universität zu Berlin

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

AQUA: An Ontology-Driven Question Answering System

MASTER OF SCIENCE (M.S.) MAJOR IN COMPUTER SCIENCE

A Case Study: News Classification Based on Term Frequency

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Conversational Framework for Web Search and Recommendations

Detecting English-French Cognates Using Orthographic Edit Distance

Improving Fairness in Memory Scheduling

Reducing Features to Improve Bug Prediction

Language Independent Passage Retrieval for Question Answering

CS4491/CS 7265 BIG DATA ANALYTICS INTRODUCTION TO THE COURSE. Mingon Kang, PhD Computer Science, Kennesaw State University

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Word Segmentation of Off-line Handwritten Documents

Ontological spine, localization and multilingual access

Reinforcement Learning by Comparing Immediate Reward

An OO Framework for building Intelligence and Learning properties in Software Agents

Agent-Based Software Engineering

Identification of Opinion Leaders Using Text Mining Technique in Virtual Community

The Role of String Similarity Metrics in Ontology Alignment

Integrating E-learning Environments with Computational Intelligence Assessment Agents

Axiom 2013 Team Description Paper

Data Fusion Models in WSNs: Comparison and Analysis

Assignment 1: Predicting Amazon Review Ratings

Analyzing sentiments in tweets for Tesla Model 3 using SAS Enterprise Miner and SAS Sentiment Analysis Studio

arxiv: v1 [cs.cl] 2 Apr 2017

Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models

Computerized Adaptive Psychological Testing A Personalisation Perspective

Dublin City Schools Mathematics Graded Course of Study GRADE 4

Modeling user preferences and norms in context-aware systems

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

What s in a Step? Toward General, Abstract Representations of Tutoring System Log Data

Rule Learning With Negation: Issues Regarding Effectiveness

A NOTE ON UNDETECTED TYPING ERRORS

Bachelor Class

Longest Common Subsequence: A Method for Automatic Evaluation of Handwritten Essays

Paper 2. Mathematics test. Calculator allowed. First name. Last name. School KEY STAGE TIER

Cross-Lingual Text Categorization

Coupling Semi-Supervised Learning of Categories and Relations

COMPUTATIONAL COMPLEXITY OF LEFT-ASSOCIATIVE GRAMMAR

Guru: A Computer Tutor that Models Expert Human Tutors

A Semantic Similarity Measure Based on Lexico-Syntactic Patterns

Switchboard Language Model Improvement with Conversational Data from Gigaword

We are strong in research and particularly noted in software engineering, information security and privacy, and humane gaming.

A cognitive perspective on pair programming

RESEARCH UNITS, CENTRES AND INSTITUTES. Annual Report Template

How to read a Paper ISMLL. Dr. Josif Grabocka, Carlotta Schatten

Knowledge-Based - Systems

Infrastructure Issues Related to Theory of Computing Research. Faith Fich, University of Toronto

CONCEPT MAPS AS A DEVICE FOR LEARNING DATABASE CONCEPTS

Computer Science PhD Program Evaluation Proposal Based on Domain and Non-Domain Characteristics

Ontologies vs. classification systems

Term Weighting based on Document Revision History

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Outline. Web as Corpus. Using Web Data for Linguistic Purposes. Ines Rehbein. NCLT, Dublin City University. nclt

A Domain Ontology Development Environment Using a MRD and Text Corpus

Problems of the Arabic OCR: New Attitudes

On-Line Data Analytics

Artificial Neural Networks written examination

Towards Semantic Facility Data Management

ScienceDirect. Malayalam question answering system

Adaptation Criteria for Preparing Learning Material for Adaptive Usage: Structured Content Analysis of Existing Systems. 1

Rule Learning with Negation: Issues Regarding Effectiveness

2 Mitsuru Ishizuka x1 Keywords Automatic Indexing, PAI, Asserted Keyword, Spreading Activation, Priming Eect Introduction With the increasing number o

Applying Learn Team Coaching to an Introductory Programming Course

CWIS 23,3. Nikolaos Avouris Human Computer Interaction Group, University of Patras, Patras, Greece

Laboratorio di Intelligenza Artificiale e Robotica

Unit 3. Design Activity. Overview. Purpose. Profile

Python Machine Learning

Beyond the Pipeline: Discrete Optimization in NLP

Community-oriented Course Authoring to Support Topic-based Student Modeling

Using dialogue context to improve parsing performance in dialogue systems

Graphical Data Displays and Database Queries: Helping Users Select the Right Display for the Task

Bootstrapping Personal Gesture Shortcuts with the Wisdom of the Crowd and Handwriting Recognition

TextGraphs: Graph-based algorithms for Natural Language Processing

UCEAS: User-centred Evaluations of Adaptive Systems

IT4BI, Semester 2, UFRT. Welcome address, February 1 st, 2013 Arnaud Giacometti / Patrick Marcel

Extracting and Ranking Product Features in Opinion Documents

Measuring the relative compositionality of verb-noun (V-N) collocations by integrating features

Transcription:

Bachelor thesis research plan MapReduce and word associations Ruben Nijveld (0609781) <rubennijveld@student.ru.nl> 1 Introduction Word associations can be used to provide users with suggestions to the current query input in an information retrieval system. This however requires that the suggestions provided are of high enough quality that the user can use these suggestions. In order to provide useful associations, it is then required to analyze large quantities of data. The data sets that need to be analyzed in order to get accurate word associations require large amounts of data and thus a large amount of computing power. As a single system cannot provide this much power, a cluster of computers has to be used. Using a cluster does create some problems. Distributing the workload among multiple machines and combining their results in to one final result are the most obvious problems here. The MapReduce [3] algorithm that has been developed at Google is targeted at exactly these two problems. However, it remains the question if and how the calculation of word associations can be done effectively using a cluster that makes use of this algorithm. 2 Research Question In this research project I intend to check if MapReduce is an feasible and effective method of analyzing large data sets for word associations. The following question will be central in my research: What advantages and disadvantages does using the MapReduce algorithm have when applied to an information retrieval task concerning word associations? Some additional questions that should be answered first may help in answering the previous research question: 1. What metrics can be used to associate words? 2. What adaptations does MapReduce require for algorithms to be used in combination with it? 1

3. Which of these association metrics can be used in combination with MapReduce? 4. How do these association metrics hold in a practical situation? 3 Relevance Word associations may be used in several areas of expertise. One such an example is in helping people with aphasia to remember what words they want to use and how words relate. Another example may be found in providing suggestions to people searching on an internet search engine. Take for example those people that do not know the exact terminology of what they are looking for, and thus have difficulty finding relevant documents. 4 Theoretical scope 4.1 Word Associations Typically word associations are determined by analyzing a large corpus of documents. Such a large collection is mainly required because of a large variety in language use and the large numbers of words in many languages [6]. Determining whether two words are in some way associated requires defining when to say that to words are associated as well as determining what an association actually means. Associations can either be determined with (context specific associations) or without a context (context free associations). Examples of algorithms include automatic query expansion [1] Point Mutual Information [2], Skip-gram modeling [5] and Vector Space Models [7]. 4.2 MapReduce MapReduce [3] is an algorithm to distribute and combine workloads over (large) clusters of computers that is intended to be extremely well scalable. MapReduce works by taking a set of key/value pairs as input, applying a mapping function to them, which produces a new set of intermediate key/value pairs, which can be used by the algorithm to distribute the workload. Finally a Reduce function is applied which merges key/value pairs such that the total number of key/value pairs either remains the same or is reduced. A number of implementations of the MapReduce algorithm exist. Hadoop [4] is an implementation of the MapReduce algorithm written in Java. 5 Method Research will begin by detailing the information available for both MapReduce as well as the word association analysis algorithms, using the literature already 2

available. This allows for a better understanding of the problem. In this phase (1 ) I want to answer both the first and the second subquestions. In the next phase (2 ) of my research I want to combine the knowledge of both subject areas and try to see how one can be applied to the other. The method on how to do this will be the answer on my third question. Using this I want to try to construct a simple implementation using Hadoop in order to test the idea that MapReduce can speed up this process. I will then compare this implementation in Hadoop with already existing implementations of the algorithms without Hadoop. This phase (3 ) is to result in an answer for the fourth question. All information from the three phases is then combined to form an answer to the research question posed previously. This is the final phase of my research project. This phase (4 ) will mostly involve writing my results down. 5.1 Possible problems Given the dependence on an Hadoop cluster in the final part of my research, it is best to have a backup plan. If using the Hadoop cluster is impossible or causes other problems, the following option is available: an analysis of different methods for applying algorithms on clusters. As MapReduce is not the only cluster-algorithm available, a comparison can be made between these different algorithms. 6 Schedule This schedule gives a rough indication of planned time for this research project. Week Hours Spent Activity 7 - February 13 th 10 10 Research planning 8 - February 20 th 15 16 Research planning - Provisional research plan (24 th ) 9 - February 27 th - Start of phase 1 15 16 Literature 10 - March 5 th 15 11 - March 12 th 5 Process research plan feedback - Final research plan (16 th ) 12 - March 19 th 5 Writing 13 - March 26 th 5 Writing 14 - April 2 nd 5 Writing - Start of phase 2 3

15 - April 9 th 5 Writing 16 - April 16 th 5 Writing - First draft thesis (16 th ) 17 - April 23 rd - Start of phase 3 15 Implementation 18 - April 30 th 15 Implementation 19 - May 7 th 15 Implementation 20 - May 14 th 10 Implementation 5 Writing 21 - May 21 st 15 Writing - Start of phase 4 22 - May 28 th 15 Writing 23 - June 4 th 15 Writing 24 - June 11 th 10 Grace period 5 Presentation - Second draft thesis (11 th ) 25 - June 18 th 5 Presentation 10 Grace period 26 - June 25 th - Presentation slides (25 th ) - Final thesis (25 th ) - Presentations (26 th ) Total 280 References [1] J. Bhogal, A. Macfarlane, and P. Smith. A review of ontology based query expansion. Information Processing and Management, 43(4):866 886, 2007. [2] Danushka Bollegala, Yutaka Matsuo, and Mitsuru Ishizuka. Measuring semantic similarity between words using web search engines. In Proceedings of the 16th international conference on World Wide Web, WWW 07, pages 757 766, New York, NY, USA, 2007. ACM. [3] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data processing on large clusters. Commun. ACM, 51:107 113, January 2008. [4] The Apache Software Foundation. Apache hadoop, March 2012. [5] David Guthrie, Ben Allison, Wei Liu, Louise Guthrie, and Yorick Wilks. A closer look at skip-gram modelling. In Proceedings of the 5th International Conference on Language Resources and Evaluation, 2006. [6] Peter Turney. Mining the web for synonyms: Pmi-ir versus lsa on toefl. In Luc De Raedt and Peter Flach, editors, Machine Learning: ECML 2001, 4

volume 2167 of Lecture Notes in Computer Science, pages 491 502. Springer Berlin / Heidelberg, 2001. [7] Peter D Turney and Patrick Pantel. From frequency to meaning: Vector space models of semantics. Journal of Artificial Intelligence Research, 37(arXiv:1003.1141):8. mult. p, Mar 2010. 5