Bachelor thesis research plan MapReduce and word associations Ruben Nijveld (0609781) <rubennijveld@student.ru.nl> 1 Introduction Word associations can be used to provide users with suggestions to the current query input in an information retrieval system. This however requires that the suggestions provided are of high enough quality that the user can use these suggestions. In order to provide useful associations, it is then required to analyze large quantities of data. The data sets that need to be analyzed in order to get accurate word associations require large amounts of data and thus a large amount of computing power. As a single system cannot provide this much power, a cluster of computers has to be used. Using a cluster does create some problems. Distributing the workload among multiple machines and combining their results in to one final result are the most obvious problems here. The MapReduce [3] algorithm that has been developed at Google is targeted at exactly these two problems. However, it remains the question if and how the calculation of word associations can be done effectively using a cluster that makes use of this algorithm. 2 Research Question In this research project I intend to check if MapReduce is an feasible and effective method of analyzing large data sets for word associations. The following question will be central in my research: What advantages and disadvantages does using the MapReduce algorithm have when applied to an information retrieval task concerning word associations? Some additional questions that should be answered first may help in answering the previous research question: 1. What metrics can be used to associate words? 2. What adaptations does MapReduce require for algorithms to be used in combination with it? 1
3. Which of these association metrics can be used in combination with MapReduce? 4. How do these association metrics hold in a practical situation? 3 Relevance Word associations may be used in several areas of expertise. One such an example is in helping people with aphasia to remember what words they want to use and how words relate. Another example may be found in providing suggestions to people searching on an internet search engine. Take for example those people that do not know the exact terminology of what they are looking for, and thus have difficulty finding relevant documents. 4 Theoretical scope 4.1 Word Associations Typically word associations are determined by analyzing a large corpus of documents. Such a large collection is mainly required because of a large variety in language use and the large numbers of words in many languages [6]. Determining whether two words are in some way associated requires defining when to say that to words are associated as well as determining what an association actually means. Associations can either be determined with (context specific associations) or without a context (context free associations). Examples of algorithms include automatic query expansion [1] Point Mutual Information [2], Skip-gram modeling [5] and Vector Space Models [7]. 4.2 MapReduce MapReduce [3] is an algorithm to distribute and combine workloads over (large) clusters of computers that is intended to be extremely well scalable. MapReduce works by taking a set of key/value pairs as input, applying a mapping function to them, which produces a new set of intermediate key/value pairs, which can be used by the algorithm to distribute the workload. Finally a Reduce function is applied which merges key/value pairs such that the total number of key/value pairs either remains the same or is reduced. A number of implementations of the MapReduce algorithm exist. Hadoop [4] is an implementation of the MapReduce algorithm written in Java. 5 Method Research will begin by detailing the information available for both MapReduce as well as the word association analysis algorithms, using the literature already 2
available. This allows for a better understanding of the problem. In this phase (1 ) I want to answer both the first and the second subquestions. In the next phase (2 ) of my research I want to combine the knowledge of both subject areas and try to see how one can be applied to the other. The method on how to do this will be the answer on my third question. Using this I want to try to construct a simple implementation using Hadoop in order to test the idea that MapReduce can speed up this process. I will then compare this implementation in Hadoop with already existing implementations of the algorithms without Hadoop. This phase (3 ) is to result in an answer for the fourth question. All information from the three phases is then combined to form an answer to the research question posed previously. This is the final phase of my research project. This phase (4 ) will mostly involve writing my results down. 5.1 Possible problems Given the dependence on an Hadoop cluster in the final part of my research, it is best to have a backup plan. If using the Hadoop cluster is impossible or causes other problems, the following option is available: an analysis of different methods for applying algorithms on clusters. As MapReduce is not the only cluster-algorithm available, a comparison can be made between these different algorithms. 6 Schedule This schedule gives a rough indication of planned time for this research project. Week Hours Spent Activity 7 - February 13 th 10 10 Research planning 8 - February 20 th 15 16 Research planning - Provisional research plan (24 th ) 9 - February 27 th - Start of phase 1 15 16 Literature 10 - March 5 th 15 11 - March 12 th 5 Process research plan feedback - Final research plan (16 th ) 12 - March 19 th 5 Writing 13 - March 26 th 5 Writing 14 - April 2 nd 5 Writing - Start of phase 2 3
15 - April 9 th 5 Writing 16 - April 16 th 5 Writing - First draft thesis (16 th ) 17 - April 23 rd - Start of phase 3 15 Implementation 18 - April 30 th 15 Implementation 19 - May 7 th 15 Implementation 20 - May 14 th 10 Implementation 5 Writing 21 - May 21 st 15 Writing - Start of phase 4 22 - May 28 th 15 Writing 23 - June 4 th 15 Writing 24 - June 11 th 10 Grace period 5 Presentation - Second draft thesis (11 th ) 25 - June 18 th 5 Presentation 10 Grace period 26 - June 25 th - Presentation slides (25 th ) - Final thesis (25 th ) - Presentations (26 th ) Total 280 References [1] J. Bhogal, A. Macfarlane, and P. Smith. A review of ontology based query expansion. Information Processing and Management, 43(4):866 886, 2007. [2] Danushka Bollegala, Yutaka Matsuo, and Mitsuru Ishizuka. Measuring semantic similarity between words using web search engines. In Proceedings of the 16th international conference on World Wide Web, WWW 07, pages 757 766, New York, NY, USA, 2007. ACM. [3] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data processing on large clusters. Commun. ACM, 51:107 113, January 2008. [4] The Apache Software Foundation. Apache hadoop, March 2012. [5] David Guthrie, Ben Allison, Wei Liu, Louise Guthrie, and Yorick Wilks. A closer look at skip-gram modelling. In Proceedings of the 5th International Conference on Language Resources and Evaluation, 2006. [6] Peter Turney. Mining the web for synonyms: Pmi-ir versus lsa on toefl. In Luc De Raedt and Peter Flach, editors, Machine Learning: ECML 2001, 4
volume 2167 of Lecture Notes in Computer Science, pages 491 502. Springer Berlin / Heidelberg, 2001. [7] Peter D Turney and Patrick Pantel. From frequency to meaning: Vector space models of semantics. Journal of Artificial Intelligence Research, 37(arXiv:1003.1141):8. mult. p, Mar 2010. 5