Multi Hybrid Keyword Processing for Topic Decision of Unstructured Data. Jinwoo Lee, Hyoungmin Ma, Gitae Lee, Kihong Ahn, Sukyoung Kim

Size: px

Start display at page:

Download "Multi Hybrid Keyword Processing for Topic Decision of Unstructured Data. Jinwoo Lee, Hyoungmin Ma, Gitae Lee, Kihong Ahn, Sukyoung Kim"

Chloe Riley
6 years ago
Views:

1 Multi Hybrid Keyword Processing for Topic Decision of Unstructured Data Jinwoo Lee, Hyoungmin Ma, Gitae Lee, Kihong Ahn, Sukyoung Kim Abstract Amount of information and difficulty of the user's information selection has direct proportion relation. Also title is consists of exaggerated expression. Since, authors want to summarize about document. Therefore title is almost different from contents. If these case are more increased, offering information by simple keyword search will be reached to the limit. In this study, to solve these problems, we applied TF-IDF to extract keyword in particular documents which have scarcity words in all documents and applied LDA algorithm for to find topic about single document. Finally, we have proposed the methodologies that add description on scarcity word and topic through to extract the Trigram of the entire document. In this study, to verify the accurate of methodology, we made supervised data and compared these data with data that made by suggested methodology. D I. INTRODUCTION evelopment of WEB 2.0 environment increase diversification and complexification form and expression of the information which is made in sudden expansion of SNS. This is the important reason to the users why fail to find accurate information. Particularly, the redundancy of signification and the metaphor expression to be elements which obstruct the satisfaction of searching information. Expansion of cloud technology is possible to make videos, photos and document information of pdf file infinitely and easily. So it make more difficult to find information. Also the sentences in SNS (like twitter, facebook) are abbreviated form and unphotographic information. So it is generate the limit to find information with few keywords. These problems make the time consuming for searching information. In order to our study solves this problems which are mentioned sentence before, we apply TFIDF to extract keyword for scarcity word in documents and apply LDA algorithm to find a TOPIC in single document. Finally, we suggest additional explanation methodology of scarcity word and Topic to extract Trigram in all documents and compare result of experiment to extract Trigram for verifying accuracy of our methodology. II. RELATE RESEARCH A. TF-IDF TF IDF is the product of two statistics, word frequency and inverse document frequency. There are various ways for the extracting two values (TF, IDF). In the case of the word frequency tf(t,d), the simplest selection is to use the raw frequency of a word in a document, i.e. the number of times that word t occurs in document d. If we denote the raw frequency of t by f(t,d), then the simple tf scheme is tf(t,d) = f(t,d) tfidf ( t, d, tf( t, d) udf ( t, (1) Lee JinWoo is with Department of Computer Engineering, University of National Hanbat, Daejoen, South Korea ( fniko0084@gmail.com). Ma HyoungMin is with Department of Computer Engineering, University of National Hanbat, Daejoen, South Korea ( Lee GiTae is with Department of Computer Engineering, University of National Hanbat, Daejoen, South Korea ( mm1023@naver.com). Ahn KiHong is with Department of Computer Engineering, University of National Hanbat, Daejoen, South Korea ( khahn@hanbat.ac.kr). Kim SuKyoung is with Department of Computer Engineering, University of National Hanbat, Daejoen, South Korea ( kimsk@hanbat.ac.kr). 0.5 f ( t, d ) tf ( t, d ) 0.5 (2) max( fw, d ) : w d idf ( t, D log { d D : t d} (3)

2 The inverse document frequency is a measure of whether the word is common or rare across all documents. It is obtained by dividing the total number of documents by the number of documents containing the word, and then taking the logarithm of that quotient. A high weight in TF-IDF is reached by a high word frequency (in the given document) and a low document frequency of the word in the whole collection of documents; the weights hence tend to filter out common words. Since the ratio inside the idf's log function is always greater than or equal to 1, the value of idf (and TF-IDF) is greater than or equal to 0. As a word appears in more documents, the ratio inside the logarithm approaches 1, bringing the idf and tf-idf closer to 0. B. LDA(Latent Dirichlet Allocation) When there exist the parameter of any probability distribution, LDA is Generative Model of the viewpoint that generate data based on random process. If we know topic distribution of document and each words to generate probability, we can calculate specific document probability. Fig. 1. LDA s concept diagram. Latent Dirichlet Allocation given number of the M documents, it based that the documents has few existing k topic. At first, to use probability distribution at the model is as follows. In here, is given through the actual document, other potential variables can't observed. It is a potential variable which other variables can not be observed. :Follow the k dimension Dirichlet i distribution. ~ Dir( ) z ~ Mltinomial( ) distribution. : Follow the multinomial w follow generated word probability by topic that pointed by z. At that time is Dirichlet distribution and is k V matrix parameter that contain word generate probability. About topic that pointed by z w is conditioned by the word generating probability p ( w z, ). At this time, is the parameter of Dirichlet distribution and is the probability of topic k that can give with each result V which is also calculated as k V in the matrix. This model can be interpreted as follows. For each document, they have weight for number of k subject and z subject of each word that chosen in multinomial distribution of weight. Finally, real words w are selected based on specific topic. C. N-gram It is necessary to process of lexical for understanding sentences but the common grammar of language is very complex also many common users don't follow the standard grammar. There are various algorithms that used to analyze like these sentences. In these algorithms, n-gram has more fast and simple handling advantages than other algorithms. It is the language model which is possible to calculate the meaning whether it is real with the word link of number of n. III. RESEARCH : The word distribution for topic k z :The topic for the jth word in document i(index) w :The jth word in document i(index)

As a result, we need to divide noun or find infinitives. It is more difficult than processing English. To solve these problem in this research extract verb and no use Korean parser (Komoran-1.

3 A. Basic data Basic data is made by normal people. So it is very similar to the data on web which we can see easy. To process these data, computer need many amount of preprocessing steps because the form is not defined. Also the Korea grammar has postposition, so the word's form changes to various form. As a result, we need to divide noun or find infinitives. It is more difficult than processing English. To solve these problem in this research extract verb and no use Korean parser (Komoran-1.12) to extract verb and noun. And also makes stop word dictionary to prove extracted noun and infinitives. The stop word dictionary is composed 1230 words including unknown meaningless and abstract word, article and postposition. Finally, processed documents are consists of only words by preprocessing. This data TABLE I LIST OF EXTRACTED KEYWORD WHICH USING TFIDF ALGORITHM Word TF-IDF 컨텐츠 (contents) 헤드셋 (headset) 컴퓨터 (computer) 불면증 (insomnia) 집중력 (concentration) 우울증 (melancholia) 헤어밴드 (hair bands) 긴장감 (tension) 스트레스 (stress) 스마트 (smart) is processed by TF-IDF algorithm and Topic modeling algorithm which called LDA. The data was accumulated for 3 years (2011~2013). And also ideas are composed 548 ideas in 2013, 266 ideas in 2012, and 447 ideas in Each idea are composed in their background of occurrence, necessity, technical core and scenario. Therefore 1261 data are used to this research. Next figure is value of every year's exaggerated topic. In the fig 2, the mismatching ratio between contents and title is 50% in 2013, 56% in 2012, and 38% in The document of more abstractive form show the more mismatching probability. As a result, when they saw title, they can't inference about document topic. Fig. 3. Precision and recall of topic (left-precision, right-recall). Fig. 2. A value of every year's exaggerated topic. (2011, 2012, 2013) B. Keywords extraction Each document contains their representative keywords. But it is very hard to find representative keywords in huge amount of word. By using TF-IDF algorithm, top 20 keywords are extracted

This method extract topic from each documents through topic modeling algorithm (LDA). Of course all documents already parsed and extracted noun and infinitives by morphological analyzer.

4 after preprocessing (ex. Korean parser, stop word dictionary). Table 1 is lists of extracted keywords which using TF-IDF algorithm at no.1 data. C. Topic Modeling In this chapter, we offer the result to find key word with TF-IDF's result to raise the key word's reliability. This method extract topic from each documents through topic modeling algorithm (LDA). Of course all documents already parsed and extracted noun and infinitives by morphological analyzer. To verify these keywords, we have extracted topic by supervised basic data. Total counts of verification documents are 430, also count of processed documents are 430. They are supervised data and LDA data. The standard of comparison is whether appear supervised topic in extracted topic by LDA. To measure this research's likelihood, each documents are processed EM-Algorithm 1000 times. As a result of algorithm, each document are normally included 5 keywords. These words be representative word in document. D. Clustering We can t know specific meaning of only a word. That is reason generate needs for analysis of sentence level. In this paper, we clustered word by trigram methodology for founding relation between words. This procedure that show relation words between tf-idf result and LDA result can solves problem for ambiguous word in context. Trigram expression is follows: [ n n n n n PR Ek ] [ Ek ] [ PO Ek ] (4) Clustering result of trigram can show relations between words. If frequency of trigram word has high variable, it can suppose high relation of these words. In fig 4, it shows relation of top 30 frequency words through Les Miserable Co-occurrence graph. Extracted words like table 2 are related to the middle word. In these trigram words, we select extracted topic word and directly related trigram word, and it must provide topic word as additional description word. K R( W T ) f ( w ) g(, w ) (5) n i 1 i (5) is equation of calculating relation between trigram and topic. w i is one of the trigram words that maked by topic T. f(w,t) is rate of specific word w in trigram that make by topic T. i.e. f(w,t) is binomial distribution function, n f ( w ) n( w ) n N i i 1 (6) Fig. 4. Les Miserable Co-occurrence graph of relation between words (top-sort by frequency, bottom-sort by name)

5 g(, w) P( w ) (7) We select w which maximum value in R(w T) and it is describe topic. Fig. 5. Entire process model. IV. CONCLUSION In this paper, we show that provide keyword in document to user through TF-IDF and LDA algorithm about unstructured data. Morpheme logical analysis and word stemming through stop word dictionary improve result of our procedure. Also we make supervised data for proving unsupervised data to measure precision and recall. As a result we can improve high precision. On the other hand, recall cannot reach expected point. Extracted trigram for fool recall, we also suggested a methodology for measuring relation between topic and word. In the future, we expect reached high precision and recall as adapt this methodology. There are several directions we plan to investigate in the future. One is making abstract word dictionary that impede recall. Another one is adapt trigram methodology for high quality. We expect to use this methodology for information select to any user that can easily select information when they want. [2] Wiliam B. Cavnar, John M. Trenkle, "N-Gram-Based Text Categorization ",Environmental Research Institute of Michigan P.O. Box Ann Arbor MI [3] Juan Ramos, "Using TF-IDF to Dewordine Word Relevance in Document Queries" Department of Computer Science, Rutgers University, BPO Way, Piscataway, NJ, [4] Chenghua Lin, Yulan He, Richard Everson, "Weakly Supervised Joint Sentiment-Topic Detection from Text", IEEE TRANSCATIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO.6, JUNE [5] Seungil Huh, Stephen E. Fienberg, Discriminative Topic Modeling Based on Manifold Learning, ACM Transactions on Knowledge Discovery from Data, Vol. 5 No. 4, Article 20, Publication date: February 2012 [6] Aurora Pons-Porrata, Rafael Berlanga-Llavori, Jose Ruiz-Shulcloper, Topic discovery based on text mining techniques, Information Processing and Management 43 (2007) [7] A. P. Dempster, N. M. Laird, D. B. Rubin Maximum Likelihood from Incomplete Data via the EM Algorithm, Journal of the Royal Statistical Society, Vol.39,No.1(1977),pp.1-38 REFERENCES [1] David M. Blei, Andrew Y. Ng, Michael I. Jordan, "Latent Dirichlet Allocation", Journal of Machine Learning Research 3 (2003)

Probabilistic Latent Semantic Analysis

Probabilistic Latent Semantic Analysis Thomas Hofmann Presentation by Ioannis Pavlopoulos & Andreas Damianou for the course of Data Mining & Exploration 1 Outline Latent Semantic Analysis o Need o Overview