Multi Hybrid Keyword Processing for Topic Decision of Unstructured Data. Jinwoo Lee, Hyoungmin Ma, Gitae Lee, Kihong Ahn, Sukyoung Kim
|
|
- Chloe Riley
- 6 years ago
- Views:
Transcription
1 Multi Hybrid Keyword Processing for Topic Decision of Unstructured Data Jinwoo Lee, Hyoungmin Ma, Gitae Lee, Kihong Ahn, Sukyoung Kim Abstract Amount of information and difficulty of the user's information selection has direct proportion relation. Also title is consists of exaggerated expression. Since, authors want to summarize about document. Therefore title is almost different from contents. If these case are more increased, offering information by simple keyword search will be reached to the limit. In this study, to solve these problems, we applied TF-IDF to extract keyword in particular documents which have scarcity words in all documents and applied LDA algorithm for to find topic about single document. Finally, we have proposed the methodologies that add description on scarcity word and topic through to extract the Trigram of the entire document. In this study, to verify the accurate of methodology, we made supervised data and compared these data with data that made by suggested methodology. D I. INTRODUCTION evelopment of WEB 2.0 environment increase diversification and complexification form and expression of the information which is made in sudden expansion of SNS. This is the important reason to the users why fail to find accurate information. Particularly, the redundancy of signification and the metaphor expression to be elements which obstruct the satisfaction of searching information. Expansion of cloud technology is possible to make videos, photos and document information of pdf file infinitely and easily. So it make more difficult to find information. Also the sentences in SNS (like twitter, facebook) are abbreviated form and unphotographic information. So it is generate the limit to find information with few keywords. These problems make the time consuming for searching information. In order to our study solves this problems which are mentioned sentence before, we apply TFIDF to extract keyword for scarcity word in documents and apply LDA algorithm to find a TOPIC in single document. Finally, we suggest additional explanation methodology of scarcity word and Topic to extract Trigram in all documents and compare result of experiment to extract Trigram for verifying accuracy of our methodology. II. RELATE RESEARCH A. TF-IDF TF IDF is the product of two statistics, word frequency and inverse document frequency. There are various ways for the extracting two values (TF, IDF). In the case of the word frequency tf(t,d), the simplest selection is to use the raw frequency of a word in a document, i.e. the number of times that word t occurs in document d. If we denote the raw frequency of t by f(t,d), then the simple tf scheme is tf(t,d) = f(t,d) tfidf ( t, d, tf( t, d) udf ( t, (1) Lee JinWoo is with Department of Computer Engineering, University of National Hanbat, Daejoen, South Korea ( fniko0084@gmail.com). Ma HyoungMin is with Department of Computer Engineering, University of National Hanbat, Daejoen, South Korea ( Lee GiTae is with Department of Computer Engineering, University of National Hanbat, Daejoen, South Korea ( mm1023@naver.com). Ahn KiHong is with Department of Computer Engineering, University of National Hanbat, Daejoen, South Korea ( khahn@hanbat.ac.kr). Kim SuKyoung is with Department of Computer Engineering, University of National Hanbat, Daejoen, South Korea ( kimsk@hanbat.ac.kr). 0.5 f ( t, d ) tf ( t, d ) 0.5 (2) max( fw, d ) : w d idf ( t, D log { d D : t d} (3)
2 The inverse document frequency is a measure of whether the word is common or rare across all documents. It is obtained by dividing the total number of documents by the number of documents containing the word, and then taking the logarithm of that quotient. A high weight in TF-IDF is reached by a high word frequency (in the given document) and a low document frequency of the word in the whole collection of documents; the weights hence tend to filter out common words. Since the ratio inside the idf's log function is always greater than or equal to 1, the value of idf (and TF-IDF) is greater than or equal to 0. As a word appears in more documents, the ratio inside the logarithm approaches 1, bringing the idf and tf-idf closer to 0. B. LDA(Latent Dirichlet Allocation) When there exist the parameter of any probability distribution, LDA is Generative Model of the viewpoint that generate data based on random process. If we know topic distribution of document and each words to generate probability, we can calculate specific document probability. Fig. 1. LDA s concept diagram. Latent Dirichlet Allocation given number of the M documents, it based that the documents has few existing k topic. At first, to use probability distribution at the model is as follows. In here, is given through the actual document, other potential variables can't observed. It is a potential variable which other variables can not be observed. :Follow the k dimension Dirichlet i distribution. ~ Dir( ) z ~ Mltinomial( ) distribution. : Follow the multinomial w follow generated word probability by topic that pointed by z. At that time is Dirichlet distribution and is k V matrix parameter that contain word generate probability. About topic that pointed by z w is conditioned by the word generating probability p ( w z, ). At this time, is the parameter of Dirichlet distribution and is the probability of topic k that can give with each result V which is also calculated as k V in the matrix. This model can be interpreted as follows. For each document, they have weight for number of k subject and z subject of each word that chosen in multinomial distribution of weight. Finally, real words w are selected based on specific topic. C. N-gram It is necessary to process of lexical for understanding sentences but the common grammar of language is very complex also many common users don't follow the standard grammar. There are various algorithms that used to analyze like these sentences. In these algorithms, n-gram has more fast and simple handling advantages than other algorithms. It is the language model which is possible to calculate the meaning whether it is real with the word link of number of n. III. RESEARCH : The word distribution for topic k z :The topic for the jth word in document i(index) w :The jth word in document i(index)
3 A. Basic data Basic data is made by normal people. So it is very similar to the data on web which we can see easy. To process these data, computer need many amount of preprocessing steps because the form is not defined. Also the Korea grammar has postposition, so the word's form changes to various form. As a result, we need to divide noun or find infinitives. It is more difficult than processing English. To solve these problem in this research extract verb and no use Korean parser (Komoran-1.12) to extract verb and noun. And also makes stop word dictionary to prove extracted noun and infinitives. The stop word dictionary is composed 1230 words including unknown meaningless and abstract word, article and postposition. Finally, processed documents are consists of only words by preprocessing. This data TABLE I LIST OF EXTRACTED KEYWORD WHICH USING TFIDF ALGORITHM Word TF-IDF 컨텐츠 (contents) 헤드셋 (headset) 컴퓨터 (computer) 불면증 (insomnia) 집중력 (concentration) 우울증 (melancholia) 헤어밴드 (hair bands) 긴장감 (tension) 스트레스 (stress) 스마트 (smart) is processed by TF-IDF algorithm and Topic modeling algorithm which called LDA. The data was accumulated for 3 years (2011~2013). And also ideas are composed 548 ideas in 2013, 266 ideas in 2012, and 447 ideas in Each idea are composed in their background of occurrence, necessity, technical core and scenario. Therefore 1261 data are used to this research. Next figure is value of every year's exaggerated topic. In the fig 2, the mismatching ratio between contents and title is 50% in 2013, 56% in 2012, and 38% in The document of more abstractive form show the more mismatching probability. As a result, when they saw title, they can't inference about document topic. Fig. 3. Precision and recall of topic (left-precision, right-recall). Fig. 2. A value of every year's exaggerated topic. (2011, 2012, 2013) B. Keywords extraction Each document contains their representative keywords. But it is very hard to find representative keywords in huge amount of word. By using TF-IDF algorithm, top 20 keywords are extracted
4 after preprocessing (ex. Korean parser, stop word dictionary). Table 1 is lists of extracted keywords which using TF-IDF algorithm at no.1 data. C. Topic Modeling In this chapter, we offer the result to find key word with TF-IDF's result to raise the key word's reliability. This method extract topic from each documents through topic modeling algorithm (LDA). Of course all documents already parsed and extracted noun and infinitives by morphological analyzer. To verify these keywords, we have extracted topic by supervised basic data. Total counts of verification documents are 430, also count of processed documents are 430. They are supervised data and LDA data. The standard of comparison is whether appear supervised topic in extracted topic by LDA. To measure this research's likelihood, each documents are processed EM-Algorithm 1000 times. As a result of algorithm, each document are normally included 5 keywords. These words be representative word in document. D. Clustering We can t know specific meaning of only a word. That is reason generate needs for analysis of sentence level. In this paper, we clustered word by trigram methodology for founding relation between words. This procedure that show relation words between tf-idf result and LDA result can solves problem for ambiguous word in context. Trigram expression is follows: [ n n n n n PR Ek ] [ Ek ] [ PO Ek ] (4) Clustering result of trigram can show relations between words. If frequency of trigram word has high variable, it can suppose high relation of these words. In fig 4, it shows relation of top 30 frequency words through Les Miserable Co-occurrence graph. Extracted words like table 2 are related to the middle word. In these trigram words, we select extracted topic word and directly related trigram word, and it must provide topic word as additional description word. K R( W T ) f ( w ) g(, w ) (5) n i 1 i (5) is equation of calculating relation between trigram and topic. w i is one of the trigram words that maked by topic T. f(w,t) is rate of specific word w in trigram that make by topic T. i.e. f(w,t) is binomial distribution function, n f ( w ) n( w ) n N i i 1 (6) Fig. 4. Les Miserable Co-occurrence graph of relation between words (top-sort by frequency, bottom-sort by name)
5 g(, w) P( w ) (7) We select w which maximum value in R(w T) and it is describe topic. Fig. 5. Entire process model. IV. CONCLUSION In this paper, we show that provide keyword in document to user through TF-IDF and LDA algorithm about unstructured data. Morpheme logical analysis and word stemming through stop word dictionary improve result of our procedure. Also we make supervised data for proving unsupervised data to measure precision and recall. As a result we can improve high precision. On the other hand, recall cannot reach expected point. Extracted trigram for fool recall, we also suggested a methodology for measuring relation between topic and word. In the future, we expect reached high precision and recall as adapt this methodology. There are several directions we plan to investigate in the future. One is making abstract word dictionary that impede recall. Another one is adapt trigram methodology for high quality. We expect to use this methodology for information select to any user that can easily select information when they want. [2] Wiliam B. Cavnar, John M. Trenkle, "N-Gram-Based Text Categorization ",Environmental Research Institute of Michigan P.O. Box Ann Arbor MI [3] Juan Ramos, "Using TF-IDF to Dewordine Word Relevance in Document Queries" Department of Computer Science, Rutgers University, BPO Way, Piscataway, NJ, [4] Chenghua Lin, Yulan He, Richard Everson, "Weakly Supervised Joint Sentiment-Topic Detection from Text", IEEE TRANSCATIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO.6, JUNE [5] Seungil Huh, Stephen E. Fienberg, Discriminative Topic Modeling Based on Manifold Learning, ACM Transactions on Knowledge Discovery from Data, Vol. 5 No. 4, Article 20, Publication date: February 2012 [6] Aurora Pons-Porrata, Rafael Berlanga-Llavori, Jose Ruiz-Shulcloper, Topic discovery based on text mining techniques, Information Processing and Management 43 (2007) [7] A. P. Dempster, N. M. Laird, D. B. Rubin Maximum Likelihood from Incomplete Data via the EM Algorithm, Journal of the Royal Statistical Society, Vol.39,No.1(1977),pp.1-38 REFERENCES [1] David M. Blei, Andrew Y. Ng, Michael I. Jordan, "Latent Dirichlet Allocation", Journal of Machine Learning Research 3 (2003)
Probabilistic Latent Semantic Analysis
Probabilistic Latent Semantic Analysis Thomas Hofmann Presentation by Ioannis Pavlopoulos & Andreas Damianou for the course of Data Mining & Exploration 1 Outline Latent Semantic Analysis o Need o Overview
More informationSINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)
SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF) Hans Christian 1 ; Mikhael Pramodana Agus 2 ; Derwin Suhartono 3 1,2,3 Computer Science Department,
More informationUniversiteit Leiden ICT in Business
Universiteit Leiden ICT in Business Ranking of Multi-Word Terms Name: Ricardo R.M. Blikman Student-no: s1184164 Internal report number: 2012-11 Date: 07/03/2013 1st supervisor: Prof. Dr. J.N. Kok 2nd supervisor:
More informationTwitter Sentiment Classification on Sanders Data using Hybrid Approach
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 4, Ver. I (July Aug. 2015), PP 118-123 www.iosrjournals.org Twitter Sentiment Classification on Sanders
More informationChapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard
Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA Alta de Waal, Jacobus Venter and Etienne Barnard Abstract Most actionable evidence is identified during the analysis phase of digital forensic investigations.
More informationLinking Task: Identifying authors and book titles in verbose queries
Linking Task: Identifying authors and book titles in verbose queries Anaïs Ollagnier, Sébastien Fournier, and Patrice Bellot Aix-Marseille University, CNRS, ENSAM, University of Toulon, LSIS UMR 7296,
More informationA Comparison of Two Text Representations for Sentiment Analysis
010 International Conference on Computer Application and System Modeling (ICCASM 010) A Comparison of Two Text Representations for Sentiment Analysis Jianxiong Wang School of Computer Science & Educational
More informationCross Language Information Retrieval
Cross Language Information Retrieval RAFFAELLA BERNARDI UNIVERSITÀ DEGLI STUDI DI TRENTO P.ZZA VENEZIA, ROOM: 2.05, E-MAIL: BERNARDI@DISI.UNITN.IT Contents 1 Acknowledgment.............................................
More informationA Case Study: News Classification Based on Term Frequency
A Case Study: News Classification Based on Term Frequency Petr Kroha Faculty of Computer Science University of Technology 09107 Chemnitz Germany kroha@informatik.tu-chemnitz.de Ricardo Baeza-Yates Center
More informationBridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models
Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models Jung-Tae Lee and Sang-Bum Kim and Young-In Song and Hae-Chang Rim Dept. of Computer &
More informationIterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages
Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages Nuanwan Soonthornphisaj 1 and Boonserm Kijsirikul 2 Machine Intelligence and Knowledge Discovery Laboratory Department of Computer
More informationThe Internet as a Normative Corpus: Grammar Checking with a Search Engine
The Internet as a Normative Corpus: Grammar Checking with a Search Engine Jonas Sjöbergh KTH Nada SE-100 44 Stockholm, Sweden jsh@nada.kth.se Abstract In this paper some methods using the Internet as a
More informationPerformance Analysis of Optimized Content Extraction for Cyrillic Mongolian Learning Text Materials in the Database
Journal of Computer and Communications, 2016, 4, 79-89 Published Online August 2016 in SciRes. http://www.scirp.org/journal/jcc http://dx.doi.org/10.4236/jcc.2016.410009 Performance Analysis of Optimized
More informationDisambiguation of Thai Personal Name from Online News Articles
Disambiguation of Thai Personal Name from Online News Articles Phaisarn Sutheebanjard Graduate School of Information Technology Siam University Bangkok, Thailand mr.phaisarn@gmail.com Abstract Since online
More informationAssignment 1: Predicting Amazon Review Ratings
Assignment 1: Predicting Amazon Review Ratings 1 Dataset Analysis Richard Park r2park@acsmail.ucsd.edu February 23, 2015 The dataset selected for this assignment comes from the set of Amazon reviews for
More informationUsing dialogue context to improve parsing performance in dialogue systems
Using dialogue context to improve parsing performance in dialogue systems Ivan Meza-Ruiz and Oliver Lemon School of Informatics, Edinburgh University 2 Buccleuch Place, Edinburgh I.V.Meza-Ruiz@sms.ed.ac.uk,
More informationProduct Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments
Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments Vijayshri Ramkrishna Ingale PG Student, Department of Computer Engineering JSPM s Imperial College of Engineering &
More informationData Fusion Models in WSNs: Comparison and Analysis
Proceedings of 2014 Zone 1 Conference of the American Society for Engineering Education (ASEE Zone 1) Data Fusion s in WSNs: Comparison and Analysis Marwah M Almasri, and Khaled M Elleithy, Senior Member,
More informationChinese Language Parsing with Maximum-Entropy-Inspired Parser
Chinese Language Parsing with Maximum-Entropy-Inspired Parser Heng Lian Brown University Abstract The Chinese language has many special characteristics that make parsing difficult. The performance of state-of-the-art
More informationMatching Similarity for Keyword-Based Clustering
Matching Similarity for Keyword-Based Clustering Mohammad Rezaei and Pasi Fränti University of Eastern Finland {rezaei,franti}@cs.uef.fi Abstract. Semantic clustering of objects such as documents, web
More informationEnhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities
Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities Yoav Goldberg Reut Tsarfaty Meni Adler Michael Elhadad Ben Gurion
More informationOn-Line Data Analytics
International Journal of Computer Applications in Engineering Sciences [VOL I, ISSUE III, SEPTEMBER 2011] [ISSN: 2231-4946] On-Line Data Analytics Yugandhar Vemulapalli #, Devarapalli Raghu *, Raja Jacob
More informationThe Good Judgment Project: A large scale test of different methods of combining expert predictions
The Good Judgment Project: A large scale test of different methods of combining expert predictions Lyle Ungar, Barb Mellors, Jon Baron, Phil Tetlock, Jaime Ramos, Sam Swift The University of Pennsylvania
More informationSemi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.
Semi-supervised methods of text processing, and an application to medical concept extraction Yacine Jernite Text-as-Data series September 17. 2015 What do we want from text? 1. Extract information 2. Link
More informationScienceDirect. A Framework for Clustering Cardiac Patient s Records Using Unsupervised Learning Techniques
Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 98 (2016 ) 368 373 The 6th International Conference on Current and Future Trends of Information and Communication Technologies
More informationExperiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling
Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling Notebook for PAN at CLEF 2013 Andrés Alfonso Caurcel Díaz 1 and José María Gómez Hidalgo 2 1 Universidad
More informationExperts Retrieval with Multiword-Enhanced Author Topic Model
NAACL 10 Workshop on Semantic Search Experts Retrieval with Multiword-Enhanced Author Topic Model Nikhil Johri Dan Roth Yuancheng Tu Dept. of Computer Science Dept. of Linguistics University of Illinois
More informationLecture 1: Machine Learning Basics
1/69 Lecture 1: Machine Learning Basics Ali Harakeh University of Waterloo WAVE Lab ali.harakeh@uwaterloo.ca May 1, 2017 2/69 Overview 1 Learning Algorithms 2 Capacity, Overfitting, and Underfitting 3
More informationAQUA: An Ontology-Driven Question Answering System
AQUA: An Ontology-Driven Question Answering System Maria Vargas-Vera, Enrico Motta and John Domingue Knowledge Media Institute (KMI) The Open University, Walton Hall, Milton Keynes, MK7 6AA, United Kingdom.
More informationTerm Weighting based on Document Revision History
Term Weighting based on Document Revision History Sérgio Nunes, Cristina Ribeiro, and Gabriel David INESC Porto, DEI, Faculdade de Engenharia, Universidade do Porto. Rua Dr. Roberto Frias, s/n. 4200-465
More informationMultilingual Sentiment and Subjectivity Analysis
Multilingual Sentiment and Subjectivity Analysis Carmen Banea and Rada Mihalcea Department of Computer Science University of North Texas rada@cs.unt.edu, carmen.banea@gmail.com Janyce Wiebe Department
More informationWHEN THERE IS A mismatch between the acoustic
808 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 3, MAY 2006 Optimization of Temporal Filters for Constructing Robust Features in Speech Recognition Jeih-Weih Hung, Member,
More informationConstructing Parallel Corpus from Movie Subtitles
Constructing Parallel Corpus from Movie Subtitles Han Xiao 1 and Xiaojie Wang 2 1 School of Information Engineering, Beijing University of Post and Telecommunications artex.xh@gmail.com 2 CISTR, Beijing
More informationJulia Smith. Effective Classroom Approaches to.
Julia Smith @tessmaths Effective Classroom Approaches to GCSE Maths resits julia.smith@writtle.ac.uk Agenda The context of GCSE resit in a post-16 setting An overview of the new GCSE Key features of a
More informationLongest Common Subsequence: A Method for Automatic Evaluation of Handwritten Essays
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 6, Ver. IV (Nov Dec. 2015), PP 01-07 www.iosrjournals.org Longest Common Subsequence: A Method for
More informationTextGraphs: Graph-based algorithms for Natural Language Processing
HLT-NAACL 06 TextGraphs: Graph-based algorithms for Natural Language Processing Proceedings of the Workshop Production and Manufacturing by Omnipress Inc. 2600 Anderson Street Madison, WI 53704 c 2006
More informationMining Association Rules in Student s Assessment Data
www.ijcsi.org 211 Mining Association Rules in Student s Assessment Data Dr. Varun Kumar 1, Anupama Chadha 2 1 Department of Computer Science and Engineering, MVN University Palwal, Haryana, India 2 Anupama
More informationBooks Effective Literacy Y5-8 Learning Through Talk Y4-8 Switch onto Spelling Spelling Under Scrutiny
By the End of Year 8 All Essential words lists 1-7 290 words Commonly Misspelt Words-55 working out more complex, irregular, and/or ambiguous words by using strategies such as inferring the unknown from
More informationLanguage Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus
Language Acquisition Fall 2010/Winter 2011 Lexical Categories Afra Alishahi, Heiner Drenhaus Computational Linguistics and Phonetics Saarland University Children s Sensitivity to Lexical Categories Look,
More informationA heuristic framework for pivot-based bilingual dictionary induction
2013 International Conference on Culture and Computing A heuristic framework for pivot-based bilingual dictionary induction Mairidan Wushouer, Toru Ishida, Donghui Lin Department of Social Informatics,
More informationThe Role of String Similarity Metrics in Ontology Alignment
The Role of String Similarity Metrics in Ontology Alignment Michelle Cheatham and Pascal Hitzler August 9, 2013 1 Introduction Tim Berners-Lee originally envisioned a much different world wide web than
More informationDublin City Schools Mathematics Graded Course of Study GRADE 4
I. Content Standard: Number, Number Sense and Operations Standard Students demonstrate number sense, including an understanding of number systems and reasonable estimates using paper and pencil, technology-supported
More informationAGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS
AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS 1 CALIFORNIA CONTENT STANDARDS: Chapter 1 ALGEBRA AND WHOLE NUMBERS Algebra and Functions 1.4 Students use algebraic
More informationModule 12. Machine Learning. Version 2 CSE IIT, Kharagpur
Module 12 Machine Learning 12.1 Instructional Objective The students should understand the concept of learning systems Students should learn about different aspects of a learning system Students should
More informationLearning Disability Functional Capacity Evaluation. Dear Doctor,
Dear Doctor, I have been asked to formulate a vocational opinion regarding NAME s employability in light of his/her learning disability. To assist me with this evaluation I would appreciate if you can
More informationA Syllable Based Word Recognition Model for Korean Noun Extraction
are used as the most important terms (features) that express the document in NLP applications such as information retrieval, document categorization, text summarization, information extraction, and etc.
More informationPredicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks
Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks Devendra Singh Chaplot, Eunhee Rhim, and Jihie Kim Samsung Electronics Co., Ltd. Seoul, South Korea {dev.chaplot,eunhee.rhim,jihie.kim}@samsung.com
More informationThe Indices Investigations Teacher s Notes
The Indices Investigations Teacher s Notes These activities are for students to use independently of the teacher to practise and develop number and algebra properties.. Number Framework domain and stage:
More informationEdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar
EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar Chung-Chi Huang Mei-Hua Chen Shih-Ting Huang Jason S. Chang Institute of Information Systems and Applications, National Tsing Hua University,
More informationClass-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification
Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification Tomi Kinnunen and Ismo Kärkkäinen University of Joensuu, Department of Computer Science, P.O. Box 111, 80101 JOENSUU,
More informationDistant Supervised Relation Extraction with Wikipedia and Freebase
Distant Supervised Relation Extraction with Wikipedia and Freebase Marcel Ackermann TU Darmstadt ackermann@tk.informatik.tu-darmstadt.de Abstract In this paper we discuss a new approach to extract relational
More informationWE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT
WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT PRACTICAL APPLICATIONS OF RANDOM SAMPLING IN ediscovery By Matthew Verga, J.D. INTRODUCTION Anyone who spends ample time working
More informationSystem Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks
System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks 1 Tzu-Hsuan Yang, 2 Tzu-Hsuan Tseng, and 3 Chia-Ping Chen Department of Computer Science and Engineering
More informationShort Text Understanding Through Lexical-Semantic Analysis
Short Text Understanding Through Lexical-Semantic Analysis Wen Hua #1, Zhongyuan Wang 2, Haixun Wang 3, Kai Zheng #4, Xiaofang Zhou #5 School of Information, Renmin University of China, Beijing, China
More informationHLTCOE at TREC 2013: Temporal Summarization
HLTCOE at TREC 2013: Temporal Summarization Tan Xu University of Maryland College Park Paul McNamee Johns Hopkins University HLTCOE Douglas W. Oard University of Maryland College Park Abstract Our team
More informationReducing Features to Improve Bug Prediction
Reducing Features to Improve Bug Prediction Shivkumar Shivaji, E. James Whitehead, Jr., Ram Akella University of California Santa Cruz {shiv,ejw,ram}@soe.ucsc.edu Sunghun Kim Hong Kong University of Science
More informationMULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.
Ch 2 Test Remediation Work Name MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question. Provide an appropriate response. 1) High temperatures in a certain
More informationUsing Web Searches on Important Words to Create Background Sets for LSI Classification
Using Web Searches on Important Words to Create Background Sets for LSI Classification Sarah Zelikovitz and Marina Kogan College of Staten Island of CUNY 2800 Victory Blvd Staten Island, NY 11314 Abstract
More informationGuidelines for Writing an Internship Report
Guidelines for Writing an Internship Report Master of Commerce (MCOM) Program Bahauddin Zakariya University, Multan Table of Contents Table of Contents... 2 1. Introduction.... 3 2. The Required Components
More information11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation
tatistical Parsing (Following slides are modified from Prof. Raymond Mooney s slides.) tatistical Parsing tatistical parsing uses a probabilistic model of syntax in order to assign probabilities to each
More informationProbability and Statistics Curriculum Pacing Guide
Unit 1 Terms PS.SPMJ.3 PS.SPMJ.5 Plan and conduct a survey to answer a statistical question. Recognize how the plan addresses sampling technique, randomization, measurement of experimental error and methods
More informationTopicFlow: Visualizing Topic Alignment of Twitter Data over Time
TopicFlow: Visualizing Topic Alignment of Twitter Data over Time Sana Malik, Alison Smith, Timothy Hawes, Panagis Papadatos, Jianyu Li, Cody Dunne, Ben Shneiderman University of Maryland, College Park,
More informationWeb as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics
(L615) Markus Dickinson Department of Linguistics, Indiana University Spring 2013 The web provides new opportunities for gathering data Viable source of disposable corpora, built ad hoc for specific purposes
More informationPostprint.
http://www.diva-portal.org Postprint This is the accepted version of a paper presented at CLEF 2013 Conference and Labs of the Evaluation Forum Information Access Evaluation meets Multilinguality, Multimodality,
More informationEnsemble Technique Utilization for Indonesian Dependency Parser
Ensemble Technique Utilization for Indonesian Dependency Parser Arief Rahman Institut Teknologi Bandung Indonesia 23516008@std.stei.itb.ac.id Ayu Purwarianti Institut Teknologi Bandung Indonesia ayu@stei.itb.ac.id
More informationTest Effort Estimation Using Neural Network
J. Software Engineering & Applications, 2010, 3: 331-340 doi:10.4236/jsea.2010.34038 Published Online April 2010 (http://www.scirp.org/journal/jsea) 331 Chintala Abhishek*, Veginati Pavan Kumar, Harish
More informationModeling user preferences and norms in context-aware systems
Modeling user preferences and norms in context-aware systems Jonas Nilsson, Cecilia Lindmark Jonas Nilsson, Cecilia Lindmark VT 2016 Bachelor's thesis for Computer Science, 15 hp Supervisor: Juan Carlos
More informationTHE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING
SISOM & ACOUSTICS 2015, Bucharest 21-22 May THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING MarilenaăLAZ R 1, Diana MILITARU 2 1 Military Equipment and Technologies Research Agency, Bucharest,
More informationPython Machine Learning
Python Machine Learning Unlock deeper insights into machine learning with this vital guide to cuttingedge predictive analytics Sebastian Raschka [ PUBLISHING 1 open source I community experience distilled
More informationDifferent Requirements Gathering Techniques and Issues. Javaria Mushtaq
835 Different Requirements Gathering Techniques and Issues Javaria Mushtaq Abstract- Project management is now becoming a very important part of our software industries. To handle projects with success
More informationhave to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,
A Language-Independent, Data-Oriented Architecture for Grapheme-to-Phoneme Conversion Walter Daelemans and Antal van den Bosch Proceedings ESCA-IEEE speech synthesis conference, New York, September 1994
More informationWord Segmentation of Off-line Handwritten Documents
Word Segmentation of Off-line Handwritten Documents Chen Huang and Sargur N. Srihari {chuang5, srihari}@cedar.buffalo.edu Center of Excellence for Document Analysis and Recognition (CEDAR), Department
More informationTABE 9&10. Revised 8/2013- with reference to College and Career Readiness Standards
TABE 9&10 Revised 8/2013- with reference to College and Career Readiness Standards LEVEL E Test 1: Reading Name Class E01- INTERPRET GRAPHIC INFORMATION Signs Maps Graphs Consumer Materials Forms Dictionary
More informationRule Learning With Negation: Issues Regarding Effectiveness
Rule Learning With Negation: Issues Regarding Effectiveness S. Chua, F. Coenen, G. Malcolm University of Liverpool Department of Computer Science, Ashton Building, Ashton Street, L69 3BX Liverpool, United
More informationCS Machine Learning
CS 478 - Machine Learning Projects Data Representation Basic testing and evaluation schemes CS 478 Data and Testing 1 Programming Issues l Program in any platform you want l Realize that you will be doing
More informationProcedia - Social and Behavioral Sciences 226 ( 2016 ) 27 34
Available online at www.sciencedirect.com ScienceDirect Procedia - Social and Behavioral Sciences 226 ( 2016 ) 27 34 29th World Congress International Project Management Association (IPMA) 2015, IPMA WC
More informationMathematics subject curriculum
Mathematics subject curriculum Dette er ei omsetjing av den fastsette læreplanteksten. Læreplanen er fastsett på Nynorsk Established as a Regulation by the Ministry of Education and Research on 24 June
More informationA study of speaker adaptation for DNN-based speech synthesis
A study of speaker adaptation for DNN-based speech synthesis Zhizheng Wu, Pawel Swietojanski, Christophe Veaux, Steve Renals, Simon King The Centre for Speech Technology Research (CSTR) University of Edinburgh,
More informationAnalyzing sentiments in tweets for Tesla Model 3 using SAS Enterprise Miner and SAS Sentiment Analysis Studio
SCSUG Student Symposium 2016 Analyzing sentiments in tweets for Tesla Model 3 using SAS Enterprise Miner and SAS Sentiment Analysis Studio Praneth Guggilla, Tejaswi Jha, Goutam Chakraborty, Oklahoma State
More informationA Comparison of Standard and Interval Association Rules
A Comparison of Standard and Association Rules Choh Man Teng cmteng@ai.uwf.edu Institute for Human and Machine Cognition University of West Florida 4 South Alcaniz Street, Pensacola FL 325, USA Abstract
More informationMining Topic-level Opinion Influence in Microblog
Mining Topic-level Opinion Influence in Microblog Daifeng Li Dept. of Computer Science and Technology Tsinghua University ldf3824@yahoo.com.cn Jie Tang Dept. of Computer Science and Technology Tsinghua
More informationLearning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models
Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models Stephan Gouws and GJ van Rooyen MIH Medialab, Stellenbosch University SOUTH AFRICA {stephan,gvrooyen}@ml.sun.ac.za
More informationClickthrough-Based Translation Models for Web Search: from Word Models to Phrase Models
Clickthrough-Based Translation Models for Web Search: from Word Models to Phrase Models Jianfeng Gao Microsoft Research One Microsoft Way Redmond, WA 98052 USA jfgao@microsoft.com Xiaodong He Microsoft
More informationAustralian Journal of Basic and Applied Sciences
AENSI Journals Australian Journal of Basic and Applied Sciences ISSN:1991-8178 Journal home page: www.ajbasweb.com Feature Selection Technique Using Principal Component Analysis For Improving Fuzzy C-Mean
More informationThe 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X
The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, 2013 10.12753/2066-026X-13-154 DATA MINING SOLUTIONS FOR DETERMINING STUDENT'S PROFILE Adela BÂRA,
More informationComment-based Multi-View Clustering of Web 2.0 Items
Comment-based Multi-View Clustering of Web 2.0 Items Xiangnan He 1 Min-Yen Kan 1 Peichu Xie 2 Xiao Chen 3 1 School of Computing, National University of Singapore 2 Department of Mathematics, National University
More informationA Bayesian Learning Approach to Concept-Based Document Classification
Databases and Information Systems Group (AG5) Max-Planck-Institute for Computer Science Saarbrücken, Germany A Bayesian Learning Approach to Concept-Based Document Classification by Georgiana Ifrim Supervisors
More informationCross-lingual Short-Text Document Classification for Facebook Comments
2014 International Conference on Future Internet of Things and Cloud Cross-lingual Short-Text Document Classification for Facebook Comments Mosab Faqeeh, Nawaf Abdulla, Mahmoud Al-Ayyoub, Yaser Jararweh
More informationEfficient Online Summarization of Microblogging Streams
Efficient Online Summarization of Microblogging Streams Andrei Olariu Faculty of Mathematics and Computer Science University of Bucharest andrei@olariu.org Abstract The large amounts of data generated
More informationarxiv:cmp-lg/ v1 22 Aug 1994
arxiv:cmp-lg/94080v 22 Aug 994 DISTRIBUTIONAL CLUSTERING OF ENGLISH WORDS Fernando Pereira AT&T Bell Laboratories 600 Mountain Ave. Murray Hill, NJ 07974 pereira@research.att.com Abstract We describe and
More informationRadius STEM Readiness TM
Curriculum Guide Radius STEM Readiness TM While today s teens are surrounded by technology, we face a stark and imminent shortage of graduates pursuing careers in Science, Technology, Engineering, and
More informationLet s think about how to multiply and divide fractions by fractions!
Let s think about how to multiply and divide fractions by fractions! June 25, 2007 (Monday) Takehaya Attached Elementary School, Tokyo Gakugei University Grade 6, Class # 1 (21 boys, 20 girls) Instructor:
More informationLip reading: Japanese vowel recognition by tracking temporal changes of lip shape
Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape Koshi Odagiri 1, and Yoichi Muraoka 1 1 Graduate School of Fundamental/Computer Science and Engineering, Waseda University,
More information2 Mitsuru Ishizuka x1 Keywords Automatic Indexing, PAI, Asserted Keyword, Spreading Activation, Priming Eect Introduction With the increasing number o
PAI: Automatic Indexing for Extracting Asserted Keywords from a Document 1 PAI: Automatic Indexing for Extracting Asserted Keywords from a Document Naohiro Matsumura PRESTO, Japan Science and Technology
More informationAs a high-quality international conference in the field
The New Automated IEEE INFOCOM Review Assignment System Baochun Li and Y. Thomas Hou Abstract In academic conferences, the structure of the review process has always been considered a critical aspect of
More informationSummarizing Contrastive Themes via Hierarchical Non-Parametric Processes
Summarizing Contrastive Themes via Hierarchical Non-Parametric Processes Zhaochun Ren z.ren@uva.nl Maarten de Rijke derijke@uva.nl University of Amsterdam, Amsterdam, The Netherlands ABSTRACT Given a topic
More informationCompositional Semantics
Compositional Semantics CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Words, bag of words Sequences Trees Meaning Representing Meaning An important goal of NLP/AI: convert natural language
More informationArtificial Neural Networks written examination
1 (8) Institutionen för informationsteknologi Olle Gällmo Universitetsadjunkt Adress: Lägerhyddsvägen 2 Box 337 751 05 Uppsala Artificial Neural Networks written examination Monday, May 15, 2006 9 00-14
More informationA DISTRIBUTIONAL STRUCTURED SEMANTIC SPACE FOR QUERYING RDF GRAPH DATA
International Journal of Semantic Computing Vol. 5, No. 4 (2011) 433 462 c World Scientific Publishing Company DOI: 10.1142/S1793351X1100133X A DISTRIBUTIONAL STRUCTURED SEMANTIC SPACE FOR QUERYING RDF
More informationDetecting Wikipedia Vandalism using Machine Learning Notebook for PAN at CLEF 2011
Detecting Wikipedia Vandalism using Machine Learning Notebook for PAN at CLEF 2011 Cristian-Alexandru Drăgușanu, Marina Cufliuc, Adrian Iftene UAIC: Faculty of Computer Science, Alexandru Ioan Cuza University,
More information