Multi Hybrid Keyword Processing for Topic Decision of Unstructured Data. Jinwoo Lee, Hyoungmin Ma, Gitae Lee, Kihong Ahn, Sukyoung Kim

Size: px
Start display at page:

Download "Multi Hybrid Keyword Processing for Topic Decision of Unstructured Data. Jinwoo Lee, Hyoungmin Ma, Gitae Lee, Kihong Ahn, Sukyoung Kim"

Transcription

1 Multi Hybrid Keyword Processing for Topic Decision of Unstructured Data Jinwoo Lee, Hyoungmin Ma, Gitae Lee, Kihong Ahn, Sukyoung Kim Abstract Amount of information and difficulty of the user's information selection has direct proportion relation. Also title is consists of exaggerated expression. Since, authors want to summarize about document. Therefore title is almost different from contents. If these case are more increased, offering information by simple keyword search will be reached to the limit. In this study, to solve these problems, we applied TF-IDF to extract keyword in particular documents which have scarcity words in all documents and applied LDA algorithm for to find topic about single document. Finally, we have proposed the methodologies that add description on scarcity word and topic through to extract the Trigram of the entire document. In this study, to verify the accurate of methodology, we made supervised data and compared these data with data that made by suggested methodology. D I. INTRODUCTION evelopment of WEB 2.0 environment increase diversification and complexification form and expression of the information which is made in sudden expansion of SNS. This is the important reason to the users why fail to find accurate information. Particularly, the redundancy of signification and the metaphor expression to be elements which obstruct the satisfaction of searching information. Expansion of cloud technology is possible to make videos, photos and document information of pdf file infinitely and easily. So it make more difficult to find information. Also the sentences in SNS (like twitter, facebook) are abbreviated form and unphotographic information. So it is generate the limit to find information with few keywords. These problems make the time consuming for searching information. In order to our study solves this problems which are mentioned sentence before, we apply TFIDF to extract keyword for scarcity word in documents and apply LDA algorithm to find a TOPIC in single document. Finally, we suggest additional explanation methodology of scarcity word and Topic to extract Trigram in all documents and compare result of experiment to extract Trigram for verifying accuracy of our methodology. II. RELATE RESEARCH A. TF-IDF TF IDF is the product of two statistics, word frequency and inverse document frequency. There are various ways for the extracting two values (TF, IDF). In the case of the word frequency tf(t,d), the simplest selection is to use the raw frequency of a word in a document, i.e. the number of times that word t occurs in document d. If we denote the raw frequency of t by f(t,d), then the simple tf scheme is tf(t,d) = f(t,d) tfidf ( t, d, tf( t, d) udf ( t, (1) Lee JinWoo is with Department of Computer Engineering, University of National Hanbat, Daejoen, South Korea ( fniko0084@gmail.com). Ma HyoungMin is with Department of Computer Engineering, University of National Hanbat, Daejoen, South Korea ( Lee GiTae is with Department of Computer Engineering, University of National Hanbat, Daejoen, South Korea ( mm1023@naver.com). Ahn KiHong is with Department of Computer Engineering, University of National Hanbat, Daejoen, South Korea ( khahn@hanbat.ac.kr). Kim SuKyoung is with Department of Computer Engineering, University of National Hanbat, Daejoen, South Korea ( kimsk@hanbat.ac.kr). 0.5 f ( t, d ) tf ( t, d ) 0.5 (2) max( fw, d ) : w d idf ( t, D log { d D : t d} (3)

2 The inverse document frequency is a measure of whether the word is common or rare across all documents. It is obtained by dividing the total number of documents by the number of documents containing the word, and then taking the logarithm of that quotient. A high weight in TF-IDF is reached by a high word frequency (in the given document) and a low document frequency of the word in the whole collection of documents; the weights hence tend to filter out common words. Since the ratio inside the idf's log function is always greater than or equal to 1, the value of idf (and TF-IDF) is greater than or equal to 0. As a word appears in more documents, the ratio inside the logarithm approaches 1, bringing the idf and tf-idf closer to 0. B. LDA(Latent Dirichlet Allocation) When there exist the parameter of any probability distribution, LDA is Generative Model of the viewpoint that generate data based on random process. If we know topic distribution of document and each words to generate probability, we can calculate specific document probability. Fig. 1. LDA s concept diagram. Latent Dirichlet Allocation given number of the M documents, it based that the documents has few existing k topic. At first, to use probability distribution at the model is as follows. In here, is given through the actual document, other potential variables can't observed. It is a potential variable which other variables can not be observed. :Follow the k dimension Dirichlet i distribution. ~ Dir( ) z ~ Mltinomial( ) distribution. : Follow the multinomial w follow generated word probability by topic that pointed by z. At that time is Dirichlet distribution and is k V matrix parameter that contain word generate probability. About topic that pointed by z w is conditioned by the word generating probability p ( w z, ). At this time, is the parameter of Dirichlet distribution and is the probability of topic k that can give with each result V which is also calculated as k V in the matrix. This model can be interpreted as follows. For each document, they have weight for number of k subject and z subject of each word that chosen in multinomial distribution of weight. Finally, real words w are selected based on specific topic. C. N-gram It is necessary to process of lexical for understanding sentences but the common grammar of language is very complex also many common users don't follow the standard grammar. There are various algorithms that used to analyze like these sentences. In these algorithms, n-gram has more fast and simple handling advantages than other algorithms. It is the language model which is possible to calculate the meaning whether it is real with the word link of number of n. III. RESEARCH : The word distribution for topic k z :The topic for the jth word in document i(index) w :The jth word in document i(index)

3 A. Basic data Basic data is made by normal people. So it is very similar to the data on web which we can see easy. To process these data, computer need many amount of preprocessing steps because the form is not defined. Also the Korea grammar has postposition, so the word's form changes to various form. As a result, we need to divide noun or find infinitives. It is more difficult than processing English. To solve these problem in this research extract verb and no use Korean parser (Komoran-1.12) to extract verb and noun. And also makes stop word dictionary to prove extracted noun and infinitives. The stop word dictionary is composed 1230 words including unknown meaningless and abstract word, article and postposition. Finally, processed documents are consists of only words by preprocessing. This data TABLE I LIST OF EXTRACTED KEYWORD WHICH USING TFIDF ALGORITHM Word TF-IDF 컨텐츠 (contents) 헤드셋 (headset) 컴퓨터 (computer) 불면증 (insomnia) 집중력 (concentration) 우울증 (melancholia) 헤어밴드 (hair bands) 긴장감 (tension) 스트레스 (stress) 스마트 (smart) is processed by TF-IDF algorithm and Topic modeling algorithm which called LDA. The data was accumulated for 3 years (2011~2013). And also ideas are composed 548 ideas in 2013, 266 ideas in 2012, and 447 ideas in Each idea are composed in their background of occurrence, necessity, technical core and scenario. Therefore 1261 data are used to this research. Next figure is value of every year's exaggerated topic. In the fig 2, the mismatching ratio between contents and title is 50% in 2013, 56% in 2012, and 38% in The document of more abstractive form show the more mismatching probability. As a result, when they saw title, they can't inference about document topic. Fig. 3. Precision and recall of topic (left-precision, right-recall). Fig. 2. A value of every year's exaggerated topic. (2011, 2012, 2013) B. Keywords extraction Each document contains their representative keywords. But it is very hard to find representative keywords in huge amount of word. By using TF-IDF algorithm, top 20 keywords are extracted

4 after preprocessing (ex. Korean parser, stop word dictionary). Table 1 is lists of extracted keywords which using TF-IDF algorithm at no.1 data. C. Topic Modeling In this chapter, we offer the result to find key word with TF-IDF's result to raise the key word's reliability. This method extract topic from each documents through topic modeling algorithm (LDA). Of course all documents already parsed and extracted noun and infinitives by morphological analyzer. To verify these keywords, we have extracted topic by supervised basic data. Total counts of verification documents are 430, also count of processed documents are 430. They are supervised data and LDA data. The standard of comparison is whether appear supervised topic in extracted topic by LDA. To measure this research's likelihood, each documents are processed EM-Algorithm 1000 times. As a result of algorithm, each document are normally included 5 keywords. These words be representative word in document. D. Clustering We can t know specific meaning of only a word. That is reason generate needs for analysis of sentence level. In this paper, we clustered word by trigram methodology for founding relation between words. This procedure that show relation words between tf-idf result and LDA result can solves problem for ambiguous word in context. Trigram expression is follows: [ n n n n n PR Ek ] [ Ek ] [ PO Ek ] (4) Clustering result of trigram can show relations between words. If frequency of trigram word has high variable, it can suppose high relation of these words. In fig 4, it shows relation of top 30 frequency words through Les Miserable Co-occurrence graph. Extracted words like table 2 are related to the middle word. In these trigram words, we select extracted topic word and directly related trigram word, and it must provide topic word as additional description word. K R( W T ) f ( w ) g(, w ) (5) n i 1 i (5) is equation of calculating relation between trigram and topic. w i is one of the trigram words that maked by topic T. f(w,t) is rate of specific word w in trigram that make by topic T. i.e. f(w,t) is binomial distribution function, n f ( w ) n( w ) n N i i 1 (6) Fig. 4. Les Miserable Co-occurrence graph of relation between words (top-sort by frequency, bottom-sort by name)

5 g(, w) P( w ) (7) We select w which maximum value in R(w T) and it is describe topic. Fig. 5. Entire process model. IV. CONCLUSION In this paper, we show that provide keyword in document to user through TF-IDF and LDA algorithm about unstructured data. Morpheme logical analysis and word stemming through stop word dictionary improve result of our procedure. Also we make supervised data for proving unsupervised data to measure precision and recall. As a result we can improve high precision. On the other hand, recall cannot reach expected point. Extracted trigram for fool recall, we also suggested a methodology for measuring relation between topic and word. In the future, we expect reached high precision and recall as adapt this methodology. There are several directions we plan to investigate in the future. One is making abstract word dictionary that impede recall. Another one is adapt trigram methodology for high quality. We expect to use this methodology for information select to any user that can easily select information when they want. [2] Wiliam B. Cavnar, John M. Trenkle, "N-Gram-Based Text Categorization ",Environmental Research Institute of Michigan P.O. Box Ann Arbor MI [3] Juan Ramos, "Using TF-IDF to Dewordine Word Relevance in Document Queries" Department of Computer Science, Rutgers University, BPO Way, Piscataway, NJ, [4] Chenghua Lin, Yulan He, Richard Everson, "Weakly Supervised Joint Sentiment-Topic Detection from Text", IEEE TRANSCATIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO.6, JUNE [5] Seungil Huh, Stephen E. Fienberg, Discriminative Topic Modeling Based on Manifold Learning, ACM Transactions on Knowledge Discovery from Data, Vol. 5 No. 4, Article 20, Publication date: February 2012 [6] Aurora Pons-Porrata, Rafael Berlanga-Llavori, Jose Ruiz-Shulcloper, Topic discovery based on text mining techniques, Information Processing and Management 43 (2007) [7] A. P. Dempster, N. M. Laird, D. B. Rubin Maximum Likelihood from Incomplete Data via the EM Algorithm, Journal of the Royal Statistical Society, Vol.39,No.1(1977),pp.1-38 REFERENCES [1] David M. Blei, Andrew Y. Ng, Michael I. Jordan, "Latent Dirichlet Allocation", Journal of Machine Learning Research 3 (2003)

Probabilistic Latent Semantic Analysis

Probabilistic Latent Semantic Analysis Probabilistic Latent Semantic Analysis Thomas Hofmann Presentation by Ioannis Pavlopoulos & Andreas Damianou for the course of Data Mining & Exploration 1 Outline Latent Semantic Analysis o Need o Overview

More information

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF) SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF) Hans Christian 1 ; Mikhael Pramodana Agus 2 ; Derwin Suhartono 3 1,2,3 Computer Science Department,

More information

Universiteit Leiden ICT in Business

Universiteit Leiden ICT in Business Universiteit Leiden ICT in Business Ranking of Multi-Word Terms Name: Ricardo R.M. Blikman Student-no: s1184164 Internal report number: 2012-11 Date: 07/03/2013 1st supervisor: Prof. Dr. J.N. Kok 2nd supervisor:

More information

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Twitter Sentiment Classification on Sanders Data using Hybrid Approach IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 4, Ver. I (July Aug. 2015), PP 118-123 www.iosrjournals.org Twitter Sentiment Classification on Sanders

More information

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA Alta de Waal, Jacobus Venter and Etienne Barnard Abstract Most actionable evidence is identified during the analysis phase of digital forensic investigations.

More information

Linking Task: Identifying authors and book titles in verbose queries

Linking Task: Identifying authors and book titles in verbose queries Linking Task: Identifying authors and book titles in verbose queries Anaïs Ollagnier, Sébastien Fournier, and Patrice Bellot Aix-Marseille University, CNRS, ENSAM, University of Toulon, LSIS UMR 7296,

More information

A Comparison of Two Text Representations for Sentiment Analysis

A Comparison of Two Text Representations for Sentiment Analysis 010 International Conference on Computer Application and System Modeling (ICCASM 010) A Comparison of Two Text Representations for Sentiment Analysis Jianxiong Wang School of Computer Science & Educational

More information

Cross Language Information Retrieval

Cross Language Information Retrieval Cross Language Information Retrieval RAFFAELLA BERNARDI UNIVERSITÀ DEGLI STUDI DI TRENTO P.ZZA VENEZIA, ROOM: 2.05, E-MAIL: BERNARDI@DISI.UNITN.IT Contents 1 Acknowledgment.............................................

More information

A Case Study: News Classification Based on Term Frequency

A Case Study: News Classification Based on Term Frequency A Case Study: News Classification Based on Term Frequency Petr Kroha Faculty of Computer Science University of Technology 09107 Chemnitz Germany kroha@informatik.tu-chemnitz.de Ricardo Baeza-Yates Center

More information

Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models

Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models Jung-Tae Lee and Sang-Bum Kim and Young-In Song and Hae-Chang Rim Dept. of Computer &

More information

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages Nuanwan Soonthornphisaj 1 and Boonserm Kijsirikul 2 Machine Intelligence and Knowledge Discovery Laboratory Department of Computer

More information

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

The Internet as a Normative Corpus: Grammar Checking with a Search Engine The Internet as a Normative Corpus: Grammar Checking with a Search Engine Jonas Sjöbergh KTH Nada SE-100 44 Stockholm, Sweden jsh@nada.kth.se Abstract In this paper some methods using the Internet as a

More information

Performance Analysis of Optimized Content Extraction for Cyrillic Mongolian Learning Text Materials in the Database

Performance Analysis of Optimized Content Extraction for Cyrillic Mongolian Learning Text Materials in the Database Journal of Computer and Communications, 2016, 4, 79-89 Published Online August 2016 in SciRes. http://www.scirp.org/journal/jcc http://dx.doi.org/10.4236/jcc.2016.410009 Performance Analysis of Optimized

More information

Disambiguation of Thai Personal Name from Online News Articles

Disambiguation of Thai Personal Name from Online News Articles Disambiguation of Thai Personal Name from Online News Articles Phaisarn Sutheebanjard Graduate School of Information Technology Siam University Bangkok, Thailand mr.phaisarn@gmail.com Abstract Since online

More information

Assignment 1: Predicting Amazon Review Ratings

Assignment 1: Predicting Amazon Review Ratings Assignment 1: Predicting Amazon Review Ratings 1 Dataset Analysis Richard Park r2park@acsmail.ucsd.edu February 23, 2015 The dataset selected for this assignment comes from the set of Amazon reviews for

More information

Using dialogue context to improve parsing performance in dialogue systems

Using dialogue context to improve parsing performance in dialogue systems Using dialogue context to improve parsing performance in dialogue systems Ivan Meza-Ruiz and Oliver Lemon School of Informatics, Edinburgh University 2 Buccleuch Place, Edinburgh I.V.Meza-Ruiz@sms.ed.ac.uk,

More information

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments Vijayshri Ramkrishna Ingale PG Student, Department of Computer Engineering JSPM s Imperial College of Engineering &

More information

Data Fusion Models in WSNs: Comparison and Analysis

Data Fusion Models in WSNs: Comparison and Analysis Proceedings of 2014 Zone 1 Conference of the American Society for Engineering Education (ASEE Zone 1) Data Fusion s in WSNs: Comparison and Analysis Marwah M Almasri, and Khaled M Elleithy, Senior Member,

More information

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Chinese Language Parsing with Maximum-Entropy-Inspired Parser Chinese Language Parsing with Maximum-Entropy-Inspired Parser Heng Lian Brown University Abstract The Chinese language has many special characteristics that make parsing difficult. The performance of state-of-the-art

More information

Matching Similarity for Keyword-Based Clustering

Matching Similarity for Keyword-Based Clustering Matching Similarity for Keyword-Based Clustering Mohammad Rezaei and Pasi Fränti University of Eastern Finland {rezaei,franti}@cs.uef.fi Abstract. Semantic clustering of objects such as documents, web

More information

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities Yoav Goldberg Reut Tsarfaty Meni Adler Michael Elhadad Ben Gurion

More information

On-Line Data Analytics

On-Line Data Analytics International Journal of Computer Applications in Engineering Sciences [VOL I, ISSUE III, SEPTEMBER 2011] [ISSN: 2231-4946] On-Line Data Analytics Yugandhar Vemulapalli #, Devarapalli Raghu *, Raja Jacob

More information

The Good Judgment Project: A large scale test of different methods of combining expert predictions

The Good Judgment Project: A large scale test of different methods of combining expert predictions The Good Judgment Project: A large scale test of different methods of combining expert predictions Lyle Ungar, Barb Mellors, Jon Baron, Phil Tetlock, Jaime Ramos, Sam Swift The University of Pennsylvania

More information

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17. Semi-supervised methods of text processing, and an application to medical concept extraction Yacine Jernite Text-as-Data series September 17. 2015 What do we want from text? 1. Extract information 2. Link

More information

ScienceDirect. A Framework for Clustering Cardiac Patient s Records Using Unsupervised Learning Techniques

ScienceDirect. A Framework for Clustering Cardiac Patient s Records Using Unsupervised Learning Techniques Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 98 (2016 ) 368 373 The 6th International Conference on Current and Future Trends of Information and Communication Technologies

More information

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling Notebook for PAN at CLEF 2013 Andrés Alfonso Caurcel Díaz 1 and José María Gómez Hidalgo 2 1 Universidad

More information

Experts Retrieval with Multiword-Enhanced Author Topic Model

Experts Retrieval with Multiword-Enhanced Author Topic Model NAACL 10 Workshop on Semantic Search Experts Retrieval with Multiword-Enhanced Author Topic Model Nikhil Johri Dan Roth Yuancheng Tu Dept. of Computer Science Dept. of Linguistics University of Illinois

More information

Lecture 1: Machine Learning Basics

Lecture 1: Machine Learning Basics 1/69 Lecture 1: Machine Learning Basics Ali Harakeh University of Waterloo WAVE Lab ali.harakeh@uwaterloo.ca May 1, 2017 2/69 Overview 1 Learning Algorithms 2 Capacity, Overfitting, and Underfitting 3

More information

AQUA: An Ontology-Driven Question Answering System

AQUA: An Ontology-Driven Question Answering System AQUA: An Ontology-Driven Question Answering System Maria Vargas-Vera, Enrico Motta and John Domingue Knowledge Media Institute (KMI) The Open University, Walton Hall, Milton Keynes, MK7 6AA, United Kingdom.

More information

Term Weighting based on Document Revision History

Term Weighting based on Document Revision History Term Weighting based on Document Revision History Sérgio Nunes, Cristina Ribeiro, and Gabriel David INESC Porto, DEI, Faculdade de Engenharia, Universidade do Porto. Rua Dr. Roberto Frias, s/n. 4200-465

More information

Multilingual Sentiment and Subjectivity Analysis

Multilingual Sentiment and Subjectivity Analysis Multilingual Sentiment and Subjectivity Analysis Carmen Banea and Rada Mihalcea Department of Computer Science University of North Texas rada@cs.unt.edu, carmen.banea@gmail.com Janyce Wiebe Department

More information

WHEN THERE IS A mismatch between the acoustic

WHEN THERE IS A mismatch between the acoustic 808 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 3, MAY 2006 Optimization of Temporal Filters for Constructing Robust Features in Speech Recognition Jeih-Weih Hung, Member,

More information

Constructing Parallel Corpus from Movie Subtitles

Constructing Parallel Corpus from Movie Subtitles Constructing Parallel Corpus from Movie Subtitles Han Xiao 1 and Xiaojie Wang 2 1 School of Information Engineering, Beijing University of Post and Telecommunications artex.xh@gmail.com 2 CISTR, Beijing

More information

Julia Smith. Effective Classroom Approaches to.

Julia Smith. Effective Classroom Approaches to. Julia Smith @tessmaths Effective Classroom Approaches to GCSE Maths resits julia.smith@writtle.ac.uk Agenda The context of GCSE resit in a post-16 setting An overview of the new GCSE Key features of a

More information

Longest Common Subsequence: A Method for Automatic Evaluation of Handwritten Essays

Longest Common Subsequence: A Method for Automatic Evaluation of Handwritten Essays IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 6, Ver. IV (Nov Dec. 2015), PP 01-07 www.iosrjournals.org Longest Common Subsequence: A Method for

More information

TextGraphs: Graph-based algorithms for Natural Language Processing

TextGraphs: Graph-based algorithms for Natural Language Processing HLT-NAACL 06 TextGraphs: Graph-based algorithms for Natural Language Processing Proceedings of the Workshop Production and Manufacturing by Omnipress Inc. 2600 Anderson Street Madison, WI 53704 c 2006

More information

Mining Association Rules in Student s Assessment Data

Mining Association Rules in Student s Assessment Data www.ijcsi.org 211 Mining Association Rules in Student s Assessment Data Dr. Varun Kumar 1, Anupama Chadha 2 1 Department of Computer Science and Engineering, MVN University Palwal, Haryana, India 2 Anupama

More information

Books Effective Literacy Y5-8 Learning Through Talk Y4-8 Switch onto Spelling Spelling Under Scrutiny

Books Effective Literacy Y5-8 Learning Through Talk Y4-8 Switch onto Spelling Spelling Under Scrutiny By the End of Year 8 All Essential words lists 1-7 290 words Commonly Misspelt Words-55 working out more complex, irregular, and/or ambiguous words by using strategies such as inferring the unknown from

More information

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus Language Acquisition Fall 2010/Winter 2011 Lexical Categories Afra Alishahi, Heiner Drenhaus Computational Linguistics and Phonetics Saarland University Children s Sensitivity to Lexical Categories Look,

More information

A heuristic framework for pivot-based bilingual dictionary induction

A heuristic framework for pivot-based bilingual dictionary induction 2013 International Conference on Culture and Computing A heuristic framework for pivot-based bilingual dictionary induction Mairidan Wushouer, Toru Ishida, Donghui Lin Department of Social Informatics,

More information

The Role of String Similarity Metrics in Ontology Alignment

The Role of String Similarity Metrics in Ontology Alignment The Role of String Similarity Metrics in Ontology Alignment Michelle Cheatham and Pascal Hitzler August 9, 2013 1 Introduction Tim Berners-Lee originally envisioned a much different world wide web than

More information

Dublin City Schools Mathematics Graded Course of Study GRADE 4

Dublin City Schools Mathematics Graded Course of Study GRADE 4 I. Content Standard: Number, Number Sense and Operations Standard Students demonstrate number sense, including an understanding of number systems and reasonable estimates using paper and pencil, technology-supported

More information

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS 1 CALIFORNIA CONTENT STANDARDS: Chapter 1 ALGEBRA AND WHOLE NUMBERS Algebra and Functions 1.4 Students use algebraic

More information

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur Module 12 Machine Learning 12.1 Instructional Objective The students should understand the concept of learning systems Students should learn about different aspects of a learning system Students should

More information

Learning Disability Functional Capacity Evaluation. Dear Doctor,

Learning Disability Functional Capacity Evaluation. Dear Doctor, Dear Doctor, I have been asked to formulate a vocational opinion regarding NAME s employability in light of his/her learning disability. To assist me with this evaluation I would appreciate if you can

More information

A Syllable Based Word Recognition Model for Korean Noun Extraction

A Syllable Based Word Recognition Model for Korean Noun Extraction are used as the most important terms (features) that express the document in NLP applications such as information retrieval, document categorization, text summarization, information extraction, and etc.

More information

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks Devendra Singh Chaplot, Eunhee Rhim, and Jihie Kim Samsung Electronics Co., Ltd. Seoul, South Korea {dev.chaplot,eunhee.rhim,jihie.kim}@samsung.com

More information

The Indices Investigations Teacher s Notes

The Indices Investigations Teacher s Notes The Indices Investigations Teacher s Notes These activities are for students to use independently of the teacher to practise and develop number and algebra properties.. Number Framework domain and stage:

More information

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar Chung-Chi Huang Mei-Hua Chen Shih-Ting Huang Jason S. Chang Institute of Information Systems and Applications, National Tsing Hua University,

More information

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification Tomi Kinnunen and Ismo Kärkkäinen University of Joensuu, Department of Computer Science, P.O. Box 111, 80101 JOENSUU,

More information

Distant Supervised Relation Extraction with Wikipedia and Freebase

Distant Supervised Relation Extraction with Wikipedia and Freebase Distant Supervised Relation Extraction with Wikipedia and Freebase Marcel Ackermann TU Darmstadt ackermann@tk.informatik.tu-darmstadt.de Abstract In this paper we discuss a new approach to extract relational

More information

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT PRACTICAL APPLICATIONS OF RANDOM SAMPLING IN ediscovery By Matthew Verga, J.D. INTRODUCTION Anyone who spends ample time working

More information

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks 1 Tzu-Hsuan Yang, 2 Tzu-Hsuan Tseng, and 3 Chia-Ping Chen Department of Computer Science and Engineering

More information

Short Text Understanding Through Lexical-Semantic Analysis

Short Text Understanding Through Lexical-Semantic Analysis Short Text Understanding Through Lexical-Semantic Analysis Wen Hua #1, Zhongyuan Wang 2, Haixun Wang 3, Kai Zheng #4, Xiaofang Zhou #5 School of Information, Renmin University of China, Beijing, China

More information

HLTCOE at TREC 2013: Temporal Summarization

HLTCOE at TREC 2013: Temporal Summarization HLTCOE at TREC 2013: Temporal Summarization Tan Xu University of Maryland College Park Paul McNamee Johns Hopkins University HLTCOE Douglas W. Oard University of Maryland College Park Abstract Our team

More information

Reducing Features to Improve Bug Prediction

Reducing Features to Improve Bug Prediction Reducing Features to Improve Bug Prediction Shivkumar Shivaji, E. James Whitehead, Jr., Ram Akella University of California Santa Cruz {shiv,ejw,ram}@soe.ucsc.edu Sunghun Kim Hong Kong University of Science

More information

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question. Ch 2 Test Remediation Work Name MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question. Provide an appropriate response. 1) High temperatures in a certain

More information

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Using Web Searches on Important Words to Create Background Sets for LSI Classification Using Web Searches on Important Words to Create Background Sets for LSI Classification Sarah Zelikovitz and Marina Kogan College of Staten Island of CUNY 2800 Victory Blvd Staten Island, NY 11314 Abstract

More information

Guidelines for Writing an Internship Report

Guidelines for Writing an Internship Report Guidelines for Writing an Internship Report Master of Commerce (MCOM) Program Bahauddin Zakariya University, Multan Table of Contents Table of Contents... 2 1. Introduction.... 3 2. The Required Components

More information

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation tatistical Parsing (Following slides are modified from Prof. Raymond Mooney s slides.) tatistical Parsing tatistical parsing uses a probabilistic model of syntax in order to assign probabilities to each

More information

Probability and Statistics Curriculum Pacing Guide

Probability and Statistics Curriculum Pacing Guide Unit 1 Terms PS.SPMJ.3 PS.SPMJ.5 Plan and conduct a survey to answer a statistical question. Recognize how the plan addresses sampling technique, randomization, measurement of experimental error and methods

More information

TopicFlow: Visualizing Topic Alignment of Twitter Data over Time

TopicFlow: Visualizing Topic Alignment of Twitter Data over Time TopicFlow: Visualizing Topic Alignment of Twitter Data over Time Sana Malik, Alison Smith, Timothy Hawes, Panagis Papadatos, Jianyu Li, Cody Dunne, Ben Shneiderman University of Maryland, College Park,

More information

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics (L615) Markus Dickinson Department of Linguistics, Indiana University Spring 2013 The web provides new opportunities for gathering data Viable source of disposable corpora, built ad hoc for specific purposes

More information

Postprint.

Postprint. http://www.diva-portal.org Postprint This is the accepted version of a paper presented at CLEF 2013 Conference and Labs of the Evaluation Forum Information Access Evaluation meets Multilinguality, Multimodality,

More information

Ensemble Technique Utilization for Indonesian Dependency Parser

Ensemble Technique Utilization for Indonesian Dependency Parser Ensemble Technique Utilization for Indonesian Dependency Parser Arief Rahman Institut Teknologi Bandung Indonesia 23516008@std.stei.itb.ac.id Ayu Purwarianti Institut Teknologi Bandung Indonesia ayu@stei.itb.ac.id

More information

Test Effort Estimation Using Neural Network

Test Effort Estimation Using Neural Network J. Software Engineering & Applications, 2010, 3: 331-340 doi:10.4236/jsea.2010.34038 Published Online April 2010 (http://www.scirp.org/journal/jsea) 331 Chintala Abhishek*, Veginati Pavan Kumar, Harish

More information

Modeling user preferences and norms in context-aware systems

Modeling user preferences and norms in context-aware systems Modeling user preferences and norms in context-aware systems Jonas Nilsson, Cecilia Lindmark Jonas Nilsson, Cecilia Lindmark VT 2016 Bachelor's thesis for Computer Science, 15 hp Supervisor: Juan Carlos

More information

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING SISOM & ACOUSTICS 2015, Bucharest 21-22 May THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING MarilenaăLAZ R 1, Diana MILITARU 2 1 Military Equipment and Technologies Research Agency, Bucharest,

More information

Python Machine Learning

Python Machine Learning Python Machine Learning Unlock deeper insights into machine learning with this vital guide to cuttingedge predictive analytics Sebastian Raschka [ PUBLISHING 1 open source I community experience distilled

More information

Different Requirements Gathering Techniques and Issues. Javaria Mushtaq

Different Requirements Gathering Techniques and Issues. Javaria Mushtaq 835 Different Requirements Gathering Techniques and Issues Javaria Mushtaq Abstract- Project management is now becoming a very important part of our software industries. To handle projects with success

More information

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words, A Language-Independent, Data-Oriented Architecture for Grapheme-to-Phoneme Conversion Walter Daelemans and Antal van den Bosch Proceedings ESCA-IEEE speech synthesis conference, New York, September 1994

More information

Word Segmentation of Off-line Handwritten Documents

Word Segmentation of Off-line Handwritten Documents Word Segmentation of Off-line Handwritten Documents Chen Huang and Sargur N. Srihari {chuang5, srihari}@cedar.buffalo.edu Center of Excellence for Document Analysis and Recognition (CEDAR), Department

More information

TABE 9&10. Revised 8/2013- with reference to College and Career Readiness Standards

TABE 9&10. Revised 8/2013- with reference to College and Career Readiness Standards TABE 9&10 Revised 8/2013- with reference to College and Career Readiness Standards LEVEL E Test 1: Reading Name Class E01- INTERPRET GRAPHIC INFORMATION Signs Maps Graphs Consumer Materials Forms Dictionary

More information

Rule Learning With Negation: Issues Regarding Effectiveness

Rule Learning With Negation: Issues Regarding Effectiveness Rule Learning With Negation: Issues Regarding Effectiveness S. Chua, F. Coenen, G. Malcolm University of Liverpool Department of Computer Science, Ashton Building, Ashton Street, L69 3BX Liverpool, United

More information

CS Machine Learning

CS Machine Learning CS 478 - Machine Learning Projects Data Representation Basic testing and evaluation schemes CS 478 Data and Testing 1 Programming Issues l Program in any platform you want l Realize that you will be doing

More information

Procedia - Social and Behavioral Sciences 226 ( 2016 ) 27 34

Procedia - Social and Behavioral Sciences 226 ( 2016 ) 27 34 Available online at www.sciencedirect.com ScienceDirect Procedia - Social and Behavioral Sciences 226 ( 2016 ) 27 34 29th World Congress International Project Management Association (IPMA) 2015, IPMA WC

More information

Mathematics subject curriculum

Mathematics subject curriculum Mathematics subject curriculum Dette er ei omsetjing av den fastsette læreplanteksten. Læreplanen er fastsett på Nynorsk Established as a Regulation by the Ministry of Education and Research on 24 June

More information

A study of speaker adaptation for DNN-based speech synthesis

A study of speaker adaptation for DNN-based speech synthesis A study of speaker adaptation for DNN-based speech synthesis Zhizheng Wu, Pawel Swietojanski, Christophe Veaux, Steve Renals, Simon King The Centre for Speech Technology Research (CSTR) University of Edinburgh,

More information

Analyzing sentiments in tweets for Tesla Model 3 using SAS Enterprise Miner and SAS Sentiment Analysis Studio

Analyzing sentiments in tweets for Tesla Model 3 using SAS Enterprise Miner and SAS Sentiment Analysis Studio SCSUG Student Symposium 2016 Analyzing sentiments in tweets for Tesla Model 3 using SAS Enterprise Miner and SAS Sentiment Analysis Studio Praneth Guggilla, Tejaswi Jha, Goutam Chakraborty, Oklahoma State

More information

A Comparison of Standard and Interval Association Rules

A Comparison of Standard and Interval Association Rules A Comparison of Standard and Association Rules Choh Man Teng cmteng@ai.uwf.edu Institute for Human and Machine Cognition University of West Florida 4 South Alcaniz Street, Pensacola FL 325, USA Abstract

More information

Mining Topic-level Opinion Influence in Microblog

Mining Topic-level Opinion Influence in Microblog Mining Topic-level Opinion Influence in Microblog Daifeng Li Dept. of Computer Science and Technology Tsinghua University ldf3824@yahoo.com.cn Jie Tang Dept. of Computer Science and Technology Tsinghua

More information

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models Stephan Gouws and GJ van Rooyen MIH Medialab, Stellenbosch University SOUTH AFRICA {stephan,gvrooyen}@ml.sun.ac.za

More information

Clickthrough-Based Translation Models for Web Search: from Word Models to Phrase Models

Clickthrough-Based Translation Models for Web Search: from Word Models to Phrase Models Clickthrough-Based Translation Models for Web Search: from Word Models to Phrase Models Jianfeng Gao Microsoft Research One Microsoft Way Redmond, WA 98052 USA jfgao@microsoft.com Xiaodong He Microsoft

More information

Australian Journal of Basic and Applied Sciences

Australian Journal of Basic and Applied Sciences AENSI Journals Australian Journal of Basic and Applied Sciences ISSN:1991-8178 Journal home page: www.ajbasweb.com Feature Selection Technique Using Principal Component Analysis For Improving Fuzzy C-Mean

More information

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, 2013 10.12753/2066-026X-13-154 DATA MINING SOLUTIONS FOR DETERMINING STUDENT'S PROFILE Adela BÂRA,

More information

Comment-based Multi-View Clustering of Web 2.0 Items

Comment-based Multi-View Clustering of Web 2.0 Items Comment-based Multi-View Clustering of Web 2.0 Items Xiangnan He 1 Min-Yen Kan 1 Peichu Xie 2 Xiao Chen 3 1 School of Computing, National University of Singapore 2 Department of Mathematics, National University

More information

A Bayesian Learning Approach to Concept-Based Document Classification

A Bayesian Learning Approach to Concept-Based Document Classification Databases and Information Systems Group (AG5) Max-Planck-Institute for Computer Science Saarbrücken, Germany A Bayesian Learning Approach to Concept-Based Document Classification by Georgiana Ifrim Supervisors

More information

Cross-lingual Short-Text Document Classification for Facebook Comments

Cross-lingual Short-Text Document Classification for Facebook Comments 2014 International Conference on Future Internet of Things and Cloud Cross-lingual Short-Text Document Classification for Facebook Comments Mosab Faqeeh, Nawaf Abdulla, Mahmoud Al-Ayyoub, Yaser Jararweh

More information

Efficient Online Summarization of Microblogging Streams

Efficient Online Summarization of Microblogging Streams Efficient Online Summarization of Microblogging Streams Andrei Olariu Faculty of Mathematics and Computer Science University of Bucharest andrei@olariu.org Abstract The large amounts of data generated

More information

arxiv:cmp-lg/ v1 22 Aug 1994

arxiv:cmp-lg/ v1 22 Aug 1994 arxiv:cmp-lg/94080v 22 Aug 994 DISTRIBUTIONAL CLUSTERING OF ENGLISH WORDS Fernando Pereira AT&T Bell Laboratories 600 Mountain Ave. Murray Hill, NJ 07974 pereira@research.att.com Abstract We describe and

More information

Radius STEM Readiness TM

Radius STEM Readiness TM Curriculum Guide Radius STEM Readiness TM While today s teens are surrounded by technology, we face a stark and imminent shortage of graduates pursuing careers in Science, Technology, Engineering, and

More information

Let s think about how to multiply and divide fractions by fractions!

Let s think about how to multiply and divide fractions by fractions! Let s think about how to multiply and divide fractions by fractions! June 25, 2007 (Monday) Takehaya Attached Elementary School, Tokyo Gakugei University Grade 6, Class # 1 (21 boys, 20 girls) Instructor:

More information

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape Koshi Odagiri 1, and Yoichi Muraoka 1 1 Graduate School of Fundamental/Computer Science and Engineering, Waseda University,

More information

2 Mitsuru Ishizuka x1 Keywords Automatic Indexing, PAI, Asserted Keyword, Spreading Activation, Priming Eect Introduction With the increasing number o

2 Mitsuru Ishizuka x1 Keywords Automatic Indexing, PAI, Asserted Keyword, Spreading Activation, Priming Eect Introduction With the increasing number o PAI: Automatic Indexing for Extracting Asserted Keywords from a Document 1 PAI: Automatic Indexing for Extracting Asserted Keywords from a Document Naohiro Matsumura PRESTO, Japan Science and Technology

More information

As a high-quality international conference in the field

As a high-quality international conference in the field The New Automated IEEE INFOCOM Review Assignment System Baochun Li and Y. Thomas Hou Abstract In academic conferences, the structure of the review process has always been considered a critical aspect of

More information

Summarizing Contrastive Themes via Hierarchical Non-Parametric Processes

Summarizing Contrastive Themes via Hierarchical Non-Parametric Processes Summarizing Contrastive Themes via Hierarchical Non-Parametric Processes Zhaochun Ren z.ren@uva.nl Maarten de Rijke derijke@uva.nl University of Amsterdam, Amsterdam, The Netherlands ABSTRACT Given a topic

More information

Compositional Semantics

Compositional Semantics Compositional Semantics CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Words, bag of words Sequences Trees Meaning Representing Meaning An important goal of NLP/AI: convert natural language

More information

Artificial Neural Networks written examination

Artificial Neural Networks written examination 1 (8) Institutionen för informationsteknologi Olle Gällmo Universitetsadjunkt Adress: Lägerhyddsvägen 2 Box 337 751 05 Uppsala Artificial Neural Networks written examination Monday, May 15, 2006 9 00-14

More information

A DISTRIBUTIONAL STRUCTURED SEMANTIC SPACE FOR QUERYING RDF GRAPH DATA

A DISTRIBUTIONAL STRUCTURED SEMANTIC SPACE FOR QUERYING RDF GRAPH DATA International Journal of Semantic Computing Vol. 5, No. 4 (2011) 433 462 c World Scientific Publishing Company DOI: 10.1142/S1793351X1100133X A DISTRIBUTIONAL STRUCTURED SEMANTIC SPACE FOR QUERYING RDF

More information

Detecting Wikipedia Vandalism using Machine Learning Notebook for PAN at CLEF 2011

Detecting Wikipedia Vandalism using Machine Learning Notebook for PAN at CLEF 2011 Detecting Wikipedia Vandalism using Machine Learning Notebook for PAN at CLEF 2011 Cristian-Alexandru Drăgușanu, Marina Cufliuc, Adrian Iftene UAIC: Faculty of Computer Science, Alexandru Ioan Cuza University,

More information