Feature Reduction Techniques for Arabic Text Categorization

Size: px
Start display at page:

Download "Feature Reduction Techniques for Arabic Text Categorization"

Transcription

1 Feature Reduction Techniques for Arabic Text Categorization Rehab Duwairi Department of Computer Information Systems, Jordan University of Science and Technology, Irbid, Jordan. Mohammad Nayef Al-Refai Department of Computer Science, Jordan University of Science and Technology, Irbid, Jordan. Natheer Khasawneh Department of Computer Engineering; Jordan University of Science and Technology, Irbid, Jordan. This paper presents and compares three feature reduction techniques that were applied to Arabic text. The techniques include stemming, light stemming, and word clusters. The effects of the aforementioned techniques were studied and analyzed on the K-nearest-neighbor classifier. Stemming reduces words to their stems. Light stemming, by comparison, removes common affixes from words without reducing them to their stems. Word clusters group synonymous words into clusters and each cluster is represented by a single word. The purpose of employing the previous methods is to reduce the size of document vectors without affecting the accuracy of the classifiers. The comparison metric includes size of document vectors, classification time, and accuracy (in terms of precision and recall). Several experiments were carried out using four different representations of the same corpus: the first version uses stem-vectors, the second uses light stem-vectors, the third uses word clusters, and the fourth uses the original words (without any transformation) as representatives of documents. The corpus consists of 15,000 documents that fall into three categories: sports, economics, and politics. In terms of vector sizes and classification time, the stemmed vectors consumed the smallest size and the least time necessary to classify a testing dataset that consists of 6,000 documents. The light stemmed vectors superseded the other three representations in terms of classification accuracy. Introduction The exponential growth in the availability of online information and in Internet usage has created an urgent demand Received September 26, 2008; revised June 2, 2009; accepted June 3, ASIS&T Published online 13 July 2009 in Wiley InterScience ( for fast and useful access to information (Correa & Ludermir, 2002; Ker & Chen, 2000; Pierre, 2000). People need help in finding, filtering, and managing resources. Furthermore, today s large repositories of information present the problem of how to analyze the information and how to facilitate navigation to the information. This mass of information must be organized to make it comprehensible to people, and the most successful paradigm is to categorize different documents according to their topics. Text categorization, or text classification, is one of many information management tasks. It is a way of assigning documents to predefined categories based on document contents. Categorization is generally done to organize information automatically. The need of automated classification arises basically because of the paced growth and change of the Web, where manual organization becomes almost impossible without expending massive time and effort (Pierre, 2000). Information retrieval, text routing, filtering, and understanding are some examples of wide range applications for text categorization (Dumais, Platt, Heckerman, & Sahami, 1998). Many categorization algorithms have been applied to text categorization, for example, the Naïve Bayes probabilistic classifiers (Eyheramendy, Lewis, & Madiagn, 2003), Decision Tree classifiers (Bednar, 2006), Neural Networks (Basu, Walters, & Shepherd, 2003), K-nearest-neighbor classifiers (KNN) (Gongde, Hui, David, Yaxin, & Kieran, 2004) and Support Vector Machines (Sebastiani, 2005). With the increasing size of datasets used in text classification, the number and quality of features provided to describe the data has become a relevant and challenging problem. There is a need for effective feature reduction strategies (Yan et al., 2005; Yang & Pedersen, 1997). Some standard feature JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 60(11): , 2009

2 reduction processes can initially be used to reduce the high number of features such as eliminating stopwords, stemming, and removing very frequent/infrequent words. Feature selection strategies discover a subset of features that are relevant to the task to be learned and that causes a reduction in the training and testing data sizes (Seo, Ankolekar, & Sycara, 2004). The classifier built with only the relevant subset of features would have better predictive accuracy than the classifier built from the entire set of features (Mejia-Lavalle & Morales, 2006). If we keep working on a high dimension dataset space, two main problems may arise: computational complexity and overfitting. In this paper the researchers present and compare three heuristic feature selection measures for Arabic text: stemming, light stemming, and word clusters. Stemming reduces words to their stems (roots). Light stemming, on the other hand, removes common affixes from words without reducing them to their roots. Word clusters partition the words that appear in a document into clusters based on the synonymy relation. Afterwards each cluster is represented by a single word (called cluster representative). The effects of the above three techniques, in addition to the case of using the original words of a document, such as feature selection techniques, were assessed for text categorization. The assessment framework includes comparing the document vector sizes, preprocessing and classifications times, and classifier accuracy. The KNN classifier was applied to an Arabic dataset. The dataset consists of 15,000 Arabic documents; the documents are collected, filtered, and classified manually into three categories: Sports, Economics, and Politics. The smallest vector sizes and the smallest time were achieved in the case of stemming, while the highest classifier accuracy in terms of precision and recall was obtained in the case of light stemming. This paper is organized as follows: The first section is the introduction; the second section describes the proposed framework, which consists of feature selection techniques and the classification subsystem. The third section presents and analyzes the results of this paper. Finally, the last section summarizes the conclusions and briefly highlights future work. System Framework The Proposed Feature Selection Measures Stemming algorithms are needed in many applications such as natural language processing, compression of data, and information retrieval systems. Very little work in the literature utilizes stemming algorithms for Arabic text categorization, such as the work of Sawaf, Zaplo, and Ney (2001), and the work of Elkourdi, Bensaid, and Rachidi (2004), and Duwairi (2006). Applying stemming algorithms as a feature selection method reduces the number of features because lexical forms (of words) are derived from basic building blocks and, hence, many features that are generated from the same stem are represented as one feature (their stem). This technique reduces the size of document vectors and increases the speed of learning and categorization phases for many classifiers, especially for classifiers that scan the whole training dataset for each test document. The stemming algorithm of Al-Shalabi, Kanaan, and Al-Serhan (2003) was followed here as a feature selection method. Arabic words roots consist of three letters. Very few words have four, five, or six letters. The algorithm reported in Al-Shalabi et al. (2003) finds the three-letter roots for Arabic words without depending on any root or pattern files. For example, using Al-Shalabi et al. s algorithm would reduce the Arabic words which mean the library, the writer, and the book, respectively, to one stem, which means write. The main idea for using light stemming is that many word variants do not have similar meanings or semantics. However, these word variants are generated from the same root. Thus, root extraction algorithms affect the meanings of words. Light stemming, by comparison, aims to enhance the categorization performance while retaining the words meanings. It removes some defined prefixes and suffixes from the word instead of extracting the original root (Aljlayl & Frieder, 2002). For example, the word means the book and the word means the writers ; they are extracted from the same root write, but they have different meanings. Thus, the stemming approach reduces their semantics. The light stemming approach, on the other hand, maps the word which means the book to which means book, and stems the word which means the writers to which means writer. Light stemming keeps the words meanings unaffected. We applied the light stemming approach of Aljlayl and Frieder (2002) as a feature selection method. The basis of their light stemming algorithm consists of several rounds that attempt to locate and remove the most frequent prefixes and suffixes from the word. Word clustering aggregates synonymous words, which have various syntactical forms but have similar meanings, into clusters. For example, the verbs and have similar meaning, which is run, and therefore would be aggregated into a single cluster even though they have different roots. Classifiers cannot deal with such words as correlated words that provide similar semantic interpretations. A word cluster vector is created for every document by partitioning the words that appear in that document into groups based on their synonymy relation. Every cluster is then represented by a single word the one that is commonly used in that context. To alleviate minor syntactical variations among words in the same cluster, the words were light stemmed. Using this approach, a document vector would consist of cluster representatives only and their frequencies. This fundamentally reduces the size of document vectors. The distribution of words into clusters is performed by carrying out a thesaurus lookup for every word that appears in a document. If that word matches a cluster (list) in the thesaurus then that word is replaced by that list s representative. Since the dataset consists of three categories, sports, politics, and economics, we built a thesaurus for Arabic terms that are related to these topics. Many resources were utilized 2348 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY November 2009

3 FIG. 1. The main modules in the system. to build this thesaurus such as the following Arabic dictionaries:,,,, and other electronic resources such as, Microsoft Office 2003 thesaurus, and many Internet sites that provide information about the Arabic terms of politics, sports, and economics, or their synonymous terms. The Classification Subsystem The motivation of our research is to assess the effects of feature reduction methods on the KNN classifier. Therefore, the documents in the dataset were processed and represented in four versions and then the classifier was run on each version. Each document in the dataset was represented using the following four vectors: A stem vector: where words that appear in the document were reduced to their stems. A light stem vector: where words are processed by removing common affixes using the algorithm described previously (Aljlayl & Frieder, 2002). A word clusters vector where synonymous words that appear in the document are represented by a single word that is their cluster representative. A word vector where words of a document are used as is, without any transformation. Figure 1 shows the main modules in the system. The following paragraphs describe each of these modules. The preprocessor: preprocessing aims to represent documents in a format that is understandable to the classifier. Common functions for preprocessors include document conversion, stopword removal, and term weighting. The stemmer and light stemmer modules apply stemming and light stemming to the keywords of a document, respectively. The word cluster module groups synonymous words. After each transformation the keywords are weighted using term frequency (TF). The KNN classifier takes as input a test document (represented using the four vectors described above) and assigns a label to that document by calculating its similarity to every document in the training dataset. The training dataset was also represented using the four representational vectors used for test documents. The label of the test document is determined based on the labels of the closest K neighbors to that document. The best value of K was 10 (for this dataset) and it was determined experimentally. Experimentation and Results Analysis Dataset Description and Vector Sizes The dataset consists of 15,000 Arabic text documents. These documents were prepared manually by collecting them from online resources. The dataset was filtered and classified manually into three categories: politics, sports, and economics. Each category consists of 5,000 documents. The dataset was divided into two parts: training and testing. The testing data consist of 6,000 documents, 2,000 documents per category. The training data, on the other hand, consist of 9,000 documents, 3,000 documents per category. Every document was represented by four vectors depending on the feature reduction technique employed. In particular, word, word clusters, stemmed, and light stemmed vectors. Table 1 describes the characteristics of the four versions of document vectors. The purpose of this table is to show that feature reduction methods reduce the dataset size, and hence minimize the required memory space to handle the dataset. As can be seen from the table, the stemmed vectors consumed the least space (35 MB) and the smallest number of features (5,341,696). This is expected, as stemming reduces several words to one stem. The largest vectors in terms of size and JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY November

4 TABLE 1. The properties of the training dataset in terms of size and number of features. Training dataset Size in Total number of keywords version megabytes for 9,000 documents Word vectors 86 MB 8,987,278 Stemmed vectors 35 MB 5,341,696 Light-stemmed vectors 59.5 MB 7,092,884 Word-cluster vectors 57 MB 6,845,538 TABLE 2. The elapsed classification time (in seconds) for 6,000 test documents. Preprocessing Classification Total Experiment time time Time Word vectors 1,778 12,013 13,791 Stemmed vectors 1,252 9,773 11,025 Light-stemmed vectors 1,433 10,886 12,319 Word-cluster vectors 1,881 10,155 12,036 number of features were the word vectors (where no feature reduction technique was employed). Again, this is expected, as words with minor syntactical variations would each be represented as a distinct keyword on its own. Finally, the light-stemmed vectors consumed (59.5 MB) with (7,092,884) features. This is slightly higher than the stemmed vectors as only certain prefixes and suffixes are removed from the words before storing them in their corresponding vectors. Wordcluster vectors fall between the stemmed and light stemmed versions. Preprocessing and Classification Times The experiments were performed by categorizing the test documents (6,000 documents) in four cases based on the feature reduction method utilized. In every case the KNN classifier was used. The experiments were carried out on a Pentium IV personal computer (PC), with a RAM of size 256 MB. Table 2 shows the elapsed preprocessing and classification times for all test documents. Preprocessing time depends on the activities performed during this process. In our work, preprocessing includes the removal of punctuation marks, tags, and stopwords, which is common to all experiments. Preprocessing also includes term weighting, which is common to all experiments but also depends on the feature reduction algorithm. Terms weighting time is proportional to the number of terms in a given document: the more terms the higher the preprocessing time. Table 2 shows that the lowest preprocessing time was achieved in the case of stemming. This is because the size of the vectors is smaller when compared with the other three vector types. Also, the stemming algorithm utilized is efficient in the sense that it needs to scan a given word only once to deduce its stem (Al-Shalabi et al., 2003). The next-best preprocessing time was achieved in the case of light stemming. To a certain extent, stemming and light stemming are similar in the sense that they both need to process every word in a document either by running the stemming or light stemming algorithms. The worst preprocessing time was in the case of word clusters, as this requires accessing the thesaurus in addition to the document to create a document vector. Classification time indicates the time necessary to classify the 6,000 test documents using the KNN classifier. As can be seen from the table, classifying documents using the stemmed vectors needed the least time, as document vector sizes are rather small. Classifying documents using the light-stemmed and word-cluster vectors consumed 10,886 and 10,155 seconds, respectively. The two values slightly vary because vector sizes of the two methods are similar (see Table 1). Word vectors needed the largest time to classify the collection of test documents; again, this is due to the fact that word vectors are the largest. To sum up, classification time using the KNN classifier is proportional to document vector sizes: the smaller the vector sizes the smaller the classification time. The last column in Table 2 shows the total time. Total time is the sum of the preprocessing time and classification time. Stemmed vectors achieve the lowest total time. The largest time was achieved in the case of word vectors. Classifier Accuracy Versus Feature Reduction Techniques This section investigates the effects of feature reduction techniques on classifier accuracy. The accuracy of the classifier is judged by the standard precision and recall values widely used by the research community. These were originally used to evaluate information retrieval systems and are currently used to assess the accuracy of classifiers. Assume that we have a binary classification problem. This means we have only one category to predict (say, C). The sample of documents consists of both documents that belong to this category and documents that do not belong to the category. Let TP (true positives) be the number of documents that were classified by the classifier to be members of C and they are true members of C (human agrees with classifier). Let FN (false negatives) be the number of documents that truly belong to C but the classifier failed to single them out. Let FP (false positives) be the number of documents that were misclassified by the classifier to belong to C. Finally, let TN (true negatives) be the number of documents that were truly classified not to belong to C. Recall is defined as TP/(TP + FN) and Precision is given by TP/(TP + FP). Table 3 shows the precision values for the politics, economics, and sports documents that were fed to the KNN classifier. Every group of test documents was fed to the classifier four times: once in the form of word vectors, the second in the form of stemmed vectors, then as light-stemmed vectors, and finally as word-clusters vectors. As can be seen from the table the highest value of precision was achieved for politics documents, in the case of word cluster vectors. The precision for the light stemming case was slightly less than the word clusters case. The worst precision 2350 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY November 2009

5 TABLE 3. Classifier accuracy (using precision) for word vectors, stemmed-vectors, light-stemmed vectors and word-clusters vectors. Word Stemmed Light-stemmed Word-clusters vectors vectors vectors vectors Politics Economics Sports TABLE 4. Classifier accuracy (using recall) for word vectors, stemmed-vectors, light-stemmed vectors, and word-clusters vectors. Word Stemmed Light-stemmed Word clusters vectors vectors vectors vectors Politics Economics Sports FIG. 2. Average precision of the classifier. FIG. 3. Average recall of the classifier. value for the politics documents was in the case of the word vectors. The highest precision value for the economics documents, by comparison, was achieved in the case of light-stemmed vectors; the lowest value was achieved in the case of the word vectors. Finally, the highest precision value for the sports documents was achieved in the case of light-stemmed vectors and the worst was in the case of word vectors. The conclusion is that using words without applying any stemming, light stemming, or word cluster algorithms result in the classifier being too sensitive to the minor syntactical variations of the same word and therefore these words are considered not correlated, which consequently adversely affects the precision of the classifier. Figure 2 depicts the average precision for all test documents (politics, economics, and sports); the best average value was achieved in the case of light stemming and the worst average value resulted in the case of word vectors. Table 4 shows the recall values for the three categories. The highest recall for politics documents was achieved in the case where the documents represented as stemmed vectors. The highest recall value for economics documents, by comparison, was achieved when the documents were represented as word clusters. Finally, the highest recall for sports documents was achieved in the case where the documents were represented as light-stemmed vectors. Figure 3 shows the average recall values for all test documents against the four employed feature reduction techniques. The two best values were achieved in the cases of light stemming and word clusters, respectively. Conclusions and Future Work In this study we applied three feature selection methods for Arabic text categorization. The Arabic dataset was collected manually from Internet sites such as Arabic journals. The dataset was filtered and classified manually into three categories: politics, sports, and economics. Each category consists of 5,000 documents. The dataset was divided into two parts: training and testing. The testing data consist of 6,000 documents, 2,000 documents for each category. The training data consist of 9,000 documents, 3,000 documents per category. The techniques used for feature selection are stemming (Al-Shalabi et al., 2003), light stemming (Aljlayl & Frieder, 2002), and word clusters. Stemming finds the three-letter roots for Arabic words without depending on any root or pattern files. Light stemming, on the other hand, removes the common suffixes and prefixes from the words. Word clusters group synonymous light stems using a prepared thesaurus and then chooses a light-stemmed word to represent the cluster. The KNN classifier was used to classify the test documents. The experiments were done in the following manner, the KNN classifier was run four times on four versions of the JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY November

6 dataset. In the first version, a document is represented as a vector that includes all the words that appear in that document. In the second version, words of a given document are reduced to their stems (roots) then the corresponding document vector is created. In the third version the words that constitute a given document were light stemmed and then the corresponding vector is generated. In the final version, the words of a given document were grouped into clusters based on the synonymy relation and each cluster was represented by a single word (cluster representative). Afterwards, that document vector was formed by using cluster representatives only. Our experiments have shown that stemming reduces vector sizes, and therefore improves the classification time. However, it adversely affects the accuracy of the classifier in terms of precision and recall. The precision and recall reached their highest values when using the light stemming approach. These results are of interest to anyone working in Arabic information retrieval, text filtering, or text categorization. In the future we plan to extend this framework to include statistical feature selection techniques such as χ 2, information gain, and the Gini index (Shang et al., 2007; Shankar & Karypis, 2000). Finally, the thesaurus, which we used in this work, was built manually. We plan to improve this thesaurus by utilizing automatic algorithms and then screening the synonymy lists by language experts. References Aljlayl, M., & Frieder, O. (2002). On Arabic search: Improving the retrieval effectiveness via a light stemming approach. In Proceedings of the ACM 11th Conference on Information and Knowledge Management (pp ). New York: ACM Press. Al-Shalabi, R., Kanaan, G., & Al-Serhan, H. (2003, December). A new approach for extracting Arabic roots. Paper presented at the International Arab Conference on Information Technology (ACIT), Alexandra, Egypt. Basu, A., Walters, C., & Shepherd, M. (2003). Support vector machines for text categorization. In Proceedings of the 36th Annual Hawaii International Conference on System Sciences (pp ). Los Alamitos, California: IEEE Press. Retrieved July 2, 2009, from ieee.org/stamp/stamp.jsp?tp=&arnumber= &isnumber=26341 Bednar, P. (2006, January). Active learning of SVM and decision tree classifiers for text categorization. Paper presented at the Fourth Slovakian- Hungarian Joint Symposium on Applied Machine Intelligence, Herlany, Slovakia. Correa, R.F., & Ludermir, T.B. (2002, November). Automatic text categorization: Case study. Paper presented at the VII Brazilian Symposium on Neural Networks, Pernambuco, Brazil. Dumais, S., Platt, J., Heckerman, D., & Sahami, M. (1998). Inductive learning algorithms and representations for text categorization. In Proceedings of the Seventh International Conference on Information and Knowledge Management (pp ). New York: ACM Press. Duwairi, R.M. (2006). Machine learning forarabic text categorization. Journal of the American Society for Information Science and Technology, 57(8), Elkourdi, M., Bensaid, A., & Rachidi, T. (2004). Automatic Arabic document categorization based on the naïve Bayes algorithm. In Proceedings of COLING 20th Workshop on Computational Approaches to Arabic Script-based Languages (pp ). Retrieved July 2, 2009, from Eyheramendy, S., Lewis, D., & Madiagn, D. (2003). On the naïve Bayes model for text categorization. Paper presented at the Ninth International Conference on Artificial Intelligence and Statistics, Key West, FL. Gongde, G., Hui, W., David, A.B., Yaxin, B., & Kieran, G. (2004). An knn model-based approach and its application in text categorization. In A. Gelbukh (Ed.), Proceedings of the Fifth International Conference on Intelligent Text Processing and Computational Linguistics (CICLing) (pp ). Lecture Notes in Computer Science, Vol Berlin/Heidelberg, Germany: Springer. Ker, S., & Chen, J. (2000). A text categorization based on summarization technique. In J. Klavans & J. Gonzalo (Eds.), Proceedings of the ACL-2000 Workshop on Recent Advances in Natural Language Processing and Information Retrieval, held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics (pp ). New Brunswick, NJ: The Association for Computational Linguistics. Mejia-Lavalle, M., & Morales, E. (2006). Feature selection considering attribute inter-dependencies. In International Workshop on Feature Selection for Data Mining: Interfacing Machine Learning and Statistics (pp ). Providence, RI: American Mathematical Society. Pierre, J. (2000, September). Practical issues for automated categorization of web pages. Paper presented at the 2000 Workshop on the Semantic Web, Lisbon, Portugal. Retrieved May 29, 2009, from psu.edu/pierre00practical.html Sawaf, H., Zaplo, J., & Ney, H. (2001, July). Statistical classification methods for Arabic news articles. Paper presented at the Arabic Natural Language Processing Workshop. Toulonse, France. Sebastiani, F. (2005). Text categorization. In A. Zanasi (Ed.). Text mining and its applications to intelligence, CRM and knowledge management (pp ). Southampton, UK: WIT Press. Seo, Y., Ankolekar, A., & Sycara, K. (2004). Feature selection for extracting semantically rich words. Technical Report CMU-RI-TR-04-18, Robotics Institute, Carnegie Mellon University, Pittsburgh, PA. Shang, W., Huoang, H., Zhu, H., Lin,Y., Qu,Y., & Wang, Z. (2007). A novel feature selection algorithm for text categorization. Expert Systems with Applications, 33(1), 1 5. Shankar, S., & Karypis, G. (2000). A feature weight adjustment algorithm for document categorization. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM Press. Yan, J., Liu N., Zhang B., Yan S., Chen Z., Cheng Q., Fan W., & Ma W. (2005). OCFS: Optimal orthogonal centroid feature selection for text categorization. In Proceedings of the 28th Annual International ACM SIGIR Conference (SIGIR 2005) (pp ). New York: ACM Press. Yang, Y., & Pedersen, J. (1997). A comparative study on feature selection in text categorization. In J.D.H. Fisher (Ed.). The Fourteenth International Conference on Machine Learning (ICML 97) (pp ). San Francisco: Morgan Kaufmann JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY November 2009

Rule Learning With Negation: Issues Regarding Effectiveness

Rule Learning With Negation: Issues Regarding Effectiveness Rule Learning With Negation: Issues Regarding Effectiveness S. Chua, F. Coenen, G. Malcolm University of Liverpool Department of Computer Science, Ashton Building, Ashton Street, L69 3BX Liverpool, United

More information

Rule Learning with Negation: Issues Regarding Effectiveness

Rule Learning with Negation: Issues Regarding Effectiveness Rule Learning with Negation: Issues Regarding Effectiveness Stephanie Chua, Frans Coenen, and Grant Malcolm University of Liverpool Department of Computer Science, Ashton Building, Ashton Street, L69 3BX

More information

A Case Study: News Classification Based on Term Frequency

A Case Study: News Classification Based on Term Frequency A Case Study: News Classification Based on Term Frequency Petr Kroha Faculty of Computer Science University of Technology 09107 Chemnitz Germany kroha@informatik.tu-chemnitz.de Ricardo Baeza-Yates Center

More information

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF) SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF) Hans Christian 1 ; Mikhael Pramodana Agus 2 ; Derwin Suhartono 3 1,2,3 Computer Science Department,

More information

Cross-lingual Short-Text Document Classification for Facebook Comments

Cross-lingual Short-Text Document Classification for Facebook Comments 2014 International Conference on Future Internet of Things and Cloud Cross-lingual Short-Text Document Classification for Facebook Comments Mosab Faqeeh, Nawaf Abdulla, Mahmoud Al-Ayyoub, Yaser Jararweh

More information

Probabilistic Latent Semantic Analysis

Probabilistic Latent Semantic Analysis Probabilistic Latent Semantic Analysis Thomas Hofmann Presentation by Ioannis Pavlopoulos & Andreas Damianou for the course of Data Mining & Exploration 1 Outline Latent Semantic Analysis o Need o Overview

More information

Assignment 1: Predicting Amazon Review Ratings

Assignment 1: Predicting Amazon Review Ratings Assignment 1: Predicting Amazon Review Ratings 1 Dataset Analysis Richard Park r2park@acsmail.ucsd.edu February 23, 2015 The dataset selected for this assignment comes from the set of Amazon reviews for

More information

Reducing Features to Improve Bug Prediction

Reducing Features to Improve Bug Prediction Reducing Features to Improve Bug Prediction Shivkumar Shivaji, E. James Whitehead, Jr., Ram Akella University of California Santa Cruz {shiv,ejw,ram}@soe.ucsc.edu Sunghun Kim Hong Kong University of Science

More information

Linking Task: Identifying authors and book titles in verbose queries

Linking Task: Identifying authors and book titles in verbose queries Linking Task: Identifying authors and book titles in verbose queries Anaïs Ollagnier, Sébastien Fournier, and Patrice Bellot Aix-Marseille University, CNRS, ENSAM, University of Toulon, LSIS UMR 7296,

More information

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

OCR for Arabic using SIFT Descriptors With Online Failure Prediction OCR for Arabic using SIFT Descriptors With Online Failure Prediction Andrey Stolyarenko, Nachum Dershowitz The Blavatnik School of Computer Science Tel Aviv University Tel Aviv, Israel Email: stloyare@tau.ac.il,

More information

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Twitter Sentiment Classification on Sanders Data using Hybrid Approach IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 4, Ver. I (July Aug. 2015), PP 118-123 www.iosrjournals.org Twitter Sentiment Classification on Sanders

More information

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH ISSN: 0976-3104 Danti and Bhushan. ARTICLE OPEN ACCESS CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH Ajit Danti 1 and SN Bharath Bhushan 2* 1 Department

More information

Learning From the Past with Experiment Databases

Learning From the Past with Experiment Databases Learning From the Past with Experiment Databases Joaquin Vanschoren 1, Bernhard Pfahringer 2, and Geoff Holmes 2 1 Computer Science Dept., K.U.Leuven, Leuven, Belgium 2 Computer Science Dept., University

More information

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments Vijayshri Ramkrishna Ingale PG Student, Department of Computer Engineering JSPM s Imperial College of Engineering &

More information

Australian Journal of Basic and Applied Sciences

Australian Journal of Basic and Applied Sciences AENSI Journals Australian Journal of Basic and Applied Sciences ISSN:1991-8178 Journal home page: www.ajbasweb.com Feature Selection Technique Using Principal Component Analysis For Improving Fuzzy C-Mean

More information

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages Nuanwan Soonthornphisaj 1 and Boonserm Kijsirikul 2 Machine Intelligence and Knowledge Discovery Laboratory Department of Computer

More information

arxiv: v1 [cs.lg] 3 May 2013

arxiv: v1 [cs.lg] 3 May 2013 Feature Selection Based on Term Frequency and T-Test for Text Categorization Deqing Wang dqwang@nlsde.buaa.edu.cn Hui Zhang hzhang@nlsde.buaa.edu.cn Rui Liu, Weifeng Lv {liurui,lwf}@nlsde.buaa.edu.cn arxiv:1305.0638v1

More information

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, 2013 10.12753/2066-026X-13-154 DATA MINING SOLUTIONS FOR DETERMINING STUDENT'S PROFILE Adela BÂRA,

More information

Disambiguation of Thai Personal Name from Online News Articles

Disambiguation of Thai Personal Name from Online News Articles Disambiguation of Thai Personal Name from Online News Articles Phaisarn Sutheebanjard Graduate School of Information Technology Siam University Bangkok, Thailand mr.phaisarn@gmail.com Abstract Since online

More information

AQUA: An Ontology-Driven Question Answering System

AQUA: An Ontology-Driven Question Answering System AQUA: An Ontology-Driven Question Answering System Maria Vargas-Vera, Enrico Motta and John Domingue Knowledge Media Institute (KMI) The Open University, Walton Hall, Milton Keynes, MK7 6AA, United Kingdom.

More information

Switchboard Language Model Improvement with Conversational Data from Gigaword

Switchboard Language Model Improvement with Conversational Data from Gigaword Katholieke Universiteit Leuven Faculty of Engineering Master in Artificial Intelligence (MAI) Speech and Language Technology (SLT) Switchboard Language Model Improvement with Conversational Data from Gigaword

More information

Word Segmentation of Off-line Handwritten Documents

Word Segmentation of Off-line Handwritten Documents Word Segmentation of Off-line Handwritten Documents Chen Huang and Sargur N. Srihari {chuang5, srihari}@cedar.buffalo.edu Center of Excellence for Document Analysis and Recognition (CEDAR), Department

More information

Python Machine Learning

Python Machine Learning Python Machine Learning Unlock deeper insights into machine learning with this vital guide to cuttingedge predictive analytics Sebastian Raschka [ PUBLISHING 1 open source I community experience distilled

More information

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming Data Mining VI 205 Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming C. Romero, S. Ventura, C. Hervás & P. González Universidad de Córdoba, Campus Universitario de

More information

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur Module 12 Machine Learning 12.1 Instructional Objective The students should understand the concept of learning systems Students should learn about different aspects of a learning system Students should

More information

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System QuickStroke: An Incremental On-line Chinese Handwriting Recognition System Nada P. Matić John C. Platt Λ Tony Wang y Synaptics, Inc. 2381 Bering Drive San Jose, CA 95131, USA Abstract This paper presents

More information

CS Machine Learning

CS Machine Learning CS 478 - Machine Learning Projects Data Representation Basic testing and evaluation schemes CS 478 Data and Testing 1 Programming Issues l Program in any platform you want l Realize that you will be doing

More information

Using dialogue context to improve parsing performance in dialogue systems

Using dialogue context to improve parsing performance in dialogue systems Using dialogue context to improve parsing performance in dialogue systems Ivan Meza-Ruiz and Oliver Lemon School of Informatics, Edinburgh University 2 Buccleuch Place, Edinburgh I.V.Meza-Ruiz@sms.ed.ac.uk,

More information

ScienceDirect. A Framework for Clustering Cardiac Patient s Records Using Unsupervised Learning Techniques

ScienceDirect. A Framework for Clustering Cardiac Patient s Records Using Unsupervised Learning Techniques Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 98 (2016 ) 368 373 The 6th International Conference on Current and Future Trends of Information and Communication Technologies

More information

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Using Web Searches on Important Words to Create Background Sets for LSI Classification Using Web Searches on Important Words to Create Background Sets for LSI Classification Sarah Zelikovitz and Marina Kogan College of Staten Island of CUNY 2800 Victory Blvd Staten Island, NY 11314 Abstract

More information

Longest Common Subsequence: A Method for Automatic Evaluation of Handwritten Essays

Longest Common Subsequence: A Method for Automatic Evaluation of Handwritten Essays IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 6, Ver. IV (Nov Dec. 2015), PP 01-07 www.iosrjournals.org Longest Common Subsequence: A Method for

More information

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words, A Language-Independent, Data-Oriented Architecture for Grapheme-to-Phoneme Conversion Walter Daelemans and Antal van den Bosch Proceedings ESCA-IEEE speech synthesis conference, New York, September 1994

More information

Multilingual Document Clustering: an Heuristic Approach Based on Cognate Named Entities

Multilingual Document Clustering: an Heuristic Approach Based on Cognate Named Entities Multilingual Document Clustering: an Heuristic Approach Based on Cognate Named Entities Soto Montalvo GAVAB Group URJC Raquel Martínez NLP&IR Group UNED Arantza Casillas Dpt. EE UPV-EHU Víctor Fresno GAVAB

More information

Large-Scale Web Page Classification. Sathi T Marath. Submitted in partial fulfilment of the requirements. for the degree of Doctor of Philosophy

Large-Scale Web Page Classification. Sathi T Marath. Submitted in partial fulfilment of the requirements. for the degree of Doctor of Philosophy Large-Scale Web Page Classification by Sathi T Marath Submitted in partial fulfilment of the requirements for the degree of Doctor of Philosophy at Dalhousie University Halifax, Nova Scotia November 2010

More information

Mining Association Rules in Student s Assessment Data

Mining Association Rules in Student s Assessment Data www.ijcsi.org 211 Mining Association Rules in Student s Assessment Data Dr. Varun Kumar 1, Anupama Chadha 2 1 Department of Computer Science and Engineering, MVN University Palwal, Haryana, India 2 Anupama

More information

Speech Emotion Recognition Using Support Vector Machine

Speech Emotion Recognition Using Support Vector Machine Speech Emotion Recognition Using Support Vector Machine Yixiong Pan, Peipei Shen and Liping Shen Department of Computer Technology Shanghai JiaoTong University, Shanghai, China panyixiong@sjtu.edu.cn,

More information

Matching Similarity for Keyword-Based Clustering

Matching Similarity for Keyword-Based Clustering Matching Similarity for Keyword-Based Clustering Mohammad Rezaei and Pasi Fränti University of Eastern Finland {rezaei,franti}@cs.uef.fi Abstract. Semantic clustering of objects such as documents, web

More information

Learning Methods for Fuzzy Systems

Learning Methods for Fuzzy Systems Learning Methods for Fuzzy Systems Rudolf Kruse and Andreas Nürnberger Department of Computer Science, University of Magdeburg Universitätsplatz, D-396 Magdeburg, Germany Phone : +49.39.67.876, Fax : +49.39.67.8

More information

A Comparison of Two Text Representations for Sentiment Analysis

A Comparison of Two Text Representations for Sentiment Analysis 010 International Conference on Computer Application and System Modeling (ICCASM 010) A Comparison of Two Text Representations for Sentiment Analysis Jianxiong Wang School of Computer Science & Educational

More information

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks Devendra Singh Chaplot, Eunhee Rhim, and Jihie Kim Samsung Electronics Co., Ltd. Seoul, South Korea {dev.chaplot,eunhee.rhim,jihie.kim}@samsung.com

More information

arxiv: v1 [cs.cl] 2 Apr 2017

arxiv: v1 [cs.cl] 2 Apr 2017 Word-Alignment-Based Segment-Level Machine Translation Evaluation using Word Embeddings Junki Matsuo and Mamoru Komachi Graduate School of System Design, Tokyo Metropolitan University, Japan matsuo-junki@ed.tmu.ac.jp,

More information

CHAPTER 4: REIMBURSEMENT STRATEGIES 24

CHAPTER 4: REIMBURSEMENT STRATEGIES 24 CHAPTER 4: REIMBURSEMENT STRATEGIES 24 INTRODUCTION Once state level policymakers have decided to implement and pay for CSR, one issue they face is simply how to calculate the reimbursements to districts

More information

Multi-Lingual Text Leveling

Multi-Lingual Text Leveling Multi-Lingual Text Leveling Salim Roukos, Jerome Quin, and Todd Ward IBM T. J. Watson Research Center, Yorktown Heights, NY 10598 {roukos,jlquinn,tward}@us.ibm.com Abstract. Determining the language proficiency

More information

Impact of Cluster Validity Measures on Performance of Hybrid Models Based on K-means and Decision Trees

Impact of Cluster Validity Measures on Performance of Hybrid Models Based on K-means and Decision Trees Impact of Cluster Validity Measures on Performance of Hybrid Models Based on K-means and Decision Trees Mariusz Łapczy ski 1 and Bartłomiej Jefma ski 2 1 The Chair of Market Analysis and Marketing Research,

More information

Conference Presentation

Conference Presentation Conference Presentation Towards automatic geolocalisation of speakers of European French SCHERRER, Yves, GOLDMAN, Jean-Philippe Abstract Starting in 2015, Avanzi et al. (2016) have launched several online

More information

What s in a Step? Toward General, Abstract Representations of Tutoring System Log Data

What s in a Step? Toward General, Abstract Representations of Tutoring System Log Data What s in a Step? Toward General, Abstract Representations of Tutoring System Log Data Kurt VanLehn 1, Kenneth R. Koedinger 2, Alida Skogsholm 2, Adaeze Nwaigwe 2, Robert G.M. Hausmann 1, Anders Weinstein

More information

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS Václav Kocian, Eva Volná, Michal Janošek, Martin Kotyrba University of Ostrava Department of Informatics and Computers Dvořákova 7,

More information

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems Modeling function word errors in DNN-HMM based LVCSR systems Melvin Jose Johnson Premkumar, Ankur Bapna and Sree Avinash Parchuri Department of Computer Science Department of Electrical Engineering Stanford

More information

Bug triage in open source systems: a review

Bug triage in open source systems: a review Int. J. Collaborative Enterprise, Vol. 4, No. 4, 2014 299 Bug triage in open source systems: a review V. Akila* and G. Zayaraz Department of Computer Science and Engineering, Pondicherry Engineering College,

More information

Detecting Wikipedia Vandalism using Machine Learning Notebook for PAN at CLEF 2011

Detecting Wikipedia Vandalism using Machine Learning Notebook for PAN at CLEF 2011 Detecting Wikipedia Vandalism using Machine Learning Notebook for PAN at CLEF 2011 Cristian-Alexandru Drăgușanu, Marina Cufliuc, Adrian Iftene UAIC: Faculty of Computer Science, Alexandru Ioan Cuza University,

More information

Feature Selection based on Sampling and C4.5 Algorithm to Improve the Quality of Text Classification using Naïve Bayes

Feature Selection based on Sampling and C4.5 Algorithm to Improve the Quality of Text Classification using Naïve Bayes Feature Selection based on Sampling and C4.5 Algorithm to Improve the Quality of Text Classification using Naïve Bayes Viviana Molano 1, Carlos Cobos 1, Martha Mendoza 1, Enrique Herrera-Viedma 2, and

More information

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems Modeling function word errors in DNN-HMM based LVCSR systems Melvin Jose Johnson Premkumar, Ankur Bapna and Sree Avinash Parchuri Department of Computer Science Department of Electrical Engineering Stanford

More information

Cross Language Information Retrieval

Cross Language Information Retrieval Cross Language Information Retrieval RAFFAELLA BERNARDI UNIVERSITÀ DEGLI STUDI DI TRENTO P.ZZA VENEZIA, ROOM: 2.05, E-MAIL: BERNARDI@DISI.UNITN.IT Contents 1 Acknowledgment.............................................

More information

Developing True/False Test Sheet Generating System with Diagnosing Basic Cognitive Ability

Developing True/False Test Sheet Generating System with Diagnosing Basic Cognitive Ability Developing True/False Test Sheet Generating System with Diagnosing Basic Cognitive Ability Shih-Bin Chen Dept. of Information and Computer Engineering, Chung-Yuan Christian University Chung-Li, Taiwan

More information

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification Tomi Kinnunen and Ismo Kärkkäinen University of Joensuu, Department of Computer Science, P.O. Box 111, 80101 JOENSUU,

More information

Detecting English-French Cognates Using Orthographic Edit Distance

Detecting English-French Cognates Using Orthographic Edit Distance Detecting English-French Cognates Using Orthographic Edit Distance Qiongkai Xu 1,2, Albert Chen 1, Chang i 1 1 The Australian National University, College of Engineering and Computer Science 2 National

More information

How to read a Paper ISMLL. Dr. Josif Grabocka, Carlotta Schatten

How to read a Paper ISMLL. Dr. Josif Grabocka, Carlotta Schatten How to read a Paper ISMLL Dr. Josif Grabocka, Carlotta Schatten Hildesheim, April 2017 1 / 30 Outline How to read a paper Finding additional material Hildesheim, April 2017 2 / 30 How to read a paper How

More information

An Effective Framework for Fast Expert Mining in Collaboration Networks: A Group-Oriented and Cost-Based Method

An Effective Framework for Fast Expert Mining in Collaboration Networks: A Group-Oriented and Cost-Based Method Farhadi F, Sorkhi M, Hashemi S et al. An effective framework for fast expert mining in collaboration networks: A grouporiented and cost-based method. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY 27(3): 577

More information

Analyzing sentiments in tweets for Tesla Model 3 using SAS Enterprise Miner and SAS Sentiment Analysis Studio

Analyzing sentiments in tweets for Tesla Model 3 using SAS Enterprise Miner and SAS Sentiment Analysis Studio SCSUG Student Symposium 2016 Analyzing sentiments in tweets for Tesla Model 3 using SAS Enterprise Miner and SAS Sentiment Analysis Studio Praneth Guggilla, Tejaswi Jha, Goutam Chakraborty, Oklahoma State

More information

Lecture 1: Machine Learning Basics

Lecture 1: Machine Learning Basics 1/69 Lecture 1: Machine Learning Basics Ali Harakeh University of Waterloo WAVE Lab ali.harakeh@uwaterloo.ca May 1, 2017 2/69 Overview 1 Learning Algorithms 2 Capacity, Overfitting, and Underfitting 3

More information

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA Alta de Waal, Jacobus Venter and Etienne Barnard Abstract Most actionable evidence is identified during the analysis phase of digital forensic investigations.

More information

Learning Methods in Multilingual Speech Recognition

Learning Methods in Multilingual Speech Recognition Learning Methods in Multilingual Speech Recognition Hui Lin Department of Electrical Engineering University of Washington Seattle, WA 98125 linhui@u.washington.edu Li Deng, Jasha Droppo, Dong Yu, and Alex

More information

The Good Judgment Project: A large scale test of different methods of combining expert predictions

The Good Judgment Project: A large scale test of different methods of combining expert predictions The Good Judgment Project: A large scale test of different methods of combining expert predictions Lyle Ungar, Barb Mellors, Jon Baron, Phil Tetlock, Jaime Ramos, Sam Swift The University of Pennsylvania

More information

Evolutive Neural Net Fuzzy Filtering: Basic Description

Evolutive Neural Net Fuzzy Filtering: Basic Description Journal of Intelligent Learning Systems and Applications, 2010, 2: 12-18 doi:10.4236/jilsa.2010.21002 Published Online February 2010 (http://www.scirp.org/journal/jilsa) Evolutive Neural Net Fuzzy Filtering:

More information

Issues in the Mining of Heart Failure Datasets

Issues in the Mining of Heart Failure Datasets International Journal of Automation and Computing 11(2), April 2014, 162-179 DOI: 10.1007/s11633-014-0778-5 Issues in the Mining of Heart Failure Datasets Nongnuch Poolsawad 1 Lisa Moore 1 Chandrasekhar

More information

Cross-Lingual Text Categorization

Cross-Lingual Text Categorization Cross-Lingual Text Categorization Nuria Bel 1, Cornelis H.A. Koster 2, and Marta Villegas 1 1 Grup d Investigació en Lingüística Computacional Universitat de Barcelona, 028 - Barcelona, Spain. {nuria,tona}@gilc.ub.es

More information

AUTOMATED TROUBLESHOOTING OF MOBILE NETWORKS USING BAYESIAN NETWORKS

AUTOMATED TROUBLESHOOTING OF MOBILE NETWORKS USING BAYESIAN NETWORKS AUTOMATED TROUBLESHOOTING OF MOBILE NETWORKS USING BAYESIAN NETWORKS R.Barco 1, R.Guerrero 2, G.Hylander 2, L.Nielsen 3, M.Partanen 2, S.Patel 4 1 Dpt. Ingeniería de Comunicaciones. Universidad de Málaga.

More information

Welcome to. ECML/PKDD 2004 Community meeting

Welcome to. ECML/PKDD 2004 Community meeting Welcome to ECML/PKDD 2004 Community meeting A brief report from the program chairs Jean-Francois Boulicaut, INSA-Lyon, France Floriana Esposito, University of Bari, Italy Fosca Giannotti, ISTI-CNR, Pisa,

More information

Human Emotion Recognition From Speech

Human Emotion Recognition From Speech RESEARCH ARTICLE OPEN ACCESS Human Emotion Recognition From Speech Miss. Aparna P. Wanare*, Prof. Shankar N. Dandare *(Department of Electronics & Telecommunication Engineering, Sant Gadge Baba Amravati

More information

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY Chen, Hsin-Hsi Department of Computer Science and Information Engineering National Taiwan University Taipei, Taiwan E-mail: hh_chen@csie.ntu.edu.tw Abstract

More information

A Bayesian Learning Approach to Concept-Based Document Classification

A Bayesian Learning Approach to Concept-Based Document Classification Databases and Information Systems Group (AG5) Max-Planck-Institute for Computer Science Saarbrücken, Germany A Bayesian Learning Approach to Concept-Based Document Classification by Georgiana Ifrim Supervisors

More information

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling Notebook for PAN at CLEF 2013 Andrés Alfonso Caurcel Díaz 1 and José María Gómez Hidalgo 2 1 Universidad

More information

Predicting Students Performance with SimStudent: Learning Cognitive Skills from Observation

Predicting Students Performance with SimStudent: Learning Cognitive Skills from Observation School of Computer Science Human-Computer Interaction Institute Carnegie Mellon University Year 2007 Predicting Students Performance with SimStudent: Learning Cognitive Skills from Observation Noboru Matsuda

More information

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration INTERSPEECH 2013 Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration Yan Huang, Dong Yu, Yifan Gong, and Chaojun Liu Microsoft Corporation, One

More information

Evidence for Reliability, Validity and Learning Effectiveness

Evidence for Reliability, Validity and Learning Effectiveness PEARSON EDUCATION Evidence for Reliability, Validity and Learning Effectiveness Introduction Pearson Knowledge Technologies has conducted a large number and wide variety of reliability and validity studies

More information

Universidade do Minho Escola de Engenharia

Universidade do Minho Escola de Engenharia Universidade do Minho Escola de Engenharia Universidade do Minho Escola de Engenharia Dissertação de Mestrado Knowledge Discovery is the nontrivial extraction of implicit, previously unknown, and potentially

More information

Preference Learning in Recommender Systems

Preference Learning in Recommender Systems Preference Learning in Recommender Systems Marco de Gemmis, Leo Iaquinta, Pasquale Lops, Cataldo Musto, Fedelucio Narducci, and Giovanni Semeraro Department of Computer Science University of Bari Aldo

More information

Automatic document classification of biological literature

Automatic document classification of biological literature BMC Bioinformatics This Provisional PDF corresponds to the article as it appeared upon acceptance. Copyedited and fully formatted PDF and full text (HTML) versions will be made available soon. Automatic

More information

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2 1 CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2 Peter A. Chew, Brett W. Bader, Ahmed Abdelali Proceedings of the 13 th SIGKDD, 2007 Tiago Luís Outline 2 Cross-Language IR (CLIR) Latent Semantic Analysis

More information

Data Integration through Clustering and Finding Statistical Relations - Validation of Approach

Data Integration through Clustering and Finding Statistical Relations - Validation of Approach Data Integration through Clustering and Finding Statistical Relations - Validation of Approach Marek Jaszuk, Teresa Mroczek, and Barbara Fryc University of Information Technology and Management, ul. Sucharskiego

More information

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Chinese Language Parsing with Maximum-Entropy-Inspired Parser Chinese Language Parsing with Maximum-Entropy-Inspired Parser Heng Lian Brown University Abstract The Chinese language has many special characteristics that make parsing difficult. The performance of state-of-the-art

More information

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17. Semi-supervised methods of text processing, and an application to medical concept extraction Yacine Jernite Text-as-Data series September 17. 2015 What do we want from text? 1. Extract information 2. Link

More information

Conversational Framework for Web Search and Recommendations

Conversational Framework for Web Search and Recommendations Conversational Framework for Web Search and Recommendations Saurav Sahay and Ashwin Ram ssahay@cc.gatech.edu, ashwin@cc.gatech.edu College of Computing Georgia Institute of Technology Atlanta, GA Abstract.

More information

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING SISOM & ACOUSTICS 2015, Bucharest 21-22 May THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING MarilenaăLAZ R 1, Diana MILITARU 2 1 Military Equipment and Technologies Research Agency, Bucharest,

More information

Mining Student Evolution Using Associative Classification and Clustering

Mining Student Evolution Using Associative Classification and Clustering Mining Student Evolution Using Associative Classification and Clustering 19 Mining Student Evolution Using Associative Classification and Clustering Kifaya S. Qaddoum, Faculty of Information, Technology

More information

Performance Analysis of Optimized Content Extraction for Cyrillic Mongolian Learning Text Materials in the Database

Performance Analysis of Optimized Content Extraction for Cyrillic Mongolian Learning Text Materials in the Database Journal of Computer and Communications, 2016, 4, 79-89 Published Online August 2016 in SciRes. http://www.scirp.org/journal/jcc http://dx.doi.org/10.4236/jcc.2016.410009 Performance Analysis of Optimized

More information

Beyond the Pipeline: Discrete Optimization in NLP

Beyond the Pipeline: Discrete Optimization in NLP Beyond the Pipeline: Discrete Optimization in NLP Tomasz Marciniak and Michael Strube EML Research ggmbh Schloss-Wolfsbrunnenweg 33 69118 Heidelberg, Germany http://www.eml-research.de/nlp Abstract We

More information

Exposé for a Master s Thesis

Exposé for a Master s Thesis Exposé for a Master s Thesis Stefan Selent January 21, 2017 Working Title: TF Relation Mining: An Active Learning Approach Introduction The amount of scientific literature is ever increasing. Especially

More information

Efficient Online Summarization of Microblogging Streams

Efficient Online Summarization of Microblogging Streams Efficient Online Summarization of Microblogging Streams Andrei Olariu Faculty of Mathematics and Computer Science University of Bucharest andrei@olariu.org Abstract The large amounts of data generated

More information

ScienceDirect. Malayalam question answering system

ScienceDirect. Malayalam question answering system Available online at www.sciencedirect.com ScienceDirect Procedia Technology 24 (2016 ) 1388 1392 International Conference on Emerging Trends in Engineering, Science and Technology (ICETEST - 2015) Malayalam

More information

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier IOSR Journal of Electronics and Communication Engineering (IOSR-JECE) e-issn: 2278-2834,p- ISSN: 2278-8735.Volume 10, Issue 2, Ver.1 (Mar - Apr.2015), PP 55-61 www.iosrjournals.org Analysis of Emotion

More information

Automating the E-learning Personalization

Automating the E-learning Personalization Automating the E-learning Personalization Fathi Essalmi 1, Leila Jemni Ben Ayed 1, Mohamed Jemni 1, Kinshuk 2, and Sabine Graf 2 1 The Research Laboratory of Technologies of Information and Communication

More information

Constructing Parallel Corpus from Movie Subtitles

Constructing Parallel Corpus from Movie Subtitles Constructing Parallel Corpus from Movie Subtitles Han Xiao 1 and Xiaojie Wang 2 1 School of Information Engineering, Beijing University of Post and Telecommunications artex.xh@gmail.com 2 CISTR, Beijing

More information

Term Weighting based on Document Revision History

Term Weighting based on Document Revision History Term Weighting based on Document Revision History Sérgio Nunes, Cristina Ribeiro, and Gabriel David INESC Porto, DEI, Faculdade de Engenharia, Universidade do Porto. Rua Dr. Roberto Frias, s/n. 4200-465

More information

A Case-Based Approach To Imitation Learning in Robotic Agents

A Case-Based Approach To Imitation Learning in Robotic Agents A Case-Based Approach To Imitation Learning in Robotic Agents Tesca Fitzgerald, Ashok Goel School of Interactive Computing Georgia Institute of Technology, Atlanta, GA 30332, USA {tesca.fitzgerald,goel}@cc.gatech.edu

More information

Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge

Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge Innov High Educ (2009) 34:93 103 DOI 10.1007/s10755-009-9095-2 Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge Phyllis Blumberg Published online: 3 February

More information

The stages of event extraction

The stages of event extraction The stages of event extraction David Ahn Intelligent Systems Lab Amsterdam University of Amsterdam ahn@science.uva.nl Abstract Event detection and recognition is a complex task consisting of multiple sub-tasks

More information

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for Email Marilyn A. Walker Jeanne C. Fromer Shrikanth Narayanan walker@research.att.com jeannie@ai.mit.edu shri@research.att.com

More information

CS4491/CS 7265 BIG DATA ANALYTICS INTRODUCTION TO THE COURSE. Mingon Kang, PhD Computer Science, Kennesaw State University

CS4491/CS 7265 BIG DATA ANALYTICS INTRODUCTION TO THE COURSE. Mingon Kang, PhD Computer Science, Kennesaw State University CS4491/CS 7265 BIG DATA ANALYTICS INTRODUCTION TO THE COURSE Mingon Kang, PhD Computer Science, Kennesaw State University Self Introduction Mingon Kang, PhD Homepage: http://ksuweb.kennesaw.edu/~mkang9

More information

Semi-Supervised Face Detection

Semi-Supervised Face Detection Semi-Supervised Face Detection Nicu Sebe, Ira Cohen 2, Thomas S. Huang 3, Theo Gevers Faculty of Science, University of Amsterdam, The Netherlands 2 HP Research Labs, USA 3 Beckman Institute, University

More information