Improving Machine Learning Input for Automatic Document Classification with Natural Language Processing

Size: px
Start display at page:

Download "Improving Machine Learning Input for Automatic Document Classification with Natural Language Processing"

Transcription

1 Improving Machine Learning Input for Automatic Document Classification with Natural Language Processing Jan C. Scholtes Tim H.W. van Cann University of Maastricht, Department of Knowledge Engineering. P.O. Box 616, 6200 MD Maastricht Abstract Given a number of documents, we are interested in automatically classifying documents or document sections into a number of predefined classes as efficiently as possible with as little computational requirements as possible. This is done by using Natural Language Processing (NLP) Techniques in combination with traditional high-dimensional document representation techniques such as a Bag of Words (BoW), Term Frequency-Inverse Document Frequency (TF-IDF) and machine learning techniques such as Support Vector Machines (SVM). Despite the availability of various statistical feature-selection techniques, the high-dimensionality of the feature spaces causes computational problems, especially in collections containing old-spelling and Optical Character Recognition (OCR) errors which leads to exploding feature spaces. As a result, feature extraction, feature selection, training a supervised machine learning algorithm, or clustering can no longer practically be used because it is too slow and the memory requirements are too large. We show that by applying a variety of Natural Language Processing (NLP) techniques as pre-processing, it is possible to significantly increase the discrimination between the classes. In this paper, we report f1- measures that are up to 11,3% compared to a baseline performance model which does not use NLP techniques. At the same time, the dimensionality of the feature space is reduced by up to 54%, leading to highly reduced computational requirements and better responds times in building the model of the feature space as well as in the machine learning and classification. Further experiments resulted in vector reductions up to 80%, with results being only 4% worse than the baseline model. 1 Introduction In this paper a description will be given of a Natural Language Processing (NLP) based approach to the problem of reducing the dimensionality of the feature space when applying feature extraction such as Bag of Words (BoW) or Term Frequency-Inverse Document Frequency (TF-IDF). BoW and TF-IDF are commonly used for document classification, document clustering and relevance ranking [16]. Despite the fact that the feature space is very sparse (e.g. contains many dimensions holding zero-values), the dimensionality of the feature space is often very high (in the 100,000 s or more). When the text of the documents also contains Optical Character Recognition (OCR) errors or old-spelling variations, the dimensionality can easily double or triple. The high-dimensionality of these feature spaces causes computational problems during the feature extraction, feature selection, machine learning and classification. Although many OCR errors are quit unique in their occurrences (there are many OCR variations, but these occur with very low individual frequencies [32]) and although the TF-IDF algorithm automatically grants very low feature values to rare words, this still causes huge computations delays in the feature extraction process. Especially in the field of humanities where digital-heritage collections are the basis of the research and investigations this is currently causing many problems [10, 18].

2 Statistical feature selection methods such as, but not limited to, Principle Component Analysis, Chisquare, Maximum-Likelihood and Maximal Entropy Models reduce the feature space by selecting the best features to increase the inner-class similarity and inter-class difference [4, 6, 8, 17, 27]. Obviously, these algorithms also suffer from the high dimensionality of the initial feature extraction process. An additional problem exists: after feature selection, the feature space is highly optimized to the documents used in this process and as a result, new documents that were not used building the feature selection process will be classified at much lower quality levels [21]. Since feature selection is computationally very expensive, it is also not convenient to repeat the feature selection calculations every time new documents are added to the collection. During the research, we also found that statistical feature selection methods such as Chi-square did not work as well due of the sparseness of the data and because of the high dependence of the features of each other: after all, language is more than a bag of words. Models that presume statistical independence between features (e.g.: word occurrences in text) or that look for features with the largest statistical independence have to overcome the fact that words to not occur independently in natural language. Based on our initial experience in [14], where co-reference resolution and synonym normalization led to significant improvement of classification results for sentence classification, we propose a different method where the inner-class similarity and inter-class difference are increased by using a number of NLP Techniques that pre-process the text of a document in such a way that the initial feature extraction and -selection will result in much smaller vectors than with the original text [8, 15]. This will yield highly reduced calculation times, but at the same time, it is also expected that better classification results could be achieved. 2 Overview of the Classification Pipeline First, a Baseline Performance is created. Next, the impact of a number of different NLP techniques on the quality as well as on the computational complexity will be discussed. In this research, the following techniques are used in the machine learning process: 1. A number of Natural Language Processing techniques, and 2. Two document feature-extraction techniques known as a (i) bag-of-word (BOW) and (ii) Term Frequency- Inverse Document Frequency (TF-IDF), and 3. Basic document feature-selection techniques such as logarithmic normalization and selection of the relevant features by vector cut-off, and 4. A supervised machine-learning algorithm based on Support Vector Machines (SVM) to build binary classifiers for each document category. We will discuss the setup of these different components of the classification pipeline in more detail in the following paragraphs. 2.1 Natural Language Processing By using a number of Natural Language Processing techniques, the original text is modified in such a way that the text that is presented to the feature selection process contains words that improve the machine-learning inner-class similarity and inter-class dissimilarity. This goal is reached with the following means:

3 (i) increasing the number of relevant words for a class by using named-entity recognition, coreference and anaphora resolution, and (ii) consolidating the different textual occurrences of words which are caused by for instance synonyms, abbreviations, spelling variations, spelling errors, or OCR errors. In order to use these methods as good as possible, Part of Speech (POS) tagging from the NLTK [35] library was used [8, 15, 16, 24, 28]. First, the additional POS information is used for boundary- and conjunction detection in the Named Entities recognition process with the CoreNLP library [36]. Second, the POS tags are used for the co-reference and pronoun resolution to replace co-references and pronouns with their corresponding Named Entity values [12, 26]. Next, abbreviations are resolved by using the official English abbreviations list from the Oxford English dictionary [5] and synonyms are resolved by using the lexical database WordNet [29]. In addition, WordNet, in combination with the POS tag, is used for lemmatization to reduce words to their individual lemma. Hereafter, by using the Jaro-Winkler [11, 31] and Levenshtein [22] distance methods, spelling variations due to prefixes and suffixes [25], international spelling variations, spelling errors, or OCR errors are resolved and all occurrences of such words are normalized to and replaced by one common token in the text. The process is concluded with the removal of stop words for which a predefined list is used [33]. For reasons of clarity, the process is explained on the following sample text (please notice that atcor is a deliberate typo): In 2003, James Williams decided to become an atcor in Hollywood L.A. He moved there last week to pursue his livelong dream of acting glory and fame. His brother, a celebrated writer, wished him good luck. After tokenization, the document looks like this: ['in', '2003', 'james', 'williams', 'decided', 'to', 'become', 'an', 'atcor', 'in', 'hollywood', 'l.a.', 'he', 'moved', 'there', 'last', 'week', 'to', 'pursue', 'his', 'livelong', 'dream', 'of', 'acting', 'glory', 'and', 'fame.', 'his', 'brother,', 'a', 'celebrated', 'writer,', 'wished', 'him', 'good', 'luck.'] After applying co-reference and pronoun resolution, lemmatization, word similarity and resolving abbreviations and synonyms, the document has changed to: ['James', 'Williams', 'decide', 'actor', 'Hollywood', 'los angeles', 'James', 'Williams', 'travel', 'week', 'prosecute', 'livelong', 'James', 'Williams', 'dream', 'act', 'glory', 'fame', 'James', 'Williams', 'brother', 'lionize', 'writer', 'wish', 'good', 'fortune'] Which will then be used as the basis for the feature extraction process. As one can see, on the one hand, the number of unique words is much smaller than in the original text, leading to a significant smaller feature vector. On the other hand, the remaining words are all more distinguishing for the specific content of the document; there are less word variations for similar content words and less common and high-frequent words such as stop words, pronouns and co-references. 2.2 Feature extraction For feature extraction, the process of extracting relevant information from the data to create feature vectors, the Bag of Words (BoW) and the Term Frequency - Inverse Document Frequency (TF-IDF) methods are used [16].

4 2.3 Normalization & Basic Feature Selection In order to reduce the possible large gaps between feature values, normalization can be applied. Each feature is normalized between 0 and 1. Since this normalization can still create very small values, a second normalization is applied by taking the logarithm (base 10). In order to further decrease the number of features present in the document vectors and to select the best possible features for the machine learning process, a very basic feature selection process is applied by removing the dimensions with a more than average low value. We have called this the vector cut-off approach. Vector Cut-off is defined as: Cut-off = min + (max - min) * (perc/100)) Where max is the maximum value in the document matrix mxn (the collection of all document vectors where m are the documents and n the features) and min is the minimum value in the document matrix. Perc needs to be determined empirically. If a feature has no value higher or equal to the Cut-off, the feature is removed from the vector, otherwise it is kept. We used a very small value for perc: 1 in our case. 2.4 Supervised Machine Learning and Automatic Document Classification with SVM Supervised machine-learning based on a linear Support Vector Machine (SVM) was used to construct a system that can be trained with tagged data [1, 3, 7]. Because of the sparseness of the data, the linear model was good enough. The usage of the Gaussian kernel and the sigmoid kernel did not lead to better results, only to longer calculation times. LIBSVM was used for the implementation of the experiments [2]. Each training document was tagged with the appropriate classification category. To ensure multicategory classification, for each category a separate binary model was trained which was used to predict with certain probability whether a document was part of the category or not. Single class classification was implemented by taking the maximum value returned. Multi-class classification was used by selecting all classifiers returning a value higher than a certain threshold. 3 Experiments and Results 3.1 Corpus and Evaluation Method In this research, the fully annotated Reuters RCV1 corpus is used [13, 19, 23]. For the evaluation, we have used the same evaluation method used by the Legal-TREC conferences, which are based on best practice principles for measuring the quality of document classification. We used the F1 score and derived 11-points precision graphs representing the quality of the classifiers [9]. In a so-called 11-points precision graph, for each recall value corresponding to a certain threshold value, the precision can be calculated and both can be plotted as coordinates in a graph. 3.2 Baseline Performance Creation The following experiments were implemented. First we created a baseline performance by selecting labeled documents from the RCV1 corpus: 1. First training and validation sets are randomly generated per Reuters category from RCV1 by randomly choosing 500 documents in the actual Reuters category (positive instances) and 500 documents outside the Reuters category (negative instances). The test set consists of additional documents from RCV1, where 25% are documents from the within the class, and

5 75% are outside of the class and randomly selected from the entire RCV1 corpus. We tested several different RCV1 classes related to WAR, CRIME and VIOLENCE. The results for the different classes were about the same. 2. Next, feature extraction was done by using BoW and TF-IDF. Feature selection was done by using vector cut-off and the features were normalized by using logarithmic normalization. 3. For training, a SVM with a linear kernel was used. The result of the baseline performance can be found in Table 1: F1 Scores for the Baseline Performance. Table 1: F1 Scores for the Baseline Performance Feature-Extraction Used F1 Scores Test Set Bag of Words TF-IDF In the baseline performance, the dimensionality of the feature vectors was 142,722 after the basic feature extraction. If n features is the dimensionality of the feature vector and n samples is the number of training samples, then the time complexity of the libsvm implementation scales between O(n features x n 2 samples) and O(n features x n 3 samples) depending on how efficiently the libsvm cache is used in practice (which is dataset dependent). If the data is very sparse n feature should be replaced by the average number of nonzero features in a sample vector. By reducing the length of the feature vector, the time and space complexity can be significantly reduced. 3.3 The Effects of NLP on the Results The results of the additional NLP steps are listed in Table 2. In these experiments, we used the same train and test sets as in the baseline performance. Even though the BoW has a slightly higher baseline, BoW in combination with NLP techniques reached a maximum F-score of 0.795, only 0.3 higher than the initial baseline. The results for the TF-IDF in combination with NLP techniques, led to better results: up to 0.820, so we will focus on these results: Table 2: Results after NLP on TF-IDF Feature Extraction NLP techniques applied (in chronological sequence Vector size TF-IDF % vector reduction compared to baseline performance F-score TF-IDF Baseline (tokenization) 142,722 0% Anaphora and NER 94,985 33% Abbreviations 80,906 43% Lemmatizations 66,008 54% Synonyms 57,125 60% 0.794

6 Jaro-Winkler % Figure 1 shows the 11-points precision-recall for the NLP preprocessing experiments where the results were better than the original baseline. In addition, the best results in the upper right corner are circled, as these are the best results from the experiments resulting in both precision and recall larger than 0.8. Figure 1: 11 Point precision for NLP Experiments on TF-IDF Feature Extraction 4 Conclusions In this research we have shown that by applying a variety of Natural Language Processing preprocessing techniques, it is possible to significantly increase the inner-class similarity and the inter-class difference which results in up to 11,3% better machine learning quality at lemmatization. At the same time, the dimensionality of the feature space is reduced by up to 54%, leading to highly reduced computational and memory requirements and better responds times in building the model of the feature space as well as in the machine learning and classification. Further preprocessing with Synonyms and Jaro-Winkler does lead to even more vector size reductions (60% and 80%), but do also lead to lower machine learning quality compared to the lemmatization, although the synonyms are still better than the baseline. One can even say, that the highly reduced vector length of the Jaro-Winkler processing, leads to 80% smaller vectors, but only 4% less machine learning quality. In order to investigate the stability of the results, we also tested the same model for several other classes in the RCV1 corpus with up to 5 varying random sets of train and text sets; the results were similar than the ones presented here. There is off-course a price to the NLP processing but each technique is done only once per document. High dimensional feature spaces lead to more problems, especially in cases where the machine learning process has to be done several times due to the addition of new documents or improved sets of training documents.

7 5 Future Research A number of improvements can be made to the current setup. First, by using Dependency Grammars, we expect to reach a much better quality of co-reference and pronoun resolution. The currently available information from the POS tagging is not always sufficient to resolve the co-references and pronouns. Recent research shows that applying Dependency Grammars to this problem is very promising [34]. Second, The synonym replacement did not improve the results. We expect this to be caused by the fact that the context of the synonym usage is not taken into consideration. A better context-sensitive synonym replacement could result in maintaining or improving the quality of the machine learning as shown in [14] Finally, the Jaro-Winker algorithm caused problems for words that contained a negating pre-fix such as democratic and un-democratic. These contradicting words were replaced by the same common token, which resulted in quality loss. By detecting certain negating prefixes and exclude them from the Jaro- Winkler similarity detection, we also expect the machine learning results to increase. 6 References [1] Bernhard E Boser, Isabelle M Guyon, and Vladimir N Vapnik. A training algorithm for optimal margin classifiers. In Proceedings of the fifth annual workshop on Computational learning theory, pages ACM, [2] Chih-Chung Chang and Chih-Jen Lin. Libsvm. tw/~cjlin/libsvm/. [3] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine learning, 20(3): , [4] Devijver, P.A. and Kittler, J. (1982). Pattern Recognition: A Statistical Approach. Prentice Hall. [5] The Oxford English Dictionary. List of abbreviations. [6] Duda, R.O. and Hart, P.E. (2001). Pattern Classification (2nd Edition). John Wiley and Sons. [7] Kai-Bo Duan and S Sathiya Keerthi. Which is the best multiclass SVM method? An empirical study. In Multiple Classifier Systems, pages Springer, [8] Feldman, R., and Sanger, J. (2006). The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge University Press. [9] Maura R. Grossman, Gordon V. Cormack, Bruce Hedin, and Douglas W. Oard. Overview of the TREC 2011 legal track. In TREC, [10] Martha van den Hoven, Antal van den Bosch, Kalliopi Zervanou. Beyond Reported History: Strikes That Never Happened. ILK Technical Report Series August [11] Matthew A Jaro. Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. Journal of the American Statistical Association, 84(406): , [12] Heeyoung Lee, Angel Chang, Yves Peirsman, Nathanael Chambers, Mihai Surdeanu, and Dan Jurafsky. Deterministic co-reference resolution based on entity-centric, precision-ranked rules. Computational Linguistics, 1 54, [13] David D Lewis, Yiming Yang, Tony G Rose, and Fan Li. RCV1: A new bench- mark collection for text categorization research. The Journal of Machine Learning Research, 5: , [14] Aisan Maghsoodi, Merlijn Sevenster, Jan C. Scholtes and Georgi Nabaltov (2012), Automatic Sentence-based Classification of Free-text Breast Cancer Radiology Reports, 25th IEEE International Symposium on Computer-Based Medical Systems (CBMS 2012).

8 [15] Manning, Christopher D. and Schütze, Hinrich, (1999). Foundations of statistical natural language processing. MIT Press. [16] Manning, C.D., Raghavan, P. and Schütze, H. Introduction to Information Retrieval Cambridge University Press, [17] Michalski, R.S., Carbonell, J.G. and Mitchell, T.M. (Editors), (1986a). Machine Learning, an Artificial Intelligence Apporach. Volume 1&2. Morgan Kaufmann. [18] [19] Reuters RCV1 Corpus: [20] Rijsbergen, C.J. van (1979). Information Retrieval. Butterworths, London. [21] Scholtes, J.C., Cann, T. van, and Mack, M. (2013). The Impact of Incorrect Training Sets and Rolling Collections on Technology-Assisted Review. International Conference on Artificial Intelligence in Law 2013, DESI V Workshop. June 14, 2013, Consiglio Nazionale delle Ricerche, Rome, Italy. [22] LVI Levenshtein. Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady, [23] David D Lewis, Yiming Yang, Tony G Rose, and Fan Li. Rcv1: A new benchmark collection for text categorization research. The Journal of Machine Learning Research, 5: , [24] Ann Bies, Constance Cooper, Mark Ferguson, Alyson Littman, Mitchell Marcus, and Ann Taylor. The Penn tree bank project. [25] Martin F Porter. An algorithm for suffix stripping. Program: electronic library and information systems, 14(3): , [26] Karthik Raghunathan, Heeyoung Lee, Sudarshan Rangarajan, Nathanael Chambers, Mihai Surdeanu, Dan Jurafsky, and Christopher Manning. A multi-pass sieve for co-reference resolution. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages Association for Computational Linguistics, [27] Claude Elwood Shannon and Warren Weaver. A mathematical theory of communication, [28] Kristina Toutanova, Dan Klein, Christopher D Manning, and Yoram Singer. Feature-rich part-ofspeech tagging with a cyclic dependency network. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1, pages Association for Computational Linguistics, [29] Princeton university. About WordNet. [30] Stanford university. English stop list. [31] William E Winkler. String comparator metrics and enhanced decision rules in the Fellegi-Sunter model of record linkage [32] Scholtes, J.C. (1995): Artificial Neural Networks in Information Retrieval in a Libraries Context. PROLIB/ANN, EUR EN, European Commission, DG XIII-E3. [33] [34] Anders Bjorkelund and Jonas Kuhn, Phrase Structures and Dependencies for End-to-End Coreference Resolution. Proceedings of COLING 2012: Posters, pages , COLING 2012, Mumbai, December [35] NLTK: Edward Loper Bird, Steven and Ewan Klein. Natural Language Processing with Python. O Reilly Media Inc, [36] CoreNLP:

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF) SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF) Hans Christian 1 ; Mikhael Pramodana Agus 2 ; Derwin Suhartono 3 1,2,3 Computer Science Department,

More information

Rule Learning With Negation: Issues Regarding Effectiveness

Rule Learning With Negation: Issues Regarding Effectiveness Rule Learning With Negation: Issues Regarding Effectiveness S. Chua, F. Coenen, G. Malcolm University of Liverpool Department of Computer Science, Ashton Building, Ashton Street, L69 3BX Liverpool, United

More information

Python Machine Learning

Python Machine Learning Python Machine Learning Unlock deeper insights into machine learning with this vital guide to cuttingedge predictive analytics Sebastian Raschka [ PUBLISHING 1 open source I community experience distilled

More information

Rule Learning with Negation: Issues Regarding Effectiveness

Rule Learning with Negation: Issues Regarding Effectiveness Rule Learning with Negation: Issues Regarding Effectiveness Stephanie Chua, Frans Coenen, and Grant Malcolm University of Liverpool Department of Computer Science, Ashton Building, Ashton Street, L69 3BX

More information

Linking Task: Identifying authors and book titles in verbose queries

Linking Task: Identifying authors and book titles in verbose queries Linking Task: Identifying authors and book titles in verbose queries Anaïs Ollagnier, Sébastien Fournier, and Patrice Bellot Aix-Marseille University, CNRS, ENSAM, University of Toulon, LSIS UMR 7296,

More information

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

OCR for Arabic using SIFT Descriptors With Online Failure Prediction OCR for Arabic using SIFT Descriptors With Online Failure Prediction Andrey Stolyarenko, Nachum Dershowitz The Blavatnik School of Computer Science Tel Aviv University Tel Aviv, Israel Email: stloyare@tau.ac.il,

More information

Detecting English-French Cognates Using Orthographic Edit Distance

Detecting English-French Cognates Using Orthographic Edit Distance Detecting English-French Cognates Using Orthographic Edit Distance Qiongkai Xu 1,2, Albert Chen 1, Chang i 1 1 The Australian National University, College of Engineering and Computer Science 2 National

More information

A Case Study: News Classification Based on Term Frequency

A Case Study: News Classification Based on Term Frequency A Case Study: News Classification Based on Term Frequency Petr Kroha Faculty of Computer Science University of Technology 09107 Chemnitz Germany kroha@informatik.tu-chemnitz.de Ricardo Baeza-Yates Center

More information

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words, A Language-Independent, Data-Oriented Architecture for Grapheme-to-Phoneme Conversion Walter Daelemans and Antal van den Bosch Proceedings ESCA-IEEE speech synthesis conference, New York, September 1994

More information

Learning From the Past with Experiment Databases

Learning From the Past with Experiment Databases Learning From the Past with Experiment Databases Joaquin Vanschoren 1, Bernhard Pfahringer 2, and Geoff Holmes 2 1 Computer Science Dept., K.U.Leuven, Leuven, Belgium 2 Computer Science Dept., University

More information

Speech Emotion Recognition Using Support Vector Machine

Speech Emotion Recognition Using Support Vector Machine Speech Emotion Recognition Using Support Vector Machine Yixiong Pan, Peipei Shen and Liping Shen Department of Computer Technology Shanghai JiaoTong University, Shanghai, China panyixiong@sjtu.edu.cn,

More information

Lecture 1: Machine Learning Basics

Lecture 1: Machine Learning Basics 1/69 Lecture 1: Machine Learning Basics Ali Harakeh University of Waterloo WAVE Lab ali.harakeh@uwaterloo.ca May 1, 2017 2/69 Overview 1 Learning Algorithms 2 Capacity, Overfitting, and Underfitting 3

More information

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Twitter Sentiment Classification on Sanders Data using Hybrid Approach IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 4, Ver. I (July Aug. 2015), PP 118-123 www.iosrjournals.org Twitter Sentiment Classification on Sanders

More information

Disambiguation of Thai Personal Name from Online News Articles

Disambiguation of Thai Personal Name from Online News Articles Disambiguation of Thai Personal Name from Online News Articles Phaisarn Sutheebanjard Graduate School of Information Technology Siam University Bangkok, Thailand mr.phaisarn@gmail.com Abstract Since online

More information

The stages of event extraction

The stages of event extraction The stages of event extraction David Ahn Intelligent Systems Lab Amsterdam University of Amsterdam ahn@science.uva.nl Abstract Event detection and recognition is a complex task consisting of multiple sub-tasks

More information

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17. Semi-supervised methods of text processing, and an application to medical concept extraction Yacine Jernite Text-as-Data series September 17. 2015 What do we want from text? 1. Extract information 2. Link

More information

Reducing Features to Improve Bug Prediction

Reducing Features to Improve Bug Prediction Reducing Features to Improve Bug Prediction Shivkumar Shivaji, E. James Whitehead, Jr., Ram Akella University of California Santa Cruz {shiv,ejw,ram}@soe.ucsc.edu Sunghun Kim Hong Kong University of Science

More information

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages Nuanwan Soonthornphisaj 1 and Boonserm Kijsirikul 2 Machine Intelligence and Knowledge Discovery Laboratory Department of Computer

More information

Cross Language Information Retrieval

Cross Language Information Retrieval Cross Language Information Retrieval RAFFAELLA BERNARDI UNIVERSITÀ DEGLI STUDI DI TRENTO P.ZZA VENEZIA, ROOM: 2.05, E-MAIL: BERNARDI@DISI.UNITN.IT Contents 1 Acknowledgment.............................................

More information

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems Modeling function word errors in DNN-HMM based LVCSR systems Melvin Jose Johnson Premkumar, Ankur Bapna and Sree Avinash Parchuri Department of Computer Science Department of Electrical Engineering Stanford

More information

arxiv: v1 [cs.lg] 3 May 2013

arxiv: v1 [cs.lg] 3 May 2013 Feature Selection Based on Term Frequency and T-Test for Text Categorization Deqing Wang dqwang@nlsde.buaa.edu.cn Hui Zhang hzhang@nlsde.buaa.edu.cn Rui Liu, Weifeng Lv {liurui,lwf}@nlsde.buaa.edu.cn arxiv:1305.0638v1

More information

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments Vijayshri Ramkrishna Ingale PG Student, Department of Computer Engineering JSPM s Imperial College of Engineering &

More information

Exposé for a Master s Thesis

Exposé for a Master s Thesis Exposé for a Master s Thesis Stefan Selent January 21, 2017 Working Title: TF Relation Mining: An Active Learning Approach Introduction The amount of scientific literature is ever increasing. Especially

More information

Online Updating of Word Representations for Part-of-Speech Tagging

Online Updating of Word Representations for Part-of-Speech Tagging Online Updating of Word Representations for Part-of-Speech Tagging Wenpeng Yin LMU Munich wenpeng@cis.lmu.de Tobias Schnabel Cornell University tbs49@cornell.edu Hinrich Schütze LMU Munich inquiries@cislmu.org

More information

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks Devendra Singh Chaplot, Eunhee Rhim, and Jihie Kim Samsung Electronics Co., Ltd. Seoul, South Korea {dev.chaplot,eunhee.rhim,jihie.kim}@samsung.com

More information

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY Chen, Hsin-Hsi Department of Computer Science and Information Engineering National Taiwan University Taipei, Taiwan E-mail: hh_chen@csie.ntu.edu.tw Abstract

More information

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems Modeling function word errors in DNN-HMM based LVCSR systems Melvin Jose Johnson Premkumar, Ankur Bapna and Sree Avinash Parchuri Department of Computer Science Department of Electrical Engineering Stanford

More information

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA Alta de Waal, Jacobus Venter and Etienne Barnard Abstract Most actionable evidence is identified during the analysis phase of digital forensic investigations.

More information

Applications of memory-based natural language processing

Applications of memory-based natural language processing Applications of memory-based natural language processing Antal van den Bosch and Roser Morante ILK Research Group Tilburg University Prague, June 24, 2007 Current ILK members Principal investigator: Antal

More information

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence. NLP Lab Session Week 8 October 15, 2014 Noun Phrase Chunking and WordNet in NLTK Getting Started In this lab session, we will work together through a series of small examples using the IDLE window and

More information

The Role of String Similarity Metrics in Ontology Alignment

The Role of String Similarity Metrics in Ontology Alignment The Role of String Similarity Metrics in Ontology Alignment Michelle Cheatham and Pascal Hitzler August 9, 2013 1 Introduction Tim Berners-Lee originally envisioned a much different world wide web than

More information

CS Machine Learning

CS Machine Learning CS 478 - Machine Learning Projects Data Representation Basic testing and evaluation schemes CS 478 Data and Testing 1 Programming Issues l Program in any platform you want l Realize that you will be doing

More information

AQUA: An Ontology-Driven Question Answering System

AQUA: An Ontology-Driven Question Answering System AQUA: An Ontology-Driven Question Answering System Maria Vargas-Vera, Enrico Motta and John Domingue Knowledge Media Institute (KMI) The Open University, Walton Hall, Milton Keynes, MK7 6AA, United Kingdom.

More information

Assignment 1: Predicting Amazon Review Ratings

Assignment 1: Predicting Amazon Review Ratings Assignment 1: Predicting Amazon Review Ratings 1 Dataset Analysis Richard Park r2park@acsmail.ucsd.edu February 23, 2015 The dataset selected for this assignment comes from the set of Amazon reviews for

More information

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling Notebook for PAN at CLEF 2013 Andrés Alfonso Caurcel Díaz 1 and José María Gómez Hidalgo 2 1 Universidad

More information

Memory-based grammatical error correction

Memory-based grammatical error correction Memory-based grammatical error correction Antal van den Bosch Peter Berck Radboud University Nijmegen Tilburg University P.O. Box 9103 P.O. Box 90153 NL-6500 HD Nijmegen, The Netherlands NL-5000 LE Tilburg,

More information

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur Module 12 Machine Learning 12.1 Instructional Objective The students should understand the concept of learning systems Students should learn about different aspects of a learning system Students should

More information

Georgetown University at TREC 2017 Dynamic Domain Track

Georgetown University at TREC 2017 Dynamic Domain Track Georgetown University at TREC 2017 Dynamic Domain Track Zhiwen Tang Georgetown University zt79@georgetown.edu Grace Hui Yang Georgetown University huiyang@cs.georgetown.edu Abstract TREC Dynamic Domain

More information

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Chinese Language Parsing with Maximum-Entropy-Inspired Parser Chinese Language Parsing with Maximum-Entropy-Inspired Parser Heng Lian Brown University Abstract The Chinese language has many special characteristics that make parsing difficult. The performance of state-of-the-art

More information

The taming of the data:

The taming of the data: The taming of the data: Using text mining in building a corpus for diachronic analysis Stefania Degaetano-Ortlieb, Hannah Kermes, Ashraf Khamis, Jörg Knappen, Noam Ordan and Elke Teich Background Big data

More information

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

The Internet as a Normative Corpus: Grammar Checking with a Search Engine The Internet as a Normative Corpus: Grammar Checking with a Search Engine Jonas Sjöbergh KTH Nada SE-100 44 Stockholm, Sweden jsh@nada.kth.se Abstract In this paper some methods using the Internet as a

More information

Human Emotion Recognition From Speech

Human Emotion Recognition From Speech RESEARCH ARTICLE OPEN ACCESS Human Emotion Recognition From Speech Miss. Aparna P. Wanare*, Prof. Shankar N. Dandare *(Department of Electronics & Telecommunication Engineering, Sant Gadge Baba Amravati

More information

Word Segmentation of Off-line Handwritten Documents

Word Segmentation of Off-line Handwritten Documents Word Segmentation of Off-line Handwritten Documents Chen Huang and Sargur N. Srihari {chuang5, srihari}@cedar.buffalo.edu Center of Excellence for Document Analysis and Recognition (CEDAR), Department

More information

(Sub)Gradient Descent

(Sub)Gradient Descent (Sub)Gradient Descent CMSC 422 MARINE CARPUAT marine@cs.umd.edu Figures credit: Piyush Rai Logistics Midterm is on Thursday 3/24 during class time closed book/internet/etc, one page of notes. will include

More information

Probabilistic Latent Semantic Analysis

Probabilistic Latent Semantic Analysis Probabilistic Latent Semantic Analysis Thomas Hofmann Presentation by Ioannis Pavlopoulos & Andreas Damianou for the course of Data Mining & Exploration 1 Outline Latent Semantic Analysis o Need o Overview

More information

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, 2013 10.12753/2066-026X-13-154 DATA MINING SOLUTIONS FOR DETERMINING STUDENT'S PROFILE Adela BÂRA,

More information

Switchboard Language Model Improvement with Conversational Data from Gigaword

Switchboard Language Model Improvement with Conversational Data from Gigaword Katholieke Universiteit Leuven Faculty of Engineering Master in Artificial Intelligence (MAI) Speech and Language Technology (SLT) Switchboard Language Model Improvement with Conversational Data from Gigaword

More information

Introduction, Organization Overview of NLP, Main Issues

Introduction, Organization Overview of NLP, Main Issues HG2051 Language and the Computer Computational Linguistics with Python Introduction, Organization Overview of NLP, Main Issues Francis Bond Division of Linguistics and Multilingual Studies http://www3.ntu.edu.sg/home/fcbond/

More information

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar Chung-Chi Huang Mei-Hua Chen Shih-Ting Huang Jason S. Chang Institute of Information Systems and Applications, National Tsing Hua University,

More information

Universiteit Leiden ICT in Business

Universiteit Leiden ICT in Business Universiteit Leiden ICT in Business Ranking of Multi-Word Terms Name: Ricardo R.M. Blikman Student-no: s1184164 Internal report number: 2012-11 Date: 07/03/2013 1st supervisor: Prof. Dr. J.N. Kok 2nd supervisor:

More information

Reinforcement Learning by Comparing Immediate Reward

Reinforcement Learning by Comparing Immediate Reward Reinforcement Learning by Comparing Immediate Reward Punit Pandey DeepshikhaPandey Dr. Shishir Kumar Abstract This paper introduces an approach to Reinforcement Learning Algorithm by comparing their immediate

More information

Longest Common Subsequence: A Method for Automatic Evaluation of Handwritten Essays

Longest Common Subsequence: A Method for Automatic Evaluation of Handwritten Essays IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 6, Ver. IV (Nov Dec. 2015), PP 01-07 www.iosrjournals.org Longest Common Subsequence: A Method for

More information

Ensemble Technique Utilization for Indonesian Dependency Parser

Ensemble Technique Utilization for Indonesian Dependency Parser Ensemble Technique Utilization for Indonesian Dependency Parser Arief Rahman Institut Teknologi Bandung Indonesia 23516008@std.stei.itb.ac.id Ayu Purwarianti Institut Teknologi Bandung Indonesia ayu@stei.itb.ac.id

More information

Artificial Neural Networks written examination

Artificial Neural Networks written examination 1 (8) Institutionen för informationsteknologi Olle Gällmo Universitetsadjunkt Adress: Lägerhyddsvägen 2 Box 337 751 05 Uppsala Artificial Neural Networks written examination Monday, May 15, 2006 9 00-14

More information

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks 1 Tzu-Hsuan Yang, 2 Tzu-Hsuan Tseng, and 3 Chia-Ping Chen Department of Computer Science and Engineering

More information

Performance Analysis of Optimized Content Extraction for Cyrillic Mongolian Learning Text Materials in the Database

Performance Analysis of Optimized Content Extraction for Cyrillic Mongolian Learning Text Materials in the Database Journal of Computer and Communications, 2016, 4, 79-89 Published Online August 2016 in SciRes. http://www.scirp.org/journal/jcc http://dx.doi.org/10.4236/jcc.2016.410009 Performance Analysis of Optimized

More information

Multi-Lingual Text Leveling

Multi-Lingual Text Leveling Multi-Lingual Text Leveling Salim Roukos, Jerome Quin, and Todd Ward IBM T. J. Watson Research Center, Yorktown Heights, NY 10598 {roukos,jlquinn,tward}@us.ibm.com Abstract. Determining the language proficiency

More information

Detecting Wikipedia Vandalism using Machine Learning Notebook for PAN at CLEF 2011

Detecting Wikipedia Vandalism using Machine Learning Notebook for PAN at CLEF 2011 Detecting Wikipedia Vandalism using Machine Learning Notebook for PAN at CLEF 2011 Cristian-Alexandru Drăgușanu, Marina Cufliuc, Adrian Iftene UAIC: Faculty of Computer Science, Alexandru Ioan Cuza University,

More information

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics (L615) Markus Dickinson Department of Linguistics, Indiana University Spring 2013 The web provides new opportunities for gathering data Viable source of disposable corpora, built ad hoc for specific purposes

More information

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System QuickStroke: An Incremental On-line Chinese Handwriting Recognition System Nada P. Matić John C. Platt Λ Tony Wang y Synaptics, Inc. 2381 Bering Drive San Jose, CA 95131, USA Abstract This paper presents

More information

Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models

Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models Jung-Tae Lee and Sang-Bum Kim and Young-In Song and Hae-Chang Rim Dept. of Computer &

More information

SARDNET: A Self-Organizing Feature Map for Sequences

SARDNET: A Self-Organizing Feature Map for Sequences SARDNET: A Self-Organizing Feature Map for Sequences Daniel L. James and Risto Miikkulainen Department of Computer Sciences The University of Texas at Austin Austin, TX 78712 dljames,risto~cs.utexas.edu

More information

Analyzing sentiments in tweets for Tesla Model 3 using SAS Enterprise Miner and SAS Sentiment Analysis Studio

Analyzing sentiments in tweets for Tesla Model 3 using SAS Enterprise Miner and SAS Sentiment Analysis Studio SCSUG Student Symposium 2016 Analyzing sentiments in tweets for Tesla Model 3 using SAS Enterprise Miner and SAS Sentiment Analysis Studio Praneth Guggilla, Tejaswi Jha, Goutam Chakraborty, Oklahoma State

More information

A Bayesian Learning Approach to Concept-Based Document Classification

A Bayesian Learning Approach to Concept-Based Document Classification Databases and Information Systems Group (AG5) Max-Planck-Institute for Computer Science Saarbrücken, Germany A Bayesian Learning Approach to Concept-Based Document Classification by Georgiana Ifrim Supervisors

More information

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models Netpix: A Method of Feature Selection Leading to Accurate Sentiment-Based Classification Models 1 Netpix: A Method of Feature Selection Leading to Accurate Sentiment-Based Classification Models James B.

More information

A Neural Network GUI Tested on Text-To-Phoneme Mapping

A Neural Network GUI Tested on Text-To-Phoneme Mapping A Neural Network GUI Tested on Text-To-Phoneme Mapping MAARTEN TROMPPER Universiteit Utrecht m.f.a.trompper@students.uu.nl Abstract Text-to-phoneme (T2P) mapping is a necessary step in any speech synthesis

More information

Australian Journal of Basic and Applied Sciences

Australian Journal of Basic and Applied Sciences AENSI Journals Australian Journal of Basic and Applied Sciences ISSN:1991-8178 Journal home page: www.ajbasweb.com Feature Selection Technique Using Principal Component Analysis For Improving Fuzzy C-Mean

More information

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS Václav Kocian, Eva Volná, Michal Janošek, Martin Kotyrba University of Ostrava Department of Informatics and Computers Dvořákova 7,

More information

Constructing Parallel Corpus from Movie Subtitles

Constructing Parallel Corpus from Movie Subtitles Constructing Parallel Corpus from Movie Subtitles Han Xiao 1 and Xiaojie Wang 2 1 School of Information Engineering, Beijing University of Post and Telecommunications artex.xh@gmail.com 2 CISTR, Beijing

More information

A Comparison of Two Text Representations for Sentiment Analysis

A Comparison of Two Text Representations for Sentiment Analysis 010 International Conference on Computer Application and System Modeling (ICCASM 010) A Comparison of Two Text Representations for Sentiment Analysis Jianxiong Wang School of Computer Science & Educational

More information

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Using Web Searches on Important Words to Create Background Sets for LSI Classification Using Web Searches on Important Words to Create Background Sets for LSI Classification Sarah Zelikovitz and Marina Kogan College of Staten Island of CUNY 2800 Victory Blvd Staten Island, NY 11314 Abstract

More information

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus Language Acquisition Fall 2010/Winter 2011 Lexical Categories Afra Alishahi, Heiner Drenhaus Computational Linguistics and Phonetics Saarland University Children s Sensitivity to Lexical Categories Look,

More information

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Active Learning. Yingyu Liang Computer Sciences 760 Fall Active Learning Yingyu Liang Computer Sciences 760 Fall 2017 http://pages.cs.wisc.edu/~yliang/cs760/ Some of the slides in these lectures have been adapted/borrowed from materials developed by Mark Craven,

More information

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models Stephan Gouws and GJ van Rooyen MIH Medialab, Stellenbosch University SOUTH AFRICA {stephan,gvrooyen}@ml.sun.ac.za

More information

Learning Methods in Multilingual Speech Recognition

Learning Methods in Multilingual Speech Recognition Learning Methods in Multilingual Speech Recognition Hui Lin Department of Electrical Engineering University of Washington Seattle, WA 98125 linhui@u.washington.edu Li Deng, Jasha Droppo, Dong Yu, and Alex

More information

Large-Scale Web Page Classification. Sathi T Marath. Submitted in partial fulfilment of the requirements. for the degree of Doctor of Philosophy

Large-Scale Web Page Classification. Sathi T Marath. Submitted in partial fulfilment of the requirements. for the degree of Doctor of Philosophy Large-Scale Web Page Classification by Sathi T Marath Submitted in partial fulfilment of the requirements for the degree of Doctor of Philosophy at Dalhousie University Halifax, Nova Scotia November 2010

More information

The Role of the Head in the Interpretation of English Deverbal Compounds

The Role of the Head in the Interpretation of English Deverbal Compounds The Role of the Head in the Interpretation of English Deverbal Compounds Gianina Iordăchioaia i, Lonneke van der Plas ii, Glorianna Jagfeld i (Universität Stuttgart i, University of Malta ii ) Wen wurmt

More information

On document relevance and lexical cohesion between query terms

On document relevance and lexical cohesion between query terms Information Processing and Management 42 (2006) 1230 1247 www.elsevier.com/locate/infoproman On document relevance and lexical cohesion between query terms Olga Vechtomova a, *, Murat Karamuftuoglu b,

More information

Optimizing to Arbitrary NLP Metrics using Ensemble Selection

Optimizing to Arbitrary NLP Metrics using Ensemble Selection Optimizing to Arbitrary NLP Metrics using Ensemble Selection Art Munson, Claire Cardie, Rich Caruana Department of Computer Science Cornell University Ithaca, NY 14850 {mmunson, cardie, caruana}@cs.cornell.edu

More information

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification Tomi Kinnunen and Ismo Kärkkäinen University of Joensuu, Department of Computer Science, P.O. Box 111, 80101 JOENSUU,

More information

Beyond the Pipeline: Discrete Optimization in NLP

Beyond the Pipeline: Discrete Optimization in NLP Beyond the Pipeline: Discrete Optimization in NLP Tomasz Marciniak and Michael Strube EML Research ggmbh Schloss-Wolfsbrunnenweg 33 69118 Heidelberg, Germany http://www.eml-research.de/nlp Abstract We

More information

CSL465/603 - Machine Learning

CSL465/603 - Machine Learning CSL465/603 - Machine Learning Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in Introduction CSL465/603 - Machine Learning 1 Administrative Trivia Course Structure 3-0-2 Lecture Timings Monday 9.55-10.45am

More information

Distant Supervised Relation Extraction with Wikipedia and Freebase

Distant Supervised Relation Extraction with Wikipedia and Freebase Distant Supervised Relation Extraction with Wikipedia and Freebase Marcel Ackermann TU Darmstadt ackermann@tk.informatik.tu-darmstadt.de Abstract In this paper we discuss a new approach to extract relational

More information

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model Xinying Song, Xiaodong He, Jianfeng Gao, Li Deng Microsoft Research, One Microsoft Way, Redmond, WA 98052, U.S.A.

More information

Text-mining the Estonian National Electronic Health Record

Text-mining the Estonian National Electronic Health Record Text-mining the Estonian National Electronic Health Record Raul Sirel rsirel@ut.ee 13.11.2015 Outline Electronic Health Records & Text Mining De-identifying the Texts Resolving the Abbreviations Terminology

More information

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN From: AAAI Technical Report WS-98-08. Compilation copyright 1998, AAAI (www.aaai.org). All rights reserved. Recommender Systems: A GroupLens Perspective Joseph A. Konstan *t, John Riedl *t, AI Borchers,

More information

Bootstrapping and Evaluating Named Entity Recognition in the Biomedical Domain

Bootstrapping and Evaluating Named Entity Recognition in the Biomedical Domain Bootstrapping and Evaluating Named Entity Recognition in the Biomedical Domain Andreas Vlachos Computer Laboratory University of Cambridge Cambridge, CB3 0FD, UK av308@cl.cam.ac.uk Caroline Gasperin Computer

More information

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition Todd Holloway Two Lecture Series for B551 November 20 & 27, 2007 Indiana University Outline Introduction Bias and

More information

Prediction of Maximal Projection for Semantic Role Labeling

Prediction of Maximal Projection for Semantic Role Labeling Prediction of Maximal Projection for Semantic Role Labeling Weiwei Sun, Zhifang Sui Institute of Computational Linguistics Peking University Beijing, 100871, China {ws, szf}@pku.edu.cn Haifeng Wang Toshiba

More information

HLTCOE at TREC 2013: Temporal Summarization

HLTCOE at TREC 2013: Temporal Summarization HLTCOE at TREC 2013: Temporal Summarization Tan Xu University of Maryland College Park Paul McNamee Johns Hopkins University HLTCOE Douglas W. Oard University of Maryland College Park Abstract Our team

More information

Organizational Knowledge Distribution: An Experimental Evaluation

Organizational Knowledge Distribution: An Experimental Evaluation Association for Information Systems AIS Electronic Library (AISeL) AMCIS 24 Proceedings Americas Conference on Information Systems (AMCIS) 12-31-24 : An Experimental Evaluation Surendra Sarnikar University

More information

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH ISSN: 0976-3104 Danti and Bhushan. ARTICLE OPEN ACCESS CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH Ajit Danti 1 and SN Bharath Bhushan 2* 1 Department

More information

Generative models and adversarial training

Generative models and adversarial training Day 4 Lecture 1 Generative models and adversarial training Kevin McGuinness kevin.mcguinness@dcu.ie Research Fellow Insight Centre for Data Analytics Dublin City University What is a generative model?

More information

Multilingual Document Clustering: an Heuristic Approach Based on Cognate Named Entities

Multilingual Document Clustering: an Heuristic Approach Based on Cognate Named Entities Multilingual Document Clustering: an Heuristic Approach Based on Cognate Named Entities Soto Montalvo GAVAB Group URJC Raquel Martínez NLP&IR Group UNED Arantza Casillas Dpt. EE UPV-EHU Víctor Fresno GAVAB

More information

The University of Amsterdam s Concept Detection System at ImageCLEF 2011

The University of Amsterdam s Concept Detection System at ImageCLEF 2011 The University of Amsterdam s Concept Detection System at ImageCLEF 2011 Koen E. A. van de Sande and Cees G. M. Snoek Intelligent Systems Lab Amsterdam, University of Amsterdam Software available from:

More information

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for Email Marilyn A. Walker Jeanne C. Fromer Shrikanth Narayanan walker@research.att.com jeannie@ai.mit.edu shri@research.att.com

More information

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data Ebba Gustavii Department of Linguistics and Philology, Uppsala University, Sweden ebbag@stp.ling.uu.se

More information

Finding Translations in Scanned Book Collections

Finding Translations in Scanned Book Collections Finding Translations in Scanned Book Collections Ismet Zeki Yalniz Dept. of Computer Science University of Massachusetts Amherst, MA, 01003 zeki@cs.umass.edu R. Manmatha Dept. of Computer Science University

More information

Speech Recognition at ICSI: Broadcast News and beyond

Speech Recognition at ICSI: Broadcast News and beyond Speech Recognition at ICSI: Broadcast News and beyond Dan Ellis International Computer Science Institute, Berkeley CA Outline 1 2 3 The DARPA Broadcast News task Aspects of ICSI

More information

Physics 270: Experimental Physics

Physics 270: Experimental Physics 2017 edition Lab Manual Physics 270 3 Physics 270: Experimental Physics Lecture: Lab: Instructor: Office: Email: Tuesdays, 2 3:50 PM Thursdays, 2 4:50 PM Dr. Uttam Manna 313C Moulton Hall umanna@ilstu.edu

More information