A Named Entity Recognizer for Filipino Texts
|
|
- Jasper Baker
- 6 years ago
- Views:
Transcription
1 A Named Entity Recognizer for Filipino Texts Lim, L. E., New, J. C., Ngo, M. A., Sy, M. C., Lim, N. R. De La Salle University-Manila 2401 Taft Avenue Malate, Manila {lan_585, ABSTRACT In this paper, we define the task of named entity recognition, look at existing systems for named entity recognition, and discuss the design, implementation, and evaluation of a system that performs named entity recognition on Filipino texts. We also compare the results of the system with an existing named entity recognizer designed for English texts using a Filipino corpus. Keywords Named entity recognition, extraction, information extraction, natural language processing 1. INTRODUCTION Named entity recognition (NER) involves automatically (or semiautomatically) processing a series of words and extracting or recognizing words or phrases in the text that refer to people, places, organizations, products, and other named entities. The task of named entity recognition also entails identifying the class of each extracted named entity, i.e., person, place, organization, etc. While the Filipino language makes use of capitalization to indicate the presence of proper nouns, extraction of named entities cannot be done by merely considering the case of the first letter of each word. This difficulty is, in large part, due to the fact that some named entities can contain lower case words, such as Komisyon sa Wikang Filipino. On the other hand, a list of named entities can be provided beforehand, and a system that simply scans a stream of text and searches for the named entities provided in the list can be designed. Such an approach, however, can be cumbersome and error-prone, largely because new products and organizations come into being every day, and keeping the lists up to date is a manual and extremely timeconsuming task. Furthermore, even assuming that all the named entities have been extracted successfully, neither of these two approaches would help a system decide whether a particular named entity contained in the text, such as Philip Morris, refers to a person or a company. Finally, other entities may possibly be of interest to the user, such as dates, sums of money, percentages, temperatures, and product codes, some of which do not rely on capitalization cues and cannot be contained in a finite list. While many named entity recognition systems exist in the market today, very few, if any, have been designed specifically for handling texts written in the Filipino language. Most software packages and implementations for NER accept a stream of English text and extracts names of people, places, and companies or organizations. Some may include support for identifying relationships between named entities. For instance, a system processing the text containing Gloria Macapagal-Arroyo, Presidente ng Pilipinas might identify Gloria Macapagal-Arroyo as the name of a person, Pilipinas as a place, and a Presidente-ng relationship between the two. Different approaches to named entity recognition have been designed and can be classified into two main categories. The first one involves heuristic rules and lists of named entities. The second one, and the approach which was taken in the design of this system, is the statistical approach, which allows a program to learn the task of named entity recognition based on previously annotated training data. In particular, hidden Markov models are used in the manner discussed in [2], and then the system is tested using manually annotated Filipino articles and essays. In section 2, we will take a look at existing approaches and systems for NER and look at their individual strengths and limitations. More information regarding the training data used for the system, as well as the actual design and implementation of the system, is given in section 3. Section 4 outlines the tests conducted and compares the results of the system with those of existing systems (that are targeted to other languages). Finally, section 5 looks at possible extensions and improvements to the system. 2. RELATED WORK Many contemporary software packages and implementations that perform automatic named entity recognition are in use today. In addition, various techniques and algorithms have been designed for the NER task. The following subsections aim to summarize and evaluate some of these approaches and techniques. 2.1 Using Risk Minimization The proponents of the system looked into the availability of linguistic features and their impact on how the system will yield a better performance. It was stated that statistically based named entity recognition has yet to be made consistent across a variety of data source types, with this, their system focused on existing features of languages that exist in many languages, creating language-independent named entity recognition systems which can perform as well as systems utilizing features that are language-dependent. In the developed system, their approach was patterned on previous text chunking system with the sequence of words treated as sequence of tokenized text. The named entity recognition is treated as a token-based tagging problem in the system and employing IOBI encoding for entity encoding. In token-based tagging, each token denoted by w i is to be assigned with a class label denoted by t i from a set of existing class labels. The system should be able to calculate the possible t i (class label value) for each w i (token). This probability can be determined 20
2 with the following formula P(t i = c x i ), where it is calculated for all possible class-label value c, and x i is an association with i as a feature vector. The feature vector can be based on class labels determined previously, this, results to a formula of P(t i = c x i ) = P(t i = c {w i }, {t j } j<=i ). Using the dynamic approach, the t series was approximated yielding a conditional probability model in the form of w e represents linear weight vector, b e is a constant, these values are approximated as given by the training data. A model resulted from the input of training data, for which a function was observed to be correlated to the Huber s loss function. Robust risk minimization is used to describe a classification method for minimizing the risk function. The system underwent several experiments, which are variations in features of the English language development set. These features are as follows: word cases/case sets, prefix and suffix, part-of-speech and chunking. The feature sets in details and the components used in the experiments are further detailed in the tables below: treat words as unknown word models and extracting internal word features such as affixes (prefix and suffix), punctuation marks and capitalization of letter. The two models discussed, make use of representation of character sequences. Character-Level Hidden Markov Model (HMM) treats word sequences per character-basis and associating a state for every character. Both state and character depend on previous state and character respectively. The character is also dependent on the current state it is associated with. This type of character modeling (character emission model) is comparable to n-gram proper-name classification with the addition of state-transition chaining, allowing categorization and segmentation of characters. For the character-level models, different state assignment for each character in a word should be avoided in two ways; first being state transition locking and second as transition topology choice. The paper discusses the transition topology change with the following implementation. The state is represented as (t, k), with entity type denoted by t and length of time the system is in state t denoted by k. In the case of (PERSON, 2), PERSON represents the state and 2 represents second letter in a PERSON phrase such that a space will follow (inserted if not existing). Once the state reaches (PERSON, F) or the final state, it assumes that state. Furthermore, the emission is given by the following formula: The paper further stated that dictionaries can help improve system performance although this would lead to a language dependent inclination for a language independent system. On the other hand, to increase precision of recognizing named entity, rules are to be developed for specific linguistic patterns. The paper concluded that with the use of simple language attributes that can result to language-independent system, it is capable of good system performance with language-specific attributes contributing less as expected to the overall improvement upon inclusion. 2.2 Using Character-Level Models The paper describes two named entity recognition models namely character-level HMM and maximum-entropy conditional Markov model. Further, it states that most named entity recognition and part of speech tagging have used words as the basic inputs; however, due to limitation of data availability, there is a need to such that c 0 is the current character for emission, c_ (n-1) is the word phrase, c _1 indicates the class label, s indicates the state. The model was tried in two occasions: one which discards previous context and another one that retains previous context. It was gathered from the results that character model increased system performance. Further inclusion of gazetteer entries which have been built on training data decreased performance. Due to the results, further testing were done to compare its performance along word n-grams using system such as CoNLL. With the assignment of words to their classes, the character n- grams did not scale well; however, after addition of features of n- grams such as start and end symbols as well as substring features and prior and subsequent words increased its performance. This constitutes an edge over word n-gram systems which will not scale well for multi-word names as it does not combine these names as pair or related sequence. Conditional Markov Model (CMM) is particularly useful in the inclusion of sequence sensitive features. In this case, features such as joint tag-sequence, longer distance sequence, tag-sequence, letter type patterns, second previous words and second next word features allow more accurate determination of named entity. The system also allows the labeling of repeated sub elements into a single class label such as first name and lat name into a single class label of PERSON. Finally, the paper concluded that character models should be further used for named entity recognition systems, with significant improvements compared to word n-gram models. 2.3 Using Symbolic and Neural Learning Named Entity Recognition (NERC) plays a vital role in information extraction. It is defined in the paper as the identification and categorization of named entities. An NERC system has two components, namely, the lexicon and the grammar. The lexicon refers to the named entities previously 21
3 identified and classified. The grammar is responsible for recognition and classification of named entities that are not part of the lexicon. NERC systems are considered domain specific as there exist differences for languages. Machine learning techniques are introduced in the paper which will aid machines in automatically adapting and acquiring information from unclassified data. These techniques can also be classified according to the type of model representation used. The symbolic method and sub-symbolic method use distinct symbolic and numeric representations, respectively. The Inductive NERC system as described in the paper is based in the first general purpose symbolic machine learning algorithm C4.5. The C4.5 makes use of decision trees and searches through the tree by employing recursive division of data, starting with the whole and subsequently using a feature to categorize data into classes. The data is exposed to several partitioning until exhausted and thoroughly classified. Since the repetitive division is deemed as impossible and problematic as it can lead to overtraining of data, some of the leaf nodes of the decision tree are not classified extensively but still able to incorporate most of the important and significant classification rules. The multi-layered Feedback Neural Network (FNN) is composed of input, intermediate and output nodes, which have inputs only from previous nodes. As part of the experimentation on which method will yield the best result, both methods are deployed to identify named entities particularly person and organization entities, which are deemed the most difficult to recognize and classify. For the purpose of this experiment, two features are considered, the POS (Part of speech) and gazetteer tag (person, location, etc).the feature vectors are created by identifying and tagging of noun phrases. The Inductive Recognizer s feature vector encodes noun phrases using the gazetteer tags that are based on what are available in the gazetteer list. Some gazetteer tags are as follows: IN: preposition, DT: determiner, NNP: noun phrase, CC: conjunction. In the system deploying C4.5 however, more than one gazetteer tag may be issued to a word depending on how many instances it appears on the list. A? is used for missing words as NOTAG is used for missing gazetteer tag. The Neural Network Recognizer s feature set functions by forming a dimension per part of speech and per gazetteer vector; such that each word can be represented by the combination of a part of speech vector and gazetteer vector. The representations of noun phrases are part of the experiment requirement which was carried out in two ways. The first experiment looks into how the NERC will function wholly and in its sub functions; this is measured by the named entity identification and classification into three subclasses namely person, organization and none named entity once it does not fit into the first two classes. The second one utilizes a more hierarchical method of classification which is by categorizing into named entities and none named entities, then further classifying those under named entities as person or organization. The bases of the experiment lie on two factors, namely, recall and precision. The ratio of correctly identified named entities of a specific type in the data and the total number of that type is referred to as recall; while the ratio of identifying named entities of a particular type and the number of items identified as part of that type as dictated by the system is referred to on the other hand as precision. As an overview of the experiment result, it was noted that named entities of person class are easier to identify than that of the organization class due to the length and number of words that usually comprise an organization as well as the presence of titles that denote person names. The results in the first experiments were able to prove that the order of words was not important for neural NERC system which was able to perform better than the decision-tree NERC system. After the first experiment which dealt on named entity identification, the second experiment was concentrated on the recognizer functions (Neural and Inductive Recognizers). This experiment showed that the neural NERC system performs better than the decision-tree NERC system, although both of them perform well in identifying and classifying named entities. The paper concluded that the reduction or removal of manually tagged data and allowing machine learning does not decrease system performance, which grants the developer of the system to deploy the system in different language or domain without much need of manual tagging of training data. The proponents further proposed the development of an NERC independent of gazetteer list and the system having the ability to produce its own gazetteer list out of raw data. 3. NAMED ENTITY RECOGNITION SYSTEM FOR FILIPINO TEXTS According to [2], one way to approach the task of NER is to suppose that the text once had all the names within it marked for our convenience, but then the text was passed through a noisy channel, and this information was somehow deleted. Our aim is therefore to model the original process that marked the names. This can be achieved by reading each word in the input stream and deciding, for each, whether or not it is part of a named entity and classifying it. For simplicity, a word that is not part of any named entity is classified as belonging to the (name) class NOT- A-NAME. 3.1 Training Data Training for the system was done by using existing text documents. In order to perform supervised learning, the text documents were tagged with four distinct classes, namely: person (tao), place (lug), organization (org) and others (atbp). The tags were incorporated into the text documents using an XML style of encoding. Opening and closing tags were used in order to determine the start and the end of the named entity. Tagging was done with respect to the context of the sentence. For example, you can tag the proper noun Philip Morris in two ways, as a person or as an organization. The encoder determines the proper type of the usage of the word and tags them accordingly. Also, tagging does not include the position or the title of the named entity. For example: the named entity Dr. Jose Rizal, the tagging procedure done in these cases is to only tag the name of the person thus resulting in this: Dr. <tao> Jose Rizal </tao>. This is the same for location names. The descriptions such as city, barangay, street, etc. are also omitted. The training set came from different types of writing materials. Some of these were news articles, translations of books, scripts for plays and biographies. 22
4 3.2 System Design To help in the classification of each word, the features of each word are first extracted and identified. The system being discussed used the same features as that used by the Nymble System. The Nymble system was later on further improved and renamed the Identifinder system. There are fourteen features in total, which are mutually exclusive. The features are presented in Table 1. Table 1. Nymble's word feature set [2]. Word feature Example text Explanation twodigitnum 90 Two-digit year fourdigitnum 1990 Four-digit year containsdigitandalpha A8-67 Product code containsdigitanddash Date containsdigitandslash 11/9/98 Date containsdigitandcomma 1,000 Amount containsdigitandperiod 1.00 Amount othernum Any other number allcaps BBN Organization capperiod P. firstword initcap lowercase other The Sally tree.net Personal name initial Capitalized word that is the first word in a sentence Capitalized word in midsentence Un-capitalized word Punctuation, or any other word not covered above The name class of the previous word in the sentence typically provides clues and information regarding the name class of the current word; consequently, assuming sentence and word boundaries have been determined, [2] states that one component of assigning a name class, NC 0, to the current word, w 0, can be computed based on the name class NC -1 of the previous word, w -1, as follows: P(NC 0 NC -1, w -1 ) The second component looks at the probability of generating the current word w 0 with its associated word feature f 0 given the name class of the current word and the name class of the previous word: P((w 0, f 0 ) NC 0, NC -1 ) The probability of the current word being the first word in a name class NC 0 is then given by the product of these two probabilities, as in the Nymble and Identifinder systems: P(NC 0 NC -1, w -1 ) * P((w 0, f 0 ) NC 0, NC -1 ) On the other hand, if the previous word has been classified into a name class, the probability that the current word is itself part of the named entity to which the previous word belongs is given by: P((w 0, f 0 ) (w -1, f -1 ), NC 0 ) According to [2], this technique of basing decisions concerning the current word on previously made decisions concerning the previous word is based on the commonly used bigram language model, in which a word s probability of occurrence is based on the previous word. In cases where there is no previous word, i.e., the current word is the first word in the sentence, a START-OF- SENTENCE token is used to represent w -1, as illustrated below: P((w 0, f 0 ) NC 0, START-OF-SENTENCE) In addition, the system, as does the Nymble system, introduces a +END+ token, with word feature other, for representing the probability that the current word is the last word in its name class: P((+END+, other) (w 0, f 0 ), NC 0 ) These probability values are computed from actual counts done on previously annotated corpora. These corpora constitute the training data for the system (see section 3.1). For instance, to generate the following probability: P(NC 0 NC -1, w -1 ) the total number of times in which a word of name class NC 0 follows a word w -1 of name class NC -1 is divided by the number of times a word w -1 of name class NC -1 appeared in the text. Similar computations are done for all of the other probabilities. Currently, a default value is simply substituted for missing values. Based on preliminary experimentation, a default value of was found to produce satisfactory results. 4. EXPERIMENTAL RESULTS We have classified the results of the system on each recognized named entity as one of the following: correct, partially correct, or incorrect. A correct tag indicates that the system was able to correctly identify the boundaries of a named entity and determine its class, i.e., person, location, organization, or miscellaneous. A named entity is considered incorrectly tagged when the system tags it as a named entity and none of the words in the phrase are actually part of a named entity. Finally, a partially correct result means that (1) the boundaries of a named entity was correctly determined but the system specified the wrong class, or (2) the boundaries of a named entity was not correctly determined (there may be extraneous words before or after the named entity that were tagged by the system as being part of the named entity). The system was compared to an existing system named ANNIE[1]. 1 Table 2. Test Results of Experimental System Document T X-men L O A Women Power T Column 2 specifies the word class for the particular result (T tao or person, L lugar or place, O organisasyon or organization and A for atbp or miscellaneous) 23
5 Without a Net Wild Swans Why Are Filipinos Hungry Walk Don t Run TV Dinners Sweet Valley Kids Stop EVAT Law Stardust Snoopy Comics Ryoga Pol Medina Pagmamalasakit L A T L A T L O A T L A T L A T L O A T L O A T L A T L O A T L O A T L A T L A T L O A Naruto My Brother, My Executioner T L O A T L O A Table 3. Average Result per Word Class of the Experimental System Word Class Person Place Organization Miscellaneous Table 4. Test Results of the System - ANNIE My Brother, T My L Executioner O T Naruto L O T Pagmamalasakit L O T Pol Medina L O T Ryoga L O T Stardust L Snoopy Comics Stop EVAT Law O T L O T L O Another Class 2 Column 2 specifies the word class for the particular result (T tao or person, L lugar or place and O organisasyon or organization) 24
6 T Sweet Valley Kids L O TV Dinner T L O T Walk Don't Run L O Why are T Filipinos L Hungry O T Wild Swans L O T Without A Net L Women Power X-Men O T L O T L O Table 5. Average Result per Word Class of the System - ANNIE Word Class Another Class Person Place Organization Table 3 illustrates that the experimental system performed best in recognizing and tagging names of persons and worst in tagging names of organizations, possibly because of the lack of organization names in the training data. The experimental system recognized fewer named entities than ANNIE; however, the number of incorrect results produced by the experimental system is also dramatically lower than that produced by ANNIE. A possible explanation is the lack of training data inputted into the experimental system. 5. CONCLUSION The current implementation of the system is preliminary and can be further improved in terms of accuracy and ease of use. In particular, back-off models and smoothing can be used for handling missing data in the hash tables, and the classification process can be improved by considering all possible sequences of name classes and directly comparing the probabilities of each of these sequences with one another. For instance, in the sentence Banks filed bankruptcy papers, the word Banks could refer to a person or to banks as a whole. In this case, the probability of each can be computed and compared with each other to generate the best possible (or, in this case, most probable) sequence of labels or name classes [2]. In addition, the process of getting the next sentence in the text stream can be further improved. The current implementation simply checks for the presence of any of the three sentence delimiters (period, question mark, and exclamation point) and checks whether the word, if any, immediately following the punctuation mark is capitalized. This rule is very crude and can fail in a lot of common situations, such as in the presence of abbreviated titles, e.g., Dr. Joe. In line with this, the process of getting the next word can also be improved by recognizing more characters as potential word delimiters. At present, the system assumes that all input documents are encoded in ANSI text files; consequently, some Unicode characters, such as the left and right single quotes, are unrecognized and can generate errors in the training and/or the recognition task. Also, while the system successfully identifies words of the features dates, product codes, and amounts of money, it does not actually tag these words as entities. Finally, more training data could be prepared and fed into the system to further improve its performance on new texts and to reduce the effect of annotation errors on system performance. 6. REFERENCES [1] Cunningham, H. et al. GATE A General Architecture for Text Engineering March [Online]. Available: [Accessed: January 24, 2007] [2] Jackson, P. and Moulinier, I. Natural Language Processing for Online Applications: text retrieval, extraction and categorization. Amsterdam, Netherlands: John Benjamin s Publishing Co., [3] Klein, D., Smarr, J., Nguyan, H. and Manning, C. Named Entity Recognition with Character-Level Models. In Proceedings the Seventh Conference on Natural Language Learning, [4] Petasis, G., Petridis, S., Paliouras, G., Karkaletsis, V., Perantonis, S. J. and Spyropoulos, C. D. Symbolic and Neural Learning for Named Entity Recognition.. Presented at, Symposium on Computational Intelligence and Learning, Chios, Greece, 2000 [5] Zhang, T. and Johnson, D. Robust Risk Minimization based Named Entity Recognition System presented at, Proceedings of CoNLL-2003, Edmonton, Canada,
Linking Task: Identifying authors and book titles in verbose queries
Linking Task: Identifying authors and book titles in verbose queries Anaïs Ollagnier, Sébastien Fournier, and Patrice Bellot Aix-Marseille University, CNRS, ENSAM, University of Toulon, LSIS UMR 7296,
More information2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases
POS Tagging Problem Part-of-Speech Tagging L545 Spring 203 Given a sentence W Wn and a tagset of lexical categories, find the most likely tag T..Tn for each word in the sentence Example Secretariat/P is/vbz
More informationDisambiguation of Thai Personal Name from Online News Articles
Disambiguation of Thai Personal Name from Online News Articles Phaisarn Sutheebanjard Graduate School of Information Technology Siam University Bangkok, Thailand mr.phaisarn@gmail.com Abstract Since online
More informationSINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)
SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF) Hans Christian 1 ; Mikhael Pramodana Agus 2 ; Derwin Suhartono 3 1,2,3 Computer Science Department,
More informationUsing dialogue context to improve parsing performance in dialogue systems
Using dialogue context to improve parsing performance in dialogue systems Ivan Meza-Ruiz and Oliver Lemon School of Informatics, Edinburgh University 2 Buccleuch Place, Edinburgh I.V.Meza-Ruiz@sms.ed.ac.uk,
More informationOCR for Arabic using SIFT Descriptors With Online Failure Prediction
OCR for Arabic using SIFT Descriptors With Online Failure Prediction Andrey Stolyarenko, Nachum Dershowitz The Blavatnik School of Computer Science Tel Aviv University Tel Aviv, Israel Email: stloyare@tau.ac.il,
More informationSemi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.
Semi-supervised methods of text processing, and an application to medical concept extraction Yacine Jernite Text-as-Data series September 17. 2015 What do we want from text? 1. Extract information 2. Link
More informationA Case Study: News Classification Based on Term Frequency
A Case Study: News Classification Based on Term Frequency Petr Kroha Faculty of Computer Science University of Technology 09107 Chemnitz Germany kroha@informatik.tu-chemnitz.de Ricardo Baeza-Yates Center
More informationRule Learning With Negation: Issues Regarding Effectiveness
Rule Learning With Negation: Issues Regarding Effectiveness S. Chua, F. Coenen, G. Malcolm University of Liverpool Department of Computer Science, Ashton Building, Ashton Street, L69 3BX Liverpool, United
More informationMemory-based grammatical error correction
Memory-based grammatical error correction Antal van den Bosch Peter Berck Radboud University Nijmegen Tilburg University P.O. Box 9103 P.O. Box 90153 NL-6500 HD Nijmegen, The Netherlands NL-5000 LE Tilburg,
More informationEnhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities
Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities Yoav Goldberg Reut Tsarfaty Meni Adler Michael Elhadad Ben Gurion
More informationCorpus Linguistics (L615)
(L615) Basics of Markus Dickinson Department of, Indiana University Spring 2013 1 / 23 : the extent to which a sample includes the full range of variability in a population distinguishes corpora from archives
More informationTwitter Sentiment Classification on Sanders Data using Hybrid Approach
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 4, Ver. I (July Aug. 2015), PP 118-123 www.iosrjournals.org Twitter Sentiment Classification on Sanders
More informationModule 12. Machine Learning. Version 2 CSE IIT, Kharagpur
Module 12 Machine Learning 12.1 Instructional Objective The students should understand the concept of learning systems Students should learn about different aspects of a learning system Students should
More informationChunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.
NLP Lab Session Week 8 October 15, 2014 Noun Phrase Chunking and WordNet in NLTK Getting Started In this lab session, we will work together through a series of small examples using the IDLE window and
More informationINPE São José dos Campos
INPE-5479 PRE/1778 MONLINEAR ASPECTS OF DATA INTEGRATION FOR LAND COVER CLASSIFICATION IN A NEDRAL NETWORK ENVIRONNENT Maria Suelena S. Barros Valter Rodrigues INPE São José dos Campos 1993 SECRETARIA
More informationThe taming of the data:
The taming of the data: Using text mining in building a corpus for diachronic analysis Stefania Degaetano-Ortlieb, Hannah Kermes, Ashraf Khamis, Jörg Knappen, Noam Ordan and Elke Teich Background Big data
More informationChinese Language Parsing with Maximum-Entropy-Inspired Parser
Chinese Language Parsing with Maximum-Entropy-Inspired Parser Heng Lian Brown University Abstract The Chinese language has many special characteristics that make parsing difficult. The performance of state-of-the-art
More informationLoughton School s curriculum evening. 28 th February 2017
Loughton School s curriculum evening 28 th February 2017 Aims of this session Share our approach to teaching writing, reading, SPaG and maths. Share resources, ideas and strategies to support children's
More informationExperiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling
Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling Notebook for PAN at CLEF 2013 Andrés Alfonso Caurcel Díaz 1 and José María Gómez Hidalgo 2 1 Universidad
More informationCreate Quiz Questions
You can create quiz questions within Moodle. Questions are created from the Question bank screen. You will also be able to categorize questions and add them to the quiz body. You can crate multiple-choice,
More informationTarget Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data
Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data Ebba Gustavii Department of Linguistics and Philology, Uppsala University, Sweden ebbag@stp.ling.uu.se
More informationWord Segmentation of Off-line Handwritten Documents
Word Segmentation of Off-line Handwritten Documents Chen Huang and Sargur N. Srihari {chuang5, srihari}@cedar.buffalo.edu Center of Excellence for Document Analysis and Recognition (CEDAR), Department
More informationThe stages of event extraction
The stages of event extraction David Ahn Intelligent Systems Lab Amsterdam University of Amsterdam ahn@science.uva.nl Abstract Event detection and recognition is a complex task consisting of multiple sub-tasks
More informationARNE - A tool for Namend Entity Recognition from Arabic Text
24 ARNE - A tool for Namend Entity Recognition from Arabic Text Carolin Shihadeh DFKI Stuhlsatzenhausweg 3 66123 Saarbrücken, Germany carolin.shihadeh@dfki.de Günter Neumann DFKI Stuhlsatzenhausweg 3 66123
More informationIterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages
Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages Nuanwan Soonthornphisaj 1 and Boonserm Kijsirikul 2 Machine Intelligence and Knowledge Discovery Laboratory Department of Computer
More informationSystem Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks
System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks 1 Tzu-Hsuan Yang, 2 Tzu-Hsuan Tseng, and 3 Chia-Ping Chen Department of Computer Science and Engineering
More informationAQUA: An Ontology-Driven Question Answering System
AQUA: An Ontology-Driven Question Answering System Maria Vargas-Vera, Enrico Motta and John Domingue Knowledge Media Institute (KMI) The Open University, Walton Hall, Milton Keynes, MK7 6AA, United Kingdom.
More informationPython Machine Learning
Python Machine Learning Unlock deeper insights into machine learning with this vital guide to cuttingedge predictive analytics Sebastian Raschka [ PUBLISHING 1 open source I community experience distilled
More informationCorrective Feedback and Persistent Learning for Information Extraction
Corrective Feedback and Persistent Learning for Information Extraction Aron Culotta a, Trausti Kristjansson b, Andrew McCallum a, Paul Viola c a Dept. of Computer Science, University of Massachusetts,
More informationOPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS
OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS Václav Kocian, Eva Volná, Michal Janošek, Martin Kotyrba University of Ostrava Department of Informatics and Computers Dvořákova 7,
More informationDeveloping True/False Test Sheet Generating System with Diagnosing Basic Cognitive Ability
Developing True/False Test Sheet Generating System with Diagnosing Basic Cognitive Ability Shih-Bin Chen Dept. of Information and Computer Engineering, Chung-Yuan Christian University Chung-Li, Taiwan
More informationA Version Space Approach to Learning Context-free Grammars
Machine Learning 2: 39~74, 1987 1987 Kluwer Academic Publishers, Boston - Manufactured in The Netherlands A Version Space Approach to Learning Context-free Grammars KURT VANLEHN (VANLEHN@A.PSY.CMU.EDU)
More informationThe Internet as a Normative Corpus: Grammar Checking with a Search Engine
The Internet as a Normative Corpus: Grammar Checking with a Search Engine Jonas Sjöbergh KTH Nada SE-100 44 Stockholm, Sweden jsh@nada.kth.se Abstract In this paper some methods using the Internet as a
More informationAustralian Journal of Basic and Applied Sciences
AENSI Journals Australian Journal of Basic and Applied Sciences ISSN:1991-8178 Journal home page: www.ajbasweb.com Feature Selection Technique Using Principal Component Analysis For Improving Fuzzy C-Mean
More informationBYLINE [Heng Ji, Computer Science Department, New York University,
INFORMATION EXTRACTION BYLINE [Heng Ji, Computer Science Department, New York University, hengji@cs.nyu.edu] SYNONYMS NONE DEFINITION Information Extraction (IE) is a task of extracting pre-specified types
More informationPOS tagging of Chinese Buddhist texts using Recurrent Neural Networks
POS tagging of Chinese Buddhist texts using Recurrent Neural Networks Longlu Qin Department of East Asian Languages and Cultures longlu@stanford.edu Abstract Chinese POS tagging, as one of the most important
More informationProbabilistic Latent Semantic Analysis
Probabilistic Latent Semantic Analysis Thomas Hofmann Presentation by Ioannis Pavlopoulos & Andreas Damianou for the course of Data Mining & Exploration 1 Outline Latent Semantic Analysis o Need o Overview
More informationRule Learning with Negation: Issues Regarding Effectiveness
Rule Learning with Negation: Issues Regarding Effectiveness Stephanie Chua, Frans Coenen, and Grant Malcolm University of Liverpool Department of Computer Science, Ashton Building, Ashton Street, L69 3BX
More informationPredicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks
Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks Devendra Singh Chaplot, Eunhee Rhim, and Jihie Kim Samsung Electronics Co., Ltd. Seoul, South Korea {dev.chaplot,eunhee.rhim,jihie.kim}@samsung.com
More informationOn-Line Data Analytics
International Journal of Computer Applications in Engineering Sciences [VOL I, ISSUE III, SEPTEMBER 2011] [ISSN: 2231-4946] On-Line Data Analytics Yugandhar Vemulapalli #, Devarapalli Raghu *, Raja Jacob
More informationSpeech Recognition at ICSI: Broadcast News and beyond
Speech Recognition at ICSI: Broadcast News and beyond Dan Ellis International Computer Science Institute, Berkeley CA Outline 1 2 3 The DARPA Broadcast News task Aspects of ICSI
More informationUsing Blackboard.com Software to Reach Beyond the Classroom: Intermediate
Using Blackboard.com Software to Reach Beyond the Classroom: Intermediate NESA Conference 2007 Presenter: Barbara Dent Educational Technology Training Specialist Thomas Jefferson High School for Science
More informationBootstrapping and Evaluating Named Entity Recognition in the Biomedical Domain
Bootstrapping and Evaluating Named Entity Recognition in the Biomedical Domain Andreas Vlachos Computer Laboratory University of Cambridge Cambridge, CB3 0FD, UK av308@cl.cam.ac.uk Caroline Gasperin Computer
More informationLearning Methods in Multilingual Speech Recognition
Learning Methods in Multilingual Speech Recognition Hui Lin Department of Electrical Engineering University of Washington Seattle, WA 98125 linhui@u.washington.edu Li Deng, Jasha Droppo, Dong Yu, and Alex
More informationModeling function word errors in DNN-HMM based LVCSR systems
Modeling function word errors in DNN-HMM based LVCSR systems Melvin Jose Johnson Premkumar, Ankur Bapna and Sree Avinash Parchuri Department of Computer Science Department of Electrical Engineering Stanford
More informationMULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY
MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY Chen, Hsin-Hsi Department of Computer Science and Information Engineering National Taiwan University Taipei, Taiwan E-mail: hh_chen@csie.ntu.edu.tw Abstract
More informationCross Language Information Retrieval
Cross Language Information Retrieval RAFFAELLA BERNARDI UNIVERSITÀ DEGLI STUDI DI TRENTO P.ZZA VENEZIA, ROOM: 2.05, E-MAIL: BERNARDI@DISI.UNITN.IT Contents 1 Acknowledgment.............................................
More informationEdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar
EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar Chung-Chi Huang Mei-Hua Chen Shih-Ting Huang Jason S. Chang Institute of Information Systems and Applications, National Tsing Hua University,
More informationDefragmenting Textual Data by Leveraging the Syntactic Structure of the English Language
Defragmenting Textual Data by Leveraging the Syntactic Structure of the English Language Nathaniel Hayes Department of Computer Science Simpson College 701 N. C. St. Indianola, IA, 50125 nate.hayes@my.simpson.edu
More informationCS 598 Natural Language Processing
CS 598 Natural Language Processing Natural language is everywhere Natural language is everywhere Natural language is everywhere Natural language is everywhere!"#$%&'&()*+,-./012 34*5665756638/9:;< =>?@ABCDEFGHIJ5KL@
More informationQuickStroke: An Incremental On-line Chinese Handwriting Recognition System
QuickStroke: An Incremental On-line Chinese Handwriting Recognition System Nada P. Matić John C. Platt Λ Tony Wang y Synaptics, Inc. 2381 Bering Drive San Jose, CA 95131, USA Abstract This paper presents
More informationDetecting English-French Cognates Using Orthographic Edit Distance
Detecting English-French Cognates Using Orthographic Edit Distance Qiongkai Xu 1,2, Albert Chen 1, Chang i 1 1 The Australian National University, College of Engineering and Computer Science 2 National
More informationNCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches
NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches Yu-Chun Wang Chun-Kai Wu Richard Tzong-Han Tsai Department of Computer Science
More informationLearning and Retaining New Vocabularies: The Case of Monolingual and Bilingual Dictionaries
Learning and Retaining New Vocabularies: The Case of Monolingual and Bilingual Dictionaries Mohsen Mobaraki Assistant Professor, University of Birjand, Iran mmobaraki@birjand.ac.ir *Amin Saed Lecturer,
More informationExploiting Wikipedia as External Knowledge for Named Entity Recognition
Exploiting Wikipedia as External Knowledge for Named Entity Recognition Jun ichi Kazama and Kentaro Torisawa Japan Advanced Institute of Science and Technology (JAIST) Asahidai 1-1, Nomi, Ishikawa, 923-1292
More informationA Neural Network GUI Tested on Text-To-Phoneme Mapping
A Neural Network GUI Tested on Text-To-Phoneme Mapping MAARTEN TROMPPER Universiteit Utrecht m.f.a.trompper@students.uu.nl Abstract Text-to-phoneme (T2P) mapping is a necessary step in any speech synthesis
More informationLecture 1: Machine Learning Basics
1/69 Lecture 1: Machine Learning Basics Ali Harakeh University of Waterloo WAVE Lab ali.harakeh@uwaterloo.ca May 1, 2017 2/69 Overview 1 Learning Algorithms 2 Capacity, Overfitting, and Underfitting 3
More informationModeling function word errors in DNN-HMM based LVCSR systems
Modeling function word errors in DNN-HMM based LVCSR systems Melvin Jose Johnson Premkumar, Ankur Bapna and Sree Avinash Parchuri Department of Computer Science Department of Electrical Engineering Stanford
More informationSwitchboard Language Model Improvement with Conversational Data from Gigaword
Katholieke Universiteit Leuven Faculty of Engineering Master in Artificial Intelligence (MAI) Speech and Language Technology (SLT) Switchboard Language Model Improvement with Conversational Data from Gigaword
More informationarxiv: v1 [cs.cl] 2 Apr 2017
Word-Alignment-Based Segment-Level Machine Translation Evaluation using Word Embeddings Junki Matsuo and Mamoru Komachi Graduate School of System Design, Tokyo Metropolitan University, Japan matsuo-junki@ed.tmu.ac.jp,
More informationCoast Academies Writing Framework Step 4. 1 of 7
1 KPI Spell further homophones. 2 3 Objective Spell words that are often misspelt (English Appendix 1) KPI Place the possessive apostrophe accurately in words with regular plurals: e.g. girls, boys and
More informationSoftware Maintenance
1 What is Software Maintenance? Software Maintenance is a very broad activity that includes error corrections, enhancements of capabilities, deletion of obsolete capabilities, and optimization. 2 Categories
More informationGCSE. Mathematics A. Mark Scheme for January General Certificate of Secondary Education Unit A503/01: Mathematics C (Foundation Tier)
GCSE Mathematics A General Certificate of Secondary Education Unit A503/0: Mathematics C (Foundation Tier) Mark Scheme for January 203 Oxford Cambridge and RSA Examinations OCR (Oxford Cambridge and RSA)
More informationDegreeWorks Advisor Reference Guide
DegreeWorks Advisor Reference Guide Table of Contents 1. DegreeWorks Basics... 2 Overview... 2 Application Features... 3 Getting Started... 4 DegreeWorks Basics FAQs... 10 2. What-If Audits... 12 Overview...
More informationLearning Methods for Fuzzy Systems
Learning Methods for Fuzzy Systems Rudolf Kruse and Andreas Nürnberger Department of Computer Science, University of Magdeburg Universitätsplatz, D-396 Magdeburg, Germany Phone : +49.39.67.876, Fax : +49.39.67.8
More information1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature
1 st Grade Curriculum Map Common Core Standards Language Arts 2013 2014 1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature Key Ideas and Details
More informationAGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS
AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS 1 CALIFORNIA CONTENT STANDARDS: Chapter 1 ALGEBRA AND WHOLE NUMBERS Algebra and Functions 1.4 Students use algebraic
More informationPrentice Hall Literature: Timeless Voices, Timeless Themes Gold 2000 Correlated to Nebraska Reading/Writing Standards, (Grade 9)
Nebraska Reading/Writing Standards, (Grade 9) 12.1 Reading The standards for grade 1 presume that basic skills in reading have been taught before grade 4 and that students are independent readers. For
More informationTraining and evaluation of POS taggers on the French MULTITAG corpus
Training and evaluation of POS taggers on the French MULTITAG corpus A. Allauzen, H. Bonneau-Maynard LIMSI/CNRS; Univ Paris-Sud, Orsay, F-91405 {allauzen,maynard}@limsi.fr Abstract The explicit introduction
More informationWeb as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics
(L615) Markus Dickinson Department of Linguistics, Indiana University Spring 2013 The web provides new opportunities for gathering data Viable source of disposable corpora, built ad hoc for specific purposes
More informationNamed Entity Recognition: A Survey for the Indian Languages
Named Entity Recognition: A Survey for the Indian Languages Padmaja Sharma Dept. of CSE Tezpur University Assam, India 784028 psharma@tezu.ernet.in Utpal Sharma Dept.of CSE Tezpur University Assam, India
More informationGrade 5: Module 3A: Overview
Grade 5: Module 3A: Overview This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. Exempt third-party content is indicated by the footer: (name of copyright
More informationA heuristic framework for pivot-based bilingual dictionary induction
2013 International Conference on Culture and Computing A heuristic framework for pivot-based bilingual dictionary induction Mairidan Wushouer, Toru Ishida, Donghui Lin Department of Social Informatics,
More informationStatewide Framework Document for:
Statewide Framework Document for: 270301 Standards may be added to this document prior to submission, but may not be removed from the framework to meet state credit equivalency requirements. Performance
More informationArtificial Neural Networks written examination
1 (8) Institutionen för informationsteknologi Olle Gällmo Universitetsadjunkt Adress: Lägerhyddsvägen 2 Box 337 751 05 Uppsala Artificial Neural Networks written examination Monday, May 15, 2006 9 00-14
More informationMultiobjective Optimization for Biomedical Named Entity Recognition and Classification
Available online at www.sciencedirect.com Procedia Technology 6 (2012 ) 206 213 2nd International Conference on Communication, Computing & Security (ICCCS-2012) Multiobjective Optimization for Biomedical
More informationLearning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models
Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models Stephan Gouws and GJ van Rooyen MIH Medialab, Stellenbosch University SOUTH AFRICA {stephan,gvrooyen}@ml.sun.ac.za
More informationOnline Updating of Word Representations for Part-of-Speech Tagging
Online Updating of Word Representations for Part-of-Speech Tagging Wenpeng Yin LMU Munich wenpeng@cis.lmu.de Tobias Schnabel Cornell University tbs49@cornell.edu Hinrich Schütze LMU Munich inquiries@cislmu.org
More informationarxiv: v1 [math.at] 10 Jan 2016
THE ALGEBRAIC ATIYAH-HIRZEBRUCH SPECTRAL SEQUENCE OF REAL PROJECTIVE SPECTRA arxiv:1601.02185v1 [math.at] 10 Jan 2016 GUOZHEN WANG AND ZHOULI XU Abstract. In this note, we use Curtis s algorithm and the
More informationA Framework for Customizable Generation of Hypertext Presentations
A Framework for Customizable Generation of Hypertext Presentations Benoit Lavoie and Owen Rambow CoGenTex, Inc. 840 Hanshaw Road, Ithaca, NY 14850, USA benoit, owen~cogentex, com Abstract In this paper,
More informationTowards a MWE-driven A* parsing with LTAGs [WG2,WG3]
Towards a MWE-driven A* parsing with LTAGs [WG2,WG3] Jakub Waszczuk, Agata Savary To cite this version: Jakub Waszczuk, Agata Savary. Towards a MWE-driven A* parsing with LTAGs [WG2,WG3]. PARSEME 6th general
More informationSpeech Emotion Recognition Using Support Vector Machine
Speech Emotion Recognition Using Support Vector Machine Yixiong Pan, Peipei Shen and Liping Shen Department of Computer Technology Shanghai JiaoTong University, Shanghai, China panyixiong@sjtu.edu.cn,
More informationKnowledge Transfer in Deep Convolutional Neural Nets
Knowledge Transfer in Deep Convolutional Neural Nets Steven Gutstein, Olac Fuentes and Eric Freudenthal Computer Science Department University of Texas at El Paso El Paso, Texas, 79968, U.S.A. Abstract
More informationReducing Features to Improve Bug Prediction
Reducing Features to Improve Bug Prediction Shivkumar Shivaji, E. James Whitehead, Jr., Ram Akella University of California Santa Cruz {shiv,ejw,ram}@soe.ucsc.edu Sunghun Kim Hong Kong University of Science
More informationAssignment 1: Predicting Amazon Review Ratings
Assignment 1: Predicting Amazon Review Ratings 1 Dataset Analysis Richard Park r2park@acsmail.ucsd.edu February 23, 2015 The dataset selected for this assignment comes from the set of Amazon reviews for
More informationThe Smart/Empire TIPSTER IR System
The Smart/Empire TIPSTER IR System Chris Buckley, Janet Walz Sabir Research, Gaithersburg, MD chrisb,walz@sabir.com Claire Cardie, Scott Mardis, Mandar Mitra, David Pierce, Kiri Wagstaff Department of
More informationPrentice Hall Literature: Timeless Voices, Timeless Themes, Platinum 2000 Correlated to Nebraska Reading/Writing Standards (Grade 10)
Prentice Hall Literature: Timeless Voices, Timeless Themes, Platinum 2000 Nebraska Reading/Writing Standards (Grade 10) 12.1 Reading The standards for grade 1 presume that basic skills in reading have
More informationSenior Stenographer / Senior Typist Series (including equivalent Secretary titles)
New York State Department of Civil Service Committed to Innovation, Quality, and Excellence A Guide to the Written Test for the Senior Stenographer / Senior Typist Series (including equivalent Secretary
More informationCPS122 Lecture: Identifying Responsibilities; CRC Cards. 1. To show how to use CRC cards to identify objects and find responsibilities
Objectives: CPS122 Lecture: Identifying Responsibilities; CRC Cards last revised February 7, 2012 1. To show how to use CRC cards to identify objects and find responsibilities Materials: 1. ATM System
More informationhave to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,
A Language-Independent, Data-Oriented Architecture for Grapheme-to-Phoneme Conversion Walter Daelemans and Antal van den Bosch Proceedings ESCA-IEEE speech synthesis conference, New York, September 1994
More informationProof Theory for Syntacticians
Department of Linguistics Ohio State University Syntax 2 (Linguistics 602.02) January 5, 2012 Logics for Linguistics Many different kinds of logic are directly applicable to formalizing theories in syntax
More informationAnalyzing sentiments in tweets for Tesla Model 3 using SAS Enterprise Miner and SAS Sentiment Analysis Studio
SCSUG Student Symposium 2016 Analyzing sentiments in tweets for Tesla Model 3 using SAS Enterprise Miner and SAS Sentiment Analysis Studio Praneth Guggilla, Tejaswi Jha, Goutam Chakraborty, Oklahoma State
More informationLearning Computational Grammars
Learning Computational Grammars John Nerbonne, Anja Belz, Nicola Cancedda, Hervé Déjean, James Hammerton, Rob Koeling, Stasinos Konstantopoulos, Miles Osborne, Franck Thollard and Erik Tjong Kim Sang Abstract
More informationTaught Throughout the Year Foundational Skills Reading Writing Language RF.1.2 Demonstrate understanding of spoken words,
First Grade Standards These are the standards for what is taught in first grade. It is the expectation that these skills will be reinforced after they have been taught. Taught Throughout the Year Foundational
More informationLip reading: Japanese vowel recognition by tracking temporal changes of lip shape
Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape Koshi Odagiri 1, and Yoichi Muraoka 1 1 Graduate School of Fundamental/Computer Science and Engineering, Waseda University,
More informationELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading
ELA/ELD Correlation Matrix for ELD Materials Grade 1 Reading The English Language Arts (ELA) required for the one hour of English-Language Development (ELD) Materials are listed in Appendix 9-A, Matrix
More informationContent Language Objectives (CLOs) August 2012, H. Butts & G. De Anda
Content Language Objectives (CLOs) Outcomes Identify the evolution of the CLO Identify the components of the CLO Understand how the CLO helps provide all students the opportunity to access the rigor of
More informationFirst Grade Curriculum Highlights: In alignment with the Common Core Standards
First Grade Curriculum Highlights: In alignment with the Common Core Standards ENGLISH LANGUAGE ARTS Foundational Skills Print Concepts Demonstrate understanding of the organization and basic features
More informationNotes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1
Notes on The Sciences of the Artificial Adapted from a shorter document written for course 17-652 (Deciding What to Design) 1 Ali Almossawi December 29, 2005 1 Introduction The Sciences of the Artificial
More information