A Named Entity Recognizer for Filipino Texts

A Named Entity Recognizer for Filipino Texts Lim, L. E., New, J. C., Ngo, M. A., Sy, M. C., Lim, N. R. De La Salle University-Manila 2401 Taft Avenue Malate, Manila {lan_585, johnchristophernew}@yahoo.com, mango.bunny@gmail.com, maureen_sy@yahoo.com, limn@dlsu.edu.ph ABSTRACT In this paper, we define the task of named entity recognition, look at existing systems for named entity recognition, and discuss the design, implementation, and evaluation of a system that performs named entity recognition on Filipino texts. We also compare the results of the system with an existing named entity recognizer designed for English texts using a Filipino corpus. Keywords Named entity recognition, extraction, information extraction, natural language processing 1. INTRODUCTION Named entity recognition (NER) involves automatically (or semiautomatically) processing a series of words and extracting or recognizing words or phrases in the text that refer to people, places, organizations, products, and other named entities. The task of named entity recognition also entails identifying the class of each extracted named entity, i.e., person, place, organization, etc. While the Filipino language makes use of capitalization to indicate the presence of proper nouns, extraction of named entities cannot be done by merely considering the case of the first letter of each word. This difficulty is, in large part, due to the fact that some named entities can contain lower case words, such as Komisyon sa Wikang Filipino. On the other hand, a list of named entities can be provided beforehand, and a system that simply scans a stream of text and searches for the named entities provided in the list can be designed. Such an approach, however, can be cumbersome and error-prone, largely because new products and organizations come into being every day, and keeping the lists up to date is a manual and extremely timeconsuming task. Furthermore, even assuming that all the named entities have been extracted successfully, neither of these two approaches would help a system decide whether a particular named entity contained in the text, such as Philip Morris, refers to a person or a company. Finally, other entities may possibly be of interest to the user, such as dates, sums of money, percentages, temperatures, and product codes, some of which do not rely on capitalization cues and cannot be contained in a finite list. While many named entity recognition systems exist in the market today, very few, if any, have been designed specifically for handling texts written in the Filipino language. Most software packages and implementations for NER accept a stream of English text and extracts names of people, places, and companies or organizations. Some may include support for identifying relationships between named entities. For instance, a system processing the text containing Gloria Macapagal-Arroyo, Presidente ng Pilipinas might identify Gloria Macapagal-Arroyo as the name of a person, Pilipinas as a place, and a Presidente-ng relationship between the two. Different approaches to named entity recognition have been designed and can be classified into two main categories. The first one involves heuristic rules and lists of named entities. The second one, and the approach which was taken in the design of this system, is the statistical approach, which allows a program to learn the task of named entity recognition based on previously annotated training data. In particular, hidden Markov models are used in the manner discussed in [2], and then the system is tested using manually annotated Filipino articles and essays. In section 2, we will take a look at existing approaches and systems for NER and look at their individual strengths and limitations. More information regarding the training data used for the system, as well as the actual design and implementation of the system, is given in section 3. Section 4 outlines the tests conducted and compares the results of the system with those of existing systems (that are targeted to other languages). Finally, section 5 looks at possible extensions and improvements to the system. 2. RELATED WORK Many contemporary software packages and implementations that perform automatic named entity recognition are in use today. In addition, various techniques and algorithms have been designed for the NER task. The following subsections aim to summarize and evaluate some of these approaches and techniques. 2.1 Using Risk Minimization The proponents of the system looked into the availability of linguistic features and their impact on how the system will yield a better performance. It was stated that statistically based named entity recognition has yet to be made consistent across a variety of data source types, with this, their system focused on existing features of languages that exist in many languages, creating language-independent named entity recognition systems which can perform as well as systems utilizing features that are language-dependent. In the developed system, their approach was patterned on previous text chunking system with the sequence of words treated as sequence of tokenized text. The named entity recognition is treated as a token-based tagging problem in the system and employing IOBI encoding for entity encoding. In token-based tagging, each token denoted by w i is to be assigned with a class label denoted by t i from a set of existing class labels. The system should be able to calculate the possible t i (class label value) for each w i (token). This probability can be determined 20

with the following formula P(t i = c x i ), where it is calculated for all possible class-label value c, and x i is an association with i as a feature vector. The feature vector can be based on class labels determined previously, this, results to a formula of P(t i = c x i ) = P(t i = c {w i }, {t j } j<=i ). Using the dynamic approach, the t series was approximated yielding a conditional probability model in the form of w e represents linear weight vector, b e is a constant, these values are approximated as given by the training data. A model resulted from the input of training data, for which a function was observed to be correlated to the Huber s loss function. Robust risk minimization is used to describe a classification method for minimizing the risk function. The system underwent several experiments, which are variations in features of the English language development set. These features are as follows: word cases/case sets, prefix and suffix, part-of-speech and chunking. The feature sets in details and the components used in the experiments are further detailed in the tables below: treat words as unknown word models and extracting internal word features such as affixes (prefix and suffix), punctuation marks and capitalization of letter. The two models discussed, make use of representation of character sequences. Character-Level Hidden Markov Model (HMM) treats word sequences per character-basis and associating a state for every character. Both state and character depend on previous state and character respectively. The character is also dependent on the current state it is associated with. This type of character modeling (character emission model) is comparable to n-gram proper-name classification with the addition of state-transition chaining, allowing categorization and segmentation of characters. For the character-level models, different state assignment for each character in a word should be avoided in two ways; first being state transition locking and second as transition topology choice. The paper discusses the transition topology change with the following implementation. The state is represented as (t, k), with entity type denoted by t and length of time the system is in state t denoted by k. In the case of (PERSON, 2), PERSON represents the state and 2 represents second letter in a PERSON phrase such that a space will follow (inserted if not existing). Once the state reaches (PERSON, F) or the final state, it assumes that state. Furthermore, the emission is given by the following formula: The paper further stated that dictionaries can help improve system performance although this would lead to a language dependent inclination for a language independent system. On the other hand, to increase precision of recognizing named entity, rules are to be developed for specific linguistic patterns. The paper concluded that with the use of simple language attributes that can result to language-independent system, it is capable of good system performance with language-specific attributes contributing less as expected to the overall improvement upon inclusion. 2.2 Using Character-Level Models The paper describes two named entity recognition models namely character-level HMM and maximum-entropy conditional Markov model. Further, it states that most named entity recognition and part of speech tagging have used words as the basic inputs; however, due to limitation of data availability, there is a need to such that c 0 is the current character for emission, c_ (n-1) is the word phrase, c _1 indicates the class label, s indicates the state. The model was tried in two occasions: one which discards previous context and another one that retains previous context. It was gathered from the results that character model increased system performance. Further inclusion of gazetteer entries which have been built on training data decreased performance. Due to the results, further testing were done to compare its performance along word n-grams using system such as CoNLL. With the assignment of words to their classes, the character n- grams did not scale well; however, after addition of features of n- grams such as start and end symbols as well as substring features and prior and subsequent words increased its performance. This constitutes an edge over word n-gram systems which will not scale well for multi-word names as it does not combine these names as pair or related sequence. Conditional Markov Model (CMM) is particularly useful in the inclusion of sequence sensitive features. In this case, features such as joint tag-sequence, longer distance sequence, tag-sequence, letter type patterns, second previous words and second next word features allow more accurate determination of named entity. The system also allows the labeling of repeated sub elements into a single class label such as first name and lat name into a single class label of PERSON. Finally, the paper concluded that character models should be further used for named entity recognition systems, with significant improvements compared to word n-gram models. 2.3 Using Symbolic and Neural Learning Named Entity Recognition (NERC) plays a vital role in information extraction. It is defined in the paper as the identification and categorization of named entities. An NERC system has two components, namely, the lexicon and the grammar. The lexicon refers to the named entities previously 21

identified and classified. The grammar is responsible for recognition and classification of named entities that are not part of the lexicon. NERC systems are considered domain specific as there exist differences for languages. Machine learning techniques are introduced in the paper which will aid machines in automatically adapting and acquiring information from unclassified data. These techniques can also be classified according to the type of model representation used. The symbolic method and sub-symbolic method use distinct symbolic and numeric representations, respectively. The Inductive NERC system as described in the paper is based in the first general purpose symbolic machine learning algorithm C4.5. The C4.5 makes use of decision trees and searches through the tree by employing recursive division of data, starting with the whole and subsequently using a feature to categorize data into classes. The data is exposed to several partitioning until exhausted and thoroughly classified. Since the repetitive division is deemed as impossible and problematic as it can lead to overtraining of data, some of the leaf nodes of the decision tree are not classified extensively but still able to incorporate most of the important and significant classification rules. The multi-layered Feedback Neural Network (FNN) is composed of input, intermediate and output nodes, which have inputs only from previous nodes. As part of the experimentation on which method will yield the best result, both methods are deployed to identify named entities particularly person and organization entities, which are deemed the most difficult to recognize and classify. For the purpose of this experiment, two features are considered, the POS (Part of speech) and gazetteer tag (person, location, etc).the feature vectors are created by identifying and tagging of noun phrases. The Inductive Recognizer s feature vector encodes noun phrases using the gazetteer tags that are based on what are available in the gazetteer list. Some gazetteer tags are as follows: IN: preposition, DT: determiner, NNP: noun phrase, CC: conjunction. In the system deploying C4.5 however, more than one gazetteer tag may be issued to a word depending on how many instances it appears on the list. A? is used for missing words as NOTAG is used for missing gazetteer tag. The Neural Network Recognizer s feature set functions by forming a dimension per part of speech and per gazetteer vector; such that each word can be represented by the combination of a part of speech vector and gazetteer vector. The representations of noun phrases are part of the experiment requirement which was carried out in two ways. The first experiment looks into how the NERC will function wholly and in its sub functions; this is measured by the named entity identification and classification into three subclasses namely person, organization and none named entity once it does not fit into the first two classes. The second one utilizes a more hierarchical method of classification which is by categorizing into named entities and none named entities, then further classifying those under named entities as person or organization. The bases of the experiment lie on two factors, namely, recall and precision. The ratio of correctly identified named entities of a specific type in the data and the total number of that type is referred to as recall; while the ratio of identifying named entities of a particular type and the number of items identified as part of that type as dictated by the system is referred to on the other hand as precision. As an overview of the experiment result, it was noted that named entities of person class are easier to identify than that of the organization class due to the length and number of words that usually comprise an organization as well as the presence of titles that denote person names. The results in the first experiments were able to prove that the order of words was not important for neural NERC system which was able to perform better than the decision-tree NERC system. After the first experiment which dealt on named entity identification, the second experiment was concentrated on the recognizer functions (Neural and Inductive Recognizers). This experiment showed that the neural NERC system performs better than the decision-tree NERC system, although both of them perform well in identifying and classifying named entities. The paper concluded that the reduction or removal of manually tagged data and allowing machine learning does not decrease system performance, which grants the developer of the system to deploy the system in different language or domain without much need of manual tagging of training data. The proponents further proposed the development of an NERC independent of gazetteer list and the system having the ability to produce its own gazetteer list out of raw data. 3. NAMED ENTITY RECOGNITION SYSTEM FOR FILIPINO TEXTS According to [2], one way to approach the task of NER is to suppose that the text once had all the names within it marked for our convenience, but then the text was passed through a noisy channel, and this information was somehow deleted. Our aim is therefore to model the original process that marked the names. This can be achieved by reading each word in the input stream and deciding, for each, whether or not it is part of a named entity and classifying it. For simplicity, a word that is not part of any named entity is classified as belonging to the (name) class NOT- A-NAME. 3.1 Training Data Training for the system was done by using existing text documents. In order to perform supervised learning, the text documents were tagged with four distinct classes, namely: person (tao), place (lug), organization (org) and others (atbp). The tags were incorporated into the text documents using an XML style of encoding. Opening and closing tags were used in order to determine the start and the end of the named entity. Tagging was done with respect to the context of the sentence. For example, you can tag the proper noun Philip Morris in two ways, as a person or as an organization. The encoder determines the proper type of the usage of the word and tags them accordingly. Also, tagging does not include the position or the title of the named entity. For example: the named entity Dr. Jose Rizal, the tagging procedure done in these cases is to only tag the name of the person thus resulting in this: Dr. <tao> Jose Rizal </tao>. This is the same for location names. The descriptions such as city, barangay, street, etc. are also omitted. The training set came from different types of writing materials. Some of these were news articles, translations of books, scripts for plays and biographies. 22

3.2 System Design To help in the classification of each word, the features of each word are first extracted and identified. The system being discussed used the same features as that used by the Nymble System. The Nymble system was later on further improved and renamed the Identifinder system. There are fourteen features in total, which are mutually exclusive. The features are presented in Table 1. Table 1. Nymble's word feature set [2]. Word feature Example text Explanation twodigitnum 90 Two-digit year fourdigitnum 1990 Four-digit year containsdigitandalpha A8-67 Product code containsdigitanddash 09-96 Date containsdigitandslash 11/9/98 Date containsdigitandcomma 1,000 Amount containsdigitandperiod 1.00 Amount othernum 12345 Any other number allcaps BBN Organization capperiod P. firstword initcap lowercase other The Sally tree.net Personal name initial Capitalized word that is the first word in a sentence Capitalized word in midsentence Un-capitalized word Punctuation, or any other word not covered above The name class of the previous word in the sentence typically provides clues and information regarding the name class of the current word; consequently, assuming sentence and word boundaries have been determined, [2] states that one component of assigning a name class, NC 0, to the current word, w 0, can be computed based on the name class NC -1 of the previous word, w -1, as follows: P(NC 0 NC -1, w -1 ) The second component looks at the probability of generating the current word w 0 with its associated word feature f 0 given the name class of the current word and the name class of the previous word: P((w 0, f 0 ) NC 0, NC -1 ) The probability of the current word being the first word in a name class NC 0 is then given by the product of these two probabilities, as in the Nymble and Identifinder systems: P(NC 0 NC -1, w -1 ) * P((w 0, f 0 ) NC 0, NC -1 ) On the other hand, if the previous word has been classified into a name class, the probability that the current word is itself part of the named entity to which the previous word belongs is given by: P((w 0, f 0 ) (w -1, f -1 ), NC 0 ) According to [2], this technique of basing decisions concerning the current word on previously made decisions concerning the previous word is based on the commonly used bigram language model, in which a word s probability of occurrence is based on the previous word. In cases where there is no previous word, i.e., the current word is the first word in the sentence, a START-OF- SENTENCE token is used to represent w -1, as illustrated below: P((w 0, f 0 ) NC 0, START-OF-SENTENCE) In addition, the system, as does the Nymble system, introduces a +END+ token, with word feature other, for representing the probability that the current word is the last word in its name class: P((+END+, other) (w 0, f 0 ), NC 0 ) These probability values are computed from actual counts done on previously annotated corpora. These corpora constitute the training data for the system (see section 3.1). For instance, to generate the following probability: P(NC 0 NC -1, w -1 ) the total number of times in which a word of name class NC 0 follows a word w -1 of name class NC -1 is divided by the number of times a word w -1 of name class NC -1 appeared in the text. Similar computations are done for all of the other probabilities. Currently, a default value is simply substituted for missing values. Based on preliminary experimentation, a default value of 0.005 was found to produce satisfactory results. 4. EXPERIMENTAL RESULTS We have classified the results of the system on each recognized named entity as one of the following: correct, partially correct, or incorrect. A correct tag indicates that the system was able to correctly identify the boundaries of a named entity and determine its class, i.e., person, location, organization, or miscellaneous. A named entity is considered incorrectly tagged when the system tags it as a named entity and none of the words in the phrase are actually part of a named entity. Finally, a partially correct result means that (1) the boundaries of a named entity was correctly determined but the system specified the wrong class, or (2) the boundaries of a named entity was not correctly determined (there may be extraneous words before or after the named entity that were tagged by the system as being part of the named entity). The system was compared to an existing system named ANNIE[1]. 1 Table 2. Test Results of Experimental System Document T 8 40 8 X-men L 0 19 39 O 0 3 0 A 0 10 26 Women Power T 2 3 2 1 Column 2 specifies the word class for the particular result (T tao or person, L lugar or place, O organisasyon or organization and A for atbp or miscellaneous) 23

Without a Net Wild Swans Why Are Filipinos Hungry Walk Don t Run TV Dinners Sweet Valley Kids Stop EVAT Law Stardust Snoopy Comics Ryoga Pol Medina Pagmamalasakit L 0 3 6 A 0 1 11 T 2 14 3 L 0 3 15 A 0 4 12 T 10 24 7 L 0 4 33 O 0 0 1 A 0 1 17 T 0 1 1 L 0 6 16 A 0 1 6 T 1 1 2 L 0 0 13 A 0 1 18 T 10 3 4 L 1 5 4 O 0 1 0 A 0 4 8 T 82 136 13 L 0 13 34 O 0 0 2 A 1 21 22 T 0 9 2 L 0 3 2 A 0 1 1 T 13 42 20 L 0 16 42 O 0 1 2 A 0 19 25 T 0 5 7 L 0 3 10 O 0 0 1 A 0 1 8 T 6 2 3 L 0 2 5 A 0 1 11 T 0 1 1 L 1 1 3 A 0 4 3 T 4 25 8 L 0 6 10 O 0 4 12 A 0 4 12 Naruto My Brother, My Executioner T 8 16 35 L 0 3 23 O 0 2 5 A 0 8 21 T 40 102 23 L 4 25 77 O 0 4 4 A 1 37 68 Table 3. Average Result per Word Class of the Experimental System Word Class Person 11.625 26.5 8.6875 Place 0.375 7 20.75 Organization 0 1.9375 4.125 Miscellaneous 0.125 7.375 16.8125 2 Table 4. Test Results of the System - ANNIE My Brother, T 149 19 306 17 My L 12 0 2 23 Executioner O 1 0 127 3 T 10 2 89 13 Naruto L 1 0 0 0 O 0 0 9 0 T 42 9 36 5 Pagmamalasakit L 1 1 3 0 O 19 2 14 0 T 1 0 2 4 Pol Medina L 5 0 0 2 O 1 0 3 0 T 2 0 19 16 Ryoga L 0 0 3 0 O 0 0 5 0 T 64 20 86 7 Stardust L 13 0 0 2 Snoopy Comics Stop EVAT Law O 0 2 37 2 T 6 2 22 1 L 0 0 1 1 O 0 2 2 0 T 5 3 6 5 L 0 0 1 1 O 0 0 12 2 Another Class 2 Column 2 specifies the word class for the particular result (T tao or person, L lugar or place and O organisasyon or organization) 24

T 240 27 83 2 Sweet Valley Kids L 1 0 0 0 O 0 0 38 0 TV Dinner T 12 2 20 4 L 1 0 1 1 O 0 0 5 0 T 0 1 7 2 Walk Don't Run L 1 0 0 0 O 0 1 6 0 Why are T 1 3 3 0 Filipinos L 0 0 0 1 Hungry O 2 1 0 1 T 38 2 49 2 Wild Swans L 0 0 0 0 O 0 0 8 0 T 5 4 21 14 Without A Net L 0 0 0 0 Women Power X-Men O 0 0 16 0 T 0 3 16 2 L 0 0 0 9 O 1 1 15 0 T 43 7 106 29 L 0 0 4 0 O 0 0 46 7 Table 5. Average Result per Word Class of the System - ANNIE Word Class Another Class Person 38.625 6.5 54.4375 7.6875 Place 2.1875 0.0625 0.9375 2.5 Organization 1.5 0.5625 21.4375 0.9375 Table 3 illustrates that the experimental system performed best in recognizing and tagging names of persons and worst in tagging names of organizations, possibly because of the lack of organization names in the training data. The experimental system recognized fewer named entities than ANNIE; however, the number of incorrect results produced by the experimental system is also dramatically lower than that produced by ANNIE. A possible explanation is the lack of training data inputted into the experimental system. 5. CONCLUSION The current implementation of the system is preliminary and can be further improved in terms of accuracy and ease of use. In particular, back-off models and smoothing can be used for handling missing data in the hash tables, and the classification process can be improved by considering all possible sequences of name classes and directly comparing the probabilities of each of these sequences with one another. For instance, in the sentence Banks filed bankruptcy papers, the word Banks could refer to a person or to banks as a whole. In this case, the probability of each can be computed and compared with each other to generate the best possible (or, in this case, most probable) sequence of labels or name classes [2]. In addition, the process of getting the next sentence in the text stream can be further improved. The current implementation simply checks for the presence of any of the three sentence delimiters (period, question mark, and exclamation point) and checks whether the word, if any, immediately following the punctuation mark is capitalized. This rule is very crude and can fail in a lot of common situations, such as in the presence of abbreviated titles, e.g., Dr. Joe. In line with this, the process of getting the next word can also be improved by recognizing more characters as potential word delimiters. At present, the system assumes that all input documents are encoded in ANSI text files; consequently, some Unicode characters, such as the left and right single quotes, are unrecognized and can generate errors in the training and/or the recognition task. Also, while the system successfully identifies words of the features dates, product codes, and amounts of money, it does not actually tag these words as entities. Finally, more training data could be prepared and fed into the system to further improve its performance on new texts and to reduce the effect of annotation errors on system performance. 6. REFERENCES [1] Cunningham, H. et al. GATE A General Architecture for Text Engineering March 2001. [Online]. Available: http://gate.ac.uk/ [Accessed: January 24, 2007] [2] Jackson, P. and Moulinier, I. Natural Language Processing for Online Applications: text retrieval, extraction and categorization. Amsterdam, Netherlands: John Benjamin s Publishing Co., 2002. [3] Klein, D., Smarr, J., Nguyan, H. and Manning, C. Named Entity Recognition with Character-Level Models. In Proceedings the Seventh Conference on Natural Language Learning, 2003. [4] Petasis, G., Petridis, S., Paliouras, G., Karkaletsis, V., Perantonis, S. J. and Spyropoulos, C. D. Symbolic and Neural Learning for Named Entity Recognition.. Presented at, Symposium on Computational Intelligence and Learning, Chios, Greece, 2000 [5] Zhang, T. and Johnson, D. Robust Risk Minimization based Named Entity Recognition System presented at, Proceedings of CoNLL-2003, Edmonton, Canada, 2003. 25