Multiobjective Optimization for Biomedical Named Entity Recognition and Classification

Similar documents
Linking Task: Identifying authors and book titles in verbose queries

Bootstrapping and Evaluating Named Entity Recognition in the Biomedical Domain

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Named Entity Recognition: A Survey for the Indian Languages

ScienceDirect. Malayalam question answering system

AQUA: An Ontology-Driven Question Answering System

Using dialogue context to improve parsing performance in dialogue systems

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

CS Machine Learning

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

The taming of the data:

Rule Learning With Negation: Issues Regarding Effectiveness

Laboratorio di Intelligenza Artificiale e Robotica

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

Switchboard Language Model Improvement with Conversational Data from Gigaword

Disambiguation of Thai Personal Name from Online News Articles

Rule Learning with Negation: Issues Regarding Effectiveness

Procedia - Social and Behavioral Sciences 141 ( 2014 ) WCLTA Using Corpus Linguistics in the Development of Writing

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Python Machine Learning

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Laboratorio di Intelligenza Artificiale e Robotica

Assignment 1: Predicting Amazon Review Ratings

BYLINE [Heng Ji, Computer Science Department, New York University,

Probabilistic Latent Semantic Analysis

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Heuristic Sample Selection to Minimize Reference Standard Training Set for a Part-Of-Speech Tagger

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

A Case Study: News Classification Based on Term Frequency

Prediction of Maximal Projection for Semantic Role Labeling

Universiteit Leiden ICT in Business

Memory-based grammatical error correction

Exposé for a Master s Thesis

Reducing Features to Improve Bug Prediction

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

PROTEIN NAMES AND HOW TO FIND THEM

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Calibration of Confidence Measures in Speech Recognition

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

Ensemble Technique Utilization for Indonesian Dependency Parser

Indian Institute of Technology, Kanpur

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Learning Methods in Multilingual Speech Recognition

On-Line Data Analytics

ScienceDirect. A Framework for Clustering Cardiac Patient s Records Using Unsupervised Learning Techniques

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

ARNE - A tool for Namend Entity Recognition from Arabic Text

arxiv: v1 [cs.cl] 2 Apr 2017

Cross Language Information Retrieval

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Online Updating of Word Representations for Part-of-Speech Tagging

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Automatic document classification of biological literature

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Learning From the Past with Experiment Databases

Exploiting Wikipedia as External Knowledge for Named Entity Recognition

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Australian Journal of Basic and Applied Sciences

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

Distant Supervised Relation Extraction with Wikipedia and Freebase

Text-mining the Estonian National Electronic Health Record

Detecting English-French Cognates Using Orthographic Edit Distance

An investigation of imitation learning algorithms for structured prediction

Constructing Parallel Corpus from Movie Subtitles

Modeling function word errors in DNN-HMM based LVCSR systems

Learning Computational Grammars

A Comparison of Two Text Representations for Sentiment Analysis

Short Text Understanding Through Lexical-Semantic Analysis

Corpus Linguistics (L615)

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Beyond the Pipeline: Discrete Optimization in NLP

The stages of event extraction

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Active Learning. Yingyu Liang Computer Sciences 760 Fall

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

Lecture 1: Basic Concepts of Machine Learning

Mining Association Rules in Student s Assessment Data

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Improved Effects of Word-Retrieval Treatments Subsequent to Addition of the Orthographic Form

Word Segmentation of Off-line Handwritten Documents

On document relevance and lexical cohesion between query terms

Spoken Language Parsing Using Phrase-Level Grammars and Trainable Classifiers

Learning Methods for Fuzzy Systems

Extracting Opinion Expressions and Their Polarities Exploration of Pipelines and Joint Models

SARDNET: A Self-Organizing Feature Map for Sequences

Defragmenting Textual Data by Leveraging the Syntactic Structure of the English Language

Modeling function word errors in DNN-HMM based LVCSR systems

Finding Translations in Scanned Book Collections

Transcription:

Available online at www.sciencedirect.com Procedia Technology 6 (2012 ) 206 213 2nd International Conference on Communication, Computing & Security (ICCCS-2012) Multiobjective Optimization for Biomedical Named Entity Recognition and Classification Asif Ekbal a, Sriparna Saha a, Utpal Kumar Sikdar a a Department of Computer Science and Technology Indian Institute of Technology Patna Patna, Bihar Email:{asif,sriparna,utpal.sikdar}@iitp.ac.in Abstract Named Entity Recognition and Classification (NERC) is one of the most fundamental and important tasks in biomedical information extraction. Biomedical named entities (NEs) include mentions of proteins, genes, DNA, RNA etc. which, in general, have complex structures and are difficult to recognize. We have developed a large number of features for identifying NEs from biomedical texts. Two robust diverse classification methods like Conditional Random Field (CRF) and Support Vector Machine (SVM) are used to build a number of models depending upon the various representations of the set of features and/or feature templates. Finally the outputs of these different classifiers are combined using multiobjective weighted voted approach. We hypothesize that the reliability of predictions of each classifier differs among the various output classes. Thus, in an ensemble system, it is necessary to determine the appropriate weight of vote for each output class in each classifier. Here, a multiobjective genetic algorithm is utilized for determining appropriate weights of votes for combining the outputs of classifiers. The developed technique is evaluated with the benchmark dataset of JNLPBA 2004 that yields the overall recall, precision and F-measure values of 74.10%, 77.58% and 75.80%, respectively. c 2012 The Published Authors. bypublished Elsevier Ltd. by Elsevier SelectionLtd. and/or Selection peer-review and/or under peer-review responsibility under responsibility of CSE Department, of the Department NIT Rourkela. of Computer Science & Engineering, National Institute of Technology Rourkela Open access under CC BY-NC-ND license. Keywords: Multiobjective Optimization; Classifier Ensemble; Named Entity Recognition and Classification; Machine Learning; Genetic Algorithm (GA). 1. Introduction The explosion of information in the biomedical domain leads to strong demand for automated biomedical information extraction techniques. Named Entity Recognition and Classification (NERC) is a fundamental task of biomedical text mining. Recognizing named entities (NEs) like mentions of proteins, DNA, RNA etc. is one of the most important factors in biomedical knowledge discovery. But the inherently complex structures of biomedical NEs poses a big challenge for their identification and classification in biomedical information extraction. The biomedical NERC is Sriparna Saha. Tel.: +91-8809559190. E-mail address: sriparna@iit.ac.in 2212-0173 2012 The Authors. Published by Elsevier Ltd. Selection and/or peer-review under responsibility of the Department of Computer Science & Engineering, National Institute of Technology Rourkela Open access under CC BY-NC-ND license. doi:10.1016/j.protcy.2012.10.025

Asif Ekbal et al. / Procedia Technology 6 ( 2012 ) 206 213 207 vast, but there is still a wide gap in performance between the systems developed for the news-wire domains ( 91%) and the existing systems in biomedical domains ( 78%). The major challenges and/or difficulties associated with the identification and classification of biomedical NEs are as follows: (i) building a complete dictionary for all types of biomedical NEs is infeasible due to the generative nature of NEs, (ii) NEs are made of very long compounded words (i.e., contain nested entities) or abbreviations and hence difficult to classify them properly, (iii) these names do not follow any nomenclature, (iv) these include different symbols, common words and punctuation symbols, conjunctions, prepositions etc. that make NE boundary identification more difficult and challenging, and (v) same word or phrase can refer to different NEs based on their contexts. The literature on biomedical NERC can be broadly classified into two main categories, namely rule based and machine learning based approaches. Rule based approaches (Tsuruoka & Tsujii 2003, Hanisch, Fluck, Mevissen & Zimmer 2003) depend on the carefully handcrafted set of rules, which are difficult to design for the inherent complex nature of biomedical NEs. They require good expertise in domain knowledge and it is, thus, very difficult to obtain high performance in these models. Such systems also suffer from the problem of adaptability to new domains as well as new NE types. The difficulties of rule based systems facilitate the use of machine learning approach, which is easy to adapt and relatively less expensive to maintain. The success of learning algorithm is crucially dependent on the features it uses. A supervised machine learning algorithm captures the instances of positive and negative examples over a large collection of annotated documents. The supervised approaches (Wang, Zhao, Tan & Zhang 2008, Kim, Yoon, Park & Rim 2005, GuoDong & Jian 2004, Finkel, Dingare, Nguyen, Nissim, Sinclair & Manning 2004, Settles 2004) have been widely used for NERC in biomedical texts. The release of tagged GENIA corpus (Ohta, Tateisi & Kim 2002) provides a way of comparing the existing biomedical NERC systems. However, most of these state-of-art approaches suggest that individual NERC system may not cover entity representations with arbitrary set of features and cannot achieve best performance. Classifier ensemble 1 is a popular concept in machine learning. In this paper, we have used a genetic algorithm (GA) (Kirkpatrick, Gelatt & Vecchi 1983) based multiobjective optimization (MOO) (Deb 2001) approach for classifier ensemble (Ekbal, Saha & Garbe 2010). The MOO based method (Ekbal, Saha & Garbe 2010) provides an automatic way of determining the appropriate weights of votes for all the classes in each classifier. Thereafter the decisions of all the classifiers are combined together to form an ensemble using our developed approach. Here, we use a multiobjective genetic algorithm based technique, NSGA-II (nondominated sorting genetic algorithm 2) (Deb, Pratap, Agarwal & Meyarivan 2002) as the underlying optimization algorithm. It is to be noted that the approach developed here is evaluated for the biomedical corpora, which are more challenging to cope up with. In addition we identify and implement a rich feature set that itself achieves very good performance. We use two popular and robust machine learning techniques, namely Conditional Random Field (CRF) and Support Vector Machine (SVM) as the base classifiers. We generate different models of these base classifiers by varying the available features and/or feature templates. We identify a very rich feature set that includes variety of features based on orthography, local contextual information and global contexts. One most important characteristic of our system is that the identification and selection of features are mostly done without any domain knowledge and/or resources. The developed approach is evaluated on the the benchmark datasets of JNLPBA 2004 shared task (Jin-Dong, Tomoko & et al. 2004). Evaluation results show the recall, precision and F-measure values of 74.10%, 77.58% and 75.80%, respectively. Comparisons with several baselines and the state-of-the-art systems clearly show the superiority of our developed approach under the same experimental setup. We also evaluate our proposed approach with other benchmark dataset like AIMed and GENETAG. Evaluation results with the AIMed datasets show the 3-fold recall, precision and F-measure values of 96.08%, 94.81%, 95.44%, respectively. Experiments with GENETAG datasets yield the overall recall, precision and F-measure values of 98.05%, 98.45%, and 98.25%, respectively. 1 We use ensemble classifier and classifier ensemble interchangeably

208 Asif Ekbal et al. / Procedia Technology 6 ( 2012 ) 206 213 Table 1. Orthographic features Feature Example Feature Example InitCap Src AllCaps EBNA, LMP InCap mab CapMixAlpha NFkappaB, EpoR DigitOnly 1, 123 DigitSpecial 12-3 DigitAlpha 2 NFkappaB, 2A AlphaDigitAlpha IL23R, EIA Hyphen - CapLowAlpha Src, Ras, Epo CapsAndDigits 32Dc13 RomanNumeral I, II StopWord at, in ATGCSeq CCGCCC, ATAGAT AlphaDigit p50, p65 DigitCommaDigit 1,28 GreekLetter alpha, beta LowMixAlpha mrna, mab 2. Named Entity Features Feature selection plays an important role for the success of machine learning techniques. We use a large number of following features for constructing the various classifiers based on CRF and SVM. These features are easy to derive and don t require deep domain knowledge and/or external resources for their generation. Thus, these features are general in nature and can be applied for other domains as well as languages. Due to the use of variety of features, the individual classifiers achieve very high accuracies. 1. Context words: These are the words occurring within the context window w i+3 i 3 = w i 3...w i+3, w i+2 i 2 = w i 2...w i+2 and w i+1 i 1 = w i 1...w i+1, where w i is the current word. This feature is considered with the observation that surrounding words carry effective information for identification of NEs. 2. Word prefix and suffix. These are the word prefix and suffix character sequences of length up to n. The sequences are stripped from the leftmost (prefix) and rightmost (suffix) positions of the words. We set the feature values to undefined if either the length of w i is less than or equal to n 1, w i is a punctuation symbol or if it contains any special symbol or digit. We experiment with n=3 (i.e., 6 features) and 4 (i.e., 8 features) both. 3. Word length. We define a binary valued feature that fires if the length of w i is greater than a pre-defined threshold. Here, the threshold value is set to 5. This feature captures the fact that short words are likely not to be NEs. 4. Infrequent word. A list is compiled from the training data by considering the words that appear less frequently than a predetermined threshold. The threshold value depends on the size of the dataset. Here, we consider the words having less than 10 occurrences in the training data. Now, a feature is defined that fires if w i occurs in the compiled list. This is based on the observation that more frequently occurring words are rarely the NEs. 5. Part of Speech (PoS) information: PoS information is a critical feature for NERC. In this work, we use POS information of the current and/or the surrounding token(s) as the features. This information is obtained using GENIA tagger V2.0.2 2, which is used to extract PoS information from the biomedical domain. The accuracy of the GENIA tagger is 98.26%. 6. Chunk information : We use GENIA tagger V2.0.2 to get the chunk information. Chunk information (or, shallow parsing features) provide useful evidences about the boundaries of biomedical NEs. In the current work, we use chunk information of the current and/or the surrounding token(s). 7. Dynamic feature: Dynamic feature denotes the output tags t i 3 t i 2 t i 1, t i 2 t i 1, t i 1 of the word w i 3 w i 2 w i 1, w i 2 w i 1, w i 1 preceding w i in the sequence w n 1. This feature is used for SVM model. For CRF, we consider the bigram template that considers the combination of the current and previous output labels. 2 http://www-tsujii.is.s.u-tokyo.ac.jp/genia/tagger

Asif Ekbal et al. / Procedia Technology 6 ( 2012 ) 206 213 209 8. Unknown token feature: This is a binary valued feature that checks whether the current token was seen or not in the training corpus. In the training phase, this feature is set randomly. 9. Word normalization: We define two different types of features for word normalization. The first type of feature attempts to reduce a word to its stem or root form. This helps to handle the words containing plural forms, verb inflections, hyphen, and alphanumeric letters. The second type of feature indicates how a target word is orthographically constructed. Word shapes refer to the mapping of each word to their equivalence classes. Here each capitalized character of the word is replaced by A, small characters are replaced by a and all consecutive digits are replaced by 0. For example, IL is normalized to AA, IL-2 is normalized to AA-0 and IL-88 is also normalized to AA-0. 10. Head nouns: Head noun is the major noun or noun phrase of a NE that describes its function or the property. For example, transcription factor is the head noun for the NE NF-kappa B transcription factor. In comparison to other words in NE, head nouns are more important as these play key role for correct classification of the NE class. In this work, we use only the unigram and bigram head nouns like receptor, protein, binding protein etc. For domain independence, we extract these head nouns from the training data only. These are compiled to generate a list of 912 entries that contain only the most frequently occurring head nouns. Apart from these head nouns, we also consider the unigrams and bigrams extracted from the left ends of the NEs of the training data. A list of 578 entries is created by considering only the most frequent such n-grams. A feature is defined that fires iff the current word or the sequence of words appears in either of these lists. 11. Verb trigger: These are the special type of verb (e.g., binds, participates etc.) that occur preceding to NEs and provide useful information about the NE class. To maintain the nature of domain independence, these trigger words are extracted automatically from the training corpus based on their frequencies of occurrences. A feature is then defined that fires iff the current word appears in the list of trigger words. 12. Word class feature: Certain kind of NEs, which belong to the same class, are similar to each other. The word class feature is defined as follows: For a given token, capital letters, small letters, numbers and non-english characters are converted to A, a, O and -, respectively. Thereafter, the consecutive same characters are squeezed into one character. This feature will group similar names into the same NE class. 13. Informative words: In general, biomedical NEs are too long and they contain many common words that are actually not NEs. For example, the function words such as of, and etc.; nominals such as active, normal etc. appear in the training data often more frequently but these don t help to recognize NEs. In order to select the most important effective words, we first list all the words which occur inside the multiword NEs. Thereafter digits, numbers and various symbols are removed from this list. For each word (w i ) of this list, a weight is assigned that measures how better the word is to identify and/or classify the NEs. This weight is denoted by NEweight (w i ), and calculated as follows: NEweight(w i )= Total no. of occurances of w i as part of a NE (1) Total no. of occurances of w i in the training data The effective words are finally selected based on the two parameters, namely NEweight and number of occurrences. The threshold values of these two parameters are selected based on some experiments. The words which have less than two occurrences inside the NEs are not considered as informative. The remaining words are divided into the following classes: Class 1: This includes the words that occur more than 100 times. Here, we consider those words whose NEweights are greater than 0.4. Class 2: This includes the words having occurrences 20 and < 100. Here, we set NEweight 0.6. Class 3: This class includes the words having occurrences 10 and < 20. Here, we chose NEweight 0.75. Class 4: This includes the words having occurrences 5 < 10. Here, we chose NEweight 0.85. Class 5: This includes the words having occurrences < 5. Here, we chose NEweight 1.00. We compile five different lists for the above five classes of informative words. A binary feature vector of length five is defined for each word. If the current word in training (or, test) is found in any particular list then the value of the corresponding feature is set to 1. This feature is a modification to the one used in (Saha, Sarkar & Mitra 2009).

210 Asif Ekbal et al. / Procedia Technology 6 ( 2012 ) 206 213 14. Semantic feature: This feature is semantically motivated and exploits global context information. This is based on the content words in the surrounding context. We consider all unigrams in contexts w i+3 i 3 = w i 3...w i+3 of w i (crossing sentence boundaries) for the entire training data. We convert tokens to lower case, remove stopwords, numbers, punctuation and special symbols. Then we extracted 10 most frequent content words from this set of unigrams. Thereafter we define a feature vector of length 10 using these 10 most frequent content words. This feature is defined for each token instance. Given a classification instance, the feature corresponding to token t is set to 1 iff the context w i+3 i 3 of w i contains t. 15. Orthographic features: We define a number of orthographic features depending upon the contents of the wordforms. Several binary features are defined which use capitalization and digit information. These features are: initial capital, all capital, capital in inner, initial capital then mix, only digit, digit with special character, initial digit then alphabetic, digit in inner. The presence of some special characters like (,, -,., ), ( etc.) is very much helpful to detect NEs, especially in biomedical domain. For example, many biomedical NEs have - (hyphen) in their construction. Some of these special characters are also important to detect boundaries of NEs. We also use the features that check the presence of ATGC sequence and stop words. The complete list of orthographic features is shown in Table 1. 3. Approach A multiobjective GA (Ekbal, Saha & Garbe 2010), along the lines of NSGA-II(Deb et al. 2002), is now developed for solving the named entity recognition problem from biomedical domain using classifier ensembles. 3.1. Chromosome Representation and Population Initialization If the total number of available classifiers is M and total number of output classes is O, then the length of the chromosome is M O. Each chromosome encodes the weights of votes for possible O classes in each classifier. As an example, the encoding of a particular chromosome is shown in Figure 1. Here, M = 3 and O = 3 (i.e., total 9 votes can be possible). The weights of votes for 3 different output classes for each classifier are as follows: (i). Classifier1: 0.59, 0.12 and 0.56; (ii). Classifier2: 0.09, 0.91 and 0.02; (iii). Classifier3: 0.76, 0.5 and 0.21. In the present work, we use real encoding. The entries of each chromosome are randomly initialized to a real value rand() (r) between 0 and 1. Here, r = RAND MAX+1. If the population size is P then all the P number of chromosomes of the population are initialized in the above way. 3.2. Fitness Computation Initially, the F-measure values of all the available classifiers (or, models) for each of the output classes are calculated based on the development data. Thereafter, we execute the following steps to compute the objective values. Suppose, there are total M number of classifiers. Let, the overall F-measure values of these M classifiers on the development data be F i, i = 1...M. For each word in the development data, we have M classes, each from a different classifier. Now for the ensemble classifier, the output class label for each word in the development data is determined using the weighted voting of these M classifiers outputs. The weight of the output class provided by the i th classifier is equal to F i. The combined score of a particular class for a particular word w is: f (c i )= F m I(m,i), m = 1toM and op(w,m)=c i Here, I(m,i) is the entry of the chromosome corresponding to the m th classifier and i th class; and op(w,m) denotes the output class provided by the classifier m for the word w. The class receiving the maximum combined score is selected as the joint decision. The overall recall, precision and F-measure values of this classifier ensemble for the 1/3 training data are calculated.

Asif Ekbal et al. / Procedia Technology 6 ( 2012 ) 206 213 211 Fig. 1. Chromosome Representation Table 2. Overall evaluation results in % Model recall precision F-measure Best individual classifier 73.10 76.78 74.90 Baseline-1 71.03 75.76 73.32 Baseline-2 71.42 75.90 73.59 Baseline-3 71.72 76.25 73.92 MOO based approach 74.10 77.58 75.80 Steps 2 and 3 are repeated 3 times to perform 3-fold cross validation. The average recall and precision values of this cross validation are used as the two objective functions of the developed MOO technique. 3.3. Other Operators Thereafter, the steps of NSGA-II are executed to optimize the above mentioned two objective functions. We use crowded binary tournament selection as in NSGA-II, followed by conventional crossover and mutation for the MOO based classifier ensemble. The most characteristic part of NSGA-II is its elitism operation, where the non-dominated solutions (Deb 2001) among the parent and child populations are propagated to the next generation. The near-paretooptimal strings of the last generation provide the different solutions to the ensemble problem. For every solution on the final Pareto optimal front, the overall average F-measure value of the vote based classifier ensemble for the 3-fold cross validation is calculated on the training data. The solution with the maximum F-measure value is selected as the best solution. Final results on the test data are reported using the classifier ensemble corresponding to this best solution. There can be many other different approaches of selecting a solution from the final Pareto optimal front. 4. Evaluation Results and Discussions We evaluate our developed approach with the JNLPBA 2004 shared task datasets 3. The datasets were extracted from the GENIA Version 3.02 corpus of the GENIA project. This was constructed by a controlled search on Medline using MeSH terms such as human, blood cells and transcription factors. From this search, 2000 abstracts of about 500K wordforms were selected and manually annotated according to a small taxonomy of 48 classes based on a chemical classification. Out of these classes, 36 classes were used to annotate the GENIA corpus. In the shared task, the datsets were further simplified to be annotated with only five NE classes, namely Protein, DNA, RNA, Cell line and Cell type (Jin-Dong, Tomoko & et al. 2004). The test set was relatively new collection of Medline abstracts from 3 http://research.nii.ac.jp/ collier/workshops/jnlpba04st.htm

212 Asif Ekbal et al. / Procedia Technology 6 ( 2012 ) 206 213 the GENIA project. The test set contains 404 abstracts of around 100K words. One half of the test data was from the same domain as that of the training data and the rest half was from the super domain of blood cells and transcription factors. In order to properly denote the boundaries of NEs, five classes are further divided using the BIO format, where B-XXX refers to the beginning of a multi-word/single-word NE of type XXX, I-XXX refers to the rest of the words of the NE and O refers to the entities outside the NE. We build a number of different CRF and SVM based classifiers by varying the various available features described earlier. In particular, along with the other features we varied the local contexts within the previous and next three words, i.e. w i+3 i 3 = w i 3...w i+3. For constructing CRF based classifiers, we use the C ++ based CRF ++ package 4, a simple, customizable, and open source implementation of CRF for segmenting or labeling sequential data. For constructing SVM based classifiers, we use YamCha 5 toolkit along with TinySVM-0.07 6 classifier. Here, we use both the one-vs-rest and pairwise multi-class decision methods, and the polynomial kernel function. The parameters of NSGA-II based ensemble technique are as follows: population size=100, number of generations=50, probability of mutation=0.2, probability of crossover=0.9. Performance of each classifier as well as of the overall system is measured in terms of the standard metrics, recall, precision and F-measure. We use the evaluation script, provided with the JNLPBA 2004 shared task 7 is used to measure recall, precision and F-measure. We define three different baseline models as below: Baseline-1: All the individual classifiers are combined together into a final system based on the majority voting. Baseline-2: Classifiers are combined using weighted voting. Weights are calculated based on the average F- measure value of the 5-fold cross validation on the training data. Baseline-3: This is also based on weighted voting, but here we consider the individual class F-measure as the weight. The CRF-based model exhibits the best performance with the recall, precision and F-measure values of 73.10%, 76.78% and 74.90%, respectively. The corresponding feature template considers the contexts of previous and next two tokens and their all possible n-gram (n 2) combinations from left to right, prefixes and suffixes of length upto 3 characters of only the current word, feature vector consisting of length, infrequent word, normalization, chunk, orthographic constructs, trigger word, semantic information, unknown word, head noun, word class, effective NE information of only the current token, and bigram feature combinations. The CRF-based system with context window of -3 to +3, prefixes and suffixes of length 4, with all the other features including the dynamic class information feature achieves the recall, recision and F-measure values of 76.63%, 73.04%, and 74.79%, respectively. The SVM based system with context window of -3 to +3, prefixes and suffixes of length 4 and with all the features achieves the recall, precision and F-measure values of 67.70%, 66.34%, and 67.01%, respectively. The overall evaluation results of the developed approaches are presented in Table 2. The developed ensemble technique attains the final recall, precision and F-measure values of 74.10%, 77.58% and 75.80%, respectively. It performs superior to the best individual model, Baseline- 1, Baseline- 2 and Baseline- 3 by 0.90, 2.48, 2.21 and 1.88 percentage F-measure points, respectively. We also compared our obtained results with all those state-of-the-art systems that were developed using the same data sets and within the same experimental setup. The highest performance attained by the existing approaches (GuoDong & Jian 2004) without using any domain dependant resource and/or tools like gazetteers, dictionaries, external NE taggers etc. was 72.55%, which is less than 3.25 points in comparison to our developed approach. Results show that classifier ensemble approach performs much better than individual classifiers with all relevant features. This is because by combining all the classifiers we can merge the goodness of different systems. 4 http://crfpp.sourceforge.net 5 http://chasen-org/ taku/software/yamcha/ 6 http://cl.aist-nara.ac.jp/ taku-ku/software/tinysvm 7 http://www-tsujii.is.s.u-tokyo.ac.jp/genia/ertask/report.html

Asif Ekbal et al. / Procedia Technology 6 ( 2012 ) 206 213 213 5. Conclusion In this paper, we have developed some multiobjective classifier ensemble technique using the search capability of a GA based optimization technique, NSGA-II for NERC in biomedical domain. We hypothesized and have shown that rather than combining all the available classifiers blindly or eliminating some classifiers, quantification of the amount of voting for each class in each classifier could be a more fruitful approach. We have used CRF and SVM frameworks as the base classifiers to generate different classification models by varying the available features and/or feature templates. We came up with a very rich feature set that itself can achieve very high accuracy. Results on JNLPBA 2004 shared task data sets show that the overall performance attained by the developed MOO based techniques is better than the best individual classifier, several baselines and the state-of-the-art systems. References Deb, Kalyanmoy. 2001. Multi-objective Optimization Using Evolutionary Algorithms. England: John Wiley and Sons, Ltd. Deb, Kalyanmoy, Amrit Pratap, Sameer Agarwal & T. Meyarivan. 2002. A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Transactions on Evolutionary Computation 6(2):181 197. Ekbal, Asif, Sriparna Saha & Christoph S. Garbe. 2010. Multiobjective Optimization Approach for Named Entity Recognition. In PRICAI. pp. 52 63. Finkel, J., S. Dingare, H. Nguyen, M. Nissim, G. Sinclair & C. Manning. 2004. Exploiting Context for Biomedical Entity Recognition: From Syntax to the Web. In Proceedings of the Joint Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA-2004). pp. 88 91. GuoDong, Z. & S. Jian. 2004. Exploring Deep Knowledge Resources in Biomedical Name Recognition. In JNLPBA 04: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications. pp. 96 99. Hanisch, Daniel, Juliane Fluck, Heinz-Theodor Mevissen & Ralf Zimmer. 2003. Playing Biology s Name Game: Identifying Protein Names in Scientific Text. In Pacific Symposium on Biocomputing. pp. 403 414. Jin-Dong, Kim, Ohta Tomoko & Tsuruoka Yoshimasa et al. 2004. Introduction to the Bio-Entity Recognition Task at JNLPBA. In JNLPBA 04: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications. Association for Computational Linguistics pp. 70 75. Kim, Seonho, Juntae Yoon, Kyung-Mi Park & Hae-Chang Rim. 2005. Two-Phase Biomedical Named Entity Recognition Using A Hybrid Method. In IJCNLP. pp. 646 657. Kirkpatrick, S., C.D. Gelatt & M.P. Vecchi. 1983. Optimization by simulated annealing. Science 220:671 680. Ohta, T., Y. Tateisi & J. Kim. 2002. The GENIA corpus: an annotated research abstract corpus in molecular biology domain. In Proceedings of the second international conference on Human Language Technology Research. pp. 82 86. Saha, S. K., S. Sarkar & P. Mitra. 2009. Feature Selection Techniques for Maximum Entropy based Biomedical Named Entity Recognition. J. of Biomedical Informatics 42(5):905 911. Settles, Burr. 2004. Biomedical Named Entity Recognition Using Conditional Random Fields and Rich Feature Sets. In JNLPBA 04: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications. Association for Computational Linguistics pp. 104 107. Tsuruoka, Yoshimasa & Jun ichi Tsujii. 2003. Boosting Precision and Recall of Dictionary-based Protein Name Recognition. In Proceedings of the ACL 2003 Workshop on Natural Language Processing in Biomedicine. pp. 41 48. Wang, Haochang, Tiejun Zhao, Hongye Tan & Shu Zhang. 2008. Biomedical named entity recognition based on classifiers ensemble. International Journal on Computer Science and Applications 5:1 11.