LP&IIS 2013, Springer LNCS Vol. 7912, pp

LP&IIS 2013, Springer LNCS Vol. 7912, pp. 57 68 Aaron L.-F. Han, Derek F. Wong, and Lidia S. Chao Hanlifengaaron AT gmail DOT com June 17 th -18 th, 2013, Warsaw, Poland Natural Language Processing & Portuguese-Chinese Machine Translation Laboratory Department of Computer and Information Science University of Macau

Motivation and related work in NER (CNER) Problem analysis and the aim of this work A study of Chinese characteristics (in PER, LOC, and ORG) The designed and optimized feature set Employed CRF model Experiments Comparison with related work Different performance of sub features Formal definitions of the problems in CNER Conclusion Reference

Related literatures that are influenced by named entity recognition: Information extraction text mining machine translation knowledge management information retrieval, etc. Rapid development of NLP also promotes the NER research Development of computer technology allows the analysis on big data storage capacity computational power

Lev and Dan [1] perform NER on English Using unlabeled text and Wikipedia gazetteers. Sang and Meulder [2] conduct NER research on German Special applications of NER: geological text processing [3] biomedical named entity detection [4] Chinese NER (CNER), more difficult. Why? no word boundary in Chinese sentence

International CNER shared tasks under the SIGHAN (special interest group for Chinese) and CIPS (Chinese information processing society) before 2008 [5][6] Chinese personal name disambiguation after 2008 by SIGHAN [7][8] Explored methods on CNER: Maximum Entropy [9][10][16] Hidden Markov Model [11] Support Vector Machine [12] Conditional Random Field [13][15] Combination with other researches: Word segmentation, sentence chunking, word detection [14]

Problems in the employed methods: Maximum Entropy, local optimal solution, label bias Markov Model, strong independence assumption Support Vector Machine, low performance Conditional Random Field, challenges in features selection Problems in the research work: More discussion with the algorithm, less on the issues in CNER Different features, less or no explanation or backgrounds Less analysis on Chinese characteristics

The aim of this work: An introduction of Chinese characteristics Feature optimization based on linguistic analysis PER, LOC, ORG Comparisons of the performances by different algorithms Issues analysis and problem formalization in CNER

Chinese personal names (PER): clear format: Surname Given-name (we use x+y) Chinese surnames: 11,939 by Chinese academy of science [19][20]: 5313 of which consist of one character 4311 of two characters 1615 of three characters 571 of four characters, etc. Chinese Given-name: usually contains one or two characters as shown in Table 1.

Pl: place; Bud: building; Org: organization; Suf: suffix; Abbr: abbreviation

Chinese location names (LOC): Commonly used suffixes: 路 (road), 區 (district), 縣 (county), 市 (city), 省 (province), 洲 (continent), etc. Some standard formats, as in Table 1: use building names place + building place + organization Mix + suffix abbreviations

Chinese organization names (ORG): Some ORG entities contain suffixes but the suffixes own various expressions, not formalized Others do not have apparent suffixes: named by the owners of the organization e.g. 笑開花 (XiaoKaiHua, a small art association) Table 2 lists some kinds of ORG entities: including administrative unit, company, arts, public service, association, education and cultural, etc. Potentially implying that ORG may be one of the difficult category

X: the variable representing sequence Y: corresponding label sequence P(Y X): the conditional model in mathematics G=(V, E): a graph G, V of vertices or nodes, E of edges or lines Y = {Y v v V}, Y is indexed by vertices of G (X, Y) is a conditional random field model [24]: P θ y x exp e E,k λ k f k e, y e, x + v V,k μ k g k v, y v, x f k and g k are the feature functions, λ k and μ k are the parameters to be trained

Training methods for CRF including: Iterative scaling algorithms [24] Non-preconditioned conjugate-gradient [25] Voted perceptron training [26] Quasi-newton algorithm [27], used in this work online tool: http://crfpp.googlecode.com/svn/trunk/doc/index.html

Data intro: To deal with an extensive kinds of named entities Using the SIGHAN Bakeoff-4 corpora [6] Containing PER, LOC, and ORG three kinds of entities CityU (traditional Chinese) and MSRA (simplified Chinese) Perform on closed track (without using external resources) Detailed information for training and test data in Table 4 and 5.

NE means the total of three kinds of named entities

OOV means the entities of the test data that do not exist in the training data, and Roov means the OOV rate

The samples of training corpus are shown as Table 6. In the test data, there is only one column of Chinese characters

Recognition results:

Evaluation metrics: Precision = Recall = number of correct output number of total output number of correct output number of truth F score = Harmonic Precision, Recall = 2 Precision Recall Precision+Recall Evaluation is performed on NE level (not token-per-token). E.g., if a token is supposed to be B-LOC but it is labeled I-LOC instead, then this will not be considered as a correct labeling

Evaluation scores:

There are several main conclusions derived: 1. These experiments results corroborate our analysis of the Chinese characteristics: PER and LOC have simpler structures and expressions that make the recognition easier than the ORG the Roov rate (in Table 5) of LOC is the lowest (0.1857 and 0.0861 respectively for CityU and MSRA) and the corresponding recognition of LOC performed very well (0.8599 and 0.8988 respectively in F-score). in the MSRA corpus, the Roov of ORG (0.3533) is larger than PER (0.3026) and the corresponding F-scores of ORG are lower however, in CityU corpus, the Roov of ORG (0.4884) is much lower than PER (0.7850) while the recognition result of ORG also perform worse (0.6646 and 0.8036 respectively of F-scores for them)

2. The recognition of the OOV entities is the principal challenge for the automatic systems the total OOV entity number in CityU (0.4882) is larger than MSRA (0.2142), and the corresponding final F-score of CityU (0.7955) is also lower than MSRA (0.8833)

Comparison with baselines in Table 9: The baselines are produced by a left-to-right maximum match algorithm applied on the testing data with the named entity lists generated from the training data.

The experiments have yielded much higher F-scores than the baselines The baseline scores are unstable on different entities resulting synthetically in the total F-scores of 0.5955 and 0.6105 respectively for CityU and MSRA corpus. On the other hand, our results show that the three kinds of entity recognitions get high scores generally without big twists and turns. This proves that the approaches employed in this research are reasonable and augmented. The improvements on ORG and PER are especially larger on both two corpora, leading to the total increases of F-scores 33.6% and 44.7% respectively.

Comparison with related works: Related works that use different features (various window sizes) algorithms (CRF, ME, SVM, etc.) external resources (external vocabularies, POS tools, name lists, etc.) the comparison test on MSRA, some works briefly in Table 10. Due to the fact that most researchers undertake the test only on MSRA corpus use number n to represent the character previous nth character when n<0 the following nth character when n>0 and the current token case when n=0 E.g., B(-10, 01, 12) means the three bigram features (former one and current, current and next one, next two characters).

From Table 10: when the window size of the features is smaller, the performance shows worse. too large window size cannot ensure good results while it will bring in noises and cost more running time simultaneously. external materials do not necessarily ensure better performances the combination of segmentation and POS will offer more information about the test set; however, the segmentation and POS accuracy also influence the system quality. the experiment of this paper has yielded promising results by employing optimized feature set and a concise model.

the performances of different sub features in our experiments the corresponding results respectively in Table 11

Table 11 shows: Generally speaking, more features lead to more training time, and when the feature set is small this conclusion also fit the case of iteration number. However, this conclusion does not stand when the feature set gets larger e.g. testing on the MSRA corpus, the feature set (FS) FS4 needs 314 iteration number which is less than 318 by FS2 although the former feature set is larger. This may be due to the fact that the feature set FS2 needs more iterations to converge to a fixed point. Employing the CRF algorithm, the optimized feature set is chosen as FS4 and if we continue to expand the features the recognition accuracy will decrease as in Table 11

Due to the changeful and complicated characteristics of Chinese there are some special combinations of characters, and sometimes we can label them with different performances with all results reasonable in practice. These make some confusion for the researchers. How do we deal with these problems? To facilitate further researches, we introduce and provide some formal definitions of the existing issues in CNER

First, the Function-overload problem: (also called as metonymy in some place) One word bears two or more meanings in the same text. E.g., the word 大山 (DaShan) means an organization name in the chunk 大山國際銀行 (DaShan International Bank) and the whole chunk means a company While 大山 (DaShan) also represents a person name in the sequence 大山悄悄地走了 (DaShan quietly went away) with the whole sequence meaning a person's action It is difficult for the computer to differ their meaning and assign corresponding different labels (ORG or PER) they must be recognized through the analysis of context and semantics.

Furthermore, the Multi-segmentation problem in CNER: one sequence can be segmented into a whole or more fragments according to different meanings, and the labeling will correspondingly end in different results. For example, the sequence 中興實業 (ZhongXing Corporation) can be labeled as a whole chunk as "B-ORG I-ORG I-ORG I-ORG" which means it is an organization name It also can be divided as 中興 / 實業 and labeled as B-ORG I-ORG / N N meaning that the word 中興 (ZhongXing) can represent the organization entity and 實業 (Corporation) specifies common Chinese word, and this usage is widespread in Chinese documents.

Another example of the Multi-segmentation problem in CNER: the sequence 杭州西湖 (Hang Zhou Xi Hu) can be labeled as "B-LOC I-LOC I-LOC I-LOC" as a place name but it can also be labeled as "B-LOC I-LOC B-LOC I-LOC" due to the fact that 西湖 (XiHu) is indeed a place that belongs to the city 杭州 (HangZhou). Which label sequences shall we select for them? Both of them are reasonable. This is a difficult problem for manual work, let alone for computer. Above discussed problems are only some of the existing ones in CNER. If we can deal with them well, the performances will be better in the future.

This paper undertakes the researches of CNER which is a difficult issue in NLP literature. The characteristics of Chinese named entities are introduced respectively on personal names, location names and organization names. Employing the CRF algorithm, optimized features have shown promising performances compared with related works that use different feature sets and algorithms. Furthermore, to facilitate further researches, this paper discusses the problems existing in the CNER and puts forward some formal definitions combined with instructive solutions. The performance results can be further improved in the open test through employing other high quality resources and tools e.g. externally generated word-frequency counts, common Chinese surnames and internet dictionaries

1. Ratinov, L., Roth, D.: Design Challenges and Misconceptions in Named Entity Recognition. In: Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL 2009), pp. 147 155. Association for Computational Linguistics Press, Stroudsburg (2009) 2. Sang, E.F.T.K., Meulder, F.D.: Introduciton to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition. In: HLT-NAACL, pp. 142 147. ACL Press, USA (2003) 3. Sobhana, N., Mitra, P., Ghosh, S.: Conditional Random Field Based Named Entity Recognition in Geological text. J. IJCA 1(3), 143 147 (2010) 4. Settles, B.: Biomedical named entity recognition using conditional random fields and rich feature sets. In: Collier, N., Ruch, P., Nazarenko, A. (eds.) International Joint Workshop on Natural Language Processing in Biomedicine and its Applications, pp. 104 107. ACL Press, Stroudsburg (2004) 5. Levow, G.A.: The third international CLP bakeoff: Word segmentation and named entity recognition. In: Proceedings of the Fifth SIGHAN Workshop on CLP, pp. 122 131. ACL Press, Sydney (2006)

6. Jin, G., Chen, X.: The fourth international CLP bakeoff: Chinese word segmentation, named entity recognition and Chinese pos tagging. In: Sixth SIGHAN Workshop on CLP, pp. 83 95. ACL Press, Hyderabad (2008) 7. Chen, Y., Jin, P., Li, W., Huang, C.-R.: The Chinese Persons Name Disambiguation Evaluation: Exploration of Personal Name Disambiguation in Chinese News. In: CIPS-SIGHAN Joint Conference on Chinese Language Processing, pp. 346 352. ACL Press, BeiJing (2010) 8. Sun, L., Zhang, Z., Dong, Q.: Overview of the Chinese Word Sense Induction Task at CLP2010. In: CIPS-SIGHAN Joint Conference on CLP (CLP2010), pp. 403 409. ACL Press, BeiJing (2010) 9. Jaynes, E.: The relation of Bayesian and maximum entropy methods. J. Maximumentropy and Bayesian Methods in Science and Engineering 1, 25 29 (1988) 10. Wong, F., Chao, S., Hao, C.C., Leong, K.S.: A Maximum Entropy (ME) Based Translation Model for Chinese Characters Conversion. J. Advances in Computational Linguistics, Research in Computer Science. 41, 267 276 (2009)

11. Ekbal, A., Bandyopadhyay, S.: A hidden Markov model based named entity recognition system: Bengali and Hindi as case studies. In: Ghosh, A., De, R.K., Pal, S.K. (eds.) PReMI 2007. LNCS, vol. 4815, pp. 545 552. Springer, Heidelberg (2007) 12. Mansouri, A., Affendey, L., Mamat, A.: Named entity recognition using a new fuzzy support vector machine. J. IJCSNS 8(2), 320 (2008) 13. Putthividhya, D.P., Hu, J.: Bootstrapped named entity recognition for product attribute extraction. In: EMNLP 2011, pp. 1557 1567. ACL Press, Stroudsburg (2011) 14. Peng, F., Feng, F., McCallum, A.: Chinese segmentation and new word detection using conditional random fields. In: Proceedings of the 20th international conference on Computational Linguistics (COLING 2004), Article 562. Computational Linguistics Press, Stroudsburg (2004) 15. Chen, W., Zhang, Y., Isahara, H.: Chinese named entity recognition with conditional random fields. In: Fifth SIGHAN Workshop on Chinese Language Process- ing, pp. 118 121. ACL Press, Sydney (2006)

16. Zhu, F., Liu, Z., Yang, J., Zhu, P.: Chinese event place phrase recognition of emergency event using Maximum Entropy. In: Cloud Computing and Intelligence Systems (CCIS), pp. 614 618. IEEE, ShangHai (2011) 17. Qin, Y., Yuan, C., Sun, J., Wang, X.: BUPT Systems in the SIGHAN Bakeoff 2007. In: Sixth SIGHAN Workshop on CLP, pp. 94 97. ACL Press, Hyderabad (2008) 18. Feng, Y., Huang, R., Sun, L.: Two Step Chinese Named Entity Recognition Based on Conditional Random Fields Models. In: Sixth SIGHAN Workshop on CLP, pp. 120 123. ACL Press, Hyderabad (2008) 19. Yuan, Yida, Zhong, W.: Contemporary Surnames. Jiangxi people s publishing house, China (2006) 20. Yuan, Yida, Qiu, J., Zhang, R.: 300 most common surname in Chinese surnamespopulation genetic and population distribution. East China Normal University Publishing House, China (2007) 21. Huang, D., Sun, X., Jiao, S., Li, L., Ding, Z., Wan, R.: HMM and CRF based hybrid model for chinese lexical analysis. In: Sixth SIGHAN Workshop on CLP, pp. 133 137. ACL Press, Hyderabad (2008)

22. Sun, G.-L., Sun, C.-J., Sun, K., Wang, X.-L.: A Study of Chinese Lexical Analysis Based on Discriminative Models. In: Sixth SIGHANWorkshop on CLP, pp. 147 150. ACL Press, Hyderabad (2008) 23. Yang, F., Zhao, J., Zou, B.: CRFs-Based Named Entity Recognition Incorporated with Heuristic Entity List Searching. In: Sixth SIGHAN Workshop on CLP, pp. 171 174. ACL Press, Hyderabad (2008) 24. Lafferty, J., McCallum, A., Pereira, F.C.N.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceeding of 18th International Conference on Machine Learning, pp. 282 289. DBLP, Massachusetts (2001) 25. Shewchuk, J.R.: An introduction to the conjugate gradient method without the agonizing pain. Technical Report CMUCS-TR-94-125, Carnegie Mellon University (1994) 26. Collins, M., Duffy, N.: New ranking algorithms for parsing and tagging: kernels over discrete structures, and the voted perceptron. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (ACL 2002), pp. 263 270. Association for Computational Linguistics Press, Stroudsburg (2002)

27. The Numerical Algorithms Group. E04 - Minimizing or Maximizing a Function, NAG Library Manual, Mark 23 (retrieved 2012) 28. Zhao, H., Liu, Q.: The CIPS-SIGHAN CLP2010 Chinese Word Segmentation Backoff. In: CIPS-SIGHAN Joint Conference on CLP, pp. 199 209. ACL Press, BeiJing (2010) 29. Zhou, Q., Zhu, J.: Chinese Syntactic Parsing Evaluation. In: CIPS-SIGHAN Joint Conference on CLP (CLP 2010), pp. 286 295. ACL Press, BeiJing (2010) 30. Xu, Z., Qian, X., Zhang, Y., Zhou, Y.: CRF-based Hybrid Model for Word Segmentation, NER and even POS Tagging. In: Sixth SIGHAN Workshop on CLP, pp. 167 170. ACL Press, India (2008)

Aaron L.-F. Han, Derek F. Wong, and Lidia S. Chao Hanlifengaaron AT gmail DOT com, {derekfw, lidiasc} AT umac.mo Natural Language Processing & Portuguese-Chinese Machine Translation Laboratory Department of Computer and Information Science University of Macau