Analysis of Error Count Distributions for Improving the Postprocessing Performance of OCCR

Analysis of Error Count Distributions for Improving the Postprocessing Performance of OCCR Yue-Shi Lee and Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University Taipei, Taiwan, R.O.C. E-Mail: {leeys, hh_chen}@csie.ntu.edu.tw Submitted on 23 July, 1996, Revised on 17 November, 1996 and Accepted on 29 November, 1996 Abstract Contextual language processing plays an important role for the postprocessing of OCR. Its effects are demonstrated by many proposed systems. In general, it performs well. However, its performance is not so good as expect when the test data contain more unseen context, e.g., proper nouns such as personal names and organizational names. This paper addresses the importance of analyzing the error count distributions before applying the language models. According to the analysis, more than 50% of errors can be reduced and more than 90% of time can be saved on the average based on the Markov character bigram model. Keywords: Contextual Language Processing, Error Count Distributions, Image Processing, Markov Model, OCCR, Unseen Context 1 Introduction To improve the interface with computers, the development of input devices such as optical character recognition (OCR) device and speech recognition (SR) device is expected. The OCR device is a good choice while the printed documents are provided. However, optical Chinese character recognition (OCCR) is an extremely challenging task due to the different multifonts, complex shapes and the very large vocabularies 1. Because the misrecognitions of image processing are hard to be avoided, contextual postprocessing (language processing) of the recognition is indispensable for both reducing the recognition errors caused by the preprocessing (image processing) of the recognition and saving the time in human proofreading. Contextual language processing for the postprocessing of OCR is not new. Shyu, et al. [1] adopt a word-lattice-based Markov character bigram model suggested by [2] to the OCCR system. Chou and Chang [3] use a Markov word unigram model and a confusion matrix to decide the most plausible characters. Chang and Chen [4] combine a noisy channel model and a language model to implement the postprocessing of the OCCR. Araki, et al. [5-6] propose a selective error-correction method to detect and correct erroneous characters in Japanese text input through an OCR. Shinghal [7], Sinha and Prasada [8] also propose approaches for English. The purpose of the contextual language processing is to find the most plausible candidate for each image character with the maximum likelihood probability. The above approaches claim that it plays an important role and has much effect in the postprocessing of the OCR systems. In general, it performs well. However, its performance is not so good as expect when the test data contain more unseen context, e.g., proper nouns such as personal names and organizational names. Besides, some frequently used characters are always selected by the language models, but they may be wrong in some cases. Therefore, if we can predict which image characters have been recognized correctly by the image processing module, the above problems can be alleviated. That is, it is important to make an analysis before applying the language models. This paper is organized as follows. Section 2 presents our OCCR system. Section 3 introduces the language models used in this paper. Section 4 describes the analysis methods and demonstrates the proposed methods. Section 5 is the concluding remarks. 2 System Description The proposed system shown in Fig. 1 consists of two major modules: (1) Image Processing Module (Preprocessing) and (2) Language Processing Module (Postprocessing). The image processing module contains three submodules: (1) Image Segmentation, (2) Feature Extraction, and (3) Feature Matching. An optical scanner scans the printed document and converts it into an image document. The image segmentation submodule segments the entire image into blocks and then classifies each block into a text, graphic or picture block. The text blocks are further segmented into individual character blocks subsequently. Each character block stands for an image character. Fig. 2(a) shows a simplified example of image document. After the image segmentation submodule is applied, the segmented image document is shown in Fig. 2(b). Once the image characters have been segmented properly, the feature extraction submodule extracts the features from each image character. In the feature matching stage, the extracted features of each individual image character are matched to a feature database to recognize the character. The top ten candidates, which form a candidate set, for each image character are generated for the subsequent language processing. Fig. 3 shows the candidates for each image character of Fig. 2(b). 1 There are about 13,000 characters in Chinese.

Image Document Image Processing Module (Preprocessing) Image Segmentation Image base Feature Extraction Feature Extraction Dictionary Feature Matching Feature base Language Processing Module (Postprocessing) Analysis of Error Count Distributions Markov Character Bigram Model Character Bigram Table Character Unigram Table Text Document Fig. 1. Block Diagram of the Proposed System Character Blocks (Image ) (a) (b) Fig. 2. An Example of Image Segmentation 03602 Â 04598 Ã 04629 Ä 04664 Å 04806 Æ 04970 Ç 04982 È 05000 É 05021 05045 03657 04315 04360 04486 04735 04749 04777 04779 04782 04804 04559 04655 04832 04836 04837 04873 04880 04906 04910 04921 02861 03847! 03884 " 03972 # 04016 $ 04080 % 04102 & 04121 ' 04280 ( 04281 ) 04016 * 05206 + 05523, 05609-05680. 05692 / 05882 0 06058 1 06081 2 06090 Fig. 3. The Top Ten Candidates for Each Image Character of Fig. 2(b) In Fig. 3, a number follows each candidate. The number indicates the error count between the current image character and the image character stored in the image database according to the features. The lower the error count is, the more the similarity between two image characters is. Thus, the first candidate of each image character is the most plausible candidate based on the image processing module. The error count can be used to calculate the probability of each candidate given the image character. Given the k-th image character i k, the matching score of the j-th candidate c j (SCORE j ) is defined as follows [9-10]. ÂÃÄÅ = ÅÇÇÈÇÉÂÈÉÈÉÂ Æ ÅÇÇÈÇÉÂÈÉÈÉÂ + Æ Â Based on the definition of SCORE j, the probability of the j-th candidate c j given the image character i k is calculated as follows. ÅÆÇÈÉ Ã Â ÂÃ Ã ÄÄ = ÇÈ ÅÆÇÈÉ Å ÅÆÇ During the language processing stage, the analysis of error count distributions and the Markov character bigram model are adopted simultaneously to deal with the problems of recognition errors caused by the image processing module and yield the final text document. This paper focuses on the postprocessing of the recognition especially for the analysis of error count distributions.

3 Language Models in OCCR The problem of OCCR can be defined as how to convert a sequence of image characters I into the corresponding sequence of characters Â correctly based on the language models. In this paper, a statistical Markov character bigram model is adopted to improve the recognition rate. Let I=<i 1, i 2, i 3,..., i n > be an image character string and C=<c 1, c 2, c 3,..., c n > be one of the possible character strings. Here, c i denotes one of the characters in the i-th candidate set. The conversion can be formulated as follows. Â ÂÃ ÄÅ Â ÂÃ ÃÄÅÆÇÈ Â ÄÃ ÃÄÇ... (1) The former probability, i.e., P IP (C I), is produced by the image processing module and the latter probability, i.e., P LP (C) is calculated by the language processing module. If the contextual information, i.e., P LP (C), is ignored, the above formula becomes as follows. Â ÂÃ ÄÅ The P IP (C I) is defined as follows. ÂÃÄÅÆ = Â Â Ã Ã Ã= Â Ã Ä Å Æ Ä Å Â ÂÃ ÃÄÅÆÇ...... (2) The definition of P IP (c j i j ), i.e., the probability of candidate c j given the image character i j, is described in Section 2. By using Formula 2, the first candidate, which has the lowest error count, is always selected as the result. If more than one candidate has the same error count, the most frequently used character is selected as the result through dictionary lookup (see Fig. 1). Similarly, if the P IP (C I) is ignored in, the formula becomes as follows. Â ÂÃ ÄÅ Â ÂÃ ÃÄÅ...... (3) In this paper, the P LP (C) is simplified as a Markov character bigram model shown below. Ä + Ã Ã + Â ÂÃÄ Â ÂÃÄ ÅÆ ÂÃÄ ÇÄ Å Â Â ÂÃ Ã Ä Å Ä Æ Â Â = Ã = Â In this formula, c 0 and c n+1 mark the beginning and the ending of the character string, respectively. According to the above formulas, the preliminary results are shown in Table 1. Table 1. The Preliminary Results Correct Rate for Correct Rate for Formula 2 Correct Rate for 1 97.87% 98.48% 88.75% 2 96.34% 96.57% 88.56% 3 98.91% 97.00% 90.19% 4 97.82% 97.28% 89.47% 5 98.56% 96.51% 87.89% 6 97.55% 94.76% 89.16% Total 97.84% 96.83% 88.97% In these experiments, a Chinese unsegmented newspaper corpus is adopted as the source of the training data to train the Markov character bigram probabilities. It includes approximately 360,000 sentences (about 4,000,000 characters). The test data (6 articles) are scanned from the Liberty Times. It includes 237 sentences (2457 characters). In Table 1, it is clear that using the contextual information () only to select the most plausible candidate does not gain the advantages in these experiments. This is because the image processing module has the excellent performance and the test data (news) contains many proper nouns such as personal names and organizational names which are difficult to be solved by the language models. Besides, some frequently used characters are always selected by, but they may be wrong in some cases. Because combines P IP (C I) and P LP (C), these effects are alleviated. However, they still have some influences. The subsequent analysis will demonstrate this point. The preliminary results for Formula 2 are discussed in detail and the statistic information is shown in Table 2. Table 2. The Statistic Information of the Preliminary Results for Formula 2 Correctly Image Wrongly Image Correct within the Top Ten Candidates 1 324 5 328 2 422 15 432 3 356 11 366 4 536 15 551 5 470 17 486 6 271 15 282 Total 2379 78 2445 In the above table, 1 has 5 image characters wrongly recognized by using Formula 2. That is, the first candidate is not the correct result in these five image characters. But 4 of 5 can be found within the top ten candidates. From Table 2, 84.62% ((2445-2379)/78) wrongly recognized image characters can be recovered to the correct ones by using the characters within the top ten candidates. This is a good phenomenon while the contextual information can be successfully applied to the wrongly recognized positions. Tables 3 and 4 show the detail statistic information of the preliminary results for Formulas 1 and 3, respectively. Table 3. The Detail Statistic Information of the Preliminary Results for Correct Wrong CC CW WC WW Net Gain 1 322 7 320 4 2 3-2 2 421 16 413 9 8 7-1 3 363 4 355 1 8 3 +7 4 539 12 526 10 13 2 +3 5 480 7 465 5 15 2 +10 6 279 7 268 3 11 4 +8 Total 2404 53 2347 32 57 21 +25-3-

Table 4. The Detail Statistic Information of the Preliminary Results for Correct Wrong CC CW WC WW Net Gain 1 292 37 290 34 2 3-32 2 387 50 378 44 9 6-35 3 331 36 323 33 8 3-25 4 493 58 484 52 9 6-43 5 428 59 413 57 15 2-42 6 255 31 244 27 11 4-16 Total 2186 271 2132 247 54 24-193 In these two tables, Correct (Wrong) denotes the number of correctly (wrongly) recognized image characters 2. Columns 4, 5, 6 and 7 indicate the performance changes from image processing module (preprocessing) to language processing module (postprocessing). They can be classified into four types: Correctto-Correct (CC), Correct-to-Wrong (CW), Wrong-to-Correct (WC) and Wrong-to-Wrong (WW). In the CW type, an image character which is correctly recognized by the image processing module is changed to a wrong one by the language processing module. In the WC type, a wrongly recognized character is recovered to the correct one by the language processing module. In the CC type, no characters are changed. In the WW type, a wrongly recognized character is not changed or is changed to another wrong one. The performance of the language processing module can be evaluated as the net gain shown as follows. Net Gain = WC - CW In Table 4, the Net Gains are all negative. It reveals the language processing module cannot be effectively applied to the OCCR application when the P IP (C I) is ignored. But the Net Gains of 1 and 2 in Table 3 are also negative even the P IP (C I) is incorporated with the P LP (C). In Table 3 (4), 32 (247) image characters which are correctly recognized by the image processing module are changed to the wrong ones by the language processing module. However, 57 (54) image characters which are wrongly recognized by the image processing module are recovered to the correct ones by the language processing module. Because Table 2 shows that 66 (2445-2379) wrongly recognized characters may be recovered by the language processing module, the language processing module performs well in these wrongly recognized positions. That is, if we can predict that which position has correctly recognized by the image processing module, the first candidate is selected as the result. The other candidates (from the second candidate to the tenth candidate) can be removed from the candidate set and will not be tried by the language processing module. Under this way, the Net Gain can be turned to positive value and the effects of language processing module can be shown. In the next section, we will describe how to predict if a position is correctly or wrongly recognized by the image processing module. 2 Correct = CC + WC Wrong = CW + WW 4 Analysis of Error Count Distributions To decide which image character has been recognized by the image processing module correctly, the only information that we can use is the error count of each candidate. In this paper, an image character is assumed to be correctly recognized by the image processing module based on the following two hypotheses. (1) The error count of the first candidate in the candidate set must be less than a threshold value A. (2) The difference of the error count between the first candidate and the second candidate in the candidate set must be greater than a threshold value B. Table 5 shows the error count distribution for the first hypothesis. Table 5. The Error Count Distribution for the First Hypothesis The Range of the Error Count for the First Candidate Correctly Image Character Wrongly Image Character 0 ~ 999 5 0 1000 ~ 1499 5 0 1500 ~ 1999 43 0 2000 ~ 2499 181 0 2500 ~ 2999 407 6 3000 ~ 3499 688 9 3500 ~ 3999 651 25 4000 ~ 4499 326 15 4500 ~ 4999 66 17 5000 ~ 5499 7 6 In Table 5, Column 1 indicates the range of the error count for the first candidate. Column 2 (3) indicates the number of the image characters which are correctly (wrongly) recognized by the image processing module given the condition in Column 1. For example, there are 407 correctly recognized image characters and 6 wrongly recognized image characters when the error counts of their first candidates are between 2500 and 2999. In Table 5, we can find that an image character is correctly recognized by the image processing module if the error count of its first candidate is less than 2500, i.e., threshold value A. That is, total 234 (5+5+43+181) positions can be correctly detected. The first candidate can be selected as the correct result and the other candidates are not tried by the language processing module. However, this hypothesis only obtains little improvements (9.52%) because total number of positions (image characters) is 2457. Table 6 shows the error count distribution for the second hypothesis. In Table 6, Column 1 indicates the range of the difference of the error count between the first candidate and the second candidate in the candidate set. Column 2 (3) indicates the number of correctly (wrongly) recognized image characters under the condition in Column 1. For example, there are only 136 correctly recognized image characters and 63 wrongly recognized image characters when the difference of the error count between the first candidate and the second candidate is less than 200, i.e., threshold value B. -4-

Table 6. The Error Count Distribution for the Second Hypothesis The Range of the Difference of the Error Count Correctly Image Wrongly Image 0 ~ 200 136 63 0 ~ 250 190 69 0 ~ 300 249 74 0 ~ 400 451 76 0 ~ 500 646 77 That is, if we assume that the first candidate is correct when the difference of the error count is greater than 200, total 2258 (2457-136-63) positions are identified. Of these, 2243 are correct and 15 (78-63) are wrong. That is, these 15 wrongly recognized characters are identified as correct based on this hypothesis. However, only 199 (136+63) positions have to be considered further. It is clear that this analysis is useful because most of the characters are identified correctly in advance. Tables 7 and 8 show the experimental results for Formulas 1 and 3 based on two hypotheses. Table 7. The Experimental Results for Based on Two Hypotheses Correct Wrong 1 325 325 326 4 4 3 2 429 429 429 8 8 8 3 361 362 362 6 5 5 4 547 547 548 4 4 3 5 482 482 483 5 5 4 6 279 279 279 7 7 7 Total 2423 2424 2427 34 33 30 Table 8. The Experimental Results for Based on Two Hypotheses Correct Wrong 1 324 324 324 5 5 5 2 425 424 424 12 13 13 3 360 360 358 7 7 9 4 544 543 544 7 8 7 5 477 477 476 10 10 11 6 278 278 278 8 8 8 Total 2408 2406 2404 49 51 53 In Tables 7 and 8, A and B denote the threshold values for Hypothesis I and Hypothesis II, respectively. The performance in these two experiments is increased very much. For example, the original Markov character bigram model based on (3) has 53 (271) errors. After the analysis, the recognition errors reduce to 30 (53) under the threshold values, i.e., and. That is, 43.40% (80.44%) of errors on the average are reduced by the analysis. The threshold values A and B in hypotheses I and II highly depend on the quality of the printed documents. It does not depend on the type and domain of the context. Another 7 printed documents are also scanned for testing. The experimental results are shown in Tables 9 and 10. Table 9. The Experimental Results before the Analysis Correctly Wrongly Correctly Wrongly 7 357 11 331 37 8 371 12 318 65 9 417 15 354 78 10 443 18 410 51 11 88 2 73 17 12 106 6 96 16 13 583 14 531 66 Total 2365 78 2113 330 Table 10. The Experimental Results after the Analysis Correctly Wrongly Correctly Wrongly 7 364 4 364 4 8 377 6 374 9 9 424 8 416 16 10 453 8 447 14 11 90 0 89 1 12 108 4 107 5 13 588 9 586 11 Total 2404 39 2383 60 The threshold values, A and B, are set to 2500 and 300, respectively. It is clear that the experimental results are similar to the previous ones. Without the analysis, the correct rates for Formulas 1 and 3 are 96.81% and 86.49%, respectively. By using the analysis, the correct rates for Formulas 1 and 3 are 98.40% and 97.54%, respectively. That is, 50.00% and 81.82% of errors on the average are reduced by the analysis for Formulas 1 and 3, respectively. Besides, the processing speed is also saved after applying the analysis. Without the analysis, the processing speed is 1.67 characters per second. By using the analysis, the -5-

processing speed becomes 21.59 characters per second under PC- 486/DX4-100. That is, the analysis saves 92.26% of time on the average. 5 Concluding Remarks A standard approach to reduce the recognition errors caused by the preprocessing, i.e., image processing, is to use the corpusbased language models in the postprocessing, i.e., language processing. This paper proposes the analysis of error count distributions to alleviate the problems caused by the contextual language processing. The experimental results show the analysis can reduce more than 50% of errors and save more than 90% of time on the average based on the Markov character bigram model. Besides, this simple but effective analysis can also be applied to other natural language applications such as speech recognition [2] and handwriting recognition [9,10,11]. References [1] K.H. Shyu, et al., "An OCR Based Translation System between Simplified and Complex Chinese," Computer Processing of Chinese and Oriental Languages, Vol. 9, No. 1, pp. 59-68, 1995. [2] L.S. Lee, et al., "Golden Mandarin (II) - An Improved Single-Chip Real-Time Mandarin Dictation Machine for Chinese Language with Very Large Vocabulary," Proceedings of ICASSP, pp. 503-506, 1993. [3] B.H. Chou and J.S. Chang, "The Language Models in Optical Chinese Character Recognition," Proceedings of ROCLING V, pp. 259-286, 1992. [4] J.S. Chang and S.D. Chen, The Postprocessing of Optical Character Recognition Based on Statistical Noisy Channel and Language Model, Proceedings of PACLIC, pp. 127-131, 1995. [5] T. Araki, S. Ikehara, et al., An Evaluation of a Method to Detect and Correct Erroneous in Japanese Input through an OCR Using Markov Models, Proceedings of Applied Natural Language Processing, pp. 198-199, 1994. [6] T. Araki, S. Ikehara, et al., An Evaluation to Detect and Correct Erroneous Wrongly Substituted, Deleted and Inserted in Japanese and English Sentences Using Markov Models, Proceedings of COLING, pp. 187-193, 1994. [7] R. Shinghal, "A Hybrid Algorithm for Contextual Text Recognition," Pattern Recognition, Vol. 16, No. 2, pp. 261-267, 1983. [8] R.M.K. Sinha and B. Prasada, "Visual Text Recognition Through Contextual Processing," Pattern Recognition, Vol. 21, No. 5, pp. 463-479, 1988. [9] H.J. Lee, C.H. Tung and C.H. Chang Chien, "A Markov Model in Handwritten Chinese Text Recognition," Proceedings of ICDAR, pp. 72-75, 1993. [10] C.H. Tung and H.J. Lee, "Increasing Character Recognition Accuracy by Detection and Correction of Erroneously Identified," Pattern Recognition, Vol. 27, No. 9, pp. 1259-1266, 1994. [11] C.H. Chang, " Word Class Discovery for Postprocessing Chinese Handwriting Recognition," Proceedings of COLING, pp. 12221-1225, 1994. -6-