Analysis of Error Count Distributions for Improving the Postprocessing Performance of OCCR

Size: px

Start display at page:

Download "Analysis of Error Count Distributions for Improving the Postprocessing Performance of OCCR"

Richard Moody
5 years ago
Views:

1 Analysis of Error Count Distributions for Improving the Postprocessing Performance of OCCR Yue-Shi Lee and Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University Taipei, Taiwan, R.O.C. {leeys, Submitted on 23 July, 1996, Revised on 17 November, 1996 and Accepted on 29 November, 1996 Abstract Contextual language processing plays an important role for the postprocessing of OCR. Its effects are demonstrated by many proposed systems. In general, it performs well. However, its performance is not so good as expect when the test data contain more unseen context, e.g., proper nouns such as personal names and organizational names. This paper addresses the importance of analyzing the error count distributions before applying the language models. According to the analysis, more than 50% of errors can be reduced and more than 90% of time can be saved on the average based on the Markov character bigram model. Keywords: Contextual Language Processing, Error Count Distributions, Image Processing, Markov Model, OCCR, Unseen Context 1 Introduction To improve the interface with computers, the development of input devices such as optical character recognition (OCR) device and speech recognition (SR) device is expected. The OCR device is a good choice while the printed documents are provided. However, optical Chinese character recognition (OCCR) is an extremely challenging task due to the different multifonts, complex shapes and the very large vocabularies 1. Because the misrecognitions of image processing are hard to be avoided, contextual postprocessing (language processing) of the recognition is indispensable for both reducing the recognition errors caused by the preprocessing (image processing) of the recognition and saving the time in human proofreading. Contextual language processing for the postprocessing of OCR is not new. Shyu, et al. [1] adopt a word-lattice-based Markov character bigram model suggested by [2] to the OCCR system. Chou and Chang [3] use a Markov word unigram model and a confusion matrix to decide the most plausible characters. Chang and Chen [4] combine a noisy channel model and a language model to implement the postprocessing of the OCCR. Araki, et al. [5-6] propose a selective error-correction method to detect and correct erroneous characters in Japanese text input through an OCR. Shinghal [7], Sinha and Prasada [8] also propose approaches for English. The purpose of the contextual language processing is to find the most plausible candidate for each image character with the maximum likelihood probability. The above approaches claim that it plays an important role and has much effect in the postprocessing of the OCR systems. In general, it performs well. However, its performance is not so good as expect when the test data contain more unseen context, e.g., proper nouns such as personal names and organizational names. Besides, some frequently used characters are always selected by the language models, but they may be wrong in some cases. Therefore, if we can predict which image characters have been recognized correctly by the image processing module, the above problems can be alleviated. That is, it is important to make an analysis before applying the language models. This paper is organized as follows. Section 2 presents our OCCR system. Section 3 introduces the language models used in this paper. Section 4 describes the analysis methods and demonstrates the proposed methods. Section 5 is the concluding remarks. 2 System Description The proposed system shown in Fig. 1 consists of two major modules: (1) Image Processing Module (Preprocessing) and (2) Language Processing Module (Postprocessing). The image processing module contains three submodules: (1) Image Segmentation, (2) Feature Extraction, and (3) Feature Matching. An optical scanner scans the printed document and converts it into an image document. The image segmentation submodule segments the entire image into blocks and then classifies each block into a text, graphic or picture block. The text blocks are further segmented into individual character blocks subsequently. Each character block stands for an image character. Fig. 2(a) shows a simplified example of image document. After the image segmentation submodule is applied, the segmented image document is shown in Fig. 2(b). Once the image characters have been segmented properly, the feature extraction submodule extracts the features from each image character. In the feature matching stage, the extracted features of each individual image character are matched to a feature database to recognize the character. The top ten candidates, which form a candidate set, for each image character are generated for the subsequent language processing. Fig. 3 shows the candidates for each image character of Fig. 2(b). 1 There are about 13,000 characters in Chinese.

2 Image Document Image Processing Module (Preprocessing) Image Segmentation Image base Feature Extraction Feature Extraction Dictionary Feature Matching Feature base Language Processing Module (Postprocessing) Analysis of Error Count Distributions Markov Character Bigram Model Character Bigram Table Character Unigram Table Text Document Fig. 1. Block Diagram of the Proposed System Character Blocks (Image ) (a) (b) Fig. 2. An Example of Image Segmentation Â Ã Ä Å Æ Ç È É ! " # $ % & ' ( ) * , / Fig. 3. The Top Ten Candidates for Each Image Character of Fig. 2(b) In Fig. 3, a number follows each candidate. The number indicates the error count between the current image character and the image character stored in the image database according to the features. The lower the error count is, the more the similarity between two image characters is. Thus, the first candidate of each image character is the most plausible candidate based on the image processing module. The error count can be used to calculate the probability of each candidate given the image character. Given the k-th image character i k, the matching score of the j-th candidate c j (SCORE j ) is defined as follows [9-10]. ÂÃÄÅ = ÅÇÇÈÇÉÂÈÉÈÉÂ Æ ÅÇÇÈÇÉÂÈÉÈÉÂ + Æ Â Based on the definition of SCORE j, the probability of the j-th candidate c j given the image character i k is calculated as follows. ÅÆÇÈÉ Ã Â ÂÃ Ã ÄÄ = ÇÈ ÅÆÇÈÉ Å ÅÆÇ During the language processing stage, the analysis of error count distributions and the Markov character bigram model are adopted simultaneously to deal with the problems of recognition errors caused by the image processing module and yield the final text document. This paper focuses on the postprocessing of the recognition especially for the analysis of error count distributions.

3 3 Language Models in OCCR The problem of OCCR can be defined as how to convert a sequence of image characters I into the corresponding sequence of characters Â correctly based on the language models. In this paper, a statistical Markov character bigram model is adopted to improve the recognition rate. Let I=<i 1, i 2, i 3,..., i n > be an image character string and C=<c 1, c 2, c 3,..., c n > be one of the possible character strings. Here, c i denotes one of the characters in the i-th candidate set. The conversion can be formulated as follows. Â ÂÃ ÄÅ Â ÂÃ ÃÄÅÆÇÈ Â ÄÃ ÃÄÇ... (1) The former probability, i.e., P IP (C I), is produced by the image processing module and the latter probability, i.e., P LP (C) is calculated by the language processing module. If the contextual information, i.e., P LP (C), is ignored, the above formula becomes as follows. Â ÂÃ ÄÅ The P IP (C I) is defined as follows. ÂÃÄÅÆ = Â Â Ã Ã Ã= Â Ã Ä Å Æ Ä Å Â ÂÃ ÃÄÅÆÇ (2) The definition of P IP (c j i j ), i.e., the probability of candidate c j given the image character i j, is described in Section 2. By using Formula 2, the first candidate, which has the lowest error count, is always selected as the result. If more than one candidate has the same error count, the most frequently used character is selected as the result through dictionary lookup (see Fig. 1). Similarly, if the P IP (C I) is ignored in, the formula becomes as follows. Â ÂÃ ÄÅ Â ÂÃ ÃÄÅ (3) In this paper, the P LP (C) is simplified as a Markov character bigram model shown below. Ä + Ã Ã + Â ÂÃÄ Â ÂÃÄ ÅÆ ÂÃÄ ÇÄ Å Â Â ÂÃ Ã Ä Å Ä Æ Â Â = Ã = Â In this formula, c 0 and c n+1 mark the beginning and the ending of the character string, respectively. According to the above formulas, the preliminary results are shown in Table 1. Table 1. The Preliminary Results Correct Rate for Correct Rate for Formula 2 Correct Rate for % 98.48% 88.75% % 96.57% 88.56% % 97.00% 90.19% % 97.28% 89.47% % 96.51% 87.89% % 94.76% 89.16% Total 97.84% 96.83% 88.97% In these experiments, a Chinese unsegmented newspaper corpus is adopted as the source of the training data to train the Markov character bigram probabilities. It includes approximately 360,000 sentences (about 4,000,000 characters). The test data (6 articles) are scanned from the Liberty Times. It includes 237 sentences (2457 characters). In Table 1, it is clear that using the contextual information () only to select the most plausible candidate does not gain the advantages in these experiments. This is because the image processing module has the excellent performance and the test data (news) contains many proper nouns such as personal names and organizational names which are difficult to be solved by the language models. Besides, some frequently used characters are always selected by, but they may be wrong in some cases. Because combines P IP (C I) and P LP (C), these effects are alleviated. However, they still have some influences. The subsequent analysis will demonstrate this point. The preliminary results for Formula 2 are discussed in detail and the statistic information is shown in Table 2. Table 2. The Statistic Information of the Preliminary Results for Formula 2 Correctly Image Wrongly Image Correct within the Top Ten Candidates Total In the above table, 1 has 5 image characters wrongly recognized by using Formula 2. That is, the first candidate is not the correct result in these five image characters. But 4 of 5 can be found within the top ten candidates. From Table 2, 84.62% (( )/78) wrongly recognized image characters can be recovered to the correct ones by using the characters within the top ten candidates. This is a good phenomenon while the contextual information can be successfully applied to the wrongly recognized positions. Tables 3 and 4 show the detail statistic information of the preliminary results for Formulas 1 and 3, respectively. Table 3. The Detail Statistic Information of the Preliminary Results for Correct Wrong CC CW WC WW Net Gain Total

4 Table 4. The Detail Statistic Information of the Preliminary Results for Correct Wrong CC CW WC WW Net Gain Total In these two tables, Correct (Wrong) denotes the number of correctly (wrongly) recognized image characters 2. Columns 4, 5, 6 and 7 indicate the performance changes from image processing module (preprocessing) to language processing module (postprocessing). They can be classified into four types: Correctto-Correct (CC), Correct-to-Wrong (CW), Wrong-to-Correct (WC) and Wrong-to-Wrong (WW). In the CW type, an image character which is correctly recognized by the image processing module is changed to a wrong one by the language processing module. In the WC type, a wrongly recognized character is recovered to the correct one by the language processing module. In the CC type, no characters are changed. In the WW type, a wrongly recognized character is not changed or is changed to another wrong one. The performance of the language processing module can be evaluated as the net gain shown as follows. Net Gain = WC - CW In Table 4, the Net Gains are all negative. It reveals the language processing module cannot be effectively applied to the OCCR application when the P IP (C I) is ignored. But the Net Gains of 1 and 2 in Table 3 are also negative even the P IP (C I) is incorporated with the P LP (C). In Table 3 (4), 32 (247) image characters which are correctly recognized by the image processing module are changed to the wrong ones by the language processing module. However, 57 (54) image characters which are wrongly recognized by the image processing module are recovered to the correct ones by the language processing module. Because Table 2 shows that 66 ( ) wrongly recognized characters may be recovered by the language processing module, the language processing module performs well in these wrongly recognized positions. That is, if we can predict that which position has correctly recognized by the image processing module, the first candidate is selected as the result. The other candidates (from the second candidate to the tenth candidate) can be removed from the candidate set and will not be tried by the language processing module. Under this way, the Net Gain can be turned to positive value and the effects of language processing module can be shown. In the next section, we will describe how to predict if a position is correctly or wrongly recognized by the image processing module. 2 Correct = CC + WC Wrong = CW + WW 4 Analysis of Error Count Distributions To decide which image character has been recognized by the image processing module correctly, the only information that we can use is the error count of each candidate. In this paper, an image character is assumed to be correctly recognized by the image processing module based on the following two hypotheses. (1) The error count of the first candidate in the candidate set must be less than a threshold value A. (2) The difference of the error count between the first candidate and the second candidate in the candidate set must be greater than a threshold value B. Table 5 shows the error count distribution for the first hypothesis. Table 5. The Error Count Distribution for the First Hypothesis The Range of the Error Count for the First Candidate Correctly Image Character Wrongly Image Character 0 ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ In Table 5, Column 1 indicates the range of the error count for the first candidate. Column 2 (3) indicates the number of the image characters which are correctly (wrongly) recognized by the image processing module given the condition in Column 1. For example, there are 407 correctly recognized image characters and 6 wrongly recognized image characters when the error counts of their first candidates are between 2500 and In Table 5, we can find that an image character is correctly recognized by the image processing module if the error count of its first candidate is less than 2500, i.e., threshold value A. That is, total 234 ( ) positions can be correctly detected. The first candidate can be selected as the correct result and the other candidates are not tried by the language processing module. However, this hypothesis only obtains little improvements (9.52%) because total number of positions (image characters) is Table 6 shows the error count distribution for the second hypothesis. In Table 6, Column 1 indicates the range of the difference of the error count between the first candidate and the second candidate in the candidate set. Column 2 (3) indicates the number of correctly (wrongly) recognized image characters under the condition in Column 1. For example, there are only 136 correctly recognized image characters and 63 wrongly recognized image characters when the difference of the error count between the first candidate and the second candidate is less than 200, i.e., threshold value B. -4-

5 Table 6. The Error Count Distribution for the Second Hypothesis The Range of the Difference of the Error Count Correctly Image Wrongly Image 0 ~ ~ ~ ~ ~ That is, if we assume that the first candidate is correct when the difference of the error count is greater than 200, total 2258 ( ) positions are identified. Of these, 2243 are correct and 15 (78-63) are wrong. That is, these 15 wrongly recognized characters are identified as correct based on this hypothesis. However, only 199 (136+63) positions have to be considered further. It is clear that this analysis is useful because most of the characters are identified correctly in advance. Tables 7 and 8 show the experimental results for Formulas 1 and 3 based on two hypotheses. Table 7. The Experimental Results for Based on Two Hypotheses Correct Wrong Total Table 8. The Experimental Results for Based on Two Hypotheses Correct Wrong Total In Tables 7 and 8, A and B denote the threshold values for Hypothesis I and Hypothesis II, respectively. The performance in these two experiments is increased very much. For example, the original Markov character bigram model based on (3) has 53 (271) errors. After the analysis, the recognition errors reduce to 30 (53) under the threshold values, i.e., and. That is, 43.40% (80.44%) of errors on the average are reduced by the analysis. The threshold values A and B in hypotheses I and II highly depend on the quality of the printed documents. It does not depend on the type and domain of the context. Another 7 printed documents are also scanned for testing. The experimental results are shown in Tables 9 and 10. Table 9. The Experimental Results before the Analysis Correctly Wrongly Correctly Wrongly Total Table 10. The Experimental Results after the Analysis Correctly Wrongly Correctly Wrongly Total The threshold values, A and B, are set to 2500 and 300, respectively. It is clear that the experimental results are similar to the previous ones. Without the analysis, the correct rates for Formulas 1 and 3 are 96.81% and 86.49%, respectively. By using the analysis, the correct rates for Formulas 1 and 3 are 98.40% and 97.54%, respectively. That is, 50.00% and 81.82% of errors on the average are reduced by the analysis for Formulas 1 and 3, respectively. Besides, the processing speed is also saved after applying the analysis. Without the analysis, the processing speed is 1.67 characters per second. By using the analysis, the -5-

6 processing speed becomes characters per second under PC- 486/DX That is, the analysis saves 92.26% of time on the average. 5 Concluding Remarks A standard approach to reduce the recognition errors caused by the preprocessing, i.e., image processing, is to use the corpusbased language models in the postprocessing, i.e., language processing. This paper proposes the analysis of error count distributions to alleviate the problems caused by the contextual language processing. The experimental results show the analysis can reduce more than 50% of errors and save more than 90% of time on the average based on the Markov character bigram model. Besides, this simple but effective analysis can also be applied to other natural language applications such as speech recognition [2] and handwriting recognition [9,10,11]. References [1] K.H. Shyu, et al., "An OCR Based Translation System between Simplified and Complex Chinese," Computer Processing of Chinese and Oriental Languages, Vol. 9, No. 1, pp , [2] L.S. Lee, et al., "Golden Mandarin (II) - An Improved Single-Chip Real-Time Mandarin Dictation Machine for Chinese Language with Very Large Vocabulary," Proceedings of ICASSP, pp , [3] B.H. Chou and J.S. Chang, "The Language Models in Optical Chinese Character Recognition," Proceedings of ROCLING V, pp , [4] J.S. Chang and S.D. Chen, The Postprocessing of Optical Character Recognition Based on Statistical Noisy Channel and Language Model, Proceedings of PACLIC, pp , [5] T. Araki, S. Ikehara, et al., An Evaluation of a Method to Detect and Correct Erroneous in Japanese Input through an OCR Using Markov Models, Proceedings of Applied Natural Language Processing, pp , [6] T. Araki, S. Ikehara, et al., An Evaluation to Detect and Correct Erroneous Wrongly Substituted, Deleted and Inserted in Japanese and English Sentences Using Markov Models, Proceedings of COLING, pp , [7] R. Shinghal, "A Hybrid Algorithm for Contextual Text Recognition," Pattern Recognition, Vol. 16, No. 2, pp , [8] R.M.K. Sinha and B. Prasada, "Visual Text Recognition Through Contextual Processing," Pattern Recognition, Vol. 21, No. 5, pp , [9] H.J. Lee, C.H. Tung and C.H. Chang Chien, "A Markov Model in Handwritten Chinese Text Recognition," Proceedings of ICDAR, pp , [10] C.H. Tung and H.J. Lee, "Increasing Character Recognition Accuracy by Detection and Correction of Erroneously Identified," Pattern Recognition, Vol. 27, No. 9, pp , [11] C.H. Chang, " Word Class Discovery for Postprocessing Chinese Handwriting Recognition," Proceedings of COLING, pp ,

Noisy Channel Models for Corrupted Chinese Text Restoration and GB-to-Big5 Conversion

Computational Linguistics and Chinese Language Processing vol. 3, no. 2, August 1998, pp. 79-92 79 Computational Linguistics Society of R.O.C. Noisy Channel Models for Corrupted Chinese Text Restoration