An Improved Segmentation of Online English Handwritten Text Using Recurrent Neural Networks Cuong Tuan Nguyen and Masaki Nakagawa Deparment of Computer and Information Sciences Tokyo University of Agriculture and Technology ntcuong2103@gmail.com, nakagawa@cc.tuat.ac.jp Abstract Segmentation of online handwritten text recognition is better to employ the dependency on context of strokes written before and after it. This paper shows an application of Bidirectional Long Short-term Memory recurrent neural networks for segmentation of on-line handwritten English text. The networks allow incorporating long-range context from both forward and backward directions to improve the confident of segmentation over uncertainty. We show that applying the method in the semi-incremental recognition of online handwritten English text reduces up to 62% of waiting time, 50% of processing time. Moreover, recognition rate of the system also improves remarkably by 3 points from 71.7%. 1. Introduction Due to the development of pen-based or touch-based devices such as Tablet PCs, smart-phones, electronic whiteboards, digital pens and so on, online handwriting recognition is reviving more attention. Especially, online handwritten text recognition is a practical input method for these devices without keyboard [1, 2]. Researches focus on improving recognition rate, reducing processing time and dictionary size of recognition systems [3, 4], which enable a handwritten text recognition system to run effectively and reliably on small portable devices. Compared to isolated character or word recognition, handwritten text recognition faces the difficulty of word segmentation and character segmentation due to the ambiguity in segmentation. Moreover, in continuous handwriting, characters tend to be written more cursively. To deal with the problem, applying context for segmentation is crucial. The typical approach is using oversegmentation in combination with recognition results and linguistic context [3]. Based on geometric features, all the potential segmentation positions are determined to build up hypothetical segmentation paths. Then, recognition results and linguistic context is combined to evaluate and find the best path. The SVM method, which has been widely applied to numerous classification tasks achieves good performance on segmentation of on-line handwritten text [5]. The segmentation task, however, can be further improved by incorporating context from both the forward and backward directions. An improved version of bidirectional recurrent neural network: Bidirectional Long Short-term Memory BLSTM [12] allows the network to access long-range context. BLSTM shows its effective in many sequence classification tasks. There are two methods for text recognition. The batch recognition method, which recognizes handwritten text after a user has finished writing, can easily employ the full context to achieve high recognition rate [3]. If all the processes for segmentation and recognition are made after whole text is written, however, it suffers from the problem of large waiting time. As more text is written, larger waiting time is incurred. The other is the incremental recognition method [6, 7] which recognizes handwriting during a user is writing. Although it does not incur large waiting time after the user has finished writing, it may degrade recognition rate due to local processing of every stroke (a sequence of finger-tip or pen-tip coordinates from finger/pen-down to finger/pen-up) and increase the total CPU time due to repeated processing after receiving every stroke. The semi-incremental recognition method applying the resumption of segmentation-recognition, avails context for segmentation and recognition. It realizes very small waiting time, low CPU burden and no significant loss in recognition rate in both Japanese text and English text [8, 9]. In this work, we apply BLSTM for improving segmentation and evaluate its effect on the semiincremental English recognition method. 2. Segmentation of online handwritten text To recognize online handwritten text, there are two main streams: the segmentation free method and the dissection method. In this paper, we focus on the dissection method since it is better for Chinese and Japanese text recognition [1, 3] and it could produce better results even for western handwriting recognition for which the segmentation free method has been dominant. 225
2.1. Segmentation-recognition strategy Online handwriting text recognition deals with the problem of recognizing handwritten text including many text lines. For this problem, handwritten text is segmented into text lines, and then each text line is segmented into words. The segmentation can be so-called hard-decision (Yes or No) or soft-decision (allowing multiple possibilities). Segmentation is made based on geometric layout features (e.g. gap between strokes, stroke histogram, interrelationship and so on). Due to instability and ambiguity of these features in practical handwriting, however, it is difficult to make segmentation without using recognition cues and linguistic context. Thus, we employ the softdecision approach. Segmentation-recognition is accomplished in two steps: over-segmentation and path evaluation-search. A text line is over-segmented into primitive segments such that each segment composes a single word or a part of a word. A segment or a sequence of few consecutive segments is assumed as a candidate word pattern, which is recognized by a word recognizer with a list of candidate categories. Multiple ways of segmentation into a candidate word and multiple ways of recognition into a word are represented by a segmentation-recognition candidate lattice [3]. Text recognition is made by the best path search into the lattice considering geometrics and linguistic context as well as word recognition scores. 2.2. Features for segmentation From the local and global features based on the feature set described in [9], we extend it into a set of nine geometrical features as in Table 2. We define the terms in Table 1: Sp Immediate preceding stroke Ss Immediate succeeding stroke Bp Bounding box of S p Bs Bounding box of S s Bp_all Bounding box of all the preceding strokes Bs_all Bounding box of all the succeeding strokes P Pattern of all strokes Psub Sub-pattern of S p and S s Table 1. Terms of features representation. F1 F2 F3 F4 F5 F6 F7 F8 F9 Distance between B s_all and B p_all in x-axis Average stroke length on horizontal of P Average stroke length on horizontal of P sub Overlap length between B p and B s in x-axis Overlap length between B p and B s in y-axis Minimum point distance between S s and S p Angle between the vector from the centroids of B s and B p and x-axis Ratio between B s width and B p width Ratio between B s height and B p height Table 2. Features for English word segmentation. 2.3. Segmentation by a SVM classifier For word over-segmentation of a text line, the work in [9] uses a SVM classifier to classify each off-stroke into two classes: segmentation point (SP) or non-segmentation point (NSP). A SP off-stroke separates two words while a NSP off-stroke indicates the off-stroke is within a word. Off-strokes with low confidence are classified as undecided point (UP). In training the SVM, however, due to unbalance between the numbers of positive and negative training patterns (i.e. the number of SP and that of NSP), we need to adjust the cost of false positives and false negatives [10]. The higher cost of false positives is set, the higher precision of determining SP is achieved. The same logic applies to false negatives and precision of determining NSP. We use a combination of two SVMs: one with high precision for determining SP, the other with high precision for determining NSP. 2.4. Segmentation by a BLSTM classifier One of the key benefits of RNNs is their ability to use previous context. For standard RNN architectures, however, the range of context that can be accessed in practice is limited due to the vanishing gradient problem [11]. Long Short-Term Memory (LSTM [11]) is an RNN architecture designed to address the vanishing gradient problem. A LSTM layer consists of multiple recurrently connected memory blocks. Each block contains a set of internal units, known as cells, whose activation are controlled by three multiplicative gate units. The effect of the gates is to allow the cells to store and access information over long periods of time. For many tasks, it is useful to have access to future as well past context. Bidirectional LSTM (BLSTM) allows this [12] by using two separate hidden layers to present input in forward and backward directions, both of which are connected to the same output layer to provide access to long-range bidirectional context. We use BLSTM to employ the context of strokes written before and after an off-stroke for segmentation of that offstroke. The training of BLSTM does not suffer the problem of different in number of class patterns. Therefore, we use BLSTM with two thresholds to make over-segmentation. For over segmentation, we need to find all potential segmentation points of off-strokes (which could be then determined as segmentation or non-segmentation points), the remaining are non-segmentation points. Thus, we set a threshold TH1 to determine an off-stroke as a potential segmentation point if the score is above than TH1 and as a non-segmentation point if the score is below TH1. Likewise, we set another threshold TH2 to determine an off-stroke as a potential non-segmentation point or a segmentation point. The off-strokes whose score fall between TH1 and TH2 are classified as UP. Fig. 1 illustrates this method. 226
The segmentation process is divided into two steps. Firstly, we apply segmentation using the SVM or BLSTM classifier from Seg_rp. Secondly, we fix the UP off-strokes as SP before N_seg_fix latest recognized words if they match with the word segmentation points retrieved from the text recognition result. Both N_seg and N_seg_fix are determined experimentally. Figure 1. Over-segmentation using BLSTM. 3. Semi-incremental recognition method 3.1. Processing flow The semi-incremental method, as similarly to the incremental method also makes recognition in background while a user is writing. The method avails the effect of newly written strokes to recognition of previous strokes by makes resumptions of segmentation, recognition, and best path-search. Therefore, the method determines segmentation resuming point (Seg_rp) to resume the segmentation, determines the processing window termed as scope to update and resume best-path search. Fig. 2 shows the processing flow of the semi-incremental recognition method. 3.3. Determination of scope To determine the scope, we use the result from the segmentation process. The segmentations of the strokes before and after the method has received new strokes are compared with each other. If there is an off-stroke whose classification is changed from the before (we call it classification-changed off-stroke), we consider the strokes before the earliest classification-changed off-strokes are stably classified while the strokes after that are not classified stably. Otherwise, the off-stroke before the newly added strokes is considered as the earliest classificationchanged off-stroke. This earliest classification-changed offstroke may occur within a candidate word block or between two candidate word blocks. We define the scope as the sequence of strokes starting from the first stroke of the candidate word block containing or just preceding the earliest classification-changed off-stroke to the last stroke. 4. Experiments Figure 2. Flow of semi-incremental recognition. First, we receive new strokes. Secondly, we update Seg_rp. Thirdly, we apply segmentation from Seg_rp. Fourthly, we determine the scope. Fifthly, we update the src-lattice for this scope. Finally, we resume the best-path search from the beginning in this scope to get text recognition result. The segmentation and text recognition result of the scope is used for next processing cycle. 3.2. Seg_rp determination and segmentation process From the result of text recognition up to the latest scope at the beginning of each processing cycle, we update Seg_rp to the candidate segmentation point before a fixed number (N_seg) of latest recognized words 4.1. Metrics of segmentation evaluation First, over-segmentation is applied and then segmentation is determined along with word recognition and best-path search. We evaluate over-segmentation as well as segmentation. The over-segmentation process classifies each off-stroke to a SP, NSP, or UP off-stroke. Among them, a UP offstroke could then be further classified as SP or NSP in the text recognition process. The performance of over-segmentation is evaluated by the following measures. Precision is the ratio of correctly classified SP off-strokes over detected SP off-strokes as: Precision 3 Recall is the ratio of correctly classified SP off-strokes and detected UP off-strokes over true SP off-strokes as follows: Recall 4 Inclusion of detected UPs in the dividend is typical for over-segmentation since UP off-strokes keep the possibility that they are classified correctly. 227
F-measure is calculated from precision and recall as follows: 2 F measure 5 Detection rate shows the ratio of detected SP off-strokes over detected SP and detected UP off-strokes as: Detection 6 Although UP off-strokes keep the possibility that they are classified correctly, thus increase recall as said above, they decrease the recognition speed of the system. Therefore, along with F-measure, we also evaluate the performance of over-segmentation using the detection rate. 4.2. Experiment setup We employ the IAM online database (IAM-OnDB) [13] which consists of pen trajectories collected from 221 different writers using an electronic whiteboard. We follow the handwritten text recognition task: IAM-OnDB-t1 in which the database is divided into a training set, two validation sets, and a test set containing 5,364, 1,438, 1,518 and 3,859 written lines, respectively. We use a trigram table extract from the LOB text corpus for language modeling. For segmentation, we train both the SVM classifier and BLSTM classifier on segmented words of IAM-OnDB. The SVM classifiers use the RBF kernel with cost factor of 0.1 for high precision of SP determination and 7.5 for high precision of NSP determination. The BLSTM classifier uses a bi-directional layer of 20 LSTM blocks with one cell in each block. After training the BLSTM classifier, based on the distribution of output scores, we set TH1 = 0.1 and TH2 = 0.9. We compare the performance of two semi-incremental recognition systems: the first system uses the SVM classifiers for segmentation (SVM system) and the second uses BLSTM classifier for segmentation (BLSTM system). 4.3. Over-segmentation The BLSTM system outperforms the SVM system for both recall and detection rates. Recall rate has improved 1.5 point, while detection has significant improved from 48% to 86% as shown in Table 3. 4.4. Recognition rate The recognition rates of the SVM system and the BLSTM system with changing N_seg parameter are shown in Fig. 3 and Fig.4. Result at each N_seg includes the maximum, the minimum and average recognition rate when running with the number of strokes each incremental recognition (Ns) from 1 to 10. The BLSTM system improves recognition rate by about 3 point. With high detection rate, the BLSTM system reduces a large number of UPs as compared with SVM. Therefore, the system reduces the ambiguity in best path search and improves recognition rate. Figure 3. Recognition rate of the SVM system. Figure 4. Recognition rate of the LSTM system. 4.5. Waiting time We measure the average waiting time of the two systems with changing Ns. The BLSTM system has reduction rate of average waiting time from 36.65% to 62.43% over the SVM system as shown in Fig. 5. Measures SVM LSTM Recall 96.91 98.57 Precision 99.25 99.06 F-measure 98.07 98.81 Detection 48.34 86.55 Table 3: Over-segmentation results 228
50% of CPU time as compared with the SVM system. Moreover, reducing undecided points also reduces the number of search paths, and lowers ambiguity of recognition so that BLSTM improves recognition rate of the system by 3 points from 71.7%. Acknowledgement This work is being supported by the Grant-in-Aid for Scientific Research (B)-224300095. Figure 5. Average waiting time of the two systems. 4.6. CPU time We also compare both of the systems in CPU time per stroke. Fig. 6 shows the results of the BLSTM and SVM systems with changing of Ns. The BLSTM system also reduces about 50% of CPU time as compared with the SVM system. Figure 6. CPU time of the two systems. 5. Discussion Detection rate gives the ratio of segmentation points over all potential segmentation points. For the two systems, as the same recall rate, higher detection rate reduces the number of undecided points. Since each undecided point doubles the number of candidate word patterns which need to be recognized, high detection rate reduces processing time and waiting time. Each undecided point also doubles the number of search paths. Therefore, higher detection rate reduces the number of search paths, lowers ambiguity, and improves recognition rate. 6. Conclusion In this paper, we proposed a system using BLSTM recurrent neural network for segmentation of on-line handwritten English text. By large improvement in the detection rate of over-segmentation, BLSTM reduces the number of undecided points each of which doubles the number of candidate character patterns. The reduction of candidate character patterns is vital since character recognition is applied for each candidate. The BLSTM system reduces up to 62.34% of waiting time and around References [1] C. L. Liu, S. Jaeger, and M. Nakagawa, "Online Recognition of Chinese Characters: The State-of-the-Art," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 26, no. 2, pp. 198-213, February 2004. [2] R. Plamondon and S. N. Srihari, "On-Line and Off-Line Handwriting Recognition: A Comprehensive Survey," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 22, no. 1, pp. 63-85, January 2000. [3] B. Zhu, X.D. Zhou, C.L. Liu, and M. Nakagawa, "A robust model for on-line handwritten Japanese text recognition," International Journal on Document Analysis and Recognition, vol. 13, no. 2, pp. 121-131, June 2010. [4] B. Zhu and M. Nakagawa: "Building a compact online MRF recognizer for large character set by structured dictionary representation and vector quantization technique," Pattern Recognition 47(3): 982-993 (2014) [5] B. Zhu and M. Nakagawa, "Segmentation of On-Line Freely Written Japanese Text Using SVM for Improving Text Recognition," IEICE Transactions on Information and Systems, Volume E91.D, Issue 1, pp. 105-113 (2010). [6] H. Tanaka, "Implementation of real-time box-free online Japanese handwriting recognition system," Japanese Patent 3925247, issued March 13, 2002 (in Japanese). [7] D.H. Wang, C.L. Liu, and X.D. Zhou, "An approach for realtime recognition of online Chinese handwritten sentences," Pattern Recognition, no. 45, pp. 3661-3675, 2012. [8] C.T. Nguyen, B. Zhu and M. Nakagawa, "A semiincremental recognition method for on-line handwritten Japanese text," Proc. 12 th Int. Conf. on Document Analysis and Recognition, Washington D.C, USA, Aug. 2014. [9] C.T. Nguyen, B. Zhu and M. Nakagawa, "A semiincremental recognition method for on-line handwritten English text," Proc. 14 h Int. Conf. on Frontier in Handwritten Recognition, Crete, Greece, Sept. 2014. [10] K. Morik, P. Brockhausen, and T. Joachims, "Combining statistical learning with a knowledge-based approach - A case study in intensive care monitoring," Proc. 16th Int'l Conf. on Machine Learning (ICML-99), 1999. [11] S. Hochreiter, J. Schmidhuber, "Long Short-term Memory," Neural Computation 9(8):1735-1780, 1997. [12] A. Graves and J. Schmidhuber, "Framewise phoneme classification with bidirectional LSTM and other neural network architectures," Neural Networks, 18(5-6):602 610, July 2005. [13] M. Liwicki and H. Bunke, "IAM-OnDB - an on-line English sentence database acquired from handwritten text on a whiteboard," Proc. 8th Int. Conf. Document Anal. and Recognit, pp 956-961, 2005. 229