End-to-end Speech Recognition for Languages with Ideographic Characters

End-to-end Speech Recognition for Languages with Ideographic Characters Hitoshi Ito, Aiko Hagiwara, Manon Ichiki, Takeshi Mishima, Shoei Sato, Akio Kobayashi NHK (Japan Broadcasting Corp.), Japan E-mail: {itou.h-ce, hagiwara.a-iy, ichiki.m-fq, mishima.t-iy, satou.s-gu}@nhk.or.jp NHK Engineering System, Japan E-mail: kobayashi.a-fs@nhk.or.jp Abstract This paper describes a novel training method for acoustic models using connectionist temporal classification (CTC) for Japanese end-to-end automatic speech recognition (ASR). End-to-end ASR can estimate characters directly without using a pronunciation dictionary; however, this approach was conducted mostly in the English research area. When dealing with languages such as Japanese, we confront difficulties with robust acoustic modeling. One of the issues is caused by a large number of characters, including Japanese kanji, which leads to an increase in the number of model parameters. Additionally, multiple pronunciations of kanji increase the variance of acoustic features for corresponding characters. Therefore, we propose end-to-end ASR based on bi-directional long short-term memory () networks to solve these problems. Our proposal involves two approaches: reducing the number of dimensions of and adding character strings to output layer labels. Dimensional compression decreases the number of parameters, while output label expansion reduces the variance of acoustic features. Consequently, we could obtain a robust model with a small number of parameters. Our experimental results with Japanese broadcast programs show the combined method of these two approaches improved the word error rate significantly compared with the conventional character-based end-to-end approach. I. INTRODUCTION Automatic speech recognition (ASR) technology has advanced significantly after adopting hidden Markov models (HMMs) and deep neural networks (DNNs). The HMM-DNN model deals with the difference between the sequence lengths of input acoustic features and output labels through the HMM model topology and the pronunciation dictionary with a correspondence of the phonemes and words. The end-to-end ASR approach has been introduced, and it has been proven to achieve performance comparable with the HMM-DNN hybrid approach [1] [10]. One of the advantages of the end-to-end approach is it can absorb the differences in sequence lengths between input speech features and output characters by using connectionist temporal classification (CTC) [11]. This approach enables acoustic model output characters directly from the acoustic features. Therefore, a diversity of pronunciation can be learned in the acoustic model without a lexicon including multiple pronunciations, and it means we can save the cost for preparing the appropriate pronunciations. While this method was reported mainly in English, fewer studies addressing languages have ideographic characters such as Japanese and Chinese. In particular, the following two issues are evident in Japanese. First, the total number of characters is generally higher than that in English. Thus, the number of parameters to be estimated is also larger, making training a robust model more difficult. Second, because each kanji character usually has multiple pronunciations, the acoustic model needs to map different acoustic features to the same label, and consequently the training procedure could become difficult. In this paper, we proposed a new end-to-end approach using CTC and bi-directional long short-term memory () network architectures [12], [13] to tackle these problems based on the following two ideas. One is to reduce the number of parameters of. The dimensional compression of is achieved by the low-rank matrix decomposition of the affine transformation. The introduction of the dimensional compression of the affine transformation makes the training time short without degrading the word error rate (WER) in the feedforward HMM-DNN model [14]. By utilizing the matrix decomposition, dimensional compression should not only shorten the training time but also improve the generalization ability of the model. The other idea is adding meaningful character strings as output labels. In Japanese, the pronunciation of each character is changed by the relationship with the previous and following characters. However, in the conventional endto-end ASR approach, one needs to train multiple pronunciations implicitly for the same character; nevertheless, they are generated by different articulation manners. Then, we add a set of character strings as output labels to separate different pronunciations into different labels for training. It has a potential to solve the difficulty of model training due to the differences in multiple pronunciations. We add highfrequency words and words with pronunciations appearing in low frequency in the training data. A. End-to-end ASR II. JAPANESE END-TO-END ASR End-to-end ASR is an approach that maps acoustic features to one of either characters, syllables, or phonemes directly. CTC is one of the practical approaches for temporal representations to convert the acoustic features directly to the symbols. To absorb the difference in a time series length between input and output, CTC has blank labels. Therefore,

the acoustic model using CTC can output a sequence with blanks between characters. For example, both AA ø BøCC and A ø BBBøCø are mapped to the same sequence ABC. Many CTC-based acoustic models are used along with long short-term memory (LSTM) [15], [16] or bi-directional LSTM () networks, and the output labels are phonemes [1], [5], [7] [9], syllables [10], or characters [2] [6]. We propose a character-based Japanese end-to-end system, which does not require any pronunciation dictionaries. As described in the Introduction, some difficulties are evident in applying end-to-end ASR to Japanese. One difficulty is caused by the existence of a large number of characters. English has at most about 100 characters including alphabetical characters and symbols, while Japanese has over 3,000 characters including kanji, hiragana, katakana, alphabetical characters, and symbols. When 100 English characters and 3,000 Japanese characters are used, the average number of training sample frames per Japanese character is 0.033 times that of English and it makes difficult to train robustly as compared with English. However, the total number of pronunciations is not large for the number of characters. Training with an unreasonably large number of parameters reduces the generalization performance of the model. The other issue is caused by multiple pronunciations for kanji characters. Most Japanese kanji have not only a Chinese-origin pronunciation ( On-yomi ) but also a Japanese-origin pronunciation ( Kunyomi ). In English, there are multiple readings in a character. However, in Japanese, each one is a completely different reading. There are more than ten kinds of reading at maximum. For example, the readings of kanji are shou, sei, nama, ki, i, u, o, ha and so on. The acoustic model should train mapping from similar acoustic features to one label. However, the acoustic model with kanji as an output label needs to train mapping multiple different acoustic features to one label. For these difficulties, syllable-based CTC is introduced to the Japanese end-to-end ASR system [10]. In Chinese, which also has many kanji characters like Japanese does, dealing with kanji requires using a tremendous amount of training data [4]. We propose a Japanese end-to-end method that outputs characters directly without a large amount of data by designing a network structure for insufficient data and for giving appropriate labels. B. Dimensional Compression First, we attempt to apply dimensional compression to structures to make the number of parameters appropriate. Our training network has a structure including an affine transformation connected to the final layer of s (Fig.1) [17]. The affine transformation layer transforms the dimensions from the final layer of to a softmax layer. The number of parameters, P A, of the affine transformation is given by P A = D in D out + D out, (1) softmax Input Fig. 1. DNN structure using Bi-directional LSTM. where D in is the number of dimensions for the s final layer and D out is the number of dimensions of the softmax layer. When matrix decomposition is performed with a low rank, D r, the number of parameters, P r, is expressed as follows: P r = D r D in + D r D out + D out. (2) When the low rank, D r, satisfies the following equation, the dimension can be compressed using matrix decomposition because P r is smaller than P A. D r < C. Label Expansion D ind out D in + D out. (3) Next, we added character strings as output labels of to avoid mapping multiple pronunciations to one label. 1) High-Frequency Word Addition Method: Because the pronunciation of the character is determined by the relationship with the previous and following characters, adding words as an output label enables assigning different acoustic features to different labels. One way to select adding words is using the high-frequency word addition method (HF-selection). In this method, among the words J in the training data, top-k frequently observed words are added to the set, A f. A f = {x J x is top-k frequent words, among J }. (4) As an effect of adding high-frequency words, we expect that models can train acoustic features on a word basis and that they can train efficiently. Words that consists of two or more kanji characters are added when they satisfy the Equation 4. 2) Low-Frequency Pronunciation-Word Addition Method: We propose a low-frequency pronunciation-word addition method (LFP-selection) as another method. In this method, adding a label is selected based on the rarity of the pronunciation of the character in words in the training text. We choose words in the training text and pick up words containing a low-frequency pronunciation in each character pronunciation among the chosen words. The reason for adding words with low-frequency pronunciation is to train by assigning special pronunciation to another label. Adding words with unusual pronunciations will most likely improve the per-label training

A word list containing 生生物生徒生活生息生涯生糸 TABLE I EVALUATION DATA Clean Noisy Total Utterances 461 2313 2774 Words 5875 25936 31811 Words whose pronunciation begins with se 生物生徒生活生息 Word whose pronunciation begins with sh 生涯 Low-Frequency Pronunciation words Word whose pronunciation begins with ki 生糸 TABLE II DIMENSION COMPRESSION METHOD EXPERIMENT Model WER[%] Cost[H/iter] #Params[M] baseline 19.8 38.3 7.5 affine 18.0 38.1 6.7 Fig. 2. Example of how to select low-frequency pronunciation words performance because the variance of acoustic features learned with one label decreases. In this paper, we consider only the first character of the initial character pronunciation of each word for simplicity. We classify words for each word set with a specific letter s in the initial letter. If the percentage of words that have that pronunciation is less than a certain threshold, t, that word is added to the set as a new label. A word list, A p, satisfying the following is added as a new additional output label. A p = {L p s H s t Lp s }, (5) p Lp s A. Experimental Setup III. EXPERIMENT For experiments, we used the EESEN [5] framework based on the Kaldi toolkit [18] and modified it so as to enable Japanese character output, and we set the following parameters. We trained CTC-based 3-layer for the acoustic model using 1404 hours of NHK broadcast programs and their closed-captions as training data. An input acoustic feature was filter-bank 40 dimensions + delta + deltadelta, for a total of 120 dimensions. We trained a language model based on WFST from the ARPA-format 3-grams, which were estimated from a total of 620 million words in the NHK news manuscripts and closed-captions with a 200-k-word vocabulary. We adopted a newbob annealing schedule by reducing the learning rate by half that of the epoch, where the cross-validation data decreased from those of the previous epoch only by less than 0.5. The initial value of the learning rate was set to 2.0 10 5. We used NHK s information program Hirumae Hotto consisting of 32 k words as evaluation data. Our evaluation data contained spontaneous speech and noise such as background music and cooking sounds. A breakdown of the evaluation data is shown in the Table I. It consists of 461 clean utterances, mainly including news manuscripts, and 2313 noisy utterances, including background music or field noise. where H s is a word set containing character s as its initial character. L p s is a word set where the word begins with a pronunciation, p. We classify words by p and divide them into the list L p s. When H s have been classified by p, the number of L p s is counted. If the number of elements of words L p s is satisfied by Equation 5, word list L p s is added as an output layer. As an example, I herein introduce a method to judge whether or not the pronunciation is unusual among words possessed by kanji (Fig.2). A word list ( seibutsu means a living thing), ( seito means a student), B. Dimensional Compression Experiment ( seikatsu means living), ( seisoku means habit ), ( shougai means lifelong), and ( kiito, means silk) containing in the training text is created. In this case, the number of words is 6. We count up word numbers for each initial hiragana based letter se, sh, and ki in each pronunciation and add those with initial letters of pronunciation with few strings as output labels. In this case, the number of words in each pronunciation is 4 for se, 1 for affine: sh, and 1 for ki. If the threshold t is 0.2, the words that satisfy Equation 5 are and, and they are added as output labels. However, in selected A p, words starting with common letters and pronunciation are included. We divide the words into the first two characters and then add only the divided first two letters as additional labels to create commonly usable character strings. First, we conducted dimension compression experiments. We used the models illustrated in Figure 3. baseline: The number of memory cells was set to 320 in Fw and Bw layers, and the output dimension was 3225. The labels include hiragana, katakana, kanji, numbers, alphabetical characters, and symbols. The affine transformation in the final layer was decomposed to two matrices with rank D r = 320. Table II shows the overall WER results. In the table, Cost indicates the time of per training, and #Params indicates the number of parameters. Affine model improved the WER relatively by 9.1% compared with the baseline. It indicates that the generalization ability of the proposed model was most likely improved by parameter compressing. The proposed method reduced training time costs.

TABLE IV BREAKDOWN OF MISTAKES IN EVALUATION DATA Softmax Input (120 dim) (baseline) Softmax (320*3225) (640*320) Input (120 dim) (affine) Fig. 3. DNN structure of experiment. TABLE III EXPERIMENT INVOLVING METHOD ADDING CHARACTER STRINGS Model WER[%] #Params[M] baseline 19.8 7.5 affine 18.0 6.7 affine+ expansion lexicon 18.2 6.7 affine+ LFP-selection 17.9 6.8 affine+hf-selection 16.9 6.9 C. Label Expansion Experiment Next, we conducted experiments for output label expansion methods. We utilized the affine model as a network structure. Then, we added the output labels using HF-selection and LFPselection, respectively. The numbers of additional labels were determined in preliminary experiments and set to 800 for HFselection and 325 for LFP-selection. Consequently, the number of output labels was increased from 3225 to 4025. In LFPselection, a label to be added as a threshold value of 0.12 was selected from words that occurred more than 10 times in the training data. Table III shows the results. As shown in the table, HFselection and LFP-selection methods further improved the WERs compared with those of the affine model. Particularly in HF-selection, the WER was reduced by 6.1% and achieved a total of 14.6 % of the relative WER reduction compared with the baseline. While HF-selection chose words without considering any pronunciations, the results indicate that the model could be trained efficiently. This is because the added character strings could capture the transitional dynamics of features between the characters. In contrast, the LFP-selection model achieved poor results compared with those of the HFselection model. These results are probably due to the lack of training data for the added labels. For further improvements, we should take into account the design of output labels as we select additional labels according to the frequency. Additionally, for comparison, we made a lexicon that added Error Rate[%] Model Sub. Del. Ins. baseline 11.3 4.6 3.8 affine 9.7 5.2 3.1 affine+ LFP-selection 9.9 4.4 3.7 affine+hf-selection 8.9 5.0 3.0 syllabic Japanese scripts, hiragana and katakana, of about 1,800 words on the basis of the proposed method. Among the 1,800 words, 1,000 were chosen by HF-selection, and 800 were chosen by LFP-selection. We utilized this expanded lexicon in the affine model ( affine+expansion lexicon ). However, the affine+expansion lexicon model had no improvement. We believe that this occurred for the following two reasons. One reason is the training data did not rewrite kanji added to lexicon into hiragana. We estimate that models training acoustic features with kanji were less likely to output hiragana for the kanji. The other is the characters output in hiragana and katakana were unnecessarily converted into kanji using the added lexicon. In Japanese, some words written in kanji for other purposes are written in hiragana such as proper nouns. When hiragana characters were output as a result of this training, they may have been unnecessarily converted into kanji by the added dictionary. Table IV shows the results for types of errors, i.e., substitutions, deletions, and insertions. In the affine+hf-selection model, the substitution errors were significantly reduced compared with baseline. We believe the model trained the relationship between character string and acoustic features. Originally, CTC based training had difficulty in training acoustic features between characters in insufficient data. Adding character strings as labels is equivalent to considering crossword contexts in conventional acoustic modeling [19] and enables the proposed model to train the dynamics between characters to estimate a noise robust model. IV. CONCLUSION AND FUTURE WORK In this paper, we proposed a novel method for end-to-end ASR for languages with a large number of characters and multiple pronunciations. We utilized a CTC-based acoustic model and used the following two approaches. One reduces the parameters of using matrix decomposition of affine transformation. The other adds strings as output labels. The experimental results showed the model combining these two methods enabled a relative reduction of 14.6 % in WER than the previous one. Therefore, we conclude that the Japanese end-to-end model needs to be dimensionally compressed appropriately and to train acoustic features between letters. For future work, we will deal with unknown characters caused by a large number of characters and conduct multicondition training for noisy conditions.

REFERENCES [1] A. Graves, A.-R. Mohamed, and G. Hinton, Speech recognition with deep recurrent neural networks, in Proc. ICASSP. IEEE, 2013, pp. 6645 6649. [2] A. Graves and N. Jaitly, Towards end-to-end speech recognition with recurrent neural networks. in Proc. ICML, vol. 14, 2014, pp. 1764 1772. [3] A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates et al., Deep speech: Scaling up end-to-end speech recognition, arxiv preprint arxiv:1412.5567, 2014. [4] D. Amodei, R. Anubhai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, J. Chen, M. Chrzanowski, A. Coates, G. Diamos et al., Deep speech 2: End-to-end speech recognition in English and Mandarin, arxiv preprint arxiv:1512.02595, 2015. [5] Y. Miao, M. Gowayyed, and F. Metze, EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding, in Proc. ASRU. IEEE, 2015, pp. 167 174. [6] A. L. Maas, Z. Xie, D. Jurafsky, and A. Y. Ng, Lexicon-free conversational speech recognition with neural networks. in Proc. HLT-NAACL, 2015, pp. 345 354. [7] H. Sak, A. Senior, K. Rao, and F. Beaufays, Fast and accurate recurrent neural network acoustic models for speech recognition, arxiv preprint arxiv:1507.06947, 2015. [8] H. Sak, A. Senior, K. Rao, O. Irsoy, A. Graves, F. Beaufays, and J. Schalkwyk, Learning acoustic frame labeling for speech recognition with recurrent neural networks, in Proc. ICASSP. IEEE, 2015, pp. 4280 4284. [9] Y. Zhang, M. Pezeshki, P. Brakel, S. Zhang, C. Laurent, Y. Bengio, and A. Courville, Towards end-to-end speech recognition with deep convolutional neural networks, in Proc. Interspeech, 2016, pp. 410 414. [10] N. Kanda, X. Lu, and H. Kawai, Maximum a posteriori based decoding for CTC acoustic models, in Proc. Interspeech, 2016, pp. 1868 1872. [11] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks, in Proc. ICML. ACM, 2006, pp. 369 376. [12] M. Schuster and K. K. Paliwal, Bidirectional recurrent neural networks, IEEE Transactions on Signal Processing, vol. 45, no. 11, pp. 2673 2681, 1997. [13] A. Graves and J. Schmidhuber, Framewise phoneme classification with bidirectional LSTM and other neural network architectures, Neural Networks, vol. 18, no. 5, pp. 602 610, 2005. [14] T. N. Sainath, B. Kingsbury, V. Sindhwani, E. Arisoy, and B. Ramabhadran, Low-rank matrix factorization for deep neural network training with high-dimensional output targets, in Proc. ICASSP. IEEE, 2013, pp. 6655 6659. [15] S. Hochreiter and J. Schmidhuber, Long short-term memory, Neural computation, vol. 9, no. 8, pp. 1735 1780, 1997. [16] H. Sak, A. W. Senior, and F. Beaufays, Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition, CoRR, vol. abs/1402.1128, 2014. [17] A. Zeyer, R. Schlüter, and H. Ney, Towards online-recognition with deep bidirectional LSTM acoustic models, in Proc. Interspeech, 2016, pp. 3424 3428. [18] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al., The kaldi speech recognition toolkit, in Proc. ASRU, no. EPFL-CONF-192584. IEEE Signal Processing Society, 2011. [19] P. Beyerlein, M. Ullrich, and P. Wilcox, Modeling and decoding of crossword context dependent phones in the Phillips large vocabulary continuous speech recognition system, in Proc. Eurospeech, 1997, pp. 1163 1166.