RESEARCH SPOKEN LANGUAGE SYSTEMS

Size: px

Start display at page:

Download "RESEARCH SPOKEN LANGUAGE SYSTEMS"

Nickolas Byrd
5 years ago
Views:

1 T H E S I S RESEARCH SPOKEN LANGUAGE SYSTEMS 35

2 36 SUMMARY OF RESEARCH

3 A Model for Segment-Based Speech Recognition Jane Chang Currently, most approaches to speech recognition are frame-based in that they represent the speech signal using a temporal sequence of frame-based features, such as Mel-cepstral vectors. Frame-based approaches take advantage of efficient search algorithms that largely contribute to their success. However, they cannot easily incorporate segment-based modeling strategies that can further improve recognition performance. For example, duration is a segment-based feature that is useful but difficult to model in a frame-based approach. In contrast, segment-based approaches represent the speech signal using a graph of segment-based features, such as average Melcepstral vectors over hypothesized phone segments. Segment-based approaches enable the use of segment-based modeling strategies. However, they introduce multiple difficulties in recognition that have limited their success. In this work, we have developed a framework for speech recognition that overcomes many of the difficulties of a segment-based approach. We have published experiments in phone recognition on the core test set of the TIMIT corpus over 39 classes [1]. We have also run preliminary experiments in word recognition on the December 94 test set of the ATIS corpus. In our segment-based approach, we hypothesize segments prior to recognition. Previously, our segmentation algorithm was based on local acoustic change. However, segmentation depends on contextual factors that are difficult to capture in a simple measure. We have developed a probabilistic segmentation algorithm called segmentation by recognition that hypothesizes segments in the process of recognition. Segmentation by recognition applies all of the constraints used in recognition towards segmentation. As a result, it hypothesizes more accurate segments. In addition, it adapts to all types of variability, focuses modeling on confusable segments, hypothesizes all types of units, and uses scores that can be re-used in recognition. We have implemented this segmentation algorithm using a backwards A* search and a diphone context-dependent frame-based phone recognizer. In published TIMIT experiments, we have reported an 11.3% reduction in phone recognition error rate from 38.7% with our previous acoustic segmentation to 34.3% with segmentation by recognition [1]. In segment-based recognition, the speech signal is represented using a graph of features. Probabilistically, it is necessary to account for all of the features in the graph. However, each path through the graph directly accounts for only a subset of all features. Previously, we modeled the features that are not in a path using a single antiphone model [2]. However, the features that are not in a path depend on contextual factors that are difficult to capture in one model. We have developed a search algorithm called near-miss modeling that uses multiple models for all features in a graph. Near-miss modeling associates each feature with a near-miss subset of features such that any path through a graph is associated with all features. As a result, it probabilistically accounts for and efficiently enforces constraints across all features. In addition, it focuses modeling on discriminating between a feature and its near-misses. We have implemented near-miss modeling using a Viterbi search and a set of near-miss phone models that correspond to our SPOKEN LANGUAGE SYSTEMS 37

A MODEL FOR SEGMENT-BASED SPEECH RECOGNITION Figure 7. Example of framework, showing spectrogram, segment graph, phone and word recognition, and scores for the highlighted segment.

4 A MODEL FOR SEGMENT-BASED SPEECH RECOGNITION Figure 7. Example of framework, showing spectrogram, segment graph, phone and word recognition, and scores for the highlighted segment. context-independent phone models. In published experiments, we have reported a 9.3% reduction in phone recognition error rate from 34.3% with anti-phone modeling to 31.1% with near-miss modeling [1]. In addition, in preliminary ATIS experiments, we have shown a 21.4% reduction in word recognition error rate from 12.6% with antiphone modeling to 9.9% with near-miss modeling. In word recognition, deletion and insertion errors in segmentation can cause multiple recognition errors. Previously, we have been using phone units. However, phone realizations depend on contextual factors and are difficult to segment. We have developed larger units called multiphone units that span multiple phones. Multi-phone units cover phone sequences that demonstrate systematic acoustic and lexical variations. As a result, they recover from systematic segmentation errors. In addition, they focus modeling on systematic context-dependencies. We select multiphones using a Viterbi search to minimize match, deletion and insertion criteria. In preliminary experiments, we have shown a 4.2% reduction in word recognition error rate from 9.9% with phone units to 9.1% with multi-phone units. Figure 7 shows an example of our framework. The input speech is displayed as a spectrogram. Segmentation by recognition hypothesizes the graph of segments under the spectrogram. Near-miss modeling associates near-misses such that the black segment is associated with the three gray segments. The total score for a unit is the sum of segment and near-miss scores. The seven best scoring units for the black segment are listed on the right. The best scoring unit is the multi-phone unit that spans the phone sequence of /r/ followed by /l/. The recognized phone and word outputs are displayed under the segment graph. With segmentation by recognition, near-miss modeling and multi-phone units, our framework overcomes many of the difficulties in segment-based recognition and enables the exploration of a wide range of segment-based modeling strategies. Although our work does not focus on developing such strategies, we have already 38 SUMMARY OF RESEARCH

5 JANE CHANG shown improvements in recognition performance. For example, a segment-based approach can use both frame- and segmentbased features. In published experiments, we have reported a 4.0% reduction in phone recognition error rate from 27.7% with just frame-based features to 26.6% with both types of features. In addition, a segment-based approach facilitates the use of duration to model segment probability. In preliminary experiments, we have shown a 5.3% reduction in word recognition error rate from 9.1% with no model to 8.5% with a duration model. References [1] J. Chang and J. Glass, Segmentation and Modeling in Segment-based Recognition, Proc. European Conference on Speech Communication and Technology, pp , Rhodes, Greece, September [2] J. Glass, J. Chang and M. McCandless, A Probabilistic Framework for Feature-based Speech Recognition, Proc. International Conference on Spoken Language Processing, pp, , Philadelphia, PA, October1996. [3] J. Chang. Near-Miss Modeling: A Segment-based Approach to Speech Recognition. Ph.D. thesis, MIT Department of Electrical Engineering and Computer Science, June SPOKEN LANGUAGE SYSTEMS 39

6 Hierarchical Duration Modelling for a Speech Recognition System Grace Chung Durational patterns of phonetic segments and pauses convey information about the linguistic content of an utterance. Most speech recognition systems grossly underutilize the knowledge provided by durational cues due to the vast array of factors that influence speech timing and the complexity with which they interact. In this thesis, we introduce a duration model based on the ANGIE framework. ANGIE is a paradigm which captures morpho-phonemic and phonological phenomena under a unified hierarchical structure. Sublexical parse trees provided by ANGIE are well-suited for constructing complex statistical models to account for durational patterns that are functions of effects at various linguistic levels. By constructing models for all the sublexical nodes of a parse tree, we implicitly model duration phenomena at these linguistic levels simultaneously, and subsequently account for a vast array of contextual variables affecting duration from the phone level up to the word level. Experiments in our work have been conducted in the ATIS domain which consists of continuous, spontaneous utterances concerning enquiries for travel information. In this duration model, a strategy has been formulated in which node durations in upper layers are successively normalized by their respective realizations in the layers below; that is, given a nonterminal node, individual probability distributions, corresponding with each different realization in the layer immediately below, are all scaled to have the same mean. This reduces the variance at each node, and enables the sharing of statistical distributions. Upon normalization, a set of relative duration models is constructed by measuring the percentage duration of nodes occupied with respect to their parent nodes. Under this normalization scheme, the normalized duration of each word node is independent of the inherent durations of its descendents and hence is an indicator of speaking rate. A speaking rate parameter can be defined as a ratio of the normalized word duration over the global average normalized word duration. This speaking rate parameter is then used to construct absolute duration models that are normalized by the rate of speech. This is done by scaling absolute phoneme durations by the above parameter. By combining hierarchical normalization and speaking rate normalization, the average standard deviation for phoneme duration was reduced from 50ms to 33ms. Using the hierarchical structure, we have conducted a series of experiments investigating speech timing phenomena. We are specifically interested in (1) examining secondary effects of speaking rate, (2) characterizing the effects of prepausal lengthening, and (3) detecting other word boundary effects associated with duration such as gemination. For example, we have found, with statistical significance, that a suffix within a word is affected far more by speaking rate than is a prefix. We have also studied closely the types of words which tend to be realized particularly slowly in our training corpus and it is discovered that these are predominantly function words and single syllable words. Prepausal lengthening is the phenomenon where words preceding pauses tend to be somewhat lengthened. Our goal is to examine the characteristics associated with prepausal effects and in the future further incorporate these into our model. In our studies, we consider the relationship between this phenomenon and the rate of 40 SUMMARY OF RESEARCH

7 speech. We found that lengthening occurs when pauses tend to be greater than 100ms in duration. It is also observed that prepausal lengthening affects the various sublexical units non-uniformly. For example, the stressed syllable nucleus tends to be lengthened more than the onset position.the final duration model has been implemented into the ANGIE phonetic recognizer. In addition to contextual effects captured by the model at various sublexical levels, the scoring mechanism also accounts explicitly for two inter-word level phenomena, namely, prepausal lengthening and gemination. Our experiments have been conducted under increasing levels of linguistic constraint with correspondingly different baseline performances. The improved performance is obtained by providing successively greater amounts of implicit lexical knowledge during recognition by way of an intermediate morph or syllable lexicon. When maximal linguistic contraint is imposed, the incorporation of the relative and speaking-rate normalized absolute phoneme duration scores reduced the phonetic error rate from 29.7% to 27.4%, a relative reduction of 7.7%. These gains are over and above any gains realized from standard phone duration models present in the baseline system, and encourage us to further apply our model in future recognition tasks. As a first step towards demonstrating the benefit of duration modelling for full word recognition, we have conducted a preliminary study using duration as a postprocessor in a word-spotting task. We have simplified the task of spotting city names in the ATIS domain by choosing a pair of highly confusable keywords, New York and Newark. All tokens initially spotted as New York are passed to a post-processor, which reconsiders those words and makes a final decision, with the duration component incorporated. For this task, the duration postprocessor reduced the number of confusions from 60 to 19 tokens out of a total of 323 tokens, a 68% reduction of error. We believe that the dramatic performance improvement demonstrates the power of durational knowledge in specific instances where acousticphonetic features are less effective. In another experiment, the duration model was fully integrated into an ANGIE-based wordspotting system. As in our phonetic recognition experiments, results were obtained by adding varying degrees of linguistic contraint. When maximum constraint is imposed, the duration model improved performance from 89.3 to 91.6 (FOM), a relative improvement of 21.5%. The duration model has shown to be most effective when the maximum amount of lexical knowledge is provided, wherein the model is able to best take advantage of the various durational relationships among the components of the sublexical parse structure. We also believe that the more complex parse structures available in the keywords for this task contribute to the performance of our duration model. This research has demonstrated success in employing a complex statistical duration model in order to improve speech recognition performance. In particular, we see that duration is more valuable during word recognition. We would like to incorporate our duration modeling into a continuous speech recognition system, in which significant gains should also be possible there. Reference [1] G. Chung. Hierarchical Duration Modelling for a Speech Recognition System. S.M. thesis, MIT Department of Electrical Engineering and Computer Science, Cambridge, MA, May SPOKEN LANGUAGE SYSTEMS 41

8 Discourse Segmentation of Spoken Dialogue: An Empirical Approach Giovanni Flammia Empirical research in discourse and dialogue is instrumental in quantifying which conventions of human-to-human language may be applicable for human-tomachine language [1,2] This thesis is an empirical exploration of one aspect of human-to-human dialogue that can be applicable to human-to-machine language. Some linguistic and computational models assume that human-to-human dialogue can be modeled as a sequence of segments [3]. Detecting segment boundaries has potential practical benefits in building spoken language applications (e.g., designing effective system dialogue strategies for each discourse segment and dynamically changing the system lexicon at segment boundaries). Unfortunately, drawing conclusions from studying human-to-human conversation is difficult because spontaneous dialogue can be quite variable, containing frequent interruptions, incomplete sentences and unstructured segments. Some of these variabilities may not contribute directly to effective communication of information. The goal of this thesis is to determine empirically the extent to which discourse segment boundaries can be extracted from annotated transcriptions of spontaneous, natural dialogues in specific application domains. We seek answers to three questions. First, is it possible to obtain consistent annotations from many subjects? Second, what are the regular vs. irregular discourse patterns found by the analysis of the annotated corpus? Third, is it possible to build discourse segment models automatically from an annotated corpus? The contributions of this thesis are twofold. Firstly, we developed and evaluated the performance of a novel annotation tool and associated discourse segmentation instructions. The tool and the instructions have proven to be instrumental in obtaining reliable annotations from many subjects. Our findings indicate that it is possible to obtain reliable and efficient discourse segmentation when the task instructions are specific and the annotators have few degrees of freedom, i.e., when the annotation task is limited to choosing among few independent alternatives. The reliability results are very competitive with other published work [4]. Secondly, the analysis of the annotated corpus provides substantial quantitative evidence about the differences between human-to-human conversation and current human-to-machine telephone applications. Since dialogue annotation can be extremely time consuming, it is essential that we develop the necessary tools to maximize efficiency and consistency. To this end, we have developed a visual annotation tool called Nb which has been used for discourse segmentation in our group and other institutions. With the help of Nb, we determined how reliably human annotators can tag segments in the dialogue transcriptions of our corpus. We conducted two experiments in which the transcriptions have each been annotated by several people. To carry out our research, we are making use of a corpus of orthographically transcribed and annotated telephone conversations. The text data are faithful transcriptions of actual telephone conversations between customers and telephone operators collected by BellSouth Intelliventures and American Airlines in The first pilot study consisted of 18 dialogues from all the domains of our corpus, each one annotated by 6 different 42 SUMMARY OF RESEARCH

9 coders [5]. The goal of this experiment was rather exploratory in nature, without particular constrains on where to place discourse segment boundaries.we measured reliability by recall, precision and the kappa coefficient. When comparing two different segmentations of the same text, we alternatively select one as the reference and the other one as the test. Reliability is best measured by the kappa coefficient, a statistical measure which is gaining popularity in computational linguistics because it measures how much better than chance is the observed agreement [6]. A coefficient of 0.7 or better indicates reliable results. Table 11 summarizes our findings. We found that without detailed instructions, annotators agree at the 0.45 reliability level in placing segment boundaries. In our data, we found that the kappa coefficient is always less than the average of precision and recall. The analysis of the disagreements of the first experiment led to a second, more Probability of customer back channel response Experiment First Second Dialogues Coders per dialogue Units: sentences Precision Recall Kappa Units: turns Precision Recall Kappa focused experiment. This other experiment consisted of 22 dialogues from only one application, the movies listing domain. Each dialogue was annotated by 7-9 coders [7]. The instructions defined a segment to be a section of the dialogue in which the agent delivers a new piece of information that is relevant to the task. In addition, the annotators had to choose among five different segment purpose labels when tagging a discourse segment. In that case, we found that the kappa reliability measure Table 11. Summary percentage statistics of the two annotation experiments. Average precision and recall are measured across all possible combinations of pairs of coders. The groupwise kappa coefficient is computed from the classification matrix of all the coders. Statistics are computed using as unit of analysis the sentence or the dialogue turn. Typically, a dialogue turn is composed of one to three short sentences. Figure 8. Observed frequency of customer s acknowledgments as a function of the preceeding agent s dialogue turn duration Number of words in preceeding agent turn SPOKEN LANGUAGE SYSTEMS 43

10 DISCOURSE SEGMENTATION OF SPOKEN DIALOGUE: AN EMPIRICAL APPROACH Table 12. Distribution of segment initiatives by topics and average turn position of response. Topic Customer Init. Agent Init. Turn of Response List movies 67.8% 32.2% 4.5 Phone number 87.1% 12.9% 3.7 Show times 76.3% 23.7% 2.9 Where is it playing 79.4% 20.7% 4.0 in placing segment boundaries is 0.824, and the accuracy in assigning segment purpose labels is 80.1%. To evaluate the feasibility of segmenting dialogues automatically, we implemented a simple discourse segment boundary classifier based on learning classification rules from lexical features [8]. On average, the automatic algorithm agrees with the manually annotated boundaries with 69.4% recall and 74.5% precision. Analysis of the movies listing conversations indicates that the customer follows with an explicit acknowledgement the information reported by the agent 84% of the time. We found that the agent delivers information using shorter rather than longer sentences. Figure 8 is a cumulative frequency plot of the length of the agent's dialogue turn before a customer's acknowledgement. Most of the times, the agent does not speak more than 15 words before the customer responds with an acknowledgment. After the acknowledgement, 40% of the time the information is explicitly confirmed by both parties with at least two additional dialogue turns. Analysis of the annotated segments indicate that the customer is mainly responsible for switching to new topics, and that in average the agent's response is not immediate but instead is preceeded by a few clarification turns. Table 12 lists the fraction of agent vs. customer initiated segments by topics and the average turn of the agent's response from the beginning of the segment. References [1] N.O. Bernsen, L. Dybkjaer, and H. Dybkjaer, Cooperativity in Human-machine and Humanhuman Spoken Dialogue. Discourse Processes Vol. 21, No. 2, pp , [2] N. Yankelovich, Using Natural Dialogs as the Basis for Speech Interface Design Chapter for the upcoming book, Automated Spoken Dialog Systems, edited by Susann Luperfoy, MIT Press, [3] B. Grosz and C. Sidner, Attentions, Intentions and the Structure Of Discourse. Computational Linguistics, Vol. 12. No. 3, pp , [4] M. Walker and J. Moore, editors Empirical Studies in Discourse, Computational Linguistics special issue. Vol. 20. No. 2, [5] G. Flammia and V. Zue, Empirical Evaluation of Human Performance and Agreement in Parsing Discourse Constituents in Spoken Dialogue. Proc. European Conference on Speech Communication and Technology, pp , Madrid, Spain, September [6] J. Carletta, Assessing Agreement on Classification Tasks: The Kappa Statistics. Computational Linguistics. Vol. 22, No. 2, pp , [7] G. Flammia and V. Zue, Learning the Structure of Mixed-initiative Dialogues using a Corpus of Annotated Conversations. Proc. European Conference on Speech Communication and Technology, pp , Rhodes, Greece, September [8] W. W. Cohen, Fast Effective Rules Induction, Machine Learning: Proceedings of the 12th International Conference [9] G. Flammia. Corpus-based Discourse Segmentation of Spoken Dialogue. Ph.D. thesis, MIT Department of Electrical Engineering and Computer Science, June SUMMARY OF RESEARCH

11 Heterogeneous Acoustic Measurements and Multiple Classifiers for Speech Recognition Andrew Halberstadt Figure 9. Human Perception (PA, PMV) versus Machine Classification (A,B, C, MMV, D) in the tasks of stop identification, place of articulation, identification of stops, and voicing identification of stops. Most automatic speech recognition systems use a small set of homogeneous acoustic measurements and a single classifier to make acoustic-phonetic distinctions. We are exploring the use of a large set of heterogeneous measurements and multiple classifiers in order to improve phonetic classification. There are several areas for innovative work involved in implementing this approach. First, a variety of acoustic measurements need to be developed, or selected, from those proposed in the literature. In the past, different acoustic measurements have generally been compared in a winner-takesall paradigm in which the goal is to select the single best measurement set. In contrast to this approach, we are interested in making use of complementary information in different measurement sets. In addition, measurements have usually been evaluated for their performance over the entire phone set. In contrast, in this work, we explore the notion that high-performance acoustic measurements may be different across different phone classes. Thus, heterogeneous measurements may be used both within and across phone classes. Second, methods for utilizing high-dimensional acoustic measurement spaces need to be proposed and developed. This problem will be addressed through schemes for combining the results of multiple classifiers. Percent Error Percent Error Percent Error 28.9 PA PA PA 23.1 PMV PMV PMV Stop Identification A B C Place Identification A B C Voicing Identification A B C 52.7 MMV 18.6 MMV 38.2 MMV 38.4 D 14.1 D 27.1 D In the process of developing heterogeneous acoustic measurements, we focused initially on stop consonants because of evidence that their short-time burst characteristics and rapidly changing acoustics were poorly represented by conventional homogeneous measurements [1]. A perceptual experiment using only stop consonants was performed in order to facilitate comparative analysis of the types of errors made by humans and machines. The experiment was designed so that humans could not make profitable use of phonotactic or lexical knowledge. Figure 9 provides a summary of the results of these experiments. The error rates are generally high because the data set was deliberately chosen to include some of the most difficult-to-identify stops in our development set. The machine systems are labelled A, B, C, MMV, D, where A, B, and C are three different context-independent systems, MMV (Machine Majority Vote) is a system that takes the 3-way majority vote answer from A, B, and C, and D is a contextdependent system. The perceptual results from listeners are labelled PA (Perceptual Average) and PMV (Perceptual Majority Vote). The place of articulation identification by machine is times worse than humans, whereas the voicing identification is only times worse. Our conclusion is that the place of articulation identification of automatic systems is an area that requires significant improvement in order to approach human-like levels of performance. The second challenge is to develop overall system architectures which can make profitable use of a large number of acoustic measurements. The fundamental challenge of high-dimensional input spaces arises because the quantity of training data SPOKEN LANGUAGE SYSTEMS 45

12 l y z s v f t k l y z s v f t k HETEROGENEOUS ACOUSTIC MEASUREMENTS AND MULTIPLE CLASSIFIERS FOR SPEECH RECOGNITION Figure 10. Bubble plot of confusions in phonetic classification on TIMIT development set. Radii are linearly proportional to the error. The largest bubble is 5.2% of the total error. Reference iy ih eh ae ah uw uh aa ey ay oy aw ow er l r w y m n ng dx jh ch z s sh hh v f dh th b p d t g k cl iy iheh iy iheh ae ah uw uh aa ey ay oy aw ow er weak fricatives ae ah uw uh aa ey ay oy aw ow er r w r w Hypothesis m n ng dx jh ch vowels/semivowels m n ng dx jh ch nasals/flaps strong fricatives stops needed to adequately train a classifier grows exponentially with the input dimensionality. In one approach to the problem, multiple classifiers trained from different measurements can be arranged hierarchically. In this scheme, the hierarchical structure emphasizes taking the task of phonetic classification and breaking it down into subproblems such as vowel classification and nasal classification. Figure 10 illustrates the fact that most classifier errors remain within the same manner class, thus supporting the subproblem approach of hierarchical classification. Roughly speaking, if the first stage of the hierarchy has high confidence that a particular token is a nasal, then a classifier tuned especially for nasals may perform further processing. In [1], this approach was developed and used to obtain 79.0% context-independent classification sh hh sh hh dh th b dh th b p d p d g g cl cl iy ih eh ae ah uw uh aa ey ay oy aw ow er l r w y m n ng dx jh ch z s sh hh v f dh th b p d t g k cl on the TIMIT core test set. Alternatively, multiple classifiers may be formed into committees. Each committee member has some influence on the final selection. In its simplest form, the final choice could be determined by popular vote of the classifier committee. The performance of the MMV (Machine Majority Vote) system in Figure 9, which is the result of voting among systems A, B, and C, and the PMV (Perceptual Majority Vote) results are examples of improved performance through the use of voting. The ideas of classification according to a hierarchy or by a committee are not mutually exclusive, but rather can be combined. Thus, one member of a committee could be a hierarchical classifier, or there could be a hierarchy of committees. In the future, we hope to narrow the gap observed in perceptual experiments between human and machine performance in the task of place of articulation identification. We plan to continue investigating heterogeneous measurement sets and developing a variety of ways of combining those measurements into classification and recognition systems. Reference [1] A. K. Halberstadt and J. R. Glass, Heterogeneous Measurements for Phonetic Classification, Proc. European Conference on Speech Communication and Technology, pg , Rhodes, Greece, September SUMMARY OF RESEARCH

13 r r r r r r r r r r The Use of Speaker Correlation Information for Automatic Speech Recognition T. J. Hazen Typical speech recognition systems perform much better in speaker dependent (SD) mode than they do in speaker independent (SI) mode. This is a result of flaws in the probabilistic framework and modeling techniques used by today s speech recognizers. In particular, current SI recognizers typically assume that all acoustic observations can be considered independent of each other. This assumption ignores within-speaker correlation information which exists between speech events produced by the same speaker. Knowledge of the speaker constraints imposed on the acoustic realization of an utterance can be extremely useful for improving the accuracy of a recognition system. To describe the problem mathematically, begin by letting P represent a sequence of phonetic units. If P contains N different phones then let it be expressed as: P = { p1, p2,..., pn } (1) Here each p n represents the identity of one phone in the sequence. Next, let x be a sequence of feature vectors which represent the acoustic information of an utterance. If X contains one feature vector for each phone in P then X can be expressed as: X = { x1, x2,... xn } (2) Given the above definitions, the probabilistic expression for the acoustic model is given as p ( X P). In order to develop effective and efficient methods for estimating the acoustic model likelihood, typical recognition systems use a variety of simplifying assumptions. To begin, the general expression can be expanded as follows: p At this point, speech recognition systems almost universally assume that the acoustic feature vectors are independent. With this assumption the acoustic model is expressed as follows: (3) (4) Because this is a standard assumption in most recognition systems, the term p( x P) will be referred to as the standard acoustic model. In Equation (3), the likelihood of a particular feature vector is deemed dependent on the observation of all of the feature vectors which have preceded it. In Equation (4), each feature vector x n is treated as an independently drawn observation which is not dependent on any other observations, thus implying that no statistical correlation exists between the observations. What these two equations do not show is the net effect of making the independence assumption. Consider applying Bayes rule to the probabilistic term in Equation (3). In this case the term in this expression can be rewritten as: p N r r r r r r ( X P) = p( x1, x2,... xn P) = p( xn x,..., x P) n 1 1, p p ( ) ( ) ( x ( ) ) n 1,..., x1 xn, P xn xn 1,..., x1, P p xn P r = p x..., x P n (5) After applying Bayes rule, the conditional probability expression contained in (3) is rewritten as a product of the standard p x P and a probability acoustic model ( ) n n= 1 N ( X P) = p( x n P) n= 1 n 1, 1 SPOKEN LANGUAGE SYSTEMS 47

14 r THE USE OF SPEAKER CORRELATION INFORMATION FOR AUTOMATIC SPEECH RECOGNITION Table 13. Summary of recognition results using various instantaneous adaptation techniques including reference speaker weighting (RSW), gender dependent modeling (GD), gender and speaker rate dependent modeling (GRD), speaker cluster weighting (SCW), and consistency modeling (CM), dependent modeling (GRD), speaker cluster weighting (SCW), and consistency modeling (CM). Adaptation Method Word Error Rate Error Rate Reduction SI 8.6% - SI+RSW 8.0% 6.5% SI+CM 7.9% 8.2% SI+RSW+CM 7.7% 10.0% GD 7.7% 10.5% GD+CM 7.1% 17.6% GRD 7.2% 16.4% GRD+CM 6.8% 20.3% SCW 6.9% 18.9% SCW+CM 6.8% 21.1% ratio which will be referred as the consistency ratio. The consistency ratio is a multiplicative factor which is ignored when the feature vectors are considered independent. It represents the contribution of the correlations which exist between the feature vectors. The purpose of this dissertation is to examine the assumptions and modeling techniques that are utilized by SI recognition systems and to propose novel modeling techniques to account for the speaker constraints which are typically ignored. To this end, this thesis has examined two primary approaches: speaker adaptation and consistency modeling. The goal of speaker adaptation is to alter the standard acoustic models represented by the expression p( x P) so as to match the current test speaker as closely as possible. The goal of consistency modeling is to estimate the contribution of the consistency ratio, which is typically ignored when the independence of observations assumption is made. Speaker clustering provides one of the most effective techniques used by speaker adaptation algorithms. This thesis examines several different approaches to speaker clustering. These techniques are reference speaker weighting, hierarchical speaker n clustering, and speaker cluster weighting. These methods examine various different approaches for utilizing and combining acoustic model parameters trained from different speakers or speaker clusters. For example, the hierarchical speaker clustering used in this thesis examines the use of gender dependent models as well as gender and speaking rate dependent models. Consistency modeling is a novel recognition technique for accounting for the correlation information which is generally ignored when each acoustic observation is considered independent. The key idea of consistency modeling is that the contribution of the consistency ratio must be estimated. Using several simplifying assumptions, the estimation of the consistency ratio can be reduced to the problem of estimating the mutual information between pairs of acoustic observations. The various different techniques have been evaluated on the DARPA Resource Management recognition task [1] using the SUMMIT speech recognition system [2]. The algorithms were tested on the task of instantaneous adaptation. In other words, the methods attempt to adapt to the same utterance which the system is trying to recognize. The results are tabulated in Table 13 with respect to the baseline SI system. The results include experiments where speaker adaptation or clustering techniques are used in conjunction with consistency modeling in order to combine their strengths. The results indicate that significant performance improvements are possible when speaker correlation information is accounted for within the framework of a speech recognition system. 48 SUMMARY OF RESEARCH

15 T.J. HAZEN References 1] W. Fisher, The DARPA Task Domain Speech Recognition Database, Proc. of the DARPA Speech Recognition Workshop, pp , San Diego CA, March [2] J. Glass, J. Chang, and M. McCandless, A Probabilistic Framework for Feature-based Speech Recognition, Proc. of the International Conference on Spoken Language Processing, pp , Philadelphia, PA, [3] T. Hazen, A Comparison of Novel Techniques for Instantaneous Speaker Adaptation, Proc. European Conference on Speech Communication and Technology, pp , Rhodes, Greece, [4] T. Hazen. The Use of Speaker Correlation Information for Automatic Speech Recognition. Ph.D. thesis, MIT Department of Electrical Engineering and Computer Science, January SPOKEN LANGUAGE SYSTEMS 49

16 The Mole: A Robust Framework for Accessing Information from the World Wide Web Hyung-Jin Kim Figure 11. Semantic template for CNN weather. Although many people have labeled the World Wide Web as the largest database ever created, very few applications have been able to use the web as a database. This is because the web is dynamic: web pages change constantly, sometimes on a daily basis. I propose a system called the Mole that aims to solve this problem by providing a semantic interface into the web. The semantic interface uses the semantic content on web pages to map very high-level concepts, such as weather reports for Boston to low-level requests for data (such as getting the text in the third A tag in a web page). Therefore, even though web pages change, the Mole will still be able to find information on them. The Mole will robustly access a web page by taking advantage of the topology of its underlying HTML. When web pages get updated, the information that is presented usually retains the same structure. For example, when the CNN Weather Data site changed in November of 1997, its facade changed, but it still continued to present the same information. CNN still presented data about the current conditions of a city and it still gave a four-day forecast. Furthermore, although the HTML structure of this new page was drastically different, the weather information was still grouped in the same way (i.e. high and low temperatures were still presented next to each other). encapsulate Word ("low") near Integer Day encapsulate Word ("high") Integer near The Mole uses semantic templates to access information from web pages. In the weather example, to gather all of the 4-day forecasts of a city, the template in Figure 11 is used. The Mole takes this template and matches it to the data on the web page. This template essentially drills down through high-level concepts presented on the web page. First, it finds a day word e.g., Monday on the web page and then it tries to find the words low and high that are associated with that word. Finally, it finds the integers that are most closely located to the words low and high. Since this semantic template is abstracting away from the HTML structure, this template would have found the same temperature information before and after the change (see Figure 12). Notice that this template follows what a human does to gather the same information: first, he searches for a specific day and then he searches for the temperatures besides the words high and low. In order to make use of semantic templates, the Mole will require the following facilities: a taxonomy of data descriptors and a library of relationship descriptions. A taxonomy of data descriptors is used to describe all possible data or recognizable features on a web page. In our weather template, we used the names integer and day to describe the data we are looking for. In order for the Mole to access many different types of web pages, a large library of data types needs to be created. One can imagine extending this taxonomy to incorporate concepts of state, country, and car_name. This taxonomy can be hierarchical in that a semantic idea can be built on top of other semantic ides, making them highly scalable and re-usable. A library of relationship 50 SUMMARY OF RESEARCH

17 descriptors describes all the ways in which features of a web page can relate to each other. Descriptors such as near and on_top_of are simple examples of relationship descriptors. More complicated descriptors include encapsulate which not only define how one datum is positioned relative to another, but also how the fonts of each datum are related to each other (words with large, bolded fonts encapsulate smaller fonted words following them). The Mole is potentially a very robust and simple interface for applications to access the web. By lifting semantic concepts found on a web page away from the HTML structure, the Mole will be able to gather information from web pages even when these pages change. In many ways, semantic templates attempt to mimic what a Monday High: 95 Low: 55 Word ("low") Day Word ("high") Integer Integer Monday High: Low: human does to find information. By using concepts instead HTML tags to find information, the Mole is using web pages as they were meant to be used: by the human eye. Figure 12. Mapping of the semantic template to two versions of the weather page (note: not necessarily the CNN weather page). SPOKEN LANGUAGE SYSTEMS 51

18 Sublexical Modelling for Word-Spotting and Speech Recognition using ANGIE Raymond Lau In this work, we introduce and explore a novel framework, ANGIE, for modelling subword lexical phenomena in speech recognition. Our framework provides a flexible and powerful mechanism for capturing morphology, syllabification, phonology and other subword effects in a hierarchical manner which maximizes the sharing of subword structures. We hope that such a system can provide a single unified probabilistic framework for modelling phonological variation and morphology. Many current systems handle phonological variations either by having a pronunciation graph (such as in MIT's SUMMIT system) or by implicitly absorbing the variations into the acoustic modelling. The former has the disadvantage of not sharing common subword structure, hence splitting training data. The latter masks the process of handling phonological variations and makes the process difficult to control and to improve upon. For example, in the ATIS domain, the words "fly," "flying," "flight," and "flights" all share the common initial phoneme sequence f l ay, so presumably, phonological variations affecting this sequence can be better learned if examples from all four words were pooled together. Our system does just that. The sharing of subword structure will hopefully facilitate the search process and also make it easier to deal with new, out-of-vocabulary, words. By pursuing merged common subword theories during search, we can mitigate the combinatorial explosion of the search tree, making large vocabulary recognition more manageable. Because we expect new words to share much common subword structure with words in our vocabulary, we can easily add new words dynamically, allowing them to adopt existing subword structures. In principle, we can even detect the occurrence of out-of-vocabulary words by recognizing as much of the subword structure as we can in a bottom up manner. We are using the ANGIE framework to model constraints at the subword level. Within our framework, subword structure is modeled via a context-free grammar and a probability model. The grammar generates a layered structure very similar to that proposed by Meng [1]. An example of an ANGIE parse tree is shown in Figure 13. Our work attempts to validate the feasibility of using the framework for speech recognition by demonstrating its effectiveness in three recognition tasks, phonetic recognition, word-spotting and continuous speech recognition. We also explore the combination of ANGIE with a natural language understanding system, TINA, that is also based on a context-free grammar and hence can be more easily integrated into our ANGIEbased system as compared to a more traditional recognition framework. Finally, we conclude with two pilot studies, one attempting to leverage off the ANGIE subword structural information for prosodic modelling and the other exploring the addition of new words to the recognition vocabulary in real time. Our first demonstration of recognition with ANGIE was a system for forced phonemic/phonetic/acoustic alignment and phonetic recognition as described in greater detail in [2]. In this system, we perform a bottom up best first search over possible phone strings, incorporating the acoustic score of each phone along with the score of the best ANGIE parse for the path up to that phone. Phonetic recognition results obtained have been promising, with the ANGIE-based system achieving a 36.1% error 52 SUMMARY OF RESEARCH

19 SENTENCE Figure 13. A sample parse tree for the phrase "I'm interested." WORD WORD FCN SROOT UROOT2 DSUF ISUF Morphology FNUC FCODA NUCLAX+ CODA NUC DNUC UCODA PAST Syllabification /ay_i/ [ax] /m/ [m] /ih+/ /n/ /t/ /er/ /eh/ /s/ /t/ /d*ed/ [ih] [n] [-n] [axr] [ix] [s] [t] [ix] [dx] Phonemics Phonetics rate as compared to a phone bigram baseline system with a 39.8% error rate on ATIS data. The improvement was due roughly equally to improved phonological modelling and the more powerful longer distance constraints made possible with ANGIE's upper layers. Our second demonstration was the implementation of an ANGIE-based system for word spotting. Our test case was word spotting the city names in the ATIS corpus. We have successfully implemented the wordspotter with competitive performance. We have also conducted several experiments varying the nature of the subword constraints on the filler model within the wordspotter. The constraint sets experimented with ranged from simple phone bigram, to syllables, to full word recognition. The results showed that, as expected, the inclusion of more constraints on the filler led to improved word-spotting performance. On our test set, the system had a FOM of 89.3 with full word recognition, 87.7 with syllables, and 85.3 with phone bigrams. Surprisingly, speed tended to improve with FOM performance. We believe the explanation is that more constraints lead to a less bushy search. More details of our work in word-spotting can be found in [3]. For our final feasibility test, we have implemented a continuous speech recognition system based on the ANGIE framework. Our recognizer, employing a word bigram, achieves the same level of performance as the our SUMMIT baseline system with a word bigram (18.8% word error rate vs. 18.9%). In both cases, context-independent acoustic models were used. Because ANGIE is based on a context-free grammar framework, we have experimented with integrating our TINA natural language understanding system, also based on a context-free framework) with ANGIE, resulting in a single, coupled search strategy. The main challenge with the integrated system was in curtailing the computational requirements of supporting robust parse. We finalized upon a greedy strategy described in greater detail in [4]. With the combined system, the word error rate declines to 14.8%. We have also attempted TINA resorting of SUMMIT N-best lists in an effort to separate the benefits of an integrated search strategy from those of bringing in the powerful TINA language model. That experiment yielded only marginal improvement over the word SPOKEN LANGUAGE SYSTEMS 53

20 SUBLEXICAL MODELLING FOR WORD-SPOTTING AND SPEECH RECOGNITION USING ANGIE bigram, suggesting that the tightly coupled search can lead to a gain not attainable when the recognition and NL understanding processes are separated and interfaced through an N-best list, generated without the use of information from TINA. Finally, we conducted two pilot studies exploring problems for which we believe the ANGIE-based framework will exhibit advantages. The first pilot study examines the ability to add new words to the recognition vocabulary in real time, that is, without requiring extensive retraining of the lexical models. We believe that, because of ANGIE's hierarchical structure, new words added to the vocabulary can share lexical subword structures with existing words in the vocabulary. For this study, we simulated the appearance of new words by artificially removing the city names that only appear in ATIS-3, that is, city names which did not appear in ATIS-2. These city names were then considered the new words in our system. For the baseline comparison, we added the words to a similarly reduced SUMMIT recognizer and assigned zero to their lexical arc weights in the pronunciation graph. In the ANGIE case, we allowed ANGIE to generalize probabilities learned from other words with similar word substructures. In both cases, the word level bigram model used a class bigram, with uniform probabilities distributed over all city names, including the simulated new words. Both the baseline and ANGIE systems achieved the same word error rate, 19.2%. This represents a slight decrease from a system trained with full knowledge of the simulated new words. Apparently, the lack of lexical training did not adversely impact recognition performance much with our set of simulated new words. It is unclear whether ANGIE would show an improvement over the baseline for a different choice of new words. We do note, however, that for the artificially reduced system, without the simulated new words in the vocabulary, the ANGIE-based system achieves a 31.2% error rate as compared to a 34.2% error rate for the baseline SUMMIT system, suggesting that ANGIE is more robust in the presence of unknown words. For our other pilot study, we attempted to leverage the word substructure information provided by ANGIE for prosodic modelling. Our experiment, conducted in conjunction with our colleague Grace Chung, was to implement a hierarchical duration model based on the ANGIE parse tree and to incorporate the duration score into our recognition search process. We evaluated the duration model in the context of our ANGIE-based word-spotting system. Its inclusion increased the FOM from 89.3 to 91.6, leading us to conclude that the ANGIE subword structure information can indeed be used for improved prosodic modelling, minimally in terms of duration. We believe that our work demonstrates the feasibility of using ANGIE as a competitive lexical modelling framework for various speech recognition systems. Our experience with word-spotting shows that ANGIE provides a platform where it is easy to alter subword constraints. Our success at NL integration for improved recognition suggests that a context-free framework has several advantages. Finally, our pilot study in prosodic modelling suggests that ANGIE's subword structuring information can be leveraged to provide improved performance. 54 SUMMARY OF RESEARCH

21 RAYMOND LAU References [1] H. M. Meng, Phonological Parsing for Bi-directional Letter-to-Sound/Sound-to-Letter Generation, Ph.D. thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, June [2] S. Seneff, R. Lau, and H. Meng, ANGIE: A New Framework for Speech Analysis Based on Morpho-Phonological Modelling, Proc. ICSLP '96, Philadelphia, PA, pp , October (Available online at icslp96_angie.pdf) [3] R. Lau and S. Seneff, Providing Sublexical Constraints for Word Spotting within the ANGIE Framework, Proc. Eurospeech '97, Rhodes, Greece, pp , September (Available online at main.pdf) [4] R. Lau, Subword Lexical Modelling for Speech Recognition, Ph.D. thesis, MIT Department of Electrical Engineering and Computer Science, Cambridge, MA, May (Available online at SPOKEN LANGUAGE SYSTEMS 55

22 Probabilistic Segmentation for Segment-Based Speech Recognition Steven Lee 56 SUMMARY OF RESEARCH The objective of this research is to develop a high-quality real-time probabilistic segmentation algorithm for use with SUMMIT, a segment-based speech recognition system [1]. Until recently, SUMMIT used a segmentation algorithm based on acoustic change. This algorithm was adequate, but produced segment graphs that were denser than necessary because a low acoustic change threshold was needed to ensure segment boundaries not marked by sharp acoustic change were also included. Recently, Chang developed an approach to segmentation that uses a Viterbi and a backwards A* search to produce a phonetic graph in the same manner as word graph production [2, 3]. This algorithm achieved a 11.4% decrease in phonetic recognition error rate while hypothesizing half the number of segments of the acoustic segmentation algorithm. While the results of this approach are promising, it has two drawbacks that keep it from widespread use in practical speech recognition systems. The first is that the algorithm cannot run in real-time because it requires a complete forward Viterbi search followed by a backward A* search. The second is that the algorithm requires enormous computational power since the search is performed at the frame level. This research seeks to develop a search algorithm that produces a segment network in a pipelined, left-to-right mode. It also aims to improve and lower the computational requirements. The approach being adopted in this research is to introduce a simplified search framework and to shrink the search space. The new search framework, a frame-based Viterbi search that does not utilize a segment graph, is attractive for probabilistic segmentation because of its simplicity and its relatively low computational requirements. Although work on using this search to produce a segment graph is ongoing, preliminary results using this search on phonetic recognition resulted in a competitive error rate of 30.3% [4]. Since recognition performance should be somewhat correlated to the quality of the segment graph produced, this is a promising result. The size of the search space in probabilistic segmentation is bounded by time in one dimension and by the number of phonetic units in another dimension. Both dimensions can be shrunk to provide computational savings. This research will investigate shrinking the time dimension by using landmarks instead of frames. It will also investigate the use of broad classes to shrink the search space along the lexical dimension. The domains being used for this work are TIMIT and JUPITER [5], a telephonebased weather information domain. References [1] J. Glass and J. Chang and M. McCandless, A Probabilistic Framework for Feature-based Speech Recognition, Proc. International Conference on Speech and Language Processing, pp , Philadelphia, PA, October [2] J. Chang and J. Glass, Segmentation and Modeling in Segment-based Recognition, Proc. European Conference on Speech Communication and Technology, pp , Rhodes, Greece, September [3] I. Hetherington and M. Phillips and J. Glass and V. Zue, A* Word Network Search for Continuous Speech Recognition, Proc. European Conference on Speech Communication and Technology, pp , Berlin, Germany, September [4] S. Lee. Probabilistic Segmentation for Segmentationbased Speech Recognition, M.Eng thesis, MIT Department of Electrical Enginering and Computer Science, June [5] V. Zue, et al. From Interface to Content: Translingual Access and Delivery of On-line Information, Proc. European Conference on Speech Communcation and Technology, pp , Rhodes, Greece, September 1997.

23 A Model for Interactive Computation: Applications to Speech Research Michael McCandless Although interactive tools are extremely valuable for progress in speech research, the programming techniques required to implement them are often difficult to master and apply. There are numerous interface toolkits which facilitate implementation of the user-interface, but these tools still require the programmer to build the tool s back end by hand. The goal of this research is to create a programming environment which simplifies the process of building interactive tools by automating the computational details of providing interactivity. Interactive tools engage their users in a dialogue, effectively allowing the user to ask questions and receive answers. Questions are typically asked by interacting with the tool s interface via direct manipulation. I propose a set of metrics which may be used to measure the extent of a tool s interactivity: rapid response (does the tool answer the user s question as quickly as possible); high coverage (is the user able to ask a wide range of questions); adaptability (does the tool adapt to varying computation environments); scalability (can the tool manage both large and small inputs); pipelining (does the tool provide the answer in pieces over time for computations that take a long time); backgrounding (is the user able to ask other questions while an answer is being computed). I refer to a tool which can meet these stringent requirements as a finely interactive tool. These dimensions provide metrics for measuring and comparing the interactivity of different tools. Based on these requirements for interactivity, I have designed a declarative computation model for specifying and implementing interactive computation. In order to evaluate the effectiveness of the model, I have incorporated it into a speech toolkit called MUSE [1,2]. MUSE contains numerous components allowing a programmer to quickly construct finely interactive tools. MUSE is implemented in the Python programming language with extensions in C. A Python interface to the Tk widget set is used for interface design and layout. The programmer specifies computation in MUSE differently from existing imperative programming languages. Like existing languages, the programmer builds a MUSE program by applying functions to stronglytyped values. However, in MUSE, the programmer does not have detailed control over when the computations actually take place, nor over when and where intermediate results are stored; instead, the programmer declares the functional relationships among a collection of MUSE values. The MUSE system records these relationships, constructs a run-time acyclic dependency graph, and then chooses when to compute which values. MUSE s data-types also differ from existing programming languages. The specification for each data-type, for example a waveform, image or graph, includes provisions for incremental change: every data-type is allowed to change in certain ways. For example, images may change by replacing the set of pixels within a specified rectangular area. When a value changes at run-time, MUSE will consult the dependency graph, and will then take the necessary steps to bring all dependents of that value up to date with the new change. These unique properties of MUSE free the programmer from dealing with many of the complex computational aspects of providing interactivity. SPOKEN LANGUAGE SYSTEMS 57

24 A MODEL FOR INTERACTIVE COMPUTATION: APPLICATIONS TO SPEECH RESEARCH Figure 14. A screen shot of an interactive lexical access tool. The user is able to edit the phonetic transcription and with each change, the word transcription is updated in real-time to reflect the allowed word alignments according to the TIMIT pronunciation lexicon. The tool demonstrates the unique nature of MUSE s incremental computation model. Because the programmer relinquishes control over the details of how values are computed, the MUSE run-time system must make such choices. While there are many ways to implement this, the technique used by MUSE is based on purely lazy evaluation plus caching. When values are changed, a synchronous depth-first search is performed, notifying all impacted values of the change. Values are computed entirely on-demand, and are then cached away according to the program. For example, if the user is looking at a spectrogram, only the portion of the image they are actually looking at will be computed, which requires a certain range of the STFT, which in turn requires only a certain range of the input waveform. This implementation choice affects all of the built-in functions; the implementation of these functions, in both Python and C, must force the evaluation of any inputs that they need, but only in response to their output being forced. Further, any incremental change on an input to the function must be propagated as an incremental change on the function s output. In order to effectively test the interactivity of MUSE, I have added many necessary speech functions and data-types. The functions include waveform preemphasis, (short-time) Fourier transforms, linear-predictive analysis, Cepstral analysis, energy, word-spotting lexical access, mixture diagonal Gaussian training. The datatypes include waveforms, spectra, graphs, tracks, time marks and cursors, and images. Each data-type has an associated cache which the programmer may use to easily control the extent of storage of intermediate results, as well as a visual function, which translates the data-type into an appropriate image. I have constructed four example tools which illustrate the unique capabilities of 58 SUMMARY OF RESEARCH

25 MICHAEL MCCANDLESS the MUSE toolkit. The first tool is a basic speech analysis tool showing a waveform, spectrogram and transcription, which allows the user to modify the alignment of individual frames of the STFT by directly editing time marks, and then see the impact on the spectrogram image. The second tool displays three overlaid spectral slices (FFT, LPC, and CEPSTRUM), and allows the user to change all aspects of the computation. The third tool illustrates the process of training a diagonal Gaussian mixture model on one-dimensional data, allowing the user to vary many of the parameters affecting the training process. The final tool is a lexical access tool, allowing the user to phonetically transcribe an utterance and then see the corresponding potential word matches. Figure 14 shows a screen-shot of this tool. The properties of MUSE s incremental computation model are reflected in the high degree of interactivity each of these tools offers the user; MUSE s run-time model is able to effectively carry out the requirements of interactivity. References [1] M. McCandless and J. Glass, MUSE: A Scripting Language for the Development of Interactive Speech Analysis and Recognition Tools, Proc. European Conference on Speech Communication and Technology, Rhodes, Greece, September [2] M. McCandless. A Model for Interactive Computation: Applications to Speech Research. Ph.D. thesis, MIT Department of Electrical Engineering and Computer Science, June SPOKEN LANGUAGE SYSTEMS 59

26 Subword Approaches to Spoken Document Retrieval Kenney Ng As the amount of accessible data continues to grow, the need for automatic methods to process, organize, and analyze this data and present it in human usable form has become increasingly important. Of particular interest is the problem of efficiently finding interesting pieces of information from the growing collections and streams of data. Much research has been done on the problem of selecting relevant items from large collections of text documents given a query or request from a user. Only recently has there been work addressing the retrieval of information from other media such as images, video, audio, and speech. Given the growing amounts of spoken language data, such as recorded speech messages and radio and television broadcasts, the development of automatic methods to index, organize, and retrieve spoken documents will become more important. In our work, we are investigating the feasibility of using subword unit indexing terms for spoken document retrieval as an alternative to words generated by either keyword spotting or word recognition. The investigation is motivated by the observation that word-based retrieval approaches face the problem of either having to know the keywords to search for a priori, or requiring a very large recognition vocabulary in order to cover the contents of growing and diverse message collections. The use of subword units in the recognizer constrains the size of the vocabulary needed to cover the language; and the use of subword unit indexing terms allows for the detection of new user-specified query terms during retrieval. We explore a range of subword unit indexing terms of varying complexity derived from phonetic transcriptions. The basic underlying unit is the phone; more and less complex units are derived by varying the level of detail and the sequence length of these units. Labels of the units range from specific phones to broad phonetic classes obtained via hierarchical clustering. Automatically derived fixed- and variable-length sequences ranging from one to six units long are examined. Also, sequences with and without overlap are explored. In generating the subword units, each message/query is treated as one long phone sequence with no word or sentence boundary information. The speech data used in this work consists of recorded FM radio broadcasts of the NPR Morning Edition news show. The training set for the speech recognizer consists of 2.5 hours of clean speech from 5 shows while the development set consists of one hour of data from one show. The spoken document collection is made up of 12 hours of speech from 16 shows partitioned into 384 separate news stories. In addition, a set of 50 natural language text queries and associated relevance judgments on the message collection are created to support the retrieval experiments. Phonetic recognition of the data is performed with the MIT SUMMIT speech recognizer. It is a probabilistic segmentbased approach that uses context-independent segment and context-dependent boundary acoustic models. A two pass search strategy is used during recognition. A forward Viterbi search is performed using a statistical bigram language model followed by a backwards A* search using a higher order statistical n-gram language model. Information retrieval is done using a standard vector space approach. In this model, the documents and queries are 60 SUMMARY OF RESEARCH

27 represented as vectors where each component is an indexing term. The terms are weighted based on the term s occurrence statistics both within the document and across the collection. A normalized inner product similarity measure between document and query vectors is used to score and rank the documents during retrieval. We perform a series of experiments to measure the ability of the different subword units to perform effective spoken document retrieval. A baseline text retrieval run is performed using word-level text transcriptions of the spoken documents and queries. This is equivalent to using a perfect word recognizer to transcribe the speech messages followed by a full-text retrieval system. An upper bound on the performance of the different subword unit indexing terms is obtained by running retrieval experiments using phonetic expansions of the words in the messages and queries obtained via a pronunciation dictionary. We find that many of the subword unit indexing terms are able to capture enough information to perform effective retrieval. With the appropriate subword units it is possible to achieve performance comparable to that of text-based word units if the underlying phonetic units are recognized correctly. We next examine the retrieval performance of the subword unit indexing terms derived from errorful phonetic transcriptions created by running the phonetic recognizer on the entire spoken document collection. From this experiment, we find that although performance is worse for all units when there are phonetic recognition errors, some subword units can still give reasonable performance even before the use of any error compensation techniques such as approximate term matching. We then attempt to improve retrieval performance by exploring robust indexing and retrieval approaches which take into account and try to compensate for the speech recognition errors introduced into the spoken document collection. We look at two approaches. One involves modifying the query representation to include additional approximate match terms; the main idea is to include terms that are likely to be confused with the original query terms. The other approach is to modify the speech document representation by expanding them to include high scoring recognition alternatives; the goal is to increase the chance of including the correct hypothesis. We find that both approaches are able to help improve retrieval performance. Our results indicate that subword-based approaches to spoken document retrieval are feasible and merit further research. In terms of current and future work, we are expanding the corpus to include more speech for both recognizer training and the speech message collection; exploring ways to improve the performance of the phonetic recognizer; and investigating more sophisticated robust indexing and retrieval methods in an effort to improve retrieval performance when there are recognition errors. References [1] K. Ng and V. Zue, Subword Unit Representations for Spoken Document Retrieval, Proc. European Conference on Speech Communication and Technology, pp , Rhodes, Greece, September 1997, [2] K. Ng and V. Zue, An Investigation of Subword Unit Representations for Spoken Document Retrieval, Proceedings of the ACM SIGIR Conference, p. 139, Philadelphia, PA, July SPOKEN LANGUAGE SYSTEMS 61

28 A Semi-Automatic System for the Syllabification and Stress Assignment of Large Lexicons Aarati Parmar 62 SUMMARY OF RESEARCH Sub-word modelling, which includes morphology, syllabification, stress, and phonemes, has been shown to improve performance in certain speech applications [1]. This observation has motivated us to attempt to formally define a convention for a set of syllable-sized units, intended to capture these sub-word level realizations in words in the English language, through a two-tiered approach. The assumption is that words can be represented as sequences of units we call morphs, which capture explicitly both the pronunciation and the orthography. Each morph unit has a carefully constructed label and a lexical entry that provides its canonic phonemic realization. Each word is entered into a word lexicon decomposed into its appropriate morph sequence. Thus, for example, the word contentiously would be represented as con- ten+ -tious =ly with the markers, -, +, and = coding for morphological categories such as prefix, stressed root, and derivational/inflectional suffix. It is our hope that all words of English can be represented in terms of a reasonably small, closed set of these morph units. This thesis introduces a new semiautomatic procedure for acquiring a representation of a large corpus of words in terms of morphs. Morph transcription, as we have defined it, is a considerably more difficult task than phonetic or phonemic transcriptions, simply because constraints have to be satisfied on more than one level. Morphs with similar spellings but different pronunciations must be distinguished through selected capital letters, as in the examples com+ (/k!/ /aa+/ /m/) in combat and com+ (/k!/ /ah+/ /m/) in comfort. The letters of the morph spellings for a given word must, if lowercased and concatenated, realize a correct spelling of the word. Syllabification must be correctly marked, and the phonemic transcription obtained by replacing the morph units with their phonemic realizations must be accurate. We would like to know if our representation is extensible, and if it is possible to automatically or semi-automatically extract these sub-lexical units from large corpora of words with associated phonetic transcriptions. Thus we have devised a procedure that can hopefully propose morph decompositions accurately and efficiently. We have evaluated the procedure on two corpora, and have also assessed how appropriate the morph concept is as a basic unit for capturing sub-lexical constraints. We used the ANGIE formalism to generate and test our morphs. ANGIE is a system that can parse either spellings or phonetics into a probabilistic hierarchical framework. We decided to develop our procedure based on a medium-sized corpus known as TIMIT. We began with a grammar that had been developed and trained on a corpus we call ABH (a combination including the ATIS vocabulary, a subset of the 10,000 most frequent words of the Brown corpus, and the Harvard List lexicon). We then applied the knowledge we had gained from ABH, both with and without that derived from the TIMIT experiment, to the much larger COMLEX lexicon (omitting proper nouns and abbreviations). In this way we tested how well a set of morphs derived from a seed lexicon can be applied to a much larger set of some 30,000 words. If morphs are a good representation, then good coverage should be attainable. Our procedure was to first parse, in recognition mode, the letters of all the new words in a corpus to be absorbed, using a letter-terminal grammar trained on the seed

29 ABH corpus. This yielded a set of hypothesized phoneme sequences and/or morph sequences for each word, which could then be verified or rejected by parsing in phone mode, using the phonetic transcription provided by the corpus as established phone terminals, along with a phone-to-phoneme grammar that defines the mappings from the conventions of the corpus to ANGIE s conventions. By enforcing morph constraints as well, we obtained further constraint than if we just used the phonemic knowledge. We have some encouraging signs that our set of morphs is large enough to encompass most or all English words, particularly if we allow novel stressed roots to be invented by decomposing them into a confirmed onset and rhyme. In our experiments, even without invented stressed roots, we determined that coverage of TIMIT was about 89%, and for COMLEX it was about 94%. The parse coverage of our procedure is quite good, considering the large size of the COMLEX corpus. The accuracy of the morphological decompositions is reasonable as well. According to an informal evaluation, morphological decompositions of words in TIMIT that pass through both letter and phone parsing steps have a 78% probability of matching exactly the expert transcription. Of course this metric does not take into account alternate decompositions which may also be correct, or more consistent with one another than the human-generated ones. We performed an analysis and comparison of the experiments performed on TIMIT and COMLEX. The topics covered include degree of constraint, hand-written versus automatic rules, and consistency of morphological decompositions. Constraint can be measured by the average number of alternate morphological decompositions per word. The average number of morphs generated from the letter parsing step is about three, for both TIMIT and COMLEX. After parsing with phones, the figure drops to 1.1 for TIMIT and to 1.7 for COMLEX. Automatically derived rules (for the mapping from ANGIE s phoneme conventions to the phonetic conventions of the corpus) provide a quick alternative to hand-written rules, with greater coverage, but at a price of some performance loss. Morphological decompositions produced by our procedure also appear to be selfconsistent. We have also developed a new analysis tool to simplify the task of labelling words for morph transcriptions. This tool aids the transcriber by providing easy access to many different sources of knowledge, via a sophisticated graphical interface. It can be used to efficiently repair errors obtained in the automatic parsing procedure. A significant outcome of this thesis is a much larger inventory of the possible morphs of English, and a much larger lexicon of words decomposed into these morph units. These resources should serve us well in future experiments in letter-tosound/sound-to-letter generation, for the automatic acquisition of pronunciations for new words. They should also be useful for the automatic acquisition of vocabularies for speech recognition tasks using ANGIE, and for other experiments, e.g, in prosodic analysis, where syllable decomposition may be important. Reference [1] R. Lau and S. Seneff, Providing Sublexical Constraints for Word Spotting within the ANGIE Framework, Proc. European Conference on Speech Communication and Technology, pp , Rhodes, Greece, September SPOKEN LANGUAGE SYSTEMS 63

30 A Segment-Based Speaker Verification System Using SUMMIT Sridevi Sarma This thesis describes the development of a segment-based speaker verification system and explores two computationally efficient techniques. Our investigation is motivated by past observations that speaker-specific cues may manifest themselves differently depending on the manner of articulation of the phonemes. By treating the speech signal as a concatenation of phone-sized units, one may be able to capitalize on measurements for such units more readily. A potential side benefit of such an approach is that one may be able to achieve good performance with unit (i.e., phonetic inventory) and feature sizes that are smaller than what would normally be required for a frame-based system, thus deriving the benefit of reduced computation. To carry out our investigation, we started with the segment-based speech recognition system developed in our group called SUMMIT [1,2], and modified it to suit our needs. The speech signal was first transformed into a hierarchical segment network using frame-based measurements. Next, acoustic models for each speaker were developed for a small set of six phoneme broad classes. The models represented feature statistics with diagonal Gaussians, which characterized the principle components of the feature set. The feature vector included averages of MFCCs 1-14, plus three prosodic measurements: energy, fundamental frequency (F0), and duration. To facilitate a comparison with previously reported work [3,4,5], our speaker verification experiments were carried out using 2 sets of 100 speakers from the TIMIT corpus. Each speaker-specific model was developed from the eight SI and SX sentences. Verification was performed using the two SA sentences common to all speakers. To classify a speaker, a Viterbi forced alignment was determined for each test utterance, and the forced alignment score of the purported speaker was compared with those obtained with the models of the speaker s competitors. These scores were then rank ordered and the user was accepted if his/her model s score was within the top N of 100 scores, where N is a parameter we varied in our experiments. To test for false acceptance, we used every other speaker in the system as impostors. Ideally, the purported speaker s score should be compared to scores of every other system user. However, computation becomes expensive as more users are added to the system. To reduce the computation, we adopted a procedure in which the score for the purported speaker is compared only to scores of a cohort set consisting of a small set of acoustically similar speakers. These scores were then rank ordered and the user was accepted if his/her model s score was within the top N scores, where N is a parameter we varied in our experiments. To test for false acceptance, we used only the members of a speaker s cohort set as impostors. In addition to using cohort normalization to reduce computation, we determined the the size and content of the feature vector through a greedy algorithm optimized on overall speaker verification performance. Fewer features allows for fewer parameters to be estimated during training, and fewer scores to be computed during testing. We were able to achieve a performance of 0% equal error rate (EER) on clean data and 8.36% EER on noisy telephone data, with a simple system design. Thus we show that a segment-based approach to speaker 64 SUMMARY OF RESEARCH

31 verification is viable, competitive and efficient. Cohort normalization and conducting a feature search to reduce dimensions minimally affect performance and are useful when computation is prohibitive. References [1] V. Zue, J. Glass, M. Phillips, and S. Seneff, Acoustic Segmentation and Phonetic Classification in the SUMMIT Speech Recognition System, Proc. of the International Conference on Acoustics, Speech, and Signal Processing, pp , Glasgow, Scotland, May [2] V. Zue, J. Glass, M. Phillips, and S. Seneff, The SUMMIT Speech Recognition System: Phonological Modeling and Lexical Access, Proc. of the International Conference on Acoustics, Speech, and Signal Processing, pp , Albuquerque, NM, April [3] L. Lamel, J.L. Gauvain, A Phone-based Approach to Non-linguistic Speech Feature Identification, Computer Speech and Language, pp , [4] Y. Bennani, Speaker Identification Through Modular Connectionist Architecture: Evaluation on the TIMIT database, Proc. from the International Conference on Spoken Language Processing, pp , Banff, Alberta [5] D. Reynolds, D., Speaker Identification and Verification Using Gaussian Mixture Speaker Models, Speech Communication, Vol. 17, No. 1, pp , August SPOKEN LANGUAGE SYSTEMS 65

32 Context-Dependent Modelling in a Segment-Based Speech Recognition System Benjamin Serridge Modern speech recognition systems typically classify speech into sub-word units that loosely correspond to phonemes. These phonetic units are, at least in theory, independent of task and vocabulary, and because they constitute a small set, each one can be well-trained with a reasonable amount of data. In practice, however, the acoustic realization of a phoneme varies greatly depending on its context, and speech recognition systems can benefit by choosing units that more explicitly model such contextual effects. The goal of this research was to explore various strategies for incorporating contextual information into a segment-based speech recognition system, while maintaining computational costs at a level acceptable for implementation in a real-time system. The latter was achieved by using contextindependent models in the search, while context-dependent models are reserved for re-scoring the hypotheses proposed by the context-independent system. Within this framework, several types of context-dependent sub-word units were evaluated, including word-dependent, biphone, and triphone phonetic units. In each case, deleted interpolation was used to compensate for the lack of training data for the models. Other types of context-dependent modeling, such as context-dependent boundary modelling and offset modelling, were also used successfully in the re-scoring pass. The evaluation of the system was performed using the Resource Management task. Context-dependent segment models were able to reduce the error rate of the context-independent system by more than twenty percent, and context-dependent boundary models were able to reduce the word error rate by more than a third. A straight-forward combination of contextdependent segment models and boundary models leads to further reductions in error rate. So that it can be incorporated easily into existing and future systems, the code for re-sorting N-best lists was been implemented as an object in SAPPHIRE [2], a framework for specifying the configuration of a speech recognition system using a scripting language. It is currently being tested on JUPITER [3], a real-time telephone based weather information system under development at SLS. References [1] B. Serridge. Context-dependent Modeling in a Segment-based Speech Recognition System. M.Eng. thesis, MIT Department of Electrical Engineering and Computer Science, Cambrudge, MA, August [2] L. Hetherington and M.McCandless, SAPPHIRE: An Extensible Speech Analysis and Recognition Tool based on Tcl/Tk, Proc. International Conference on Spoken Language Processing, pp , Philadelphia, PA, October [3] V. Zue, et al., From Interface to Content: Translingual Access and Delivery of On-line Information, Proc. European Conference on Speech Communication and Technology, pp , Rhodes, Greece, September SUMMARY OF RESEARCH

33 Toward the Automatic Transcription of General Audio Data Michelle S. Spina Recently, ASR research has broadened its scope to include the transcription of general audio data (GAD), from sources such as radio or television broadcasts. This shift in research focus is largely brought on by the growing need to shift content-based information retrieval from text to speech. However, GAD pose new challenges to present-day ASR technology because they often contain extemporaneously-generated, and therefore disfluent speech, with words drawn from a very large vocabulary, and they are usually recorded from varying acoustic environments. Also, the voices of multiple speakers often interleave and overlap with one another or with music and other sounds. Since the performance of ASR systems can vary a great deal depending on speaker, microphone, recording conditions and transmission channel, we have argued that the transcription of GAD would benefit from a preprocessing step that first segmented the signal into acoustically homogeneous chunks [3]. Such preprocessing would enable the transcription system to utilize the appropriate acoustic models during recognition. The goal of the research presented here was to investigate some of the strategies for training a phonetic recognition system for GAD. We have chosen to focus on the Morning Edition (ME) news program broadcast by National Public Radio (NPR). NPR-ME consists of news reports from national and local studio anchors as well as reporters from the field, special interest editorials and musical segments. The analysis presented here is based on a collection of six hours of recording from November, 1996 to January, The six, one-hour shows were automatically split into manageable sized waveform files at silence breaks. In addition, if any of the resulting waveform files contained multiple sound environments (e.g., a segment of music followed by a segment of speech) they were further split at these boundaries. Therefore, each file was homogeneous with respect to sound environment. Orthographies and phonetic alignments were generated for each of the files using orthographic transcriptions of the data and a forced Viterbi search. Seven categories were used to characterize the files. These categories were described in our previous work [3], and are briefly reviewed here: 1) clean speech: wideband (8kHz) speech from anchors and reporters, recorded in the studio, 2) music speech: speech with music in the background, 3) noisy speech: speech with background noise, 4) field speech: telephone bandwidth (4kHz) speech from field reporters, 5) music, 6) silence, and 7) garbage, which accounted for anything that did not fall into one of the other six categories. In [3], we described some preliminary analyses and experiments that we had conducted concerning the transcription of this data. For the NPR-ME corpus, we were able to achieve better than 80% classification accuracy for these seven sound classes on unseen data, using relatively straightforward acoustic measurements and pattern classification techniques. A speech/non-speech classifier achieved an accuracy of nearly 94%. The level of performance of such a classifier is clearly related to the ways in which it will serve as an intelligent front-end to a speech recognition system. The experiments done for this work attempt to determine if such a preprocessor is necessary, and if so, what level of performance is required for the sound segmentation. SPOKEN LANGUAGE SYSTEMS 67

34 TOWARD THE AUTOMATIC TRANSCRIPTION OF GENERAL AUDIO DATA For the development of the phonetic recognition system, 4.25 hours of the NPR- ME data were used for system training, and the remaining hour was used for system test. Acoustic models were built using the TIMIT 61 label set. Results, expressed as phonetic recognition error rates, are collapsed down to the 39 labels typically used by others to report recognition results. The SUMMIT segment-based speech recognizer developed by our group was used for these experiments. The feature vector for each segment consisted of MFCC and energy averages over segment thirds as well as two derivatives computed at segment boundaries. Segment duration was also included. Mixtures of up to 50 diagonal Gaussians were used to model the phone distributions on the training data. For simplicity, only context-independent models were used. The language model used in all experiments was a phone bigram based on over four hours of training data. This particular configuration of SUMMIT achieved an error rate of 37.1% when trained and tested on TIMIT. We conducted experiments to determine the trade-offs between using a large amount data recorded under a variety of speaking environments (a multi-style training approach) and a smaller amount of high quality data if a single recognizer system was to be used to recognize all four different types of speech material present in NPR-ME. We found that a multi-style approach yielded an overall error rate of 39.2%, with the lowest error rates arising from clean speech (33.2%) and the highest error rates arising from field speech (50.4%). Training the system with only the clean, wideband speech material found in the training set yielded comparable results, with an overall error rate of 38.8%. However, the multi-style approach utilized nearly 1.7 times the amount of data for training the acoustic models. To perform a fair comparison between these two approaches, we trained a multi-style system with an amount of training data equivalent to that of the clean speech system. We found this training approach degraded our results to an overall error rate of 41.1%, an increase of nearly 3%. This result indicates that it is advantageous to use only clean, wideband speech material for acoustic model training when data and computation availability becomes an issue. In addition to the single recognizer system explained above, we also explored the use of a multiple recognizer system for the phonetic recognition of NPR-ME, one for each type of speech material. The environment-specific approach involves training a separate set of models for each speaking environment, and using the appropriate models for testing. We used the sound classification system described in [3] as the preprocessor to classify each test utterance as one of the four speech environments. The environment-specific model chosen by the automatic classifier for each utterance was then used to perform the phonetic recognition. This resulted in an overall error rate of 38.3%, which is slightly better than the best single recognizer result. In all of the experiments conducted, we found that the field speech environment has consistently shown the highest phonetic recognition error rates. In an attempt to improve the recognition performance of the field speech, we bandlimited the training data by restricting our analysis to the frequency range of 133Hz to 4kHz. Using this approach, we were able to achieve lower recognition error rates on the field speech 68 SUMMARY OF RESEARCH

35 MICHELLE S. SPINA data to 46.9% through bandlimiting the clean speech training data. Using the bandlimited clean speech models in multiple recognizer system for utterances classified as field speech, the overall error rate becomes 37.9%, which is 2.3% better than the best single recognizer result. In future work in this area, we intend to concentrate on improving the phonetic recognition results from the clean speech environment, and to investigate how the recognition of GAD compares to other automatic speech recognition tasks. References [1] J.L. Gauvain, L. Lamel, M. Adda-Decker, Acoustic Modelling in the LIMSI Nov96 Hub4 System, Proc. of DARPA Speech Recognition Workshop, February [2] R. Schwartz, H. Jin, F. Kubala, S. Matsoukas, Modeling those F-conditions - Or not, Proc. of DARPA Speech Recognition Workshop, February [3] M.S. Spina and V. W. Zue, Automatic Transcription of General Audio Data: Preliminary Analysis, Proc. of the International Conference on Spoken Language Processing, pp , Philadelphia, PA, October [4] M.S. Spina and V.W. Zue, Automatic Transcription of General Audio Data: Effect of Environment Segmentation on Phonetic Recognition, Proc. of European Conference on Speech Communication and Technology, pp , Rhodes, Greece, September SPOKEN LANGUAGE SYSTEMS 69

Porting the GALAXY System to Mandarin Chinese Chao Wang The GALAXY system is a human-computer conversational system providing a spoken language interface for accessing on-line information.

36 Porting the GALAXY System to Mandarin Chinese Chao Wang The GALAXY system is a human-computer conversational system providing a spoken language interface for accessing on-line information. It was initially implemented for English in travel-related domains, including air travel, local city navigation, and weather. One of the design goals of the GALAXY architecture was to accommodate multiple languages in a common framework. This thesis concerns the development of YINHE, a Mandarin Chinese version of the GALAXY system [1,2]. Acoustic models, language models, vocabularies, and linguistic rules for Mandarin speech recognition, language understanding, and language generation have been developed; large amounts of domain specific Mandarin speech data have been collected from native speakers for system training; and issues that are specific for Chinese have been addressed to make the system core more language independent. Figure 15 shows the system operating in Chinese. The user communicates with the system in spoken Mandarin, and the system displays responses in Chinese ideographs, along with maps, etc. In the following, data collection, development of speech recognition, understanding and generation components, and system evaluation will be described in more detail. Both read and spontaneous speech have been collected from native speakers of Mandarin Chinese. Spontaneous speech data were collected using a simulated environment based on the existing English GALAXY system. The data were used for training both acoustic and language models for recognition, and deriving and training a grammar for language understanding. In addition, a significant amount of read speech data was collected through our Web data collection facility. It is easier to collect read data in large amounts, and they are very valuable for acoustic training due to the phone-line diversity of randomly distributed callers. We use pinyin, enhanced with tones, for Chinese representation in out transcription to simplify the input task. Figure 15. An example of a dialogue exchange between YINHE and a user. 70 SUMMARY OF RESEARCH

On the Formation of Phoneme Categories in DNN Acoustic Models

On the Formation of Phoneme Categories in DNN Acoustic Models Tasha Nagamine Department of Electrical Engineering, Columbia University T. Nagamine Motivation Large performance gap between humans and state-