Analysis of Importance of the prosodic Features for Automatic Sentence Modality Recognition in French in real Conditions

Analysis of Importance of the prosodic Features for Automatic Sentence Modality Recognition in French in real Conditions PAVEL KRÁL 1, JANA KLEČKOVÁ 1, CHRISTOPHE CERISARA 2 1 Dept. Informatics & Computer Science University of West Bohemia, Plzeň, CZECH REPUBLIC 2 LORIA UMR 7503, Vandoeuvre-les-Nancy, FRANCE Abstract: - This paper deals with the measure of importance of the prosodic features for automatic sentence modality recognition in French in real conditions. We start by analysing the problem of subjectivity of manual labeling of corpus. Then, we show the results of automatic sentence modality recognition by only two prosodic features: fundamental frequency (F0) and energy. The global accuracy (ACC) is not sufficient for our application: animate a talking head [1] for deaf and hearing-impaired children by information about the sentence type. Next, we analyse the corpus for explaining these results. We consider, that prosodic features are sufficient only for prosodic question detection with accuracy greater than 80 %. For recognition of other modalities with accuracy over 80 %, we need other informations, as language model or semantic. Key-Words: - prosody, fundamental frequency (F0), energy, automatic sentences modality recognition (ASMR), modal corpus. 1 Introduction The main objective of this work is the analysis of the measure of basic prosodic features: fundamental frequency (F0) and energy according to results of automatic sentences modality recognition in French in real conditions. The real condition, means the spontaneous speech from French broadcast news evaluation. This study is performed in the area of developing application to help deaf and hearing-impaired children to better understand and to be integrated in classrooms with normal-hearing children. The classic automatic recognition systems transform speech signal into words sequency, which forms the sentence. It is not sufficient for conversation, because the information about sentence type is lost. Our study will complete this information in the system. 2 Short review of modality recognition approaches The basic rule concerning the prosody of French sentences modality is summarized as in [2]: Declarative sentence: small decrease of melody, Imperative sentence: melody, important decrease of Interrogative sentence: increase of melody, Grammar interrogative sentence: neutral intonation. Another variante of French sentences modalities [3] is a sentences distinction in only two classes: declarative or interrogative. Imperative sentence is appearing as a variant of declarative sentence. Very few papers have paid attention to sentence type recognition in French, but much more studies are about other languages and particularly about English. In the published works, the following features are used: F0 contour in [4] for German, F0 and energy in [5] for German and English, F0 and energy in [6] for Czech, F0 and duration of the ending suffix in [7] for standard Korean. Another work [8] investigates many other prosodic attributes that are mostly derived from F0, energy and duration, for example, the max, min, mean and standard deviation of F0, the energy mean and standard deviation and the number of frames in utterance and number of frames of F0. The features are computed

on the whole sentence and also on the last 200 ms of each sentence. The authors conclude that the end of sentences is the most important for modality recognition. In the literature, the following classification methods have been tested and compared for sentence type recognition: Neutral Network (NN) [4, 6, 9], Hidden Markov Models (HMMs) [8] and Classification and Regression Trees (CART) [8, 9]. The error rate is comparable between such classifiers. 3 French modal corpus We used ESTER corpus [10], which is used in the French broadcast news evaluation. This corpus has not been designed a priori to do sentence modality recognition, then we decided to re-label it. We use the three punctuation marks?.! to extract from the raw ESTER corpus a set of sentences that belongs to each category. This first modal corpus (hereafter called original corpus) contains 18499 sentences (15619 declarations, 1339 exclamations and 1541 questions) for training and 895 sentences (581 declarations, 170 exclamations and 144 questions) for testing. Thanks to this automatic extraction the context of the sentences is lost. 3.1 Manual re-labeling of the corpus Manually labeling sentence modality without context is very subjective. We discovered, that different labels are often given by different persons for the same sentence. In this part, we will study the measure of overlapping of sentence types as a function of the listeners number. For following analysis, we concentrate only to the training part of the corpus. We chose the following classes for manual re-labeling of the corpus according to our application: D: declaration, E: exclamation, G:grammar question (the listener is able to distinguish this sentence only by its grammatical structure) Q: prosodic question (the listener is able to distinguish this class mainly by intonation), X: the listener is not able to determine the type of this sentence, E: errors (in this sentences is some noices, music, two people are speaking, etc., so it is consider as an error). Two types of questions are chosen in relation to our prosodic module. We suppose, that the prosodic module will not be able to detect grammar question, because prosodic information may be here close to the affirmative sentences. We chose accidentally 400 questions from the original corpus and four listeners re-labeled them. The table 1 shows the different results of the labeling for the different labelers (two, three and four). We can observe, the labeling is realy very subjective. Some sentences, initially labeled as questions, are now labeled as declarations, errors or sentences, which type is very difficult to chose. The number of exclamations is very small and two labelers find none of them. In the table 1 are not shown all combinations of listeners, because the analysis is the same. We can summarize: if the number of listeners is increasing, the number of common sentences is decreasing. L D Do E G Go Q Qo X Xo Relation between two listenes L1 64 34 1 162 77 85 38 40 13 L2 53 34 0 87 77 90 38 59 13 Relation between three listeners L1 64 10 1 162 43 85 27 40 6 L2 53 10 0 87 43 90 27 59 6 L3 23 10 2 76 43 113 27 84 6 Relation between four listeners L1 64 8 1 162 28 85 17 40 4 L2 53 8 0 87 28 90 17 59 4 L3 23 8 2 76 28 113 17 84 4 L4 76 8 10 66 28 99 17 30 4 Table 1: Relation between the listeners number and the overlapping in labeling: the first part shows the relation between two listeners; the second, between three and the last one between four listeners; Xo is the number of common labeled sentences for the class X. In relation to our application, we analyse in details only the questions. The figure 1 shows the decrease of number of common labeled questions (grammar and

prosodic) in function of the number of listeners. After the first re-labeling, the number of questions is decreasing by about 50 % (more precisely down to 247 by listener L1 and down to 177 by listener L2). After a 2nd re-labeling the number of common question was reduced down to about 100 questions, chosen by both listeners. When the third listener heard the sentences, the common part for questions is reduced to only 70 sentences. After the last listener we had only 45 questions. This number represents only 11 % of the primary corpus. This problem can be explained by the following reasons: the context of the dialog is essential for the sentences modality recognition, some sentences can belong to several modalities, the listeners make errors when labeling. Figure 1: Number of common labeled questions (grammar and prosodic) in function of listeners number: G curve = grammar questions; Q curve = prosodic questions; G + Q curve = union of G and Q questions 4 Automatic sentence modality recognition In this section are described the steps, the needs for ASMR and our recognition accuracy. 4.1 Attributes choice We chose only the basic prosodic attributes: F0 and energy, because some [5, 6] other studies used them for automatic modality recognition with successfull results. On top of that, it is not very difficult to compute them. For F0 calculus is used the autocorelation function [11]. 4.2 Attributes extraction The second step is attributes extraction. Our approach is based on the following principe. We calculate F0 and energy values for each microsegment of the speech. F0 for unvoiced parts of the signal is completed by a linear interpolation. Then, each sentence is decomposed into 20 segments and for each segment the average value of F0 and energy is computed. We obtain 20 values of F0 and 20 values of energy. This features number is chosen experimentally [6]. 4.3 Classifier choice In the literature [9] it is shown that the classifier is not very important for our goal. For this reason and in order to simplify the work in this part of the research, we chose two basic classifiers NN and GMM. 4.4 Recognition accuracy The table 2 shows the accuracy of the ASMR in French for different features, different classifiers and their combination. The Q class is here only the prosodic questions. The grammar questions are excluded from this experiment, because we assume that they can not be detected by prosody. The expected accuracy is not very good in relationship with our application. We need an accuracy above 90 %. One possible reason is, that only the basic prosodic features are not enough discriminating for this task. We analyse our corpus for confirming or disapproving this hypothesis. ACC in [%] Feature Classifier D E Q total F0 GMM 44 47 75 54 F0, E MLP 69 47 53 59 F0, E GMM, MLP 56 41 84 61 Table 2: Modal recognition ACC for different prosodic features and classifiers in % 5 Study of the French ESTER corpus The first study is the observation of the F0 slope at the end of the sentence. It is performed for testing the basic prosodic rules described in section. The end of sentence means the last segment of 0.7 s duration. The four values of F0 are computed for this segment by an autocorelation function. We use the linear regression

of these four values for analysing of the F0 slope. The table 3 shows the number of sentences according to following rules. The column with + symbol represents the sentences with positive F0 slope and - with negative F0 slope. This first analysis separates the linear regression values into only two intervals. The next analysis divides the linear regression values into three intervals. In the first one are all values of linear regression greater as 0.03 (marked as ++ in the table). It may be a characteristic for the questions. The second interval (marked as 0 ) is [0.03; 0.03]. The sentences with linear regression coefficients smaller that 0.03 (market as - - ) are in the last column of the table. It may be a characteristic of declarative sentences. We can conclude, that the majority (80 %) of prosodic questions respect the basic prosodic rule: the final F0 slope is inceasing. Only 59 % of declarations have a decreasing final F0 slope and approximately half of the grammar questions an increasing final F0 slope. This analysis confirms the good accuracy of question detection score by F0 features only. Conversely, the grammar questions, as mentioned in section, cannot be detected only by F0 features. The number of exclamations is not sufficient for performing this study, therefore it is not shown in the table. Class + - ++ 0 - - D 41 59 14 14 72 G 47 53 28 18 54 Q 80 20 62 10 28 most discriminating. The ending F0 slope for prosodic question is clearly increasing and for the other two sentences types it is falling or neutral. These two types (D and G) are very close, which leads to some confusion in ASMR, if only F0 features are used. The behaviour of energy is difficult to explain from figure 3. We suppose, that energy is less discriminating, than F0, because the variance of energy is 10 x greater than the variance of F0 and the overlapping of the sentences in the different classes will be greater. For this hypothesis the histogram of F0 and especially of energy has been created. Figure 2: F0 curves for three types of sentences: D curve = declarations, G curve = grammar questions, Q curve = prosodic questions Table 3: Analysis of the slope of F0 curve at the end of sentences by linear regression in % The second analysis is the observation of F0 and energy curve, represented by all computed F0 and energy features. We compute the mean and variance values for all features. The means of F0 are shown in the figure 2 and the means of energy in the figure 3. The variance is not shown, because the figure would be difficult to be read. The variances for F0 are in interval (0; 0,02] and can be neglected. Conversely, the variance for energy is in interval [0,01; 0,2], which can be very important for ASMR. We can see, that the first two third of the F0 values for all classes are close, which is useless for modality recognition. The last third of the segment is the Figure 3: Energy curves for three types of sentences: D curve = declarations, G curve = grammar questions, Q curve = prosodic questions The figure 4 shows an important overlapping (F0 and energy too), particularly between the classes of grammar questions and declaration. This fact explains most of the confusions between these two classes.

0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 3.6 1.300 Czech Republic under contract number 201/02/1553. 3.2 1.083 2.8 2.4 2.0 1.6 1.2 0.8 0.4 0.0 0.867 0.650 0.433 0.217 0.1 0.3 0.5 0.7 0.9 1.1 1.3 1.5 1.7 References: [1] P. Kral and J. Kleckova, Speech recognition and animation of talking head, in IWSSIP 03, Prague, Czech Republic, September 2003. 3.800 3.378 2.956 1.300 1.083 [2] H. Gezundhajt, La prosodie, in http://www. linguistes.com/phonetique/prosodie.html. 2.533 0.867 2.111 1.689 1.267 0.844 0.422 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 0.650 0.433 0.217 0.1 0.3 0.5 0.7 0.9 1.1 1.3 1.5 1.7 1.9 2.1 [3] P. Martin, L intonation en parole spontané, in Revue Franaise de Linguistique Appliqué, Paris, France, 2000, vol. IV-2, pp. 57 76. 3.700 3.289 1.500 [4] R. Kompe, Prosody in Speech Understanding Systems, Springer, July 1997. 1.286 2.878 2.467 2.056 1.644 1.233 0.822 0.411 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.071 0.857 0.643 0.429 0.214 0.1 0.3 0.5 0.7 0.9 1.1 1.3 1.5 1.7 1.9 [5] V. Strom, Detection of accents, phrase boundaries and sentence modality in german with prosodic features, in Eurospeech 95, Madrid, 1995. Figure 4: Histograms of F0 (left) and energy (right) slopes. From top to bottom: declarations, grammar questions and prosodic questions. 6 Conclusion Our analysis of the corpus shows that it is not possible to recognize all sentences types only with basic prosodic features (F0 and energy) in real conditions with a good accuracy. It is due to an important overlapping between the features values in the classes. The most discriminating is the class Q, where the accuracy is about 84 %. The recognition accuracy of the others classes (E and D) is about 50 %, which is not sufficient for our application. We will include other informations, such as language models and the semantic to improve the accuracy of ASMR. 7 Acknowledgements This work would not have been possible without the aid of Daniel Dechelot and Emanuel Didiot from French laboratory Loria, who is participed to the manual corpus re-labeling. The greatest thanks to Christophe Cerisara from the same laboratory for his remarks, help from his contribution and for the manual corpus re-labeling as well. The work presented in this paper was partly supported by the Grant Agency of [6] J. Kleckova and V. Matousek, Using prosodic characteristics in Czech dialog system, in Interact 97, 1997. [7] K. Chongdok and Y. Hiyon, Defining modality by terminal contours in standard korean, in 1st International Conference on Speech Sciences, Seoul, 2002. [8] H. Wright, M. Poesio, and S. Isard, Using high level dialogue information for dialogue act recognition using prosodic features, in ESCA Workshop on Prosody and Dialogue, Eindhoven, Holland, September 1999. [9] H. Wright, Automatic utterance type detection using suprasegmental features, in ICSLP 98, Sydney, 1998, p. 1403. [10] http://www.recherche.gouv.fr/technolangue/,. [11] A. de Cheveigne and H. Kawahara, Comparative evaluation of F estimation algorithms, in Eurospeech 2001, Scandinavia, 2001.