Prosody-based automatic segmentation of speech into sentences and topics as presented in a similarly called paper by E. Shriberg, A. Stolcke, D. Hakkani-Tür and G. Tür Vesa Siivola Vesa.Siivola@hut.fi Vesa Siivola, Audio Mining, Oct 3 2002 p.1/27
Why? Segmentation of sentences and topics is needed for robust information extraction. Sentence segmentation: No typographic cues in speech input (punctuation, paragraphs, capitalization, etc...) First step for topic segmentation Topic segmentation needed for: Topic detection and tracking Summarization this is speech as you can see it can be hard to read without any punctuation capitalization would also help as well as paragraphing Vesa Siivola, Audio Mining, Oct 3 2002 p.2/27
Part I: Models and features Vesa Siivola, Audio Mining, Oct 3 2002 p.3/27
Information sources for segmentation Language models Prosody Timing, pitch, stress and other voice qualities (creak) Relatively unaffected by word identity robust with speech recognition (SR) errors Segmentation by prosody can be used alone for audio browsing Many prosodic features invariant to channel changes robust Minimal additional load when used with traditional SR Vesa Siivola, Audio Mining, Oct 3 2002 p.4/27
Teaching prosodic models Word boundaries found by forced alignment SR ( mismatch between training and test data) Features extracted from words on both sides of boundaries (or alternatively 200 ms from pause) after a earthquake hit last night (pause) at eleven we bring 200ms 200ms 200ms 200ms Started with 100 features, reduced with decision tree experiments Pause durations, phone durations, pitch information, voice quality Given: speaker gender, speaker change no energy or amplitude based features these were not robust enough to different channels Vesa Siivola, Audio Mining, Oct 3 2002 p.5/27
Pause features Pauses give strong hint about possible topic and sentence breaks. Used features Current pause duration / previous pause duration False pauses at stop closures no problem, model learns them raw durations vs. speaker normalized durations Vesa Siivola, Audio Mining, Oct 3 2002 p.6/27
Phone and syllable duration features Typically speaker slows down toward the end of units. Last syllable length compared to average syllable length Last word s longest phone and longest vowel general features vs. speaker normalized features Vesa Siivola, Audio Mining, Oct 3 2002 p.7/27
Pitch features pitch tracker LTM filtering median filtering piecewise linear stylization feature computation µ log2 µ log f0 µ+log2 pitch (f 0 ) estimation not very robust, needs postprocessing f 0 doubling/halving, estimated on per speaker basis median filtering (removes unstable estimates at beginnings of voiced sounds piecewise linearization As a result we get stylized f 0 contour Vesa Siivola, Audio Mining, Oct 3 2002 p.8/27
Pitch features 2 f 0 reset features Speaker usually resets pitch at new block. Typically preceded by final fall Features: log ratio or log difference of min, max, mean, start, end stylized f 0 at next and preceding word f 0 range features Pitch range in word before the boundary compared to baseline f 0 f 0 slope at each side of the boundary f 0 continuity across the boundary Vesa Siivola, Audio Mining, Oct 3 2002 p.9/27
Other features pitch halving at f 0 detector (usually sign of creaky voice) gender of the speaker (given, not estimated) speaker change (given, not estimated) Vesa Siivola, Audio Mining, Oct 3 2002 p.10/27
Modeling: Decision trees CART-based, used IND-package which copes with missing values Decision trees (DT) make no assumption about shape of the feature distributions Also categorial features work Decision trees are interpretable by humans b=true true 0.8 0.2 true 0.51 0.49 a>5 a<=5 false 0.4 0.6 b=false false 0.1 0.9 false 0.3 0.7 Vesa Siivola, Audio Mining, Oct 3 2002 p.11/27
Feature selection algorithm Initially highly redundant feature set not very good for a greedy algorithm like CART Iterative feature selection algorithm First, leave-one-out as long as performance does not decrease significantly Second, beam search over all subsets, which contain human selected core features Vesa Siivola, Audio Mining, Oct 3 2002 p.12/27
Language modeling for sentence segmentation Hidden Markov model: observations words, states words and boundaries. Observations keep the model and word stream in sync. Can be described as: P S = P (< S > w n 1, w n 2, < S >)P (w n < S >) P!S = P (w n w n 1, w n 2, < S >) where <S> is a sentence boundary Trained from annotated, boundary tagged training data, Katz-backoff. Vesa Siivola, Audio Mining, Oct 3 2002 p.13/27
Language modeling for topic segmentation 100 unigram topic cluster language models politics HMM: states are topic clusters, observations are sentences Complete graph with initial and end states news start sports news end Presegment data on pauses > 0.65 s culture Vesa Siivola, Audio Mining, Oct 3 2002 p.14/27
Model combination Posteriori probability interpolation P (T i W, F ) λp LM (T i W ) + (1 λ)p DT (T i F i, W ) λ is optimized on the held-out data Integrated hidden Markov modeling Similar to to Hidden Markov model in language modeling Model emits both words and prosodic observations HMM posteriors as decision tree features (not used here) Vesa Siivola, Audio Mining, Oct 3 2002 p.15/27
Part II: Data and experiments Vesa Siivola, Audio Mining, Oct 3 2002 p.16/27
Data Switchboard data (SWB) Telephone conversations Hand-labeled subset of data from Linguistic Data Consortium (LDC) Broadcast news (BN): from LDC s 1997 Broadcast news Sentence boundaries automatically marked by MITRE tagger (punctuation, capitalization etc...) Some data from Hub-4 for language models for sentence detection TDT and TDT2 data for language models for topic detection Vesa Siivola, Audio Mining, Oct 3 2002 p.17/27
Data 2 For experiments, with recognized speech, SRI s DECIPHER recognizers 1-best output was used. Switchboard WER 46.7 % Broadcast news WER 30.5 % Task Training Tuning Test LM Prosody SWB sentence (real) 1.2M words 1.2M words 103K words 101K words SWB sentence (recog) 1.2M words 1.2M words 6K words 8K words BN sentence 130M words 700K words 24K words 21K words BN topic 10.7M words 700K words 205K words 44K words Vesa Siivola, Audio Mining, Oct 3 2002 p.18/27
Results: BN sentence segmentation Model True words SR words chance 6.2 13.3 Lower bound 0.0 7.9 With f 0 feats LM only 4.1 11.8 Prosody only 3.6 10.9 Interpolated 3.5 10.8 Combined HMM 3.3 13.3 Without f 0 feats Prosody only 3.8 11.3 Interpolated 3.2 Combined HMM 11.1 Vesa Siivola, Audio Mining, Oct 3 2002 p.19/27
Results: BN sentence segmentation Features queried in decision tree: 46% Pause duration at boundary 42% Speaker change 11% f 0 difference 1% last syllable duration Decision tree like expected from prosody study literature Prosodic features better than word based and also more robust to SR errors Dropping f 0 based features did not matter much Vesa Siivola, Audio Mining, Oct 3 2002 p.20/27
Results: SWB sentence segmentation Model True words SR words chance 11.0 25.8 Lower bound 0.0 17.6 LM only 4.3 22.8 Prosody only 6.7 22.9 Interpolated 4.1 22.2 Combined HMM 4.0 22.5 Vesa Siivola, Audio Mining, Oct 3 2002 p.21/27
Results: SWB sentence segmentation Features queried in decision tree: 49% Phone and syllable duration preceding boundary 18% Pause length at boundary 17% Speaker change 15% Pause at previous word boundary ( <S> Yeah <S> I know what you mean ) 1% How long this speaker has been speaking Prosodic model not very good or material very easy for language modeling few words appear often at starts of sentences ( I ) Prosodic model robust to ASR errors, LM model degrades badly Vesa Siivola, Audio Mining, Oct 3 2002 p.22/27
Results: BN topic segmentation Model True words SR words chance 0.3 0.3 With f 0 feats LM only 0.190 0.190 Prosody only 0.166 0.173 Combined HMM 0.138 0.144 Without f 0 feats Combined HMM 0.151 Vesa Siivola, Audio Mining, Oct 3 2002 p.23/27
Results: BN topic segmentation Features queried in decision tree: 43% Pause duration at boundary 36% f 0 range (preceding word vs. base f 0 ) 9% Speaker change 7% Speaker gender: men use f 0 differently than women (even after normalization) 5% How long this speaker has been speaking Pause duration underestimated, since the speech was preprocessed by cutting at longer than 0.65 s pauses Prosody more reliable than LM. Combination very good F 0 features important, serious degradation without. Vesa Siivola, Audio Mining, Oct 3 2002 p.24/27
Summary topic LM much more robust than sentence LM Feature usage corpus and task dependent Improvements Model lexical stress, syllable structure Different combinations of features Remove mismatch in training (true words/recognized words) condition on shows speaking style speaker Vesa Siivola, Audio Mining, Oct 3 2002 p.25/27
Home exercise a) Which features were the most important? b) Which of these were estimated, which ones were given? c) How hard it would be to estimate given features and what kind of error rates could be achieved? d) How well would the sentence/topic finder work, if it really had to estimate also the given features? e) What are the real phenomenon that the most important features are sensitive to? That is, how do people separate sentences and topics in real life and how these effects can be seen from the features? Without tests there can be no right answer to questions c) and d). State your own guestimate and also shortly the reasoning behind your answer. Vesa Siivola, Audio Mining, Oct 3 2002 p.26/27
Project work: based on temporal features Sentence segmentation Data: Syntymättömien sukupolvien Eurooppa Features pause duration prev pause duration last syl length (vs. average syl len?) average length of syllable in last sentence last word Modeling: SOM or MLP? Vesa Siivola, Audio Mining, Oct 3 2002 p.27/27