BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY

Size: px
Start display at page:

Download "BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY"

Transcription

1 BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY Sergey Levine Principal Adviser: Vladlen Koltun Secondary Adviser: Christian Theobalt May 2009

2 Abstract Human communication involves not only speech, but also a wide variety of gestures and body motions. Interactions in virtual environments often lack this multi-modal aspect of communication. This thesis presents a method for automatically synthesizing body language animations directly from the participants speech signals, without the need for additional input. The proposed system generates appropriate body language animations in real time from live speech by selecting segments from motion capture data of real people in conversation. The selection is driven by a hidden Markov model and uses prosody-based features extracted from speech. The training phase is fully automatic and does not require hand-labeling of input data, and the synthesis phase is efficient enough to run in real time on live microphone input. The results of a user study confirm that the proposed method is able to produce realistic and compelling body language. ii

3 Acknowledgements First and foremost, I would like to thank Vladlen Koltun and Christian Theobalt for their help and advice over the course of this project. Their help and support was invaluable at the most critical moments, and this project would not have been possible without them. I would also like to thank Jerry Talton for being there to lend a word of advice, and never hesitating to point out the occasional glaring flaws that others were willing to pass over. I would like to thank Stefano Corazza and Emiliano Gambaretto of Animotion Inc., Alison Sheets of the Stanford BioMotion Laboratory, and all of the staff at PhaseSpace Motion Capture for the time and energy they spent helping me with motion capture processing and acquisition at various stages in the project. I would also like to thank Sebastián Calderón Bentin for his part as the main motion capture actor, and Chris Platz, for help with setting up scenes for the figures and videos. I would also like to thank the many individuals who reviewed my work and provided constructive feedback, including Sebastian Thrun, Matthew Stone, Justine Cassell, and Michael Neff. Finally, I would like to thank all of the participants in my evaluation surveys, as well as all other individuals who were asked to evaluate the results of my work in some capacity. The quality of body language can only be determined reliably by a human being, and the human beings around me were often my first option for second opinions over the entire course of this project. Therefore, I would like to especially thank all of my friends, who remained supportive of me throughout this work despite being subjected to an unending barrage of trial videos and sample questions. iii

4 Contents Abstract Acknowledgements ii iii 1 Introduction Background and Motivation Related Work Gesture and Speech System Overview Motion Capture Processing Motion Capture Motion Segmentation Segment Clustering Audio Segmentation and Prosody Gesture Synchrony and Prosody Audio Processing Syllable Segmentation Feature Extraction Probabilistic Gesture Model Probabilistic Models for Body Language Gesture Model State Space iv

5 4.3 Parameterizing the Gesture Model Estimating Model Parameters Synthesis Overview Forward Probabilities Most Probable and Coherent State Early and Late Termination Constructing the Animation Stream Evaluation Evaluation Methodology Results Video Examples Discussion and Future Work Applicability of Prosody and the Synchrony Rule Limitations Future Work Conclusion Bibliography 37 v

6 List of Tables 2.1 This table shows the number of segments and clusters identified for each body section in each training set. Training set I consisted of 30 minutes divided into four scenes. Training set II consisted of 15 minutes in eight scenes This table presents a summary of the audio features used by the system. For each continuous variable X, µ X represents its mean value and σ X represents its standard deviation vi

7 List of Figures 1.1 Training of the body language model from motion capture data with simultaneous audio The body parts which constitute each section are shown in (a) and the motion summary is illustrated in (b) The hidden state space of the model consists of the index of the active motion, A t = s i or c i, and the fraction of that motion which has elapsed up to this point, τ t. Observations V t are paired with the active motion at the end of the next syllable Diagram of the synthesis algorithm for one observation. A t is the t th animation state, V t is the t th observation, α t is the forward probability vector, τ t is the vector of expected τ values, and γ t is the direct probability vector The image shows the states selected over 40 observations with max i γ t (i) (red) and max i α t (i) γ t (i) (green), with dark cells corresponding to high forward probability. The graph plots the exact probability of the selections, as well as max i α t (i) An excerpt from a synthesized animation used in the evaluation study Mean scores on a 5-point Likert scale for the four evaluation surveys. Error bars indicate standard error An excerpt from a synthesized animation Synthesized body language helps create an engaging virtual interaction vii

8 Chapter 1 Introduction 1.1 Background and Motivation Today, virtual worlds are a rapidly growing venue for entertainment, with socialization and communication comprising some of their most important uses [54]. Communication in virtual environments may be accomplished by means of text chat or voice. However, human beings do not communicate exclusively with words, and virtual environments often fail to recreate the full, multi-modal nature of human communication. Gestures and speech coexist in time and are tightly intertwined [30], but current input devices are far too cumbersome to allow body language to be conveyed as intuitively and seamlessly as it would be in person. Current virtual worlds frequently employ keyboard or mouse commands to allow participants to utilize a small library of pre-recorded gestures, but this mode of communication is unnatural for extemporaneous body language. Given these limitations on direct input, body language for human characters must be synthesized automatically in order to produce consistent and believable results. Though significant progress has been made towards automatic animation of conversational behavior, as discussed in the next section, to the best knowledge of the author no current method exists which can animate characters in real time without requiring specialized input in addition to speech. In this thesis, I present a data-driven method that automatically generates body language animations from the participant s speech signal. The system is trained on motion capture data of real people in conversation, with simultaneously recorded audio. The main 1

9 CHAPTER 1. INTRODUCTION 2 contribution is a method of modeling the gesture formation process that is appropriate for real-time synthesis, as well as an efficient algorithm that uses this model to produce an animation from a live speech signal, such as a microphone. Prosody is known to correspond well to emotional state [1, 46] and emphasis [49]. Gesture has also been observed to reflect emotional state [50, 32] and highlight emphasized phrases [30]. Therefore, the system generates the animation by selecting appropriate gesture subunits from the motion capture training data based on prosody cues in the speech signal. The selection is performed by a specialized hidden Markov model (HMM), which ensures that the gesture subunits transition smoothly and are appropriate for the tone of the current utterance. In order to synthesize gestures in real time, the HMM predicts the next gesture and corrects mispredictions. The use of coherent gesture subunits, with appropriate transitions enforced by the HMM, ensures a smooth and realistic animation. The use of prosody for driving the selection of motions also ensures that the synthesized animation matches the timing and tone of the speech. Four user studies were conducted to evaluate the effectiveness of the proposed method. The user studies evaluated two different training sets and the ability of the system to generalize to novel speakers and utterances. Participants were shown synthesized animations for novel utterances together with the original motion capture sequences corresponding to each utterance and an additional sequence of gestures selected at random from the training set. The results of the study, presented in Chapter 6, confirm that the method produces compelling body language and generalizes to different speakers. By animating human characters from live speech in real time with no additional input, the proposed system can seamlessly produce plausible body language for human-controlled characters, thus improving the immersiveness of interactive virtual environments and the fluidity of virtual conversations. 1.2 Related Work Although no system has been proposed which both synthesizes full-body gestures and operates in real time on live voice input, a number of methods have been devised that synthesize either full-body or facial animations from a variety of inputs. Such methods often aim to

10 CHAPTER 1. INTRODUCTION 3 animate Embodied Conversational Agents (ECAs), which operate on a pre-defined script or behavior tree [10], and therefore allow for concurrent planning of synthesized speech and gesture to ensure co-occurrence. Often these methods rely on the content author to specify gestures as part of the input, using a concise annotation scheme [17, 23]. New, more complete annotation schemes for gestures are still being proposed [21], and there is no clear consensus on how gestures should be specified. Some higher-level methods also combine behavioral planning with gesture and speech synthesis [11, 39], with gestures specified as part of a scripted behavior. However, all of these methods rely on an annotation scheme that concisely and completely specifies the desired gestures. Stone et al. [48] avoid the need for annotation of the input text with a data-driven method, which re-arranges pre-recorded motion capture data to form the desired utterance. However, this method is limited to synthesizing utterances made up of pre-recorded phrases, and does require the hand-annotation of all training data. The proposed system also employs a datadriven method to avoid a complex annotation scheme, but is able to decouple speech from motion data and retarget appropriate pre-recorded motion to a completely novel utterance. Since the proposed system must generate animations for arbitrary input, it also cannot require any form of annotation or specification from the user. Several methods have been proposed that animate characters from arbitrary text using natural language processing. Cassell et al. [12] propose an automatic rule-based gesture generation system for ECAs, while Neff et al. [36] use a probabilistic synthesis method trained on hand-annotated video. However, both of these methods rely on concurrent generation of speech and gesture from text. Text does not capture the emotional dimension that is so important to body language, and neither text communication, nor speech synthesized from text can produce as strong an impact as real conversation [18]. Animation directly from voice has been explored for synthesizing facial expressions and lip movements, generally as a data-driven approach using some form of probabilistic model. Bregler et al. [9] propose a video-based method that reorders frames in a video sequence to correspond to a stream of phonemes extracted from speech. This method is further extended by Brand [8] by retargeting the animation onto a new model and adopting a sophisticated hidden Markov model for synthesis. Hidden Markov models are now commonly used to model the relationship between speech and facial expression [25, 53]. Other

11 CHAPTER 1. INTRODUCTION 4 automatic methods have proposed synthesis of facial expressions using more sophisticated morphing of video sequences [15], physical simulation of muscles [47], or by using hybrid rule-based and data-driven methods [5]. Although speech-based synthesis of facial expressions is quite common, it does not generally utilize vocal prosody. Since facial expressions are dominated by mouth movement, many speech-based systems use techniques similar to phoneme extraction to select appropriate mouth shapes. However, a number of methods have been proposed that use speech prosody to model expressive human motion beyond lip movements. Albrecht et al. [2] use prosody features to drive a rule-based facial expression animation system, while more recent systems apply a data-driven approach to generate head motion from pitch [13] and facial expressions from vocal intensity [19]. Incorporating a more sophisticated model, Sargin et al. [42] use prosody features to directly drive head orientation with a HMM. Although these methods only animate head orientation from prosody, Morency et al. [33] suggest that prosody may also be useful for predicting gestural displays. The proposed system selects appropriate motions using a prosody-driven HMM reminiscent of the above techniques, but each output segment corresponding to an entire gesture subunit, such as a stroke or hold. This effectively reassembles the training motion segments into a new animation. Current techniques for assembling motion capture into new animations generally utilize some form of graph traversal to select the most appropriate transitions [3, 24]. The proposed system captures this notion of a motion graph in the structure of the HMM, with high-probability hidden-state transitions corresponding to transitions which occur frequently in the training data. While facial animation synthesis HMMs have previously used remapping of observations [8], or conditioned the output animations on the input audio [25], the proposed method maps animation states directly to the hidden states of the model. This allows a simpler system to be designed that is able to deal with coherent gesture subunits without requiring them to directly coincide in time with audio segments. While the methods described above are able to synthesize plausible body language or facial expression for ECAs, none of them can generate full-body animations from live speech. Animating human-controlled characters requires real-time speeds and a predictive model that does not rely on looking ahead in the audio stream. Such a model constitutes

12 CHAPTER 1. INTRODUCTION 5 the main contribution of this work. 1.3 Gesture and Speech The most widely-used taxonomy for gestures was proposed by McNeill [30], though he later suggested that a more continuous classification would be more appropriate [31]. Mc- Neill s original taxonomy consists of four gesture types: iconics, metaphorics, deictics, and beats. Iconics present images of concrete objects or actions, metaphorics represent abstract ideas, deictics serve to locate entities in space, and beats are simple, repetitive motions meant to emphasize key phrases [30]. In addition to the above taxonomy, McNeill made several other relevant observations. In regard to beat gestures, he observed that beats tend to have the same form regardless of content that is, an up/down or sideways flick of the hand. He also observed that beats are generally used to highlight important concepts or emphasize certain words, and constitute almost half of all gestures in both narrative and non-narrative speech. While narrative speech contains many iconic gestures, beats were observed by McNeill to constitute about two thirds of the gestures accompanying non-narrative speech, suggesting that synthesizing only beats would adequately animate non-narrative conversations [30]. McNeill also observed that metaphoric gestures vary across cultures, but within the same culture tend to be drawn from a small pool of possible options. While narrative speech tends to be dominated by iconic gestures, over 80% of the gestures accompanying non-narrative speech were observed by McNeill to be either beats or metaphorics, suggesting that synthesizing only beats and metaphorics would adequately animate non-narrative, conversational speech [30]. As observed in [14], the overall amplitude and frequency of gestures varies across culture, so it is necessary for us to restrict the proposed method to a single cultural context. Since metaphoric gestures are drawn from a small pool of options within a single such context, they are also good candidates for synthesis. Even if metaphoric gestures depend on the semantic meaning of the utterance, the probability of selecting an appropriate metaphoric is quite high due to the small number of options. The abstract nature of the gesture further increases the likelihood that it will be seen as plausible even if it does not correspond to the

13 CHAPTER 1. INTRODUCTION 6 semantic meaning, provided it occurs at the right time. Since such gestures often accompany a central or emphasized idea, we can hypothesize that their occurrence may also be signaled by prosody cues. Since prosody correlates well to emphasis [49], it should correspond to the emphasized words that gestures highlight, making it useful for selecting gestures, such as beats and metaphorics, with the appropriate timing and rhythm. In addition, there is evidence that prosody carries much of the emotive content of speech [1, 44], and that emotional state is often reflected in body language [50, 32]. Therefore, we can hypothesize that a prosody-driven system would produce accurately timed and appropriate beats, as well as more complex abstract or emotion-related gestures with the appropriate emotive content. The system captures prosody using the three features that it is most commonly associated with: pitch, intensity, and duration. Pitch and intensity have previously been used to drive expressive faces [13, 19], duration has a natural correspondence to the rate of speech, and all of these aspects of prosody are informative for determining emotional state [43]. While semantic meaning also corresponds strongly to gesture, the system does not attempt to interpret the utterance. This approach has some limitations, as discussed in Section 7.2, but is more appropriate for online, real-time synthesis. Extracting semantic meaning from speech to synthesize gesture without knowledge of the full utterance is difficult because it requires a predictive speech interpreter, while prosody can be obtained efficiently without looking ahead in the utterance. 1.4 System Overview During the training phase, the system processes a corpus of motion capture and audio data to build a probabilistic model that correlates gesture and prosody. The motion capture data is processed by extracting gesture subunits, such as strokes and holds. These subunits are then clustered to identify recurrences of the same motion, as described in Chapter 2. The training speech signal is processed by extracting syllables and computing a set of prosody-based features on each syllable, as described in Chapter 3. Finally, the parameters of a hidden Markov model are estimated directly from the two resulting state-streams, with clustered animation segments as hidden states and the audio

14 CHAPTER 1. INTRODUCTION 7 Motion Capture Speech Segmentation Segmentation Clustering Feature Extraction Combined Motion/Speech State Stream HMM Parameter Estimation Figure 1.1: Training of the body language model from motion capture data with simultaneous audio. segments as observations. Mapping hidden states directly to clustered motions allows us to build a model which can handle non-synchronized motion and audio segments efficiently. The head, upper body, and lower body are processed separately into three distinct HMMs, since the gestures of these body parts do not always coincide. The hidden state space of the HMM consists of start and continuation states for each motion, along with the fraction of the motion which has elapsed up to the current time step. This allows motions to last for multiple time steps to handle the common case of multiple syllable-observations within one long motion. The model is described in more detail in Chapter 4. The entire training process is summarized in Figure 1.1. The HMM produced in the training phase is used in the synthesis phase to drive the animation in real time based on a novel audio signal. Each time a new syllable is identified in the audio stream, the system computes the most probable hidden state based on the previous state, as well as the most probable state based on the current vector of forward probabilities. These vectors are then combined to find the state that is both probable and coherent. The synthesis process is described in Chapter 5.

15 CHAPTER 1. INTRODUCTION 8 The correlation between speech and body language is inherently a many-to-many mapping: there is no unique animation that is most appropriate for a given utterance [8]. This ambiguity makes validation of synthesis techniques difficult. Since the only way to judge the quality of a synthesized body language animation is by subjective evaluation, surveys were conducted to validate the proposed method. The results of these surveys are presented in Chapter 6.

16 Chapter 2 Motion Capture Processing 2.1 Motion Capture The final training set for the system consists of 30 minutes of motion capture data with accompanying speech, divided into four scenes. An additional 15 minute training set from a different speaker was also used in the user study. Each scene was excerpted from an extemporaneous conversation, ensuring that the body language was representative of real human interaction. Although the topic of each conversation was not selected in advance, conversations were intentionally selected to provide a variety of common behaviors, including turn-taking, extended periods of fluent speech, the asking of questions, and the expression of a few simple emotions (such as anger and excitement). Training data from different speakers was not mixed together to avoid creating an inconsistent gesturing style. Additional motion capture data from other speakers was used to test the ability of the system to generalize to other gestural styles and for comparison evaluation of the synthesized sequence, as described in Chapter 6. For capture, I employed the markerless technique proposed by Mündermann et al. [35], as well as the motion capture facilities at PhaseSpace Motion Capture. The markerless system computes a visual hull using silhouettes taken from eight cameras and fits a highresolution scan of the speaker s body to the visual hull with appropriate joint orientations. The PhaseSpace motion capture system operates using conventional active markers. The resulting marker location data was processed with Autodesk MotionBuilder to compute 9

17 CHAPTER 2. MOTION CAPTURE PROCESSING 10 joint orientations. Both the markerless data and the active marker data obtained from the PhaseSpace system is processed to obtain an animated 14-joint skeleton. The data is then segmented into gesture unit phases, henceforth referred to as gesture subunits. These segments are then clustered to identify recurrences of similar motions. 2.2 Motion Segmentation Current motion segmentation methods identify broad categories of motion within a corpus of motion capture data, such as walking and sitting [4, 34]. Perceptually distinct gesture subunits are not as dissimilar as these broad categories, and much of the existing work on data-driven gesture animation segments training data manually [48, 36]. However, manual annotation does not scale gracefully to large amounts of training data. Therefore, a more sensitive automatic segmentation algorithm is needed. Gesture units consist of the pre-stroke hold, stroke, and post-stroke hold phases [30, 20], which provide natural boundaries for segmentation of gesture motions. Such phases have previously been used to segment hand gestures [29]. From this we deduce that a gesture unit consists of alternating periods of fast and slow motion. To segment the motion capture data into gesture unit phases (or subunits ), boundaries are placed between strokes and holds using an algorithm inspired by [16]. Since individual high-velocity strokes are separated by low-velocity pauses, we may detect these boundaries by looking for dips in the average angular velocities of the joints. For each frame of the motion capture data, let ω j be the angular velocity of joint j, and let u j be a weight corresponding to the perceptual importance of the joint. This weight is chosen to weigh high-influence joints, such as the pelvis or abdomen, higher than smaller joints such as the hands or feet. We can then compute a weighted average of the angular velocities as following: z = 14 j=1 u j ω j 2. A stroke is expected to have a high z value, while a hold will have a low value. Similarly, a pause between a stroke and a retraction will be marked by a dip in z as the arm stops

18 CHAPTER 2. MOTION CAPTURE PROCESSING 11 (a) The body parts which constitute each of the three body sections are highlighted in different colors. (b) Displacement of the key parts, shown here, is used along with velocity to summarize the dynamics of a segment for clustering. Figure 2.1: The body parts which constitute each section are shown in (a) and the motion summary is illustrated in (b). and reverses direction. Therefore, segment boundaries are inserted when this sum crosses an empirically determined threshold. To avoid creating small segments due to noise, the system only creates segments which exceed either a minimum length or a minimum limit on the integral of z over the segment. To avoid unnaturally long still segments during long pauses, the algorithm places an upper bound on segment length. The motions of the head, arms, and lower body do not always coincide, so these three body sections are segmented separately and henceforth treated as separate animation streams. The joints which constitute each of the sections are illustrated in Figure 2.1(a). The number of segments extracted from the training sets is summarized in Table Segment Clustering Once the motion data has been segmented, the segments are clustered to identify recurring gesture subunits. The clustering algorithm is based on Ward hierarchical clustering [52],

19 CHAPTER 2. MOTION CAPTURE PROCESSING 12 though a variety of methods may be appropriate. Ward hierarchical clustering merges two clusters at every iteration so as to minimize the error sum-of-squares criterion (ESS) for the new cluster. If we define C i to be the center of the i th cluster, and S (j) i segment in that cluster, we may define ESS i as N i ESS i = D(C i,s (j) i ), j to be the j th where D is the distance metric. The algorithm is modified to always select one of the existing sample points as a cluster center (specifically, the point which minimizes the resulting ESS value), thus removing the requirement that a Euclidian metric be used. This allows the distance function to be arbitrary complex. Segments are compared according to a three-part distance function that takes into account length, starting pose, and a fixed-size summary of the dynamics of the segment. The summary holds some measure of velocity and maximum and average displacement along each of the axes for a few key body parts in each section, as in Figure 2.1. The head section uses the rotation of the neck about each of the axes. The arm section uses the positions of the hands, which often determine the perceptual similarity of two gestures. The lower body uses the positions of the feet relative to the pelvis, which capture motions such as the shifting of weight. For the arms and lower body, velocity and maximum and average displacement each constitute a six-dimensional vector (three axes per joint), while for the head this vector has three dimensions. Each vector is rescaled to a logarithmic scale, since larger gestures can tolerate greater variation while remaining perceptually similar. Using six-dimensional vectors serves to de-emphasize the unused hand in one-handed gestures: if instead a three-dimensional vector for each hand was rescaled, small motions in the unused hand would be weighted too heavily on the logarithmic scale relative to the more important large motion of the active hand. To compare two summaries, the algorithm uses the sum of the distances between the three vectors. Although the summary is the most significant part of the distance function, it is important that all gestures in a cluster have roughly the same length and pose, as this allows us to synthesize a smooth animation stream. In particular, still segments or holds (segments without much motion e.g., a z value that is consistently below the threshold) tend

20 CHAPTER 2. MOTION CAPTURE PROCESSING 13 Final Training Data Motion Statistics Training Set Body Section Total Segments Total Clusters Avg Segments/Cluster Head Set I Arms Lower Body Head Set II Arms Lower Body Table 2.1: This table shows the number of segments and clusters identified for each body section in each training set. Training set I consisted of 30 minutes divided into four scenes. Training set II consisted of 15 minutes in eight scenes. to have very similar summaries, so the weight of the pose is increased on such segments to avoid clustering all still segments together. The starting pose is far more important for still segments, which remain in their starting pose for their duration, while active segments quickly leave the starting pose. Table 2.1 provides a summary of the number of clusters identified in each training set. It should be noted that the lower body section generally contained very few unique segments. In fact, in training set II, only two holds and one motion were identified. The motion corresponded to a shifting of weight from one foot to the other. The larger training set contained a few different weight shifting motions. The arms and head were more varied by comparison, reflecting their greater prominence in body language.

21 Chapter 3 Audio Segmentation and Prosody 3.1 Gesture Synchrony and Prosody Gesture strokes have been observed to consistently end at or before, but never after, the prosodic stress peak of the accompanying syllable. This is referred to as one of the gesture synchrony rules by McNeill [30]. Due to the importance of syllables for maintaining the rhythm of gestures, syllables serve as the basic segmentation unit for audio data in the proposed system. During both training and synthesis, the audio data is parsed to identify syllable boundaries, and a concise prosody descriptor is computed for each syllable. During real-time synthesis, the syllable segmentation algorithm identifies syllables precisely when the peak of the next syllable is encountered, making it easier to produce gestures in accordance with the synchrony rule. This algorithm is further discussed in Section 3.3. As noted earlier, prosody carries much of the emotive content of speech, and therefore is expected to correlate well to body language. Prosody also allows the system to operate independently of language. Therefore, the features extracted for each syllable reflect some aspect of prosody, which is defined as pitch, intensity, and duration, as described in Section

22 CHAPTER 3. AUDIO SEGMENTATION AND PROSODY Audio Processing For both syllable segmentation and feature extraction, the system continuously extracts pitch and intensity curves from the audio stream. The intensity curve is used to drive segmentation, and both curves are used after segmentation to compute a concise prosody descriptor for each audio segment. Pitch is extracted using the autocorrelation method, as described in [6], and intensity is extracted by squaring waveform values and filtering them with a Gaussian analysis window. For both tasks, components of the Praat speech analysis tool [7] were used. Components of the tool were modified and optimized to run in real time, allowing for continuous extraction of pitch and intensity. 3.3 Syllable Segmentation For efficient segmentation, the system employs a simple algorithm inspired by Maeran et al. [28], which identifies peaks separated by valleys in the intensity curve, under the assumption that distinct syllables will have distinct intensity peaks. In order to segment the signal progressively in real time, the algorithm identifies the highest peak within a fixed radius, and continues scanning the intensity curve until the next local maximum is found. If this maximum is at least a fixed distance away from the previous one, the previous peak is identified as the peak of the syllable terminating at the lowest point between the two peaks. Finally, segment boundaries are inserted at voiced/unvoiced transitions, and when the intensity falls below a fixed silence threshold. As stated previously, one effect of using this algorithm is that a syllable is only detected once the intensity peak of the next syllable is found, so syllable observations are issued precisely at syllable peaks, which allows the system to more easily follow the previously mentioned synchrony rule. 3.4 Feature Extraction In order to train the system on a small corpus of training data, the size of the observation state space is limited by using a small set of discrete features, rather than the more standard mixture of Gaussians approach. The selected features concisely describe each of the three

23 CHAPTER 3. AUDIO SEGMENTATION AND PROSODY 16 Audio Features Feature Variable Value 0 Value 1 Value 2 Inflection up Pitch change df 0 df 0 µ df0 < σ df0 df 0 µ df0 σ df0 Inflection down Pitch change df 0 df 0 µ df0 > σ df0 df 0 µ df0 σ df0 Intensity Intensity I I < C silence I µ I < σ I I µ I σ I Length Syllable length L L µ L σl 2 σl 2 < L µ L < σ L L µ L σ L Table 3.1: This table presents a summary of the audio features used by the system. For each continuous variable X, µ X represents its mean value and σ X represents its standard deviation. aspects of prosody: pitch, intensity, and duration. The features are summarized in Table 3.1. Pitch is described using two binary features indicating the presence or absence of significant upward or downward inflection, corresponding to one standard deviation above the mean. Intensity is described using a trinary feature indicating silence, standard intensity, or high intensity, corresponding to half a standard deviation above the mean. Length is also described using a trinary feature, with long segments one deviation above the mean and short segments half a deviation below. This asymmetry is necessitated by the fact that syllable lengths are not normally distributed. The means and deviations of the training sequence are established prior to feature extraction. During synthesis, the means and deviations are estimated incrementally from the available audio data as additional syllables are observed. This set of discrete prosody-based features creates an observation state-space of 36 states, which enumerates all possible combinations of inflection, intensity, and duration. In practice, the state space is somewhat smaller, since silent segments do not contain inflection, and are restricted to short or medium length. Long silent segments are truncated, since they do not provide additional information over a sequence of medium segments, though short segments are retained, as these usually correspond to brief pauses between words. This effectively limits the state space to about 26 distinct observation states.

24 Chapter 4 Probabilistic Gesture Model 4.1 Probabilistic Models for Body Language Previously proposed HMM-driven animation methods generally use temporally aligned animation and speech segments. The input audio is remapped directly to animation output [8], or is coupled to the output in a more complex manner, such as conditioning output and transition probabilities on the input state [25]. The HMM itself is trained with some variation on the EM algorithm [25, 53]. Since the input and output states in the proposed system are not temporally aligned, and since each of the animation segments corresponds to a meaningful unit, such as a stroke or a hold, the model takes a different approach and maps animation segments directly to the hidden states of the model, rather than inferring hidden structure with the EM algorithm. This allows greater control in designing a specialized system for dealing with the lack of temporal alignment even when using a small amount of training data, though at the expense of being unable to infer additional hidden structure beyond that provided by the clustering of gesture subunits. The hidden state space of the HMM consists of start and continuation states for each motion, along with the fraction of the motion which has elapsed up to the current observation. This allows the same motion to span multiple observations and handle the common case of multiple syllables within a single stroke or hold. 17

25 CHAPTER 4. PROBABILISTIC GESTURE MODEL 18 In Section 4.2, I formulate the hidden state space for the gesture model. A parametrization of the model which allows efficient inference and training is described in Section 4.3, and the estimation of model parameters from training data is detailed in Section Gesture Model State Space The process of gesture subunit formation consists of the selection of appropriate motions that correspond well to speech and line up in time to form a continuous animation stream. As discussed in Section 3.1, gesture strokes terminate at or before syllable peaks, and syllable observations arrive when the peak of the next syllable is encountered. In the training data, less than 0.4% of motion segments did not contain a syllable observation, so we may assume that at least one syllable will be observed in every motion, though we will often observe more. Therefore, syllable observations provide a natural discretization for continuous gesture subunit formation, allowing us to model this process with a discrete-time HMM. Under this discretization, we assume that gesture formation is a Markov process in which H t, the animation state at syllable observation V t, depends only on V t and the previous animation state H t 1. As discussed in Section 3.4, V t contains a small set of discrete prosody features, and may be expressed as the index of a discrete observation state, which is denoted by v j. We define the animation state H t = {A t,τ t (i)}, where A t = m i is the index of the motion cluster corresponding to the desired motion, and τ t (i) is the fraction of that motion which has elapsed up to the observation V t. This allows motions to span multiple syllables. To prevent adjacent motions from overlapping, one could use a temporal scaling factor to ensure that the current motion terminates precisely when the next motion begins. However, during real-time synthesis, there is no knowledge of the length of future syllables, and it is therefore not possible to accurately predict the scaling factor. Instead, we may interrupt a motion and begin a new one when it becomes unlikely that another syllable will be observed within that motion, thus ensuring that motions terminate at syllable observations and do not overlap.

26 CHAPTER 4. PROBABILISTIC GESTURE MODEL 19 m 0 m 1 m 2 V 1 V 2 V 3 V 4 V 5 V 1 V 2 V 3 V 4 A 1 : s 1 τ 1 :.1 A 2 : c 1 τ 2 :.3 A 3 : c 1 τ 3 :.5 A 4 : s 2 τ 3 :.2 Figure 4.1: The hidden state space of the model consists of the index of the active motion, A t = s i or c i, and the fraction of that motion which has elapsed up to this point, τ t. Observations V t are paired with the active motion at the end of the next syllable. 4.3 Parameterizing the Gesture Model In order to compactly express the parameters of the gesture model, first note that τ evolves in a very structured manner. If observation V t corresponds to the continuation of the current motion m i, then the following equations describe the evolution of A and τ: A t = A t 1 = m i τ t (i) = τ t 1 (i) + τ t (i), where τ t (i) is the fraction of the motion m i which has elapsed since the previous observation. If instead the current motion terminates at observation V t and is succeeded by A t = m j, then we simply have τ t (j) = 0. Let τ t(i) = τ t 1 (i)+ τ t (i), then τ t is completely determined by τ t and A t, except in the case when motion m i follows itself, and A t = A t 1 = m i but τ t (i) = 0. To disambiguate this case, we introduce a start state s i and a continuation state c i for each motion cluster m i, so that A t may take on either s i or c i (we will denote an arbitrary animation state as a j ). This approach is similar to the BIO notation (beginning, inside, outside) used in semantic role labeling [41]. Under this new formulation, τ t (i) = τ t(i) when A t = c i and A t 1 = c i

27 CHAPTER 4. PROBABILISTIC GESTURE MODEL 20 or s i. Otherwise, τ t (i) = 0. Therefore, τ t is completely determined by τ t and A t. The transition probabilities for A t are a function of A t 1 = m i and τ t(i). While this distribution may be learned from the training data, we can simplify it considerably. Although the precise length of a motion may vary slightly for syllable alignment, the transitions out of that motion do not depend strongly on this length, and we can assume that they are independent. Therefore, we can define T ci as the vector of transition probabilities out of a motion m i : T ci,s j = P(A t = s j A t 1 = c i,a t c i ). We can also assume that the continuation of c i depends only on τ t(i), since if A t = c i, A t 1 = c i or s i, providing little additional information. Intuitively, this seems reasonable, since encountering another observation within a motion becomes less probable as the motion progresses. Therefore, we can construct the probabilities of transitions from c i as a linear combination of T ci and e ci, the guaranteed transition into c i, according to some function f i of τ t(i): f i (τ t(i)) A t = c i P(A t A t 1 = c i ) = (4.1) (1 f i (τ t(i)))t ci,a t otherwise We now define the vector T si as the distribution of transitions out of s i, and construct the full transition matrix T = [T s1,t c1,...,t sn,t cn ], where n is the total number of motion clusters. The parameters of the HMM are then completely specified by the matrix T, a matrix of observation probabilities O given by O i,j = P(v i a j ), and the interpolation functions f 1,f 2,...,f n. 4.4 Estimating Model Parameters The transition and observation matrices are estimated directly from the training data by counting the frequency of transitions and observations. When pairing observations with animation states, we must consider the animation state that is active at the end of the next observation, as shown in Figure 4.1. When a motion m i terminates on the syllable V t, we must predict the next animation segment. However, if we associate V t with the current

28 CHAPTER 4. PROBABILISTIC GESTURE MODEL 21 segment, the observed evidence would indicate that c i is the most probable state, which is not the case. Instead, we would like to predict the next state, which is precisely the state that is active at the end of the next syllable V t+1. The value of an interpolation function, f i (τ t (i)), gives the probability that another syllable will terminate within motion m i after the current one, which terminated at point τ t (i). Since motions are more likely to terminate as τ t (i) increases, we may assume that f i is monotonically decreasing. From empirical observation of the distribution of τ values for final syllables within training motions, I concluded that a logistic curve would be well suited for approximating f i. Therefore, the interpolation functions are estimated by fitting a logistic curve to the τ values for final syllables within each training motion cluster by means of gradient descent.

29 Chapter 5 Synthesis 5.1 Overview The model assembled in the training phase allows the system to animate a character in real time from a novel speech signal. Since the model is trained on animation segments which follow audio segments, it is able to predict the most appropriate animation as soon as a new syllable is detected. As noted earlier, gesture strokes terminate at or before the intensity peak of the accompanying syllable [30]. The syllable segmentation algorithm detects a syllable in continuous speech when it encounters the syllable peak of the next syllable, so new gestures begin at syllable peaks. Consequently, the previous gesture ends at or before a syllable peak. To synthesize the most probable and coherent animation, the system computes a vector of forward probabilities along with direct probabilities based on the previously selected hidden state. The forward probabilities carry context from previous observations, while the direct probabilities indicate likely transitions given only the last displayed animation and current observation. Together, these two vectors may be combined to obtain a prediction that is both probable given the speech context and coherent given the previously displayed state, resulting in animation that is both smooth and plausible. The synthesis process is illustrated in Figure 5.1. Once the most probable and coherent motion cluster has been selected, the system selects a single segment from this cluster which blends well with the current pose of the 22

30 CHAPTER 5. SYNTHESIS 23 Previous Update Syllable Observation V t A t 1 Direct Forward α t 1, τ t 1 Probabilities Probabilities Update: Equation 5.4 Update: Equations 5.1,5.3 γ t α t γ t α t, τ t A t Stochastic Sampling Next Update Figure 5.1: Diagram of the synthesis algorithm for one observation. A t is the t th animation state, V t is the t th observation, α t is the forward probability vector, τ t is the vector of expected τ values, and γ t is the direct probability vector. character. This segment is then blended into the animation stream, as described in Section Forward Probabilities Given the parameters of a hidden Markov model and a sequence of syllable-observations {V 1,V 2,...,V t }, the most probable hidden state at time t may be efficiently determined from forward probabilities [40]. Since transition probabilities depend on the value τ t (i) at the current observation, we must take it into account when updating forward probabilities. To this end, the system maintains both a vector of forward probabilities α t, and a vector τ t of the expected values E(τ t (i) A t = c i ) at the t th observation. Using the expected value of τ t (i) rather than a more complex probability distribution over possible values implies the assumption that the distribution of τ t (i) is convex - that is,

31 CHAPTER 5. SYNTHESIS 24 that using the expected value gives a good approximation of the distribution. This assumption may not hold in the case of two or more probable timelines which transition into m i at different points in the past, resulting in two or more peaks in the distribution of τ t (i). However, we assume that this case is not common. As before, we define a vector τ t(i) = τ t 1 (i)+ τ t (i). Using this vector, computing the transition probabilities between any two states at observation V t is analogous to Equation 4.1: f i ( τ t(i)) A t = A t 1 = c i P(A t A t 1 ) = (1 f j ( τ t(j)))t cj,a t A t 1 = c j otherwise T At 1,A t With this formula for the transition between any two hidden states, we can use the usual update rule for α t : α t (i) = η 2n k=1 α t 1 (k)p(a t = a i A t 1 = a k )P(V t a i ), (5.1) where η is the normalization value. Once the forward probabilities are computed, τ t must also be updated. This is done according to the following update rule: 2n k=1 τ t (i) = E(τ t (i) A t = c i ) = P(A t = c i,a t 1 = a k )E(τ t (i) A t = c i,a t 1 = a k ). (5.2) P(A t = c i ) Since only s i and c i can transition into c i, P(A t = c i,a t 1 ) = 0 if A t 1 c i and A t 1 s i. If A t 1 = s i, this is the first event during this motion, so the previous value τ t 1 (i) must have been zero. Therefore, E(τ t (i) A t = c i,a t 1 = s i ) = τ t (i). If A t 1 = c i, the previous expected value τ t 1 (i) is simply τ t 1 (i), so the new value is E(τ t (i) A t = c i,a t 1 = c i ) = τ t 1 (i) + τ t (i) = τ t(i). Therefore, we may reduce Equation 5.2 to:

32 CHAPTER 5. SYNTHESIS 25 τ t (i) = P(A t = c i,a t 1 = c i )E(τ t (i) A t = c i,a t 1 = c i ) P(A t = c i ) + P(A t = c i,a t 1 = s i )E(τ t (i) A t = c i,a t 1 = s i ) P(A t = c i ) = α t 1(c i )f i ( τ t(i)) τ t(i) + α t 1 (s i )T si,c i τ t (i). (5.3) α t (c i ) Once the forward probabilities and τ t are computed, the most probable state could simply be selected from α t (i). However, choosing the next state based solely on forward probabilities would produce an erratic animation stream, since the most probable state may alternate between several probable paths as new observations are provided. 5.3 Most Probable and Coherent State Since the system has already displayed the previous animation state A t 1, it can estimate the current state from only the previous state and current observation to ensure a highprobability transition, and thus a coherent animation stream. Given τ t(i) = τ t 1 (i)+ τ t (i) for currently active motion i, the system computes transition probabilities just as in Section 4.3. If the previous state is the start state s i, transition probabilities are given directly as P(A t A t 1 = s i ) = T si,a t. If the previous state is the continuation state c i, the transition probability P(A t A t 1 = c i ) is given by Equation 4.1. With these transition probabilities, the vector of direct probabilities γ t can be computed in the natural way as: γ t (i) = ηp(a t = a i A t 1 )P(V t a i ). (5.4) Using γ t directly, however, would quickly drive the synthesized animation down an improbable path, since context is not carried forward. Instead, γ t and α t can be used together to select a state which is both probable given the sequence of observations and coherent given the previously displayed animation state. The final distribution for the current state is therefore computed as the normalized pointwise product of the two distributions α t γ t. As shown in Figure 5.2, this method also generally selects states that have a higher forward

33 CHAPTER 5. SYNTHESIS max i α t (i) max i α t γ t (i) state index max i γ t (i) Figure 5.2: The image shows the states selected over 40 observations with max i γ t (i) (red) and max i α t (i) γ t (i) (green), with dark cells corresponding to high forward probability. The graph plots the exact probability of the selections, as well as max i α t (i). probability than simply using γ t directly. The green lines, representing states selected using the combined method, tend to coincide with darker areas, representing higher forward probabilities, though they deviate as necessary to preserve coherence. To obtain the actual estimate, the most probable and coherent state could be selected as arg max i ( α t γ t )(i). However, always displaying the optimal animation is not necessary, since various gestures are often appropriate for the same speech segment. Instead, the current state is selected by stochastically sampling the state space according to this product distribution. This has several desirable properties: it introduces greater variety into the animation, prevents grating repetitive motions, and makes the algorithm less sensitive to issues arising from the specific choice of clusters. To illustrate this last point, consider a case where two fast gestures receive probabilities of 0.3 each, and one slow gesture receives a probability of 0.4. Clearly, a fast gesture is more probable, but a highest-probability selection will play the slow gesture every time. Randomly sampling according to the distribution, however, will properly display a fast gesture six times out of ten.

34 CHAPTER 5. SYNTHESIS Early and Late Termination As discussed in Section 4.2, the system accounts for variation in motion length by terminating a motion if another syllable is unlikely to occur within that motion. In the case that this event is mispredicted and the motion is not terminated, it may end between syllable observations. To handle this case, the system re-examines the last observation by restoring α t 1 and τ t 1 and performing the update again, with the new evidence that, if A t 1 = c i or s i, A t c i. This corrects the previous misprediction, though the newly selected motion is not always as far along as it should be, since it must start from the beginning. 5.5 Constructing the Animation Stream Once the animation state and its corresponding cluster of animation segments is chosen, we must blend an appropriate animation segment from this cluster into the animation stream. Although segments within a motion cluster generally share a similar pose, there is some variation. Because of this, some segments in a cluster may be closer to the last frame in the animation stream than others. The system performs the actual segment selection by considering only the segments which begin in a pose similar to that of the last frame. In practice, the system selects a subset of a cluster, the pose difference of every element of which is within some tolerance of the minimum pose difference for that cluster. One of the segments that fall within this tolerance is randomly selected. Random selection avoids jarring repetitive motion when the same animation state occurs multiple times consecutively. Once the appropriate segment is selected, it must be blended with the previous frame to create a smooth transition. Although linear blending is generally sufficient for a perceptually smooth transition [51], it requires each motion to have enough extra frames at the end to accommodate a gradual blend, and simply taking these frames from the original motion capture data may introduce extra, unwanted gestures. For many of the motions, the blend interval would also exceed the length of the gesture, requiring many gestures to be linearly blended simultaneously. Instead, the proposed method uses a velocity-based blending algorithm, which keeps the magnitude of the angular velocity on each joint equal to that of the desired animation,

35 CHAPTER 5. SYNTHESIS 28 and adjusts joint orientations to be as close as possible to the desired pose within this constraint. For a new rotation quaternion r t, previous rotation r t 1, desired rotation d t, and the derivative in the source animation of the desired frame d t = d t dt 1, the orientation of a joint at frame t is given by ( ) angle( d t ) r t = slerp r t 1,d t ;, angle(d t r t 1 ) where slerp is the quaternion spherical interpolation function [45], and angle gives the angle of the axis-angle representation of a quaternion. The rotation of each joint can be represented either in parent-space or world-space. Using a parent-space representation drastically reduces the chance of unnatural interpolations (such as violation of anatomic constraints), while using a world-space representation preserves the feel of the desired animation, even when the starting pose is different, by keeping to roughly the same alignment of the joints. For example, a horizontal gesture performed at hip level would remain horizontal even when performed at chest level, so long as world-space representation of the joint rotations is used. In practice, the artifacts introduced by world-space blending are extremely rare, so all joints are blended in world-space with the exception of the hands, which are blended in parent-space. Being farther down on the skeleton, the hands are particularly susceptible to unnatural poses as a result of world-space blending. The velocity-based method roughly preserves the feel of the desired animation even when the starting pose is different, by maintaining the same velocity magnitude. One effect of this is that the actual pose of still segments, or holds, is relatively unimportant, since when d t 0, the last frame is held indefinitely. Due to noise and low-level motion, these segments will still eventually drift toward the original segment pose, but will do so very gradually.

36 Chapter 6 Evaluation 6.1 Evaluation Methodology There is no single correct body language sequence for a given utterance, which makes validation of the system inherently difficult. The only known way to determine the quality of synthesized body language is by human observation. To this end, I conducted four surveys to evaluate the method. Participants unfamiliar with the details of the system were recruited from a broad group of university students. Participants were asked to evaluate three animations corresponding to the same utterance, presented in random order on a web page. The utterances ranged from 40 to 60 seconds in length. In addition to the synthesized sequence, two controls were used. One of the controls contained motion capture of the original utterance being spoken. The other control was generated by randomly selecting new animation segments whenever the current segment terminated, producing an animation that did not correspond to the speech but still appeared generally coherent. The original motion capture provides a natural standard for quality, while random selection represents a simple alternative method for synthesizing body language in real time. Random selection has previously been used to animate facial expressions [38] and to add detail to coarsely defined motion [26]. As observed in [22], the lack of facial animation tends to distract viewers from evaluating synthesized gestures. Therefore, the videos featured a simplified model without a face, as shown in Figure 6.1. Two sets of training data were used in the evaluation, from two different speakers. 29

37 CHAPTER 6. EVALUATION 30 We just, uh, you know... we just hung around, uh... went to the beach. Figure 6.1: An excerpt from a synthesized animation used in the evaluation study. Training set I consisted of 30 minutes of motion capture data in four scenes, recorded from a trained actor. Training set II consisted of 15 minutes in eight scenes, recorded from a good speaker with no special training. For each training set, two surveys were conducted. Surveys AI and AII sought to determine the quality of the synthesized animation compared to motion capture of the same speaker, in order to ensure a reliable comparison without interference from confounding variables, such as gesturing style. These surveys used utterances from the training speakers that were not present in the training data. Surveys BI and BII sought to determine how well the system performed on an utterance from a novel speaker. 6.2 Results After viewing each animation, participants rated each animation on a five-point Likert scale for timing and appropriateness. To determine timing, participants were asked their agreement on the statement the movements of character X were generally timed appropriately. To determine the appropriateness of the motions for the current speech, they were asked their agreement on the statement the motions of character X were generally consistent with what he was saying. The questions and average scores for the surveys are presented in Figure 6.2. In all surveys, the synthesized sequence outperformed the random sequence, with p < In fact, with the exception of survey AI, the score given to the synthesized sequences remained quite stable. The relatively low scores of both the synthesized and randomly constructed sequences in survey AI may be accounted for by the greater skill of the trained

38 CHAPTER 6. EVALUATION 31 AI N=45 BI N=37 AII N=40 BII N=50 same speaker novel speaker same speaker novel speaker Movements were timed appropriately Motions were consitent with speech Original motion capture Random synthesis Proposed method Figure 6.2: Mean scores on a 5-point Likert scale for the four evaluation surveys. Error bars indicate standard error.

39 CHAPTER 6. EVALUATION 32 actor in set I, which is the reason that training set I is used in all other examples in this thesis and the accompanying videos discussed in Section 6.3. In the generalization surveys BI and BII, the original motion capture does not always outperform the synthesized sequence, and in survey BII, its performance is comparable to that of the random sequence. This indicates that the two speakers used in the generalization tests had body language that was not as compelling as the training data. This is not unexpected, since skilled speakers were intentionally chosen to create the best possible training data, and different individuals may appear more or less comfortable with their body language in conversation. It may be difficult for viewers to separate their notions of believable and effective speakers (e.g., an awkward speaker will appear less realistic, even though he is just as human as a more skilled speaker). Even the random sequence used the gesture subunits from the training data, and the transitions in the random sequence were smooth, due to the fact that motion segments were segmented at low-velocity boundaries and the blending was velocity-based. Therefore, it is reasonable to conclude that even a random sequence from a skilled speaker might appear more believable than the original motion capture of a less confident speaker. It is notable, however, that our system was able to synthesize gestures that were as appropriate as those in the same-speaker tests. The stability of the scores for our system across the surveys indicates that it was able to successfully transplant the training speakers more effective gesturing styles to the novel speakers. 6.3 Video Examples In addition to the images provided in this thesis, a number of videos were created to demonstrate the quality of animations synthesized with the proposed system. The videos present examples of animations for numerous speakers, utterances with pronounced emphasis, and utterances with a clear emotional atmosphere. One of the sets of videos from the surveys is also included, along with separate evaluations of the selection and synchronization components, as discussed in Section 7.1. The videos may be viewed on a website, located at svlevine/vids/svlugthesis.htm.

40 Chapter 7 Discussion and Future Work 7.1 Applicability of Prosody and the Synchrony Rule I presented a system for generating expressive body language animations for human characters from live speech input. The system is efficient enough to run on a 2 GHz Intel Centrino processor, making it suitable for modern consumer PCs. While drawing on previous work on animated ECAs, the proposed method addresses the somewhat neglected problem of animating human characters in networked virtual environments, which requires a real-time technique to handle unpredictable and varied extemporaneous speech. The system works off of two assumptions: the co-occurrence of gestures and syllable peaks and the usefulness of prosody for identifying body language. The former assumption is justified by the synchrony rule described in [30], and the latter by the relationship between prosody and emotion [1], and by extension body language. The effectiveness of the proposed method was validated by a comparative study that confirmed that these assumptions produce plausible body language that often compares well to the actual body language accompanying a sample utterance. In addition to the survey discussed in Chapter 6, a pilot study was conducted in which participants were also shown a random animation with gestures synchronized to syllable peaks, and an animation generated with the proposed system but without the synchrony rule (e.g., all gestures ran to completion). Although these animations were not included 33

41 CHAPTER 7. DISCUSSION AND FUTURE WORK 34 Motorcycles become a fully immersive experience, where the sound of it, the vibration, the seating, it all matters. Figure 7.1: An excerpt from a synthesized animation. in the final survey to avoid fatiguing the responders, comments from the pilot study indicated that both synchrony and proper selection were needed to synthesize plausible body language. Responders noted that the randomly animated character felt a little forced and didn t move with the rhythm of the speech, while the character which did not synchronize with syllable peaks appeared off sync and consistently low energy. Example animations that use random selection with synchronization and HMM-driven selection without synchronization may be viewed online, as discussed in Section 6.3. These findings suggest that both the synchrony rule and prosody-based gesture selection are useful for gesture synthesis. 7.2 Limitations Despite the effectiveness of this method for synthesizing plausible animations, it has several inherent limitations. Most importantly, relying on prosody alone precludes the system from generating meaningful iconic gestures when they are not accompanied by emotional cues. Metaphoric gestures are easier to select because they originate from a smaller repertoire and are more abstract [30], and therefore more tolerant of mistakes, but iconic gestures cannot be guessed without interpreting the words in an utterance. This fundamental limitation may be addressed in the future with a method that combines rudimentary word recognition with prosody-based gesture selection. Word recognition may be performed either directly or from phonemes, which can already be extracted in real time [37]. A second limitation of this method is that it must synthesize gestures from information that is already available to the listener, thus limiting its ability to provide supplementary

42 CHAPTER 7. DISCUSSION AND FUTURE WORK 35 details. While there is no consensus in the linguistics community on just how much information is conveyed by gestures [27], a predictive speech-based real-time method is unlikely to impart on the listener any more information than could be obtained from simply listening to the speaker attentively, while real gestures often convey information not present in speech [30]. Therefore, while there are clear benefits to immersiveness and realism from compelling and automatic animation of human-controlled characters, such methods cannot provide additional details without some form of additional input. This additional input, however, need not necessarily be as obtrusive as keyboard or mouse controls. For example, facial expressions carry more information about emotional state than prosody [1], which suggests that more informative gestures may be synthesized by analyzing the speaker s facial expressions, for example though a consumer webcam. 7.3 Future Work Besides addressing the limitations of the proposed system, future work may also expand its capabilities. Training data from multiple individuals, for example, may allow the synthesis of personalized gesture styles, as advocated by [36]. A more advanced method of extracting gestures from training data may allow for the principal joints of a gesture to be identified automatically, which would both eliminate the need for the current separation of leg, arm, and head gestures and allow for more intelligent blending of gestures with other animations. This would allow a gesturing character to engage in other activities with realistic interruptions for performing important gestures. In addition to animating characters, I hope that the proposed method will eventually reveal new insight into how people form and perceive gestures. The pilot survey already suggests that the method may be useful for confirming the validity of the synchrony rule and the relationship between prosody and gesture. Further work could explore the relative importance of various gesture types, as well as the impact of timing and appropriateness on perceived gesture quality.

43 CHAPTER 7. DISCUSSION AND FUTURE WORK 36 Which is also one of those very funny episodes that are in this movie. Figure 7.2: Synthesized body language helps create an engaging virtual interaction. 7.4 Conclusion In this thesis, I present a new method for generating plausible body language for a variety of utterances and speakers, such as the virtual interaction portrayed in Figure 7.2, or the story being told by the character in Figure 7.1. The use of vocal prosody allows the system to detect emphasis, produce gestures that reflect the emotional state of the speaker, and select a rhythm that is appropriate for the utterance. Since the system generates compelling body language in real time without specialized input, it is particularly appropriate for animating human characters in networked virtual worlds.

Conversation Starters: Using Spatial Context to Initiate Dialogue in First Person Perspective Games

Conversation Starters: Using Spatial Context to Initiate Dialogue in First Person Perspective Games Conversation Starters: Using Spatial Context to Initiate Dialogue in First Person Perspective Games David B. Christian, Mark O. Riedl and R. Michael Young Liquid Narrative Group Computer Science Department

More information

Mandarin Lexical Tone Recognition: The Gating Paradigm

Mandarin Lexical Tone Recognition: The Gating Paradigm Kansas Working Papers in Linguistics, Vol. 0 (008), p. 8 Abstract Mandarin Lexical Tone Recognition: The Gating Paradigm Yuwen Lai and Jie Zhang University of Kansas Research on spoken word recognition

More information

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier IOSR Journal of Electronics and Communication Engineering (IOSR-JECE) e-issn: 2278-2834,p- ISSN: 2278-8735.Volume 10, Issue 2, Ver.1 (Mar - Apr.2015), PP 55-61 www.iosrjournals.org Analysis of Emotion

More information

Speech Recognition at ICSI: Broadcast News and beyond

Speech Recognition at ICSI: Broadcast News and beyond Speech Recognition at ICSI: Broadcast News and beyond Dan Ellis International Computer Science Institute, Berkeley CA Outline 1 2 3 The DARPA Broadcast News task Aspects of ICSI

More information

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur Module 12 Machine Learning 12.1 Instructional Objective The students should understand the concept of learning systems Students should learn about different aspects of a learning system Students should

More information

Lecture 1: Machine Learning Basics

Lecture 1: Machine Learning Basics 1/69 Lecture 1: Machine Learning Basics Ali Harakeh University of Waterloo WAVE Lab ali.harakeh@uwaterloo.ca May 1, 2017 2/69 Overview 1 Learning Algorithms 2 Capacity, Overfitting, and Underfitting 3

More information

CEFR Overall Illustrative English Proficiency Scales

CEFR Overall Illustrative English Proficiency Scales CEFR Overall Illustrative English Proficiency s CEFR CEFR OVERALL ORAL PRODUCTION Has a good command of idiomatic expressions and colloquialisms with awareness of connotative levels of meaning. Can convey

More information

Speech Emotion Recognition Using Support Vector Machine

Speech Emotion Recognition Using Support Vector Machine Speech Emotion Recognition Using Support Vector Machine Yixiong Pan, Peipei Shen and Liping Shen Department of Computer Technology Shanghai JiaoTong University, Shanghai, China panyixiong@sjtu.edu.cn,

More information

Statewide Framework Document for:

Statewide Framework Document for: Statewide Framework Document for: 270301 Standards may be added to this document prior to submission, but may not be removed from the framework to meet state credit equivalency requirements. Performance

More information

REVIEW OF CONNECTED SPEECH

REVIEW OF CONNECTED SPEECH Language Learning & Technology http://llt.msu.edu/vol8num1/review2/ January 2004, Volume 8, Number 1 pp. 24-28 REVIEW OF CONNECTED SPEECH Title Connected Speech (North American English), 2000 Platform

More information

(Sub)Gradient Descent

(Sub)Gradient Descent (Sub)Gradient Descent CMSC 422 MARINE CARPUAT marine@cs.umd.edu Figures credit: Piyush Rai Logistics Midterm is on Thursday 3/24 during class time closed book/internet/etc, one page of notes. will include

More information

Annotation and Taxonomy of Gestures in Lecture Videos

Annotation and Taxonomy of Gestures in Lecture Videos Annotation and Taxonomy of Gestures in Lecture Videos John R. Zhang Kuangye Guo Cipta Herwana John R. Kender Columbia University New York, NY 10027, USA {jrzhang@cs., kg2372@, cjh2148@, jrk@cs.}columbia.edu

More information

Application of Virtual Instruments (VIs) for an enhanced learning environment

Application of Virtual Instruments (VIs) for an enhanced learning environment Application of Virtual Instruments (VIs) for an enhanced learning environment Philip Smyth, Dermot Brabazon, Eilish McLoughlin Schools of Mechanical and Physical Sciences Dublin City University Ireland

More information

WHEN THERE IS A mismatch between the acoustic

WHEN THERE IS A mismatch between the acoustic 808 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 3, MAY 2006 Optimization of Temporal Filters for Constructing Robust Features in Speech Recognition Jeih-Weih Hung, Member,

More information

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS Václav Kocian, Eva Volná, Michal Janošek, Martin Kotyrba University of Ostrava Department of Informatics and Computers Dvořákova 7,

More information

Human Emotion Recognition From Speech

Human Emotion Recognition From Speech RESEARCH ARTICLE OPEN ACCESS Human Emotion Recognition From Speech Miss. Aparna P. Wanare*, Prof. Shankar N. Dandare *(Department of Electronics & Telecommunication Engineering, Sant Gadge Baba Amravati

More information

Radius STEM Readiness TM

Radius STEM Readiness TM Curriculum Guide Radius STEM Readiness TM While today s teens are surrounded by technology, we face a stark and imminent shortage of graduates pursuing careers in Science, Technology, Engineering, and

More information

Language Acquisition Chart

Language Acquisition Chart Language Acquisition Chart This chart was designed to help teachers better understand the process of second language acquisition. Please use this chart as a resource for learning more about the way people

More information

MULTIMEDIA Motion Graphics for Multimedia

MULTIMEDIA Motion Graphics for Multimedia MULTIMEDIA 210 - Motion Graphics for Multimedia INTRODUCTION Welcome to Digital Editing! The main purpose of this course is to introduce you to the basic principles of motion graphics editing for multimedia

More information

Lecture 10: Reinforcement Learning

Lecture 10: Reinforcement Learning Lecture 1: Reinforcement Learning Cognitive Systems II - Machine Learning SS 25 Part III: Learning Programs and Strategies Q Learning, Dynamic Programming Lecture 1: Reinforcement Learning p. Motivation

More information

Software Maintenance

Software Maintenance 1 What is Software Maintenance? Software Maintenance is a very broad activity that includes error corrections, enhancements of capabilities, deletion of obsolete capabilities, and optimization. 2 Categories

More information

Build on students informal understanding of sharing and proportionality to develop initial fraction concepts.

Build on students informal understanding of sharing and proportionality to develop initial fraction concepts. Recommendation 1 Build on students informal understanding of sharing and proportionality to develop initial fraction concepts. Students come to kindergarten with a rudimentary understanding of basic fraction

More information

Python Machine Learning

Python Machine Learning Python Machine Learning Unlock deeper insights into machine learning with this vital guide to cuttingedge predictive analytics Sebastian Raschka [ PUBLISHING 1 open source I community experience distilled

More information

The Good Judgment Project: A large scale test of different methods of combining expert predictions

The Good Judgment Project: A large scale test of different methods of combining expert predictions The Good Judgment Project: A large scale test of different methods of combining expert predictions Lyle Ungar, Barb Mellors, Jon Baron, Phil Tetlock, Jaime Ramos, Sam Swift The University of Pennsylvania

More information

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature 1 st Grade Curriculum Map Common Core Standards Language Arts 2013 2014 1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature Key Ideas and Details

More information

INTERMEDIATE ALGEBRA PRODUCT GUIDE

INTERMEDIATE ALGEBRA PRODUCT GUIDE Welcome Thank you for choosing Intermediate Algebra. This adaptive digital curriculum provides students with instruction and practice in advanced algebraic concepts, including rational, radical, and logarithmic

More information

Reinforcement Learning by Comparing Immediate Reward

Reinforcement Learning by Comparing Immediate Reward Reinforcement Learning by Comparing Immediate Reward Punit Pandey DeepshikhaPandey Dr. Shishir Kumar Abstract This paper introduces an approach to Reinforcement Learning Algorithm by comparing their immediate

More information

Introduction to Simulation

Introduction to Simulation Introduction to Simulation Spring 2010 Dr. Louis Luangkesorn University of Pittsburgh January 19, 2010 Dr. Louis Luangkesorn ( University of Pittsburgh ) Introduction to Simulation January 19, 2010 1 /

More information

Visit us at:

Visit us at: White Paper Integrating Six Sigma and Software Testing Process for Removal of Wastage & Optimizing Resource Utilization 24 October 2013 With resources working for extended hours and in a pressurized environment,

More information

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition Hua Zhang, Yun Tang, Wenju Liu and Bo Xu National Laboratory of Pattern Recognition Institute of Automation, Chinese

More information

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION JOURNAL OF MEDICAL INFORMATICS & TECHNOLOGIES Vol. 11/2007, ISSN 1642-6037 Marek WIŚNIEWSKI *, Wiesława KUNISZYK-JÓŹKOWIAK *, Elżbieta SMOŁKA *, Waldemar SUSZYŃSKI * HMM, recognition, speech, disorders

More information

Probability and Statistics Curriculum Pacing Guide

Probability and Statistics Curriculum Pacing Guide Unit 1 Terms PS.SPMJ.3 PS.SPMJ.5 Plan and conduct a survey to answer a statistical question. Recognize how the plan addresses sampling technique, randomization, measurement of experimental error and methods

More information

Probability estimates in a scenario tree

Probability estimates in a scenario tree 101 Chapter 11 Probability estimates in a scenario tree An expert is a person who has made all the mistakes that can be made in a very narrow field. Niels Bohr (1885 1962) Scenario trees require many numbers.

More information

Stimulating Techniques in Micro Teaching. Puan Ng Swee Teng Ketua Program Kursus Lanjutan U48 Kolej Sains Kesihatan Bersekutu, SAS, Ulu Kinta

Stimulating Techniques in Micro Teaching. Puan Ng Swee Teng Ketua Program Kursus Lanjutan U48 Kolej Sains Kesihatan Bersekutu, SAS, Ulu Kinta Stimulating Techniques in Micro Teaching Puan Ng Swee Teng Ketua Program Kursus Lanjutan U48 Kolej Sains Kesihatan Bersekutu, SAS, Ulu Kinta Learning Objectives General Objectives: At the end of the 2

More information

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

OCR for Arabic using SIFT Descriptors With Online Failure Prediction OCR for Arabic using SIFT Descriptors With Online Failure Prediction Andrey Stolyarenko, Nachum Dershowitz The Blavatnik School of Computer Science Tel Aviv University Tel Aviv, Israel Email: stloyare@tau.ac.il,

More information

BENCHMARK TREND COMPARISON REPORT:

BENCHMARK TREND COMPARISON REPORT: National Survey of Student Engagement (NSSE) BENCHMARK TREND COMPARISON REPORT: CARNEGIE PEER INSTITUTIONS, 2003-2011 PREPARED BY: ANGEL A. SANCHEZ, DIRECTOR KELLI PAYNE, ADMINISTRATIVE ANALYST/ SPECIALIST

More information

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems Modeling function word errors in DNN-HMM based LVCSR systems Melvin Jose Johnson Premkumar, Ankur Bapna and Sree Avinash Parchuri Department of Computer Science Department of Electrical Engineering Stanford

More information

Extending Place Value with Whole Numbers to 1,000,000

Extending Place Value with Whole Numbers to 1,000,000 Grade 4 Mathematics, Quarter 1, Unit 1.1 Extending Place Value with Whole Numbers to 1,000,000 Overview Number of Instructional Days: 10 (1 day = 45 minutes) Content to Be Learned Recognize that a digit

More information

Dublin City Schools Mathematics Graded Course of Study GRADE 4

Dublin City Schools Mathematics Graded Course of Study GRADE 4 I. Content Standard: Number, Number Sense and Operations Standard Students demonstrate number sense, including an understanding of number systems and reasonable estimates using paper and pencil, technology-supported

More information

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS 1 CALIFORNIA CONTENT STANDARDS: Chapter 1 ALGEBRA AND WHOLE NUMBERS Algebra and Functions 1.4 Students use algebraic

More information

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers October 31, 2003 Amit Juneja Department of Electrical and Computer Engineering University of Maryland, College Park,

More information

University of Groningen. Systemen, planning, netwerken Bosman, Aart

University of Groningen. Systemen, planning, netwerken Bosman, Aart University of Groningen Systemen, planning, netwerken Bosman, Aart IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite from it. Please check the document

More information

Lecturing Module

Lecturing Module Lecturing: What, why and when www.facultydevelopment.ca Lecturing Module What is lecturing? Lecturing is the most common and established method of teaching at universities around the world. The traditional

More information

AGENDA LEARNING THEORIES LEARNING THEORIES. Advanced Learning Theories 2/22/2016

AGENDA LEARNING THEORIES LEARNING THEORIES. Advanced Learning Theories 2/22/2016 AGENDA Advanced Learning Theories Alejandra J. Magana, Ph.D. admagana@purdue.edu Introduction to Learning Theories Role of Learning Theories and Frameworks Learning Design Research Design Dual Coding Theory

More information

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus Language Acquisition Fall 2010/Winter 2011 Lexical Categories Afra Alishahi, Heiner Drenhaus Computational Linguistics and Phonetics Saarland University Children s Sensitivity to Lexical Categories Look,

More information

ANGLAIS LANGUE SECONDE

ANGLAIS LANGUE SECONDE ANGLAIS LANGUE SECONDE ANG-5055-6 DEFINITION OF THE DOMAIN SEPTEMBRE 1995 ANGLAIS LANGUE SECONDE ANG-5055-6 DEFINITION OF THE DOMAIN SEPTEMBER 1995 Direction de la formation générale des adultes Service

More information

Digital Fabrication and Aunt Sarah: Enabling Quadratic Explorations via Technology. Michael L. Connell University of Houston - Downtown

Digital Fabrication and Aunt Sarah: Enabling Quadratic Explorations via Technology. Michael L. Connell University of Houston - Downtown Digital Fabrication and Aunt Sarah: Enabling Quadratic Explorations via Technology Michael L. Connell University of Houston - Downtown Sergei Abramovich State University of New York at Potsdam Introduction

More information

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models Stephan Gouws and GJ van Rooyen MIH Medialab, Stellenbosch University SOUTH AFRICA {stephan,gvrooyen}@ml.sun.ac.za

More information

A Case Study: News Classification Based on Term Frequency

A Case Study: News Classification Based on Term Frequency A Case Study: News Classification Based on Term Frequency Petr Kroha Faculty of Computer Science University of Technology 09107 Chemnitz Germany kroha@informatik.tu-chemnitz.de Ricardo Baeza-Yates Center

More information

School Size and the Quality of Teaching and Learning

School Size and the Quality of Teaching and Learning School Size and the Quality of Teaching and Learning An Analysis of Relationships between School Size and Assessments of Factors Related to the Quality of Teaching and Learning in Primary Schools Undertaken

More information

Major Milestones, Team Activities, and Individual Deliverables

Major Milestones, Team Activities, and Individual Deliverables Major Milestones, Team Activities, and Individual Deliverables Milestone #1: Team Semester Proposal Your team should write a proposal that describes project objectives, existing relevant technology, engineering

More information

Evolutive Neural Net Fuzzy Filtering: Basic Description

Evolutive Neural Net Fuzzy Filtering: Basic Description Journal of Intelligent Learning Systems and Applications, 2010, 2: 12-18 doi:10.4236/jilsa.2010.21002 Published Online February 2010 (http://www.scirp.org/journal/jilsa) Evolutive Neural Net Fuzzy Filtering:

More information

Getting the Story Right: Making Computer-Generated Stories More Entertaining

Getting the Story Right: Making Computer-Generated Stories More Entertaining Getting the Story Right: Making Computer-Generated Stories More Entertaining K. Oinonen, M. Theune, A. Nijholt, and D. Heylen University of Twente, PO Box 217, 7500 AE Enschede, The Netherlands {k.oinonen

More information

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm Prof. Ch.Srinivasa Kumar Prof. and Head of department. Electronics and communication Nalanda Institute

More information

Further, Robert W. Lissitz, University of Maryland Huynh Huynh, University of South Carolina ADEQUATE YEARLY PROGRESS

Further, Robert W. Lissitz, University of Maryland Huynh Huynh, University of South Carolina ADEQUATE YEARLY PROGRESS A peer-reviewed electronic journal. Copyright is retained by the first or sole author, who grants right of first publication to Practical Assessment, Research & Evaluation. Permission is granted to distribute

More information

Program Matrix - Reading English 6-12 (DOE Code 398) University of Florida. Reading

Program Matrix - Reading English 6-12 (DOE Code 398) University of Florida. Reading Program Requirements Competency 1: Foundations of Instruction 60 In-service Hours Teachers will develop substantive understanding of six components of reading as a process: comprehension, oral language,

More information

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation Vivek Kumar Rangarajan Sridhar, John Chen, Srinivas Bangalore, Alistair Conkie AT&T abs - Research 180 Park Avenue, Florham Park,

More information

Oakland Unified School District English/ Language Arts Course Syllabus

Oakland Unified School District English/ Language Arts Course Syllabus Oakland Unified School District English/ Language Arts Course Syllabus For Secondary Schools The attached course syllabus is a developmental and integrated approach to skill acquisition throughout the

More information

Lesson plan for Maze Game 1: Using vector representations to move through a maze Time for activity: homework for 20 minutes

Lesson plan for Maze Game 1: Using vector representations to move through a maze Time for activity: homework for 20 minutes Lesson plan for Maze Game 1: Using vector representations to move through a maze Time for activity: homework for 20 minutes Learning Goals: Students will be able to: Maneuver through the maze controlling

More information

Truth Inference in Crowdsourcing: Is the Problem Solved?

Truth Inference in Crowdsourcing: Is the Problem Solved? Truth Inference in Crowdsourcing: Is the Problem Solved? Yudian Zheng, Guoliang Li #, Yuanbing Li #, Caihua Shan, Reynold Cheng # Department of Computer Science, Tsinghua University Department of Computer

More information

Concept Acquisition Without Representation William Dylan Sabo

Concept Acquisition Without Representation William Dylan Sabo Concept Acquisition Without Representation William Dylan Sabo Abstract: Contemporary debates in concept acquisition presuppose that cognizers can only acquire concepts on the basis of concepts they already

More information

WiggleWorks Software Manual PDF0049 (PDF) Houghton Mifflin Harcourt Publishing Company

WiggleWorks Software Manual PDF0049 (PDF) Houghton Mifflin Harcourt Publishing Company WiggleWorks Software Manual PDF0049 (PDF) Houghton Mifflin Harcourt Publishing Company Table of Contents Welcome to WiggleWorks... 3 Program Materials... 3 WiggleWorks Teacher Software... 4 Logging In...

More information

Word Stress and Intonation: Introduction

Word Stress and Intonation: Introduction Word Stress and Intonation: Introduction WORD STRESS One or more syllables of a polysyllabic word have greater prominence than the others. Such syllables are said to be accented or stressed. Word stress

More information

Learning Methods in Multilingual Speech Recognition

Learning Methods in Multilingual Speech Recognition Learning Methods in Multilingual Speech Recognition Hui Lin Department of Electrical Engineering University of Washington Seattle, WA 98125 linhui@u.washington.edu Li Deng, Jasha Droppo, Dong Yu, and Alex

More information

12- A whirlwind tour of statistics

12- A whirlwind tour of statistics CyLab HT 05-436 / 05-836 / 08-534 / 08-734 / 19-534 / 19-734 Usable Privacy and Security TP :// C DU February 22, 2016 y & Secu rivac rity P le ratory bo La Lujo Bauer, Nicolas Christin, and Abby Marsh

More information

Speaker recognition using universal background model on YOHO database

Speaker recognition using universal background model on YOHO database Aalborg University Master Thesis project Speaker recognition using universal background model on YOHO database Author: Alexandre Majetniak Supervisor: Zheng-Hua Tan May 31, 2011 The Faculties of Engineering,

More information

English Language Arts Missouri Learning Standards Grade-Level Expectations

English Language Arts Missouri Learning Standards Grade-Level Expectations A Correlation of, 2017 To the Missouri Learning Standards Introduction This document demonstrates how myperspectives meets the objectives of 6-12. Correlation page references are to the Student Edition

More information

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview Algebra 1, Quarter 3, Unit 3.1 Line of Best Fit Overview Number of instructional days 6 (1 day assessment) (1 day = 45 minutes) Content to be learned Analyze scatter plots and construct the line of best

More information

A study of speaker adaptation for DNN-based speech synthesis

A study of speaker adaptation for DNN-based speech synthesis A study of speaker adaptation for DNN-based speech synthesis Zhizheng Wu, Pawel Swietojanski, Christophe Veaux, Steve Renals, Simon King The Centre for Speech Technology Research (CSTR) University of Edinburgh,

More information

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems Modeling function word errors in DNN-HMM based LVCSR systems Melvin Jose Johnson Premkumar, Ankur Bapna and Sree Avinash Parchuri Department of Computer Science Department of Electrical Engineering Stanford

More information

Reading Grammar Section and Lesson Writing Chapter and Lesson Identify a purpose for reading W1-LO; W2- LO; W3- LO; W4- LO; W5-

Reading Grammar Section and Lesson Writing Chapter and Lesson Identify a purpose for reading W1-LO; W2- LO; W3- LO; W4- LO; W5- New York Grade 7 Core Performance Indicators Grades 7 8: common to all four ELA standards Throughout grades 7 and 8, students demonstrate the following core performance indicators in the key ideas of reading,

More information

The Effect of Discourse Markers on the Speaking Production of EFL Students. Iman Moradimanesh

The Effect of Discourse Markers on the Speaking Production of EFL Students. Iman Moradimanesh The Effect of Discourse Markers on the Speaking Production of EFL Students Iman Moradimanesh Abstract The research aimed at investigating the relationship between discourse markers (DMs) and a special

More information

Revisiting the role of prosody in early language acquisition. Megha Sundara UCLA Phonetics Lab

Revisiting the role of prosody in early language acquisition. Megha Sundara UCLA Phonetics Lab Revisiting the role of prosody in early language acquisition Megha Sundara UCLA Phonetics Lab Outline Part I: Intonation has a role in language discrimination Part II: Do English-learning infants have

More information

Number of students enrolled in the program in Fall, 2011: 20. Faculty member completing template: Molly Dugan (Date: 1/26/2012)

Number of students enrolled in the program in Fall, 2011: 20. Faculty member completing template: Molly Dugan (Date: 1/26/2012) Program: Journalism Minor Department: Communication Studies Number of students enrolled in the program in Fall, 2011: 20 Faculty member completing template: Molly Dugan (Date: 1/26/2012) Period of reference

More information

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH 2009 423 Adaptive Multimodal Fusion by Uncertainty Compensation With Application to Audiovisual Speech Recognition George

More information

UNIVERSITY OF CALIFORNIA SANTA CRUZ TOWARDS A UNIVERSAL PARAMETRIC PLAYER MODEL

UNIVERSITY OF CALIFORNIA SANTA CRUZ TOWARDS A UNIVERSAL PARAMETRIC PLAYER MODEL UNIVERSITY OF CALIFORNIA SANTA CRUZ TOWARDS A UNIVERSAL PARAMETRIC PLAYER MODEL A thesis submitted in partial satisfaction of the requirements for the degree of DOCTOR OF PHILOSOPHY in COMPUTER SCIENCE

More information

LEGO MINDSTORMS Education EV3 Coding Activities

LEGO MINDSTORMS Education EV3 Coding Activities LEGO MINDSTORMS Education EV3 Coding Activities s t e e h s k r o W t n e d Stu LEGOeducation.com/MINDSTORMS Contents ACTIVITY 1 Performing a Three Point Turn 3-6 ACTIVITY 2 Written Instructions for a

More information

Probabilistic Latent Semantic Analysis

Probabilistic Latent Semantic Analysis Probabilistic Latent Semantic Analysis Thomas Hofmann Presentation by Ioannis Pavlopoulos & Andreas Damianou for the course of Data Mining & Exploration 1 Outline Latent Semantic Analysis o Need o Overview

More information

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project Phonetic- and Speaker-Discriminant Features for Speaker Recognition by Lara Stoll Research Project Submitted to the Department of Electrical Engineering and Computer Sciences, University of California

More information

Grade 4. Common Core Adoption Process. (Unpacked Standards)

Grade 4. Common Core Adoption Process. (Unpacked Standards) Grade 4 Common Core Adoption Process (Unpacked Standards) Grade 4 Reading: Literature RL.4.1 Refer to details and examples in a text when explaining what the text says explicitly and when drawing inferences

More information

Proceedings of Meetings on Acoustics

Proceedings of Meetings on Acoustics Proceedings of Meetings on Acoustics Volume 19, 2013 http://acousticalsociety.org/ ICA 2013 Montreal Montreal, Canada 2-7 June 2013 Speech Communication Session 2aSC: Linking Perception and Production

More information

Segregation of Unvoiced Speech from Nonspeech Interference

Segregation of Unvoiced Speech from Nonspeech Interference Technical Report OSU-CISRC-8/7-TR63 Department of Computer Science and Engineering The Ohio State University Columbus, OH 4321-1277 FTP site: ftp.cse.ohio-state.edu Login: anonymous Directory: pub/tech-report/27

More information

Mathematics process categories

Mathematics process categories Mathematics process categories All of the UK curricula define multiple categories of mathematical proficiency that require students to be able to use and apply mathematics, beyond simple recall of facts

More information

Document number: 2013/ Programs Committee 6/2014 (July) Agenda Item 42.0 Bachelor of Engineering with Honours in Software Engineering

Document number: 2013/ Programs Committee 6/2014 (July) Agenda Item 42.0 Bachelor of Engineering with Honours in Software Engineering Document number: 2013/0006139 Programs Committee 6/2014 (July) Agenda Item 42.0 Bachelor of Engineering with Honours in Software Engineering Program Learning Outcomes Threshold Learning Outcomes for Engineering

More information

GCSE Mathematics B (Linear) Mark Scheme for November Component J567/04: Mathematics Paper 4 (Higher) General Certificate of Secondary Education

GCSE Mathematics B (Linear) Mark Scheme for November Component J567/04: Mathematics Paper 4 (Higher) General Certificate of Secondary Education GCSE Mathematics B (Linear) Component J567/04: Mathematics Paper 4 (Higher) General Certificate of Secondary Education Mark Scheme for November 2014 Oxford Cambridge and RSA Examinations OCR (Oxford Cambridge

More information

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology ISCA Archive SUBJECTIVE EVALUATION FOR HMM-BASED SPEECH-TO-LIP MOVEMENT SYNTHESIS Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano Graduate School of Information Science, Nara Institute of Science & Technology

More information

Individual Component Checklist L I S T E N I N G. for use with ONE task ENGLISH VERSION

Individual Component Checklist L I S T E N I N G. for use with ONE task ENGLISH VERSION L I S T E N I N G Individual Component Checklist for use with ONE task ENGLISH VERSION INTRODUCTION This checklist has been designed for use as a practical tool for describing ONE TASK in a test of listening.

More information

Axiom 2013 Team Description Paper

Axiom 2013 Team Description Paper Axiom 2013 Team Description Paper Mohammad Ghazanfari, S Omid Shirkhorshidi, Farbod Samsamipour, Hossein Rahmatizadeh Zagheli, Mohammad Mahdavi, Payam Mohajeri, S Abbas Alamolhoda Robotics Scientific Association

More information

Think A F R I C A when assessing speaking. C.E.F.R. Oral Assessment Criteria. Think A F R I C A - 1 -

Think A F R I C A when assessing speaking. C.E.F.R. Oral Assessment Criteria. Think A F R I C A - 1 - C.E.F.R. Oral Assessment Criteria Think A F R I C A - 1 - 1. The extracts in the left hand column are taken from the official descriptors of the CEFR levels. How would you grade them on a scale of low,

More information

Proposal of Pattern Recognition as a necessary and sufficient principle to Cognitive Science

Proposal of Pattern Recognition as a necessary and sufficient principle to Cognitive Science Proposal of Pattern Recognition as a necessary and sufficient principle to Cognitive Science Gilberto de Paiva Sao Paulo Brazil (May 2011) gilbertodpaiva@gmail.com Abstract. Despite the prevalence of the

More information

CSC200: Lecture 4. Allan Borodin

CSC200: Lecture 4. Allan Borodin CSC200: Lecture 4 Allan Borodin 1 / 22 Announcements My apologies for the tutorial room mixup on Wednesday. The room SS 1088 is only reserved for Fridays and I forgot that. My office hours: Tuesdays 2-4

More information

CROSS COUNTRY CERTIFICATION STANDARDS

CROSS COUNTRY CERTIFICATION STANDARDS CROSS COUNTRY CERTIFICATION STANDARDS Registered Certified Level I Certified Level II Certified Level III November 2006 The following are the current (2006) PSIA Education/Certification Standards. Referenced

More information

Lecture 2: Quantifiers and Approximation

Lecture 2: Quantifiers and Approximation Lecture 2: Quantifiers and Approximation Case study: Most vs More than half Jakub Szymanik Outline Number Sense Approximate Number Sense Approximating most Superlative Meaning of most What About Counting?

More information

Assignment 1: Predicting Amazon Review Ratings

Assignment 1: Predicting Amazon Review Ratings Assignment 1: Predicting Amazon Review Ratings 1 Dataset Analysis Richard Park r2park@acsmail.ucsd.edu February 23, 2015 The dataset selected for this assignment comes from the set of Amazon reviews for

More information

The College Board Redesigned SAT Grade 12

The College Board Redesigned SAT Grade 12 A Correlation of, 2017 To the Redesigned SAT Introduction This document demonstrates how myperspectives English Language Arts meets the Reading, Writing and Language and Essay Domains of Redesigned SAT.

More information

Word Segmentation of Off-line Handwritten Documents

Word Segmentation of Off-line Handwritten Documents Word Segmentation of Off-line Handwritten Documents Chen Huang and Sargur N. Srihari {chuang5, srihari}@cedar.buffalo.edu Center of Excellence for Document Analysis and Recognition (CEDAR), Department

More information

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation SLSP-2016 October 11-12 Natalia Tomashenko 1,2,3 natalia.tomashenko@univ-lemans.fr Yuri Khokhlov 3 khokhlov@speechpro.com Yannick

More information

Copyright Corwin 2015

Copyright Corwin 2015 2 Defining Essential Learnings How do I find clarity in a sea of standards? For students truly to be able to take responsibility for their learning, both teacher and students need to be very clear about

More information

SOFTWARE EVALUATION TOOL

SOFTWARE EVALUATION TOOL SOFTWARE EVALUATION TOOL Kyle Higgins Randall Boone University of Nevada Las Vegas rboone@unlv.nevada.edu Higgins@unlv.nevada.edu N.B. This form has not been fully validated and is still in development.

More information