Semantic-based Audio Recognition and Retrieval

Size: px
Start display at page:

Download "Semantic-based Audio Recognition and Retrieval"

Transcription

1 Semantic-based Audio Recognition and Retrieval Colin R. Buchanan Master of Science Artificial Intelligence School of Informatics University of Edinburgh 2005

2 Abstract This study considers the problem of attaching meaning to non-speech sound. The purpose is to ably demonstrate automated annotation of a sound with a string of semantically appropriate words and also retrieval of sounds most relevant to a given textual query. This is achieved by constructing acoustic and semantic spaces from a database of sound and description pairs and using statistical models to learn similarity in each space. The spaces are then linked to allow retrieval in either direction. A key aspect is effective prediction of novel events through generalisation from known examples. The motivation and implementation of the system is described using such techniques and representations as Mel frequency cepstral coefficients, Gaussian mixture models, hierarchical clustering and latent semantic analysis. System results are evaluated with automatic classification measures and human judgements demonstrating that this is an effective method for annotation and retrieval of general sound. i

3 Acknowledgements Firstly, I would like to thank Steve Renals for his indispensable help, feedback and direction throughout for which I am extremely grateful. I would also like to thank Dimitrios Zeimpekis and Efstratios Gallopoulos at the University of Patras, Greece for allowing me to use their Text to Matrix Generator (TMG) toolbox. Thanks is also due to Daniel P. W. Ellis at Columbia University for providing freely available code for MFCC extraction and Ian T. Nabney for producing the Netlab machine learning toolbox. ii

4 Declaration I declare that this thesis was composed by myself, that the work contained herein is my own except where explicitly stated otherwise in the text, and that this work has not been submitted for any other degree or professional qualification except as specified. (Colin R. Buchanan) iii

5 Table of Contents 1 Introduction Aims and Objectives Overview Literature Review Audio Classification Audio Retrieval Image and Multimedia Retrieval Dataset Acoustic-Semantic Framework The Acoustic Model Feature Extraction Mel Frequency Cepstral Coefficients The Delta-cepstrum Audio Classification Gaussian Mixture Models Class-based Prediction Experiments The MFCC Parameterisation The Effect of the Delta-cepstrum Silence Detection and Removal The Number of Mixture Components The Semantic Model The Vector Space Model Latent Semantic Analysis Query Matching Limitations of LSA The Number of Singular Values Linking Acoustic and Semantic Spaces The Distributions of Acoustic and Semantic Space Acoustic to Semantic Linkage Clustering the Acoustic Space The Word Model Interpolation of Semantic Predictions iv

6 4.3 Semantic to Acoustic Linkage Retrieving Sounds from an Unlabelled Database The Complete Acoustic-Semantic Framework Evaluation and Discussion Automatic Evaluation Measuring Annotation Performance Measuring Retrieval Performance Subjective Experiments Analysis Comparison of Objective and Subjective Evaluation Prediction on Novel Examples Conclusions Summation Limitations and Future Work Concluding Remarks A Subjective Test Material 64 A.1 Acoustic to Semantic A.2 Semantic to Acoustic Bibliography 67 v

7 Chapter 1 Introduction For decades researchers of automatic speech recognition have addressed the problem of interpreting speech by machine, their efforts over time leading to a considerable understanding of the domain and numerous practical applications. However, despite the impressive activity on speech interpretation and a well established practice of audio and signal processing there has been little emphasis on automatic understanding of non-speech. This study focuses on making sense of non-speech audio, chiefly for two related purposes: intelligently labelling a given sound and for retrieval of sound(s) from a database via a textual description. For instance, imagine a system that given an input sound of a lion roaring could return the label lion roaring, and given an input prompt of lion roar could retrieve from a database samples most like that of a lion roaring. Previous and current research in audio classification tends to focus on matching test sounds into a limited number of predefined categories such as music, applause, speech etc., but this superior approach would describe each sound with a string of semantically appropriate words. Furthermore, the proposed system should allow intelligent interpretation of unseen examples, e.g. describe a tiger roaring based on the similarity to previously seen events. The analogy in human perception is that we can easily describe a new sound by its relation to other sounds. Retrieval systems are often based on query by keyword, e.g. literal keyword matching (requiring annotations paired with sounds), a method familiar to users used to web search engines. This study augments audio retrieval by finding semantic concepts linked to words in the user s query in order to match the most relevant sounds in a 1

8 Chapter 1. Introduction 2 database. Such an approach improves over simple literal query matching as the user need not know exact search terms to achieve useful results. For example, a search for boat though ranking exact matches highest could also predict sounds described by related words, e.g. kayak or jet-ski. An extension is made to retrieve sounds from unlabelled databases where selection is based on acoustic similarity to known sounds. Such a semantic system would be of benefit in many applications. For example, audio databases could be better accessed through semantic based queries. Currently, audio annotation is typically performed manually, an attractive alternative is an automatic annotation system capable of classifying and labelling an entire audio archive with minimal cost. In particular, the abundance of raw audio content already available and continually growing multimedia archives exhibit a pressing need for automated handling of audio/multimedia content concerning the tasks of indexing and retrieval. 1.1 Aims and Objectives The principal goal is to demonstrate that this approach to modelling the semantics of sound is highly appropriate for the task of labelling and retrieving audio. The objectives to achieve this goal are: Construction of an acoustic model using motivated audio features and classification method Construction of a semantic model with an appropriate semantic representation and classification method A mapping between the acoustic and semantic spaces to allow retrieval in either direction, demonstrating generalisation to deal with novel examples in a reasonable manner. Evaluation of system performance against baseline methods and through human judgement

9 Chapter 1. Introduction Overview Given these goals a framework to develop such a system is outlined making use of appropriate techniques from literature. Firstly, in the next section, a study of existing audio classification and retrieval systems is undertaken with emphasis on semantic attachment. Specifically, a state-of-the-art audio system developed by Malcolm Slaney in 2002 [31], [32] demonstrates intelligent labelling and retrieval of audio samples and this work forms the basis of this study. Other related fields such as areas of multimedia including image and video processing have received more attention in semantic modelling and provide valuable insight. Section 1.4 describes the dataset used in this work, and section 1.5 outlines the steps for constructing a semantic audio system based on the proposed approach. Development follows engineering based methods such as signal processing, pattern recognition and stochastic models. In Chapter 2, methods are described for extracting high-level audio features (Mel frequency cepstral coefficients) and measuring acoustic similarity using Gaussian mixture models. Suitable experiments are motivated and described for aspects of the audio parameterisations in order that the most effective values can be established for the task. Likewise, chapter 3 presents suitable ideas for a semantic space utilising latent semantic analysis to model related search terms as single concepts. To allow acoustic-tosemantic and semantic-to-acoustic queries a linkage between acoustic and semantic spaces is described in chapter 4. Clustering is applied to the acoustic space to permit a general-to-specific hierarchy and a word model is employed to predict relevant words. Semantic retrieval uses a mapping from the semantic query to the acoustic domain using the acoustic model to predict acoustically appropriate sounds. Forms of interpolation to improve operational results are described for both mappings. Chapter 5 presents suitable evaluation methodology where both large-scale automatic evaluation and small-scale subjective tests are performed. Evaluation tasks test prediction performance in both retrieval directions against baseline methods using held-out data. Subjective evaluation tests manual ratings of predictions against that of true sounds/descriptions. Finally, in chapter 6 analysis allows us to infer some overall conclusions about the value and future direction of the semantic audio system.

10 Chapter 1. Introduction Literature Review Though audio content has never been lacking, for years it has often been overlooked while multimedia research efforts predominately focused on image and video elements. However, in recent years there has been increasing interest in automatically processing audio content for both indexing and retrieval, particularly for the purposes of integration with multimedia systems, e.g. the University of Mannheim s Movie Content Analysis (MoCA) project [24]. Similarly, there has also been interest in automated handling of audio archives, e.g. the Muscle Fish SoundFisher system [36]. Much of this work is experimental and no comprehensive techniques have been established, yet there is a rich literature to exploit from many closely related fields such as speech recognition, speaker identification, music classification and information retrieval, all of which are discussed as appropriate Audio Classification In recent years a substantial literature on audio classification has developed. Approaches mainly differ in the set of acoustic features used to represent the audio signal and the classification technique applied. For example, a violence detection system developed for the MoCA project [24] predicts gunshots, explosion and cries based on statistics of the waveform (e.g. measures of amplitude and frequency) using correlation and Euclidean distance measures. Another system for speech, music and noise segmentation and classification developed by Lu & Hankinson uses similar waveform statistics and decision tree based classification [23]. However, most current research only concerns a small number of sound types (often involving some speech content), e.g. music and speech discrimination [29] or silence, music, speech, noise classification [20]. Consequently the features and discrimination techniques are tailored to a specific domain and are unlikely to apply well to the general case. Nevertheless, this research provides valuable insight into effective classification techniques and acoustic features.

11 Chapter 1. Introduction 5 Classification Techniques Several techniques have been employed for the purpose of classifying an unknown sound. The principle is to measure similarity between an input feature vector and those of known sounds. In the early days of speech processing, template matching between feature vectors was used intuitively. Current acoustic research favours stochastic models which provide more flexibility and more theoretically meaningful likelihood scores. Of these the most common approaches are Gaussian model based methods [29], [32], [21], hidden Markov models [38], nearest neighbour methods [21] [29], neural network variants [21], vector quantisation [13], [25] and support vector machines [35]. For example, Scheirer & Slaney [29] investigate Gaussian based models, nearest neighbour and spatial partitioning approaches employing 13 different acoustic features (such as spectral centroid and zero-crossing rate) for a speech and music discrimination task. They conclude that the topology of the feature space is rather simple and that there may be little performance difference between classification methods. This claim is also backed by other larger studies, for example, Liu & Wan conclude that four classification techniques all achieve similar results (between 56-64% accuracy) on a larger scale 29-class problem. Furthermore, both Scheirer & Slaney [29] and Li et al. [20] argue that the choice of acoustic feature appears to be more critical than the classification method. Acoustic Features A broad selection of acoustic features have been applied with varying success on different tasks. Generally features are either derived from simple measures of a waveform (e.g. energy functions, fundamental frequency) or may be motivated by perception, such as pitch and loudness. For example, the general audio study by Zhang and Kuo [38] applies an energy function to measure amplitude variation over time, zero-crossing rate to estimate spectral properties, fundamental frequency to capture harmonic properties of the signal etc. Such features obtained from the time, frequency and time-frequency domains are numerous and a comprehensive study for the case of general audio is undertaken by Liu & Wan considering 87 features for a content-based classification task in order to build an optimal feature vector [21]. Likewise, the speech and music discrimination work by

12 Chapter 1. Introduction 6 Scheirer & Slaney tests combinations of 13 acoustic properties [29]. Composite feature vectors obtained through such work have been used effectively for discrimination (Scheirer & Slaney report less than 2% error on a small test set). Both studies show that optimal feature selection depends on the domain and classification technique. As the feature compositions are optimised for a specific domain they are unlikely to scale well to more complex discrimination tasks. Alternatively, properties motivated by perception such as pitch, loudness and timbre are clearly important for us to distinguish between sounds but are difficult to quantify. Attempts to model human auditory perception in every detail are impractical due to the complexity and only partial knowledge of the process. However, compact representations of a signal can capture significant frequency and energy information in an attempt to model known perceptual properties. In speech research features such as Mel frequency cepstral coefficients (MFCCs) or linear prediction coefficients (LPCs) have been demonstrated to provide good representations of a speech signal allowing for better discrimination than temporal or frequency based features alone [18]. However, as both MFCCs and LPCs are intended to model speech their effectiveness with non-speech is questionable. In particular, LPCs are based on speech production rather than perception and the rudimentary vocal tract model is unlikely to provide a good representation of more general sounds which may often lack resonance and exhibit fricative sources (though both Liu & Wan [21] and Li et al. [20] use them effectively with non-speech). MFCCs, on the other hand, are derived from a sinusoidal based expansion of the energy spectrum and are capable of capturing more varied spectral phenomenon. MFCCs correspond to a frequency smoothed log-magnitude spectrum which suppresses undesirable spectral variation, particularly at higher frequencies [7]. This perceptual motivation makes them ideal for general audio discrimination as they capture crucial properties used in human hearing. MFCCs are ubiquitous in speech research but they have been applied successfully in non-speech tasks such as the music system developed by Pye [25], another by Berenzweig et al. [3] and also more general audio studies by Foote [13] and Liu & Wan [21]. Li et al. conclude from their study that cepstral features such as MFCCs perform better than temporal or frequency based features and advocate their use for general audio tasks particularly when the number of audio classes is large [20].

13 Chapter 1. Introduction 7 Semantic Attachment Regardless of the choice of acoustic features or classification method all content-based classification systems treat prediction as a statistical classification into one of a number of predefined classes ignoring any notion of meaning. However, the intention of this work is to describe each sound with a string of semantically appropriate words, based on the known descriptions. Some semantic representation and a method to link predictions of a classifier to a point in semantic space is required. It is in this manner that Slaney proposes a state-of-the-art system which incorporates a mapping between audio and semantic spaces [31], [32]. Methods are developed to describe general audio with words (and also predict sounds given a text query) using a labelled sound set. In brief, audio is represented by a stacked MFCC vector, using linear discriminant analysis to reduce dimensions and promote separation of acoustic classes. To predict acoustic similarity Gaussian mixture models (GMMs) are applied, and a clustering method is used to permit generalisation from the training sounds. To generate a description given a test sound a linkage is made to predict words from the descriptions associated with the most similar training sounds. In his initial work [32] prediction only involves the single best acoustic answer, the later study [31] employs a mixture of experts approach [33] to interpolate between answers and predict more suitable descriptions. This concept provides an effective framework on which to build the prosed system, though implementation is also influenced by other work. Where relevant, notable differences and their justification are described. The evaluation phase of Slaney s initial study only consists of demonstrating examples from the training material [32]. The later study involves evaluation with a held-out test set which is used to test predicted labels against true labels [31]. We hope to supplement evaluation by also testing against baseline methods and involving human judgement. Content-based approaches exhibit inability to suitably predict a test sound of a type not in the database the intention of the proposed system is to judge novel events based on similarity to known examples, in this way, a semantic approach can deal with a more extensive set of acoustic events provided the initial training set allows for a such generalisation. Slaney achieves this by clustering the acoustic space to create a generalto-specific hierarchy of sounds [32].

14 Chapter 1. Introduction Audio Retrieval Foote provides a now slightly dated but comprehensive overview of audio retrieval [14]. Typical approaches are query by example (QBE) which allows retrieval of sound(s) based on similarity to acoustic properties of user supplied sounds or templates, or query by keyword (QBK) which allows users to search via textual queries but requires annotations paired with sounds. A classic example of QBE is the query by humming method used to retrieve music by humming a melody (based on whether a note is higher or lower in pitch than the previous note) developed by Ghias et al. [17]. While this is surprisingly effective for retrieval of music scores such queries are not particularly natural or convenient for other sound types. The Muscle Fish system (later developed into the commercial application SoundFisher) implements retrieval for a general audio database based on similarity between psycho-acoustic properties, e.g. loudness, pitch, harmony [36]. The system measures similarity (by Mahalanobis distance) between a new sound and sounds in a database which are then ranked on proximity. Alternatively the database can also be sorted via parametric relations such as pitch and brightness. The authors demonstrate retrieval on a manually annotated collection of 400 sounds (classified into laughter, percussion etc.) but do not formally evaluate retrieval accuracy. Foote applies a vector quantisation approach to retrieval using MFCC audio features, he also evaluates against the sounds in the Muscle Fish database with a similar demonstration [13]. The advantage of the QBE approach is that similarity is derived from the audio signal (annotations are not required) and can therefore be applied inexpensively on a large scale. However, QBE is not orientated towards the kind of audio semantics proposed for this study, where queries based on acoustic properties may well be effective at finding acoustic similarity but not necessarily higher-level semantic relations. This indicates a gulf between user s needs and current QBE methods. It is apparent that modelling the high-level meaning of sound requires semantic (e.g. textual) content. In essence, QBK is the same problem as conventional text information retrieval, the aim to retrieve relevant documents (though associated with sounds) through a textual query 1. Though literal keyword matching is the most simple approach and capable of retrieving exact matches, due to the subjective nature of descriptions it can be difficult 1 e.g.

15 Chapter 1. Introduction 9 to satisfy a particular query. For example, search terms differ from one user to the next, in the worst case resulting in frustrated or failed searches. However, retrieval can be augmented by grouping closely related terms as concepts to allow matching of the most relevant sounds in a database, ideally finding similarity in the same way a real user would. A suitable information retrieval technique is latent semantic analysis (LSA) devised by Deerwester et al. [9]. LSA is a vector based semantic approach designed to solve the underlying problem of synonymy through dimensionality reduction. The authors demonstrate it to be effective at improving retrieval of relevant documents over literal term matching, where users need not know exact search terms to achieve useful results. A technique derived from factor analysis is used to reduce a matrix of documents indexed by terms into a lower dimensional space. This effectively models the global usage patterns of terms so that documents sharing related concepts (rather than just literal terms) are represented by nearby vectors in the lower dimensional space. The semantic system developed by Slaney uses an alternative approach where multinomial clustering is used to group together alike documents in a similar manner to the approach used on the acoustic space [32]. In essence, this achieves a similar result to LSA, though Deerwester et al. argue that hierarchies are too limited to capture the rich semantics of most document collections [9] Image and Multimedia Retrieval The substantial literature encompassing image and multimedia research reveals a number of notable and relevant retrieval techniques. For example, IBM s Query by Image Content (QBIC) system [12] is a QBE approach allowing comparison of images on properties such as colour histograms, texture information, foreground objects (in a limited fashion), backgrounds etc., and can allow queries by colour distributions, example images or even-user constructed sketches. In some aspects this is a very successful technique of image retrieval and can find images (or video) similar in property to the query. However, this system makes no pretence of attaching semantics to the queries, for example, finding an image of a bird could involve sketching a bird shape and the system could find similar shaped items (though not necessarily birds). This method cannot consider retrieval of differently shaped birds as humans could readily distinguish between.

16 Chapter 1. Introduction 10 Studies by Barnard et al. [1] indicate that queries based on image histograms, texture etc. are uncommon, suggesting that this is not the natural way we think about media types. In contrast, their work on the semantics of words and pictures [1], [2] creates a semantic linking between images and words, allowing automated annotation and semantic retrieval much in the same manner as the proposed audio system. They implement a joint model where image features (such as measures of color, texture and shape) are combined with text features to create a single feature space. The authors also introduce the concept of correspondence to associate labels with distinct regions of an image. Currently, correspondence is unnecessary for an acoustic model where the foreground/background problem is not yet under investigation. This semantic image system has some similarities to the proposed work and provides insight into semantic modelling and representing the semantic space. 1.4 Dataset Audio retrieval systems are typically created from insufficient audio datasets consisting of raw audio content with a truncated filename or perhaps a brief description to suffice for high-level information. Potentially, web-based audio retrieval could benefit from capturing words surrounding or related to sounds found on web pages, in the same manner as conventional information retrieval. Ideally, annotations should precisely describe audio content but in practice they can vary considerably in consistency and comprehensiveness depending on the purpose and source. The insufficiency of crucial semantic information is a major obstacle for retrieval systems, clearly richer annotation would benefit a semantic model. Consequently, for this study a dataset of sounds paired with reasonably thorough and consistent descriptions was chosen. The dataset consists of a set of CD-ROMs containing over 3,000 isolated sounds with annotations 2. Though the samples are intended for film and multimedia production the studio quality recording and consistency in labelling are ideal for this work. Sounds are divided into a broad range of general categories (e.g. airplane, animal, household sounds etc.), each with a suitable label (organised by hierarchy), table 1.1. This type of concise labelling lends itself well to a workable semantic model. 2 The XV Series Sound Effects Library from Sound Ideas,

17 Chapter 1. Introduction 11 Category Title & Description Time ANIMALS animal, frog: great basin spadefoot toad: single call amphibian 0:01 ANIMALS animal, wolf: timber-wolf: one wolf howling 0:06 AUTOMOBILES auto, police: ext: pass by at fast speed with siren emergency vehicle 0:17 HOUSEHOLD household, toaster: pop up 0:03 Table 1.1: Example listings from the XV Series Sound Effects Library The audio samples are recorded at a sampling rate of 44.1 khz with two (stereo) audio channels. Before use the stereo channels are mixed to a single monaural channel for each sample. Typically sounds are a few seconds in length and contain only a single sound source (or occasionally a mixture of related sounds). Descriptions contain on average eight words (with a maximum of 27 words), limited by information retrieval standards but sufficient for a practical model. Before allocation of training and testing sets, some refinement and pre-processing of the dataset is required. Some of the samples fall into indistinct categories such as exercise equipment, gas station sounds or hospital sounds and it proves difficult to create effective models which can distinguish between these sound types, consequently such examples are exclude from the dataset. This is justified as we would not expect to perceive the difference between the clanks of a bench press and various other indistinct sound types without visual cues. The refined dataset consists of sound types of which we would reasonably expect a human user to distinguish between. Additionally, some categories are very similar to others (e.g. crashes and impacts) and the pre-processing stage involves merging similar categories and partitioning others. Finally very short and a few of the longer audio samples are also excluded from the dataset. The result is a set of 914 isolated sounds divided among 36 categories and 90 subcategories. From this bank of isolated sounds with associated descriptions a conventional training, development and testing set was constructed. The training set is constructed from approximately 70% of the archive (allocated randomly) with the remainder of held-out examples divided between the development and test set. 1.5 Acoustic-Semantic Framework The overall purpose of the system is to predict words to describe a sound and, in the opposite direction, retrieve sounds via a textual query. To achieve this separate acoustic

18 Chapter 1. Introduction 12 and semantic spaces can be constructed. The acoustic space concerns the actual audio content and the semantic space the words attached to a sound, e.g. dog barking. For retrieval we wish, given an example in one domain to make the most appropriate prediction in the opposite domain. Statistical models can be constructed for each domain, where mathematical feature spaces and appropriate classification techniques are applied. The purpose of these models is to provide a measure of similarity between points in a feature space. For example, in the acoustic space the model should predict acoustic similarity between sounds. In the semantic space the model should predict semantic similarity between the descriptions belonging to sounds. Thus each model can perform classification within the respective domain. Such models then serve as a channel to map to the opposite domain for retrieval. This is achieved by creating two one-way linkages between domains using the known relationships between sound and descriptions pairs, thus achieving our goal of retrieval in either direction. Acoustic-to-semantic retrieval involves using the acoustic model to predict sounds matching an audio query, then a word model takes the associated descriptions to predict words with a high likelihood of describing the query sound. Acoustic-to-semantic retrieval uses the semantic model to predict descriptions matching a textual query, then the associated sounds can be used by an audio retrieval component to predict sounds relevant to the query. A conceptual illustration of this architecture is illustrated, figure 1.1. An alternative approach would be to combine both words and sounds into the same feature space and create a joint two-way model to predict words in one direction and sounds in the other, cf. the work of Barnard et al. [1]. However as the proposed method focuses on distinct measures of acoustic and semantic similarity two separate models are more appropriate. In this work both sounds and descriptions are represented by a point in a high-dimensional vector space, a crucial quality for capturing significant information in the content, reducing the effect of noise and for computational performance. Appropriate high-level features can vastly improve pattern recognition performance over low-level content and are more semantically meaningful [37]. As the chosen feature representations are critical for prediction performance we discuss and justify the choice and where necessary carry out experiments to determine optimal parameters.

19 Chapter 1. Introduction 13 Figure 1.1: Overview of system architecture with example inputs and outputs System components are designed modularly to aid integration of alternative methods. Parameters are controlled by a configuration file in order to facilitate experiments on parameter settings and optimisation. However, except where appropriate we do not explicitly discuss exact implementation details, instead preferring to make more concise mathematical descriptions that can be replicated. Additionally, the aim of an automated system is that the modelling should be largely unsupervised. After the user creates a parameter configuration the system then (automatically and without supervision) extracts features, learns similarity and a linking between domains, requiring only a pairing of sounds and their corresponding descriptions as input. However, there are some limitations on the focus of the study. Firstly, the descriptions predicted by acoustic-to-semantic retrieval are not intended to be sentences, rather a collection of descriptive words. Also, for this study we limit the scope to isolated sounds of the type in our chosen dataset and we do not investigate the effects of segmentation or multiple sounds sources occurring within samples.

20 Chapter 2 The Acoustic Model The purpose of the acoustic model is to transform a given sound into a point in acoustic space where sounds can be compared on acoustic similarity. Distinguishing between sounds is not a trivial task, an audio archive will contain a broad range of sound types which are distinguished by different acoustic characteristics. The difficulty stems from the requirement to extract features capable of capturing the individuality of each sound source. It would be unrealistic to anticipate a solution to all audio classification problems but the aim is to present a generally applicable solution. 2.1 Feature Extraction As low-level waveforms are opaque and difficult to compare, a better strategy is to extract some higher-level, more meaningful feature from the signal. In order to facilitate accurate prediction, an acoustic feature should ideally distinguish between significant acoustic variation yet eliminate irrelevant spectral detail and noise which do not contribute to recognition. As discussed in our review of literature, Mel frequency cepstral coefficients are a well established acoustic representation derived from the energy spectrum and are capable of capturing varied spectral phenomenon. Though intended to model speech they have been applied successfully in non-speech tasks such as the music system developed by Pye [25] and more general audio studies by Foote [13] and Liu & Wan [21]. This success indicates that they are a good starting point for general audio discrimination. 14

21 Chapter 2. The Acoustic Model Mel Frequency Cepstral Coefficients MFCCs are obtained through a frame-based analysis of a signal where the waveform is divided into a sequence of frames (usually in the tens of milliseconds), the purpose to smooth the frequency spectra and reduce the effects of acoustic variation. A sinusoidal transform (discrete Fourier transform) is performed using a hamming window overlapping each frame to obtain an amplitude spectrum, which is then converted to a Mel-scale spectrum using triangular filters emphasising frequencies according to their perceptual importance on this scale. The particular scale used in this study incorporates linear frequency spacing below 1000 Hz and logarithmic spacing above 1000 Hz in order to reflect perception of frequency. The final stage takes the logarithm of the Mel-weighted spectrum and another sinusoidal transform (discrete cosine transform) reconstructs these values into a number of cepstral coefficients augmented with a zeroth coefficient representing the overall energy of each frame [7]. The result is a vector of reasonably uncorrelated coefficients describing smoothed frequency and compressed amplitude information. The advantage of uncorrelated features becomes apparent when applying statistical models. Valuable lower and mid-range frequencies and energy are retained yet the representation exhibits some robustness in the presence of noise. Clearly there is also a computational advantage in dimensionality reduction. MFCCs are popular in audio discrimination tasks because they capture acoustic properties useful in perception and this motivation is instrumental in justifying their use for a semantic audio system Limitations of MFCCs However, some stages in the MFCC extraction process may be of questionable suitability for non-speech, including the emphasis on mid-range frequencies, the size of frame and number of MFCC coefficients. Logan investigates the application of MFCCs in modelling music, concluding that they are at least effective for music/speech discrimination and there is no evidence to suggest that the frequency emphasis is inappropriate [22]. Other aspects such as frame size and number of coefficients are investigated for the case of general audio in section Furthermore, although the extraction process attempts to reduce the effects of noise there is no reason to stop a classifier learning undesired properties from the MFCC rep-

22 Chapter 2. The Acoustic Model 16 resentation. It is possible that a model could learn to distinguishing between sounds based on background noise (such as the silence occurring between footsteps) rather than content. To counteract this a method of silence detection and removal is implemented, section The Delta-cepstrum MFCCs are essentially static features where coefficient values describe the signal over a single frame. A means to capture the time evolution over a longer interval should help increase distinction between dynamic sound types. For example, the salient feature of a particular sound could be a rising or falling pitch, while the instantaneous pitch itself is less important. In order to capture the slope or difference over a number of frames a delta-cepstrum can be derived from the static cepstrum. Delta coefficients x(t) are derived for every time-slice t by the first-order time derivatives of cepstrum values over a window (t k) to (t + k) usually ±1 or ±2 frames, where x is the input cepstrum coefficient [16] x(t) = k x(t + k) i (2.1) i= k Computing delta coefficients for every cepstrum value results in a delta-cepstrum reflecting the change occurring in the static cepstrum. The second order derivative of the sequence or double delta calculated from the first order delta-cepstrum may also be applied to capture the acceleration of change in the cepstrum. Delta cepstrums are appended to the original static cepstrum resulting in a composite feature vector as commonly used in speech recognition. This allows modelling the acoustic feature at a particular instant as well as capturing the rate and direction of change. The benefit of dynamic features was demonstrated by Furui in an isolated word recognition task with a significant reduction in error rate over the static cepstrum alone [15]. Though such features derived from modelling speech do not automatically apply to non-speech tasks, the dynamic behaviour of sound features is clearly applicable to other modelling tasks. For example, dynamic features have been used successfully in the music classification system by Pye [25] and the speaker verification system by Reynolds [27]. It is anticipated that these enhancements should also improve distinc-

23 Chapter 2. The Acoustic Model 17 tion between general sound types. For example, different engine noises are characterised by unique short-term temporal fluctuations. 2.2 Audio Classification Each sound consists of a mass of points in MFCC feature space and to measure similarity we require a method of comparing distributions in this space. Though a variety of pattern recognition techniques can be applied it seems natural to use methods that can probabilistically model such distributions Gaussian Mixture Models Gaussian mixture models (GMMs) are a widely used supervised classification technique that improve over a single Gaussian distribution to accommodate a broader, more complex range of distributions using a combination of simple components. They can effectively model distributions where the data points originate from separate clusters but membership is unknown. The idea is to fit the data with regions of probability mass on the assumption that high-level data points are clustered into groups and that each cluster follows a Gaussian distribution (or at least this is a reasonable approximation). This is appropriate as it is anticipated that points emanating from a sound source will typically form a cluster in high dimensional space. Once properly trained a GMM should be effective at predicting the likelihood of a new test sound matching the trained data. For general audio discrimination the content of the data is usually more important than the sequence or structure that is important in speech recognition. GMMs are suitable as they model the whole sample as a mass of points dismissing temporal order (though it can be beneficial to capture the short-term rate of change by a delta-cepstrum). In the case that a sound type is characterised by long-term temporal ordering (e.g. bird song) it may be beneficial to apply a method such hidden Markov models (as used extensively in speech recognition) [16] to capture the characteristic sequence, though this seems unnecessary for the chosen dataset. GMMs are commonly used in speech research, providing improved discrimination over single Gaussian distributions in many cases. For example, Reynolds [27] achieves

24 Chapter 2. The Acoustic Model 18 good results on databases of differing quality by employing GMMs to verify a speaker s voice pattern, in the same manner as we wish to capture the class of a sound. Additionally, Singer et al. develop an effective language identification system using GMMs and delta-cepstrums [30]. Indeed, mixtures are applicable to more general audio problems such as the music classification tasks by Pye [25], and Berenzweig et al. [3]. Though a sound may not necessarily form distinct clusters in acoustic space the distribution can always be approximated by using enough mixture components. The benefit of modelling the data with a number of simple components is clear but in each case there is the problem of determining the appropriate order of GMM to fit the data. The effect of mixture sizes is investigated in section Parameters of a Gaussian Mixture To model an acoustic feature vector x (of dimension D) the distribution is described by p(x) = K i=1 π i p(x θ i ) (2.2) where K is the number of mixture components, π i is the probability that component i contributes to modelling the data and θ i the parameters of component i. In the case of GMMs, θ i = {µ i,σ i } and the probability density function is a multivariate Gaussian distribution p(x θ i ) = 1 (2π) D/2 Σ i 1/2 exp( 1 2 (x µ i) T Σ 1 i (x µ i )) (2.3) where µ i is the D 1 mean vector and Σ i the D D covariance matrix of component i. Hence, for each training case there are two sets of parameters to estimate, the mixing weights π i and the parameters θ i. An appropriate value for K must also be determined. Estimation of parameters There is no closed form to calculate GMM parameters directly but a maximum likelihood estimate can be obtained over a number of iterative steps, often achieved via the expectation-maximisation (EM) algorithm [10]. Given a set of training vectors, x 1,x 2,...,x P, the EM algorithm iteratively refines the GMM parameters to increase

25 Chapter 2. The Acoustic Model 19 the likelihood of the estimated model for the observed data. Starting with a parameter initialisation the EM algorithm then proceeds over two steps: Expectation step: find the data points closest to a mixture component j, where w i j is the probability that x i belongs to cluster j using the current estimate of parameters, p(x θ i ) is computed as in eq. 2.3 w i j = π j p(x i θ j ) K k=1 w k p(x i θ k ) (2.4) Maximisation step: calculate new weight ˆπ j, mean ˆµ j and covariance ˆΣ j over the data points closest to each cluster j ˆπ j = 1 P P w i j (2.5) n=1 ˆµ j = 1 P Pˆπ j w i j x i (2.6) n=1 ˆΣ j = 1 P Pˆπ j w i j (x i ˆµ j )(x i ˆµ j ) T (2.7) n=1 Both steps are repeated (with the maximisation estimates becoming the parameters at the next stage) for a set number of iterations or until convergence of the complete-data likelihood, P i=1 K j=1 π j p(x i θ j ). The derivation can be found in [6]. In this work individual clusters are not represented by a full covariance matrix but by diagonal approximations. The approximation causes the cluster axes to be orientated parallel to the axes of the feature space. This is done for reasons of computational efficiency during parameter estimation. Furthermore, studies by Reynolds have shown that any Gaussian mixture with full covariance can be replicated by a larger order mixture with diagonal covariance and can even outperform full covariance models [27]. Unfortunately the procedure can converge to different solutions depending on the initialisation of parameters. A good initialisation strategy adopted in this study is to randomly assign each mixture component to a subset of data points and set mixing weights π j to (1/K). Several iterations of the K-means algorithm (a non-probabilistic method) are then used to quickly converge on reasonable estimates of the mean parameters and the EM algorithm then proceeds. It was observed that fewer than ten iterations are needed for convergence. A further problem is that covariances will tend to zero as

26 Chapter 2. The Acoustic Model 20 the likelihood tends to infinity, e.g. if the mixture component models a single point or points close together. To prevent this a size constraint is imposed on covariance values. The optimum number of mixture components K depends on various factors such as size, dimensionality and inherent clusters in the source data. Greater numbers of elements will fit the data more precisely but may capture outliers or noisy data elements. Fewer elements will fit the data more crudely but may perhaps allow better generalisation performance to sounds of a similar type. For illustration, GMMs are trained on just two dimensions (the energy term and first MFCC coefficient) over a range of mixture sizes. Figure 2.1 shows the regions of probability generated by these GMMs trained on a source sound of dog vocals. By using the described training strategy mixture components tend to fit to heavily populated clusters and ignore outliers in the source sample. This inbuilt robustness to noise is an encouraging trait for good generalisation performance. Further experiments investigating prediction performance on differing mixture sizes are reported in section ANIMAL DOG11 LRG BARK GROWL (original) element element element element element Figure 2.1: The source data (top left) and the probability mass generated by GMMs of differing order trained on this data

27 Chapter 2. The Acoustic Model 21 Classification Once a set of class models has been trained by obtaining the parameters π i and θ i for each of the classes in turn, the models can be used to predict the most likely class membership of a new sound. Classification of a test sound, with feature vector X = x 1,x 2,...,x N, is achieved by estimating the likelihood that each model could have generated X. The log likelihood of a model for the sequence of feature vectors is calculated L(X λ) = 1 N N K log π j p(x n θ j ) (2.8) n=1 j=1 where λ represents all the parameters of a GMM, λ = {θ j,π j, j = 1,...,K}. Summing log values makes the assumption that feature vectors of X are independent. The normalisation factor 1/N is used to normalise for sample duration, omitting this term would result in longer samples obtaining disproportionately low likelihood values. Reynolds argues that this normalisation factor counteracts the underestimation of actual likelihood values due to the incorrect independence assumption [27]. Likelihood values may be combined with a class prior to determine class membership, however this is not appropriate for this study where there is no prior information of what sound will occur. Thus, to classify a sound, the test vector is tested against all models in turn and assigned to the class of the model predicting the highest likelihood. 2.3 Class-based Prediction In order to demonstrate the described acoustic model four sets of GMMs were trained and tested. Training Set 1 consists of a small set of 320 labelled sounds and Set 2 consists of 611 sounds. Each sound has two classes determined from the hierarchical labelling, e.g. in Set 1 there are 16 course labels such as airplane and 53 fine labels corresponding to airplane, biplane, airplane, jet etc. Models were trained on the set of feature vectors belonging to each class using 8 element GMMs and a 13 dimensional MFCC vector obtained at a 10 ms frame rate. For each training set two collections of GMMs are trained, one for the coarse classes and another for the fine classes. Accuracy is measured by the number of correct class predictions on an unseen test set, table 2.1.

28 Chapter 2. The Acoustic Model 22 TEST ACCURACY Set 1 (16 Classes) (56/75) 74.7% Set 1 (53 Classes) (51/75) 68.0% Set 2 (36 Classes) (105/153) 68.6% Set 2 (90 Classes) (94/153) 61.4% Table 2.1: Classification accuracy obtained with four sets of trained models on an unseen test set As would be expected, prediction with coarser labels achieves greater accuracy, arguably the finer classes are too insubstantial (some classes having only 3 training sounds). Nevertheless, using fine labels still allows reasonable results. This approach of predicting into predefined categories is typical of those used in content-based audio classification and these results compare favourably to studies such as the work of Liu & Wan who achieve 56% test accuracy using GMMs with a larger training set divided into 29 classes [21]. The approach applied to acoustic-to-semantic and semantic-to-acoustic retrieval follows the work of Slaney to measure acoustic similarity between sounds using individual GMMs for each sound rather than training class models (the above models only serve as an illustration and are not used further). The purpose of this approach is to predict similarity between each sound in the set and it is this measure of closeness that proves crucial in mapping between the audio and semantic spaces. 2.4 Experiments In order to determine parameter settings of the acoustic model most suitable for general audio a number of experiments are considered. In particular the MFCC parameterisation, the delta cepstrum, silence detection and number of GMM elements are investigated. To facilitate this investigation a classification framework is created in which particular parameter values can be varied while all other variables including training and testing material are kept constant. This framework involves using a representative subset of the training data and the development set as testing material. For each training sound, X 1,X 2,...,X n, a single GMM is initialised and trained, λ 1,λ 2,...,λ n (the same approach is used in acoustic-to-semantic prediction). Classification of a test sound is then achieved by querying each GMM and the sound is assigned the class of the model

29 Chapter 2. The Acoustic Model 23 most likely to have generated it. Performance is simply rated by classification accuracy (between 36 classes) on the development set. However, the experiments are relatively small scale so we should not expect them to provide definitive settings that apply to all types of general audio tasks. However, it is anticipated that these results should be representative enough to illustrate useful information about the problem domain and ultimately allow improved discrimination with the acoustic model The MFCC Parameterisation As the MFCC parameterisation is intended for modelling speech there is a need to investigate parameter settings for the case of general audio. Therefore, various experiments are conducted to compare the parameters generating MFCCs at various informed values in the described framework. Though numerous parameters affect the generation of MFCCs, we specifically investigate, frame size, number of MFCC coefficients and the energy term. The number of GMM elements is kept constant throughout to ensure fair comparison Comparing Frame Size Though a frame size of 10 ms is conventional in the MFCC parameterisation used in speech research, there is no reason to assume such a resolution is suited to more general audio. Previous studies indicate the choice of frame size can vary quite widely depending on the problem domain, for example, Foote finds a frame rate of 20 ms to be effective for a music discrimination task [13]. Therefore, we compare frame rates of 5, 10 and 20 ms in order to establish which provides the best overall discrimination for the audio used in this study. In each case the window overlap used in the frame based analysis is kept constant at 50% and although alternate overlap proportions were tested no clear difference was found over the values trialled. A set of GMMs is initialised and trained for each frame size and each is tested on the development set. The overall classification accuracy of each set is presented in table 2.2. Analysis reveals that the greater resolution provided by 5 and 10 ms frames allows

30 Chapter 2. The Acoustic Model 24 FRAME SIZE ACCURACY 5 ms 77.8% 10 ms 77.1% 20 ms 71.9% Table 2.2: The percentage of test sounds correctly classified using models trained with 5, 10 and 20 ms frame sizes for a notable improvement in discrimination over a 20 ms frame rate (over 20% error reduction). The difference in performance between 5 and 10 ms rates is marginal and a frame rate of 10 ms is chosen for use in this work for computational efficiency (training and testing time does not simply increase linearly with a doubling in frame rate). Additionally, from these results we can conclude that a frame resolution greater than 20 ms is preferable for the discrimination of the general audio sounds in this dataset. In particular, the higher frame rates are advantageous to effectively capture sounds more dynamic than human voice. Analysis of misclassifications indicates that higher frame rates improve discrimination between dog, bird and human vocals and also between engines noise such as car, boat, motorcycle etc The Number of Cepstral Coefficients and the Energy Term In speech research a thirteen dimensional vector consisting of twelve cepstral coefficients plus the energy term is standard. Again there is no reason to assume this is the optimum number for more general audio. In fact, work by Berenzweig et al. demonstrates that a 20 dimensional MFCC vector achieves the best performance for a music classification problem [3]. Therefore an appropriate experiment is to compare different dimensions in the classification framework. Additionally, the impact of the energy term is investigated by conducting these experiments with and without, figure 2.2. Analysis indicates that a 13 dimensional vector is in fact globally optimal for the dataset (though the advantage over 8 and 16 coefficients is marginal). The biggest advantage can be gained by including the energy coefficient, which is clearly important for sound discrimination. Generally, fewer dimensions do not capture enough significant content and larger dimensions may channel too much noise to the modelling process. However, the optimal dimension appears to depend on the particular type of sound under analysis. From observation it was determined the optimal dimension for one sound type such as animal vocals is not necessarily the best for other

31 Chapter 2. The Acoustic Model 25 Figure 2.2: The percentage of test sounds correctly classified using 4, 8, 12, 16 and 20 cepstrum coefficients, with and without energy samples such as impact sounds which are more impulse-like. Nevertheless, the best overall dimension should be selected in order to achieve the best prediction on unseen test sounds The Effect of the Delta-cepstrum A feature vector based on cepstrum values does not capture any temporal change in acoustic properties. However, it is anticipated that a method of capturing short-term temporal changes (under 100 ms) will prove effective in distinguishing between certain sounds. Indeed, spectral transition is believed to play an important role in auditory perception [16]. Consequently, this experiment involves augmenting the static MFCC representation with delta-cepstrums and comparing discrimination performance. The delta coefficients are calculated over a 7 frame window (70 ms) by eq A window size of 70 ms was selected through comparative trials where is was observed to obtain better discrimination than a 50 ms window (with a small increase in performance). The results from tests with and without applying delta and double delta-cepstrums are presented, table 2.3. Though the results reveal no clear improvement in applying delta cepstrums, in both cases the addition of the time derivative sequence offers a slight increase in classification performance. In the best case applying both delta and double delta-cepstrums achieves an error reduction of around 14% over static cepstrum values alone. The

32 Chapter 2. The Acoustic Model 26 CONFIGURATION ACCURACY static cepstrum 77.12% static + delta-cepstrum 77.78% static + delta + double delta-cepstrum 80.39% Table 2.3: The percentage of test sounds correctly classified using delta and double-delta cepstrums disadvantage of such a representation is that the dimension of the feature vector is increased threefold, though often the motivation to capture acoustic transition and improve prediction is more prominent. Analysis reveals that misclassifications are reduced between classes with similar instantaneous properties but dissimilar acoustic variation over time. In particular, misclassification between various types of engine noise is notably reduced. For example, models trained on helicopter sounds were liable to cause false positive predictions. The sound of a helicopter rotor is characterised by short-term temporal fluctuations and application of delta-cepstrums was observed to entirely eliminate those misclassifications Silence Detection and Removal In early trials it was observed that problems arise from the presence of background noise or regions of silence, e.g. the silence occurring between footsteps. This causes a problem for the modelling process, where a model may learn similarity between sounds based on these regions, resulting in erroneous classifications. In particular models trained on such sounds are liable to generate false positives. A straightforward strategy to prevent this is to remove such content from waveforms beforehand. A simple method to detect and remove silence is to measure energy levels of the input sample and exclude frames when energy values fall below a set threshold. In practice, detection is averaged over a number of frames so that only reasonably contiguous periods of silence are removed and the actual waveform content is not disrupted. Though rather rudimentary this method was found to be reasonably effective. However, often sounds are sourced from different archives and recording environments and this rather limited approach is less reliable in such situations. An energy threshold is too crude to reliably detect varying levels of noise, either suggesting removal of too much or too little. Improvements can be made by accounting for other measures such as zero-crossing rate and an adaptive threshold is an interesting prospect [24].

33 Chapter 2. The Acoustic Model 27 BIRD PARROT CALLS.wav 0.5 content silence amplitude time (secs) 0.5 ANIMAL DOG02 MED BARKING.wav content silence amplitude time (secs) Figure 2.3: Examples of predicted silence However, a more sophisticated approach is adopted from speech/silence discrimination literature, cf. the work of Reynolds [27]. This involves creation of a GMM to model silence and another to model content. The advantage is that this can also account for the spectral difference between regions of silence and content, as well as capturing energy. Representative examples of both silence and content were marked out in the training set and the GMMs are then trained on this data. The content model required considerably more mixture components to fit the data. The test for a frame of silence is then by likelihood ratio α = L(x λ S) L(x λ C ) (2.9) where x is the input frame, L(x λ) is calculated as eq. 2.8 and λ S and λ C are the estimated parameters for the silence and content GMMs, respectively. If α is more than or equal to a threshold the frame is classed as silence, otherwise as content. Again detection is averaged over a number of frames so that only contiguous periods of silence are removed. For illustration, example regions of silence and content predicted by this method are shown, figure 2.3. Applying this method of silence removal to acoustic feature vectors before training results in a set of models relatively unaffected by noise elements in the feature space. However, evaluation by classification accuracy using such models only indicates a slight or no improvement than without removal. This is partly because the majority of sounds contain no silence periods (less than 10% contain any significant regions of

34 Chapter 2. The Acoustic Model 28 silence) and so remain unaffected. Another problem stems from the limited amount of silence marked training material, which may not be sufficient for the model to fit all types of silence in the dataset. Additionally, it was observed that over predicting silence is harmful and can decrease prediction accuracy, therefore a more cautious prediction is made by increasing the threshold α. In this case a value slightly larger than 1 ensures under-prediction of silence, preventing harmful removal of content. In order to effectively demonstrate the value of silence detection an artificial test set was created consisting of sounds known to contain background noise, e.g. footsteps, various animal noises (dog bark, horse galloping) etc. Some examples contain around 80% proportions of silence. As before, performance is rated on classification accuracy and results show 90.48% of test sounds are correctly classified using silence detection with only 85.71% correctly predicted without detection, an error reduction of around 33% The Number of Mixture Components The optimum number of mixture components in a GMM depends on a number of factors such as shape and separation of clusters in the source data and source sample size and dimensions. Generally, greater numbers of elements will fit the data more precisely but may capture outliers or noisy data elements, whereas fewer numbers will fit the data more crudely but will perhaps allow better generalisation performance to sounds of a similar type. This experiment compares the effects of varying mixture size over all models. Mixture sizes of 2, 4, 8, and 16 are compared along with an optimised configuration MIX, discussed below. Throughout, the feature dimension is held constant using 13 dimensional MFCCs without delta-cepstrums (though trials indicate a similar trend for deltacepstrums). Performance is measured by the percentage of correctly classified sounds from the development set, table 2.4. The results suggest that a good policy is to use few mixture components to prevent over-fitting to the data. Indeed, although 16 element models are likely a good model of their own content they provide poorer generalisation performance than using just 2 components. Using 4 mixture elements provides an error reduction of 18% over any of the other single size models.

35 Chapter 2. The Acoustic Model 29 ELEMENTS ACCURACY % % % % MIX 82.4% Table 2.4: The percentage of correctly classified test sounds using GMMs of differing order However, the best overall strategy is the MIX configuration which compares varying mixture sizes for each model. This is constructed by initially selecting a mixture size of 4 (the best overall size) for each model. An optimisation strategy was then used to select the best mixture size for a single model, rather than globally. The procedure involves testing the set of models on the development data and iteratively replacing models, which cause either false positive or false negatives, with differing mixture sizes, K = {2,4,8,16}. For each model the size achieving the best classification accuracy is selected, in the case of a draw the lowest order model is chosen for efficiency and potentially better generalisation performance in practice. This strategy was observed to reduce misclassifications considerably resulting in improved performance on the test set compared to the single best mixture size (around 23% error reduction). From analysis of the MIX results, it seems that the optimised mixture size varies according to sample size or complexity of the sound (e.g. number of sound sources present). It was observed that some of the lengthier and more complex samples generally suit 8 or 16 elements (perhaps more in some cases), whereas 2 element models suit a few of the shortest or most stationary samples. Most examples in the dataset are short single-source sounds and this explains why they can be modelled effectively with a small number of mixtures. Bearing in mind that these results are optimised for the development set the results on an unseen test set are less impressive, a small trial indicates an error reduction of less than 12% over the single best mixture size. A considerable drawback of this strategy is that the optimisation procedure adopted is relatively intensive and would be impractical on larger datasets. Though effective, the optimisation strategy is not applied to the final clustered models used in acoustic-to-semantic retrieval (see chapter 4), instead models are trained using the single best mixture size allowing straightforward comparison between models.

36 Chapter 3 The Semantic Model The semantic feature space is derived from the words forming each sound s description (termed a document). As documents are just text, standard text processing and information retrieval techniques can be applied to construct a semantic model. In this case the ordering of words within a document is largely irrelevant, the focus is on treating keywords as terms that can be mapped to concepts, the so called bag-of-words approach. However, before setting out to learn relationships between documents it is beneficial to perform some pre-processing on the text data to facilitate later indexing and retrieval. After the descriptions are tokenised (i.e. terms are extracted) and punctuation stripped, the next step involves exclusion of terms that appear in a stop list. The stop list is composed of common function words such as the, and, while etc., which have little semantic content. Though the collection of descriptions contains few function words a custom stop list of 54 terms was applied for this purpose. The reason for exclusion of these terms is that they do not contribute to the meaning of a description and could cause a semantic model to learn undesirable relationships between terms and documents. The focus is on retaining descriptive verbs, nouns and adjectives. The process of stemming to remove uninformative word endings is commonly applied in information retrieval tasks, including the semantic work by Slaney [31], [32]. The idea is that removal of plurals and suffixes will improve retrieval performance by mapping terms with the same stem to a single root. However, a trial of stemming on the text in this study found no benefit. Firstly, the corpus contains few terms that would benefit from this procedure, and secondly the vector space model applied to the semantic space 30

37 Chapter 3. The Semantic Model 31 in most cases finds a relationship between root forms and their derivatives without the need for stemming. After pre-processing the task is to construct a semantic model which when given a new textual description (or query) can determine the semantic similarity to existing descriptions. Ideally the semantic model should establish useful relationships between terms. Whereas literal term matching predicts only exact matches to a term, a conceptbased approach can find relationships between terms following similar global patterns potentially allowing more suitable retrieval. For example, matching terms related to storm such as wind, rain etc. is likely to retrieve more relevant documents than otherwise. 3.1 The Vector Space Model A vector space model can be used to represent the unique terms, t 1,t 2,...,t m, and documents, d 1,d 2,...,d n, occurring in a collection [28]. The collection of n documents indexed by m terms can be encoded by a m n term-by-document matrix. A column denotes a document indicating the terms indexed, and a row represents a term indicating the documents in which it occurs. An example of a partial term-by-document matrix is shown, figure 3.1, where term frequency is recorded. DOCUMENTS d1 d2 d3 d4 boat, sail: schooner: bow cutting through water marine sports boating sailboat boat, storm: large ship bow slamming waves creaks weather canoe: pull up onto shore lake waves marine sports boating water, river: heavy flow ambience d1 d2 d3 d4 t1: ambience t2: boat t3: boating t4: bow Figure 3.1: Example of a partial term-by-document matrix.... The values recorded in a term-by-document matrix reflect the weighting of the term in the document and in the collection. A weighting scheme has two components: a local weight and a global weight [4]. Local term weighting reflects the importance of the

38 Chapter 3. The Semantic Model 32 term in a document, in most document collections it is appropriate to represent local weighting by term frequency f i j, the number of times term t i occurs in document d j. However, for this collection the local weight is represented by a binary value, where multiple occurrences of a term in a document are not taken into account χ( f i j ) = { 1 if fi j > 0, 0 if f i j = 0 (3.1) The reason for this is that terms are merely keywords so we only wish to record an occurrence once, e.g. animal, cat: domestic cat purring is equivalent to animal domestic cat purring. The second weighting component is global weighting, which reflects the importance of a term in the collection by weighting all occurrences of the term with the same value. A common weighting strategy is by the inverse document frequency (IDF), where rare terms are up-weighted to reflect their relative importance. However for this purpose it was found that the entropy weighting scheme used by Dumais [11] was more effective than IDF and both of these schemes were more effective than no global weighting. The rationale of this information-theoretic approach is that terms occurring frequently in the collection have low information content. Thus, accounting for both local and global weighting the overall weight of a term i in a document j is defined w i j = χ( f i j ) [ 1 j ] p i j log(p i j ) log(n) (3.2) where p i j = (t i j /g i ) and g i is the total number of times term i occurs in the collection. In addition, document vectors are also normalised using standard cosine normalisation to ensure that all documents are treated equal regardless of length. Accounting for weighting and normalisation, values of a term-by-document matrix A are calculated A i j = w i j (wi j ) 2 (3.3) As each document will likely only contain a few terms most elements of the matrix will be zero. Due to the short length of descriptions in the document collection only about 0.3% of the elements in the term-by-document matrix are filled. Query matching in vector space can be performed by indexing a textual query in the same manner to produce a n 1 query vector. This can then be compared against the

39 Chapter 3. The Semantic Model 33 existing document vectors using a distance metric such as cosine angle between query and document vectors, this corresponds to literal term matching. Matching is not strict, in the sense that a query containing multiple words may retrieve relevant documents when not all words are present. This approach allows documents to be ranked by similarity to the query. However, the vector space model is limited. Literal term matching is unlikely to retrieve all relevant documents, there are broad range of words people may use to describe the same documents. The vector space model fails to capture the concepts of synonymy and polysemy. Synonymy refers to words having the same meaning as another word in a language, for example, in the audio dataset the words car and automobile are synonyms. Polysemy refers to multiple unrelated meanings of the same word, e.g. the word taxi is a polysemous word denoting a form of public transportation or the movement of an aircraft on the ground. Ideally, a retrieval system should capture relations between synonymous words yet identify the different words senses of a polysemous word. Also, in the case of the vector space model, each word is considered independent of all other words. However, this is unrealistic as we would expect some words to commonly co-occur with others, e.g. fire and engine. Adopting a concept-based indexing technique attempts to overcome this and model the co-occurrence relation of terms. 3.2 Latent Semantic Analysis Given the limitations of the vector space model an extension of this to latent semantic analysis (LSA) [9] is described. The aim of LSA is to build a model of the true relationships between terms and documents. LSA is a popular information retrieval approach using approximations to encode a term-by-document matrix into a lower dimensional space, capturing hidden (hence latent) relationships between terms. LSA attempts to model the global usage patterns of terms so that documents sharing related concepts (rather than just literal terms) are represented by nearby vectors in the lower dimensional space. In most cases this allows improved retrieval of relevant documents over the vector space model, though the effectiveness of this is by no means guaranteed. In order to achieve this goal, a technique is required to project the term-by-document

40 Chapter 3. The Semantic Model 34 matrix to a lower dimensional space. Such an approximation is most often made by the use of singular value decomposition (SVD). This involves a generalisation of factor analysis, a statistical technique used to explain most of the variation among a number of random variables in terms of a smaller number of hidden variables. The SVD of a rectangular m n matrix A is defined as A = USV T (3.4) where U is an orthonormal m m matrix, V is an orthonormal n n matrix and S is a diagonal m n matrix with the singular values of A along the diagonal in decreasing order. The larger singular values account for most variation in the data. Therefore, an (optimal) approximation can be made by taking the first k largest singular values of S and discarding the others [4]. The result of this approximation is reduction of S to a diagonal k k matrix S k. Accordingly, the matrix U can be simplified to an m k matrix U k and V to a n k matrix V k. The approximate reconstruction of the original matrix A can be calculated A k = U k S k Vk T. Berry shows this approximation of rank k to be the closest to the original matrix in terms of error [5]. The choice of the value k is crucial for good results. Ideally, k should be large enough to fit all the meaningful structure of the data to allow for good generalisation, yet not capture noise. Typically k will be a value much smaller than m or n. In that case the set of terms is represented by a vector space of lower dimensionality than the total number of terms in the vocabulary. An experiment to find the optimal value of k for this task is conducted in section 3.3. The motivation behind the approximation is that the reduced matrix corresponds to a matrix with minimised noise elements (ideally a better matrix that the original), potentially allowing improved prediction Query Matching To allow retrieval using LSA a n 1 query vector q can be indexed as before, this can then be represented in k-dimensional space by ˆq = q T U k Sk 1 [5]. To find similarity between the query and any document in the collection, the cosine angle between ˆq and

41 Chapter 3. The Semantic Model 35 k-dimensional document vectors is computed, returning a list of documents ranked by proximity to the query. Queries can be any type of free text using the vocabulary of terms. As an example the most relevant documents to the query canoe are illustrated, figure 3.1. The most relevant matches are ranked highest and semantically related descriptions concerning boats and kayaks are appropriately predicted as relevant, this could not be achieved with literal term matching. It is believed the effectiveness of semantic matching on this dataset is largely due to the hierarchical and consistent annotation of descriptions which often share keywords among related descriptions, (e.g. boating ). SIMILARITY DOCUMENT canoe: two paddlers: launch and pull away from shore, boating canoe: single paddler: on board: pull onto shore step out of canoe, boating canoe: pull up onto shore lake waves follow, boating canoe: single paddler: on board: launch from shore jump in, boating boat, sail three masted schooner: on board: lower sail sailboat, boating boat, sail three masted schooner: on board: turning winch sailboat, boating kayak: approach head on pull up short reverse forest birds in background scull eight man sculler: on board: rowing boat marine sports boating Table 3.1: Most relevant documents retrieved with the query canoe Limitations of LSA There are also limitations of this approach. Firstly, rank reduction does not necessarily improve matching for all queries and LSA is not guaranteed to find desired (synonymous) relationships between terms. It is worth noting that the LSA approach is not linguistically motivated (e.g. by marked up word sense) but driven by linear algebra where the learned concepts may not actually correspond to observed linguistic data. In some cases it was observed seemingly unrelated terms are linked, though application of a stop list can help prevent this to a certain extent. For example, the word single appears in many descriptions, e.g. single horse, single explosion etc. As this is not the type of semantic relationship we wish to capture the word is excluded from the vocabulary. Additionally, research such as the work of Landauer et al. has indicated that large corpora (millions of words) are required to find realistic linguistic relationships between terms [19]. However, this corpus consists of less than a thousand terms in most exper-

42 Chapter 3. The Semantic Model 36 iments. Possibly these sparse statistics do not allow for the most effective use of LSA, though we argue that there is a clear benefit to a semantic system as illustrated in figure The Number of Singular Values Choosing a suitable dimension of the reduced matrix when performing LSA is crucial for effective results. If the value is too small important information will be lost, too big and undesirable relationships (noise) may be modelled. However there is no definitive method for choosing an optimal dimension k and literature suggests that the best case is usually determined empirically [9]. The approach used here involves measuring retrieval accuracy at various values of k using the development set as test material. The descriptions of the development set are applied as queries to the semantic model and retrieval is deemed correct if the class of the closest predicted document matches the class of the test document. Although this measure does not truly reflect the relevance of a given document (e.g. in some cases it is reasonable for relevant documents to belong to another class) it should provide a good approximation of retrieval performance. For comparison, a baseline method is constructed using the ordinary vector space model (corresponding to literal term matching) where query matching is performed by using cosine angle to measure similarity between query and document vector [4]. The percentage of correctly retrieved documents (averaged over the set) for the LSA method is plotted against k along with the baseline, figure 3.2. Accuracy is relatively high (90.2% for the vector space model) because within most classes the test descriptions are very similar to the training descriptions and prediction by both methods becomes fairly certain 1. Firstly, it is apparent that values of k < 50 do indeed loose valuable information and are outperformed by the baseline. This means in general there may be no benefit in applying LSA unless a suitably reduced dimension is determined. However, larger values of k can considerably outperform the baseline and the best accuracy for this task is 95.4% when k = 70. Performance gradually drops from around k = 100 onwards as 1 a more realistic evaluation would involve testing the model against free text queries

43 Chapter 3. The Semantic Model 37 Figure 3.2: Percentage of correctly predicted test documents against differing sizes of reduced matrix for LSA retrieval more noise is introduced, eventually performance will match that of the vector space baseline. The results demonstrate that LSA is indeed applicable to this task allowing improved performance over literal term matching. In this case the LSA method benefits from the well structured descriptions and co-occurence of informative keywords in the dataset. It is unlikely the LSA method would prove so effective if the semantic model was formed from less sophisticated annotation. Finally, having attained an effective method for measuring statistical closeness of queries to documents, the next step is to construct a linking between acoustic and semantic spaces.

44 Chapter 4 Linking Acoustic and Semantic Spaces To allow an audio request to generate a semantic answer (and vice versa) some mapping between the acoustic and semantic models must be implemented. This relies on the known relationship between sounds and descriptions in the training set. Firstly, for insight into the problem domain the distributions of acoustic and semantic space are compared and the difference is used to justify two separate models. Then the procedures for achieving both acoustic-to-semantic and semantic-to-acoustic retrieval are described in the remainder of this chapter. 4.1 The Distributions of Acoustic and Semantic Space For comparison of the similarity predicted by the acoustic and semantic models an illustration of the distances between a number of training points in both spaces is presented, figure 4.1. Acoustic distance is derived from the acoustic model where for each training sound, X 1,X 2,...,X n, a GMM is initialised and trained. For each trained model, λ 1,λ 2,...,λ n, the likelihood that it generated each training sound is recorded, resulting in a n n matrix indicating how well each sound scores with each model, the leftmost matrix, figure 4.1. The lower the likelihood of a model λ i generating sound X j the greater the distance between them. The distances are also normalised and made symmetrical as described in section

45 Chapter 4. Linking Acoustic and Semantic Spaces 39 The distance in semantic space is measured by the similarity scores predicted by LSA query matching where each training document is treated as a query and the similarity between all other documents is found. These distances also undergo normalisation and symmetrisation. acoustic similarity semantic similarity Figure 4.1: Comparison of acoustic (leftmost) and semantic similarity between training points as predicted by the acoustic and semantic models, lighter regions indicate greater similarity Of course, in both spaces each point scores itself with the highest similarity, hence the strong diagonal. As the points are numbered by catalogue order similar types of sounds are placed together, hence the rectangular groupings on the diagonal. In this case, the most visible rectangles denote four broad classes, animals, birds, explosions and footsteps (from left to right). Little acoustic similarity between bird sounds is found, it is believed this is because the class is rather diverse in comparison to other groupings and there is most often only one example for each bird species. Also, there is a strong overlap in acoustic similarity between animals and footsteps, this arises from the relation of footsteps to a horse trotting etc. The strong structuring in the semantic space is due to the hierarchical labelling of descriptions. Overall both acoustic and semantic spaces have a visibly similar distribution where similarity found between examples is in some cases complementary. However, the distributions are by no means identical and we would not expect acoustic similarity to correspond to semantic similarity. As each space is differently distributed it seems wise to build two separate linking models one for mapping from audio to semantic and vice versa, in the same manner as the work of Slaney [31], [32].

Probabilistic Latent Semantic Analysis

Probabilistic Latent Semantic Analysis Probabilistic Latent Semantic Analysis Thomas Hofmann Presentation by Ioannis Pavlopoulos & Andreas Damianou for the course of Data Mining & Exploration 1 Outline Latent Semantic Analysis o Need o Overview

More information

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier IOSR Journal of Electronics and Communication Engineering (IOSR-JECE) e-issn: 2278-2834,p- ISSN: 2278-8735.Volume 10, Issue 2, Ver.1 (Mar - Apr.2015), PP 55-61 www.iosrjournals.org Analysis of Emotion

More information

Speaker recognition using universal background model on YOHO database

Speaker recognition using universal background model on YOHO database Aalborg University Master Thesis project Speaker recognition using universal background model on YOHO database Author: Alexandre Majetniak Supervisor: Zheng-Hua Tan May 31, 2011 The Faculties of Engineering,

More information

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project Phonetic- and Speaker-Discriminant Features for Speaker Recognition by Lara Stoll Research Project Submitted to the Department of Electrical Engineering and Computer Sciences, University of California

More information

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION JOURNAL OF MEDICAL INFORMATICS & TECHNOLOGIES Vol. 11/2007, ISSN 1642-6037 Marek WIŚNIEWSKI *, Wiesława KUNISZYK-JÓŹKOWIAK *, Elżbieta SMOŁKA *, Waldemar SUSZYŃSKI * HMM, recognition, speech, disorders

More information

Lecture 1: Machine Learning Basics

Lecture 1: Machine Learning Basics 1/69 Lecture 1: Machine Learning Basics Ali Harakeh University of Waterloo WAVE Lab ali.harakeh@uwaterloo.ca May 1, 2017 2/69 Overview 1 Learning Algorithms 2 Capacity, Overfitting, and Underfitting 3

More information

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm Prof. Ch.Srinivasa Kumar Prof. and Head of department. Electronics and communication Nalanda Institute

More information

Speech Recognition at ICSI: Broadcast News and beyond

Speech Recognition at ICSI: Broadcast News and beyond Speech Recognition at ICSI: Broadcast News and beyond Dan Ellis International Computer Science Institute, Berkeley CA Outline 1 2 3 The DARPA Broadcast News task Aspects of ICSI

More information

Human Emotion Recognition From Speech

Human Emotion Recognition From Speech RESEARCH ARTICLE OPEN ACCESS Human Emotion Recognition From Speech Miss. Aparna P. Wanare*, Prof. Shankar N. Dandare *(Department of Electronics & Telecommunication Engineering, Sant Gadge Baba Amravati

More information

A study of speaker adaptation for DNN-based speech synthesis

A study of speaker adaptation for DNN-based speech synthesis A study of speaker adaptation for DNN-based speech synthesis Zhizheng Wu, Pawel Swietojanski, Christophe Veaux, Steve Renals, Simon King The Centre for Speech Technology Research (CSTR) University of Edinburgh,

More information

Speech Emotion Recognition Using Support Vector Machine

Speech Emotion Recognition Using Support Vector Machine Speech Emotion Recognition Using Support Vector Machine Yixiong Pan, Peipei Shen and Liping Shen Department of Computer Technology Shanghai JiaoTong University, Shanghai, China panyixiong@sjtu.edu.cn,

More information

WHEN THERE IS A mismatch between the acoustic

WHEN THERE IS A mismatch between the acoustic 808 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 3, MAY 2006 Optimization of Temporal Filters for Constructing Robust Features in Speech Recognition Jeih-Weih Hung, Member,

More information

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification Tomi Kinnunen and Ismo Kärkkäinen University of Joensuu, Department of Computer Science, P.O. Box 111, 80101 JOENSUU,

More information

A Case Study: News Classification Based on Term Frequency

A Case Study: News Classification Based on Term Frequency A Case Study: News Classification Based on Term Frequency Petr Kroha Faculty of Computer Science University of Technology 09107 Chemnitz Germany kroha@informatik.tu-chemnitz.de Ricardo Baeza-Yates Center

More information

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems Modeling function word errors in DNN-HMM based LVCSR systems Melvin Jose Johnson Premkumar, Ankur Bapna and Sree Avinash Parchuri Department of Computer Science Department of Electrical Engineering Stanford

More information

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems Modeling function word errors in DNN-HMM based LVCSR systems Melvin Jose Johnson Premkumar, Ankur Bapna and Sree Avinash Parchuri Department of Computer Science Department of Electrical Engineering Stanford

More information

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS Elliot Singer and Douglas Reynolds Massachusetts Institute of Technology Lincoln Laboratory {es,dar}@ll.mit.edu ABSTRACT

More information

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur Module 12 Machine Learning 12.1 Instructional Objective The students should understand the concept of learning systems Students should learn about different aspects of a learning system Students should

More information

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION Mitchell McLaren 1, Yun Lei 1, Luciana Ferrer 2 1 Speech Technology and Research Laboratory, SRI International, California, USA 2 Departamento

More information

Python Machine Learning

Python Machine Learning Python Machine Learning Unlock deeper insights into machine learning with this vital guide to cuttingedge predictive analytics Sebastian Raschka [ PUBLISHING 1 open source I community experience distilled

More information

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models Stephan Gouws and GJ van Rooyen MIH Medialab, Stellenbosch University SOUTH AFRICA {stephan,gvrooyen}@ml.sun.ac.za

More information

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS Václav Kocian, Eva Volná, Michal Janošek, Martin Kotyrba University of Ostrava Department of Informatics and Computers Dvořákova 7,

More information

Generative models and adversarial training

Generative models and adversarial training Day 4 Lecture 1 Generative models and adversarial training Kevin McGuinness kevin.mcguinness@dcu.ie Research Fellow Insight Centre for Data Analytics Dublin City University What is a generative model?

More information

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines Amit Juneja and Carol Espy-Wilson Department of Electrical and Computer Engineering University of Maryland,

More information

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Using Web Searches on Important Words to Create Background Sets for LSI Classification Using Web Searches on Important Words to Create Background Sets for LSI Classification Sarah Zelikovitz and Marina Kogan College of Staten Island of CUNY 2800 Victory Blvd Staten Island, NY 11314 Abstract

More information

Assignment 1: Predicting Amazon Review Ratings

Assignment 1: Predicting Amazon Review Ratings Assignment 1: Predicting Amazon Review Ratings 1 Dataset Analysis Richard Park r2park@acsmail.ucsd.edu February 23, 2015 The dataset selected for this assignment comes from the set of Amazon reviews for

More information

Rule Learning With Negation: Issues Regarding Effectiveness

Rule Learning With Negation: Issues Regarding Effectiveness Rule Learning With Negation: Issues Regarding Effectiveness S. Chua, F. Coenen, G. Malcolm University of Liverpool Department of Computer Science, Ashton Building, Ashton Street, L69 3BX Liverpool, United

More information

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012 Text-independent Mono and Cross-lingual Speaker Identification with the Constraint of Limited Data Nagaraja B G and H S Jayanna Department of Information Science and Engineering Siddaganga Institute of

More information

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition MITSUBISHI ELECTRIC RESEARCH LABORATORIES http://www.merl.com Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition Seltzer, M.L.; Raj, B.; Stern, R.M. TR2004-088 December 2004 Abstract

More information

The Good Judgment Project: A large scale test of different methods of combining expert predictions

The Good Judgment Project: A large scale test of different methods of combining expert predictions The Good Judgment Project: A large scale test of different methods of combining expert predictions Lyle Ungar, Barb Mellors, Jon Baron, Phil Tetlock, Jaime Ramos, Sam Swift The University of Pennsylvania

More information

Learning Methods in Multilingual Speech Recognition

Learning Methods in Multilingual Speech Recognition Learning Methods in Multilingual Speech Recognition Hui Lin Department of Electrical Engineering University of Washington Seattle, WA 98125 linhui@u.washington.edu Li Deng, Jasha Droppo, Dong Yu, and Alex

More information

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

Digital Signal Processing: Speaker Recognition Final Report (Complete Version) Digital Signal Processing: Speaker Recognition Final Report (Complete Version) Xinyu Zhou, Yuxin Wu, and Tiezheng Li Tsinghua University Contents 1 Introduction 1 2 Algorithms 2 2.1 VAD..................................................

More information

1 Use complex features of a word processing application to a given brief. 2 Create a complex document. 3 Collaborate on a complex document.

1 Use complex features of a word processing application to a given brief. 2 Create a complex document. 3 Collaborate on a complex document. National Unit specification General information Unit code: HA6M 46 Superclass: CD Publication date: May 2016 Source: Scottish Qualifications Authority Version: 02 Unit purpose This Unit is designed to

More information

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language Z.HACHKAR 1,3, A. FARCHI 2, B.MOUNIR 1, J. EL ABBADI 3 1 Ecole Supérieure de Technologie, Safi, Morocco. zhachkar2000@yahoo.fr.

More information

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction INTERSPEECH 2015 Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction Akihiro Abe, Kazumasa Yamamoto, Seiichi Nakagawa Department of Computer

More information

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration INTERSPEECH 2013 Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration Yan Huang, Dong Yu, Yifan Gong, and Chaojun Liu Microsoft Corporation, One

More information

Speaker Identification by Comparison of Smart Methods. Abstract

Speaker Identification by Comparison of Smart Methods. Abstract Journal of mathematics and computer science 10 (2014), 61-71 Speaker Identification by Comparison of Smart Methods Ali Mahdavi Meimand Amin Asadi Majid Mohamadi Department of Electrical Department of Computer

More information

Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques

Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques Lorene Allano 1*1, Andrew C. Morris 2, Harin Sellahewa 3, Sonia Garcia-Salicetti 1, Jacques Koreman 2, Sabah Jassim

More information

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System QuickStroke: An Incremental On-line Chinese Handwriting Recognition System Nada P. Matić John C. Platt Λ Tony Wang y Synaptics, Inc. 2381 Bering Drive San Jose, CA 95131, USA Abstract This paper presents

More information

On the Combined Behavior of Autonomous Resource Management Agents

On the Combined Behavior of Autonomous Resource Management Agents On the Combined Behavior of Autonomous Resource Management Agents Siri Fagernes 1 and Alva L. Couch 2 1 Faculty of Engineering Oslo University College Oslo, Norway siri.fagernes@iu.hio.no 2 Computer Science

More information

Proceedings of Meetings on Acoustics

Proceedings of Meetings on Acoustics Proceedings of Meetings on Acoustics Volume 19, 2013 http://acousticalsociety.org/ ICA 2013 Montreal Montreal, Canada 2-7 June 2013 Speech Communication Session 2aSC: Linking Perception and Production

More information

Document number: 2013/ Programs Committee 6/2014 (July) Agenda Item 42.0 Bachelor of Engineering with Honours in Software Engineering

Document number: 2013/ Programs Committee 6/2014 (July) Agenda Item 42.0 Bachelor of Engineering with Honours in Software Engineering Document number: 2013/0006139 Programs Committee 6/2014 (July) Agenda Item 42.0 Bachelor of Engineering with Honours in Software Engineering Program Learning Outcomes Threshold Learning Outcomes for Engineering

More information

On-Line Data Analytics

On-Line Data Analytics International Journal of Computer Applications in Engineering Sciences [VOL I, ISSUE III, SEPTEMBER 2011] [ISSN: 2231-4946] On-Line Data Analytics Yugandhar Vemulapalli #, Devarapalli Raghu *, Raja Jacob

More information

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition Hua Zhang, Yun Tang, Wenju Liu and Bo Xu National Laboratory of Pattern Recognition Institute of Automation, Chinese

More information

CONCEPT MAPS AS A DEVICE FOR LEARNING DATABASE CONCEPTS

CONCEPT MAPS AS A DEVICE FOR LEARNING DATABASE CONCEPTS CONCEPT MAPS AS A DEVICE FOR LEARNING DATABASE CONCEPTS Pirjo Moen Department of Computer Science P.O. Box 68 FI-00014 University of Helsinki pirjo.moen@cs.helsinki.fi http://www.cs.helsinki.fi/pirjo.moen

More information

A Note on Structuring Employability Skills for Accounting Students

A Note on Structuring Employability Skills for Accounting Students A Note on Structuring Employability Skills for Accounting Students Jon Warwick and Anna Howard School of Business, London South Bank University Correspondence Address Jon Warwick, School of Business, London

More information

arxiv: v1 [math.at] 10 Jan 2016

arxiv: v1 [math.at] 10 Jan 2016 THE ALGEBRAIC ATIYAH-HIRZEBRUCH SPECTRAL SEQUENCE OF REAL PROJECTIVE SPECTRA arxiv:1601.02185v1 [math.at] 10 Jan 2016 GUOZHEN WANG AND ZHOULI XU Abstract. In this note, we use Curtis s algorithm and the

More information

AQUA: An Ontology-Driven Question Answering System

AQUA: An Ontology-Driven Question Answering System AQUA: An Ontology-Driven Question Answering System Maria Vargas-Vera, Enrico Motta and John Domingue Knowledge Media Institute (KMI) The Open University, Walton Hall, Milton Keynes, MK7 6AA, United Kingdom.

More information

Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment

Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment A thesis submitted in fulfillment of the requirements for the degree of Doctor of Philosophy Sheeraz Memon

More information

Evidence for Reliability, Validity and Learning Effectiveness

Evidence for Reliability, Validity and Learning Effectiveness PEARSON EDUCATION Evidence for Reliability, Validity and Learning Effectiveness Introduction Pearson Knowledge Technologies has conducted a large number and wide variety of reliability and validity studies

More information

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA Alta de Waal, Jacobus Venter and Etienne Barnard Abstract Most actionable evidence is identified during the analysis phase of digital forensic investigations.

More information

Implementing a tool to Support KAOS-Beta Process Model Using EPF

Implementing a tool to Support KAOS-Beta Process Model Using EPF Implementing a tool to Support KAOS-Beta Process Model Using EPF Malihe Tabatabaie Malihe.Tabatabaie@cs.york.ac.uk Department of Computer Science The University of York United Kingdom Eclipse Process Framework

More information

Segregation of Unvoiced Speech from Nonspeech Interference

Segregation of Unvoiced Speech from Nonspeech Interference Technical Report OSU-CISRC-8/7-TR63 Department of Computer Science and Engineering The Ohio State University Columbus, OH 4321-1277 FTP site: ftp.cse.ohio-state.edu Login: anonymous Directory: pub/tech-report/27

More information

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers October 31, 2003 Amit Juneja Department of Electrical and Computer Engineering University of Maryland, College Park,

More information

Learning Methods for Fuzzy Systems

Learning Methods for Fuzzy Systems Learning Methods for Fuzzy Systems Rudolf Kruse and Andreas Nürnberger Department of Computer Science, University of Magdeburg Universitätsplatz, D-396 Magdeburg, Germany Phone : +49.39.67.876, Fax : +49.39.67.8

More information

AGENDA LEARNING THEORIES LEARNING THEORIES. Advanced Learning Theories 2/22/2016

AGENDA LEARNING THEORIES LEARNING THEORIES. Advanced Learning Theories 2/22/2016 AGENDA Advanced Learning Theories Alejandra J. Magana, Ph.D. admagana@purdue.edu Introduction to Learning Theories Role of Learning Theories and Frameworks Learning Design Research Design Dual Coding Theory

More information

Rule Learning with Negation: Issues Regarding Effectiveness

Rule Learning with Negation: Issues Regarding Effectiveness Rule Learning with Negation: Issues Regarding Effectiveness Stephanie Chua, Frans Coenen, and Grant Malcolm University of Liverpool Department of Computer Science, Ashton Building, Ashton Street, L69 3BX

More information

10.2. Behavior models

10.2. Behavior models User behavior research 10.2. Behavior models Overview Why do users seek information? How do they seek information? How do they search for information? How do they use libraries? These questions are addressed

More information

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model Xinying Song, Xiaodong He, Jianfeng Gao, Li Deng Microsoft Research, One Microsoft Way, Redmond, WA 98052, U.S.A.

More information

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Chinese Language Parsing with Maximum-Entropy-Inspired Parser Chinese Language Parsing with Maximum-Entropy-Inspired Parser Heng Lian Brown University Abstract The Chinese language has many special characteristics that make parsing difficult. The performance of state-of-the-art

More information

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren Speech Technology and Research Laboratory, SRI International,

More information

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation SLSP-2016 October 11-12 Natalia Tomashenko 1,2,3 natalia.tomashenko@univ-lemans.fr Yuri Khokhlov 3 khokhlov@speechpro.com Yannick

More information

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH 2009 423 Adaptive Multimodal Fusion by Uncertainty Compensation With Application to Audiovisual Speech Recognition George

More information

P. Belsis, C. Sgouropoulou, K. Sfikas, G. Pantziou, C. Skourlas, J. Varnas

P. Belsis, C. Sgouropoulou, K. Sfikas, G. Pantziou, C. Skourlas, J. Varnas Exploiting Distance Learning Methods and Multimediaenhanced instructional content to support IT Curricula in Greek Technological Educational Institutes P. Belsis, C. Sgouropoulou, K. Sfikas, G. Pantziou,

More information

Lecture 9: Speech Recognition

Lecture 9: Speech Recognition EE E6820: Speech & Audio Processing & Recognition Lecture 9: Speech Recognition 1 Recognizing speech 2 Feature calculation Dan Ellis Michael Mandel 3 Sequence

More information

Master Program: Strategic Management. Master s Thesis a roadmap to success. Innsbruck University School of Management

Master Program: Strategic Management. Master s Thesis a roadmap to success. Innsbruck University School of Management Master Program: Strategic Management Department of Strategic Management, Marketing & Tourism Innsbruck University School of Management Master s Thesis a roadmap to success Index Objectives... 1 Topics...

More information

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining Dave Donnellan, School of Computer Applications Dublin City University Dublin 9 Ireland daviddonnellan@eircom.net Claus Pahl

More information

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining Dave Donnellan, School of Computer Applications Dublin City University Dublin 9 Ireland daviddonnellan@eircom.net Claus Pahl

More information

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

OCR for Arabic using SIFT Descriptors With Online Failure Prediction OCR for Arabic using SIFT Descriptors With Online Failure Prediction Andrey Stolyarenko, Nachum Dershowitz The Blavatnik School of Computer Science Tel Aviv University Tel Aviv, Israel Email: stloyare@tau.ac.il,

More information

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words, A Language-Independent, Data-Oriented Architecture for Grapheme-to-Phoneme Conversion Walter Daelemans and Antal van den Bosch Proceedings ESCA-IEEE speech synthesis conference, New York, September 1994

More information

A Pipelined Approach for Iterative Software Process Model

A Pipelined Approach for Iterative Software Process Model A Pipelined Approach for Iterative Software Process Model Ms.Prasanthi E R, Ms.Aparna Rathi, Ms.Vardhani J P, Mr.Vivek Krishna Electronics and Radar Development Establishment C V Raman Nagar, Bangalore-560093,

More information

TRANSFER LEARNING IN MIR: SHARING LEARNED LATENT REPRESENTATIONS FOR MUSIC AUDIO CLASSIFICATION AND SIMILARITY

TRANSFER LEARNING IN MIR: SHARING LEARNED LATENT REPRESENTATIONS FOR MUSIC AUDIO CLASSIFICATION AND SIMILARITY TRANSFER LEARNING IN MIR: SHARING LEARNED LATENT REPRESENTATIONS FOR MUSIC AUDIO CLASSIFICATION AND SIMILARITY Philippe Hamel, Matthew E. P. Davies, Kazuyoshi Yoshii and Masataka Goto National Institute

More information

Data Integration through Clustering and Finding Statistical Relations - Validation of Approach

Data Integration through Clustering and Finding Statistical Relations - Validation of Approach Data Integration through Clustering and Finding Statistical Relations - Validation of Approach Marek Jaszuk, Teresa Mroczek, and Barbara Fryc University of Information Technology and Management, ul. Sucharskiego

More information

Word Segmentation of Off-line Handwritten Documents

Word Segmentation of Off-line Handwritten Documents Word Segmentation of Off-line Handwritten Documents Chen Huang and Sargur N. Srihari {chuang5, srihari}@cedar.buffalo.edu Center of Excellence for Document Analysis and Recognition (CEDAR), Department

More information

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING Gábor Gosztolya 1, Tamás Grósz 1, László Tóth 1, David Imseng 2 1 MTA-SZTE Research Group on Artificial

More information

Calibration of Confidence Measures in Speech Recognition

Calibration of Confidence Measures in Speech Recognition Submitted to IEEE Trans on Audio, Speech, and Language, July 2010 1 Calibration of Confidence Measures in Speech Recognition Dong Yu, Senior Member, IEEE, Jinyu Li, Member, IEEE, Li Deng, Fellow, IEEE

More information

NCEO Technical Report 27

NCEO Technical Report 27 Home About Publications Special Topics Presentations State Policies Accommodations Bibliography Teleconferences Tools Related Sites Interpreting Trends in the Performance of Special Education Students

More information

Mandarin Lexical Tone Recognition: The Gating Paradigm

Mandarin Lexical Tone Recognition: The Gating Paradigm Kansas Working Papers in Linguistics, Vol. 0 (008), p. 8 Abstract Mandarin Lexical Tone Recognition: The Gating Paradigm Yuwen Lai and Jie Zhang University of Kansas Research on spoken word recognition

More information

Australian Journal of Basic and Applied Sciences

Australian Journal of Basic and Applied Sciences AENSI Journals Australian Journal of Basic and Applied Sciences ISSN:1991-8178 Journal home page: www.ajbasweb.com Feature Selection Technique Using Principal Component Analysis For Improving Fuzzy C-Mean

More information

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS Annamaria Mesaros 1, Toni Heittola 1, Antti Eronen 2, Tuomas Virtanen 1 1 Department of Signal Processing Tampere University of Technology Korkeakoulunkatu

More information

BENCHMARK TREND COMPARISON REPORT:

BENCHMARK TREND COMPARISON REPORT: National Survey of Student Engagement (NSSE) BENCHMARK TREND COMPARISON REPORT: CARNEGIE PEER INSTITUTIONS, 2003-2011 PREPARED BY: ANGEL A. SANCHEZ, DIRECTOR KELLI PAYNE, ADMINISTRATIVE ANALYST/ SPECIALIST

More information

Mathematics subject curriculum

Mathematics subject curriculum Mathematics subject curriculum Dette er ei omsetjing av den fastsette læreplanteksten. Læreplanen er fastsett på Nynorsk Established as a Regulation by the Ministry of Education and Research on 24 June

More information

USER ADAPTATION IN E-LEARNING ENVIRONMENTS

USER ADAPTATION IN E-LEARNING ENVIRONMENTS USER ADAPTATION IN E-LEARNING ENVIRONMENTS Paraskevi Tzouveli Image, Video and Multimedia Systems Laboratory School of Electrical and Computer Engineering National Technical University of Athens tpar@image.

More information

Speaker Recognition. Speaker Diarization and Identification

Speaker Recognition. Speaker Diarization and Identification Speaker Recognition Speaker Diarization and Identification A dissertation submitted to the University of Manchester for the degree of Master of Science in the Faculty of Engineering and Physical Sciences

More information

Abstractions and the Brain

Abstractions and the Brain Abstractions and the Brain Brian D. Josephson Department of Physics, University of Cambridge Cavendish Lab. Madingley Road Cambridge, UK. CB3 OHE bdj10@cam.ac.uk http://www.tcm.phy.cam.ac.uk/~bdj10 ABSTRACT

More information

A GENERIC SPLIT PROCESS MODEL FOR ASSET MANAGEMENT DECISION-MAKING

A GENERIC SPLIT PROCESS MODEL FOR ASSET MANAGEMENT DECISION-MAKING A GENERIC SPLIT PROCESS MODEL FOR ASSET MANAGEMENT DECISION-MAKING Yong Sun, a * Colin Fidge b and Lin Ma a a CRC for Integrated Engineering Asset Management, School of Engineering Systems, Queensland

More information

Australia s tertiary education sector

Australia s tertiary education sector Australia s tertiary education sector TOM KARMEL NHI NGUYEN NATIONAL CENTRE FOR VOCATIONAL EDUCATION RESEARCH Paper presented to the Centre for the Economics of Education and Training 7 th National Conference

More information

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence INTERSPEECH September,, San Francisco, USA Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence Bidisha Sharma and S. R. Mahadeva Prasanna Department of Electronics

More information

Comment-based Multi-View Clustering of Web 2.0 Items

Comment-based Multi-View Clustering of Web 2.0 Items Comment-based Multi-View Clustering of Web 2.0 Items Xiangnan He 1 Min-Yen Kan 1 Peichu Xie 2 Xiao Chen 3 1 School of Computing, National University of Singapore 2 Department of Mathematics, National University

More information

An Introduction to Simio for Beginners

An Introduction to Simio for Beginners An Introduction to Simio for Beginners C. Dennis Pegden, Ph.D. This white paper is intended to introduce Simio to a user new to simulation. It is intended for the manufacturing engineer, hospital quality

More information

Speech Recognition by Indexing and Sequencing

Speech Recognition by Indexing and Sequencing International Journal of Computer Information Systems and Industrial Management Applications. ISSN 215-7988 Volume 4 (212) pp. 358 365 c MIR Labs, www.mirlabs.net/ijcisim/index.html Speech Recognition

More information

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT PRACTICAL APPLICATIONS OF RANDOM SAMPLING IN ediscovery By Matthew Verga, J.D. INTRODUCTION Anyone who spends ample time working

More information

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages Nuanwan Soonthornphisaj 1 and Boonserm Kijsirikul 2 Machine Intelligence and Knowledge Discovery Laboratory Department of Computer

More information

On-the-Fly Customization of Automated Essay Scoring

On-the-Fly Customization of Automated Essay Scoring Research Report On-the-Fly Customization of Automated Essay Scoring Yigal Attali Research & Development December 2007 RR-07-42 On-the-Fly Customization of Automated Essay Scoring Yigal Attali ETS, Princeton,

More information

Why Did My Detector Do That?!

Why Did My Detector Do That?! Why Did My Detector Do That?! Predicting Keystroke-Dynamics Error Rates Kevin Killourhy and Roy Maxion Dependable Systems Laboratory Computer Science Department Carnegie Mellon University 5000 Forbes Ave,

More information

Linking Task: Identifying authors and book titles in verbose queries

Linking Task: Identifying authors and book titles in verbose queries Linking Task: Identifying authors and book titles in verbose queries Anaïs Ollagnier, Sébastien Fournier, and Patrice Bellot Aix-Marseille University, CNRS, ENSAM, University of Toulon, LSIS UMR 7296,

More information

On Human Computer Interaction, HCI. Dr. Saif al Zahir Electrical and Computer Engineering Department UBC

On Human Computer Interaction, HCI. Dr. Saif al Zahir Electrical and Computer Engineering Department UBC On Human Computer Interaction, HCI Dr. Saif al Zahir Electrical and Computer Engineering Department UBC Human Computer Interaction HCI HCI is the study of people, computer technology, and the ways these

More information

Statewide Framework Document for:

Statewide Framework Document for: Statewide Framework Document for: 270301 Standards may be added to this document prior to submission, but may not be removed from the framework to meet state credit equivalency requirements. Performance

More information

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION Han Shu, I. Lee Hetherington, and James Glass Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Cambridge,

More information

Voice conversion through vector quantization

Voice conversion through vector quantization J. Acoust. Soc. Jpn.(E)11, 2 (1990) Voice conversion through vector quantization Masanobu Abe, Satoshi Nakamura, Kiyohiro Shikano, and Hisao Kuwabara A TR Interpreting Telephony Research Laboratories,

More information