Semantic-based Audio Recognition and Retrieval

Size: px

Start display at page:

Download "Semantic-based Audio Recognition and Retrieval"

Janis Stevens
6 years ago
Views:

1 Semantic-based Audio Recognition and Retrieval Colin R. Buchanan Master of Science Artificial Intelligence School of Informatics University of Edinburgh 2005

2 Abstract This study considers the problem of attaching meaning to non-speech sound. The purpose is to ably demonstrate automated annotation of a sound with a string of semantically appropriate words and also retrieval of sounds most relevant to a given textual query. This is achieved by constructing acoustic and semantic spaces from a database of sound and description pairs and using statistical models to learn similarity in each space. The spaces are then linked to allow retrieval in either direction. A key aspect is effective prediction of novel events through generalisation from known examples. The motivation and implementation of the system is described using such techniques and representations as Mel frequency cepstral coefficients, Gaussian mixture models, hierarchical clustering and latent semantic analysis. System results are evaluated with automatic classification measures and human judgements demonstrating that this is an effective method for annotation and retrieval of general sound. i

3 Acknowledgements Firstly, I would like to thank Steve Renals for his indispensable help, feedback and direction throughout for which I am extremely grateful. I would also like to thank Dimitrios Zeimpekis and Efstratios Gallopoulos at the University of Patras, Greece for allowing me to use their Text to Matrix Generator (TMG) toolbox. Thanks is also due to Daniel P. W. Ellis at Columbia University for providing freely available code for MFCC extraction and Ian T. Nabney for producing the Netlab machine learning toolbox. ii

4 Declaration I declare that this thesis was composed by myself, that the work contained herein is my own except where explicitly stated otherwise in the text, and that this work has not been submitted for any other degree or professional qualification except as specified. (Colin R. Buchanan) iii

5 Table of Contents 1 Introduction Aims and Objectives Overview Literature Review Audio Classification Audio Retrieval Image and Multimedia Retrieval Dataset Acoustic-Semantic Framework The Acoustic Model Feature Extraction Mel Frequency Cepstral Coefficients The Delta-cepstrum Audio Classification Gaussian Mixture Models Class-based Prediction Experiments The MFCC Parameterisation The Effect of the Delta-cepstrum Silence Detection and Removal The Number of Mixture Components The Semantic Model The Vector Space Model Latent Semantic Analysis Query Matching Limitations of LSA The Number of Singular Values Linking Acoustic and Semantic Spaces The Distributions of Acoustic and Semantic Space Acoustic to Semantic Linkage Clustering the Acoustic Space The Word Model Interpolation of Semantic Predictions iv

6 4.3 Semantic to Acoustic Linkage Retrieving Sounds from an Unlabelled Database The Complete Acoustic-Semantic Framework Evaluation and Discussion Automatic Evaluation Measuring Annotation Performance Measuring Retrieval Performance Subjective Experiments Analysis Comparison of Objective and Subjective Evaluation Prediction on Novel Examples Conclusions Summation Limitations and Future Work Concluding Remarks A Subjective Test Material 64 A.1 Acoustic to Semantic A.2 Semantic to Acoustic Bibliography 67 v

7 Chapter 1 Introduction For decades researchers of automatic speech recognition have addressed the problem of interpreting speech by machine, their efforts over time leading to a considerable understanding of the domain and numerous practical applications. However, despite the impressive activity on speech interpretation and a well established practice of audio and signal processing there has been little emphasis on automatic understanding of non-speech. This study focuses on making sense of non-speech audio, chiefly for two related purposes: intelligently labelling a given sound and for retrieval of sound(s) from a database via a textual description. For instance, imagine a system that given an input sound of a lion roaring could return the label lion roaring, and given an input prompt of lion roar could retrieve from a database samples most like that of a lion roaring. Previous and current research in audio classification tends to focus on matching test sounds into a limited number of predefined categories such as music, applause, speech etc., but this superior approach would describe each sound with a string of semantically appropriate words. Furthermore, the proposed system should allow intelligent interpretation of unseen examples, e.g. describe a tiger roaring based on the similarity to previously seen events. The analogy in human perception is that we can easily describe a new sound by its relation to other sounds. Retrieval systems are often based on query by keyword, e.g. literal keyword matching (requiring annotations paired with sounds), a method familiar to users used to web search engines. This study augments audio retrieval by finding semantic concepts linked to words in the user s query in order to match the most relevant sounds in a 1

8 Chapter 1. Introduction 2 database. Such an approach improves over simple literal query matching as the user need not know exact search terms to achieve useful results. For example, a search for boat though ranking exact matches highest could also predict sounds described by related words, e.g. kayak or jet-ski. An extension is made to retrieve sounds from unlabelled databases where selection is based on acoustic similarity to known sounds. Such a semantic system would be of benefit in many applications. For example, audio databases could be better accessed through semantic based queries. Currently, audio annotation is typically performed manually, an attractive alternative is an automatic annotation system capable of classifying and labelling an entire audio archive with minimal cost. In particular, the abundance of raw audio content already available and continually growing multimedia archives exhibit a pressing need for automated handling of audio/multimedia content concerning the tasks of indexing and retrieval. 1.1 Aims and Objectives The principal goal is to demonstrate that this approach to modelling the semantics of sound is highly appropriate for the task of labelling and retrieving audio. The objectives to achieve this goal are: Construction of an acoustic model using motivated audio features and classification method Construction of a semantic model with an appropriate semantic representation and classification method A mapping between the acoustic and semantic spaces to allow retrieval in either direction, demonstrating generalisation to deal with novel examples in a reasonable manner. Evaluation of system performance against baseline methods and through human judgement

9 Chapter 1. Introduction Overview Given these goals a framework to develop such a system is outlined making use of appropriate techniques from literature. Firstly, in the next section, a study of existing audio classification and retrieval systems is undertaken with emphasis on semantic attachment. Specifically, a state-of-the-art audio system developed by Malcolm Slaney in 2002 [31], [32] demonstrates intelligent labelling and retrieval of audio samples and this work forms the basis of this study. Other related fields such as areas of multimedia including image and video processing have received more attention in semantic modelling and provide valuable insight. Section 1.4 describes the dataset used in this work, and section 1.5 outlines the steps for constructing a semantic audio system based on the proposed approach. Development follows engineering based methods such as signal processing, pattern recognition and stochastic models. In Chapter 2, methods are described for extracting high-level audio features (Mel frequency cepstral coefficients) and measuring acoustic similarity using Gaussian mixture models. Suitable experiments are motivated and described for aspects of the audio parameterisations in order that the most effective values can be established for the task. Likewise, chapter 3 presents suitable ideas for a semantic space utilising latent semantic analysis to model related search terms as single concepts. To allow acoustic-tosemantic and semantic-to-acoustic queries a linkage between acoustic and semantic spaces is described in chapter 4. Clustering is applied to the acoustic space to permit a general-to-specific hierarchy and a word model is employed to predict relevant words. Semantic retrieval uses a mapping from the semantic query to the acoustic domain using the acoustic model to predict acoustically appropriate sounds. Forms of interpolation to improve operational results are described for both mappings. Chapter 5 presents suitable evaluation methodology where both large-scale automatic evaluation and small-scale subjective tests are performed. Evaluation tasks test prediction performance in both retrieval directions against baseline methods using held-out data. Subjective evaluation tests manual ratings of predictions against that of true sounds/descriptions. Finally, in chapter 6 analysis allows us to infer some overall conclusions about the value and future direction of the semantic audio system.

10 Chapter 1. Introduction Literature Review Though audio content has never been lacking, for years it has often been overlooked while multimedia research efforts predominately focused on image and video elements. However, in recent years there has been increasing interest in automatically processing audio content for both indexing and retrieval, particularly for the purposes of integration with multimedia systems, e.g. the University of Mannheim s Movie Content Analysis (MoCA) project [24]. Similarly, there has also been interest in automated handling of audio archives, e.g. the Muscle Fish SoundFisher system [36]. Much of this work is experimental and no comprehensive techniques have been established, yet there is a rich literature to exploit from many closely related fields such as speech recognition, speaker identification, music classification and information retrieval, all of which are discussed as appropriate Audio Classification In recent years a substantial literature on audio classification has developed. Approaches mainly differ in the set of acoustic features used to represent the audio signal and the classification technique applied. For example, a violence detection system developed for the MoCA project [24] predicts gunshots, explosion and cries based on statistics of the waveform (e.g. measures of amplitude and frequency) using correlation and Euclidean distance measures. Another system for speech, music and noise segmentation and classification developed by Lu & Hankinson uses similar waveform statistics and decision tree based classification [23]. However, most current research only concerns a small number of sound types (often involving some speech content), e.g. music and speech discrimination [29] or silence, music, speech, noise classification [20]. Consequently the features and discrimination techniques are tailored to a specific domain and are unlikely to apply well to the general case. Nevertheless, this research provides valuable insight into effective classification techniques and acoustic features.

11 Chapter 1. Introduction 5 Classification Techniques Several techniques have been employed for the purpose of classifying an unknown sound. The principle is to measure similarity between an input feature vector and those of known sounds. In the early days of speech processing, template matching between feature vectors was used intuitively. Current acoustic research favours stochastic models which provide more flexibility and more theoretically meaningful likelihood scores. Of these the most common approaches are Gaussian model based methods [29], [32], [21], hidden Markov models [38], nearest neighbour methods [21] [29], neural network variants [21], vector quantisation [13], [25] and support vector machines [35]. For example, Scheirer & Slaney [29] investigate Gaussian based models, nearest neighbour and spatial partitioning approaches employing 13 different acoustic features (such as spectral centroid and zero-crossing rate) for a speech and music discrimination task. They conclude that the topology of the feature space is rather simple and that there may be little performance difference between classification methods. This claim is also backed by other larger studies, for example, Liu & Wan conclude that four classification techniques all achieve similar results (between 56-64% accuracy) on a larger scale 29-class problem. Furthermore, both Scheirer & Slaney [29] and Li et al. [20] argue that the choice of acoustic feature appears to be more critical than the classification method. Acoustic Features A broad selection of acoustic features have been applied with varying success on different tasks. Generally features are either derived from simple measures of a waveform (e.g. energy functions, fundamental frequency) or may be motivated by perception, such as pitch and loudness. For example, the general audio study by Zhang and Kuo [38] applies an energy function to measure amplitude variation over time, zero-crossing rate to estimate spectral properties, fundamental frequency to capture harmonic properties of the signal etc. Such features obtained from the time, frequency and time-frequency domains are numerous and a comprehensive study for the case of general audio is undertaken by Liu & Wan considering 87 features for a content-based classification task in order to build an optimal feature vector [21]. Likewise, the speech and music discrimination work by

12 Chapter 1. Introduction 6 Scheirer & Slaney tests combinations of 13 acoustic properties [29]. Composite feature vectors obtained through such work have been used effectively for discrimination (Scheirer & Slaney report less than 2% error on a small test set). Both studies show that optimal feature selection depends on the domain and classification technique. As the feature compositions are optimised for a specific domain they are unlikely to scale well to more complex discrimination tasks. Alternatively, properties motivated by perception such as pitch, loudness and timbre are clearly important for us to distinguish between sounds but are difficult to quantify. Attempts to model human auditory perception in every detail are impractical due to the complexity and only partial knowledge of the process. However, compact representations of a signal can capture significant frequency and energy information in an attempt to model known perceptual properties. In speech research features such as Mel frequency cepstral coefficients (MFCCs) or linear prediction coefficients (LPCs) have been demonstrated to provide good representations of a speech signal allowing for better discrimination than temporal or frequency based features alone [18]. However, as both MFCCs and LPCs are intended to model speech their effectiveness with non-speech is questionable. In particular, LPCs are based on speech production rather than perception and the rudimentary vocal tract model is unlikely to provide a good representation of more general sounds which may often lack resonance and exhibit fricative sources (though both Liu & Wan [21] and Li et al. [20] use them effectively with non-speech). MFCCs, on the other hand, are derived from a sinusoidal based expansion of the energy spectrum and are capable of capturing more varied spectral phenomenon. MFCCs correspond to a frequency smoothed log-magnitude spectrum which suppresses undesirable spectral variation, particularly at higher frequencies [7]. This perceptual motivation makes them ideal for general audio discrimination as they capture crucial properties used in human hearing. MFCCs are ubiquitous in speech research but they have been applied successfully in non-speech tasks such as the music system developed by Pye [25], another by Berenzweig et al. [3] and also more general audio studies by Foote [13] and Liu & Wan [21]. Li et al. conclude from their study that cepstral features such as MFCCs perform better than temporal or frequency based features and advocate their use for general audio tasks particularly when the number of audio classes is large [20].

13 Chapter 1. Introduction 7 Semantic Attachment Regardless of the choice of acoustic features or classification method all content-based classification systems treat prediction as a statistical classification into one of a number of predefined classes ignoring any notion of meaning. However, the intention of this work is to describe each sound with a string of semantically appropriate words, based on the known descriptions. Some semantic representation and a method to link predictions of a classifier to a point in semantic space is required. It is in this manner that Slaney proposes a state-of-the-art system which incorporates a mapping between audio and semantic spaces [31], [32]. Methods are developed to describe general audio with words (and also predict sounds given a text query) using a labelled sound set. In brief, audio is represented by a stacked MFCC vector, using linear discriminant analysis to reduce dimensions and promote separation of acoustic classes. To predict acoustic similarity Gaussian mixture models (GMMs) are applied, and a clustering method is used to permit generalisation from the training sounds. To generate a description given a test sound a linkage is made to predict words from the descriptions associated with the most similar training sounds. In his initial work [32] prediction only involves the single best acoustic answer, the later study [31] employs a mixture of experts approach [33] to interpolate between answers and predict more suitable descriptions. This concept provides an effective framework on which to build the prosed system, though implementation is also influenced by other work. Where relevant, notable differences and their justification are described. The evaluation phase of Slaney s initial study only consists of demonstrating examples from the training material [32]. The later study involves evaluation with a held-out test set which is used to test predicted labels against true labels [31]. We hope to supplement evaluation by also testing against baseline methods and involving human judgement. Content-based approaches exhibit inability to suitably predict a test sound of a type not in the database the intention of the proposed system is to judge novel events based on similarity to known examples, in this way, a semantic approach can deal with a more extensive set of acoustic events provided the initial training set allows for a such generalisation. Slaney achieves this by clustering the acoustic space to create a generalto-specific hierarchy of sounds [32].

14 Chapter 1. Introduction Audio Retrieval Foote provides a now slightly dated but comprehensive overview of audio retrieval [14]. Typical approaches are query by example (QBE) which allows retrieval of sound(s) based on similarity to acoustic properties of user supplied sounds or templates, or query by keyword (QBK) which allows users to search via textual queries but requires annotations paired with sounds. A classic example of QBE is the query by humming method used to retrieve music by humming a melody (based on whether a note is higher or lower in pitch than the previous note) developed by Ghias et al. [17]. While this is surprisingly effective for retrieval of music scores such queries are not particularly natural or convenient for other sound types. The Muscle Fish system (later developed into the commercial application SoundFisher) implements retrieval for a general audio database based on similarity between psycho-acoustic properties, e.g. loudness, pitch, harmony [36]. The system measures similarity (by Mahalanobis distance) between a new sound and sounds in a database which are then ranked on proximity. Alternatively the database can also be sorted via parametric relations such as pitch and brightness. The authors demonstrate retrieval on a manually annotated collection of 400 sounds (classified into laughter, percussion etc.) but do not formally evaluate retrieval accuracy. Foote applies a vector quantisation approach to retrieval using MFCC audio features, he also evaluates against the sounds in the Muscle Fish database with a similar demonstration [13]. The advantage of the QBE approach is that similarity is derived from the audio signal (annotations are not required) and can therefore be applied inexpensively on a large scale. However, QBE is not orientated towards the kind of audio semantics proposed for this study, where queries based on acoustic properties may well be effective at finding acoustic similarity but not necessarily higher-level semantic relations. This indicates a gulf between user s needs and current QBE methods. It is apparent that modelling the high-level meaning of sound requires semantic (e.g. textual) content. In essence, QBK is the same problem as conventional text information retrieval, the aim to retrieve relevant documents (though associated with sounds) through a textual query 1. Though literal keyword matching is the most simple approach and capable of retrieving exact matches, due to the subjective nature of descriptions it can be difficult 1 e.g.

15 Chapter 1. Introduction 9 to satisfy a particular query. For example, search terms differ from one user to the next, in the worst case resulting in frustrated or failed searches. However, retrieval can be augmented by grouping closely related terms as concepts to allow matching of the most relevant sounds in a database, ideally finding similarity in the same way a real user would. A suitable information retrieval technique is latent semantic analysis (LSA) devised by Deerwester et al. [9]. LSA is a vector based semantic approach designed to solve the underlying problem of synonymy through dimensionality reduction. The authors demonstrate it to be effective at improving retrieval of relevant documents over literal term matching, where users need not know exact search terms to achieve useful results. A technique derived from factor analysis is used to reduce a matrix of documents indexed by terms into a lower dimensional space. This effectively models the global usage patterns of terms so that documents sharing related concepts (rather than just literal terms) are represented by nearby vectors in the lower dimensional space. The semantic system developed by Slaney uses an alternative approach where multinomial clustering is used to group together alike documents in a similar manner to the approach used on the acoustic space [32]. In essence, this achieves a similar result to LSA, though Deerwester et al. argue that hierarchies are too limited to capture the rich semantics of most document collections [9] Image and Multimedia Retrieval The substantial literature encompassing image and multimedia research reveals a number of notable and relevant retrieval techniques. For example, IBM s Query by Image Content (QBIC) system [12] is a QBE approach allowing comparison of images on properties such as colour histograms, texture information, foreground objects (in a limited fashion), backgrounds etc., and can allow queries by colour distributions, example images or even-user constructed sketches. In some aspects this is a very successful technique of image retrieval and can find images (or video) similar in property to the query. However, this system makes no pretence of attaching semantics to the queries, for example, finding an image of a bird could involve sketching a bird shape and the system could find similar shaped items (though not necessarily birds). This method cannot consider retrieval of differently shaped birds as humans could readily distinguish between.

16 Chapter 1. Introduction 10 Studies by Barnard et al. [1] indicate that queries based on image histograms, texture etc. are uncommon, suggesting that this is not the natural way we think about media types. In contrast, their work on the semantics of words and pictures [1], [2] creates a semantic linking between images and words, allowing automated annotation and semantic retrieval much in the same manner as the proposed audio system. They implement a joint model where image features (such as measures of color, texture and shape) are combined with text features to create a single feature space. The authors also introduce the concept of correspondence to associate labels with distinct regions of an image. Currently, correspondence is unnecessary for an acoustic model where the foreground/background problem is not yet under investigation. This semantic image system has some similarities to the proposed work and provides insight into semantic modelling and representing the semantic space. 1.4 Dataset Audio retrieval systems are typically created from insufficient audio datasets consisting of raw audio content with a truncated filename or perhaps a brief description to suffice for high-level information. Potentially, web-based audio retrieval could benefit from capturing words surrounding or related to sounds found on web pages, in the same manner as conventional information retrieval. Ideally, annotations should precisely describe audio content but in practice they can vary considerably in consistency and comprehensiveness depending on the purpose and source. The insufficiency of crucial semantic information is a major obstacle for retrieval systems, clearly richer annotation would benefit a semantic model. Consequently, for this study a dataset of sounds paired with reasonably thorough and consistent descriptions was chosen. The dataset consists of a set of CD-ROMs containing over 3,000 isolated sounds with annotations 2. Though the samples are intended for film and multimedia production the studio quality recording and consistency in labelling are ideal for this work. Sounds are divided into a broad range of general categories (e.g. airplane, animal, household sounds etc.), each with a suitable label (organised by hierarchy), table 1.1. This type of concise labelling lends itself well to a workable semantic model. 2 The XV Series Sound Effects Library from Sound Ideas,

17 Chapter 1. Introduction 11 Category Title & Description Time ANIMALS animal, frog: great basin spadefoot toad: single call amphibian 0:01 ANIMALS animal, wolf: timber-wolf: one wolf howling 0:06 AUTOMOBILES auto, police: ext: pass by at fast speed with siren emergency vehicle 0:17 HOUSEHOLD household, toaster: pop up 0:03 Table 1.1: Example listings from the XV Series Sound Effects Library The audio samples are recorded at a sampling rate of 44.1 khz with two (stereo) audio channels. Before use the stereo channels are mixed to a single monaural channel for each sample. Typically sounds are a few seconds in length and contain only a single sound source (or occasionally a mixture of related sounds). Descriptions contain on average eight words (with a maximum of 27 words), limited by information retrieval standards but sufficient for a practical model. Before allocation of training and testing sets, some refinement and pre-processing of the dataset is required. Some of the samples fall into indistinct categories such as exercise equipment, gas station sounds or hospital sounds and it proves difficult to create effective models which can distinguish between these sound types, consequently such examples are exclude from the dataset. This is justified as we would not expect to perceive the difference between the clanks of a bench press and various other indistinct sound types without visual cues. The refined dataset consists of sound types of which we would reasonably expect a human user to distinguish between. Additionally, some categories are very similar to others (e.g. crashes and impacts) and the pre-processing stage involves merging similar categories and partitioning others. Finally very short and a few of the longer audio samples are also excluded from the dataset. The result is a set of 914 isolated sounds divided among 36 categories and 90 subcategories. From this bank of isolated sounds with associated descriptions a conventional training, development and testing set was constructed. The training set is constructed from approximately 70% of the archive (allocated randomly) with the remainder of held-out examples divided between the development and test set. 1.5 Acoustic-Semantic Framework The overall purpose of the system is to predict words to describe a sound and, in the opposite direction, retrieve sounds via a textual query. To achieve this separate acoustic

18 Chapter 1. Introduction 12 and semantic spaces can be constructed. The acoustic space concerns the actual audio content and the semantic space the words attached to a sound, e.g. dog barking. For retrieval we wish, given an example in one domain to make the most appropriate prediction in the opposite domain. Statistical models can be constructed for each domain, where mathematical feature spaces and appropriate classification techniques are applied. The purpose of these models is to provide a measure of similarity between points in a feature space. For example, in the acoustic space the model should predict acoustic similarity between sounds. In the semantic space the model should predict semantic similarity between the descriptions belonging to sounds. Thus each model can perform classification within the respective domain. Such models then serve as a channel to map to the opposite domain for retrieval. This is achieved by creating two one-way linkages between domains using the known relationships between sound and descriptions pairs, thus achieving our goal of retrieval in either direction. Acoustic-to-semantic retrieval involves using the acoustic model to predict sounds matching an audio query, then a word model takes the associated descriptions to predict words with a high likelihood of describing the query sound. Acoustic-to-semantic retrieval uses the semantic model to predict descriptions matching a textual query, then the associated sounds can be used by an audio retrieval component to predict sounds relevant to the query. A conceptual illustration of this architecture is illustrated, figure 1.1. An alternative approach would be to combine both words and sounds into the same feature space and create a joint two-way model to predict words in one direction and sounds in the other, cf. the work of Barnard et al. [1]. However as the proposed method focuses on distinct measures of acoustic and semantic similarity two separate models are more appropriate. In this work both sounds and descriptions are represented by a point in a high-dimensional vector space, a crucial quality for capturing significant information in the content, reducing the effect of noise and for computational performance. Appropriate high-level features can vastly improve pattern recognition performance over low-level content and are more semantically meaningful [37]. As the chosen feature representations are critical for prediction performance we discuss and justify the choice and where necessary carry out experiments to determine optimal parameters.

Chapter 1. Introduction 13 Figure 1.1: Overview of system architecture with example inputs and outputs System components are designed modularly to aid integration of alternative methods.

However, except where appropriate we do not explicitly discuss exact implementation details, instead preferring to make more concise mathematical descriptions that can be replicated.

19 Chapter 1. Introduction 13 Figure 1.1: Overview of system architecture with example inputs and outputs System components are designed modularly to aid integration of alternative methods. Parameters are controlled by a configuration file in order to facilitate experiments on parameter settings and optimisation. However, except where appropriate we do not explicitly discuss exact implementation details, instead preferring to make more concise mathematical descriptions that can be replicated. Additionally, the aim of an automated system is that the modelling should be largely unsupervised. After the user creates a parameter configuration the system then (automatically and without supervision) extracts features, learns similarity and a linking between domains, requiring only a pairing of sounds and their corresponding descriptions as input. However, there are some limitations on the focus of the study. Firstly, the descriptions predicted by acoustic-to-semantic retrieval are not intended to be sentences, rather a collection of descriptive words. Also, for this study we limit the scope to isolated sounds of the type in our chosen dataset and we do not investigate the effects of segmentation or multiple sounds sources occurring within samples.

20 Chapter 2 The Acoustic Model The purpose of the acoustic model is to transform a given sound into a point in acoustic space where sounds can be compared on acoustic similarity. Distinguishing between sounds is not a trivial task, an audio archive will contain a broad range of sound types which are distinguished by different acoustic characteristics. The difficulty stems from the requirement to extract features capable of capturing the individuality of each sound source. It would be unrealistic to anticipate a solution to all audio classification problems but the aim is to present a generally applicable solution. 2.1 Feature Extraction As low-level waveforms are opaque and difficult to compare, a better strategy is to extract some higher-level, more meaningful feature from the signal. In order to facilitate accurate prediction, an acoustic feature should ideally distinguish between significant acoustic variation yet eliminate irrelevant spectral detail and noise which do not contribute to recognition. As discussed in our review of literature, Mel frequency cepstral coefficients are a well established acoustic representation derived from the energy spectrum and are capable of capturing varied spectral phenomenon. Though intended to model speech they have been applied successfully in non-speech tasks such as the music system developed by Pye [25] and more general audio studies by Foote [13] and Liu & Wan [21]. This success indicates that they are a good starting point for general audio discrimination. 14

21 Chapter 2. The Acoustic Model Mel Frequency Cepstral Coefficients MFCCs are obtained through a frame-based analysis of a signal where the waveform is divided into a sequence of frames (usually in the tens of milliseconds), the purpose to smooth the frequency spectra and reduce the effects of acoustic variation. A sinusoidal transform (discrete Fourier transform) is performed using a hamming window overlapping each frame to obtain an amplitude spectrum, which is then converted to a Mel-scale spectrum using triangular filters emphasising frequencies according to their perceptual importance on this scale. The particular scale used in this study incorporates linear frequency spacing below 1000 Hz and logarithmic spacing above 1000 Hz in order to reflect perception of frequency. The final stage takes the logarithm of the Mel-weighted spectrum and another sinusoidal transform (discrete cosine transform) reconstructs these values into a number of cepstral coefficients augmented with a zeroth coefficient representing the overall energy of each frame [7]. The result is a vector of reasonably uncorrelated coefficients describing smoothed frequency and compressed amplitude information. The advantage of uncorrelated features becomes apparent when applying statistical models. Valuable lower and mid-range frequencies and energy are retained yet the representation exhibits some robustness in the presence of noise. Clearly there is also a computational advantage in dimensionality reduction. MFCCs are popular in audio discrimination tasks because they capture acoustic properties useful in perception and this motivation is instrumental in justifying their use for a semantic audio system Limitations of MFCCs However, some stages in the MFCC extraction process may be of questionable suitability for non-speech, including the emphasis on mid-range frequencies, the size of frame and number of MFCC coefficients. Logan investigates the application of MFCCs in modelling music, concluding that they are at least effective for music/speech discrimination and there is no evidence to suggest that the frequency emphasis is inappropriate [22]. Other aspects such as frame size and number of coefficients are investigated for the case of general audio in section Furthermore, although the extraction process attempts to reduce the effects of noise there is no reason to stop a classifier learning undesired properties from the MFCC rep-

22 Chapter 2. The Acoustic Model 16 resentation. It is possible that a model could learn to distinguishing between sounds based on background noise (such as the silence occurring between footsteps) rather than content. To counteract this a method of silence detection and removal is implemented, section The Delta-cepstrum MFCCs are essentially static features where coefficient values describe the signal over a single frame. A means to capture the time evolution over a longer interval should help increase distinction between dynamic sound types. For example, the salient feature of a particular sound could be a rising or falling pitch, while the instantaneous pitch itself is less important. In order to capture the slope or difference over a number of frames a delta-cepstrum can be derived from the static cepstrum. Delta coefficients x(t) are derived for every time-slice t by the first-order time derivatives of cepstrum values over a window (t k) to (t + k) usually ±1 or ±2 frames, where x is the input cepstrum coefficient [16] x(t) = k x(t + k) i (2.1) i= k Computing delta coefficients for every cepstrum value results in a delta-cepstrum reflecting the change occurring in the static cepstrum. The second order derivative of the sequence or double delta calculated from the first order delta-cepstrum may also be applied to capture the acceleration of change in the cepstrum. Delta cepstrums are appended to the original static cepstrum resulting in a composite feature vector as commonly used in speech recognition. This allows modelling the acoustic feature at a particular instant as well as capturing the rate and direction of change. The benefit of dynamic features was demonstrated by Furui in an isolated word recognition task with a significant reduction in error rate over the static cepstrum alone [15]. Though such features derived from modelling speech do not automatically apply to non-speech tasks, the dynamic behaviour of sound features is clearly applicable to other modelling tasks. For example, dynamic features have been used successfully in the music classification system by Pye [25] and the speaker verification system by Reynolds [27]. It is anticipated that these enhancements should also improve distinc-

23 Chapter 2. The Acoustic Model 17 tion between general sound types. For example, different engine noises are characterised by unique short-term temporal fluctuations. 2.2 Audio Classification Each sound consists of a mass of points in MFCC feature space and to measure similarity we require a method of comparing distributions in this space. Though a variety of pattern recognition techniques can be applied it seems natural to use methods that can probabilistically model such distributions Gaussian Mixture Models Gaussian mixture models (GMMs) are a widely used supervised classification technique that improve over a single Gaussian distribution to accommodate a broader, more complex range of distributions using a combination of simple components. They can effectively model distributions where the data points originate from separate clusters but membership is unknown. The idea is to fit the data with regions of probability mass on the assumption that high-level data points are clustered into groups and that each cluster follows a Gaussian distribution (or at least this is a reasonable approximation). This is appropriate as it is anticipated that points emanating from a sound source will typically form a cluster in high dimensional space. Once properly trained a GMM should be effective at predicting the likelihood of a new test sound matching the trained data. For general audio discrimination the content of the data is usually more important than the sequence or structure that is important in speech recognition. GMMs are suitable as they model the whole sample as a mass of points dismissing temporal order (though it can be beneficial to capture the short-term rate of change by a delta-cepstrum). In the case that a sound type is characterised by long-term temporal ordering (e.g. bird song) it may be beneficial to apply a method such hidden Markov models (as used extensively in speech recognition) [16] to capture the characteristic sequence, though this seems unnecessary for the chosen dataset. GMMs are commonly used in speech research, providing improved discrimination over single Gaussian distributions in many cases. For example, Reynolds [27] achieves

24 Chapter 2. The Acoustic Model 18 good results on databases of differing quality by employing GMMs to verify a speaker s voice pattern, in the same manner as we wish to capture the class of a sound. Additionally, Singer et al. develop an effective language identification system using GMMs and delta-cepstrums [30]. Indeed, mixtures are applicable to more general audio problems such as the music classification tasks by Pye [25], and Berenzweig et al. [3]. Though a sound may not necessarily form distinct clusters in acoustic space the distribution can always be approximated by using enough mixture components. The benefit of modelling the data with a number of simple components is clear but in each case there is the problem of determining the appropriate order of GMM to fit the data. The effect of mixture sizes is investigated in section Parameters of a Gaussian Mixture To model an acoustic feature vector x (of dimension D) the distribution is described by p(x) = K i=1 π i p(x θ i ) (2.2) where K is the number of mixture components, π i is the probability that component i contributes to modelling the data and θ i the parameters of component i. In the case of GMMs, θ i = {µ i,σ i } and the probability density function is a multivariate Gaussian distribution p(x θ i ) = 1 (2π) D/2 Σ i 1/2 exp( 1 2 (x µ i) T Σ 1 i (x µ i )) (2.3) where µ i is the D 1 mean vector and Σ i the D D covariance matrix of component i. Hence, for each training case there are two sets of parameters to estimate, the mixing weights π i and the parameters θ i. An appropriate value for K must also be determined. Estimation of parameters There is no closed form to calculate GMM parameters directly but a maximum likelihood estimate can be obtained over a number of iterative steps, often achieved via the expectation-maximisation (EM) algorithm [10]. Given a set of training vectors, x 1,x 2,...,x P, the EM algorithm iteratively refines the GMM parameters to increase

25 Chapter 2. The Acoustic Model 19 the likelihood of the estimated model for the observed data. Starting with a parameter initialisation the EM algorithm then proceeds over two steps: Expectation step: find the data points closest to a mixture component j, where w i j is the probability that x i belongs to cluster j using the current estimate of parameters, p(x θ i ) is computed as in eq. 2.3 w i j = π j p(x i θ j ) K k=1 w k p(x i θ k ) (2.4) Maximisation step: calculate new weight ˆπ j, mean ˆµ j and covariance ˆΣ j over the data points closest to each cluster j ˆπ j = 1 P P w i j (2.5) n=1 ˆµ j = 1 P Pˆπ j w i j x i (2.6) n=1 ˆΣ j = 1 P Pˆπ j w i j (x i ˆµ j )(x i ˆµ j ) T (2.7) n=1 Both steps are repeated (with the maximisation estimates becoming the parameters at the next stage) for a set number of iterations or until convergence of the complete-data likelihood, P i=1 K j=1 π j p(x i θ j ). The derivation can be found in [6]. In this work individual clusters are not represented by a full covariance matrix but by diagonal approximations. The approximation causes the cluster axes to be orientated parallel to the axes of the feature space. This is done for reasons of computational efficiency during parameter estimation. Furthermore, studies by Reynolds have shown that any Gaussian mixture with full covariance can be replicated by a larger order mixture with diagonal covariance and can even outperform full covariance models [27]. Unfortunately the procedure can converge to different solutions depending on the initialisation of parameters. A good initialisation strategy adopted in this study is to randomly assign each mixture component to a subset of data points and set mixing weights π j to (1/K). Several iterations of the K-means algorithm (a non-probabilistic method) are then used to quickly converge on reasonable estimates of the mean parameters and the EM algorithm then proceeds. It was observed that fewer than ten iterations are needed for convergence. A further problem is that covariances will tend to zero as

Chapter 2. The Acoustic Model 20 the likelihood tends to infinity, e.g. if the mixture component models a single point or points close together.

The optimum number of mixture components K depends on various factors such as size, dimensionality and inherent clusters in the source data.

Fewer elements will fit the data more crudely but may perhaps allow better generalisation performance to sounds of a similar type.

26 Chapter 2. The Acoustic Model 20 the likelihood tends to infinity, e.g. if the mixture component models a single point or points close together. To prevent this a size constraint is imposed on covariance values. The optimum number of mixture components K depends on various factors such as size, dimensionality and inherent clusters in the source data. Greater numbers of elements will fit the data more precisely but may capture outliers or noisy data elements. Fewer elements will fit the data more crudely but may perhaps allow better generalisation performance to sounds of a similar type. For illustration, GMMs are trained on just two dimensions (the energy term and first MFCC coefficient) over a range of mixture sizes. Figure 2.1 shows the regions of probability generated by these GMMs trained on a source sound of dog vocals. By using the described training strategy mixture components tend to fit to heavily populated clusters and ignore outliers in the source sample. This inbuilt robustness to noise is an encouraging trait for good generalisation performance. Further experiments investigating prediction performance on differing mixture sizes are reported in section ANIMAL DOG11 LRG BARK GROWL (original) element element element element element Figure 2.1: The source data (top left) and the probability mass generated by GMMs of differing order trained on this data

27 Chapter 2. The Acoustic Model 21 Classification Once a set of class models has been trained by obtaining the parameters π i and θ i for each of the classes in turn, the models can be used to predict the most likely class membership of a new sound. Classification of a test sound, with feature vector X = x 1,x 2,...,x N, is achieved by estimating the likelihood that each model could have generated X. The log likelihood of a model for the sequence of feature vectors is calculated L(X λ) = 1 N N K log π j p(x n θ j ) (2.8) n=1 j=1 where λ represents all the parameters of a GMM, λ = {θ j,π j, j = 1,...,K}. Summing log values makes the assumption that feature vectors of X are independent. The normalisation factor 1/N is used to normalise for sample duration, omitting this term would result in longer samples obtaining disproportionately low likelihood values. Reynolds argues that this normalisation factor counteracts the underestimation of actual likelihood values due to the incorrect independence assumption [27]. Likelihood values may be combined with a class prior to determine class membership, however this is not appropriate for this study where there is no prior information of what sound will occur. Thus, to classify a sound, the test vector is tested against all models in turn and assigned to the class of the model predicting the highest likelihood. 2.3 Class-based Prediction In order to demonstrate the described acoustic model four sets of GMMs were trained and tested. Training Set 1 consists of a small set of 320 labelled sounds and Set 2 consists of 611 sounds. Each sound has two classes determined from the hierarchical labelling, e.g. in Set 1 there are 16 course labels such as airplane and 53 fine labels corresponding to airplane, biplane, airplane, jet etc. Models were trained on the set of feature vectors belonging to each class using 8 element GMMs and a 13 dimensional MFCC vector obtained at a 10 ms frame rate. For each training set two collections of GMMs are trained, one for the coarse classes and another for the fine classes. Accuracy is measured by the number of correct class predictions on an unseen test set, table 2.1.

28 Chapter 2. The Acoustic Model 22 TEST ACCURACY Set 1 (16 Classes) (56/75) 74.7% Set 1 (53 Classes) (51/75) 68.0% Set 2 (36 Classes) (105/153) 68.6% Set 2 (90 Classes) (94/153) 61.4% Table 2.1: Classification accuracy obtained with four sets of trained models on an unseen test set As would be expected, prediction with coarser labels achieves greater accuracy, arguably the finer classes are too insubstantial (some classes having only 3 training sounds). Nevertheless, using fine labels still allows reasonable results. This approach of predicting into predefined categories is typical of those used in content-based audio classification and these results compare favourably to studies such as the work of Liu & Wan who achieve 56% test accuracy using GMMs with a larger training set divided into 29 classes [21]. The approach applied to acoustic-to-semantic and semantic-to-acoustic retrieval follows the work of Slaney to measure acoustic similarity between sounds using individual GMMs for each sound rather than training class models (the above models only serve as an illustration and are not used further). The purpose of this approach is to predict similarity between each sound in the set and it is this measure of closeness that proves crucial in mapping between the audio and semantic spaces. 2.4 Experiments In order to determine parameter settings of the acoustic model most suitable for general audio a number of experiments are considered. In particular the MFCC parameterisation, the delta cepstrum, silence detection and number of GMM elements are investigated. To facilitate this investigation a classification framework is created in which particular parameter values can be varied while all other variables including training and testing material are kept constant. This framework involves using a representative subset of the training data and the development set as testing material. For each training sound, X 1,X 2,...,X n, a single GMM is initialised and trained, λ 1,λ 2,...,λ n (the same approach is used in acoustic-to-semantic prediction). Classification of a test sound is then achieved by querying each GMM and the sound is assigned the class of the model

29 Chapter 2. The Acoustic Model 23 most likely to have generated it. Performance is simply rated by classification accuracy (between 36 classes) on the development set. However, the experiments are relatively small scale so we should not expect them to provide definitive settings that apply to all types of general audio tasks. However, it is anticipated that these results should be representative enough to illustrate useful information about the problem domain and ultimately allow improved discrimination with the acoustic model The MFCC Parameterisation As the MFCC parameterisation is intended for modelling speech there is a need to investigate parameter settings for the case of general audio. Therefore, various experiments are conducted to compare the parameters generating MFCCs at various informed values in the described framework. Though numerous parameters affect the generation of MFCCs, we specifically investigate, frame size, number of MFCC coefficients and the energy term. The number of GMM elements is kept constant throughout to ensure fair comparison Comparing Frame Size Though a frame size of 10 ms is conventional in the MFCC parameterisation used in speech research, there is no reason to assume such a resolution is suited to more general audio. Previous studies indicate the choice of frame size can vary quite widely depending on the problem domain, for example, Foote finds a frame rate of 20 ms to be effective for a music discrimination task [13]. Therefore, we compare frame rates of 5, 10 and 20 ms in order to establish which provides the best overall discrimination for the audio used in this study. In each case the window overlap used in the frame based analysis is kept constant at 50% and although alternate overlap proportions were tested no clear difference was found over the values trialled. A set of GMMs is initialised and trained for each frame size and each is tested on the development set. The overall classification accuracy of each set is presented in table 2.2. Analysis reveals that the greater resolution provided by 5 and 10 ms frames allows

30 Chapter 2. The Acoustic Model 24 FRAME SIZE ACCURACY 5 ms 77.8% 10 ms 77.1% 20 ms 71.9% Table 2.2: The percentage of test sounds correctly classified using models trained with 5, 10 and 20 ms frame sizes for a notable improvement in discrimination over a 20 ms frame rate (over 20% error reduction). The difference in performance between 5 and 10 ms rates is marginal and a frame rate of 10 ms is chosen for use in this work for computational efficiency (training and testing time does not simply increase linearly with a doubling in frame rate). Additionally, from these results we can conclude that a frame resolution greater than 20 ms is preferable for the discrimination of the general audio sounds in this dataset. In particular, the higher frame rates are advantageous to effectively capture sounds more dynamic than human voice. Analysis of misclassifications indicates that higher frame rates improve discrimination between dog, bird and human vocals and also between engines noise such as car, boat, motorcycle etc The Number of Cepstral Coefficients and the Energy Term In speech research a thirteen dimensional vector consisting of twelve cepstral coefficients plus the energy term is standard. Again there is no reason to assume this is the optimum number for more general audio. In fact, work by Berenzweig et al. demonstrates that a 20 dimensional MFCC vector achieves the best performance for a music classification problem [3]. Therefore an appropriate experiment is to compare different dimensions in the classification framework. Additionally, the impact of the energy term is investigated by conducting these experiments with and without, figure 2.2. Analysis indicates that a 13 dimensional vector is in fact globally optimal for the dataset (though the advantage over 8 and 16 coefficients is marginal). The biggest advantage can be gained by including the energy coefficient, which is clearly important for sound discrimination. Generally, fewer dimensions do not capture enough significant content and larger dimensions may channel too much noise to the modelling process. However, the optimal dimension appears to depend on the particular type of sound under analysis. From observation it was determined the optimal dimension for one sound type such as animal vocals is not necessarily the best for other

31 Chapter 2. The Acoustic Model 25 Figure 2.2: The percentage of test sounds correctly classified using 4, 8, 12, 16 and 20 cepstrum coefficients, with and without energy samples such as impact sounds which are more impulse-like. Nevertheless, the best overall dimension should be selected in order to achieve the best prediction on unseen test sounds The Effect of the Delta-cepstrum A feature vector based on cepstrum values does not capture any temporal change in acoustic properties. However, it is anticipated that a method of capturing short-term temporal changes (under 100 ms) will prove effective in distinguishing between certain sounds. Indeed, spectral transition is believed to play an important role in auditory perception [16]. Consequently, this experiment involves augmenting the static MFCC representation with delta-cepstrums and comparing discrimination performance. The delta coefficients are calculated over a 7 frame window (70 ms) by eq A window size of 70 ms was selected through comparative trials where is was observed to obtain better discrimination than a 50 ms window (with a small increase in performance). The results from tests with and without applying delta and double delta-cepstrums are presented, table 2.3. Though the results reveal no clear improvement in applying delta cepstrums, in both cases the addition of the time derivative sequence offers a slight increase in classification performance. In the best case applying both delta and double delta-cepstrums achieves an error reduction of around 14% over static cepstrum values alone. The

32 Chapter 2. The Acoustic Model 26 CONFIGURATION ACCURACY static cepstrum 77.12% static + delta-cepstrum 77.78% static + delta + double delta-cepstrum 80.39% Table 2.3: The percentage of test sounds correctly classified using delta and double-delta cepstrums disadvantage of such a representation is that the dimension of the feature vector is increased threefold, though often the motivation to capture acoustic transition and improve prediction is more prominent. Analysis reveals that misclassifications are reduced between classes with similar instantaneous properties but dissimilar acoustic variation over time. In particular, misclassification between various types of engine noise is notably reduced. For example, models trained on helicopter sounds were liable to cause false positive predictions. The sound of a helicopter rotor is characterised by short-term temporal fluctuations and application of delta-cepstrums was observed to entirely eliminate those misclassifications Silence Detection and Removal In early trials it was observed that problems arise from the presence of background noise or regions of silence, e.g. the silence occurring between footsteps. This causes a problem for the modelling process, where a model may learn similarity between sounds based on these regions, resulting in erroneous classifications. In particular models trained on such sounds are liable to generate false positives. A straightforward strategy to prevent this is to remove such content from waveforms beforehand. A simple method to detect and remove silence is to measure energy levels of the input sample and exclude frames when energy values fall below a set threshold. In practice, detection is averaged over a number of frames so that only reasonably contiguous periods of silence are removed and the actual waveform content is not disrupted. Though rather rudimentary this method was found to be reasonably effective. However, often sounds are sourced from different archives and recording environments and this rather limited approach is less reliable in such situations. An energy threshold is too crude to reliably detect varying levels of noise, either suggesting removal of too much or too little. Improvements can be made by accounting for other measures such as zero-crossing rate and an adaptive threshold is an interesting prospect [24].

33 Chapter 2. The Acoustic Model 27 BIRD PARROT CALLS.wav 0.5 content silence amplitude time (secs) 0.5 ANIMAL DOG02 MED BARKING.wav content silence amplitude time (secs) Figure 2.3: Examples of predicted silence However, a more sophisticated approach is adopted from speech/silence discrimination literature, cf. the work of Reynolds [27]. This involves creation of a GMM to model silence and another to model content. The advantage is that this can also account for the spectral difference between regions of silence and content, as well as capturing energy. Representative examples of both silence and content were marked out in the training set and the GMMs are then trained on this data. The content model required considerably more mixture components to fit the data. The test for a frame of silence is then by likelihood ratio α = L(x λ S) L(x λ C ) (2.9) where x is the input frame, L(x λ) is calculated as eq. 2.8 and λ S and λ C are the estimated parameters for the silence and content GMMs, respectively. If α is more than or equal to a threshold the frame is classed as silence, otherwise as content. Again detection is averaged over a number of frames so that only contiguous periods of silence are removed. For illustration, example regions of silence and content predicted by this method are shown, figure 2.3. Applying this method of silence removal to acoustic feature vectors before training results in a set of models relatively unaffected by noise elements in the feature space. However, evaluation by classification accuracy using such models only indicates a slight or no improvement than without removal. This is partly because the majority of sounds contain no silence periods (less than 10% contain any significant regions of

34 Chapter 2. The Acoustic Model 28 silence) and so remain unaffected. Another problem stems from the limited amount of silence marked training material, which may not be sufficient for the model to fit all types of silence in the dataset. Additionally, it was observed that over predicting silence is harmful and can decrease prediction accuracy, therefore a more cautious prediction is made by increasing the threshold α. In this case a value slightly larger than 1 ensures under-prediction of silence, preventing harmful removal of content. In order to effectively demonstrate the value of silence detection an artificial test set was created consisting of sounds known to contain background noise, e.g. footsteps, various animal noises (dog bark, horse galloping) etc. Some examples contain around 80% proportions of silence. As before, performance is rated on classification accuracy and results show 90.48% of test sounds are correctly classified using silence detection with only 85.71% correctly predicted without detection, an error reduction of around 33% The Number of Mixture Components The optimum number of mixture components in a GMM depends on a number of factors such as shape and separation of clusters in the source data and source sample size and dimensions. Generally, greater numbers of elements will fit the data more precisely but may capture outliers or noisy data elements, whereas fewer numbers will fit the data more crudely but will perhaps allow better generalisation performance to sounds of a similar type. This experiment compares the effects of varying mixture size over all models. Mixture sizes of 2, 4, 8, and 16 are compared along with an optimised configuration MIX, discussed below. Throughout, the feature dimension is held constant using 13 dimensional MFCCs without delta-cepstrums (though trials indicate a similar trend for deltacepstrums). Performance is measured by the percentage of correctly classified sounds from the development set, table 2.4. The results suggest that a good policy is to use few mixture components to prevent over-fitting to the data. Indeed, although 16 element models are likely a good model of their own content they provide poorer generalisation performance than using just 2 components. Using 4 mixture elements provides an error reduction of 18% over any of the other single size models.

35 Chapter 2. The Acoustic Model 29 ELEMENTS ACCURACY % % % % MIX 82.4% Table 2.4: The percentage of correctly classified test sounds using GMMs of differing order However, the best overall strategy is the MIX configuration which compares varying mixture sizes for each model. This is constructed by initially selecting a mixture size of 4 (the best overall size) for each model. An optimisation strategy was then used to select the best mixture size for a single model, rather than globally. The procedure involves testing the set of models on the development data and iteratively replacing models, which cause either false positive or false negatives, with differing mixture sizes, K = {2,4,8,16}. For each model the size achieving the best classification accuracy is selected, in the case of a draw the lowest order model is chosen for efficiency and potentially better generalisation performance in practice. This strategy was observed to reduce misclassifications considerably resulting in improved performance on the test set compared to the single best mixture size (around 23% error reduction). From analysis of the MIX results, it seems that the optimised mixture size varies according to sample size or complexity of the sound (e.g. number of sound sources present). It was observed that some of the lengthier and more complex samples generally suit 8 or 16 elements (perhaps more in some cases), whereas 2 element models suit a few of the shortest or most stationary samples. Most examples in the dataset are short single-source sounds and this explains why they can be modelled effectively with a small number of mixtures. Bearing in mind that these results are optimised for the development set the results on an unseen test set are less impressive, a small trial indicates an error reduction of less than 12% over the single best mixture size. A considerable drawback of this strategy is that the optimisation procedure adopted is relatively intensive and would be impractical on larger datasets. Though effective, the optimisation strategy is not applied to the final clustered models used in acoustic-to-semantic retrieval (see chapter 4), instead models are trained using the single best mixture size allowing straightforward comparison between models.

36 Chapter 3 The Semantic Model The semantic feature space is derived from the words forming each sound s description (termed a document). As documents are just text, standard text processing and information retrieval techniques can be applied to construct a semantic model. In this case the ordering of words within a document is largely irrelevant, the focus is on treating keywords as terms that can be mapped to concepts, the so called bag-of-words approach. However, before setting out to learn relationships between documents it is beneficial to perform some pre-processing on the text data to facilitate later indexing and retrieval. After the descriptions are tokenised (i.e. terms are extracted) and punctuation stripped, the next step involves exclusion of terms that appear in a stop list. The stop list is composed of common function words such as the, and, while etc., which have little semantic content. Though the collection of descriptions contains few function words a custom stop list of 54 terms was applied for this purpose. The reason for exclusion of these terms is that they do not contribute to the meaning of a description and could cause a semantic model to learn undesirable relationships between terms and documents. The focus is on retaining descriptive verbs, nouns and adjectives. The process of stemming to remove uninformative word endings is commonly applied in information retrieval tasks, including the semantic work by Slaney [31], [32]. The idea is that removal of plurals and suffixes will improve retrieval performance by mapping terms with the same stem to a single root. However, a trial of stemming on the text in this study found no benefit. Firstly, the corpus contains few terms that would benefit from this procedure, and secondly the vector space model applied to the semantic space 30

37 Chapter 3. The Semantic Model 31 in most cases finds a relationship between root forms and their derivatives without the need for stemming. After pre-processing the task is to construct a semantic model which when given a new textual description (or query) can determine the semantic similarity to existing descriptions. Ideally the semantic model should establish useful relationships between terms. Whereas literal term matching predicts only exact matches to a term, a conceptbased approach can find relationships between terms following similar global patterns potentially allowing more suitable retrieval. For example, matching terms related to storm such as wind, rain etc. is likely to retrieve more relevant documents than otherwise. 3.1 The Vector Space Model A vector space model can be used to represent the unique terms, t 1,t 2,...,t m, and documents, d 1,d 2,...,d n, occurring in a collection [28]. The collection of n documents indexed by m terms can be encoded by a m n term-by-document matrix. A column denotes a document indicating the terms indexed, and a row represents a term indicating the documents in which it occurs. An example of a partial term-by-document matrix is shown, figure 3.1, where term frequency is recorded. DOCUMENTS d1 d2 d3 d4 boat, sail: schooner: bow cutting through water marine sports boating sailboat boat, storm: large ship bow slamming waves creaks weather canoe: pull up onto shore lake waves marine sports boating water, river: heavy flow ambience d1 d2 d3 d4 t1: ambience t2: boat t3: boating t4: bow Figure 3.1: Example of a partial term-by-document matrix.... The values recorded in a term-by-document matrix reflect the weighting of the term in the document and in the collection. A weighting scheme has two components: a local weight and a global weight [4]. Local term weighting reflects the importance of the

38 Chapter 3. The Semantic Model 32 term in a document, in most document collections it is appropriate to represent local weighting by term frequency f i j, the number of times term t i occurs in document d j. However, for this collection the local weight is represented by a binary value, where multiple occurrences of a term in a document are not taken into account χ( f i j ) = { 1 if fi j > 0, 0 if f i j = 0 (3.1) The reason for this is that terms are merely keywords so we only wish to record an occurrence once, e.g. animal, cat: domestic cat purring is equivalent to animal domestic cat purring. The second weighting component is global weighting, which reflects the importance of a term in the collection by weighting all occurrences of the term with the same value. A common weighting strategy is by the inverse document frequency (IDF), where rare terms are up-weighted to reflect their relative importance. However for this purpose it was found that the entropy weighting scheme used by Dumais [11] was more effective than IDF and both of these schemes were more effective than no global weighting. The rationale of this information-theoretic approach is that terms occurring frequently in the collection have low information content. Thus, accounting for both local and global weighting the overall weight of a term i in a document j is defined w i j = χ( f i j ) [ 1 j ] p i j log(p i j ) log(n) (3.2) where p i j = (t i j /g i ) and g i is the total number of times term i occurs in the collection. In addition, document vectors are also normalised using standard cosine normalisation to ensure that all documents are treated equal regardless of length. Accounting for weighting and normalisation, values of a term-by-document matrix A are calculated A i j = w i j (wi j ) 2 (3.3) As each document will likely only contain a few terms most elements of the matrix will be zero. Due to the short length of descriptions in the document collection only about 0.3% of the elements in the term-by-document matrix are filled. Query matching in vector space can be performed by indexing a textual query in the same manner to produce a n 1 query vector. This can then be compared against the

39 Chapter 3. The Semantic Model 33 existing document vectors using a distance metric such as cosine angle between query and document vectors, this corresponds to literal term matching. Matching is not strict, in the sense that a query containing multiple words may retrieve relevant documents when not all words are present. This approach allows documents to be ranked by similarity to the query. However, the vector space model is limited. Literal term matching is unlikely to retrieve all relevant documents, there are broad range of words people may use to describe the same documents. The vector space model fails to capture the concepts of synonymy and polysemy. Synonymy refers to words having the same meaning as another word in a language, for example, in the audio dataset the words car and automobile are synonyms. Polysemy refers to multiple unrelated meanings of the same word, e.g. the word taxi is a polysemous word denoting a form of public transportation or the movement of an aircraft on the ground. Ideally, a retrieval system should capture relations between synonymous words yet identify the different words senses of a polysemous word. Also, in the case of the vector space model, each word is considered independent of all other words. However, this is unrealistic as we would expect some words to commonly co-occur with others, e.g. fire and engine. Adopting a concept-based indexing technique attempts to overcome this and model the co-occurrence relation of terms. 3.2 Latent Semantic Analysis Given the limitations of the vector space model an extension of this to latent semantic analysis (LSA) [9] is described. The aim of LSA is to build a model of the true relationships between terms and documents. LSA is a popular information retrieval approach using approximations to encode a term-by-document matrix into a lower dimensional space, capturing hidden (hence latent) relationships between terms. LSA attempts to model the global usage patterns of terms so that documents sharing related concepts (rather than just literal terms) are represented by nearby vectors in the lower dimensional space. In most cases this allows improved retrieval of relevant documents over the vector space model, though the effectiveness of this is by no means guaranteed. In order to achieve this goal, a technique is required to project the term-by-document

40 Chapter 3. The Semantic Model 34 matrix to a lower dimensional space. Such an approximation is most often made by the use of singular value decomposition (SVD). This involves a generalisation of factor analysis, a statistical technique used to explain most of the variation among a number of random variables in terms of a smaller number of hidden variables. The SVD of a rectangular m n matrix A is defined as A = USV T (3.4) where U is an orthonormal m m matrix, V is an orthonormal n n matrix and S is a diagonal m n matrix with the singular values of A along the diagonal in decreasing order. The larger singular values account for most variation in the data. Therefore, an (optimal) approximation can be made by taking the first k largest singular values of S and discarding the others [4]. The result of this approximation is reduction of S to a diagonal k k matrix S k. Accordingly, the matrix U can be simplified to an m k matrix U k and V to a n k matrix V k. The approximate reconstruction of the original matrix A can be calculated A k = U k S k Vk T. Berry shows this approximation of rank k to be the closest to the original matrix in terms of error [5]. The choice of the value k is crucial for good results. Ideally, k should be large enough to fit all the meaningful structure of the data to allow for good generalisation, yet not capture noise. Typically k will be a value much smaller than m or n. In that case the set of terms is represented by a vector space of lower dimensionality than the total number of terms in the vocabulary. An experiment to find the optimal value of k for this task is conducted in section 3.3. The motivation behind the approximation is that the reduced matrix corresponds to a matrix with minimised noise elements (ideally a better matrix that the original), potentially allowing improved prediction Query Matching To allow retrieval using LSA a n 1 query vector q can be indexed as before, this can then be represented in k-dimensional space by ˆq = q T U k Sk 1 [5]. To find similarity between the query and any document in the collection, the cosine angle between ˆq and

41 Chapter 3. The Semantic Model 35 k-dimensional document vectors is computed, returning a list of documents ranked by proximity to the query. Queries can be any type of free text using the vocabulary of terms. As an example the most relevant documents to the query canoe are illustrated, figure 3.1. The most relevant matches are ranked highest and semantically related descriptions concerning boats and kayaks are appropriately predicted as relevant, this could not be achieved with literal term matching. It is believed the effectiveness of semantic matching on this dataset is largely due to the hierarchical and consistent annotation of descriptions which often share keywords among related descriptions, (e.g. boating ). SIMILARITY DOCUMENT canoe: two paddlers: launch and pull away from shore, boating canoe: single paddler: on board: pull onto shore step out of canoe, boating canoe: pull up onto shore lake waves follow, boating canoe: single paddler: on board: launch from shore jump in, boating boat, sail three masted schooner: on board: lower sail sailboat, boating boat, sail three masted schooner: on board: turning winch sailboat, boating kayak: approach head on pull up short reverse forest birds in background scull eight man sculler: on board: rowing boat marine sports boating Table 3.1: Most relevant documents retrieved with the query canoe Limitations of LSA There are also limitations of this approach. Firstly, rank reduction does not necessarily improve matching for all queries and LSA is not guaranteed to find desired (synonymous) relationships between terms. It is worth noting that the LSA approach is not linguistically motivated (e.g. by marked up word sense) but driven by linear algebra where the learned concepts may not actually correspond to observed linguistic data. In some cases it was observed seemingly unrelated terms are linked, though application of a stop list can help prevent this to a certain extent. For example, the word single appears in many descriptions, e.g. single horse, single explosion etc. As this is not the type of semantic relationship we wish to capture the word is excluded from the vocabulary. Additionally, research such as the work of Landauer et al. has indicated that large corpora (millions of words) are required to find realistic linguistic relationships between terms [19]. However, this corpus consists of less than a thousand terms in most exper-

42 Chapter 3. The Semantic Model 36 iments. Possibly these sparse statistics do not allow for the most effective use of LSA, though we argue that there is a clear benefit to a semantic system as illustrated in figure The Number of Singular Values Choosing a suitable dimension of the reduced matrix when performing LSA is crucial for effective results. If the value is too small important information will be lost, too big and undesirable relationships (noise) may be modelled. However there is no definitive method for choosing an optimal dimension k and literature suggests that the best case is usually determined empirically [9]. The approach used here involves measuring retrieval accuracy at various values of k using the development set as test material. The descriptions of the development set are applied as queries to the semantic model and retrieval is deemed correct if the class of the closest predicted document matches the class of the test document. Although this measure does not truly reflect the relevance of a given document (e.g. in some cases it is reasonable for relevant documents to belong to another class) it should provide a good approximation of retrieval performance. For comparison, a baseline method is constructed using the ordinary vector space model (corresponding to literal term matching) where query matching is performed by using cosine angle to measure similarity between query and document vector [4]. The percentage of correctly retrieved documents (averaged over the set) for the LSA method is plotted against k along with the baseline, figure 3.2. Accuracy is relatively high (90.2% for the vector space model) because within most classes the test descriptions are very similar to the training descriptions and prediction by both methods becomes fairly certain 1. Firstly, it is apparent that values of k < 50 do indeed loose valuable information and are outperformed by the baseline. This means in general there may be no benefit in applying LSA unless a suitably reduced dimension is determined. However, larger values of k can considerably outperform the baseline and the best accuracy for this task is 95.4% when k = 70. Performance gradually drops from around k = 100 onwards as 1 a more realistic evaluation would involve testing the model against free text queries

43 Chapter 3. The Semantic Model 37 Figure 3.2: Percentage of correctly predicted test documents against differing sizes of reduced matrix for LSA retrieval more noise is introduced, eventually performance will match that of the vector space baseline. The results demonstrate that LSA is indeed applicable to this task allowing improved performance over literal term matching. In this case the LSA method benefits from the well structured descriptions and co-occurence of informative keywords in the dataset. It is unlikely the LSA method would prove so effective if the semantic model was formed from less sophisticated annotation. Finally, having attained an effective method for measuring statistical closeness of queries to documents, the next step is to construct a linking between acoustic and semantic spaces.

44 Chapter 4 Linking Acoustic and Semantic Spaces To allow an audio request to generate a semantic answer (and vice versa) some mapping between the acoustic and semantic models must be implemented. This relies on the known relationship between sounds and descriptions in the training set. Firstly, for insight into the problem domain the distributions of acoustic and semantic space are compared and the difference is used to justify two separate models. Then the procedures for achieving both acoustic-to-semantic and semantic-to-acoustic retrieval are described in the remainder of this chapter. 4.1 The Distributions of Acoustic and Semantic Space For comparison of the similarity predicted by the acoustic and semantic models an illustration of the distances between a number of training points in both spaces is presented, figure 4.1. Acoustic distance is derived from the acoustic model where for each training sound, X 1,X 2,...,X n, a GMM is initialised and trained. For each trained model, λ 1,λ 2,...,λ n, the likelihood that it generated each training sound is recorded, resulting in a n n matrix indicating how well each sound scores with each model, the leftmost matrix, figure 4.1. The lower the likelihood of a model λ i generating sound X j the greater the distance between them. The distances are also normalised and made symmetrical as described in section

similarity between all other documents is found. These distances also undergo normalisation and symmetrisation.

45 Chapter 4. Linking Acoustic and Semantic Spaces 39 The distance in semantic space is measured by the similarity scores predicted by LSA query matching where each training document is treated as a query and the similarity between all other documents is found. These distances also undergo normalisation and symmetrisation. acoustic similarity semantic similarity Figure 4.1: Comparison of acoustic (leftmost) and semantic similarity between training points as predicted by the acoustic and semantic models, lighter regions indicate greater similarity Of course, in both spaces each point scores itself with the highest similarity, hence the strong diagonal. As the points are numbered by catalogue order similar types of sounds are placed together, hence the rectangular groupings on the diagonal. In this case, the most visible rectangles denote four broad classes, animals, birds, explosions and footsteps (from left to right). Little acoustic similarity between bird sounds is found, it is believed this is because the class is rather diverse in comparison to other groupings and there is most often only one example for each bird species. Also, there is a strong overlap in acoustic similarity between animals and footsteps, this arises from the relation of footsteps to a horse trotting etc. The strong structuring in the semantic space is due to the hierarchical labelling of descriptions. Overall both acoustic and semantic spaces have a visibly similar distribution where similarity found between examples is in some cases complementary. However, the distributions are by no means identical and we would not expect acoustic similarity to correspond to semantic similarity. As each space is differently distributed it seems wise to build two separate linking models one for mapping from audio to semantic and vice versa, in the same manner as the work of Slaney [31], [32].

Probabilistic Latent Semantic Analysis

Probabilistic Latent Semantic Analysis Thomas Hofmann Presentation by Ioannis Pavlopoulos & Andreas Damianou for the course of Data Mining & Exploration 1 Outline Latent Semantic Analysis o Need o Overview