Action Recognition and Video

Size: px

Start display at page:

Download "Action Recognition and Video"

Irma Shana Joseph
6 years ago
Views:

Faculty of Engineering and Information Technology School of Computing and Communications Action Recognition and Video Summarisation by Submodular Inference Thesis

1 Faculty of Engineering and Information Technology School of Computing and Communications Action Recognition and Video Summarisation by Submodular Inference Thesis submitted in partial fulfilment of the requirements for the award of Doctor of Philosophy Principal Supervisor:Prof. Massimo Piccardi Candidate:Fairouz Hussein April, 2017

2 Certificate of Original Authorship I certify that the work in this thesis has not previously been submitted for a degree nor has it been submitted as part of requirements for a degree except as fully acknowledged within the text. I also certify that the thesis has been written by me. Any help that I have received in my research work and the preparation of the thesis itself has been acknowledged. In addition, I certify that all information sources and literature used are indicated in the thesis. Fairouz Farouq Fayiz Hussein 16-April-2017

3 Abstract In the field of computer vision, action recognition and video summarisation are two important tasks that are useful for applications such as video indexing and retrieval, humancomputer interaction, video surveillance and home intelligence. While many approaches exist in the literature for these two tasks, to date they have always been addressed separately. Instead, in this thesis we move from the assumption that action recognition can usefully drive the selection of frames for the summary and that recognising actions from a summary can prove more accurate than from the whole video, and therefore the two tasks should be tackled simultaneously as a joint objective. To this aim, we propose a novel framework based on structured max-margin algorithms and an efficient model for inferring the action and the summary based on the property of submodularity. Recently, submodularity has emerged as an area of interest in machine learning and theoretical computer science, particularly within the domains of optimisation and game theory and is therefore one of the main frameworks for this thesis. To ensure proper exploitation of the proposed method, we have conducted experiments in three different kinds of scenarios: unsupervised summaries, semi-supervised summaries and fully supervised. We also propose a novel loss function - V-JAUNE - to evaluate the quality of a predicted video summary against the summaries annotated by multiple annotators. In a last experiment, we leverage the proposed loss function not only for evaluation, but also for the training stage. The effectiveness of the proposed algorithms is proved using qualitative and quantitative tests on two challenging depth action datasets: ACE and MSR DailyActivity. The results show that the proposed approaches are capable of learning accurate action classifiers and produce informative summaries. ii

4 Dedicated to my sweet and loving family

5 Acknowledgements Sincere feelings and strongest kind words emanating from my heart go to my supervisor Professor Massimo Piccardi. I present him with my most heartfelt thanks and gratitude together with respect and appreciation. He has given a lot to support my thesis and he is still offering his time and thoughts pro-actively and gladly, without waiting for praise or thanks. I consider myself very lucky to have had a supervisor like him who is known for his wonderful experience, creative assistance, and distinctive presence. I would like to extend my thanks and gratitude to Tareq, my beloved husband, who is the reason for the continuation and completion of my studies, who stood by me in the toughest conditions and encouraged me to persevere and continue and to not give in to despair. I am also thankful to my parents, my lovely kids Marah, Abdullah, Leen and Joury, and my sisters and brothers who have given me their love and care. I ask God to bless them with good health, happiness and faith. Finally, I would like to thank all my colleagues and friends - Shaukat, Sari, Khalid, Subheih, Rana, Majeda, Dana, Ali, Raniah, Arwa, Hanadi, Hayat, and Ala a - who filled my time at UTS with smiles and support.

6 Contents Abstract i 1 Introduction Motivation and Objectives Research Questions Contributions Organisation of the thesis Publications Background and Related work Introduction Action Recognition Approaches Local Representations Feature detectors Feature descriptors Feature representations Types of Features Colour-based features v

7 CONTENTS vi Skeleton-based features Depth-based features Action classification models Rule-based methods Probabilistic methods Learning Supervised learning Unsupervised learning Semi-supervised learning Classification Methods k-nn SVM Multi-class SVM Structural SVM Main applications of SSVM Latent structural SVM Formulation Submodular Functions Why submodularity? Video Summarisation and Evaluation Video summarisation approaches Video summarisation evaluation Video Summarisation and Submodular Functions

8 CONTENTS vii Formulation Action Recognition in Depth Videos Datasets ACE MSRDailyActivity3D MSR Action3D Joint action recognition and summarisation Introduction and Related Work Recognition and summarisation by submodular functions Learning: latent variables Experimental Results V-JAUNE: A Framework for Joint Action Recognition and Video summarisation Introduction Related Work Learning Framework Model Formulation Latent Structural SVM for Unsupervised and Semi-Supervised Learning V-JAUNE: Video Summary Evaluation Experimental results ACE MSR DailyActivity3D

9 5 Minimum Risk Structured Learning of Video Summarisation Introduction and Related Work Summarisation via structured learning Problem Formulation Structural SVM for Supervised Learning Learning with V-JAUNE V-JAUNE for Evaluation Experimental results ACE MSR Conclusion 105 Bibliography 107 viii

10 List of Tables 3.1 Comparison of action recognition accuracy on the MSR Daily Activity 3D dataset Sensitivity analysis of the accuracy with different weights in (3.6) and with depth and RGB data The accuracy achieved by Latent SSVM on depth data Details of the ACE dataset Comparison of the action recognition accuracy on the ACE dataset The evaluation results on the ACE dataset using various amounts of supervision Influence of the budget on the action recognition accuracy for the ACE dataset Sensitivity analysis of the action recognition accuracy at the variation of the λ parameters for the ACE dataset (unsupervised case) The evaluation results on the MSR DailyActivity3D using various flavours of learning Comparison of the action recognition accuracy on the MSR DailyActivity3D dataset (depth frames only) The values of V-JAUNE measure on the ACE dataset (clipped) The values of V-JAUNE measure on the ACE dataset (unclipped) The values of V-JAUNE measure on the MSR DailyActivity3D dataset ix

11 List of Figures 1.1 With Kinect games, the players are the controller Examples of actions in videos Example of a smart home (reprinted from [Simpson, 2016]) Example of Input and Output of Video Summarisation from the ACE dataset Some challenges of action recognition Extraction of space-time cuboids at interest points from similar actions performed by different persons (reprinted from [Laptev et al., 2007]) A generic machine learning system (reprinted from [Kadre and Konasani, 2015]) Flavours of machine learning a) Fully-supervised learning; b) Unsupervised learning Binary Support Vector Machines on (a) linearly separable data and (b) nonlinearly separable. Squares represent one class, circles the other one. Support vectors are laying on the margin Examples of structured problems The diminishing return property in a submodular set function A comparison between RGB channels and depth channels ( reprinted from [Wang et al., 2014a]) x

12 LIST OF FIGURES xi 2.8 A typical clip of ACE actions performed by five different actors (distinguishable by their clothing) Some examples from the MSR DailyActivity3D (displayed as RGB and depth frames): the first column in each subfigure shows the subject standing close to the couch; the second, sitting on it Sample clips from the MSR Action3D for actions a) Draw tick and b) Tennis serve (reprinted from [Li et al., 2010]) Summary examples (displayed as RGB frames) for action walk: a) proposed method; b) SAD Each row contains the summary of a video to represent a certain activity, the activities are: drinking, eating, reading, using cell phones, writing, using computers/laptop, vacuuming, cheering up, sitting still, tossing crumbled paper, playing games, lying on the sofa, walking, playing the guitar, standing up, and sitting down The graphical model for joint action classification and summarisation of a video: y: action class label; h: frames selected for the summary; x: measurements from the video frames V-JAUNE values for the ACE test set (95 videos) with multiple annotators: blue bars: denormalised values; red bars: normalised values V-JAUNE loss for different annotators over the ACE test set (95 videos), using the first annotator as ground truth and the second as prediction. Please note that the changes in value are mainly due to the changes in magnitude of the VLAD descriptors. However, the agreement also varies with the video Examples of predicted summaries from the ACE dataset (displayed as RGB frames for the sake of visualisation). The subfigures display the following actions: a) breaking; b) baking (omelet); c) baking (ham); and d) turning. In each subfigure, the first row is from the proposed method, the second from SAD

13 4.5 Examples of summaries from the MSR DailyActivity3D dataset (displayed as RGB frames for ease of interpretation) for actions a) Cheer and d) Walk: in each subfigure, the first row is from the proposed method and the second from SAD. The results from the proposed method look more informative V-JAUNE values for the ACE test set for actions a) boiling, and b) seasoning, with multiple annotators: blue bars: denormalised values; red bars: normalised values V-JAUNE loss for different annotators for actions a) boiling, and b) seasoning, using the first annotator as ground truth and the second as prediction. Please note that the changes in value are mainly due to the changes in magnitude of the VLAD descriptors. However, the agreement also varies with the video Examples of predicted summaries from the ACE dataset (clipped). The subfigures display the actions a) seasoning; and b) peeling. In each subfigure, the first row is from the proposed method, the second from SAD Examples of predicted summaries from the ACE dataset (unclipped). In each subfigure, the first row is from the proposed method, the second from SAD Examples of predicted summaries from the MSR DailyActivity3D dataset. The subfigures display the actions a) using vacuum; and b) playing guitar. In each subfigure, the first row is from the proposed method, the second from SAD xii

14 Chapter 1 Introduction Over the past decades, tremendous technology improvements have been witnessed in our lives. Developments in computers and global networks have changed our way of living, whether at home, work or school and even during leisure. Along with the advances in computers, video data have become more and more accessible. Nowadays, nearly everyone uses electronic devices such as mobile phones, digital cameras, and notebooks which allow seamless capture of videos. At the same time, the increase in the speed of the Internet and in storage volumes have made video content more widespread and accessible. For example, YouTube users upload more than 400 hours of video to the site every minute, according to VIDCON Moreover, SocialMediaToday has recenlty reported that video views on Facebook are in average 8 billion a day 2. In general terms, video data have become a very big priority. However, despite their expanding significance, the automated methods to examine, analyse and summarise them are still fairly constrained. For this reason, the research area of computer vision strives to emulate the capabilities of human vision, resorting to geometry, probability, statistics, physics and machine learning methods to recover unknown and incomplete information from video data. In the following, we briefly mention its main applications and scenarios. Video surveillance systems that can automatically understand the occurrences in a scene represent a fundamental cornerstone in terms of security and safety for many premises (e.g., roads, markets, airports, and car parks) [Otoom et al., 2008]. They also allow investigating crimes (i.e., forensics) and effectively warding off possible criminals (i.e., prevention). Video

15 2 surveillance systems are also suited for monitoring automated production, tracking packages and protecting goods from damage and vandalism. They make a valuable contribution to guaranteeing quality and optimising processes in automation chains. Another application of computer vision are traffic sign recognition (TSR) systems. These systems are beneficial for maintaining maps of road signs and periodically testing their conditions. In addition, they are useful as advanced driving assistance systems (ADAS) [García-Garrido et al., 2012] which reduce hazards and warn drivers in any precarious situations, including automatic breaking, pedestrian detection, and sleepiness alerts. In the imminent future, they will form an integral part of many self-driving vehicles which will appear on our roads. Another interesting area in computer vision is human-computer interaction (HCI). HCI systems are popular in electronic games platforms where the players can be immersed in a virtual world with their full bodies and a natural voice, free hands and body motion, as shown in Figure 1.1. HCI interfaces allow them to control the game without the need for any extra input equipment such as joysticks and wifi controllers. Fortunately, this technology also lends itself to applications beyond entertainment, such as rehabilitation of people who have suffered from major traumatic injuries and elderly suffering from motor illnesses. Another major area of application for computer vision is social multimedia. User-uploaded videos can be automatically classified into different categories, such as: vehicles, buildings, animals, sports, nature, people, etc. The indexing and retrieval of relevant videos in web search engines is currently mainly based on keyword-based search. While this is rather effective, the manual annotation of videos consumes time and costs money for large-scale databases. Conversely, content-based image retrieval (CBIR) can automate this process and satisfy the query by automatically analysing and classifying the visual content [Liu et al., 2007]. From this brief review, it appears that most computer vision applications, in a way or another, require techniques for recognising human actions in different scenarios such as wide-area surveillance, healthcare, smart homes, sport analysis and so forth (see Figure 1.2). As an example of useful human action recognition, consider the following scenario: An old woman with Alzheimer wakes up in her apartment. She turns on the kitchen lights,

16 3 Figure 1.1: With Kinect games, the players are the controller. Figure 1.2: Examples of actions in videos.

4 Figure 1.3: Example of a smart home (reprinted from [Simpson, 2016]). takes the last two eggs from the fridge and operates the stove to prepare boiled eggs for breakfast.

17 4 Figure 1.3: Example of a smart home (reprinted from [Simpson, 2016]). takes the last two eggs from the fridge and operates the stove to prepare boiled eggs for breakfast. After the eggs are cooked, a computer-generated voice gently reminds her to turn off the stove and also to take her medicines before eating her breakfast. Moreover, by an Internet connection, the fridge notifies a supermarket that she has run out of eggs and issues a new order. This scenario is that of a smart home which can provide assisted daily living through automated understanding of a user s actions (Figure 1.3). On the other hand, great challenges arise from the titanic volume of nowadays video footage. By nature, videos can be long, at times repetitive and might contain irrelevant parts. Humans struggle to gain an understanding of vast amounts of video footage. As an example, surveillance systems acquire massive amounts of video data 24/7. Video summarisation can therefore prove an essential tool to help users understand the occurrences in large video data. For example, in baseball videos, the overall scene is similar over long periods, but automated summarisation can delect distinguishable events such as the squatting of the catcher at the beginning of every pitch. In addition, summarisation can help in storing, browsing, retrieving and processing of video footage in databases. Throughout the years, many algorithms have been proposed for automated summarisation, mainly aiming to quickly identify the salient frames of given videos. In general, a good frame summary must avoid redundancy and ensure high coverage of the original video (see Figure 1.4).

18 1.1. Motivation and Objectives 5 Figure 1.4: Example of Input and Output of Video Summarisation from the ACE dataset. 1.1 Motivation and Objectives Action recognition is intrinsically challenging due to multiple, concurring factors such as viewpoint variations, illumination changes, occlusions, and actors dependencies (see Figure 1.5). However, major advancements have recently been achieved due to the availability of depth cameras which add a new sensing dimension compared to conventional colour cameras. Accordingly, recognition of ongoing actions and an understanding of the existing context has become possible. On the other hand, video summarisation is imperative given that traversing, retrieving and processing huge chunks of video footage consumes an inordinate amount of time and media storage. Furthermore, it is very difficult to gain an understanding of vast amounts of video footage. Automated video summarisation, however, appears to deal effectively with these situations. On their own, action recognition in video and video summarisation are two well-established research areas. However, the simultaneous classification and summarisation of a video depicting an action has not received appropriate attention in the literature to date. We believe that classification and summarisation could be usefully merged into a single, simultaneous

19 1.2. Research Questions 6 Figure 1.5: Some challenges of action recognition. objective following the intuition that action recognition can suitably drive the selection of the frames for the summary (that is, action classification is expected to be more accurate if provided based on only a set of key frames) and, at the same time, the summary is expected to be more meaningful if it conveys information about the activity. Another challenge in this area is that a meaningful and accepted measure for the quality evaluation of a video summary is still missing. A main impediment might be the overall shortage of ground-truth annotated video summaries. Another could be the intrinsic subjectivity of video summaries. A proper evaluation measure should not only reflect the video s contents, but also the frames order to ensure chronological coherence. For example, actions sitting down and standing up consist of similar sets of frames, but in approximately reverse order. Generally, a video summary can be thought of as a sub-sequence of the original frames. In this thesis, we aim to: Tackle action recognition and video summarization simultaneously as a joint objective. Formulate a novel evaluation measure for the quality of a video summary. Achieve automatic summarisation of large-scale video by minimizing loss functions. 1.2 Research Questions This thesis aims to address the following research questions: Can action recognition and summarisation be performed more effectively by being addressed jointly? This, in turn, equates to these two questions:

20 1.2. Research Questions 7 1. Can action recognition be performed more accurately using only a selection of key frames rather than the entire video? 2. Can summarisation be more meaningful to a human user if driven by the requirement of supporting accurate action recognition? Can the quality of a predicted video summary be evaluated in a sensible quantitative way? Can we design effective machine learning algorithms that directly minimise video summary loss functions? These questions have been extensively answered in our published and submitted works. In a first paper, we have presented a structured prediction system that jointly infers an activity label and a summary of an activity video based on latent structural SVM. A key advantage of this learning approach is its capability of handling huge numbers of features which, in turn, provides extensive flexibility for building an effective model for recognition and summarisation. In a second paper, we have proposed a new measure, nicknamed V- JAUNE, to evaluate the videos summaries and conducted various flavours of learning (from unsupervised to fully supervised) to achieve maximum benefit from the framework. In a third paper, we have presented an approach to provide automatic summarisation of largescale video collections using V-JAUNE directly in the learning algorithm together with a novel feature function. Lastly, in a collaborative paper with a colleague, I have experimented with dense local features from depth videos to learn how to best represent the individual frames in my summarisation and action recognition approaches.

21 1.3. Contributions Contributions The main contributions of this thesis are summarised as follows: We design a novel scoring function for the action class and the video summary which enjoys the property of submodularity. In addition, the summary is set to a predetermined number of frames ( a budget ) suitable for review by users. We present a proof of submodularity for the loss-augmented inference of latent structural SVM under an action classification loss. We propose a novel measure, named V-JAUNE, to compute the loss between a predicted summary and multiple, ground-truth summaries. We use the V-JAUNE loss to train a minimum-risk classifier (structural SVM) for video summarisation. Also, we present a new, key proof of submodularity or the loss-augmented inference of structural SVM under the V-JAUNE loss. We exploit various flavours of learning including using fully-supervised summaries, semi-supervised summaries and unsupervised summaries. We present a new feature function for video summarisation. The proposed function depends not only on the contents but significantly on the frames order to ensure the temporal coherence. We present experiments over two challenging benchmarks, the MSR DailyActivity3D and the ACE datasets, showing that the approaches are capable of achieving remarkable action recognition accuracy and providing high-quality, meaningful and visuallyappealing video summaries. Lastly, as a simple yet tiresome contribution, we manually annotate the ground truth of the video summaries for both benchmarks.

22 1.4. Organisation of the thesis Organisation of the thesis The rest of this thesis is organised as follows: In Chapter 2, we review previous research work in the area of action recognition, feature extraction and representation, structural learning, classification techniques, submodular functions and video summarisation and evaluation. In Chapter 3, we present a novel framework for joint action recognition-summarisation based on performing latent structural SVM framework, together with an efficient algorithm for inferring the action and the summary based on the property of submodularity. The framework is evaluated on a challenging benchmark, MSR DailyActivity3D. The experimental results show that the approach is capable of achieving remarkable action recognition accuracy while providing appealing video summaries. In Chapter 4, we present a new measure for the quantitative evaluation of video summaries, nicknamed V-JAUNE, and we conduct a vast extension of the experimental evaluation including: another, larger and more probing action benchmark (ACE), training with different extents of summary supervision, quantitative evaluation of the quality of the predicted video summaries, quantification of multiple annotators disagreement, and an analysis of sensitivity to the (hyper-)parameters. Also, we present a new, key proof of submodularity for the loss-augmented inference of latent structural SVM. In Chapter 5, we present a new mechanism to achieve automatic summarisation of largescale video collections by using the loss function, V-JAUNE, direclty in the learning algorithm, together with a new feature function that encapsulates the frames sequentiality while still enjoying the property of submodularity. Finally, in Chapter 6, we present a brief conclusion as well as future work.

23 1.5. Publications Publications These are the publications produced to date from this thesis: Fairouz Hussein, Sari Awwad, and Massimo Piccardi. Joint action recognition and summarisation by sub-modular inference, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Fairouz Hussein and Massimo Piccardi. V-JAUNE: A Framework for Joint Action Recognition and Video summarisation, 2017 ACM Transactions on Multimedia Computing, Communications, and Applications (ACM TOMM). Fairouz Hussein and Massimo Piccardi. Minimum Risk Structured Learning of Video summarisation, in preparation for IEEE Multimedia Signal Processing (MMSP) Sari Awwad, Fairouz Hussein, and Massimo Piccardi. Local Depth Patterns for Tracking in Depth Videos, 23rd ACM International Conference on Multimedia, 2015.

24 Chapter 2 Background and Related work This chapter reviews the state-of-the-art work in the fields of action recognition and video summarisation and evaluation. It is accompanied by a brief discussion on the fields and highlighted areas of potential research interests. 2.1 Introduction This thesis links to discriminative structural learning methods and their applications in action recognition and video summarisation. Since the state of the art is huge, we restrict the review to the relevant topics to which this thesis tries to contribute. This review begins with a discussion of existing methods in activity recognition. Then an analysis of feature extraction and representation approaches, which is the first stage in any recognition system, is presented. In subsequent sections, a range of models, learning tasks and classification techniques are discussed. Thereafter, the chapter gives a review of research literature concerned with submodularity and video summarisation and evaluation. At the end of the chapter, we describe some of the datasets that are used to evaluate depth action recognition methods. 2.2 Action Recognition Approaches The existing methods in activity recognition comprise top-down and bottom-up approaches. 11

25 2.2. Action Recognition Approaches Top-down approaches These systems are constructed using the body geomeasure features; they build the activity model based on the human posture representation and the body parts are described using cylinders, spheres or super quadratics. Often, the 3D volume is used as feature or prototype to match the action videos to a gallery to complete the classification [Yamamoto and Koshikawa, 1991]. The region of interest is encoded as a global representation for the visual observations as a whole, and background subtraction or tracking techniques are used to localise a person. However, these reconstruction methods suffer from lack of reliability and robustness to real images, for reasons of viewpoint sensitivity and noisy and meaningless background information. Also, they are dependent on accurate tracking and localisation methods [Poppe, 2010]. Some kinds of global features that are typically extracted are: (a) Silhouettes Information The person s silhouette is extracted by using background subtraction techniques. [Yamato et al., 1992] are amongst the first to use silhouette images. These features encapsulate meaningful information about the image s motion energy and history. This includes a motion energy image (MEI) to specify the region where the motion happens and a motion history image (MHI) to weight that region s pixels intensity (a higher weight is given to more recent regions). (b) Edges (Contours) Information Using background subtraction, the shapes of humans are gained from the contour information [Yilma and Shah, 2005]. The action is represented as a set of main points (such as peak, saddle, valley and pit points) and recognised by computing point-to-point correspondences. (c) Optical Flow Information With dynamic backgrounds, it is hard to perform background subtraction due to the noise over the descriptor. An alternative choice is to calculate the optical flow magnitudes for periodic motion patterns. A pioneering work in this context has been introduced by Efros et al. [Efros et al., 2003]. They attempt to track players in a soccer game based on analyzing motion channels. Their descriptors consider vertical and horizontal components, and divide them into directed positive and negative vectors resulting in four different channels. Unfortunately, if there is

26 2.3. Local Representations 13 unexpected change in motion, these descriptors fail and capture only the first order motion. 2. Bottom-up approaches The action in these systems is recognised from time-sequential images that utilise lowlevel features. In general, this local representation goes through the following stages: (a) Detect the spatio-temporal interest points; (b) Calculate and describe the local patches around these points; (c) Encode the local patches into a final representation. Conversely to the top-down approach which uses a global representation, the local representation of relevant interests points which are obtained by the extraction process does not require background subtraction or tracking, is less sensitive to noise and more robust for model fitting procedures. We further discuss the local representation stages in section Local Representations Inspired by object recognition in images, the concept of local features (i.e., vectors that represent the local patches of an image) has been extended to videos. Local features designate a salient point in space and time (x,y,t) and its neighbourhood. This salient point is discriminatingly characterised by high amounts of changes in its adjacent points in the spatial and temporal domain. For example, in the action running, the contrast in pixels intensity between the person and the background generates a high variation in the space domain, resulting in many Spatial Interest Points (SIPs). Once a SIP is extended over the time domain it becomes a STIP. The motivation for detecting regions devising high saliency in the video is that they must be informative for labelling and discriminating the action. In the literature, local features show their effectiveness for classifying actions in cluttered background, occlusions, and probably under rotation and scaling.

27 2.3. Local Representations Feature detectors The idea of a feature detector is to make a local decision at every image point whether the point and its neighbourhood are a local feature. Feature detection is typically the first operation performed on an image; therefore it is considered a low-level image processing operation. A famous detector for prominent regions in images is the Harris corner detector [Harris and Stephens, 1988]; it uses the eigenvalues of the second moment matrix to determine if there is an edge, a corner or a flat area centred on the chosen pixel. In the extended 3D Harris detector, the time dimension was added by Laptev [Laptev, 2005]. Consequently, the second-moment matrix becomes spatio-temporal and the temporal variance is measured in a Gaussian window function to identify points with significant local differences in both the spatial and temporal domains. Dollar et al.[dollár et al., 2005] proposed different corner STIPs by applying 2D Gaussian and 1D Gabor filters on the spatial and temporal volume separately. Many other feature detectors have been proposed in the literature based on what kind of points is potentially interesting (e.g., an edge [Canny, 1986], a corner [Willis and Sui, 2009] or a blob [Rosten and Drummond, 2005]) Feature descriptors The term local descriptor is used interchangeably with local feature to denote a measurement extracted from a pixel and its neighbourhood (often called a patch ). Local descriptors have proved useful for many major applications of computer vision: object detection; disparity matching in stereo vision; tracking; and action recognition (see Figure 2.1). Descriptors can be extracted over a regular grid on the image (dense descriptors) or only where significant features are detected (sparse descriptors). Recent trends have been in favour of the dense option. Generally, descriptors are computed from optical flow or gradients because these representations reflect the changes that occur in a video. A popular 2D feature descriptor is the Histograms of Oriented Gradients (HOG) [Dalal and Triggs, 2005]. In this case, the image is divided into blocks either sparsely over the detected interest points, or densely. In dense sampling, the descriptors are computed regularly on the entire frame on a grid and at every time instead of relying only on the detected points. The resulting information is often huge, but still not missing any possible locations. Each patch represents one HOG descriptor.

2.3. Local Representations 15 Figure 2.1: Extraction of space-time cuboids at interest points from similar actions performed by different persons (reprinted from [Laptev et al., 2007]).

28 2.3. Local Representations 15 Figure 2.1: Extraction of space-time cuboids at interest points from similar actions performed by different persons (reprinted from [Laptev et al., 2007]). The HOG descriptor is calculated by dividing the neighbourhood into a grid of cells, each cell consisting of a block of pixels. Then, a histogram of pixel s gradients is computed over each cell and a histogram of values is computed. The Histogram of Flow (HOF) descriptor is similar to HOG, but the spatial gradients are replaced by the optical flow. A combination of HOG and HOF by a fusion approach was also introduced by [Laptev et al., 2008a]. The code from [Wang et al., 2009] was used in our experiments to extract local descriptors (HOG/HOF) over a regular spatio-temporal grid Feature representations The extracted features can be used directly or represented in a different way to improve classification performance. This representation provides a single vector - often called an encoding - for the whole video or per frame, depending on the classification level (i.e., a single label for the whole video, or one per frame). Two popular encodings are Bag of Features (BOF) and VLAD. The idea in BOF is that the image or video can be represented by the occurrences of its features, ignoring their order. The result is a histogram of feature occurrences where the bins in the histogram are named vocabulary or codebook, and each bin denote a cluster or a codevector in that codebook. An unsupervised learning step is

29 2.4. Types of Features 16 applied to build the codebook using typically a k-means technique; the aim is to minimise the distances between points and their nearest cluster centres. Initially, the centres are assigned random values and each point is allocated to the nearest centre; then, at every iteration, each cluster centre is recomputed as the mean of its allocated points and the points re-assigned, and this is repeated until convergence. Finally, the resulting codebook is used for quantising the features by mapping a feature vector to the index of the nearest codevector in the codebook. The Vector of Linearly Aggregated Descriptors (VLAD) encoding [Jégou et al., 2010] uses GMM to generate the vocabulary. It differs from BOF by recording the differences between each descriptor and the centres. The dimension of the encoded vector, v, equals the dimension of the descriptor multiplied by the number of clusters. 2.4 Types of Features Colour-based features Colour features represent the colour distribution in a local patch. These features can be computed in various colour spaces such as CIE, CMYK, and RGB. The colour features are more informative in the sense that they permit better quality of appearance description than grey-scale features Skeleton-based features The movements of a human skeleton can help discriminate its various actions. Actually, the joints locations and angles are informative representations especially if the subject is facing the camera and there are no occlusions. However, this information is not always available and is not entirely view-invariant. In addition, the same pose structure is obtained with different actions such as drinking or eating. The work proposed by [Xia et al., 2012] presented the Histogram of 3D Joint Locations (HOJ3D) descriptor that defines spherical information around a predefined joint considered as the root, for example the hip. Then, the 3D space is divided into n bins originating from the root s centre. However, the strong dependence on the root joint might limit the recognition accuracy. In the experiments in [Xia et al., 2012], a sequential classifier was used to recognise and model the action.

30 2.5. Action classification models Depth-based features A revolution has taken place since the appearance of depth cameras. A new dimension - depth - is added to the colour information delivered by conventional cameras. Actually, these cameras offer an approximation of the distance, D, between the camera origin and the position of a specified pixel in the scene, and they are considered low-cost in terms of price. Dedicated local features for depth frames have started to appear in the literature. For instance, the Local Depth Pattern (LDP) descriptor was proposed by [Zhao et al., 2012] to define the local patch around a pixel in the depth map. The patch is first divided into a grid; for each cell pair, the differences between the average of their depth values is calculated and concatenated to produce the feature. Inspired by the achievements of the HOG descriptor [Dalal and Triggs, 2005], [Yang et al., 2012] extracted HOG s descriptors from Depth Motion Maps (DMM) by assembling the motion energy of the depth maps. The temporal order of the motion is disregarded in their method as the entire sequence is stacked into one image. [Lai et al., 2011] extracted HOG from both depth and colour images. In their experiments, they found that HOG over depth images is better than HOG over RGB images thanks to the strong gradients of the boundaries in depth images. 2.5 Action classification models Throughout the years, a variety of models have been proposed for action classification, from simple rule-based approaches to full probabilistic models. Below, we present a brief review Rule-based methods In the rule-based methods, the actions are classified based on predefined propositions. For instance, [Ivanov and Bobick, 1999] used these techniques to provide an automatic surveillance system for a parking lot that label activities and person-vehicle interactions such as drop-off and pick-up in order to detect suspicious events. Rule-based methods can be workable, but are restricted in terms of scalability, viewpoint and by the predefined rules. To overcome these limitations, researchers have tended to exploit more sophisticated, probabilistic-based

31 2.5. Action classification models 18 techniques, especially in recent years Probabilistic methods The researchers following probabilistic approaches adopt a statistical model of the problem, estimate its paarameters by learning from the data and search for the optimal solution to the given problem by using the learned model. The taxonomy of probabilistic learning methodologies distinguishes between generative and discriminative models. The generative models learn the joint probability, p(x, y), of an action measurement x and its class label y (with no limitation, x and y can be sets of measurements and labels, even with graphical structure). As such, p(x, y) can be sampled to generate new virtual (x, y) samples. Conversely, the discriminative models learn only conditional probability distribution p(y x), so they do not model the measurements but somehow emphasise discrimination between classes. Indeed, leaving aside computational concerns, the dominant agreement seems that the discriminative models lead to more accurate classification than the generative ones, so one should adopt them [Jordan, 2002]. Hereafter we describe some examples of probabilistic methods: HMM, CRF, and HCRF. HMM The hidden Markov model (HMM) is a generative probabilistic model with a Markov chain of discrete state variables and a sequence of corresponding measurements, or observations. The states of the system are hidden variables and are inferred through the measurements. HMMs are particularly known for their applications in handwriting, speech and gesture recognition. [Yamato et al., 1992] were amongst the first to utilise an HMM to recognise human actions in time-sequential images from silhouettes of tennis players. The parameters of an HMM include: 1. Initial state probabilities which form an N x 1 vector. 2. State transition probabilities, that form an N x N state transition matrix. 3. Observation probabilities, which state the conditional probability of generating an output given a state. If the measurements are discrete with M possible values, they

32 2.5. Action classification models 19 form an N x M matrix. If measurements are continuous and multi-variate (the common case in computer vision and signal processing), they include the parameters of a probability density function for each state. An HMM can be learned in an unsupervised manner from a set of sequences of measurements, X, by maximising p(x) in the parameters. Otherwise, it can be learned in a supervised manner from a set of sequences of measurements and corresponding state values, X, Y, by maximising p(x, Y ) in the parameters. HMMs can also be used to create an action classifier by assuming that variable y is a label for the action and that the states of the HMM, this time noted as h to avoid confusion, represent an evolving internal state, or progression, of the action. The full joint probability becomes p(x, h, y), where y is the action label, h is the sequence of the internal states and x is the sequence of the measurements. Conditional probability p(x, h y) is therefore the probability of a conventional HMM for action class y. This model is typically trained using a training set of measurement sequences, X, and the corresponding action labels, Y, by marginalising states H and maximising p(x, Y ) in the parameters. CRF Conditional random fields (CRF) are discriminative probabilistic models [Lafferty et al., 2001] that leverage the properties of the exponential family of distributions. Their main advantage compared to HMM and other Bayesian networks is that they can be trained discriminatively by maximising p(y X). This training objective is often called the conditional likelihood and is a probability only of the state labels, Y. The linear-chain CRF is the analogous of the HMM, but it is always trained in a supervised way. HCRF Founded on CRF, [Wang et al., 2006] presented the HCRF model, that is a CRF augmented with latent (hidden) variables to capture intermediate structures. An HCRF models distribution p(y, h x), where y can be one or more class labels and h are the hidden variables. During training, y is supervised while h is marginalised. During inference, y is inferred and

33 2.6. Learning 20 h marginalised. The HCRF is the discriminative equivalent of the action classifier based on multiple HMMs described in the previous paragraph. The conditional probability of a class label y given a set of measurements x in HCRF model is expressed as: p(y x, θ) = h p(y, h x, θ) = h eψ(y,h,x;θ) y Y,h H eψ(y,h,x;θ) (2.1) where θ is the parameter vector and ψ is a feature function based on measurements x, label(s) y, and hidden variables h. The structure of an HCRF is an undirected graph. In recent years, a new trend for training HCRFs has been the extension of maximum margin methods. This style of training is presented in the following section. 2.6 Learning Machine learning is a discipline of artificial intelligence that gives computers the ability to solve problems by using example data or past experience. The computers are expected to find hidden insights by using algorithms that learn iteratively from data without being explicitly programmed (see Figure 2.2). Such algorithms involve detection, classification, prediction, robot control etc. With the recent technology developments, we need algorithms that can quickly and automatically produce models that are able to analyse bigger, more complex data and bring faster, more accurate results. In order to ensure effective use of machine learning algorithms, one has to abide by the following steps: Determine the nature of the samples. For example, video, audio, handwriting etc. Collect a large set of these samples. They should be able to represent a real-world application in all its diversity. Divide the collected data into two disjoints sets: the training and the test set. Apply the learning algorithm on the training set to generate a predictive function f.

34 2.6. Learning 21 Figure 2.2: A generic machine learning system (reprinted from [Kadre and Konasani, 2015]). The accuracy of the learned function is estimated by the percentage of samples in the test set that are correctly recognised by f. The above steps can be repeated with various sizes and combinations of training and test sets. Typically, machine learning tasks are classified into three categories, depending on the availability of the class labels in the training data: fully-supervised, unsupervised and semi-supervised. The situation in which the entire training data are labelled with the true class labels is called fully-supervised learning (see Figure 2.3, top). Learning when there is no hint at all about the labels is called unsupervised learning (see Figure 2.3, bottom). Semi-supervised learning refers to any case in between fully-supervised and unsupervised learning Supervised learning Supervised learning is the basic learning approach for classification problems. The task is to construct a function from training data to predict the labels of unseen examples. The training set are the examples from which the function attempts to learn. Each example used for training is labelled with its true class. The function is often iteratively learned by comparing the training labels (ground truth) with the predicted labels and adjusting the model accordingly. More formally, let us be given a set of N examples (x i, y i ), i = 1,..., N, such that x i is the feature vector of the i-th example and y i is its label. Let X denote the space of all possible

35 2.6. Learning 22 (a) (b) Figure 2.3: Flavours of machine learning a) Fully-supervised learning; b) Unsupervised learning.

36 2.6. Learning 23 inputs, Y be the discrete space of possible outputs and H denote the set of all hypothesises that the learning algorithm can produce. A hypothesis (or classifier) h is a function from X to Y, h : X Y. The learning algorithm seeks to produce a hypothesis h H that will work well on new examples, usually by finding h that correctly classifies the training data and adds some guarantees of generalisation. Supervised learning is the common approach for training any classifiers, including the popular decision trees, neural networks and support vector machines (SVMs) and it has been used in a myriad of fields. Below we cite some examples. [Olaru and Wehenkel, 2003] have used supervised learning to train decision trees. [Fritzke, 1994] have used supervised training for training the weights of neural networks and automatically find suitable network structure and size. The classification error was used to adjust the network. [Dollar et al., 2006] have presented a fully supervised algorithm for edge and object boundary detection which tries to learn from thousands of features computed on image patches. This approach can handle cues such as incomplete and parallel curves. A supervised technique to learn the dictionaries of sparse representations has been successfully used in action recognition [Wang et al., 2012a], image classification[yang et al., 2014] and event detection [Xu et al., 2003]. Speech recognition using Bayesian networks [Nefian et al., 2002] or hidden Markov models [Rabiner, 1989] typically leverages supervision to maximise joint probabilities in the parameters. In the field of video summarisation, [Gong et al., 2014, Potapov et al., 2014] have used human summaries together with the original videos to learn a model that can select the most informative frames. Recently, [Gygli et al., 2015, Kim et al., 2014] have adapted supervised techniques to summarise videos by learning submodular functions. However, the annotation is expensive, time-consuming and highly subjective to the annotators Unsupervised learning In unsupervised learning, data points do not have labels associated with them, so the algorithms must figure out what is being shown and organise the unlabelled data somehow to describe their hidden structure. As an example, consider the following example from a bank system: given data about customers (education, salary, age, etc.), the aim is to cluster the customers into homogeneous groups. The hope is that all customers within a single group

37 2.6. Learning 24 will exhibit some similarity in behaviour such as propension to spend, save, repay. The bank can usefully exploit these similarities from a business perspective to tailor its offering to its customers. Popular techniques for unsupervised learning include Expectation-Maximisation [Bailey et al., 1994], nearest-neighbour clustering [Roweis and Saul, 2000], K-means clustering [Figueiredo and Jain, 2002], neural networks [Lee et al., 2009] and max-margin [Maji and Malik, 2009]. Much work has been dedicated to building models for human action classification that can be learned in an unsupervised fashion. The works presented by [Le et al., 2011, Niebles et al., 2008, Wong and Cipolla, 2007] discovered semantic clusters by learning the features directly from the training data. The method in [Niebles et al., 2008] applied two unsupervised learning models: probabilistic Latent Semantic Analysis (plsa) [Hofmann, 1999] and Latent Dirichlet Allocation (LDA) [Blei et al., 2003] to classify and localise human actions. Some researchers have also explored unsupervised learning for face detection, for example, [Le, 2013] employed a convolutional network model for unsupervised learning of facial expressions. The model was able to detect faces and bodies of humans and cats from a large dataset. Another example is the work of [Cao et al., 2010], where the authors encoded the local micro-structures of the face into a set of discrete codes by using three unsupervised learning methods: PCA tree [Freund et al., 2007], K-means, and the random projection tree [Freund et al., 2007]. A well-known track of research is video summarisation without supervision. Such methods produce keyframe summaries using conventional clustering [Wang and Merialdo, 2009, Mundur et al., 2006] or hierarchical clustering [Mahmoud et al., 2013, Zhu et al., 2004]. The frame which is closest to a cluster centroid is designated as a keyframe. The video can be clustered using low-level features [De Avila et al., 2011] or objects [Lee et al., 2012]. A recent approach reduces redundancy by learning a dictionary of keyframes [Yang et al., 2013, Cong et al., 2012a] Semi-supervised learning Semi-supervised learning (SSL) is intermediate between supervised and unsupervised learning. Traditional learning models use only labelled data for training. However, labelled data are often costly to generate, time-consuming, or they may require experts about a certain

38 2.6. Learning 25 domain (e.g. to transcribe an audio sector), or a physical experiment (e.g. to determine whether there is oil at a particular location). Conversely, the acquisition of unlabelled data is rather inexpensive. Another reason why smei-supervised learning may be needed is when the (x, y) pair is insufficient for representing the input-output relationship, since this relationship depends also on unobserved/latent variables, h. These unobserved quantities or missing data, even if not obvious, are essential for expressing models for such applications. For example, in automated translation, we may wish to establish whether the many noun phrases in a document refer to the same object. Semi-supervised learning attmpts to build models with higher accuracy by utilising a large amount of unlabelled data together with the labelled data. SSL has gained wide attention in the community of machine learning for both practice and theory since it is the most common scenario in real applications. SSL can be either inductive or transductive as explained hereafter.

39 2.6. Learning 26 A learner is transductive if it takes into account a particular (unlabeled) test set as well as a (labeled) training set, and attempts to minimise misclassification of only those particular examples. This is somewhat natural for inference based on graph representations of the data, with a node for each labelled or unlabelled datum. Transduction was pioneered by Vapnik [Vapnik and Chervonenkis, 1974, Vapnik and Sterin, 1977]. A transductive learner faces limitations if the data are provided incrementally in a stream. That is, if a new unlabelled sample is added, the whole training should be repeated with all of the samples to predict a label, which is computationally expensive. In inductive learning, the model is trained on the labelled data, and then has it to predict labels for all of the (unlabelled) test data. Consider the pair (x, f(x)), where x is the input data and f(x) is the output of function f applied to x. The task of pure inductive inference is this: given a set of samples, return a function f that agrees on all examples. The induction algorithms have the advantage of being independent of the unlabelled data. These styles of semi-supervised learning can be applied to any type of model: for example, generative models by using Bayes rule [Ratsaby and Venkatesh, 1995] or the chain rule [Schum, 1994], low-density separation models headed by max-margin algorithms [Cortes and Vapnik, 1995], heuristic models and graph-based models. The semi-supervised learning has proven its convenience in many computer vision applications. In the field of object detection and action recognition the latent variables may correspond to the locations of parts [Bouchard and Triggs, 2005], simple motion [Niebles and Fei-Fei, 2007] or complex motion [Niebles et al., 2010]. The max-margin algorithms are championed by latent SVM which has proven to offer significant advantages in object and action recognition. Latent SVM has been effective for object detection in [Felzenszwalb et al., 2008] where the object parts positions were treated as latent variables. Inspired by the success of [Felzenszwalb et al., 2008], [Liu et al., 2011] offered a framework built on a latent SVM formulation to recognise human actions using high-level semantic concepts, called attributes. The latent variables were utilised to represent the importance of each attribute for each class. The proposed model was able to recognise a class label with significant accuracy by using both manually-specified attributes and trained attributes. We discuss latent SVM in more details in section

40 2.7. Classification Methods Classification Methods A large number of different classification techniques have been proposed and applied to human action classification. Amongst others, they include decision trees, k-nn, Bayesian Networks and support vector machines (SVMs). The aim of the classification process is to predict the class label, y, given the measurement, x. In some cases (the so-called structured prediction), y can be a graph of labels that have to be predicted jointly from a set of measurements. During training, some of the labels may be unknown. For this reason, hereafter we first review k-nn and SVM, and then we review structural SVM and latent structural SVM k-nn The K-Nearest Neighbours algorithm (k-nn) is a non-parametric, simple method that stores all available samples and uses the most similar (closest) K neighbouring samples to classify a new sample. The distance is typically measured by the Manhattan or Euclidean distances, and the new sample is assigned to the majority class of its K neighbours. If K = 1, the sample is simply assigned to the class of its closest neighbour. In large datasets, the comparisons to find the closest neighbours are computationally expensive (however, fast search algorithms are available) SVM A support vector machine is a discriminative classifier proposed by Vladimir Vapnik that exploits the notion of structural risk minimisation. To learn an SVM classifier for binary classification, we are given a training set of samples from two classes ((x 1, y 1 )...(x n, y n )) where x i are the feature vectors representing the samples and y i { 1, +1} are the class labels of the samples. For example, we are given pictures of tables and chairs as measurements and we want to correctly classify a new picture as either of these two classes. Each picture is digitised as pixels, so we have measurement vectors x i R d where d = 76, 800. The positive label could indicate the chair, and the negative label may indicate the table. Now, a new picture is given and we want to predict: is it a chair or a table? The support vector machine objective is to find the optimal separating hyperplane:

41 2.7. Classification Methods 28 maximising the distance from the closest points of both classes, and minimising the risk of misclassifying the training points. The hyperplane can be described by equation w T x + b = 0, where w R d and b is a scalar. The hyperplane is found by solving the following optimisation problem: argmin w,b,ξ w 2 + C y i (w T x i + b) 1 ξ i i = 1... N, y Y N i=1 ξ i s.t. (2.2) The objective is a quadratic (therefore, convex) function subject to linear inequality constrains. The variables ξ i s are called slack variables and allow the two classes to be nonperfectly separable. By definition, a slack variable ξ i for a data point x i is larger than or equal to 0, so that it implies slackness of the corresponding constraint. Likewise, we can notice that if (ξ i > 1), it means the data point is misclassified; if (0 < ξ i 1), it means the data point lies between the margin and the right side of the hyperplane (it is correctly classified, but with a poor margin); and if (ξ i = 0), it means that the data point lies on the margin (i.e., it is a support vector) or further inside the correct region (see Figure 2.4). Constant C is a regularisation parameter used to balance the importance of maximising the margin (which is equivalent to minimising the norm of w) and minimising the training error. The higher the value of C, the more one is penalising misclassified samples comapred to a smaller margin. The cross-validation approach (i.e., try and see how the model works on a validation set) is common for choosing the value of parameter C. However, when the data are heavily non-linearly separable in the original data space, the points must be mapped - implicitly or explicitly - using a mapping function (a kernel) to a space of higher dimensions so that they will be separable by a hyperplane in that space (Figure 2.4). Once the model w is learned, the inference of the class (prediction, classification) for a new point, x, is given by: y = argmax y y(w T x + b) (2.3)

42 2.7. Classification Methods Multi-class SVM Conventional SVMs are binary classifiers. However, most real-world problems have more than two classes. Multi-class classification with SVM is often achieved by using multiple binary SVMs. The conventional way to deal with SVM multi-class classification is to utilise the one-vs-one [Kreßel, 1999] or one-vs-all [Vapnik and Vapnik, 1998] strategies. The onevs-one strategy consists of: Breaking down the multi-class classification (assuming K classes) into K(K-1)/2 binary classifiers. Using the maximum-voting principle to predict the label. At prediction time, all classifiers are applied to a new sample and a voting scheme is considered to predict the label. In one-vs-all, one builds K binary classifiers ( one class as positive and all other classes as negative) for the K classes. When a new sample needs to be classified, all K classifiers are applied. This method seems quicker to train than the one-vs-one because it requires O(N) classifiers instead of O(N 2 ), but each classifier in one-vs-all is usually bigger as it involves data from all classes. The choice between one-vs-one and one-vs-all is mainly empirical. While practical, these strategies may lead to conflicting or insufficient assignments from the multiple classifiers. Moreover, breaking a multi-class classification problem into multiple independent binary classification problems does not fully capture the correlations between the classes [Crammer and Singer, 2002]. An alternative, true multi-class SVM can be resolved exactly by a single multi-class machine that learns all the class models jointly by solving a single optimisation problem. An example is the multi-class SVM from Crammer and Singer [Crammer and Singer, 2002]: argmin w 2 + C w,b,ξ 0 N i=1 ξ i s.t. w T y i x i + b yi (w T k x i + b k ) 1 ξ i (2.4) k y i, ξ i 0, i = 1... N

43 2.7. Classification Methods 30 (a) (b) Figure 2.4: Binary Support Vector Machines on (a) linearly separable data and (b) nonlinearly separable. Squares represent one class, circles the other one. Support vectors are laying on the margin.

44 2.7. Classification Methods 31 where w is the concatenation of class models, w T = [w T 1... wt k ], and b = [b 1... b k ]. In practice, training of this machine is significantly slower than the reductions described above Structural SVM Structural SVM (SSVM) is an extension of the conventional support vector machine to the classification of structured outputs [Tsochantaridis et al., 2005]. The input can be any amounts of measurements, and the output classes can be complex and structured (chains, trees, graphs, etc.). In this case, notation x represents all the measurements, while notation y represents all the class variables. For instance, x and y can be all the measurements and states of an HMM. A training set to train an HMM will consist of N such sequences of measurements and states, noted here as x i, y i, i = 1... N. Please note that hereafter we adopt apexes (e.g., i ) to index the samples for consistency with our publications and to reserve the pedexes (e.g., t ) to index the various variables in the graph. In structural SVM, the primal objective to learn the discriminative function, F (y, x) = w T ψ(y, x), is posed as: argmin w 2 + C w,ξ 0 N i=1 ξ i s.t. w T ψ(y i, x i ) w T ψ(y, x i ) + ξ i (y i, y), (2.5) i = 1... N, y Y This is, again, a quadratic programming optimisation problem with linear constraints. However, the number of constraints is huge compared to the binary and multi-class cases, since it is exponential in the number of class variables. For instance, if the output, y, consists of 10 variables with 5 values each, the total number of their values, Y, will be 5 10 = 9, 765, 625! Therefore, for a training set with N samples, there will be that many constraints, times N. The feature function ψ extracts a feature vector from a given sample and w is a model with an adequate number of parameters. The loss function is an arbitrary user-defined function to measure the loss between the ground-truth labelling and the predicted labelling. The Hamming loss is the most common choice.

45 2.7. Classification Methods 32 As a fundamental breakthrough for this objective, [Tsochantaridis et al., 2005] proposed a method to obtain a closely approximated solution by drastically reducing the number of the constraints. For each sample, at every iteration of the solver only the most violated constraint is added to a working set of constraints. The most violated constraint is found by the loss-augmented inference: y i = argmax(w T ψ(y, x i ) + (y i, y)) (2.6) y Again, once the model is learned, the prediction for a new sample, x, is made according to: y = argmax y F (y, x) = argmax w T ψ(y, x). (2.7) y Main applications of SSVM In the structural SVM learning framework, the aim is to maximise the objective function while at the same time permitting computation of the model. In the literature, structural SVM [Tsochantaridis et al., 2005] draws its motivations from applications such as handwritten digit recognition, object recognition, information retrieval and structured prediction (see Figure 2.5). The input can be any kind of objects, and the output can be complex and structured (chains, trees, graphs, etc.). Structural SVM is based on the cutting-plane algorithm which has been effectively applied also for binary classification, multi-class classification, sequential labeling, sequence alignment and Context Free Grammar (CFG) parsing [Joachims et al., 2009]. Hereafter we cite a number of notable utilisations: [Wang and Mori, 2009] have used structural SVM for action recognition from static images. They have modelled the actor s pose as a graph with a root node and a constellation of implicitly correlated local patches as the hidden variables. [Zhang and Piccardi, 2014] have used structural SVM to label sequential data in the area of video segmentation and used different loss functions. The recognition system proposed by [Yang et al., 2010] configured the poselets of body parts as latent variables and inferred them using a tree structure. Local shapes and dense optical flow were used by [Schindler and Van Gool, 2008] for recognition. Structural SVM has also proved beneficial with higher-order energy functions [Fix et al., 2013]. Our algorithms, described in sections 3.3 and 4.3, exploit structural SVM with and without latent variables [Yu

46 2.7. Classification Methods 33 and Joachims, 2009] to recognise the action and find the video summaries Latent structural SVM In many structured problems, the model formulation of the training data (x 1, y 1 )...(x N, y N ) is not sufficient for describing the input-output relationship because this relationship also depends on unobserved/latent variables, h. The use of latent variables has been explored widely in probabilistic models, such as the hidden Markov model (HMM) and the hidden conditional random field (HCRF). These models are progressively being used by the research community for problems such as speech recognition [Rabiner, 1989], background modelling [Stenger et al., 2001], and action recognition [Kim et al., 2010, Wang and Mori, 2011a]. In non-probabilistic models such as structural SVMs and Max-Margin Markov Networks the use of latent variables was not introduced until 2009, which led to limited use of these models in many interesting problems, despite their effectiveness in the supervised case. Yu and Joachims [Yu and Joachims, 2009] have extended the structural SVM framework to include latent variables and used the concave-convex procedure to find a local optimum. Latent structural SVM has proven its ability to model latent variables in many multimedia applications, especially those where it is expensive to produce large amounts of labelled data, such as: speech recognition [Zhang et al., 2011], complex event detection[tang et al., 2012, Vahdat et al., 2013], object detection [Felzenszwalb et al., 2010] and action recognition [Wu and Jia, 2012, Wu et al., 2013]. The recognition system proposed by [Wu and Jia, 2012] treated the scene class labels as hidden variables, while the work presented by [Zhu et al., 2010] for object detection modelled the position of the object s parts as hidden variables. Moreover, [Wang and Mori, 2011a] have proposed an MM-HCRF (max-margin hidden conditional random field) for human action recognition that models the position of the actor s parts as hidden variables.

47 2.7. Classification Methods 34 (a) (b) (c) Figure 2.5: Examples of structured problems.

48 2.7. Classification Methods Formulation In latent structural SVM, training includes a set of training data, {(x 1, y 1 ),..., (x N, y N )} and a set of latent variables, {h 1,..., h N }. These latent variables may be unavailable for a number of reasons: missing values; high annotation cost; or unknown semantic. The goal is to learn a linear discriminative function F : X Y H R where R is a score domain that expresses the compatibility of the inputs. The learning objective associated with F is defined as: Step 1: w = argmin w,ξ w 2 + C N i=1 ξ i s.t. w T ψ(x i, y i, h i ) }{{} w T ψ(x i, y, h) }{{} (y i, y, h) }{{} ξ i score of ground truth score of prediction required margin = loss i, y, h W (2.8) Step 2: h i ( = argmax w T ψ(x i, y i, h) ) h The learning objective alternates between these two steps until convergence (that is provenly guaranteed, to a local minimum). In step 1, the algorithm searches for the model that gives higher score to the ground truth than to other labelings. The margin between these scores must be at least equal to the loss of the prediction. At the first iteration, the h i s are initialised arbitrarily. In step 2, the h i s are assigned a new value using updated model w. Objective (2.8) is a standard optimisation that can be solved by use of any common solver. Despite the exponential size of the constraint set, [Tsochantaridis et al., 2005] have solved this problem, again, by using the cutting plane algorithm and obtained a closelyapproximated solution that drastically reduces the number of the constraints. Namely, for each sample and at every iteration of the solver, only the most violated constraint is added

49 2.7. Classification Methods 36 to a working set of constraints. The number of constraints required to achieve any arbitrary ɛ-approximation of the solution is proven to be polynomial in N. The most violated constraint is found by the loss-augmented inference: ξ i = max y,h ( wt ψ(x i, y i, h i ) + w T ψ(x i, h, y) + (y i, y, h)) (2.9) which, like in normal structural SVM, boils down to finding the labeling with the highest sum of score and loss: ȳ i, h i = argmax(w T ψ(x i, y, h) + (y i, y, h)) (2.10) y,h The ideal case is where the loss function, (y i, y, h), folds over the score so that one can use the same algorithms available for the conventional inference also for the loss-augmented one. Eventually, once the model is trained, the best y and h are found by the inference rule: y, h = argmax w T ψ(x, y, h) (2.11) y,h Recap To apply the latent structural SVM algorithm for a structured prediction problem with latent variables, basically, we need to formulate an appropriate feature map ψ(x, y, h) and a loss function (y i, y, h) for the following problems: Loss Augmented Inference ( argmax w T ψ(x i, y, h) + (y i, y, h) ) (2.12) y,h Latent Variable Assignment ( argmax w T ψ(x i, y i, h) ) (2.13) h

50 2.8. Submodular Functions 37 Figure 2.6: The diminishing return property in a submodular set function. Inference argmax w T ψ(x, y, h) (2.14) y,h 2.8 Submodular Functions Submodularity is a property of functions with profound hypothetical outcomes which can be used to speed up the inference in structured prediction. Submodular inference has been used in many machine learning and data mining domains by properly choosing features and structures: in computer vision it has been recognised and employed in applications such as automatic summarisation of image collections [Lin and Bilmes, 2011, Tschiatschek et al., 2014], image segmentation [Kohli et al., 2009, Jegelka and Bilmes, 2011], feature selection [Krause et al., 2008] and active learning [Wei et al., 2015]. For summarisation, submodularity intuitively formalises the property of diminishing returns [Narasimhan and Bilmes, 2012], i.e., the idea that adding words to already large summaries brings relatively little benefit. A definition of submodularity is: Definition 2.1 Let V = {v 1,...,v n } and F : 2 n R return a real value for any subset h V ; function F is submodular iff, A, B V : F (A) + F (B) F (A B) + F (A B) An equivalent definition based on the notion of diminishing returns is: B A V \ v : F (B + v) F (B) F (A + v) F (A)

51 2.8. Submodular Functions 38 The diminishing property means that the incremental value of v decreases as the situation in which v is added grows from B to A. That is, adding v to set B offers a greater improvement of the score function, F, than adding v to a larger set, A (Figure 2.6). If this property is satisfied with the equality sign, then F is simply called a modular function (the opposite inequality defines supermodular functions). Submodular functions can be further classified as: Monotonic functions: F is monotonic if F (B) f(a) s.t. B A Non-monotonic functions: the above does not hold. Submodular functions arise in connection with many optimisation problems, in particular: Minimisation Clustering: [Nagano et al., 2010] introduced the minimum average cost clustering problem, where the cost is computed by minimising a submodular objective function. Image segmentation: The work by [Jegelka and Bilmes, 2011] provided edge coupling for image segmentation. In their approach, the classes are identified by optimising over submodular energy functions. MAP inference: An investigation of probabilistic inference in Bayesian submodular models is presented by [Djolonga and Krause, 2014]. Maximisation Feature selection: [Das and Kempe, 2011] offered an accurate predictor for feature selection and dictionary selection by analysing greedy algorithms using the notion of submodularity. Active learning: The framework in [Wei et al., 2015] formulated the data subset selection as constrained submodular maximisation to obtain the desired outputs. Summarisation: [Lin and Bilmes, 2011, Sipos et al., 2012a] have presented submodular maximisation for document summarisation; [Tschiatschek et al., 2014] have proposed a similar approach for the summarisation of a collection of images.

52 2.8. Submodular Functions Why submodularity? Greedy algorithms: the main benefit of submodular functions for maximisation are the performance guarantees of greedy inference algorithms. A greedy algorithm solves the inference problem heuristically; it makes one greedy decision after another, reducing the inference search to a much smaller one. In other words, a greedy algorithm finds a local optimum at each step in the hope of finding a global or near-global optimum in the end. The maximum achieved by a greedy algorithm over a submodular monotonic function is (e 1)/e the actual maximum [Nemhauser et al., 1978b]. In practice, various papers have shown that these greedy algorithms are often within 90% of the actual maximum [Krause, 2008] and that they also work well with submodular functions which are not monotonic. The simplicity of greedy inference makes it attractive for various applications where exact inference is not feasible. Submodularity is a natural property of many, common scoring functions. For instance, in summarisation any function that rewards good coverage of the whole document is spontaneously submodular. The same goes for functions that penalise redundancy inside the summary [Carbonell and Goldstein, 1998, Lin and Bilmes, 2010]. Any positive combination of these two functions is also submodular. In fact, any positive combination of any number of submodular functions is still submodular (by propagation of submodularity). Budget-additive: in settings where there is a fixed budget, B, on the inference (for instance, a maximum number of variables that can be inferred as positive), the combination of greedy inference and submodular functions allows us to easily respect the budget. Inference is also faster since it is O(B). Diminishing revenues: adding another frame to an already large summary brings little improvement, justifying working with a budget. Submodularity leads to sufficient structure so that a mathematically elegant and practically useful theory can be exploited.

53 2.9. Video Summarisation and Evaluation Video Summarisation and Evaluation Video summarisation is a family of algorithms targeted to help the users to understand, browse and analyse videos quickly. The user will be able to traverse the video at higher altitude, ignore details and see shots of videos in response to queries such as In a baseball video, show me the squatting catcher at the beginning of every pitch Video summarisation approaches Throughout the years, an assortment of algorithms have been proposed for automated video summarisation. Most existing algorithms can be categorised as either (a) clustering approaches; or (b) frame-differences approaches. Various steps are involved in a clusteringbased approach [De Avila et al., 2011, Ghosh et al., 2012, Jaffe et al., 2006, Mundur et al., 2006]: first, the clusters are built based on the similarities between frames; then, the clusters are ranked according to some notion of importance and only the most important are retained; eventually, the frames that are closest to the centres of each clusters are selected as the summary. The videos can be clustered using low-level features [De Avila et al., 2011] or objects [Ghosh et al., 2012], and structure can also be usefully enforced during clustering [Chen et al., 2009, Gygli et al., 2015]. The structure is used to capture correlation between the frames and also to favour some/all objectives of a desired summary such as likelihood, coverage, importance and orthogonality. Greedy algorithms can be used to quasi-optimise these objectives. For example, [Gygli et al., 2015] presented an automated model for video summarisation that learns submodular mixtures of global characteristics of desirable summaries. On the other hand, Frame-differences approaches scan the video s frames in sequential order to detect shot boundaries and key frames [Xiong et al., 2006, Cong et al., 2012a, Yang et al., 2013, Lu et al., 2016] Video summarisation evaluation Unlike for text, a reliable and shared measure for the quantitative assessment of a video summary s quality is still missing. Ideally, such an evaluation measure should reflect the quality of a summary for:

54 2.9. Video Summarisation and Evaluation 41 Understanding: reducing the time needed by the users to understand a video compared to the original sequence. Browsing: when a video summary is used to help a user search for a specific video in a large database. Query search: how well and easily a user navigates to a required position in the video. Amongst others, the F-score has often been used for video summarisation evaluation [Li et al., 2011, Ejaz et al., 2013, Gygli et al., 2014, Gygli et al., 2015]. It basically consists of the harmonic average of the precision and the recall of the predicted summary. The precision is defined as the number of frames occurring in both the ground-truth and the predicted summaries divided by the number of frames in the ground-truth summary. The recall is the number of frames occurring both in the ground-truth and the predicted summaries divided by the number of frames in the predicted summary. The F-score is formalised as: F = 2 precision recall (precision + recall) recall = N pre gt N gt precision = N pre gt N pre s.t. (2.15) where N gt is the total number of frames in a ground-truth summary, N pre is the total number of frames in a predicted summary and N pre gt is the number of predicted and ground-truth frames being matched. This is also known as the F1 measure, because recall and precision are equally weighted; in its Fβ extension, a relative weight, β, is applied. Following the effective application of summarisation evaluation measures, such as ROUGE [Lin, 2004] in the case of natural language, [Tschiatschek et al., 2014] have introduced a recall-based measure called V-ROUGE and used it to evaluate summarisation of image collections. V- ROUGE represents each frame based on Bags-of-Words, i.e. as a histogram. The measure is computed by counting the number of overlapping words between the predicted summary and the ground-truth summaries from multiple annotators. For example, the comparison of two images, A,B, can be done by a function called histogram intersection, min(a, B). Others have used a recall-based measure called Comparison of User Summaries (CUS) [Guimarães

55 2.10. Video Summarisation and Submodular Functions 42 et al., 2003, De Avila et al., 2011, Almeida et al., 2012, Ejaz et al., 2012]. The CUS method compares the predicted summary with the ground-truth summaries from multiple annotators by various distances such as the Manhattan distance. Instead of comparing with multiple ground-truth summaries, [Guimarães et al., 2003] synthesised a single, optimal summary from the ground-truth summaries. Then, this single summary is compared with the predicted summary. However, while all these measures can prove useful, they do not convey the mutual information amongst the frames [Chen and Lin, 2006] and are not fit for video summarisation where the output is an ordered sequence of frames rather than just a set. A widely accepted and reliable evaluation measure for video summaries that keeps the frame order into account is still missing Video Summarisation and Submodular Functions Video summarisation is undoubtedly a foundational area in multimedia. It provides concise information about a video by means of a few, informative frames. Many techniques have been proposed for video summarisation and many of them depend on clustering which is computationally expensive [De Avila et al., 2011]. An existing state-of-the-art method offered by [Chen et al., 2009] is based on a graph that reflects the story-structure and the semantic arrangements among entities to detect the most meaningful shots. A useful video summary typically enjoys two properties: coverage, accounting for the similarity between the summary and the rest of the video, and non-redundancy, accounting for the diversity amongst the frames in the summary. These two properties can be combined into a single scoring function by suitable weights and the optimal summary can then be sought. Unfortunately, the number of possible summaries is exponential in the number of frames and this search may prove prohibitive. Submodular functions can curb this problem and have had a major influence in computer vision, natural language processing and machine learning in general. Submodular functions have been increasingly identified and utilised for social network analysis, graph cutting and information gathering. Indeed, the rich structure of submodular functions enables them to be customised over discrete optimisation problems, convex analysis and as regularisation functions in combinatorial structures for both supervised and unsupervised learning [Bach, 2011]. Document summarisation presented in [Lin and Bilmes, 2011] has been framed as a

56 2.11. Action Recognition in Depth Videos 43 mixture of submodular functions to represent the diversity and the coverage of a particular summary. At their turn, max-margin approaches such as structural SVM have also been applied to learn the weights of submodular functions for text summarisation by Sipos et al [Sipos et al., 2012a] and for image summarisation bytschiatschek et al [Tschiatschek et al., 2014]. These works have remarked that the scoring function is submodular, and have exploited the properties of submodularity to provide fast and effective summary inference Formulation In this thesis, we exploit submodularity to provide summarisation of a video x = {x 1,..,x t,..,x T }. Our aim is to jointly infer the class label y for that video and a summary h = {h 1,..,h t,..,h T }, where h t {0, 1}. For every frame x t,the corresponding h t is assigned to 1 if the frame is included in the summary or otherwise to 0. The summary of frames is selected to maximise the scoring function F subject to a budget B, i.e. the maximum number of frames allowed to go in the summary. This inference is formalised as: y, h = argmax F (y, h, x) s.t. B h (2.16) y,h The maximisation of score function F for a video x yields the label of the predicted class accompanied by the summary of that video. More specifically, depending on a similarity function φ that measures the distance between any two frames, the summary should be representative of the whole video, have limited redundancy and simultaneously help infer the class label. In addition, we will exploit a similar approach to solve the loss-augmented inference - the crux of both structural and latent structural SVM Action Recognition in Depth Videos Action recognition in video has been an fruitful area of research for over fifteen years. Indeed, its significance is demonstrated by its many, diverse applications such as video surveillance, video recovery, human-computer interaction, sports video analysis, and home intelligence. Its usefulness lies in its ability to classify the actions that occur in a scene [Ali and Shah, 2010]. Action recognition is a challenging area due to the inherent noisy nature of

57 2.12. Datasets 44 observations captured by cameras that are frequently subject to variations in view point and illumination, occlusions, and scaling. In addition, the same actions are performed differently by different people and humans have very large anthropomorphic variations. A revolution has taken place with the introduction of depth cameras where a new dimension has been added to the colour information delivered by conventional cameras. It is a fundamental clue for object and action recognition in video since it is not influenced by colours and appearance. Depth cameras provide an approximation of the distance, D, between the camera s focal point and every object in the scene and are considered low-cost with respect to price. The information provided by depth cameras helps disambiguate occlusions and resolve illumination artifacts, and under certain viewing conditions, allows estimating the actors pose in terms of a skeletal model (a model containing all the main joints of the actor). Thereby, over the last few years a number of works have addressed action recognition with depth cameras, with approaches based on different combinations of depth, RGB and skeletal features. For instance, in [Wang et al., 2012b], local occupancy information was computed at given skeletal positions and used to model human-object interaction. In [Wang et al., 2014b], a hierarchical compositional model based on RGB and skeleton features was used for cross-view action recognition. However, the skeleton information is not always obtainable and is usually noisy, with a limited ability to handle view occlusions and to model the interactions between humans and objects. In [Oreifej and Liu, 2013], a new 4D spatio-temporal descriptor was proposed to effectively describe a depth video. In [Lu et al., 2014], new range-sample features were proposed to provide invariance to background and occlusions. All these approaches assume that the actor s pose conveys the most important information about the action. Conversely, approaches based on descriptors from the entire image such as [Bo et al., 2011, Blum et al., 2012] can recognise a wider variety of actions related to the presence of telling objects and contextual information Datasets In this section we present a brief description of some of the state-of-the-art human action recognition datasets that are commonly used to evaluate depth action recognition methods ( benchmarks ). Subsections , and describe the ACE, MSRDailyActivity3D and MSR Actions 3D datasets, respectively. These datasets are three-dimensional

58 2.12. Datasets 45 Figure 2.7: A comparison between RGB channels and depth channels ( reprinted from [Wang et al., 2014a]). (3D) video data in which a new dimension - depth - has been added to the colour information. A comparitive example of colour and depth channels is shown in Figure ACE ACE is the Actions for Cooking Eggs dataset 1, released as part of a contest called Kitchen Scene Context based Gesture Recognition in ICPR It was collected in a kitchen environment by a Kinect RGBD camera that delivers synchronised colour and depth image sequences at 30 fps and the resolution of the depth frames is Each of the recordings is from 5 to 10 minutes in length. As annotation, a class label was manually assigned to each frame. There are five menus for cooking eggs: omelet, boiled-egg, ham and eggs, Kinshi and scrambled-egg. Each menu was staged by five different actors. For test purposes, two cooking scenes were captured for each menu. The main task is to recognise eight kinds of human cooking actions including cutting, seasoning, peeling, boiling, turning, baking, mixing and breaking. ACE is a challenging dataset since most of the actions are sharing the same body postures and involving similar human-object interaction. For example, actions turning and mixing have almost identical motion sequences for the hand. If the cooker uses a pan, the class label should be turning while the class label should be mixing if the cooker uses a bowl or a pan. Example clips are shown in Figure Available at 1

59 2.12. Datasets 46 (a) Breaking: a cooker is cracking an egg (b) Mixing: a cooker is mixing something in a bowl or a pan (c) Baking: a cooker is baking something on a pan (d) Turning: a cooker is turnning something in a pan (e) Cutting: a cooker is cutting something on a cutting board (f) Boiling: the water is boiling (g) Seasoning: a cooker is seasoning with salt (h) Peeling: a cooker is peeling egg shells Figure 2.8: A typical clip of ACE actions performed by five different actors (distinguishable by their clothing).

60 2.12. Datasets MSRDailyActivity3D The MSR DailyActivity3D dataset 1 was released by Microsoft Research and captured using a Kinect sensor. It contains 16 indoor activities and covers most human daily activities for a living room including: drinking, eating, reading, using cell phones, writing, using computers/laptop, vacuuming, cheering up, sitting still, tossing crumbled paper, playing games, lying on the sofa, walking, playing the guitar, standing up, and sitting down. The total number of videos is 320, staged by 10 actors and performed in two different poses, one standing close to the couch and the other sitting on it. The resolution of the depth sequences is For evaluation, a cross-subject evaluation is common, with subjects 1 5 used for training and subjects 6 10 for test. As shown in Figure 2.9, the objects in the background are very close to the subject. The same action includes many variations as a result of their different positions. Also, some actions are composed of very precise movements of the hands such as writing and eating. Thus, learning from this dataset is also very challenging. 1 available at

61 2.12. Datasets 48 (a) Drink (b) Eat (c) Call (d) Write Figure 2.9: Some examples from the MSR DailyActivity3D (displayed as RGB and depth frames): the first column in each subfigure shows the subject standing close to the couch; the second, sitting on it.

2.12. Datasets 49 2.12.3 MSR Action3D The MSR Action3D dataset 1 has been introduced in [Li et al., 2010].

62 2.12. Datasets MSR Action3D The MSR Action3D dataset 1 has been introduced in [Li et al., 2010]. It consists of twenty outdoor sports action classes: side kick, jogging, golf swing, hand catch, forward punch, high throw, draw circle, draw x, draw tick, hand clap, two hand wave, side boxing, hammer, forward kick, horizontal arm wave, high arm wave, bend, tennis serve, tennis swing, pick up and throw (see Figure 2.10). Each action is performed two or three times by ten subjects resulting in 567 depth videos in total. The size of each depth frame is The common setup for evaluation is the cross-subject, with subjects 1,3,5,7,9 used for training and subjects 2,4,6,8,10 for test. The background is clean and static in most sequences, making this dataset generally easier than the previous two. Figure 2.10: Sample clips from the MSR Action3D for actions a) Draw tick and b) Tennis serve (reprinted from [Li et al., 2010]). Amongst these datasets, an extensive amount of researches have been carried out using MSR Action3D which has almost been saturated with accuracy, with the overall best performance at 96.7% [Luo et al., 2013]. The MSR DailyActivity3D dataset is considered more challenging than MSR Action3D. At its turn, the ACE dataset is much bigger: it is almost ten times larger than MSRDailyActivity3D. For these reasons, in this thesis we have used both the MSR DailyActivity3D and ACE datasets to prove the effectiveness of our approaches. 1 available at:

63 2.12. Datasets 50 However, none of the existing methods have tackled action recognition and video summarisation jointly. Another challenge is that a meaningful measure for the quantitative evaluation of a video summary is still missing. In order to address this, this thesis presents a framework that combines action recognition and video summarisation into a single objective. Also, it presents a novel measure to evaluate the quality of a predicted video summary against the annotations of multiple annotators.

64 Chapter 3 Joint action recognition and summarisation As said in our Introduction, action recognition and video summarisation are two important multimedia tasks that are useful for applications such as video indexing and retrieval, video surveillance, human-computer interaction and home intelligence. While many approaches exist in the literature for these two tasks, to date they have always been addressed separately. Instead, in this chapter we move from the assumption that these two tasks should be tackled as a joint objective: on the one hand, action recognition can drive the selection of meaningful and informative summaries; on the other hand, recognising actions from a summary rather than the entire video can in principle reduce noise and prove more accurate. To this aim, we propose a novel approach for joint action recognition and summarisation based on performing latent structural SVM framework, together with an efficient algorithm for inferring the action and the summary based on the property of submodularity. Experimental results on a challenging benchmark, MSR DailyActivity3D, show that the approach is capable of achieving remarkable action recognition accuracy while providing appealing video summaries. 51

65 3.1. Introduction and Related Work Introduction and Related Work Action recognition in video has been an important research area of multimedia signal processing for over a decade. Applications are varied and include, amongst others, video surveillance, human-computer interaction, sport analysis and home intelligence. Over the years, a variety of approaches have been proposed for recognition, including bag-of-features representations, sequential classifiers and deformable part models [Wang et al., 2009, Hoai et al., 2011, Tang et al., 2012, Felzenszwalb et al., 2010]. Such approaches have led to important results even in challenging cases with realistic scenarios and large class sets [Soomro et al., 2012]. However, action recognition in video is still intrinsically challenged by the typical, extensive variations in illumination and view point. Fortunately, the recent release of inexpensive depth cameras such a Microsoft Kinect has helped mitigate these issues by adding an extra dimension to the traditional RGB components and generally improving recognition accuracy [Wang et al., 2012b, Oreifej and Liu, 2013, Tang et al., 2014]. Another major area of application for multimedia signal processing is video summarisation which provides concise information about a video by a few, informative frames. Video summaries can be used for indexing and retrieval or for story-boarding the videos to end users [Ma et al., 2002, Liu et al., 2010, Cong et al., 2012b]. A useful video summary typically enjoys two properties: coverage, accounting for the similarity between the summary and the rest of the video, and non-redundancy, accounting for the diversity among the frames in the summary. These two properties can be combined into a single scoring function so as to assign a unique score to each candidate summary. Unfortunately, the number of possible candidates is exponential in the number of frames and an exhaustive search for the optimal summary is impossible. However, recent work from Lin and Bilmes [Lin and Bilmes, 2011], Sipos et al. [Sipos et al., 2012a], and Tschiatschek et al. [Tschiatschek et al., 2014] has remarked that the scoring function is submodular, and have exploited the properties of submodularity to provide fast and effective summary inference. Given their intrinsic complexity, both action recognition and summarisation can benefit from structured prediction approaches. Structured prediction leverages the formalism of graphical models to provide prediction for objects such as sequences, trees and graphs [Nowozin and Lampert, 2011]. In multimedia, its typical applications range from image segmentation and action recognition to video indexing and summarisation [Hoai et al.,

66 3.2. Recognition and summarisation by submodular functions , Yang et al., 2010, Tschiatschek et al., 2014]. An increasingly popular approach in this area is structural SVM (SSVM) that is an extension of the conventional support vector machine for the classification of structured objects [Tsochantaridis et al., 2005, Yu and Joachims, 2009]. SSVM has reported a strong experimental performance when compared to alternative approaches such as generative models and conditional random fields [Wang and Mori, 2011b, Nowozin and Lampert, 2011]. To date, action recognition and video summarisation have been tackled as separate objectives. Instead, we believe that they could be usefully merged into a single, joint objective following the intuition that action recognition can drive the selection of meaningful frames for the summary and that, in turn, recognising the action from a summary rather than the entire video may reduce noise and prove more accurate. Therefore, in this chapter we present an approach based on latent structural SVM that jointly provides the action class and the summary for an action video. Our main contribution is the design of a novel scoring function which enjoys the property of submodularity and therefore supports efficient inference of both the action and the summary. We present experiments over a challenging benchmark, MSR DailyActivity3D, showing that the approach is capable of achieving remarkable action recognition accuracy while providing meaningful and visuallyappealing video summaries. 3.2 Recognition and summarisation by submodular functions Let us note the sequence of measurements from the frames as x = {x 1,..,x t,..,x T }; the sequence of binary variables indicating whether a frame belongs to the summary or not as h = {h 1,..,h t,..,h T }; and the action class as y. Formally, we aim to jointly infer class label y and summary h while keeping the summary within a given, maximum size, B: y, h = argmax F (x, h, y) y,h s.t. T h t B (3.1) t=1 Lin and Bilmes in [Lin and Bilmes, 2011] have shown that desirable summaries (i.e., summaries with good coverage and limited redundancy) enjoy the property of submodularity. submodularity can be intuitively explained as a law of diminishing returns [Lin and Bilmes, 2011]: let us assume to have a scalar function, F, which can measure the quality of a given

67 3.2. Recognition and summarisation by submodular functions 54 summary, together with an arbitrary summary, A. We now add a new element, v, to A and compute the difference in value between F (A v) and F (A) (the return of v for A). Let us then consider a super-set of A, B A, and add v to it: submodularity holds if the difference in value between F (B v) and F (B) is less than or equal to the return of v for A. In simple terms, the larger the summary is, the less is the benefit brought in by a new element. This property can be formally expressed as: A B, v : F (A v) F (A) F (B v) F (B) (3.2) Note that submodular functions are not required to be monotonically non-decreasing, i.e., returns can be negative; however, (3.2) must hold. For simplicity, in the following we also assume F to be non-negative. The attractive property of submodularity is that a value for F with a guaranteed lower bound can be found by simply selecting the elements for the summary one by one. The approximate maximum returned by such a greedy algorithm is guaranteed to be at least of the actual maximum and is found to be often better in practice [Nemhauser et al., 1978a, Lin and Bilmes, 2011]. We now restrict the choice of scoring function to the case of linear models: F (x, h, y) = w T ψ(x, h, y) (3.3) with w a parameter vector and ψ(x, h, y) a suitable feature function of equal size. We further restrict w and ψ(x, h, y) to be non-negative in all their elements. Lin and Bilmes in [Lin and Bilmes, 2011] have proposed the following feature function for summarisation: ( T T ) ψ(x, h, y) = δ(h t, h u )σ(x t, x u ) t=1 u=1 (3.4) where λ 1 if h t = 1, h u = 0 δ(h t, h u ) = λ 2 if h t = 1, h u = 1 (3.5) 0 otherwise,

68 3.2. Recognition and summarisation by submodular functions 55 with λ 1, λ 2 > 0, and σ(x t, x u ) a non-negative function measuring the similarity between frames x t and x u. A frame x t is selected for the summary if its corresponding binary indicator, h t, is set to one. Therefore, the λ 1 terms in (3.4) are the coverage terms while the λ 2 terms promote non-redundancy in the summary by penalising similar frames. Following [Lin and Bilmes, 2011], it is easy to prove that function (3.4) is submodular. Functions based on between-frame similarities such as (3.4) are suitable for summarisation, but do not properly describe the class of the action since their space is too sparse. Typical feature functions for action recognition are instead based on bagging or averages of the frame measurements. To provide joint summarisation and recognition, we thus modify (3.4) as follows: ( ) T T ψ(x, h, y) = δ(h t )x t + δ(h t, h u )σ(x t, x u ) t=1 u=1 (3.6) with λ 3 > 0 if h t = 1 δ(h t ) = 0 if h t = 0 (3.7) In this way, we add a new term consisting of the scaled average of all measurements x t in the summary that promises to be informative for action recognition. We now prove that (3.6) is still submodular: Proof : Given a current summary, h, the addition of any new frame to it makes term T t=1 δ(ht )x t vary by the same amount irrespectively of h. This term therefore satisfies inequality (3.2) with the equal sign. Given that convex combinations of submodular functions are also submodular, the overall submodularity of (3.6) follows.

69 3.3. Learning: latent variables 56 The main benefit of submodular scoring functions are the performance guarantees of greedy inference algorithms. Algorithm 1 shows the greedy algorithm that we use to jointly infer the best action class and the best summary, choosing one frame for the summary in turn. Algorithm 1 Greedy algorithm for inferring class y and summary h given scoring function F (x, h, y). max =, argmax = 0 for y = 1... Y do h X x while X and h B do k argmax v X F (x, h v, y) F (x, h, y) h h {k} X X\{k} end while if F (x, h, y) > max then max = F (x, h, y) argmax = y end if end for 3.3 Learning: latent variables As framework for learning parameter vector w, we adopt the popular latent structural SVM [Yu and Joachims, 2009] which has proved effective in a variety of computer vision applications [Wang and Mori, 2009, Yang et al., 2010, Hoai et al., 2011]. In the training set, the action classes are supervised, but the summaries are completely unsupervised. Given a training set with N videos, (x i, y i ), i = 1... N, the learning objective of latent structural SVM: w 1 = argmin w,ξ 1:N 2 w 2 + C N i=1 ξ i s.t. w T ψ(x i, h i, y i ) w T ψ(x i, h, y) (y i, y) ξ i (3.8) w d 0 d = 1..D {y, h} {y i, h i }

70 3.4. Experimental Results 57 h i = argmax w T ψ(x i, h, y i ) (3.9) h is an iterative objective that alternates between the constrained optimisation in (3.8), performed using the current values for latent variables h i, and a new assignment for h i (3.9) from updated model w. The loss function that we choose to minimise, (y i, y), only accounts for the loss from action misclassification. As such, the selection of frames for the summary, h, on the training samples is solely driven by the action recognition accuracy. The optimisation in (3.8) is a standard optimisation that can be solved by use of any common solver. However, since the number of constraints in (3.8) is exponential, we adopt the relaxation of [Tsochantaridis et al., 2005] which can find almost-correct solutions using only a polynomial-size working set of constraints. The working set is built by searching the sample s most violated constraint at each iteration of the solver: ξ i = max y,h ( wt ψ(x i, h i, y i ) + w T ψ(x i, h, y) + (y i, y)) (3.10) which equates to finding the labeling with the highest sum of score and loss: ȳ i, h i, = argmax(w T ψ(x i, h, y) + (y i, y)) (3.11) y,h This problem is commonly referred to as loss-augmented inference due to its similarity to the standard inference and can be, again, solved by Algorithm 1 simply with the addition of loss (y i, y) to the score. 3.4 Experimental Results The proposed method is evaluated on the MSR DailyActivity3D dataset [Wang et al., 2012b] released by Microsoft Research and captured using the Kinect RGBD camera. It depicts 16 common living-room activities including: drinking, eating, reading, using cell phones, writing, using computer/laptop, vacuuming, cheering up, sitting still, tossing crumbled paper, playing games, lying on the sofa, walking, playing the guitar, standing up, and sitting down.

71 3.4. Experimental Results 58 The total number of videos is 320, staged by 10 actors and performed in two different poses, one standing close to the couch and the other sitting on it. For evaluation, a cross-subject evaluation is common, with subjects 1 5 used for training and subjects 6 10 for test. To pursue a more general approach, we have decided not to use the information about the actor s skeleton, limiting feature extraction to the depth and RGB streams. For each video, we have extracted local descriptors (HOG/HOF) over a regular spatio-temporal grid using the code from [Wang et al., 2009]. As time scale we have used τ = 2 resulting in 162-D descriptors. For the encoding, we have first run k-means with k = 32 clusters from the entire set of descriptors of the training set. Then, we have encoded all the descriptors of each frame using VLAD [Jégou et al., 2010] which embeds the distance between the pooled local features and the clusters centres. The resulting encoding is a = 5, 184-D vector and is our measurement for the frame. As software for the latent structural SVM model, we have used Joachim s solver [Tsochantaridis et al., 2005] with Vedaldi s MATLAB wrapper [Vedaldi, 2011]. As parameters, we have used summary size B = 10, regularisation coefficient C = 100 and performed a grid search over the training set for weights λ 1, λ 2, λ 3 [Tsochantaridis et al., 2005, Vedaldi, 2011]. Table 3.2 shows the sensitivity analysis of the accuracy with different weights in (3.6) and with depth and RGB data. Table 3.3 shows the accuracy achieved by Latent SSVM on depth data. For performance evaluation, we care to note that our approach is the only approach to date to provide action recognition and video summarisation as an integrated task. To evaluate the action recognition component, we compare the testset recognition accuracy using depth videos with a reference system using libsvm as the classifier [Chang and Lin, 2011] and the sum of all VLAD descriptors for the video as the measurement. In addition, we compare the action recognition accuracy with a system from the literature that uses dynamic time warping. To the best of our knowledge, this is the only approach which does not use the actor s skeletal information in any form (locations or angles). Table 3.1 shows that the accuracy achieved with the proposed method (60.62%) is comparable with that from the dynamic time warping approach (54.0%) and is much higher than that of the reference system (34.4%). These accuracies can be regarded as satisfactory since they are much above chance accuracy, i.e. 1/16 = 6.25% for this dataset. The accuracy using depth videos is also remarkably higher than that using RGB videos (42.5%), showing that depth is a more informative clue for action recognition.

72 3.4. Experimental Results 59 Table 3.1: Comparison of action recognition accuracy on the MSR Daily Activity 3D dataset. Method Accuracy Dynamic temporal warping [Wang et al., 2012b, Müller and Röder, 2006] 54.0% libsvm [Chang and Lin, 2011] 34.4% Proposed method 60.6% Proposed method (RGB videos) 42.5% Table 3.2: Sensitivity analysis of the accuracy with different weights in (3.6) and with depth and RGB data. λ 1 λ 2 λ 3 Accuracy (Depth) Accuracy (RGB) % 41.25% % 37.50% % 38.12% % 42.50% % 36.88% % 38.12% % 41.25% % 36.88% Table 3.3: The accuracy achieved by Latent SSVM on depth data Iteration No. Hidden Variable Accuracy (Depth) 1 H % 2 H % 3 H % 4 H % 5 H % 6 H % 7 H %

73 (a) (b) Figure 3.1: Summary examples (displayed as RGB frames) for action walk: a) proposed method; b) SAD Experimental Results 60

74 3.4. Experimental Results 61 For the evaluation of the summarisation component, since a ground truth is not available, we resort to qualitative comparisons. In particular, we compare the summaries obtained with the proposed method with those produced by a popular summarisation approach, the sum of absolute differences (SAD), which has been widely used in object recognition and video compression [Xiong et al., 2006]. SAD is a low-level approach that selects the frames for the summary as those with the largest, absolute difference from the previous frame, up to the given budget. The examples displayed in Figure 3.1 show that the summaries provided by the proposed approach appear more meaningful, faithful and informative about the content of the video. Examples summaries for all 16 classes of MSR DailyActivity3D are shown in Figure 3.2.

75 3.4. Experimental Results 62 Figure 3.2: Each row contains the summary of a video to represent a certain activity, the activities are: drinking, eating, reading, using cell phones, writing, using computers/laptop, vacuuming, cheering up, sitting still, tossing crumbled paper, playing games, lying on the sofa, walking, playing the guitar, standing up, and sitting down.

76 Chapter 4 V-JAUNE: A Framework for Joint Action Recognition and Video summarisation In this chapter we present a new measure for the quantitative evaluation of video summaries, nicknamed V-JAUNE, and we extend the experiments carried out in the previous chapter by: 1) another, larger and more probing action benchmark, ACE, 2) training with different extents of summary supervision, 3) quantitative evaluation of the quality of the predicted video summaries, 4) quantification of multiple annotators disagreement, and 5) an analysis of sensitivity to the (hyper-)parameters. Also, we present a new, key proof of submodularity for the loss-augmented inference of latent structural SVM. 4.1 Introduction The amount of publicly-available video footage is growing at unprecedented rates thanks to the commoditasion of video acquisition and the role played by social media. However, video data are typically large in size whereas the events of interest may be concentrated only in small segments. Video summarisation has therefore become imperative to concisely capture the contents of videos. The main applications of video summaries are indexing, search and retrieval from video collections and the storyboarding of the videos to end users [Ma et al., 2002, Liu et al., 2010, Cong et al., 2012a, Guan et al., 2014]. The basic requirements of 63

77 4.1. Introduction 64 an effective video summary are well understood and reduce to appropriate coverage of the original footage together with limited redundancy in the frames selected as the summary. At the same time, the huge number of videos calls for the automated labeling of their main theme or activity. For instance, in social media it can be helpful to know whether a video depicts activities such as food preparation or conversation in a living room for categorisation and content customisation. In addition to social media, automated activity recognition is an important component of many other applications such as video surveillance, human-computer interaction and home intelligence [Wang and Mori, 2011a, Wang et al., 2011, Wang and Schmid, 2013, Yang et al., 2016]. Yet, it remains a challenging task to date due to the inherent challenges of activity videos that include subject dependence, occlusions of view and viewpoint, illumination and scale variations. Given the above, a question that spontaneously arises is: can video summarisation and action recognition benefit from being performed jointly? This question can be rephrased as: can action recognition prove more accurate if performed based on a selection of the video s frames rather than the entire set? And, simultaneously, can the selected frames enjoy the properties required by an effective summary, i.e. good coverage and limited redundancy? Assuming that this question can be answered in the affirmative, in this chapter we set to investigate the performance of joint action recognition and video summarisation. Inferring an optimal summary is a combinatorially exponential problem and, as such, intractable. However, it has been proven that most functions used to evaluate the quality of a summary are monotonic submodular [Lin and Bilmes, 2011, Sipos et al., 2012b, Tschiatschek et al., 2014]. The main advantage of these functions is that inexpensive, greedy algorithms can be used to perform approximate inference with performance guarantees [Nemhauser et al., 1978b]. In this chapter, we extend the existing submodular functions for summarisation to functions for joint recognition and summarisation that still enjoy submodularity. As learning framework, we adopt the latent structural SVM [Yu and Joachims, 2009]. This framework joins the benefits of structured prediction (i.e., the ability to predict sequences, trees and other graphs) with maximum-margin learning that has gained a reputation for accurate prediction in a number of fields [Zhu et al., 2010, Wang and Mori, 2011a, Duan et al., 2012, Kim et al., 2015, Sachan et al., 2015]. In addition, this framework allows us to exploit different degrees of supervision for the summary variables, from com-

78 4.2. Related Work 65 pletely unsupervised to fully supervised, which suit different scenarios of application. The main, contributions of this chapter are: a submodular inference approach for the computation of latent structural SVM (Section 4.3.2). a new measure for the quantitative evaluation of video summaries, nicknamed V- JAUNE (Section 4.4). an extensive experimental evaluation over action datasets with different extents of summary supervision (Section 4.5). The rest of this chapter is organised as follows: in Section 4.2 we review the state of the art on relevant topics. In Section 4.3 we describe the model and the learning framework. In Section 4.4 we introduce the proposed summary evaluation measure. Experiments and results are presented in Section Related Work This chapter relates to structured prediction learning and its applications to action recognition and video summarisation. Since the state of the art is vast, we restrict the review of related work to immediately relevant topics. Automated video summarisation is a long-standing research area in multimedia [Maybury et al., 1997]. summarisation methods can be mainly categorised into: a) clustering approaches; and b) frame-differences approaches. The clustering approaches are aggregative methods that attempt grouping similar frames and select representatives from each group [De Avila et al., 2011, Ghosh et al., 2012, Jaffe et al., 2006, Mundur et al., 2006]. Frames can be clustered using low-level features (e.g., [De Avila et al., 2011]) or even detected objects [Ghosh et al., 2012], and structure can also be usefully enforced during clustering [Chen et al., 2009, Gygli et al., 2015]. Frame-differences approaches, instead, scan the video s frames in sequential order to detect shot boundaries and key frames [Xiong et al., 2006, Cong et al., 2012a, Yang et al., 2013, Lu et al., 2016]. Submodular functions have recently played a major role in machine learning thanks to their efficient maximisation and

79 4.2. Related Work 66 minimisation properties. Submodular functions have been identified in tasks as diverse as social network analysis, graph cutting, machine translation and summarisation [Bach, 2011]. For instance, Lin and Bilmes [Lin and Bilmes, 2011] and Sipos et al. [Sipos et al., 2012b] have presented submodular approaches to document summarisation. Tschiatschek et al. [Tschiatschek et al., 2014] have proposed a similar approach for the summarisation of a collection of images. The most attractive property of submodular functions that are also monotonic (a frequent case) are the guaranteed performance of greedy maximisation algorithms. This is not only useful for inference of unseen examples, but also for inference during training. Action recognition has been one of the most active areas of computer vision for over a decade [Negin and Bremond, 2016]. An obvious categorisation of the approaches is elusive, but in the context of this chapter we can categorise them as a) non-structural versus b) structural. The approaches in the first category extract a single representation from the whole video and apply a classifier to predict its action label [Laptev et al., 2008b, Wang et al., 2011, Wang and Schmid, 2013, Karpathy et al., 2014]. The approaches in the structural category leverage the relationships between the video s frames, often in terms of graphical models, and infer the action class from the graph [Tang et al., 2012, Izadinia and Shah, 2012, Donahue et al., 2015]. These approaches more naturally lend themselves to extensions to summarisation, key frame detection and pose detection (e.g., [Brendel and Todorovic, 2010]). Various works have argued that actions can be recognised more accurately using only a selection of the video s key frames [Schindler and Van Gool, 2008, Hu and Zheng, 2011, Raptis and Sigal, 2013] and our work follows along the same lines. Structural SVM is an extension of the conventional support vector machine to the classification of structured outputs, i.e., sets of labels organised into sequences, trees and graphs [Tsochantaridis et al., 2005]. It has been applied successfully in areas as diverse as handwritten digit recognition, object recognition, action recognition and information retrieval [Altun et al., 2005, Wang and Mori, 2011a, Wu et al., 2013, Kim et al., 2015]. Yu and Joachims [Yu and Joachims, 2009] have extended structural SVM to training samples with latent variables and used a concave-convex procedure to ensure convergence to a local optimum. Latent structural SVM, too, has proven useful in many multimedia applications, especially those where ground-truth annotation is expensive or impossible such as complex event detection [Tang et al., 2012] or natural language comprehension [Sachan et al., 2015].

80 4.2. Related Work 67 In this chapter, we adopt a latent and semi-latent structural SVM approach to jointly infer the action and the summary from a video sequence, dealing with the summary as a set of latent variables. For the quantitative evaluation of video summarisation, many works have adopted the F 1 score for its ability to balance the precision and recall requirements [Li et al., 2011, Ejaz et al., 2013, Gygli et al., 2015]. Others have used a recall-based measure called Comparison of User Summaries (CUS) where the predicted summary is compared against ground-truth summaries from multiple annotators using a Manhattan distance [De Avila et al., 2011, Almeida et al., 2012, Ejaz et al., 2012]. Following the widespread use of summarisation measures such as ROUGE [Lin, 2004] in natural language processing, Tschiatschek et al. [Tschiatschek et al., 2014] have introduced a recall-based measure called V-ROUGE and applied it to the evaluation of summaries of image collections. While this measure could also be used to evaluate the quality of a video summary, it does not take into account the frames sequentiality. In other words, a video summary ought to consider the order in which the frames appear and consist of a sequence, rather than a set, of frames. For this reason, in this chapter we introduce a novel measure, V-JAUNE, that addresses the video summaries as frame sequences.

y: Play guitar x: (HOG/HOF, VLAD) Drink h h: 1 h 2.

and summarisation of a video: y: action class label; h:

81 y: Play guitar x: (HOG/HOF, VLAD) Drink h h: 1 h 2... h i-1 h i... h T y x 1 x 2... x i-1 x i... x T action summary measurements Figure 4.1: The graphical model for joint action classification and summarisation of a video: y: action class label; h: frames selected for the summary; x: measurements from the video frames Related Work 68

82 4.3. Learning Framework Learning Framework The framework proposed for joint action recognition and summarisation is based on graphical models and latent structural SVM. The model is described in the hereafter while latent structural SVM is presented in Section Model Formulation The goal of our work is to provide joint classification and summarisation for a video representing an action. To this aim, let us note a sequence of multivariate measurements, one per frame, as x = {x 1,... x i,... x T }; a sequence of binary variables indicating whether a frame belongs to the summary or not as h = {h 1,... h i,... h T }; and the action class as y {1... M}. Figure 4.1 shows the variables organised in a graphical model. Formally, we aim to jointly infer class label y and summary h while keeping the summary within a given, maximum size, B (the budget ): y, h = argmax F (x, y, h) y,h s.t. T h i B (4.1) i=1 Lin and Bilmes [Lin and Bilmes, 2011] have shown that desirable summaries (i.e., summaries with good coverage of the entire document and limited redundancy) enjoy the property of submodularity. Submodularity can be intuitively explained as a law of diminishing returns: let us assume to have a scalar function, F, which can measure the quality of a given summary, together with an arbitrary summary, S. We now add a new element, v, to S and compute the difference in value between F (S v) and F (S) (the return of v for S). Let us then consider a superset of S, T S, and add v to it: submodularity holds if the return of v for T is less than or equal to the return of v for S. In simple terms, the larger the summary, the less is the benefit brought in by a new element. This property can be formally expressed as: S T, v : F (S v) F (S) F (T v) F (T ) (4.2) Note that submodular functions are not required to be monotonically non-decreasing, i.e., returns can be negative; however, (4.2) must hold. For simplicity, in the following we

83 4.3. Learning Framework 70 assume that F is monotonically non-decreasing for reasonably small sizes of the summary. The most remarkable property of monotonic submodular functions is that a value for F with a guaranteed lower bound can be found by simply selecting the elements for the summary one by one. The approximate maximum returned by such a greedy algorithm is guaranteed to be at least (1 1/e) of the actual maximum [Nemhauser et al., 1978b], and it is found to be often better in practice. De facto, greedy inference algorithms perform well with submodular functions [Lin and Bilmes, 2011]. In addition, the search for the B highest-scoring elements of a set enjoys minimal, linear complexity, O(T ), in the size of the set which is the lowest possible computational complexity for the inference. We now restrict the choice of the scoring function to the case of linear models: F (x, y, h) = w ψ(x, y, h) (4.3) with w a parameter vector of non-negative elements and ψ(x, y, h) a suitable feature function of equal size. Lin and Bilmes in [Lin and Bilmes, 2011] have proposed the following feature function for summarisation: ψ(x, y, h) = φ(x i, x j, y, h i, h j ) (4.4) i,j=1,j i where φ(x i, x j, y, h i, h j ) = λ(h i, h j )s(x i, x j ) (y-aligned) λ 1, h i = 1, h j = 0 (coverage) λ(h i, h j ) = λ 2, h i = 1, h j = 1 (non-redundancy) 0, h i = 0, h j = 0 λ 1 0, λ 2 0 with s(x i, x j ) a similarity function between frames x i and x j. If the similarity function is D-dimensional, function φ(x i, x j, y, h i, h j ) is MD-dimensional and is obtained by aligning

84 4.3. Learning Framework 71 the similarity function at index (y 1)D and padding all the remaining elements with zeros. Frame x i, i = 1... T, is included in the summary if its corresponding binary indicator, h i, is set to one. Therefore, the λ 1 terms in (4.4) are the coverage terms while the λ 2 terms promote non-redundancy in the summary by penalising similar frames. Following [Lin and Bilmes, 2011], it is easy to prove that function (4.4) is submodular. Functions based on between-frame similarities such as ( 4.4) are suitable for summarisation, but do not properly describe the class of the action since their space is very sparse. Typical feature functions for action recognition are instead based on bagging or averages of the frame measurements. To provide joint summarisation and recognition, we propose to augment (4.4) as follows: ψ(x, y, h) = T i,j=1,j i φ(x i, x j, y, h i, h j ) + T λ 3 I[y, h i = 1]x i i=1 } {{ } action (4.5) In this way, a new term is added containing the weighted sum of all measurements x i in the summary. Such a term is equivalent to a pooled descriptor and promises to be informative for action recognition. Its dimensionality is assumed to be equal to that of the similarity function, D, so that the terms can be added up and y-aligned. We now prove that (4.5) is still submodular: Proposition 1 : Function ψ(x, y, h) in (4.5) is submodular. Proof Given a current summary, h, adding any extra frame to it makes term T i=1 λ 3I[y, h i = 1]x i vary by the same amount irrespectively of h. This term thus satisfies inequality (4.2) with the equal sign and is therefore submodular. At its turn, function ψ(x, y, h) is a positive combination of two submodular terms and is provenly submodular thanks to well-known properties of submodularity [Bach, 2011]. Algorithm 1 shows the greedy algorithm that we use to jointly infer the best action class and the best summary, choosing one frame for the summary at a time. Given the recent ascent of deep neural networks in classification performance ([Karpathy et al., 2014, Donahue et al., 2015] and many others), it is important to highlight the

85 4.3. Learning Framework 72 Algorithm 2 Greedy algorithm for inferring class y and summary h given scoring function F (x, y, h). max =, argmax = 0 for y = 1... M do h X x while X and h B do k argmax v X F (x, y, h v) F (x, y, h ) h h {k} X X\{k} end while if F (x, y, h ) > max then max = F (x, y, h ) argmax = y end if end for advantages of using a graphical model such as ( ) for the joint prediction of an action and its summary: conventional deep neural networks such as convolutional or recurrent networks [Goodfellow et al., 2016] could straightforwardly be used to infer either the action or the summary, but their joint inference would require substantial modifications; the nature of the score function in a graphical model allows enforcing meaningful constraints for the score (e.g., coverage and non-redundancy of the summaries) and enjoys the properties of submdoular inference; the variables for the summary can be trained in an unsupervised way alongside the supervised actions. This is a major advantage in terms of annotation and is discussed in detail in the following section Latent Structural SVM for Unsupervised and Semi-Supervised Learning Latent structural SVM is an extension of the support vector machine suitable for the prediction of complex outputs such as trees and graphs in the presence of latent nodes [Yu and Joachims, 2009]. It has been applied successfully in a number of fields such as computer vision, natural language processing and bioinformatics [Zhu et al., 2010, Wang and Mori, 2011a, Duan et al., 2012, Kim et al., 2015, Sachan et al., 2015]. Its main strength is its abil-

86 4.3. Learning Framework 73 ity to combine the typical accuracy of large-margin training with the flexibility of arbitrary output structures and unobserved variables. It is therefore a natural training framework for the score function in (4.3). Let us assume that we are given a training set with N videos, (x n, y n ), n = 1... N, where the action classes are supervised, but the summaries are completely unsupervised. Please note that in this section we use a superscript index to indicate the video and, where needed, a subscript to indicate the frame. The learning objective of latent structural SVM can be expressed as: w = argmin w 0,ξ 1:N w 2 + C N n=1 s.t. w ψ(x n, y n, h n ) w ψ(x n, y, h) (y n, y) ξ n ξ n (4.6) {y, h} {y n, h n } h n = argmax w ψ(x n, h, y n ) (4.7) h Like in a conventional SVM, the objective function in (4.6) is a trade-off between two terms: an upper bound over the classification error on the training set, N n=1 ξn, (also known as the hinge loss) and a regulariser, w 2, that encourages a large margin between the classes. The constraints in the minimisation impose that, for every sample, the score assigned to the correct labeling, y n, h n, is higher than the score given to any other labelings, y, h y n, h n, by a margin equal to the loss function, (y n, y) (margin-rescaled SVM). However, given that the h variables are unsupervised/unknown, an estimate has to be inferred in (4.7) using the current model. Latent structural SVM is therefore an iterative objective that alternates between the constrained optimisation in (4.6), performed using the current values for the latent variables, h n, and a new assignment for the h n in (4.7) performed using the current model, w. This algorithm is guaranteed to converge to a local minimum of the objective function [Yu and Joachims, 2009]. Note that the loss function that we minimise, (y n, y), only accounts for the loss from action misclassifications. As such, the selection of the frames for the summary, h, is driven by the requirement of maximising the action recognition accuracy.

87 4.3. Learning Framework 74 The initialisation of the training algorithm requires an arbitrary, starting assignment for the h variables. A uniformly-spaced selection of the frames is a reasonable starting summary and we thus use it for initialisation (i.e., h i = 1 if i = T/B where T is the video s length and B is the budget). In case some of the summaries can be ground-truth annotated (semi-supervised training), algorithm ( ) can be used substantially unchanged by just skipping assignment (4.7) for the supervised sequences. The optimisation in (4.6) is a standard quadratic program that can be addressed by any common solver. However, since the number of constraints in (4.6) is exponential, we adopt the relaxation of [Tsochantaridis et al., 2005] which can find ɛ-correct solutions using only a polynomial-size working set of constraints. The working set is built by searching the sample s most violated constraint at each iteration of the solver: ξ n = max y,h ( w ψ(x n, h n, y n ) + w ψ(x n, h, y) + (y n, y)) (4.8) which equates to finding the labeling with the highest sum of score and loss: y n, h n, = argmax(w ψ(x n, h, y) + (y n, y)) (4.9) y,h This problem is commonly referred to as loss-augmented inference due to its similarity to the standard inference and can be, again, solved by Algorithm 1 simply with the addition of loss (y n, y) to the score. In the following, we prove that the argument of (4.9) is submodular: Proposition 2 : Function w ψ(x n, h, y) + (y n, y) is submodular. Proof : In Proposition 1, we had already proved that function ψ(x, y, h) is submodular. Score w ψ(x, y, h), w 0, is a positive combination of the dimensions of ψ(x, y, h) and is therefore submodular for well-known properties of submodularity [Bach, 2011]. Given that (y n, y) is independent of h, its contribution to the return is null and the function is therefore submodular overall.

88 4.4. V-JAUNE: Video Summary Evaluation V-JAUNE: Video Summary Evaluation Video summarisation still lacks a generally-agreed measure for the quantitative assessment of a summary s quality. Unlike measures for text summaries like the popular ROUGE [Lin, 2004], a measure for video summaries should reflect not only the summary s content, but also its order since summaries could otherwise prove ambiguous. For example, actions sitting down and standing up could generate similar sets of summary frames, but their order must be different to correctly convey the action. For this reason, in this chapter we propose a novel performance measure, nicknamed V-JAUNE following the conventional use of color names and referring by V to visual data as in [Tschiatschek et al., 2014]. To present the measure, in this section we utilise a compact notation for a summary, h = {h 1,... h i,... h B }, consisting of the frame indices of its B frames. Given a ground-truth summary, h, and a predicted summary, h, the measure is phrased as a loss function and defined as follows: (h, h) B = δ(h i, h i ) i=1 (4.10) δ(h i, h i ) = min{ x hj 2 x hi }, s.t. i l j i + l With this definition, loss function (h, h) reflects the sequential order of the frames in their respective summaries, while allowing for a ±l tolerance in the matching of the corresponding positions. In the field of summarisation, the annotation of the ground truth is highly subjective and it is therefore desirable to extend the loss to the multi-annotator case. By calling M the number of annotators, the multi-annotator loss is defined as: (h 1:M, h) M = (h m, h) (4.11) m=1 A loss function such as (4.10,4.11) visibly depends on the scale of the x measurements and it is therefore denormalised. A possible way to normalise it would be to estimate the scale of the measurements and divide the loss by it. However, a preferable approach is to

89 4.4. V-JAUNE: Video Summary Evaluation 76 normalise it by the disagreement between the annotators summaries. In this way, the loss simultaneously becomes normalised by both the measurements scale and the extent of disagreement between the ground-truth annotators. Therefore, we quantify the disagreement as: D = 2 M(M 1) (h p, h q ) p = 1... M, q = p M (4.12) p,q and normalise the loss as: (h 1:M, h) = (h 1:M, h)/d (4.13) Figure 4.2 compares the values of the denormalised and normalised loss functions for three summary annotations of 95 videos from the ACE action dataset (Section 4.5.1). It is evident that the normalised loss values are much more uniform. For further detail, Figure 4.3 plots the disagreements between pairs of annotators.

90 Figure 4.2: V-JAUNE values for the ACE test set (95 videos) with multiple annotators: blue bars: denormalised values; red bars: normalised values V-JAUNE: Video Summary Evaluation 77

Figure 4.3: V-JAUNE loss for different annotators over the ACE test set (95 videos), using the first annotator as ground truth and the second as prediction.

91 Figure 4.3: V-JAUNE loss for different annotators over the ACE test set (95 videos), using the first annotator as ground truth and the second as prediction. Please note that the changes in value are mainly due to the changes in magnitude of the VLAD descriptors. However, the agreement also varies with the video V-JAUNE: Video Summary Evaluation 78

92 4.5. Experimental results Experimental results To evaluate the effectiveness of the proposed method, we have performed experiments on two challenging action datasets of depth videos: the Actions for Cooking Eggs (ACE) dataset [Shimada et al., 2013] and the MSR DailyActivity3D dataset [Wang et al., 2012b]. The datasets and experimental results are presented in detail in Sections and The evaluation addresses both the accuracy of the action recognition and the qualitative and quantitative quality of the produced summaries. For both datasets, we have used comparable implementation settings: for each video, we have extracted dense local descriptors (HOG/HOF) over a regular spatio-temporal grid using the code from [Wang et al., 2009]. As time scale, we have used τ = 2 which has resulted in 162-D individual descriptors. We have chosen the HOG/HOF features as well-proven, general-purpose features for action recognition, suitable for the scope of this thesis. However, it is likely that the experimental accuracy could easily be improved by using alternative features. As feature encoding, we have used VLAD [Jégou et al., 2010] which embeds the distance between the pooled local features and the clusters centers. For the encoding, we have first run a k-means clustering over all the descriptors in the training set, empirically choosing k = 64 for the ACE dataset (more complex and varied) and k = 32 for MSRDailyActivity3D. Then, for each frame, we have used the found clusters to encode the frame s descriptors in an encoding of 162 k-d dimensions, to be used as the measurement vector for the frame. As software for the latent structural SVM model, we have used Joachims solver [Tsochantaridis et al., 2005] with Vedaldi s MATLAB wrapper [Vedaldi, 2011]. As parameters, we have used summary size B = 10, regularisation coefficient C = 100, and performed a grid search over the training set for weights λ 1, λ 2, λ 3 in range [ 1, 1] in 0.5 steps. The summary size was chosen arbitrarily as a reasonable number of frames to display at once to a user, while the values for the number of clusters k and regularisation coefficient C were chosen over an initial evaluation phase using a small subset of the training set as validation set.

93 4.5. Experimental results 80 Table 4.1: Details of the ACE dataset. Action Training Test Instances Frames Instances Frames Breaking Mixing Baking Turning Cutting Boiling Seasoning Pealing ACE The ACE dataset was released as part of an ICPR 2012 official contest (the Kitchen Scene Context Based Gesture Recognition contest; the dataset also being known as KSCGR). It was collected in a simulated kitchen using a Kinect camera at 30 fps. The resolution of the depth frames is Using our grid on these frames resulted in 1, 786 descriptors per frame. In the dataset, five actors cook eggs according to different recipes (omelet, scrambled eggs etc). The cooking entails 8 different classes of actions: cutting, seasoning, peeling, boiling, turning, baking, mixing and breaking, annotated in the videos at frame level. Classification is challenging since most of the actions share similar body postures and span limited movements. Most of the previous work on this dataset had used it for joint action segmentation and classification [Niebles et al., 2010, Yuan et al., 2011, Wang and Mori, 2011a, Wang et al., 2011, Wang and Schmid, 2013, Ni et al., 2015], while we have used it for joint action classification and summarisation. To prepare the dataset for this task, we have clipped the individual action instances, maintaining the same training and test split mentioned in [Shimada et al., 2013]. In this way, we have obtained 161 action instances for training and 95 for testing. Table 4.1 shows the number of instances and frames per action. Each instance ranges between 20 and 4, 469 frames. In addition, we have asked three annotators to independently select B = 10 frames from each action instance as their preferred summary for that instance. The annotators were instructed to select the frames based on how well they seemed to cover the overall instance and its various phases. This left room for significant subjectivity and variance in the resulting summaries (Figure 4.3).

94 4.5. Experimental results 81 Results. To evaluate the action recognition component, we have compared the test-set recognition accuracy of the proposed system with: 1) a baseline system using the pooled descriptors from all frames as measurement and libsvm [Chang and Lin, 2011] as the classifier; 2) the proposed system without the summarisation component in the score function (i.e., λ 1 = λ 2 = 0), still with B = 10; and 3) with all the frames. In addition, we have compared it with the highest result reported in the literature for the joint action segmentation and classification task [Ni et al., 2015]. Table 4.2 shows that the proposed method (with λ 1 = 0.5 and λ 2 = 0.5) has achieved a much higher accuracy (77.9%) than the same method without the summarisation component in the score function, both when using only 10 frames (66.3%) and all frames (54.7%). The recognition accuracy obtained by the baseline system (62.1%) has also been remarkably lower than that of the proposed method. In addition, the proposed method has outperformed the highest result for the action segmentation and classification task (75.2%) [Ni et al., 2015], although these accuracies cannot be directly compared since we have not undertaken action segmentation. Overall, these results give evidence that: a) higher action classification accuracy can be achieved by leveraging a selection of the frames; b) the summarisation component in the score function increases the accuracy of action recognition; and c) the proposed method is in line with the state of the art on this dataset. Table 4.2: Comparison of the action recognition accuracy on the ACE dataset. Method Accuracy libsvm [Chang and Lin, 2011] 62.1% PA-Pooling [Ni et al., 2015] 75.2% Proposed method (all frames & no summary) 54.7% Proposed method (10 frames & no summary) 66.3% Proposed method 77.9% For the evaluation of the summarisation component, we resort to both a qualitative comparison and a quantitative comparison by means of the proposed V-JAUNE loss. The proposed system, as described in the previous paragraph, has achieved a normalised loss value of To put it in perspective, we compare it with the loss value obtained by a popular summarisation approach, the sum of absolute differences (SAD) which has been widely used in object recognition and video compression [Xiong et al., 2006]. The loss value achieved by SAD is 0.927, showing that our summaries are only slightly worse than those obtained with this method.

95 4.5. Experimental results 82 Table 4.3: The evaluation results on the ACE dataset using various amounts of supervision. Learning Ground 1 Ground 2 Ground 3 Accuracy V-JAUNE Accuracy V-JAUNE Accuracy V-JAUNE Unsupervised 77.9% % % % supervised 81.1% % % % supervised 81.1% % % Fully supervised 66.3% % % However, all the experiments conducted so far have been carried out in a completely unsupervised way as far as the summaries are concerned. This means that the initialisation of the summary variables in latent structural SVM has been performed in an arbitrary way (i.e., uniform spacing). Conversely, the proposed method has the potential to take advantage of a more informed initialisation. To this aim, we have created a set of experiments where an increasing percentage of the training sequences were initialised using the summaries from one of the annotators in turn (the remaining sequences were still initialised uniformly). Table 4.3 shows the quantitative evaluation of the produced summaries alongside the recognition accuracy with different percentages of supervision. A remarkable result has been obtained with 10% supervision from the first annotator (Ground 1 ), with an action recognition accuracy of 81.1% and a normalised V-JAUNE value of This result seems very valuable since both the action recognition accuracy and the summary quality have improved compared to the unsupervised case, also outperforming the SAD baseline. As expected, the performance and the optimal amount of supervision vary with the annotator (Table 4.3), and they are therefore selected by cross-validation. However, the fact that the performance does not improve beyond a given amount of supervision seems desirable: while it is easy to collect large datasets of action videos, it is very time-consuming to manually annotate their summaries. Therefore, such a weakly-supervised scenario is also the most feasible in practice. For a qualitative comparison, Figure 4.4 shows examples of summaries predicted by the proposed approach (10% supervision) and SAD for actions breaking, baking (egg), baking (ham) and turning. In general, the summaries obtained with the proposed approach appear mildly more informative and diversified in terms of stages of the action (compare, for instance, the first and second rows of Figure 4.4.b). Given that the loss value is also slightly lower for the proposed method, this qualitative comparison confirms the usefulness of V-JAUNE as a quantitative indicator of a summary s quality.

4.5. Experimental results 83 (a) (b) (c) (d) Figure 4.

(displayed as RGB frames for the sake of visualisation).

b) baking (omelet); c) baking (ham); and d) turning.

96 4.5. Experimental results 83 (a) (b) (c) (d) Figure 4.4: Examples of predicted summaries from the ACE dataset (displayed as RGB frames for the sake of visualisation). The subfigures display the following actions: a) breaking; b) baking (omelet); c) baking (ham); and d) turning. In each subfigure, the first row is from the proposed method, the second from SAD.

97 4.5. Experimental results 84 Table 4.4: Influence of the budget on the action recognition accuracy for the ACE dataset. Budget size Accuracy % % % % Table 4.5: Sensitivity analysis of the action recognition accuracy at the variation of the λ parameters for the ACE dataset (unsupervised case). λ 1 λ 2 λ 3 Accuracy % % % % As a last experiment with this dataset, we have aimed to explore the sensitivity of the action recognition accuracy to the budget, B, and the λ parameters in the score function. Table 4.4 shows that if the budget is reduced to 5 frames per video, the action recognition accuracy drops significantly (54.7%). This is likely due to an insufficient description of the action. However, the action recognition accuracy also starts to decrease if the budget is increased beyond a certain value. This confirms that adding frames in excess ends up introducing noise in the score function. Table 4.5 shows that the best accuracy is achieved with a tuned balance of the coverage, non-redundancy and recognition coefficients (first row). Renouncing the summarisation components (second row) decreases the accuracy, while the accuracy increases back as they are progressively reintroduced (third row). Renouncing the recognition component, too, leads to a marked decrease in accuracy (forth row).

98 4.5. Experimental results MSR DailyActivity3D The MSR DailyActivity3D dataset is a popular activity dataset captured using a Kinect sensor. It consists of 16 classes of typical indoor activities, namely drinking, eating, reading, using cell phones, writing, using a laptop, vacuum cleaning, cheering, sitting still, tossing crumbled paper, playing games, lying on the sofa, walking, playing the guitar, standing up, and sitting down. The total number of videos is 320, each representing an activity instance from one of 10 actors and either of two different poses (standing close to the couch and sitting on it). The resolution of the depth frames is which, with our parameterisation, has led to a total of 419 descriptors per frame. For evaluation, we have used the most common training/test split for this dataset, with the first five subjects used for training and the remaining for testing. For annotation of the summaries, given the easier interpretability of these videos, we have used only a single annotator. Results. Table 4.6 reports the test-set results from the proposed method with different extents of summary supervision. The highest action recognition accuracy (60.6%) has been achieved with completely unsupervised summaries. This can be justified by the fact that, during training, the summary variables are free to take the values that maximise the training objective and this seems to generalise well on the test set. While the accuracy is not high in absolute terms, it is the highest reported to date for this dataset without the use of the skeletons information. Conversely, the best value of the denormalised V-JAUNE measure on the test set (4.99) is achieved when training with full supervision of the summaries. However, the increase in value with decreasing supervision is modest, and unsupervised training may be regarded as the preferable trade-off between action recognition accuracy and summary quality in this case. Please note that the values for V-JAUNE are generally higher than those reported in Table 4.3 since in this experiment the measure does not include multi-annotator normalisation.

99 4.5. Experimental results 86 Table 4.6: The evaluation results on the MSR DailyActivity3D using various flavours of learning. Learning style Action accuracy V-JAUNE (unnorm) Fully supervised 58.8% 4.99 Semi-supervised 10% 56.3% 5.02 Semi-supervised 20% 56.3% 5.10 Unsupervised 60.6% 5.22 For a comparative evaluation of the action recognition accuracy, we report the test-set accuracy of: 1. a reference system that uses the pooled descriptors from all frames as measurement and libsvm as the classifier; 2. the proposed system using all the frames and without the summarisation component in the score function (i.e., λ 1 = λ 2 = 0); 3. the proposed system with full functionalities; and 4. a system from the literature that uses dynamic time warping; to the best of our knowledge, this is the best reported accuracy without making use of the actors skeletal joints in any form (locations or angles). Table 4.7 shows that the accuracy achieved with the proposed method (60.6%) is much higher than that of the reference system (34.4%) and also remarkably higher than that of the proposed method using all the frames (48.8%). This proves that action recognition based on a selected summary can prove more accurate than recognition from the entire video, and validates the intuition of providing action recognition and summarisation jointly. In addition, the accuracy using depth videos is also remarkably higher than that using RGB videos (46.3%), showing that depth can prove a more informative clue for recognising actions.

100 4.5. Experimental results 87 Table 4.7: Comparison of the action recognition accuracy on the MSR DailyActivity3D dataset (depth frames only). Method Accuracy libsvm [Chang and Lin, 2011] 34.4% Proposed method (all frames) 48.8% Proposed method 60.6% Dynamic temporal warping [Wang et al., 2012b, Müller and Röder, 2006] 54.0% Proposed method (RGB videos) 46.3% For a quantitative evaluation of the summarisation component, we have compared the V- JAUNE measure for the summaries obtained with the proposed method and with SAD: the loss with SAD (5.65) is significantly higher than with the proposed method with any extent of supervision ( ). Also from a qualitative perspective, the predicted summaries seem more informative, as displayed in Figure 4.5.

frames for ease of interpretation) for actions a) Cheer and d) Walk: in each

101 (a) (b) Figure 4.5: Examples of summaries from the MSR DailyActivity3D dataset (displayed as RGB frames for ease of interpretation) for actions a) Cheer and d) Walk: in each subfigure, the first row is from the proposed method and the second from SAD. The results from the proposed method look more informative Experimental results 88

102 Chapter 5 Minimum Risk Structured Learning of Video Summarisation Video summarisation is an important multimedia task that is useful for applications such as video indexing and retrieval, video surveillance, human-computer interaction and video storyboarding. In this chapter, we present a new mechanism to achieve automatic summarisation of large-scale video collections by using the proposed loss function, V-JAUNE, directly in the learning algorithm, together with a new feature function that encapsulates the frames sequentiality while still enjoying the property of submodularity. The efficiency of the proposed algorithms is proved using qualitative and quantitative tests on two challenging depth action datasets: the ACE and the MSR DailyActivity3D datasets. The results show that the proposed approach leads to effective classifiers and high-quality summaries. 5.1 Introduction and Related Work The amount of publicly-available video footage is growing at unprecedented rates thanks to the commoditization of video acquisition and the role played by social media. According to VIDCON , YouTube users upload more than 400 hours of video to the site every minute. Moreover, SocialMediaToday has recenlty reported that video views on Facebook are averaging 8 billion a day 2. With the rapidly expanding size of video repositories, the

103 5.1. Introduction and Related Work 90 need for summarization tools is becoming more urgent. Fortunately, typical video content can be effectively summarized to a remarkable extent. For example, in sports videos an informative summary may contain highlights of scored points and defensive actions. In surveillance systems, a summary may contain only the main events, filtered from the long hours of uneventful recording. In general, video summarization offers an efficient approach to abstract the main actions, scenes, or objects in a video to provide an easily-understood synopsis [Cong et al., 2012a]. Over the years, a large number of algorithms have been proposed for automated summarization, aimed at both accuracy and efficiency. These algorithms can be mainly categorized as a) clustering approaches and b) frame-differences approaches. The clustering approaches are aggregative methods that attempt grouping similar frames and select representatives from each group [De Avila et al., 2011, Ghosh et al., 2012, Jaffe et al., 2006, Mundur et al., 2006]. Frames can be clustered using low-level features (e.g., [De Avila et al., 2011]) or even detected objects [Ghosh et al., 2012], and structure can also be usefully enforced during clustering [Chen et al., 2009, Gygli et al., 2015]. Frame-differences approaches, instead, scan the video s frames in sequential order to detect shot boundaries and key frames [Xiong et al., 2006, Cong et al., 2012a, Yang et al., 2013, Lu et al., 2016]. The basic requirements of an effective video summary are well understood and boil down to appropriate coverage of the original footage together with limited redundancy in the frames selected as the summary. In this chapter, we tackle the problem of video summarisation by structural SVM. The motivation of structural SVM is to extend the powerful classification performance of SVM to the structured case. The input can be any amounts of measurements, and the output classes can be complex and structured (chains, trees, graphs, etc.). The attractive property is that it permits building submodular score functions that allow inferring meaningful values for the parameters and efficient inference of the summaries at run time. The main challenge is the need for an appropriate measure of the summary s quality to use both for evaluation and as a training loss for structural SVM. A popular metric for evaluating text summaries in the NLP community is ROUGE [Lin, 2004]. Inspired by [Lin, 2004], [Tschiatschek et al., 2014] have introduced a metric called V-ROUGE and applied it to the evaluation of summaries of image collections. While this metric could also be utilised to measure the quality of a video summary, it does not consider the frames sequentiality. Therefore, in this chapter, we present a framework allows us to apply a loss function in the learning algo-

104 5.2. Summarisation via structured learning 91 rithm to derive efficient videos summaries and exploit different types of feature functions for the summary variables. The proposed framework depends not only on the contents but significantly on the frames order to ensure the temporal coherence of the predicted summary. Contributions: A novel scoring function that embraces the frames sequentiality of a summary and still enjoys the property of submodularity. Using the dedicated summarisation loss function, V-JAUNE, for the empirical risk minimisation of structural SVM. The remainder of the chapter is organised as follows: The detailed explanation of the model and the learning framework is elucidated in Section 5.2. In Section 5.3 we present the summary evaluation metric. Experiments and results are discussed in Section Summarisation via structured learning The framework proposed for video summarisation is based on structural SVM. The model is described hereafter while structural SVM is presented in Section Please note that in this section we use a superscript index to indicate the video and, where needed, a subscript to indicate the frame. In addition, we switch to noting the summary variables as y since there are no hidden variables (h) in this unit of work. For clarity, no action recognition is involved in this chapter Problem Formulation The goal of our work is to provide summaries from video sequences. To this aim, let us note a sequence of multivariate measurements, one per frame, as x = {x 1,... x i,... x T } and a corresponding sequence of binary variables indicating whether a frame belongs to the summary or not as y = {y 1,... y i,... y T }. Formally, we aim to infer summary y while keeping it within a given, maximum size, B (the budget ):

105 5.2. Summarisation via structured learning 92 y = argmax F (x, y) y s.t. T y i B (5.1) i=1 As in previous chapters, we restrict the choice of scoring function to the case of linear models: F (x, y) = w T ψ(x, y) (5.2) with w a parameter vector of non-negative elements and ψ(x, y) a suitable feature function of equal size. Lin and Bilmes in [Lin and Bilmes, 2011] have proposed the following feature function for summarisation: T ψ(x, y) = φ(x i, x j, y i, y j ) (5.3) i,j=1,j i where φ(x i, x j, y i, y j ) = λ(y i, y j )s(x i, x j ) λ 1, y i = 1, y j = 0 (coverage) λ(y i, y j ) = λ 2, y i = 1, y j = 1 (non-redundancy) 0, y i = 0, y j = 0 λ 1 0, λ 2 0 with s(x i, x j ) a similarity function between frames x i and x j. Frame x i, i = 1... T, is included in the summary if its corresponding binary indicator, y i, is set to one. Therefore, the λ 1 terms in (5.3) are the coverage terms while the λ 2 terms promote non-redundancy in the summary by penalising similar frames. However, functions based on between-frame similarities such as (5.3) are suitable for summarisation of sets, but do not properly describe the sequence of the frames. To ensure that the frames sequentiality is taken into account, we propose to augment (5.3) as follows:

106 5.2. Summarisation via structured learning 93 where ψ(x, y) = [ T i,j=1,j i φ(x i, x j, y i, y j ) Ω(y) }{{} ] (5.4) order term Ω(y) = λ 3 T (i j) 2 s.t. y i = 1, y j = 0, λ 3 0 (5.5) i,j=1 (5.6) In this way, a new term, Ω(h), is concatenated to the scoring function to reward the coverage of the frame indices (notation [a b] represents the concatenation of a and b.). This term helps to ensure that the summary will contain a good representation of the frames based not only on their contents, but also on their order in the sequence. The square root in (5.5) retains the submodularity of its argument, which is a conventional coverage term. This new term is a scalar, so the size of ψ (and thus w) only increases by one. We restate herewith that the main benefit of submodular scoring functions are the performance guarantees over greedy maximisation algorithms. For clarity, the guarantees only hold for mononotic functions, but this is the de-facto case of submodular functions over reasonably small summaries (intuitively, adding an extra frame to a summary may actually decrease its value, but only if it contains already many frames) Structural SVM for Supervised Learning The main motivation of structural SVM is to extend the powerful classification performance of SVM to the structured case with the ability to optimise a custom loss function [Finley and Joachims, 2005]. It is therefore a natural training framework for the score function in (5.2). Let us assume that we are given a training set with N videos, (x i, y i ), i = 1... N, where the summaries are supervised. The learning objective of structural SVM can then be expressed as:

107 5.2. Summarisation via structured learning 94 w 1 = argmin w,ξ 1:N 2 w 2 + C N ξ i i=1 (5.7) s.t. w T ψ(x i, y i ) w T ψ(x i, y) (y i, y) ξ i The objective (5.7) is a quadratic optimisation problem with linear constraints. The intuition behind this objective is that the score w T ψ(x i, y i ) of the ground-truth summary should be greater than the score w T ψ(x i, y) of all other incorrect structures by a margin equal to the loss function, (y i, y) (margin-rescaled SVM). However, the number of constraints is huge since it is exponential in the number of class variables. For instance, for a 1 minute video at 30 fps, the total number of possible summaries is = (try to figure out how large this number is). For a training set with N such samples, the total number of constraints will be N times that. Since the complete constraint set is exponential, we adopt the relaxation of [Tsochantaridis et al., 2005] which obtains a closely-approximated solution (to an arbitrary epsilon) by drastically reducing the number of the constraints to polynomial(n). That is, for each sample at every iteration of the solver only the most violated constraint is added to a working set of constraints. The most violated constraint is, as usual, found by the loss-augmented inference: ȳ i = argmax(w T ψ(x i, y) + (y i, y)) (5.8) y where (y i, y) is an arbitrary loss function that we choose to measure the accuracy for the task at hand. The loss function that we choose to minimise for summarisation is described hereafter Learning with V-JAUNE The goal of video summarisation is to produce an ordered selection of the frames as the summary. The objective in (5.7) is optimised on a customisable loss function whose choise is important for the task at hand. For this reason, we have applied the loss function, V- JAUNE, that we presented in section 4.4 for evaluation. To present this loss hereafter, we use a compact notation for a summary, y = {y 1,... y i,... y B }, consisting of the frame indices

108 5.3. V-JAUNE for Evaluation 95 of its B frames. Given a ground-truth summary, y, and a predicted summary, ȳ, the loss function is defined as follows: B (y, ȳ) = δ(y i, ȳ i ) i=1 (5.9) δ(y i, ȳ i ) = min{ x yj xȳi 2 }, s.t. i l j i + l With this definition, loss function (y, ȳ) reflects the sequential order of the frames in their respective summaries, while allowing for a ±l tolerance in the matching of the corresponding positions. Algorithm 3 shows the greedy algorithm that we use to infer the best sequence summary, selecting one frame at a time. Algorithm 3 Greedy algorithm for inferring summary y given scoring function F (x, y) and loss (y, y ) for n = 1...N do y X x while X and y B do k argmax v X [F (x, y v) + (y, y v)] [F (x, y ) + (y, y )] y y {k} X X\{k} end while end for 5.3 V-JAUNE for Evaluation Video summarisation still lacks a generally-agreed measure for the quantitative assessment of a summary s quality. Unlike metrics for text summaries like the popular ROUGE [Lin, 2004], a measure for video summaries should reflect not only the summary s content, but also its order since summaries could otherwise prove ambiguous. For example, actions sitting down and standing up could generate similar sets of summary frames, but their order must be different to correctly convey the action. A loss function such as (5.9) is considerably beneficial and appropriate for this purpose.

109 5.3. V-JAUNE for Evaluation 96 In the field of summarisation, the annotation of the ground truth is highly subjective and it is therefore desirable to extend the loss to the multi-annotator case. By calling M the number of annotators, the multi-annotator loss is defined as: M (y 1:M, ȳ) = (y m, ȳ) (5.10) m=1 A loss function such as (5.9,5.10) visibly depends on the scale of the x measurements and it is therefore denormalised. A possible way to normalise it would be to estimate the scale of the measurements and divide the loss by it. However, a preferable approach is to normalise it by the disagreement between the annotators summaries. In this way, the loss simultaneously becomes normalised by both the measurements scale and the extent of disagreement between the ground-truth annotators. Therefore, we quantify the disagreement as: s.t. D = 2 M(M 1) (y p, y q ) p,q p = 1... M, q = p M (5.11) and normalise the loss as: (y 1:M, ȳ) = (y 1:M, ȳ)/d (5.12) Figure 5.1 compares the values of the denormalised and normalised loss functions for three summary annotations of videos from the ACE action dataset. It is evident that the normalised loss values are much more uniform.

110 (a) Figure 5.1: V-JAUNE values for the ACE test set for actions a) boiling, and b) seasoning, with multiple annotators: blue bars: denormalised values; red bars: normalised values. (b) 5.3. V-JAUNE for Evaluation 97

111 (a) Figure 5.2: V-JAUNE loss for different annotators for actions a) boiling, and b) seasoning, using the first annotator as ground truth and the second as prediction. Please note that the changes in value are mainly due to the changes in magnitude of the VLAD descriptors. However, the agreement also varies with the video. (b) 5.3. V-JAUNE for Evaluation 98

112 5.4. Experimental results Experimental results To evaluate the effectiveness of the proposed method, we have performed experiments on two challenging action datasets of depth videos: the Actions for Cooking Eggs (ACE) dataset [Shimada et al., 2013] and the MSR DailyActivity3D dataset [Wang et al., 2012b]. The detailed descriptions of these datasets have been presented previously in Sections and For both datasets, we have kept the same implementation settings that were presented in Section 4.5. The evaluation addresses the qualitative and quantitative quality of the produced summaries ACE To ensure the generality of the proposed method, we have applied the framework to two forms of the ACE dataset: a) clipped, and b) unclipped). In the first form, we have clipped the individual action instances, obtaining 161 instances for training and 95 for testing. Each instance ranges between 20 and 4, 469 frames. In the second form, we have kept the original videos without clipping. The videos portrays five actors cooking eggs according to five recipes: ham and eggs, scrambled eggs, boiled eggs, omelet and Kinshi-Tamago (a Japanese egg crepe). There are 35 unclipped videos in total and we have used 25 for training and 10 for testing: their number of frames ranges between 2, 000 to 12, 000 each. For both forms, we have maintained the same training and test split mentioned in [Shimada et al., 2013]. In addition, we have asked five annotators (three for the clipped instances and two for the unclipped) to independently select B = 10 frames from each video as their preferred summary. Results (clipped). To evaluate the videos summaries, we have resorted to both a qualitative comparison and a quantitative comparison using the proposed V-JAUNE loss. The proposed system has achieved a normalised loss value of To put it in perspective, we compare it with the loss value obtained by 1) a system that considers only the coverage and the non-redundancy in the objective (i.e., λ 3 = 0), the 2) proposed method using only the coverage and the frames order (i.e., λ 2 = 0); and 3) a popular summarisation approach, the sum of absolute differences (SAD) which has been widely used in object recognition and video compression [Xiong et al., 2006]. The loss value achieved by SAD is 0.927, showing that our summaries are remarkably better than those obtained with this method. Table 5.1

113 5.4. Experimental results 100 Table 5.1: The values of V-JAUNE measure on the ACE dataset (clipped). Method Ground 1 Ground 2 Ground 3 Lin and Bilmes [Lin and Bilmes, 2011] (λ 3 = 0) Proposed method (λ 2 = 0) Proposed method (λ 1, λ 2, λ 3 0) Table 5.2: The values of V-JAUNE measure on the ACE dataset (unclipped). Method Ground 1 Ground 2 Lin and Bilmes [Lin and Bilmes, 2011] (λ 3 = 0) Proposed method (λ 2 = 0) Proposed method (λ 1, λ 2, λ 3 0) shows a quantitative comparison of the summaries obtained with the different methods an different ground-truth annotations. These results seem encouraging since the inclusion of the frames sequentiality in the scoring function has led to higher accuracy than the original scoring function of [Lin and Bilmes, 2011] which only takes into account coverage and non-redundancy. For a qualitative comparison, Figure 5.3 shows examples of summaries predicted by the proposed approach (i.e., λ 1, λ 2, λ 3 0) and SAD for actions seasoning and peeling. In general, the summaries obtained with the proposed approach appear mildly more informative and diversified in terms of stages of the action (compare, for instance, the first and second rows of Figure 5.3(a)). Given that the loss value is also lower for the proposed method, this qualitative comparison confirms the usefulness of using V-JAUNE in the learning objective. Results (unclipped). Table 5.2 shows a quantitative comparison of the produced summaries. The best result is obtained, again, with the proposed method at a substantial parity with and without the non-redundancy term (λ 2 0 and λ 2 = 0, respectively). In addition, the loss with the SAD baseline (2.197) is worse than with the proposed method (2.149). For a qualitative comparison, we compare the summaries obtained with the proposed method (i.e., λ 1, λ 2, λ 3 0) with those produced by SAD in Figure 5.4: in our judgement, the summaries provided by the proposed approach appear more able to describe the entire preparation of the recipe. For example, frames from actions seasoning and mixing only appear in the summaries provided by the proposed method, and in the expected order.

114 (a) (b) Figure 5.3: Examples of predicted summaries from the ACE dataset (clipped). The subfigures display the actions a) seasoning; and b) peeling. In each subfigure, the first row is from the proposed method, the second from SAD Experimental results 101

115 (a) (b) Figure 5.4: Examples of predicted summaries from the ACE dataset (unclipped). In each subfigure, the first row is from the proposed method, the second from SAD Experimental results 102

116 5.4. Experimental results 103 Table 5.3: The values of V-JAUNE measure on the MSR DailyActivity3D dataset. Method V-JAUNE (denorm) SAD Lin and Bilmes [Lin and Bilmes, 2011] (λ 3 = 0) Proposed method (λ 2 = 0) Proposed method (λ 1, λ 2, λ 3 0) MSR Results. For a quantitative evaluation over this dataset, we have compared the V-JAUNE measure for the summaries obtained with the proposed method with 1) SAD; 2) a system that considers only the coverage and the non-redundancy in the objective (i.e., λ 3 = 0); and the 3) proposed method using only the coverage and the frames order (i.e., λ 2 = 0). Table 5.3 reports the values of the denormalised V-JAUNE measure on the test set (please note that these values are denormalised since we only have one annotation), showing that the loss with the proposed method is the best. Also from a qualitative perspective, the predicted summaries seem more informative than with SAD, as displayed in Figure 5.5.

117 (a) (b) Figure 5.5: Examples of predicted summaries from the MSR DailyActivity3D dataset. The subfigures display the actions a) using vacuum; and b) playing guitar. In each subfigure, the first row is from the proposed method, the second from SAD Experimental results 104

Lecture 1: Machine Learning Basics

1/69 Lecture 1: Machine Learning Basics Ali Harakeh University of Waterloo WAVE Lab ali.harakeh@uwaterloo.ca May 1, 2017 2/69 Overview 1 Learning Algorithms 2 Capacity, Overfitting, and Underfitting 3