Masters Thesis CLASSIFICATION OF GESTURES USING POINTING DEVICE BASED ON HIDDEN MARKOV MODEL

Size: px

Start display at page:

Download "Masters Thesis CLASSIFICATION OF GESTURES USING POINTING DEVICE BASED ON HIDDEN MARKOV MODEL"

Jasmine Reed
6 years ago
Views:

1 Masters Thesis CLASSIFICATION OF GESTURES USING POINTING DEVICE BASED ON HIDDEN MARKOV MODEL By: Tanvir Alam Date: 26/06/ :15 Supervisor: At Philips Research: Dr. Jan Kneissler (Senior Scientist) PHILIPS Research Laboratories Medical Signal Processing Netherlands, Eindhoven At IDE: Frank Lüders (Industrial PhD student) Mälardalen University, Sweden, Västeras 1

2 Preface This thesis was conducted at PHILIPS research in Eindhoven, Netherlands from June 1 st 2006 to January 31 st Additional work was done at Mälardalen University in Västeras, Sweden. Open source framework: HTK and GT²K Operating system: Windows XP and Linux Fedora 5.0. Additional tool used: C/C++, UNIX Script, Perl, Matlab 7, LabView 7 and MS Excel This report is written in MS Word. Acknowledgment I would like to thank all who helped me with this thesis. First of all, I would like to personally thank Mr. Galileo June Destura for giving me the chance of a lifetime. Without his help nothing would have been possible and I would have spent my first day in Netherlands on the streets. Second, I would like to thank Dr. Jan Kneissler, my supervisor at Philips, for his help and motivation. His smooth communication skill and understanding encouraged me toward Linux environment. Third, I would like to thank Mr. Ard Biesheuvel, a friend and a colleague, who helped me with MS visual C++, UNIX script and Perl and without his help everything would have been processed very slowly. Forth, I would like to give a special thanks to Mr. Frank Lüders, my supervisor at Mälardalen University, for his initial encouragement and motivation and for correcting my report. Thanks to the team who developed the framework Hidden Tool Kit and Georgia Tech Gesture Toolkit GT²k. I would also like to thank the 50 people who helped me for data collection, which includes friends, colleague and Philips employees. Most importantly, I would like to express thanks and gratitude to my parents who blessed me with their love, happiness and strength that helped me through my bad times and good times of this thesis. 2

3 Abstract HMM is a very powerful modeling tool compared to many statistical tools such as neural networks, template matching, dictionary lookup, linguistic matching, ad hoc methods, etc. To use HMM for gesture recognation the gesture are transformed from the sensor data into sequence of vectors and HMM is trained to represent the gestures. An existing open source framework (HTK and GT²k) is used. Initially, the data is collected using a pointing device similar to pc mouse with (x,y) coordinates and sampled time of 16 millisecond. Eight different feature vectors were generated and conducted experiments on different number of HMM states. It was found out that increased number of HMM parameters gives better accuracy along with increased number of HMM states as well as compared the different HMM models with respect to the number of model parameters due to implementation on embedded system. Two ways of recognition can be conducted; one using greater number of HMM states and a secound using multiple gaussian density per probability distribution. In both cases it improvements were shown with increased number of parameters comparing to fixed number of parameters which is critical for memory as well as realtime constraint on embedded platform. It was found that the best accuracy level is 96% for isolated recognation and approximatly 90% for grammar recognation with elements shared amongst multiple gestures. 3

4 Table of content Table of content...4 Chapter 1: Introduction Introduction Organization Philips Research High Tech Campus Motivation What is gesture? Types of gestures Purpose of gesture recognition Related Work Overview...11 Chapter- 2: Hidden Markov Model Background Pattern Recognition Stochastic process Markov Chain Hidden Markov Model Formal defination of HMM Problems and Algorithms Probability of the observation sequence (Evaluation) Discovering the Hidden State (Decoding) Estimating the Model Parameters (Learning) Why use HMM? Advantages and Disadvantages Hidden Markov Model Toolkit (HTK) HTK Environment Use of HTK Georgia Tech Gesture Toolkit (GT²k) Connecting GT²k with HTK Process of training, validation and recognition Relation to previous work...24 Chapter-3: Approach Approach Method Collecting Data The tool Extracting the feature vector Experiment for collecting data Training and Validation Experiments Types of experiment Grammar for isolated gestures recognation Testing the feature vector Differentiating GOOD and BAD data Testing each HMM states with different feature vector Searching for a Filter Gesture grouping Searching for Decay Filter

5 3.5 Recognition with gesture elements Grammar for gesture elements Label(MLF) file for gesture elements Experiment with gesture elements Conclusion...Error! Bookmark not defined. 4.0 Future Work...Error! Bookmark not defined. 5.0 Reference...54 Appendix A: The source codes...56 Appendix B : New gestures for new data collection...64 FIGURE 1: HIGH TECH CAMPUS - EINDHOVEN...8 FIGURE 2: EXAMPLE OF MARKOV PROCESS...13 FIGURE 3: FIRST ORDER TRANSITIONS...14 FIGURE 4: TRANSITION MATRIX OF POSSIBLE TRANSITION PROBABILITIES...14 FIGURE 5 EXAMPLE OF HIDDEN MARKOV MODEL...15 FIGURE 6: HIDDEN AND OBSERVABLE STATES IN THE WEATHER...15 FIGURE 7: CONFUSION MATRIX...16 FIGURE 8: A THREE-STATE HIDDEN MARKOV MODEL (HMM)...16 FIGURE 9: SOFTWARE STRUCTURE OF A TYPICAL HTK...20 FIGURE 10: EXAMPLE HMM MODELS...21 FIGURE 11: COMMUNICATION BETWEEN GT²K AND HTK...22 FIGURE 12: THE TOOL INTERFACE...26 FIGURE 13: EXAMPLE OF NORMAL PDF DENSITY...27 FIGURE 14: SAMPLE OF DECAY FILTER FIGURE 15: DIFFERENCE BETWEEN MALE AND FEMALE OPINION...28 FIGURE 16: 30 GESTURES...29 FIGURE 17: PROCESS FOR CROSS-VALIDATION...30 FIGURE 18: PROCESS FOR LEAVE-ONE-OUT VALIDATION...31 FIGURE 19: PROCESS FOR FIXED VALIDATION...31 FIGURE 20: RESULTS FOR GOOD AND BAD DATA...33 FIGURE 21: RESULTS FOR 21 SUBJECTS...34 FIGURE 22: RESULTS FOR 40 SUBJECTS...35 FIGURE 23: FILTER RANGE...35 FIGURE 24: INDIVIDUAL TEST OF EACH FEATURE VECTOR USING FILTER RANGE...36 FIGURE 25: RESULTS FOR INCREASED VECTOR DIMENSIONS...36 FIGURE 26: RESULTS FOR INCREASED HMM STATES AND DIFFERENT FILTER TEST...37 FIGURE 27: RESULTS FOR INCREASED FILTER TEST...38 FIGURE 28: SPLIT TEST...39 FIGURE 29: GESTURE GROUP HMM STATE RATIO TEST FOR SMALL...40 FIGURE 30: GESTURE GROUP HMM STATE RATIO TEST FOR MEDIUM...40 FIGURE 31: GESTURE GROUP HMM STATE RATIO TEST FOR LARGE FIGURE 32: DECAY FILTER TEST 0.2 TO FIGURE 33: DECAY FILTER TEST 0.1 TO 0.5 FOR (RVA, RVI, RI, RA AND RIA)...43 FIGURE 34: DECAY FILTER TEST 0.1 TO 0.5 USING DIFFERENT SAMPLE RATE FIGURE 35: DECAY FILTER 0.6 USING DIFFERENT FEATURE VECTOR FIGURE 36: DECAY FILTER 0.6 USING FEATURE VECTOR: RIA, R, I, A, IA...46 FIGURE 37: VISUALIZATION OF DIRECTION OF THE ELEMENTS; (A) ARC, (B) LINES FIGURE 38: SAMPLE GRAMMAR FOR CONTINUOUS RECOGNITION FIGURE 39: GRAMMAR TEST WITH TESTINGSET FOR 21 SUBJECT

6 TABLE 1: TOOLS USED BY HTK...20 TABLE 2: FILES USED BY GT²K...22 TABLE 3: NAMING OF THE FILES TABLE 4: EXAMPLE OF RESULTS USING CROSS VALIDATION FOR FOUR GESTURES...32 TABLE 5: GRAMMAR FOR GESTURE...32 TABLE 6: EXTRACTING THE GOOD QUALITY DATA...33 TABLE 7: DIFFERENCE IN LABEL FILES FOR ISOLATED AND CONTINUOUS TABLE 8: SAMPLE OF HRESULT.LOG...49 TABLE 9: SAMPLE CONFUSION MATRIX FOR CONTINUOUS RECOGNITION TABLE 10: DIFFERENCE IN ERROR RATE FOR LABEL.MLF AND MARGED LABEL.MLF...51 TABLE 11: SAMPLE GRAMMAR FOR GESTURE CIRCLE IN GRAMMAR_CONT.TXT...51 TABLE 12: THE FILE GESTUREREC.TXT AND ITS FORMAT

7 Chapter 1: Introduction 1.1 Introduction The main aim of this thesis is to contribute in the development of a universal remote control (URC) that can interact with any electronic device in a domestic home or with devices in medical sector using one/two button and one-hand simple human autonomous gestures to communicate with the different software application that are installed into any electronic device. The main objective of the thesis is to understand and generate autonomous gestures that can be recognized using Hidden Markov Model (HMM). In order to fully understand HMM, the thesis consists of experiments for collecting data, experiments for training and validating the HMM and recognizing the gestures using the HMM models. Before describing these aspects, this chapter gives a summary of the organization, motivation and brief overview of the report as well as related works Organization In 1891, Anton and Gerard Philips established a company in Eindhoven (the Netherlands) to manufacture incandescent lamps and other electrical products. Today, it is one of the world s biggest electronics companies. Initially, the company focused on making carbon-filament lamps but later gave more concentration on other electronics product, which helps us to lead comfortable lives. The company has been ranked as the global leader in sustainability within the cyclical goods market. It is No. 1 in the global lighting, DVD market, and electric shavers market as well as No. 3 in the world for TV, Video and audio products, computer monitors, consumer communications set-top boxes and accessories Philips Research Today, Philips Research has laboratories in six different countries The Netherlands, England, Germany, China and the United States. The research conducted in the past to present has led to about patent and design rights as well as many publications of technical and scientific papers High Tech Campus High Tech Campus (HTC) which is in Eindhoven, is a worldwide well-known technology centre, with a diversity of high tech companies who work together in the development of new technologies, from idea, concept, to prototyping and small series production. It is also situated in the middle of what has been described as Europe s leading R&D region. This area, also known as the Intelligence Delta, stretches from Leuven and Aachen in the south to the important technical university towns of Delft and Enschede further north. The following figure 1 shows a picture of HTC. 7

Figure 1: High Tech Campus - Eindhoven 1.2 Motivation Pointing at an object is the most basic human interaction, which is very important in human-machine interfaces.

8 Figure 1: High Tech Campus - Eindhoven 1.2 Motivation Pointing at an object is the most basic human interaction, which is very important in human-machine interfaces. Similarly, creating gestures such as using ones hand and arm can provide information. For instance, pointing to a chair indicates asking for permission to sit down or pointing to indicate directions. The main goal of all the research done on gesture recognition is to develop a system that can identify specific human gestures and use them to interact with other technological device. In order to understand gesture one must ask question such as: How humans use gestures to communicate with and command other people? How is information encoded in gestures? How do engineering researchers define and use gestures? How can gesture be used to interface between human and machines? 1.3 What is gesture? Humans habitually use gesture to interact with other humans. Gestures can be used for everything such as pointing at an object for attention to conveying information about space and temporal characteristics [1]. Biologists define gesture, as the notion of gesture is to embrace all kinds of instances where an individual engages in movements whose communicative intent is paramount, manifest, and openly acknowledged [2] Types of gestures Gestures that are related with speech are called gesticulation and gestures that are independent from speech are called autonomous gesture. Autonomous gestures have its own communicative language such as American Sign Language (ASL) as well as motion commands. There are many other types of gestures such as Semiotic communicating meaning, Ergotic manipulating objects and Epistemic groping. There are six types of semiotic gestures and they are Symbolic (arbitrary), Deictic (pointing), Iconic (descriptive), Pantomimic (use of invisible objects), Beat (indicating flow of speech) and Cohesive (marker indicating related topics) [3]. 8

9 Symbolic or Arbitrary gestures are gestures that can be learned but are not common in a culture setting and can be very useful because they can be specifically created for use of device control. An example is the set of gestures used for airport plane control. Deictic gestures are used to point at important objects and these gestures can be specific that refers to one object, in general that refers to class of object, or functions that symbolize intentions. An example is a simple hand gestures such as pointing to ones mouth when he or she is hungry Purpose of gesture recognition Gesture recognition is a process through which a computer can recognize human gesture. Such interaction can make interface with computer more accessible and expressive for both the physically impaired and for young children who might find this type of interaction more natural. Gestures can be used in applications such as word processing, hand sign language, games, entertainment and educational approaches. There are other forms of gesture recognition than hand gestures. For example, finger pointing a way to select or move objects around, face tracking, eye motion and lip reading, etc. Technology that implements gestures has the ability to change the way humans interact with computers by eliminating input devices such as joysticks, mice and keyboards. The main objective of the thesis is to understand and generate autonomous gesture that can be implemented using the URC. There are many questions that must be answered before the objective can be accomplished. For instance, What kinds of gesture are suitable for the URC? What kinds of gesture that can be implemented in the URC, which are easy and effective for both the user as well as the device? How to realize gesture recognition in such a way that NO precise motion gesture is needed to accomplish a task? How to recognize these gestures? Which models or algorithms such as Hidden Markov Model (HMM) are both effective as well as easy to implement in order to recognize gesture? How can we implement and test gesture easily? How can using gesture be more user-friendly compare to traditional ways of interfacing? Will the user be able to create dynamic (users own) gesture? Which position (sitting, standing, etc) does the USER have to be in order to use these gestures? What type of learning environment does the user need go through in order to learn these gestures? 9

10 1.4 Related Work Today, there exist many image-based or device-based hardware techniques, which can track gestures. For example, an image-based technique can detect gesture by capturing pictures of users hand motions gesture via camera. The captured image is then sent to Computer-vision software, which tracks the image and identifies the gesture. For instance, television sets that can be controlled by hand gesture instead of a remote control have been developed [4] [5] [6]. Basically, to turn the TV on, the user raises his open hand and the computer recognizes the gesture, which in return turns the TV on. Device-based techniques such as instrumented gloves, stylus and other position trackers have been developed, which recognize the movements and send the signals so that the system can understand the gesture. For example, Dexterous Handmaster [7] developed in 1987, initially used to control robot hand, which was very accurate but was not suitable for rapid movement. Power Glove [7] developed in 1989 by Mattel, which was resistive ink sensors for finger position, plus ultrasonic tracking and the Space Glove [7] developed in 1991, which was a plastic rings around fingers. Other gloves such as 5DT Data GloveTM [7], SuperGlove [7], Pinch Gloves [7] and CyberGlove [7]. However, the latest in computer technology is the G-Speak Gestural Technology System [8], a glove which is faster and more easier to use compared to mouse and keyboard and using it one can move anything anywhere on the screen. At present, there exist many products, which are closely related with the hypothesis of this thesis. For example, the GyroPoint is a product of Gyration, Inc. in Saratoga, CA [9] [10]. The device can operate in two different modes. First, it can operate as a regular mouse and second, it can operate in the air (3D). Another example, the Bluewand, [11] which is a small pen-like device used to control Bluetooth enabled devices by hand-movements. Basically, a 6-axis accelerometer and gyroscope system detects device's full direction and movement in space. Bluewand can be used with varity of applications such as remote control for TV-set, cell-phone, MP3-player, etc. Further more, Australasian CRC for Interaction Design (ACID) led by Professor Duane Varan has developed the world s first TV Mouse [12]. The device is a click-on that helps user to apply command-using gestures. ACID is currently trying to fine tuning the prototype so that it can recognize wide variation in gestures, including speed, extent and support for both left and right-handed users. In addition, there is StrokeIt, by Jeff Doozan [13], which is an advanced mouse gestures recognition engine and command processor. One can use this application for various action using mouse gestures as well as create dynamic gestures. StrokeIt currently has more than 80 pre-configured mouse gestures and can be easily trained to recognize more gestures. Finally, we have Hidden Markov Model for gesture recognition by Donald O. Tanguay, Jr [14], which is closely related with this thesis and will be described in details in Chapter 2. 10

11 1.6 Overview The report is divided into 5 chapters where chapter 1 describes the basic introduction, about the organization, motivation for using gestures for communication and related works on gestures. Chapter 2 of this report describes the background of the thesis with description of the theoretical aspect of the thesis. A detail description of pattern recognition and Hidden Markov model is given in section 2.1 as well as the foundation, types of Models. In section 2.2, a description of Hidden Tool Kit (HTK) as well as its environment is described. In section 2.3, states how HMM can be used with the help of Georgia Tech Gesture Toolkit (GT²k) for gesture recognition. Finally the last section of chapter 2 describes how this thesis is different from the previous thesis conducted using HMM for gesture recognition. Chapter 3, which is the most important chapter of this report, describes the approach of gesture recognition. Section 3.1 with a brief introduction, gives a description of how the data was collected using 50 subject who used a device called uwand and how the data is been processed. For example, calculating linear interpolation, delta, angle, the velocity, etc. After the data and been collected and processed it was used for training and validation of HMM. This section also describes how the training and validation works. The next section describes how the recognition is done using the HMM model that has been verified and trained. Section 3.4, Evaluation, describes the 30 gestures used, all the experiments conducted. Finally there is chapter 4 and 5 which is the conclusion and future works followed by Reference and appendixes. 11

12 2.1 Background Chapter- 2: Hidden Markov Model Patterns can be seen in many areas over a space of time and recognizing them can be very crucial in real life. For example, patterns that can appear in sequences of words in sentences, sequence of phonemes in voice, patterns of instruction given to computer, etc. Taking a simple example from a lecture of University of Leeds [22], predicting the weather from a sample of seaweed by referencing it from folklore, tells that soggy seaweed is rainy weather and dry seaweed tells us it is sunny weather but using intermediate state damp it is impossible to state the weather condition. However, since the state of the weather is not limited to the seaweed it is possible to examine the seaweed and predict the proper weather status. It is also possible to state that current weather status depends on the previous weather status according to the seaweed. The weather example is a very simple problem that can use the help of HMM to predict the accurate results. Before understanding how to recognize gestures, it is important to comprehend the problems that come across in HMM gesture recognition by looking at a popular example such as the weather example. This chapter is divided into three major areas: Pattern Recognition & Hidden Markov Model (HMM) along with a brief description of HTK and GT²k and Relation to previous work. 2.2 Pattern Recognition Pattern recognition is defined as the act of taking in raw data and taking an action based on the category of the data [15]. In other word, it is a process to classify patterns of data based on previous statistical information of the patterns. A complete pattern recognition system consists of a sensor that gathers the observations to be classified or described; a feature extraction mechanism that computes numeric or symbolic information from the observations; and a classification or description scheme that does the actual job of classifying or describing observations, relying on the extracted features. [15]. Classic applications that are concerned with pattern recognition are speech recognition, handwriting recognition, image recognition and gesture recognition. There are two types of pattern that can be generated, one is Deterministic Patterns and another is a Non-deterministic pattern. An example of deterministic patterns is a set of traffic lights where the state of the light changes from red - red/amber green and that each state depends on the previous state. These systems are very easy to understand because current state always depends on the previous state. Considering a non-deterministic pattern, which is more complicated, compare to deterministic pattern. A good example would be the weather with three states; Rainy, Cloudy and Sunny. Unlike the traffic light, the weather patterns cannot follow the same sequence. For example, sunny weather can follow a rainy weather or cloudy and visa versa. However, there exists a pattern even in this type of unpredictable system. The most important concept of statistical pattern classification is stochastic models where parameters are automatically trained on large sets of data. All the speech units such as word, syllable and phones are represented using HMM which is a double stochastic model. In order to understand HMM, it is important to have an overview of stochastic process. 12

13 2.2.1 Stochastic process In the mathematics of probability, a stochastic process is a random function.[16]. The random function is defined over a domain such as time interval or region of space. Stochastic modeling can be applied to many real life applications such as stock exchange rate fluctuations, speech, audio, video, images, etc. Finite State Machine (FSM) with fixed number of states is a statistical model called HMM and it can help to identify the properties of any patterns. HMM is a doubly embedded stochastic process [17] where the process is not directly observable but rather can be analyzed using another set of stochastic processes Markov Chain Any multivariate probability density whose independence diagram is a chain is called Markov Chain. [18]. Hence, the ordered variable depends only on its neighbouring variables if the variables are sampled left-to-right. In other word, Markov Chain is a sequence of random variables such as x1, x2, x3,... with Markov properties and can be described using directed graph showing direction of the probability going from one state to another states. A good example of Markov chain is a FSM composed of states (Provides information of the past by reflecting the present), transitions (Indicates a state change) and actions (Activity that is to be performed). The basic procedure is that if a process is in a state y at time t, then the probability that it moves to state x at time n + 1 depends only on the current state and does not depend on the time t.[19]. In order to understand Markov process it is important to view an example publicized by Phil Blunsom [21] in his tutorial on HMM. Below shows an example of Markov process with a simple model for a stock market. Figure 2: Example of Markov Process [21] The above model has three states BULL, BEAR and EVEN and three index observations UP, DOWN and UNCHANGED as well as probabilistic transitions between states. With an observation such as UP-DOWN-DOWN, it is easily possible to identify the state sequence that produced these observations such as BULL-BEAR- BEAR and the probability of the sequence = Considering the previous weather example, it is possible to assume that state of the model depends only upon the previous states of the model, which is Markov assumption. A Markov process is a process, which moves from state to state depending (only) on the previous n states.[22]. Figure below shows first order transitions between the states of the weather example. 13

Figure 3: First order transitions [22] Looking at the figure above, it is visible that there are multiple transaction between states and each transition has a probability, which can be represented,

Figure 4: Transition matrix of possible transition probabilities [22] The above transition matrix shows that if it was sunny yesterday, the probability of today being sunny is 0.5, cloudy 0.

14 Figure 3: First order transitions [22] Looking at the figure above, it is visible that there are multiple transaction between states and each transition has a probability, which can be represented, in a transition matrix such as the one in figure below. Figure 4: Transition matrix of possible transition probabilities [22] The above transition matrix shows that if it was sunny yesterday, the probability of today being sunny is 0.5, cloudy 0.25 and rainy 0.25 and visa versa. An important point about the Markov assumption is that the state transition probabilities do not vary in time - the matrix is fixed throughout the life of the system. [22] The discrete variables of Markov process can help to identify the states of HMM where the output values are continuous. A HMM is equivalently a coupled mixture model where the joint distribution over states is a Markov chain. [19]. However, in many different cases some patterns cannot be analysed properly using Markov process. Looking at the weather example, one may not see the weather but instead have a seaweed which is related with the weather can help to predict the weather. In this case, there exist two sets of states, one the observable states and the hidden state and the challenge are to predict the weather by observing the observable states and the Markov process without the help of the weather itself. In speech recognition, the sound that is created is a result of vocal chords, size of throat, position of tongue, etc and each of these factors is very important for sound and speech recognition system detects are the changing sound generated from the internal physical changes in the person speaking. [22] 2.3 Hidden Markov Model A statistical model which is a markov process with unknown parameters is called a Hidden Markov model (HMM) and indentifing the hidden parameters from an observable parameters which is the biggest challenging task of HMM. Once the hidden parameters are extracted from the model it can be later used to anaylize to extract better HMM models. HMM is the simplest dynamic Bayesian network which is a form of probabilistic graphical model. In normal markov model all the states are visible and therefore parameters are the state transition probabilities. However, in HMM, the states are not directly visible but the variables that are related with the states are visible and each state has a probability distribution over the possible output tokens. 14

15 The example in figure 1 can be extended to represent HMM as shown in the figure 3 below. The new model emitted observation symbols with a finite probability which makes the model more expressive. The main difference is that, with the observation UP-DOWN-DOWN it is impossible to state that it is produced by BULL- BEAR-BEAR which keeps the state s sequence hidden. However, it is possible to calculate probability of the produced sequence. Figure 5 Example of Hidden Markov Model [21] Similarly, the hidden and observable states can also be seen in the figure below of the weather example. The connections between the hidden states and the observable states represent the probability of generating a particular observed state given that the Markov process is in a particular hidden state. [22] Figure 6: Hidden and observable states in the weather [22] Similar to the transition matrix defined by Markov process, another matrix can be created which is called confusion matrix with the probabilities of the observable states given a particular hidden state as shown in the figure below. 15

16 Figure 7: Confusion matrix [22] Formal defination of HMM Before discussing the formal defination of HMM it is crucial to mention that HMM is a stochastic finite state automaton (SFSA) built from a finite set of possible states Q = { q1, q2,..., qk where each states indicates specific probability density function (pdf). Figure 8 below shows an example of HMM with three states. Figure 8: A three-state Hidden Markov Model (HMM) [20] HMM has a set of states q i, an emission probability density p ( x n qi ) related with each states and transition probabilities p ( q j qi ) indicating transiton from state q i to state q. The formal definition for HMM: j λ = (A, B, π ) [21] (1) A is the transition array with probability of state j following state i : A = a ], a = P( q = s q = s ) [21] (2) [ ij ij t i t 1 i B is the observation array with observation k produced from the state j, independent of t: B = b ( k)], b ( k) = P( q = v q = s ) [21] (3) [ i i t k t i π is initial probability array: π = π ], π = P ( q = s ) [21] (4) [ i i 1 i Let, S be state alphabet set, and V be the observation alphabet set: S = s, s..., s ) [21] (5) ( 1 2 n ( v1, v2,..., vm V = ) [21] (6) 16

17 Let, Q be fixed state sequence of length T and O be corresponding observations of length T: Q = q, 1 q2,..., qt [21] (7) O = o o,..., [21] (8) 1, 2 o T There are two assumption made according to the model equation (1). They are called Markov assumption and Independence assumption. Markov assumption states that current state is dependent on the previous state. t 1 P ( qt q1 ) = P( qt qt 1) [21] (9) Independence assumption states that output observation at time t depends on the current state. t 1 t P o o, q ) = P( o q ) [21] (10) ( t 1 1 t t 2.4 Problems and Algorithms Using HMM model that has been created, there are three obstacles that needs to be overcome; the first obstacle is finding the probability of an observed sequence from the model which is called evaluation, finding the sequence of hidden states that most probably generated an observed sequence which is called decoding and generating a HMM given a sequence of observations what is called learning. Below shows how each obstacles needs different algorithm to solve the obstacles. Problems Algorithm used Calculating the probability of a particular Forward algorithm observation sequence: Evaluation Calculating the most likely sequence of hidden Viterbi algorithm states: Decoding Calculating the most likely set of state transition Baum-Welch algorithm and output probabilities: Learning Probability of the observation sequence (Evaluation) One of the important problems that can be solved, if a HMM model with its sequence of observation is given, is to calculate the value of P ( O λ) (probability of the observation sequence). Using this value it is possible to understand how well a HMM model behaves in predicting the observation sequence which gives a better chance for selecting a suitable HMM model. When dealing with lots of HMMs, it is common to know which HMM most probably generated the given sequence. For instance, having individual model for Summer and Winter for the seaweed, it is crucial to determine the season on the basis of a sequence of dampness observations.[22] Using forward algorithm it is possible to calculate the probability of an observation sequence given a particular HMM and select the most probable one. Denoting α for probability of the observation sequence of o 1, o2,..., ot and donate s i for state at time t then it is possible to define the forward probability variables. Using forward algorithm it is possible to calculate observation sequence ( k ) Y = yk,..., y 1 k T and the Intermediate probabilities can be calculated using the following: α 1( j) = π ( j). b jk. Hence, for each step of time (t = 2,3,4,,T), the partial 1 probability (α ) is computed using the following formula: α N t ( ) ( α + 1 t i= 1 j = ( i) a ) b which is the product of the appropriate observation ij jk i 17

18 probability and the sum over all possible routes to that state and each partial probability (at time t > 2) is calculated from all the previous states.[22]. Finally, adding all the partial probabilities it will give the observation for a given HMM. Using this algorithm it is easy to decide which HMM best illustrate a given observation sequence Discovering the Hidden State (Decoding) In many cases, it is essential to find the hidden states that generated the observed output. In seaweed and the weather example the seaweed is the observed output and the weather is the hidden states. In order to calculate the hidden states, the Viterbi algorithm is used which is similar to forward algorithm. The only difference is that in forward algorithm the transition probability is summed unlike Viterbi algorithm where the probability is maximised at each step. Viterbi algorithm is discussed in more details later in this chapter. In order to find the hidden state sequence generated by an observation sequence it is possible to use Viterbi algorithm. The Viterbi algorithm is another trellis algorithm, which is very similar to the forward algorithm, except that the transition probabilities are maximized at each step, instead of summed. [21]. To define the probability of the most probable state path for certain observable sequence which is by initializing the probability calculations and estimating the probable route to the next state. This can be estimated by considering all products of transition probabilities with the maximal probabilities already derived for the preceding step.[21]. The maximisation can help to detect the most likely route to the current position. By utilizing the time invariance, the problem s complexity can be reduced and the algorithm keeps a backward pointer ( ) for each state (t > 1), and stores a probability ( ) with each state. [21]. To read the data from isolated noise garbles, Viterbi algorithm can observe the entire sequence and decide on the most likely final state. This algorithm also helps efficient way to compute most likely state sequence by utilizing recursion to reduce computational load Estimating the Model Parameters (Learning) The hardest problem to solve using HMM is to take sequence of observations, the hidden states and fit the most probable HMM. [21]. Using Baum-Welch algorithm it is possible to find the unknown parameters by making use of forward-backward algorithm where matrices A and B are not directly measurable. The easiest solution for creating a model is to have a large corpus of training examples, each annotated with the correct classification. [21]. This approach is called PoS tagging. In other word, determining the model parameters by using maximum likelihood estimates (MLF). The main problems solved by HMM as motioned before are evaluations and decoding of the HMM model. Nevertheless, in many cases these problems cannot be measured directly but has to be estimated, which is called learning. When we know the sequence of observations from a given set, the forward-backward algorithm can be used to estimate. However, this algorithm is the hardest algorithm to comprehend compared to forward algorithm and the Viterbi algorithm. In brief, forward-backward algorithm makes an initial guess of the parameters and later makes a proper estimation of the parameters by reducing the errors. In this way it performs gradient descent and looks for a minimum of an error measure. It derives its name from the fact that, for each state in an execution trellis, it computes the forward probability of arriving at that state (given the current model 18

19 approximation) and the backward probability of generating the final state of the model, again given the current approximation. [22]. Using recursion it is easily possible to calculate both forward and backward. 2.5 Why use HMM? There are some characteristic that needs to be considered before HMM is used. For example, if one has a problem that can be phrased as classification, if the observations of the problem are in order and if the observation follows a grammatical structure which is optional. Mean while, HMM has been fully adapted to speech recognition and slowly it is beginning to capture the attention of vision community. In 1992, Yamato and his colleagues used discrete HMM's to recognize image sequences of six different tennis strokes among three subjects.[23] and since then many people have contributed to gesture recognition using HMM. However, this thesis looks at two most related work concerning gesture recognition and these will be described later in this chapter. Before going into details of how HMM can be utilized using tools such as HTK and GT²k, let s briefly discusses the advantages and disadvantages of HMM Advantages and Disadvantages The reason HMM is so popular is that statistician are contented toward HMM and the flexibility of manipulating the training and verification processes in addition to mathematical and theoretical analysis of the results and processes. HMM is a very powerful modeling tool compared to many statistical tools. Using HMM it is possible to make use of individually distinct functional units (Modularity). HMM provide a transparency of the model, which helps to read and understand the model easily as well as use of prior knowledge into the architecture and to constrain training process. As for the disadvantages of HMM, due to multiple local maxima the model may not reflect truly optimal parameter set. HMM only reflects if there is good training set and having many training set may not give good accuracy. The method used in HMM is slow compared to other methods of recognition. The next section describes the tools used to make use of HMM. 2.6 Hidden Markov Model Toolkit (HTK) HTK was developed at Machine Intelligence Laboratory of the Cambridge University Engineering Department (CUED) for the use of large vocabulary speech recognition systems. Today, Microsoft copyrights the original source code of HTK but anyone is allowed to modify it according to his or her needs. The first version of HTK was developed in 1989 and since then many versions have been developed. In this project HTK version 3.2 is used. Basically, it is a portable toolkit helps to manipulate HMM. Although, HTK is used mainly for speech recognition, the tool been used in many other application such as speech synthesis, character recognition, DNA sequencing and recognizing sign language and gesture recognition. HTK contains a set of library modules developed using C language. The tool also supports HMM using both continuous density mixture Gaussians and discrete distributions and can be used to build complex HMM systems. [24]. For decades, HTK been used by many researchers from all over the world and therefore Microsoft decided to make the core HTK toolkit available again and licensed the software back to CUED so that it could distribute and develop the software.[24]. However, from September 2000, HTK became available for FREE and can be downloaded from CUED Web site. 19

20 2.6.1 HTK Environment HTK toolkit was developed only for Unix platforms but it is possible to use HTK in windows. Below shows an overall picture of architecture of HTK. Figure 9: Software structure of a typical HTK [25] Even though, not all library models are needed for gesture recognition, it is a good idea to be acquainted with each file briefly to differentiate what is used and not used for gesture recognition. All the library modules is made of all the HTK functionality that provides tool interfaces to the outside world as well as central resource of regularly used functions. Looking at the figure above, the user input/output and interaction with HTK and the OS is maintained by the library module Hshell. Below table describes the entire library module and their functionalities. HMem is for memory management and HMath is for mathematical calculation. HLabel helps to provide an interface for label files. HModel is for HMM definition and HRec Contains the main recognition processing functions. There are 33 tools offered by HTK for speech recognition and they are mentioned below. Cluster HBuild HCompV HCopy HDMan HERest HHEd HInit HLEd HList HLMCopy HLRescore HLStats HParse HQuant HRest HResults HSGen HSLab HSmooth HVite LAdapt LBuild LFoF LGCopy LGList LGPrep LLink LMerge LNewMap LNorm LPlex Table 1: Tools used by HTK Use of HTK Even though HTK was developed mainly for speech recognition, there are other applications that can use the benefit of HTK. However, the four main basic use of HTK are data preparation, model training, pattern recognitions and model analysis. The data preparation part of HTK is not needed for gesture recognitions but distinguishing between the gesture data and speech is necessary. As mentioned before HTK is made only for speech recognition so the data are all in audio format. These data are simply data vectors with components of 2-byte integers or 4-byte floatingpoint numbers. How data are prepared for gesture recognition will be explained later in this chapter. HMM models are created using scripting language, which is similar to HTML. For example, the models have a beginning (BeginHMM) and ending (ENDHMM) and 20

21 a model is created for each gesture. Below shows an example of HMM model. The <VectorSize>, <Mean> and <Variance> defines the feature vector. The value under <Mean> and <Variance> is mean observation vector and covariance matrix diagonal. The <NumStates> is the number of states of the HMM. The <State> indicates the current states and the <NumMixs> is the number of Gaussian distribution and at the end there is the <TranP>, which is the transaction matrix (A). The <Mixture> is the distribution s ID and weight. Figure 10: Example HMM Models [26] Model training and retraining is done by HInit and HRest respactivly and a label files is generated to show what is happening in the data sequence. During training and recognition there could be many test files and label files hence all the label files are condensed into one file called master label file (MLF). The main pattern recognition is done by HVite and the model analysis is done by HResults. The next section describes the tool that helps make use of HTK for gesture recognation. 2.7 Georgia Tech Gesture Toolkit (GT²k) GT²k is a toolkit to make use of HTK library modules easy. HTK toolkits are made for speech recognition and can be very difficult to use for people who are not familiar with speech. GT²k can help with data preparation, HMM creation and training, HMM validation and most of all recognition. At first GT²k must be trained on known data, which is provided by the user and after training, GT²k can perform gesture recognition. However, before GT²k can perform gesture recognition, the input data must be processed to find the salient characteristics (features vector) and detail explanation of the feature vector is described in chapter 3. After the data has been gathered with the extracted feature vector GT²k can train the data. GT²k can create HMM layout for the training process and using simple grammar GT²k can recognize gesture. In other word, GT²k labels the data and trains them HMM layout and prepares them for recognitions. After training the HMM model, GT²k can validate it by using cross-validation or leave-one-out validation, which is explained in more details in chapter 3. Using the validation process, GT²k can calculate the accuracy level of the model and if it does not give a satisfactory results, it can be tested using different HMM layout to improve the accuracy level. 21

22 GT²k must be installed in UNIX system (Linux fedora 5) and once it is installed, the utils/new_project.sh script can be used to create new project. The script needs three arguments to create a project; location of the project where it will be stored, feature vector in size and link to GT²k. A directory with useful files and scripts are generated. Once the project directories are generated, a script in scripts/ subdirectory can be edited for different specification needed for training and validating Connecting GT²k with HTK Once GT²k is installed, it creates a directory /bin/bin.linux where it keeps all the HTK files. Figure 11: Communication between GT²k and HTK There are 33 items in the folder /bin/bin.linux and the most common ones are HVite, HResult, HParse, HCompV, HRest, HInit, HERest and HCopy. The table below describes each of the mentions files. HVite is a general-purpose Viterbi word HVite: recognizer; it matches gesture files against a network of HMMs and output a transcription for each (Label Files). HResults is the HTK performance analysis tool; it HResult: reads the Label Files and compares it with corresponding transcription files. HParse: Hparse generates word level lattice files. HcompV generates the global mean and HCompV: covariance of a set of training data. It is also used to initialize the parameters of a HMM. HRest: HRest performs Baum-Welch re-estimation of HMM by using set of observation sequences. HERest performs linear transforms of the HMM HERest: by embedded training of the Baum-Welch algorithm. HCopy: Hcopy copies data files to a designated output file HInit provides initial estimates for the parameter HInit: of HMM; It works by repeatedly using Viterbi alignment. Table 2: Files used by GT²k Process of training, validation and recognition Using the script new_project.sh provided by GT²k it is possible to generate project to test each HMM models. Each project contains folders; data, ext, models, scripts, testsets and trainset as well as text files; datadir, datafiles, commands, grammer, hmm, labels.mlf and one main script options.sh. The training process can be initialized using the script train.sh that contains in script folder. These scripts along 22

23 with others are copied into the project folder automatically using the script new_project.sh. We can invoke the script train.sh with one argument, which is the scripts/options.sh train.sh. Once the training is complete, the results of the HMM performance is generated in a confusion matrix and are stored in hresults.log. Before the training can start, HMM definition must be generated using the script gen_hmmdef.pl. Basically, it requires three arguments to generate HMM definition. The following command used to generate HMM of state 5 with 4 observations per state and number of states that can be skipped gen_hmmdef.pl n5 v7 s2. Transition probability is generated automatically with the help of gen_hmmdef.pl. Each row in the transition probability shows the current state of the HMM and the column represents the state transition status from the current state. To retrieve the description of the HMM models one can search the model s folder select the HMM model which has a name similar to hmm0.3 depending on the number of iteration. After training is complete, new data can be used for recognition. The script recognize.sh, found in the scripts/ directory in a project directory can be used recognition. In order to use this script, four arguments need to be passed; data files, a file to store the recognition results, options.sh and the trained HMM model. As for the HMM model one can use newmacros and the output will be in the form a Master Label File (MLF). The format will have gesture ranked by likelihood score. For our training and validation, we have to generate huge amount of projects for testing different HMM parameters to get good accuracy level. In order to do these kinds of testing it was important to generate a UNIX script that automatically generate each project and conducts training and validation. The sample of the script will be shown in the appendix. 23

24 2.8 Relation to previous work There are many research conducted today that focus on learning and recognizing image behavior related to computer vision. Recognizing gesture can be conducted in many ways such as neural network, template matching, dictionary lookup, linguistic matching and ad hoc method, etc. However, not many researches exist today that study the recognition process using HMM. Two projects are selected and they are very closely related to this project and they are HMM for Gesture recognition by Donald O. Tanguay Jr. [27] and Hidden Markov Model for Gesture Recognition by Jie Yang and Yangshen Xu [28]. Although the main similarity is the use of HTK and two-dimensional mouse gesture, but many difference also exist that shows the uniqueness of all three projects. Below describes differences and similarities between the projects. Donald O. Tanguay Jr. [27], conducted simple experiments with poor performance and experiments that gave 100% accuracy. He demonstrated that many factors contribute to success in recognition using HMM. In his research, a simple tool is created to collect mouse data coordinates as well as add and modify gestures. Unlike the data format of the current project, the data is made of only three-feature vector such as windows-scale position with time in millisecond and velocity and single vector of position and velocity, which is sampled at 20 MHz and stored in a text file. This thesis does not use grammar because to avoid the dependency on specific domain and he also did not collect enough data to train and validate HMM. He only used 5 gestures, which provide a limitation for HMM. However, having many data does not mean HMM will give better recognition but it does imply that using lots of data HMM can analyze the data from many verity of data samples. Jie Yang and Yangshen Xu [28], converted gestures into sequential symbols unlike the current thesis, which uses geometric features. They developed a prototype system that achieved 99.78% accuracy for isolated recognition using only 9 gestures. They also analyzed other forms of gesture recognition and confirmed that HMM is the best way possible for gesture recognition. Their approach is defining meaningful gesture in term of HMM, collecting data (two-dimensional mouse gesture), train the HMM using the data and evaluate it using the training model. They have also discussed the difference between isolated and continuous gesture recognition in which they state that HMM is the only feasible way to recognize continuous gesture recognition. They have also given details of computation methods according to HMM based system such as scaling, logarithm computation, thresholding, multiple independent sequence and Initialization [28]. At first they collected 150 samples of data for each gesture and kept 100 for training and 50 for testing. The have shown that more the training sample better the recognition rate and also that HMM is a feasible parametric model for gesture recognition. They have also analyzed online and offline handwritten recognition. The main difference between both the thesis and the current thesis is the use of GT²k with HTK, which make life much easier and helps to focus on other specification such as using different feature vector, splitting, skipping, etc rather than giving attention to how HMM works with HTK due to difference in gesture recognition compared to speech recognition. 24

25 Chapter-3: Approach 3.1 Approach This chapter is the main part of the thesis which explains in details how HMM is used to train and validate the data in order to identify the perfect HMM model. In section 3.2.1, the method of collecting data and the tools used to collect the data as well as processing of the data is described. This section also describes the experiment conducted using 50 people who contributed to data collections plus an analysis of use of 30 gestures and the device (uwand) used by the user. Section 3.3, explains the experiment in details how training and validation is done using data, computing the features and filters along with the results as well as the process of selecting a model for testing. The last Section 3.5, describes the grammar constraint recognaítion. 3.2 Method Before recognizing gestures one has to collect sample data and after a long search, a decision was made to modify a tool written in Visual C++. The tool initially created by Mr. Konstantin Boukreev [29] in 2001 for mouse gesture recognition using neural network is used to collect the initial data. However, it was important to modify the tool in such a way that all the functions related with neural network has been removed and the only thing that is similar is the interface of the application, which can be implemented using default function provided by Microsoft VC++. The device uwand, is an easy way to interact with different device such as TV, light, game, media players and medical devices, etc by pointing and clicking similar to a mouse. The device has a receiver, which detects and decodes the angle of the device pointing to and this is the way it is possible to put the cursor on the screen. However, a device such as light does not need a cursor hence changing the light or the colour of the light is possible depending on the angle of the device pointing to. It is similar to pc mouse except it is 3D and can be used with any device. The maximum range of the device is about 5 meters. uwand comparing to another technology such as Nintendo Wii, is very different. It uses gyroscope and accelerometer hence gives a different range of interaction and does not use pointing method Collecting Data The basic procedure for collecting the data are as follows: at first the user gets to practice the movements using the device uwand and once they are ready they can write down their name, age, gender and can start their session for drawing each gesture 10 times which makes 30 gesture 300 times in total. Each person spends an average of 10 minutes per session. A folder is created using their last name where one file are generated with normal coordinates (x, y, t) where x, y is generated every 16 milliseconds interval. Once the initial files are generated, it is possible to generate linear interpolation and use this interpolation for calculating all other feature vectors. 25

26 3.2.2 The tool The main objective of the tool is to draw and record mouse path using the 3D device uwand. The following figure 12 shows the interface of the tool. Figure 12: The Tool Interface The interface has three windows in the upper-right hand corner which shows the gesture to be drawn, the second window shows the gesture drawn by the user and the third window shows number of points in creating the path, coordinates of the path, time in millisecond, total number of gestures and total number of gesture completed by the user Extracting the feature vector Each time a gesture is created it is named in such a way that it can be tracked later. The following table 3 shows an example of how each file is named. Original Name: 01_Circle (Image) Normal data: 01_Circle_01 (Position) Interpolation (I): 01_Circle_01_I Table 3: Naming of the files. The reason for using linear interpolation was to fill up the gap between the (x, y) coordinates of the position vector for every 20-millisecond interval. Linear interpolation is a simple form of interpolation, which is a process to generate new coordinates from a discrete set of coordinates. This method is the simpliest method to implement but may not be very accurate. To calculate the linear interpolation of the position vector the following formula is used. 0 0 (x - x ) 1 0 y = y + *( y y ) (1) 1 0 (x - x ) 0 0 (y - y ) 1 0 x = x + *( x x ) (2) 1 0 (y - y ) 26

27 As for the delta of the interpolation, calculating the difference between interpolation x and interopolation y which is the average change of y per unit x. To calculate delta simply using the following formula. 1 0 Δx = x x (3) 1 0 Δy = y y In order to add more vector dimension, it was decided to calculate velocity (V) and arctangent angle for each coordinates points using the following formula to retrieve more vector dimension of 1. y A = tan 1 (5) x A decision was made to calculate the difference in angle (Angle delta a ), veleocity (Velocity delta v ) and also calculate running average (R) for both interpolated x and y. Basically, the average of both interpolated x and y is taken and deducted with the interpolated x and y. In additon to that we also calculated the substraction of initial interpolated x and y with the rest of the interpolated values of x and y (S). For processing the data more a implementation of Gaussian Smoothing is needed. Gaussian distribution or Gaussian Smoothing is also called normal distribution which is a probability density function (pdf). Statisticians and mathematicians use the term normal distribution, physicists refer to it as Gaussian distribution and social scientists refer it as bell curve because of the curved shape. Figure 13 illustrates how a normal prabability distribution function can be calculated in Matlab 7 x = [-4:0.5:4]; f = normpdf(x, 0, 0.5); Figure 13: Example of normal pdf density The variable f generates the normal pdf density using the range of x(-4 to 4) with interval of 0.5 and parameters µ = 0 and σ =1. Basically, it needs two parameters, location parameter (µ) and scale parameter (σ). In order to evaluate the perfact filtering range, it was nesseary to test varity of normal pdf density. Similar to Gaussian filtering where it reduces the noise there are other filtering processes such as exponential decay filter. Exponential decay is a form of rapid decrease of quantity. For example, purifying kerosene for using as jet fuel. Basically, the kerosene is purified using clay filter to remove pollutants. Using exponential decay in other areas and can be used to represent a number of different things. Exponential decay is not linear and the decrease is rapid at first but not constant. It is often used to explain population decreases or increase which depicts exponential growth. The format of the filter is defined by lambda λ, which starts from 0.1 to 0.8 and square of lambda from 2 to 30 in reverse. Following figure 14 show a sample of decay filter with lambda

28 Figure 14: Sample of decay filter Experiment for collecting data The experiment for collecting data took approximately 1 week, where all the male and female candidates were asked to sit in front of a TV with aspect ratio of 16:9 and draw specific gesture using the device uwand. Each person took about 10 to 15 minutes to complete 30 gestures (10 times for each gesture). After completing their session they were asked to fill up a form where they had to answer three basic questions as well as draw their own gesture. The examples of the questions are given below: Was the instruction easy to follow? (1 to 5) Was the gesture HARD to perform? (1 to 5) Was the device comfortable to use? (1 to 5) There were in total of 21 female and 30 male participants and out of them there were about 10 left-handed and rest were right handed. The following figure 15 shows the results of the participants according to the questioners they filled up. User Test Average Score Male (30) Female (21) Instructions Gesture Device Categories Figure 15: Difference between male and female opinion. 28

29 From the figure above it is possible to conclude that both male and female found the instructions very easy such as the tool itself and the image of the gestures displayed. However, the 30 gestures that were drawn by the users, little more then 50% found it difficult to draw and the device itself was uncomfortable. The reason the user found the gesture hard to draw is because the device was very hard to control and uncomfortable and subjects who were older then 50 years found the device painful due to certain health issues. Group 1 (Small) Up Down Right Left Back_Forth Vee_Up Vee_Down Up_Down Down_Right Group 2 (Medium) Down_Left Cross N Z W Rectangle Group 3 (large) Circle Circle_Right Circle_Left Omega # Down_Up_Circle Question Circle_Down Phi Down_Circle_Left Down_Circle_Right Left_Circle_Left 8 Left bracket Right bracket Figure 16: 30 gestures 29

30 3.2.5 Training and Validation The procedure for training is fully automatic; in a way it returns results and verifies models that can be later used for recognition. GT²k helps to abstract the training process in such a way that user can avoid dealing with complicated algorithms. Once the processing and preparation of the data is complete, it is time to train the model, selecting two training validation method. The process requires dividing the data into two sets, a training set and a validation set. A training set is a set of data that is used to train the HMM models where as validation set is data set that is used to verify the HMM model. 3.3 Experiments Due to lengthy duration of the validation method Leave-one out, a decision was made to conduct all the training and validation experiment using Cross- Validation. Since we had to conduct many simulation to select the best model and each Cross-Validation test takes approximately 45 min to 1 hour or more depending on the number of states of the HMM. In order to do the test more efficiently and effectively generating a UNIX script is important which will automatically create project and train and validate the model and once it is complete it will create and run the next project and so on. The special UNIX script along with others are described in details in Appendix A. As mentioned before there were 50 candidates, and each candidate provided with 300 separate files of data. Each data files are labelled in such a way that it is possible to search and test them separately using UNIX script Types of experiment GT²k provides two types training/validation techniques and they are crossvalidation and leave-one-out validation. Basically, cross-validation randomly divides the data into two sets, 66.6% as training set and 33.3% validation set. To select the options for the above techniques, the script options.sh needs to be changed. The main settings in options.sh that can be changed for HMM training are the TRAIN_TEST_VALIDATION in which it is possible to set it to CROSS or LEAVE_ONE_OUT. The rest of the settings in the options.sh are default setting for the project and therefore no need to change them unless it is necessary for specific training. Figure 17 below shows the overall process for cross-validation. Figure 17: Process for cross-validation As for Leave-one-out validation, one data is selected randomly for validation set and uses rest of the data for training set. The process for Leave-one-out validation 30

31 can take up to 1 day or even more depending on the state of each HMM. Figure 18 below shows the overall process for Leave-one-out validation. Figure 18: Process for Leave-one-out validation The above figure shows that the training/validation phase is repeated for every change of the data set. Hence, each iteration computes the overall performance of the HMM models. Since both leave-one out and cross validation selects data randomly, a decision was made to assign FIXED amount of data for training and testing. For training data assigned are all the files numbered run_0[1,2,3,5,6,7,9] and for testing data used are all the files named run_0[4,8,10]. Unlike Cross validation, the fixed validation selects data fixed instead of random. Figure 19 below shows the overall process of FIXED validation. Figure 19: Process for FIXED validation GT²k provides a metric for accuracy where it defines the results in incorporates substitution (S), insertion (I), and deletion (D) errors and Total number of gesture (N) classified and total number of gesture (H) recognized. Substitution errors arises when the system classifies the gesture incorrectly and the insertion error occurs when the system imagine the occurrence of a gesture. The deletion error happens when the system does not recognize the gesture within a series of gesture. If gestures are isolated gestures then the value of D and I will always be Zero otherwise the results occurs during continuous recognition. Once HTK generated the above value, the calculation of the accuracy of the overall system using the following formula: 31

32 N S D I Accuracy = *100 N The overall performance is recorded in the form of a confusion matrix as shown in the following table Overall Results SENT: %Correct=0.00 [H=0, S=0, N=0] WORD: %Corr=0.00, Acc=0.00 [H=0, D=0, S=0, I=0, N=0] Confusion Matrix D L R U o e i p w f g n t h t Del [%c / %e] Down [0.0/0.0] Left [0.0/0.0] Righ [0.0/0.0] Up [0.0/0.0] Ins ============================================= Table 4: Example of results using Cross Validation for four gestures Grammar for isolated gestures recognation GT²k needs grammar to perform training and recognition and it is possible to generate it automatically or can be specified manually in a text file. If it is auto set then the grammar will only allow gestures to be recognized one at a time and for continuous recognition will need a special grammar setting which is set manually. The grammar can be generated using variables and commands which are specified using $ symbol. Commands are simple text strings that are associated with the variables. Each variables may contains more then one commands by separating using pipe character. Table 5 below show an example of grammar used for testing isolated gestures. $gesture = Back_Forth Circle Circle_Down Circle_Left Circle_Right Cross Down Down_Circle_Left Down_Circle_Right Down_Left Down_Right Down_Up_Circle Eight Left leftbracket Left_Circle_Left N Omega Phi Question Rectangle Right rightbracket Three Up Up_Down_Up Vee_Down Vee_Up W Z; ( $gesture ) Table 5: Grammar for gesture 3.4 Testing the feature vector A feature vector is an n-dimention vector which represents numerical features of an object. In image, the feature value is the pixel of the image and for mouse gestures it is the manipulation of the (x, y, t) coordinates which helps to reduce the dimensionality. If the number of variables in a feature vector is large the processing of the data will be slow and might need lot of memory. In order to solve the above problem feature extraction is needed which combines the variables and can overcome the problem. Since leave one out validation takes long time a decision was made to conduct experiment using Cross validation and Cross validation FIXED Differentiating GOOD and BAD data Before doing the real experiment using the entire feature vector, it was important to find out if the data collected are sufficient enough for training and validation. As mentioned before, 40 candidates contributed to the data collection and that each person provided us with 300 gestures with feature vector. Basically, each 32

33 feature vector for each candidate using 5 states HMM was tested. We divided 40 candidates in to four groups according to their training and validation results. Linear interpolation (I), No. Of States: 5, vector dimension of 2 No. Of Candidates Bad Data Results Cross Val Group 1: 40 50% < 60% Group 2: 31 Deleted 9 candidates < 50% 62% Group 3: 27 Deleted candidates < 60% 63% Group 4: 19 Deleted candidates < 70% 62% Delta (D), No. Of States: 5, vector dimension of 2 No. Of Candidates Bad Data Results Cross Val Group 1: 40 50% < 60% Group 2: 36 Deleted 4 candidates < 60% 59% Group 3: 29 Deleted candidates < 70% 62% Group 4: 18 Deleted candidates < 75% 67% Linear interpolation & Delta (ID), No. Of States: 5, vector dimension of 4 No. Of Candidates Bad Data Results Cross Val Group 1: 40 50% < 60% Group 2: 36 Deleted 4 candidates < 60% 75.12% Group 3: 32 Deleted candidates < 70% 77.60% Group 4: 21 Deleted candidates < 80% 81.88% Table 6: Extracting the good quality data The experiment using ID shows that 21 candidates provided with good data and to confirm the results, all 40 candidates data were tested, 21 candidates good data and 19 candidates bad data separately. The following figure 20 illustrates the results obtained. ID (40) All Data ID (21) Good Data ID (19) bad Data Acc. in % HMM States Figure 20: Results for good and bad data From the above figure it is possible to confirm that 21 data are sufficient enough to train and validate HMM models. However, it is feasible to test all the 40 candidates in order to properly differentiate data according to HMM models. 33

34 3.4.2 Testing each HMM state with different feature vectors After differentiating the good data from all the data, It is time to extract more feature vectors and train and validate them. As mentioned before in section , calculation of velocity (V), Angle (A), Angle delta (a) and combined the old feature vector to get more vector dimensions. After training it was observed that more the vector dimension along with increased number of HMM states, the better the result. The following figure 21 shows the results obtained after training the good data from 21 subjects. 100 Acc. % IV IVA VA IiVAa ia V A a I i Ii HMM States Figure 21: Results for 21 subjects As you can see form the above graph, it is correct to state that more the vector dimension the better the results. Such as the feature vector (IiVAa), which gives us vector dimension of 7 and the results are obtained from 83.65% accuracy for HMM states of 5 to 94.63% accuracy for HMM states 22 and compare to lowest results obtained by feature vector (V) with vector dimension of 1. Testing the feature vector with vector dimension of 1 (V, A and a) have less effect on the accuracy compared to vector dimension of more then 1. However, surprisingly increasing the HMM states do have a good effect on the accuracy level. If training the same feature vectors using the entire 40 subject, a minor difference in accuracy is seen. Below show the figure that represents results for 40 subjects. Looking at the graph below, the highest accuracy obtained is with feature vector Interpolation and delta (Ii). There is no major difference between figure 6 and figure 7, other then vector dimension of 7 (IiVAa) gives lesser accuracy compare the vector dimension 4 (Ii) hence disproving the hypothesis that having more vector dimension may not help to improve accuracy level. However, this may be the case because the data are mixed with good and bad. 34

35 Acc. % HMM States IV IVA VA IiVAa ia V A a I i Ii Figure 22: Results for 40 subjects Searching for a Filter After analysing the data and the entire feature vector, a decision was made to test different filtration to reduce the noise level. In order to find out the best filter, it was important to test much filter range for the entire feature vector. Verities of filter range from 2 and 2 to 20 and 20 with interval of 0.5 with sigma starting form 0.1 to 9.0 were tested. Hence, the final filters range of 20 to 20 with interval of 0.5 and sigma 2.0 and 2.1 gives better results. Below shows the figure of the final filter range selected. x = [-20:0.5:20]; f = normpdf(x, 0, 2.0); Figure 23: Filter range Following the selection of best filter range sigma (2.0 and 2.1), a decision was made to test both the range of filter with the entire feature vector. The figure 24 below shows the results of each feature vector with filter range of sigma

36 Acc.% I_F_2.0 i_f_2.0 V_F_2.0 v_f_2.0 R_F_2.0 S_F_2.0 A_F_ HMM States Figure 24: Individual test of each feature vector using filter range Figure 9 above, shows us that the running average (R), Angle (A) and Delta (i) gives better preformance compare to other feature vector. Hence, decision was made to only use feature vector (RAi). From the experiment conducted in figure 6 and 7, it was established that more the vector dimension betters the results. However, to prove the hypothesis a decision was made to test how the HMM behaves with increased amount of feature vector. The figure 25 below shows how the results behave according to increased vector dimensions. 120 Acc. % V-1 A_F_2.0 V-2 R_F_2.0 V-3 Ia_F_2.0 V-4 Ii_F_2.0 V-5 IiV_F_2.0 V-6 IiVA_F_2.0 V-7 IiVAa_F_2.0 V-8 IiVAav_F_2.0 V-10 IiVAavR_F_2.0 V-11 IiVAaRS_F_2.0 V-12 IiVAavRS_F_ HMM States Figure 25: Results for increased vector dimensions 36

37 In fact, the hypothesis is incorrect because the graph above shows us that vector dimension of 10 gives better results compare to vector dimension of 12. Thus, using more vector dimension does not mean better accuracy level. Since the HMM behaviours using states starting form 4 to 30, where the accuracy level increases with increased HMM states, it was important to find out what happens if the HMM states is increased more then 30. Below shows the training results using the feature vectors (RAi) and filter range sigma 2.0 to 3.0 and HMM states up to 70. Different Filter test Acc % RAi_F_2.1 RAi_F_2.3 RAi_F_2.5 RAi_F_2.7 RAi_F_ HMM state Figure 26: Results for increased HMM states and different filter test Figure 26, observing that the accuracy level levels off starting from state 30 to 70. The above figure also shows us that filter range with sigma 2.1 gives better results compare to sigma

38 Filter Vs. Acc 98 Acc % RAi_F_2.1 RAi_F_2.3 RAi_F_2.5 RAi_F_2.7 RAi_F_2.9 Filter Figure 27: Results for increased filter test With all the experiment conducted, the best feature vectors with best filtering which is RAi and (-20, 0.5, 20; 2.1) is selected. It was time to analyze the HMM model and its parameters. The following UNIX scripts calculate the parameter of each HMM generated HTK. cd models/hmm0.3/ more newmacros sed 's/0\.0*e+00//g' sed 's/<.*>//g' sed 's/[0-9]\.[0-9]*e[+-][0-9]*/@/g' sed "s/[^@]//g" awk "BEGIN{count=0;{count += length($1);end{print count" > parm.txt Table 7: UNIX script for calculating the HMM parameters Basically, it retrieves all the non-zero digits from the model (newmacros) and counts the number of lines and keeps it in the parm.txt. All the experiments conducted before was using cross validation which takes data randomly and therefore it was decided to use FIXED amount of data for testing and training. A decision was made to conduct split test where the model training process generates four iterations per HMM when setting auto-estimation true and the final trained model is stored in the third one where specified by NUM_HMM_DIR = 3 in train.sh. 38

39 Split Density Test Acc % RAi_F_2.1(No Split) RAi_F_2.1(1 Split 4) RAi_F_2.1(2 Split-5) RAi_F_2.1(3 Split-6) RAi_F_2.1(4 Split-7) RAi_F_2.1(5 Split-8) RAi_F_2.1(6 Split-9) RAi_F_2.1(7 Split-10) HMM Model Parameters Figure 28: Split test The above figure 28 shows us that by splitting NUM_HMM_DIR from 4 to 10, as the split increases the parameters increase the accuracy level also increases. Hence, it is just doubled amount of density per state Gesture grouping Dividing the gestures into groups enables us to test group ratio of HMM states. The main reason for testing the gesture in-group is to find out the effect of accuracy level of each gesture. In order to calculate the accuracy level of each group of gestures, the confusion matrix needed to be analyzed. Using UNIX shall script and a simple C++ code can be used. At first all the diagonal values of the confusion matrix is retrieved and kept in a.txt file as well as all the gestures names are kept in three separate files. Using join and grep in Unix script, each groups accuracy level is calculated along with its model parameters. Once all the necessary test is complete, the best ratio results (14:21:28) is chosen. Using this ratio, a decision was made to test both 21 data and 40 by combining the different ratio. When the tests were complete, the confusion matrix is retrieved and analyzed manually to find out how each group of gestures behaves with each other. There are total of 30 gestures and out of them 10 are small, 5 are middle and 15 are large. The gestures are divided according to their number of elements or lines needed to create each gesture as shown in the table 3 on page 5. As mentioned above that the ratio selected is (14:21:28) hence testing each group according to this ratio and its combination. For example, testing 14:21:28, 28:14:21, 14:28:21 and so on, which gives 6 combinations for small, medium and large. The total amount data for 21 subjects is 6300 which is divided into trainingsets and testsets of which the testsets is 1890 and for each group are as follows small (630), medium (315) and large (945). Below shows the results of 21 subjects with ratio combination test for Small, Medium and Large. Figure 29 represents the results for SMALL gesture and observing that as the HMM states goes higher the accuracy level for SMALL 39

40 decreases with SMALL confusing with LARGE and SMALL. With less HMM states the SMALL tends to confuse with other SMALL compare to MEDIUM and LARGE. Sm all Small (s) Medium (m) Large (l) :21:28 28:21:14 14:28:21 21:28:14 21:14:28 28:14:21 21:21:21 HMM State Figure 29: Gesture group HMM state ratio test for SMALL. Figure 30, represents the results for MEDIUM gesture and observing that as the HMM states increases the accuracy level also increases where MEDIUM confuses with SMALL and LARGE. With less HMM states the error rate increases compared to high HMM states. However, it is interesting to see that NO MEDIUM is confused with other MEDIUM. Medium Small (s) Medium (m) Large (l) No. of gestures :21:28 28:21:14 14:28:21 21:28:14 21:14:28 28:14:21 21:21:21 HM M State Figure 30: Gesture group HMM state ratio test for MEDIUM 40

41 Figure 31, represents the results for LARGE gesture and observing quite similar results to the MEDIUM gesture where as the HMM states increases the accuracy level also increases. However, LARGE tends to confuse with SMALL and other LARGE as the HMM states increased. Large Small (s) Medium (m) Large (l) :21:28 28:21:14 14:28:21 21:28:14 21:14:28 28:14:21 21:21:21 HM M state Figure 31: Gesture group HMM state ratio test for LARGE Searching for Decay Filter The decay filter is used similarly to Gaussian filter except the number of filter is decreased by 1 instead of dividing by 2 in case of Gaussian filter. 41

42 Decay Filter Test RVA RVa Rv RVAia RVI RVi RVAi Ri RVAiav Ra RAi 96.00% 94.00% 92.00% 90.00% 88.00% Acc % 86.00% 84.00% 82.00% 80.00% 78.00% D_F_02(9) D_F_03(12) D_F_04(15) D_F_05(20) Feature Vector Figure 32: Decay filter Test 0.2 to 0.5 The above figure 32 shows the decay filter test using 11 different type of feature vector with HMM states of 10 using 21 subjects with lambda 0.2 to 0.5. From the above test the best feature vector is selected (RVa, RVi, Ri, Ra and Ria) and tested with more decay filter to view the difference in accuracy. Below shows the result of the feature vector tested using the selected feature vector. In order to observe the result more clearly, the feature vectors with good accuracy is further tested by increasing the decay filter. 42

43 Decay filter test Acc% RVa RVi Ri Ra Ria D_F_01(6) D_F_02(9) D_F_03(12) D_F_04(15) D_F_05(20) D_F_06(20) D_F_07(7) D_F_08(9) Decay Filter (0.1 to 0.5) Figure 33: Decay filter Test 0.1 to 0.5 for (RVa, RVi, Ri, Ra and Ria) From the figure 33 above, it is clear that running average (R) combined with interpolation delta (i) and angel delta (a). As mentioned before the data collected for all the gesture is considered at sample rate of 20 milliseconds. It was decided to see the difference in accuracy if the sample rate was increased or decreased. Sample rate of 30ms, 20ms and 10ms is considered using the feature vector of Rai using HMM states of 10 with decay filter with lambda of 0.1, 0.2, 0.3, 0.4 and

44 Decay test with 10ms, 20ms, 30ms Acc% Rai_30ms Rai_20ms Rai_10ms D_F_01(6) D_F_02(9) D_F_03(12) D_F_04(15) D_F_05(20) D_F_06(20) D_F_07(20) D_F_08(20) Decay Filter Figure 34: Decay filter test 0.1 to 0.5 using different sample rate. Looking at the figure 34 above, it is possible to state that lesser the data better the accuracy, where as data sampled at 30 ms give better results compare to data sampled at 10 ms. Since, decay with lambda 0.6 give a significant improvements, testing this decay filter with all other feature vector to see the difference in accuracy. 44

45 Decay Filter 0.6, 10 ms test Ria RiVvAa RAV RaV iav iav VA Va R i a ia Acc % States 6 States 8 States 10 HMM states Figure 35: Decay filter 0.6 using different feature vector. The above figure 35 shows that Ria with decay filter of 0.6 give the best result and shows that relative values or coordinates gives better performance compared to absolute coordinate. The main reason for this is because relative values are values that are computed from absolute values hence provides a range of values rather then fixed position. To make the observation a little more clear, the following figure shows the results of feature vectors with best accuracy such as Ria, R, i, a and ia. 45

46 Decay Filter 0.6, 10ms test Ria R i a ia Acc % States 6 States 8 States 10 HMM State Figure 36: Decay filter 0.6 using feature vector: Ria, R, i, a, ia. The angel delta (a) gives the worst result compared to interpolation delta (i) but combining (a) with running average (R) changes the accuracy level significantly. This result can be improved using sample rate of 30 ms instead of 20 or 10 millisecond. 3.5 Recognition with gesture elements The main requirement for gesture recognition is continuous online recognition, but retrieving HMM model with such system performance is very difficult to accomplish. The main reason for this difficulty is that the starting and ending point of each gesture is unknown and is very complicated to solve this problem. Today, Hidden Markov Models are one of the most used methods for continuous gesture recognition. The advantage of HMM is that it can automatically understand a range of model boundary information for continuous gesture recognition. Due to this reason other methods, such as neural networks, is a problem because gesture boundaries are 46

47 not automatically detectable. The process for continuous gesture recognition using HMM is similar to isolated gesture recognition. HMM are concatenated when the parameters of the HMM are trained where each HMM is instantiated with a corresponding gesture. The concatenated HMM can be trained because each gesture HMM is trained on the entire observation sequence for the corresponding gestures. Hence, the parameters of each model are re-estimated regardless of the location of each gestures boundary. However, continuous gesture recognition is much more difficult compared to isolated gesture recognition due to unknown gestures boundaries. Therefore, all possible Start_Gesture and End_Gesture must be considered which results in tree search. The Viterbi algorithm is a proficient search algorithm that can be used for continuous gesture recognition. The grammar, the MLF file, the commands were generated automatically with the help of HTK and GT²k for our isolated recognition. In order to proceed with continuous gesture recognition, it is essential to change the GRAMMAR with specific commands and to modify label MLF files that measure the boundaries of each gesture with a frame rate of 2000 manually Grammar for gesture elements Grammar for continuous recognition is different from that of isolated recognition. Each gesture is categorized according to direction such as N, S, E, W, NE, NW, SE and WS. Using this simple direction it is possible to generate two types commands, Line and Arc. A line is specified using an L in front of each directions and Arc is specified using an A in front of each direction. For example, LN, LS and LWS or ANW and ANE, etc. (a) (b) Figure 37: Visualization of direction of the elements; (a) Arc, (b) Lines. With the help of HTK syntax, it is possible to manually change the format of the grammar. The grammar can be generated using variables and commands which are specified using $ symbol. Commands are simple text strings that are associated with the variables. Each variable may contain more then one command by separating using pipe character. Following table show an example of grammar used for testing continious recognation. 47

48 Circle Down_Left Start_Ges AWN ANE AES ASW end_ges Start_Ges LS LW end_ges Figure 38: Sample Grammar for continuous recognition. The above grammar was created after analysing different type of combination of commands. Many different type of grammar was tested and the above grammar was chosen for the final grammar with 17 commands. Using less command gives us less HMM models with less parameter compare to 39 and 41 commands. The grammar with both start_ges and end_ges, which gives us two more commands, and grammar with no start_ges and no end_ges was tested and observed that grammar with start_ges and end_ges gives the best result which gives us (15+2=17) commands Label(MLF) file for gesture elements The MLF files that is generated by GTK for isolated recognitions is similar to continuous recognition. However, there is small difference, which is shown below. Rectangle (Isolated) Rectangle (Continious) start_ges LE Rectangle LS LW LN end_ges Circle_Right (Isolated) Circle_Right (Continuous) start_ges Circle_Right AWN ANE end_ges Table 8: Difference in Label files for isolated and continuous. The MLF file is generated using gen_mlf.sh script needs to be modified. Basically it reads each file, the number of lines in each files and the link of the file. gen_mlf.sh has start_time and end_time where end_time is calculated by multiplying the number of lines with frame rate of 2000 and increments it for each commands. Hence, once it reads the number of lines, it divides the line number with the number of corresponding gesture commands. Below shows a sample of label files generated by the modified gen_mlf.sh scripts for 6300 gestures. #!MLF!# "/home/gesture/t_con_rvaia_21_yesse_testing_merged_labels/ext/data/limerkens 29MR/run_10_Up_294_RVAia.txt.lab" start_ges LN end_ges. "/home/gesture/t_con_rvaia_21_yesse_testing_merged_labels/ext/data/limerkens 29MR/run_10_N_286_RVAia.txt.lab" start_ges LN LSE LN end_ges. 48

49 The operations performed by HTK, assumes that the gestures are divided into segments and each segments has a name or label. The set of labels related with gesture file makes up a transcription and each transcription is kept in a separate label file, which is the same name as the corresponding gesture file but with a different extension. HTK can handle a very large number of files using Master Label Files (MLFs), which is kind of an index file reference to the actual label file. Hence, MLF file is a large sets of files stored in one single file, which allows a single transcription to be shared by many logical label files and they allow arbitrary file redirection Experiment with gesture elements Before the real experiment can be conducted the grammar and the changed MLF file has to be tested for proper use and format. Once the grammar and the MLF is confirmed using relative feature vector (RVAia) with 21 subject, simple 4 state HMM and Filter with sigma 2.1 was tested. However, retrieving the ERROR rate is different compared to isolated recognation. As mentioned before that the result of the recognation is generated in the file hresult.log and hence this is the file that has to be analysed to find the correct Error rate. The hresult.log file is generated and a sample of the file is shown below. Aligned transcription: /home/gesture/t_con_v_21_yesse/ext/data/akesson29ml/run_10_down_left_279_v.txt.lab vs /home/gesture/t_con_v_21_yesse/ext/data/akesson29ml/run_10_down_left_279_v.txt.rec LAB: start_ges LS LW end_ges REC: start_ges LS LE end_ges Table 9: Sample of hresult.log. However, it is not possible to analyze the confusion matrix because the new confusion matrix is generating with 17 commands rather then the whole gesture. Table 10 below shows a sample of the confusion matrix. ====================== HTK Results Analysis ============ Date: Fri Nov 24 14:38: Ref : /home/gesture/t_con_v_21_yesse/labels.mlf Rec : /home/gesture/t_con_v_21_yesse/ext/result.mlf Overall Results SENT: %Correct=58.33 [H=7, S=5, N=12] WORD: %Corr=89.58, Acc=89.58 [H=43, D=0, S=5, I=0, N=48] Confusion Matrix s e L L L t n S E W a d r _ t G Del [ %c / %e] star end_ LS LE [66.7/4.2] LW [50.0/6.2] Ins ======================================================== Table 10: Sample confusion matrix for continuous recognition. In order to analyze the LOG file, it was necessary to generate a PERL script. Basically, it compares string values (REC: vs LAB:). It retrieves summation of the unmatched string and divides them with either number of TRAININGSET (4410), TESTINGSET (1890) or all DATASET (6300) depending on which dataset is been tested as well as depending on the two sets of data (21 and 48). The following graph shows the error rate using different grammar for 21 subjects using TESTINGSET. 49

50 Accuracy 17 Elements Vs 15 Elements 17 elements (Start_Gesture & End_Gesture) 15 elements NO (Start_Gesture & End_Gesture) Acc % IiVAaRS RVAia RVAiav SVAiav SVAia RVA VAia RVi VAiav VAi i VA SVi Feature Vector Vi Ii SVA R S I v V Figure 39: Grammar test with TESTINGSET for 21 subject The above figure 39 simulation was conducted for 21 different type of feature vector. From the graph above it can be confirmed that using start_ges and end_ges gives the lower Error rate compared with NO start_ges and end_ges and it also shows that the absolute value combined with relative values (IiVAaRS) gives the lowest error rate. The graph also shows that the combination of feature vector such as angle (A), Velocity (V), etc, is crucial to generate low error rate. Using HResults and HVite, two files are generated, one hresults.log and another result.mlf0. HVite is a general-purpose Viterbi recognizer and it will match each file against a network of HMMs and output a transcription for each. HResults is the HTK performance analysis tool. It reads in a set of label MLF files from output from a recognition tool such as HVite and compares them with the corresponding reference transcription files. Before understanding the file result.mlf0, it is important to understand the label.mlf file as expalined above. After building the HMM models and training the model, HTK stores its recognition output in the result.mlf0, which is similar to original label files except the boundary is reestimated along with log probability. The file later is compared with the original labels using the HResults function. Below shows an sample of the result.mlf0. #!MLF!# "/home/gesture/t_con_v_21_yesse/ext/data/akesson29ml/run_08_down_righ t_220_v.txt.rec" start_ges LS LE end_ges The main differnce is the boundary re-estimation along with log probability. However, the main intention of using this file is to keep all the correct recognition and replace all the worng recognation according to the original label file and the error 50

51 retrieved from the hresults.log. Below shows the result after merging the original label file with the result.mlf0. RVAia Testing(1890) Error Training(4410) Error All(6300) Error Original Label Marged_Label Table 11: Difference in Error rate for both original label.mlf and marged label.mlf. The simulation was conducted using RVAia feature vector, using 21 subject, with 17 commands, etc. There is not much difference between original label file and the marged label file. Hence, marging the label file doesn t help much as expected. Once the above results is confirmed, it was decided to manipulate the grammar to see what happens. At first, from hresults.log file all the gesture name is retrieved along with all the recognation (REC) results and kept in two separate files (gesture_name.txt and recognised.txt). After, using past the two files is marged in to another file (gesturerec.txt) and this file is sorted with unique gestures and kept in another seperate file (sortedgesturerec.txt). Using the recognation grammar, the original grammar file is modified manually as shown below. $Circle = ( Start_ges AES ASW AWN ANE End_Ges Start_ges AWN ANE AES ASW AWN End_Ges Start_ges AWN ANE AES ASW LN End_Ges Start_ges AWN ANE ASW End_Ges ); Table 12: Sample grammar for gesture Circle in grammar_cont.txt. The first grammar is the original grammar followed by with all the other recognation grammar. However, retrieving the error rate is different, the hresults.log now has to be analyzed according to the new grammar. For example, the REC: form hresults.log needs to be compared with the new grammar and if there is a match it is correct recognation and if there is a mismatch there is an error. Basically, one file is generated after each simulation using the file hresults.log, which is gestrurec.txt. Table 13 below shows an example of the file gesturerec.txt. Vee_up REC: start_ges LSE LNE LSE LNE end_ges Up REC: start_ges LSE LNE LSE LNE end_ges Table 13: The file gesturerec.txt and its format. This file is compared using a PERL script with grammar_cont.txt (the files with the gesture grammars) and count the number of matches and unmatches and calculate the accuracy rate. However, due to no significiant changes in the result, it was decided end the testing of grammar elements recognition. 51

52 Chapter 4: Conclusion Initial data collection was using a pointing device similar to pc mouse with (x,y) coordinates and sampled time of 16 millisecound. Due to other device, there was interference and hence initial data collected was bad. Interference such as EMC tranmitted by the TV, the light source and other source of devices. However, collecting data with the new and improved device will imporve the accuracy level. The basic idea behind using existing open source framework HTK and GT²k is to use HMM and its alogorithm to calculate the likelyhood result generated by specific HMM model of each gesture recognized as well as overall accuracy result. Using about data, different feature vectors, filters and different number of HMM states, it is possible to manipulate the likelyhood of the recognized result. Different feature vectors were generated and experiments conducted on different numbers of HMM states and it was found out that increased number of HMM parameters gives better accuracy along with increased number of HMM states as well as compared the different HMM models with respect to the number of model parameters due to implementation on embedded system. There are two ways of recognition; one using greater number of HMM states and secound using multiple gaussian density per probability distribution. In both cases, there are improvments with increased number of parameters compared to fixed number of parameters which is critical for memory as well as realtime constraint on embedded platform. For isolated recognition, best performance was 96% using raletive feature vector (Ria) and this result can be improved using better data with less jitter and better filtering method. Using different grammars with elements commonly used by almost all the gestures does change the accuracy. However, it is not a continuous recognition because the gestures do not follow any sequence and the start and end of the gesture is unknown. The reason for using Gaussian filtering is that it is very easy to use in order to filter out the white noise and decay filter is used because it is easy to compute with no buffer value. However, using data with lots of jitter due to less calibrated device and using complicated gestures, the accuracy level is low and expected to increase if the data is collected with better device. The data was collected using a TV with aspect ratio of 16:9, which did have bad effect on the data collected due to jittering of the device. 52

53 Chapter 5: Future Work As with any research, many improvement and extensions are needed. First, due to bugs and less calibration on the device uwand, the device did influence the data collection by providing uneven or scattered data. Hence, more data needs to be collected using the new and improved device as well as analyzing 30 gestures and find the most feasible and easy to use gestures. Second, for isolated recognition, more feature vector such as polynomial interpolation, spline interpolation and other relative feature vector and more experiment should be conducted in order to confirm the best feature vector. The experiments and simulation should be conducted with variety of combination of feature vector such as testing relative position and absolute position and their combination, which could improve the recognition result. Even though it takes long time to conduct such experiment, more experiment should be conducted using LEAVE_ONE_OUT which could improve the recognition result. However, retrieving small feature vector gives less HMM parameter but finding one, which gives good recognition, more feature vector with less feature dimension should be extracted that gives good recognition result. Third, other common extension such as time duration modelling and corrective training could also improve the result. Forth, more complex 2-D mouse gesture can be collected for more interesting studies. For example, using video sequence related with computer vision allowing different type of applications to use the recognition facility. Fifth, reducing white noise by using different types of filters such as kalman filter, other low-pass filters, butterworth filter, chebshev filter, etc. As for grammar constraint recognition, the grammar can be manipulated in various ways to decrease the error rate. For example, increasing the number of elements, defining the elements properly and accurately. Increased amount of elements or commands increases the model parameters therefore using few elements will decrease the model parameters. The label file must be changed according to the format specified in chapter 3 and the boundaries can be estimated by modifying the gen_milf.sh. However, the confusion matrix generated in continuous recognition is not similar to isolated recognition, hence manipulating the file hresult.log to retrieve the error rate is essential. The only filter used for continuous recognition was Gaussian filter; hence the feature vectors needs to be tested on other filters such as exponential decay filter, kalman filter, etc. 53

54 Reference [1]. A Brief Overview of Gesture Recognition by Charles Cohen. [2] Jean-Luc Nespoulous, Paul Perron, and Andre Roch Lecours. The Biological Foundations of Gestures: Motor and Semiotic Aspects. Lawrence Erlbaum Associates, Hillsdale, MJ, [3]. GESTURE BASED INTERACTION; Steven Damer; [4] Computer Vision for Computer Interaction SIGGRAPH Computer Graphics magazine, November 1999 W. T. Freeman, P. A. Beardsley, H. Kage, K. Tanaka, K. Kyuma, C. D. Weissman. Also available as MERL-TR [5] Computer Vision for Interactive Computer Graphics; IEEE Computer Graphics and Applications, Vol. 18, No. 3, May-June 1998; W. T. Freeman, D. B. Anderson, P. A. Beardsley, C. N. Dodge, M. Roth, C. D. Weissman, W. S. Yerazunis, H. Kage, K. Kyuma, Y. Miyake, and K. Tanaka. Also available as MERL-TR [6] Computer Vision for Computer Games In 2nd International Conference on Automatic Face and Gesture Recognition, Killington, VT, USA. IEEE. W. T. Freeman, K. Tanaka, J. Ohta, K. Kyuma. Also available as MERL-TR [7] Gesture Based Interaction; Steven Damer; [8] Gesture Glove Not Science Fiction; LOS ANGLEES, Aug. 23, 2005; [9] An Evaluation of Two Input Devices for Remote Pointing; Scott MacKenzie and Shaidah Jusoh; [10] Gyration, Inc., Saratoga, California. [11] Bluewand. A versatile remote control and pointing device; Thomas Fuhrmann, Markus Klein, and Manuel Odendahl; [12] HERE COMES THE TV MOUSE ; ACID Media Release Tuesday 13 September [13] StrokeIt; Jeff Doozan; [14]. Hidden Markov Models for Gesture Recognition; Donald O. Tanguay, Jr. (June 1993) [15]. Pattern recognition; Wikipedia, the free encyclopedia; [16]. Stochastic process; Wikipedia, the free encyclopedia; [17]. Speech Recognition Using Hidden Markov Models; Course Project for Automatic Speech Recognition - EEL 6586; Dr. John G Harris; [18]. A Statistical Learning/Pattern Recognition Glossary; Brian Ripley's; [19]. Markov chain; Wikipedia, the free encyclopedia; 54

55 [20]. Hidden Markov Models (HMMs); [21]. Hidden Markov Models; Phil Blunsom; [22]. Hidden Markov Model; University of Leeds; [23]. HTK - Hidden Markov Model Toolkit - Speech Recognition toolkit; [24] The HTK Book; Steve Young, Gunnar Evermann, Mark Gales, Thomas Hain, Dan Kershaw, Gareth Moore, Julian Odell, Dave Ollason, Dan Povey, Valtcho Valtchev, Phil Woodland; [25] The Georgia Tech Gesture Toolkit (GT²k ); [26] L. Rabiner.; A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2): , Feb [27] HMM for Gesture recognition; Donald O. Tanguay Jr [28] Hidden Markov Model for Gesture Recognition; Jie Yang and Yangshen Xu [29] Mouse gestures recognition by Konstantin Boukreev. [30] Georgia Tech Gesture Toolkit 55

56 Appendix A: The source codes A-1: Unix script: Make Project #!/bin/sh # # Use this to create a new project with the following parameters: # # makeproject <name> <datadir> <namelist> <gesturelist> <vectortype> <numstates> <filter(optional)> FILTER=$7 if test "$FILTER"!= "" then FILTER="../../../$FILTER" echo Using filter $FILTER else echo Using no filter fi cp -r emptyproject $1 cd $1 cp../$3.txt persons.txt cp../$4.txt gestures.txt cp../runs.txt runs.txt cd data for person in `cat../persons.txt` do mkdir $person cd $person for gesture in `cat../../gestures.txt` do for run in `cat../../runs.txt` do find../../../rawdata/$2/$person -name run_"$run"_"$gesture"_"[0-9]*".txt -exec../../scripts/computefeaturesnormalize { $5 $FILTER \; done done cd.. done DIM=`echo "$5" sed 's/[iisr]/++/g' sed 's/[aavv]/+/g' awk '{print length($1);'` echo Vector dimension: $DIM cd.. find data/* -type d > datadir find data/* -type f -name "*.txt" sort > datafiles scripts/gen_commands.sh datafiles sort -u > commands scripts/gen_hmmdef.pl -n$6 -v$dim -s0 hmm scripts/gen_mlf.sh datafiles ext/ > labels.mlf cat scripts/options_any.sh sed "s/<any>/$dim/" > options.sh echo Project creation done. 56

57 A-2: C++: Compute Features #include <iostream.h> #include <fstream.h> #include <math.h> #include <string.h> #define PI using namespace std; int main(int argc, char **argv) { //string name, filename; if ((argc!= 3) && (argc!= 4)) { cout << "Illegal number of arguments, Syntax: ComputeFeatures <in> <IVAiva> <filter>"; return -1; // read the input file int N; ifstream myinput; myinput.open(argv[1],fstream::in); if (myinput.fail()) { cerr << "Error opening file"; exit(-1); if (myinput.eof()) { cerr << "Error reading file"; exit(-1); myinput >> N; int *xa = new int[n]; int *ya = new int[n]; int *ta = new int[n]; for (int i = 0; i < N; i++) { if (myinput.eof()) { cerr << "Error reading file"; exit(-1); myinput >> xa[i]; myinput >> ya[i]; myinput >> ta[i]; double *filter; int NFilter; // read the filter file if specified if (argc == 4) { ifstream myfilter; myfilter.open(argv[3],fstream::in); cout << "Using Filter " << argv[3] << endl; if (myfilter.fail()) { cerr << "Error opening filter file"; exit(-1); 57

58 if (myfilter.eof()) { cerr << "Error reading filter file"; exit(-1); myfilter >> NFilter; filter = new double[nfilter]; for (int i = 0; i < NFilter; i++) { if (myfilter.eof()) { cerr << "Error reading filter file"; exit(-1); myfilter >> filter[i]; else { cout << "No Filter file specified." << endl; NFilter = 1; filter = new double[nfilter]; filter[0] = 1.0; // do interpolation int Ni = 0; int deltat = 20; for (int t=ta[0]; t<ta[n-1]; t+=deltat) Ni++; double *xiunfiltered = new double[ni]; double *yiunfiltered = new double[ni]; int count = 0; { for (int t=ta[0]; t<ta[n-1]; t+=deltat) { double Td1, Td2; double Td3, Td4; int Xdiff, Ydiff; double ResultX=0; double ResultY=0; int Fi = 0; bool found = false; for(int k=0; k<n; k++) { if((ta[k]>t) && (!found)) { Fi = k - 1; found=true; Td1 = t - ta[fi]; Td2 = ta[fi+1] - ta[fi]; Xdiff = xa[fi+1] - xa[fi]; ResultX = xa[fi] + (Td1 / Td2) * Xdiff; Td3 = t - ta[fi]; Td4 = ta[fi+1] - ta[fi]; Ydiff = ya[fi+1] - ya[fi]; ResultY = ya[fi] + (Td3 / Td4) * Ydiff; xiunfiltered[count] = ResultX; 58

59 yiunfiltered[count] = ResultY; count++; // do filtering double *xi = new double[ni]; double *yi = new double[ni]; { for (int i = 0; i < Ni; i++) { double sumx = 0; double sumy = 0; double sumweights = 0; for (int j = 0; j < NFilter; j++) { int shiftedi = i + j - NFilter/2; if ((shiftedi >= 0) && (shiftedi < Ni)) { sumx += xiunfiltered[shiftedi]*filter[j]; sumy += yiunfiltered[shiftedi]*filter[j]; sumweights += filter[j]; xi[i] = sumx/sumweights; yi[i] = sumy/sumweights; // compute deltas double *dx = new double[ni-1]; double *dy = new double[ni-1]; { for (int i = 1; i < Ni; i++) { dx[i-1] = xi[i] - xi[i-1]; dy[i-1] = yi[i] - yi[i-1]; // compute running average double *rax = new double[ni]; double *ray = new double[ni]; { int count = 0; double sumx = 0.0; double sumy = 0.0; for (int i = 0; i < Ni; i++) { sumx += xi[i]; sumy += yi[i]; count++; rax[i] = sumx/(double)count; ray[i] = sumy/(double)count; // compute angle/velocity double *angle = new double[ni-1]; 59

60 double *veloc = new double[ni-1]; { for (int i = 0; i < Ni-1; i++) { double powx = dx[i]*dx[i]; double powy = dy[i]*dy[i]; veloc[i] = sqrt(powx + powy); angle[i] = atan2(dy[i],dx[i]); // compute delta angle/velocity double *deltaangle = new double[ni-1]; // 0 is not used! double *deltaveloc = new double[ni-1]; // 0 is not used! { for (int i = 1; i < Ni-1; i++) { deltaveloc[i] = veloc[i]-veloc[i-1]; double da = angle[i]-angle[i-1]; if (da < -PI) da += 2*PI; if (da > PI) da -= 2*PI; deltaangle[i] = da; // determine start and end time char *type = argv[2]; int start = 0; int end = Ni; { for (int i = 0; type[i]!= 0; i++) { if ((type[i] == 'i') (type[i] == 'V') (type[i] == 'A') (type[i] == 'v') (type[i] == 'a')) end = Ni-1; if ((type[i] == 'v') (type[i] == 'a')) start = 1; // write file string inname(argv[1]), outname; if(inname.rfind ('/')!= string::npos ) { int pos = inname.rfind ('/'); outname = inname.substr(pos+1,inname.length()); else outname = inname; if(outname.rfind ('.')!= string::npos ) { outname = outname.substr(0,outname.rfind('.')); outname += "_"; outname += type; outname += ".txt"; fstream file_op(outname.c_str(),ios::out); { for (int i = start; i < end; i++) { 60

61 for (int j = 0; type[j]!= 0; j++) { switch (type[j]) { case 'I': file_op << xi[i] << " " << yi[i]; break; case 'V': file_op << veloc[i]; break; case 'A': file_op << angle[i]; break; case 'i': file_op << dx[i] << " " << dy[i]; break; case 'v': file_op << deltaveloc[i]; break; case 'a': file_op << deltaangle[i]; break; case 'R': file_op << (xi[i]-rax[i]) << " " << (yi[i]-ray[i]); break; case 'S': file_op << (xi[i]-xi[0]) << " " << (yi[i]-yi[0]); break; if (type[j+1]!= 0) file_op << " "; file_op << endl; cout << "Created " << outname << " with " << end-start << "samples (filter length " << NFilter << ")." << endl; return 0; A-3: Unix script: Running each project makeprojectmoreclever.sh T_Ria_21_DF06_10 All_MF_21people_10 persontest21 gesturetest RAi Filter_20_05_20_21.txt cd T_Ria_21_DF06_10 scripts/train.sh options.sh rm -r data rm -r ext cd.. A-4: Unix script: Calculating the HMM parameters cd models/hmm0.3/ more newmacros sed 's/0\.0*e+00//g' sed 's/<.*>//g' sed 's/[0-9]\.[0-9]*e[+-][0-9]*/@/g' sed "s/[^@]//g" awk "BEGIN{count=0;{count += length($1);end{print count" > parm.txt 61

62 A-5: Unix script: Generating label files for Grammar elements DATA_LIST_FILE=$1 EXT_DIR=$2 EXTRACT_LABELS=scripts/extract_label.sh #extract data labels GESTURES=Grammar.txt integer frame_duration=2000 integer start_time=0 integer end_time=0 integer num_lines=0 integer num_ges=0 integer incr_step=0 integer next_time=0 #each frame is about 2000ns in HTK echo "#!MLF!#" # write the header for the file for m in `cat $DATA_LIST_FILE`; do # for each data file listed # if! [ -d "$EXT_DIR/$m" ]; then # mirror the data directory hierarchy # mkdir -p "$EXT_DIR/$m" # in the ext output # fi # see if we need to append a path to this echo "${EXT_DIR" grep '^/' > /dev/null if [ "$?" = "0" ]; then labname="\"${ext_dir/$m.lab\"" #search for HTK readable datafiles else labname="\"`pwd`/${ext_dir/$m.lab\"" fi echo "$labname" sed -e "s // / g" # output the label filename INT=`cat $m sed -e 's/^[^_]*_[^_]*_//' sed -e 's/_[0-9].*$//' wc -l` #for Count in `cat $m sed -e 's/^[^_]*_[^_]*_//' sed -e 's/_[0-9].*$//' wc -l`; do #Ges_num=`cat $filename wc -l` Ges=`head -n $INT $GESTURES tail -n 1` #$Count = $Count + 1 #done num_lines=`cat $m wc -l` # compute the num lines per file end_time=num_lines*frame_duration #total time = #_frames * duration num_ges=`echo $Ges wc -w`+2 incr_step=$end_time/$num_ges #write start/stop time w/ label #echo $start_time $end_time `$EXTRACT_LABELS \`basename $m\`` # sed -e "s/ /\n/g" start_time=0 for i in start_ges $Ges do next_time=$start_time+$incr_step echo "$start_time $next_time $i" start_time=$next_time done #echo $start_time $end_time `$EXTRACT_LABELS \`basename $m\`` echo "$start_time $end_time end_ges" echo "." # data seperator done 62

63 A-6: PERL script: Calculate the Gesture group test more hresults.log cat tail -32 head -30 awk 'BEGIN{n=1{n++;print $n;end{' > G_results.txt GestureG.sh join small.txt Ges_Res.txt > Small_Value.txt more Small_Value.txt awk '{sum+=$2;s=sum/630end{print S' > Small_Result.txt join medium.txt Ges_Res.txt > medium_value.txt more medium_value.txt awk '{sum+=$2;m=sum/315end{print M' > medium_result.txt join large.txt Ges_Res.txt > large_value.txt more large_value.txt awk '{sum+=$2;l=sum/945end{print L' > large_result.txt cd.. A-7: GRAMMAR $Back_Forth = (start_ges LE LW end_ges); $Circle = (start_ges AWN ANE AES ASW end_ges); $Circle_Down = (start_ges AES ASW AWN ANE LS end_ges); $Circle_Left = (start_ges AEN ANW end_ges); $Circle_Right = (start_ges AWN ANE end_ges); $Cross = (start_ges LSW LN LSE end_ges); $Down = (start_ges LS end_ges); $Down_Circle_Left = (start_ges LS AWS ASE AEN LW end_ges); $Down_Circle_Right = (start_ges LS AES ASW AWN LE end_ges); $Down_Left = (start_ges LS LW end_ges); $Down_Right = (start_ges LS LE end_ges); $Down_Up_Circle = (start_ges LS LN AWN ANE end_ges); $Eight = (start_ges ANW AWS ANE AES ASW AWN LNE end_ges); $Left = (start_ges LW end_ges); $leftbracket = (start_ges ANW AES ANE AWS end_ges); $Left_Circle_Left = (start_ges LW ASW AWN ANE AES LW end_ges); $N = (start_ges LN LSE LN end_ges); $Omega = (start_ges LE ASW AWN ANE AES LE end_ges); $Phi = (start_ges AWS ASE AEN LS end_ges); $Question = (start_ges AWN ANE ASW AWS end_ges); $Rectangle = (start_ges LE LS LW LN end_ges); $Right = (start_ges LE end_ges); $rightbracket = (start_ges ANE AWS ANW AES end_ges); $Three = (start_ges AWN ANE AES ANW AES ASW end_ges); $Up = (start_ges LN end_ges); $Up_Down_Up = (start_ges LS LN end_ges); $Vee_Down = (start_ges LNE LSE end_ges); $Vee_Up = (start_ges LSE LNE end_ges); $W = (start_ges LSE LNE LSE LNE end_ges); $Z = (start_ges LE LSW LE end_ges); $gesture = $Back_Forth $Circle $Circle_Down $Circle_Left $Circle_Right $Cross $Down $Down_Circle_Left $Down_Circle_Right $Down_Left $Down_Right $Down_Up_Circle $Eight $Left $leftbracket $Left_Circle_Left $N $Omega $Phi $Question $Rectangle $Right $rightbracket $Three $Up $Up_Down_Up $Vee_Down $Vee_Up $W $Z; ($gesture) 63

64 Appendix B : New gestures for new data ( ) Two Three Four Five One Circle START Anywhere Counterclockwise C One Circle START Anywhere Clockwise Cross D Down E (Internet Explorer) L Left Left_Circle_Counter Left_Circle_Clock M N Ok Omega Down_Circle_Counter Down_Circle_Clock P Gamma Phi Question Right Right_Circle_Counter Rigth_Circle_Clock S Up W 64

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur Module 12 Machine Learning 12.1 Instructional Objective The students should understand the concept of learning systems Students should learn about different aspects of a learning system Students should