Simple Variable Length N-grams for Probabilistic Automata Learning

Similar documents
Learning Methods in Multilingual Speech Recognition

Introduction to Simulation

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Calibration of Confidence Measures in Speech Recognition

A NEW ALGORITHM FOR GENERATION OF DECISION TREES

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Predicting Future User Actions by Observing Unmodified Applications

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

CS 598 Natural Language Processing

Learning goal-oriented strategies in problem solving

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Shockwheat. Statistics 1, Activity 1

University of Groningen. Systemen, planning, netwerken Bosman, Aart

Physics 270: Experimental Physics

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

Probabilistic Latent Semantic Analysis

Modeling function word errors in DNN-HMM based LVCSR systems

Automatic Pronunciation Checker

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Jacqueline C. Kowtko, Patti J. Price Speech Research Program, SRI International, Menlo Park, CA 94025

Discriminative Learning of Beam-Search Heuristics for Planning

Efficient Online Summarization of Microblogging Streams

stateorvalue to each variable in a given set. We use p(x = xjy = y) (or p(xjy) as a shorthand) to denote the probability that X = x given Y = y. We al

An Introduction to Simio for Beginners

Improvements to the Pruning Behavior of DNN Acoustic Models

Training and evaluation of POS taggers on the French MULTITAG corpus

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Lecture 1: Machine Learning Basics

Deep Neural Network Language Models

An Online Handwriting Recognition System For Turkish

A Version Space Approach to Learning Context-free Grammars

Radius STEM Readiness TM

Switchboard Language Model Improvement with Conversational Data from Gigaword

CSL465/603 - Machine Learning

ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

WHEN THERE IS A mismatch between the acoustic

Human Emotion Recognition From Speech

Seminar - Organic Computing

COMPUTER-ASSISTED INDEPENDENT STUDY IN MULTIVARIATE CALCULUS

Modeling function word errors in DNN-HMM based LVCSR systems

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Facilitating Students From Inadequacy Concept in Constructing Proof to Formal Proof

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Using dialogue context to improve parsing performance in dialogue systems

Developing a TT-MCTAG for German with an RCG-based Parser

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Grammars & Parsing, Part 1:

Human-like Natural Language Generation Using Monte Carlo Tree Search

Detecting English-French Cognates Using Orthographic Edit Distance

Corrective Feedback and Persistent Learning for Information Extraction

Knowledge based expert systems D H A N A N J A Y K A L B A N D E

Abbreviated text input. The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters.

Large vocabulary off-line handwriting recognition: A survey

arxiv:cmp-lg/ v1 22 Aug 1994

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

arxiv: v1 [cs.lg] 7 Apr 2015

An investigation of imitation learning algorithms for structured prediction

A study of speaker adaptation for DNN-based speech synthesis

Action Models and their Induction

CLASSIFICATION OF PROGRAM Critical Elements Analysis 1. High Priority Items Phonemic Awareness Instruction

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

The A2iA Multi-lingual Text Recognition System at the second Maurdor Evaluation

South Carolina English Language Arts

The Indices Investigations Teacher s Notes

Evolutive Neural Net Fuzzy Filtering: Basic Description

A General Class of Noncontext Free Grammars Generating Context Free Languages

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

Domain Adaptation in Statistical Machine Translation of User-Forum Data using Component-Level Mixture Modelling

If we want to measure the amount of cereal inside the box, what tool would we use: string, square tiles, or cubes?

Test Effort Estimation Using Neural Network

Rule Learning With Negation: Issues Regarding Effectiveness

A R "! I,,, !~ii ii! A ow ' r.-ii ' i ' JA' V5, 9. MiN, ;

HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Learning Methods for Fuzzy Systems

MYCIN. The MYCIN Task

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Model Ensemble for Click Prediction in Bing Search Ads

Disambiguation of Thai Personal Name from Online News Articles

AUTOMATED TROUBLESHOOTING OF MOBILE NETWORKS USING BAYESIAN NETWORKS

Parsing of part-of-speech tagged Assamese Texts

Proof Theory for Syntacticians

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

COPING WITH LANGUAGE DATA SPARSITY: SEMANTIC HEAD MAPPING OF COMPOUND WORDS

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

Improving Fairness in Memory Scheduling

KLI: Infer KCs from repeated assessment events. Do you know what you know? Ken Koedinger HCI & Psychology CMU Director of LearnLab

Transcription:

JMLR: Workshop and Conference Proceedings 2:254 258, 202 The th ICGI Simple Variable Length N-grams for Probabilistic Automata Learning Fabio N. Kepler fabiokepler@unipampa.edu.br Sergio L. S. Mergen sergiomergen@unipampa.edu.br Cleo Z. Billa cleobilla@unipampa.edu.br LEA Group, Alegrete, Federal University of Pampa UNIPAMPA Editors: Jeffrey Heinz, Colin de la Higuera, and Tim Oates Abstract This paper describes an approach used in the 202 Probabilistic Automata Learning Competition. The main goal of the competition was to obtain insights about which techniques and approaches work best for sequence learning based on different kinds of automata generating machines. This paper proposes the usage of n-gram models with variable length. Experiments show that, using the test sets provided by the competition, the variable-length approach works better than fixed 3-grams. Keywords: Probabilistic automata learning; Variable length n-grams; Markov models.. Introduction This paper describes the approach used in the 202 Probabilistic Automata Learning Competition PAutomaC. The competition was about learning non-deterministic probabilistic finite state machines, based on artificial data automatically generated by four kinds of machines: Markov Chains, Deterministic Probabilistic Finite Automata, Hidden Markov Models and Probabilistic Finite Automata. The main goal of the competition was to obtain insights about which techniques and approaches would work best for these machines given their kind and parameters set. For the details about how the machines were generated and an overview of the area of probabilistic automata learning see Verwer et al. (202). Basically, a number of different problems, i.e., generated by different machine settings, were provided, each consisting of a training set and a test set. Each problem comprises a number of sentences, which are sequences of integer symbols inside a given vocabulary. The goal is to predict the probability of a sequence of symbols. Three baselines were provided: one based on the frequency of each token; a 3-gram model; and the ALERGIA algorithm. Our approach to solve these problems is based on an n-gram model with variable length. This kind of solution handles the memory cost of larger n-grams by performing different pruning strategies to effectively shrink the state space. In this paper we show how we employed this idea using a simple tree structure. This paper is organized as follows: in Section 2 we explain the vln-gram approach we used; in Section 3 we show the results this approach achieved; and in Section 4 we draw some considerations.. http://ai.cs.umbc.edu/ecgi202/challenge/pautomac c 202 F.N. Kepler, S.L.S. Mergen & C.Z. Billa.

Variable Length n-grams for PDFA Learning 2. N-grams with variable length The useof a trigram (3-gram) model as a baseline for many tasks is due to its simplicity and yet relatively good performance. Several researchers argued that a higher order model, i.e., an n-gram model with n > 3, would increase performance, if a sufficiently large training set was given. However, this incurs in an intractable state space in terms of required memory. A solution developed since the 990 s to deal with such restrictions is to use n-gram models with variable length (Ron et al., 996; Kneser, 996). These models use larger values for n, which means a longer context window is used, and then prune the state space, allowing contexts with different lengths to occur. In order to effectively store contexts with variable length we use a simple tree structure, which we call a context tree. A context tree is built using the sequences of symbols from a training set. An example can be seen in Figure. Each node is a symbol in the vocabulary (in this case {,2,3,4}) and has an associated table that shows the symbols that appear after the sequence formed by the nodes of the path up to the root. ROOT : 402 2: 230 3: 540 4: 28 : 240 2: 97 3: 342 4: 26 2 : 03 2: 50 3: 00 4: 80 : 42 2: 40 3: 84 4: 2 4 : 56 2: 5 3: 7 4: 0 3 : 50 3: 40 4: 80 : 0 2: 3 3: 54 4: 5 2 4 : 20 3: 5 4: 40 : 20 3: 0 4: 0 3 : 3 3: 4: 40 Figure : Example of a context tree. Taking the leaf node with the symbol 2, the context it represents is 2 4, and 2 3 4 are symbols that appear after 2 4. That means the training set contains the subsequences 2 4 2, 2 4 3, and 2 4 4, which occur 3, 54, and 5 times, respectively. The process for building the tree involves two steps. First, having defined n as the longest size of subsequences to consider, we run over the training set, adding sequences with n symbols to the tree, either by inserting new nodes or by updating counts. Then the tree is pruned. For every leaf, the Kullback-Leibler divergence is calculated between the leaf and its parent. If the divergence is greater than a given cut value, defined 255

Kepler Mergen Billa empirically, it means the leaf node does not add more relevant information than its parent, and thus is pruned off the tree. Topredicttheprobabilityofasequenceofsymbolswegotheusualway. Foreachsymbol, from left to right, we search the tree for the longest subsequence to its left, returning its corresponding probability and multiplying it by the probabilities of the previous symbols. 3. Results We use the data made available during the training phase of the competition. This allows us to use the solution also made available and compare the perplexities of different models. For each kind of machine used to generate the data, we choose at least two problems: one where vln-grams perform well and one where it does not. The kinds of machine are HMM (Hidden Markov Models), MC (Markov Chains), PDFA (Probabilistic Deterministic Finite Automata), and PNFA (Probabilistic Non-deterministic Finite Automata). We report the perplexities for the 3-gram baseline, the leader during the training phase 2, and three vlngram instances with different combinations of parameters. Table shows the results. V is the vocabulary size, and vln-gram, vln-gram2 and vln-gram3 use as parameters, respectively, n = 4, K = 5 0 4 ; n = 5, K = 5 0 4 ; and n = 5, K = 5 0 5, where n is the maximum subsequence length and K is the cut value used for pruning the tree according to the Kullback-Leibler divergence. We can see that the 3-gram baseline is easily surpassed by a vln-gram model in almost all cases, the exceptions being problems HMM 22 and PNFA 0. Table : Perplexity of various models with respect to the solution. Dataset V 3-gram Leader vln-gram vln-gram2 vln-gram3 HMM 23 5 45.2930 4.343 44.7347 44.6885 4.4695 HMM 4 2 3.60 3.0573 3.0655 3.0623 3.0676 HMM 22 8.563.389.560.6288.3496 MC 7 28.2403 27.4760 27.9483 28.0040 27.7656 MC 8 8 69.8465 68.2840 69.2366 69.74 68.8033 MC 6 5 68.4676 68.4409 68.4464 68.447 68.4480 PDFA 9 0 4.6605 4.6032 4.6075 4.647 4.607 PDFA 2 7 4.9779 4.2604 42.52 42.0768 4.428 PDFA 43 9 4.4766 28.5565 36.24 32.205 32.959 PNFA 7 5 40.0527 39.6367 40.4574 40.4567 39.6922 PNFA 48 20 43.9553 34.7346 36.232 35.638 35.3566 PNFA 0 7 34.464 34.464 35.0555 35.0376 34.522 However, we can also see that not always the same vln-gram model yields the best result for all problems. The reasons for that are unclear, although they are obviously due to the 2. Available in http://ai.cs.umbc.edu/icgi202/challenge/pautomac/results_old.php. 256

Variable Length n-grams for PDFA Learning specificities of each problem s generating automata. For example, a large vocabulary could cause data sparsity with long contexts. But in problems HMM 4 and specially PNFA 48 we see that longer contexts have lower perplexity, despite the large vocabularies. Regarding the different kinds of problems, we can see that no single kind can be said to be easier than another. Each of them has both harder and easier problems. Some of the problems clearly need just short contexts, and getting longer contexts with a higher cut value does not always yield a similar perplexity. However, as noted above, using n-grams greater than 3 and pruning them yields better results than a fixed 3-gram. Table 2 shows results over the HMM 23 problem with different parameter values. Neither increasing the context size nor decreasing the cut value shows a predictable result. There seems to be a tipping point around K = 5 0 6 after the context is larger than 4. Table 2: Perplexities using different model parameters over a single problem (HMM 23). K (n = 4) Perplexity 5 0 4 44.7347 5 0 5 42.2438 5 0 6 42.05 5 0 7 42.046 K (n = 5) Perplexity 5 0 4 44.6885 5 0 5 4.4695 5 0 6 4.4659 5 0 7 4.4685 K (n = 6) Perplexity 5 0 4 43.9577 5 0 5 4.74 5 0 6 4.6727 5 0 7 4.7205 There are also the aspects of the resulting model size and the influence of the training set size, but due to space restrictions we do not analyze them here. 4. Considerations We showed the simple approach of using variable length n-grams for sequence modeling, which is used by the community since the 990 s for various different problems, and reported its development and results on the automata learning competition. Our goal was to apply the simplest possible approach which would be able to surpass the baselines, taking advantage of not having to deal with unknown words and not having to actually predict sequences, thus not requiring the use of any algorithm like Viterbi s. The approach expands over the classic 3-gram model, being somewhat easy to implement and yielding usually better results. However, it raises the questions about how to determine the best size of the context to be used along with the best cut value. A way of determining these parameters depending on aspects of the training set remains as a future work, which has the potential of raising the baseline bar in the future. To allow the replication of the experiments, the full source code is available at https://gist.github.com/ee Acknowledgement The authors would like to thank the competition committee for providing open data and solutions, and encouraging participation and the writing of this paper. 257

Kepler Mergen Billa References Reinhard Kneser. Statistical language modeling using a variable context length. In Spoken Language, 996. ICSLP 96. Proceedings., Fourth International Conference on, volume, pages 494 497. IEEE, 996. Dana Ron, Yoram Singer, and Naftali Tishby. The power of amnesia: Learning probabilistic automata with variable memory length. Machine Learning, 25:7 49, 996. Sicco Verwer, Rémi Eyraud, and Colin de la Higuera. PAutomaC: a PFA/HMM Learning Competition. In Proceedings of the th International Conference on Grammatical Inference, 202. 258