Predicting Words and Sentences using Statistical Models

Similar documents
Switchboard Language Model Improvement with Conversational Data from Gigaword

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

A Case Study: News Classification Based on Term Frequency

Learning Methods in Multilingual Speech Recognition

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Probabilistic Latent Semantic Analysis

Cross Language Information Retrieval

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Universiteit Leiden ICT in Business

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Cross-Lingual Text Categorization

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Linking Task: Identifying authors and book titles in verbose queries

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

AQUA: An Ontology-Driven Question Answering System

ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Speech Recognition at ICSI: Broadcast News and beyond

MISSISSIPPI OCCUPATIONAL DIPLOMA EMPLOYMENT ENGLISH I: NINTH, TENTH, ELEVENTH AND TWELFTH GRADES

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Clickthrough-Based Translation Models for Web Search: from Word Models to Phrase Models

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

A Bayesian Learning Approach to Concept-Based Document Classification

Vocabulary Usage and Intelligibility in Learner Language

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

The Strong Minimalist Thesis and Bounded Optimality

Large vocabulary off-line handwriting recognition: A survey

Disambiguation of Thai Personal Name from Online News Articles

Beyond the Pipeline: Discrete Optimization in NLP

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Proof Theory for Syntacticians

Noisy Channel Models for Corrupted Chinese Text Restoration and GB-to-Big5 Conversion

Language Acquisition Chart

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

Word Segmentation of Off-line Handwritten Documents

Modeling function word errors in DNN-HMM based LVCSR systems

Language Independent Passage Retrieval for Question Answering

CS 598 Natural Language Processing

Modeling function word errors in DNN-HMM based LVCSR systems

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Lecture 1: Machine Learning Basics

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Using dialogue context to improve parsing performance in dialogue systems

A heuristic framework for pivot-based bilingual dictionary induction

Detecting English-French Cognates Using Orthographic Edit Distance

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models

A Graph Based Authorship Identification Approach

Prediction of Maximal Projection for Semantic Role Labeling

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Mandarin Lexical Tone Recognition: The Gating Paradigm

arxiv: v1 [cs.cl] 2 Apr 2017

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

Developing a TT-MCTAG for German with an RCG-based Parser

Program Matrix - Reading English 6-12 (DOE Code 398) University of Florida. Reading

Assignment 1: Predicting Amazon Review Ratings

Coast Academies Writing Framework Step 4. 1 of 7

OFFICE SUPPORT SPECIALIST Technical Diploma

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

The Smart/Empire TIPSTER IR System

Training and evaluation of POS taggers on the French MULTITAG corpus

A DISTRIBUTIONAL STRUCTURED SEMANTIC SPACE FOR QUERYING RDF GRAPH DATA

MYCIN. The MYCIN Task

Applications of memory-based natural language processing

Linguistics. Undergraduate. Departmental Honors. Graduate. Faculty. Linguistics 1

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

Longest Common Subsequence: A Method for Automatic Evaluation of Handwritten Essays

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Statewide Framework Document for:

(Sub)Gradient Descent

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

Multilingual Document Clustering: an Heuristic Approach Based on Cognate Named Entities

Learning to Rank with Selection Bias in Personal Search

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Probability and Statistics Curriculum Pacing Guide

Performance Analysis of Optimized Content Extraction for Cyrillic Mongolian Learning Text Materials in the Database

Data Fusion Models in WSNs: Comparison and Analysis

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

English Language and Applied Linguistics. Module Descriptions 2017/18

Introduction to Simulation

The College Board Redesigned SAT Grade 12

Florida Reading Endorsement Alignment Matrix Competency 1

CS Machine Learning

UMass at TDT Similarity functions 1. BASIC SYSTEM Detection algorithms. set globally and apply to all clusters.

Rule Learning With Negation: Issues Regarding Effectiveness

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Investigation on Mandarin Broadcast News Speech Recognition

Bug triage in open source systems: a review

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

First Grade Curriculum Highlights: In alignment with the Common Core Standards

5. UPPER INTERMEDIATE

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Problems of the Arabic OCR: New Attitudes

Spoken Language Parsing Using Phrase-Level Grammars and Trainable Classifiers

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

Transcription:

Predicting Words and Sentences using Statistical Models Nicola Carmignani Departement of Computer Science University of Pisa carmigna@di.unipi.it Language and Intelligence Reading Group July 5, 2006 1 / 38

Outline 1 Introduction 2 Word Prediction The Origin of Word Prediction n-gram Models 3 Sentence Prediction Statistical Approach Information Retrieval Approach 4 Conclusions 2 / 38

Outline 1 Introduction 2 Word Prediction The Origin of Word Prediction n-gram Models 3 Sentence Prediction Statistical Approach Information Retrieval Approach 4 Conclusions 3 / 38

Introduction Natural Language Processing (NLP) aims to study the problems of automated generation and understanding of natural human languages. The major tasks in NLP: Text-to-Speech (TTS) Speech Recognition Machine Translation Information Extraction Question-Answering Part-of-Speech (POS) Tagging Information Retrieval Automatic Summarization 4 / 38

Statistical NLP Statistical Inference aims to collect some data and then make some inferences about its probability distribution. Prediction issues require an appropriate language model. Natural language modeling is a statistical inference problem. Statisitcal NLP methods can be useful in order to capture human knowledge needed to allow prediction, and assess the likelihood of various hypotheses probability of word sequences; likelihood of words co-occurrence. 5 / 38

Prediction::An Overview Humans are good in word prediction... Once upon a... and sentence prediction. Penny Lane 6 / 38

Prediction::An Overview Humans are good in word prediction... Once upon a time... and sentence prediction. Penny Lane 6 / 38

Prediction::An Overview Humans are good in word prediction... Once upon a time... and sentence prediction. Penny Lane 6 / 38

Prediction::An Overview Humans are good in word prediction... Once upon a time... and sentence prediction. Penny Lane is in my ears and in my eyes 6 / 38

Prediction::Why? Predictors support writing and are commonly used in combination with assistive devices such as keyboards, virtual keyboards, touchpads and pointing devices. Frequently, applications include repetitive tasks such as writing emails in call centers or letters in an administrative environment. Applications of word prediction: Spelling Checkers Mobile Phone/PDA Texting Disabled Users Handwriting Recognition Word-sense Disambiguation 7 / 38

Outline 1 Introduction 2 Word Prediction The Origin of Word Prediction n-gram Models 3 Sentence Prediction Statistical Approach Information Retrieval Approach 4 Conclusions 8 / 38

Word Prediction::An Overview Word Prediction is the problem of guessing which word is likely to continue a given initial text fragment. Word prediction techniques are well-established methods in the field of AAC (Augmentative and Alternative Communication) that are frequently used as communication aids for people with disabilities accelerate the writing; reduce the effort needed to type; suggest the correct word (no misspellings). 9 / 38

Please, don t confuse! Usually, when I say word prediction, everybody calls Tegic T9 to mind. T9 is a successful system but its prediction is based on dictionary disambiguation (only according to last word). We would like something that is skilful at doing prediction according to the previous context. 10 / 38

Word Prediction::The Origins The word prediction task can be framed viewed as the statistical formulation of the speech recognition problem. Finding the most likely word sequence Ŵ given the observable acoustic signal Ŵ = arg max P(W A) W We can rewrite it using Bayes rule Ŵ = arg max W P(A W )P(W ) P(A) Since P(A) is independent of the choice of W, we can simplify as follows Ŵ = arg max W P(A W )P(W ) 11 / 38

n-gram Models::Introduction In order to predict the next word (w N ) given the context or history (w 1,..., w N 1 ), we want to estimate this probability function: P(w N w 1,..., w N 1 ) The language model estimates the values P(W ), where W = w 1,..., w N. By using Bayes theorem, we get P(W ) = N P(w i w 1, w 2..., w i 1 ) i=1 12 / 38

n-gram Models Since the parameter space of P(w i w 1, w 2..., w i 1 ) is too large, we need a model where all similar histories w 1, w 2..., w i 1 are placed in the same equivalence class. Markov Assumption: only the prior local content (the last few words) affects the next word. (n 1) th Markov Model or n-gram 13 / 38

n-gram Models Formally, n-gram model is denoted by: P(w i w 1,..., w i 1 ) P(w i w i n+1,..., w i 1 ) Typical values of n-gram are n = 1 (unigram) P(w i w 1,..., w i 1 ) P(w i ) n = 2 (bigram) P(w i w 1,..., w i 1 ) P(w i w i 1 ) n = 3 (trigram) P(w i w 1,..., w i 1 ) P(w i w i 2 w i 1 ) 14 / 38

n-gram word Models::Example Example: W = Last night I went to the concert Instead of P(concert Last night I went to the) we use a bigram P(concert the) or a trigram P(concert to the) 15 / 38

How to Estimate Probabilities Where do we find these probabilities? Corpora are collections of text and speech (e.g. Brown Corpus). Two different coprora are needed: Probabilities are extracted from a training corpus, which is necessary to design the model. A test corpus is used to run trials in order to evaluate the model. 16 / 38

Problems with n-grams The drawback of these methods is the amount of text needed to train the model. Training corpus has to be large enough to ensure that each valid word sequence appears a relevant number of times. A great amount of computational resources is needed especially if the number of words in the lexicon is big. For a vocabulary V of 20,000 words V 2 = 400 million of bigrams; V 3 = 8 trillion of trigrams; V 4 = 1.6 10 17 of four-grams. Since the number of possible words is very large, there is a need to focus attention on a smaller subset of these. 17 / 38

n-gram POS Models One proposed solution consists in generalizing the n-gram model, by grouping the words in category according to the context. A mapping ϕ is defined to approximate a context by means of the equivalence class it belongs to: P(w i ϕ[w i n+1,..., w i 1 ]). Usually, Part-of-Speech (POS) tags are used as mapping function, replacing each word with the corresponding POS tag (i.e. classification). POS tags have the potential of allowing generalization over similar words, as well as reducing the size of the language model. 18 / 38

n-gram Models for Inflected Languages Many word prediction methods are focused on non-inflected languages (English) that have a small amount of variation. Inflected languages can have a huge amount of affixes that affect the syntactic function of every word. It is difficult to include every variation of a word in the dictionary. Italian is a very morphologically rich language with a high rate of inflected forms. A morpho-syntactic component is needed to compose inflections in accordance with the context Gender: lui è un... professore not professoressa ; Number: le mie... scarpe not scarpa ; Verbal agreement: la ragazza... scrive not scriviamo. 19 / 38

Hybrid Approach to Prediction Prediction can either be based on text statistics or linguistic rules. Two Markov models can be included: one for word classes (POS tag unigrams, bigrams and trigrams) and one for words (word unigrams and bigrams). A linear combination algorithm may combine these two models. Incorporating morpho-syntactic information to enforce prediction accuracy. 20 / 38

Outline 1 Introduction 2 Word Prediction The Origin of Word Prediction n-gram Models 3 Sentence Prediction Statistical Approach Information Retrieval Approach 4 Conclusions 21 / 38

Sentence Prediction::An Overview It s now easy to presume what is Sentence Prediction. Now we would like to predict how a user will continue a given initial fragment of natural language text. Some Applications: Korvemaker and Greiner have developed a system which predicts whole command lines. OS Search Engine Word Processor Images retrieved from Arnab Nandi presentation: Better Phrase Prediction Mobile Phone 22 / 38

Sentence Prediction::Two Approaches A possible approach for sentence prediction problem might be to learn a language model and to construct the most likely sentence: statistical approach. An alternative solution to address completion might involve information retrieval methods. Domain specific collection of documents are used in both researces as corpora. Clearly, a constrained application context improves the accuracy of prediction. 23 / 38

Sentence Prediction::Statistical Approach As shown, n-gram language models provide a natual approach to the construction of sentence completion systems, but they could not be sufficient. Eng and Eisner have developed a radiology report entry system that implements an automated phrase completion feature based on language modeling (trigram language model). Bickel, Haider and Scheffer have developed an n-gram based completion method using specific document collections, such as emails and weather reports. 24 / 38

[Eng et al., 2004] Radiology report domain Training corpus: 36,843 general reports. Performance tested on 200 reports outside of the training set. The algorithm is based on a trigram language model and provides both word prediction and phrase completion. Word chaining guesses zero or more subsequent words. A threshold chain length L(w1, w 2 ) can be determined in order to extend prediction to furher words. 25 / 38

[Eng et al., 2004] All alphabetic characters were converted to uppercase. Words occurring fewer than 10 times in the corpus were replaced with a special label in order to eliminate misspelled words. Punctuation marks were removed from the corpus, so they do not appear in the suggested sentence and must be entered when needed. 26 / 38

[Bickel et al., 2005] Application corpora: Call-Center emails, personal emails, weather reports and cooking recipes. The sentence completion is based on a linear interpolation of n-gram models Finding the most likely word sequence wt+1,..., w t+t given a word n-gram model and an initial sequence w 1,..., w t. The decoding problem is mathematically defined as follows: P(w t+1,..., w t+t w 1,..., w t ) The n th order Markov assumption constrains each w t to be dependent on at most w t n+1 through w t 1. The parameters of the problem are: P(wt w t n+1,..., w t 1 ) 27 / 38

[Bickel et al., 2005] An n-gram model is learned by estimating the probability of all possible combinations of n words. The solution to overcome data sparseness problem is to use a weighted linear mixture of n-gram models. Several mathematical transformations lead the problem to a Viterbi algorithm that retrieves the most likely word sequence. This algorithm starts with the most recently entered word (wt ) and moves iteratively looking for highest scored periods. 28 / 38

Sentence Prediction::IR Approach An information retrieval approach to sentence prediction involves finding, in a corpus, the sentence which is most similar to a given initial fragment. Grabski and Scheffer have developed an indexing method that retrieves the sentence from a collection of documents. Information retrieval aims to provide methods that satisfy a user s information needs. Here, the model has to retrieve the remaining part of a sentence. 29 / 38

[Grabski et al., 2004] Research approach is to search for the sentence whose initial words are most similar to the given initial sequence in vector space representation. For each training sentence d j and each length l, a TF-IDF representation of the first l words is calculated: f l i,j = normalize(tf (t i, d j, l) IDF (t i )) The similarity between two vectors is defined by the cosine measure. 30 / 38

[Grabski et al., 2004] To find the best fitting sentence an indexing algorithm is used An inverse index structure lists, for each term, the sentences in which the term occurs (the postings). The postings lists are sorted according to a relation < that is defined on sentence pairs: s 1 < s 2 if s 1 appears in the document collection more frequently than s 2. A similarity bound can be calculated to stop the retrieval algorithm, because there is no better sentence left to find. 31 / 38

[Grabski et al., 2004] Such a structure improves access time but raises the problem of having to store a huge amount of data. Data compression has been achieved by using clustering techniques finding groups of semantically equivalent sentences. The result of clustering algorithm is a tree of clusters. The leaf nodes contain the groups of sentences. The tree can also be used to access the data more quickly. 32 / 38

Outline 1 Introduction 2 Word Prediction The Origin of Word Prediction n-gram Models 3 Sentence Prediction Statistical Approach Information Retrieval Approach 4 Conclusions 33 / 38

Conclusions A prediction system is particularly useful to minimize keystrokes for users with special needs and to reduce misspellings and typographic errors. Moreover, it can be effectively used in language learning, by suggesting well-formed words to non-native users. Prediction methods can include different modeling strategies for linguistic information. Stochastic modeling (n-gram models) considers a small amount of information of written text (e.g. the last n words). 34 / 38

It s Worth Another Try!!! THE 35 / 38

It s Worth Another Try!!! THE END 35 / 38

References I S. Hunnicutt, L. Nozadze and G. Chikoidze, Russian Word Prediction with Morphological Support, 5th International Symposium on Language, Logic and Computation, Tbilisi, Georgia, 2003. Y. Even-Zohar and D. Roth, A Classification Approach to Word Prediction, NAACL-2000, The 1st North American Conference on Computational Linguistics, 124 131, 2000. S. Bickel, P. Haider and T. Scheffer, Predicting Sentences using N-Gram Language Models, Proceedings of Conference on Empirical Methods in Natural Language Processing, 2005. 36 / 38

References II A. Fazly and G. Hirst, Testing the Efficacy of Part-of-Speech Information in Word Completion, Proceedings of the Workshop on Language Modeling for Text Entry Methods, 10th EACL, Budapest, 2003. J. Eng and J. Eisner, Radiology Report Entry with Automatic Phrase Completion Driven by Language Modeling, Radiographics 24(5):1493 1501, September-October, 2004. K. Grabski and T. Scheffer, Sentence Completion, Proceedings of the SIGIR International Conference on Information Retrieval, 2004. 37 / 38

References III B. Korvemaker and R. Greiner, Predicting UNIX Command Lines: Adjusting to User Patterns, Proceedings of AAAI/IAAI 2000: 230 235, 2000. Cagigas S., Contribution to Word Prediction in Spanish and its Integration in Technical Aids for People with Physical Disabilities, PhD Dissertation, Madrid University, 2001. Gustavii E. and Pettersson E., A Swedish Grammar for Word Prediction, Master s Thesis, Department of Linguistics at Uppsala University, 2003. 38 / 38