Natural Language Processing

Similar documents
CS 598 Natural Language Processing

Cross Language Information Retrieval

Learning Methods in Multilingual Speech Recognition

Natural Language Processing. George Konidaris

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Speech Recognition at ICSI: Broadcast News and beyond

Large vocabulary off-line handwriting recognition: A survey

Clickthrough-Based Translation Models for Web Search: from Word Models to Phrase Models

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Knowledge-Based - Systems

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

Knowledge based expert systems D H A N A N J A Y K A L B A N D E

arxiv: v1 [cs.cl] 2 Apr 2017

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Language Model and Grammar Extraction Variation in Machine Translation

Applications of memory-based natural language processing

Linking Task: Identifying authors and book titles in verbose queries

An Interactive Intelligent Language Tutor Over The Internet

Training and evaluation of POS taggers on the French MULTITAG corpus

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Lecture 1: Machine Learning Basics

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Modeling function word errors in DNN-HMM based LVCSR systems

Switchboard Language Model Improvement with Conversational Data from Gigaword

AQUA: An Ontology-Driven Question Answering System

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Learning Methods for Fuzzy Systems

Speech Emotion Recognition Using Support Vector Machine

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Detecting English-French Cognates Using Orthographic Edit Distance

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Evaluation of a Simultaneous Interpretation System and Analysis of Speech Log for User Experience Assessment

Program Matrix - Reading English 6-12 (DOE Code 398) University of Florida. Reading

The MSR-NRC-SRI MT System for NIST Open Machine Translation 2008 Evaluation

A Case Study: News Classification Based on Term Frequency

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Rule Learning With Negation: Issues Regarding Effectiveness

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Modeling function word errors in DNN-HMM based LVCSR systems

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Noisy SMS Machine Translation in Low-Density Languages

Using dialogue context to improve parsing performance in dialogue systems

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

Some Principles of Automated Natural Language Information Extraction

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Human Emotion Recognition From Speech

Context Free Grammars. Many slides from Michael Collins

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy

21st Century Community Learning Center

Problems of the Arabic OCR: New Attitudes

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

Defragmenting Textual Data by Leveraging the Syntactic Structure of the English Language

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

CLASSIFICATION OF PROGRAM Critical Elements Analysis 1. High Priority Items Phonemic Awareness Instruction

Courses in English. Application Development Technology. Artificial Intelligence. 2017/18 Spring Semester. Database access

A heuristic framework for pivot-based bilingual dictionary induction

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

English Language and Applied Linguistics. Module Descriptions 2017/18

Developing a TT-MCTAG for German with an RCG-based Parser

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Learning to Schedule Straight-Line Code

Lecture 10: Reinforcement Learning

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Spoken Language Parsing Using Phrase-Level Grammars and Trainable Classifiers

Indian Institute of Technology, Kanpur

Extracting Opinion Expressions and Their Polarities Exploration of Pipelines and Joint Models

PAGE(S) WHERE TAUGHT If sub mission ins not a book, cite appropriate location(s))

Multilingual Sentiment and Subjectivity Analysis

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Phonemic Awareness. Jennifer Gondek Instructional Specialist for Inclusive Education TST BOCES

Laboratorio di Intelligenza Artificiale e Robotica

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

MYCIN. The MYCIN Task

Lecture 1: Basic Concepts of Machine Learning

Linguistics. Undergraduate. Departmental Honors. Graduate. Faculty. Linguistics 1

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Discriminative Learning of Beam-Search Heuristics for Planning

Using Web Searches on Important Words to Create Background Sets for LSI Classification

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

Rule-based Expert Systems

Ontologies vs. classification systems

South Carolina English Language Arts

Computerized Adaptive Psychological Testing A Personalisation Perspective

UNIVERSITY OF OSLO Department of Informatics. Dialog Act Recognition using Dependency Features. Master s thesis. Sindre Wetjen

POLA: a student modeling framework for Probabilistic On-Line Assessment of problem solving performance

Transcription:

Lecture 18 Natural Language Processing Marco Chiarandini Department of Mathematics & Computer Science University of Southern Denmark Slides by Dan Klein at Berkeley

Course Overview Introduction Artificial Intelligence Intelligent Agents Search Uninformed Search Heuristic Search Uncertain knowledge and Reasoning Probability and Bayesian approach Bayesian Networks Hidden Markov Chains Kalman Filters Learning Supervised Decision Trees, Neural Networks Learning Bayesian Networks Unsupervised EM Algorithm Reinforcement Learning Games and Adversarial Search Minimax search and Alpha-beta pruning Multiagent search Knowledge representation and Reasoning Propositional logic First order logic Inference Plannning 2

Outline 1. 2. 3. Statistical MT Rule-based MT 3

: Sequential data 4

: Filtering 5

: State Trellis State trellis: graph of states and transitions over time Each arc represents some transition x t 1 x t Each arc has weight Pr(x t x t 1 ) Pr(e t x t ) Each path is a sequence of states The product of weights on a path is the seq s probability Can think of the Forward (and now Viterbi) algorithms as computing sums of all paths (best paths) in this graph 6

: Forward/Viterbi 7

: Particle Filtering Particles: track samples of states rather than an explicit distribution 8

Natural Language 100.000 years ago humans started to speak 7.000 years ago humans started to write Machines process natural language to: acquire information communicate with humans 9

Natural Language Processing Speech technologies Automatic speech recognition (ASR) Text-to-speech synthesis (TTS) Dialog systems Language processing technologies Machine translation Information extraction Web search, question answering Text classification, spam filtering, etc. 10

Outline 1. 2. 3. Statistical MT Rule-based MT 11

Digitalizing Speech Speech input is an acoustic wave form 12

Spectral Analysis 13

Acoustic Feature Sequence 14

State Space Pr(E X ) encodes which acoustic vectors are appropriate for each phoneme (each kind of sound) Pr(X X ) encodes how sounds can be strung together We will have one state for each sound in each word From some state x, can only: Stay in the same state (e.g. speaking slowly) Move to the next position in the word At the end of the word, move to the start of the next word We build a little state graph for each word and chain them together to form our state space X 15

HMM for speech 16

Transition with Bigrams 17

Decoding While there are some practical issues, finding the words given the acoustics is an HMM inference problem We want to know which state sequence x 1:T is most likely given the evidence e 1:T : From the sequence x, we can simply read off the words 18

Outline 1. 2. 3. Statistical MT Rule-based MT 19

Fundamental goal: analyze and process human language, broadly, robustly, accurately... End systems that we want to build: Ambitious: speech recognition, machine translation, information extraction, dialog interfaces, question answering... Modest: spelling correction, text categorization, language recognition, genre classification. 20

Language Models Language defined by a sequence of strings and rules called grammars. Formal languages also need semantics that define meaning. Natural Languages: 1. not definitive: is disagreement with grammar rules Not to be invited is sad To be not invited is sad 2. ambiguous: Entire store 25% off I will bring my bike tomorrow if it looks nice in the morning. 3. large and constantly changing 21

n-gram sequence of n characters or sequence of n words, syllables n-gram models: define probability distributions for these sequences n-gram model is defined as a Markov chain of order n 1. For a trigram: p(c i c 1:i 1 ) = p(c i c i 2:i 1 ) N N p(c 1:N ) = Pr(c i c 1:i 1 ) = Pr(c i c i 2:i 1 ) i=1 i=1 100 chars millions of entries with words even worse Corpus body of text 22

Language identification Learned from corpus: p(c i c i 2:i 1, l) Most probable language: l = argmax l p(l c 1:N ) = argmax l p(l)p(c 1:N l) (Bayes) N = argmax l p(l) p(c i c i 2:i 1, l) (Markov property) i=1 Computers can reach 99% accuracy 23

Rough translation: gives the main point but contains errors Pre-edited translation: original text written in constrained language easier to translate automatically Restricted-source translation: fully automatic but only on technical content as e.g. weather forecast 24

Systems Very simplified there are three types of machine translation Statistical machine translation (SMT) learn relational dependencies of features such as grams, lemmas, etc. Requires large data sets Example: google translate Relatively easy to implement Rule-based machine translation (RBMT) use grammatical rules and language constructions to analyze syntax and semantics Use moderate size data sets Long development time and expertise Hybrid machine translation either construct from RBMT and use SMT to post-process and optimize the result Or use grammatical rules to derive further features to then be fed in the statistical learning machine New direction of research. 25

Brief History 26

Interlingual model: the source language, i.e. the text to be translated is transformed into an interlingua, i.e., an abstract language-independent representation. The target language is then generated from the interlingua. Transfer model: the source language is transformed into an abstract, less language-specific representation. Linguistic rules which are specific to the language pair then transform the source language representation into an abstract target language representation and from this the target sentence is generated. Direct model: words are translated directly without passing through an additional representation. 27

Levels of Transfer Interlingua Semantics Attraction(NamedJohn, NamedMary, High) English Semantics Loves(John, Mary) French Semantics Aime(Jean, Marie) English Syntax S(NP(John), VP(loves, NP(Mary))) French Syntax S(NP(Jean), VP(aime, NP(Marie))) English Words John loves Mary French Words Jean aime Marie Vauquois pyramid 28

Levels of Transfer 29

The problem with dictionary look ups 30

Statistical machine translation Data driven MT 32

e sequence of strings in English f sequence of strings in French f = argmax f Pr(f e) = argmax f Pr(e f ) Pr(f ) Pr(e f ) learned from bilingual (parallel) corpus made of phrases seen before 33

with 100 French phrases for a 5-gram English there are 100 5 different 5-gram and 5! reorderings. 34 e 1 e 2 e 3 e 4 e 5 There is a smelly wumpus sleeping in 2 2 f 1 f 3 f 2 f 4 f 5 Il y a un wumpus malodorant qui dort à 2 2 d 1 = 0 d 3 = -2 d 2 = +1 d 4 = +1 d 5 = 0 Given English sentence e find French sentence f : 1. break English e into phrases e 1,..., e n 2. e i choose the French f i : Pr(f i e i ) 3. choose a permutation of phrases f 1,..., f n f i choose distortion d i : num. of words that phrase f i has moved wrt f i 1 n Pr(f, d e) = Pr(f i e i ) Pr(d i ) i=1

Learn probabilities 1. Parallel corpus: parliamentary debates, web pages 2. Segment into sentences. Periods are good indicators with some care. 3. Align sentences. length of sentences is an indicator, landmarks another 4. Align phrases within sentence: iterative process,aggregation of evidence, no other pair appear so frequently in the corpus. Pr(f i e i ) 5. Extract distortions: count how often distortions appear in the corpus after phrase alignment (smoothing) 6. Improve estimates of Pr(f e) and Pr(d) with EM. 35

Learning to translate 36

An HMM model 37

Machine translation systems 39

Grammars Grammars: set of rules (from left to right) that describe how to form strings from the language s alphabet that are valid according to the language s syntax (Language generator). Parsing is the process of recognizing a string in natural languages by breaking it down to a set of symbols and analyzing each one against the grammar of the language, ie, determining whether the string belongs to the language or is grammatically incorrect. The result is a parse tree. context free grammars (see http://en.wikipedia.org/wiki/chomsky_hierarchy) probabilistic context free grammars lexicalized probabilistic context free grammars 40

Parsing as search 41

Probabilistic Context Free Grammars 42

Hybrid Systems The translated sentence can be checked against a monolingual corpus. 43

Translate text from one language to another Recombines fragments of example translations Challenges: What fragments? [learning to translate] How to make efficient? [fast translation search] 44

After a first bubble now full speed in the sector In spite of the economical crisis 7% growth on world basis Commercial and technological focus Danish is a marginal language and existing systems cannot be applied reliably www.eicom.dk and www.oversaetterhuset.dk search development in collaboration with research institutions (SDU, CBS, ASB) 45

Announcement Need for human resources, possibilties for thesis and individual study activities together with: Visual Interactive Syntax Learning project at the Institute for Language and Communication of SDU http://beta.visl.sdu.dk/constraint_grammar.html Eckhard Bick project leader http://en.wikipedia.org/wiki/eckhard_bick If interested contact me. 46