Applications of memory-based natural language processing

Applications of memory-based natural language processing Antal van den Bosch and Roser Morante ILK Research Group Tilburg University Prague, June 24, 2007

Current ILK members Principal investigator: Antal van den Bosch Post-doc researchers: Piroska Lendvai, Martin Reynaert, Roser Morante, Erwin Marsi Ph.D. students: Sander Canisius, Toine Bogers, Marieke van Erp, Herman Stehouwer Scientific programmers: Ko van der Sloot, Steve Hunt, Peter Berck Guest researchers: Erik Tjong Kim Sang, Iris Hendrickx, Walter Daelemans

0utline of the talk 1. Scientific embedding 1.1 NLP as classification 1.2 Inference in NLP 2. Memory-based NLP applications 3. Embedded memory-based applications 4. Software and infrastructure 5. e-learning?

1 Scientific embedding (1) Language processing is memory-based Learning consists of: Storing instances in memory Drawing analogies with the stored instances to deal with new experiences. Learning is a supervised process Annotated data are needed

Representation of instances Task: assigning part of speech tags Context Focus word Context were always accepted.. _?

1 Scientific embedding (2) Language processing has simplicity constraints: Context is a local phenomenon Abstraction is harmful

1 Scientific embedding (3) Language processing can be reduced to: Classification Segmentation, mapping Inference: Finding the optimal sequence/structure

1.1 NLP as classification (1) Classification: Given new test instance X, Compare it to all memory instances Compute a distance between X and memory instance Y Update the top k of closest instances (nearest neighbors) When done, take the majority class of the k nearest neighbors as the class of X

1.1 NLP as classification (2) Sentence accent placement Dependency relation assignment

1.2 Inference in NLP Local classifications global solution Open up search space In which there is an optimal global solution Search algorithms Constraint satisfaction inference Beam search Viterbi

2 Memory-based NLP apps Basic NLP Spelling correction Speech synthesis Morpho-syntax Semantics Machine translation Embedded NLP Dialogue systems Professional document writing Knowledge enrichment

2.1 Morpho-phonology

2.2 Morpho-syntax

2.3 Semantics

2.3 Semantics Semantic relations: content-container

2.4 Machine Translation Memory-based text-to-text processing Machine translation Language modelling Confusible disambiguation

3 Embedded Memory-Based Apps Dialogue systems NWO IMIX: ROLAQUAD Professional document writing Senter Novem IOP-MMI À Propos Knowledge enrichment in domains NWO CATCH: MITCH

3.1 Semantic Classification in QA Answer retrieval from domain documents through alignment of question analyses with off-line document analyses.

3.2 Professional Document Writing Pro-active personalization for professional document writing Recommend related articles for a 'focus' online news article Retrieve similar passages Classify experts

3.3 Knowledge Enrichment Mining information from texts in the cultural heritage From documents to knowledge bases and ontologies Goal: research and develop techniques to discover new meaning in large collections of partially structured data that are available at Naturalis

3.4 Text Mining in Animal Data

In sum

LT Modules Text Applications Lexical / Morphological Analysis Tagging Chunking Syntactic Analysis Word Sense Disambiguation Grammatical Relation Finding Named Entity Recognition Semantic Analysis OCR Spelling Error Correction Grammar Checking Information retrieval Document Classification Information Extraction Summarization Question Answering Ontology Extraction and Refinement Reference Resolution Discourse Analysis Meaning Dialogue Systems Machine Translation

4 Software and Infrastructure Open Source (GPL) software: a.o. TiMBL, MBT: Machine learning and sequence processing NeXTeNS: text-to-speech conversion POS tagging, lemmatization, morphological analysis, shallow parsing (Tadpole) Demos Web interfaces Computing infrastructure One supercomputer; one high-end file server Approx. 20 computing servers, 4 web/data servers, 20 desktops Parallelisation: Dimbl, Mumbl

Better accessibility e-learning? Recommendation tools Multi-lingual NLP & MT Creating better e-learning apps with more natural interfaces Speech synthesis QA, dialogue systems Language e-learning Help the computer learn language Win-win situation, open mind

Thanks for your attention! You will find more information in: http://ilk.uvt.nl

Partners Academic CNTS, University of Antwerp Project partners: Nijmegen, Groningen, Maastricht, Utrecht, Eindhoven, Leuven University of Bergen, Dublin City University, Polytechnic University of Catalunya, Saarland University, University of Illinois at Urbana- Champaign Non-commercial Naturalis Museum of Natural History Industrial Textkernel Project partners: Polderland, SEC, Irion, Trezorix

Spin off Textkernel B.V. Information extraction Robust text matching Dialogue systems Foundation for Inductive Learning Applications Broker for Tilburg and Antwerp university software Consultancy

Eager vs Lazy Learning