NLP for Norwegian: adaptation to the clinical domain

Similar documents
Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Cross-Lingual Dependency Parsing with Universal Dependencies and Predicted PoS Labels

Ensemble Technique Utilization for Indonesian Dependency Parser

The Role of the Head in the Interpretation of English Deverbal Compounds

BYLINE [Heng Ji, Computer Science Department, New York University,

UNIVERSITY OF OSLO Department of Informatics. Dialog Act Recognition using Dependency Features. Master s thesis. Sindre Wetjen

The Smart/Empire TIPSTER IR System

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Heuristic Sample Selection to Minimize Reference Standard Training Set for a Part-Of-Speech Tagger

Applications of memory-based natural language processing

Lessons from a Massive Open Online Course (MOOC) on Natural Language Processing for Digital Humanities

Linking Task: Identifying authors and book titles in verbose queries

Automatic Translation of Norwegian Noun Compounds

Multilingual Sentiment and Subjectivity Analysis

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS

Leveraging Sentiment to Compute Word Similarity

Using dialogue context to improve parsing performance in dialogue systems

Distant Supervised Relation Extraction with Wikipedia and Freebase

! # %& ( ) ( + ) ( &, % &. / 0!!1 2/.&, 3 ( & 2/ &,

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Introduction to Text Mining

Text-mining the Estonian National Electronic Health Record

Developing a large semantically annotated corpus

Extracting Opinion Expressions and Their Polarities Exploration of Pipelines and Joint Models

Introduction. Beáta B. Megyesi. Uppsala University Department of Linguistics and Philology Introduction 1(48)

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

ScienceDirect. Malayalam question answering system

A Vector Space Approach for Aspect-Based Sentiment Analysis

Postprint.

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Towards a Machine-Learning Architecture for Lexical Functional Grammar Parsing. Grzegorz Chrupa la

A High-Quality Web Corpus of Czech

The taming of the data:

Prediction of Maximal Projection for Semantic Role Labeling

Using Semantic Relations to Refine Coreference Decisions

LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

A Bayesian Learning Approach to Concept-Based Document Classification

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Proceedings of the 19th COLING, , 2002.

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Learning Computational Grammars

Глубокие рекуррентные нейронные сети для аспектно-ориентированного анализа тональности отзывов пользователей на различных языках

Indian Institute of Technology, Kanpur

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

ARNE - A tool for Namend Entity Recognition from Arabic Text

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

Vocabulary Usage and Intelligibility in Learner Language

THE VERB ARGUMENT BROWSER

Development of the First LRs for Macedonian: Current Projects

Bootstrapping and Evaluating Named Entity Recognition in the Biomedical Domain

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Specifying a shallow grammatical for parsing purposes

The CESAR Project: Enabling LRT for 70M+ Speakers

A Comparison of Two Text Representations for Sentiment Analysis

The stages of event extraction

Memory-based grammatical error correction

Training and evaluation of POS taggers on the French MULTITAG corpus

AQUA: An Ontology-Driven Question Answering System

An Evaluation of POS Taggers for the CHILDES Corpus

Online Updating of Word Representations for Part-of-Speech Tagging

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

Mining Topic-level Opinion Influence in Microblog

Linguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis

Detecting negation scope is easy, except when it isn t

Semantic Inference at the Lexical-Syntactic Level for Textual Entailment Recognition

What Can Neural Networks Teach us about Language? Graham Neubig a2-dlearn 11/18/2017

Language Independent Passage Retrieval for Question Answering

Survey on parsing three dependency representations for English

CS 598 Natural Language Processing

English Language and Applied Linguistics. Module Descriptions 2017/18

Class Responsibility Assignment (CRA) for Use Case Specification to Sequence Diagrams (UC2SD)

A Graph Based Authorship Identification Approach

A Dataset of Syntactic-Ngrams over Time from a Very Large Corpus of English Books

Probabilistic Latent Semantic Analysis

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Grammar Extraction from Treebanks for Hindi and Telugu

The Choice of Features for Classification of Verbs in Biomedical Texts

Natural Language Processing. George Konidaris

University of Alberta. Large-Scale Semi-Supervised Learning for Natural Language Processing. Shane Bergsma

Adapting Stochastic Output for Rule-Based Semantics

Parsing of part-of-speech tagged Assamese Texts

Exploiting Wikipedia as External Knowledge for Named Entity Recognition

Second Exam: Natural Language Parsing with Neural Networks

A data and analysis resource for an experiment in text mining a collection of micro-blogs on a political topic

Introduction, Organization Overview of NLP, Main Issues

Cross Language Information Retrieval

PROTEIN NAMES AND HOW TO FIND THEM

A heuristic framework for pivot-based bilingual dictionary induction

Accurate Unlexicalized Parsing for Modern Hebrew

2.1 The Theory of Semantic Fields

Extracting and Ranking Product Features in Opinion Documents

Surgical Residency Program & Director KEN N KUO MD, FACS

An Out-of-Domain Test Suite for Dependency Parsing of German

The Effect of Multiple Grammatical Errors on Processing Non-Native Writing

Transcription:

NLP for Norwegian: adaptation to the clinical domain Lilja Øvrelid & Taraka Rama University of Oslo, Department of Informatics Nov 2nd, 2017

Language Technology Group (LTG), UiO 2 Research group at Dept of Informatics, UiO 4 permanent staff 6 PhDs (ongoing) 2 Postdoc s (one BigMed funded) 2 II-positions from industry

Language Technology Group (LTG), UiO 3 Data-driven linguistic analysis of text Extensive use of machine learning and HPC Dedicated to furthering of NLP for Norwegian

Information Extraction 4

Information Extraction 5

Underlying NLP Pipeline 6

Data, data, data 7 LT today is largely data-driven Machine learning is the central methodology Need data: annotated or (large amounts of) unannotated Domain adaptation is a central issue

Machine learning 8 Manually annotated data as training data Allows for rigorous evaluation and system comparison

NLP for Norwegian 9 We have developed several resources and tools for processing of general-domain Norwegian text: sentence-splitter, tokenizer (sentences, words) part-of-speech tagger (nouns, verbs) parser (relations between words: subj, obj) Named Entity Recognition (semantic entities: Person, Location, etc.) ONGOING Sentiment Analysis (positive/negative texts) ONGOING

NLP for Norwegian 10 A treebank is a manually annotated corpus containing syntactic analysis We need treebanks for several reasons development of NLP tools down-stream use of these tools Exist for a range of languages, but until recently no treebank existed for Norwegian

Norwegian Dependency Treebank (NDT) 11 NDT was released in 2014 (Solberg et al, 2014) Approx 600,000 tokens of manually annotated Bokmål and Nynorsk text (general domain) Allows for training of taggers and parsers (Øvrelid & Hohle, 2016; Hohle et al, 2017; Velldal et al, 2017) Freely available, so others can too!

NLP for Norwegian 12 Universal Dependencies Community-driven effort to develop cross-linguistically consistent treebank annotation for many languages Enables cross-lingual learning Conversion of NDT to UD (Øvrelid & Hohle, 2016) Currently more than 50 languages (including Norwegian)!

NLP for Norwegian 13 Semantic vectors (word embeddings) for Norwegian Distributional semantic models of words acquired using unsupervised machine learning from raw text (Norsk Aviskorpus) Web service (Kutuzov et al, 2017): http://ltr.uio.no/semvec

NLP for Norwegian 13 Semantic vectors (word embeddings) for Norwegian Distributional semantic models of words acquired using unsupervised machine learning from raw text (Norsk Aviskorpus) Web service (Kutuzov et al, 2017): http://ltr.uio.no/semvec

Clinical NLP for Norwegian 14 Domain-adaptation is a challenge for data-driven NLP Most tools are trained on highly edited news texts Results drop when these are applied to new domains and text types (clinical notes are both!)

Adaptation of existing tools 15 Evaluation and adaptation of existing Norwegian tools Quantify and study performance of existing taggers and parsers

Adaptation of existing tools 15 Evaluation and adaptation of existing Norwegian tools Quantify and study performance of existing taggers and parsers Investigate methods for improving their performance on clinical data normalization: spelling correction, abbreviation detection use of structured domain knowledge

Adaptation of existing tools 15 Evaluation and adaptation of existing Norwegian tools Quantify and study performance of existing taggers and parsers Investigate methods for improving their performance on clinical data normalization: spelling correction, abbreviation detection use of structured domain knowledge Unsupervised learning of domain knowledge and vocabulary Norsk Legemiddelhåndbok Store Medisinske Leksikon

Adaptation of existing tools 16 Transfer from Swedish Swedish has a entity recognizer for clinical text which cannot be shared due to sensitive patient information How to use the Swedish resource for Norwegian?

Some strategies 17 Figure: A Bi-Directional LSTM entity recognizer (ER) for biomedical text (Liu et al. 2017)

Some strategies 17 Figure: A Bi-Directional LSTM entity recognizer (ER) for biomedical text (Liu et al. 2017) Delexicalized training Machine translation Joint learning of embeddings

Delexicalized training 18 Remove Swedish words and train the ER on (Universal) POS tags Tag the POS tagged Norwegian clinical text using the ER trained on Swedish data

Machine Translation 19 Translate the Swedish clinical text to Norwegian using a Machine Translation System Train a ER on the translated Norwegian Clinical text

Generalize Word/Character Embeddings 20 Word/Character embeddings are typically trained on corpora from the same domain Norsk Legemiddelhåndbok is too small for BIG experiments Learn word embeddings jointly on Norsk Legemiddelhåndbok Norwegian and Swedish corpora Train a ER on the Swedish data using the word embeddings

QUESTIONS? 21