NLP for Norwegian: adaptation to the clinical domain Lilja Øvrelid & Taraka Rama University of Oslo, Department of Informatics Nov 2nd, 2017
Language Technology Group (LTG), UiO 2 Research group at Dept of Informatics, UiO 4 permanent staff 6 PhDs (ongoing) 2 Postdoc s (one BigMed funded) 2 II-positions from industry
Language Technology Group (LTG), UiO 3 Data-driven linguistic analysis of text Extensive use of machine learning and HPC Dedicated to furthering of NLP for Norwegian
Information Extraction 4
Information Extraction 5
Underlying NLP Pipeline 6
Data, data, data 7 LT today is largely data-driven Machine learning is the central methodology Need data: annotated or (large amounts of) unannotated Domain adaptation is a central issue
Machine learning 8 Manually annotated data as training data Allows for rigorous evaluation and system comparison
NLP for Norwegian 9 We have developed several resources and tools for processing of general-domain Norwegian text: sentence-splitter, tokenizer (sentences, words) part-of-speech tagger (nouns, verbs) parser (relations between words: subj, obj) Named Entity Recognition (semantic entities: Person, Location, etc.) ONGOING Sentiment Analysis (positive/negative texts) ONGOING
NLP for Norwegian 10 A treebank is a manually annotated corpus containing syntactic analysis We need treebanks for several reasons development of NLP tools down-stream use of these tools Exist for a range of languages, but until recently no treebank existed for Norwegian
Norwegian Dependency Treebank (NDT) 11 NDT was released in 2014 (Solberg et al, 2014) Approx 600,000 tokens of manually annotated Bokmål and Nynorsk text (general domain) Allows for training of taggers and parsers (Øvrelid & Hohle, 2016; Hohle et al, 2017; Velldal et al, 2017) Freely available, so others can too!
NLP for Norwegian 12 Universal Dependencies Community-driven effort to develop cross-linguistically consistent treebank annotation for many languages Enables cross-lingual learning Conversion of NDT to UD (Øvrelid & Hohle, 2016) Currently more than 50 languages (including Norwegian)!
NLP for Norwegian 13 Semantic vectors (word embeddings) for Norwegian Distributional semantic models of words acquired using unsupervised machine learning from raw text (Norsk Aviskorpus) Web service (Kutuzov et al, 2017): http://ltr.uio.no/semvec
NLP for Norwegian 13 Semantic vectors (word embeddings) for Norwegian Distributional semantic models of words acquired using unsupervised machine learning from raw text (Norsk Aviskorpus) Web service (Kutuzov et al, 2017): http://ltr.uio.no/semvec
Clinical NLP for Norwegian 14 Domain-adaptation is a challenge for data-driven NLP Most tools are trained on highly edited news texts Results drop when these are applied to new domains and text types (clinical notes are both!)
Adaptation of existing tools 15 Evaluation and adaptation of existing Norwegian tools Quantify and study performance of existing taggers and parsers
Adaptation of existing tools 15 Evaluation and adaptation of existing Norwegian tools Quantify and study performance of existing taggers and parsers Investigate methods for improving their performance on clinical data normalization: spelling correction, abbreviation detection use of structured domain knowledge
Adaptation of existing tools 15 Evaluation and adaptation of existing Norwegian tools Quantify and study performance of existing taggers and parsers Investigate methods for improving their performance on clinical data normalization: spelling correction, abbreviation detection use of structured domain knowledge Unsupervised learning of domain knowledge and vocabulary Norsk Legemiddelhåndbok Store Medisinske Leksikon
Adaptation of existing tools 16 Transfer from Swedish Swedish has a entity recognizer for clinical text which cannot be shared due to sensitive patient information How to use the Swedish resource for Norwegian?
Some strategies 17 Figure: A Bi-Directional LSTM entity recognizer (ER) for biomedical text (Liu et al. 2017)
Some strategies 17 Figure: A Bi-Directional LSTM entity recognizer (ER) for biomedical text (Liu et al. 2017) Delexicalized training Machine translation Joint learning of embeddings
Delexicalized training 18 Remove Swedish words and train the ER on (Universal) POS tags Tag the POS tagged Norwegian clinical text using the ER trained on Swedish data
Machine Translation 19 Translate the Swedish clinical text to Norwegian using a Machine Translation System Train a ER on the translated Norwegian Clinical text
Generalize Word/Character Embeddings 20 Word/Character embeddings are typically trained on corpora from the same domain Norsk Legemiddelhåndbok is too small for BIG experiments Learn word embeddings jointly on Norsk Legemiddelhåndbok Norwegian and Swedish corpora Train a ER on the Swedish data using the word embeddings
QUESTIONS? 21