NLP for Norwegian: adaptation to the clinical domain

Size: px

Start display at page:

Download "NLP for Norwegian: adaptation to the clinical domain"

Lambert Hall
6 years ago
Views:

1 NLP for Norwegian: adaptation to the clinical domain Lilja Øvrelid & Taraka Rama University of Oslo, Department of Informatics Nov 2nd, 2017

2 Language Technology Group (LTG), UiO 2 Research group at Dept of Informatics, UiO 4 permanent staff 6 PhDs (ongoing) 2 Postdoc s (one BigMed funded) 2 II-positions from industry

3 Language Technology Group (LTG), UiO 3 Data-driven linguistic analysis of text Extensive use of machine learning and HPC Dedicated to furthering of NLP for Norwegian

4 Information Extraction 4

5 Information Extraction 5

6 Underlying NLP Pipeline 6

7 Data, data, data 7 LT today is largely data-driven Machine learning is the central methodology Need data: annotated or (large amounts of) unannotated Domain adaptation is a central issue

8 Machine learning 8 Manually annotated data as training data Allows for rigorous evaluation and system comparison

9 NLP for Norwegian 9 We have developed several resources and tools for processing of general-domain Norwegian text: sentence-splitter, tokenizer (sentences, words) part-of-speech tagger (nouns, verbs) parser (relations between words: subj, obj) Named Entity Recognition (semantic entities: Person, Location, etc.) ONGOING Sentiment Analysis (positive/negative texts) ONGOING

10 NLP for Norwegian 10 A treebank is a manually annotated corpus containing syntactic analysis We need treebanks for several reasons development of NLP tools down-stream use of these tools Exist for a range of languages, but until recently no treebank existed for Norwegian

11 Norwegian Dependency Treebank (NDT) 11 NDT was released in 2014 (Solberg et al, 2014) Approx 600,000 tokens of manually annotated Bokmål and Nynorsk text (general domain) Allows for training of taggers and parsers (Øvrelid & Hohle, 2016; Hohle et al, 2017; Velldal et al, 2017) Freely available, so others can too!

NLP for Norwegian 12 Universal Dependencies Community-driven effort to develop cross-linguistically consistent treebank annotation for many

12 NLP for Norwegian 12 Universal Dependencies Community-driven effort to develop cross-linguistically consistent treebank annotation for many languages Enables cross-lingual learning Conversion of NDT to UD (Øvrelid & Hohle, 2016) Currently more than 50 languages (including Norwegian)!

13 NLP for Norwegian 13 Semantic vectors (word embeddings) for Norwegian Distributional semantic models of words acquired using unsupervised machine learning from raw text (Norsk Aviskorpus) Web service (Kutuzov et al, 2017):

14 NLP for Norwegian 13 Semantic vectors (word embeddings) for Norwegian Distributional semantic models of words acquired using unsupervised machine learning from raw text (Norsk Aviskorpus) Web service (Kutuzov et al, 2017):

15 Clinical NLP for Norwegian 14 Domain-adaptation is a challenge for data-driven NLP Most tools are trained on highly edited news texts Results drop when these are applied to new domains and text types (clinical notes are both!)

16 Adaptation of existing tools 15 Evaluation and adaptation of existing Norwegian tools Quantify and study performance of existing taggers and parsers

17 Adaptation of existing tools 15 Evaluation and adaptation of existing Norwegian tools Quantify and study performance of existing taggers and parsers Investigate methods for improving their performance on clinical data normalization: spelling correction, abbreviation detection use of structured domain knowledge

18 Adaptation of existing tools 15 Evaluation and adaptation of existing Norwegian tools Quantify and study performance of existing taggers and parsers Investigate methods for improving their performance on clinical data normalization: spelling correction, abbreviation detection use of structured domain knowledge Unsupervised learning of domain knowledge and vocabulary Norsk Legemiddelhåndbok Store Medisinske Leksikon

19 Adaptation of existing tools 16 Transfer from Swedish Swedish has a entity recognizer for clinical text which cannot be shared due to sensitive patient information How to use the Swedish resource for Norwegian?

20 Some strategies 17 Figure: A Bi-Directional LSTM entity recognizer (ER) for biomedical text (Liu et al. 2017)

21 Some strategies 17 Figure: A Bi-Directional LSTM entity recognizer (ER) for biomedical text (Liu et al. 2017) Delexicalized training Machine translation Joint learning of embeddings

22 Delexicalized training 18 Remove Swedish words and train the ER on (Universal) POS tags Tag the POS tagged Norwegian clinical text using the ER trained on Swedish data

23 Machine Translation 19 Translate the Swedish clinical text to Norwegian using a Machine Translation System Train a ER on the translated Norwegian Clinical text

24 Generalize Word/Character Embeddings 20 Word/Character embeddings are typically trained on corpora from the same domain Norsk Legemiddelhåndbok is too small for BIG experiments Learn word embeddings jointly on Norsk Legemiddelhåndbok Norwegian and Swedish corpora Train a ER on the Swedish data using the word embeddings

25 QUESTIONS? 21

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Semi-supervised methods of text processing, and an application to medical concept extraction Yacine Jernite Text-as-Data series September 17. 2015 What do we want from text? 1. Extract information 2. Link