Text-mining the Estonian National Electronic Health Record

Similar documents
ScienceDirect. A Framework for Clustering Cardiac Patient s Records Using Unsupervised Learning Techniques

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Linking Task: Identifying authors and book titles in verbose queries

Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

A Case Study: News Classification Based on Term Frequency

Executive Guide to Simulation for Health

The One Minute Preceptor: 5 Microskills for One-On-One Teaching

Pre-vocational training. Unit 2. Being a fitness instructor

Age Effects on Syntactic Control in. Second Language Learning

Medical College of Wisconsin and Froedtert Hospital CONSENT TO PARTICIPATE IN RESEARCH. Name of Study Subject:

Developing a TT-MCTAG for German with an RCG-based Parser

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

Global Health Kitwe, Zambia Elective Curriculum

Python Machine Learning

SCORING KEY AND RATING GUIDE

Continuing Education Unit Program Course Catalog

Distant Supervised Relation Extraction with Wikipedia and Freebase

arxiv: v1 [cs.cl] 2 Apr 2017

Unit 7 Data analysis and design

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

GCSE Media Studies. Mark Scheme for June Unit B322: Textual Analysis and Media Studies Topic (Moving Image)

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

Loughton School s curriculum evening. 28 th February 2017

Outline. Web as Corpus. Using Web Data for Linguistic Purposes. Ines Rehbein. NCLT, Dublin City University. nclt

A Bayesian Learning Approach to Concept-Based Document Classification

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Modeling function word errors in DNN-HMM based LVCSR systems

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

ehealth Governance Initiative: Joint Action JA-EHGov & Thematic Network SEHGovIA DELIVERABLE Version: 2.4 Date:

Prediction of Maximal Projection for Semantic Role Labeling

Postprint.

Possessive have and (have) got in New Zealand English Heidi Quinn, University of Canterbury, New Zealand

Argument structure and theta roles

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

Using dialogue context to improve parsing performance in dialogue systems

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading

Modeling function word errors in DNN-HMM based LVCSR systems

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

Analyzing sentiments in tweets for Tesla Model 3 using SAS Enterprise Miner and SAS Sentiment Analysis Studio

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

Preparing a Research Proposal

Cross Language Information Retrieval

Construction Grammar. University of Jena.

Probabilistic Latent Semantic Analysis

Interprofessional educational team to develop communication and gestural skills

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

A Vector Space Approach for Aspect-Based Sentiment Analysis

Contemporary Opportunities and Challenges for teaching Pharmacogenomics to Student Pharmacists

Problem-based learning using patient-simulated videos showing daily life for a comprehensive clinical approach

Machine Translation on the Medical Domain: The Role of BLEU/NIST and METEOR in a Controlled Vocabulary Setting

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

What s in a Step? Toward General, Abstract Representations of Tutoring System Log Data

SIMULATION CENTER AND NURSING RESOURCE LABORATORY

AQUA: An Ontology-Driven Question Answering System

Unit 14 Dangerous animals

Applications of data mining algorithms to analysis of medical data

Applications of memory-based natural language processing

Parsing of part-of-speech tagged Assamese Texts

GCSE Mathematics B (Linear) Mark Scheme for November Component J567/04: Mathematics Paper 4 (Higher) General Certificate of Secondary Education

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

CMS Transforming Clinical Practices Initiative and. The Southern New England Practice Transformation Network

BIOH : Principles of Medical Physiology

Lecturing for Deeper Learning Effective, Efficient, Research-based Strategies

Tutoring First-Year Writing Students at UNM

BYLINE [Heng Ji, Computer Science Department, New York University,

Problems of the Arabic OCR: New Attitudes

Basic Standards for Residency Training in Internal Medicine. American Osteopathic Association and American College of Osteopathic Internists

Level 3 Diploma in Health and Social Care (QCF)

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

Finding Translations in Scanned Book Collections

Holy Family Catholic Primary School SPELLING POLICY

The taming of the data:

Word Sense Disambiguation

The Role of the Head in the Interpretation of English Deverbal Compounds

Centre for Evaluation & Monitoring SOSCA. Feedback Information

Virginia Commonwealth University Retrospective Concussion Diagnostic Interview - Blast. (dd mmm yyyy)

Ohio ACEP Your Essential Resource for Emergency Medicine Board Review Comprehensive. Relevant. Essential.

A Graph Based Authorship Identification Approach

A Domain Ontology Development Environment Using a MRD and Text Corpus

Modeling full form lexica for Arabic

Response to the Review of Modernising Medical Careers

Modeling user preferences and norms in context-aware systems

Vocabulary Usage and Intelligibility in Learner Language

Computerized Adaptive Psychological Testing A Personalisation Perspective

Rule Learning With Negation: Issues Regarding Effectiveness

Automated Non-Alphanumeric Symbol Resolution in Clinical Texts

2.1 The Theory of Semantic Fields

UNIVERSITY OF NORTH ALABAMA DEPARTMENT OF HEALTH, PHYSICAL EDUCATION AND RECREATION. First Aid

Parent Information Welcome to the San Diego State University Community Reading Clinic

Reading Grammar Section and Lesson Writing Chapter and Lesson Identify a purpose for reading W1-LO; W2- LO; W3- LO; W4- LO; W5-

PL Preceptor News June 2012

MEDICAL COLLEGE OF WISCONSIN (MCW) WHO WE ARE AND OUR UNIQUE VALUE

Literacy THE KEYS TO SUCCESS. Tips for Elementary School Parents (grades K-2)

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

Some Principles of Automated Natural Language Information Extraction

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Transcription:

Text-mining the Estonian National Electronic Health Record Raul Sirel rsirel@ut.ee 13.11.2015

Outline Electronic Health Records & Text Mining De-identifying the Texts Resolving the Abbreviations Terminology EXtraction and Text Analytics (TEXTA) Toolkit

Electronic Health Record (EHR) Peter B. Jensen, Lars J. Jensen and Søren Brunak. 2012. Mining electronic health records: towards better research applications and clinical care. Nature Reviews Genetics 13, 395-405.

Estonian National Health Information System (ENHIS) A nation-wide electronic health record All healthcare providers are obligated by law to forward their medical data to the ENHIS The main unit of data is the epicrisis, which contains information about: the reason the patient arrived (anamnesis) conducted procedures medications etc.

The Data Epicrisis type 2012 2013 Total Outpatient consultation summaries 1 216 400 1 975 016 3 191 416 Discharge summaries 214 874 208 171 423 045 Total 1 431 274 2 183 187 3 614 461 2 years ~ 1 million patients

Why Text Mining? Significant portion (~50%) of the digital health data is unstructured (Hicks 2003)!

...... Patient complaints Pulse... Blood Pressure Measurements

Why Text Mining? Significant portion (~50%) of the digital health data is unstructured (Hicks 2003)! In order to do something useful with the data, we need to analyse the unstructured data!

Outline Electronic Health Records & Text Mining De-identifying the Texts Resolving the Abbreviations Terminology EXtraction and Text Analytics (TEXTA) Toolkit

Medical records contain sensitive information Identity-related information often found among the unstructured data De-identifying the Texts Prior to releasing the data to researchers, the identify-related information needs to be removed: names national identity numbers phone numbers etc.

De-identifying the Texts Input Patsient John Doe Vanus 44 a. IK 77771478888 võeti statsionaarsele ravile. Asjaolude täpsustamiseks helistada dr. Hämarikule tel: 7177765, kell 10.00-13.00. 95% of identityrelated information removed De-identified text De-identifier Patsient XXX Vanus 44 a. IK XXX võeti statsionaarsele ravile. Asjaolude täpsustamiseks helistada dr. XXX tel: XXX, kell 10.00-13.00.

Under the Bonnet Motivation from Named Entity Recognition CRF learning algorithm Surrounding words and grammatical attributes (case, number, etc.) as features CRF-based System Dictionary-based system Precision 97% 40% Recall 95% 70%

Outline Electronic Health Records & Text Mining De-identifying the Texts Resolving the Abbreviations Terminology EXtraction and Text Analytics (TEXTA) Toolkit

Resolving the Abbreviations Up to 13%* of all tokens are abbreviations: Length of the word * Short functional words removed prior to analysis

Under the Bonnet For each abbreviation in text: produce all possible full forms (rule-based model) select the most probable variant (statistical language model) Context p silm ei näe... kolmas p palavik... vähene p pleurareaktsiooni riba... Full form parem päev parietaalne Full form Score parietaalne 92% parem 4% päev 3% pupill 0,3%...... p 6mm ümargune... pupill

Outline Electronic Health Records & Text Mining De-identifying the Texts Resolving the Abbreviations Terminology EXtraction and Text Analytics (TEXTA) Toolkit

Understanding the Text http://www.cvast.tuwien.ac.at/projects/iumls

Problems Language specificity: most of the existing methodologies in NLP are usually language-specific and therefore not applicable for processing other languages Domain specificity: most of the research in NLP is currently focused on general language (e.g. newspaper articles) Lack of semantic resources: when working with sublanguages, it is often the problem that lexical resources (e.g. dictionaries or thesauri) built for general language correspond poorly to the actual language usage Scalability: existing NLP methods usually require large scale resources in order to be used in big data analysis

The Objective The aim was to build a system for exploratory text analytics which: is robust (and scalable) is domain independent doesn t require language-specific resources doesn t require external semantic resources

Terminology Extraction and Text Analytics (TEXTA) Toolkit A system for: describing domain terminologies exploring and analysing the data using the defined terminologies Base Lexicon Extraction Semantic Grouping of Words Multi-word Expression Extraction Searches Aggregations Terminology Extraction Text Analytics For each subtask, the toolkit provides a corresponding tool

TEXTA: Base Lexicon Extraction Base lexicon a list of words describing some topic or semantic property, e.g.: symptoms: pain, nausea, queasiness, cut, etc. anatomical: head, hand, arm, leg, lung, etc. locations: left, right, central, lower, upper, medial, etc. etc.

TEXTA: Base Lexicon Extraction 1. User enters some words 2. User is supported with similar words

Under the Bonnet Distributional semantics: You shall know a word by the company it keeps (Firth 1957) Distributional hypothesis: words with similar distributional properties are semantically similar Language modelling, word-vector modelling Furry Cute Filthy Dog Cat Pig

Under the Bonnet Semantic similarity in word-vector models using cosine similarity

TEXTA: Semantic Grouping of Words The aim is to group together words with similar meaning: headache - migraine pain ache etc. The user is supported with an interactive 2-D projection of the base lexicons: PCA MDS t-sne

TEXTA: Semantic Grouping of Words PCA plot of a base lexicon containing patient complaints:

TEXTA: Semantic Grouping of Words PCA plot of a base lexicon containing patient complaints: constipation-related words The user can now group similar words into concepts (groups of words with similar meanings) nausea-related words pain-related words

TEXTA: Multi-word Expressions More complex concepts are represented as multi-word expressions: Base lexicons Complaints pain cut... Anatomical head arm... Locations left right... Text Corpus Multi-word expressions Patient complaints pain in left arm. Motorcycle accident deep cut in right leg....

Under the Bonnet A k-partite graph is a graph whose vertices are partitioned into k different independent sets k = number of base lexicons k=2 k=3 A multi-word expression is a path with a length of n (n<=k), whose vertices are located in different sets (the path is acyclic)

TEXTA: Searches

TEXTA: Aggregating the Matches Maching documents can be aggregated over any field in the dataset Bite-related documents aggregated over time:

TEXTA: Aggregating the Matches Bite-related documents aggregated over diagnoses: Open wound of unspecified body region Venom of other arthropods Need for immunization against rabies Multiple open wounds of wrist and hand Cellulitis of other parts of limb Lyme disease (Borreliosis) Localized oedema Urticaria, unspecified

TEXTA: Aggregating the Matches Bite-related documents aggregated over: significant words: to bite (verb) bite wound dog tick neighbour anti-rabic

TEXTA: Aggregating the Matches Bite-related documents aggregated over: significant words: to bite (verb) bite wound dog tick neighbour anti-rabic gender: Female Male

TEXTA: Conclusion TEXTA A toolkit for performing text mining Toolkit s workflow is based on: describing domain terminologies exploring and analysing the data using the defined terminologies The sales pitch: it s robust (and scalable) it s domain independent it doesn t require language-specific resources it doesn t require external semantic resources

https://ehr.stacc.ee/public/texta TEXTA: Demo

Overall Conclusion The general aim is to provide resources for increasing the meaningful usage of unstructured data: clinical research quality of care assessments clinical decision support personalised medicine etc.

Thank You for listening!

References Jensen et al. 2012. Jensen PB, Jensen LJ, Brunak S. 2012. Mining electronic health records: towards better research applications and clinical care. Nature Reviews Genetics 2012; 13: 395 405. Hicks 2003. Hicks J. 2003. The potential of claims data to support the measurement of health care quality. San Diego, CA: RAND; 2003. Firth 1957. Firth, J.R. 1957. A synopsis of linguistic theory 1930 1955. Studies in Linguistic Analysis (Oxford: Philological Society): 1 32. Reprinted in F.R. Palmer, ed. (1968). Selected Papers of J.R. Firth 1952 1959. London: Longman.