NLP Structured Data Investigation on Non-Text

Similar documents
System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

CS4491/CS 7265 BIG DATA ANALYTICS INTRODUCTION TO THE COURSE. Mingon Kang, PhD Computer Science, Kennesaw State University

Text-mining the Estonian National Electronic Health Record

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

LIM-LIG at SemEval-2017 Task1: Enhancing the Semantic Similarity for Arabic Sentences with Vectors Weighting

Python Machine Learning

Unsupervised Cross-Lingual Scaling of Political Texts

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

arxiv: v1 [cs.cl] 20 Jul 2015

Semantic and Context-aware Linguistic Model for Bias Detection

Lecture 1: Machine Learning Basics

Georgetown University at TREC 2017 Dynamic Domain Track

A study of speaker adaptation for DNN-based speech synthesis

ScienceDirect. A Framework for Clustering Cardiac Patient s Records Using Unsupervised Learning Techniques

arxiv: v1 [cs.cl] 2 Apr 2017

Word Embedding Based Correlation Model for Question/Answer Matching

Differential Evolutionary Algorithm Based on Multiple Vector Metrics for Semantic Similarity Assessment in Continuous Vector Space

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

A Case Study: News Classification Based on Term Frequency

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

A deep architecture for non-projective dependency parsing

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Bachelor Class

TopicFlow: Visualizing Topic Alignment of Twitter Data over Time

Word Segmentation of Off-line Handwritten Documents

Using Web Searches on Important Words to Create Background Sets for LSI Classification

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

Second Exam: Natural Language Parsing with Neural Networks

Глубокие рекуррентные нейронные сети для аспектно-ориентированного анализа тональности отзывов пользователей на различных языках

SIMULATION CENTER AND NURSING RESOURCE LABORATORY

Modeling function word errors in DNN-HMM based LVCSR systems

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Generative models and adversarial training

Probabilistic Latent Semantic Analysis

CS 446: Machine Learning

FBK-HLT-NLP at SemEval-2016 Task 2: A Multitask, Deep Learning Approach for Interpretable Semantic Textual Similarity

MYCIN. The MYCIN Task

Deep Neural Network Language Models

Mining Association Rules in Student s Assessment Data

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Modeling function word errors in DNN-HMM based LVCSR systems

Top US Tech Talent for the Top China Tech Company

Notetaking Directions

Learning Methods for Fuzzy Systems

Identification of Opinion Leaders Using Text Mining Technique in Virtual Community

Topic Modelling with Word Embeddings

CSL465/603 - Machine Learning

Online Updating of Word Representations for Part-of-Speech Tagging

Ask Me Anything: Dynamic Memory Networks for Natural Language Processing

Analyzing sentiments in tweets for Tesla Model 3 using SAS Enterprise Miner and SAS Sentiment Analysis Studio

Performance Analysis of Optimized Content Extraction for Cyrillic Mongolian Learning Text Materials in the Database

DIXON INTERMEDIATE SCHOOL CALENDAR OF EVENTS

Exploration. CS : Deep Reinforcement Learning Sergey Levine

While you are waiting... socrative.com, room number SIMLANG2016

Speaker Identification by Comparison of Smart Methods. Abstract

A DISTRIBUTIONAL STRUCTURED SEMANTIC SPACE FOR QUERYING RDF GRAPH DATA

Finding a Classroom Volunteer

College Pricing. Ben Johnson. April 30, Abstract. Colleges in the United States price discriminate based on student characteristics

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN

Evidence-based Practice: A Workshop for Training Adult Basic Education, TANF and One Stop Practitioners and Program Administrators

Comment-based Multi-View Clustering of Web 2.0 Items

Joint Learning of Character and Word Embeddings

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Global School-based Student Health Survey (GSHS) and Global School Health Policy and Practices Survey (SHPPS): GSHS

EDEXCEL FUNCTIONAL SKILLS PILOT TEACHER S NOTES. Maths Level 2. Chapter 4. Working with measures

MASTER OF SCIENCE (M.S.) MAJOR IN COMPUTER SCIENCE

A Vector Space Approach for Aspect-Based Sentiment Analysis

Global Health Kitwe, Zambia Elective Curriculum

Social Media Journalism J336F Unique ID CMA Fall 2012

Navigating the PhD Options in CMS

arxiv: v2 [cs.ir] 22 Aug 2016

Variations of the Similarity Function of TextRank for Automated Summarization

Assignment 1: Predicting Amazon Review Ratings

Soulbus project/jamk Part B: National tailored pilot Case Gloria, Soultraining, Summary

Rule Learning With Negation: Issues Regarding Effectiveness

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Multisensor Data Fusion: From Algorithms And Architectural Design To Applications (Devices, Circuits, And Systems)

There are some definitions for what Word

A JOINT MANY-TASK MODEL: GROWING A NEURAL NETWORK FOR MULTIPLE NLP TASKS

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Duke University. Trinity College of Arts & Sciences/ Pratt School of Engineering Application for Readmission to Duke

DIGITAL GAMING & INTERACTIVE MEDIA BACHELOR S DEGREE. Junior Year. Summer (Bridge Quarter) Fall Winter Spring GAME Credits.

arxiv: v3 [cs.cl] 7 Feb 2017

Cross-lingual Short-Text Document Classification for Facebook Comments

PRAAT ON THE WEB AN UPGRADE OF PRAAT FOR SEMI-AUTOMATIC SPEECH ANNOTATION

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Selling Skills. Tailored to Your Needs. Consultants & trainers in sales, presentations, negotiations and influence

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

arxiv: v4 [cs.cl] 28 Mar 2016

- COURSE DESCRIPTIONS - (*From Online Graduate Catalog )

CS224d Deep Learning for Natural Language Processing. Richard Socher, PhD

Introduction to CS 100 Overview of UK. CS September 2015

Boosting Named Entity Recognition with Neural Character Embeddings

use different techniques and equipment with guidance

Smiley Face Self Assessment Template

Detecting English-French Cognates Using Orthographic Edit Distance

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

Transcription:

NLP Structured Data Investigation on Non-Text Casey Stella Spring, 2015

Table of Contents Preliminaries Borrowing from NLP Demo Questions

Introduction I m a Principal Architect at Hortonworks I work primarily doing Data Science in the Hadoop Ecosystem Prior to this, I ve spent my time and had a lot of fun Doing data mining on medical data at Explorys using the Hadoop ecosystem Doing signal processing on seismic data at Ion Geophysical using MapReduce Being a graduate student in the Math department at Texas A&M in algorithmic complexity theory

Domain Challenges in Data Science A data scientist has to merge analytical skills with domain expertise. Often we re thrown into places where we have insufficient domain experience. Gaining this expertise can be challenging and time-consuming. Unsupervised machine learning techniques can be very useful to understand complex data relationships.

Domain Challenges in Data Science A data scientist has to merge analytical skills with domain expertise. Often we re thrown into places where we have insufficient domain experience. Gaining this expertise can be challenging and time-consuming. Unsupervised machine learning techniques can be very useful to understand complex data relationships. We ll use an unsupervised structure learning algorithm borrowed from NLP to look at medical data.

Word2Vec Word2Vec is a vectorization model created by Google [1] that attempts to learn relationships between words automatically given a large corpus of sentences. Gives us a way to find similar words by finding near neighbors in the vector space with cosine similarity. 1 http://radimrehurek.com/2014/12/making-sense-of-word2vec/

Word2Vec Word2Vec is a vectorization model created by Google [1] that attempts to learn relationships between words automatically given a large corpus of sentences. Gives us a way to find similar words by finding near neighbors in the vector space with cosine similarity. Uses a neural network to learn vector representations. 1 http://radimrehurek.com/2014/12/making-sense-of-word2vec/

Word2Vec Word2Vec is a vectorization model created by Google [1] that attempts to learn relationships between words automatically given a large corpus of sentences. Gives us a way to find similar words by finding near neighbors in the vector space with cosine similarity. Uses a neural network to learn vector representations. Recent work by Pennington, Socher, and Manning [2] shows that the word2vec model is equivalent to weighting a word co-occurance matrix weighting based on window distance and lowering the dimension by matrix factorization. 1 http://radimrehurek.com/2014/12/making-sense-of-word2vec/

Word2Vec Word2Vec is a vectorization model created by Google [1] that attempts to learn relationships between words automatically given a large corpus of sentences. Gives us a way to find similar words by finding near neighbors in the vector space with cosine similarity. Uses a neural network to learn vector representations. Recent work by Pennington, Socher, and Manning [2] shows that the word2vec model is equivalent to weighting a word co-occurance matrix weighting based on window distance and lowering the dimension by matrix factorization. Takeaway: The technique boils down, intuitively, to a riff on word co-occurence. See here 1 for more. 1 http://radimrehurek.com/2014/12/making-sense-of-word2vec/

Clinical Data as Sentences Clinical encounters form a sort of sentence over time. For a given encounter: Vitals are measured (e.g. height, weight, BMI). Labs are performed and results are recorded (e.g. blood tests). Procedures are performed. Diagnoses are made (e.g. Diabetes). Drugs are prescribed. Each of these can be considered clinical words and the encounter forms a clinical sentence.

Clinical Data as Sentences Clinical encounters form a sort of sentence over time. For a given encounter: Vitals are measured (e.g. height, weight, BMI). Labs are performed and results are recorded (e.g. blood tests). Procedures are performed. Diagnoses are made (e.g. Diabetes). Drugs are prescribed. Each of these can be considered clinical words and the encounter forms a clinical sentence. Idea: We can use word2vec to investigate connections between these clinical concepts.

Demo As part of a Kaggle competition 2, Practice Fusion, a digital electronic medical records provider released depersonalized clinical records of 10,000 patients. I ingested and preprocessed these records into 197,340 clinical sentences using Pig and Hive. 2 https://www.kaggle.com/c/pf2012-diabetes

Demo As part of a Kaggle competition 2, Practice Fusion, a digital electronic medical records provider released depersonalized clinical records of 10,000 patients. I ingested and preprocessed these records into 197,340 clinical sentences using Pig and Hive. MLLib from Spark now contains an implementation of word2vec, so let s use pyspark and IPython Notebook to explore this dataset on Hadoop. 2 https://www.kaggle.com/c/pf2012-diabetes

Questions Thanks for your attention! Questions? Code & scripts for this talk available on my github presentation page. 3 Find me at http://caseystella.com Twitter handle: @casey_stella Email address: cstella@hortonworks.com 3 http://github.com/cestella/presentations/

Bibliography [1] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. CoRR, abs/1301.3781, 2013. [2] Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532 1543. Association for Computational Linguistics, 2014.