Structure Discovery and Visualization in Scientific Literature

Similar documents
Applications of memory-based natural language processing

A Case Study: News Classification Based on Term Frequency

Distant Supervised Relation Extraction with Wikipedia and Freebase

Matching Similarity for Keyword-Based Clustering

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

TextGraphs: Graph-based algorithms for Natural Language Processing

On document relevance and lexical cohesion between query terms

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

Word Sense Disambiguation

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

2.1 The Theory of Semantic Fields


On-Line Data Analytics

The taming of the data:

Using Semantic Relations to Refine Coreference Decisions

Postprint.

AQUA: An Ontology-Driven Question Answering System

Linking Task: Identifying authors and book titles in verbose queries

The stages of event extraction

Leveraging Sentiment to Compute Word Similarity

Welcome to. ECML/PKDD 2004 Community meeting

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Universiteit Leiden ICT in Business

The Smart/Empire TIPSTER IR System

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

MARYLAND BLACK BUSINESS SUMMIT & EXPO March 24-27, 2011 presented by AATC * Black Dollar Exchange * BBH Tours

OFFICE OF ENROLLMENT MANAGEMENT. Annual Report

Operational Knowledge Management: a way to manage competence

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Constructing Parallel Corpus from Movie Subtitles

Word Segmentation of Off-line Handwritten Documents

Level 1 Mathematics and Statistics, 2015

Learning From the Past with Experiment Databases

UNIT IX. Don t Tell. Are there some things that grown-ups don t let you do? Read about what this child feels.

Literature and the Language Arts Experiencing Literature

Probabilistic Latent Semantic Analysis

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS

Outreach Connect User Manual

CS 446: Machine Learning

Cross Language Information Retrieval

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Objective: Model division as the unknown factor in multiplication using arrays and tape diagrams. (8 minutes) (3 minutes)

UVA Office of University Building Official. Annual Report

arxiv: v1 [cs.cl] 2 Apr 2017

Executive Summary. Gautier High School

BYLINE [Heng Ji, Computer Science Department, New York University,

The MEANING Multilingual Central Repository

Prentice Hall Literature: Timeless Voices, Timeless Themes Gold 2000 Correlated to Nebraska Reading/Writing Standards, (Grade 9)

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN

Enhancing Customer Service through Learning Technology

Lesson M4. page 1 of 2

Introduction to Yearbook / Newspaper Course Syllabus

Reading Horizons. A Look At Linguistic Readers. Nicholas P. Criscuolo APRIL Volume 10, Issue Article 5

Measures of the Location of the Data

Python Machine Learning

How To Design A Training Course By Peter Taylor

LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization

Ontologies vs. classification systems

DKPro WSD A Generalized UIMA-based Framework for Word Sense Disambiguation

An Introduction to Simio for Beginners

A Bayesian Learning Approach to Concept-Based Document Classification

Constraining X-Bar: Theta Theory

Executive Summary. Sidney Lanier Senior High School

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Annotation Projection for Discourse Connectives

U VA THE CHANGING FACE OF UVA STUDENTS: SSESSMENT. About The Study

university of wisconsin MILWAUKEE Master Plan Report

Speech Recognition at ICSI: Broadcast News and beyond

Chapter 4: Valence & Agreement CSLI Publications

InTraServ. Dissemination Plan INFORMATION SOCIETY TECHNOLOGIES (IST) PROGRAMME. Intelligent Training Service for Management Training in SMEs

Multilingual Document Clustering: an Heuristic Approach Based on Cognate Named Entities

Interactive Whiteboard

CNS 18 21th Communications and Networking Simulation Symposium

Name: Class: Date: ID: A

Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov [Folie 1] 6.1 Type-token ratio

Top US Tech Talent for the Top China Tech Company

The Talent Development High School Model Context, Components, and Initial Impacts on Ninth-Grade Students Engagement and Performance

Close Up. washington, Dc High School Programs

Characterizing Mathematical Digital Literacy: A Preliminary Investigation. Todd Abel Appalachian State University

Fluency YES. an important idea! F.009 Phrases. Objective The student will gain speed and accuracy in reading phrases.

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

! # %& ( ) ( + ) ( &, % &. / 0!!1 2/.&, 3 ( & 2/ &,

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Learning Methods for Fuzzy Systems

A Topic Maps-based ontology IR system versus Clustering-based IR System: A Comparative Study in Security Domain

Developing Grammar in Context

An Interactive Intelligent Language Tutor Over The Internet

14 N Leo News. Information for all Leos. District 14N Leo Clubs

Dear campus colleagues, Thank you for choosing to present the CME Bulletin Board in a Bag : Native American History Month in your area this November!

The role of the first language in foreign language learning. Paul Nation. The role of the first language in foreign language learning

Task Tolerance of MT Output in Integrated Text Processes

Can Human Verb Associations help identify Salient Features for Semantic Verb Classification?

AUTHORING E-LEARNING CONTENT TRENDS AND SOLUTIONS

Innovative Methods for Teaching Engineering Courses

Short Text Understanding Through Lexical-Semantic Analysis

Using the CU*BASE Member Survey

Adjusting a semantic taxonomy and annotation tool for historical corpora

Transcription:

DIPF-Workshop im Lichtenberghaus Chris Biemann, August 2, 2012 biem@cs.tu-darmstadt.de Data-driven Methods for Text Analysis Structure Discovery and Visualization in Scientific Literature

Outline What standard NLP can do Structure Discovery: unsupervised and knowledge-free methods Word co-occurrence Semantic Similarity Segmentation with Topic Models Visual Analytics of Language Conclusion 2

What standard NLP can do Indexing: Find documents containing search terms, ordered by relevance, filtered by meta data (think: Google / Opac) Low-level-processing: Language recognition, Parts-ofspeech tagging, Named Entity recognition, glossing, segment detection.... and where we are going: semantic matching and paraphrase recognition automatic summarization personal assistant 3

Structure Discovery: Collection exploration Structure Discovery: Units (words, sentences, documents) are characterized by distinguishing features Similar units are grouped: discovery of structure in the data groups of units for new units/features for further analysis structures can be visualized or used by processing systems Advantages: domain- and language independent no manual creation of lexical resources or training material 4

Syntagmatic vs. Paradigmatic Relations Ferdinand de Saussure http://courses.nus.edu.sg/course/elltankw/history/vocab/b.htm Syntagmatic Relations: syntactic constraints in the context Paradigmatic Relations: associations, semantic constraints 5

Co-occurrence Graphs from Large Corpora Pairs of words that significantly co-occur within a given window significance test: score interesting pairs higher result: top-significant co-occurrences per word Gesamtschule Lerntheorie Local neighborhoods: top significant co-occurrences, and their top significant connections Mix of syntactic and semantic relations Biemann, Chr.; Bordag, S.; Heyer, G.; Quasthoff, U.; Wolff, Chr.: Language independent Methods for Compiling Monolingual Lexi cal Data, Proceedings of CicLING 2004, Seoul, Korea 6

Distributional Thesaurus (DT) Computed from distributional similarity statistics Entry for a target word consists of a ranked list of neighbors meeting gathering 56.0 seminar 49.0 meet 46.0 lecture 43.0 conference 42.0 concert 38.0 fair 35.0 exhibition 33.0 demonstration 33.0 reception 33.0 rally 32.0 presentation 30.0 symposium 28.0 screening 27.0 workshop 26.0 dinner 26.0 occasion 25.0 reading 25.0 picnic 25.0 congress 25.0... PowerPoint Excel 4.9585013 Word 3.4647698 Access 2.8596914 Outlook 2.617733 Flash 1.792471 Microsoft_Excel 1.7355845 WordPerfect 1.5644555 PostScript 1.4552999 SVG 1.3335394 RTF 1.3335392 Microsoft_Word 1.3207517 XML 1.2791278 Internet_Explorer1.2188575 DjVu 1.1352614 TIFF 1.1352614 PDB 1.1352614 insight 1.1162213... 7 Kuh Hund Kuh First order Second order 2 bunt#attr fliegen#subj Katze#kon die#det Hund

Matching with semantic expansions Knowledge-based Word Sense Disambiguation (à la Lesk) A patient fell over a stack of magazines in an aisle at a physiotherapist practice. customer student individual person mother user passenger.. rose dropped climbed increased slipped declined tumbled surged pile copy lots dozens array collection amount ton Zero word overlap field hill line river stairs road hall driveway physician attorney psychiatrist scholar engineer journalist contractor session game camp workouts training meeting work WordNet: S: (n) magazine (product consisting of a paperback periodic publication as a physical object) "tripped over a pile of magazines jumped woke turned drove walked blew put fell.. stack tons piece heap collection bag loads mountain.. Overlap = 2 Overlap = 1 Overlap = 2 8

Text Mr. Pohs, previously executive vice president and chief operating officer, was named interim president and chief executive officer after David M. Harrold, a company founder, resigned from the posts for personal reasons in August. Cellular said Robert J. Lunday Jr., its chairman and another founder, resigned from the company s board to pursue the sale of his telephone Intuition: company, Big Sandy Telecommunications Inc. Apartheid foes staged a massive antigovernment rally in South Africa. More than 70,000 people filled a soccer stadium on the outskirts of the black township of Soweto and between welcomed segments freed leaders of the outlawed African National Congress. It was considered South Africa s largest opposition rally. Cohesion within segments is higher than cohesion 9

Text Segmentation using Topic Models Mr.:62 Pohs:2,:2 previously:4 executive:2 vice:2 president:2 and:17 chief:2 Mr.:62 operating:2 Pohs:2 officer:2,:2 previously:4,:72 was:2 executive:2 named:2 interim:2 vice:2 president:2 and:17 and:73 chief:2 operating:2 executive:2 officer:2,:72 after:17 was:2 David:2 named:2 M:27 interim:2.:36 Harrold:65 president:2,:2 and:73 a:84 company:2 chief:2 executive:2 founder:2,:26 officer:2 resigned:2 after:17 from:91 David:2 the:34 M:27 posts:2.:36 Harrold:65 for:62 personal:61,:2 a:84 company:2 reasons:2 founder:2 in:84 August:2,:26 resigned:2.:58 Cellular:70 from:91 said:54 the:34 Robert:2 posts:2 J:61 for:62.:42 personal:61 Lunday:2 Jr:18 reasons:2.:31,:44 in:84 its:57 August:2 chairman:2.:58 and:73 Cellular:70 another:25 said:54 founder:2 Robert:2,:31 J:61 resigned:2.:42 Lunday:2 from:91 Jr:18 the:57.:31,:44 its:57 company:2 chairman:2 s:24 board:2 and:73 to:10 another:25 pursue:2 founder:2 the:10,:31 sale:55 resigned:2 of:67 his:28 from:91 telephone:31 the:57 company:2 company:42 s:24,:74 board:2 Big:10 Sandy:50 to:10 pursue:2 Telecommunications:31 the:10 sale:55 of:67 Inc:2 his:28.:74 telephone:31 company:42,:74 Big:10 Sandy:50 Telecommunications:31 Inc:2.:74 Apartheid:37 foes:37 staged:41 a:37 massive:37 antigovernment:37 rally:37 in:40 South:37 Apartheid:37 Africa:37 foes:37.:19 staged:41 More:29 a:37 than:34 massive:37 70:45,:26 antigovernment:37 000 people:37 filled:17 rally:37 a:22 in:40 soccer:37 South:37 Africa:37 stadium:88.:19 on:46 More:29 the:34 than:34 outskirts:37 70:45,:26 of:93 000 the:24 people:37 black:37 filled:17 township:37 a:22 of:45 soccer:37 Soweto:37 stadium:88 and:37 on:46 welcomed:11 the:34 outskirts:37 freed:37 leaders:37 of:93 the:24 of:98 black:37 the:57 township:37 outlawed:37 of:45 Soweto:37 African:37 and:37 National:45 welcomed:11 Congress:87 freed:37 leaders:37.:72 It:79 of:98 was:55 the:57 considered:37 South:37 outlawed:37 Africa:37 African:37 s:33 National:45 largest:90 opposition:67 Congress:87.:72 rally:37 It:79.:37 was:55 considered:37 South:37 Africa:37 s:33 largest:90 opposition:67 rally:37.:37 Riedl M., Biemann C. (2012): TopicTiling: A Text Segmentation Algorithm based on LDA, Proc. of the Student Research Workshop of the 50th ACL, Jeju, Republic of Korea 10

Visual Analytics using NLP NLP: getting linguistic annotations right Visual Analytics: present data in an interesting way. The interpretation lies in the eye of the beholder NLP + Visual Analytics can yield interesting tools for literature research and document collection understanding 11

Term Maps and Concept Trails Background Map: significant terms and their co-occurrences Red/Yellow Trail: Sequence of terms in incoming document Georgien Afghanistan Irak Quickly maps a new document in a background map e.g. visualizes how a new document matches current 12 body of references Martin Riedl and Chris Biemann, TU Biemann, Darmstadt C., Böhm, K., Heyer, G., Melz, R. (2004): SemanticTalk: Software for Visualizing Brainstorming Sessions and Thematic Concept Trails on Document Collections, Proceedings of ECML/PKDD 2004, Pisa, Italy 12

Time lines: Frequency over time slices Eiken, U.C., Liseth, A.T., Richter, M., Witschel, F. and Biemann, C. (2006): Ord i Dag: Mining Norwegian Daily Newswire. Proc. FinTAL, Turku, Finland Quasthoff, U. (2007): Deutsches Neologismenwörterbuch. Neue Wörter und Wortbedeutungen in der Gegenwartssprache. Berlin, De Gruyter 13

Conclusions Language Technology can solve many basic preprocessing tasks Structure Discovery can be used to unveil specific phenomena and relations of language units methodology is independent of domain or language resulting structure is domain-specific and adopts to changes Visual Analytics of language material visual aid for locating an incoming document in a background map time series analysis (many more possible, ask Daniela Oelke) Statistics over text is a powerful tool to support literature research Language technology cannot replace human researchers, editors and authors. But it can make their job easier! 14

Q&A 15

Positional Co-occurrences: sagte vs. meinte Also store the distance between words in the sentence captures parts of syntactic structure similar terms have similar contexts 16

Clustering of DT entries: Sense Induction paper#nn bright#jj 17

Cooc- PEDOCS 18