DIPF-Workshop im Lichtenberghaus Chris Biemann, August 2, 2012 biem@cs.tu-darmstadt.de Data-driven Methods for Text Analysis Structure Discovery and Visualization in Scientific Literature
Outline What standard NLP can do Structure Discovery: unsupervised and knowledge-free methods Word co-occurrence Semantic Similarity Segmentation with Topic Models Visual Analytics of Language Conclusion 2
What standard NLP can do Indexing: Find documents containing search terms, ordered by relevance, filtered by meta data (think: Google / Opac) Low-level-processing: Language recognition, Parts-ofspeech tagging, Named Entity recognition, glossing, segment detection.... and where we are going: semantic matching and paraphrase recognition automatic summarization personal assistant 3
Structure Discovery: Collection exploration Structure Discovery: Units (words, sentences, documents) are characterized by distinguishing features Similar units are grouped: discovery of structure in the data groups of units for new units/features for further analysis structures can be visualized or used by processing systems Advantages: domain- and language independent no manual creation of lexical resources or training material 4
Syntagmatic vs. Paradigmatic Relations Ferdinand de Saussure http://courses.nus.edu.sg/course/elltankw/history/vocab/b.htm Syntagmatic Relations: syntactic constraints in the context Paradigmatic Relations: associations, semantic constraints 5
Co-occurrence Graphs from Large Corpora Pairs of words that significantly co-occur within a given window significance test: score interesting pairs higher result: top-significant co-occurrences per word Gesamtschule Lerntheorie Local neighborhoods: top significant co-occurrences, and their top significant connections Mix of syntactic and semantic relations Biemann, Chr.; Bordag, S.; Heyer, G.; Quasthoff, U.; Wolff, Chr.: Language independent Methods for Compiling Monolingual Lexi cal Data, Proceedings of CicLING 2004, Seoul, Korea 6
Distributional Thesaurus (DT) Computed from distributional similarity statistics Entry for a target word consists of a ranked list of neighbors meeting gathering 56.0 seminar 49.0 meet 46.0 lecture 43.0 conference 42.0 concert 38.0 fair 35.0 exhibition 33.0 demonstration 33.0 reception 33.0 rally 32.0 presentation 30.0 symposium 28.0 screening 27.0 workshop 26.0 dinner 26.0 occasion 25.0 reading 25.0 picnic 25.0 congress 25.0... PowerPoint Excel 4.9585013 Word 3.4647698 Access 2.8596914 Outlook 2.617733 Flash 1.792471 Microsoft_Excel 1.7355845 WordPerfect 1.5644555 PostScript 1.4552999 SVG 1.3335394 RTF 1.3335392 Microsoft_Word 1.3207517 XML 1.2791278 Internet_Explorer1.2188575 DjVu 1.1352614 TIFF 1.1352614 PDB 1.1352614 insight 1.1162213... 7 Kuh Hund Kuh First order Second order 2 bunt#attr fliegen#subj Katze#kon die#det Hund
Matching with semantic expansions Knowledge-based Word Sense Disambiguation (à la Lesk) A patient fell over a stack of magazines in an aisle at a physiotherapist practice. customer student individual person mother user passenger.. rose dropped climbed increased slipped declined tumbled surged pile copy lots dozens array collection amount ton Zero word overlap field hill line river stairs road hall driveway physician attorney psychiatrist scholar engineer journalist contractor session game camp workouts training meeting work WordNet: S: (n) magazine (product consisting of a paperback periodic publication as a physical object) "tripped over a pile of magazines jumped woke turned drove walked blew put fell.. stack tons piece heap collection bag loads mountain.. Overlap = 2 Overlap = 1 Overlap = 2 8
Text Mr. Pohs, previously executive vice president and chief operating officer, was named interim president and chief executive officer after David M. Harrold, a company founder, resigned from the posts for personal reasons in August. Cellular said Robert J. Lunday Jr., its chairman and another founder, resigned from the company s board to pursue the sale of his telephone Intuition: company, Big Sandy Telecommunications Inc. Apartheid foes staged a massive antigovernment rally in South Africa. More than 70,000 people filled a soccer stadium on the outskirts of the black township of Soweto and between welcomed segments freed leaders of the outlawed African National Congress. It was considered South Africa s largest opposition rally. Cohesion within segments is higher than cohesion 9
Text Segmentation using Topic Models Mr.:62 Pohs:2,:2 previously:4 executive:2 vice:2 president:2 and:17 chief:2 Mr.:62 operating:2 Pohs:2 officer:2,:2 previously:4,:72 was:2 executive:2 named:2 interim:2 vice:2 president:2 and:17 and:73 chief:2 operating:2 executive:2 officer:2,:72 after:17 was:2 David:2 named:2 M:27 interim:2.:36 Harrold:65 president:2,:2 and:73 a:84 company:2 chief:2 executive:2 founder:2,:26 officer:2 resigned:2 after:17 from:91 David:2 the:34 M:27 posts:2.:36 Harrold:65 for:62 personal:61,:2 a:84 company:2 reasons:2 founder:2 in:84 August:2,:26 resigned:2.:58 Cellular:70 from:91 said:54 the:34 Robert:2 posts:2 J:61 for:62.:42 personal:61 Lunday:2 Jr:18 reasons:2.:31,:44 in:84 its:57 August:2 chairman:2.:58 and:73 Cellular:70 another:25 said:54 founder:2 Robert:2,:31 J:61 resigned:2.:42 Lunday:2 from:91 Jr:18 the:57.:31,:44 its:57 company:2 chairman:2 s:24 board:2 and:73 to:10 another:25 pursue:2 founder:2 the:10,:31 sale:55 resigned:2 of:67 his:28 from:91 telephone:31 the:57 company:2 company:42 s:24,:74 board:2 Big:10 Sandy:50 to:10 pursue:2 Telecommunications:31 the:10 sale:55 of:67 Inc:2 his:28.:74 telephone:31 company:42,:74 Big:10 Sandy:50 Telecommunications:31 Inc:2.:74 Apartheid:37 foes:37 staged:41 a:37 massive:37 antigovernment:37 rally:37 in:40 South:37 Apartheid:37 Africa:37 foes:37.:19 staged:41 More:29 a:37 than:34 massive:37 70:45,:26 antigovernment:37 000 people:37 filled:17 rally:37 a:22 in:40 soccer:37 South:37 Africa:37 stadium:88.:19 on:46 More:29 the:34 than:34 outskirts:37 70:45,:26 of:93 000 the:24 people:37 black:37 filled:17 township:37 a:22 of:45 soccer:37 Soweto:37 stadium:88 and:37 on:46 welcomed:11 the:34 outskirts:37 freed:37 leaders:37 of:93 the:24 of:98 black:37 the:57 township:37 outlawed:37 of:45 Soweto:37 African:37 and:37 National:45 welcomed:11 Congress:87 freed:37 leaders:37.:72 It:79 of:98 was:55 the:57 considered:37 South:37 outlawed:37 Africa:37 African:37 s:33 National:45 largest:90 opposition:67 Congress:87.:72 rally:37 It:79.:37 was:55 considered:37 South:37 Africa:37 s:33 largest:90 opposition:67 rally:37.:37 Riedl M., Biemann C. (2012): TopicTiling: A Text Segmentation Algorithm based on LDA, Proc. of the Student Research Workshop of the 50th ACL, Jeju, Republic of Korea 10
Visual Analytics using NLP NLP: getting linguistic annotations right Visual Analytics: present data in an interesting way. The interpretation lies in the eye of the beholder NLP + Visual Analytics can yield interesting tools for literature research and document collection understanding 11
Term Maps and Concept Trails Background Map: significant terms and their co-occurrences Red/Yellow Trail: Sequence of terms in incoming document Georgien Afghanistan Irak Quickly maps a new document in a background map e.g. visualizes how a new document matches current 12 body of references Martin Riedl and Chris Biemann, TU Biemann, Darmstadt C., Böhm, K., Heyer, G., Melz, R. (2004): SemanticTalk: Software for Visualizing Brainstorming Sessions and Thematic Concept Trails on Document Collections, Proceedings of ECML/PKDD 2004, Pisa, Italy 12
Time lines: Frequency over time slices Eiken, U.C., Liseth, A.T., Richter, M., Witschel, F. and Biemann, C. (2006): Ord i Dag: Mining Norwegian Daily Newswire. Proc. FinTAL, Turku, Finland Quasthoff, U. (2007): Deutsches Neologismenwörterbuch. Neue Wörter und Wortbedeutungen in der Gegenwartssprache. Berlin, De Gruyter 13
Conclusions Language Technology can solve many basic preprocessing tasks Structure Discovery can be used to unveil specific phenomena and relations of language units methodology is independent of domain or language resulting structure is domain-specific and adopts to changes Visual Analytics of language material visual aid for locating an incoming document in a background map time series analysis (many more possible, ask Daniela Oelke) Statistics over text is a powerful tool to support literature research Language technology cannot replace human researchers, editors and authors. But it can make their job easier! 14
Q&A 15
Positional Co-occurrences: sagte vs. meinte Also store the distance between words in the sentence captures parts of syntactic structure similar terms have similar contexts 16
Clustering of DT entries: Sense Induction paper#nn bright#jj 17
Cooc- PEDOCS 18