NewsReader: Automatically extracting Events, Entities and Perspectives from Newspapers

Similar documents
AQUA: An Ontology-Driven Question Answering System

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Applications of memory-based natural language processing

The MEANING Multilingual Central Repository

Lessons from a Massive Open Online Course (MOOC) on Natural Language Processing for Digital Humanities

SEMAFOR: Frame Argument Resolution with Log-Linear Models

Linking Task: Identifying authors and book titles in verbose queries

Language Independent Passage Retrieval for Question Answering

Postprint.

The CESAR Project: Enabling LRT for 70M+ Speakers

Learning a Cross-Lingual Semantic Representation of Relations Expressed in Text

Distant Supervised Relation Extraction with Wikipedia and Freebase

Exploiting Wikipedia as External Knowledge for Named Entity Recognition

Artificial Intelligence

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

1. Introduction. 2. The OMBI database editor

Developing a large semantically annotated corpus

Online Marking of Essay-type Assignments

A Case Study: News Classification Based on Term Frequency

Expert locator using concept linking. V. Senthil Kumaran* and A. Sankar

Developing a TT-MCTAG for German with an RCG-based Parser

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Analyzing sentiments in tweets for Tesla Model 3 using SAS Enterprise Miner and SAS Sentiment Analysis Studio

CREATING SHARABLE LEARNING OBJECTS FROM EXISTING DIGITAL COURSE CONTENT

Parsing of part-of-speech tagged Assamese Texts

Using dialogue context to improve parsing performance in dialogue systems

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Using Semantic Relations to Refine Coreference Decisions

Getting the Story Right: Making Computer-Generated Stories More Entertaining

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

A Comparison of Two Text Representations for Sentiment Analysis

Natural Language Processing. George Konidaris

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Emotional Variation in Speech-Based Natural Language Generation

Modeling function word errors in DNN-HMM based LVCSR systems

Compositional Semantics

ROSETTA STONE PRODUCT OVERVIEW

Community-oriented Course Authoring to Support Topic-based Student Modeling

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Update on Soar-based language processing

A Web Based Annotation Interface Based of Wheel of Emotions. Author: Philip Marsh. Project Supervisor: Irena Spasic. Project Moderator: Matthew Morgan

Text-mining the Estonian National Electronic Health Record

Learning Methods in Multilingual Speech Recognition

Speech Recognition at ICSI: Broadcast News and beyond

Beyond the Pipeline: Discrete Optimization in NLP

A Bayesian Learning Approach to Concept-Based Document Classification

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Treebank mining with GrETEL. Liesbeth Augustinus Frank Van Eynde

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

A DISTRIBUTIONAL STRUCTURED SEMANTIC SPACE FOR QUERYING RDF GRAPH DATA

Multilingual Sentiment and Subjectivity Analysis

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Information System Design and Development (Advanced Higher) Unit. level 7 (12 SCQF credit points)

The Smart/Empire TIPSTER IR System

Applying Information Technology in Education: Two Applications on the Web

Sample Iep Goals For Anxiety

Cross Language Information Retrieval

HLTCOE at TREC 2013: Temporal Summarization

USER ADAPTATION IN E-LEARNING ENVIRONMENTS

MASTER OF SCIENCE (M.S.) MAJOR IN COMPUTER SCIENCE

Leveraging Sentiment to Compute Word Similarity

Probabilistic Latent Semantic Analysis

Please find below a summary of why we feel Blackboard remains the best long term solution for the Lowell campus:

CWIS 23,3. Nikolaos Avouris Human Computer Interaction Group, University of Patras, Patras, Greece

Text-to-Speech Application in Audio CASI

POWLA: Modeling linguistic corpora in OWL/DL

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Breakthrough Russian (Breakthrough Language Courses) [Paperback] By Halya Coynash

Unit purpose and aim. Level: 3 Sub-level: Unit 315 Credit value: 6 Guided learning hours: 50

Designing Autonomous Robot Systems - Evaluation of the R3-COP Decision Support System Approach

Unit 7 Data analysis and design

Citrine Informatics. The Latest from Citrine. Citrine Informatics. The data analytics platform for the physical world

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

arxiv: v1 [cs.cl] 2 Apr 2017

Disambiguation of Thai Personal Name from Online News Articles

Argument structure and theta roles

Temporal Information Extraction for Question Answering Using Syntactic Dependencies in an LSTM-based Architecture

! # %& ( ) ( + ) ( &, % &. / 0!!1 2/.&, 3 ( & 2/ &,

AUTOMATED TROUBLESHOOTING OF MOBILE NETWORKS USING BAYESIAN NETWORKS

Procedia - Social and Behavioral Sciences 154 ( 2014 )

Tools for Tracing Evidence in Social Science

Nearing Completion of Prototype 1: Discovery

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

Computerized Adaptive Psychological Testing A Personalisation Perspective

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

Using Hashtags to Capture Fine Emotion Categories from Tweets

On document relevance and lexical cohesion between query terms

GALICIAN TEACHERS PERCEPTIONS ON THE USABILITY AND USEFULNESS OF THE ODS PORTAL

AUTHORING E-LEARNING CONTENT TRENDS AND SOLUTIONS

BYLINE [Heng Ji, Computer Science Department, New York University,

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

Domain-specific Named Entity Disambiguation in Historical Memoirs

Intension, Attitude, and Tense Annotation in a High-Fidelity Semantic Representation

A Neural Network GUI Tested on Text-To-Phoneme Mapping

16.1 Lesson: Putting it into practice - isikhnas

No Parent Left Behind

Extracting Opinion Expressions and Their Polarities Exploration of Pipelines and Joint Models

Transcription:

NewsReader: Automatically extracting Events, Entities and Perspectives from Newspapers Marieke van Erp marieke.van.erp@vu.nl http://mariekevanerp.com

NewsReader http://www.newsreader-project-eu ICT 316404, FP7-ICT-2011-8: Jan. 2013 - Dec. 2015 Consortium: Vrije Universiteit Amsterdam (NL), The University of The Basque Country (ES), Fondazione Bruno Kessler (IT), LexisNexis (NL), ScraperWiki (now The Sensible Code Company, UK) & SynerScope (NL) Read massive streams of news from many different sources Record the changes in the world as they are told in the sources in 4 languages: English, Dutch, Spanish and Italian. What happened, where and when, who was involved. From unstructured Text to structured RDF (through a happy marriage between Computational Linguistics and Semantic Web researchers). Who made what statement, where do sources agree and disagree, what is their emotion or judgement: provenance

From Text to RDF

Natural Language Processing Pipeline

NLP Annotation Format Stand-off XML Based on KAF, TAF, LAF and uses URIs (from RDF) NAF-FoLiA converters are in progress Each annotation receives a new layer

NLP Annotation Format

NLP Annotation Format

Semantic Annotation Named Entity Recognition & Linking Speaker's intended meaning Pragmatic Analysis From words to concepts Semantic Analysis Semantic Role Labelling Syntactic Analysis Recognising Temporal Expressions & Relations Lexical Analysis Wikification Tokenisation Input text

Named Entity Recognition & Linking Semi-supervised NER: R. Agerri, G. Rigau, Robust multilingual Named Entity Recognition with shallow semi-supervised features. Artificial Intelligence, 238 (2016) 63-82. JCR 2015: 3.371 Named Entity Linking (DBpedia Spotlight): Daiber, Joachim, et al. "Improving efficiency and accuracy in multilingual entity extraction." Proceedings of the 9th International Conference on Semantic Systems. ACM, 2013.

Named Entities in NAF

Why link to a resource such as DBpedia? It allows you to query for fine-grained entity types: give me all politicians in the dataset, give me all football players Plus: the background knowledge provides additional filters: give me all politicians born after 1900 in the dataset Caveat: the background knowledge is not complete

Why link to a resource such as DBpedia?

Named Entity Recognition & Linking We are developing a new entity linker that allows for use of datasets other than DBpedia and is less sensitive to general entity popularity Discovering more about Dark and NIL entities is also ongoing work

From words to concepts Linking terms to synonyms to obtain a higher level of abstraction Word-sense disambiguation + WordNet + Multilingual Central Repository + Framenet + PropBank Stop, quit, leave, relinquish, bow out -> all linked to the concept wn:leave_office

From Words to Concepts

Why link to WordNet/ConceptNet/etc? It allows you to query for types rather than instances: give me all lawsuits in the dataset In the context of CLARIAH, we are converting various diachronous lexicons to Linked Data integrate resources tag interesting concepts in text query expansion

New synonym/concept lists are easy to plug in

New synonym/concept lists are easy to plug in

Semantic Role Labelling Detecting the agent, patient, recipient and theme of a sentence Mary sold the book to John Agent: Mary Recipient: John Theme: the book

http://english.alarabiya.net 2013-06-17 http://www.telegraph.co.uk Qatar Holding sells 10% stake in Porsche to founding families Porsche family buys back 10pc stake from Qatar fn:commerce_money_transfer type dbp:porsche_fa mily fn:buyer Event12 buy/sell fn:seller dbp:qatarholding fn:goods sem:hastime fn:money Entity23 10% stake 2013-06-17?

Event abstractions Enable searches such as: Give me all lawsuits in which a politician was involved between 1990 and 2000.

Pragmatic Analysis Factuality/Attribution Speaker's intended meaning Pragmatic Analysis Who said what, who agrees with whom, how certain is a speaker about her statement, is she talking about the past, present or future? Semantic Analysis Syntactic Analysis Lexical Analysis Tokenisation Input text

Perspective Pro-EU campaigners have hoped that big carmakers would also support the Remain campaign. big carmakers support the Remain campaign CONFIRM CERTAIN FUTURE POSITIVE Pro-EU campaigners hoped FINANCIAL TIMES CONFIRM_CERTAIN_PAST_NEUTRAL

and beyond

Find out more All modules and evaluations are described in: http://kyoto.let.vu.nl/ newsreader_deliverables/nwr-d4-2-3.pdf (158 pages!) http://www.newsreader-project.eu/results/software/ Black box setup Links to individual modules on Github Hadoop package for batch processing New developments: http://www.clariah.nl & https://github.com/clariah

Discussion It s research software (no fancy interface) Currently not adapted to deal with old spelling variants/ocr/ etc NLP isn t perfect (but humans don t always agree either!) What would it take for you to start using such tools? What types of analyses are most interesting to the community? What use cases are most useful to the community at this point in time?

Thank you for your attention https://youtu.be/rylavn3oqli