Towards a Vecsigrafo Portable Semantics in Knowledge-based Text Analytics. Ronald Denaux & José Manuel Gómez Pérez HSSUES Oct.

Similar documents
Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Linking Task: Identifying authors and book titles in verbose queries

arxiv: v1 [cs.cl] 2 Apr 2017

Leveraging Sentiment to Compute Word Similarity

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Probabilistic Latent Semantic Analysis

Word Sense Disambiguation

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Cross-Lingual Text Categorization

Cross-Lingual Dependency Parsing with Universal Dependencies and Predicted PoS Labels

Applications of memory-based natural language processing

Cross Language Information Retrieval

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

On document relevance and lexical cohesion between query terms

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Text-mining the Estonian National Electronic Health Record

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Multilingual Document Clustering: an Heuristic Approach Based on Cognate Named Entities

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

A Graph Based Authorship Identification Approach

Combining a Chinese Thesaurus with a Chinese Dictionary

A Comparison of Two Text Representations for Sentiment Analysis

Vocabulary Usage and Intelligibility in Learner Language

A Bayesian Learning Approach to Concept-Based Document Classification

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Finding Translations in Scanned Book Collections

Python Machine Learning

Learning Methods for Fuzzy Systems

Lecture 1: Machine Learning Basics

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

(Sub)Gradient Descent

A Semantic Similarity Measure Based on Lexico-Syntactic Patterns

Speech Recognition at ICSI: Broadcast News and beyond

Methods for the Qualitative Evaluation of Lexical Association Measures

Natural Language Processing. George Konidaris

Using dialogue context to improve parsing performance in dialogue systems

Multilingual Sentiment and Subjectivity Analysis

Word Translation Disambiguation without Parallel Texts

The taming of the data:

Memory-based grammatical error correction

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

LIM-LIG at SemEval-2017 Task1: Enhancing the Semantic Similarity for Arabic Sentences with Vectors Weighting

English Language and Applied Linguistics. Module Descriptions 2017/18

THE VERB ARGUMENT BROWSER

A Domain Ontology Development Environment Using a MRD and Text Corpus

The Role of the Head in the Interpretation of English Deverbal Compounds

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Switchboard Language Model Improvement with Conversational Data from Gigaword

The MEANING Multilingual Central Repository

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Autoencoder and selectional preference Aki-Juhani Kyröläinen, Juhani Luotolahti, Filip Ginter

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

CS 598 Natural Language Processing

AQUA: An Ontology-Driven Question Answering System

Distant Supervised Relation Extraction with Wikipedia and Freebase

Short Text Understanding Through Lexical-Semantic Analysis

Procedia - Social and Behavioral Sciences 154 ( 2014 )

Attributed Social Network Embedding

Lessons from a Massive Open Online Course (MOOC) on Natural Language Processing for Digital Humanities

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS

Accuracy (%) # features

1. Introduction. 2. The OMBI database editor

Improvements to the Pruning Behavior of DNN Acoustic Models

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Visual CP Representation of Knowledge

Assignment 1: Predicting Amazon Review Ratings

Learning Disability Functional Capacity Evaluation. Dear Doctor,

Towards a Machine-Learning Architecture for Lexical Functional Grammar Parsing. Grzegorz Chrupa la

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Concepts and Properties in Word Spaces

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Artificial Neural Networks written examination

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Mathematics. Mathematics

Development of the First LRs for Macedonian: Current Projects

Semantic Inference at the Lexical-Syntactic Level for Textual Entailment Recognition

Innovative Methods for Teaching Engineering Courses

LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization

Linguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis

Radius STEM Readiness TM

! # %& ( ) ( + ) ( &, % &. / 0!!1 2/.&, 3 ( & 2/ &,

arxiv: v1 [cs.cl] 20 Jul 2015

The stages of event extraction

Learning Methods in Multilingual Speech Recognition

Language Independent Passage Retrieval for Question Answering

Abstractions and the Brain

A DISTRIBUTIONAL STRUCTURED SEMANTIC SPACE FOR QUERYING RDF GRAPH DATA

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems

Using Small Random Samples for the Manual Evaluation of Statistical Association Measures

Modeling full form lexica for Arabic

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Graph Alignment for Semi-Supervised Semantic Role Labeling

Lecture 1: Basic Concepts of Machine Learning

Introduction. Beáta B. Megyesi. Uppsala University Department of Linguistics and Philology Introduction 1(48)

Transcription:

Towards a Vecsigrafo Portable Semantics in Knowledge-based Text Analytics Ronald Denaux & José Manuel Gómez Pérez HSSUES Oct. 21st, 2017

The Cognitive Chasm How can humans and AI interact with and understand each other? Machine understanding vs. Human understanding Is this possible or are they cognitively disconnected? What mechanisms are needed to cross the cognitive chasm? How can knowledge representation be both flexible, scalable, deep and logical? 2

Pros and cons of structured knowledge PROS Humans have a rich understanding of the domain, resulting in detailed, expressive models Underlying formalisms support logical explanations Reasonable response times Tooling can optimize cost, enabling user-entered knowledge Requires a considerable amount of well trained, centralized labor to manually encode knowledge Lacks scalability with large corpora and still costly due to humans in the loop Possible bias, hard to generalize Brittleness CONS 3

4 Structured knowledge (Sensigrafo) Sensigrafo, a knowledge graph containing word definitions, related concepts and linguistic information Main entities include syncons (concepts), lemmas (canonical representation of a word) and relations (properties, taxonomical, polysemy, synonymy ) 301,582 syncons 401,028 lemmas 80+ relation types that yield ~2.8 million links Internal representation that leverages external resources, both general and domain-specific Word-sense disambiguation, based on the context of a word in Sensigrafo Categorization and extraction supported through Sensigrafo plus lexical-syntactic rules

Building multiple language models Word2vec represents words in a vector space, making natural language computer-readable Neural word embeddings enable word similarity, analogy and relatedness based on vector arithmetic (cosine similarity) Essential property: Semantic portability 5

Towards Natural Learning at Expert System Knowledge embedded in document corpora Broad, flexible, scalable Good for POS tagging, parsing, semantic relatedness Statistic induction, not logical explanation Lack of true understanding of realworld semantics and pragmatics Vecsigrafo Knowledge encoded in the mind of the expert Structured knowledge base Good for logical deduction and explanation Deep, but rigid and brittle Human is a bottleneck: handengineered features and powerful modeling tools needed Automatically learning how language is used in real life and materializing that in structured knowledge graphs

Vecsigrafo Putting it all together Vocab elements EN-grafo ES-grafo Sensi Vecsi Sensi Vecsi Lemmas 398 80 268 91 Concepts 300 67 226 52 Total 698 147 474 143 Corpus Sentences Spanish words English words Euparl 1,965,734 51,575,748 49,093,806 UN.en-es 21,911,121 678,778,068 590,672,799 Two parallel corpora, focused on English and Spanish (Europarl and UN) Meaning extracted from corpora and related to Sensigrafo (21% and 30% Sensigrafo covered, resp.) Tokenized, lemmatized and disambiguated with COGITO Learned monolingual joint word-concept models and a (non-linear) transformation between vector spaces for crosslinguality Deeplearning4j with Skip-gram, minfreq 10, vector dimensionality 400 TensorFlow and Swivel for better vectorization time (~16x & ~20x speedup, 80 epochs) 7

average cosim Vecsigrafo - Evaluation Model WSim WSrel Simlex999 Rarewords Simverb SotA 2015 79.4 70.6 43.3 50.8 n/a Swivel 74.8 61.6 40.3 48.3 62.8 Swivel UN, en 58.8 45.0 18.3 37.8 15.3 Vecisgrafo UN,en 47.6 24.1 12.4 30.8 13.2 Word Prediction Plots (quality validation and hypothesis checking) Corpus size and distribution matters Overall performance equivalent at lemma level (Swivel, same corpus) Including concepts has a cost Visual inspection (t-sne, PCA) and manual (relatedness, analogy ) Further insight needed most frequent a) Random baseline b) Buggy correlations least frequent c) Uncentered d) Re-centered 8

Vecsigrafo Word Similarity Redux Model WSim WSrel Simlex999 Rarewords Simverb SotA 2015 79.4 70.6 43.3 50.8 62.8 Swivel 74.8 61.6 40.3 48.3 n/a Swivel UN, en 58.8 45.0 18.3 37.8 15.3 Swivel UN, en recentered 57.7 47.2 21.3 39.2 17.0 Vecisgrafo UN,en 47.6 24.1 12.4 30.8 13.2 Vecisgrafo UN,en 69.9 51.6 38.2 50.3* 30.6 Better than swivel for same corpus Effect of recentering Effect of aligning to Spanish Further insight needed How similar are two vecsigrafos? Which relations are inferred? How are relations encoded in the embedding space? Vecisgrafo UN,en recentered 59.3 43.0 42.4 49.3 30.4 Vecisgrafo UN,en NN aligned to es 65.8 45.3 39.2 49.3 28.5 9

Vecsigrafo Application Roadmap Crosslinguality Map individual Vecsigrafos Correlate and identify modeling gaps in Sensigrafos Suggest crosslingual synonyms Assisted Sensigrafo Learning Fast internationalization at Expert System (EU, US, LATAM) and growing customer needs in 14 languages 10

Mapping and correlation Mapping vector spaces in different languages: Linear transformation suggested by (Mikolov, 2013) produced poor results. Non-linear transformation using NNs: hit@5 = 0.78 and 90% semantic relatedness Manual inspection showed only 28% exact correspondence EN ES, due to volume (75K concepts less in Spanish Sensigrafo) and strategic modeling decisions How to address the gap? Alignment performance Method Nodes hit@5 TM n/a 0.36 NN2 4K 0.61 NN2 5K 0.68 NN2 10K 0.78 NN3 5K 0.72 Manual inspection EN ES in dict. out dict. #concepts 46 64 hit@5 0.72 0.28 no concept ES 2 33 11

Examples Scrap value (EN ES) Financing (EN ES) PYME (ES EN) 12

Crosslingual synonym suggester Combines features from bilingual vecsigrafo, the target and source Sensigrafos and a dictionary (PanLex) 1. For each concept in the source language, find the n nearest concepts in the target language that match grammar type (noun, verb, adjective, etc.) 2. For each candidate, calculate hybrid features (lemma translation, glossa similarity, cosine similarity, shared hypernyms and domains) 3. Combine into a single score and rank 4. Check if suggested synonym candidate is already mapped to a different concept and compare 5. Suggestion made if score is over a threshold Manual inspection EN ES (1546 concepts, IPTC) 1546 IPTC concepts No suggestions Clashing Non clashing 13

Wrapping up 14

Ronald Denaux Senior Researcher rdenaux@expertsystem.com Jose Manuel Gomez-Perez Director R&D jmgomez@expertsystem.com Denaux R, Gomez-Perez JM. Towards a Vecsigrafo: Portable Semantics in Knowledge-based Text Analytics. To appear in proceedings of the Intl. Workshop on Hybrid Statistical Semantic Understanding and Emerging Semantics (HSSUES), collocated with the 16 th Intl. Semantic Web Conference (ISWC), Vienna, 2017. linkedin.com/company/expert-system twitter.com/expert_system info@expertsystem.com

16

Correlation calculation Develop an indicative list of advisory and conciliatory measures to encourage full compliance; Tokenize & WSD en#67083 develop en#89749 indicative en#113271 list en#88602 advisory en#85521 conciliatory en#33443 measure en#77189 encourage en#84127 full en#4941 compliance Correlation for en_lem_list (window 2, harmonic weight) token Distance weight en#67083 2 ½ develop 2 ½ en#89749 1 1 indicative 1 1 en#113271 0 1 token Distance weight list 0 1 en#88602 1 1 advisory 1 1 en#85521 2 ½ conciliatory 2 ½ 17