CLARIN-PL Research User-driven Language Technology Infrastructure

Similar documents
The CESAR Project: Enabling LRT for 70M+ Speakers

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

AQUA: An Ontology-Driven Question Answering System

Applications of memory-based natural language processing

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Developing a TT-MCTAG for German with an RCG-based Parser

Modeling full form lexica for Arabic

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

DICTE PLATFORM: AN INPUT TO COLLABORATION AND KNOWLEDGE SHARING

National Academies STEM Workforce Summit

Linking Task: Identifying authors and book titles in verbose queries

AUTHORING E-LEARNING CONTENT TRENDS AND SOLUTIONS

Intension, Attitude, and Tense Annotation in a High-Fidelity Semantic Representation

Towards a Machine-Learning Architecture for Lexical Functional Grammar Parsing. Grzegorz Chrupa la

Introduction to Text Mining

Development of the First LRs for Macedonian: Current Projects

1. Introduction. 2. The OMBI database editor

PIRLS. International Achievement in the Processes of Reading Comprehension Results from PIRLS 2001 in 35 Countries

Lessons from a Massive Open Online Course (MOOC) on Natural Language Processing for Digital Humanities

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Multilingual Sentiment and Subjectivity Analysis

Using Semantic Relations to Refine Coreference Decisions

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

Semantic Inference at the Lexical-Syntactic Level for Textual Entailment Recognition

CREATING SHARABLE LEARNING OBJECTS FROM EXISTING DIGITAL COURSE CONTENT

Impact of Educational Reforms to International Cooperation CASE: Finland

The recognition, evaluation and accreditation of European Postgraduate Programmes.

SOCRATES PROGRAMME GUIDELINES FOR APPLICANTS

Using dialogue context to improve parsing performance in dialogue systems

AN INTRODUCTION (2 ND ED.) (LONDON, BLOOMSBURY ACADEMIC PP. VI, 282)

THE VERB ARGUMENT BROWSER

Europeana Creative. Bringing Cultural Heritage Institutions and Creative Industries Europeana Day, April 11, 2014 Zagreb

Vocabulary Usage and Intelligibility in Learner Language

The taming of the data:

An Introduction to the Minimalist Program

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Automated Identification of Domain Preferences of Collocations

The stages of event extraction

ehealth Governance Initiative: Joint Action JA-EHGov & Thematic Network SEHGovIA DELIVERABLE Version: 2.4 Date:

GALICIAN TEACHERS PERCEPTIONS ON THE USABILITY AND USEFULNESS OF THE ODS PORTAL

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

The Survey of Adult Skills (PIAAC) provides a picture of adults proficiency in three key information-processing skills:

A Framework for Customizable Generation of Hypertext Presentations

EOSC Governance Development Forum 4 May 2017 Per Öster

Introduction Research Teaching Cooperation Faculties. University of Oulu

SEMAFOR: Frame Argument Resolution with Log-Linear Models

Postprint.

Requirements-Gathering Collaborative Networks in Distributed Software Projects

A Bayesian Learning Approach to Concept-Based Document Classification

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

A Grammar for Battle Management Language

arxiv: v1 [cs.cl] 2 Apr 2017

Executive summary (in English)

Open Discovery Space: Unique Resources just a click away! Andy Galloway

Challenges for Higher Education in Europe: Socio-economic and Political Transformations

Learning Methods in Multilingual Speech Recognition

PROCESS USE CASES: USE CASES IDENTIFICATION

Tailoring i EW-MFA (Economy-Wide Material Flow Accounting/Analysis) information and indicators

Distant Supervised Relation Extraction with Wikipedia and Freebase

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

The Discourse Anaphoric Properties of Connectives

Prediction of Maximal Projection for Semantic Role Labeling

BYLINE [Heng Ji, Computer Science Department, New York University,

Extended Similarity Test for the Evaluation of Semantic Similarity Functions

Language Independent Passage Retrieval for Question Answering

English Language and Applied Linguistics. Module Descriptions 2017/18

MODERNISATION OF HIGHER EDUCATION PROGRAMMES IN THE FRAMEWORK OF BOLOGNA: ECTS AND THE TUNING APPROACH

Speech Recognition at ICSI: Broadcast News and beyond

LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization

UCEAS: User-centred Evaluations of Adaptive Systems

Accurate Unlexicalized Parsing for Modern Hebrew

2.1 The Theory of Semantic Fields

Natural Language Processing. George Konidaris

LING 329 : MORPHOLOGY

Evaluation of Learning Management System software. Part II of LMS Evaluation

CS 598 Natural Language Processing

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS

Towards a MWE-driven A* parsing with LTAGs [WG2,WG3]

AGENDA LEARNING THEORIES LEARNING THEORIES. Advanced Learning Theories 2/22/2016

Twenty years of TIMSS in England. NFER Education Briefings. What is TIMSS?

Some Principles of Automated Natural Language Information Extraction

Department of Education and Skills. Memorandum

Treebank mining with GrETEL. Liesbeth Augustinus Frank Van Eynde

Universiteit Leiden ICT in Business

ScienceDirect. Malayalam question answering system

A Neural Network GUI Tested on Text-To-Phoneme Mapping

The Smart/Empire TIPSTER IR System

Ministry of Education, Republic of Palau Executive Summary

An Open Framework for Integrated Qualification Management Portals

Emergency Management Games and Test Case Utility:

Update on Soar-based language processing

Clumps and collection description in the information environment in the UK with particular reference to Scotland

Interactive Corpus Annotation of Anaphor Using NLP Algorithms

Summary and policy recommendations

Trend Survey on Japanese Natural Language Processing Studies over the Last Decade

Software Maintenance

Chamilo 2.0: A Second Generation Open Source E-learning and Collaboration Platform

Transcription:

Research User-driven Language Technology Infrastructure Maciej Piasecki Wrocław University of Technology G4.19 Research Group maciej.piasecki@pwr.wroc.pl

Basic Notions Language Technology (LT) language resources and tools robust in terms of quality and coverage multipurpose component based Language Technology Infrastructure a software framework (architecture or platform) for combining language tools with language resources into processing chains (or pipelines) the defined processing chains are next applied to language data sources interoperability, also with the external systems

LT in Humanities and Social Sciences: Barriers Physical language tools and resources are not accessible in Internet Informational descriptions are not available or there is no means for searching Technological lack of commonly accepted standards for LT, lack of a common platform, varieties of technological solutions, insufficient users computers Related to knowledge the use of LT requires programming skills or knowledge from the area of natural language engineering Legal licences for language resources and tools (LRTs) limit their applications

CLARIN Support for Humanities & Social Sciences CLARIN is ERIC type consortium of 11 countries (Austria, Bulgaria, Czech Republic, Denmark, Estonia, Germany, Lithuania, The Netherlands, Poland, Portugal, Sweden) and The Dutch Language Union 1 observer: Norway Focus area: Supporting research in Humanities and Social Sciences Users: researchers, PhD students, students and scientific institutions CLARIN Mission To significantly lower the barriers for the use of Language Technology in Humanities & Social Sciences (H&SS) To facilitate or enable research methods based on automated analysis of text and speech resources

CLARIN Offer Integration of different LT components into one interoperable system Common, flexible meta-data standard (CMDI) Central searching for resources (Virtual Language Observatory) One sign on and one login into the distributed infrastructure Decreased Physical and Informational Barriers Common standards: promoting, co-ordinating, harmonising Web Services for Language Tools and Resources Decreased Technological Barrier Installation-free, access via Web Applications Decreased Knowledge Barrier Common licences and promotion of the open access Decreased Legal Barrier

CLARIN: Portal

CLARIN: Virtual Language Observatory

CLARIN: Federated Content Search Searching Corpora

LTI Development Paradigms Bottom-up a collected offer approach based on linking together the already existing Language Resources and Tools focused on accessibility, technical interoperability and processing chains Top-down following on user-centred design paradigm research applications for H&SS are a starting point Bi-directional linking of Language Resources and Tools combined with the development of research applications

Bi-directional LTI Development Idea development of the necessary elements a distributed network infrastructure basic LT processing chain combined with user-centred approach to the development of research applications Top-down part close co-operation with key users from the H&SS domain a metaphor of the Agile-like light weight software designing method with emphasis to prototyping amendments to the shape of the technical basis: LRTs, standards, inspirations, identification of the further user needs, next iterations

: the Consortium Polish scientific consortium Wrocław University of Technology, G4.19 Research Group Institute of Computer Science, Polish Academy of Science Polish-Japanese Institute of Information Technology, Chair of Multimedia University of Łódź, PELCRA group at Chair of English Language and Applied Linguistics Institute of Slavic Studies, Polish Academy of Science Wrocław University Goal: implementation of the Polish part of the CLARIN ERIC LTI Follows the bi-directional approach to LTI development

: Mission Starting point Several publicly available language resources and tools for Polish, But still many were lacking Deeper technological barrier: restricted applications Pillars: Language Technology Centre www.clarin-pl.eu the Polish node of the CLARIN distributed infrastructure Complete set of the basic Language Resources & Tools for Polish Research applications for H&SS first set for key users and selected H&SS sub-domains.

Language Technology Centre Location in Wrocław University of Technology based on modified D-Space system from Lindat (Czech CLARIN) One sign-on, one login (a member of the Pioneer.id Federation) Advanced repository system for language resources Persistent Identifiers for resources and tools Rich CMDI meta-data CLARIN wide visibility in the central search Interface for Federated Content Search depositing service for researchers from H&SS application for the Data Seal of Approval Adherence to all CLARIN specifications about standards and protocols Web Services for LRTs: the basic processing chain of Polish Prototype system for flexible composition of the natural language processing chains support for developers SOAP & REST interfaces Web Applications for LRTs Knowledge Sharing: expertise and support for the users

: Language Resources 1. Polish Morphological Dictionary 2. Polish Speech Corpora 3. Annotated Polish Corpora 4. Bilingual Corpora 5. Polish Historical Corpus 6. Semantic lexicon Wordnet for Polish formal description of lexical meanings 7. Dictionary of Multiword Expressions 8. Bilingual semantic lexicon 9. Lexicon of Proper Names 10.Syntactic-semantic Valency Dictionary 11.Robust syntactic-semantic grammar

: Language Resources 1. Polish Morphological Dictionary 2. Polish Speech Corpora 3. Annotated Polish Corpora 4. Bilingual Corpora 5. Polish Historical Corpus 6. Semantic lexicon plwordnet 3.0 formal description of lexical meanings 7. Dictionary of Multiword Expressions 8. Bilingual semantic lexicon 9. Lexicon of Proper Names 10.Syntactic-semantic Valency Dictionary: 11.Robust syntactic-semantic grammar

: Language Resources Starting point a set of large resources a huge National Corpus of Polish (1 billion tokens) plwordnet 2.1 a very large wordnet for Polish Korpus Politechniki Wrocławskiej an open Polish corpus with rich annotation Expanded resources plwordnet 3.0 a huge semantic lexicon of Polish a comprehensive description of the Polish lexico-semantic system (~200 000 lemmas, ~280 000 senses) fully mapped to English Princeton WordNet described formally by mapping to an ontology Dictionary of multiword expressions described syntactically NELexicon 2.0 a huge lexicon of Polish Proper Names (2.5 mln)

: Language Resources for Polish Expanded resources Conversational corpus (following PELCRA and NKJP) A large semantic valency lexicon for Polish predicative lexical units Newly built resources Transcribed training-testing Polish speech corpus Bi-lingual corpora: Polish-English, Polish-Bulgarian-Russian, Polish-Lithuanian Polish historical corpus (for the years 1945-1954) Corpora annotated for: meta-data, anaphora, time expressions, spatial expressions, semantic relations and situations

plwordnet 2.2 in http://plwordnet.pwr.edu.pl

plwordnet 2.2 in http://plwordnet.pwr.edu.pl

: Language Tools for Polish Systems for searching corpora, especially Polish corpora Spokes for conversational and bilingual corpora Poliqarp 2.0 for richly annotated Historical corpora [New] Text mining (information extraction) Recognition and classification of Proper Names Recognition of anaphoric links Recognition and classification of time expressions and spatial expressions [New] Situation recognition [New] Extraction of multiword expressions (collocations) A generic set of morpho-syntactic tools for Polish that can be adapted to a domain specified by the user [New]

: Language Tools for Polish Word Sense Disambiguation based on plwordnet Shallow semantic parser [New] Deep syntactic-semantic parser [New] Tools for the extraction of the semantic-pragmatic information from documents and collections of documents, e.g. keywords [New], semantic relations between text fragments and text summaries

Basic Language Tools for Polish 1. Segmentation into tokens and sentences 2. Morphological analysis 3. Morphological guessing of unknown words (both without context and context sensitive) 4. Morpho-syntactic tagging 5. Word Sense Disambiguation 6. Chunker and shallow syntactic parser 7. Named Entity Recognition and disambiguation 8. Co-reference and anaphora resolution 9. Temporal expression recognition 10. Semantic relation recognition 11. Event recognition 12. Shallow semantic parser 13. Deep syntactic parser with disambiguated output: dependency and constituent 14. Deep semantic parser

Basic Language Tools for Polish 1. Segmentation into tokens and sentences 2. Morphological analysis 3. Morphological guessing of unknown words (both without context and context sensitive) 4. Morpho-syntactic tagging 5. Word Sense Disambiguation 6. Chunker and shallow syntactic parser 7. Named Entity Recognition and disambiguation 8. Co-reference and anaphora resolution 9. Temporal expression recognition 10. Semantic relation recognition 11. Event recognition 12. Shallow semantic parser 13. Deep syntactic parser with disambiguated output: dependency and constituent 14. Deep semantic parser

Basic Language Tools for Polish 1. Segmentation into tokens and sentences 2. Morphological analysis 3. Morphological guessing of unknown words (both without context and context sensitive) 4. Morpho-syntactic tagging 5. Word Sense Disambiguation 6. Chunker and shallow syntactic parser 7. Named Entity Recognition and disambiguation 8. Co-reference and anaphora resolution 9. Temporal expression recognition 10. Semantic relation recognition 11. Event recognition 12. Shallow semantic parser 13. Deep syntactic parser with disambiguated output: dependency and constituent 14. Deep semantic parser

: Processing Chain for Polish

: Recognition and classification of Proper Names

Bi-directional - Top-down Part: First Applications Approaching users already active, interested, working on large textual and speech resources, covering a maximal variety of research areas, e.g. linguistics, literary studies, psychology, political studies and sociology matching the available language tools for Polish the first set of several prototype application illustrating possibilities and facilitating identification of the needs First applications Spokes searching corpora of conversational data A system for collecting Polish text corpora from the Web A open textometric and stylometric system focused on Polish Semantic text classification for sociology Literary Map

Spokes (University of Łódź) http://spokes.clarin-pl.eu

System for Collecting Polish Text Corpora from the Web Requests from the users revealed gaps in the available technology existing corpus building systems were too sensitive to text encoding errors found in the web not designed for informal corpora like blogs A system for collecting Polish text corpora from the Web had to be constructed: based on tools from the Masaryk University in Brno to detect texts including larger number of errors (by morphological analysis) supports semi-automated extraction of texts from blogs, posts on forums, etc. integrated with tools for processing

Open Textometric and Stylometric System System designed for characteristic features of Polish like rich inflection, weakly constrained word order Based on several existing components including Stylo (Eder & Rybicki) Enabling the use of features defined on any level of the linguistic structure: from the level of word forms up to the level of the semantic-pragmatic structures. Available as Web Application and a Web Service Stylometric techniques appear to be applicable in many tasks of H&SS sociology (characteristic features that are for different subgroups), political studies (similarity and differences between political parties), literary studies

Semantic Text Classification for Sociology Users: Collegium Civitas, Warsaw Goal Support for large scale analysis of the source materials Automatically annotate documents and text fragments with pre-defined semantic categories Definition of categories by examples Automated semantic grouping of documents and text fragments Support for Corpus building Manual annotation of the learning sub-corpus Automated annotation process Statistical analysis of the results

GeTClasS Generalised Text Classification for Sociology

Literary Map Users: Digital Humanities Centre of The Institute of Literary Research (Polish Academy of Sciences) Goal Support for using maps in the literary criticism Tool for the identification of all geographical names in the literary text (or a corpus) and mapping them onto a geographical map Tasks 1.Identification and semantic classification of the referring language expressions 2.Disambiguation of the referents 3.Mapping the referents onto a map (geo-location) 4.Recognition of the semantic relations and statistical analysis

Literary Map

Conclusions Application of LT to the research in Humanities & Social Sciences seem to be much more challenging than in commercial systems! LT for Polish achieved a stage in which valuable support can be provided for research applications Bi-directional approach combines development of the basic, universal set of language tools and resources with inspirations from the research applications

Thank you very much for your attention! www.clarin-pl.eu Supported by the Polish Ministry of Science and Higher Education []

Bi-directional: bottom-up part PALC 2014 Łódź 2014-11-22 LRTs and LRT chains can be useful if the required tools and resources exist, and, they are robust! What is the minimal set of LRTs? What kind of LRTs can be called robust? automated applications in H&SS seem to require high quality of language tools and mostly large coverage of resource BLARK The Basic Language Resource Kit the minimal set of language resources that is necessary to do any precompetitive research and education at all (Krauwer, 2003) and also basic processing chains possible reference point to compare LRTs for different languages