Stefano Rovetta. University of Genova. ICT for Eu-India cross-cultural dissemination Co-financed by the European Commission

Similar documents
DICTE PLATFORM: AN INPUT TO COLLABORATION AND KNOWLEDGE SHARING

AQUA: An Ontology-Driven Question Answering System

Language Independent Passage Retrieval for Question Answering

GALICIAN TEACHERS PERCEPTIONS ON THE USABILITY AND USEFULNESS OF THE ODS PORTAL

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Postprint.

Chamilo 2.0: A Second Generation Open Source E-learning and Collaboration Platform

Knowledge Sharing Workshop, Tiel The Netherlands, 20 September 2016

Baku Regional Seminar in a nutshell

Interview on Quality Education

Institutional repository policies: best practices for encouraging self-archiving

The International Coach Federation (ICF) Global Consumer Awareness Study

Impact of Digital India program on Public Library professionals. Manendra Kumar Singh

ScienceDirect. Malayalam question answering system

Cambridge NATIONALS. Creative imedia Level 1/2. UNIT R081 - Pre-Production Skills DELIVERY GUIDE

D.10.7 Dissemination Conference - Conference Minutes

ehealth Governance Initiative: Joint Action JA-EHGov & Thematic Network SEHGovIA DELIVERABLE Version: 2.4 Date:

Cross Language Information Retrieval

Linking Task: Identifying authors and book titles in verbose queries

EUROPEAN UNIVERSITIES LOOKING FORWARD WITH CONFIDENCE PRAGUE DECLARATION 2009

USER ADAPTATION IN E-LEARNING ENVIRONMENTS

Internet Society (ISOC)

SEDRIN School Education for Roma Integration LLP GR-COMENIUS-CMP

Scientific information management policies and information literacy schemes in Greek higher education institutions and libraries

OCW Global Conference 2009 MONTERREY, MEXICO BY GARY W. MATKIN DEAN, CONTINUING EDUCATION LARRY COOPERMAN DIRECTOR, UC IRVINE OCW

21st CENTURY SKILLS IN 21-MINUTE LESSONS. Using Technology, Information, and Media

California Digital Libraries Discussion Group. Trends in digital libraries and scholarly communication among European Academic Research Libraries

Information Literacy Competency Standards for Higher Education

P. Belsis, C. Sgouropoulou, K. Sfikas, G. Pantziou, C. Skourlas, J. Varnas

The MEANING Multilingual Central Repository

CEN/ISSS ecat Workshop

LIFELONG LEARNING PROGRAMME ERASMUS Academic Network

An Evaluation of E-Resources in Academic Libraries in Tamil Nadu

Multilingual Information Access Douglas W. Oard College of Information Studies, University of Maryland, College Park

HIGHLIGHTS OF FINDINGS FROM MAJOR INTERNATIONAL STUDY ON PEDAGOGY AND ICT USE IN SCHOOLS

The IDN Variant Issues Project: A Study of Issues Related to the Delegation of IDN Variant TLDs. 20 April 2011

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

A Case Study: News Classification Based on Term Frequency

PROJECT PERIODIC REPORT

Summary BEACON Project IST-FP

1. Introduction. 2. The OMBI database editor

What, Why and How? Past, Present and Future! Gudrun Wicander

Universities as Laboratories for Societal Multilingualism: Insights from Implementation

AUTHORING E-LEARNING CONTENT TRENDS AND SOLUTIONS

The recognition, evaluation and accreditation of European Postgraduate Programmes.

Problem-Solving with Toothpicks, Dots, and Coins Agenda (Target duration: 50 min.)

WP 2: Project Quality Assurance. Quality Manual

The European Higher Education Area in 2012:

Referencing the Danish Qualifications Framework for Lifelong Learning to the European Qualifications Framework

E-Learning project in GIS education

The Comparative Study of Information & Communications Technology Strategies in education of India, Iran & Malaysia countries

Protocols for building an Organic Chemical Ontology

Educator s e-portfolio in the Modern University

WHAT IS AEGEE? AEGEE-EUROPE PRESENTATION EUROPEAN STUDENTS FORUM

Learning Methods in Multilingual Speech Recognition

This Access Agreement covers all relevant University provision delivered on-campus or in our UK partner institutions.

InTraServ. Dissemination Plan INFORMATION SOCIETY TECHNOLOGIES (IST) PROGRAMME. Intelligent Training Service for Management Training in SMEs

COMPETENCY-BASED STATISTICS COURSES WITH FLEXIBLE LEARNING MATERIALS

DOCTORAL SCHOOL TRAINING AND DEVELOPMENT PROGRAMME

National and Regional performance and accountability: State of the Nation/Region Program Costa Rica.

ICDE SCOP Lillehammer, Norway June Open Educational Resources: Deliberations of a Community of Interest

IMPROVING ICT SKILLS OF STUDENTS VIA ONLINE COURSES. Rozita Tsoni, Jenny Pange University of Ioannina Greece

The following information has been adapted from A guide to using AntConc.

Council of the European Union Brussels, 4 November 2015 (OR. en)

EUROPEAN DAY OF LANGUAGES

GOING GLOBAL 2018 SUBMITTING A PROPOSAL

PROJECT DESCRIPTION SLAM

BLASKI, POLAND Introduction. Italian partner presentation

The Smart/Empire TIPSTER IR System

Dakar Framework for Action. Education for All: Meeting our Collective Commitments. World Education Forum Dakar, Senegal, April 2000

Procedia - Social and Behavioral Sciences 141 ( 2014 ) WCLTA Using Corpus Linguistics in the Development of Writing

the contribution of the European Centre for Modern Languages Frank Heyworth

Meeting on the Recognition of Prior Learning (RPL) and Good Practices in Skills Development

The Future of Consortia among Indian Libraries - FORSA Consortium as Forerunner?

Preprint.

Test Blueprint. Grade 3 Reading English Standards of Learning

Vocabulary Usage and Intelligibility in Learner Language

Longest Common Subsequence: A Method for Automatic Evaluation of Handwritten Essays

I set out below my response to the Report s individual recommendations.

UDLnet: A Framework for Addressing Learner Variability

European Cooperation in the field of Scientific and Technical Research - COST - Brussels, 24 May 2013 COST 024/13

COMMISSION OF THE EUROPEAN COMMUNITIES RECOMMENDATION OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

Marie Skłodowska-Curie Actions in H2020

PROGRESS TOWARDS THE LISBON OBJECTIVES IN EDUCATION AND TRAINING

Project ID: IT1-LEO Leonardo da Vinci Partnership S.E.GR.E. Social Enterprises & Green Economy: new models of European Development

General report Student Participation in Higher Education Governance

Competition in Information Technology: an Informal Learning

Chapter 2. University Committee Structure

COMMISSION OF THE EUROPEAN COMMUNITIES. COMMISSION STAFF WORKING DOCUMENT Accompanying document to the

Higher education is becoming a major driver of economic competitiveness

HIGHER EDUCATION IN POLAND

2001 MPhil in Information Science Teaching, from Department of Primary Education, University of Crete.

SOCRATES PROGRAMME GUIDELINES FOR APPLICANTS

Use of Online Information Resources for Knowledge Organisation in Library and Information Centres: A Case Study of CUSAT

(English translation)

The taming of the data:

Applications of memory-based natural language processing

Challenges for Higher Education in Europe: Socio-economic and Political Transformations

Transcription:

ICT for Eu-India cross-cultural dissemination Co-financed by the European Commission Stefano Rovetta University of Genova Department of Computer and Information Sciences

ICT for Eu-India cross-cultural dissemination Workgroup 8 Semantic Information Retrieval: A Natural Language Processing Task Multi-Language Communication: Two Sides of a Golden Coin

Outline Multi-Language Communication as an ICT task Multi-Language Communication as a challenge Multi-Language Communication as an opportunity Preview: Genoa contribution to Workgroup 8

Multi-Language communication

Communication Communicating and community making: by necessity goes through computers Language is still an issue Access to digital documents: search organize and group present answer questions directly suggest interesting items...

June 2005 WG4 Workshop The 2005 Cross-Language Information Processing Workshop was held in Genoa (http://www.disi.unige.it/clip2005) Participants from WG4 countries (Italy and Spain) and from Russia Topics discussed: Cross-language question answering Document organization and clustering Structural analysis of documents Content personalization There was also a panel discussion about more general pattern recognition topics

Workshop conclusions Electronic documents form the basis of many everyday tasks, both for personal productivity and for group work Automatic document organization is of vital importance in this regard Despite its advancement, further work is needed Structural and simple content-based analysis are the basic tools Significant improvements need also an approach based on semantic analysis

More workshop conclusions Cross-language document processing is possible: either by using knowledge encoded into language-dependent resources, such as ontologies and automatic translators (intensive methods) or by using trainable systems that learn from examples of different languages (extensive methods)

Side I: The challenge

Organizing and searching documents Traditional area for computers In the past 10 years it has developed exponentially: the Web desktop document production and processing powerful aids for digitization (scanners, OCR)

The status of multi-language methods research Typical cross-language task: retrieve documents from a collection in more than one target language Usually target languages are known in advance This helps in the preliminary processing steps: eliminating uninformative terms extracting the stem part-of-speech tagging...

CLEF The Cross-Language Evaluation Forum (http://www.clef-campaign.org/) is the most representative international initiative in this field Periodically poses challenges and gathers results in annual workshops Typical methods presented are based on translation software or on ontologies (which are ready-made knowledge repositories)

Some remarks Multi-language communities from Europe and India have to face much more complex situations Although there are widespread languages both across India and across Europe, the effective number of languages used is at least of the order of 100 There is also the issue of different scripts

Solutions to the multi-script problem European languages are widely studied and standard encodings for all significant scripts are available Indian languages are receiving attention (e.g. the ISCII code) The multi-script problem may be tackled with tools which are becoming standard such as Unicode

Language independence For a universal multi-language approach, language-specific facts should be learned from examples Methods should be based as much as possible on statistical approaches rather than a-priori knowledge Methods based on plug-in knowledge repositories are also useful but limited to those language for which translators or ontologies exist

The contribution from Genoa WG4 A task that has been studied: organizing documents in coherent clusters both for efficient indexing and for meaningful presentation WG8 A technical problem to be solved: finding the best keywords for document indexing

Side II: The opportunities

The language-independent approach In many instances the proposed approach has already been implemented or prepared A prominent example: Google (http://www.google.com) is not based on language-dependent preprocessing (stemming)

Benefits of this activity The results of these studies are likely to impact on important areas of interest: the EU priorities to bring ICT to the citizen ( e-inclusion ) the Indian Minister of Communications and Information Technology agenda, point 9 ( Language Computing ) However, the fact itself of working on these topics has already had an impact over creation of multi-language communities

Widening the network As a result of the Project's activities, more initiatives and new partnerships have been launched by WG4/WG8 participants: Research cooperation with Indian Statistical Institute, Kolkata Partnership and cooperation with other European research centres on document and language technology (from Greece and Switzerland) Hosting more young Indian researchers with support from the Italian Ministry of University

A golden coin We believe that the expected benefits, are of great importance in building and supporting multi-language communities The benefits already achieved are a confirmation

Preview: WG8 contribution > Crtview > A DSP ----- * ERR >esp >ita > hind

Workgroup 8 WG8 is dedicated to the following topic Semantic Information Retrieval: A Natural Language Processing Task Start: September 2005 End: April 2006 The Genoa contribution is focused on automatic keyword extraction

The Vector Space model It is the main approach of the field Represents a document as a list of keywords Keywords are extensive i.e. Take all terms as keywords Exclude only some How do we know what keywords are important? Knowledge of the topic and the language is necessary

Natural language processing Alternative, powerful approach The content of documents is analyzed at the grammatical and semantic levels We need to store the knowledge about languages in resources such as a corpus (or training collection) an ontology (or semantic network)

Language independence The approach with methods learning from examples is a third way Combines implicit semantic informations with language independence

Automatic keyword selection All terms in a document are possible keywords But not all would make for good keywords A method has been developed to identify the most relevant terms The method is fully automatic and focused on the task of document clustering

Expected results WG8 is focused on taking into account the meaning of documents (semantic analysis) The keyword selection method provides an automatic evaluation of which terms are interesting (useful) This is learned from examples and therefore independently from the specific language The method works also for multi-language documents

Final remarks

The approach Accessing collections of documents is one of the key points for cooperation in teams and communities The main requirement in multilingual communications is language independent methods We try not to rely only only on pre-existing resources methods based on learning from data

Summary of Genoa contribution to WG 4 and WG 8 Workgroup 4 provided tools for automatic organization of collections of documents Workgroup 8 is working on techniques to exploit the content of documents and their meaning The Genova group is studying techniques to automatically find relevant keywords from documents in a language-independent setting Community building is being widened outside the project consortium

the end