A Survey of Research on Computing Language in BahasaIndonesia Conducted at the University of Indonesia

Similar documents
Cross Language Information Retrieval

AQUA: An Ontology-Driven Question Answering System

English Language and Applied Linguistics. Module Descriptions 2017/18

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Applications of memory-based natural language processing

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Linking Task: Identifying authors and book titles in verbose queries

ScienceDirect. Malayalam question answering system

Speech Recognition at ICSI: Broadcast News and beyond

Ensemble Technique Utilization for Indonesian Dependency Parser

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Parsing of part-of-speech tagged Assamese Texts

Probabilistic Latent Semantic Analysis

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

Constructing Parallel Corpus from Movie Subtitles

The CESAR Project: Enabling LRT for 70M+ Speakers

The Smart/Empire TIPSTER IR System

Trend Survey on Japanese Natural Language Processing Studies over the Last Decade

Modeling full form lexica for Arabic

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

MASTER OF SCIENCE (M.S.) MAJOR IN COMPUTER SCIENCE

Language Independent Passage Retrieval for Question Answering

A Case Study: News Classification Based on Term Frequency

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Measuring the relative compositionality of verb-noun (V-N) collocations by integrating features

Introduction, Organization Overview of NLP, Main Issues

CS 598 Natural Language Processing

THE VERB ARGUMENT BROWSER

Developing a TT-MCTAG for German with an RCG-based Parser

CROSS LANGUAGE INFORMATION RETRIEVAL: IN INDIAN LANGUAGE PERSPECTIVE

BYLINE [Heng Ji, Computer Science Department, New York University,

Linguistics. Undergraduate. Departmental Honors. Graduate. Faculty. Linguistics 1

A Bayesian Learning Approach to Concept-Based Document Classification

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

1. Introduction. 2. The OMBI database editor

On document relevance and lexical cohesion between query terms

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov [Folie 1] 6.1 Type-token ratio

UNIVERSITY OF OSLO Department of Informatics. Dialog Act Recognition using Dependency Features. Master s thesis. Sindre Wetjen

Lessons from a Massive Open Online Course (MOOC) on Natural Language Processing for Digital Humanities

Compositional Semantics

Performance Analysis of Optimized Content Extraction for Cyrillic Mongolian Learning Text Materials in the Database

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

Introduction to Text Mining

LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization

Multilingual Sentiment and Subjectivity Analysis

Natural Language Processing. George Konidaris

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Combining Bidirectional Translation and Synonymy for Cross-Language Information Retrieval

Distant Supervised Relation Extraction with Wikipedia and Freebase

A Graph Based Authorship Identification Approach

Some Principles of Automated Natural Language Information Extraction

Modeling function word errors in DNN-HMM based LVCSR systems

ADDIS ABABA UNIVERSITY SCHOOL OF GRADUATE STUDIES SCHOOL OF INFORMATION SCIENCES

Matching Similarity for Keyword-Based Clustering

Evaluation for Scenario Question Answering Systems

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

Integrating Semantic Knowledge into Text Similarity and Information Retrieval

Building an HPSG-based Indonesian Resource Grammar (INDRA)

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

Towards a Machine-Learning Architecture for Lexical Functional Grammar Parsing. Grzegorz Chrupa la

Cross-Lingual Text Categorization

Procedia - Social and Behavioral Sciences 141 ( 2014 ) WCLTA Using Corpus Linguistics in the Development of Writing

Vocabulary Usage and Intelligibility in Learner Language

The stages of event extraction

A Comparison of Two Text Representations for Sentiment Analysis

Beyond the Pipeline: Discrete Optimization in NLP

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

Can Human Verb Associations help identify Salient Features for Semantic Verb Classification?

Analysis of Probabilistic Parsing in NLP

Detecting Wikipedia Vandalism using Machine Learning Notebook for PAN at CLEF 2011

A Domain Ontology Development Environment Using a MRD and Text Corpus

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Treebank mining with GrETEL. Liesbeth Augustinus Frank Van Eynde

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

A heuristic framework for pivot-based bilingual dictionary induction

Prediction of Maximal Projection for Semantic Role Labeling

Dictionary-based techniques for cross-language information retrieval q

Part III: Semantics. Notes on Natural Language Processing. Chia-Ping Chen

Detecting English-French Cognates Using Orthographic Edit Distance

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

An Interactive Intelligent Language Tutor Over The Internet

Comparing different approaches to treat Translation Ambiguity in CLIR: Structured Queries vs. Target Co occurrence Based Selection

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

The Choice of Features for Classification of Verbs in Biomedical Texts

A Corpus-based Evaluation of a Domain-specific Text to Knowledge Mapping Prototype

Using Semantic Relations to Refine Coreference Decisions

Combining a Chinese Thesaurus with a Chinese Dictionary

Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models

Semantic Evidence for Automatic Identification of Cognates

Test Blueprint. Grade 3 Reading English Standards of Learning

A DISTRIBUTIONAL STRUCTURED SEMANTIC SPACE FOR QUERYING RDF GRAPH DATA

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

LEXICAL COHESION ANALYSIS OF THE ARTICLE WHAT IS A GOOD RESEARCH PROJECT? BY BRIAN PALTRIDGE A JOURNAL ARTICLE

Transcription:

1 A Survey of Research on Computing Language in BahasaIndonesia Conducted at the University of Indonesia Information Retrieval Lab Conference on "Policy and Sustainability of Local Language Computing in Developing Asia Lahore, 29 Jan to 3 Feb 2012 Faculty of Computer Science University of Indonesia

2

3 Faculty of Computer Science Undergraduate (1986), Masters (1988)& Doctoral (1998) programme >1400 active students, annual intake ±350 ±60 teaching staff, mostly MSc& PhD holders Research labs: Computational Intelligence Digital Libraries & Distance Learning Formal Methods in Software Engineering Architecture, Networks & High Performance Computing Image Processing & Pattern Recognition Information Retrieval & Text Processing IT Governance E-Government

4

5 Computational lexicography Corpus-based word-frequency dictionary (1996) Electronic KamusBesarBahasaIndonesia joint work with Pusat Bahasa: Data structures & compression Efficient matching Spell-checking (Lotus SmartSuite) Data structures Error-tolerant matching Suggestion

6 Morphological analysis Stemming algorithms: Rule-based Corpus-based Morphological parser: Two-level morphology Syntactic Parsing Formal modeling: Regular languages Context Free Grammars Feature structure unification grammars Statistical parsing: Probabilistic CFGs

7 Semantic and discourse analysis Lexical semantics Vector-space lexical similarity Indonesian WordNet Text semantic analysis Syntax-based, lambda calculus Anto makan nasi Nasi disantap Anto eat(e) ^ agent(e,a) ^ patient(e,n) ^ person(a,anto) ^ object(n,nasi)

8

9 IR: Cross-language Retrieval Retrieving documents in one language using query in another language Retrieving Indonesian documents using English queries Retrieving English documents using Indonesian queries Mode of Translation Query Translation (2006) Document Translation (2007) CLIR Translation Approaches Bilingual Dictionaries Direct Translation (2006) English-to-Indonesian & Indonesian-to-English Transitive Translation (2007) Indonesian-French-German-English Machine Translations (2006) Transtool, Toggletext Parallel Corpus (2007) Collecting parallel articles from the Internet Translating English documents into Indonesian

10 IR: Document summarization Single document summarization Extract sentences containing cue phrases & important keywords (2006) Query Biased Summary Extract sentences related to query words (2007) Multi Document Summarization Extract sentences containing important keywords that occur in the centroidof a cluster (2008)

11 IR: Question answering Finding answers to Indonesian questions in Indonesian documents Using statistical technique and considering the position of a candidate answers in the passages (2006) Finding answers to Indonesian questions in English documents (CLEF-QA) Translate Indonesian queries into English Using linguistic knowledge and external resources found on the Internet to find the answer (2008)

12 IR: Geographic Information Retrieval Finding events occurring in certain locations We develop a location parser to identify any location name that appears on the Indonesian query and documents (2007) We use geographic relation words to identify events that happen in certain locations (2007) We use a location-based query expansion technique to improve the retrieval performance of CL-GIR (2007)

13 IR: Information extraction Extracting important information from Indonesian documents Developing named entity tagger to identify person, location, organization names using rules based (2004) and machine learning approach with association rules (2007) Identify whether some named entities found refer to the same object (co-reference resolution). Identify the relationship exist between those named entities

14 Speech recognition Developing ASR for Bahasa Indonesia Using open-source ASR systems Sphinx-4 Julius Intended for telephone applications Building a speech corpus 5000 speakers Each speaker spends 15 minutes to record a list of sentences Indonesia has many local languages and dialect (> 600) Need to identify various pronunciation for words

15 Indonesian WordNet Indonesian WordNetusing expand approach based on Princeton WordNet(PWN) & KBBI Automatic PWN-KBBI mapping using LSA

16 Legal Information System The Indonesian law document is written in natural language Standardizing Indonesian Law document using XML format Indonesian legal document search engine Recapitulation System for the Indonesian Law document

17 Publications MalindoWorkshop (http://malindo.org) : 6 th workshop will be held in Malaysia ICACSIS Conferences http://icacsis.cs.ui.ac.id Workshop on Technologies and Corpora for Asia- Pacific Speech Translation etc

18 Resources The Resources can be seen in: http://fws.cs.ui.ac.id It includes lexical resources (Electronic KBBI, Indonesian WordNet), NLP Tools (stemmer, parser, POS, etc), and IR Application (machine translator, speech recognition, etc) Furthermore, UI collaborates with Kyoto University in Language Grid (LangGrid) Project, UI will become Jakarta Operation Center. UI will provide language resources and tools for BahasaIndonesia that can be accessed through web service.

19 Further Information Further information can be found in: http://ir.cs.ui.ac.id http://bahasa.cs.ui.ac.id Thank You