Natural Language Processing Techniques for Managing Legal Resources

Similar documents
AQUA: An Ontology-Driven Question Answering System

Parsing of part-of-speech tagged Assamese Texts

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

CS 598 Natural Language Processing

Developing a TT-MCTAG for German with an RCG-based Parser

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Context Free Grammars. Many slides from Michael Collins

Developing a large semantically annotated corpus

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

Compositional Semantics

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

Objectives. Chapter 2: The Representation of Knowledge. Expert Systems: Principles and Programming, Fourth Edition

ENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist

LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

An Interactive Intelligent Language Tutor Over The Internet

The College Board Redesigned SAT Grade 12

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Applications of memory-based natural language processing

Some Principles of Automated Natural Language Information Extraction

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

The Smart/Empire TIPSTER IR System

Specifying Logic Programs in Controlled Natural Language

A Framework for Customizable Generation of Hypertext Presentations

Construction Grammar. University of Jena.

EAGLE: an Error-Annotated Corpus of Beginning Learner German

Ensemble Technique Utilization for Indonesian Dependency Parser

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

Linking Task: Identifying authors and book titles in verbose queries

Using Semantic Relations to Refine Coreference Decisions

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

Procedia - Social and Behavioral Sciences 154 ( 2014 )

Interactive Corpus Annotation of Anaphor Using NLP Algorithms

SEMAFOR: Frame Argument Resolution with Log-Linear Models

Introduction to Text Mining

Grammars & Parsing, Part 1:

Linguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis

Chapter 4: Valence & Agreement CSLI Publications

Heuristic Sample Selection to Minimize Reference Standard Training Set for a Part-Of-Speech Tagger

Derivational: Inflectional: In a fit of rage the soldiers attacked them both that week, but lost the fight.

A Comparison of Two Text Representations for Sentiment Analysis

Modeling full form lexica for Arabic

The stages of event extraction

A First-Pass Approach for Evaluating Machine Translation Systems

THE VERB ARGUMENT BROWSER

ScienceDirect. Malayalam question answering system

Expert locator using concept linking. V. Senthil Kumaran* and A. Sankar

Beyond the Pipeline: Discrete Optimization in NLP

Approaches to control phenomena handout Obligatory control and morphological case: Icelandic and Basque

BYLINE [Heng Ji, Computer Science Department, New York University,

UNIVERSITY OF OSLO Department of Informatics. Dialog Act Recognition using Dependency Features. Master s thesis. Sindre Wetjen

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Specifying a shallow grammatical for parsing purposes

The Discourse Anaphoric Properties of Connectives

Automating the E-learning Personalization

ARNE - A tool for Namend Entity Recognition from Arabic Text

Update on Soar-based language processing

Text-mining the Estonian National Electronic Health Record

Using dialogue context to improve parsing performance in dialogue systems

Class Responsibility Assignment (CRA) for Use Case Specification to Sequence Diagrams (UC2SD)

CS Machine Learning

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

The MEANING Multilingual Central Repository

Control and Boundedness

Towards a Machine-Learning Architecture for Lexical Functional Grammar Parsing. Grzegorz Chrupa la

Universal Grammar 2. Universal Grammar 1. Forms and functions 1. Universal Grammar 3. Conceptual and surface structure of complex clauses

Outline. Web as Corpus. Using Web Data for Linguistic Purposes. Ines Rehbein. NCLT, Dublin City University. nclt

National Literacy and Numeracy Framework for years 3/4

Analysis of Probabilistic Parsing in NLP

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Extracting Opinion Expressions and Their Polarities Exploration of Pipelines and Joint Models

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

A Corpus-based Evaluation of a Domain-specific Text to Knowledge Mapping Prototype

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Inleiding Taalkunde. Docent: Paola Monachesi. Blok 4, 2001/ Syntax 2. 2 Phrases and constituent structure 2. 3 A minigrammar of Italian 3

Basic Syntax. Doug Arnold We review some basic grammatical ideas and terminology, and look at some common constructions in English.

Natural Language Processing. George Konidaris

Advanced Grammar in Use

Underlying and Surface Grammatical Relations in Greek consider

Leveraging Sentiment to Compute Word Similarity

Minding the Source: Automatic Tagging of Reported Speech in Newspaper Articles

Defragmenting Textual Data by Leveraging the Syntactic Structure of the English Language

The presence of interpretable but ungrammatical sentences corresponds to mismatches between interpretive and productive parsing.

Lessons from a Massive Open Online Course (MOOC) on Natural Language Processing for Digital Humanities

CREATING SHARABLE LEARNING OBJECTS FROM EXISTING DIGITAL COURSE CONTENT

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading

Argument structure and theta roles

Knowledge-Based - Systems

The Choice of Features for Classification of Verbs in Biomedical Texts

TABE 9&10. Revised 8/2013- with reference to College and Career Readiness Standards

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Controlled vocabulary

Introduction, Organization Overview of NLP, Main Issues

A Computational Evaluation of Case-Assignment Algorithms

What Can Neural Networks Teach us about Language? Graham Neubig a2-dlearn 11/18/2017

Subject: Opening the American West. What are you teaching? Explorations of Lewis and Clark

AUTHORING E-LEARNING CONTENT TRENDS AND SOLUTIONS

Transcription:

Natural Language Processing Techniques for Managing Legal Resources Managing Legal Resources on the Semantic Web European University Institute Fiesole, Italy September 11, 2009 Adam Wyner University College London adam@wyner.info

Main Point Legal text expressed in natural language can be automatically annotated with semantic mark ups using natural language processing systems such as the General Architecture for Text Engineering (GATE).

Overview Motivation and objectives of NLP in this context. General Architecture for Text Engineering (GATE). Processing and marking up text. Another technology for parsing and semantic interpretation (C&C/Boxer). Other approaches.

Motivations Annotate large legacy corpora. Address growth of corpora. Reduce number of human annotators and tedious work. Make annotation systematic and automatic. Annotate fine-grained information: Names, locations, addresses, web links, organisations, actions, argument structures, relations between entities... Map from well-drafted documents in NL to RDF/OWL.

Motivations Top-down vs. Bottom-up approaches: Both do initial (and iterative) analysis of the texts in the target corpora. Top-down defines the annotation system, which is applied manually to texts. Knowledge intensive in development and application. Annotation system is defined in terms of parsing, lists of basic components, ontologies, and rules to construct complex mark ups from simpler one. Apply the annotation system to text, which outputs annotated text. Knowledge intensive in development. Convergent/complementary/integrated approaches. Bottom-up reconstructs and implements linguistic knowledge. However, there are limits...

Objectives of NLP NLP automated processing of natural language. Generation convert information in a database into natural language. Understanding convert natural language into a machine readable form. Range of subtasks (focusing on text): Segment text (words, phrases, sentences, paragraphs, sections,...). Morphological analysis (plural/singular, tense,...). Tag each word for part of speech in context (noun, verb, adjective, number,...).

Objectives of NLP Range of subtasks: Syntactic parsing into phrases/chuncks (prepositional, nominal, verbal,...). Identify semantic roles (agent, patient,...). Entity recognition (organisations, people, places,...). Resolve pronominal anaphor and co-reference. Address ambiguity.

Objectives of NLP NLP useful for: Mark up documents in a large corpora. Automatic mark up to overcome bottleneck. Semantic representation for modelling and inference. Semantic representation as a interlanguage for translation. To understand and work with human language capabilities.

Objectives of NLP Develop annotations, ontologies, and gold-standard corpora. Semantically annotated texts support activities such as: Maintenance, presentation, and navigation. Information extraction (find patterns -- words or statements -- among documents). Translation Query (find all individuals who did a particular action). Inference.

Reminder Presentations on acquisition of ontologies using NLP. Ontology design patterns with natural language tie ins. WordNet and Framenet. The analysis cycle: Text -> Linguistic Analysis -> Knowledge Extraction -> Structural Content Cycle between Linguistic Analysis and Knowledge Extraction to improve the final Structural Content. Computational linguistic analysis layer cake.

Current State at OPSI, UK Office of Public Sector Information, United Kingdom Want to develop and leverage public information. http://www.opsi.gov.uk/ The Stationary Office, which have used GATE to develop automated mark up for OPSI, have not (yet) made marked up documents or processes available. Public vs. Private development. NLP for legislation is not an academic exercise. Applications?

The Crown XML Schema for Legislation

Terrorism Act 2000 (1.0)

Terrorism Act 2000 (1.1)

Terrorism Act 2000 (1.2)

Terrorism Act 2000 (2.0)

Terrorism Act 2000 (2.1)

Not glamorous, but useful. RuleBurst. Content in Notices

Content in Notices

GATE General Architecture for Text Engineering (GATE) open source framework which supports plug in NLP components to process a corpus of text. Is open open? Where to get it? http://gate.ac.uk/ Components and sequences of processes, each process feeding the next in a pipeline. Annotated text output. Example of a case with screen shots.

GATE References: Building Search Applications: Lucene, LingPipe, and Gate by Manu Konchady, 2008. Introduction to Linguistic Annotation and Analytics Technologies by Graham Wilcock, 2009

GATE Language Resources: lexicons, corpora, ontologies. Processing Resources: parsers, generators, taggers. Visual Resources: visualisation and editing. The resources are plug ins, so can be added or taken away. Document = text + annotations + features <Person, gender = male >John Smith</Person> <Verb, tense = past >ran</verb>

GATE Computational linguistic analysis layer cake : Sentence segmentation Tokenisation (words identified by spaces between them). Morphological analysis (singular/plural, tense, nominalisation,..., range of parts of speech such as noun, verb, adjective,...). Part of speech tagging (noun or verb given other words nearby). Shallow syntactic parsing/chunking (noun phrase, verb phrase, subordinate clause,...). Dependency analysis (subordinate clauses, pronominal anaphora,...). Pattern matching and rule application.

GATE Lists: List of verbs: like, run, jump,... List of common nouns: dog, cat, hamburger,... List of proper names: Cyndi, Bill, Lisa,... List of determiners: the, a, two,... Rules: (Determiner + Common Noun) Proper Name => Noun Phrase Verb + Noun Phrase => Verb Phrase Noun Phrase + Verb Phrase => Sentence Output: [ s [ np Cyndi] [ vp [ v likes] [ np [ det the] [ cn dog]]]].

GATE Offset Annotations are: tokens (offsets of text from start space to end space) along with type/features which have a name or value.

GATE Annotations Partial. Missing namespace and type needed for full definition.

GATE Annotations

GATE Construction: From smaller units, compose larger, derivative units. Gazetteers: Lists of words (or abbreviations) that fit an annotation: first names, street locations, organizations... JAPE (Java Annotation Patterns Engine): Build other annotations out of previously given/defined annotations. Use this where the mark up is not given by a gazetteer. Rules have a syntax.

GATE Gazetteers

GATE Organisation Gazetteer

GATE JAPE JAPE idea (here with mark up, but could be some feature). <FirstName>aaaa</FirstName><LastName>bbbb</LastName> => <WholeName><FirstName>aaaa</FirstName> <LastName>bbbb</LastName></WholeName> FirstName and LastName we get from the Gazetteer. WholeName we construct using the rule. For complex constructions, must have a range of alternatives.

GATE JAPE

GATE JAPE

GATE JAPE

GATE Example

GATE Example

GATE Example

GATE Example Organisations and Quotations. Case references.

GATE XML

Other GATE Components Develop an ontology, import it into GATE, then mark up elements manually. Use the ontology in writing the JAPE rules. Plug in other parsers, create gazetteers, work with other languages... Machine learning component. Have not discussed mark up for metadata, structure, or presentation (see de Maat, Winkels, and van Engers). Work to develop gazetteers and JAPE rules.

GATE Problems and Issues Any difference in the characters of the basic text or in annotations is an absolute difference theatre and theater are different strings for entities. Variants in Gazetteers. Organisation and Organization are different annotations. Output in XML is possible, but GATE mark up allows overlapping tags, which are barred in standard XML. Must rework GATE XML with XSLT to make it standard XML. Accuracy is not 100% for a variety of reasons, but it can be 80-95%.

C&C/Boxer Motivations and Objectives Fine-grained syntactic parsing can identify not only parts of speech, but grammatical roles (subject, object) and phrases (e.g. verb plus direct object is verb phrase). Contributes to NL to RDF/OWL translation individual entities, data and object properties? Input to semantic interpretation in FOL test for consistency, support inference, allow rule extraction.

C&C/Boxer C & C is a combinatorial categorial grammar. Boxer provides a semantic interpretation, given the parse. The semantic interpretation is a form of first order logic discourse representation theory. Needs some manipulation. Parser outputs the best parse, but that might not be what one wants; the semantic representation might need to be selected. Try it out at: http://svn.ask.it.usyd.edu.au/trac/candc Various representations C&C, Graphic, XML Parse, Prolog.

C&C/Boxer

C&C/Boxer Vx [ man (x) -> happy (x)]

If Bill is rich and healthy, then he is happy

If Bill is rich and healthy, then he is happy.

A More Complex Example A person commits an offence if he invites another to provide money or other property and intends that it should be used, or has reasonable cause to suspect that it may be used, for the purposes of terrorism. From UK Terrorism Act 2000, Interpretation, Terrorist Property (Partial parse image).

A More Complex Example

Other Topics Controlled Languages An expressive subset of grammatical constructions and lexicon. Guided in put so only well-formed, unambiguous expressions. Translation to FOL. Machine Learning Annotating a set of documents to make a gold standard. Train the system on the gold standard and unannotated documents. Test accuracy and adjust. No information on how the algorithm works.

Conclusion Different approaches to mark up. Burdens of initial analysis, coding, and labour. Top-down is far ahead of bottom-up, but this is a matter of focus of research effort. Converging, complementary, integrated approaches. Potential to enrich annotations further for finer-grained information.