Natural Language Processing Techniques for Managing Legal Resources

Natural Language Processing Techniques for Managing Legal Resources Managing Legal Resources on the Semantic Web European University Institute Fiesole, Italy September 11, 2009 Adam Wyner University College London adam@wyner.info

Main Point Legal text expressed in natural language can be automatically annotated with semantic mark ups using natural language processing systems such as the General Architecture for Text Engineering (GATE).

Overview Motivation and objectives of NLP in this context. General Architecture for Text Engineering (GATE). Processing and marking up text. Another technology for parsing and semantic interpretation (C&C/Boxer). Other approaches.

Motivations Annotate large legacy corpora. Address growth of corpora. Reduce number of human annotators and tedious work. Make annotation systematic and automatic. Annotate fine-grained information: Names, locations, addresses, web links, organisations, actions, argument structures, relations between entities... Map from well-drafted documents in NL to RDF/OWL.

Motivations Top-down vs. Bottom-up approaches: Both do initial (and iterative) analysis of the texts in the target corpora. Top-down defines the annotation system, which is applied manually to texts. Knowledge intensive in development and application. Annotation system is defined in terms of parsing, lists of basic components, ontologies, and rules to construct complex mark ups from simpler one. Apply the annotation system to text, which outputs annotated text. Knowledge intensive in development. Convergent/complementary/integrated approaches. Bottom-up reconstructs and implements linguistic knowledge. However, there are limits...

Objectives of NLP NLP automated processing of natural language. Generation convert information in a database into natural language. Understanding convert natural language into a machine readable form. Range of subtasks (focusing on text): Segment text (words, phrases, sentences, paragraphs, sections,...). Morphological analysis (plural/singular, tense,...). Tag each word for part of speech in context (noun, verb, adjective, number,...).

Objectives of NLP Range of subtasks: Syntactic parsing into phrases/chuncks (prepositional, nominal, verbal,...). Identify semantic roles (agent, patient,...). Entity recognition (organisations, people, places,...). Resolve pronominal anaphor and co-reference. Address ambiguity.

Objectives of NLP NLP useful for: Mark up documents in a large corpora. Automatic mark up to overcome bottleneck. Semantic representation for modelling and inference. Semantic representation as a interlanguage for translation. To understand and work with human language capabilities.

Objectives of NLP Develop annotations, ontologies, and gold-standard corpora. Semantically annotated texts support activities such as: Maintenance, presentation, and navigation. Information extraction (find patterns -- words or statements -- among documents). Translation Query (find all individuals who did a particular action). Inference.

Reminder Presentations on acquisition of ontologies using NLP. Ontology design patterns with natural language tie ins. WordNet and Framenet. The analysis cycle: Text -> Linguistic Analysis -> Knowledge Extraction -> Structural Content Cycle between Linguistic Analysis and Knowledge Extraction to improve the final Structural Content. Computational linguistic analysis layer cake.

Current State at OPSI, UK Office of Public Sector Information, United Kingdom Want to develop and leverage public information. http://www.opsi.gov.uk/ The Stationary Office, which have used GATE to develop automated mark up for OPSI, have not (yet) made marked up documents or processes available. Public vs. Private development. NLP for legislation is not an academic exercise. Applications?

The Crown XML Schema for Legislation

Terrorism Act 2000 (1.0)

Terrorism Act 2000 (1.1)

Terrorism Act 2000 (1.2)

Terrorism Act 2000 (2.0)

Terrorism Act 2000 (2.1)

Not glamorous, but useful. RuleBurst. Content in Notices

Content in Notices

GATE General Architecture for Text Engineering (GATE) open source framework which supports plug in NLP components to process a corpus of text. Is open open? Where to get it? http://gate.ac.uk/ Components and sequences of processes, each process feeding the next in a pipeline. Annotated text output. Example of a case with screen shots.

GATE References: Building Search Applications: Lucene, LingPipe, and Gate by Manu Konchady, 2008. Introduction to Linguistic Annotation and Analytics Technologies by Graham Wilcock, 2009

GATE Language Resources: lexicons, corpora, ontologies. Processing Resources: parsers, generators, taggers. Visual Resources: visualisation and editing. The resources are plug ins, so can be added or taken away. Document = text + annotations + features <Person, gender = male >John Smith</Person> <Verb, tense = past >ran</verb>

GATE Computational linguistic analysis layer cake : Sentence segmentation Tokenisation (words identified by spaces between them). Morphological analysis (singular/plural, tense, nominalisation,..., range of parts of speech such as noun, verb, adjective,...). Part of speech tagging (noun or verb given other words nearby). Shallow syntactic parsing/chunking (noun phrase, verb phrase, subordinate clause,...). Dependency analysis (subordinate clauses, pronominal anaphora,...). Pattern matching and rule application.

GATE Lists: List of verbs: like, run, jump,... List of common nouns: dog, cat, hamburger,... List of proper names: Cyndi, Bill, Lisa,... List of determiners: the, a, two,... Rules: (Determiner + Common Noun) Proper Name => Noun Phrase Verb + Noun Phrase => Verb Phrase Noun Phrase + Verb Phrase => Sentence Output: [ s [ np Cyndi] [ vp [ v likes] [ np [ det the] [ cn dog]]]].

GATE Offset Annotations are: tokens (offsets of text from start space to end space) along with type/features which have a name or value.

GATE Annotations Partial. Missing namespace and type needed for full definition.

GATE Annotations

GATE Construction: From smaller units, compose larger, derivative units. Gazetteers: Lists of words (or abbreviations) that fit an annotation: first names, street locations, organizations... JAPE (Java Annotation Patterns Engine): Build other annotations out of previously given/defined annotations. Use this where the mark up is not given by a gazetteer. Rules have a syntax.

GATE Gazetteers

GATE Organisation Gazetteer

GATE JAPE JAPE idea (here with mark up, but could be some feature). <FirstName>aaaa</FirstName><LastName>bbbb</LastName> => <WholeName><FirstName>aaaa</FirstName> <LastName>bbbb</LastName></WholeName> FirstName and LastName we get from the Gazetteer. WholeName we construct using the rule. For complex constructions, must have a range of alternatives.

GATE JAPE

GATE Example

GATE Example Organisations and Quotations. Case references.

GATE XML

Other GATE Components Develop an ontology, import it into GATE, then mark up elements manually. Use the ontology in writing the JAPE rules. Plug in other parsers, create gazetteers, work with other languages... Machine learning component. Have not discussed mark up for metadata, structure, or presentation (see de Maat, Winkels, and van Engers). Work to develop gazetteers and JAPE rules.

GATE Problems and Issues Any difference in the characters of the basic text or in annotations is an absolute difference theatre and theater are different strings for entities. Variants in Gazetteers. Organisation and Organization are different annotations. Output in XML is possible, but GATE mark up allows overlapping tags, which are barred in standard XML. Must rework GATE XML with XSLT to make it standard XML. Accuracy is not 100% for a variety of reasons, but it can be 80-95%.

C&C/Boxer Motivations and Objectives Fine-grained syntactic parsing can identify not only parts of speech, but grammatical roles (subject, object) and phrases (e.g. verb plus direct object is verb phrase). Contributes to NL to RDF/OWL translation individual entities, data and object properties? Input to semantic interpretation in FOL test for consistency, support inference, allow rule extraction.

C&C/Boxer C & C is a combinatorial categorial grammar. Boxer provides a semantic interpretation, given the parse. The semantic interpretation is a form of first order logic discourse representation theory. Needs some manipulation. Parser outputs the best parse, but that might not be what one wants; the semantic representation might need to be selected. Try it out at: http://svn.ask.it.usyd.edu.au/trac/candc Various representations C&C, Graphic, XML Parse, Prolog.

C&C/Boxer

C&C/Boxer Vx [ man (x) -> happy (x)]

If Bill is rich and healthy, then he is happy

If Bill is rich and healthy, then he is happy.

A More Complex Example A person commits an offence if he invites another to provide money or other property and intends that it should be used, or has reasonable cause to suspect that it may be used, for the purposes of terrorism. From UK Terrorism Act 2000, Interpretation, Terrorist Property (Partial parse image).

A More Complex Example

Other Topics Controlled Languages An expressive subset of grammatical constructions and lexicon. Guided in put so only well-formed, unambiguous expressions. Translation to FOL. Machine Learning Annotating a set of documents to make a gold standard. Train the system on the gold standard and unannotated documents. Test accuracy and adjust. No information on how the algorithm works.

Conclusion Different approaches to mark up. Burdens of initial analysis, coding, and labour. Top-down is far ahead of bottom-up, but this is a matter of focus of research effort. Converging, complementary, integrated approaches. Potential to enrich annotations further for finer-grained information.