Design, Prototypical Implementation, and Evaluation of an Active Machine Learning Service in the Context of Legal Text Classification

Similar documents
Learning Methods for Fuzzy Systems

Exposé for a Master s Thesis

Dr. Judith Christina Abdel-Massih-Thiemann. Freelance consultant for organizational and project development

The Role of Architecture in a Scaled Agile Organization - A Case Study in the Insurance Industry

Integration of a MOOC into a traditional third-level e-learning platform

The Verbmobil Semantic Database. Humboldt{Univ. zu Berlin. Computerlinguistik. Abstract

Development of an IT Curriculum. Dr. Jochen Koubek Humboldt-Universität zu Berlin Technische Universität Berlin 2008

A Comparison of Two Text Representations for Sentiment Analysis

THE KARLSRUHE EDUCATION MODEL FOR PRODUCT DEVELOPMENT KALEP, IN HIGHER EDUCATION

Data Fusion Models in WSNs: Comparison and Analysis

CLIL Science Teaching Fostering Scientific Inquiry through the Use of Selective Scaffolding

USER ADAPTATION IN E-LEARNING ENVIRONMENTS

Computer Science PhD Program Evaluation Proposal Based on Domain and Non-Domain Characteristics

Customised Software Tools for Quality Measurement Application of Open Source Software in Education

Citrine Informatics. The Latest from Citrine. Citrine Informatics. The data analytics platform for the physical world

Applying Fuzzy Rule-Based System on FMEA to Assess the Risks on Project-Based Software Engineering Education

Including the Microsoft Solution Framework as an agile method into the V-Modell XT

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

DICTE PLATFORM: AN INPUT TO COLLABORATION AND KNOWLEDGE SHARING

Linking Task: Identifying authors and book titles in verbose queries

Study in Berlin at the HTW. Study in Berlin at the HTW

Probabilistic Latent Semantic Analysis

7KH5ROHRI3URFHVVRULHQWHG(QWHUSULVH0RGHOLQJLQ'HVLJQLQJ 3URFHVVRULHQWHG.QRZOHGJH0DQDJHPHQW6\VWHPV

RUFINA GAFEEVA Curriculum Vitae

A Brief Profile of the National Educational Panel Study

CS4491/CS 7265 BIG DATA ANALYTICS INTRODUCTION TO THE COURSE. Mingon Kang, PhD Computer Science, Kennesaw State University

Humboldt-Universität zu Berlin

BUILD-IT: Intuitive plant layout mediated by natural interaction

The IDN Variant Issues Project: A Study of Issues Related to the Delegation of IDN Variant TLDs. 20 April 2011

P. Belsis, C. Sgouropoulou, K. Sfikas, G. Pantziou, C. Skourlas, J. Varnas

Entrepreneurial Discovery and the Demmert/Klein Experiment: Additional Evidence from Germany

Detecting English-French Cognates Using Orthographic Edit Distance

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

MASTER OF SCIENCE (M.S.) MAJOR IN COMPUTER SCIENCE

Word Segmentation of Off-line Handwritten Documents

PhD Regulations for the Faculty of Law of European University Viadrina

30 Jahre Kooperation zwischen TU Darmstadt & Tongji University Shanghai

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Skillsoft Acquires SumTotal: Frequently Asked Questions. October 2014

On the Open Access Strategy of the Max Planck Society

Online Updating of Word Representations for Part-of-Speech Tagging

Inoffical translation 1

IT4BI, Semester 2, UFRT. Welcome address, February 1 st, 2013 Arnaud Giacometti / Patrick Marcel

Human Emotion Recognition From Speech

DIGITAL GAMING & INTERACTIVE MEDIA BACHELOR S DEGREE. Junior Year. Summer (Bridge Quarter) Fall Winter Spring GAME Credits.

Curriculum Vitae Susanne E. Baumgartner

Telekooperation Seminar

EuSEC nd European Systems Engineering Conference. Systems Engineering - A Key to Competitive Advantage for All Industries.

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

SITUATING AN ENVIRONMENT TO PROMOTE DESIGN CREATIVITY BY EXPANDING STRUCTURE HOLES

A new Dataset of Telephone-Based Human-Human Call-Center Interaction with Emotional Evaluation

Reinforcement Learning by Comparing Immediate Reward

Magdeburg-Stendal University of Applied Sciences

Switchboard Language Model Improvement with Conversational Data from Gigaword

Doctoral GUIDELINES FOR GRADUATE STUDY

PATHOLOGY AND LABORATORY MEDICINE GUIDELINES GRADUATE STUDENTS IN RESEARCH-BASED PROGRAMS

Data Integration through Clustering and Finding Statistical Relations - Validation of Approach

Analyzing sentiments in tweets for Tesla Model 3 using SAS Enterprise Miner and SAS Sentiment Analysis Studio

11:00 am Robotics and the Law: An American Perspective Prof. Ryan Calo, University of Washington School of Law

Axel Bangert Dayton Henderson Anke Hertling

November 17, 2017 ARIZONA STATE UNIVERSITY. ADDENDUM 3 RFP Digital Integrated Enrollment Support for Students

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

eportfolios in Education - Learning Tools or Means of Assessment?

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

MYP personal project guide 2011 overview of objectives

Susanne Rieger on her objectives as new President of EASC

Your Partner for Additive Manufacturing in Aachen. Community R&D Services Education

Use of CIM in AEP Enterprise Architecture. Randy Lowe Director, Enterprise Architecture October 24, 2012

Circuit Simulators: A Revolutionary E-Learning Platform

Rule Learning With Negation: Issues Regarding Effectiveness

Conversational Framework for Web Search and Recommendations

Syllabus: MKT Online Marketing (MKT3202) / MKT Introduction into Online Technologies for Marketing Professionals (MKT3205)

AQUA: An Ontology-Driven Question Answering System

Seminar - Organic Computing

Dual Training at a Glance

Speech Emotion Recognition Using Support Vector Machine

Diploma in Library and Information Science (Part-Time) - SH220

Infrared Paper Dryer Control Scheme

Welcome to the University of Hertfordshire and the MSc Environmental Management programme, which includes the following pathways:

SARDNET: A Self-Organizing Feature Map for Sequences

Curriculum vitae University of Saarland Sociology, American Studies, Economics

22/07/10. Last amended. Date: 22 July Preamble

Ecole Polytechnique Fédérale de Lausanne EPFL School of Computer and Communication Sciences IC. School of Computer and Communication Sciences

Coding II: Server side web development, databases and analytics ACAD 276 (4 Units)

Communication and Cybernetics 17

Reducing Features to Improve Bug Prediction

Customized Question Handling in Data Removal Using CPHC

A Didactics-Aware Approach to Management of Learning Scenarios in E-Learning Systems

UniConnect: A Hosted Collaboration Platform for the Support of Teaching and Research in Universities

CONCEPT MAPS AS A DEVICE FOR LEARNING DATABASE CONCEPTS

The MEANING Multilingual Central Repository

Lecture Notes in Artificial Intelligence 4343

Turkish Vocabulary Developer I / Vokabeltrainer I (Turkish Edition) By Katja Zehrfeld;Ali Akpinar

Curriculum Vitae. Silke Anger

Multisensor Data Fusion: From Algorithms And Architectural Design To Applications (Devices, Circuits, And Systems)

21 st Century Skills and New Models of Assessment for a Global Workplace

Python Machine Learning

Indian Institute of Technology, Kanpur

GRAND CHALLENGES SCHOLARS PROGRAM

Transcription:

Design, Prototypical Implementation, and Evaluation of an Active Machine Learning Service in the Context of Legal Text Classification Johannes Muhr, Feb 13 th 2017, Munich Chair of Software Engineering for Business Information Systems (sebis) Faculty of Informatics Technische Universität München wwwmatthes.in.tum.de

Key Facts Title (German) Design, Prototypische Implementierung, und Evaluation eines Active Machine Learning Services im Kontext von Rechtstexten Advisor Bernhard Waltl Supervisor Prof. Dr. Florian Matthes Project LexAlyze Analysis of Legal Texts Chair Software Engineering for Business Information Systems (SEBIS) Student Johannes Muhr Start January, 15 th 2017 Submission July, 15 th 2017 February 13 th 2017, Kick-off presentation Johannes Muhr sebis 2

Outline 1. Motivation 2. Active Learning 3. Research Questions 4. Solution Approach 5. Roadmap February 13 th 2017, Kick-off presentation Johannes Muhr sebis 3

Motivation Huge amount of legal documents are produced every day Many different kinds of legal documents [1] Gruner, 2008, A Client s Analysis and Discussion of a Multi-Million Dollar Federal Lawsuit February 13 th 2017, Kick-off presentation Johannes Muhr sebis 4

Motivation Manual document classification is very expensive and time consuming 13,5 Million $ were spent for classifying 1,6 Million items needing 4 month (= 8,50$ per document) [1] A lot of time is wasted with (document) discovery [2] Hours: 1 446 5 486 = 26,4 % Dollars: 483 986 1 697 322 = 28,5 % [1] Roitblat, H. L., et al. (2010). Document categorization in legal electronic discovery: computer classification vs. manual review. [2] Gruner, (2008). A Client s Analysis and Discussion of a Multi-Million Dollar Federal Lawsuit February 13 th 2017, Kick-off presentation Johannes Muhr sebis 5

Motivation Ø Result Ø Document and Sentence classification is a hot topic Ø Manual classification is very expensive and time-consuming Ø Machine learning approach is supposed to help here Ø Solution Approaches 1. Use of (Ruta) Rules 2. Active Machine Learning (AL) 3. Combination of Ruta Rules and AL February 13 th 2017, Kick-off presentation Johannes Muhr sebis 6

Outline 1. Motivation 2. Active Learning 3. Research Questions 4. Solution Approach 5. Roadmap February 13 th 2017, Kick-off presentation Johannes Muhr sebis 7

Active Learning Motivation Why using Active Machine Learning for Document- & Sentence Classification? Detection of rules is limited Minor linguistic variations are enough that sentences are not classified accordingly Im Sinne des Gesetzes!= Im Sinne der Gesetze Active learning has already been successfully applied in ü text classification [3] ü and also within the legal environment [4] [3] Novak, Mladenič, & Grobelnik, 2006; S. Tong & Koller, 2002; Segal, Markowitz, & Arnold, 2006 [4] Cardellino, Villata, Alemany, & Cabrio, 2015; Šavelka, Trivedi, & Ashley, 2015; Sunkle et al., 2016 February 13 th 2017, Kick-off presentation Johannes Muhr sebis 8

Active Learning Overview Subfield of machine learning with people in the loop (iterative & interactive form) Goal: Reduce size of needed trainings data by labelling those instances that are especially helpful Many influencing factors need to be considered (e.g. classifier, query strategy) February 13 th 2017, Kick-off presentation Johannes Muhr sebis 9

Active Learning Data Set Document classification >100 000 documents Manually labelled set of documents received from Datev Sentence Classification Available from laws (Lexia) Manual classification with the help of Elena Scepankova February 13 th 2017, Kick-off presentation Johannes Muhr sebis 10

Outline 1. Motivation 2. Active Learning 3. Research Questions 4. Solution Approach 5. Roadmap February 13 th 2017, Kick-off presentation Johannes Muhr sebis 11

Research Path? What are common concepts, strategies and technologies used in the context of text classification?? How can (active) machine learning support the classification of legal documents and their content (sentences)?? What does the concept and design of an active machine learning service look like?? How well does the active machine learning service in the classification of legal documents and their content (sentences) perform? February 13th 2017, Kick-off presentation Johannes Muhr sebis 12

Outline 1. Motivation 2. Active Learning 3. Research Questions 4. Solution Approach 5. Roadmap February 13 th 2017, Kick-off presentation Johannes Muhr sebis 13

Literature Study and Framework Assessment Machine Learning (Legal) Text Classification Active Learning Analysis of Machine Learning Frameworks February 13 th 2017, Kick-off presentation Johannes Muhr sebis 14

Preliminary Architecture Lexia Scope of thesis Rest API Machine Learning Microservice Machine Learning Framework Model Store February 13 th 2017, Kick-off presentation Johannes Muhr sebis 15

Outline 1. Motivation 2. Active Learning 3. Research Questions 4. Solution Approach 5. Roadmap February 13 th 2017, Kick-off presentation Johannes Muhr sebis 16

Timeline Jan Feb March April Mai June July Literature Review Implementation Sentence Classification Concept Evaluation Implementation Document Concept Evaluation Classification Writing Master s Thesis February 13 th 2017, Kick-off presentation Johannes Muhr sebis 17

Johannes Muhr Advisor: Bernhard Waltl Technische Universität München Faculty of Informatics Chair of Software Engineering for Business Information Systems Boltzmannstraße 3 85748 Garching bei München Tel +49.89.289. 17132 Fax +49.89.289.17136 matthes@in.tum.de wwwmatthes.in.tum.de

Bibliography Busse, D. (2000). Textsorten des Bereichs Rechtswesen und Justiz. In G. Antos, K. Brinker, W. Heineman, & S. F. Sager (Eds.), Text- und Gespra chslinguistik. Ein internationales Handbuch zeitgeno ssischer Forschung. (Handbu cher zur Sprach- und Kommunikationswissenschaft) (pp. 658-675). Berlin/New York: de Gruyter Cardellino, C., Villata, S., Alemany, L. A., & Cabrio, E. (2015). Information Extraction with Active Learning: A Case Study in Legal Text. Paper presented at the International Conference on Intelligent Text Processing and Computational Linguistics. Gruner, R. H. (2008). Anatomy of a Lawsuit - A Client s Analysis and Discussion of a Multi- Million Dollar Federal Lawsuit. Retrieved from http://www.gruner.com/writings/anatomylawsuit.pdf Landset, S., Khoshgoftaar, T. M., Richter, A. N., & Hasanin, T. (2015). A survey of open source tools for machine learning with big data in the Hadoop ecosystem. Journal of Big Data, 2(1), 24. doi:10.1186/s40537-015-0032-1 Novak, B., Mladenič, D., & Grobelnik, M. (2006). Text Classification with Active Learning. In M. Spiliopoulou, R. Kruse, C. Borgelt, A. Nürnberger, & W. Gaul (Eds.), From Data and Information Analysis to Knowledge Engineering: Proceedings of the 29th Annual Conference of the Gesellschaft für Klassifikation e.v. University of Magdeburg, March 9 11, 2005 (pp. 398-405). Berlin, Heidelberg: Springer Berlin Heidelberg. February 13 th 2017, Kick-off presentation Johannes Muhr sebis 19

Bibliography Šavelka, J., Trivedi, G., & Ashley, K. D. (2015). Applying an Interactive Machine Learning Approach to Statutory Analysis. Segal, R., Markowitz, T., & Arnold, W. (2006). Fast Uncertainty Sampling for Labeling Large E-mail Corpora. Paper presented at the CEAS. Settles, B. (2010). Active learning literature survey. University of Wisconsin, Madison, 52(55-66), 11. Sunkle, S., Kholkar, D., & Kulkarni, V. (2016, 5-9 Sept. 2016). Informed Active Learning to Aid Domain Experts in Modeling Compliance. Paper presented at the 2016 IEEE 20th International Enterprise Distributed Object Computing Conference (EDOC). Tong, S. (2001). Active learning: theory and applications. Citeseer. Tong, S., & Koller, D. (2002). Support vector machine active learning with applications to text classification. Journal of Machine Learning Research, 2(1), 45-66 February 13 th 2017, Kick-off presentation Johannes Muhr sebis 20

Backup Literature study Use of online Platforms like Google Scholar, Web of Science, Institute of Electrical and Electronics Engineers (IEEE), or Online Public Access Catalogue (OPAC) and Google Books Backwards Search February 13 th 2017, Kick-off presentation Johannes Muhr sebis 21