Towards semantics-enabled infrastructure for knowledge acquisition from distributed data

Similar documents
Lecture 1: Basic Concepts of Machine Learning

Lecture 1: Machine Learning Basics

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Generation of Attribute Value Taxonomies from Data for Data-Driven Construction of Accurate and Compact Classifiers

On-Line Data Analytics

Seminar - Organic Computing

Rule Learning With Negation: Issues Regarding Effectiveness

AQUA: An Ontology-Driven Question Answering System

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

A Case Study: News Classification Based on Term Frequency

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

On Human Computer Interaction, HCI. Dr. Saif al Zahir Electrical and Computer Engineering Department UBC

A Decision Tree Analysis of the Transfer Student Emma Gunu, MS Research Analyst Robert M Roe, PhD Executive Director of Institutional Research and

Linking Task: Identifying authors and book titles in verbose queries

Reducing Features to Improve Bug Prediction

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Using dialogue context to improve parsing performance in dialogue systems

Objectives. Chapter 2: The Representation of Knowledge. Expert Systems: Principles and Programming, Fourth Edition

CS Machine Learning

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Active Learning. Yingyu Liang Computer Sciences 760 Fall

(Sub)Gradient Descent

Computerized Adaptive Psychological Testing A Personalisation Perspective

Proof Theory for Syntacticians

Rule Learning with Negation: Issues Regarding Effectiveness

Word Segmentation of Off-line Handwritten Documents

Semi-Supervised Face Detection

CSL465/603 - Machine Learning

Applications of data mining algorithms to analysis of medical data

A Bayesian Learning Approach to Concept-Based Document Classification

Knowledge-Based - Systems

Software Maintenance

GACE Computer Science Assessment Test at a Glance

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Australian Journal of Basic and Applied Sciences

CS 446: Machine Learning

MASTER OF SCIENCE (M.S.) MAJOR IN COMPUTER SCIENCE

Bug triage in open source systems: a review

Python Machine Learning

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Chapter 2 Rule Learning in a Nutshell

Data Stream Processing and Analytics

Switchboard Language Model Improvement with Conversational Data from Gigaword

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

Speech Recognition at ICSI: Broadcast News and beyond

Modeling user preferences and norms in context-aware systems

re An Interactive web based tool for sorting textbook images prior to adaptation to accessible format: Year 1 Final Report

Welcome to. ECML/PKDD 2004 Community meeting

Automatic document classification of biological literature

USER ADAPTATION IN E-LEARNING ENVIRONMENTS

Stopping rules for sequential trials in high-dimensional data

Distributed Weather Net: Wireless Sensor Network Supported Inquiry-Based Learning

Laboratorio di Intelligenza Artificiale e Robotica

Abstractions and the Brain

Learning From the Past with Experiment Databases

ECON 365 fall papers GEOS 330Z fall papers HUMN 300Z fall papers PHIL 370 fall papers

Towards Semantic Facility Data Management

B.S/M.A in Mathematics

Universidade do Minho Escola de Engenharia

Kenya: Age distribution and school attendance of girls aged 9-13 years. UNESCO Institute for Statistics. 20 December 2012

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

NCEO Technical Report 27

Citrine Informatics. The Latest from Citrine. Citrine Informatics. The data analytics platform for the physical world

Evolutive Neural Net Fuzzy Filtering: Basic Description

Laboratorio di Intelligenza Artificiale e Robotica

GRADUATE STUDENT HANDBOOK Master of Science Programs in Biostatistics

Compositional Semantics

A NEW ALGORITHM FOR GENERATION OF DECISION TREES

Exposé for a Master s Thesis

Corrective Feedback and Persistent Learning for Information Extraction

Researcher Development Assessment A: Knowledge and intellectual abilities

The Indices Investigations Teacher s Notes

Courses in English. Application Development Technology. Artificial Intelligence. 2017/18 Spring Semester. Database access

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

A survey of multi-view machine learning

Lecture 10: Reinforcement Learning

Entrepreneurial Discovery and the Demmert/Klein Experiment: Additional Evidence from Germany

INSTRUCTIONAL FOCUS DOCUMENT Grade 5/Science

Experiment Databases: Towards an Improved Experimental Methodology in Machine Learning

OFFICE SUPPORT SPECIALIST Technical Diploma

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Assignment 1: Predicting Amazon Review Ratings

Beyond the Pipeline: Discrete Optimization in NLP

The Good Judgment Project: A large scale test of different methods of combining expert predictions

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

Ontologies vs. classification systems

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Version Space. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Version Space Term 2012/ / 18

Inquiry Learning Methodologies and the Disposition to Energy Systems Problem Solving

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Probabilistic Latent Semantic Analysis

Jacqueline C. Kowtko, Patti J. Price Speech Research Program, SRI International, Menlo Park, CA 94025

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

Learning Methods in Multilingual Speech Recognition

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Word learning as Bayesian inference

Transcription:

Towards semantics-enabled infrastructure for knowledge acquisition from distributed data Vasant Honavar and Doina Caragea Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Graduate Program Center for Computational Intelligence, Learning, & Discovery Iowa State University honavar@cs.iastate.edu www.cs.iastate.edu/~honavar/ In collaboration with Jun Zhang (Ph.D., 2005), Jie Bao (Ph.D., 2007)

Outline Background and motivation Learning from data revisited Learning predictive models from distributed data Learning predictive models from semantically heterogeneous data Learning predictive models from partially specified data Current Status and Summary of Results

Representative Application: Gene Annotation Discovering potential errors in gene annotation using machine learning (Andorf, Dobbs, and Honavar, BMC Bioinformatics, 2007) Train on human kinases, and test on mouse kinases surprisingly poor accuracy! Nearly 95 percent of the GO annotations returned by AmiGO for a set of mouse protein kinases are inconsistent with the annotations of their human homologs and are likely, erroneous The mouse annotations came from Okazaki et al, Nature, 420, 563-573, 2002 They were propagated to MGI through the Fantom2 (Functional Annotation of Mouse) Database and from MGI to AmiGO 136 rat protein kinase annotations retrieved using AmiGO had functions assigned based on one of the 201 potentially incorrectly annotated mouse proteins Postscript: Erroneous mouse annotations were traced to a bug in the annotation script and have since been corrected by MGI

PREDICTED: Structure Protein binding residues RNA binding residues VALIDATED: Protein binding residues RNA binding residues Representative Application - Predicting Protein-RNA Binding Sites 41 51 GPLESDQWCRVLRQSLPEEKISSQTCI ++++++++ ++ MBP WT 31-165 31-145 57-165 145-165 + + 61 71 81 91 ARRHLGPGPTQHTPSRRDRWIREQILQAEVLQERLEWRI +++++++++++++++ ++++++++++++++++ 31 KRRRK RRDRW 131 141 151 161 QRGDFSAWGDYQQAQERRWGEQSSPRVLRPGDSKRRRKHL ++++++++++ ++ +++ ++++++ + ++++++++++++++++++++ EIAV Rev: Predictions vs Experiments 57 125 145 165 NES NLS RRDRW ERLE KRRRK Terribilini, M., Lee. J-H., Yan, C., Carpenter, S., Jernigan, R., Honavar, V. and Dobbs, D.(2006)

Data revolution Bioinformatics Background Over 200 data repositories of interest to molecular biologists alone (Discala, 2000) Environmental Informatics Enterprise Informatics Medical Informatics Social Informatics... Information processing revolution: Algorithms as theories Computation: Biology::Calculus:Physics Connectivity revolution (Internet and the web) Integration revolution Need to understand the elephant as opposed to examining the trunk, the tail, etc. Needed infrastructure to support collaborative, integrative analysis of data

Predictive models from Data Supporting collaborative, integrative analysis of data across geographic, organizational, and disciplinary barriers requires coming to terms with: Large, distributed autonomous data sources Memory, bandwidth, and computing limitations Access and privacy constraints Differences in data semantics Same term, different meaning Different terms, same meaning Different domains of values for semantically equivalent attributes Different measurement units, different levels of abstraction Can we learn without centralized access to data? Can we learn in the presence of semantic gaps between user and data sources? How do the results compare with the centralized setting?

Outline Background and motivation Learning from data revisited Learning predictive models from distributed data Learning predictive models from semantically heterogeneous data Learning predictive models from partially specified data Current Status and Summary of Results

Acquiring knowledge from data Most machine learning algorithms assume centralized access to a semantically homogeneous data Assumptions Data L h Knowledge

Learning Classifiers from Data Learning Data Labeled Examples Learner Classifier Classification Unlabeled Instance Classifier Class Standard learning algorithms assume centralized access to data Can we do without direct access to data?

Example: Learning decision tree classifiers Day 1 2 3 4 Outlook Sunny Sunny Overcast Overcast Temp. Hot Hot Hot Cold Humidity High High High Normal Wind Weak Strong Weak Weak Play Tennis No No Yes No Day 1 2 Day 3 4 Outlook Sunny Sunny Outlook Overcast Overcast Temp Hot Hot Temp Hot Cold Humid. High High Humid. High Normal Wind Weak Strong Wind Weak Strong Play No No Play Yes No {1, 2, 3, 4} {1, 2} Sunny No Outlook Overcast Hot No Temp. {3, 4} Cold Yes H Entropy D i D i ( D) - log = i Classes D 2 D {4} {3}

Example: Learning decision tree classifiers Decision tree is constructed by recursively (and greedily) choosing the attribute that provides the greatest estimated information about the class label What information do we need to choose a split at each step? Information gain Estimated probability distribution resulting from each candidate split Proportion of instances of each class along each branch of each candidate split Key observation: If we have the relevant counts, we have no need for the data!

Example: Learning decision tree classifiers Day 1 2 3 4 Outlook Sunny Sunny Overcast Overcast Temp. Hot Hot Hot Cold Humidity High High High Normal Wind Weak Strong Weak Weak Play Tennis No No Yes No Day 1 2 Day 3 4 Outlook Sunny Sunny Outlook Overcast Overcast Temp Hot Hot Temp Hot Cold Humid. High High Humid. High Normal Wind Weak Strong Wind Weak Stron g Play No No Play Yes No {1, 2, 3, 4} {1, 2} Sunny No Outlook Overcast Hot No Temp. {3, 4} Cold Yes H Entropy D i D i ( D) - log = i Classes D 2 D {4} {3}

Sufficient statistics for refining a partially constructed decision tree {1, 2, 3, 4} {1, 2} Sunny No Outlook Overcast Hot No Temp. {3, 4} Cold Yes H Entropy D i D i ( D) - log = i Classes D 2 D {4} {3} Sufficient statistics for refining a partially constructed decision tree count(attribute value,class path) count(class path)

Decision Tree Learning = Answering Count Queries + Hypothesis refinement Outlook Counts(Attribute, Class), Counts(Class) Counts Sunny Overcast Rain Yes Wind Counts(Wind, Class Outlook), Counts(Class Outlook) Humidity Strong Weak Yes No Counts Counts(Humidity, Class Outlook), Counts(Class Outlook) Counts Data Data High Normal No Yes

Sufficient statistics for learning: Analogy with statistical parameter estimation D s(d) D s(h i h i+1, D) θ Θ θ Θ L L h H h H

Sufficient statistics for learning a hypothesis from data It helps to break down the computation of s L (D,h) into smaller steps queries to data D computation on the results of the queries Generalizes the classical sufficient statistics by interleaving computation and queries against data Basic operations Refinement Composition

Learning from Data Reexamined Learner Data D Hypothesis Construction h i+1 C(h i, s (h i -> h i+1, D)) s(h i -> h i+1, D) Data D Statistical Query Generation Query s(h i -> h i+1, D) Learning = Sufficient statistics Extraction + Hypothesis Construction [Caragea, Silvescu, and Honavar, 2004]

Learning from Data Reexamined Designing algorithms for learning from data reduces to Identifying of minimal or near minimal sufficient statistics for different classes of learning algorithms Designing procedures for obtaining the relevant sufficient statistics or their efficient approximations Leading to Separation of concerns between hypothesis construction (through successive refinement and composition operations) and statistical query answering

Outline Background and motivation Learning from data revisited Learning predictive models from distributed data Learning predictive models from semantically heterogeneous data Learning predictive models from partially specified data Current Status and Summary of Results

Learning Classifiers from Distributed Data Learning from distributed data requires learning from dataset fragments without gathering all of the data in a central location Assuming that the data set is represented in tabular form, data fragmentation can be horizontal vertical or more general (e.g. multi-relational)

Learning from distributed data Learner S (D, h i ->h i+1 ) Query Decomposition q 1 q 2 D 1 D 2 Query S (D, h i ->h i+1 ) Answer Composition q 3 D 3

Learning from Distributed Data Learning classifiers from distributed data reduces to statistical query answering from distributed data A sound and complete procedure for answering the desired class of statistical queries from distributed data under Different types of data fragmentation Different constraints on access and query capabilities Different bandwidth and resource constraints [Caragea, Silvescu, and Honavar, 2004, Caragea et al., 2005]

How can we evaluate algorithms for learning from distributed data? Compare with their batch counterparts Exactness guarantee that the learned hypothesis is the same as or equivalent to that obtained by the batch counterpart Approximation guarantee that the learned hypothesis is an approximation (in a quantifiable sense) of the hypothesis obtained in the batch setting Communication, memory, and processing requirements [Caragea, Silvescu, and Honavar., 2003, 2004]

Some Results on Learning from Distributed Data Provably exact algorithms for learning decision trees, SVM, Naïve Bayes, Neural Network, and Bayesian network classifiers from distributed data Positive and negative results concerning efficiency (bandwith, memory, computation) of learning from distributed data [Caragea, Silvescu, and Honavar, 2004, Honavar and Caragea, 2008]

Outline Background and motivation Learning from data revisited Learning classifiers from distributed data Learning classifiers from semantically heterogeneous data Learning Classifier from partially specified data Current Status and Summary of Results

Semantically heterogeneous data Different schema, different data semantics Day Temperature (C) Wind Speed (km/h) Outlook D 1 1 20 16 Cloudy 2 10 34 Sunny 3 17 25 Rainy Day Temp (F) Wind (mph) Precipitation D 2 4 3 24 Rain 5-2 50 Light Rain 6 0 34 No Prec

Making Data Sources Self Describing Exposing the schema structure of data Specification of the attributes of the data D 1 Day: day Temperature: deg C Wind Speed: kmh Outlook: outlook D 2 Day: day Temp: deg F Wind: mph Precipitation: prec Exposing the ontology Schema semantics Data semantics

Ontology Extended Data Sources Expose the data semantics Special Case of interest: Values of each attribute organized as an AVH

Ontology Extended Data Sources Ontology extended data source [Caragea et al, 2005] Inspired by ontology-extended relational algebra [Bonatti et al., 2003] Querying data sources from a user s point of view is facilitated by specifying mappings From user schema to data source schemas From user AVH to data source AVH More systematic characterization of OEDS and mappings within a description logics framework is in progress

Mappings between schema D 1 Day: day Temperature: deg C Wind Speed: kmh Outlook: outlook D 2 Day: day Temp: deg F Wind: mph Precipitation: prec D U Day: day Temp: deg F Wind: kmh Outlook: outlook Day : D 1 Day: D U Day : D 2 Day: D U Temperature: D 1 Temp : D U Temp: D 2 Temp : D U

Semantic Correspondence between Ontologies H 1 (is-a) H 2 (is-a) H U (is-a) The white nodes represent the values used to describe data

Data sources from a user s perspective H 1 (is-a) H U (is-a) Rainy : H 1 = Rain : H U Snow : H 1 = Snow : H U [Caragea, Pathak, and Honavar; 2004] NoPrec : H U < Outlook : H 1 {Sunny, Cloudy} : H 1 = NoPrec : H U Conversion functions are used to map units (e.g. degrees F to degrees C)

Learning from Semantically Heterogeneous Data Mappings between O 1.. O N and O Ontology M(O, O 1..O N ) O q 1 D 1, O 1 Learner S O (h i ->h i+1,d) Query Decomposition q 2 D 2, O 2 Query S O (h i ->h i+1,d) Answer Composition q 3 D 3, O 3

Semantic gaps lead to Partially Specified Data Different data sources may describe data at different levels of abstraction If the description of data is more abstract than what the user expects, additional statistical assumptions become necessary H 1 (is-a) O U H U (is-a) Snow is under-specified in H 1 relative to user ontology H U Making D 1 partially specified from the user perspective [Zhang and Honavar, 2003; 2004, 2005]

Outline Background and motivation Learning from data revisited Learning predictive models from distributed data Learning predictive models from semantically heterogeneous data Learning predictive models from partially specified data Current Status and Summary of Results

Learning Classifiers from Attribute Value Taxonomies (AVT) and Partially Specified Data Given a taxonomy over values of each attribute, and data specified in terms of values at different levels of abstraction, learn a concise and accurate hypothesis Student Status Work Status h(γ 0 ) Undergraduate Graduate On-Campus Off-Campus h(γ 1 ) Freshman Senior Ph.D TA RA AA Government Private Sophomore Junior Master Federal Local Org State Com [Zhang and Honavar, 2003; 2004; Zhang et al., 2006; Caragea et al., 2006] h(γ k )

Learning Classifiers from (AVT) and Partially Specified Data Cuts through AVT induce a partial order over instance representations Classifiers AVT-DTL and AVT-NBL Show how to learn classifiers from partially specified data Estimate sufficient statistics from partially specified data under specific statistical assumptions Use CMDL score to trade off classifier complexity against accuracy [Zhang and Honavar, 2003; 2004; 2005]

Outline Background and motivation Learning from data revisited Learning predictive models from distributed data Learning predictive models from semantically heterogeneous data Learning predictive models from partially specified data Current Status and Summary of Results

Implementation: INDUS System [Caragea et al., 2005]

Summary Algorithms learning classifiers from distributed data with provable performance guarantees relative to their centralized or batch counterparts Tools for making data sources self-describing Tools for specifying semantic correspondences between data sources Tools for answering statistical queries from semantically heterogeneous data Tools for collaborative construction of ontologies and mappings, distributed reasoning..

Current Directions Further development of the open source tools for collaborative construction of predictive models from data Resource bounded approximations of statistical queries under different access constraints and statistical assumptions Algorithms for learning predictive models from semantically disparate alternately structured data Further investigation of OEDS Description logics, RDF.. Relation to modular ontologies and knowledge importing Distributed reasoning, privacy-preserving reasoning Applications in bioinformatics, medical informatics, materials informatics, social informatics

Acknowledgements Students Doina Caragea, Ph.D., 2004 Jun Zhang, Ph.D., 2005 Jie Bao, Ph.D., 2007 Cornelia Caragea, Ph.D., in progress Oksana Yakhnenko, Ph.D., in progress Collaborators Giora Slutzki George Voutsadakis National Science Foundation