Data Warehouse. Data Mining. InterPro Database. Knowledge Discovery. Knowledge Discovery. BINF 630: Introduction to Bioinformatics.

Similar documents
Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Python Machine Learning

Human Emotion Recognition From Speech

Artificial Neural Networks written examination

CS Machine Learning

Lecture 1: Machine Learning Basics

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

A Neural Network GUI Tested on Text-To-Phoneme Mapping

INPE São José dos Campos

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Lecture 1: Basic Concepts of Machine Learning

Axiom 2013 Team Description Paper

CSL465/603 - Machine Learning

Evolution of Symbolisation in Chimpanzees and Neural Nets

Laboratorio di Intelligenza Artificiale e Robotica

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Issues in the Mining of Heart Failure Datasets

Linking Task: Identifying authors and book titles in verbose queries

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

ABSTRACT. A major goal of human genetics is the discovery and validation of genetic polymorphisms

An Introduction to the Minimalist Program

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Learning Methods for Fuzzy Systems

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Speech Recognition at ICSI: Broadcast News and beyond

Natural Language Processing. George Konidaris

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Assignment 1: Predicting Amazon Review Ratings

Natural Language Processing: Interpretation, Reasoning and Machine Learning

Word Segmentation of Off-line Handwritten Documents

Knowledge-Based - Systems

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Speaker recognition using universal background model on YOHO database

Laboratorio di Intelligenza Artificiale e Robotica

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Discriminative Learning of Beam-Search Heuristics for Planning

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

CS 1103 Computer Science I Honors. Fall Instructor Muller. Syllabus

Speech Emotion Recognition Using Support Vector Machine

Grade 6: Correlated to AGS Basic Math Skills

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

CS 598 Natural Language Processing

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

Rule Learning With Negation: Issues Regarding Effectiveness

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy

Lecture 10: Reinforcement Learning

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

(Sub)Gradient Descent

SARDNET: A Self-Organizing Feature Map for Sequences

Automatic Pronunciation Checker

Page 1 of 11. Curriculum Map: Grade 4 Math Course: Math 4 Sub-topic: General. Grade(s): None specified

Calibration of Confidence Measures in Speech Recognition

First Grade Standards

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Software Maintenance

Seminar - Organic Computing

Learning Methods in Multilingual Speech Recognition

On-Line Data Analytics

Learning to Schedule Straight-Line Code

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Dublin City Schools Mathematics Graded Course of Study GRADE 4

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Probabilistic Latent Semantic Analysis

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

AQUA: An Ontology-Driven Question Answering System

Courses in English. Application Development Technology. Artificial Intelligence. 2017/18 Spring Semester. Database access

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Standard 1: Number and Computation

WHEN THERE IS A mismatch between the acoustic

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

A Domain Ontology Development Environment Using a MRD and Text Corpus

The Method of Immersion the Problem of Comparing Technical Objects in an Expert Shell in the Class of Artificial Intelligence Algorithms

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

Evolutive Neural Net Fuzzy Filtering: Basic Description

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Large vocabulary off-line handwriting recognition: A survey

Softprop: Softmax Neural Network Backpropagation Learning

Switchboard Language Model Improvement with Conversational Data from Gigaword

MASTER OF SCIENCE (M.S.) MAJOR IN COMPUTER SCIENCE

Montana Content Standards for Mathematics Grade 3. Montana Content Standards for Mathematical Practices and Mathematics Content Adopted November 2011

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Applications of data mining algorithms to analysis of medical data

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

Abstractions and the Brain

Beyond the Pipeline: Discrete Optimization in NLP

Corrective Feedback and Persistent Learning for Information Extraction

English Language and Applied Linguistics. Module Descriptions 2017/18

The Strong Minimalist Thesis and Bounded Optimality

Modeling function word errors in DNN-HMM based LVCSR systems

An Online Handwriting Recognition System For Turkish

Chapter 9 Banked gap-filling

Highlighting and Annotation Tips Foundation Lesson

Transcription:

Data Warehouse BINF 630: Introduction to Bioinformatics Operational data Iosif Vaisman Data fusion Email: ivaisman@gmu.edu Data cleansing Metadata InterPro Database Data Mining Data mining is the exploration and analysis, by automatic or semiautomatic means, of large quantities of data in order to discover meaningful patterns and rules ommon data mining tasks lassification Estimation Prediction ffinity Grouping lustering Description Knowledge Discovery Directed and Undirected KD Directed KD Purpose: Explain value of some field in terms of all the others Method: We select the target field based on some hypothesis about the data. We ask the algorithm to tell us how to predict or classify it Similar to hypothesis testing (e.g., in regression modeling) in statistics Knowledge Discovery Undirected KD Purpose: Find patterns in the data that may be interesting Method: clustering, affinity grouping losest to ideas of machine learning in artificial intelligence omparison UKD helps us to recognize relationships & DKD helps us to explain them

lassification lassifying observations into different categories given characteristics Estimation Rules that explain how to estimate a value given characteristics Prediction Rules that explain how to predict a future value or classification, given characteristics ffinity Grouping Grouping by relations (not by characteristics) lustering Segmenting a diverse population into more similar groups In clustering, there are no pre-defined classes and no examples. Records are grouped together by some similarity measure. Mechanistic models Mechanism Predictive power Elegance onsistency Scientific Models Physical models -- Mathematical models Stochastic models Black box Predictive power rtificial Intelligence in Biosciences rtificial Intelligence in Biosciences (NN) Genetic lgorithms (G) (NN) Genetic lgorithms (G)

interconnected assembly of simple processing elements (units or nodes) nodes functionality is similar to that of the animal neuron processing ability is stored in the interunit connection strengths (weights) weights are obtained by a process of adaptation to, or learning from, a set of training patterns Hierarchical neural network Perceptron Output layer 1 if Σ w i i i > Θ Output layer Input layer Y = 0 otherwise Hidden layer Input layer Learning process: w i = ( p -Y p )i pi Helix Sheet Output layer (2 units) Hidden layer (2 units) Input layer (7x21 units) rtificial Intelligence in Biosciences (NN) Genetic lgorithms (G) MKFGNFLLYQP [ PELSQE ] VMKRLVNLGKSEG...

Genetic lgorithms Search or optimization methods using simulated evolution. Population of potential solutions is subjected to natural selection, crossover, and mutation Genetic lgorithms SR INIILIZION choose initial population evaluate each individual's fitness repeat select individuals to reproduce mate pairs at random apply crossover operator apply mutation operator evaluate each individual's fitness until terminating condition VLUION YES SOLUION? NO REPRODUION ROSS-OVER MUION SOP Parent Parent B hild B hild B rossover Mutation crossover point 11 Genetic lgorithms pplications Parents 10 00 01 00 10 10 00 00 01 11 hildren 10 00 10 01 11 10 00 01 00 10 01 00 1 2 3 5 6 1 1 4 5 4 2 3 6 1 6 5 2 3 4 2 3 4 5 6 10 G simulation of folding rtificial Intelligence in Biosciences (NN) Genetic lgorithms (G) Membrane binding domain of Blood oagulation Factor VIII (J.Moult)

Grammars and Language gram mar n. 1. the study of the way the sentences of a language are constructed... 4. Generative Gram. a device, as a body of rules, whose output is all of the sentences that are permissible in a given language, while excluding all those that are not permissible. Random House Unabridged Dictionary Language omponents Semantics (meaning) Syntax (structure, form) Language Syntax lphabet Primitive elements Letters, phonemes Vocabulary Elements composed from the alphabet Words, phrases, sentences, Grammar Legal composition of vocabulary Rules, operators Semantics Derived from syntax Semantic content derived from vocabulary within a context Vocabulary element has its own meanings dictionary lookup meanings depending on context ime flies like an arrow Fruit flies like a banana Formal Grammars formal grammar a means for specifying the syntactic structure of natural language by a set of transformation functions homsky hierarchy (for string grammars) type 0: phrase structure type 1: context sensitive type 2: context free (SFG) type 3: regular (Hidden Markov models) homsky, Syntactic Structures (1957) Markov Model (or Markov hain) G Hidden Markov Models Probability for each character based only on several preceding characters in the sequence # of preceding characters = order of the Markov Model Observed frequencies 0.7 0.3 0.1 0.9 0.8 G 0.2 0.4 0.6 0.8 0.2 G 0.3 G 0.7 Probability of a sequence Probablistic model - true state is unknown P(s) = P[] P[,] P[,,] P[,,] P[,,] P[,,G]

Hidden Markov Models States -- well defined conditions Edges -- transitions between the states Each transition asigned a probability. G G G Probability of the sequence: single path with the highest probability --- Viterbi path sum of the probabilities over all paths -- Baum-Welch method ---G --G G--- G-- Hidden Markov Models probabilities P(S) log-odds (log ) 0.25 L dopted from nders Krogh, 1998 Hidden Markov Model for Exon and Stop odon (VEIL lgorithm) Hidden Markov Model in Structural nalysis Markov state dopted from S. Salzberg, 1997 hidden Markov model consists of Markov states connected by directed transitions. Each state emits an output symbol, representing sequence or structure. here are four categories of emission symbols in our model: b, d, r, and c, corresponding to amino acid residues, three-state secondary structure, backbone angles (discretized into regions of phi-psi space) and structural context (e.g. hairpin versus diverging turn, middle versus end-strand), respectively. dopted from.bystroff et al, 2000 Hidden Markov Model in Structural nalysis rtificial Intelligence in Biosciences HMM topology from merging of two motifs, the extended ype-i hairpin motif and the Serine hairpin. dopted from.bystroff et al, 2000 JMB, 301, 173 Other machine learning algorithms: Support vector machines Decision trees Random forests

Support Vector Machines (SVM) lgorithm Decision surface is a hyperplane (line in 2D, plane in 3D, etc.) in feature space Support Vector Machines (SVM) Var 1 Define what an optimal hyperplane is (in way that can be identified in a computationally efficient way): maximize margin Extend the above definition for non-linearly separable problems: have a penalty term for misclassifications Map data to high dimensional space where it is easier to classify with linear decision surfaces: reformulate problem so that data is mapped implicitly to this space liferis & samardinos Var 2 liferis & samardinos Support Vector Machines (SVM) Support Vector Machines (SVM) Var 1 Margin Width Margin Width Var 2 Linear SVM Non-linear SVM liferis & samardinos liferis & samardinos