Defining Big Data. Data Mining. Knowledge Discovery. Knowledge Discovery. Knowledge Discovery. BINF 630: Bioinformatics Methods

Similar documents
Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Python Machine Learning

Human Emotion Recognition From Speech

Artificial Neural Networks written examination

Lecture 1: Machine Learning Basics

CS Machine Learning

Axiom 2013 Team Description Paper

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

INPE São José dos Campos

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Evolution of Symbolisation in Chimpanzees and Neural Nets

Lecture 1: Basic Concepts of Machine Learning

Laboratorio di Intelligenza Artificiale e Robotica

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

An Introduction to the Minimalist Program

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

ABSTRACT. A major goal of human genetics is the discovery and validation of genetic polymorphisms

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

CSL465/603 - Machine Learning

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Calibration of Confidence Measures in Speech Recognition

Issues in the Mining of Heart Failure Datasets

Learning Methods for Fuzzy Systems

(Sub)Gradient Descent

Natural Language Processing: Interpretation, Reasoning and Machine Learning

CS 1103 Computer Science I Honors. Fall Instructor Muller. Syllabus

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Natural Language Processing. George Konidaris

Linking Task: Identifying authors and book titles in verbose queries

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Learning Methods in Multilingual Speech Recognition

Laboratorio di Intelligenza Artificiale e Robotica

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Rule Learning With Negation: Issues Regarding Effectiveness

Automatic Pronunciation Checker

Speech Recognition at ICSI: Broadcast News and beyond

Page 1 of 11. Curriculum Map: Grade 4 Math Course: Math 4 Sub-topic: General. Grade(s): None specified

Evolutive Neural Net Fuzzy Filtering: Basic Description

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

First Grade Standards

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Rule Learning with Negation: Issues Regarding Effectiveness

CS 598 Natural Language Processing

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

Word Segmentation of Off-line Handwritten Documents

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Knowledge-Based - Systems

A Domain Ontology Development Environment Using a MRD and Text Corpus

Software Maintenance

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

WHEN THERE IS A mismatch between the acoustic

The Strong Minimalist Thesis and Bounded Optimality

Seminar - Organic Computing

Discriminative Learning of Beam-Search Heuristics for Planning

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Grade 6: Correlated to AGS Basic Math Skills

Assignment 1: Predicting Amazon Review Ratings

Learning to Schedule Straight-Line Code

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy

Large vocabulary off-line handwriting recognition: A survey

On the Formation of Phoneme Categories in DNN Acoustic Models

SARDNET: A Self-Organizing Feature Map for Sequences

Lecture 10: Reinforcement Learning

On-Line Data Analytics

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

The Method of Immersion the Problem of Comparing Technical Objects in an Expert Shell in the Class of Artificial Intelligence Algorithms

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Speech Emotion Recognition Using Support Vector Machine

The taming of the data:

Softprop: Softmax Neural Network Backpropagation Learning

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

Dublin City Schools Mathematics Graded Course of Study GRADE 4

AQUA: An Ontology-Driven Question Answering System

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

Speaker recognition using universal background model on YOHO database

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

Visual CP Representation of Knowledge

Beyond the Pipeline: Discrete Optimization in NLP

Probabilistic Latent Semantic Analysis

TABLE OF CONTENTS TABLE OF CONTENTS COVER PAGE HALAMAN PENGESAHAN PERNYATAAN NASKAH SOAL TUGAS AKHIR ACKNOWLEDGEMENT FOREWORD

Standard 1: Number and Computation

Math-U-See Correlation with the Common Core State Standards for Mathematical Content for Third Grade

A Vector Space Approach for Aspect-Based Sentiment Analysis

An OO Framework for building Intelligence and Learning properties in Software Agents

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

Switchboard Language Model Improvement with Conversational Data from Gigaword

Montana Content Standards for Mathematics Grade 3. Montana Content Standards for Mathematical Practices and Mathematics Content Adopted November 2011

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

Test Effort Estimation Using Neural Network

Courses in English. Application Development Technology. Artificial Intelligence. 2017/18 Spring Semester. Database access

Transcription:

Defining Big Data BINF 630: Bioinformatics Methods Iosif Vaisman Email: ivaisman@gmu.edu NO JUS SIZE he three Vs of Big Data: volume, variety and velocity (D.Laney, 2001) Elements of "Big Data" include: he degree of complexity within the data set he amount of value that can be derived from innovative vs. non-innovative analysis techniques he use of longitudinal information supplements the analysis http://mike2.openmethodology.org/wiki/big_data_definition Data Mining Data mining is the exploration and analysis, by automatic or semiautomatic means, of large quantities of data in order to discover meaningful patterns and rules Common data mining tasks Classification Estimation Prediction ffinity Grouping Clustering Description Knowledge Discovery Knowledge is... a pattern that exceeds certain threshold of interestingness. Factors that contribute to interestingness: coverage confidence statistical significance simplicity unexpectedness actionability Knowledge Discovery Directed and Undirected KD Directed KD Purpose: Explain value of some field in terms of all the others Method: We select the target field based on some hypothesis about the data. We ask the algorithm to tell us how to predict or classify it Similar to hypothesis testing (e.g., in regression modeling) in statistics Knowledge Discovery Undirected KD Purpose: Find patterns in the data that may be interesting Method: clustering, affinity grouping Closest to ideas of machine learning in artificial intelligence Comparison UKD helps us to recognize relationships & DKD helps us to explain them

Classification Classifying observations into different categories given characteristics Estimation Rules that explain how to estimate a value given characteristics Prediction Rules that explain how to predict a future value or classification, given characteristics ffinity Grouping Grouping by relations (not by characteristics) Clustering Knowledge Discovery Segmenting a diverse population into more similar groups In clustering, there are no pre-defined classes and no examples. Records are grouped together by some similarity measure. B.Bergeron, 2002 Scientific Models Physical models -- Mathematical models rtificial Intelligence in Biosciences Mechanistic models Mechanism Predictive power Elegance Consistency Stochastic models Black box Predictive power (NN) Genetic lgorithms (G) Formal Grammars (FG)

rtificial Intelligence in Biosciences (NN) Genetic lgorithms (G) Formal Grammars (FG) interconnected assembly of simple processing elements (units or nodes) nodes functionality is similar to that of the animal neuron processing ability is stored in the interunit connection strengths (weights) weights are obtained by a process of adaptation to, or learning from, a set of training patterns Perceptron Output layer Input layer Y = 1 if w i i i > 0 otherwise Learning process: w i = ( p - Y p )i pi Hierarchical neural network Output layer Helix Sheet Output layer (2 units) Hidden layer Hidden layer (2 units) Input layer Input layer (7x21 units) MKFGNFLLYQP [ PELSQE ] VMKRLVNLGKSEGC...

rtificial Intelligence in Biosciences (NN) Genetic lgorithms (G) Formal Grammars (FG) Genetic lgorithms Search or optimization methods using simulated evolution. Population of potential solutions is subjected to natural selection, crossover, and mutation choose initial population evaluate each individual's fitness repeat select individuals to reproduce mate pairs at random apply crossover operator apply mutation operator evaluate each individual's fitness until terminating condition Genetic lgorithms SR INIILIZION VLUION Parent Parent B Child B Crossover crossover point SOLUION? YES SOP Child B NO REPRODUCION CROSS-OVER Mutation MUION Genetic lgorithms pplications Parents 10 00 01 00 10 10 00 00 01 11 1 2 3 4 5 6 1 6 5 2 3 4 G simulation of folding 11 Children 10 00 10 01 11 10 00 01 00 10 01 00 5 6 1 1 4 2 3 2 3 4 5 6 10 Membrane binding domain of Blood Coagulation Factor VIII (J.Moult)

rtificial Intelligence in Biosciences (NN) Genetic lgorithms (G) Formal Grammars (FG) Grammars and Language gram mar n. 1. the study of the way the sentences of a language are constructed... 4. Generative Gram. a device, as a body of rules, whose output is all of the sentences that are permissible in a given language, while excluding all those that are not permissible. Random House Unabridged Dictionary Language Components Semantics (meaning) Syntax (structure, form) Language Syntax lphabet Primitive elements Letters, phonemes Vocabulary Elements composed from the alphabet Words, phrases, sentences, Grammar Legal composition of vocabulary Rules, operators Derived from syntax Semantics Semantic content derived from vocabulary within a context Vocabulary element has its own meanings dictionary lookup meanings depending on context ime flies like an arrow Fruit flies like a banana Formal Grammars formal grammar a means for specifying the syntactic structure of natural language by a set of transformation functions Chomsky hierarchy (for string grammars) type 0: phrase structure type 1: context sensitive type 2: context free (SCFG) type 3: regular (Hidden Markov models) Chomsky, Syntactic Structures (1957) Markov Model (or Markov Chain) C Probability for each character based only on several preceding characters in the sequence # of preceding characters = order of the Markov Model Probability of a sequence P(s) = P[] P[,] P[,,C] P[,C,] P[C,,] P[,,G] G

Hidden Markov Models Hidden Markov Models Observed frequencies 0.7 0.3 0.1 0.9 C C 0.8 G 0.2 0.4 0.6 0.8 0.2 G C 0.3 G 0.7 States -- well defined conditions Edges -- transitions between the states C G C GC C CGC CC Probablistic model - true state is unknown Each transition asigned a probability. Probability of the sequence: single path with the highest probability --- Viterbi path sum of the probabilities over all paths -- Baum-Welch method C---G CCC CC--GC G---C CCG--C Hidden Markov Models probabilities Hidden Markov Model for Exon and Stop Codon (VEIL lgorithm) P(S) log-odds (log ) 0.25 L dopted from nders Krogh, 1998 dopted from S. Salzberg, 1997 Hidden Markov Model in Structural nalysis Markov state Hidden Markov Model in Structural nalysis HMM topology from merging of two motifs, the extended ype-i hairpin motif and the Serine hairpin. dopted from C.Bystroff et al, 2000 JMB, 301, 173 hidden Markov model consists of Markov states connected by directed transitions. Each state emits an output symbol, representing sequence or structure. here are four categories of emission symbols in our model: b, d, r, and c, corresponding to amino acid residues, three-state secondary structure, backbone angles (discretized into regions of phi-psi space) and structural context (e.g. hairpin versus diverging turn, middle versus end-strand), respectively. dopted from C.Bystroff et al, 2000

Comparison of I methods rtificial Intelligence in Biosciences Other machine learning algorithms: Support vector machines Decision trees Random forests Olden et al., 2008 Support Vector Machines (SVM) lgorithm Decision surface is a hyperplane (line in 2D, plane in 3D, etc.) in feature space Support Vector Machines (SVM) Var 1 Define what an optimal hyperplane is (in way that can be identified in a computationally efficient way): maximize margin Extend the above definition for non-linearly separable problems: have a penalty term for misclassifications Map data to high dimensional space where it is easier to classify with linear decision surfaces: reformulate problem so that data is mapped implicitly to this space liferis & samardinos Var 2 liferis & samardinos Support Vector Machines (SVM) Support Vector Machines (SVM) Var 1 Margin Width Margin Width Var 2 Linear SVM Non-linear SVM liferis & samardinos liferis & samardinos

pplications of ML methods pplications of ML methods Discrimination between regulatory ChIP-seq peaks and flanking regions within a single cell type using a support vector machine rvey et al., 2012 Mapping in topological space