Machine Learning and Applications in Finance

Similar documents
Python Machine Learning

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Probabilistic Latent Semantic Analysis

Lecture 1: Machine Learning Basics

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Human Emotion Recognition From Speech

Speech Recognition at ICSI: Broadcast News and beyond

The Good Judgment Project: A large scale test of different methods of combining expert predictions

WHEN THERE IS A mismatch between the acoustic

CSL465/603 - Machine Learning

(Sub)Gradient Descent

Speech Emotion Recognition Using Support Vector Machine

Word Segmentation of Off-line Handwritten Documents

Australian Journal of Basic and Applied Sciences

Comment-based Multi-View Clustering of Web 2.0 Items

CS4491/CS 7265 BIG DATA ANALYTICS INTRODUCTION TO THE COURSE. Mingon Kang, PhD Computer Science, Kennesaw State University

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

CS Machine Learning

Modeling function word errors in DNN-HMM based LVCSR systems

arxiv: v2 [cs.cv] 30 Mar 2017

A study of speaker adaptation for DNN-based speech synthesis

Evolutive Neural Net Fuzzy Filtering: Basic Description

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

Laboratorio di Intelligenza Artificiale e Robotica

Evaluating Interactive Visualization of Multidimensional Data Projection with Feature Transformation

Lecture 1: Basic Concepts of Machine Learning

Laboratorio di Intelligenza Artificiale e Robotica

A survey of multi-view machine learning

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Modeling function word errors in DNN-HMM based LVCSR systems

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Rule Learning With Negation: Issues Regarding Effectiveness

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Learning From the Past with Experiment Databases

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Artificial Neural Networks written examination

Assignment 1: Predicting Amazon Review Ratings

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

A Survey on Unsupervised Machine Learning Algorithms for Automation, Classification and Maintenance

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Calibration of Confidence Measures in Speech Recognition

Comparison of EM and Two-Step Cluster Method for Mixed Data: An Application

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Generative models and adversarial training

Learning Methods for Fuzzy Systems

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Learning Methods in Multilingual Speech Recognition

Speaker recognition using universal background model on YOHO database

THE world surrounding us involves multiple modalities

A Case Study: News Classification Based on Term Frequency

Lecture 10: Reinforcement Learning

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Indian Institute of Technology, Kanpur

Rule Learning with Negation: Issues Regarding Effectiveness

A cognitive perspective on pair programming

Software Maintenance

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

Detailed course syllabus

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

INPE São José dos Campos

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

ACTL5103 Stochastic Modelling For Actuaries. Course Outline Semester 2, 2014

Speaker Identification by Comparison of Smart Methods. Abstract

Introduction to Causal Inference. Problem Set 1. Required Problems

Comparison of network inference packages and methods for multiple networks inference

SARDNET: A Self-Organizing Feature Map for Sequences

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Speech Recognition by Indexing and Sequencing

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

Edinburgh Research Explorer

Math-U-See Correlation with the Common Core State Standards for Mathematical Content for Third Grade

A Bayesian Learning Approach to Concept-Based Document Classification

Evolution of Symbolisation in Chimpanzees and Neural Nets

The Method of Immersion the Problem of Comparing Technical Objects in an Expert Shell in the Class of Artificial Intelligence Algorithms

Using EEG to Improve Massive Open Online Courses Feedback Interaction

Welcome to. ECML/PKDD 2004 Community meeting

MGT/MGP/MGB 261: Investment Analysis

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

GDP Falls as MBA Rises?

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

JONATHAN H. WRIGHT Department of Economics, Johns Hopkins University, 3400 N. Charles St., Baltimore MD (410)

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

TIMSS ADVANCED 2015 USER GUIDE FOR THE INTERNATIONAL DATABASE. Pierre Foy

On-the-Fly Customization of Automated Essay Scoring

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

Time series prediction

Large-Scale Web Page Classification. Sathi T Marath. Submitted in partial fulfilment of the requirements. for the degree of Doctor of Philosophy

ScienceDirect. A Framework for Clustering Cardiac Patient s Records Using Unsupervised Learning Techniques

Switchboard Language Model Improvement with Conversational Data from Gigaword

Transcription:

Machine Learning and Applications in Finance Christian Hesse 1,2,* 1 Autobahn Equity Europe, Global Markets Equity, Deutsche Bank AG, London, UK christian-a.hesse@db.com 2 Department of Computer Science, University College London, London, UK c.hesse@ucl.ac.uk * The opinions and ideas expressed in this presentation are those of the author alone, and do not necessarily reflect the views of Deutsche Bank AG, its subsidiaries or affiliates. Christian Hesse MATLAB Computational Finance Conference, 24 June 2014, London, UK 1

Outline Machine Learning Overview Unsupervised Learning Supervised Learning Practical Considerations Recommended Reading: C. M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006 T. Hastie, R. Tibshirani, J. H. Friedman, The Elements of Statistical Learning, 2nd ed., Springer, 2009 Christian Hesse MATLAB Computational Finance Conference, 24 June 2014, London, UK 2

Machine Learning Machine learning is concerned with the design and development of datadriven algorithms able to identify and describe complex structure in data. Machine learning algorithms are designed to tackle: High-dimensional data Noisy data Data corrupted by artifacts Data with missing values Data with small sample size Non-stationary data (i.e., structural changes in data generating process) Non-linear data Machine learning techniques have been successfully applied in many areas of science, engineering and industry including finance Related fields: computer science, artificial intelligence, neural networks, statistics, signal processing, computational neuroscience Christian Hesse MATLAB Computational Finance Conference, 24 June 2014, London, UK 3

Machine Learning Approach Machine learning is generic Data is just numbers Data can be structured Data Machine learning is data-driven Data structure is defined as a model Model has minimal or generic assumptions! Model parameters estimated from data Machine learning is robust Maximize performance on unseen data Consistent performance Complexity control (avoid over-fitting) Reliable and efficient parameter estimation Christian Hesse MATLAB Computational Finance Conference, 24 June 2014, London, UK 4

Machine Learning Problems Unsupervised Learning Identifying component parts of data Representing different data features Unsupervised Learning Methods Dimension estimates/reduction Decomposition methods Clustering methods Components x1? xn x2? Supervised Learning Mapping one data part onto another Supervised Learning Methods Regression Ranking Classification Feature Selection Mapping x1 x2? xn Christian Hesse MATLAB Computational Finance Conference, 24 June 2014, London, UK 5

Decomposition Methods Orthogonal de-correlating transforms (Gaussian mixtures): x = As + n PCA, SVD, Whitening Probabilistic PCA and Factor Analysis Non-orthogonal un-mixing transforms (non-gaussian mixtures): x = As + n Independent component analysis (ICA) Probabilistic and noisy ICA Factorization and coding methods: X = WH Non-negative matrix factorization (NMF), etc Dictionary learning and sparse coding Applications Dimension reduction, regularization and de-noising Feature extraction Applications in Finance (portfolio optimization) Risk factor models, covariance matrix regularization Christian Hesse MATLAB Computational Finance Conference, 24 June 2014, London, UK 6

Clustering Methods K-Means Clustering Distance metric (Euclidean, City-block, cosine) Number of centres Probabilistic Clustering Mixture of Gaussians (spherical covariance) Mixture of Probabilistic PCA Mixture of Factor Analysers Non-Gaussian Clusters Mixture of ICA models Mixture of Von-Mises Fisher distributions Time Series Clusters Clusters reflect states State transitions and Hidden Markov Models (HMM) Christian Hesse MATLAB Computational Finance Conference, 24 June 2014, London, UK 7

Application: Volume Analysis Source: Bloomberg Christian Hesse MATLAB Computational Finance Conference, 24 June 2014, London, UK 8

Intra-day Volume Profiles Source: Bloomberg Christian Hesse MATLAB Computational Finance Conference, 24 June 2014, London, UK 9

Volume Profile Analysis Motivation Examination of market structure Importance in algorithmic trading Data Set Stocks: constituent names of the STOXX Europe 50 Index Period: Dec 2010 - May 2014 Intra-day trading volumes from primary exchange aggregated over 5 minute buckets Normalized volume profiles (density) Analysis Techniques K-Means Clustering Non-negative Matrix Factorization Different initialization methods Christian Hesse MATLAB Computational Finance Conference, 24 June 2014, London, UK 10

Volume Profile Cluster Analysis Christian Hesse MATLAB Computational Finance Conference, 24 June 2014, London, UK 11

Volume Profile Cluster Analysis Christian Hesse MATLAB Computational Finance Conference, 24 June 2014, London, UK 12

Volume Profile Decomposition Christian Hesse MATLAB Computational Finance Conference, 24 June 2014, London, UK 13

Volume Profile Decomposition Christian Hesse MATLAB Computational Finance Conference, 24 June 2014, London, UK 14

Volume Profile Analysis Summary Both approaches are sensitive to fundamental characteristics of the volume data, e.g., special days, US market effects, intra-day skews K-means provides an exemplar-based representation NMF provides a reduced sum-of-parts representation Unclear which is more desirable/useful Open issues and ongoing work Intelligent initialization of both methods is important What is the best distance metric to use for k-means here What is the most appropriate model order selection approach here Vanilla NMF results exhibit spurious zeros > consider extensions of NMF Behaviour on data from less liquid stocks Applications Feature extraction for intra-day volume prediction Christian Hesse MATLAB Computational Finance Conference, 24 June 2014, London, UK 15

Supervised Learning Data structure Most of the data X is just numbers A part of the data Y are annotation Pairs reflect a mapping of X onto Y Learning task: find mapping Y = F(X) The nature of the learning task depends on the scale that Y is measured on Y is metric >> regression task Y is ordinal >> ranking task Y is nominal >> classification task What kind of mapping is F? Linear or non-linear map Exemplar or kernel based How complex and reliable is the map Feature (variable) selection Regularization and stability Mapping x1 x2? xn Christian Hesse MATLAB Computational Finance Conference, 24 June 2014, London, UK 16

Application: Index Forecasting Data Source: Bloomberg, Thomson Reuters Christian Hesse MATLAB Computational Finance Conference, 24 June 2014, London, UK 17

Classifying Future Index Moves Class Features: Macroeconomic Time Series Data Source: Bloomberg, Thomson Reuters Christian Hesse MATLAB Computational Finance Conference, 24 June 2014, London, UK 18

Classifier Evaluation Out-of-Sample Evaluation Procedure Step 1 train test Step 2 train test Step T train test data time line Measure aggregate proportion of correct predictions (hit rate) Compare with guessing, naïve benchmarks and/or other classifiers Christian Hesse MATLAB Computational Finance Conference, 24 June 2014, London, UK 19

Feature Selection: Linear Methods discriminative features train test discriminative features train test Data Source: Bloomberg, Thomson Reuters Christian Hesse MATLAB Computational Finance Conference, 24 June 2014, London, UK 20

Feature Selection: Kernel Methods discriminative features train test discriminative features train test Data Source: Bloomberg, Thomson Reuters Christian Hesse MATLAB Computational Finance Conference, 24 June 2014, London, UK 21

Classification Results HSI 2-class problem Best out-of-sample hit rate for 2-class case is 0.67 Statistically significantly better than guessing and naïve benchmarks Confusion Matrix predicted class -1 1 observed -1 0.6429 0.3571 1 0.3029 0.6971 Classification seems to perform well, but is it good enough? Christian Hesse MATLAB Computational Finance Conference, 24 June 2014, London, UK 22

Practical Considerations Data Quality and Quantity Garbage in > garbage out Missing values, Repeated values Some methods impractical for very large datasets Regularization can help fitting large models to small datasets Sample biases affect model estimates Performance Evaluation Cross-validation tests Out-of-sample tests (moving window, growing window) Application Domain Knowledge Remains critically important Required to define learning problem (e.g., labels for classification) Christian Hesse MATLAB Computational Finance Conference, 24 June 2014, London, UK 23