Python Machine Learning Step-by-Step: Modeling Financial Time Series Data

Python Machine Learning Step-by-Step: Modeling Financial Time Series Data Reece Heineke Director of Big Data Credibly February 27, 2017

What is Machine Learning? Data Preparation Overview Python Toolbox Trade Ideas to Data Conclusion Exploratory Data Analysis Overview Scatter Plot Principal Component Analysis (PCA) Conclusion Fitting Models Overview Models and Pipelines Learning Curves Interpretability Conclusion A Fitted Model

What is Machine Learning?

What is Machine Learning? 1. Machine learning is a subfield of computer science that provides computers with the ability to learn without being explicitly programmed.

What is Machine Learning? 1. Machine learning is a subfield of computer science that provides computers with the ability to learn without being explicitly programmed. 2. There are two sides to every machine learning problem:

Data Preparation: Overview Review the Python software stack

Data Preparation: Overview Review the Python software stack Motivate the problem

Data Preparation: Overview Review the Python software stack Motivate the problem Discuss some issues specific to time series modeling

Python Toolbox 1 1 Scientific Python by Eueung Mulyana

Trump2Cash 2 2 Trump2Cash GitHub Project

Input: Trump criticizes Toyota on Twitter

Output: Toyota stock opens lower 3 3 Toyota Stock on Yahoo Finance s Interactive Chart

WSJ Analysis of Trump Tweets 4 4 by Akane Otani and Shane Shifflett

IPython: A Data Scientist s Best Friend Jupyter Notebook

Data Preparation: Conclusion We now have a illustrative data set to work with Data set has 10 numeric dimensions: 9 inputs, 1 output

Data Preparation: Conclusion We now have a illustrative data set to work with Data set has 10 numeric dimensions: 9 inputs, 1 output Data set is large ( 400MB compressed)

Exploratory Data Analysis: Overview Covariance and Correlation Matrices

Exploratory Data Analysis: Overview Covariance and Correlation Matrices Scatter plots

Exploratory Data Analysis: Overview Covariance and Correlation Matrices Scatter plots Principal Component Analysis (PCA)

Exploratory Data Analysis: Overview Covariance and Correlation Matrices Scatter plots Principal Component Analysis (PCA) Kernel PCA

Using IPython Jupyter Notebook

Scatter Plot: What can we say about the data?

scikit-learn Algorithm Cheat-Sheet: Just looking 5 5 scikit-learn Cheat-Sheet

Principal Component Analysis (PCA)

Kernel PCA with Radial Basis Function (RBF)

Exploratory Data Analysis: Conclusion Nonlinear relationship with (0, 9), (2, 9), (6, 9)

Exploratory Data Analysis: Conclusion Nonlinear relationship with (0, 9), (2, 9), (6, 9) All other dimensions are quite random

Fitting Models: Overview Scikit learn s model and pipelines

Fitting Models: Overview Scikit learn s model and pipelines Illustrative learning curves

scikit-learn Revisited 6 6 scikit-learn Cheat-Sheet

scikit-learn Pipeline 7 7 Python Machine Learning by Sebastian Raschka

Holdout Method 8 8 Python Machine Learning by Sebastian Raschka

Cross-Validation 9 9 Python Machine Learning by Sebastian Raschka

Learning Curves: What does it tell us? 10 10 Python Machine Learning by Sebastian Raschka

Poor fit: Linear Regression even with (K)PCA

Good fits: SVR (RBF) and Decision Tree Learning Curves

Classic Overfitting: Random Forest Regressor

Decision Trees: Easy to understand

Fitting Models: Conclusion Support Vector Machine (SVR) with Radial Basis Function (RBF) Kernel has a higher accuracy

Fitting Models: Conclusion Support Vector Machine (SVR) with Radial Basis Function (RBF) Kernel has a higher accuracy Decision Tree is easier to understand

Fitting Models: Conclusion Support Vector Machine (SVR) with Radial Basis Function (RBF) Kernel has a higher accuracy Decision Tree is easier to understand Choice involves our own priors on the underlying structure

Second Half of Machine Learning: A Persistent Model Jupyter Notebook

Thanks for listening: Q&A https://github.com/rheineke/time series modeling