Linear Regression: Predicting House Prices

Similar documents
Lecture 1: Machine Learning Basics

(Sub)Gradient Descent

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Python Machine Learning

Probability and Statistics Curriculum Pacing Guide

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Mathematics process categories

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Assignment 1: Predicting Amazon Review Ratings

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Artificial Neural Networks written examination

Statewide Framework Document for:

STT 231 Test 1. Fill in the Letter of Your Choice to Each Question in the Scantron. Each question is worth 2 point.

CS Machine Learning

Chapters 1-5 Cumulative Assessment AP Statistics November 2008 Gillespie, Block 4

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Edexcel GCSE. Statistics 1389 Paper 1H. June Mark Scheme. Statistics Edexcel GCSE

Understanding and Interpreting the NRC s Data-Based Assessment of Research-Doctorate Programs in the United States (2010)

The Good Judgment Project: A large scale test of different methods of combining expert predictions

School of Innovative Technologies and Engineering

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

Remainder Rules. 3. Ask students: How many carnations can you order and what size bunches do you make to take five carnations home?

Teaching a Laboratory Section

Getting Started with Deliberate Practice

If we want to measure the amount of cereal inside the box, what tool would we use: string, square tiles, or cubes?

Diagnostic Test. Middle School Mathematics

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

Analysis of Enzyme Kinetic Data

Algebra 2- Semester 2 Review

Cal s Dinner Card Deals

Writing Research Articles

Case study Norway case 1

Executive Guide to Simulation for Health

Mathematics. Mathematics

AP Statistics Summer Assignment 17-18

Chapter 4 - Fractions

Math-U-See Correlation with the Common Core State Standards for Mathematical Content for Third Grade

Functional Skills Mathematics Level 2 assessment

Mathematics subject curriculum

12- A whirlwind tour of statistics

TIMSS ADVANCED 2015 USER GUIDE FOR THE INTERNATIONAL DATABASE. Pierre Foy

Julia Smith. Effective Classroom Approaches to.

Math Placement at Paci c Lutheran University

On the Combined Behavior of Autonomous Resource Management Agents

The Strong Minimalist Thesis and Bounded Optimality

Go fishing! Responsibility judgments when cooperation breaks down

TOPICS LEARNING OUTCOMES ACTIVITES ASSESSMENT Numbers and the number system

Foothill College Summer 2016

Kindergarten Lessons for Unit 7: On The Move Me on the Map By Joan Sweeney

Math 96: Intermediate Algebra in Context

The lab is designed to remind you how to work with scientific data (including dealing with uncertainty) and to review experimental design.

Creating a Test in Eduphoria! Aware

4.0 CAPACITY AND UTILIZATION

Radius STEM Readiness TM

PART C: ENERGIZERS & TEAM-BUILDING ACTIVITIES TO SUPPORT YOUTH-ADULT PARTNERSHIPS

OVERVIEW OF CURRICULUM-BASED MEASUREMENT AS A GENERAL OUTCOME MEASURE

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

This curriculum is brought to you by the National Officer Team.

Hierarchical Linear Modeling with Maximum Likelihood, Restricted Maximum Likelihood, and Fully Bayesian Estimation

Shockwheat. Statistics 1, Activity 1

Using Calculators for Students in Grades 9-12: Geometry. Re-published with permission from American Institutes for Research

Grade 6: Correlated to AGS Basic Math Skills

Preparing a Research Proposal

Learning From the Past with Experiment Databases

Quantitative analysis with statistics (and ponies) (Some slides, pony-based examples from Blase Ur)

White Paper. The Art of Learning

How to make successful presentations in English Part 2

Exploring Derivative Functions using HP Prime

Major Milestones, Team Activities, and Individual Deliverables

Manipulative Mathematics Using Manipulatives to Promote Understanding of Math Concepts

Getting Started with TI-Nspire High School Science

Truth Inference in Crowdsourcing: Is the Problem Solved?

Houghton Mifflin Online Assessment System Walkthrough Guide

ICTCM 28th International Conference on Technology in Collegiate Mathematics

There are three things that are extremely hard steel, a diamond, and to know one's self. Benjamin Franklin, Poor Richard s Almanac, 1750

Lecture 10: Reinforcement Learning

Mathematics Scoring Guide for Sample Test 2005

Let's Learn English Lesson Plan

CAAP. Content Analysis Report. Sample College. Institution Code: 9011 Institution Type: 4-Year Subgroup: none Test Date: Spring 2011

CSL465/603 - Machine Learning

MADERA SCIENCE FAIR 2013 Grades 4 th 6 th Project due date: Tuesday, April 9, 8:15 am Parent Night: Tuesday, April 16, 6:00 8:00 pm

INTERMEDIATE ALGEBRA PRODUCT GUIDE

Curriculum Design Project with Virtual Manipulatives. Gwenanne Salkind. George Mason University EDCI 856. Dr. Patricia Moyer-Packenham

learning collegiate assessment]

CS177 Python Programming

Rule Learning With Negation: Issues Regarding Effectiveness

TU-E2090 Research Assignment in Operations Management and Services

Innovative Methods for Teaching Engineering Courses

Statistical Analysis of Climate Change, Renewable Energies, and Sustainability An Independent Investigation for Introduction to Statistics

Measurement. When Smaller Is Better. Activity:

Proficiency Illusion

A Model to Predict 24-Hour Urinary Creatinine Level Using Repeated Measurements

Paper Reference. Edexcel GCSE Mathematics (Linear) 1380 Paper 1 (Non-Calculator) Foundation Tier. Monday 6 June 2011 Afternoon Time: 1 hour 30 minutes

Outreach Connect User Manual

Purdue Data Summit Communication of Big Data Analytics. New SAT Predictive Validity Case Study

CHAPTER 4: REIMBURSEMENT STRATEGIES 24

Pretest Integers and Expressions

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Transcription:

Linear Regression: Predicting House Prices I am big fan of Kalid Azad writings. He has a knack of explaining hard mathematical concepts like Calculus in simple words and helps the readers to get the intuition behind the idea. Couple of days back I was reading his book on Calculus. I came across the following passage in the book What s a better learning strategy: covering a subject in full detail from top-to-bottom, or progressively sharpening a quick overview? The better way to learn is to use the idea of progressive rendering. Get a rough outline as quickly as possible. Then gradually improve your understanding of the subject over time. This approach helps to keep our interests alive, get the big picture, and see how the individual parts are connected. This is the idea I am using to learn Machine Learning (ML). In the last post, I introduced the idea behind ML and why it s super important for machines to learn by itself. If you haven t read my previous post then you can read it here. In this post, I will be opening the pandora s box of ML and we will learn about Linear Regression, grandaddy of all ML algorithms, and use it to predict house prices. Imagine that you are planning to sell your house. How much should you sell it for? Is there a way to find it out? One way is to look at the sales data of similar houses in your neighborhood and use that information to set your sale price. What does similar houses mean? We can use the properties (also called as features) of your house and compare it with other houses and pick the ones that closely matches your house. Some examples of features are year-built, size, and no-of-bedrooms.

Let s keep things simple and use only the size (area-in-sqft) feature. Take a look at the recent sales data of 10 houses in your neighborhood. Suppose your house is 1500 square feet, then what should your sale price be? Before answering this question let s find out if there is a positive correlation between size and price. If the change in house size (independent variable) is associated with the change in house price (dependent variable) in the same direction then the 2 variables are positively correlated. Scatter plot is a great tool to visualize relationship between any two variables. From the chart we can see that there is a strong positive relationship between size and price with a correlation coefficient of 0.97. Click here to learn more about correlation coefficient.

There is one issue with the recent sales data. None of the 10 houses sold recently has the same 1500 sq.ft. as your house. Without this information how do we come up with a sale price for your house? One idea is to use the average price of houses that are closer in size to the house that we are trying to sell. The problem with this approach is that we are only using the sale price of 2 houses and throwing away sales information from the remaining 8 houses. It might not be a big ideal in this case as we are using a single feature (house size). In real life situations we will be using several features (size, year-built, no-of-bedrooms) to decide the sale

price and throwing away information from other houses is not an acceptable solution. Is there a better solution? There is a solution and we studied about it in 9 th grade mathematics. What if we fit a line through the data points and use that line for predicting house prices? The line equation can be written as Price = w 0 + w 1 * Area to better reflect our house price prediction problem. Our goal is to find out w 0 ( intercept ) and w 1 ( slope ). There are infinite possible values for w 0 and w 1. And this will result in infinite possible lines. Which line should we choose?

The idea is to choose the line that is closer to all the data points. Take a look at the chart shown below. Of the 2 lines which one is a better predictor of the house price? Clearly line A is a better predictor than B. Why is that? Visually line A is closer to all the data points than line B. The next question is what does visually closer means? Can it be represented mathematically? Given below is the mathematical representation of visually closer with 2 houses. Our goal is to choose that line which minimizes the residual sum of error. This happens when the predicted price (represented by the straight line) is closer to the actual house price (represented by x).

Let s generalize the residual sum of error from 2 to n houses and figure out a way to find the optimal values for w 0 and w 1. The worked out generalization is given below. Our goal is to find out the optimal values for w 0 and w 1 so that the cost function J(w) is minimized.

People often say that picture is worth a thousand words. Take a look at the 3-dimensional chart to get a better understanding of the problem we are trying to solve. The chart looks visually appealing. But I have few problems with it. My brain doesn t interpret 3D charts well. Finding out the optimal values for both w 0 and w 1 to keep the cost function J(w) minimum requires a good understanding of multivariate calculus. For a novice like me this is too much to handle. It s like forcing someone to build a car before driving it. I am going to simplify this by cutting down a dimension. Let s remove w 0 for now and make it a 2-dimensional chart. Finding out the optimal values for a single variable w 1 doesn t require multivariate calculus and we should be able to solve the problem with basic calculus.

How do we find out the optimal value for w 1? One option is to use trial-and-error and try all possible values and pick the one that minimizes the cost function J(w). This is not a scalable approach. Why is that? Let s consider a house with 3 features. Each feature will have it s own weight and let s call it as (w 1, w 2, w 3 ). If each weight can take values from 1 to 1000 then it will result in 1 billion evaluations. In ML, solving problems with 100+ features is very common. If we use trial-and-error, then coming up with optimal weights will take longer time than the age of our universe. We need a better solution. It turns out that our cost function J(w) is quadratic (y = x 2 ) and it results in a convex shape (U shape). Play around with online graph calculator to see the convex shape for quadratic equation. One important feature of a quadratic function is it has only one global minimum instead of several local minimum. To begin with we will choose a random value for w 1. This value can be in one of three possible locations: right-of-global-minimum, left-of-global-minimum, on-global-minimum. Let s see how w 1 reaches optimal value and minimizes the cost function J(w) irrespective of the location it starts in. The image given below shows how w 1 reaches optimal value for all 3 cases. Here are few questions that came to my mind while creating the image. 1. Why I am taking a derivative for finding the slope at the current value of w 1 instead of using the usual method? The usual method of calculating slope requires 2 points. But in our case we just have a single point. How do we find the slope for a point? We need the help of derivatives; a fundamental idea from calculus. Click here to learn more about derivatives. 2. Why is the value of slope positive for right-of-global-minimum, negative for left-of-global-minimum, and zero for on-global-minimum. To answer it yourself, I would

highly recommend you to practice calculating the value of slope using 2 points. Click here to learn more about slope calculations. 3. What is the need for a learning factor alpha (α) and why should I set it to a very small value? Remember that our goal is keep adjusting the value of w 1 so that we minimize the cost function J(w). Alpha (α) controls the step size and it ensures that we don t overshoot our goal of finding the global minimum. A smart choice of α is crucial. When α is too small, it will take our algorithm forever to reach the lowest point and if α is too big we might overshoot and miss the bottom. The algorithm explained above is called as Gradient Descent. If you re wondering what is the meaning of the word gradient then read it as slope. They are one and the same. Using Python, I ran the gradient Descent algorithm by initializing (w 1 = 0 and α = 0.1) and ran it for 2000

iterations. The table given below shows how w 1 converged to the optimal value of 82.15. This is the value which minimizes the cost function. Note that at the first few iterations the value of w 1 adjusts faster due to steep gradient. At later stages of iterations the value of w 1 adjusts very slowly. Google Sheets allows us to do linear regression and finds the best fit line. I used this feature on the house data and the optimal value for w 1 came to 82.156. The chart given below shows the best fit line along with the equation. This shows that the value I got from my Python code correctly matches the value from Google Sheet.

It took me 10 pages to explain the intuition behind linear regression. That too I explained it for a single feature (size-of-house). But in reality a house has several features. Linear regression is very flexible and it works for several features. The general form of linear regression is: w 0 + w 1 * feature 1 + w 2 * feature 2 + + w n * feature n. And gradient descent algorithm finds out the optimal weight for (w 1, w 2,, w n ). Calculating the optimal weight for a single feature required us to deal with 2-dimensions. The second dimension is for the cost function. For 2 features we need to deal with 3-dimensions and for N features we need to deal with (N+1)-dimensions. Unfortunately, our brain is not equipped to deal with more than 3-dimensions. And my brain can handle only 2-dimensions. Also finding the optimal weight for more than 1 feature requires a good understanding of multivariate calculus. My goal is to develop a good intuition on linear regression. We achieved that goal by working out the details for a single feature. I am going to assume that what worked for a single feature is going to work for multiple features. This is all great. But what does linear regression got to do with ML? To answer this question you need to take a look at the python code. This code uses scikit-learn which is a powerful open source python library for ML. Just a few lines of code finds out the optimal values for w 0 and w 1. The values for (w 0 and w 1 ) exactly matches the values from Google Sheet. Machines can learn in a couple of ways: supervised and unsupervised. In case of supervised learning we give the ML algorithm an input dataset along with the correct answer. The input dataset is a collection of several examples and each example is a collection of one-to-many

features. The correct answer is called as a label. For the house prediction problem the input dataset had 10 examples and each example had 1 feature. And the label is the house price. Using the features and label, also called as training data, the ML algorithm trains itself and generates an hypothesis function as output. For the house prediction problem it generated a hypothesis function: ( 82.156 * area-of-house + 24954.143). Why did I call the generated output as a hypothesis function instead of function? In order to answer this question we need to understand the method used by scientists to discover new laws. This method is called as scientific method and Richard Feynman explains it beautifully in the video below. Now I m going to discuss how we would look for a new law. In general, we look for a new law by the following process. First, we guess it (audience laughter), no, don t laugh, that s the truth. Then we compute the consequences of the guess, to see what, if this is right, if this law we guess is right, to see what it would imply and then we compare the computation results to nature or we say compare to experiment or experience, compare it directly with observations to see if it works. If it disagrees with experiment, it s wrong. In that simple statement is the key to science. It doesn t make any difference how beautiful your guess is, it doesn t matter how smart you are who made the guess, or what his name is If it disagrees with experiment, it s wrong. That s all there is to it. - Richard Feynman A scientist would compare the law he guessed (hypothesis) with the results from nature, experiment, and experience. If the law he guessed disagrees with the experiment then he will reject his hypothesis. We need to do the same thing for our ML algorithm. The hypothesis

function: ( 82.156 * area-of-house + 24954.143) generated by our ML algorithm is similar to scientists guess. Before accepting it we need to measure the accuracy of this hypothesis by applying it on data that the algorithm didn t see. This data is also called as test data. The image given below explains how this process works. I translated the above image and came up with the python code shown below. The variables featurestest and labelstest contains the test data. We never showed this data to our ML algorithm. Using this test data we are validating the hypothesis function ( 82.156 * area-of-house + 24954.143) generated by our ML algorithm. The actual house prices [115200, 123400] of test data almost matched with the predicted house prices [115326, 123541]. Looks like our hypothesis function is working.

Is there a summary statistic that tells how good our predictions matched with the actual labels? R-squared is the summary statistic which measures the accuracy of our prediction. A score of 1 tells that all our predictions exactly matched with the reality. In the above example the score of 0.998 is really good. This shouldn t be surprising as I created the test labels using the hypothesis function and slightly bumped up the values. Take a look at the image below. It gives you the intuition behind how the R-squared metric is computed. The idea is very simple. If the predicted value is very close the actual value then the numerator is close to zero. This will keep the value of R-squared closer to 1. Otherwise the value of R-square moves far below 1.

So far I have been assuming that the relationship between the features and the label is linear. What does linear mean? Take a look at the general form of linear regression: w 0 + w 1 * feature 1 + w 2 * feature 2 + + w n * feature n. None of the features have degree (power) greater than 1. This assumption is not true all the times. There is a special case of multiple linear regression called as polynomial regression that adds terms with degree greater than 1. The general form of polynomial regression is: w 0 + w 1 * feature 1 + w 1 * feature 2 1 + + w n * feature n. Why do we need features with degree greater than 1? In order to answer this question

take a look at the house price prediction chart shown above. The trendline is a quadratic function which is of degree 2. This makes a lot of sense as house price doesn t increase linearly as the square footage increase. The price levels off after a point and this can only be captured by a quadratic model. You can create a model with a very high degree polynomial. We have to be very careful with high degree polynomials as they fail to generalize on test data. Take a look at the chart above. The model perfectly fits the training data by coming up with a very high degree polynomial. But it might fail to fit properly on the test data. What s the use of such a model? The technical term for this problem is overfitting. This is akin to a student who scored 100 in Calculus exam by rote memorization. But failed to apply the concepts in real time. While coming up with a model we need to remember Occam s razor principle Suppose there exist two explanations for an occurrence. In this case the simpler one is usually better. All else being equal prefer simpler models (optimal number of features and degree) over complex ones (too many features and higher degree). Here are few points that I want to mention before concluding this post. 1. The hypothesis function produced by ML is generic. We used it for predicting house prices. But the hypothesis function is so generic that it can be used for ranking webpages, predicting wine prices, and for several other problems. The algorithm doesn t even know that it s predicting house prices. As long as it gets features and labels it can train itself to generate a hypothesis function. 2. Data is the new oil in the 21 st century. Whoever has the best algorithms and the most data wins. For the algorithm to produce a hypothesis function that can generalize we need to give it a lot of relevant data. What does that mean? Suppose I come up with a

model to predict house prices based on housing data from Bay Area. What will happen if I use it to predict house prices in Texas? It will blow up on my face. 3. I just scratched the surface of linear regression in this post. I have not covered several concepts like Regularization (penalizes overfitting), Outliers, Feature Scaling, and Multivariate Calculus. My objective was to develop the intuition behind linear regression and gradient descent. I believe that I achieved it through this post. References 1. Udacity: Linear Regression course material - https://goo.gl/6smlxu 2. Udacity: Gradient Descent course material - https://goo.gl/pbtcli 3. Math Is Fun: Introduction To Derivatives - https://goo.gl/d0tvzd 4. Betterexplained: Understanding the Gradient - https://goo.gl/j1vyv6 5. Gradient Descent Derivation - https://goo.gl/rxamy0 6. Machine Learning Is Fun - https://goo.gl/24he5k

Appendix: Gradient Descent Implementation In Python Author : Jana Vembunarayanan Website : https://janav.wordpress.com Twitter : @jvembuna