Skip to content

Investigating linear regression

September 16, 2012

Here’s a great question to get students thinking about linear regression, found by Michael Pershan, by way of the PCMI problem sets.

Here’s how I used this problem in my class. I started by asking pairs of students to find an equation for the line of best fit, and gave students around 10 minutes to work on this problem.

I then asked students to list their equations of fit, and we wrote all of them on the board while simultaneously plotting them on Desmos on the projector.

Then I had a few groups explain how they came up with their lines, and we got all sorts of interesting procedures. Some averaged all the points and then drew a line between the first point and the average point. Another gout found the midpoints of the segments connecting the various points and then drew a line through those midpoints. Another averaged the slopes on the segments connecting the midpoints and then tried to forecast backward to find a y-intercept.

At this point, we discussed a need to find a way to compare the lines to determine which was the better fit. I offered up a line that was clearly a terrible fit, y=-3x+4, and asked if any students had a method for proving that their line was better than mine.

Eventually, some student mentioned that we should be able to measure the distance from a data point to a potential line, and if we were to somehow add up the distances from the all the data points to the line we could use that total distance as a measure of the quality of the fit.

Then we went through an exercise to show how this works for just a single point and a line, finding the perpendicular line through the data point, finding the point of intersection between the fit line and the perpendicular, and then finding the distance between the datapoint and the intersection. This turned out to be a great review of some more linear concepts, finding the perpendicular slope, and reinforcing the primacy of the point point slope form of a line. Still, we all agreed that this process, while easy to describe, and not to complicated to do was incredibly tedious, and called for a computer to automate it.

At this point, I mentioned that we might be able to simplify things even further if we used just the vertical distance y_{data}-y_{line} rather than computing the shortest possible distance. At this point, I introduced this least squares demo from the Wolfram Demonstrations project.

Finally, we had a little bit of extra time left in class, so we looked at some data we collected on airline flights, inspired by Dan Meyer’s Air Travel lesson. I created a simple google doc for another class and asked them to simply add information for 5 nonstop flights from Philly. Unfortunately, when we looked at the data we saw people had made all kinds of data entry errors, and so I opened up the spreadsheet to the class, and we did a quick 5 minute data cleaning operation. From there, I asked students to make a prediction on a small whiteboard about what a time vs distance graph might look like for this data, and that sparked a good conversation about whether the relationship would pass through (0,0) and whether the “fact” that longer flights are flown by faster airplanes might clause the data to appear to curve downward for very long flights. Pretty quickly we were able to graph the data in excel and see a beautiful linear relationship with a non-zero vertical intercept, indicating that airlines are building in taxiing, takeoff, and delay times into their schedules.

Final thoughts

While I was pleased with how this conversation went, I really would like to use the moment where we decided a computer should be able to find the line of best fit for us as an opportunity to really do just that—designing a linear regression program using our algorithm in Python. I’ve just been too busy recently to be able to put anything together, but I think this could be a great place where students could be given a nearly complete program, and simply add in a few lines of code to do the calculation of the distance from the point to the line and sum those distances.

In case you’re interested, I recorded this class. Here’s the lesson, from start to finish on video.

2 Comments leave one →
  1. September 16, 2012 11:20 pm

    Thanks for sharing!

    For me the most important reason not to find the perpendicular distance is that if the quantities graphed have different units, one needs a metric to say what distance is.

    • September 17, 2012 8:36 am

      Brian,
      This is a great point I didn’t even think of. I’ll be sure to bring it up with my students.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: