Beware the Lurking Variable

I just completed the discussion on correlation and regression with my Introductory Statistics students. One of the recommendations within the new GAISE outline is to introduce students to multivariate analysis.  A classic application of this practice is the SAT score versus teacher salary data.

This data may be found by joining a group I created in StatCrunch titled “SullyStats”. To join the group, go to (if you don’t have a StatCrunch account, ask you Pearson representative for an account).  Under Explore, select Groups. Type SullyStats into the search box and join the group.  The data set is titled “SAT versus Teacher Salaries

Use the data to illustrate the danger in only considering a relation that appears to exist between two variables.  Draw a scatter diagram between Teacher Salary and Overall SAT score. What do you notice? What is the correlation coefficient between Teacher Salary and Overall SAT score?

Now introduce the variable “Percent Taking”.   This is a qualitative variable where “low” means less than or equal to 22% of eligible students took the SAT; “med” means between 23 and 49% of eligible students took the SAT; “high” means at least 50% of eligible students took the SAT. Draw a scatter diagram using a different plotting symbol for each level of “Percent Taking.” Find the linear correlation coefficient between salary and SAT score for each classification.  What happens to the apparent association between Teacher Salary and Overall SAT Score?

Another area that I emphasize is the difference between deterministic relations and probabilistic relations.  As an example, I ask students to pretend they have a job that pays $20 per hour.  If the student is asked to work an additional hour, how much will the student earn?  Of course, the answer is $20 (before taxes).  This is a deterministic relation because the value of the response variable (earnings) can be determined with 100% certainty if the value of the explanatory variable (hours worked) is known.   Contrast this with probabilistic models.  This is easily illustrated by telling the student they are a server in a restaurant where they typically make $20 per hour.  Each additional hour of work in this scenario does not guarantee an additional $20 in income.  Why?  Sometimes the server gets good tips, sometimes bad tips.  Now extend this idea to the interpretation of slope in regression.  It is vital that students understand that slope interpreted from a least-squares regression model must emphasize the relation is not deterministic.  For example, I use Zillow to build a model where the sale price of a home is the response variable and the Zestimate is the explanatory variable.  If the slope in the regression model is $0.92, then we interpret the slope as follows:  If the Zestimate increases by $1, the selling price of the home increases by $0.92, on average. The words “on average” are vital because this is the change in selling price over the course of the observed data.