Summarizing Bivariate Data

In Sullivan Statistics, this material is broken into two main parts.  Part I is contained in Sections 4.1 to 4.3 where we summarize bivariate quantitative data.  That is, two quantitative variables are measured on each individual.  Part II is contained in Section 4.4 where we summarize bivariate qualitative data.

Section 4.1 Scatter Diagrams and Correlation

When I introduce the material in this section, I like to use data from Zillow.com.  I ask my students to tell me where they would like to live and then we go to that location in Zillow.  I type in some parameters for the house (such as number of bedrooms), and then I look for homes that were recently sold.  Finally, I "randomly" select about 15 homes and record the Zestimate and Selling Price.  Inevitably, this data is positively associated.  Use this data for all examples in Section 4.1 to 4.3.

When introducing the linear correlation coefficient, I strongly encourage you to utilize the "Correlation by Eye" applet.  There are suggestions for how to use this applet in the Student Activity Workbook and Classroom Notes.

Finally, be sure to emphasize the difference between correlation and causation.  There are silly examples of time series data that is highly correlation on this webpage.

Section 4.2 Least-Squares Regression

I like to draw a scatter diagram of my Zillow data and then find a line through two points that seems to describe the relation between the Zestimate and Selling Price.  Of course the line I find is dependent on the two points selected and someone else may have chosen different points.  So, this leads to the question, whose line is better?  And, "Is there a "best" line?  When asking this question, we need a criteria for judging "best".

It was Adrien-Marie Legendre who suggested finding the line that minimizes the sum of squared errors is the line of best fit.  You can allow students to visualize the least-squares regression line using the Regression by Eye applet in StatCrunch.  Select Applets > Regression > by eye.  Be sure to check the "Add boxes to display squared residuals" box so students can visualize the squared residuals.  There is an activity that uses this applet in the Classroom Notes.

Be sure to emphasize that the least-squares regression line is a probabilistic model.  As such, care must be taken in interpreting the slope.  The slope is the change in the response variable for a unit change in the explanatory variable, on average.  We say "on average" to imply the change in y for a unit change in x is what happens over the course of the observed data.

Section 4.3  Coefficient of Determination and Residual Analysis

This section could be considered optional, but I find it important to present - especially residual analysis.  Look in the Classroom Notes for three data sets where each has the same variance in y.  This really helps students to grasp the concept of the coefficient of determination.

For influential observations, I also have an activity for students to develop a conceptual understanding of the material.  It uses the Regression Influence applet.

Section 4.4 Contingency Tables and Association

This section focuses on the association between two qualitative variables.  Here is some data you could use for the section based on a study done by PayScale.  The research question is, "Is household income while enrolled in college associated with mid-careers income?"

Find marginal distributions, the conditional distribution by household income in college (the explanatory variable), and draw a conditional bar graph.

Be sure to cover Simpson's Paradox, which states that an apparent association between two variables inverts or goes away when a third variable is introduced to the analysis. This concept is one emphasized in the new GAISE report.

Here is an example for you to use.

Death Sentence The following data represent the sentences imposed on offenders convicted of murder by race.

 

  Jail Time Death Sentence Total
Black Offender 2498 28 2526
White Offender 2323 49 2372
Total 4821 77 4898

Source: John Blume, Theodore Eisenberg, and Martin T. Wells. Explaining Death Row’s Population and Racial Composition,” Journal of Empirical Legal Studies, 1(1), 165-207, March, 2004

 

  • Which race appears to get a death sentence more? Why?

 

The data in the table above do not take into account the race of the victim. The data below show the sentence of the offender by race of the victim.

 

  Black Victim

Jail Time Death Sentence
White Victim

Jail Time Death Sentence
Black Offender
2139 12
359 16
White Offender
100 0
2223 49

 

  • Determine the proportion of black offenders who were given a death sentence by race of the victim. Determine the proportion of white offenders who were given a death sentence by race of the victim.
  • Repeat part (b) for offenders given jail time for each race of the offender.
  • Draw a conditional bar graph of the conditional distribution from parts (b) and (c).
  • Write a report detailing your findings.