Inference on Least-Squares Regression

Overview

This chapter represents a continuation of the material in Sections 4.1 – 4.3. In these sections, we discussed descriptive statistics (such as scatter diagrams, correlation, and regression) along with diagnostic tools (such as residual plots) for bivariate quantitative data.
Now that students understand statistical inference, we are prepared to discuss inference on bivariate quantitative data. Essentially, the chapter has two areas of focus. In Section 14.1, we learn how to test the hypothesis of whether a linear association exists between two quantitative variables and how to construct a confidence interval for the slope of the least-squares regression equation. In Section 14.2, we learn how to construct confidence and predication intervals.
This chapter also includes an optional section on randomization tests on the slope of the least-squares regression line (Section 14.1A).

What to Emphasize

This chapter builds on inferential methods presented in Chapters 9 and 10 (especially confidence intervals and hypothesis testing for a single population mean). The computations in this chapter can be overwhelming, so by-hand computation should be de-emphasized.

Testing the Significance of the Least-Squares Regression Model (Section 14.1 and 14.1A) – This section begins with a review of least-squares regression. If you just covered the presentation of regression (Chapter 4), then the review material should be skipped. There is an optional Section 14.1A that includes a discussion on randomization tests on the slope of the least-squares regression line. If you choose to use this method, we recommend that you start with Section 14.1A and then move to Section 14.1.

Section 14.1A requires an applet found in StatCrunch. To use this applet, first load the bivariate data you are using to illustrate concepts into the StatCrunch spreadsheet. Then, in StatCrunch, go to Applets > Resampling > Randomization test for slope. Select the X (Explanatory) variable and Y (response) variable. Click Compute!. The idea behind the technique is similar to other randomization techniques. We assume the statement in the null hypothesis, is true. If this is the case, then there is no association between the explanatory and response variables. So, we randomly assign a value of the response variable to a value of the explanatory variable and compute the correlation coefficient. If we do this many, many times, we can determine the proportion with which we observe a correlation coefficient as extreme as, or more extreme than, the one actually observed. This proportion becomes an approximation of the P-value. After doing the randomization technique, test the hypothesis that versus the same direction (left-tailed, right-tailed, two-tailed) alternative hypothesis. Students should note that the P-values are similar. The randomization technique has the added bonus that we did not make any assumptions regarding distributions of the variables.
In Section 14.1, it is very important to spend time discussing the requirements of the least-squares regression model (Objective 1). In particular, be sure students understand the least-squares regression equation is found from sample data. This means that the estimates of the slope and intercept will vary from sample to sample, so there is a sampling distribution associated with both the slope and intercept. This sampling distribution allows us to conduct inference on the slope and intercept.
In addition, students should understand the fact that for each value of the explanatory variable, the distribution of the response variable is normal with mean and standard deviation . The fact that the standard deviation of the distribution of the response variable is the same regardless of the value of the explanatory variable suggests that the data should be equally spread around the least-squares regression line. We verified this requirement back in Section 4.3 with residual plots (it was the requirement of homoscedasticity). Emphasize the relation between standard deviation of a single variable and the standard error of the estimate for regression. This gives you an opportunity to once again emphasize the fact that predicted values based on the least-squares regression line represent the mean value of the response variable for a given value of the explanatory variable. Do not emphasize a by-hand computation, however.
Finally, present an overview of hypothesis testing for the slope of the least-squares regression model. We suggest using technology to obtain the P-value. As always, emphasize the interpretation of the P-value.
One last item to discuss: We do not present inference on the correlation coefficient. While methods do exist for conducting inference on correlation, these methods require that the data be bivariate normal. This is a difficult requirement to verify. Plus, the results from inference on the slope are the same as those for inference on correlation. So, it seems redundant to cover inference on the correlation coefficient.

Confidence and Prediction Intervals – It might be a good idea to review confidence intervals for a mean at the start of this section. Also, review the fact that a predicted value of the response variable using the least-squares regression equation may be interpreted as either the mean of the response variable for a given value of the explanatory variable, or as the predicted value of the response variable for a particular individual whose value of the explanatory variable is known.

Emphasize the difference between a confidence interval and prediction interval. Help students to conceptually understand why prediction intervals are wider.
By-hand computation should be de-emphasized for this section as well. This is difficult for instructors using a TI-calculator, however. So, if you are using a TI-calculator for your course, consider introducing students to StatCrunch for this section.

Ideas for Traditional/Online/Blended/Flipped

Back in Chapter 4 we suggested that you secure real data to illustrate concepts. Continue to use this data in this chapter. We like to use www.zillow.com to illustrate linear associations between their “Zestimate” and “Selling Price” for recently sold homes. Students can build a linear model that might help in determining the sale price of a currently listed home.
Utilize discussion boards to get students communicating statistical ideas. For example, ask students to explain the distribution of a response variable for a given value of the explanatory variable. Or, ask for an explanation of the difference between a prediction and confidence interval.