Example 6 Identifying Influential Observations

Load the data from Table 1 in Section 4.1 into R.

Table1 <- read.csv("https://sullystats.github.io/Statistics6e/Data/Chapter4/Table1.csv")
head(Table1,n=4)
##   Speed Distance
## 1   100      257
## 2   102      264
## 3   103      274
## 4   101      266

Notice that Speed is in the first column and Distance is in the second column. Now, we want to add the observation corresponding to Justin Thomas to the data set. To do so, use the rbind( ) command (this is row bind).

Example6 <- rbind(Table1,c(120,305))
Example6
##   Speed Distance
## 1   100      257
## 2   102      264
## 3   103      274
## 4   101      266
## 5   105      277
## 6   100      263
## 7    99      258
## 8   105      275
## 9   120      305

Rather than reading the data from Github, we could manually enter the data into R.

Example6a <- data.frame("Speed"=c(100, 102, 103, 101, 105, 100, 99, 105, 120), "Distance"=c(257, 264, 274, 266, 277, 263, 258, 275, 305))

Base R has an influence.measures command that identifies influential observations.

golf_model <- lm(Distance ~ Speed, data=Example6)   # Find and name the regression model
influence.measures(golf_model)
## Influence measures of
##   lm(formula = Distance ~ Speed, data = Example6) :
## 
##    dfb.1_ dfb.Sped  dffit cov.r   cook.d   hat inf
## 1 -0.4987   0.4581 -0.847 0.599  0.25486 0.157    
## 2 -0.1091   0.0921 -0.309 1.248  0.04992 0.122    
## 3  0.1230  -0.0883  0.607 0.701  0.14526 0.114    
## 4  0.0794  -0.0709  0.165 1.490  0.01535 0.136    
## 5 -0.0479   0.0703  0.389 1.079  0.07378 0.115    
## 6  0.0489  -0.0449  0.083 1.595  0.00399 0.157    
## 7 -0.2025   0.1892 -0.301 1.465  0.04950 0.184    
## 8 -0.0193   0.0282  0.156 1.446  0.01380 0.115    
## 9  5.7588  -5.8973 -6.299 4.553 13.36257 0.900   *

The influences.measures command outputs a table featuring 7 columns and 9 rows. Justin Thomas’s swing is the 9th row, because it is the 9th observation in the data set. Each of these columns are a separate influence measure. We will focus on the 5th column and the 7th column. The 5th column (cook.d) is Cook’s distance measure. Cook’s distance measure is a common outlier measure. As seen above, Justin Thomas’s swing has a Cook’s measure of 13.36, which qualifies as an outlier. Using Cook’s Distance logic, any Cook’s distance measure over 1 qualifies as an outlier or influential observation. The 7th column is an influence yes or no column based on the 6 metrics. Justin Thomas’s swing is a significant influential observation due to the star in the “inf” column.

The Mosaic package has a graphical version of an influence test. It is part of the mplot( ) command.

library(mosaic)
golf_model <- lm(Distance ~ Speed, data=Example6)   # Find and name the regression model
mplot(golf_model,which = 4)  # the "which" option can take on a value from 1 to 7.  

Any observation with a Cook’s d in excess of 1 is considered influential. Clearly, the observation corresponding to Justin Thomas is influential.

Note:
- which = 1 draws a residual plot (residuals versus fits) - which = 2 draws a QQ plot of the residuals - which = 3 draws a residual plot (standardized residuals versus fits) - which = 4 is Cook’s d - which = 5 is residuals versus leverage - which = 6 is Cook’s d versus leverage - which = 7 is confidence intervals of estimates (Chapter 14)