Let’s learn how to use R to find descriptive measures of multiple variables. To do this, let’s work with the HomeRun_2014 data. This data set represents all home runs hit during the 2014 Major League baseball season.

HomeRun <- read.csv("https://sullystats.github.io/Statistics6e/Data/HomeRun_2014.csv")
head(HomeRun,n=4)
##        Date           Hitter HitterTeam           Pitcher PitcherTeam INN
## 1 9/28/2014   Rizzo, Anthony        CHC       Fiers, Mike         MIL   1
## 2 9/28/2014 Bernadina, Roger        LAD      Scahill, Rob         COL   6
## 3 9/28/2014     Duvall, Adam         SF     Stauffer, Tim          SD   4
## 4 9/28/2014      Duda, Lucas        NYM Foltynewicz, Mike         HOU   8
##          Ballpark TrueDist SpeedOffBat Elev.Angle Horiz.Angle Apex Type
## 1     Miller Park      441       109.1       22.7        86.7   81   PL
## 2 Dodger Stadi...      424       113.2       27.7        62.3   98   ND
## 3       AT&T Park      423       103.6       31.9       112.9   98   ND
## 4      Citi Field      417       106.3       26.5        73.0   83   PL

We can see that there are a number of variables in the data set. We are going to focus on “TrueDist”, which represents the distance the ball traveled (in feet).

First, let’s draw a histogram of the data (using the Mosaic package).

install.packages("mosaic")
library(mosaic)
gf_histogram(~TrueDist,data=HomeRun,binwidth=25,boundary=300,color='black',fill='skyblue',title="Distance of a Home Run (in feet)")

The distribution is bell-shaped. Now, let’s find the mean and median distance of a home run hit in 2014.

mean(~TrueDist,data=HomeRun)
## [1] 395.2172
median(~TrueDist,data=HomeRun)
## [1] 396

Notice that the mean and the median are roughly the same value, which validates the belief that the distribution is symmetric.

Okay, but what if we wanted to know if the typical distance of a homerun differed based on the “Type” of home run. Home runs were classified as the following types.

-PL Plenty– the home run had plenty of distance to clear the outfied wall. -ND No doubt – the home run was clearly a home run. -JE Just enough – the home run barely had enough distance to clear the outfied fence. -ITP Inside the Park – the ball did not clear the outfield fence, but the hitter was able to get around all four bases.

To find the mean by type of home run, use the following syntax:

mean(y variable ~ explanatory variable,data=df_name)

mean(TrueDist ~ Type,data=HomeRun)
##      ITP       JE       ND       PL 
## 375.0000 385.2539 417.6432 394.5639

Notice that “no doubters” had the highest mean distance (417.6 feet).

We could also apply this concept to histograms.

gf_histogram(~TrueDist|Type,data=HomeRun,binwidth=25,boundary=300,color='black',fill='skyblue',title="Distance of a Home Run (in feet)")

Another neat feature of the Mosaic package is that it is easy to bring in more variables than just Type. Let’s see how the inning of the game plays a role (if any) in home run distances.

mean(TrueDist ~ Type + INN,data=HomeRun)
##    ITP.1     JE.1     ND.1     PL.1    ITP.2     JE.2     ND.2     PL.2 
##      NaN 385.0476 417.2639 398.0456      NaN 385.2961 415.9767 395.4980 
##    ITP.3     JE.3     ND.3     PL.3    ITP.4     JE.4     ND.4     PL.4 
##      NaN 385.9441 420.3500 394.6300 339.0000 386.1366 416.7308 395.3838 
##    ITP.5     JE.5     ND.5     PL.5    ITP.6     JE.6     ND.6     PL.6 
## 419.0000 387.4937 417.0690 393.8367      NaN 382.5373 420.2609 395.2950 
##    ITP.7     JE.7     ND.7     PL.7    ITP.8     JE.8     ND.8     PL.8 
## 361.0000 385.3562 415.4032 393.0802 394.2500 384.8140 416.0758 391.9077 
##    ITP.9     JE.9     ND.9     PL.9   ITP.10    JE.10    ND.10    PL.10 
## 304.0000 383.9832 420.9615 393.6800      NaN 384.2143 417.4000 375.5263 
##   ITP.11    JE.11    ND.11    PL.11   ITP.12    JE.12    ND.12    PL.12 
##      NaN 380.4286 411.0000 398.2500      NaN 395.1667 415.6667 383.7500 
##   ITP.13    JE.13    ND.13    PL.13   ITP.14    JE.14    ND.14    PL.14 
##      NaN 370.0000 403.0000 399.1667      NaN 380.0000      NaN 387.5000 
##   ITP.15    JE.15    ND.15    PL.15   ITP.16    JE.16    ND.16    PL.16 
##      NaN      NaN      NaN 408.0000      NaN 383.0000      NaN      NaN 
##   ITP.19    JE.19    ND.19    PL.19 
##      NaN 383.0000      NaN      NaN

There is quite a bit of output. The format is Type.INN. So, ITP.1 representst the mean distance of inside the park home runs in the first inning. Notice the output is NaN (which means “not applicable” because there were no inside the park home runs in the first inning in 2014).