First, be sure the package mosaic is installed.

install.packages('mosaic')

Now, we will load population data. First, let’s consider the distance of all home runs hit during the 2014 baseball season.

HomeRun <- read.csv("https://sullystats.github.io/Statistics6e/Data/HomeRun_2014.csv")
head(HomeRun,n=4)
##        Date           Hitter HitterTeam           Pitcher PitcherTeam INN
## 1 9/28/2014   Rizzo, Anthony        CHC       Fiers, Mike         MIL   1
## 2 9/28/2014 Bernadina, Roger        LAD      Scahill, Rob         COL   6
## 3 9/28/2014     Duvall, Adam         SF     Stauffer, Tim          SD   4
## 4 9/28/2014      Duda, Lucas        NYM Foltynewicz, Mike         HOU   8
##          Ballpark TrueDist SpeedOffBat Elev.Angle Horiz.Angle Apex Type
## 1     Miller Park      441       109.1       22.7        86.7   81   PL
## 2 Dodger Stadi...      424       113.2       27.7        62.3   98   ND
## 3       AT&T Park      423       103.6       31.9       112.9   98   ND
## 4      Citi Field      417       106.3       26.5        73.0   83   PL

We are going to focus on the variable “TrueDist”, which is the distance (in feet) the home run traveled.

Let’s look at the distribution of this variable and get some summary statistics.

library(mosaic)
gf_histogram(~TrueDist,data=HomeRun,binwidth=10,color="black",fill="blue",xlab="Distance (in feet)",ylab="Frequency",title="Distance of a Home Run in 2014",)

favstats(~TrueDist,data=HomeRun)
##  min  Q1 median  Q3 max     mean       sd    n missing
##  304 378    396 413 489 395.2172 24.81088 4185       0

Notice the distribution is approximately normal with \(\mu\) = 395.2 feet and \(\sigma\) = 24.8 feet.

Now, let’s take a random sample of n = 9 home run distances from this data set and determine the sample mean of the home run distance.

mean(~TrueDist,data=sample(HomeRun,9))   # Find the mean of a sample of size 9
## [1] 389.4444

Let’s take another random sample of n = 9 home run distances and determine the sample mean.

mean(~TrueDist,data=sample(HomeRun,9))   # Find the mean of a sample of size 9
## [1] 396.3333

Notice that the sample mean changes from sample to sample because we have different home runs in the random sample.

To get a sense as to the shape, center, and spread of the sampling distribution of \(\bar{x}\) we need to obtain many, many random samples of size n = 9.

SamplingDist <- bind_rows(do(5000) * c(mean = mean(~TrueDist, data = sample(HomeRun,9))))
head(SamplingDist,n=4)
##       mean
## 1 401.5556
## 2 407.6667
## 3 384.2222
## 4 407.1111

You can see the sample mean for the first four random samples. Now, let’s look at the shape, center, and spread of the sampling distribution of \(\bar{x}\).

gf_histogram(~mean,data=SamplingDist,binwidth=5,color="black",fill="blue",xlab="Mean Distance (in feet)",ylab="Frequency",title="Distribution of Sample Mean Distance of a Home Run in 2014 with n = 9",)

mean(~mean,data=SamplingDist)
## [1] 395.1058
sd(~mean,data=SamplingDist)
## [1] 8.231234

Notice the shape of the distribution of the sample mean is approximately normal. The mean of the sampling distribution of \(\bar{x}\) is \(\mu_\bar{x} =\mu\) and the standard deviation of the sampling distribution of \(\bar{x}\) is \(\sigma_\bar{x} = \frac{\sigma}{\sqrt{n}}\).

Let’s repeat this for n = 16 to see the role sample size plays.

SamplingDist_16 <- bind_rows(do(5000) * c(mean = mean(~TrueDist, data = sample(HomeRun,16))))
head(SamplingDist_16,n=4)
##       mean
## 1 383.2500
## 2 404.3125
## 3 401.1250
## 4 393.6250

You can see the sample mean for the first four random samples. Now, let’s look at the shape, center, and spread of the sampling distribution of \(\bar{x}\).

gf_histogram(~mean,data=SamplingDist_16,binwidth=5,color="black",fill="blue",xlab="Mean Distance (in feet)",ylab="Frequency",title="Distribution of Sample Mean Distance of a Home Run in 2014 with n = 16",)

mean(~mean,data=SamplingDist_16)
## [1] 395.0118
sd(~mean,data=SamplingDist_16)
## [1] 6.168464

The shape of the distribution is still approximately normal and the mean of the sampling distribution of \(\bar{x}\) is \(\mu_\bar{x} =\mu\). Notice the standard deviation of the sampling distribution of \(\bar{x}\) is now lower because the sample size has increased. This is because \(\sigma_\bar{x} = \frac{\sigma}{\sqrt{n}}\).