Sampling Distribution of the Sample Mean

First, be sure the package mosaic is installed.

install.packages('mosaic')

Now, we will load population data. First, let’s consider the fare charged by ALL Chicago taxi rides on a single day.

Taxi <- read.csv("https://sullystats.github.io/Statistics6e/Data/ChicagoTaxi.csv")
head(Taxi,n=4)

##   Trip  Fare Payment
## 1  300  6.50    Cash
## 2 1281 42.25  Credit
## 3  780 10.75    Cash
## 4  900 17.00  Credit

We are going to focus on the variable “Fare”, which is the fare charged for the ride.

Let’s look at the distribution of this variable and get some summary statistics.

library(mosaic)
gf_histogram(~Fare,data=Taxi,binwidth=10,color="black",fill="blue",xlab="Fare (in dollars)",ylab="Frequency",title="Fare of a Chicago Taxi Ride",)

favstats(~Fare,data=Taxi)

##   min   Q1 median    Q3    max     mean       sd     n missing
##  0.01 6.25    8.5 16.25 292.75 15.01437 14.27812 28605       0

Notice the distribution is skewed right with $\mu$ = $15.041 and $\sigma$ = $14.278.

Now, let’s take a random sample of n = 10 rides from this data set and determine the sample mean fare.

mean(~Fare,data=sample(Taxi,10))   # Find the mean of a sample of size 10

## [1] 15.5

Let’s take another random sample of n = 9 home run distances and determine the sample mean.

mean(~Fare,data=sample(Taxi,10))   # Find the mean of a sample of size 10

## [1] 10.75

Notice that the sample mean changes from sample to sample because we have different rides in the random sample.

To get a sense as to the shape, center, and spread of the sampling distribution of $\bar{x}$ we need to obtain many, many random samples of size n = 10.

SamplingDist <- bind_rows(do(5000) * c(mean = mean(~Fare, data = sample(Taxi,10))))
head(SamplingDist,n=4)

##     mean
## 1 14.200
## 2 10.675
## 3  9.325
## 4 14.950

You can see the sample mean for the first four random samples. Now, let’s look at the shape, center, and spread of the sampling distribution of $\bar{x}$.

gf_histogram(~mean,data=SamplingDist,binwidth=2,color="black",fill="blue",xlab="Mean Distance (in feet)",ylab="Frequency",title="Distribution of Sample Mean Fare of a Taxi Ride with n = 10",)

mean(~mean,data=SamplingDist)

## [1] 15.02069

sd(~mean,data=SamplingDist)

## [1] 4.500416

Notice the shape of the distribution of the sample mean is skewed right. The mean of the sampling distribution of $\bar{x}$ is $\mu_\bar{x} =\mu$ and the standard deviation of the sampling distribution of $\bar{x}$ is $\sigma_\bar{x} = \frac{\sigma}{\sqrt{n}}$.

Let’s repeat this for n = 30 to see the role sample size plays.

SamplingDist_30 <- bind_rows(do(5000) * c(mean = mean(~Fare, data = sample(Taxi,30))))
head(SamplingDist_30,n=4)

##       mean
## 1 15.41667
## 2 17.42500
## 3 16.17500
## 4 15.09167

You can see the sample mean for the first four random samples. Now, let’s look at the shape, center, and spread of the sampling distribution of $\bar{x}$.

gf_histogram(~mean,data=SamplingDist_30,binwidth=2,color="black",fill="blue",xlab="Mean Distance (in feet)",ylab="Frequency",title="Distribution of Sample Mean Fare of a Taxi Ride with n = 30",)

mean(~mean,data=SamplingDist_30)

## [1] 14.99778

sd(~mean,data=SamplingDist_30)

## [1] 2.584228

The shape of the distribution is still skewed right, but not as skewed as it was for n = 10. And the mean of the sampling distribution of $\bar{x}$ is $\mu_\bar{x} =\mu$. Notice the standard deviation of the sampling distribution of $\bar{x}$ is now lower because the sample size has increased. This is because $\sigma_\bar{x} = \frac{\sigma}{\sqrt{n}}$.