install.packages("mosaic")

Data in a Contingency Table (Using do() command)

Enter data into a data frame using the do() command. This requires use of the Mosaic package. For example, let’s say we want to enter the following contingency table, which represents survival status on the Titanic.

Status Men Women Boys Girls
Survived 334 318 29 27
Died 1360 104 35 18

Let’s call the column variable Person and the row variable Status.

library(mosaic)
Titanic_df <- rbind(
do(334)*data.frame(Person="Men",Status="Survived"),
do(318)*data.frame(Person="Women",Status="Survived"),
do(29)*data.frame(Person="Boys",Status="Survived"),
do(27)*data.frame(Person="Girls",Status="Survived"),
do(1360)*data.frame(Person="Men",Status="Died"),
do(104)*data.frame(Person="Women",Status="Died"),
do(35)*data.frame(Person="Boys",Status="Died"),
do(18)*data.frame(Person="Girls",Status="Died")
)
tally(~Status+Person,data=Titanic_df)    #Build contingency table from data frame
##           Person
## Status     Boys Girls  Men Women
##   Died       35    18 1360   104
##   Survived   29    27  334   318

Now, let’s draw a barplot of the conditional distribution. Be sure to set margins to FALSE.

Titanic_Condition <- tally(~Status|Person,margins=FALSE,format="proportion",data=Titanic_df)    #Conditional distribution by type of person
barplot(Titanic_Condition, beside = TRUE, cex.names = .7,legend=TRUE, ylim=c(0,1.2),main="Survival Status on the Titanic", xlab = "Type of Passenger", ylab = "Relative Frequency", col = c('#6897bb', '#c06723', '#baebae'))

Data in a Contingency Table (Matrix)

Now, let’s construct a conditional distribution from a contingency table. We will work with the contingency table in Section 4.4, Table 9.

To create a conditional distribution in R, use the following command:

variable <- prop.table(table, 1 or 2)

Note Use 1 to condition by the row variable; use 2 to condition by the column variable.

The matrix command requires using the c( ) command. In addition, you must specify the number of rows (nrow) and the number of columns (ncol). Finally, you name the rows and columns using dimnames along with list.

Notice how the cells are entered into the matrix (all entries in first column, then second column, and so on). With dimnames, name the row values first, then the column values.

Table9 <- matrix(c(9607, 570, 11662, 34625, 1274, 26426, 36370, 1170, 19861, 57102, 1305, 20841), nrow = 3, ncol = 4, dimnames = list(c("Employed", "Unemployed", "Not in the Labor Force"), c("Did Not Finish High School", "High School Graduate", "Some College", "Bachelor's Degree or Higher")))

Does a higher level of education play a role in employment status? Let’s condition by level of education to find out. Because level of education is the column variable, use 2 in the prop.table command.

Table9_Condition <- prop.table(Table9, 2)
Table9_Condition
##                        Did Not Finish High School High School Graduate
## Employed                                0.4399011           0.55555556
## Unemployed                              0.0261001           0.02044124
## Not in the Labor Force                  0.5339988           0.42400321
##                        Some College Bachelor's Degree or Higher
## Employed                 0.63361265                  0.72054815
## Unemployed               0.02038292                  0.01646729
## Not in the Labor Force   0.34600443                  0.26298455

Now that we have the conditional distribution, use the barplot command. The syntax is as follows:

barplot(df_name,beside=TRUE)

Note: cex.names decreases the font size of the labels. legend = TRUE adds a legend. ylim=c(0,1.2) adjusts the length of the y-axis so the legend does not overlay the graph. You should experiment with the limits until you are happy with the graph.

barplot(Table9_Condition, beside = TRUE, cex.names = .7,legend=TRUE, ylim=c(0,1.2),main="Employment Status by Level of Education", xlab = "Level of Education", ylab = "Relative Frequency", col = c('#6897bb', '#c06723', '#baebae'))

Raw Data

Now, let’s learn how to create a conditional distribution bar graph from raw data. First, obtain To do so, we use the tally command in the Mosaic package.

Load the HomeRuns_2014 data.

HomeRun <- read.csv("https://sullystats.github.io/Statistics6e/Data/HomeRun_2014.csv")
head(HomeRun,n=4)
##        Date           Hitter HitterTeam           Pitcher PitcherTeam INN
## 1 9/28/2014   Rizzo, Anthony        CHC       Fiers, Mike         MIL   1
## 2 9/28/2014 Bernadina, Roger        LAD      Scahill, Rob         COL   6
## 3 9/28/2014     Duvall, Adam         SF     Stauffer, Tim          SD   4
## 4 9/28/2014      Duda, Lucas        NYM Foltynewicz, Mike         HOU   8
##          Ballpark TrueDist SpeedOffBat Elev.Angle Horiz.Angle Apex Type
## 1     Miller Park      441       109.1       22.7        86.7   81   PL
## 2 Dodger Stadi...      424       113.2       27.7        62.3   98   ND
## 3       AT&T Park      423       103.6       31.9       112.9   98   ND
## 4      Citi Field      417       106.3       26.5        73.0   83   PL

SUe the tally command to build a conditional distribution.

tally(~ response variable | explanatory variable,margins=FALSE,format=“proportion”,data = data_file)

For the HomeRun_2014 data, let’s say we want to determine if inning plays a role in the type of home run hit. In this regard, we want to find a conditional distribution of type by inning. So, inning (INN) is the explanatory variable.

Don’t forget to set margins to FALSE and use ylim to adjust the limits on the y-axis so the legend is not blocking the bars.

library(mosaic)
HomeRun_Condition <- tally(~Type|INN,margins=FALSE,format="proportion",data=HomeRun)
barplot(HomeRun_Condition, beside = TRUE, cex.names = .7,legend=TRUE, ylim=c(0,1.6),main="Type of Home Run by Inning", xlab = "Type", ylab = "Relative Frequency", col = c('#6897bb', '#c06723', '#baebae'))

```