Summarize Data

Chapter 2 Organize Qualitative Data

The material in this chapter is elementary.  Many of your students have likely seen much of this content in prior classes.  Therefore, do not get bogged down in the details of construction of the graphs.  Instead, focus on proper graphics construction and how to display qualitative and quantitative data in a fashion that clearly tells the story of the data.

Section 2.1 Summarizing Qualitative Data

I begin with a discussion of summarizing qualitative data in tables. This should not be a challenge for the students, so don't spend a lot of class time on this material.  Ideally, you will use software such as StatCrunch to build frequency and relative frequency tables.  The same goes for the construction of bar graphs and pie charts.  Do not emphasize by-hand construction—nobody would ever be expected to construct one of these graphs by hand.  Instead, emphasize when each type of graph should be constructed.  Bar graphs are typically used when we wish to compare one value of a variable to another, while pie charts are useful when comparing a part to the whole.  For example, what proportion of taxes collected by the federal government are from personal income taxes?  This is easier to see from a pie chart than a bar graph.  Bar graphs can be used to display ordinal data (Poor, Fair, Good, Excellent).

Can the same information be displayed effectively in a pie chart?

Also, emphasize side-by-side bar graphs and the fact that they should be constructed using relative frequencies (since it is not "fair" to compare frequencies when the sample/population sizes differ).   Plus, a version of side-by-side bar graphs will be used in Chapters 4 and 12 when we study conditional distributions.

Section 2.2 Summarizing Quantitative Data

I decided to segment summarizing quantitative data into two sections—the popular displays (Section 2.2) and other graphs (Section 2.3). Section 2.3 is optional and can be skipped without loss of continuity.  When considering which topics you might skip, ask yourself if the topic will be needed or revisited later in the course.  For example, graphs such as frequency polygons are not utilized later in the course, so you might consider skipping this topic.  That said, some graphs allow for interesting results, such as time series plots. So, judgment should be used when deciding what to skip.    

In Section 2.2, do not get bogged down on by-hand creation of frequency or relative frequency distributions or the construction of histograms.  Allow technology to do the work so that you may focus on the fact that there are many acceptable class widths that provide a nice summary of the data.  However, be sure to emphasize that some class widths are really poor.  In fact, spend time making sure students understand the results from the “Exploring Histograms with StatCrunch” activity in the Student Activity Notebook.  In addition, spend time on distribution shape.  Require students to justify their conclusions regarding shape, rather than simply claiming a distribution is skewed right.

Section 2.3 Additional Displays

This section is entirely optional and may be skipped without loss of continuity. Personally, I skip the entire section.  That said, you may consider at least requiring that students review time series graphs since they are popular in the media, and time series data is discussed at various points in the course (especially when we present correlation versus causation).

Section 2.4 Graphical Misrepresentations of Data

While this is an optional section, it does have merit. The media is full of examples where graphs mislead or misrepresent data. Be sure to alert students to be on the lookout for poor graphics and require students to clearly label each graph they create.

 

Chapter 3 Organize Quantitative Data

It is reasonable to conclude that many of your students will have seen and worked with the computational formulas presented in this section.  However, it is not to be assumed that students have a firm understanding of how to interpret measures of center, dispersion, and position.  It is unlikely students know the shortcomings of each measure, or how to decide which measure is the best (or better) measure to report.  Therefore, the emphasis of this chapter should be two-fold:  (1) Emphasize what each measure represents and its interpretation, and (2) Emphasize which measure should be reported based on the data.    As in Chapter 2, by-hand computation should be de-emphasized.  It is okay to require students to compute a mean, median, or standard deviation by hand, but this should not be the primary emphasis of the chapter.

Section 3.1 Measures of Central Tendency

I begin with a discussion of measures of central tendency. There are a number of areas of emphasis.

  • First, it is worthwhile to emphasize the difference between parameters and statistics. A great activity to do this would be to collect classroom data on a simple quantitative variable, such as commute time to school (or work).  Treat the class as a population and compute the population mean.  Then, find at least two simple random samples from the class and determine the sample mean.  Emphasize that the sample mean varies depending on the individuals selected and, therefore, is a variable (foreshadowing sampling distributions).
  • Second, emphasize the idea of resistance. With the same population data from your activity, include an extreme observation.  What happens to the mean; the median?  This concept could also be presented using the Mean/Median applet in StatCrunch.  An optional activity is to see the role the number of observations plays in resistance. Larger sample sizes result in statistics that have more resistance.
  • Third, emphasize the relationship between the mean and median in symmetric, skewed left, and skewed right distributions. This is a lead-in to determining which measure of central tendency should be reported for the various distribution shapes.
  • Most students have experience with means and medians.  Therefore, emphasize the idea of resistance with these measures.  The ideal tool for discussing the idea of resistance is the mean versus median applet.   Add points to the graph.  Drag points to see how each measure is affected.  Draw distributions of various shapes.  What is the relation between the mean and median for skewed left distributions? Skewed right? Symmetric?
  • One last thing—don't forget to emphasize the rounding rule in this course. We agree to round parameters and statistics to one more decimal place than the raw data.
Section 3.2 Measures of Dispersion

Emphasize that measures of center alone are not sufficient to describe a set of data. We present three measures of spread in this section: the range, the standard deviation, and the variance.  All three are not resistant—be sure to emphasize this.  Also, emphasize the shortcoming of the range (in spite of the ease of computation) in that it only uses two observations to obtain its value.  Spend a lot of time illustrating how standard deviation measures spread using deviation about the mean.  Present the interpretation as a mean deviation about the mean.  Therefore, the more observations that are far from the mean, the greater the standard deviation.  Variance should be covered, but don't dwell on it since interpretation of the value is difficult.

When presenting the Empirical Rule, it is a good idea to treat it like a model. Compare actual counts to those obtained using the Empirical Rule.

Section 3.3 Measures of Central Tendency and Dispersion from Grouped Data

This section is optional and may be skipped without loss of continuity. If you do decide to cover it, we recommend that by hand computation be de-emphasized.  If you are not using the TI-calculator or StatCrunch, we suggest introducing the students to Excel and allowing it to do the heavy computational "lifting."

Section 3.4 Measures of Position - The focus of this section is to discuss measures of relative position. Emphasize that the Z-score represents the number of standard deviations an observation is from the mean.  We do not discuss the method for obtaining percentile ranks, but instead focus on the interpretation of percentiles.  The main reason for not presenting the methodology behind finding percentile ranks is due to the fact that dividing smaller (less than 100 observations) into 100 parts (to get percentile ranks) does not make sense.  We wait until we have the normal model (Chapter 7) to discuss how percentile ranks might be assigned.

That said, quartiles represent important percentiles that are discussed in detail.  Emphasize the interpretation of quartiles.  In particular, emphasize the interquartile range as a resistant measure of dispersion.  This section also presents the introduction of outliers.  We intentionally do not present the "two standard deviations from the mean" rule for identifying outliers as this approach is based on the Empirical Rule (which assumes bell-shaped distributions).  Instead, we present Tukey's use of the lower and upper fence to find outliers.  It may be worthwhile to discuss that outliers could result from including an observation that is not part of the population under study, data entry error, or through an unusual observation from within the population.  Spend some time talking about the different instances in which outliers may occur.

Section 3.5 The Five-Number Summary and Boxplots

The major area of emphasis in this section is the boxplot. The boxplot is used for two major purposes: (1) to gauge the shape of the distribution, and (2) to identify outliers.  Boxplots will be used extensively in the inference chapters to identify outliers.

Data on Default Rates (Use to Illustrate Histograms, Boxplots, Side-by-Side Boxplots)

A cohort default rate is the percentage of a school's borrowers who enter repayment on certain Federal Family Education Loan (FFEL) Program or William D. Ford Federal Direct Loan (Direct Loan) Program loans during a particular federal fiscal year (FY), October 1 to September 30, and default or meet other specified conditions prior to the end of the second following fiscal year.  In FY2013, the national cohort default rate was 11.3%.  The following data represent default rates for schools participating in the Title IV financial assistance programs.

TitleIV Default Rate

Ideas for this Data

  • Construct frequency and relative frequency tables of DRate 1 (2013).
  • Draw a histogram of the default rate for 2013
  • Draw a box plot of the default rate for 2013
  • Draw side-by-side box plots of default rate by year.
  • Draw side-by-side box plots of default rate for 2013 by school type

Field Definitions 

Field Name Field Definition
Name Institution's Name
State State Abbreviation
Program Length The length of the longest program offered by the institution:

0—Short-Term (300–599 hours)
1—Graduate/Professional (≥ 300 hours)
2—Non-Degree (600–899 hours)
3—Non-Degree 1 Year (900–1799 hours)
4—Non-Degree 2 Years (1800–2699 hours)
5—Associate's Degree
6—Bachelor's Degree
7—First Professional Degree
8—Master's Degree or Doctor's Degree
9—Professional Certification
10—Undergraduate (Previous Degree Required)
11—Non-Degree 3 Plus Years (≥ 2700 hours)
12—Two-Year Transfer

Sch Type The code identifying the ownership control of the institution:

1—Public
2—Private, Nonprofit
3—Proprietary
5—Foreign public
6—Foreign private
7—Foreign For-Profit

Year 1 Cohort Year 2013
Num 1 Number of Borrowers in Default for 2013
Denom 1 Number of Borrowers in Repay for 2013
DRate 1 Official Cohort Default Rate for 2013
The code classifying the ethnic affiliation of the institution:

1—Native American
2—HBCU
3—Hispanic
4—Traditionally Black College
5—Ethnicity Not Reported

Year 2 Cohort Year 2012
Num 2 Number of Borrowers in Default for 2012
Denom 2 Number of Borrowers in Repay for 2012
DRate 2 Official Cohort Default Rate for 2012
Year 3 Cohort Year 2011
Num 3 Number of Borrowers in Default for 2011
Denom 3 Number of Borrowers in Repay for 2011
DRate 3 Official Default Rate for 2011

Last updated September 28, 2016