Decreasing Body Temperatures

Posted Leave a commentPosted in Data Set, Research Article, Statistics Articles

Researchers at Stanford University used data from the Union Army Veterans of the Civil War (1860-1940), the National Health and Nutrition Examination Survey (1971-1975), and the Stanford Translational Research Integrated Database Environment (2007-2017) to determine that the mean body temperature in men and women has decreased by 0.03 degrees Celsius per birth decade.  The reduction in body temperature is a proxy for metabolic rate and this may help to explain the changes in the health of humans and the increase in life expectancy.

In 1851, Carl Reinhold August Wunderlich measured the body temperature of 25,000 humans and found that normal body temperature was 98.6 degrees Fahrenheit (or 37 degrees Celsius).  The mean body temperatures of 578,222 individuals during the period from 2007 to 2017, was found to be 97.99 degrees Fahrenheit.  Below is a histogram showing the distribution of temperatures (drawn in R).  It seems pretty clear that the mean is below 98.6 degrees Fahrenheit.

Here is a link to the original study.   Included in the study is a zip file with all the original data.  The file is extremely large (over 70 MB) for the combined data, so analysis in most stats packages will be a challenge.   However, R can handle it.  If you are using other packages, first import the data to Excel and attempt to cut and paste what you need.  The data sets for each individual cohort is a little more manageable.

Okay, so what can you do with this data?

  • Fit a normal model to the Stanford data by finding the population mean and standard deviation.  Use the model to determine the proportion of individuals whose temperature is above 98.6 degrees Fahrenheit. Compare this to the actual proportions.
  • Treat the Stanford data as a population. Obtain 2000 simple random samples of size 9 from the population.  Compute the mean of each sample. Draw a histogram of the 2000 sample means.  Find the mean of the 2000 sample means. Find the standard deviation of the 2000 sample means.  Great for illustrating sampling distributions.
  • Are people "cooling off" over time?  Run a regression using birth year as the explanatory variable and temperature as the response variable.  Again, you might consider randomly sampling from each cohort and building a data set.
  • Do people "cool off" with age? Run a regression treating age as the explanatory variable and temperature as the response variable.
  • Is there a difference in the body temperature of the 1860-1940 (Union Army Vets) cohort compared to the 2007-2017 (Stanford) cohort?  Maybe take a random sample from each cohort and go through a two-sample t-test.


Frames in Polling

Posted Leave a commentPosted in Statistics Articles

Random-digit-dialing (RDD) is a popular method for selecting people to be in polls.  So, the population from which a sample is drawn is anyone with a telephone (because RDD dials both listed and unlisted numbers).  A problem with RDD is that of bias as a result of nonresponse.

A study by Pew Research compared polling results using random-digit-dialing to the use of registered voter files.  In this study, it was found that in 56 of 65 survey questions, the two polling methods resulted in estimates that were statistically insignificant.

Have your students read the article and discuss issues such as frames, nonresponse bias, under-representation, and the overall challenges in finding a sample that is representative of the population being studied.



StatCrunch Featured Data Sets

Posted Leave a commentPosted in Uncategorized

In August, 2018, Pearson posted a wide variety of featured data sets that you can use to illustrate a variety of statistical concepts in your class. Each data set on this site has a thorough description of the data and provides a source along with descriptions of each variable.

I would like to focus on the “California Home Prices, 2009” data set. This particular data set is interesting because it represents a variety of variables on homes sold in San Luis Obispo county during the financial crisis that started in August, 2008 (including sale type such as regular, foreclosure, or short sale).  Many of our students would have been between 8 and 10 years old during this very scary time in our country’s financial history, so I think students would benefit from a brief discussion of what happened during the financial crisis – especially as it pertains to “easy” credit being extended to folks who wanted to purchase homes.  Some of the mortgages people secured to buy homes were interest only and others were adjustable rate mortgages.  In an adjustable rate mortgage, the interest rate on the loan may increase as the interest rate the mortgage is based on increases (such as the federal funds rate). One outcome of the crisis was massive decline in home prices which caused many folks to have their home prices plummet below the amount they owed on the home.  As a result, homeowners were either foreclosed upon or sold their home for less than they owed (short-sale).

The Guidelines for Assessment and Instruction in Statistics Education (GAISE) report released in 2016 provides a variety of recommendations while teaching Introductory Statistics. One such recommendation is to “give students experience with multivariable thinking.”  This data set provides a relatively easy way to address this recommendation.

Begin by drawing a scatter diagram of “Price” against “SQFT” treating SQFT (square footage) as the explanatory variable.

Notice the positive association between the two variables.  However, there is an observation that is potentially an outlier and/or influential. Use your mouse to draw a square around the observation.  After doing this the observation changes color to magenta. Hit the down arrow on the spreadsheet to see which house this value corresponds to.  It is a 5060 square foot home that sold for $5,499,000 and is located in Arroyo Grande.

Notice that there are three types of sales reported in this data set (Foreclosure, Regular, and Short Sale).  You might ask students to research what each of these sales types represent.  Then, ask students to draw a scatter diagram by Status (use the “Group by:” drop-down menu in the Scatter Plot dialogue box).  This way, we are introducing a qualitative variable to the analysis and may be able to visualize the role status plays in any association between square footage and price.  What do you see in the graph?

Based on the graph, it looks like status plays a role in the association between square footage and sales price.   So, now students could find the linear correlation between the two variables by status (again, use the “Group by:” in the dialogue box).  All three status sales have roughly the same correlation.

Or, we could find the least-squares regression between square footage and sales price (treating square footage as the explanatory variable).

In the screen shot above, the least-squares regression between square footage, x, and sales price, y, is

Status = Foreclosure

The least-squares regressions for the other status sales is:

Status = Regular

Status = Short Sale

Students should interpret the slope of each regression as follows:  When the status is foreclosure, if the square footage increases by 1 square foot, the expected sale price increases by $236.  The obvious thing to note is that the slope of regular sales is much higher than the other two.  What does this suggest?  Distressed sales make the sales price increase much less, on average, as square footage increases.  Ask students if it makes sense to interpret the intercept.  The answer is that it does not because a home with 0 square feet does not make sense (and 0 square feet is outside the scope of the model).

Finally, it is worthwhile to focus on the regular status homes (due to the potential outlier/influential observation).  To determine if the observation is an outlier, save the residuals when doing a regression. Then, draw a residual plot and also a boxplot of the residuals.

The residual plot does not show any obvious pattern (other than the potential outlier), so a linear model is appropriate in describing the association between square footage and price. In the boxplot, notice there are a number of outliers (surprising given the residual plot).  The largest residual corresponds to the 5060 square foot home in Arroyo Grande mentioned earlier.

We also want to know if Arroyo Grande observation is influential.  While advanced methods for identifying influential observations exist, at the introductory level, I recommend a simple graphical test using StatCrunch’s “Influential Observation” applet. To use this applet, select Applet > Regression > Influence in StatCrunch. Fill in the dialogue box as shown below. Click Compute!.

Use your mouse to draw a square around the alleged influential observation (in the upper-right portion of the scatter diagram. The observation becomes a square and two regression lines appear – one is the regression line that includes the alleged influential point; one is the regression line without the point.

Notice the slope without the observation (green line) in Arroyo Grande is $447.7/square foot and the slope with the observation (red line) is $604.6/square foot – a 26% decrease.  It would appear this one observation is influential.  So, while the slope of the least-squares regression line for regular sales is still much higher than it is for other sales (Foreclosure or Short Sale), the difference is not as extreme as it would appear from our original analysis.

This StatCrunch data set provides lots of opportunities for students to explore.  Most important, is that students are exposed to some multivariable thinking.  Perhaps students could explore the role the other variables (location, number of bedrooms, number of bathrooms) plays in the selling price.

Learning Cataltyics

Posted Leave a commentPosted in Classroom Strategy

George Woodbury (@georgewoodbury) and I have written a Learning Catalytics course to accompany Interactive Statistics 2/e.  I started classes this week and immediately started using the program in my flipped class.  The level of engagement from my students is enormous and peer-to-peer instruction is taking place.  This has increased the level of understanding of my students and created a dynamic classroom.   I asked my students whether they prefer lecture or Learning Catalytics and they all replied “Learning Catalytics”!

If you would like a copy of our course, please email me ( or George (

Reproducibility of Research

Posted Leave a commentPosted in Uncategorized

The article below discusses how much of the scientific research that folks in the media, the general population, and other “stake-holders” is flawed due to the fact that the results cannot be reproduced.  This is a great illustration of ethics in statistical research.  This could be used to formulate a classroom discussion about the ability to replicate research.

How Bad Is the Government’s Science_ – WSJ

Beware the Lurking Variable

Posted Leave a commentPosted in Data Set, Statistics Articles

I just completed the discussion on correlation and regression with my Introductory Statistics students. One of the recommendations within the new GAISE outline is to introduce students to multivariate analysis.  A classic application of this practice is the SAT score versus teacher salary data.

This data may be found by joining a group I created in StatCrunch titled “SullyStats”. To join the group, go to (if you don’t have a StatCrunch account, ask you Pearson representative for an account).  Under Explore, select Groups. Type SullyStats into the search box and join the group.  The data set is titled “SAT versus Teacher Salaries

Use the data to illustrate the danger in only considering a relation that appears to exist between two variables.  Draw a scatter diagram between Teacher Salary and Overall SAT score. What do you notice? What is the correlation coefficient between Teacher Salary and Overall SAT score?

Now introduce the variable “Percent Taking”.   This is a qualitative variable where “low” means less than or equal to 22% of eligible students took the SAT; “med” means between 23 and 49% of eligible students took the SAT; “high” means at least 50% of eligible students took the SAT. Draw a scatter diagram using a different plotting symbol for each level of “Percent Taking.” Find the linear correlation coefficient between salary and SAT score for each classification.  What happens to the apparent association between Teacher Salary and Overall SAT Score?

Another area that I emphasize is the difference between deterministic relations and probabilistic relations.  As an example, I ask students to pretend they have a job that pays $20 per hour.  If the student is asked to work an additional hour, how much will the student earn?  Of course, the answer is $20 (before taxes).  This is a deterministic relation because the value of the response variable (earnings) can be determined with 100% certainty if the value of the explanatory variable (hours worked) is known.   Contrast this with probabilistic models.  This is easily illustrated by telling the student they are a server in a restaurant where they typically make $20 per hour.  Each additional hour of work in this scenario does not guarantee an additional $20 in income.  Why?  Sometimes the server gets good tips, sometimes bad tips.  Now extend this idea to the interpretation of slope in regression.  It is vital that students understand that slope interpreted from a least-squares regression model must emphasize the relation is not deterministic.  For example, I use Zillow to build a model where the sale price of a home is the response variable and the Zestimate is the explanatory variable.  If the slope in the regression model is $0.92, then we interpret the slope as follows:  If the Zestimate increases by $1, the selling price of the home increases by $0.92, on average. The words “on average” are vital because this is the change in selling price over the course of the observed data.

The General’s Dilemma

Posted Leave a commentPosted in Classroom Strategy

Today I am going to do “The General’s Dilemma” activity in my Intro Stats class.  I am teaching completely randomized designs, so this is a great opportunity to illustrate the methodology behind this experimental design with this activity.  This data will be used to introduce the inferential methods of comparing two independent proportions using randomization methods.  Feel free to use this in your classes.


Activity – The General’s Dilemma    The following two questions are called the first and second versions of the General’s Dilemma.  The questions were written by psychologists Daniel Kahneman and Amos Tversky.


Version I:  Threatened by a superior enemy force, the general faces a dilemma.  His intelligence officers say his soldiers will be caught in an ambush in which 600 of them will die unless he leads them to safety by one of two available routes.  If he takes the first route, 200 soldiers will be saved.  If he takes the second, there is a one-third chance that 600 soldiers will be saved, and a two-thirds chance that none will be saved.  Which route should he take?


Version II:  Threatened by a superior enemy force, the general faces a dilemma.  His intelligence officers say his soldiers will be caught in an ambush in which 600 of them will die unless he leads them to safety by one of two available routes.  If he takes the first route, 400 soldiers will die.  If he takes the second, there is a one-third chance that no soldiers will die, and a two-thirds chance that 600 soldiers will die.  Which route should he take?


(a) Do not share both versions with your students.  Randomly assign Version I to half the students in your class and Version II to the other half.    If you like, tell your students they are participating in a randomized trial and share with them how you will assign each to a treatment group.

(b) Administer the treatment (be sure each version is written on the same style and color paper).  Ask each student to record which route he or she would take: Route 1 or Route 2.  Aggregate the data for the class.

(c) Show students both versions.  Ask them to explain why Version I might be called the “Saving Lives” version and Version II the “Preventing Deaths” version.  Also, ask students to explain why Route 1 might be called the “Risk Averse” option and Route 2 the “Risk-Seeking” option.

(c) Organize the data in a two-way contingency table.  Let the row variable represent the version and let the column variable represent the route.

(d) Determine the conditional distribution of route selected by version.

(e) The research objective is to determine if there is a difference in route selected depending upon the version read.  Based on this, determine the null and alternative hypotheses.

(f) Using the randomization test for two proportions applet in StatCrunch to approximate a P-value for this hypothesis test.  Based on the result, what do you conclude?