Beard line: Time Series in R (Part III)

2012-06-05 00:31:38 · Forecasting, R, r-bloggers, weird data

In Part 2, we showed how to add recession shading to a plot of American Beards over time, and did some diagnostics to check whether 19th Century Americans grew recession beards. (Spoiler alert: it appears they did not.)  In Part 1, we showed how to plot the series in the first place.  Today, we're going to look at the beardly trend over the period.  We all know about the gilded age popularity of mutton chops and sideburns, but were full beards on the rise or on the decline between 1866 and 1911?  And more importantly, what can this period tell us about beards of the future (in the past)?!

What you'll learn

The Data

As before, the data comes from...

Robinson, D.E. Fashions in shaving and trimming of the beard: the men of the Illustrated London News, 1842-1972. American Journal of Sociology, 81:5 (1976), 1133-1141.

...via the TSDL.

The recession dates come from the NBER.

Let's Take a Look

We don't have much to add from the previous example.  In fact, you only need to put a single line onto the previous script:

# Add a bearded trend line
bplot3 <- bplot2 + geom_smooth(se=F, method="lm", formula=y~x, colour='red', size=1.125)
print(bplot3)

This provides a linear trend (method="lm"),  removes the standard error bands (se=F), asks for a linear regression to be applied (y~x), and colors the line red.

Some Notes

For now...

Beautiful Results

American Beards over Time in R

You can see our estimate, the red line, shows that in general, there was a trend of increasing beardfulness in late 19th Century America.

A Little more

The line we drew is a way of simplifying the information within our sample.  It lets us construct new estimates in-between years, even.  (Well, a second way---we've already done that once with the dotted lines between points.)  In-sample estimation is called interpolation.  You'd contrast that with extrapolation---which is estimating points beyond our sample. As you can see, the linear model interpolation is not spectacular---there are times when the sample deviates quite far from the intermediate value line we drew, and where using it would result in a bad estimate.

Well, extrapolating from this unfit model would be even worse.  It's bad enough that we should try it, as a motivatation for how we can do better.  So, let's extend this linear model into the future.  Let's forecast, and extrapolate from our model fit until we hit...

Peak Beardfulness

Our model is a simple line.  It never changes direction.  If it started up, over enough time, it'll go up to infinity.  If it started pointed down, it would just always get more negative.  So if we see it continuing up within our sample---there must be a time outside and beyond our sample when our line hits 100% and never turns back.  That is, our model suggests the shocking conclusion that we will someday hit 100% beardfulness.

When?

Using the following code from the "forecast" package in R, we can find the very year in which American beardfulness reaches 100%.

fit <- lm(data=beard.df,beard~date) # Fit a linear model
fdates <- as.Date(timeSequence(        # Over the next 100 years
    from = "1912-01-01",
    length.out = 100,
    by = "year"))
new <- data.frame(date=fdates)     # Make a new data frame
pred <- predict(fit,newdata=new) # Extrapolate over the new period
fcast <- cbind(new,pred)     # Connect past to future
names(fcast) <- c("date", "beard")
print(first(fcast[fcast$beard>100,])[,1]) # Stop at 100%...

The output of the forecast says the first date where our forecast exceeds 100% is in 1951.  Excellent.  Problem solved.

(Here is some Evidence we did not hit peak beard in 1951.)

What Went Wrong?

If you've ever looked around at modern American faces---generally unless you're looking at elderly math professors or the homeless--you'll know the answer is that we do not reach 100% beardfulness, but that beardfullness actually decreases in post-war America.  This is an illustration of the failure of extrapolation from an unsuitable model, right?---or perhaps it's a failure of American beardfulness...

You could actually argue both---on the one hand, our model didn't fully capture all the information in the data that would have given us better estimates. But on the other hand, something else was happening---trends changed.  In response to cultural fashions not present in the original data, previous data was no longer relevant---there was a regime change in beard standards that couldn't be foreseen within the previous data set---even if you could extract all the information from it, the rise of this trend likely just didn't exist or provide a signal beforehand.  Even with the best of models, the relationship between variables can change over time.

So, how can we do better?  Well...

Next up

We'll add credits to our plot for pretty publication and also explore why time series stationarity matters in making better forecasts.

Forecasting R r-bloggers weird data