Broadway Show Revenue Analysis

Motivation

As part of my CUNY SPS Masters in Data Science training I am taking a Business Analytics and Data Mining course. We use Simon Shealther’s A Modern Approach to Regression with R in the class. In chapter 2 he introduces the topic of simple linear regression. In the excercise section of that chapter he uses data on gross box office revenue for plays on Broadway in New York. He gives the following visualization for the week ending 2004-10-17:

He later states that “some promoters of Broadway plays use the prediction rule that next week’s gross box office results will be equal to this week’s gross box office results.” The exercise asks you to examine this claim with the data.

As I worked through this question I was surprised to find that the the previous week’s revenue really was a good predictor of the current week’s. Here’s the output of the regression model:

Call:
lm(formula = current_week ~ past_week, data = .)

Residuals:
   Min     1Q Median     3Q    Max 
-36926  -7525  -2581   7782  35443 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 6.805e+03  9.929e+03   0.685    0.503    
past_week   9.821e-01  1.443e-02  68.071   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 18010 on 16 degrees of freedom
Multiple R-squared:  0.9966,    Adjusted R-squared:  0.9963 
F-statistic:  4634 on 1 and 16 DF,  p-value: < 2.2e-16

However, something did not sit right with me. The conclusion seemed to defy logic. If the current week’s revenue is predicted by the previous week’s, then there is basically no change in how much a show makes over time. The show would run week after week and pull in the same revenue. That did not make any sense. I began to wonder if the authors cherry-picked the data for the purpose of the exercise and if the trend would not hold if a different week was examined. I decided to look into this further.

Data Acquisition

I scrapped playbill.com’s website to collect the gross box office results for all possible years. The data goes back to as far as 1985-06-16. At the time of this analysis the data goes up to 2019-09-22. New data continues to be added.

After exploring the data, I compiled a dataset matching what the author produced. No data was excluded from my dataset. In all there are 1,789 weeks worth of data covering 1,090 shows preformed in 58 theaters.

Does the Previous Week Predict the Current Week?

So does the previous week’s revenue predict the current week? I looked deeper into the question. Here’s what I discovered:

The first thing I discovered is that the author threw out data from 5 shows (Brooklyn, I Am My Own Wife, Marc Salem’s Mind Games on Broadway, Reckless, and Twelve Angry Men). They did not explain this nor their reasoning. So the chart should have looked like this:

Hmmm. Let’s see what kind of effect this has on the regression model:

Call:
lm(formula = current_week ~ past_week, data = .)

Residuals:
   Min     1Q Median     3Q    Max 
-66882 -10036    989  11527  44544 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 2.405e+03  1.010e+04   0.238    0.814    
past_week   9.872e-01  1.636e-02  60.352   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 24890 on 21 degrees of freedom
Multiple R-squared:  0.9943,    Adjusted R-squared:  0.994 
F-statistic:  3642 on 1 and 21 DF,  p-value: < 2.2e-16

There does not seem to be any problem with this model. I wonder why they dropped some rows from the dataset?

Full Dataset

Now there are 121 times the current or previous week is zero. Let’s eliminate these cases from the data set.  Now let’s look at the full dataset. We will look at the rule of thumb (last week predicts this week) and the linear regression line (since that is what the textbook is about).

There is definitely a positive correlation between the previous and current week’s box office revenue. The regression line (blue line above) is similar to the rule of thumb line (in red). Let’s see how the regression model preformed with more data:

Call:
lm(formula = current_week ~ past_week, data = .)

Residuals:
     Min       1Q   Median       3Q      Max 
-2003564   -35775    -6446    29796  1573978 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 2.458e+04  8.779e+02    28.0   <2e-16 ***
past_week   9.651e-01  1.219e-03   791.9   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 113000 on 45306 degrees of freedom
Multiple R-squared:  0.9326,    Adjusted R-squared:  0.9326 
F-statistic: 6.27e+05 on 1 and 45306 DF,  p-value: < 2.2e-16

That is interesting. With a very high R2, the models indicates that that the current week’s gross box office revenue is about 97% of the past week’s revenue. This is close to but less than one.

What About Week by Week?

Now this is a little unfair because the example in the textbook only looked at one week. What if we looked at all the weeks? Here’s the summary statistics for all of the linear regressions:

   Intercept         Past Week      Adjusted R Squared
 Min.   :-315631   Min.   :0.3021   Min.   :0.5240    
 1st Qu.:  -5538   1st Qu.:0.9479   1st Qu.:0.9502    
 Median :  12500   Median :0.9830   Median :0.9738    
 Mean   :  18858   Mean   :0.9771   Mean   :0.9574    
 3rd Qu.:  36027   3rd Qu.:1.0122   3rd Qu.:0.9857    
 Max.   : 528262   Max.   :1.8348   Max.   :0.9999    

The current week’s revenue is on average 98% of the preceding weeks. This is close to 1 and the adjusted R2 are very high so the rule of thumb seems valid. But it is a fraction of the preceding week. This aligns nicely with my expectations.

Note: source code available at https://github.com/mikeasilva/playbill

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.