About a year ago I was introduced to Stan. Stan is a free and open-source probabilistic programming language and Bayesian inference engine. After watching the webinar I wanted to try it out but lacked the free time. I finally got to it this past weekend. Here’s my tale.

I wanted to do a regression on a well-known data set. I decided to use the data set compiled by Francis Galton on the relationship between the height of a child and their parents. Galton observed “regression towards mediocrity” or what would be referred to today as regression to the mean.

I analysed the Galton data published in the R HistData package. I was going to use the pystan interface so I exported the data from R into a CSV. I did not make any adjustments to the data.

## OLS Regression Model

I ran an ordinary least squares (OLS) regression using the StatsModel library as a sort of baseline. Here are the results of that regression:

OLS Regression Results ============================================================================== Dep. Variable: child R-squared: 0.210 Model: OLS Adj. R-squared: 0.210 Method: Least Squares F-statistic: 246.8 Date: Mon, 27 Mar 2017 Prob (F-statistic): 1.73e-49 Time: 11:09:52 Log-Likelihood: -2063.6 No. Observations: 928 AIC: 4131. Df Residuals: 926 BIC: 4141. Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 23.9415 2.811 8.517 0.000 18.425 29.458 parent 0.6463 0.041 15.711 0.000 0.566 0.727 ============================================================================== Omnibus: 11.057 Durbin-Watson: 0.046 Prob(Omnibus): 0.004 Jarque-Bera (JB): 10.944 Skew: -0.241 Prob(JB): 0.00420 Kurtosis: 2.775 Cond. No. 2.61e+03 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 2.61e+03. This might indicate that there are strong multicollinearity or other numerical problems.

## Stan Model

Then I tried out the same process with Stan. It took a couple of hours of the read the documentation, try, fail, pull my hair out cycle before I was successful. Here was my Stan code:

stan_model_code = """ data { int<lower=0> N; // number of cases vector[N] x; // predictor (covariate) vector[N] y; // outcome (variate) } parameters { real alpha; // intercept real beta; // slope real<lower=0> sigma; // outcome noise } model { y ~ normal(x * beta + alpha, sigma); } """ stan_data = { 'N': len(df['child'].values), 'x': df['parent'].values, 'y': df['child'].values } stan_model = pystan.stan(model_name='galton', model_code=stan_model_code, data=stan_data, iter=1000, chains=4)

I got hung up on the data section. I didn’t use the vector type which was throwing errors. Here’s the output from Stan:

Inference for Stan model: galton_2a77bd156aec196d5a464494a175b11a. 4 chains, each with iter=1000; warmup=500; thin=1; post-warmup draws per chain=500, total post-warmup draws=2000. mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat alpha 24.24 0.11 2.67 18.9 22.45 24.2 26.17 29.37 607 1.01 beta 0.64 1.6e-3 0.04 0.57 0.61 0.64 0.67 0.72 606 1.01 sigma 2.24 1.9e-3 0.05 2.14 2.21 2.24 2.27 2.35 732 1.0 lp__ -1211 0.05 1.19 -1214 -1212 -1211 -1210 -1210 643 1.0 Samples were drawn using NUTS at Mon Mar 27 11:11:15 2017. For each parameter, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence, Rhat=1).

## Comparison of Results

I’ve pulled out all the results and put them in a side by side comparison. The results are very similar (as one would expect). The intercept is around 24 (18-29) with the coefficient for the parent variable at about 0.6 (0.5-0.7).

OLS | Stan | |||||
---|---|---|---|---|---|---|

Mean | 2.5% | 97.5% | Mean | 2.5% | 97.5% | |

Intercept (Alpha) | 23.9415 | 18.425 | 29.458 | 24.24 | 18.9 | 29.37 |

Parent (Beta) | 0.6463 | 0.566 | 0.727 | 0.64 | 0.57 | 0.72 |

Now that I’ve use Stan I am confident I will be using this Bayesian modeling tool in the future. As always my source code is available on my GitHub repository.