About a year ago I was introduced to Stan. Stan is a free and opensource probabilistic programming language and Bayesian inference engine. After watching the webinar I wanted to try it out but lacked the free time. I finally got to it this past weekend. Here’s my tale.
I wanted to do a regression on a wellknown data set. I decided to use the data set compiled by Francis Galton on the relationship between the height of a child and their parents. Galton observed “regression towards mediocrity” or what would be referred to today as regression to the mean.
I analysed the Galton data published in the R HistData package. I was going to use the pystan interface so I exported the data from R into a CSV. I did not make any adjustments to the data.
OLS Regression Model
I ran an ordinary least squares (OLS) regression using the StatsModel library as a sort of baseline. Here are the results of that regression:
OLS Regression Results
==============================================================================
Dep. Variable: child Rsquared: 0.210
Model: OLS Adj. Rsquared: 0.210
Method: Least Squares Fstatistic: 246.8
Date: Mon, 27 Mar 2017 Prob (Fstatistic): 1.73e49
Time: 11:09:52 LogLikelihood: 2063.6
No. Observations: 928 AIC: 4131.
Df Residuals: 926 BIC: 4141.
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>t [0.025 0.975]

Intercept 23.9415 2.811 8.517 0.000 18.425 29.458
parent 0.6463 0.041 15.711 0.000 0.566 0.727
==============================================================================
Omnibus: 11.057 DurbinWatson: 0.046
Prob(Omnibus): 0.004 JarqueBera (JB): 10.944
Skew: 0.241 Prob(JB): 0.00420
Kurtosis: 2.775 Cond. No. 2.61e+03
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.61e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
Stan Model
Then I tried out the same process with Stan. It took a couple of hours of the read the documentation, try, fail, pull my hair out cycle before I was successful. Here was my Stan code:
stan_model_code = """
data {
int<lower=0> N; // number of cases
vector[N] x; // predictor (covariate)
vector[N] y; // outcome (variate)
}
parameters {
real alpha; // intercept
real beta; // slope
real<lower=0> sigma; // outcome noise
}
model {
y ~ normal(x * beta + alpha, sigma);
}
"""
stan_data = {
'N': len(df['child'].values),
'x': df['parent'].values,
'y': df['child'].values
}
stan_model = pystan.stan(model_name='galton', model_code=stan_model_code, data=stan_data, iter=1000, chains=4)
I got hung up on the data section. I didn’t use the vector type which was throwing errors. Here’s the output from Stan:
Inference for Stan model: galton_2a77bd156aec196d5a464494a175b11a.
4 chains, each with iter=1000; warmup=500; thin=1;
postwarmup draws per chain=500, total postwarmup draws=2000.
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
alpha 24.24 0.11 2.67 18.9 22.45 24.2 26.17 29.37 607 1.01
beta 0.64 1.6e3 0.04 0.57 0.61 0.64 0.67 0.72 606 1.01
sigma 2.24 1.9e3 0.05 2.14 2.21 2.24 2.27 2.35 732 1.0
lp__ 1211 0.05 1.19 1214 1212 1211 1210 1210 643 1.0
Samples were drawn using NUTS at Mon Mar 27 11:11:15 2017.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
Comparison of Results
I’ve pulled out all the results and put them in a side by side comparison. The results are very similar (as one would expect). The intercept is around 24 (1829) with the coefficient for the parent variable at about 0.6 (0.50.7).

OLS 
Stan 

Mean 
2.5% 
97.5% 
Mean 
2.5% 
97.5% 
Intercept (Alpha) 
23.9415 
18.425 
29.458 
24.24 
18.9 
29.37 
Parent (Beta) 
0.6463 
0.566 
0.727 
0.64 
0.57 
0.72 
Now that I’ve use Stan I am confident I will be using this Bayesian modeling tool in the future. As always my source code is available on my GitHub repository.