Bringing Pystan to Anaconda

I’ve recently learned how to create python packages for the Anaconda distribution. I have created pystan packages for 64 bit Linux and Windows systems. Anyone is free to download them with

conda install pystan -c mikesilva

This will save others the effort of downloading and compiling the program. This is not the biggest hurdle in the world but may be large enough to prevent others from using this great software package. I’m happy to make it available so we can focus our efforts on using the tool rather than building the tool.

Francis Meets Stan

About a year ago I was introduced to Stan.  Stan is a free and open-source probabilistic programming language and Bayesian inference engine.  After watching the webinar I wanted to try it out but lacked the free time.  I finally got to it this past weekend.  Here’s my tale.

I wanted to do a regression on a well-known data set.  I decided to use the data set compiled by Francis Galton on the relationship between the height of a child and their parents.  Galton observed “regression towards mediocrity” or what would be referred to today as regression to the mean.

I analysed the Galton data published in the R HistData package.  I was going to use the pystan interface so I exported the data from R into a CSV.  I did not make any adjustments to the data.

OLS Regression Model

I ran an ordinary least squares (OLS) regression using the StatsModel library as a sort of baseline.  Here are the results of that regression:

                            OLS Regression Results                            
Dep. Variable:                  child   R-squared:                       0.210
Model:                            OLS   Adj. R-squared:                  0.210
Method:                 Least Squares   F-statistic:                     246.8
Date:                Mon, 27 Mar 2017   Prob (F-statistic):           1.73e-49
Time:                        11:09:52   Log-Likelihood:                -2063.6
No. Observations:                 928   AIC:                             4131.
Df Residuals:                     926   BIC:                             4141.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
Intercept     23.9415      2.811      8.517      0.000      18.425      29.458
parent         0.6463      0.041     15.711      0.000       0.566       0.727
Omnibus:                       11.057   Durbin-Watson:                   0.046
Prob(Omnibus):                  0.004   Jarque-Bera (JB):               10.944
Skew:                          -0.241   Prob(JB):                      0.00420
Kurtosis:                       2.775   Cond. No.                     2.61e+03

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.61e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Stan Model

Then I tried out the same process with Stan.  It took a couple of hours of the read the documentation, try, fail, pull my hair out cycle before I was successful.  Here was my Stan code:

stan_model_code = """
data {
    int<lower=0> N; // number of cases
    vector[N] x; // predictor (covariate)
    vector[N] y; // outcome (variate)
parameters {
    real alpha; // intercept
    real beta; // slope
    real<lower=0> sigma; // outcome noise
model {
    y ~ normal(x * beta + alpha, sigma);

stan_data = {
        'N': len(df['child'].values),
        'x': df['parent'].values,
        'y': df['child'].values

stan_model = pystan.stan(model_name='galton', model_code=stan_model_code, data=stan_data, iter=1000, chains=4)

I got hung up on the data section.  I didn’t use the vector type which was throwing errors.  Here’s the output from Stan:

Inference for Stan model: galton_2a77bd156aec196d5a464494a175b11a.
4 chains, each with iter=1000; warmup=500; thin=1; 
post-warmup draws per chain=500, total post-warmup draws=2000.

        mean se_mean     sd   2.5%    25%    50%    75%  97.5%  n_eff   Rhat
alpha  24.24    0.11   2.67   18.9  22.45   24.2  26.17  29.37    607   1.01
beta    0.64  1.6e-3   0.04   0.57   0.61   0.64   0.67   0.72    606   1.01
sigma   2.24  1.9e-3   0.05   2.14   2.21   2.24   2.27   2.35    732    1.0
lp__   -1211    0.05   1.19  -1214  -1212  -1211  -1210  -1210    643    1.0

Samples were drawn using NUTS at Mon Mar 27 11:11:15 2017.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

Comparison of Results

I’ve pulled out all the results and put them in a side by side comparison.  The results are very similar (as one would expect).  The intercept is around 24 (18-29) with the coefficient for the parent variable at about 0.6 (0.5-0.7).

OLS Stan
Mean 2.5% 97.5% Mean 2.5% 97.5%
Intercept (Alpha) 23.9415 18.425 29.458 24.24 18.9 29.37
Parent (Beta) 0.6463 0.566 0.727 0.64 0.57 0.72

Now that I’ve use Stan I am confident I will be using this Bayesian modeling tool in the future.  As always my source code is available on my GitHub repository.