Status

Bringing Pystan to Anaconda

I’ve recently learned how to create python packages for the Anaconda distribution. I have created pystan packages for 64 bit Linux and Windows systems. Anyone is free to download them with

conda install pystan -c mikesilva

This will save others the effort of downloading and compiling the program. This is not the biggest hurdle in the world but may be large enough to prevent others from using this great software package. I’m happy to make it available so we can focus our efforts on using the tool rather than building the tool.

Advertisements

Francis Meets Stan

About a year ago I was introduced to Stan.  Stan is a free and open-source probabilistic programming language and Bayesian inference engine.  After watching the webinar I wanted to try it out but lacked the free time.  I finally got to it this past weekend.  Here’s my tale.

I wanted to do a regression on a well-known data set.  I decided to use the data set compiled by Francis Galton on the relationship between the height of a child and their parents.  Galton observed “regression towards mediocrity” or what would be referred to today as regression to the mean.

I analysed the Galton data published in the R HistData package.  I was going to use the pystan interface so I exported the data from R into a CSV.  I did not make any adjustments to the data.

OLS Regression Model

I ran an ordinary least squares (OLS) regression using the StatsModel library as a sort of baseline.  Here are the results of that regression:

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  child   R-squared:                       0.210
Model:                            OLS   Adj. R-squared:                  0.210
Method:                 Least Squares   F-statistic:                     246.8
Date:                Mon, 27 Mar 2017   Prob (F-statistic):           1.73e-49
Time:                        11:09:52   Log-Likelihood:                -2063.6
No. Observations:                 928   AIC:                             4131.
Df Residuals:                     926   BIC:                             4141.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     23.9415      2.811      8.517      0.000      18.425      29.458
parent         0.6463      0.041     15.711      0.000       0.566       0.727
==============================================================================
Omnibus:                       11.057   Durbin-Watson:                   0.046
Prob(Omnibus):                  0.004   Jarque-Bera (JB):               10.944
Skew:                          -0.241   Prob(JB):                      0.00420
Kurtosis:                       2.775   Cond. No.                     2.61e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.61e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Stan Model

Then I tried out the same process with Stan.  It took a couple of hours of the read the documentation, try, fail, pull my hair out cycle before I was successful.  Here was my Stan code:

stan_model_code = """
data {
    int<lower=0> N; // number of cases
    vector[N] x; // predictor (covariate)
    vector[N] y; // outcome (variate)
}
parameters {
    real alpha; // intercept
    real beta; // slope
    real<lower=0> sigma; // outcome noise
}
model {
    y ~ normal(x * beta + alpha, sigma);
}
"""

stan_data = {
        'N': len(df['child'].values),
        'x': df['parent'].values,
        'y': df['child'].values
        }

stan_model = pystan.stan(model_name='galton', model_code=stan_model_code, data=stan_data, iter=1000, chains=4)

I got hung up on the data section.  I didn’t use the vector type which was throwing errors.  Here’s the output from Stan:

Inference for Stan model: galton_2a77bd156aec196d5a464494a175b11a.
4 chains, each with iter=1000; warmup=500; thin=1; 
post-warmup draws per chain=500, total post-warmup draws=2000.

        mean se_mean     sd   2.5%    25%    50%    75%  97.5%  n_eff   Rhat
alpha  24.24    0.11   2.67   18.9  22.45   24.2  26.17  29.37    607   1.01
beta    0.64  1.6e-3   0.04   0.57   0.61   0.64   0.67   0.72    606   1.01
sigma   2.24  1.9e-3   0.05   2.14   2.21   2.24   2.27   2.35    732    1.0
lp__   -1211    0.05   1.19  -1214  -1212  -1211  -1210  -1210    643    1.0

Samples were drawn using NUTS at Mon Mar 27 11:11:15 2017.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

Comparison of Results

I’ve pulled out all the results and put them in a side by side comparison.  The results are very similar (as one would expect).  The intercept is around 24 (18-29) with the coefficient for the parent variable at about 0.6 (0.5-0.7).

OLS Stan
Mean 2.5% 97.5% Mean 2.5% 97.5%
Intercept (Alpha) 23.9415 18.425 29.458 24.24 18.9 29.37
Parent (Beta) 0.6463 0.566 0.727 0.64 0.57 0.72

Now that I’ve use Stan I am confident I will be using this Bayesian modeling tool in the future.  As always my source code is available on my GitHub repository.

Image

Visualizing EMS Service Delivery Options

I put together a neat little visualization that allows the user to do a back-of-the-envelope calculation of what an EMS service delivery option.  You can try it out at https://msilva-cgr.shinyapps.io/essex-county-ems-options/.  It is a shiny app.  Because of the time crunch it is slow because it is doing a lot of calculations on the fly.  If I were to do it again I would have preprocessed the data and saved the results.  It would cut out the costly computations.  But in any event it is a really neat tool.

essex

How to Calculate Cosine Similarity in Excel

I often use cosine similarity at my job to find peers.  Cosine similarity is a measure of distance between two vectors.  While there are libraries in Python and R that will calculate it sometimes I’m doing a small scale project and so I use Excel.  Here’s how to do it.

First the Theory

I will not go into depth on what cosine similarity is as the web abounds in that kind of content.  Suffice it to say the formula for cosine similarity (for those of you who are mathematically inclined) is:

Cosine Similarity

So for the numerator we need to find the dot product of the two vectors.  That is done using the SUMPRODUCT function in Excel.  For each term in the denominator you need to find the square root of the sum of squares.  This is done by using the SUMSQ function nested in a SQRT function.

Now the Practice

I have put together a little template which shows how to calculate the cosine similarity.  I first organize the data in the spreadsheet so the attributes (or features or variables) go across the columns and the geographies go across each row.  I usually put who we are trying to find as a peer at the top of the spreadsheet.  I scale the data so that it ranges between zero and one. This way differences in one attribute doesn’t overpower the others simply due to a difference in scale.  With all that done I am ready to compute the cosine similarity.

Let’s suppose you have four attributes (A, B, C and D) for the baseline and five peers.  It would look like this in Excel:

Cosine Similarity Data

In order to calculate the cosine similarity of Peer 1 and the Baseline, I would divide the dot product (=SUMPRODUCT(B$2:E$2,B3:E3)) by the square root of the sum of squares multiplied together (=SQRT(SUMSQ(B3:E3))*SQRT(SUMSQ($B$2:$E$2))).  So if you want it all in one hairy formula in cell F3 it would be:

=SUMPRODUCT(B$2:E$2,B3:E3)/(SQRT(SUMSQ(B3:E3))*SQRT(SUMSQ($B$2:$E$2)))

Status

Did I Just Slip Into the World of Big Data?

I recently was trying to create a random forest classifier on a data set using R.  As you can see I ran into some problems:

big-data

The data set was half a gig with nearly one hundred rows.  I reduced the columns down to three columns (the class and two features) but couldn’t do it in R.  I was able to generate a random forest classifier in Python, however I wanted the R output to develop an API that would use the random forest model.

Does Investment in Public Libraries Increase Usage?

I recently mentioned that I had been exploring some data on public libraries.  Here’s the reason why.  A recent local new paper article chronicled the role libraries are playing today. They highlight the fact that some local libraries that have undergone major renovations recently. In the article they claim:

The surge in popularity mirrors what other communities have seen. When they invest in libraries, the number of people using them goes up.

The claim seemed to rely on anecdotal evidence, so I determined to examine this using data.

Disclaimer/Confessions

I want preface this by admitting that I am a big fan of libraries.  I have fond memories of summer reading programs in my childhood.  My very first exposure to the Internet happened in a public library.  I used to roller blade to the local public library as a teenager to do my homework (even though I had my own desk at home). When my parents moved and I visited them, one of the local attractions I wanted to see was their public libraries.  I love them.  However I love claims being backed with data more than anecdote, especially when it touches something close to me.

Methodology

I used data from the annual report for public and association libraries to evaluate the claims.  I looked at the data from 1991-2014.  As always, for those who care to replicate my analysis, you can check out the GitHub repository.

I examined the change in library “usage” in terms of circulation and visits.  I wanted to see if the investment in libraries spurred on increase usage that died out over time so I looked at the difference from a one year before and after investment window up to ten years.

There are just under 500 libraries that had a renovation over the time period.  There were also about 200 libraries in New York State didn’t have major renovations.  I was able to use these libraries as a control group.  If there was a statistically significant difference between these two groups there would be data to back up the news paper article’s claim.

Findings

After looking at circulation and visitation over the various time frames there was no difference between the libraries that were renovated and those that were not.  Not over the short term, or long term.  So the bottom-line is that the claim that investment increases library usage is not supported by the data.