Dummy

Simple Bayesian Regression with Baysig

16 July 2013

This document will show a simple Bayesian statistical analysis in the Baysig language. The code that carries out this analysis is generated by BayesHive through the "Linear regression" model builder, so you don't have to type it out yourself if you want to do something similar with your own data. But in this document, I want to go a bit more into the meaning of this code.

Why would you use Bayesian statistics? I will show in this document why a Bayesian analysis can be more informative than calulating a r^2 value or a p-value from a classical regression analysis; at the same time, the output of the Bayesian analysis is easier to understand. I will show how the output of the Bayesian analysis can be used both to estimate model parameters and to replace significance tests.

Contents:

The dataset

I will be building a model for the "electricSupply" dataset which originally appeared here. Let's look at the first five rows:

take 5 (#electricSupply)
capital labour energy output
98.2880 0.3860 13.2190 1.2700
255.0680 1.1790 49.1450 4.5970
208.9040 0.5320 18.0050 1.9850
528.8640 1.8360 75.6390 9.8970
307.4190 1.1360 52.2340 5.9070

In this document I will keep things very simple, so I will concentrate on how one variable (output) is predicted by one other variable (energy). Let's start by plotting this relationship to get a feel for how it looks:

axisLabels "energy" "output" (scatterPlot (#electricSupply <#> (energy,output)))
[ {"x": 1.739,"y":0.125},{"x": 9.027,"y":1.832},{"x": 13.219,"y":1.27},{"x": 18.005,"y":1.985},{"x": 21.99,"y":2.239},{"x": 31.244,"y":2.728},{"x": 41.676,"y":4.865},{"x": 42.037,"y":3.507},{"x": 43.232,"y":4.477},{"x": 49.145,"y":4.597},{"x": 52.234,"y":5.907},{"x": 75.581,"y":7.037},{"x": 75.639,"y":9.897},{"x": 82.296,"y":8.727},{"x": 104.584,"y":9.685},{"x": 125.351,"y":10.077} ]

It is clear that there is some kind of relationship between the two variables. I will show you how to use a simple regression model to explore this relationship.

A simple regression model

The first thing we will want to do is to define the model for these data. Here is the model definition that BayesHive generates for you:

regress = prob
   beta_0 ~ improper_uniform
   beta_energy ~ improper_uniform
   variance ~ improper_uniform_positive
   repeat 16 $ prob
      energy ~ any
      output ~ normal (beta_0+energy*beta_energy) variance
      return {output=> output; energy=> energy}

That looks like a lot of work; is certainly more work than just typing "y=mx+b+N(0,var)". There are several good reason for that:

  • we have given the model a name (regress). That is good, because now we can pass it around to estimation procedures or manipulate the model.

  • in a Bayesian analysis, we have to state the prior, which is what we already know about the parameter before seeing the data. Here, I have chosen a flat uniform prior, which essentially says that we don't know anything about the parameters. But if you have some knowledge of these parameters from previous analyses, you would put that in the first three lines (for example, beta_0 ~ normal 5 1).

  • we explicitly said that we are repeating draws from a common distribution.

  • we have also explicitly said what in the model is observed -- the last line returns the output and the energy.

Explicit is good, because it clears up mysteries.

Estimating parameters

Next, we will want to call estimate. The estimate procedure is what sets Baysig apart from most programming languages. Everything else in Baysig has been designed to make it easy to work with estimate.

regressElectricSupplyPars <* estimate regress (#electricSupply)

The output of running estimate is a probability distribution over the parameters in the regress model; this is what Bayesian call the posteriror -- what you should believe about the model parameters after you have seen the data. The posterior is represented in Baysig as a probability distribution over records where each field is a parameter in the model. It is probably much easier to understand this if we just ask for its value and see what Baysig prints out.

regressElectricSupplyPars ⇒ {
   beta_0 => 0.52 ± 0.50,
   beta_energy => 0.0897 ± 0.0084,
   variance => 1.30 ± 0.65,
}

I want to emphasise that this is just a summary and there is much more information in the posterior. Looking at this summary does tell you a bit about the approximate value that are estimated (mean plus/minus standard deviation of the estimated parameters).

It is a bit nicer to see a graphical summary of the parameter estimates. BayesHive offers to generate code that defines several such plots, and I will describe each of them in detail. I will focus on the parameter beta_energy, which is the slope, or the sensitivity, in the output with respect to the energy. In the kind of analysis we are outlining here, it is often the most important parameter.

First we can look at the estimated distribution of this parameter. That gives us all the probable values, given the data. Here, we can see that the beta_energy parameter is probably somewhere around 0.09 and very likely to lie between 0.07 and 0.11. In some data analysis traditions, one spends a long time trying to find the single most likely value of this parameter, perhaps a number such as 0.089513. In the Bayesian school embraced in Baysig, this number has no particular meaning, nor is there any way of calculating it. The parameter beta_energy is uncertain and cannot be collapsed onto a single number.

axisLabels "beta_energy" "density" (distPlot (regressElectricSupplyPars <#> beta_energy))
0.0998195 0.0826987 0.0861898 0.098142 0.0842136 0.0989857 0.0894835 0.0959261 0.0900833 0.0812039 0.0909245 0.0975317 0.0863776 0.093667 0.0906657 0.091215 0.0937437 0.0870374 0.0917341 0.093936 0.0894051 0.0864065 0.0917108 0.108156 0.085049 0.105334 0.0841503 0.0848895 0.0797881 0.0860298 0.091575 0.116343 0.0813271 0.103775 0.102085 0.0904192 0.0918715 0.092265 0.0762525 0.0826 0.0822823 0.0926428 0.0831682 0.101402 0.103978 0.102925 0.0921819 0.0837034 0.0909695 0.084077 0.0902558 0.0847561 0.092946 0.0847148 0.093096 0.0864545 0.0892277 0.0954809 0.0920533 0.083979 0.106754 0.0902684 0.0937875 0.0869124 0.0811099 0.0880108 0.0959911 0.0802527 0.0890288 0.0872385 0.0749557 0.0939566 0.0798855 0.0821932 0.0967621 0.0852819 0.09006 0.0954367 0.0794319 0.0865043 0.09687 0.10301 0.0921009 0.0872924 0.0844804 0.108234 0.0751085 0.0845388 0.0899536 0.0989961 0.0899998 0.0824374 0.10367 0.0833536 0.093954 0.0777344 0.0850913 0.0951145 0.0710413 0.0906736 0.0915403 0.0821657 0.0886319 0.0909374 0.0782321 0.0863295 0.0943161 0.099297 0.0887462 0.086923 0.0669396 0.0875324 0.0974931 0.084524 0.087414 0.0880787 0.0873067 0.0881873 0.0897474 0.0813276 0.0968264 0.0994998 0.0860294 0.0744361 0.0967174 0.0807677 0.0893403 0.105384 0.0785834 0.112986 0.0931881 0.083091 0.0895232 0.0847477 0.087345 0.0914605 0.0954073 0.0939827 0.081786 0.0878386 0.0771853 0.0790728 0.086375 0.0729687 0.0945941 0.0786287 0.0956595 0.0855 0.0932549 0.0901312 0.0987237 0.0907709 0.0728097 0.089076 0.109678 0.0859711 0.0858851 0.0938572 0.0959046 0.0902989 0.073196 0.0733432 0.076194 0.0926539 0.0874677 0.0963057 0.0719249 0.08912 0.094536 0.0850072 0.087699 0.0852635 0.0937261 0.0994354 0.0868295 0.0882226 0.0888553 0.0816739 0.0774494 0.0942275 0.0858696 0.0902278 0.101953 0.0871208 0.110643 0.0924249 0.0743337 0.0858745 0.106262 0.0917242 0.0904781 0.0831144 0.0907178 0.0935176 0.0745074 0.0771662 0.0877293 0.0770693 0.0825993 0.0910797 0.0818473 0.0842254 0.0939556 0.0934964 0.0955626 0.0965682 0.0889482 0.0777265 0.0824844 0.0534763 0.0945412 0.0949283 0.0894969 0.0847579 0.0915357 0.0867212 0.0944093 0.08340