Dummy

This document will show a simple Bayesian statistical analysis in the Baysig language. The code that carries out this analysis is generated by BayesHive through the "Linear regression" model builder, so you don't have to type it out yourself if you want to do something similar with your own data. But in this document, I want to go a bit more into the meaning of this code.

Why would you use Bayesian statistics? I will show in this document why a Bayesian analysis can be more informative than calulating a r^2 value or a p-value from a classical regression analysis; at the same time, the output of the Bayesian analysis is easier to understand. I will show how the output of the Bayesian analysis can be used both to estimate model parameters and to replace significance tests.

**Contents:**

I will be building a model for the "electricSupply" dataset which originally appeared here. Let's look at the first five rows:

take 5 (#electricSupply)

capital | labour | energy | output |
---|---|---|---|

98.2880 | 0.3860 | 13.2190 | 1.2700 |

255.0680 | 1.1790 | 49.1450 | 4.5970 |

208.9040 | 0.5320 | 18.0050 | 1.9850 |

528.8640 | 1.8360 | 75.6390 | 9.8970 |

307.4190 | 1.1360 | 52.2340 | 5.9070 |

In this document I will keep things very simple, so I will concentrate on how one variable (*output*) is predicted by one other variable (*energy*). Let's start by plotting this relationship to get a feel for how it looks:

axisLabels "energy" "output" (scatterPlot (#electricSupply <#> (energy,output)))

It is clear that there is *some* kind of relationship between the two variables. I will show you how to use a simple regression model to explore this relationship.

The first thing we will want to do is to define the model for these data. Here is the model definition that BayesHive generates for you:

regress = prob beta_0 ~ improper_uniform beta_energy ~ improper_uniform variance ~ improper_uniform_positive repeat 16 $ prob energy ~ any output ~ normal (beta_0+energy*beta_energy) variance return {output=> output; energy=> energy}

That looks like a lot of work; is certainly more work than just typing "y=mx+b+N(0,var)". There are several good reason for that:

we have given the model a name (

*regress*). That is good, because now we can pass it around to estimation procedures or manipulate the model.in a Bayesian analysis, we have to state the prior, which is what we already know about the parameter before seeing the data. Here, I have chosen a flat uniform prior, which essentially says that we don't know anything about the parameters. But if you have some knowledge of these parameters from previous analyses, you would put that in the first three lines (for example,

`beta_0 ~ normal 5 1`

).we explicitly said that we are

`repeat`

ing draws from a common distribution.we have also explicitly said what in the model is observed -- the last line

`return`

s the`output`

and the`energy`

.

Explicit is good, because it clears up mysteries.

Next, we will want to call `estimate`

. The `estimate`

procedure is what sets Baysig apart from most programming languages. Everything else in Baysig has been designed to make it easy to work with `estimate`

.

regressElectricSupplyPars <* estimate regress (#electricSupply)

The output of running `estimate`

is a probability distribution over the parameters in the `regress`

model; this is what Bayesian call the *posteriror* -- what you should believe about the model parameters *after* you have seen the data. The posterior is represented in Baysig as a probability distribution over records where each field is a parameter in the model. It is probably much easier to understand this if we just ask for its value and see what Baysig prints out.

regressElectricSupplyPars ⇒ { beta_0 => 0.52 ± 0.50, beta_energy => 0.0897 ± 0.0084, variance => 1.30 ± 0.65, }

I want to emphasise that this is just a summary and there is much more information in the posterior. Looking at this summary does tell you a bit about the approximate value that are estimated (mean plus/minus standard deviation of the estimated parameters).

It is a bit nicer to see a graphical summary of the parameter estimates. BayesHive offers to generate code that defines several such plots, and I will describe each of them in detail. I will focus on the parameter `beta_energy`

, which is the slope, or the sensitivity, in the output with respect to the energy. In the kind of analysis we are outlining here, it is often the most important parameter.

First we can look at the estimated distribution of this parameter. That gives us all the probable values, given the data. Here, we can see that the `beta_energy`

parameter is probably somewhere around 0.09 and very likely to lie between 0.07 and 0.11. In some data analysis traditions, one spends a long time trying to find the single most likely value of this parameter, perhaps a number such as 0.089513. In the Bayesian school embraced in Baysig, this number has no particular meaning, nor is there any way of calculating it. The parameter `beta_energy`

is *uncertain* and cannot be collapsed onto a single number.

axisLabels "beta_energy" "density" (distPlot (regressElectricSupplyPars <#> beta_energy))