Dummy

Type systems, data analysis and uncertainty

01 November 2013

The most successful languages for scientific data analysis are currently MATLAB, R, and Python, with Clojure and Julia1 playing catch-up. Did you notice that all of these have something in common? They are dynamically typed. The heavy-weight Fortran or C++ may be pulled out when performance is critical, but are often not the first go-to tool for researchers in many fields.

A dynamic type system is one in which the types of variables are not known before runtime. In a dynamically typed language, we can write programs that are difficult to write in statically typed languages, in which the types of all variables and function must be known (possibly inferred) at compile-time. For instance, take this function in JavaScript:

function myFunc(b) {
  if(b) 
     return 42;
  else
     return "Hello World!";
}

Here, the function myFunc returns either an integer or a string, depending on its input. So we cannot assign a type to this function's return value without knowing the inputs.2

The debate between the proponents of static and dynamic types is as old and tired as that the fight over syntax in programming languages. Both kinds of languages have advantages, and those advantages can be subtle and often the opposite of the arguments that are regurgitated ad infinitum. For instance, dynamic types, long associated with diminutive "scripting languages" are great for highly skilled programmers. A good static type system catches a lot of errors made by the less competent (including this author).

No side seems to be winning, and it looks as if there is room for both dynamically and statically typed languages. Except for in data analysis, where most of the major players are dynamically typed. Here I want to focus on what static type systems can bring to data analysis.

The first thing that usually comes to mind is the possibility of enforcing dimensional consistency. For instance, code trying to add kilograms to meters would be caught by the compiler in a type system that enforces unit correctness - without running any code at all. The advantage of testing this at compile-time over run-time checking is that you know your program is correct for any input, not just the inputs with which you are testing or running your program.

Some of the most notorious software bugs have been due to insufficient treatment of dimensional types. We don't check units of measure in Baysig, but it is something we would like to do in the future.

Likewise, an obvious way to use static types is to ensure that vectors and matrices have correct dimensions for their operations. For instance, you want to make sure than when you add two vectors, they have the same number of elements.

(12)+(789)=WTF?

Using the type system to enforce vector and matrix dimensions is much more difficult, first of all because the dimensions would need a dependent type system. That is, the type of a vector would depend on a number (the number of dimensions) which is usually something that is a value, not a type. Secondly, the dimensionality of the vector is something that may not be known at compile-time at all, if for instance the vector represents observed data. If the dimensions aren't known at compile-time, then how can we check that they match the dimensions of some other vector or matrix? This is not an impossible problem, but it is a difficult problem. Static checking of linear algebra is often used as a motivation for using dependently typed programming languages, but there are few such systems available for practical data analysis.

I want to highlight a different use of types in data analysis, and that is in tracking uncertainty. This is a difficult problem that has been highlighted again and again as a difficult aspect in the interpretation of scientific data. The plea for uncertainty is not confined to scientific research; social workers are urged to practice "respectful uncertainty". The correct handling uncertainty is the principal aim of probabilistic programming. Wouldn't it be nice if the type system could tell us that such-and-such an analysis does not sweep anything under the carpet?

The solution3 is to introduce a type for uncertain quantities, to have statistical estimation procedures use this type to return their results, and to restrict the operations available on uncertain quantities. Because we are good Bayesians here at OpenBrain Ltd., we represent uncertainty as probability distributions. But I wouldn't object if frequentists were to represent this uncertainty in a different way, perhaps as confidence intervals.

The ability to manipulate uncertainty is crucial if you want to calculate anything further on the basis of statistical estimates. Say you have estimated some quantity x = 2.3 ± 0.5. But you are really interested in is the quantity x2. How do you calculate that? If you don't know any better, you might start by squaring 2.3. What about the uncertainty? You can calculate the transformed uncertainty using propagation of uncertainty, but this has a number of limitations:

  • This is not easy because it involves partial differentiation and so people don't bother

  • Makes strong assumptions about the linearity of the transform, so even if you can calculate this for non-linear transformations, it doesn't work if your transform is too non-linear.

  • You may have had some idea that your initial uncertainty was normally distributed. But the square of a normal distribution is not normal, and so you now only have an abstract uncertainty, not a probability distribution.

So you might be tempted into doing all sorts of ad-hoc things like:

  • Forgetting about the uncertainty and only represent the transformed mean

  • Can't I just square the error estimate as well? (no)

  • Or maybe I can square the mean minus the error and the mean plus the error? (no)

I once asked a distinguished Professor of Neuroscience, who was an expert in spike sorting using methods inspired by Thermodynamics: why not propagete the uncertainty? Why throw it away? No, he said, that would be much too cumbersome. He did everything in MATLAB, of course.

On the other hand, if you are working within a type system that enforces uncertainty propagation, then you will simply not be able to get this wrong.

The key feature of such a type system is:

There is no way of collapsing uncertainty onto a single value

That means you cannot calculate the mean, the variance, the expectation, the median or anything else from the uncertainty. All you can do is transform uncertainty. Our friend here is a function that we call fmap because we have spent far too long in the land of functional programming. Let's give it a proper name and call it pmap for mapping-over-probability distributions.

1Many apologies if I have left out your favourite language
2Unless our type system has polymorphic variants.
3Well, our solution
pmap = fmap

Much better. Here is a normal distribution, and one that is squared with pmap.

myDist = normal 2.3 0.25
aspect 3 $ 
  besides [distPlot myDist, 
           distPlot (pmap square 
                          myDist)]

Squaring all the possibilities represented by a probability distribution introduces a skew so the transformed distribution is clearly no longer normal.

Here is a more complicated example. Let's say that you have estimated quantities a</