# Archive for category Minimum Message Length

### MML analysis of the multivariate Gaussian distribution

In his book on the minimum message length (MML) theory of inference, Wallace shows how one may derive MML estimates and message length formulae for many popular parametric distributions. An important example is the multivariate Gaussian distribution given in section 6.2.1 of the book. Here, Wallace derives the commonly used Wallace-Freeman (or MML87) message length formula and parameter estimates for the p-variate Gaussian using conjugate priors. After noticing a minor typographical error in the Fisher information formula (pp. 262), I decided to re-examine the MML87 treatment of the Gaussian from scratch. The first draft of my attempt is available here.

I hope this document will be useful to researchers interested in learning more about the the MML principle.

Update (24/07/2014): I have written MATLAB code to compute MML estimates and the message length. I have also done minor updates to the document.

### Parameter Inference under Type I Censoring

Suppose we observe n data points generated from an exponential distribution with probability density function

$p(t_i |\theta) = \theta^{-1} \exp(-t_i/\theta), \quad (i=1,\ldots,n)$

and would like to infer the mean. It is known that each datum represents the lifetime of an item, and that the observed item lifetimes are independent and identically distributed. We shall also assume that each datum is subject to Type I censoring with a common censoring threshold C > 0. That is, we have a censoring indicator together with each survival time, such that

$t_i = \min(T_i, C), \quad \delta_i = I(T_i \leq C)$

In words, we observe t if and only if t is less than or equal to the threshold C, otherwise we only know that t is greater than C. After some basic manipulation, the log likelihood function for this model is

$l(\theta) = - r \log \theta - \left( \sum_{i=1}^n t_i \right) / \theta$

where

$r= \sum_{i=1}^n \delta_i$

is the number of uncensored observations. We then maximise the log likelihood function and obtain the maximum likelihood (ML) estimator of the mean

$\hat{\theta}_{\rm{ml}} = \left( \sum_{i=1}^n t_i \right) / r$

This is certainly a reasonable estimator; it is asymptotically unbiased and consistent. However, we wish to investigate it’s performance under finite (small) sample sizes.

As an alternative, I have derived the Wallace–Freeman estimator (WF)

$\hat{\theta}_{\rm{wf87}} =\displaystyle{\mathop{\mbox{arg min}}_{\theta}} \left\{ (r-1) \log \theta + \frac{1}{\theta} \sum_{i=1}^n t_i + \frac{1}{2} \log \left( 1 - \exp(-C/\theta) \right) + \theta \right\}$

The WF estimator is Bayesian and requires specification of a prior distribution for the mean. For this example, it is assumed that the mean is distributed as per an exponential distribution with prior mean one. Unlike the maximum likelihood estimator, the WF estimator is not analytic and must be obtained by numerical optimisation. In this setting, the WF estimator equals to the ML estimator in the limit, and hence possesses the same favourable properties of asymptotic unbiasedness and consistency. The question we wish to investigate is how does the WF estimator compare to the ML estimator when the sample size is small?

We compare the WF and ML estimators using the loss function

$L(\theta, \hat{\theta}) = (\theta - \hat{\theta})^2 / \theta^2$

We note in passing that this is of course not the only possible loss function for this problem. In fact, it may be preferable to use, say, the Kullback-Leibler divergence which, arguably, is a natural divergence between two distributions and is invariant under one-to-one transformation of the parameters. When comparing the estimators, we resort to numerical experiments since the WF estimator is not analytic. A simple MATLAB function to compare the two estimators is given below:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 function [loss_mml, loss_ml] = cen_exp(niter, n, theta, C)   %% Setup numerical optimisation options options = optimset('TolFun',1e-6,'GradObj','off','Algorithm','interior-point'); options = optimset(options, 'Display','off');   %% Record losses for mml and ml loss_mml=zeros(niter,1); loss_ml=zeros(niter,1);   %% Do niter iterations for i=1:niter, %% generate data y=exprnd(theta,n,1); d=ones(n,1); t=y; t(y>=C)=C; d(y>=C)=0;   %% estimate parameters msglen=@(x) sum(d)*log(x)+(sum(t))./x+1/2*log(1-exp(-C./x) )-log(x)+x; mml_theta=fmincon(msglen, 1.0, [],[],[],[],0.01,100.0,[],options); ml_theta=sum(t)/sum(d);   %% compute losses loss_mml(i)=(theta-mml_theta)^2/theta^2; loss_ml(i)=(theta-ml_theta)^2/theta^2; end;   %% done return;

As an example, we run the script for 1000 iterations with the following parameters

$n=15, \theta = 3, C = 2.5$

>> [loss_mml, loss_ml] = cen_exp(1000, 15, 3.0, 2.5); >> prctile(loss_mml,[5 25 50 75 95]), prctile(loss_ml,[5 25 50 75 95])   ans = 0.0003 0.0085 0.0375 0.0978 0.2270 ans = 0.0007 0.0129 0.0559 0.1501 0.9566

In this particular run, the WF estimator was slightly superior to the ML estimator. I briefly played around with various values for the data generating mean, the sample size and the censoring threshold. For sample sizes less than about n=30, the WF estimator commonly outperforms the ML estimator. If the data generating mean is unlikely a priori, the ML estimator is slightly superior to the WF estimator, the difference being about 5%. The ML estimator is problematic when there is a large number of censored observations, tending to grossly overestimate the true mean. As expected, as the sample size gets larger, both estimators converge. Overall, the WF estimator seems preferable to the ML estimator in small samples; in larger samples, both estimators are more or less identical, however the ML estimator is much easier to compute.

### History of Minimum Message Length

Minimum Message Length, henceforth MML, is a principle for model selection developed originally by late Prof. Chris Wallace at Monash University, Australia. In simple terms, given a set of observations and several competing theories (hypotheses) that ‘explain’ this data, MML advocates choosing the theory that is (in some sense) the simplest. This is certainly not a new idea. It was first attributed to William of Ockham in the 14th century and is now known as the Occam’s razor. However, it has taken over 30 years of research by various individuals to quantify the Occam’s razor into a theory that is now practically useful.

The first incarnation of MML was developed in the mid-to-late 60’s by Prof. Wallace and Dr. David Boulton as an application to the problem of taxonomy. This work is now referred to as the MML68 approximation. MML68 formed a basis for a number of tractable MML approximations that have since been developed by Wallace and co-authors. In 2003, I was fortunate enough to attend a seminar by Prof. Wallace where he talked about the history of MML research. The seminar was transcribed by Dr. Lloyd Allison and is available online. For anyone interested in MML, head over to A Brief History of MML. I highly recommend it.