# Archive for category Model Selection

### MML analysis of the multivariate Gaussian distribution

In his book on the minimum message length (MML) theory of inference, Wallace shows how one may derive MML estimates and message length formulae for many popular parametric distributions. An important example is the multivariate Gaussian distribution given in section 6.2.1 of the book. Here, Wallace derives the commonly used Wallace-Freeman (or MML87) message length formula and parameter estimates for the p-variate Gaussian using conjugate priors. After noticing a minor typographical error in the Fisher information formula (pp. 262), I decided to re-examine the MML87 treatment of the Gaussian from scratch. The first draft of my attempt is available here.

I hope this document will be useful to researchers interested in learning more about the the MML principle.

Update (24/07/2014): I have written MATLAB code to compute MML estimates and the message length. I have also done minor updates to the document.

### Model Selection with Akaike’s Information Criterion

Model selection is the task of choosing a model that best explains an observed set of data generated by an unknown mechanism. Put simply, model selection is all about finding the model that, in some way, is closest to the data generating distribution. This data generating distribution, or the truth, is generally unknown and, in many cases, can never be known. Subsequently, finding a good model that explains the observed data is a non-trivial problem common to many areas of science. As an aside, it has been shown that it is not possible to automate model selection; thus, one is forced to use intuition together with knowledge of statistics in order to discover good models.

Fortunately, there exist a large number of model selection criteria in the literature which can aid us in discriminating between competing models. One of the perhaps most popular criteria in use is the Akaike’s Information Criterion (AIC) [1] which is defined as

$\rm{AIC}(\theta,k) = -2 \log p({\bf x} | \hat{\theta}_{\rm ML}) + 2k$

where the first term is two times the negative log-likelihood at the maximum likelihood estimate, and the second term is a model structure penalty; here, k is the number of free parameters. AIC then advocates choosing the model that results in the smallest AIC score as optimal.

This criterion did not arise by numerical experiments, pure luck or divine inspiration. It was derived, under suitable regularity conditions, by a Japanese mathematician, Hirotsugu Akaike, as an asymptotically unbiased estimator of the expected Kullback-Leibler (KL) divergence [2] between the truth and the approximating model. Briefly, the KL divergence is a measure of distance between two probability distributions which has a natural interpretation in information theory. The expected KL divergence between the optimal AIC model and the truth is smaller than the expected KL divergence between the truth and any other candidate model. In words, the AIC selected model is optimal if we are interested in minimising prediction error. Keep in mind that so far, we have not mentioned anything about AIC properties if one is instead interested in induction (finding the truth).

It is no surprise that AIC is a popular model selection criterion in the research literature. Given a log-likelihood function, the AIC score is easy to compute and is in fact automatically computed in most statistical packages. It is this ease of computation that has perhaps resulted in a number of publications where the AIC is applied blindly without consideration of the particular problem characteristics. The two most common examples of AIC misuse are: (1) applying AIC for induction, and (2) applying AIC under small sample sizes. The reasoning behind using AIC for purposes of induction goes along the lines of “AIC is asymptotically efficient (in terms of minimising prediction error), and hence as sample size increases, AIC will discover the truth”. Unfortunately, this reasoning is incorrect and it can be shown that AIC is an inconsistent model selection criterion; AIC is not guaranteed to select the truth no matter how large the sample size! Similarly, applying AIC under small sample sizes is not recommended as all theory behind AIC performance is asymptotic. In fact, one is strongly advised against using AIC in small sample sizes as it often results in poor ranking of models.

So what should we use if there is not much data available but consistency is required? It has been shown that there exists a minimum penalty that a criterion must have to be be consistent. Examples of criteria that satisfy this constraint and are, in fact, consistent include Schwarz’s Bayesian Information Criterion (BIC) [3], Minimum Message Length (MML) [4] and Minimum Description Length (MDL) [5], among others. The latter two criteria are particularly recommended if the sample size is small to moderate. If, instead, we wish to minimise prediction error in the small sample size setting, criteria such as the corrected AIC (AICc) [6] and the corrected Kullback’s Information Criterion (KICc) [7] should be considered.

References:
[1] H. Akaike, A new look at the statistical model identification, IEEE Transactions on Automatic Control, Vol. 19, No. 6, 716â€“723, 1974.
[2] S. Kullback and R. A. Leibler, On Information and Sufficiency, The Annals of Mathematical Statistics, Vol. 22, No. 1, 79-86, 1951.
[3] G. E. Schwarz, Estimating the dimension of a model. Annals of Statistics, Vol. 6, No. 2, 461â€“464, 1978.
[4] C. S. Wallace. Statistical and Inductive Inference by Minimum Message Length, 1st ed., Springer, 2005.
[5] Jorma Rissanen, Information and Complexity in Statistical Modeling, 1st ed., Springer, 2009.
[6] C. M. Hurvich and C-L. Tsai, Regression and time series model selection in small samples, Biometrika, Vol. 76, No. 2, 297-307, 1989.
[7] A.K. Seghonane and M. Bekara, A small sample model selection criterion based on Kullback’s symmetric divergence, Vol. 52, No. 12, 3314-3323 , 2004.

### Parameter Inference under Type I Censoring

Suppose we observe n data points generated from an exponential distribution with probability density function

$p(t_i |\theta) = \theta^{-1} \exp(-t_i/\theta), \quad (i=1,\ldots,n)$

and would like to infer the mean. It is known that each datum represents the lifetime of an item, and that the observed item lifetimes are independent and identically distributed. We shall also assume that each datum is subject to Type I censoring with a common censoring threshold C > 0. That is, we have a censoring indicator together with each survival time, such that

$t_i = \min(T_i, C), \quad \delta_i = I(T_i \leq C)$

In words, we observe t if and only if t is less than or equal to the threshold C, otherwise we only know that t is greater than C. After some basic manipulation, the log likelihood function for this model is

$l(\theta) = - r \log \theta - \left( \sum_{i=1}^n t_i \right) / \theta$

where

$r= \sum_{i=1}^n \delta_i$

is the number of uncensored observations. We then maximise the log likelihood function and obtain the maximum likelihood (ML) estimator of the mean

$\hat{\theta}_{\rm{ml}} = \left( \sum_{i=1}^n t_i \right) / r$

This is certainly a reasonable estimator; it is asymptotically unbiased and consistent. However, we wish to investigate it’s performance under finite (small) sample sizes.

As an alternative, I have derived the Wallace–Freeman estimator (WF)

$\hat{\theta}_{\rm{wf87}} =\displaystyle{\mathop{\mbox{arg min}}_{\theta}} \left\{ (r-1) \log \theta + \frac{1}{\theta} \sum_{i=1}^n t_i + \frac{1}{2} \log \left( 1 - \exp(-C/\theta) \right) + \theta \right\}$

The WF estimator is Bayesian and requires specification of a prior distribution for the mean. For this example, it is assumed that the mean is distributed as per an exponential distribution with prior mean one. Unlike the maximum likelihood estimator, the WF estimator is not analytic and must be obtained by numerical optimisation. In this setting, the WF estimator equals to the ML estimator in the limit, and hence possesses the same favourable properties of asymptotic unbiasedness and consistency. The question we wish to investigate is how does the WF estimator compare to the ML estimator when the sample size is small?

We compare the WF and ML estimators using the loss function

$L(\theta, \hat{\theta}) = (\theta - \hat{\theta})^2 / \theta^2$

We note in passing that this is of course not the only possible loss function for this problem. In fact, it may be preferable to use, say, the Kullback-Leibler divergence which, arguably, is a natural divergence between two distributions and is invariant under one-to-one transformation of the parameters. When comparing the estimators, we resort to numerical experiments since the WF estimator is not analytic. A simple MATLAB function to compare the two estimators is given below:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 function [loss_mml, loss_ml] = cen_exp(niter, n, theta, C)   %% Setup numerical optimisation options options = optimset('TolFun',1e-6,'GradObj','off','Algorithm','interior-point'); options = optimset(options, 'Display','off');   %% Record losses for mml and ml loss_mml=zeros(niter,1); loss_ml=zeros(niter,1);   %% Do niter iterations for i=1:niter, %% generate data y=exprnd(theta,n,1); d=ones(n,1); t=y; t(y>=C)=C; d(y>=C)=0;   %% estimate parameters msglen=@(x) sum(d)*log(x)+(sum(t))./x+1/2*log(1-exp(-C./x) )-log(x)+x; mml_theta=fmincon(msglen, 1.0, [],[],[],[],0.01,100.0,[],options); ml_theta=sum(t)/sum(d);   %% compute losses loss_mml(i)=(theta-mml_theta)^2/theta^2; loss_ml(i)=(theta-ml_theta)^2/theta^2; end;   %% done return;

As an example, we run the script for 1000 iterations with the following parameters

$n=15, \theta = 3, C = 2.5$

>> [loss_mml, loss_ml] = cen_exp(1000, 15, 3.0, 2.5); >> prctile(loss_mml,[5 25 50 75 95]), prctile(loss_ml,[5 25 50 75 95])   ans = 0.0003 0.0085 0.0375 0.0978 0.2270 ans = 0.0007 0.0129 0.0559 0.1501 0.9566

In this particular run, the WF estimator was slightly superior to the ML estimator. I briefly played around with various values for the data generating mean, the sample size and the censoring threshold. For sample sizes less than about n=30, the WF estimator commonly outperforms the ML estimator. If the data generating mean is unlikely a priori, the ML estimator is slightly superior to the WF estimator, the difference being about 5%. The ML estimator is problematic when there is a large number of censored observations, tending to grossly overestimate the true mean. As expected, as the sample size gets larger, both estimators converge. Overall, the WF estimator seems preferable to the ML estimator in small samples; in larger samples, both estimators are more or less identical, however the ML estimator is much easier to compute.

### History of Minimum Message Length

Minimum Message Length, henceforth MML, is a principle for model selection developed originally by late Prof. Chris Wallace at Monash University, Australia. In simple terms, given a set of observations and several competing theories (hypotheses) that ‘explain’ this data, MML advocates choosing the theory that is (in some sense) the simplest. This is certainly not a new idea. It was first attributed to William of Ockham in the 14th century and is now known as the Occam’s razor. However, it has taken over 30 years of research by various individuals to quantify the Occam’s razor into a theory that is now practically useful.

The first incarnation of MML was developed in the mid-to-late 60’s by Prof. Wallace and Dr. David Boulton as an application to the problem of taxonomy. This work is now referred to as the MML68 approximation. MML68 formed a basis for a number of tractable MML approximations that have since been developed by Wallace and co-authors. In 2003, I was fortunate enough to attend a seminar by Prof. Wallace where he talked about the history of MML research. The seminar was transcribed by Dr. Lloyd Allison and is available online. For anyone interested in MML, head over to A Brief History of MML. I highly recommend it.