## Model Selection with Akaike’s Information Criterion

Model selection is the task of choosing a model that best explains an observed set of data generated by an unknown mechanism. Put simply, model selection is all about finding the model that, in some way, is closest to the data generating distribution. This data generating distribution, or the truth, is generally unknown and, in many cases, can never be known. Subsequently, finding a good model that explains the observed data is a non-trivial problem common to many areas of science. As an aside, it has been shown that it is not possible to automate model selection; thus, one is forced to use intuition together with knowledge of statistics in order to discover good models.

Fortunately, there exist a large number of model selection criteria in the literature which can aid us in discriminating between competing models. One of the perhaps most popular criteria in use is the Akaike’s Information Criterion (AIC) [1] which is defined as

$\rm{AIC}(\theta,k) = -2 \log p({\bf x} | \hat{\theta}_{\rm ML}) + 2k$

where the first term is two times the negative log-likelihood at the maximum likelihood estimate, and the second term is a model structure penalty; here, k is the number of free parameters. AIC then advocates choosing the model that results in the smallest AIC score as optimal.

This criterion did not arise by numerical experiments, pure luck or divine inspiration. It was derived, under suitable regularity conditions, by a Japanese mathematician, Hirotsugu Akaike, as an asymptotically unbiased estimator of the expected Kullback-Leibler (KL) divergence [2] between the truth and the approximating model. Briefly, the KL divergence is a measure of distance between two probability distributions which has a natural interpretation in information theory. The expected KL divergence between the optimal AIC model and the truth is smaller than the expected KL divergence between the truth and any other candidate model. In words, the AIC selected model is optimal if we are interested in minimising prediction error. Keep in mind that so far, we have not mentioned anything about AIC properties if one is instead interested in induction (finding the truth).

It is no surprise that AIC is a popular model selection criterion in the research literature. Given a log-likelihood function, the AIC score is easy to compute and is in fact automatically computed in most statistical packages. It is this ease of computation that has perhaps resulted in a number of publications where the AIC is applied blindly without consideration of the particular problem characteristics. The two most common examples of AIC misuse are: (1) applying AIC for induction, and (2) applying AIC under small sample sizes. The reasoning behind using AIC for purposes of induction goes along the lines of “AIC is asymptotically efficient (in terms of minimising prediction error), and hence as sample size increases, AIC will discover the truth”. Unfortunately, this reasoning is incorrect and it can be shown that AIC is an inconsistent model selection criterion; AIC is not guaranteed to select the truth no matter how large the sample size! Similarly, applying AIC under small sample sizes is not recommended as all theory behind AIC performance is asymptotic. In fact, one is strongly advised against using AIC in small sample sizes as it often results in poor ranking of models.

So what should we use if there is not much data available but consistency is required? It has been shown that there exists a minimum penalty that a criterion must have to be be consistent. Examples of criteria that satisfy this constraint and are, in fact, consistent include Schwarz’s Bayesian Information Criterion (BIC) [3], Minimum Message Length (MML) [4] and Minimum Description Length (MDL) [5], among others. The latter two criteria are particularly recommended if the sample size is small to moderate. If, instead, we wish to minimise prediction error in the small sample size setting, criteria such as the corrected AIC (AICc) [6] and the corrected Kullback’s Information Criterion (KICc) [7] should be considered.

References:
[1] H. Akaike, A new look at the statistical model identification, IEEE Transactions on Automatic Control, Vol. 19, No. 6, 716–723, 1974.
[2] S. Kullback and R. A. Leibler, On Information and Sufficiency, The Annals of Mathematical Statistics, Vol. 22, No. 1, 79-86, 1951.
[3] G. E. Schwarz, Estimating the dimension of a model. Annals of Statistics, Vol. 6, No. 2, 461–464, 1978.
[4] C. S. Wallace. Statistical and Inductive Inference by Minimum Message Length, 1st ed., Springer, 2005.
[5] Jorma Rissanen, Information and Complexity in Statistical Modeling, 1st ed., Springer, 2009.
[6] C. M. Hurvich and C-L. Tsai, Regression and time series model selection in small samples, Biometrika, Vol. 76, No. 2, 297-307, 1989.
[7] A.K. Seghonane and M. Bekara, A small sample model selection criterion based on Kullback’s symmetric divergence, Vol. 52, No. 12, 3314-3323 , 2004.