Fisher information for irregular likelihood functions

I recently came across the following problem: how do you work out the Fisher information for a parameter \theta \in \Omega when the likelihood function, say p(\theta; x), is continuous, but not everywhere differentiable with respect to the parameter. The standard formula for the Fisher information

 {\rm J}(\theta) = \int_{\Omega} \left(\frac{\partial^2 l}{\partial \theta^2}\right) p(\theta; x) dx

assumes regularity conditions which no longer hold, and hence is not applicable. After hunting around with Google for some time, I came across the following (freely downloadable) paper

H. E. Daniels
The Asymptotic Efficiency of a Maximum Likelihood Estimator
Fourth Berkeley Symp. on Math. Statist. and Prob., University of California Press, 1961, 1, 151-163

which, fortunately, had exactly what I need. It turns out, we can still compute the Fisher information, without existence of second derivatives, using

 {\rm J}(\theta) = \int_{\Omega} \left(\frac{\partial l}{\partial \theta}\right)^2 p(\theta; x) dx

provided a set of weaker conditions holds. To get my head around the issue, I decided to look at a simple problem of working out the Fisher information for the mean \mu \in \Omega = \mathbb{R} of a Laplace distribution

  p(x,\mu; s)= \frac{1}{2s} \exp(-| x - \mu| / s)

The log-likelihood is now given by

  l(\mu; s) = -\log(2s) - \frac{|x-\mu|}{s}  = \left\{\begin{array}{ll} -\log(2s) - \frac{x-\mu}{s} & x > \mu \\ -\log(2s) + \frac{x-\mu}{s} 	& x < \mu \\ \end{array} \right.  [/latex] </center>   The first derivative with respect to [latex]\mu is

\frac{\partial l}{\partial \mu}  = \left\{\begin{array}{ll} \frac{1}{s} & x > \mu \\ -\frac{1}{s} & x < \mu \\ \end{array} \right.  [/latex]</center>   Therefore, the Fisher information is   <center>[latex] {\rm J}(\mu) = \int_{-\infty}^\mu \left(\frac{\partial l}{\partial \mu}\right)^2 p(x,\mu; s) dx + \int_{\mu}^\infty \left(\frac{\partial l}{\partial \mu}\right)^2 p(x,\mu; s) dx = \frac{1}{s^2}

The Fisher information for the scale parameter can be obtained in a similar manner. The first derivative with respect to s is

\frac{\partial l}{\partial s}  = -\frac{1}{s} + \frac{1}{s^2} |x - \mu| = \frac{1}{s^2} (|x - \mu| - s )

Since,

 {\rm E}\{|x-\mu|\} = s,  {\rm E}\{|x-\mu|^2\} = 2 s^2

the Fisher information for the scale parameter is

 {\rm J}(s) = \frac{1}{s^4} {\rm E} \left\{(|x-\mu| - s)^2 \right\} = \frac{1}{s^2}

The paper by Daniels details the exact conditions needed for the Fisher information to be derived in this way. Note, there is a mistake in one of the proofs, the correction is detailed in

J. A. Williamson
A Note on the Proof by H. E. Daniels of the Asymptotic Efficiency of a Maximum Likelihood Estimator
Biometrika, 1984, 71, 651-653

Unfortunately, the Williamson paper requires a JSTOR subscription for download.

No Comments

The blog expands

I’ve started writing a new page for the blog that examines the theory behind various common statistical methods and models. The idea here is to include references and brief descriptions of some of the important and interesting papers on various topics. I also intend to include web links to software implementations whenever possible. The new material can be found by clicking on the “Statistics Theory and Applications” tab above. I am currently working on the linear regression section and the LASSO regularisation method for linear regression.

In other news, the grant writing season has again started. This year, we aim to write one early career researcher ARC grant, and at least one NHMRC grant. The ARC grant will be a two year proposal for work on modern logistic regression techniques. We are in the process of determining which NHMRC grant(s) will be written and which researchers will be involved. I can only hope that this grant season is as successful as last years!

No Comments

More Grant News

The NHMRC funding outcomes have been released and we have been awarded another grant! This is fantastic news given the competitive nature of the funding process. The list of project grants that were funded for 2011 can be found here. We were awarded approximately $400k over a period of two years for a research project on mammographic density. This is a very exciting area of research and we hope to positively contribute towards breast cancer research and making mammographic density more clinically useful.

Continuing on with more good news, the list of accepted papers for the 23rd Australasian Joint Conference on Artificial Intelligence (AJCAI 2010) has been announced a couple of weeks ago. AJCAI is a premier conference on artificial intelligence and machine learning that is held annually in the Australasian region. We usually send at least one paper every year to AJCAI given how rare international machine learning conferences are in Australia. This year we ended up submitting two papers, both of which were accepted. Moreover, one of the papers we submitted has also been awarded the best conference paper! Following the conference, I will be putting up both papers in the Publications section along with any relevant MATLAB and C/C++ code.

Lastly, I’d like to congratulate my colleague and good friend Daniel Schmidt for getting engaged this year to one lovely lady! I wish you both the best of luck!

No Comments

ARC Grant Time

The Australian Research Council (ARC) have announced the funding outcomes for the 2011 round of Discovery Projects. The sucess rate this year was 22.0% compared to 22.7% in the last round. The success rate really is quite low considering the amount of time and effort required to fill out one of these applications. The great news is that our first ever ARC grant got funded! Although we didn’t quite get the amount of money we requested, we still got more than the national average. Now we nervously await the NHMRC funding outcomes which should be released in a week or so.

No Comments

Matrix inversion lemma

The other day at work I came across an interesting problem while trying to optimise some MATLAB MCMC sampling code. The major bottleneck in the code was the inversion of a [p x p] matrix M, where p can be quite large (in the order of thousands). Now, I noticed that M can be written as

 {\bf M} = {\bf A} + {\bf X} {\bf G} {\bf X}^{T}

where A is a diagonal [p x p] matrix, X is [p x n] and G is a full rank diagonal [n x n] matrix. In my setting, p could be much larger than n and speed is important since this particular code is executed numerous times within a loop. I immediately thought about using the matrix inversion lemma (or the Sherman–Morrison–Woodbury formula) to speed up the inversion when p >> n. However, it turns out that in my case the matrix A is

 {\bf A} = {\rm diag}(0, a_1, \ldots, a_{p-1})

which is of rank (p – 1) and singular, so the matrix inversion lemma cannot be applied in a straightforward manner. After talking to a colleague about this issue, he suggested a nice trick to make A full rank by replacing the top-left zero element with a non-zero entry, and then changing X and G to correct for this modification. If we apply this trick, we can write M as

 {\bf M} = \underbrace{\left({\bf A} - {\bf e} {\bf e}^T\right)}_{{\bf A}_*} + \left({\bf e} \; \; {\bf X}\right) \left(  \begin{array}{cc}  1         & {\bf 0}^T \\  {\bf 0} & {\bf G}  \end{array}\right)  \left(  \begin{array}{cc}  {\bf e}^T\\  {\bf X}^T  \end{array} \right)

where e = (1, 0, …, 0) is a [p x 1] vector. Application of the matrix inversion lemma is now straightforward and reduces the computational cost of inverting M from O(p^3) to O(n^3). I did some rough timing of the new code and it is (unsurprisingly) significantly faster than the previous version when p / n gets large. I’ve updated my Bayesian LASSO code for logistic regression (see Publications) to include this neat trick.

No Comments

Penalized Logistic Regression

Last week I gave a seminar at my work place on the advantages of using penalized logistic regression methods (such as the LASSO, elastic net, etc.) over the standard maximum likelihood approach. The target audience was genetic epidemiologists who have some practical knowledge of fitting logistic models, but may not be aware of the recent theoretical work in the area. The slides from the seminar are now available from the Publications page.

No Comments

Statistical Analysis Questions

The other day I came across a new Q&A site for statistical analysis called Statistical Analysis Questions. I am not sure how long the site has been running, but there are already about 200 questions, 750 answers and more than 600 users. I recommended checking it out. It seems quite useful for anyone involved in both applied and theoretical statistics work.

No Comments

Hypothesis testing with Paul the Octopus

For the past four weeks, I’ve been enjoying the FIFA World Cup 2010, the most watched television event in the world. This world cup is held in South Africa making it the first time ever an African nation hosted the prestigious tournament. One of the surprise teams of the tournament has been Germany, beating both England and Argentina (4-1 and 4-0 respectively) before losing 0-1 to current European champions Spain in a tightly contested semi-final encounter.

Meanwhile in Oberhausen, Germany, a somewhat odd event took place before each of the Germany matches. Paul the Octopus, who resides at the local Sea Life Aquarium, was used as an oracle to predict the outcomes of all Germany world cup matches prior to the games taking place. For a description of exactly how Paul makes his predictions, see this Wikipedia article. Amazingly, Paul has successfully predicted all six of the German games so far and has recently tipped Germany, to the delight of many Germans, to beat Uruguay at the upcoming game for 3rd/4th place. This should hopefully put a stop to those anti-octopus songs and calls to eat Paul. As statisticians, let us ask the question “Is Paul really an animal oracle or just one extremely lucky octopus?”.

We can model the number of Paul’s successful predictions at this world cup as a binomial distribution B(p, n=6); that is, we have six independent trials (matches) with p being the probability of success (Paul predicting correctly) at each trial. In order to test whether Paul is psychic, we shall construct a 95% confidence interval for the probability of success, p. The standard confidence interval, often called the Wald interval, is known to have poor coverage properties in this scenario and exhibits erratic behaviour, even if the sample size is large or p is near 0.5. Instead, we compute the modified Jeffreys 95% CI, recommended in [1], and find that

CI_{M-J} = [0.54, 1.0]

This CI is quite wide, which is not unexpected given such a small sample size (n=6), and excludes the possibility that Paul is just plain old lucky (p=0.5)!

What can Minimum Message Length (MML) and Minimum Description Length (MDL) tell us about Paul’s psychic powers? We shall use the Wallace-Freeman (MML) codelength formula [2,3] and the Normalized Maximum Likelihood (NML) distribution (MDL) [4] for this task. Let A denote the hypothesis that Paul is lucky, and B the alternative hypothesis that Paul is an animal oracle. We compute the codelength of data and hypothesis for both scenarios A and B, and use the difference in codelengths (i.e., codelength A – codelength B) as a probability in favour of the hypothesis with a smaller codelength. From standard information theory, the codelength for hypothesis A is 6 * log(2) = 6 bits. The codelength for hypothesis B is 2.82 bits using the WF formula and 1.92 bits using the NML distribution. Thus, both MML and MDL prefer hypothesis B.

So there you have it, Paul must be the real deal! 😉

References:
[1] Lawrence D. Brown, T. Tony Cai, and Anirban DasGupta. Interval Estimation for a Binomial Proportion, Statistical Science, Vol. 16, No. 2, pp. 101-133, 2001.
[2] C. S. Wallace and P. R. Freeman. Estimation and inference by compact coding, Journal of the Royal Statistical Society (Series B) Vol. 49, No. 3, pp. 230-265, 1987.
[3] C. S. Wallace. Statistical and Inductive Inference by Minimum Message Length, 1st ed., Springer, 2005.
[4] Jorma Rissanen. Information and Complexity in Statistical Modeling, 1st ed., Springer, 2009.

No Comments

Eurovision 2010 Forecast (Part 2)

The Eurovision 2010 competition finished last Saturday with Lena Meyer-Landrut from Germany taking the title for her song “Satellite”. If you missed the show, you can see Lena performing the song at the official Eurovision YouTube channel here. Given the current state of the world economy, this is a pretty good outcome as Germany is one of a few countries left in Europe with enough finances to host next years show. So how did my team, StatoVIC, go in the Kaggle Eurovision 2010 competition? The results have been tabulated and released here. It looks like StatoVIC took seventh place out of 22 submissions with an absolute error of 2626 points calculated from the predicted ratings. This score is in the top quartile of the submissions and about 1000 points better than “Lucky Guess”, the last place submission (I assume this submission is just a random selection of ratings). Not a bad result for StatoVIC, really. Congratulations to Jure Zbontar for winning the competition with an impressive absolute error score of about 400 rating points less than our team.

It’s time for StatoVIC to look at the HIV progression challenge and see if we can do better than seventh place!

No Comments

Eurovision 2010 Forecast (Part 1)

Last week I submitted predictions for the Kaggle Eurovision 2010 competition under the team name StatoVIC. The first part of the competition requires selecting the 25 countries that will make the Eurovision 2010 final. Once the 25 finalists are chosen, you are asked to predict the voting behaviour of all the participating countries based on 10 years of data collected from previous Eurovision competitions. In this years Eurovision, 20 countries are selected for the final based on the outcome of two semi-finals. In both semi-finals, there are 17 countries competing and the 10 countries with the most points go through to the final. The remaining five countries (Spain, Germany, United Kingdom, France and Norway) are guaranteed final competitors. With the second semi-final finishing last Thursday, it is time to see how the StatoVIC team has fared thus far.

In the first semi-final, I ended up predicting five (Bosnia, Russia, Greece, Serbia and Belgium) out of the ten finalists correctly. In the second semi-final, I fared somewhat better selecting eight of the ten countries that made the final. I missed out on picking Romania and Cyprus and instead chose Croatia and Finland. Given the relatively naive strategy that was used to select the finalists, these numbers are certainly not too bad.

Out of interest, I had a brief look at how you would fare if you were to randomly select all the finalists in any of the two semi-finals. First, the “good” news is that you are guaranteed to select at least three finalists correctly. The odds of correctly guessing all the ten finalists in a semi are unfortunately 1 in 19,448. The probability of correctly guessing five and eight finalists is about 0.27 and 0.05 respectively. In expectation, the mean number of finalists guessed correctly with this strategy is between five and six. In light of this, the performance of StatoVIC is about average in the first semi, and moderately better than average in the second.

The Eurovision final is on this Saturday night, but is shown Sunday night on SBS if you are in Australia. It will be interesting to see how StatoVIC fares in predicting the voting behaviour. In the mean time, here are the predictions of the fine folks at Google:



No Comments