# Archive for category Statistics

### Bayesian penalized regression with continuous shrinkage prior densities

We have implemented a new MATLAB toolbox for Bayesian regression. The toolbox features Bayesian linear regression with Gaussian or heavy-tailed error models and Bayesian logistic regression with ridge, lasso, horseshoe and horseshoe+ estimators. Two heavy-tailed error models are supported in linear regression: Laplace and Student t distribution errors. The toolbox is very efficient and can be used with high-dimensional data. This toolbox uses the C/C++ PolyaGamma sampler written by Jesse Windle.

Version 1.1 of the MATLAB toolbox can be downloaded from here. Please see “examples_bayesreg.m” for examples on how to use the toolbox, or type “help bayesreg” within MATLAB.

An R version of this toolbox is now available on CRAN. To install the R package, type “install.packages(“bayesreg”)” within R.

To cite this toolbox:

1. Makalic E. & Schmidt, D. F.
High-Dimensional Bayesian Regularised Regression with the BayesReg Package
arXiv:1611.06649 [stat.CO], 2016

(23/01): Updated the bayesreg code to version 1.2. This version implements Zellner’s g-prior for linear and logistic regression. The g-prior only works with full rank matrices. The examples in “examples_bayesreg.m” have been updated to include a g-prior example.

(17/01): Updated the bayesreg code to version 1.1. The display code is now in a separate “summary()” function so that it can be used many times on the same data and sampling results. We have updated “examples_bayesreg.m” to include examples of the new “summary()” function.

### MML analysis of the multivariate Gaussian distribution

In his book on the minimum message length (MML) theory of inference, Wallace shows how one may derive MML estimates and message length formulae for many popular parametric distributions. An important example is the multivariate Gaussian distribution given in section 6.2.1 of the book. Here, Wallace derives the commonly used Wallace-Freeman (or MML87) message length formula and parameter estimates for the p-variate Gaussian using conjugate priors. After noticing a minor typographical error in the Fisher information formula (pp. 262), I decided to re-examine the MML87 treatment of the Gaussian from scratch. The first draft of my attempt is available here.

I hope this document will be useful to researchers interested in learning more about the the MML principle.

Update (24/07/2014): I have written MATLAB code to compute MML estimates and the message length. I have also done minor updates to the document.

### Fisher Information in Single Factor Analysis

The single factor analysis model for data $\{ {\bf x}_1, {\bf x}_2, \ldots, {\bf x}_n \}$ is

${\bf x}_i = {\bf m} + v_i {\bf a} + \boldsymbol{\epsilon}_i \quad \; (i=1,2,\ldots,n),$

where ${\bf x}_i \in \mathbb{R}^k$, $\boldsymbol{\epsilon}_i$ is an i.i.d. zero-mean multivariate Gaussian random variable with a symmetric, positive definite variance-covariance matrix $\boldsymbol{\Sigma} \in \mathbb{R}^{k \times k}$; that is, $\boldsymbol{\epsilon}_i \sim {\rm N}_k({\bf 0}, \boldsymbol{\Sigma})$ for all $i=1,2,\ldots,n$. It is thus assumed that each data vector ${\bf x}_i$ follows a $k$-variate Gaussian distribution with moments

${\rm E} \{ {\bf x}_i \} = {\bf m} + v_i {\bf a},$

${\rm Var} \{ {\bf x} \} = \boldsymbol{\Sigma}.$

The aim of single factor analysis is to estimate the unknown factor scores ${\bf v} = (v_1, v_2, \ldots, v_n)^\prime \in \mathbb{R}^n$, the mean ${\bf m} = (m_1,m_2, \ldots, m_k)^\prime \in \mathbb{R}^k$, the variance-covariance matrix $\boldsymbol{\Sigma}$ and the factor loadings ${\bf a} = (a_1, a_2, \ldots, a_k)^\prime \in \mathbb{R}^k$.

I’ve written this note to show how one may derive the Fisher information for the single factor model. This turns out to be a fairly involved task requiring a lot of linear algebra and matrix differential calculus identities. I’ve uploaded a document describing all the necessary mathematical steps to the Publications section. Anyone interested in matrix differential calculus and linear algebra should find it useful.

### Computing p-values for large Z scores in MATLAB

I recently found a little trick for computing two-sided p-values in MATLAB when given a large Z score. The obvious approach is to use the normcdf(.) function with something like this

1 2 z=1.96; 2*(1-normcdf(z))

However, the above only works for relatively small z, in the range z < 10. In order to get p-values for z as large as 30, one should instead run

1 2 z=1.96; 2*normcdf(-abs(z))

For those using R, the above can be accomplished by running:

1 2 z <- 1.96 2*pnorm(z,lower.tail=F)

### Computing the Hodges–Lehmann estimator

The Hodges–Lehmann estimator (HL) of the location parameter for $n$ samples of data is defined as the median of the averages of all $n(n+1)/2$ data pairs. This set of all pairwise averages is usually referred to as the Walsh averages. The Hodges-Lehmann estimator has some excellent efficiency and robustness properties if the observed data is generated by a symmetric distribution.

Below is a simple C function to compute the HL estimator for a given array of observations.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 // Comparison function needed for quicksort int compare (const void * a, const void * b) { const double *da = (const double *) a; const double *db = (const double *) b;   return (*da > *db) - (*da < *db); }     // Function to compute the Hodges-Lehmann estimator double hodgeslehmann(double *x, int n) { // store and compute Walsh averages unsigned int nw=n*(n+1)/2; double *walsh_avgs=(double*) malloc(sizeof(double)*nw);   unsigned int k=0; for(unsigned int i=0; i<n; i++) { for(unsigned int j=i; j<n; j++) { walsh_avgs[k]=(x[i]+x[j])/2.0; k++; } }   // sort the vector qsort(walsh_avgs,nw,sizeof(double),compare);   // median of walsh averages double m=walsh_avgs[nw/2]; if(nw % 2 == 0) { m=((walsh_avgs[nw/2] + walsh_avgs[nw/2 - 1]) / 2.0); }   // done free(walsh_avgs); return m; }

I have also written a MATLAB driver function that allows the aforementioned C code to be executed within MATLAB, see the Software section to download this implementation. An efficient (pure) MATLAB implementation that avoids the use of nested for loops is a little bit tricky. Here is a simple (memmory intensive!) implementation that relies on the ndgrid(.) function. This code works reasonably well as long as the sample size is not too big.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 function m=hodglehm_slow(x)   %% Length of vector x=x(:); % make sure x is a column vector n=length(x);   %% compute cartesian product [a,b]=ndgrid(x,x);   %% Sum of all pairs A=triu((a+b)./2);   %% Extract upper diagonal Ix=triu(true(n)); A=A(Ix);   %% Compute median of averages m=median(A(:));   %% done return;

We can get the standard error of the HL estimator in MATLAB by using the bootstrap:

1 2 3 4 5 6 % Generate 50 observations from a Student-t distribution with 2 dof x=trnd(2,50,1);   % Compute standard errors for the mean and HL estimators std(bootstrp(1000, @(a) mean(a), x)) std(bootstrp(1000, @(a) hodglehm_slow(a), x))

### Fisher information for irregular likelihood functions

I recently came across the following problem: how do you work out the Fisher information for a parameter $\theta \in \Omega$ when the likelihood function, say $p(\theta; x)$, is continuous, but not everywhere differentiable with respect to the parameter. The standard formula for the Fisher information

${\rm J}(\theta) = \int_{\Omega} \left(\frac{\partial^2 l}{\partial \theta^2}\right) p(\theta; x) dx$

assumes regularity conditions which no longer hold, and hence is not applicable. After hunting around with Google for some time, I came across the following (freely downloadable) paper

H. E. Daniels
The Asymptotic Efficiency of a Maximum Likelihood Estimator
Fourth Berkeley Symp. on Math. Statist. and Prob., University of California Press, 1961, 1, 151-163

which, fortunately, had exactly what I need. It turns out, we can still compute the Fisher information, without existence of second derivatives, using

${\rm J}(\theta) = \int_{\Omega} \left(\frac{\partial l}{\partial \theta}\right)^2 p(\theta; x) dx$

provided a set of weaker conditions holds. To get my head around the issue, I decided to look at a simple problem of working out the Fisher information for the mean $\mu \in \Omega = \mathbb{R}$ of a Laplace distribution

$p(x,\mu; s)= \frac{1}{2s} \exp(-| x - \mu| / s)$

The log-likelihood is now given by

$l(\mu; s) = -\log(2s) - \frac{|x-\mu|}{s} = \left\{\begin{array}{ll} -\log(2s) - \frac{x-\mu}{s} & x > \mu \\ -\log(2s) + \frac{x-\mu}{s} & x < \mu \\ \end{array} \right. [/latex] The first derivative with respect to $\mu$ is $\frac{\partial l}{\partial \mu} = \left\{\begin{array}{ll} \frac{1}{s} & x > \mu \\ -\frac{1}{s} & x < \mu \\ \end{array} \right.$ Therefore, the Fisher information is
[latex] {\rm J}(\mu) = \int_{-\infty}^\mu \left(\frac{\partial l}{\partial \mu}\right)^2 p(x,\mu; s) dx + \int_{\mu}^\infty \left(\frac{\partial l}{\partial \mu}\right)^2 p(x,\mu; s) dx = \frac{1}{s^2}$

The Fisher information for the scale parameter can be obtained in a similar manner. The first derivative with respect to $s$ is

$\frac{\partial l}{\partial s} = -\frac{1}{s} + \frac{1}{s^2} |x - \mu| = \frac{1}{s^2} (|x - \mu| - s )$

Since,

${\rm E}\{|x-\mu|\} = s$, ${\rm E}\{|x-\mu|^2\} = 2 s^2$

the Fisher information for the scale parameter is

${\rm J}(s) = \frac{1}{s^4} {\rm E} \left\{(|x-\mu| - s)^2 \right\} = \frac{1}{s^2}$

The paper by Daniels details the exact conditions needed for the Fisher information to be derived in this way. Note, there is a mistake in one of the proofs, the correction is detailed in

J. A. Williamson
A Note on the Proof by H. E. Daniels of the Asymptotic Efficiency of a Maximum Likelihood Estimator
Biometrika, 1984, 71, 651-653

### The blog expands

I’ve started writing a new page for the blog that examines the theory behind various common statistical methods and models. The idea here is to include references and brief descriptions of some of the important and interesting papers on various topics. I also intend to include web links to software implementations whenever possible. The new material can be found by clicking on the “Statistics Theory and Applications” tab above. I am currently working on the linear regression section and the LASSO regularisation method for linear regression.

In other news, the grant writing season has again started. This year, we aim to write one early career researcher ARC grant, and at least one NHMRC grant. The ARC grant will be a two year proposal for work on modern logistic regression techniques. We are in the process of determining which NHMRC grant(s) will be written and which researchers will be involved. I can only hope that this grant season is as successful as last years!

### Penalized Logistic Regression

Last week I gave a seminar at my work place on the advantages of using penalized logistic regression methods (such as the LASSO, elastic net, etc.) over the standard maximum likelihood approach. The target audience was genetic epidemiologists who have some practical knowledge of fitting logistic models, but may not be aware of the recent theoretical work in the area. The slides from the seminar are now available from the Publications page.

### Statistical Analysis Questions

The other day I came across a new Q&A site for statistical analysis called Statistical Analysis Questions. I am not sure how long the site has been running, but there are already about 200 questions, 750 answers and more than 600 users. I recommended checking it out. It seems quite useful for anyone involved in both applied and theoretical statistics work.