Linear algebra in C/C++ using Eigen

I’ve recently started using the free, open-source library Eigen for linear algebra routines in C/C++. The library is absolutely fantastic in terms of perfomance and functionality, however, as is sometimes the case with open-source software, the API documentation is somewhat lacking. If I get some time in the future I hope to contribute a few articles to the online documentation. Here is an example of what I mean.

Suppose you want to do a QR decomposition of a matrix X. There are three classes that can do this for you, one of which is ColPivHouseholderQR, and you may write something like

1
2
// perform a QR decomposition of the [n x p] X matrix; where n > p
Eigen::ColPivHouseholderQR< Eigen::MatrixXd > qr = X.colPivHouseholderQr();

Seems fairly straightforward, right? But, how do we get Q and R? If we look at the online documentation for ColPivHouseholderQR, we find a public method called matrixQR() which seems to return a reference to a matrix related to Q and R in some fashion. The documentation doesn’t state anything about how to get Q and R from the returned matrix. It turns out, that the matrix returned by matrixQR() contains both Q and R, with R occupying the ‘top’ block and Q the ‘bottom’ block. Thus, to get, say, R we need to use something like

1
2
3
4
// perform a QR decomposition of the [n x p] X matrix; where n > p
Eigen::ColPivHouseholderQR< Eigen::MatrixXd > qr = X.colPivHouseholderQr();  
Eigen::Matrix<double,p,p> R; 
R = qr.matrixQR().topRightCorner(p,p).triangularView<Eigen::Upper>();	// R matrix

Here, topRightCorner() extracts the matrix of interest and triangularView() ensures that R is stored as an upper triangular dense matrix.

It didn’t take too long to figure out how QR works in Eigen. After discovering the solution I found a post on the Eigen forum that describes a similar solution to the aforementioned. I urge anyone who uses Eigen to contribute code and documentation to the project, if possible. It really is a fantastic open-source library and equals and surpasses many other commercial linear algebra packages in both performance and quality.

No Comments

Analysing cancer data with a supercomputer

Yours truly has recently appeared in an article published in the local newspaper, The Age. The article discusses our NHMRC grant to use the IBM Blue Gene/P supercomputer for processing and analysis of SNP data. Unfortunately, I didn’t get a chance to proof read the story and ended up being referred to as a “computer whiz” and a talented “games programmer”. Oh well, any publicity is good publicity, right?

No Comments

Fisher Information in Single Factor Analysis

The single factor analysis model for data \{ {\bf x}_1, {\bf x}_2, \ldots, {\bf x}_n \} is

 {\bf x}_i  = {\bf m} + v_i {\bf a} + \boldsymbol{\epsilon}_i \quad \; (i=1,2,\ldots,n),

where {\bf x}_i \in \mathbb{R}^k, \boldsymbol{\epsilon}_i is an i.i.d. zero-mean multivariate Gaussian random variable with a symmetric, positive definite variance-covariance matrix \boldsymbol{\Sigma} \in \mathbb{R}^{k \times k}; that is, \boldsymbol{\epsilon}_i \sim {\rm N}_k({\bf 0}, \boldsymbol{\Sigma}) for all i=1,2,\ldots,n. It is thus assumed that each data vector {\bf x}_i follows a k-variate Gaussian distribution with moments

	{\rm E} \{ {\bf x}_i \} = {\bf m} + v_i {\bf a},

	{\rm Var} \{ {\bf x} \} = \boldsymbol{\Sigma}.

The aim of single factor analysis is to estimate the unknown factor scores {\bf v} = (v_1, v_2, \ldots, v_n)^\prime \in \mathbb{R}^n, the mean {\bf m} = (m_1,m_2, \ldots, m_k)^\prime \in \mathbb{R}^k, the variance-covariance matrix \boldsymbol{\Sigma} and the factor loadings {\bf a} = (a_1, a_2, \ldots, a_k)^\prime \in \mathbb{R}^k.

I’ve written this note to show how one may derive the Fisher information for the single factor model. This turns out to be a fairly involved task requiring a lot of linear algebra and matrix differential calculus identities. I’ve uploaded a document describing all the necessary mathematical steps to the Publications section. Anyone interested in matrix differential calculus and linear algebra should find it useful.

No Comments

Computing p-values for large Z scores in MATLAB

I recently found a little trick for computing two-sided p-values in MATLAB when given a large Z score. The obvious approach is to use the normcdf(.) function with something like this

1
2
z=1.96;                                                                            
2*(1-normcdf(z))

However, the above only works for relatively small z, in the range z < 10. In order to get p-values for z as large as 30, one should instead run

1
2
z=1.96;                                                                           
2*normcdf(-abs(z))

For those using R, the above can be accomplished by running:

1
2
z <- 1.96                                                                          
2*pnorm(z,lower.tail=F)

No Comments

Sampling from a k-nomial distribution with Boost

Boost is free, open source and peer-reviewed set of C++ libraries that is used in academic as well as commercial environments. I have recently used the Boost random number generation and special mathematical functions library for a project to detect gene-gene interactions in breast cancer. Unfortunately, the Boost random number generation library dose not currently include the k-nomial distribution, which is something I needed for the project. Here is a simple implementation I ended up with in the hope that it may be useful to other people. Note, my implementation is definitelly not optimal and is inefficient for generating large samples.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
#include <cassert>
 
#include <boost/random/mersenne_twister.hpp>
#include <boost/random/discrete_distribution.hpp>
 
boost::mt19937 rng;	// random number generator
boost::random::uniform_01<double> dist;
 
/*
 * Generate n samples from a k-nomial distribution.
 * Input:	
 *     tab - vector of length k to store the counts; 
 *         - tab[j] \in [0, n] for all j=1,...,k
 *         - sum_{j=0}^k tab[j] = n
 *     p   - the probability vector of length k 
 *         - does not assume p(0)+p(1)+...+p(k)=1; 
 *     k   - k-nomial (k>1)
 *     n   - sample size
 */
void sample_knomial(unsigned int *tab, double *p, unsigned int k, unsigned int n)
{
        assert(k>1);
 
	// set all k-cells to zero
	memset(tab,0,sizeof(unsigned int)*k);
 
	// normalise the probabilities
	double norm_factor=0.0;
	for(unsigned int j=0; j<k; j++) {
		assert(p[j] > 0); 
		norm_factor += p[j];
	}
 
	// pre-compute cumulative sums [p(0),p(0)+p(1),...,p(0)+p(1)+...+p(k-1)]
	// (could sort p prior to this)
	double *cumsum = new double[k];
	cumsum[0]=p[0]/norm_factor;
	for(unsigned int j=1; j<k; j++)
	{
		cumsum[j] = cumsum[j-1] + p[j]/norm_factor;
	}
 
	// Generate n samples from a k-nomial with probabilities p_1,...,p_k
	unsigned int t = 0;
	double u = 0.0;
	for(unsigned int i=0; i<n; i++)
	{
		// Generate uniform u \in [0,1)
		u = dist(rng);
		// Determine in which of the k cells u falls
		for(t=0; u>cumsum[t]; t++);
		// Increment count for cell j
		tab[t]++;
	}
 
	// free memory
	delete[] cumsum;
}

No Comments

Computing the Hodges–Lehmann estimator

The Hodges–Lehmann estimator (HL) of the location parameter for n samples of data is defined as the median of the averages of all n(n+1)/2 data pairs. This set of all pairwise averages is usually referred to as the Walsh averages. The Hodges-Lehmann estimator has some excellent efficiency and robustness properties if the observed data is generated by a symmetric distribution.

Below is a simple C function to compute the HL estimator for a given array of observations.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
// Comparison function needed for quicksort                                          
int compare (const void * a, const void * b)
{
    const double *da = (const double *) a;
    const double *db = (const double *) b;
 
    return (*da > *db) - (*da < *db);
}
 
 
// Function to compute the Hodges-Lehmann estimator
double hodgeslehmann(double *x, int n)
{
    // store and compute Walsh averages
    unsigned int nw=n*(n+1)/2;
    double *walsh_avgs=(double*) malloc(sizeof(double)*nw);
 
    unsigned int k=0;
    for(unsigned int i=0; i<n; i++) {
        for(unsigned int j=i; j<n; j++) {
            walsh_avgs[k]=(x[i]+x[j])/2.0;
            k++;
        }
    }
 
    // sort the vector
    qsort(walsh_avgs,nw,sizeof(double),compare);
 
    // median of walsh averages
    double m=walsh_avgs[nw/2];
    if(nw % 2 == 0) {
        m=((walsh_avgs[nw/2] + walsh_avgs[nw/2 - 1]) / 2.0);
    }
 
    // done
    free(walsh_avgs);
    return m;
}

I have also written a MATLAB driver function that allows the aforementioned C code to be executed within MATLAB, see the Software section to download this implementation. An efficient (pure) MATLAB implementation that avoids the use of nested for loops is a little bit tricky. Here is a simple (memmory intensive!) implementation that relies on the ndgrid(.) function. This code works reasonably well as long as the sample size is not too big.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
function m=hodglehm_slow(x)                                                         
 
%% Length of vector
x=x(:);         % make sure x is a column vector
n=length(x);
 
%% compute cartesian product
[a,b]=ndgrid(x,x);
 
%% Sum of all pairs
A=triu((a+b)./2);  
 
%% Extract upper diagonal
Ix=triu(true(n));
A=A(Ix);
 
%% Compute median of averages
m=median(A(:));
 
%% done
return;

We can get the standard error of the HL estimator in MATLAB by using the bootstrap:

1
2
3
4
5
6
% Generate 50 observations from a Student-t distribution with 2 dof                  
x=trnd(2,50,1);
 
% Compute standard errors for the mean and HL estimators
std(bootstrp(1000, @(a) mean(a), x))
std(bootstrp(1000, @(a) hodglehm_slow(a), x))

No Comments

Hyperspherical Coordinates

Suppose we have a vector {\bf x}=(x_1,x_2,\ldots,x_n)^\prime \in \mathbb{R}^n residing in an n-dimensional Euclidean space. How do we express this vector using (hyper-)spherical coordinates in MATLAB? MATLAB implements a couple of functions, namely cart2pol() and cart2sph() , that handle the conversion in two and three dimensions, respectively. However, there is no general method for vectors of arbitrary dimension.

Fortunately, Wikipedia lists the formulas required to convert Euclidean coordinates to hyperspherical coordinates and these are relatively straightforward to implement in MATLAB. Below is a MATLAB function that takes an n-dimensional Euclidean vector as input and returns the corresponding radial coordinate r > 0 as well as the (n-1) angular coordinates \phi_1, \phi_2, \ldots, \phi_{n-1}; note that, \phi_{n-1} \in [0,2\pi], while all the other angles are in the set [0, \pi].

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
%% The input vector x is a n-dim. column vector.
function [r,theta]=cart2nsph(x)
 
%% setup
n = length(x);
theta = zeros(n-1,1);
x2 = x.^2;
cumx = cumsum(x2(end:-1:2));
 
%% Compute the radial coordinate
r = sqrt(sum(x2));
 
%% Compute the angles
theta(1:end-1,1) = x(1:end-2,1) ./ sqrt( cumx(end:-1:2) ) ;
theta(end,1) = (sqrt(sum(x2(end:-1:end-1,1))) + x(end-1,1)) ./ x(end,1);
theta = acot(theta);
theta(theta<0) = theta(theta<0)+pi;   % Convert acot principal value to [0,pi]
theta(end,1) = 2*theta(end,1);       % Last angle is between [0,2pi]
 
%% done
return;

Line 11 computes the radial coordinate which is simply the square root of the sum of the squares of the original, Euclidean coordinates. Lines 14-18 are required to obtain the angular coordinates. Some care is necessary in order to get all the angles in the correct range, since the principal value of the function acot() in MATLAB is in [-\pi/2,\pi/2] and not in the (usual) range of [0,\pi]. We correct for this in Line 17 by adding \pi to all the “negative” angles.

The function to compute n-dimensional Euclidean coordinates from hyperspherical coordinates is somewhat simpler. One possible implementation is given below:

1
2
3
4
5
6
7
8
9
10
11
12
%% Radial coordinate, r>0, must be positive. 
%% The angles theta_1,...,theta_{n-2} must be in [0,pi], while 
%% theta_{n-1} should be in [0,2pi].
function x = nsph2cart(r,theta)                                                   
 
c=[cos(theta); 1];
s=[1 ;cumprod(sin(theta))];
 
x=r*(c.*s);
 
%% done
return;

Except in some special cases the transform is unique. Both functions were tested on MATLAB version 7.12.0 (R2011a) and dont have any error checking. I hope that some of you can find these functions useful in your work.

No Comments

Fisher information for irregular likelihood functions

I recently came across the following problem: how do you work out the Fisher information for a parameter \theta \in \Omega when the likelihood function, say p(\theta; x), is continuous, but not everywhere differentiable with respect to the parameter. The standard formula for the Fisher information

 {\rm J}(\theta) = \int_{\Omega} \left(\frac{\partial^2 l}{\partial \theta^2}\right) p(\theta; x) dx

assumes regularity conditions which no longer hold, and hence is not applicable. After hunting around with Google for some time, I came across the following (freely downloadable) paper

H. E. Daniels
The Asymptotic Efficiency of a Maximum Likelihood Estimator
Fourth Berkeley Symp. on Math. Statist. and Prob., University of California Press, 1961, 1, 151-163

which, fortunately, had exactly what I need. It turns out, we can still compute the Fisher information, without existence of second derivatives, using

 {\rm J}(\theta) = \int_{\Omega} \left(\frac{\partial l}{\partial \theta}\right)^2 p(\theta; x) dx

provided a set of weaker conditions holds. To get my head around the issue, I decided to look at a simple problem of working out the Fisher information for the mean \mu \in \Omega = \mathbb{R} of a Laplace distribution

  p(x,\mu; s)= \frac{1}{2s} \exp(-| x - \mu| / s)

The log-likelihood is now given by

  l(\mu; s) = -\log(2s) - \frac{|x-\mu|}{s}  = \left\{\begin{array}{ll} -\log(2s) - \frac{x-\mu}{s} & x > \mu \\ -\log(2s) + \frac{x-\mu}{s} 	& x < \mu \\ \end{array} \right.

The first derivative with respect to \mu is

\frac{\partial l}{\partial \mu}  = \left\{\begin{array}{ll} \frac{1}{s} & x > \mu \\ -\frac{1}{s} & x < \mu \\ \end{array} \right.

Therefore, the Fisher information is

 {\rm J}(\mu) = \int_{-\infty}^\mu \left(\frac{\partial l}{\partial \mu}\right)^2 p(x,\mu; s) dx + \int_{\mu}^\infty \left(\frac{\partial l}{\partial \mu}\right)^2 p(x,\mu; s) dx = \frac{1}{s^2}

The Fisher information for the scale parameter can be obtained in a similar manner. The first derivative with respect to s is

\frac{\partial l}{\partial s}  = -\frac{1}{s} + \frac{1}{s^2} |x - \mu| = \frac{1}{s^2} (|x - \mu| - s )

Since,

 {\rm E}\{|x-\mu|\} = s,  {\rm E}\{|x-\mu|^2\} = 2 s^2

the Fisher information for the scale parameter is

 {\rm J}(s) = \frac{1}{s^4} {\rm E} \left\{(|x-\mu| - s)^2 \right\} = \frac{1}{s^2}

The paper by Daniels details the exact conditions needed for the Fisher information to be derived in this way. Note, there is a mistake in one of the proofs, the correction is detailed in

J. A. Williamson
A Note on the Proof by H. E. Daniels of the Asymptotic Efficiency of a Maximum Likelihood Estimator
Biometrika, 1984, 71, 651-653

Unfortunately, the Williamson paper requires a JSTOR subscription for download.

2 Comments

The blog expands

I’ve started writing a new page for the blog that examines the theory behind various common statistical methods and models. The idea here is to include references and brief descriptions of some of the important and interesting papers on various topics. I also intend to include web links to software implementations whenever possible. The new material can be found by clicking on the “Statistics Theory and Applications” tab above. I am currently working on the linear regression section and the LASSO regularisation method for linear regression.

In other news, the grant writing season has again started. This year, we aim to write one early career researcher ARC grant, and at least one NHMRC grant. The ARC grant will be a two year proposal for work on modern logistic regression techniques. We are in the process of determining which NHMRC grant(s) will be written and which researchers will be involved. I can only hope that this grant season is as successful as last years!

No Comments

More Grant News

The NHMRC funding outcomes have been released and we have been awarded another grant! This is fantastic news given the competitive nature of the funding process. The list of project grants that were funded for 2011 can be found here. We were awarded approximately $400k over a period of two years for a research project on mammographic density. This is a very exciting area of research and we hope to positively contribute towards breast cancer research and making mammographic density more clinically useful.

Continuing on with more good news, the list of accepted papers for the 23rd Australasian Joint Conference on Artificial Intelligence (AJCAI 2010) has been announced a couple of weeks ago. AJCAI is a premier conference on artificial intelligence and machine learning that is held annually in the Australasian region. We usually send at least one paper every year to AJCAI given how rare international machine learning conferences are in Australia. This year we ended up submitting two papers, both of which were accepted. Moreover, one of the papers we submitted has also been awarded the best conference paper! Following the conference, I will be putting up both papers in the Publications section along with any relevant MATLAB and C/C++ code.

Lastly, I’d like to congratulate my colleague and good friend Daniel Schmidt for getting engaged this year to one lovely lady! I wish you both the best of luck!

No Comments