MATLAB Bayesian regression software updated

We have updated bayesreg, a MATLAB toolbox that implements Bayesian linear and logistic regression with sparsity-inducing priors, to version 1.6. The package now handles logistic regression without the need for mex files, but big speed-ups can be obtained when using compiled code, so this is recommended. To compile the C++ code, run compile.m from the bayesreg directory within MATLAB. Alternatively, for convenience, the pre-compiled MEX files (MATLAB R2017a) for Windows, Linux and Mac OSX can be downloaded from here. To use these, all you need to do is download them and unzip into the “bayesreg” folder.

Version 1.6 includes the following changes:

  • Implemented block sampling of betas for data with large numbers of predictors (options ‘blocksample’ and ‘blocksize’)
  • Display the Widely Applicable Akaike’s Information Criterion (WAIC) instead of DIC in summary output
  • Implemented alternative priors for tau2
  • Added an example to demonstrate block sampling and alternative priors for tau2 (see examples\br_example13.m)

Version 1.5 includes the following changes:

  • Written a new parallelised C++ implementation of sampling code for logistic regression
  • efficient MATLAB implementation of logistic regression sampling included; works even when MEX files are not available but not as fast

Version 1.4 includes the following changes:

  • Added option ‘groups’ which allows grouping of variables into potentially overlapping groups
  • Groups work with HS, HS+ and lasso priors
  • Fixed a regression bug with g priors and logistic models
  • Updated examples

Version 1.3 includes the following changes:

  • Tidied up the summary display to switch between scientific and normal notation depending
    on size of numbers in table
  • Fixed the summary banner to adjust to the size of the table
  • Added support for MATLAB tables
    • Code now accepts either X/y as matrix/vector, or table/string name of variable to use as table
  • Added support for categorical predictors
    • These are now either specified using ‘catvars’ option for matrix inputs, or automatically determined from the contents of the table
  • Removed the dependency of br_summary() on X and y
  • Consolidated all likelihood functions into a single br_regnlike() function that is vectorized using bsxfun
  • Added a br_predict function to produce predictions over data given design matrices
    • Checks input design matrix/table against structure of training data
    • Automatically handles categorical expansions as required
    • Produces credible intervals if requested
    • Produces prediction statistics if targets are supplied
  • Updated and improved the example scripts
  • Added initial internal code structures required for grouping variables and variable expansions
  • Fix bug in computation of R2

No Comments

Bayesian penalized regression with continuous shrinkage prior densities

We have implemented a new MATLAB toolbox for Bayesian regression. The toolbox features Bayesian linear regression with Gaussian or heavy-tailed error models and Bayesian logistic regression with ridge, lasso, horseshoe and horseshoe+ estimators. Two heavy-tailed error models are supported in linear regression: Laplace and Student t distribution errors. The toolbox is very efficient and can be used with high-dimensional data. This toolbox uses the C/C++ PolyaGamma sampler written by Jesse Windle.

Version 1.1 of the MATLAB toolbox can be downloaded from here. Please see “examples_bayesreg.m” for examples on how to use the toolbox, or type “help bayesreg” within MATLAB.

An R version of this toolbox is now available on CRAN. To install the R package, type “install.packages(“bayesreg”)” within R.

To cite this toolbox:

  1. Makalic E. & Schmidt, D. F.
    High-Dimensional Bayesian Regularised Regression with the BayesReg Package
    arXiv:1611.06649 [stat.CO], 2016

Updates:
(23/01): Updated the bayesreg code to version 1.2. This version implements Zellner’s g-prior for linear and logistic regression. The g-prior only works with full rank matrices. The examples in “examples_bayesreg.m” have been updated to include a g-prior example.

(17/01): Updated the bayesreg code to version 1.1. The display code is now in a separate “summary()” function so that it can be used many times on the same data and sampling results. We have updated “examples_bayesreg.m” to include examples of the new “summary()” function.

No Comments

MML analysis of the multivariate Gaussian distribution

In his book on the minimum message length (MML) theory of inference, Wallace shows how one may derive MML estimates and message length formulae for many popular parametric distributions. An important example is the multivariate Gaussian distribution given in section 6.2.1 of the book. Here, Wallace derives the commonly used Wallace-Freeman (or MML87) message length formula and parameter estimates for the p-variate Gaussian using conjugate priors. After noticing a minor typographical error in the Fisher information formula (pp. 262), I decided to re-examine the MML87 treatment of the Gaussian from scratch. The first draft of my attempt is available here.

I hope this document will be useful to researchers interested in learning more about the the MML principle.

Update (24/07/2014): I have written MATLAB code to compute MML estimates and the message length. I have also done minor updates to the document.

No Comments

Australian Rules Football and ELO Ratings

The ELO rating system is widely used in professional individual and team sports to compute relative rankings of teams or individuals. I had some free time on the weekend and wrote a fairly simple MATLAB script to compute the ELO rating for each team in the Australian Rules Football League (AFL). I used the data files from AFL Tables as input to the MATLAB code. The table below lists all the ELO ratings as well as some basic statistics, such as the total number of wins, losses and draws for each team.

TEAM ELO WINS LOSSES DRAWS TOTAL
1 Hawthorn 1475 869 951 10 1830
2 Geelong 1469 1224 1042 21 2287
3 Collingwood 1468 1460 910 26 2396
4 Sydney 1428 1061 1199 23 2283
5 West Coast 1360 339 269 5 613
6 Brisbane Bears 1319 72 148 2 222
7 Adelaide 1316 269 241 1 511
8 St Kilda 1311 877 1338 25 2240
9 North Melbourne 1293 803 1005 17 1825
10 Fremantle 1292 168 236 0 404
11 Carlton 1253 1385 939 33 2357
12 Essendon 1194 1315 973 34 2322
13 Richmond 1170 1058 1035 22 2115
14 Brisbane Lions 1147 192 175 6 373
15 Western Bulldogs 1099 803 976 22 1801
16 Melbourne 1032 1034 1209 21 2264
17 Port Adelaide 1032 183 181 5 369
18 GW Sydney 1016 2 20 0 22
19 Gold Coast 992 6 38 0 44
20 Fitzroy 871 869 1034 25 1928
21 University 781 27 97 2 126

Note:

  • I used the results of all 14,166 games played in the VFL/AFL since 1897 to create the table.
  • The ranking scores of the top three teams are really close, but Hawthorn has edged out Collingwood and Geelong.
  • The ELO of each team was initialised to 1200; the only exception to this rule were the Brisbane Lions which inherited the final ELO score from the Brisbane Bears. This seemed reasonable given the details of the “merger” between the Brisbane Bears and Fitzroy Lions.
  • The team University withdrew from the competition for World War 1 and never came back.
  • All ratings were updated using a constant K-factor of 32.
  • My MATLAB script and the data file used to create this table are freely available for download here.

It would be of interest to use this data for prediction of the AFL 2013 season, perhaps using a methodology similar to the ELO++ work.

No Comments

Linear algebra in C/C++ using Eigen

I’ve recently started using the free, open-source library Eigen for linear algebra routines in C/C++. The library is absolutely fantastic in terms of perfomance and functionality, however, as is sometimes the case with open-source software, the API documentation is somewhat lacking. If I get some time in the future I hope to contribute a few articles to the online documentation. Here is an example of what I mean.

Suppose you want to do a QR decomposition of a matrix X. There are three classes that can do this for you, one of which is ColPivHouseholderQR, and you may write something like

1
2
// perform a QR decomposition of the [n x p] X matrix; where n > p
Eigen::ColPivHouseholderQR< Eigen::MatrixXd > qr = X.colPivHouseholderQr();

Seems fairly straightforward, right? But, how do we get Q and R? If we look at the online documentation for ColPivHouseholderQR, we find a public method called matrixQR() which seems to return a reference to a matrix related to Q and R in some fashion. The documentation doesn’t state anything about how to get Q and R from the returned matrix. It turns out, that the matrix returned by matrixQR() contains both Q and R, with R occupying the ‘top’ block and Q the ‘bottom’ block. Thus, to get, say, R we need to use something like

1
2
3
4
// perform a QR decomposition of the [n x p] X matrix; where n > p
Eigen::ColPivHouseholderQR< Eigen::MatrixXd > qr = X.colPivHouseholderQr();  
Eigen::Matrix<double,p,p> R; 
R = qr.matrixQR().topRightCorner(p,p).triangularView<Eigen::Upper>();	// R matrix

Here, topRightCorner() extracts the matrix of interest and triangularView() ensures that R is stored as an upper triangular dense matrix.

It didn’t take too long to figure out how QR works in Eigen. After discovering the solution I found a post on the Eigen forum that describes a similar solution to the aforementioned. I urge anyone who uses Eigen to contribute code and documentation to the project, if possible. It really is a fantastic open-source library and equals and surpasses many other commercial linear algebra packages in both performance and quality.

No Comments

Analysing cancer data with a supercomputer

Yours truly has recently appeared in an article published in the local newspaper, The Age. The article discusses our NHMRC grant to use the IBM Blue Gene/P supercomputer for processing and analysis of SNP data. Unfortunately, I didn’t get a chance to proof read the story and ended up being referred to as a “computer whiz” and a talented “games programmer”. Oh well, any publicity is good publicity, right?

No Comments

Fisher Information in Single Factor Analysis

The single factor analysis model for data \{ {\bf x}_1, {\bf x}_2, \ldots, {\bf x}_n \} is

 {\bf x}_i  = {\bf m} + v_i {\bf a} + \boldsymbol{\epsilon}_i \quad \; (i=1,2,\ldots,n),

where {\bf x}_i \in \mathbb{R}^k, \boldsymbol{\epsilon}_i is an i.i.d. zero-mean multivariate Gaussian random variable with a symmetric, positive definite variance-covariance matrix \boldsymbol{\Sigma} \in \mathbb{R}^{k \times k}; that is, \boldsymbol{\epsilon}_i \sim {\rm N}_k({\bf 0}, \boldsymbol{\Sigma}) for all i=1,2,\ldots,n. It is thus assumed that each data vector {\bf x}_i follows a k-variate Gaussian distribution with moments

	{\rm E} \{ {\bf x}_i \} = {\bf m} + v_i {\bf a},

	{\rm Var} \{ {\bf x} \} = \boldsymbol{\Sigma}.

The aim of single factor analysis is to estimate the unknown factor scores {\bf v} = (v_1, v_2, \ldots, v_n)^\prime \in \mathbb{R}^n, the mean {\bf m} = (m_1,m_2, \ldots, m_k)^\prime \in \mathbb{R}^k, the variance-covariance matrix \boldsymbol{\Sigma} and the factor loadings {\bf a} = (a_1, a_2, \ldots, a_k)^\prime \in \mathbb{R}^k.

I’ve written this note to show how one may derive the Fisher information for the single factor model. This turns out to be a fairly involved task requiring a lot of linear algebra and matrix differential calculus identities. I’ve uploaded a document describing all the necessary mathematical steps to the Publications section. Anyone interested in matrix differential calculus and linear algebra should find it useful.

No Comments

Computing p-values for large Z scores in MATLAB

I recently found a little trick for computing two-sided p-values in MATLAB when given a large Z score. The obvious approach is to use the normcdf(.) function with something like this

1
2
z=1.96;                                                                            
2*(1-normcdf(z))

However, the above only works for relatively small z, in the range z < 10. In order to get p-values for z as large as 30, one should instead run

1
2
z=1.96;                                                                           
2*normcdf(-abs(z))

For those using R, the above can be accomplished by running:

1
2
z <- 1.96                                                                          
2*pnorm(z,lower.tail=F)

No Comments

Sampling from a k-nomial distribution with Boost

Boost is free, open source and peer-reviewed set of C++ libraries that is used in academic as well as commercial environments. I have recently used the Boost random number generation and special mathematical functions library for a project to detect gene-gene interactions in breast cancer. Unfortunately, the Boost random number generation library dose not currently include the k-nomial distribution, which is something I needed for the project. Here is a simple implementation I ended up with in the hope that it may be useful to other people. Note, my implementation is definitelly not optimal and is inefficient for generating large samples.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
#include <cassert>
 
#include <boost/random/mersenne_twister.hpp>
#include <boost/random/discrete_distribution.hpp>
 
boost::mt19937 rng;	// random number generator
boost::random::uniform_01<double> dist;
 
/*
 * Generate n samples from a k-nomial distribution.
 * Input:	
 *     tab - vector of length k to store the counts; 
 *         - tab[j] \in [0, n] for all j=1,...,k
 *         - sum_{j=0}^k tab[j] = n
 *     p   - the probability vector of length k 
 *         - does not assume p(0)+p(1)+...+p(k)=1; 
 *     k   - k-nomial (k>1)
 *     n   - sample size
 */
void sample_knomial(unsigned int *tab, double *p, unsigned int k, unsigned int n)
{
        assert(k>1);
 
	// set all k-cells to zero
	memset(tab,0,sizeof(unsigned int)*k);
 
	// normalise the probabilities
	double norm_factor=0.0;
	for(unsigned int j=0; j<k; j++) {
		assert(p[j] > 0); 
		norm_factor += p[j];
	}
 
	// pre-compute cumulative sums [p(0),p(0)+p(1),...,p(0)+p(1)+...+p(k-1)]
	// (could sort p prior to this)
	double *cumsum = new double[k];
	cumsum[0]=p[0]/norm_factor;
	for(unsigned int j=1; j<k; j++)
	{
		cumsum[j] = cumsum[j-1] + p[j]/norm_factor;
	}
 
	// Generate n samples from a k-nomial with probabilities p_1,...,p_k
	unsigned int t = 0;
	double u = 0.0;
	for(unsigned int i=0; i<n; i++)
	{
		// Generate uniform u \in [0,1)
		u = dist(rng);
		// Determine in which of the k cells u falls
		for(t=0; u>cumsum[t]; t++);
		// Increment count for cell j
		tab[t]++;
	}
 
	// free memory
	delete[] cumsum;
}

No Comments

Computing the Hodges–Lehmann estimator

The Hodges–Lehmann estimator (HL) of the location parameter for n samples of data is defined as the median of the averages of all n(n+1)/2 data pairs. This set of all pairwise averages is usually referred to as the Walsh averages. The Hodges-Lehmann estimator has some excellent efficiency and robustness properties if the observed data is generated by a symmetric distribution.

Below is a simple C function to compute the HL estimator for a given array of observations.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
// Comparison function needed for quicksort                                          
int compare (const void * a, const void * b)
{
    const double *da = (const double *) a;
    const double *db = (const double *) b;
 
    return (*da > *db) - (*da < *db);
}
 
 
// Function to compute the Hodges-Lehmann estimator
double hodgeslehmann(double *x, int n)
{
    // store and compute Walsh averages
    unsigned int nw=n*(n+1)/2;
    double *walsh_avgs=(double*) malloc(sizeof(double)*nw);
 
    unsigned int k=0;
    for(unsigned int i=0; i<n; i++) {
        for(unsigned int j=i; j<n; j++) {
            walsh_avgs[k]=(x[i]+x[j])/2.0;
            k++;
        }
    }
 
    // sort the vector
    qsort(walsh_avgs,nw,sizeof(double),compare);
 
    // median of walsh averages
    double m=walsh_avgs[nw/2];
    if(nw % 2 == 0) {
        m=((walsh_avgs[nw/2] + walsh_avgs[nw/2 - 1]) / 2.0);
    }
 
    // done
    free(walsh_avgs);
    return m;
}

I have also written a MATLAB driver function that allows the aforementioned C code to be executed within MATLAB, see the Software section to download this implementation. An efficient (pure) MATLAB implementation that avoids the use of nested for loops is a little bit tricky. Here is a simple (memmory intensive!) implementation that relies on the ndgrid(.) function. This code works reasonably well as long as the sample size is not too big.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
function m=hodglehm_slow(x)                                                         
 
%% Length of vector
x=x(:);         % make sure x is a column vector
n=length(x);
 
%% compute cartesian product
[a,b]=ndgrid(x,x);
 
%% Sum of all pairs
A=triu((a+b)./2);  
 
%% Extract upper diagonal
Ix=triu(true(n));
A=A(Ix);
 
%% Compute median of averages
m=median(A(:));
 
%% done
return;

We can get the standard error of the HL estimator in MATLAB by using the bootstrap:

1
2
3
4
5
6
% Generate 50 observations from a Student-t distribution with 2 dof                  
x=trnd(2,50,1);
 
% Compute standard errors for the mean and HL estimators
std(bootstrp(1000, @(a) mean(a), x))
std(bootstrp(1000, @(a) hodglehm_slow(a), x))

No Comments