Archive for category Statistics
MATLAB Bayesian regression software updated
Posted by emakalic in Programming, Statistics on June 23, 2017
We have updated bayesreg, a MATLAB toolbox that implements Bayesian linear and logistic regression with sparsity-inducing priors, to version 1.6. The package now handles logistic regression without the need for mex files, but big speed-ups can be obtained when using compiled code, so this is recommended. To compile the C++ code, run compile.m from the bayesreg directory within MATLAB. Alternatively, for convenience, the pre-compiled MEX files (MATLAB R2017a) for Windows, Linux and Mac OSX can be downloaded from here. To use these, all you need to do is download them and unzip into the “bayesreg” folder.
Version 1.6 includes the following changes:
- Implemented block sampling of betas for data with large numbers of predictors (options ‘blocksample’ and ‘blocksize’)
- Display the Widely Applicable Akaike’s Information Criterion (WAIC) instead of DIC in summary output
- Implemented alternative priors for tau2
- Added an example to demonstrate block sampling and alternative priors for tau2Â (see examples\br_example13.m)
Version 1.5 includes the following changes:
- Written a new parallelised C++ implementation of sampling code for logistic regression
- efficient MATLAB implementation of logistic regression sampling included; works even when MEX files are not available but not as fast
Version 1.4 includes the following changes:
- Added option ‘groups’ which allows grouping of variables into potentially overlapping groups
- Groups work with HS, HS+ and lasso priors
- Fixed a regression bug with g priors and logistic models
- Updated examples
Version 1.3 includes the following changes:
- Tidied up the summary display to switch between scientific and normal notation depending
on size of numbers in table - Fixed the summary banner to adjust to the size of the table
- Added support for MATLAB tables
- Code now accepts either X/y as matrix/vector, or table/string name of variable to use as table
- Added support for categorical predictors
- These are now either specified using ‘catvars’ option for matrix inputs, or automatically determined from the contents of the table
- Removed the dependency of br_summary() on X and y
- Consolidated all likelihood functions into a single br_regnlike() function that is vectorized using bsxfun
- Added a br_predict function to produce predictions over data given design matrices
- Checks input design matrix/table against structure of training data
- Automatically handles categorical expansions as required
- Produces credible intervals if requested
- Produces prediction statistics if targets are supplied
- Updated and improved the example scripts
- Added initial internal code structures required for grouping variables and variable expansions
- Fix bug in computation of R2
Bayesian penalized regression with continuous shrinkage prior densities
Posted by emakalic in Statistics on December 20, 2016
We have implemented a new MATLAB toolbox for Bayesian regression. The toolbox features Bayesian linear regression with Gaussian or heavy-tailed error models and Bayesian logistic regression with ridge, lasso, horseshoe and horseshoe+ estimators. Two heavy-tailed error models are supported in linear regression: Laplace and Student t distribution errors. The toolbox is very efficient and can be used with high-dimensional data. This toolbox uses the C/C++ PolyaGamma sampler written by Jesse Windle.
Version 1.1 of the MATLAB toolbox can be downloaded from here. Please see “examples_bayesreg.m” for examples on how to use the toolbox, or type “help bayesreg” within MATLAB.
An R version of this toolbox is now available on CRAN. To install the R package, type “install.packages(“bayesreg”)” within R.
To cite this toolbox:
- Makalic E. & Schmidt, D. F.
High-Dimensional Bayesian Regularised Regression with the BayesReg Package
arXiv:1611.06649 [stat.CO], 2016
Updates:
(23/01): Updated the bayesreg code to version 1.2. This version implements Zellner’s g-prior for linear and logistic regression. The g-prior only works with full rank matrices. The examples in “examples_bayesreg.m” have been updated to include a g-prior example.
(17/01): Updated the bayesreg code to version 1.1. The display code is now in a separate “summary()” function so that it can be used many times on the same data and sampling results. We have updated “examples_bayesreg.m” to include examples of the new “summary()” function.
Fisher Information in Single Factor Analysis
Posted by emakalic in Linear algebra, Statistics on February 7, 2012
The single factor analysis model for data is
where ,
is an i.i.d. zero-mean multivariate Gaussian random variable with a symmetric, positive definite variance-covariance matrix
; that is,
for all
. It is thus assumed that each data vector
follows a
-variate Gaussian distribution with moments
The aim of single factor analysis is to estimate the unknown factor scores , the mean
, the variance-covariance matrix
and the factor loadings
.
I’ve written this note to show how one may derive the Fisher information for the single factor model. This turns out to be a fairly involved task requiring a lot of linear algebra and matrix differential calculus identities. I’ve uploaded a document describing all the necessary mathematical steps to the Publications section. Anyone interested in matrix differential calculus and linear algebra should find it useful.
Computing p-values for large Z scores in MATLAB
Posted by emakalic in Statistics on November 29, 2011
I recently found a little trick for computing two-sided p-values in MATLAB when given a large Z score. The obvious approach is to use the normcdf(.) function with something like this
1 2 | z=1.96; 2*(1-normcdf(z)) |
However, the above only works for relatively small z, in the range z < 10. In order to get p-values for z as large as 30, one should instead run
1 2 | z=1.96; 2*normcdf(-abs(z)) |
For those using R, the above can be accomplished by running:
1 2 | z <- 1.96 2*pnorm(z,lower.tail=F) |
Computing the Hodges–Lehmann estimator
Posted by emakalic in Statistics on June 17, 2011
The Hodges–Lehmann estimator (HL) of the location parameter for samples of data is defined as the median of the averages of all
data pairs. This set of all pairwise averages is usually referred to as the Walsh averages. The Hodges-Lehmann estimator has some excellent efficiency and robustness properties if the observed data is generated by a symmetric distribution.
Below is a simple C function to compute the HL estimator for a given array of observations.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 | // Comparison function needed for quicksort int compare (const void * a, const void * b) { const double *da = (const double *) a; const double *db = (const double *) b; return (*da > *db) - (*da < *db); } // Function to compute the Hodges-Lehmann estimator double hodgeslehmann(double *x, int n) { // store and compute Walsh averages unsigned int nw=n*(n+1)/2; double *walsh_avgs=(double*) malloc(sizeof(double)*nw); unsigned int k=0; for(unsigned int i=0; i<n; i++) { for(unsigned int j=i; j<n; j++) { walsh_avgs[k]=(x[i]+x[j])/2.0; k++; } } // sort the vector qsort(walsh_avgs,nw,sizeof(double),compare); // median of walsh averages double m=walsh_avgs[nw/2]; if(nw % 2 == 0) { m=((walsh_avgs[nw/2] + walsh_avgs[nw/2 - 1]) / 2.0); } // done free(walsh_avgs); return m; } |
I have also written a MATLAB driver function that allows the aforementioned C code to be executed within MATLAB, see the Software section to download this implementation. An efficient (pure) MATLAB implementation that avoids the use of nested for loops is a little bit tricky. Here is a simple (memmory intensive!) implementation that relies on the ndgrid(.) function. This code works reasonably well as long as the sample size is not too big.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | function m=hodglehm_slow(x) %% Length of vector x=x(:); % make sure x is a column vector n=length(x); %% compute cartesian product [a,b]=ndgrid(x,x); %% Sum of all pairs A=triu((a+b)./2); %% Extract upper diagonal Ix=triu(true(n)); A=A(Ix); %% Compute median of averages m=median(A(:)); %% done return; |
We can get the standard error of the HL estimator in MATLAB by using the bootstrap:
1 2 3 4 5 6 | % Generate 50 observations from a Student-t distribution with 2 dof x=trnd(2,50,1); % Compute standard errors for the mean and HL estimators std(bootstrp(1000, @(a) mean(a), x)) std(bootstrp(1000, @(a) hodglehm_slow(a), x)) |
Fisher information for irregular likelihood functions
Posted by emakalic in Statistics on February 15, 2011
I recently came across the following problem: how do you work out the Fisher information for a parameter when the likelihood function, say
, is continuous, but not everywhere differentiable with respect to the parameter. The standard formula for the Fisher information
assumes regularity conditions which no longer hold, and hence is not applicable. After hunting around with Google for some time, I came across the following (freely downloadable) paper
H. E. Daniels
The Asymptotic Efficiency of a Maximum Likelihood Estimator
Fourth Berkeley Symp. on Math. Statist. and Prob., University of California Press, 1961, 1, 151-163
which, fortunately, had exactly what I need. It turns out, we can still compute the Fisher information, without existence of second derivatives, using
provided a set of weaker conditions holds. To get my head around the issue, I decided to look at a simple problem of working out the Fisher information for the mean of a Laplace distribution
The log-likelihood is now given by
The Fisher information for the scale parameter can be obtained in a similar manner. The first derivative with respect to is
Since,
the Fisher information for the scale parameter is
The paper by Daniels details the exact conditions needed for the Fisher information to be derived in this way. Note, there is a mistake in one of the proofs, the correction is detailed in
J. A. Williamson
A Note on the Proof by H. E. Daniels of the Asymptotic Efficiency of a Maximum Likelihood Estimator
Biometrika, 1984, 71, 651-653
Unfortunately, the Williamson paper requires a JSTOR subscription for download.
The blog expands
Posted by emakalic in Research Grants, Statistics on January 25, 2011
I’ve started writing a new page for the blog that examines the theory behind various common statistical methods and models. The idea here is to include references and brief descriptions of some of the important and interesting papers on various topics. I also intend to include web links to software implementations whenever possible. The new material can be found by clicking on the “Statistics Theory and Applications” tab above. I am currently working on the linear regression section and the LASSO regularisation method for linear regression.
In other news, the grant writing season has again started. This year, we aim to write one early career researcher ARC grant, and at least one NHMRC grant. The ARC grant will be a two year proposal for work on modern logistic regression techniques. We are in the process of determining which NHMRC grant(s) will be written and which researchers will be involved. I can only hope that this grant season is as successful as last years!
Penalized Logistic Regression
Posted by emakalic in Statistics on October 9, 2010
Last week I gave a seminar at my work place on the advantages of using penalized logistic regression methods (such as the LASSO, elastic net, etc.) over the standard maximum likelihood approach. The target audience was genetic epidemiologists who have some practical knowledge of fitting logistic models, but may not be aware of the recent theoretical work in the area. The slides from the seminar are now available from the Publications page.
Statistical Analysis Questions
Posted by emakalic in Interesting URLs, Statistics on July 30, 2010
The other day I came across a new Q&A site for statistical analysis called Statistical Analysis Questions. I am not sure how long the site has been running, but there are already about 200 questions, 750 answers and more than 600 users. I recommended checking it out. It seems quite useful for anyone involved in both applied and theoretical statistics work.