## Statistics Theory and Applications

This page details some of the theory behind various common statistical methods and models. The material presented here should be considered work-in-progress and is constantly evolving. Most of the work included will not be original. Instead, I aim to reference and briefly describe important literature in various topics of statistical methodology. Web links to software implementations will be provided whenever possible. I am currently working on the linear regression section and the various regularisation approaches for linear regression in the large p (number of covariates) and small n (sample size) setting.

#### Linear Regression

Linear regression is perhaps the most commonly used statistical model in practice. The standard multivariate linear regression model for explaning a vector of observed data ${\bf y} \in \mathbb{R}^n$ can be written as

${\bf y} = {\bf X}_{\gamma} \boldsymbol{\beta} + {\epsilon}$

where

• $\boldsymbol{\beta} \in \mathbb{R}^p$ is the vector of linear regression coefficients
• $\boldsymbol{\epsilon} \in \mathbb{R}^n$ are i.i.d. variates distributed as per $\boldsymbol{\epsilon} \sim N_n\left({\bf 0}, \tau {\bf I}_n \right)$
• ${\bf I}_k$ denotes the (k x k) identity matrix
• $\gamma \subset \left\{ 1, \ldots, p \right\}$ is an index vector determining which regressors comprise the design matrix

The design matrix comprising all regressors is denoted as

${\bf X} = ({\bf x}_1, \ldots, {\bf x}_p)$

where ${\bf x}_i \in \mathbb{R}^n$ and $p$ is the maximum number of candidate regressors. The set $\gamma$ then indexes any possible design matrix that can be derived from the full matrix ${\bf X}$. The aim in linear regression is to estimate the unknown parameters $\boldsymbol{\beta}$ and $\tau$ as well as determine the optimal regressor subset $\gamma$.

A popular method for estimating the regression parameters is Fisher’s maximum likelihood approach. The idea is to set the regression coefficients to the values that maximise the likelihood given the observed data. In the case of linear regression, the maximum likelihood estimates exist in closed form provided that (1) $n \geq p$, and (2) the regressors are not highly correlated. The maximum likelihood estimator can be written as

$\hat{\boldsymbol{\beta}}({\bf y}; {\bf X}_\gamma) = \left( {\bf X}_{\gamma}^\prime {\bf X}_{\gamma} \right)^{-1} {\bf X}_{\gamma}^\prime {\bf y}$

or in MATLAB,

1 2 3 4 5 6 % Assuming targets y and regressor matrix X exist, estimate coefficients beta by maximum likelihood % Option 1 beta=inv(X'*X)'*X'*y;   % Option 2; much better; more numerically stable and faster than calculating the inverse explicitly beta=X\y;

The maximum likelihood estimate of the regression parameters has some nice statistical properties. It is an unbiased estimator of the true regression coefficients, and it is strongly consistent provided that

$\left( {\bf X}^\prime {\bf X} \right)^{-1} \to {\bf 0} \quad {\rm as} \; \; n \to \infty$

I recommend the paper by Lai et. al. for consistency proofs and further results (see References below). However, we cannot use maximum likelihood if:

• the regressors are highly correlated, or
• if the number of regressors is greater than the number of samples.

Since maximum likelihood does not zero out regressors, we cannot use maximum likelihood alone to select the optimal regressor subset.

Recently, there has been a large amount of interest in regularisation approaches to linear regression. The idea here is to again maximise the (log)likelihood subject to inequality constraints on the regression coefficients. The nature of the inequality constraints determines the properties of the resulting estimates and the type of regularisation. Some commonly used regularisation methods are discussed below:

References

[1] T. L. Lai, Herbert Robbins and C. Z. Wei
Strong consistency of least squares estimates in multiple regression
Proceedings of the National Academy of Sciences of the United States of America, 1978, 75, 3034-3036