Archive for category Biostatistics
Ever since I’ve started to work in biostatistics, I keep coming across terms that at first glance appear new, but upon further investigation turn out to correspond to well-known statistical definitions. A typical example of this terminology issue is the term ‘Genomic inflation factor’ (GIF) which crops up in Genome Wide Association (GWA) studies and is computed by many commonly used GWA software packages. Briefly, a GWA study is an analysis of association between genetic markers and a disease or trait of interest. The outcome of a GWA study is a list of p-values, one per marker, which are then used to rank genetic markers in terms of significance of association and subsequently in further biological investigation. In a GWA Study, the GIF corresponds to the regression coefficient of the univariate regression line computed in a Quantile-Quantile (QQ) plot where
(1) the x-axis denotes (negative logarithm of) the expected order statistics of a Uniform density, and
(2) the y-axis denotes the (negative logarithm of) the observed order statistics of the p-values.
The GIF is used during the quality control process of a GWA study to identify poor quality genetic markers; for example, in the biostatistics literature a GIF significantly larger than 1.0 is frowned up and is considered to indicate poor quality data.
So how do we calculate the GIF for a GWA study? In MATLAB, we could use the qqplot() function to generate a QQ plot, however that does not leave us with an easy way of computing the GIF. Instead, we will do the QQ plot and compute the GIF from scratch. The first set of statistics we need are the expected order statistics of a Uniform(0,1) distribution; the i-th order statistic from the Uniform(0,1) distribution follows a Beta distribution with parameters i and (n-i+1). The expected values are then
which are clearly easy to compute. A simple MATLAB function to display a GWA study QQ plot and compute the GIF is:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
function gif=compute_gif(pvals) % number of p-vals n=length(pvals); % expected order statistics es=(1:n)' ./ (n+1); % x-axis x = -log10(es); y = -log10(sort(pvals(:))); % compute GIF gif =(x'*y)/(x'*x); % QQ-plot figure; hold; grid; maxh=ceil(log10(n)); xlim([0 maxh]); ylim([0 maxh]); plot(x,y,'bx'); plot(x,x,'r-'); xlabel('-log10(Expected Order Statistics)'); ylabel('-log10(Observed Order Statistics)'); title('QQ Plot'); % done return;
The function compute_gif() takes in a single argument, which is a list of p-values, sorts the p-values and uses this sorted list to display a QQ plot and compute the GIF. Pretty simple, isn’t it? As a final note, the GIF is just one possible example of terminology re-incarnation in biostatistics. Other examples of new biostatistics terms include linkage disequilibrium (statistical independence) or the transmission disequilibrium test (an application of McNemar’s nonparametric test).