Biostatistics Advance Access originally published online on April 7, 2006
Biostatistics 2007 8(1):86-100; doi:10.1093/biostatistics/kxj035
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Regularized linear discriminant analysis and its application in microarrays
Department of Statistics, Stanford University, Stanford, CA 94305, USA yaqiang{at}stanford.edu
Department of Statistics, Stanford University, Stanford, CA 94305, USA
Department of Health Research and Policy, Redwood Building, Room T101C, Stanford University, Stanford, CA 94305, USA
* To whom correspondence should be addressed.
| SUMMARY |
|---|
|
|
|---|
In this paper, we introduce a modified version of linear discriminant analysis, called the "shrunken centroids regularized discriminant analysis" (SCRDA). This method generalizes the idea of the "nearest shrunken centroids" (NSC) (Tibshirani and others, 2003
Keywords: Classification; Discriminant analysis; Microarray; Prediction analysis of microarrays (PAM); Regularization; Shrunken centriods
| 1. INTRODUCTION |
|---|
|
|
|---|
Discriminant analysis (DA) is widely used in classification problems. The traditional way of doing DA was introduced by R. Fisher, known as the linear discriminant analysis (LDA). For the convenience, we first describe the general setup of this method so that we can follow the notation used here throughout this paper.
Suppose there are G different populations, each assumed to have a multivariate normal distribution with a common covariance matrix
of dimension
and mean vectors
(
), both of which are assumed known for the time being. Now, suppose we have a random sample of n observations from these populations with their true group labels being unknown. The question is how to correctly identify the population from which each observation comes. To be more explicit, let
be observations from population 1,
from population 2, and so on. Thus,
|
|
The idea of LDA is to classify observation
to a population
which minimizes
, i.e.
|
|
Under the above multivariate normal assumptions, this is equivalent to finding the population that maximizes the likelihood of the observation. More often, people have some prior knowledge as to the proportion of each population. For example, let
be the proportion of population g such that
. Then, instead of maximizing the likelihood, we maximize the posterior probability the observation belongs to a particular group, i.e.
|
|
The linearity of this DA method comes from the assumption of common covariance matrix, which simplifies the above criterion as
|
| (1.1) |
where
|
|
is the so-called discriminant function.
In reality, both
and
are unknown and therefore need to be estimated from the sample. Almost always, one uses the maximum likelihood estimates for these parameters,
![]() |
where X is a
matrix with columns corresponding to the observations and
is a matrix of the same dimensions with each column corresponding to the sample mean vector of the population that column belongs to. Therefore, in a more practical form, the sample version discriminant function is usually used in the LDA,
|
| (1.2) |
When the assumption of common covariance matrix is not satisfied, one uses an individual covariance matrix for each group and this leads to the so-called quadratic discriminant analysis (QDA) as the discriminating boundaries are quadratic curves. There is also an intermediate method between LDA and QDA, which is a regularized version of discriminant analysis (RDA) proposed by Friedman (1989)
. However, the regularization used in that method is different from the one we will propose here. A detailed source about the LDA, QDA, and Friedman's RDA methods can be found in the book by Hastie and others (2001)
. As we can see, the concept of DA certainly embraces a broader scope. But in this paper, our main focus will be solely the LDA part and henceforth the term DA will stand for LDA unless otherwise emphasized.
This paper is arranged as follows. In Section 2, we will first discuss in detail our version of regularization in DA, its statistical properties, and some computational issues (Section 2.1). Then, we introduce the "shrunken centroids regularized discriminant analysis" (SCRDA) method based on this regularization (Section 2.2). In Section 3, we compare our SCRDA method against other classification approaches through several publicly available real life microarray data sets. We also discuss an important issue about how to choose the optimal parameter pairs
for our methods (Section 3.4). Section 4 is devoted to a simulation study, where we generate data sets under different scenarios to evaluate the performance of our SCRDA method. In Section 5, we briefly discuss the feature selection property of SCRDA method. Section 6 is the discussion.
| 2. SHRUNKEN CENTROIDS RDA |
|---|
|
|
|---|
LDA is straightforward in the cases where the number of observations is greater than the dimensionality of each observation, i.e.
. In addition to being easy to apply, it also has nice properties, like robustness to deviations from model assumptions and almost-"Bayes" optimality. However, it becomes a serious challenge to use this method in the microarray analysis settings, where
is always the case. There are two major concerns here. First, the sample covariance matrix estimate is singular and cannot be inverted. Although we may use the generalized inverse instead, the estimate will be very unstable due to lack of observations. Actually, the performance of LDA in high-dimensional situations is far from being optimal (Dipillo, 1976), (, dipillo2). Second, high dimensionality makes direct matrix operation formidable, hence hindering the applicability of this method. Therefore, we will make some changes to the original LDA to overcome these problems. First, to resolve the singularity problem, instead of using
directly, we use
|
| (2.1) |
for some
,
. Some other forms of regularization on
can be
|
| (2.2) |
|
| (2.3) |
for some
,
. It is easy to see that if we ignore the prior constant, the three forms of regularization are equivalent in terms of the discriminant function. In this paper, the form (2.1) will be applied in all our computational examples due to its convenience to use.
The formulations above are not entirely new and actually have been frequently seen in situations like the ridge regression (Hoerl and Kennard, 1970
), where the correlation between predictors is high. By introducing a slightly biased covariance estimate, we not only resolve the singularity problem but also stabilize the sample covariance estimate. For example, the discriminant function (2.5) below is the main formula that we will be using in this paper. It utilizes the regularization form (2.1). Figures 1 and 2
show how the discriminant function (2.5) behaves in a two-class DA setup, where data are generated according to our model assumptions in Section 1. As we can see, regularization both stabilizes the variance and reduces the bias of the discriminant function (2.5). And as a result, the prediction accuracy is improved. In Figure 1, the identity covariance matrix is used to generate the data. From the plot, we can see that the optimal regularization parameter
is very close to 0, which means the optimal regularized covariance matrix tends to look like the true identity covariance matrix, regardless the sample size and data dimension. On the other hand, in Figure 2, an auto-regressive covariance structure (4.2) is used and the optimal
now lies somewhere between 0 and 1. Furthermore, the sample size and data dimension now have effect on choosing the optimal regularization parameter
. When the sample size is greater than the data dimension as is the case in the two top plots of Figure 2, the sample covariance matrix can be accurately estimated and since it is also unbiased for the true population covariance matrix, it is more favored in this case. When sample size is less than data dimension as is the case in the two bottom plots of Figure 2, the sample covariance matrix estimate will be highly variable. Although it is still unbiased for the true covariance matrix, a biased estimator will be favored due to the bias-variance trade-off. Hence, the optimal value of
is shifted toward 0. These plots give indications on the conditions when the regularized DA can work well. We will discuss this point later by using more data examples.
|
|
Another intuitive way of regularization modifies the sample correlation matrix
in the same way,
|
| (2.4) |
where
is the diagonal matrix taking the diagonal elements of
. Then, we compute the regularized sample covariance matrix by
. In this paper, we will consider both cases and their performance will be compared. Now, having introduced the regularized covariance matrix, we can define the corresponding regularized discriminant function as
|
| (2.5) |
where the
can be from either (2.1) or (2.4).
Our next goal is to facilitate the computation of this new discriminant function. We have addressed the issue that direct matrix manipulation is impractical in microarray settings. But if we employ the singular value decomposition (SVD) trick to compute the matrix inversion, we can get around this trouble. This enables a very efficient way of computing the discriminant function and reduces the computation complexity from the order of
to
, which will be a significant saving when
. For more details about the algorithm, please refer to Hastie and others (2001)
.
In this section, we define our new method SCRDA based on the regularized discriminant function (2.5) in Section 2.1. The idea of this method is similar to the "nearest shrunken centroids" (NSC) method (Tibshirani and others, 2003
), which we will describe briefly first. In microarray analysis, a widely accepted assumption is that most genes do not have differential expression level among different classes. In reality, the differences we observe are mostly due to random fluctuation. The NSC method removes the noisy information from such fluctuation by setting a soft threshold. This will effectively eliminate many non-contributing genes and leave us with a small subset of genes for scientific interpretation and further analysis. In the NSC method, the group centroids of each gene are shrunken individually. This is based on the assumption that genes are independent of each other, which however, for most of the time is not totally valid. Note that after shrinking the group centroids of a particular gene g, they compute the following gene-specific score for an observation
:
|
| (2.6) |
where
is the gth component of the
vector
,
is the shrunken centroid of group k for gene g, and
is the pooled standard deviation of gene g. Then
is classified to group k if k minimizes the sum of the scores over all genes (if the prior information is available, a term
should be included.), i.e.
![]() |
which is also equivalent to
|
|
given
. This is similar to the discriminant function (2.5) except that we replace
with the diagonal matrix
and the centroid vector
with the shrunken centroid vector
. Therefore, a direct modification in the regularized discriminant function (2.5) to incorporate the idea of the NSC method is to shrink the centroids in (2.5) before calculating the discriminant score, i.e.
|
| (2.7) |
However, in addition to shrinking the centroids directly, there are also two other possibilities. One is to shrink
, i.e.
|
| (2.8) |
and the other is to shrink
, i.e.
|
| (2.9) |
Although, it has been shown in our numerical analysis that all three shrinking methods have very good classification performance, only (2.8) will be the main focus of this paper as it also possesses the feature elimination property, which we will discuss later. Hence, we will refer to the DA resulted from (2.8) as SCRDA without differentiating whether
comes from (2.1) or (2.4). We will say more specifically which method is actually used when such a distinction is necessary.
| 3. COMPARISON BASED ON REAL MICROARRAY DATA |
|---|
|
|
|---|
In this section, we will investigate how well our new method works on some real data sets. A few existing classification methods are used as standards for comparison. The first two methods are the penalized logistic regression (PLR) and the support vector machines (SVM), both via univariate ranking (UR) and recursive feature elimination (RFE). They are proposed by Zhu and Hastie (2004)
The Tamayo data set (Ramaswamy and others, 2001
; Zhu and Hastie, 2004) is divided into a training subset, which contains 144 samples and a test subset of 54 samples. They consist of totally 14 different types of cancers and the number of genes in each array is 16 063. Since there are two tuning parameters in the SCRDA method, i.e. the regularization parameter
and the shrinkage parameter
, we choose the optimal pairs
for
and
using cross-validation on the training samples. And then we calculate the test error based on the tuning parameter pairs we chose and compare it with the results from Zhu and Hastie (2004)
. The results are summarized in Table 1. Based on how the covariance matrix is regularized in (2.8), two different forms of SCRDA are considered. In the table, we use "SCRDA" to denote the one from regularization on the covariance matrix and "SCRDA
" for the regularization on the correlation matrix. We can see that SCRDA clearly dominates PAM and slightly outperforms the last four methods in the table. Meanwhile, it also does a good job on selecting informative gene subset.
|
The Golub data (Golub and others, 1999
; Zhu and Hastie, 2004) consist of 38 training samples and 34 test samples from two cancer classes. The number of genes on each array is 7129. As there are only two groups to predict, this data set is much easier to analyze than the Tamayo data. The classification performance is generally impressive for most methods such that the difference among them is almost negligible. The result is summarized in Table 2.
|
The Brown data set, similar to the Tamayo data, is also a complex cancer data set. It contains a large number of samples (
) and classes (
, 14 cancer types and one normal type). The number of genes in the arrays is smaller (
) than the Tamayo data (
). As we can see (Table 3), the SCRDA method dominates PAM by a large margin and is as good as SVM.
|
,
) in SCRDAIn this section, we discuss the issue on how to choose the optimal tuning parameters. We have mentioned briefly that cross-validation should be used in determining the parameters. However, in practice, this process can be somewhat confusing. The main problem is that there are many possible tuning parameter pairs giving the same cross-validation error rate. Yet, the test error rate based on them may vary differently. Therefore, how to choose the best parameter pairs is an essential issue in evaluating the performance of the SCRDA method. Therefore, we will suggest two rules for choosing the parameters based on our experience.
First let us take a look at how the classification errors and the number of genes remained are distributed across the varying scopes of the tuning parameters
. The two plots in Figure 3 show the cross-validation error and test error given by the SCRDA method for the Tamayo data.
is chosen to lie between 0 and 1 while
between 0 and 3. Figure 4 shows the number of genes remained for the same range of the parameters. In all three plots, the stars correspond to the parameter pairs that yield the minimal cross-validation error. The round dots correspond to the 1 standard error boundary around the minimal cross-validation error points. The diamond dot in the test error plot (Figure 3, right) corresponds to the parameter pair that gives the minimal test error in the whole parameter space.
|
|
The most significant pattern we can observe in Figure 4 is the decreasing gradient approximately along the 45° diagonal line, i.e. the number of genes remained decreases as
increases or
decreases. This makes sense by intuition and has been consistently observed for all the real data and simulation data we have worked on. On the other hand, the distribution of the classification errors (both CV and test) as in Figure 3 does not have such a clearly consistent pattern. They may change dramatically from one data set to another. Further, as it is not possible to establish a unified correspondence between the distribution map of the classification error and the number of genes remained, we need to consider two distribution maps at the same time to achieve improved classification performance using a reasonably reduced subset of genes. We hence suggest the following two rules for identifying the optimal parameter pair(s)
: the "Min-Min" rule and the "One Standard Error" rule. The "Min-Min" rule
- First, find all the pairs
that correspond to the minimal cross-validation error from training samples.
- Select the pair or (pairs) that use the minimal number of genes.
The "One Standard Error" rule
- First, identify all the pairs
that correspond to the minimal cross-validation error from the training samples.
- Find the one standard error boundary points
.
- On the boundary, find the pair
that gives the smallest number of genes remained.
If there is more than one optimal pair, it is recommended to calculate the averaged test error based on all the pairs chosen as we did in this paper. There is no theoretical justification yet why these two rules are suggested. But from our empirical experience, both methods worked and provided no significant difference in terms of classification accuracy. For consistency, we have been using the "Min-Min" rule throughout this paper.
| 4. SIMULATION STUDY |
|---|
|
|
|---|
It was encouraging to see how our new method performs on some real microarray data sets. In this section, we investigate the performance of the SCRDA method in a more controlled manner. We deliberately construct three different simulation setups to study the conditions under which our SCRDA method would work well. Particularly, we choose the PAM method as our competitor.
The first setup is the simplest. There are two classes of multivariate normal distributions:
and
, each of dimension
. The true covariance structure is the independence structure, i.e.
. Also, for simplicity, all components of
are assumed to be 0 and for
, the first 100 components are
while the rest are all 0 as well, i.e.
and
For each class, we generate
training samples and
test samples.
This is not a situation where we see much advantage of the SCRDA method over the PAM method (Table 4). In this situation, all methods produce basically the same results. PAM seems to be even slightly better than the SCRDA method. However, it is hard to declare PAM as a clear winner in this case as the margin of betterment is still within the range of error fluctuation. There are two reasons. First, the number of classes is only two, the simplest case in all classification problems. As people are aware of, it is much easier for most classification methods to work well in the two-group classification problems and often it is hard to really observe the advantage of one method over another, e.g. the Golub data example. Second, the data is generated from the covariance structure of identity matrix. This suits exactly the assumption in the PAM method to make it work well. As we can see in the next two examples, when the number of classes increases, especially when data is more correlated, the SCRDA method will start to show true advantage over the PAM method.
|
The second simulation setup is slightly more complicated than the first one. We generate a multiple groups (
) classification scenario. Again each class is assumed to have distribution
with dimension of
.
is still assumed to be
as in setup I. The components of each mean vector
is assumed to be all 0 except for
components, which are set to be
. The positions of the non-zero components are selected randomly for each mean vector and they do not overlap. Again a total of
training samples and
test samples are generated with equal probabilities for each class. This time, we start to observe noticable differences among these methods (Table 5). Particularly, the SCRDA method starts to outperform the PAM method as we would expect.
|
In the last case, we produce a scenario that more resembles the real microarray data. The simulation structure is as follows. We again consider a two-group classification problem as in setup I. Two distributions are still
and
with
.
is assumed to be the same as in setup I while
is slightly different.
is no longer the identity matrix. Instead, we assume the following block diagonal structure:
![]() | (4.1) |
with each diagonal block being the following auto-regressive structure:
![]() | (4.2) |
The block size is
and there are totally
blocks. We assume the auto-correlation within each block is
and we set alternating signs for each block. A total of
training samples and
test samples are generated with half from each class.
This simulation setup does have sound basis in real microarray data. It is common knowledge that genes are networked together in pathways. Although it is true that weak connections between groups may exist, independence between groups is usually a reasonable assumption. Also, within each group, genes are either positively or negatively correlated and due to their relative distance in the regulatory pathway, the further apart two genes, the less correlation between them. These are exactly the reasons why we use the above simulation model. From the results in Table 6, we can clearly see that the SCRDA method outperforms PAM by a significant margin. Considering this is only a two-group classification problem mimicking the real microarray data, we should expect that the difference will be more significant when the number of classes is large as we have observed for the Tamayo and Brown data.
|
| 5. FEATURE SELECTION BY THE SCRDA METHOD |
|---|
|
|
|---|
Remember that the discriminant function (2.5) is linear in X with coefficients vector
. Now, if the ith element of the coefficients vector is 0 for all
, then it means gene i does not contribute to our classification purpose and hence can be eliminated. Therefore, SCRDA potentially can be used for the gene selection purpose. For example, as shown in Table 7, the number of genes that are truly differentially expressed is 100, 280, and 200, respectively, in the three simulation setups above. Correspondingly, the SCRDA method picks out 240, 785, and 282 genes in each setup. Among these genes, 82, 204, and 138 are truly differentially expressed, respectively. The detection rate is at least
in all situations. However, the false-positive rate is also high, especially when the number of classes is large. For now, there is not a good way to adjust this high false-positive rate. Therefore, SCRDA can be conservatively used as gene selection method.
|
| 6. DISCUSSION |
|---|
|
|
|---|
Through comparisons using both real microarray data sets and simulated data sets, we have shown that the SCRDA method can be a promising classification tool. Particularly, it is consistently better than its sibling method, PAM in many problems. This new method is also very competitive to some other methods, e.g. SVM.
This new method is not only empirically useful in terms of classification performance, it also has some interesting theoretical implications, which we will discuss carefully in a future paper. For example, it can be shown that the regularized discriminant function (2.5) is equivalent to the penalized log-likelihood method and in some special cases, our new method SCRDA can be related to another recently proposed new method called "elastic net" (Zou and Hastie, 2005
). These results are interesting since they not only give different perspectives of statistical methods but also provide new computational approaches. For example, an alternative method for estimating the shrunken regularized centroids other than the way we have proposed in this paper is to solve the solution of the mixed
-
penalty function. This has been made possible as the problem will convert to the usual LASSO (Tibshirani, 1996
) solution. And with the emergence of the new algorithm LARS (Efron and others, 2004
), efficient numerical solution is also available.
As mentioned before, choosing the optimal parameter pairs for the SCRDA method is not as straightforward as in PAM and in some cases, the process can be somewhat tedious. The guidelines given in Section 3.4 work generally well, at least for all the data examples provided in this paper. However, it may require some experience with the SCRDA method to get the best result. Also, the computation in the SCRDA method is not as fast as in PAM due to two reasons. First, we have two parameters
to optimize over a two-dimensional grid rather than the one-dimensional one in PAM. Second, although the SVD algorithm is very efficient, the computation still involves large matrix manipulation in practice, while only vector operations are involved in PAM. On the other hand, as shown in this paper, PAM does not always do a worse job than the SCRDA method. In some situations, e.g. when the number of classes is small or the covariance structure is nearly diagonal, PAM is both accurate in prediction and computationally efficient. Therefore, we recommend using the SCRDA method only when PAM cannot perform well in classification.
Also, the SCRDA method can be used directly for gene selection proposes. As we have seen in Section 5, the selection process of SCRDA is rather conservative, tending to include many more genes unnecessary. But overall speaking, it is not generally worse than PAM. And since it includes most of the genes that are truly differentially expressed, it is a safer way of including the ones we really should detect.
| REFERENCES |
|---|
|
|
|---|
-
Dipillo P. (1976) The application of bias to discriminant analysis. Communication in Statistics Theory and Methodology A5:84354.
Dipillo P. (1977) Further application of bias discriminant analysis. Communication in Statistics Theory and Methodology A6:93343.
Efron B, Hastie T, Johnstone I, Tibshirani R. (2004) Least angle regression. Annals of Statistics 32:40799.[CrossRef][Web of Science]
Friedman J. (1989) Regularized discriminant analysis. Journal of the American Statistical Association 84:16575.[CrossRef]
Golub T, Slonim D, Tamayo P, Huard C, Gaasenbeek M, Mesirov J, Coller H, Loh M, Downing J, Caligiuri M. (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286:5316.
Hastie T, Tibshirani R, Friedman J. (2001) The Elements of Statistical Learning(Springer, New York).
Hoerl A and Kennard R. (1970) Ridge regression: biased estimation for non-orthogonal problems. Technometrics 12:5567.
Ramaswamy S, Tamayo P, Rifkin R, Mukherjee S, Yeang C, Angelo M, Ladd C, Reich M, Latulippe E, Mesirov J. and others. (2001) Multiclass cancer diagnosis using tumor gene expression signature. Proceedings of the National Academy of Sciences of the United States of America 98:1514954.
Tibshirani R. (1996) Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B 58:26788.
Tibshirani R, Hastie T, Narashimhan B, Chu G. (2003) Class prediction by nearest shrunken centroids with applications to dna microarrays. Statistical Science 18:10417.[CrossRef]
Zhu J and Hastie T. (2004) Classification of gene microarrays by penalized logistic regression. Biostatistics 5:42743.[Abstract]
Zou H and Hastie T. (2005) Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, Series B 67:30120.[CrossRef]
Received November 1, 2005; revised March 27, 2006; accepted for publication March 29, 2006.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
V. Zuber and K. Strimmer Gene ranking and biomarker discovery under correlation Bioinformatics, October 15, 2009; 25(20): 2700 - 2707. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Gertheiss and G. Tutz Supervised feature selection in mass spectrometry-based proteomic profiling by blockwise boosting Bioinformatics, April 15, 2009; 25(8): 1076 - 1077. [Abstract] [Full Text] [PDF] |
||||
![]() |
F. Tai and W. Pan Incorporating prior knowledge of gene functional groups into regularized discriminant analysis of microarray data Bioinformatics, December 1, 2007; 23(23): 3170 - 3177. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||


ratios. The two plots on the right show the prediction error of the discriminant function for the corresponding conditions on the left. The data points are generated from a p-dimensional multivariate normal distribution with
.
= 0.6.




