Why use log2
On average, the foreground intensity of a spot with a fixed specific hybridization increases by 1. As we showed previously, local background intensities depend on the spatial localization and are independent of their corresponding foreground intensities. It can be inferred therefore that the background noise has an additive effect on foreground intensities. In this study, we addressed the problem of background correction and transformation in spotted microarray data.
We compared features of eight preprocessing methods which combine four background correction and two transformation methods. We first compared the correlations between gold-standard fold-changes obtained from quantitative PCR and fold-changes obtained from microarray data. The best correlations were obtained with the Edwards and the Standard background corrections coupled with the log2 transformation. The lowest correlations were obtained with the No background correction method and with all preprocessing using the glog transformation.
These results were explained by plotting lowess curves of the fold-change compression as a function of the average processed intensity. While all preprocessing methods produced low fold-change compression at high processed intensity, the different preprocessing methods differed markedly in terms of fold-change compression at low processed intensities. Accordingly, the fold-change compression was minimized using either the Standard or the Edwards background correction methods with the log2 transformation.
Using a glog transformation conducted to high fold-change compression whatever the background correction method. It is of note that product-moment correlation coefficients are affected by the fold-change compression because this effect is highly dependent on the average processed intensity. A constant fold-change compression across the whole range of processed intensities would indeed have an impact on intraclass correlation coefficients but no impact on the product-moment correlation coefficients.
These results provide information that are complementary to those published in previous studies which reported that microarray data exhibit fold-change compression [ 5 , 18 , 29 ]. While the study of Han et al. While average biases were only estimated on 9 LUS control probes in the Ritchie's study, and probes were used in our study to compute fold-change compression on the GE Healthcare and on the Eppendorf platforms, respectively.
Moreover, in Ritchie's study, the compression factors were only available for 2 of the 9 available LUS control probes for which most background correction methods produced a fold-change compression but the VSN method equivalent to Standard background correction plus glog transformation surprisingly produced a fold-change expansion.
In the current study, all preprocessing methods produced fold-change compression. Furthermore the compression affected mainly low intensity data, an effect that can be minimized by using a combination of the Standard or Edwards background correction with a log2 transformation.
The observed differences between these current results and those of Ritchie et al. In one-color arrays, the background noise caused by non-specific hybridization and deposits may differ for both the target and control spots used to quantify the expression fold-change.
In two-color microarrays, the control and target samples are both hybridized on the same array. The signal due to non-specific hybridization and deposits are consequently more alike. In microarray class comparison studies, effect sizes and p-values are computed by dividing the log2 fold-changes by an estimate of variability. The combination of Edwards or Standard method with the log2 transformation produced low fold-change compression but extremely high variance at low processed intensities.
At the opposite, the combination of Edwards or Standard method with the glog transformation produced high fold-change compression but good variance stabilization at low processed intensities. The impact of the fold-change compression and variance stabilization on the p-values estimation was assessed by computing the correlations between the cumulative Gaussian quantiles of the gold standard p-values obtained from quantitative PCR and the cumulative Gaussian quantiles of p-values obtained from microarray data.
Compared to the log2 transformation, the glog transformation which effectively stabilizes the variance across the whole range of processed intensities, produced generally higher intraclass correlation and comparable product-moment correlation. These results are in line with those obtained by Ritchie et al. These results also agree with those of Cui et al. While the No background correction is sometimes recommended in the literature [ 3 , 22 ] because it decreases variance at low processed intensities, our results show that the combination of the Edwards or Standard background correction with a glog transformation represents a better alternative for the p-values computation.
Furthermore, we also recommend subtraction of the background as we have confirmed the additive property of the background noise on foreground intensity values in this study. Historically, the first method to identify differentially expressed genes was based on the fold-change [ 2 , 29 ]. A change of a least two-fold up or down was generally considered meaningful. Because this method did not take into account the variance of gene expression, it was replaced by statistical inference methods and p-values.
P-values are nowadays used to rank the gene according to the more probable differential expression. Nevertheless, fold-change remains an important feature because it is generally accepted that the greater the magnitude of change, the higher the likelihood of physiologic or pathologic significance [ 29 ]. In the context of class comparison, we therefore recommend to combine the Edwards correction with a hybrid transformation method that uses the log2 transformation to estimate fold-change magnitudes and the glog transformation to estimate p-values.
This hybrid method was compared to the log2 and to the glog transformation and was found to lead to the lowest number of incorrect decisions. Although comparable to the Standard method, the Edwards method is preferable because it avoids the occurrence of missing values even when combined with a log2 transformation.
Moreover, when microarrays are used in the context of class prediction, the most important feature is the stability of the variance across the whole range of processed intensities.
In this context, Parson et al. We therefore recommend to use Standard or Edwards background correction with a glog transformation in order to stabilize the variance in this kind of microarray application. As shown here, the choice of the preprocessing steps should therefore not only be based on the type of microarray platforms but also defined according to the type of application. JA performed the microarray data analysis presented in this the paper and participated in compilation of the publication.
AR and BG supervised the statistical analysis component of the work and assisted with review of the manuscript. All authors read and approved the final manuscript. National Center for Biotechnology Information , U. BMC Bioinformatics. Published online Oct Author information Article notes Copyright and License information Disclaimer.
Corresponding author. Received Jul 14; Accepted Oct This article has been cited by other articles in PMC. Abstract Background The standard approach for preprocessing spotted microarray data is to subtract the local background intensity from the spot foreground intensity, to perform a log2 transformation and to normalize the data with a global median or a lowess normalization.
Results In this study, we assessed the impact of eight preprocessing methods combining four background correction methods and two transformations the log2 and the glog , by using data from the MAQC study. Conclusion As both fold-change magnitudes and p-values are important in the context of microarray class comparison studies, we therefore recommend to combine the Edwards correction with a hybrid transformation method that uses the log2 transformation to estimate fold-change magnitudes and the glog transformation to estimate p-values.
A short description of background correction methods and transformations appears below: Standard We refer to this method when background intensities are subtracted from foreground intensities.
No background We refer to this method when the background intensities are not subtracted. Edwards In this method, the background intensities are subtracted if the difference between foreground and background is bigger than a pre-specified small threshold value.
Normexp The Normexp method is based on the normal plus exponential convolution model [ 18 ]. Open in a separate window. Table 2 Correlation between Microarray and Taqman fold-changes. Transformation Background cor. Figure 1. Figure 2. Figure 3. Figure 4. Figure 5. Table 3 Correlation between cumulative Gaussian quantiles of p-values obtained with Taqman and Microarray.
Figure 6. Table 4 Comparison of the transformation methods. Figure 7. Authors' contributions JA performed the microarray data analysis presented in this the paper and participated in compilation of the publication. Acknowledgements J. References Hardiman G. Microarray platforms-comparisons and contrasts. Fundamentals of cDNA microarray data analysis.
Analysis of cDNA microarray images. Briefings in bioinformatics. Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments.
Statistica sinica. Improvement in the reproducibility and accuracy of DNA microarray quantification by optimizing hybridization conditions. BMC bioinformatics. BMC cancer. Transformations for cDNA microarray data. Statistical applications in genetics and molecular biology. Microarray data normalization and transformation. A statistical method for flagging weak spots improves normalization and ratio estimates in microarrays. Physiological genomics. Improved background correction for spotted DNA microarrays.
Journal of Computational Biology. Non-linear normalization and background correction in one-channel cDNA microarray studies. I would like to understand a basic theory behind log2 transformation linked to gene expression data. From the previous comments you should now realize that gene expression between two different platforms like microarray and rna-seq has different properties associated with it. Likewise, in mathematics like linear algebra, there are also properties associated with the different functions, distributions and equations; Log base has different scales between base 2 and base It was determined that the negative binomial distribution best fits count data to test the hypothesis for differential expression with confidence.
In addition, you can scale and mean center the count data with logbase 10 transformation for biological network analysis. For microarray you can normalize it using the RMA method, and then do t-test or other to test your hypothesis.
To circle back, what happens when you integrate the log2 function and what is its derivative and what properties do these have for you to apply to certain data structures that also possess its own properties, as you may have done in calculus? Sorry if this is redundant or hard to get as I was just trying to sum years of study in this small box. There isn't any theoretical reason for using base-2 instead of any other base. One could reasonably use log10 for the fold changes.
Microarray-detectable changes in expression tend to be smaller than fold in my experience. If d is small In words: If there is a small difference between two natural log values d , you can easily estimate the change between two original data points r , because r is approximately equal to d. But this estimation is not one-size-fits-all. The larger d is beyond 0. You may need to show the original scale on another axis for easier comprehension.
See an example below:. Figure 4: Plotting data with natural logarithms figure republished from [1]. Copyright by William. To sum up, the choice of log base depends on the range of your data values.
Under proper application, logarithms improve both the analysis and communication of data remarkably well. While log base 10 is excellent for larger ranges, it can hinder the study of small-range data sets, which can be better explained in log base 2 and natural log. Have we covered everything?
Feel free to discuss with us in the comment box below. The BioTuring Team,. Back to blog Data visualization and analysis.
0コメント