Outlier masking:
- For each probe, look at all probes within the surrounding 5 probe x 5 probe region. If at least 13 of the 25 probes in this region differ from their trimmed replicate mean (the mean of the three middle replicates, excluding the highest and lowest replicates) by more than 10%, consider this probe an outlier.
- Once these outlier probes have been identified, pad them by also including any probes that are within a 5-probe radius (the square root of row distance2 + column distance2 < 6).
- Also discard extreme single-probe outliers (points for which the replicate intensity > (1.35 - (mean intensity)*0.000025) * mean intensity, and mean intensity > 200). These extreme outliers are typically caused by bright pieces of debris on the chip that affect only a single probe, and are therefore not padded.
- Intensity values are then calculated for each tag by averaging all unmasked replicates.
Notes: The masking algorithm is also provided in the accompanying MATLAB scripts. This script also generates a heatmap of the array showing how each probe compares to its replicates. This view makes array defects clearly visible.
Array normalization:
- The uptags and downtags should be normalized separately because they are amplified separately, and the intensities of the individual PCR reactions will affect tag intensity on the array
- Average all unmasked values for each tag prior to normalizing the data.
- Normalize array data using quantile
normalization2. To normalize a set of
arrays:
- For each set of tags (up and down), rank values obtained from each array in order of increasing intensity.
- For each rank, assign the tag at that rank for each array to the median of all values at that rank.
Notes: This method of normalizing is dependent on a standard curve to which the arrays are normalized (the standard curve used above is the median of the control arrays). To keep this curve from changing over time, it is best to calculate one standard curve from a set of arrays and keep it for normalizing future arrays.
Only experiments with a similar distribution of tag intensities can be normalized together. For example, het and hom experiments must be normalized separately, and experiments with different generation times should also not be normalized together.
Non-parametric analysis (best for large-scale studies):
- For each tag, calculate the mean of the controls (µc) and the standard deviation of the controls (σc).
- Compute the 90th percentile of the standard deviations across genes for the uptags (ku), and separately for the downtags (kd).
- For each uptag with treatment intensity t, calculate (µc - t) / (σc + ku).
- For each downtag with treatment intensity t, calculate (µc - t) / (σc + kd).
- For each strain, average all tags that are > 200 afu in the control arrays to obtain a final sensitivity score. If a strain has no useable tags, set the score to zero, indicating no information is available for that strain.
- Strains that are sensitive will have positive scores, while strains that are resistant will have negative scores.
- Using 3 standard deviations from the mean is a stringent cutoff, but it also suffers from a high false negative rate with few false positives. The cutoff for sensitivity is therefore up to the user.
Notes: This method works best for large scale studies where it is possible to generate a set of control arrays to use against many treatment sets (> 10 control arrays). Although a large number of control arrays are required, one set of controls can be used to analyze many experimental arrays. One major benefit is that the control arrays do not need to be processed on the same day as the drug arrays to obtain a good result.
One caveat is that it is important that the control arrays represent as diverse a set of samples as the treatment arrays (cells grown on different days, tag PCRs done in different runs etc.) otherwise the standard deviation for tags in the control set will be deceptively small, making strains appear more sensitive/resistant than they actually are.
CelCompare (small scale studies):
- Multiply raw tag values by e0.00031*(tag intensity) to correct for array saturation.
- For each tag, average the normalized values for all unmasked replicates
- Normalize the averaged values using quantile normalization
- To estimate background hybridization, take the mean intensity of the unassigned tag probes.
- For each tag, take the log2 ratio of (control - background)/(treatment - background).
- For each strain, average all tags that are > 200 afu in the control arrays to obtain a final sensitivity score.
- Strains that are sensitive will have positive scores, while strains that are resistant will have negative scores.
- The score corresponds to the log2 ratio of cells present in the control vs. the treatment sample.
Notes: This method works best for small-scale studies where it would be inconvenient to generate the large number of control arrays required for non- parametric analysis. Because only a small number of control arrays are used, it is best to use control and drug arrays that were processed together (cells grown on the same day, PCR amplified together, etc.) to minimize any variation between the control and drug samples that is not related to the treatment.
Any strains for which the treatment value is indistinguishable from background have reached there maximum sensitivity score, so they may actually be more sensitive than they appear in your data. To resolve the sensitivity of these strains, sample earlier time points or examine the growth of the strain individually as described below.
The data for strains with low representation in the pool is prone to noise due to increased sampling error. One class of strains that is especially prone to this problem is the slow-growing strains1, so data from these strains should be carefully confirmed.
References
- Deutschbauer, A.M. et al. Mechanisms of haploinsufficiency revealed by genome-wide profiling in yeast. Genetics (2005).
- Bolstad, B.M., Irizarry, R.A., Astrand, M. & Speed, T.P. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19, 185-93 (2003).