What is the best practice when preprocessing microarray data using a detection filter (on scanner p-value)?
Suppose I have a microarray dataset that I have to normalize with Loess and correct with ComBat. When should I apply a detection filter, relative to the other steps in the process?
In my experimental design, I have two pipelines to test:
- Normalize per sample between 0 - 1
- Loess normalization between samples
- Loess normalization between samples
Is there any best practice for the timing of applying a detection filter?
My comment above still stands; provided your detection filter doesn't rely on your data being normalized, your detection filter can exist anywhere.
However, since it seems you know that you want to run a PCA, and know all the normalization's you'd like to try already, it is simplest to run all of your normalization first, that her than having to filter out high scoring vectors that appear to be caused by inter-sample variation.
Normalization and quantification of differential expression in gene expression microarrays
Christine Steinhoff is a postdoctoral scientist in the Computational Molecular Biology Department at the Max Planck Institute for Molecular Genetics in Berlin. Her research interest focuses on epigenetic gene regulatory mechanisms especially based on gene expression experimental approaches.
Martin Vingron is the Director at the Max Planck Institute for Molecular Genetics in Berlin and Head of the Department for Computational Molecular Biology. His current research interest lies in utilizing gene expression data as well as evolutionary data for the elucidation of gene regulatory mechanisms.
Christine Steinhoff, Martin Vingron, Normalization and quantification of differential expression in gene expression microarrays, Briefings in Bioinformatics, Volume 7, Issue 2, June 2006, Pages 166–177, https://doi.org/10.1093/bib/bbl002
Genomic data integration is a key goal to be achieved towards large-scale genomic data analysis. This process is very challenging due to the diverse sources of information resulting from genomics experiments. In this work, we review methods designed to combine genomic data recorded from microarray gene expression (MAGE) experiments. It has been acknowledged that the main source of variation between different MAGE datasets is due to the so-called ‘batch effects’. The methods reviewed here perform data integration by removing (or more precisely attempting to remove) the unwanted variation associated with batch effects. They are presented in a unified framework together with a wide range of evaluation tools, which are mandatory in assessing the efficiency and the quality of the data integration process. We provide a systematic description of the MAGE data integration methodology together with some basic recommendation to help the users in choosing the appropriate tools to integrate MAGE data for large-scale analysis and also how to evaluate them from different perspectives in order to quantify their efficiency. All genomic data used in this study for illustration purposes were retrieved from InSilicoDB http://insilico.ulb.ac.be.
The Human Transcriptome
Matthias E. Futschik PhD , . Christine Sers PhD , in Molecular Pathology (Second Edition) , 2018
Bioinformatics I—Basic Processing of Microarray and RNA-seq Data
Finding meaningful structures and information in an ocean of numerical values obtained in transcriptome experiments is a formidable task and demands various approaches of data processing and analysis. Although the type of data analysis naturally depends on the research questions posed and the chosen technical platform, common first steps are data preprocessing and normalization in order to derive quantities and comparable measures for gene expression ( Fig. 7.4 ). Subsequently, these measures are merged in a so-called gene expression matrix, which is basically a table with rows corresponding to specific transcripts and columns corresponding to samples. The constructed matrix holds two types of different expression profiles in a compact form. The set of expression values of the different genes measured in a sample constitutes the expression profile of the sample. Likewise, the expression of a gene across the different samples constitutes the expression profile of this gene. Thus, the columns of the gene expression matrix provide the profiles of the samples, while the rows provide the profiles of the genes. This matrix can then be scrutinized for the detection of genes with significant fold-changes in expression, clustering and classification of expression profiles of samples or genes, and functional profiling  . In all these tasks, visualization of data plays an important role for quality control and knowledge discovery. It should be noted that the early analysis steps can influence subsequent examination. For example, the choice of preprocessing and normalization procedures can have considerable impact on the results of clustering and classification.
Figure 7.4 . Bioinformatics workflow for transcriptomic analysis using microarrays or RNA-seq technology.
Whereas data from both technologies require distinct preprocessing, higher level analyses can be carried by similar or even the same approaches.
Microarray Data Preprocessing
The first preprocessing step for microarray data is commonly the logarithmic transformation of signal ratios. In this way, fold-changes of the same order of magnitude become symmetrical around zero for upregulation (increased signal abundance) and downregulation (decreased signal abundance). For example, using log2 transformation, a positive or negative fold-change of two is displayed as 1 or −1, respectively. The spot intensities are usually more equally distributed along the scale, which enables an easier detection of intensity bias or saturation effects ( Fig. 7.5 ). In addition, the variance of intensities tends to be more homogenous with respect to a log intensity scale compared to a linear one. A homogenous variance is often required for statistical tests.
Figure 7.5 . Plot representations for signal intensities of a two-color array comparing colorectal cancer cell lines derived from primary carcinoma (labeled by Cy3) and from a metastasis (labeled by Cy5).
The spot intensities in both fluorescence channels are shown using linear (A) and log2-scale (B). The use of log2-scale reveals nonlinear behavior, i.e., a dye bias towards Cy3 for low intensity spots. The MA-plot presents this dye bias even more clearly and also a saturation effect in the Cy5 channel for large intensities. (C) To correct the dye bias, a local regression (red line) of M can be performed (D). The obtained residuals of the local regression, i.e., normalized logged fold-changes are well balanced around zero in MA plot.
Raw microarray data are often compromised by systematic errors such as differences in detection efficiencies, dye labeling, and fluorescence yields. Such signals are corrected by normalization procedures  . Depending on the experimental design and microarray technique applied, two main normalization schemes are used: (1) between-array normalization to compare signal intensities between different microarrays and (2) within-array normalization for adjustment of signals of a single microarray. While between-array normalization is commonly used for Affymetrix chip technology, within-array normalization is applied mainly to two-color arrays for balancing both channels. The simple global normalization, a within-slide procedure, assumes that the majority of assayed genes are not differentially expressed and that the total amount of transcripts remains constant consequently. Therefore, the ratios can be linearly scaled to the same constant median value in both channels. Alternatively, a set house-keeping genes can be selected, which are thought to be equally expressed in both samples. The median of these genes can then be taken to adjust the intensity in both channels by a linear transformation, so that the intensity medians of the house-keeping genes are the same. The popular so-called quantile normalization should be treated with care, since it assumes that the overall distribution of expression values is exactly the same across different samples, which might frequently not be the case especially in the analysis of cancer specimens  . If a dye bias is suspected in two-color arrays, the use of an intensity-dependent normalization procedure might be justified  . A widespread method is to perform a local regression of the logged signal ratios M with respect to the logged intensities A and to subtract the regressed ratios from the raw ratios. The derived residuals of the regression provide the normalized fold-changes ( Fig. 7.5C ). Additional normalization procedures are required, if measured spot intensity ratios show a spatial bias across the array.
Plot representations are simple but very helpful tools to detect artifacts or other trends in microarray data. The most basic plots present the two channel intensities versus each other on a linear or log scale ( Fig. 7.5A and B ). More recently, MA-plots have become a popular tool for displaying the logged intensity ratio (M) versus the mean logged intensities (A). Although MA-plots basically are only a 45° rotation with a subsequent scaling, they reveal intensity-dependent patterns more clearly than the original plot ( Fig. 7.5C )  .
For new users of RNA-seq technologies, the amount of data to be analyzed can be daunting. In contrast to microarray data analysis, which can be carried out even on portable computers, NGS data analysis typically requires the use of multiple CPUs, sufficient computer memory, and disk space up to terabytes even for a single experiment. Alternatives to in-house computational infrastructures are publicly accessible web-platforms such as Galaxy ( https://usegalaxy.org/ ), or the use of commercial cloud computing. However, the cloud approach requires moving the data across the internet, which often presents a notorious bottleneck given the large file sizes. For researchers who carry out a few studies, it might be advisable to begin with web-tools, and then move to stand-alone tools if the required hardware resources are locally available. An excellent platform especially for follow-up analysis is provided by R/Bioconductor ( http://www.bioconductor.org/ ), which offers numerous add-on packages for specific tasks such as detection of differential expression, functional enrichment analysis, clustering and classification, but also requires basic scripting knowledge.
RNA-seq—Base Calling and Sequencing Quality
Base calling (converting measured intensity data into sequences and assessing the sequencing quality) is usually carried out by algorithms supplied by the vendor of the sequencing platform. The identified sequences and their corresponding quality scores are subsequently stored in files of Fastq format. The quality of base calling is presented by a so-called Phred score. Sequence or parts of sequences with low Phred scores indicate potential sequence errors and need to be removed. Also, reads need to be assessed for the presence of adapter sequences, which interfere with subsequent analysis.
RNA-seq—Read Mapping and Transcriptome Reconstruction
To analyze and interpret the reads produced by RNA-seq, their position within a reference sequence must be determined, a process known as alignment or mapping. This is a challenging process not only due to the large number of reads to be aligned, but also due to sequencing errors or mutations in the sequence, which need to be coped with in the alignment process. For mapping of short reads, numerous programs have been developed using different computational strategies. Several of them use the so-called Burrows-Wheeler transformation that was originally developed for file compression  . It enables the indexing of the large genomes and its utilization for faster read mapping with reduced computer memory. Alternatively, parts of the reads termed seeds are first mapped to the reference after which the alignment is extended to the full read  . Outputs of the aligners are files in Sequence Alignment/Map (SAM) or Binary Alignment/Map Bam (BAM) format, which present the chromosomal location along with the mapped sequences as text or binary encoding, respectively  .
Ideally, one would like to use the transcriptome as reference (align the reads directly to the transcriptome). However, in practice reads are aligned to the genome, as complete transcriptomes are not (yet) available. This procedure adds a layer of complexity for the sequencing of eukaryotic RNA, as many genes undergo splicing. The removal of introns leads to transcript sequences that do not correspond to a continuous stretch on the genome, but are composed of sequences from distant exons. To reconstruct the exon structure of genes, alignment programs try to map reads, that could not be aligned in their full length to the genome, to known or predicted splice junctions (locations where two exons join)  , or splitting them and mapping the different read parts to different exons  . Basically, reads overlapping the 5′ end sequence of one exon and the 3′ end sequence of another indicate that the two exons were spliced together. Based on the number of reads aligned to the exons and splice junctions, we can seek to quantify the different splice isoforms although this task has remained difficult and requires sufficient sequencing depth.
To enable comparison of gene expression within a sample or across different samples, a summarization and normalization step need to be carried out. Summarization provides the strength of gene expression, given all the reads mapped to its chromosomal region. For this quantification, the mapped reads are counted and divided by the gene length, as we expect longer genes to lead to more fragments and those to more reads, even if the transcript abundance stays the same. To enable comparison of RNA-seq runs with different number of total reads, a further normalization step is carried out. In the simplest version of normalization, this is achieved through an additional division by the total number of mapped reads producing RPKM (reads per kilo-base of exon model per million reads) values  , as the number of reads mapped to a gene should be proportional to the total number of reads produced. Alternatively, other normalization procedures can be chosen, which for instance seek to keep the expression of house-keeping genes constant  or minimize the overall fold-change between samples  .
RNA-seq—Data Visualization and Inspection of Read Mapping
For visual display of the mapping of reads to the reference sequence, various software tools such the Integrative Genomics Viewer  have been developed. As input, they use SAM or BAM files as well as available gene annotation. They help to inspect the coverage of specific genes or to discover genetic alterations. For instance, RNA-seq data can offer as a byproduct the accurate identification of single nucleotide polymorphisms (SNPs) in regions with high read coverage  .
EMG signal carries valuable information regarding the nerve system. So the aim of this paper was to give brief information about EMG and reveal the various methodologies to analyze the signal. Techniques for EMG signal detection, decomposition, process and classification were discussed along with their advantages and disadvantages. Discovery of a problem or disadvantage in one method leads to other improved methods. This study clearly points up the various types of EMG signal analysis techniques so that right methods can be applied during any clinical diagnosis, biomedical research, hardware implementations and end user applications.
DASC is an effective method to identify hidden batch effects in large consortium datasets. Our method uses data-adaptive shrinkage to get the appropriate estimate of ‘batch-free’ data. The output of DASC is more stable and robust due to the use of consensus matrix and data-adaptive shrinkage method.
From the case study of SEQC dataset, DASC outperforms all the other algorithms compared in this study based on purity and entropy measurement. From the second case study, DASC identified a strong batch effect missed by the original study, which verifies the effectiveness of our method and importance of batch correction. In a scRNA-Seq study, DASC outperformed existing methods in detecting day-to-day sequencing variations.
Moreover, we showed that the results of DASC is independent of data distribution assumption compared to PCA and sva. Altogether, DASC is a general and flexible algorithm for detecting unknown batch effects. It can be generalized to other omics datasets as well.
Using raw GC/MS data as the X-block for chemometric modeling has the potential to provide better classification models for complex samples when compared to using the total ion current (TIC), extracted ion chromatograms/profiles (EIC/EIP), or integrated peak tables. However, the abundance of raw GC/MS data necessitates some form of data reduction/feature selection to remove the variables containing primarily noise from the data set. Several algorithms for feature selection exist however, due to the extreme number of variables (10 6 –10 8 variables per chromatogram), the feature selection time can be prolonged and computationally expensive. Herein, we present a new prefilter for automated data reduction of GC/MS data prior to feature selection. This tool, termed unique ion filter (UIF), is a module that can be added after chromatographic alignment and prior to any subsequent feature selection algorithm. The UIF objectively reduces the number of irrelevant or redundant variables in raw GC/MS data, while preserving potentially relevant analytical information. In the m/z dimension, data are reduced from a full spectrum to a handful of unique ions for each chromatographic peak. In the time dimension, data are reduced to only a handful of scans around each peak apex. UIF was applied to a data set of GC/MS data for a variety of gasoline samples to be classified using partial least-squares discriminant analysis (PLS-DA) according to octane rating. It was also applied to a series of chromatograms from casework fire debris analysis to be classified on the basis of whether or not signatures of gasoline were detected. By reducing the overall population of candidate variables subjected to subsequent variable selection, the UIF reduced the total feature selection time for which a perfect classification of all validation data was achieved from 373 to 9 min (98% reduction in computing time). Additionally, the significant reduction in included variables resulted in a concomitant reduction in noise, improving overall model quality. A minimum of two um/z and scan window of three about the peak apex could provide enough information about each peak for the successful PLS-DA modeling of the data as 100% model prediction accuracy was achieved. It is also shown that the application of UIF does not alter the underlying chemical information in the data.
Special thanks to Leander Dony, who debugged, updated, and tested the case study to work with the latest methods. Furthermore, we would like to thank the many people who proofread the case study notebook and the manuscript and improved it with their comments and expertise. For this, we acknowledge the input of Maren Buttner, David Fischer, Alex Wolf, Lukas Simon, Luis Ospina-Forero, Sophie Tritschler, Niklas Koehler, Goekcen Eraslan, Benjamin Schubert, Meromit Singer, Dana Pe'er, and Rahul Satija. Special thanks for this also to the anonymous reviewers of the manuscript and the editor, Thomas Lemberger, for their thorough, constructive, and extensive comments. The case study notebook was tested and improved by the early adopters Marius Lange, Hananeh Aliee, Subarna Palit and Lisa Thiergart. Volker Bergen and Alex Wolf also contributed to the workflow by making scanpy adaptations. The choice of dataset to optimally show all aspects of the analysis workflow was facilitated by the kind input from Adam Haber and Aviv Regev. This work was supported by the BMBF grant# 01IS18036A and grant# 01IS18053A, by the German Research Foundation (DFG) within the Collaborative Research Centre 1243, Subproject A17, by the Helmholtz Association (Incubator grant sparse2big, grant # ZT-I-0007) and by the Chan Zuckerberg Initiative DAF (advised fund of Silicon Valley Community Foundation, 182835).
In microarray experiments, randomly missing values may occur due to scratches on the chip, spotting errors, dust, or hybridization errors. Other nonrandom missing values may be biological in nature, for example, probes with low intensity values or intensity values that may exceed a readable threshold. These missing values will create incomplete gene expression matrices where the rows refer to genes and the columns refer to samples. These incomplete expression matrices will make it difficult for researchers to perform downstream analyses such as differential expression inference, clustering or dimension reduction methods (e.g., principal components analysis), or multidimensional scaling. Hence, it is critical to understand the nature of the missing values and to choose an accurate method to impute the missing values.
There have been several methods put forth to impute missing data in microarray experiments. In one of the first papers related to microarrays, Troyanskaya et al.  examine several methods of imputing missing data and ultimately suggest a k-nearest neighbors approach. Researchers also explored applying previously developed schemes for microarrays such as the nonlinear iterative partial least squares (NIPALS) as discussed by Wold . A Bayesian approach for missing data in gene expression microarrays is provided by Oba et al. . Other approaches such as that of Bø et al.  suggest using least squares methods to estimate the missing values in microarray data, while Kim et al.  suggest using a local least squares imputation. A Gaussian mixture method for imputing missing data is proposed by Ouyang et al. .
While many of these approaches can be generally applied to different types of gene expression arrays, we will focus on applying these methods to Affymetrix gene expression arrays, one of the most popular arrays in scientific research. Naturally, when proposing a new imputation scheme for expression arrays, it is necessary to compare the new method against existing methods. Several excellent papers have compared missing data procedures on high throughput data platforms such as in two-dimensional gel electrophoresis as in Miecznikowski et al.'s works  or gene expression arrays . Before studying missing data imputation schemes in Affymetrix gene expression arrays, it is reasonable to first remove any existing missing values. In this way, we ensure that any subsequent missing values have known true values. A detection call algorithm is used to filter and remove missing expression values based on absent/present calls . Subsequently, a preprocessing scheme is then employed. There are numerous tasks to perform in preprocessing Affymetrix arrays, including background adjustment, normalization, and summarization. A good overview of the methods available for preprocessing is provided by Gentleman et al. . For our analysis, the detection call employs MAS 5.0  to obtain expression values thus, we also use the MAS 5.0 suite of functions as our preprocessing method.
For our analysis, we focus on the microarray quality control (MAQC) datasets (Accession no. <"type":"entrez-geo","attrs":<"text":"GSE5350","term_id":"5350">> GSE5350), where the datasets have been specifically designed to address the points of strength and weakness of various microarray analysis methods. The MAQC datasets were designed by the US Food and Drug Administration to provide quality control (QC) tools to the microarray community to avoid procedural failures. The project aimed to develop guidelines for microarray data analysis by providing the public with large reference datasets along with readily accessible reference ribonucleic acid (RNA) samples. Another purpose of this project was to establish QC metrics and thresholds for objectively assessing the performance achievable by various microarray platforms. These datasets were designed to evaluate the advantages and disadvantages of various data analysis methods.
The initial results from the MAQC project were published in Shi's work  and later in Chen et al.'s work  and Shi et al.'s work . Specifically, the MAQC experimental design for Affymetrix gene expression HG-U133 Plus 2.0 GeneChip includes 6 different test sites, 4 pools per site, and 5 replicates per site, for a total of 120 arrays (see Section 2). This rich dataset provides an ideal setting for evaluating imputation methods on Affymetrix expression arrays. While this dataset has been mined to determine inter-intra platform reproducibility of measurements, to our knowledge, none has studied imputation methods on this dataset.
The MAQC dataset hybridizes two RNA sample types—Universal Human Reference RNA (UHRR) from Stratagene and a Human Brain Reference RNA (HBRR) from Ambion. These 2 reference samples and varying mixtures of these samples constitute the 4 different pools included in the MAQC dataset. By using various mixtures of UHRR and HBRR, this dataset is designed to study technical variations present in this technology. By technical variations, we are referring to the variability between preparations and labeling of sample, variability between hybridization of the same sample to different arrays, testing site variability, and variability between the signal on replicate features of the same array. Meanwhile, biological variability refers to variability between individuals in population and is independent of the microarray process itself. By the MAQC dataset being designed to study technical variation, we can examine the accuracy of the imputation procedures without the confounding feature of biological variability. Other than MAQC datasets, similar technical datasets have been used to evaluate different analysis methods specific to Affymetrix microarrays, for example, methods for identifying differentially expressed genes .
In summary, our analysis examines cutting edge imputation schemes on an Affymetrix technical dataset with minimal biological variation. Section 2 discusses the MAQC dataset and the proposed imputation schemes. Meanwhile, Section 3 describes the results from applying the imputation methods for addressing missingness in the MAQC datasets. Finally, we conclude our paper with a discussion and conclusion in Sections 4 and 5.
- 1. Simoni Y, Chng MHY, Li S, et al.: Mass cytometry: a powerful tool for dissecting the immune landscape. Curr Opin Immunol. 2018 51: 187–196. PubMed Abstract | Publisher Full Text 2. Spitzer MH, Nolan GP: Mass Cytometry: Single Cells, Many Features. Cell. 2016 165(4): 780–791. PubMed Abstract | Publisher Full Text | Free Full Text 3. Behbehani GK: Applications of Mass Cytometry in Clinical Medicine: The Promise and Perils of Clinical CyTOF. Clin Lab Med. 2017 37(4): 945–964. PubMed Abstract | Publisher Full Text 4. Schulz AR, Baumgart S, Schulze J, et al.: Stabilizing Antibody Cocktails for Mass Cytometry. Cytometry A. 2019 95(8): 910–916. PubMed Abstract | Publisher Full Text 5. Hartmann FJ, Babdor J, Gherardini PF, et al.: Comprehensive Immune Monitoring of Clinical Trials to Advance Human Immunotherapy. Cell Rep. 2019 28(3): 819–831.e4. PubMed Abstract | Publisher Full Text | Free Full Text 6. Palit S, Heuser C, de Almeida GP, et a: Meeting the Challenges of High-Dimensional Single-Cell Data Analysis in Immunology. Front Immunol. 2019 10: 1515. PubMed Abstract | Publisher Full Text | Free Full Text 7. Olsen LR, Leipold MD, Pedersen CB, et al.: The anatomy of single cell mass cytometry data. Cytometry A. 2019 95(2): 156–172. PubMed Abstract | Publisher Full Text 8. Finck R, Simonds EF, Jager A, et al.: Normalization of mass cytometry data with bead standards. Cytometry A. 2013 83(5): 483–494. PubMed Abstract | Publisher Full Text | Free Full Text 9. Chevrier S, Crowell HL, Zanotelli VRT, et al.: Compensation of Signal Spillover in Suspension and Imaging Mass Cytometry. Cell Syst. 2018 6(5): 612–620.e5. PubMed Abstract | Publisher Full Text | Free Full Text 10. Zunder ER, Finck R, Behbehani GK, et al.: Palladium-based mass tag cell barcoding with a doublet-filtering scheme and single-cell deconvolution algorithm . Nat Protoc. 2015 10(2): 316–333. PubMed Abstract | Publisher Full Text | Free Full Text 11. Schuyler RP, Jackson C, Garcia-Perez JE, et al.: Minimizing Batch Effects in Mass Cytometry Data. Front Immunol. 2019 10: 2367. PubMed Abstract | Publisher Full Text | Free Full Text 12. Van Gassen S, Gaudilliere B, Angst MS, et al.: CytoNorm: A Normalization Algorithm for Cytometry Data. Cytometry A. 2020 97(3): 268–278. PubMed Abstract | Publisher Full Text | Free Full Text 13. Kotecha N, Krutzik PO, Irish JM: Web-based analysis and publication of flow cytometry experiments. Curr Protoc Cytom. 2010 Chapter 10: Unit10.17. PubMed Abstract | Publisher Full Text | Free Full Text 14. Nowicka M, Krieg C, Crowell HL, et al.: CyTOF workflow: differential discovery in high-throughput high-dimensional cytometry datasets [version 3 peer review: 2 approved]. F1000Res. 2019 6: 748. PubMed Abstract | Publisher Full Text | Free Full Text 15. Irmisch A, Bonilla X, Chevrier S, et al.: The Tumor Profiler Study: Integrated, multi-omic, functional tumor profiling for clinical decision support. medRxiv. 2020. Publisher Full Text 16. Chevrier S, Zurbuchen Y, Cervia C, et al.: A distinct innate immune signature marks progression from mild to severe COVID-19. bioRxiv. 2020. Publisher Full Text 17. Chevrier S, Levine JH, Zanotelli VRT, et al.: An Immune Atlas of Clear Cell Renal Cell Carcinoma. Cell. 2017 169(4): 736–749.e18. PubMed Abstract | Publisher Full Text | Free Full Text 18. Crowell H, Chevrier S, Jacobs A, et al.: An r-based reproducible and user-friendly preprocessing pipeline for cytof data. 2020. Reference Source 19. Lun A, Risso D, Korthauer K: SingleCellExperiment: S4 classes for single cell data. R package version. 2018 1. Publisher Full Text 20. Finak G, Frelinger J, Jiang W, et al.: OpenCyto: an open source infrastructure for scalable, robust, reproducible, and automated, end-to-end flow cytometry data analysis. PLoS Comput Biol. 2014 10(8): e1003806. PubMed Abstract | Publisher Full Text | Free Full Text 21. Finak G, Jiang M: FlowWorkspace: Infrastructure for representing and interacting with gated and ungated cytometry data sets. R package version. 2018 3. Publisher Full Text 22. Wickham H: ggplot2: Elegant Graphics for Data Analysis. Springer, 2016. Reference Source 23. Van P, Jiang W, Gottardo R, et al.: ggCyto: next generation open-source visualization software for cytometry. Bioinformatics. 2018 34(22): 3951–3953. PubMed Abstract | Publisher Full Text | Free Full Text 24. Hahne F, LeMeur N, Brinkman RR, et al.: flowCore: a Bioconductor package for high throughput flow cytometry. BMC Bioinformatics. 2009 10(1): 106. PubMed Abstract | Publisher Full Text | Free Full Text 25. Wickham H, Francois R, Henry L, et al.: dplyr: A grammar of data manipulation. R package. 2015. 26. Bodenmiller B, Zunder ER, Finck R, et al.: Multiplexed mass cytometry profiling of cellular states perturbed by small-molecule regulators. Nat Biotechnol. 2012 30(9): 858–867. PubMed Abstract | Publisher Full Text | Free Full Text 27. McCarthy DJ, Campbell KR, Lun ATL, et al.: Scater: pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R. Bioinformatics. 2017 33(8): 1179–1186. PubMed Abstract | Publisher Full Text | Free Full Text 28. Weber LM, Nowicka M, Soneson C, et al.: diffcyt: Differential discovery in high-dimensional cytometry via high-resolution clustering. Commun Biol. 2019 2: 183. PubMed Abstract | Publisher Full Text | Free Full Text 29. Fletez-Brant K, Špidlen J, Brinkman RR, et al.: flowClean: Automated identification and removal of fluorescence anomalies in flow cytometry data. Cytometry. 2016 89(5): 461–471. PubMed Abstract | Publisher Full Text | Free Full Text 30. Trussart M, Teh CE, Tan T, et al.: CytofRUV: Removing unwanted variation to integrate multiple CyTOF datasets. bioRxiv. 2020. Publisher Full Text 31. Van Gassen S, Callebaut B, Van Helden MJ, et al.: FlowSOM: Using self-organizing maps for visualization and interpretation of cytometry data. Cytometry A. 2015 87(7): 636–645. PubMed Abstract | Publisher Full Text 32. Finney DJ: Probit analysis. J Pharm Sci. 1971 60(9): 1432. 33. Ritz C, Baty F, Streibig JC, et al.: Dose-Response Analysis Using R. PLoS One. 2015 10(12): e0146021. PubMed Abstract | Publisher Full Text | Free Full Text 34. Lawson CL, Hanson RJ: Solving least squares problems prentice-hall. Prentice Hall, Englewood Cliffs, NJ. 1974. Reference Source 35. Lawson CL, Hanson RJ: Solving Least Squares Problems. SIAM, Philadelphia, PA. 1995. Reference Source 36. R Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2019. Reference Source 37. Huber W, Carey VJ, Gentleman R, et al.: Orchestrating high-throughput genomic analysis with Bioconductor. Nat Methods. 2015 12(2): 115–121. PubMed Abstract | Publisher Full Text | Free Full Text
Looking for the Open Peer Review Reports?
They can now be found at the top of the panel on the right, linked from the box entitled Open Peer Review. Choose the reviewer report you wish to read and click the 'read' link. You can also read all the peer review reports by downloading the PDF.