Information

R Stringdb bioconductor


I 'm using STRINGdb package from biconductor to manipulate String database of ppi I'm newbie and I don't find the documentation of the package useful for me . I wanna extract the associated network of cancer fo example , I tried to used this function s <-string_db$get_term_proteins(terms_id,string_ids =NULL , enableIEA =TRUE )and I don't know what that terms_id attribute stand for .

second , is there a simpler way to get the full network using R interface given in the url as an example

https://string-db.org/cgi/network.pl?taskId=2hQuabKoXe02

The terms_id is meant to be a vector of GO or KEGG IDs that annotate to the proteins of interest. As it's described, the method get_term_protein:

Returns the proteins annotated to belong to a given term.

My substantiation in part comes from the vignettes for STRINGdb that use terms in a separate set of functions like so:

>head(enrichmentGO, n=7) term_id proteins hits pvalue pvalue_fdr 1 GO:0048545 383 15 2.337593e-07 0.000368020 2 GO:0006952 1286 28 3.806997e-07 0.000368020 3 GO:0002252 405 15 4.756829e-07 0.000368020 4 GO:0010466 234 11 1.767763e-06 0.001025745 5 GO:0045861 330 12 8.407161e-06 0.003765892 6 GO:0010951 228 10 9.735179e-06 0.003765892 7 GO:0002443 141 8 1.224368e-05 0.003839606 term_description 1 response to steroid hormone 2 defense response 3 immune effector process 4 negative regulation of peptidase activity 5 negative regulation of proteolysis 6 negative regulation of endopeptidase activity 7 leukocyte mediated immunity >head(enrichmentKEGG, n=7) term_id proteins hits pvalue pvalue_fdr 1 04115 66 6 9.969204e-08 1.475442e-05 2 04610 68 5 3.577966e-06 2.647695e-04 3 05168 174 5 3.254067e-04 1.605340e-02 4 01100 1161 12 5.277721e-04 1.952757e-02 5 04380 123 4 8.399948e-04 2.486385e-02 6 00590 61 3 1.197959e-03 2.638087e-02 7 05161 139 4 1.322517e-03 2.638087e-02 term_description 1 p53 signaling pathway 2 Complement and coagulation cascades 3 Herpes simplex infection 4 Metabolic pathways 5 Osteoclast differentiation 6 Arachidonic acid metabolism 7 Hepatitis B

And so usage would follow, for example, I want proteins that associate with cell growth, I can use c("GO:0016049") as my terms_id to retrieve that list.

library(STRINGdb) #load package string_db <- STRINGdb$new(version="10", species=9606, score_threshold=0,) #set environment, 9606=human string_db$get_term_proteins(c("GO:0016049")) #use new environment variable to query the database, 0016049 refers to cell cycle just as an example.

Output (annotation column w/ text descriptions removed for brevity):

STRING_id term_id preferred_name protein_size 1 9606.ENSP00000002829 GO:0016049 SEMA3F 785 2 9606.ENSP00000211998 GO:0016049 VCL 1134 3 9606.ENSP00000212355 GO:0016049 TGFBR3 851 4 9606.ENSP00000216037 GO:0016049 XBP1 261 5 9606.ENSP00000216911 GO:0016049 AURKA 403 6 9606.ENSP00000221930 GO:0016049 TGFB1 390 7 9606.ENSP00000225792 GO:0016049 DDX5 614 8 9606.ENSP00000230732 GO:0016049 POU4F3 338 9 9606.ENSP00000238682 GO:0016049 TGFB3 412 10 9606.ENSP00000239462 GO:0016049 TNN 1299 11 9606.ENSP00000245479 GO:0016049 SOX9 509 12 9606.ENSP00000255448 GO:0016049 DCLK1 729 13 9606.ENSP00000256951 GO:0016049 EMP1 157 14 9606.ENSP00000258106 GO:0016049 EMX1 290 15 9606.ENSP00000258743 GO:0016049 IL6 212 16 9606.ENSP00000261023 GO:0016049 ITGAV 1048 17 9606.ENSP00000261918 GO:0016049 SEMA7A 666 18 9606.ENSP00000262017 GO:0016049 ITGB3 788 19 9606.ENSP00000263713 GO:0016049 EPB41L5 733 20 9606.ENSP00000263980 GO:0016049 SLC9A1 815 21 9606.ENSP00000264057 GO:0016049 DGKD 1214 22 9606.ENSP00000264279 GO:0016049 NOP58 529 23 9606.ENSP00000265136 GO:0016049 COBL 1261 24 9606.ENSP00000265362 GO:0016049 SEMA3A 771 25 9606.ENSP00000265371 GO:0016049 NRP1 923 26 9606.ENSP00000266058 GO:0016049 SLIT1 1534 27 9606.ENSP00000268182 GO:0016049 IQGAP1 1657 28 9606.ENSP00000270221 GO:0016049 EMP3 163 29 9606.ENSP00000272134 GO:0016049 LEFTY1 366 30 9606.ENSP00000276416 GO:0016049 BIN3 253 31 9606.ENSP00000278671 GO:0016049 LAMTOR1 161 32 9606.ENSP00000281321 GO:0016049 POU4F2 409 33 9606.ENSP00000283027 GO:0016049 NUBP1 320 34 9606.ENSP00000284981 GO:0016049 APP 770 35 9606.ENSP00000296388 GO:0016049 LEPRE1 736 36 9606.ENSP00000296755 GO:0016049 MAP1B 2468 37 9606.ENSP00000296875 GO:0016049 GDF9 454 38 9606.ENSP00000301067 GO:0016049 MLL2 5537 39 9606.ENSP00000303351 GO:0016049 ITGB1 798 40 9606.ENSP00000305133 GO:0016049 SOCS5 536 41 9606.ENSP00000306157 GO:0016049 IL7R 459 42 9606.ENSP00000306662 GO:0016049 ADRA1B 520 43 9606.ENSP00000307156 GO:0016049 LAMB2 1798 44 9606.ENSP00000307272 GO:0016049 RPTOR 1335 45 9606.ENSP00000316357 GO:0016049 USP9X 2570 46 9606.ENSP00000316543 GO:0016049 RAPH1 1250 47 9606.ENSP00000317257 GO:0016049 CPNE1 542 48 9606.ENSP00000322957 GO:0016049 PAK7 719 49 9606.ENSP00000324549 GO:0016049 CYFIP1 1253 50 9606.ENSP00000324560 GO:0016049 ULK1 1050 51 9606.ENSP00000324856 GO:0016049 STK11 433 52 9606.ENSP00000329623 GO:0016049 BCL2 239 53 9606.ENSP00000330382 GO:0016049 PDGFB 241 54 9606.ENSP00000332643 GO:0016049 NDN 321 55 9606.ENSP00000333982 GO:0016049 NDEL1 345 56 9606.ENSP00000334458 GO:0016049 GATA4 442 57 9606.ENSP00000335246 GO:0016049 ENTPD5 428 58 9606.ENSP00000337146 GO:0016049 ENOX2 610 59 9606.ENSP00000337332 GO:0016049 SIRT6 355 60 9606.ENSP00000337697 GO:0016049 DCX 441 61 9606.ENSP00000339740 GO:0016049 CAMK2D 499 62 9606.ENSP00000340820 GO:0016049 MAPT 776 63 9606.ENSP00000341940 GO:0016049 CAV3 151 64 9606.ENSP00000347906 GO:0016049 PRMT2 433 65 9606.ENSP00000348769 GO:0016049 ARIH2 493 66 9606.ENSP00000351049 GO:0016049 PAK4 591 67 9606.ENSP00000351591 GO:0016049 NLGN3 848 68 9606.ENSP00000351905 GO:0016049 TGFBR2 592 69 9606.ENSP00000353362 GO:0016049 CACNA1A 2506 70 9606.ENSP00000353582 GO:0016049 NRP2 931 71 9606.ENSP00000354558 GO:0016049 MTOR 2549 72 9606.ENSP00000354829 GO:0016049 SGMS1 413 73 9606.ENSP00000354877 GO:0016049 ULK2 1036 74 9606.ENSP00000355627 GO:0016049 AGT 485 75 9606.ENSP00000355785 GO:0016049 LEFTY2 366 76 9606.ENSP00000355896 GO:0016049 TGFB2 442 77 9606.ENSP00000357288 GO:0016049 LAMTOR2 125 78 9606.ENSP00000357470 GO:0016049 IL6R 468 79 9606.ENSP00000357494 GO:0016049 ROS1 2347 80 9606.ENSP00000359095 GO:0016049 PRMT6 375 81 9606.ENSP00000359729 GO:0016049 SLC9A6 701 82 9606.ENSP00000360973 GO:0016049 AGTR2 363 83 9606.ENSP00000360988 GO:0016049 LUZP4 313 84 9606.ENSP00000361423 GO:0016049 ABL1 1149 85 9606.ENSP00000362092 GO:0016049 RRAGC 399 86 9606.ENSP00000362717 GO:0016049 LHX2 406 87 9606.ENSP00000363822 GO:0016049 AR 920 88 9606.ENSP00000365663 GO:0016049 NPPA 151 89 9606.ENSP00000366396 GO:0016049 XRN2 950 90 9606.ENSP00000367123 GO:0016049 SLC3A2 631 91 9606.ENSP00000367830 GO:0016049 PRKCZ 592 92 9606.ENSP00000368030 GO:0016049 ATAD3A 634 93 9606.ENSP00000368169 GO:0016049 DVL1 670 94 9606.ENSP00000368683 GO:0016049 EDN1 212 95 9606.ENSP00000369014 GO:0016049 NDNF 568 96 9606.ENSP00000369816 GO:0016049 SHBG 402 97 9606.ENSP00000369960 GO:0016049 ADRA1A 475 98 9606.ENSP00000370330 GO:0016049 ERBB2IP 1371 99 9606.ENSP00000376800 GO:0016049 MTPN 118 100 9606.ENSP00000377823 GO:0016049 NDRG4 391 101 9606.ENSP00000377947 GO:0016049 RARG 454 102 9606.ENSP00000378306 GO:0016049 PPP3CB 525 103 9606.ENSP00000379003 GO:0016049 NUPR1 100 104 9606.ENSP00000385610 GO:0016049 MEX3C 659 105 9606.ENSP00000395359 GO:0016049 CADM1 442 106 9606.ENSP00000411672 GO:0016049 ATP6V0E2 213 107 9606.ENSP00000414303 GO:0016049 BDNF 329 108 9606.ENSP00000419782 GO:0016049 CDK5 292 109 9606.ENSP00000422591 GO:0016049 SLIT2 1529 110 9606.ENSP00000427941 GO:0016049 ATP6V0E1 81 111 9606.ENSP00000430333 GO:0016049 SLIT3 1523 112 9606.ENSP00000465742 GO:0016049 RASGRP4 673

CytoTree: an R/Bioconductor package for analysis and visualization of flow and mass cytometry data

Background: The rapidly increasing dimensionality and throughput of flow and mass cytometry data necessitate new bioinformatics tools for analysis and interpretation, and the recently emerging single-cell-based algorithms provide a powerful strategy to meet this challenge.

Results: Here, we present CytoTree, an R/Bioconductor package designed to analyze and interpret multidimensional flow and mass cytometry data. CytoTree provides multiple computational functionalities that integrate most of the commonly used techniques in unsupervised clustering and dimensionality reduction and, more importantly, support the construction of a tree-shaped trajectory based on the minimum spanning tree algorithm. A graph-based algorithm is also implemented to estimate the pseudotime and infer intermediate-state cells. We apply CytoTree to several examples of mass cytometry and time-course flow cytometry data on heterogeneity-based cytology and differentiation/reprogramming experiments to illustrate the practical utility achieved in a fast and convenient manner.

Conclusions: CytoTree represents a versatile tool for analyzing multidimensional flow and mass cytometry data and to producing heuristic results for trajectory construction and pseudotime estimation in an integrated workflow.

Keywords: Flow cytometry Mass cytometry Pseudotime Single-cell Tree.

Conflict of interest statement

The authors have declared no competing interests.

Figures

Overview of CytoTree package functionalities…

Overview of CytoTree package functionalities and algorithm. The preprocessing panel reveals the preparation…

Summary of the analysis workflow…

Summary of the analysis workflow of CytoTree. Summary of the CytoTree workflow for…

Analysis of mass cytometry data…

Analysis of mass cytometry data to identify the hematopoietic differentiation hierarchy. a Known…

Pseudotime estimation and the identification…

Pseudotime estimation and the identification of intermediate states in the hematopoiesis of different…

Analysis of time-course flow cytometry…

Analysis of time-course flow cytometry data reveals the induced differentiation process. a Experimental…


SpatialLIBD: an R/Bioconductor package to visualize spatially-resolved transcriptomics data

Motivation Spatially-resolved transcriptomics has now enabled the quantification of high-throughput and transcriptome-wide gene expression in intact tissue while also retaining the spatial coordinates. Incorporating the precise spatial mapping of gene activity advances our understanding of intact tissuespecific biological processes. In order to interpret these novel spatial data types, interactive visualization tools are necessary.

Results We describe spatialLIBD, an R/Bioconductor package to interactively explore spatially-resolved transcriptomics data generated with the 10x Genomics Visium platform. The package contains functions to interactively access, visualize, and inspect the observed spatial gene expression data and data-driven clusters identified with supervised or unsupervised analyses, either on the user’s computer or through a web application.


Bioinformatics and Computational Biology Solutions Using R and Bioconductor

Covers the basics of R software and the key capabilities of the Bioconductor project (a widely used open source and open development software project for the analysis and comprehension of data arising from high-throughput experimentation in genomics and molecular biology and rooted in the open source statistical computing environment R), including importation and preprocessing of high-throughput data from microarrays and other platforms. Also introduces statistical concepts and tools necessary to interpret and critically evaluate the bioinformatics and computational biology literature. Includes an overview of of preprocessing and normalization, statistical inference, multiple comparison corrections, Bayesian Inference in the context of multiple comparisons, clustering, and classification/machine learning.

© 2021 The Johns Hopkins University.
Creative Commons BY-NC-SA.
Terms of use.


Bioconductor Utilization

Installation

For the first time using Bioconductor, we have to get the latest release of R and then install the latest version of Bioconductor by starting R and entering the following commands:

By this commands, all the core packages of the late Bioconductor version release will be installed. For installing specific packages like IMMAN and GenomicRanges which are available on Bioconductor repository, we can use the instructions on the package landing page like the following command:

The biocLite() function in BiocInstaller package is used for package installation instead of common function install.packages() in R. The reason of this use is that Bioconductor has a separate repository from CRAN, and the release schedule is different from R. This can cause a mismatch between R and Bioconductor release schedules. This is the identified version by install.packages() is not often the most recent ‘release’ available.

Exemplary of use

After the primary installation, this part provides you with simple examples of the Bioconductor workspace to be more familiar with the context. Here, two different instances of diverse components of the Bioconductor project like High-throughput sequencing, Statistical Analysis and Comprehension are gathered together.

For the beginning, to be acquainted with the High-throughput data, you are going to work with DNA sequences. To this, you have to use Biostring , which is a principal package for manipulating large biological sequences. From that, you can use DNAStringSet() as a function to store a set of DNAString (a string based on the DNA alphabet indicating the IUPAC Extended Genetic Alphabet). This can turn the input into an XStringSet object of the DNA base type. The letters are encoded in a way that optimizes fast search algorithms.

Then, we can use complement() function for complementing DNA sequences.

Learning to work with packages like DNAStringSet and Biostrings pays off in other packages.

In another example, we want to use the GenomicRanges package which is specified for indicating genomic locations within the Bioconductor project. It provides a basis for genomic analysis by presenting three classes (GRanges, GPos, and GRangesList), which are used to show genomic ranges, genomic positions, and groups of genomic ranges.

The GRanges class is used to illustrate a set of genomic ranges that each has a single start and end location on the genome. It stores the location of genomic features such as contiguous binding sites, transcripts, and exons. We can access these objects using the GRanges constructor function as below:

The result is a GRanges object with 10 genomic ranges. As it is shown, the function divides the information into a left and right-hand region that are separated by | symbols. The coordinates of genomes (seqnames, ranges, and strand) are settled on the left-hand side, and the metadata columns (annotation) are located on the right. In this instance, the metadata includes score and GC information. However, almost anything can be stored in the metadata portion of a GRanges object.

The seqnames() , ranges() , and strand() accessory functions are used to reduce the components of the genomic coordinates within a GRanges object. For example,

In the last instance, a simple example of the statistical analysis portion of the project is provided. To this end, the Gene Ontology (GO) classification can be discussed. The GO is a collection of concepts (molecular function, cellular component and, biological process) used to explain gene function, and relationships between these concepts. More specifically, GO classifies genes and genes products hierarchically into a graph called ontology. In order to access an applicable example of that three Bioconductor packages called clusterProfiler , DOSE and org.Hs.eg.db can be used.

The clusterProfiler package is a set of methods specified to analyze and visualize functional profiles like GO of genes and gene clusters. By using this, you will be able to cluster different genes according to their similarities. To give an example of this procedure, geneList data set which was employed in DOSE package is a good choice. It contains information content of gene IDs. On the other hand, the org.Hs.eg.db package includes genome-wide annotation, sequence representation of genetic material which is enriched with information relating genomic position, for Human.

The enrichGO() function performs the GO Enrichment Analysis on a given vector of genes. Enrichment analysis is an approach for distinguishing a group of genes that are designated to a class of predefined bins according to their functional specifications. The enriched outcome may contain very general terms. To use this function, you have to fill out an argument called “OrgDb”. For that, the org.Hs.eg.db package including human genome-wide annotation is required. This package consists of sequence representation of genetic material which is enriched with information relating genomic position.

To illustrate the enrichment map, the emapplot() function can be used. This function demonstrates gene sets as a network. To have a simple declaration, consider mutually overlapping gene sets make a cluster together.

Getting help

During the time of using Bioconductor packages, if any help was required you can use the following commands:

Furthermore, there are a lot of handy help pages and workflows available on Bioconductor sites:


ScPipe: A flexible R/Bioconductor preprocessing pipeline for single-cell RNA-sequencing data

Single-cell RNA sequencing (scRNA-seq) technology allows researchers to profile the transcriptomes of thousands of cells simultaneously. Protocols that incorporate both designed and random barcodes have greatly increased the throughput of scRNA-seq, but give rise to a more complex data structure. There is a need for new tools that can handle the various barcoding strategies used by different protocols and exploit this information for quality assessment at the sample-level and provide effective visualization of these results in preparation for higher-level analyses. To this end, we developed scPipe, an R/Bioconductor package that integrates barcode demultiplexing, read alignment, UMI-aware gene-level quantification and quality control of raw sequencing data generated by multiple protocols that include CEL-seq, MARS-seq, Chromium 10X, Drop-seq and Smart-seq. scPipe produces a count matrix that is essential for downstream analysis along with an HTML report that summarises data quality. These results can be used as input for downstream analyses including normalization, visualization and statistical testing. scPipe performs this processing in a few simple R commands, promoting reproducible analysis of single-cell data that is compatible with the emerging suite of open-source scRNA-seq analysis tools available in R/Bioconductor and beyond. The scPipe R package is available for download from https://www.bioconductor.org/packages/scPipe.

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1. The scPipe workflow.

Fig 1. The scPipe workflow.

Fig 2. Example QC plots that can…

Fig 2. Example QC plots that can be created using output from scPipe to assess…

Fig 3. A pairwise scatter plot of…

Fig 3. A pairwise scatter plot of the quality control metrics collected by scPipe for…

Fig 4. Screenshot of an HTML report…

Fig 4. Screenshot of an HTML report created by scPipe for the mouse blood cell…

Fig 5. Analysis results produced with SingleCellExperiment…

Fig 5. Analysis results produced with SingleCellExperiment -compatible packages from Bioconductor.


4 CONCLUSION

The R/Bioconductor package RamiGO provides an easy-to-use R interface to the AmiGO Visualize web server. It provides a simple and elegant way to retrieve Graphviz trees that display hundreds of GO IDs at once and efficiently study clusters or subcomponents of the GO tree in graph form. RamiGO provides functions to convert a GO tree into different formats and display it in Cytoscape without leaving the R environment. RamiGO is therefore a perfect companion to GSEA and GO analyses in R, as it helps one better analyse and interpret the long, and sometimes complex, lists of GO identifiers that these analyses produce.


3.2 Team bootcamps

The team bootcamps are really for our team members and other more advanced R/programming members and collaborators. However, the material we cover is within reach of everyone as long as you practice using R/Bioconductor here and there. The concepts covered are of use to all of us who work with R, but we understand that you might not have as much time to learn these materials. If that’s the case, please feel free to sign up for our Data Science guidance sessions and we’ll help you learn these concepts at your own pace.

The first iteration of these bootcamps were run on September 2020 with the following schedule. For all of them, you should have the latest R and RStudio versions installed in your computer and be familiar with the R programming language. For a more structured working environment, we might use JHPCE’s computational resources while running RStudio on our computers and running code through a Linux terminal 8 . You will probably need to spend time self-learning and practicing some of the material beyond these videos. If you just started learning about R, then these bootcamps will be quite challenging.

Session Time Prerequisites Topic
1 2020-09-21 3-5 pm NA How to be a modern scientist
2 2020-09-22 3-5 pm R + RStudio What they forgot to teach you about R
3 2020-09-23 1-3 pm R + RStudio What they forgot to teach you about R
4 2020-09-24 3-5 pm R + RStudio The Elements of Data Analytic Style + CBDS
5 2020-09-28 3-5 pm R + RStudio The Elements of Data Analytic Style + CBDS
6 2020-09-29 3-5 pm Be a part of the DSgs-guides team DSgs-guide training
7 2020-09-30 1-3 pm RStudio + R functions Building Tidy Tools
8 2020-10-01 3-5 pm RStudio + R functions Building Tidy Tools


5.6.1 What is Tidy Data?

Tidy data is a concept largely defined by Hadley Wickham (Wickham 2014) . Tidy data has the following three characteristics:

  1. Each variable has its own column.
  2. Each observation has its own row.
  3. Each value has its own cell.

Here is an example of some tidy data:

Here is an example of some untidy data:

Task 1: In what ways is the untidy data not tidy? How could we make the untidy data tidy?

Tidy data is generally easier to work with than untidy data, especially if you are working with packages such as ggplot. Fortunately, packages are available to make untidy data tidy. Today we will explore a few of the functions available in the tidyr package which can be used to make untidy data tidy. If you are interested in finding out more about tidying data, we recommend reading “R for Data Science”, by Garrett Grolemund and Hadley Wickham. An electronic copy is available here: http://r4ds.had.co.nz/

The untidy data above is untidy because two variables ( Wins and Losses ) are stored in one column ( Category ). This is a common way in which data can be untidy. To tidy this data, we need to make Wins and Losses into columns, and store the values in Counts in these columns. Fortunately, there is a function from the tidyverse packages to perform this operation. The function is called spread , and it takes two arguments, key and value . You should pass the name of the column which contains multiple variables to key , and pass the name of the column which contains values from multiple variables to value . For example:

Task 2: The dataframe foods defined below is untidy. Work out why and use spread() to tidy it

The other common way in which data can be untidy is if the columns are values instead of variables. For example, the dataframe below shows the percentages some students got in tests they did in May and June. The data is untidy because the columns May and June are values, not variables.

Fortunately, there is a function in the tidyverse packages to deal with this problem too. gather() takes the names of the columns which are values, the key and the value as arguments. This time, the key is the name of the variable with values as column names, and the value is the name of the variable with values spread over multiple columns. Ie:

These examples don’t have much to do with single-cell RNA-seq analysis, but are designed to help illustrate the features of tidy and untidy data. You will find it much easier to analyse your single-cell RNA-seq data if your data is stored in a tidy format. Fortunately, the data structures we commonly use to facilitate single-cell RNA-seq analysis usually encourage store your data in a tidy manner.

5.6.2 What is Rich Data?

If you google ‘rich data’, you will find lots of different definitions for this term. In this course, we will use ‘rich data’ to mean data which is generated by combining information from multiple sources. For example, you could make rich data by creating an object in R which contains a matrix of gene expression values across the cells in your single-cell RNA-seq experiment, but also information about how the experiment was performed. Objects of the SingleCellExperiment class, which we will discuss below, are an example of rich data.

5.6.3 What is Bioconductor?

From Wikipedia: Bioconductor is a free, open source and open development software project for the analysis and comprehension of genomic data generated by wet lab experiments in molecular biology. Bioconductor is based primarily on the statistical R programming language, but does contain contributions in other programming languages. It has two releases each year that follow the semiannual releases of R. At any one time there is a release version, which corresponds to the released version of R, and a development version, which corresponds to the development version of R. Most users will find the release version appropriate for their needs.

We strongly recommend all new comers and even experienced high-throughput data analysts to use well developed and maintained Bioconductor methods and classes.

5.6.4 SingleCellExperiment class

SingleCellExperiment (SCE) is a S4 class for storing data from single-cell experiments. This includes specialized methods to store and retrieve spike-in information, dimensionality reduction coordinates and size factors for each cell, along with the usual metadata for genes and libraries.

In practice, an object of this class can be created using its constructor:

In the SingleCellExperiment , users can assign arbitrary names to entries of assays. To assist interoperability between packages, some suggestions for what the names should be for particular types of data are provided by the authors:

  • counts: Raw count data, e.g., number of reads or transcripts for a particular gene.
  • normcounts: Normalized values on the same scale as the original counts. For example, counts divided by cell-specific size factors that are centred at unity.
  • logcounts: Log-transformed counts or count-like values. In most cases, this will be defined as log-transformed normcounts, e.g., using log base 2 and a pseudo-count of 1.
  • cpm: Counts-per-million. This is the read count for each gene in each cell, divided by the library size of each cell in millions.
  • tpm: Transcripts-per-million. This is the number of transcripts for each gene in each cell, divided by the total number of transcripts in that cell (in millions).

Each of these suggested names has an appropriate getter/setter method for convenient manipulation of the SingleCellExperiment . For example, we can take the (very specifically named) counts slot, normalise it and assign it to normcounts instead:

5.6.5 scater package

scater is a R package for single-cell RNA-seq analysis (McCarthy et al. 2017) . The package contains several useful methods for quality control, visualisation and pre-processing of data prior to further downstream analysis.

scater features the following functionality:

  • Automated computation of QC metrics
  • Transcript quantification from read data with pseudo-alignment
  • Data format standardisation
  • Rich visualizations for exploratory analysis
  • Seamless integration into the Bioconductor universe
  • Simple normalisation methods

We highly recommend to use scater for all single-cell RNA-seq analyses and scater is the basis of the first part of the course.

As illustrated in the figure below, scater will help you with quality control, filtering and normalization of your expression matrix following mapping and alignment. Keep in mind that this figure represents the original version of scater where an SCESet class was used. In the newest version this figure is still correct, except that SCESet can be substituted with the SingleCellExperiment class.


Getting associated GO:IDs for a given gene name using R bioconductor annotation package

I am trying to play around with hgu95av2.db and GO.db libraries from Bioconductor classes.

I have a list of genenames.

These are standard gene names from genedb.

I now want to get the associated go.ids associated with them. I would like to use this information to bin the data in the future.

I have attempted to go through the annotationapi help file where they say one way of getting the ids is to use this api as follows:

I am not sure how to set up the keys. Do I just set up a column object with the list of keys.

When I try to do and run the command above I get the following error:

Error in .testIfKeysAreOfProposedKeytype(x, keys, keytype) : None of the keys entered are valid keys for the keytype specified.

I have played around with the keytype and always get the same error which makes me think I dont really understand fundamentally how to use this database query tool.

I have done a search in bioconductor and they just assume that I have expression data in affy matrix format where I just have a list of the genenames.

I would appreciate your help and apologies as I am a newbie and not really clear on the R bioconductor interface.


Watch the video: R Tutorial: The Bioconductor Project (November 2021).