scfind

About

scfind is a tool that allows one to search single cell RNA-Seq collections using either cells, lists of genes or GO term lists, e.g. searching for cell-types where a specific set of genes are expressed. Here we provide a tutorial on how to create an index from your dataset and how to run queries using the created index.

scfind is a Bioconductor package.

Cloud implementation of scfind with a large collection of datasets is available on our website.

Dataset

We will run scfind on a human pancreas dataset. SingleCellExperiment object created from the data from this dataset can be downloaded from our website. We will put it in the data folder in the root directory. scfind is compatible with the SingleCellExperiment format and we will run it directly on the downloaded dataset:

library(SingleCellExperiment)
muraro <- readRDS("~/data/muraro.rds")

Gene Index

Now we are ready to create a gene index using our dataset:

library(scfind)
cellTypeIndex <- buildCellTypeIndex(muraro)
head(cellTypeIndex)
##                alpha     ductal endothelial      delta     acinar
## A1BG-AS1 0.000000000 0.00000000  0.00000000 0.00000000 0.00000000
## A1BG     0.123152709 0.01224490  0.00000000 0.09844560 0.04109589
## A1CF     0.895320197 0.11020408  0.04761905 0.68911917 0.62557078
## A2M-AS1  0.027093596 0.02448980  0.00000000 0.02072539 0.01826484
## A2ML1    0.013546798 0.01632653  0.04761905 0.02072539 0.05022831
## A2M      0.004926108 0.06122449  0.95238095 0.02072539 0.02739726
##                beta unclear      gamma mesenchymal epsilon
## A1BG-AS1 0.00000000    0.00 0.00000000      0.0000       0
## A1BG     0.14062500    0.00 0.12871287      0.2875       0
## A1CF     0.55357143    0.25 0.69306931      0.0500       1
## A2M-AS1  0.02455357    0.00 0.02970297      0.0625       0
## A2ML1    0.01562500    0.00 0.01980198      0.0375       0
## A2M      0.01116071    0.00 0.00990099      0.8125       0
cellIndex <- buildCellIndex(muraro)

The gene index contains probability of expression of each gene in each cell type (ratio of expressed cells versus all cells in a given cell type). Since we only need to store one floating point value for each gene and cell type, the size of the index will scale with the total number of cell types in the atlas.

By default the cell_type1 column of the colData slot of the SingleCellExperiment object is used to define cell types, however it can also be defined manually using the cell_type_column argument of the buildCellTypeIndex function (check ?buildCellTypeIndex).

Marker genes

Now let’s define lists of cell type specific marker genes. We will use the original publication, namely Figure 1:

# these genes are taken from fig. 1
muraro_alpha <- c("GCG", "LOXL4", "PLCE1", "IRX2", "GC", "KLHL41", "CRYBA2", "TTR", "TM4SF4", "RGS4")
muraro_beta <- c("INS", "IAPP", "MAFA", "NPTX2", "DLK1", "ADCYAP1", "PFKFB2", "PDX1", "TGFBR3", "SYT13")
muraro_gamma <- c("PPY", "SERTM1", "CARTPT", "SLITRK6", "ETV1", "THSD7A", "AQP3", "ENTPD2", "PTGFR", "CHN2")
muraro_delta <- c("SST", "PRG4", "LEPR", "RBP4", "BCHE", "HHEX", "FRZB", "PCSK1", "RGS2", "GABRG2")

Additionally, we will use cell type specific marker genes from another human pancreas publication (taken from Figures 2 and 3):

# these genes are taken from fig. 2
segerstolpe_alpha <- c("GCG", "TTR", "SSR4", "CRYBA2", "SPINT2", "PEMT", "CHGA", "GPX3", "TMEM176B", "GC", "FKBP2", "PCSK2", "NAA20", "TMEM176A", "ERP29", "DPM3", "TMED3", "CNPY2", "C10orf10", "TUBA1B", "SMIM24", "F10", "FXYD5", "CD46", "SLC22A17")
segerstolpe_beta <- c("INS", "SCGN", "IAPP", "FXYD2", "RPL3", "G6PC2", "HSP90AB1", "PEBP1", "HSPA8", "EIR4A2", "ERO1LB", "SERINC1", "SLC30A8", "FAM159B", "SURF4", "NPTX2", "PFKFB2", "EDARADD", "MEG3", "HOPX", "LMO1", "PDX1", "PTEN", "SH3GL2", "ADCYAP1")
segerstolpe_gamma <- c("PPY", "MALAT1", "SCG2", "SCGB2A1", "PAM", "GPC5-AS1", "STMN2", "PAX6", "MEIS2", "CMTM8", "TTC3", "ARX", "FGFR1", "AKAP9", "ETV1", "PPY2", "NCKAP1", "INPP5F", "PXK", "ID4", "SERTM1", "SLITRK6", "SEMA3E", "APCBEC2", "ABCC9")
segerstolpe_delta <- c("SST", "RBP4", "SEC11C", "PCP4", "AGS2", "HHEX", "TPPP3", "UCP2", "LEPA", "BAIAP3", "MS4A8", "CASR", "PSIP1", "BCHE", "GABRB3", "LY6H", "UNC5B", "EDN3", "OGDHL", "NSG1", "FFAR4", "LINC00643", "LINC01014", "TMEM130", "PRG4")

# these genes are taken from fig. 3
segerstolpe_acinar <- c("REG1A", "SPINK1", "PRSS1", "REG3A", "CTRB2", "SERPINA3", "RNASE1", "IL32", "PRSS3", "REG1B", "PRSS3P2", "CTRB1", "CFB", "GDF15", "MUC1", "C15orf48", "DUOXA2", "AKR1C3", "CPA2", "OLFM4", "GSTA1", "LGALS2", "MGST1", "PDZK1IP1", "SOD2", "RARRES2", "EIF4EBP1", "CXCL17", "RAMP1", "C3", "UBD", "GSTA2", "LDHB", "ANPEP", "SDC4", "LYZ", "ANGPTL4", "IGFBP2", "ALDOB")
segerstolpe_ductal <- c("SPP1", "KRT19", "MMP7", "ANXA4", "ANXA2", "DEFB1", "SERPING1", "ATP5B", "ANXA2P2", "TSPAN8", "CLDN10", "IFITM2", "CTSH", "SERPINA1", "CD59", "SLPI", "S100A16", "CD9", "MIR492", "SERPINA5", "PIGR", "PPAP2C", "CFTR", "S100A13", "AMBP", "CLDN1", "LGALS4", "ANXA3", "ITGB1", "PERP", "LITAF", "LY6E", "CMTM7", "PDLIM3", "WFDC2", "TXNIP", "SLC3A1", "AOP1", "ALDH1A3", "VTCN1")

Search cell types by a gene list

Now, when the gene index is created one can easily and very quickly query information from the dataset. One query which is likely to be of central interest is to retrieve the cell-types where a given set of genes is expressed. For a list of genes, we can then calculate the log-likelihood that all of them will be expressed in the cell-type in question. scfind uses theis log-likelihood to perform a likelihood-ratio test and identify cell types where the given list of genes is significantly expressed.

findCellType function returns a list of p-values corresponding to all cell types in a given dataset. We will run it on all lists of marker genes defined above:

tmp <- -log10(findCellType(cellTypeIndex, muraro_alpha))
if (!all(is.na(tmp))) {
    barplot(tmp, ylab = "-log10(pval)", las = 2)
}

tmp <- -log10(findCellType(cellTypeIndex, muraro_beta))
if (!all(is.na(tmp))) {
    barplot(tmp, ylab = "-log10(pval)", las = 2)
}

tmp <- -log10(findCellType(cellTypeIndex, muraro_delta))
if (!all(is.na(tmp))) {
    barplot(tmp, ylab = "-log10(pval)", las = 2)
}

tmp <- -log10(findCellType(cellTypeIndex, muraro_gamma))
if (!all(is.na(tmp))) {
    barplot(tmp, ylab = "-log10(pval)", las = 2)
}

tmp <- -log10(findCellType(cellTypeIndex, segerstolpe_alpha))
## Warning in findCellType(cellTypeIndex, segerstolpe_alpha): Genes: SMIM24
## were exluded from search since they are not present in the Gene Index!
if (!all(is.na(tmp))) {
    barplot(tmp, ylab = "-log10(pval)", las = 2)
}

tmp <- -log10(findCellType(cellTypeIndex, segerstolpe_beta))
## Warning in findCellType(cellTypeIndex, segerstolpe_beta): Genes: EIR4A2
## were exluded from search since they are not present in the Gene Index!
if (!all(is.na(tmp))) {
    barplot(tmp, ylab = "-log10(pval)", las = 2)
}

tmp <- -log10(findCellType(cellTypeIndex, segerstolpe_delta))
## Warning in findCellType(cellTypeIndex, segerstolpe_delta): Genes: AGS2,
## LEPA, MS4A8, LINC01014 were exluded from search since they are not present
## in the Gene Index!
if (!all(is.na(tmp))) {
    barplot(tmp, ylab = "-log10(pval)", las = 2)
}

tmp <- -log10(findCellType(cellTypeIndex, segerstolpe_gamma))
## Warning in findCellType(cellTypeIndex, segerstolpe_gamma): Genes: PPY2,
## APCBEC2 were exluded from search since they are not present in the Gene
## Index!
if (!all(is.na(tmp))) {
    barplot(tmp, ylab = "-log10(pval)", las = 2)
}

tmp <- -log10(findCellType(cellTypeIndex, segerstolpe_acinar))
if (!all(is.na(tmp))) {
    barplot(tmp, ylab = "-log10(pval)", las = 2)
}

tmp <- -log10(findCellType(cellTypeIndex, segerstolpe_ductal))
## Warning in findCellType(cellTypeIndex, segerstolpe_ductal): Genes: MIR492,
## AOP1 were exluded from search since they are not present in the Gene Index!
if (!all(is.na(tmp))) {
    barplot(tmp, ylab = "-log10(pval)", las = 2)
}

We can also check whether the obtained p-values for marker genes are significantly different from p-values of random gene sets:

pvals <- NULL
while(length(pvals) < 100) {
    tmp <- findCellType(cellTypeIndex, sample(rownames(cellTypeIndex), 10))
    if(!all(is.na(tmp))) {
        pvals <- c(pvals, min(tmp[!is.na(tmp)]))
    }
}

boxplot(pvals)

Search cells by a gene list

findCell function returns a similar list of p-values corresponding to all cell types in a given dataset. However, it also outputs a list of cells in which genes from the given gene list are co-expressed. We will run it on all lists of marker genes defined above:

tmp <- -log10(findCell(cellIndex, muraro_alpha)$p_values)
if (!all(is.na(tmp))) {
    barplot(tmp, ylab = "-log10(pval)", las = 2)
}

tmp <- -log10(findCell(cellIndex, muraro_beta)$p_values)
if (!all(is.na(tmp))) {
    barplot(tmp, ylab = "-log10(pval)", las = 2)
}

tmp <- -log10(findCell(cellIndex, muraro_delta)$p_values)
if (!all(is.na(tmp))) {
    barplot(tmp, ylab = "-log10(pval)", las = 2)
}

tmp <- -log10(findCell(cellIndex, muraro_gamma)$p_values)
if (!all(is.na(tmp))) {
    barplot(tmp, ylab = "-log10(pval)", las = 2)
}

# tmp <- -log10(findCell(cellIndex, segerstolpe_alpha)$p_values)
# if (!all(is.na(tmp))) {
#     barplot(tmp, ylab = "-log10(pval)", las = 2)
# }
# tmp <- -log10(findCell(cellIndex, segerstolpe_beta)$p_values)
# if (!all(is.na(tmp))) {
#     barplot(tmp, ylab = "-log10(pval)", las = 2)
# }
# tmp <- -log10(findCell(cellIndex, segerstolpe_delta)$p_values)
# if (!all(is.na(tmp))) {
#     barplot(tmp, ylab = "-log10(pval)", las = 2)
# }
# tmp <- -log10(findCell(cellIndex, segerstolpe_gamma)$p_values)
# if (!all(is.na(tmp))) {
#     barplot(tmp, ylab = "-log10(pval)", las = 2)
# }
# tmp <- -log10(findCell(cellIndex, segerstolpe_acinar)$p_values)
# if (!all(is.na(tmp))) {
#     barplot(tmp, ylab = "-log10(pval)", las = 2)
# }
# tmp <- -log10(findCell(cellIndex, segerstolpe_ductal)$p_values)
# if (!all(is.na(tmp))) {
#     barplot(tmp, ylab = "-log10(pval)", las = 2)
# }

sessionInfo()

sessionInfo()
## R version 3.4.2 (2017-09-28)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Sierra 10.12.6
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
## 
## attached base packages:
## [1] parallel  stats4    methods   stats     graphics  grDevices utils    
## [8] datasets  base     
## 
## other attached packages:
##  [1] bindrcpp_0.2                scfind_0.99.6              
##  [3] SingleCellExperiment_0.99.4 SummarizedExperiment_1.7.10
##  [5] DelayedArray_0.3.21         matrixStats_0.52.2         
##  [7] Biobase_2.37.2              GenomicRanges_1.29.15      
##  [9] GenomeInfoDb_1.13.5         IRanges_2.11.19            
## [11] S4Vectors_0.15.14           BiocGenerics_0.23.4        
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.13            plyr_1.8.4             
##  [3] compiler_3.4.2          XVector_0.17.2         
##  [5] bindr_0.1               bitops_1.0-6           
##  [7] tools_3.4.2             zlibbioc_1.23.0        
##  [9] digest_0.6.12           bit_1.1-12             
## [11] tibble_1.3.4            evaluate_0.10.1        
## [13] lattice_0.20-35         pkgconfig_2.0.1        
## [15] rlang_0.1.2             Matrix_1.2-11          
## [17] yaml_2.1.14             blogdown_0.1           
## [19] GenomeInfoDbData_0.99.1 stringr_1.2.0          
## [21] dplyr_0.7.4             knitr_1.17             
## [23] rprojroot_1.2           grid_3.4.2             
## [25] glue_1.1.1              R6_2.2.2               
## [27] hash_2.2.6              rmarkdown_1.6          
## [29] bookdown_0.5            reshape2_1.4.2         
## [31] magrittr_1.5            backports_1.1.1        
## [33] htmltools_0.3.6         assertthat_0.2.0       
## [35] stringi_1.1.5           RCurl_1.95-4.8
comments powered by Disqus