dagLogo 1.22.1
A sequence logo has been widely used as a graphical representation of an alignment of multiple amino acid (AA) or nucleic acid sequences. There is a package seqlogo(Bembom 2006) implemented in R to draw DNA sequence logos. And another package motifStack(Ou 2012) was developed for drawing sequence logos for amino Acid, DNA and RNA sequences, which also has the capability for graphical representation of multiple motifs.
IceLogo(Colaert et al. 2009) is a tool developed in Java to visualize significantly conserved sequence patterns in an alignment of multiple peptide sequences against background sequences. Compared to webLogo(Crooks et al. 2004), which relies on information theory, IceLogo builds on probability theory. It is reported that IceLogo has a more dynamic nature and is more appropriate for analysis of conserved sequence patterns.
However, IceLogo can only compare conserved sequences to reference sequences at the individual amino acid level. As we know, some conserved sequence patterns are not conserved at the individual amino acid level, but conserved at the level of amino acids grouped on the basis of their physical and chemical properties, such as charge, hydrophobicity and etc.
Here we developed a R/Bioconductor package dagLogo, inspired by IceLogo, to visualize significantly conserved sequence patterns relative to a proper background set of sequences, with or without grouping amino acid residuals based on their physical and chemical properties. The flowchart of performing dagLogo analysis is as shown in Figure 1.
.
Figure 1. Flowchart of performing dagLogo analyses. The two ways to prepare a Proteome object are colored in greenish and yellowish. The two alternative ways to build an object of dagPeptides are colored in blue and red.
library(dagLogo)
##just in case biomaRt server does not response
if (interactive())
{
try({
mart <- useMart("ensembl")
fly_mart <-
useDataset(mart = mart, dataset = "dmelanogaster_gene_ensembl")
dat <- read.csv(system.file("extdata", "dagLogoTestData.csv",
package = "dagLogo"))
seq <- fetchSequence(IDs = as.character(dat$entrez_geneid),
anchorPos = as.character(dat$NCBI_site),
mart = fly_mart,
upstreamOffset = 7,
downstreamOffset = 7)
head(seq@peptides)
})
}
if (interactive())
{
try({
mart <- useMart("ensembl")
fly_mart <-
useDataset(mart = mart, dataset = "dmelanogaster_gene_ensembl")
dat <- read.csv(system.file("extdata", "dagLogoTestData.csv",
package = "dagLogo"))
seq <- fetchSequence(IDs = as.character(dat$entrez_geneid),
anchorAA = "*",
anchorPos = as.character(dat$peptide),
mart = fly_mart,
upstreamOffset = 7,
downstreamOffset = 7)
head(seq@peptides)
})
}
In following example, the anchoring AAs are lower case “s” for amino acid serine.
if(interactive()){
try({
dat <- read.csv(system.file("extdata", "peptides4dagLogo.csv",
package = "dagLogo"))
mart <- useMart("ensembl")
human_mart <-
useDataset(mart = mart, dataset = "hsapiens_gene_ensembl")
seq <- fetchSequence(IDs = toupper(as.character(dat$symbol)),
type = "hgnc_symbol",
anchorAA = "s",
anchorPos = as.character(dat$peptides),
mart = human_mart,
upstreamOffset = 7,
downstreamOffset = 7)
head(seq@peptides)
})
}
If you have multiple anchorAA is represented by lower case of amino acid and you want
to filter the anchorAA by subset of them, you may want to have a try cleanPeptides
.
if(interactive()){
dat <- read.csv(system.file("extdata", "peptides4dagLogo.csv",
package="dagLogo"))
dat <- cleanPeptides(dat, anchors = c("s", "t"))
mart <- useMart("ensembl", "hsapiens_gene_ensembl")
seq <- fetchSequence(toupper(as.character(dat$symbol)),
type="hgnc_symbol",
anchorAA=as.character(dat$anchor),
anchorPos=as.character(dat$peptides),
mart=mart,
upstreamOffset=7,
downstreamOffset=7)
head(seq@peptides)
}
Similarly, peptide sequences can be fetched from an predefined Proteome object.
dat <- unlist(read.delim(system.file("extdata", "grB.txt", package = "dagLogo"),
header = F, as.is = TRUE))
##prepare proteome from a fasta file
proteome <- prepareProteome(fasta = system.file("extdata",
"HUMAN.fasta",
package = "dagLogo"),
species = "Homo sapiens")
##prepare an object of dagPeptides
seq <- formatSequence(seq = dat, proteome = proteome, upstreamOffset = 14,
downstreamOffset = 15)
Once you have an object of dagPeptides in hand, you can start to build a background model for DAG test. The background could be random subsequences of a whole proteome or your inputs. If the background was built from a whole proteome or proteome without your inputs, an object of Proteome is required.
Sequences provided by a fasta file or downloaded from the UniProt database can be used to prepare a Proteome object. Case 3 of Step 1 shows how to prepare a Proteome object from a fasta file. Here we show how to prepare an object of Proteome via the UniProt database.
if(interactive()){
library(UniProt.ws)
UniProt.ws <- UniProt.ws(taxId=9606)
proteome <- prepareProteome(UniProt.ws=UniProt.ws)
}
Then the proteome is used for background model building for Fisher’s exact test or Z-test as follows.
bg_fisher <- buildBackgroundModel(seq, background = "wholeProteome",
proteome = proteome, testType = "fisher")
bg_ztest <- buildBackgroundModel(seq, background = "wholeProteome",
proteome = proteome, testType = "ztest")
Test can be done without making any change to the formatted, aligned amino acid symbols. Alternatively, amino acids can be grouped on the basis of their physical and chemical properties and the formatted, aligned amino acid symbols are replaced by new sets of symbols for each group, such as “P”, “N” and “U” for amino acids with positive charge, negative charge and no charge if the amino acids are grouped based on their charge.
## no grouping
t0 <- testDAU(seq, dagBackground = bg_ztest)
## grouping based on properties of individual amino acids.
t1 <- testDAU(dagPeptides = seq, dagBackground = bg_ztest,
groupingScheme = "chemistry_property_Mahler_group")
## grouped on the basis of charge.
t2 <- testDAU(dagPeptides = seq, dagBackground = bg_ztest,
groupingScheme = "charge_group")
## grouped on the basis of consensus similarity.
t3 <- testDAU(dagPeptides = seq, dagBackground = bg_ztest,
groupingScheme = "consensus_similarity_SF_group")
## grouped on the basis of hydrophobicity.
t4 <- testDAU(dagPeptides = seq, dagBackground = bg_ztest,
groupingScheme = "hydrophobicity_KD")
## In case users prefer to use their own grouping scheme, dagLogo allows users
## to supply one as follows. The grouping scheme is named "custom_group" internally.
## Add a grouping scheme based on the level 3 of BLOSUM50
color = c(LVIMC = "#33FF00", AGSTP = "#CCFF00",
FYW = '#00FF66', EDNQKRH = "#FF0066")
symbol = c(LVIMC = "L", AGSTP = "A", FYW = "F", EDNQKRH = "E")
group = list(
LVIMC = c("L", "V", "I", "M", "C"),
AGSTP = c("A", "G", "S", "T", "P"),
FYW = c("F", "Y", "W"),
EDNQKRH = c("E", "D", "N", "Q", "K", "R", "H"))
addScheme(color = color, symbol = symbol, group = group)
t5 <- testDAU(dagPeptides = seq, dagBackground = bg_ztest,
groupingScheme = "custom_group")
We can use a heatmap or logo to display the test results.
##Plot a heatmap to show the results
dagHeatmap(t0)
Figure 1: DAG heatmap
## dagLogo showing ungrouped AAs differentially used
dagLogo(t0)
Figure 2: ungrouped results
## dagLogo showing AA grouped based on properties of individual amino acids.
dagLogo(t1, groupingSymbol = getGroupingSymbol(t1@group), legend = TRUE)
Figure 3: classic grouped
## grouped on the basis of charge.
dagLogo(t2, groupingSymbol = getGroupingSymbol(t2@group), legend = TRUE)
Figure 4: charge grouped
## grouped on the basis of consensus similarity.
dagLogo(t3, groupingSymbol = getGroupingSymbol(t3@group), legend = TRUE)
Figure 5: chemistry grouped
## grouped on the basis of hydrophobicity.
dagLogo(t4, groupingSymbol = getGroupingSymbol(t4@group), legend = TRUE)
Figure 6: hydrophobicity grouped
Catabolite Activator Protein (CAP), also known as cAMP Receptor Protein (CRP),is a transcription activator that binds more than 100 sites of the E. coli genome.
The motif of the DNA-binding helix-turn-helix motif of the CAP family is first visualized by using the motifStack package. Residues 7-13 form the first helix, 14-17 the turn and 18-26 the DNA recognition helix. The glycine at position 15 appears to be critical in forming the turn.
protein <- read.table(file.path(find.package("motifStack"), "extdata",
"cap.txt"))
protein <- t(protein[, 1:20])
motif <- pcm2pfm(protein)
motif <- new("pfm", mat = motif,name = "CAP",
color = colorset(alphabet = "AA", colorScheme = "chemistry"))
##The DNA-binding helix-turn-helix motif of the CAP family ploted by motifStack
plotMotifLogo(motif@mat, motifName = motif@name,
p = motif@background, colset = motif@color)
Figure 7: Catobolite Activator Protein Motif
For comparison, then the CAP DNA-binding motif is visualized using the dagLogo package.
cap <- as.character(readAAStringSet(system.file("extdata",
"cap.fasta",
package="dagLogo")))
data("ecoli.proteome")
seq <- formatSequence(seq = cap, proteome = ecoli.proteome)
bg <- buildBackgroundModel(seq,
background= "wholeProteome",
proteome = ecoli.proteome,
numSubsamples = 10L)
##The DNA-binding helix-turn-helix motif of the CAP family ploted by dagLogo
t0 <- testDAU(seq, bg)
dagLogo(t0)
Figure 8: Catobolite Activator Protein Motif
Residuals at positions 10, 14, 16, 21 and 25 are partially or completely buried in the 3-D structure of CAPs are preferentially hydrophobic. Thee amino acids of the CAP DNA-binding motifs are grouped on the basis of their chemical properties and visualized using dagLogo as follows. This plot clearly displays those preferentially hydrophobic sites.
## The DNA-binding helix-turn-helix motif of the CAP family grouped by hydrophobic
t1 <- testDAU(seq, bg, groupingScheme = "hydrophobicity_HW_group")
dagLogo(t1, groupingSymbol = getGroupingSymbol(t1@group), legend = TRUE)
Figure 9: Catobolite Activator Protein Motif
sessionInfo()
## R version 3.6.0 (2019-04-26)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.2 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.9-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.9-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 parallel grid stats graphics grDevices utils
## [8] datasets methods base
##
## other attached packages:
## [1] UniProt.ws_2.24.0 RCurl_1.95-4.12 bitops_1.0-6
## [4] RSQLite_2.1.1 dagLogo_1.22.1 motifStack_1.28.0
## [7] Biostrings_2.52.0 XVector_0.24.0 IRanges_2.18.0
## [10] S4Vectors_0.22.0 ade4_1.7-13 MotIV_1.40.0
## [13] BiocGenerics_0.30.0 grImport2_0.1-5 biomaRt_2.40.0
## [16] BiocStyle_2.12.0
##
## loaded via a namespace (and not attached):
## [1] Biobase_2.44.0 httr_1.4.0
## [3] bit64_0.9-7 assertthat_0.2.1
## [5] highr_0.8 BiocManager_1.30.4
## [7] BiocFileCache_1.8.0 blob_1.1.1
## [9] BSgenome_1.52.0 GenomeInfoDbData_1.2.1
## [11] Rsamtools_2.0.0 yaml_2.2.0
## [13] progress_1.2.0 rGADEM_2.32.0
## [15] pillar_1.3.1 lattice_0.20-38
## [17] glue_1.3.1 digest_0.6.18
## [19] GenomicRanges_1.36.0 RColorBrewer_1.1-2
## [21] colorspace_1.4-1 htmltools_0.3.6
## [23] Matrix_1.2-17 XML_3.98-1.19
## [25] pkgconfig_2.0.2 pheatmap_1.0.12
## [27] bookdown_0.9 zlibbioc_1.30.0
## [29] purrr_0.3.2 scales_1.0.0
## [31] jpeg_0.1-8 BiocParallel_1.18.0
## [33] tibble_2.1.1 SummarizedExperiment_1.14.0
## [35] magrittr_1.5 crayon_1.3.4
## [37] memoise_1.1.0 evaluate_0.13
## [39] MASS_7.3-51.4 tools_3.6.0
## [41] prettyunits_1.0.2 hms_0.4.2
## [43] matrixStats_0.54.0 stringr_1.4.0
## [45] munsell_0.5.0 DelayedArray_0.10.0
## [47] AnnotationDbi_1.46.0 compiler_3.6.0
## [49] GenomeInfoDb_1.20.0 rlang_0.3.4
## [51] rappdirs_0.3.1 htmlwidgets_1.3
## [53] base64enc_0.1-3 rmarkdown_1.12
## [55] gtable_0.3.0 curl_3.3
## [57] DBI_1.0.0 R6_2.4.0
## [59] GenomicAlignments_1.20.0 knitr_1.22
## [61] dplyr_0.8.0.1 seqLogo_1.50.0
## [63] rtracklayer_1.44.0 bit_1.1-14
## [65] stringi_1.4.3 Rcpp_1.0.1
## [67] png_0.1-7 tidyselect_0.2.5
## [69] dbplyr_1.4.0 xfun_0.6
Bembom, Oliver. 2006. “SeqLogo: Sequence Logos for Dna Sequence Alignments.” R Package Version 1.5.4.
Colaert, Niklaas, Kenny Helsens, Lennart Martens, Joel Vandekerckhove, and Kris Gevaert. 2009. “Improved Visualization of Protein Consensus Sequences by iceLogo.” Nature Methods 6 (11):786–87.
Crooks, Gavin E., Gary Hon, John-Marc Chandonia, and Steven E. Brenner. 2004. “WebLogo: A Sequence Logo Generator.” Genome Research 14:1188–90.
Ou, Jianhong. 2012. “MotifStack: Plot Stacked Logos for Single or Multiple Dna, Rna and Amino Acid Sequence.” R Package Version 1.5.4.