Chemical metrics of reference proteomes for taxa

Here we visualize chemical metrics for archaeal and bacterial taxa and viruses using precomputed reference proteomes in the chem16S package.

Chemical metrics are molecular properties computed from elemental compositions – inferred from amino acid compositions of proteins – and include carbon oxidation state (Z_C) and stoichiometric hydration state (nH₂O), as described by Dick et al. (2020).

This vignette uses the RefSeq reference database for reference proteomes. The GTDB reference database is also available in chem16S (and is the default for functions in the package), but doesn’t have viral reference proteomes, which are visualized below.

Required packages

library(chem16S)

This vignette was compiled on 2025-03-21 with chem16S version 1.2.0-3.

Reference proteomes for taxa

Read the acid compositions of reference proteomes.
Show the counts of taxa at each level (taxonomic rank).

taxon_AA <- read.csv(system.file("RefDB/RefSeq_206/taxon_AA.csv.xz", package = "chem16S"))
ranks <- taxon_AA$protein
table(ranks)[unique(ranks)]

## ranks
##      species        genus       family        order        class       phylum 
##            1         4737          757          299          138           68 
## superkingdom 
##            3

Calculate Z_C.
Make boxplots for the taxa at each rank.

taxon_Zc <- canprot::calc_metrics(taxon_AA, "Zc")[, 1]
Zc_list <- sapply( unique(ranks), function(rank) taxon_Zc[ranks == rank] )
opar <- par(mar = c(4, 7, 1, 1))
boxplot(Zc_list, horizontal = TRUE, las = 1, xlab = chemlab("Zc"))

par(opar)

Z_C for genera in selected phyla and classes

Read the list of taxonomic names.
Define a function to get the names of unique genera in a phylum.
Define a function to get Z_C for named genera.
Calculate mean Z_C for genera in selected phyla.

taxnames <- read.csv(system.file("RefDB/RefSeq_206/taxonomy.csv.xz", package = "chem16S"))
phylum_to_genus <- function(phylum) na.omit(unique(taxnames$genus[taxnames$phylum == phylum]))
get_Zc <- function(genera) na.omit(taxon_Zc[match(genera, taxon_AA$organism)])
sapply(sapply(sapply(c("Crenarchaeota", "Euryarchaeota"), phylum_to_genus), get_Zc), mean)

## Crenarchaeota Euryarchaeota 
##    -0.2148740    -0.1220774

Within the Euryarchaeota, there are classes with extremely high and low Z_C (see below and Dick and Tan, 2023). Let’s look at a couple of them:

class_to_genus <- function(class) na.omit(unique(taxnames$genus[taxnames$class == class]))
sapply(sapply(sapply(c("Methanococci", "Halobacteria"), class_to_genus), get_Zc), mean)

## Methanococci Halobacteria 
##  -0.22562725  -0.08043995

Prokaryotic genera, aggregated to phylum

Read full list of taxonomic names; remove viruses; prune to keep unique genera; count number of genera in each phylum; calculate Z_C for genera in 20 most highly represented phyla of Bacteria and Archaea; order according to mean Z_C; make boxplots:

taxnames2 <- taxnames[taxnames$superkingdom != "Viruses", ]
taxnames3 <- taxnames2[!duplicated(taxnames2$genus), ]
(top20_phyla <- head(sort(table(taxnames3$phylum), decreasing = TRUE), 20))

## 
##              Proteobacteria                  Firmicutes 
##                        1375                         672 
##              Actinobacteria               Bacteroidetes 
##                         419                         354 
##               Cyanobacteria               Euryarchaeota 
##                         113                         107 
##              Planctomycetes                 Chloroflexi 
##                          62                          38 
##             Verrucomicrobia               Crenarchaeota 
##                          32                          28 
##               Acidobacteria                Spirochaetes 
##                          25                          19 
##                 Thermotogae                   Aquificae 
##                          14                          13 
##               Synergistetes                  Chlamydiae 
##                          13                          12 
##                Fusobacteria                 Tenericutes 
##                          12                          12 
##              Thaumarchaeota Candidatus Thermoplasmatota 
##                          12                          10

Zc_list <- sapply(sapply(names(top20_phyla), phylum_to_genus), get_Zc)
order_Zc <- order(sapply(Zc_list, mean))
Zc_list <- Zc_list[order_Zc]
opar <- par(mar = c(4, 13, 1, 1))
boxplot(Zc_list, horizontal = TRUE, las = 1, xlab = chemlab("Zc"))

par(opar)

Prokaryotic genera, aggregated to class

Notice the large range for Euryarchaota and Protobacteria. Let’s take a closer look at the classes within each phylum.

Loop over phyla (Euryarchaeota and Proteobacteria).
- Get name of all classes within the phylum.
- Calculate Z_C for genera in each class.
- Order according to mean Z_C.
- Make boxplots.

opar <- par(mfrow = c(1, 2), mar = c(4, 10, 1, 1))
for(phylum in c("Euryarchaeota", "Proteobacteria")) {
  taxnames4 <- taxnames3[taxnames3$phylum == phylum, ]
  classes <- na.omit(unique(taxnames4$class))
  Zc_list <- sapply(sapply(classes, class_to_genus), get_Zc)
  order_Zc <- order(sapply(Zc_list, mean))
  Zc_list <- Zc_list[order_Zc]
  boxplot(Zc_list, horizontal = TRUE, las = 1, xlab = chemlab("Zc"))
}

par(opar)

See take-home message #1.

Stoichiometric hydration state (nH₂O)

As above, but calculate nH₂O instead of Z_C.

taxon_nH2O <- canprot::calc_metrics(taxon_AA, "nH2O")[, 1]
get_nH2O <- function(genera) na.omit(taxon_nH2O[match(genera, taxon_AA$organism)])

opar <- par(mfrow = c(1, 2), mar = c(4, 10, 1, 1))
for(phylum in c("Euryarchaeota", "Proteobacteria")) {
  taxnames4 <- taxnames3[taxnames3$phylum == phylum, ]
  classes <- na.omit(unique(taxnames4$class))
  nH2O_list <- sapply(sapply(classes, class_to_genus), get_nH2O)
  order_nH2O <- order(sapply(nH2O_list, mean))
  nH2O_list <- nH2O_list[order_nH2O]
  boxplot(nH2O_list, horizontal = TRUE, las = 1, xlab = chemlab("nH2O"))
}

par(opar)

See take-home message #2.

Viruses and prokaryotes

Plot Z_C and nH₂O for viral and prokaryotic phyla with the most genus-level representatives.
Label Bacteria with the lowest nH₂O.

(top50_phyla <- head(sort(table(taxnames$phylum), decreasing = TRUE), 50))

## 
##              Proteobacteria              Actinobacteria 
##                       17362                        7569 
##                  Firmicutes                 Uroviricota 
##                        7033                        3982 
##               Bacteroidetes                Pisuviricota 
##                        3224                         945 
##            Cressdnaviricota             Kitrinoviricota 
##                         927                         812 
##               Cyanobacteria               Euryarchaeota 
##                         760                         733 
##             Negarnaviricota               Cossaviricota 
##                         664                         471 
##                 Tenericutes            Duplornaviricota 
##                         271                         228 
##              Artverviricota                Spirochaetes 
##                         204                         202 
##             Verrucomicrobia              Planctomycetes 
##                         181                         169 
##          Nucleocytoviricota         Deinococcus-Thermus 
##                         142                         131 
##               Peploviricota           Preplasmiviricota 
##                         116                         115 
##                Fusobacteria                 Thermotogae 
##                         110                         106 
##               Crenarchaeota                 Chloroflexi 
##                          93                          92 
##               Acidobacteria               Lenarviricota 
##                          87                          80 
##                Phixviricota              Thaumarchaeota 
##                          62                          60 
##              Hofneiviricota                  Chlamydiae 
##                          48                          47 
##               Fibrobacteres                   Aquificae 
##                          46                          45 
##               Synergistetes Candidatus Thermoplasmatota 
##                          37                          35 
##                    Chlorobi                 Nitrospirae 
##                          33                          29 
##                Balneolaeota       Thermodesulfobacteria 
##                          22                          17 
##                Saleviricota Candidatus Saccharibacteria 
##                          14                          11 
##             Deferribacteres             Armatimonadetes 
##                           9                           8 
##              Dividoviricota            Gemmatimonadetes 
##                           7                           6 
##     Candidatus Cryosericota        Candidatus Kryptonia 
##                           5                           5 
##          Kiritimatiellaeota               Lentisphaerae 
##                           5                           5

Zc_mean <- sapply(sapply(sapply(names(top50_phyla), phylum_to_genus), get_Zc), mean)
nH2O_mean <- sapply(sapply(sapply(names(top50_phyla), phylum_to_genus), get_nH2O), mean)
domain <- taxnames$superkingdom[match(names(top50_phyla), taxnames$phylum)]
pchs <- c(24, 21, 23)
pch <- sapply(domain, switch, Archaea = pchs[1], Bacteria = pchs[2], Viruses = pchs[3])
bgs <- topo.colors(3, alpha = 0.5)
bg <- sapply(domain, switch, Archaea = bgs[1], Bacteria = bgs[2], Viruses = bgs[3])
opar <- par(mar = c(4, 4, 1, 1))
plot(Zc_mean, nH2O_mean, xlab = chemlab("Zc"), ylab = chemlab("nH2O"), pch = pch, bg = bg)
ilow <- nH2O_mean < -0.77 & domain == "Bacteria"
xadj <- c(-0.9, -0.8, 0.8, 1, -0.8)
yadj <- c(0, 1, 1, -1, -1)
text(Zc_mean[ilow] + 0.02 * xadj, nH2O_mean[ilow] + 0.005 * yadj, names(top50_phyla[ilow]), cex = 0.9)
legend("bottomleft", c("Archaea", "Bacteria", "Viruses"), pch = pchs, pt.bg = bgs)

par(opar)

See take-home message #3.

Other metrics

Besides Z_C and nH₂O, the calc_metrics() function in canprot can calculate elemental ratios (H/C, N/C, O/C, and S/C), grand average of hydropathicity (GRAVY), isoelectric point (pI), average molecular weight of amino acid residues (MW), and protein length.

AAcomp <- taxon_AA[match(classes, taxon_AA$organism), ]
metrics <- canprot::calc_metrics(AAcomp, c("HC", "OC", "NC", "SC", "GRAVY", "pI", "MW", "plength"))
layout(rbind(c(1, 2, 5), c(3, 4, 5)), widths = c(2, 2, 1.5))
opar <- par(mar = c(4.5, 4, 1, 1), cex = 1)
plot(metrics$OC, metrics$HC, col = 1:10, pch = 1:10, xlab = "O/C", ylab = "H/C")
plot(metrics$NC, metrics$SC, col = 1:10, pch = 1:10, xlab = "N/C", ylab = "S/C")
plot(metrics$pI, metrics$GRAVY, col = 1:10, pch = 1:10, xlab = "pI", ylab = "GRAVY")
plot(metrics$plength, metrics$MW, col = 1:10, pch = 1:10, xlab = "Length", ylab = "MW")
plot.new()
legend("right", classes, col = 1:10, pch = 1:10, bty = "n", xpd = NA)

par(opar)

Take-home messages

Methanococci and Epsilonproteobacteria are prokaryotic classes whose genera have proteins with the lowest mean Z_C in their respective phyla.
Proteins in Gammaproteobacteria tend to have lower nH₂O than Alpha- and Betaproteobacteria.
Viral proteins have lower nH₂O than most Bacteria and Archaea.

Respectively, these findings suggest genomic adaptation by Methanococci and Epsilonproteobacteria – now known as Campylobacterota – to reducing environments (which may be found in submarine hot springs and anoxic zones of sediments), by Gammaproteobacteria to lower water availability in certain habitats, and by viruses to lower water availability in their environment. A notable observation in this regard is that viruses without an envelope have lower water content than bacterial cells (Matthews, 1975).

In summary, chemical metrics provide insight into how environmental factors shape the amino acid and elemental composition of proteins.

References

Dick JM, Tan J. 2023. Chemical links between redox conditions and estimated community proteomes from 16S rRNA and reference protein sequences. Microbial Ecology 85(4): 1338–1355. doi: 10.1007/s00248-022-01988-9

Dick JM, Yu M, Tan J. 2020. Uncovering chemical signatures of salinity gradients through compositional analysis of protein sequences. Biogeosciences 17(23): 6145–6162. doi: 10.5194/bg-17-6145-2020

Matthews REF. 1975. A classification of virus groups based on the size of the particle in relation to genome size. Journal of General Virology 27(2): 135–149. doi: 10.1099/0022-1317-27-2-135