Community-level chemical metrics are computed from the elemental compositions of community reference proteomes, which in turn are derived from genomic protein sequences weighted by taxonomic abundances. Unlike the synthetic variables in ordination methods (e.g. the principal components in PCA), chemical metrics are analogous to thermodynamic components that are defined independently of the data, so they can be compared across datasets. Their theoretical foundation and cross-dataset applicability facilitates the use of chemical metrics to explore hypotheses about genomic adaptation to multiple physicochemical variables at a global scale.
In this vignette we’ll analyze phyloseq’s
GlobalPatterns
dataset (based on
data from Caporaso et al., 2011) to
visualize chemical variation of community reference proteomes across
environments. Then, we’ll explore specific hypotheses about the effects
of redox conditions and salinity on genomic adaptation by analyzing
datasets for microbial communities in marine sediment (Fonseca et al., 2022)
and geothermal waters (Zhang et al., 2023).
require(chem16S)
require(phyloseq)
require(ggplot2)
theme_set(theme_bw())
# For composing plots and making a common legend (plot_layout())
require(patchwork)
# For annotating plots with regression coefficients (stat_poly_line())
require(ggpmisc)
This vignette was compiled on 2024-11-23 with chem16S version 1.1.0-9 and phyloseq version 1.51.0.
We will use the GlobalPatterns
dataset ‘as-is’, without
the preprocessing described in phyloseq’s Ordination
Plots tutorial. There, less-abundant OTUs and phyla were removed in
order to show high-level trends and shorten computing time. One step we
do take from that tutorial is the addition of a categorical variable
that identifies whether the samples are human-associated:
data(GlobalPatterns)
Human = get_variable(GlobalPatterns, "SampleType") %in% c("Feces", "Mock", "Skin", "Tongue")
sample_data(GlobalPatterns)$Human <- factor(Human)
Taxonomic assignments reported by Caporaso et
al. (2011) were made using the RDP
Classifier (presumably with its default training set), so we use the
refdb = "RefSeq_206"
argument to use manual mappings
between RDP and RefSeq described by Dick and Tan
(2023).
p2 <- plot_ps_metrics2(GlobalPatterns, color = "SampleType", shape = "Human", refdb = "RefSeq_206")
## [1] "map_taxa: using these manual mapping(s) to NCBI RefSeq:"
order_Rhizobiales –> order_Hyphomicrobiales (0.2%)
order_Clostridiales –> order_Eubacteriales (0.6%)
family_Ruminococcaceae –> family_Oscillospiraceae (3.1%)
## [1] "map_taxa: can't map groups order_Stramenopiles (12.94%), family_ACK-M1 (3.27%), 374 others (11.75%)"
## [1] "map_taxa: mapping rate to RefSeq_206 taxonomy is 71.9%"
p2 + geom_polygon(aes(fill = SampleType), alpha = 0.5) + geom_point(size = 3) +
guides(colour = guide_legend(override.aes = list(shape = c(17, 19, 19, 17, 19, 19, 17, 19, 17))))
At the extremes of carbon oxidation state (ZC), soil communities are the most oxidized and skin and tongue communities are the most reduced. At the extremes of stoichiometric hydration state (nH2O), skin communities are the most hydrated and some fecal communities are the least hydrated. In more detail, there are environmental microbiomes that show similar ranges of chemical metrics (e.g., Freshwater (creek) and Sediment (estuary)) and others that are different. Freshwater – described as “lake” by Caporaso et al. (2011) – has lower ZC than Freshwater (creek), and some ocean samples have lower nH2O than either freshwater group. These patterns could suggest an influence of greater oxygenation in flowing water compared to lakes (this is a distinction between lotic and lentic systems), and dehydration in communities adapted to life in salty water compared to freshwater.
Fonseca et al. (2022) reported 16S rRNA gene sequences for sediment samples from the oxygen minimum zone of the Pacific Ocean along the coast of Chile, known as the Humboldt Sulfuretum. The sample data include dissolved oxygen, redox potential in sediment and overlying water, and organic matter (OM) content. This is a useful dataset for exploring the hypothesis that ZC of proteins is shaped by redox conditions.
Here we read the phyloseq-class
object created by using
DADA2 (Callahan et al.,
2016) to identify amplicon sequence variants (ASVs) in this
dataset and to classify them using the GTDB training set. A sample taken
from 50 m depth at the Valparaiso location on 2012-05-12 is available in
the Sequence Read Archive (SRA) but was not included in the analysis
described by Fonseca et al. (2022). The taxonomic composition of this
sample is highly different from the all the others (see the ordination
plots in the extdata
directory where the
ps_FEN+22.rds
file is located), so we exclude it to avoid
anomalous results.
psfile <- system.file("extdata/DADA2-GTDB_220/FEN+22/ps_FEN+22.rds", package = "chem16S")
ps <- readRDS(psfile)
ps <- prune_samples(sample_names(ps) != "SRR1346095", ps)
ps
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 2095 taxa and 13 samples ]
## sample_data() Sample Data: [ 13 samples by 14 sample variables ]
## tax_table() Taxonomy Table: [ 2095 taxa by 6 taxonomic ranks ]
## refseq() DNAStringSet: [ 2095 reference sequences ]
Then, we plot ZC and nH2O for the community reference proteomes using different colors for sample groups.
plot_ps_metrics2(ps, refdb = "GTDB_220", color = "Location") +
geom_polygon(aes(fill = Location), alpha = 0.5) + geom_point(size = 3)
## [1] "map_taxa: mapping rate to GTDB_220 taxonomy is 100.0%"
Among the areas with more than one sample, the community reference proteomes for Iquique are more reduced (i.e., have lower ZC) than those for Concepcion and Valparaiso. Two of the communities at Valparaiso are also characterized by higher nH2O, suggesting the influence of a hydrating factor.
Let’s take a step toward more quantitative tests of these hypotheses about genomic adaptation to environments. The color scales in the next two plots reflect sediment redox potential and concentration of organic matter. The rationale for choosing these environmental measurements is described below.
p2 <- plot_ps_metrics2(ps, refdb = "GTDB_220", color = "Sediment_redox") +
geom_point(size = 4) + labs(color = "Sediment redox (Eh)")
## [1] "map_taxa: mapping rate to GTDB_220 taxonomy is 100.0%"
p3 <- plot_ps_metrics2(ps, refdb = "GTDB_220", color = "Organic_matter") +
geom_point(size = 4) + labs(color = "Organic matter (%)")
## [1] "map_taxa: mapping rate to GTDB_220 taxonomy is 100.0%"
Thermodynamic considerations predict a positive correlation between redox potential and ZC (Dick and Meng, 2023). It has also been predicted that salinity has a dehydrating effect that favors proteins with nH2O (Dick et al., 2020). However, these samples have no documented salinity gradient. Previous observations that protein expression in cells is shifted toward lower nH2O under hyperglycemic (high-glucose) conditions (Dick et al., 2020) suggest another hypothesis: a higher content of organic matter may be a proxy for dehydrating conditions.
We can use correlations between two environmental variables (redox potential or OM) and two chemical metrics for communities (ZC or nH2O) in order to test these hypotheses. To make the plots, let’s construct a single data frame containing the sample data and chemical metrics.
sample.data.and.chemical.metrics.for.communities <- cbind(sample_data(ps), ps_metrics(ps, refdb = "GTDB_220"))
## [1] "map_taxa: mapping rate to GTDB_220 taxonomy is 100.0%"
Now let’s write a function to create a scatter plot for two variables and add a regression line. We use this function to make a plot for each combination of environmental variable and chemical metric.
We find that carbon oxidation state is positively correlated with redox potential, and stoichiometric hydration state is negatively correlated with organic matter content. Taken alone, each of these correlations supports our initial hypotheses. However, in part because of strong covariation of the environmental variables, ZC is also negatively correlated with OM content, and nH2O is positively correlated with redox potential.
The covariation of environmental variables makes it difficult to identify primary factors that drive the observed differences between communities. However, the chemical nature of these variables provides additional clues. The covariation of environmental variables (higher OM content with lower redox potential) makes sense if greater availability of organic compounds drives respiration and ensuing depletion of oxygen. This interaction among environmental variables yields a mechanistic hypothesis for the positive association between ZC and nH2O (see first plot above), which could not be explained by our initial hypotheses about the effects of single variables.
Zhang et al. (2023) reported 16S rRNA gene sequences for mildly alkaline hot spring reservoirs in the Qinghai-Tibet Plateau. The following general predictions can be made about these data:
Let’s test these predictions by doing some calculations. The following commands load the data and plot two environmental variables (ORP and temperature (T)) against two chemical metrics.
psfile2 <- system.file("extdata/DADA2-GTDB_220/ZFZ+23/ps_ZFZ+23.rds", package = "chem16S")
ps2 <- readRDS(psfile2)
data.and.metrics <- cbind(sample_data(ps2), ps_metrics(ps2, refdb = "GTDB_220"))
## [1] "map_taxa: can't map groups family_Koribacteraceae (0.02%), family_Phormidesmiaceae (0.02%), 2 others (0.02%)"
## [1] "map_taxa: mapping rate to GTDB_220 taxonomy is 99.9%"
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 9466 taxa and 7 samples ]
## sample_data() Sample Data: [ 7 samples by 24 sample variables ]
## tax_table() Taxonomy Table: [ 9466 taxa by 6 taxonomic ranks ]
## refseq() DNAStringSet: [ 9466 reference sequences ]
The correlation between ORP and ZC isn’t as strong as might be predicted. Moreover, neither of the chemical metrics is strongly associated with temperature. Therefore, this dataset seems to be an exception to the notion that particular chemical metrics of community reference proteomes are shaped by the environment at a local scale.
But let’s not forget about the global-scale predictions! How do
communities in hot springs compare to those in ocean sediments? In order
to make a plot, we can merge both datasets into a new
phyloseq-class
object. The sequence processing pipeline
assigned the same taxon names to both datasets (ASV1
,
ASV2
, etc.). Therefore, let’s append a letter to one set of
names so that distinct taxa are not mistakenly combined.
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 11561 taxa and 20 samples ]
## sample_data() Sample Data: [ 20 samples by 34 sample variables ]
## tax_table() Taxonomy Table: [ 11561 taxa by 6 taxonomic ranks ]
## refseq() DNAStringSet: [ 11561 reference sequences ]
Let’s add a column to the sample data to indicate the type of environment for each dataset, then plot nH2O against ZC.
sample_data(ps_merged)$Environment <-
ifelse(is.na(sample_data(ps_merged)$Depth), "Hot spring", "Marine sediment")
plot_ps_metrics2(ps_merged, refdb = "GTDB_220", color = "Environment", shape = "Environment") +
geom_point(size = 3)
## [1] "map_taxa: can't map groups family_Koribacteraceae (0.02%), family_Phormidesmiaceae (0.02%), 2 others (0.02%)"
## [1] "map_taxa: mapping rate to GTDB_220 taxonomy is 99.9%"
In most cases, the communities from hot springs in the Qinghai-Tibet Plateau have lower ZC and higher nH2O compared to those in marine sediments of the Humboldt Sulfuretum. This outcome is consistent with predictions about genomic adaptation to relatively more reducing and less saline conditions of the hot springs.
The positive association between ZC and nH2O for the sediment communities does not extend to the comparison between two datasets. This suggests different influences of environmental factors on community-level elemental composition at local and global scales.
Let’s add more data to the previous comparison, specifically
freshwater and marine samples from the Baltic Sea salinity gradient
(Herlemann et al.,
2016). The first part of the following code chunk is adapted
from the help page for plot_metrics()
. A few lines are
added to remove brackish samples and adjust the plotting symbols for the
remaining samples. Then, the values for hot springs and sediments are
appended to the data frame. Finally, new colors are assigned and the
plot is made. Note: this chunk does not depend on running the
previous code, except for library(chem16S)
and
library(phyloseq)
.
Notice how more reducing samples (hot spring and sediment) have relatively low ZC and more saline samples (marine water and sediment) have relatively low nH2O.