This vignette demonstrates the integration of chem16S with phyloseq to calculate chemical metrics of community reference proteomes.
This vignette was compiled on 2024-11-23 with chem16S version 1.1.0-9 and phyloseq version 1.51.0.
The mothur MiSeq SOP
has an example dataset with 16S rRNA gene sequences for mouse gut
communities, which was extracted from the complete dataset reported by
Schloss et al. (2012). This example dataset includes data
collected at 0–9 (early) and 141-150 (late) days post-weaning from one
mouse. This dataset is used in the DADA2 Pipeline
Tutorial to demonstrate construction of amplicon sequence variants
(ASV), taxonomic classification of the ASVs, and analysis using the
phyloseq package. Here, we load the
phyloseq-class
object that was created on 2023-06-14 using
dada2 (version 1.28.0) by following the steps in the
DADA2 pipeline tutorial (version 1.16). Note that taxonomy was assigned
to genus level using a newer version of the Silva training set (silva_nr99_v138.1_train_set.fa.gz
)
than that indicated in the DADA2 tutorial on this date
(silva_nr_v132_train_set.fa.gz
).
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 232 taxa and 19 samples ]
## sample_data() Sample Data: [ 19 samples by 4 sample variables ]
## tax_table() Taxonomy Table: [ 232 taxa by 7 taxonomic ranks ]
## refseq() DNAStringSet: [ 232 reference sequences ]
To start to visualize this data, let’s recreate the bar plot from the DADA2 tutorial:
top20.silva <- names(sort(taxa_sums(mouse.silva), decreasing = TRUE))[1:20]
mouse.silva.top20 <- transform_sample_counts(mouse.silva, function(OTU) OTU/sum(OTU))
mouse.silva.top20 <- prune_taxa(top20.silva, mouse.silva.top20)
plot_bar(mouse.silva.top20, x = "Day", fill = "Family") + facet_wrap(~When, scales = "free_x")
The chem16S package contains the amino acid compositions of archaeal, bacterial, and viral taxa at levels from genus to phylum constructed from the RefSeq database; see the vignette Chemical metrics of reference proteomes for more information. There are denoted as reference proteomes for taxa. Several functions are available to identify the lowest-level taxon for each classified read (or in this case, each ASV), combine the reference proteomes for all taxa to create community reference proteomes, and calculate chemical metrics from them.
First, let’s make a table that lists the lowest-level taxon for each
ASV and their abundances. Because chem16S was first
developed to analyze the output from the RDP Classifier, this data frame
has a format that is similar to the file created by using the
classify -h
command of the RDP Classifier followed by the
merge-count
command. That is, the data frame has columns
named taxid
, rank
, lineage
, and
name
, followed by columns for all samples. Because of the
long text in the taxid
column (see below), let’s look at
the other columns first.
## lineage
## ASV1 Bacteria;Bacteroidota;Bacteroidia;Bacteroidales;Muribaculaceae
## ASV5 Bacteria;Bacteroidota;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides
## ASV8 Bacteria;Bacteroidota;Bacteroidia;Bacteroidales;Rikenellaceae;Alistipes
## ASV11 Bacteria;Firmicutes;Bacilli;Lactobacillales;Lactobacillaceae;Lactobacillus
## ASV13 Bacteria;Firmicutes;Bacilli;Lactobacillales;Lactobacillaceae;Ligilactobacillus
## ASV15 Bacteria;Firmicutes;Clostridia;Lachnospirales;Lachnospiraceae;Lachnospiraceae NK4A136 group
## name rank F3D0 F3D141 F3D142 F3D143 F3D144
## ASV1 Muribaculaceae family 3370 2927 1734 1603 2535
## ASV5 Bacteroides genus 154 189 180 130 104
## ASV8 Alistipes genus 184 321 89 83 41
## ASV11 Lactobacillus genus 17 168 42 78 269
## ASV13 Ligilactobacillus genus 52 12 103 43 16
## ASV15 Lachnospiraceae NK4A136 group genus 895 245 76 76 116
## F3D145 F3D146 F3D147 F3D148 F3D149 F3D150 F3D1 F3D2 F3D3 F3D5 F3D6 F3D7
## ASV1 4346 2205 9454 6365 6123 2482 1469 8711 3216 1719 3583 2667
## ASV5 307 179 453 443 417 169 140 338 402 151 476 470
## ASV8 125 71 75 507 515 120 190 1211 381 207 261 213
## ASV11 317 178 448 411 478 61 102 55 171 61 45 16
## ASV13 22 4 147 18 88 64 129 330 94 48 422 117
## ASV15 94 230 460 409 455 218 948 1739 178 564 580 192
## F3D8 F3D9
## ASV1 1647 2313
## ASV5 582 596
## ASV8 286 438
## ASV11 19 27
## ASV13 145 182
## ASV15 512 772
ASVs with identical lowest-level taxonomic assignments are merged, so
this data frame has fewer rows than the number of ASVs. The short names
of the ASVs are stored in the taxid
column, and the names
of multiple ASVs with the same lowest-level taxonomic classification are
separated by a semicolon.
## [1] "ASV1;ASV2;ASV3;ASV4;ASV6;ASV7;ASV9;ASV10;ASV12;ASV14;ASV17;ASV18;ASV20;ASV100;ASV143"
## [2] "ASV5;ASV80;ASV163"
## [3] "ASV8"
## [4] "ASV11"
## [5] "ASV13"
## [6] "ASV15;ASV23;ASV24;ASV30;ASV31;ASV33;ASV36;ASV37;ASV38;ASV56;ASV57;ASV69;ASV73;ASV84;ASV91;ASV94;ASV111;ASV129;ASV134;ASV162;ASV180"
As explained in the DADA2 tutorial, the ASVs themselves (that is, the
DNA sequences, not their short names) are available in the
phyloseq-class
object, but they are not used here. Now
let’s list the number of unique lowest-level classifications at each
taxonomic level:
These taxa will be used below to construct community reference proteomes.
A challenge of using reference proteomes is that they come from one source (GTDB and RefSeq are available in chem16S), but taxonomic assignments may be made using a different training set (such as Silva or RDP). While identical names and ranks can be used to automatically match taxonomic assignments to reference proteomes, inherent differences in available taxonomies pose difficulties. For example, RDP uses the name Escherichia/Shigella, for which the closest correspondence in RefSeq is Escherichia. This and other manual mappings were used by Dick and Tan (2023) for mapping between RDP and RefSeq.
Let’s analyze a phyloseq-class
object that was made
using the Silva training set but use the manual mapping that was
designed to map from RDP to RefSeq.
The manual mapping from RDP to RefSeq was not designed for taxonomic assignments made using Silva. This is just an example to show the limitations of the technique.
map.silva <- map_taxa(tc.silva, refdb = "RefSeq_206")
## [1] "map_taxa: using these manual mapping(s) to NCBI RefSeq:"
family_Ruminococcaceae –> family_Oscillospiraceae (0.1%)
## [1] "map_taxa: can't map groups genus_Lachnospiraceae NK4A136 group (7.3%), order_RF39 (1.35%), 22 others (5%)"
## [1] "map_taxa: mapping rate to RefSeq_206 taxonomy is 86.4%"
The messages show that one group was manually mapped, and the total mapping rate (including both automatic name matching and manual mapping) is 86.4%.
Not let’s analyze a phyloseq-class
object that was made
using the RDP training set. This results in a considerably higher
mapping rate to the NCBI taxonomy.
data(mouse.RDP)
tc.RDP <- ps_taxacounts(mouse.RDP)
map.RDP <- map_taxa(tc.RDP, refdb = "RefSeq_206")
## [1] "map_taxa: using these manual mapping(s) to NCBI RefSeq:"
order_Clostridiales –> order_Eubacteriales (1.4%)
family_Ruminococcaceae –> family_Oscillospiraceae (0.4%)
## [1] "map_taxa: can't map groups genus_Clostridium_XlVa (0.47%), genus_Ihubacter (0.29%), 7 others (0.54%)"
## [1] "map_taxa: mapping rate to RefSeq_206 taxonomy is 98.7%"
Now the total mapping rate is 98.7%.
Manual mapping does not overcome inconsistencies between different taxonomies but is simply a starting point for getting the most out of the available taxonomic classifications and reference proteomes. Using the Genome Taxonomy Database (GTDB) for both taxonomic classification and reference proteomes obviates the need for manual mapping and is described further below.
The ps_metrics()
function in chem16S
wraps the previous two functions (ps_taxacounts()
and
map_taxa()
) as well as another function
(get_metrics()
). In order, these functions get the
lowest-level taxon for each OTU or ASV, map taxonomic names to RefSeq,
and combine taxonomic abundances with reference proteomes for taxa to
calculate the amino acid composition of community reference proteomes,
which are then used to calculate chemical metrics. To peek into this
process, let’s begin by listing the amino acid compositions of some
community reference proteomes.
AA.RDP <- ps_metrics(mouse.RDP, refdb = "RefSeq_206", quiet = TRUE, return_AA = TRUE)
head(AA.RDP)
## Run chains Ala Cys Asp Glu Phe Gly
## 1 F3D0 6311 168957.04 28145.99 127964.63 144258.44 90485.66 153197.94
## 2 F3D141 4844 131696.18 21018.59 99639.82 109007.77 70725.38 118533.00
## 3 F3D142 2504 68690.57 10656.83 52038.20 55416.74 36915.27 61378.43
## 4 F3D143 2506 68110.94 10812.55 51760.34 56159.19 36655.56 61374.30
## 5 F3D144 3475 94794.51 14396.10 72002.10 75436.86 50753.16 84495.19
## 6 F3D145 5777 158713.35 24481.12 120504.38 126494.34 85184.00 141730.49
## His Ile Lys Leu Met Asn Pro Gln
## 1 38825.56 150203.83 129193.60 190479.38 61593.45 97793.83 81072.95 64847.15
## 2 30605.73 115752.86 98376.59 147717.39 47186.31 76797.75 64185.98 50711.12
## 3 16112.96 59866.06 50578.38 77032.13 24249.03 40450.34 33999.57 26543.86
## 4 15866.29 59980.25 51043.97 76421.37 24496.76 39949.30 33278.73 26250.61
## 5 22360.65 83019.84 69553.84 106001.86 33597.48 55812.08 46764.89 36638.24
## 6 37246.79 138367.70 115624.12 176975.14 56166.29 93440.60 78714.49 60420.63
## Arg Ser Thr Val Trp Tyr
## 1 108700.05 135531.51 118757.78 144695.16 23899.49 88542.69
## 2 86150.54 106069.75 92865.09 112114.17 19198.37 68566.76
## 3 45216.96 55896.96 48672.76 58110.19 10238.79 35484.42
## 4 44330.80 55167.48 48100.75 58075.59 9968.14 35548.78
## 5 61656.32 77077.97 66902.65 80283.65 13955.21 48662.21
## 6 104476.27 129395.65 111906.70 133940.97 23651.22 81864.74
In this data frame, columns named Run
and
chains
have the sample names and abundance-weighted numbers
of reference proteomes – that is, the total abundance of ASVs that have
taxonomic classifications mapped to the NCBI taxonomy – that are used to
compute the amino acid composition. We can use
canprot::calc_metrics()
to calculate the average protein
length of each community reference proteome, then make a histogram:
lengths <- canprot::calc_metrics(AA.RDP, "length")[, 1]
hist(lengths)
imin <- which.min(lengths)
text(lengths[imin], 1.5, AA.RDP$Run[imin], adj = 1)
Compared to the other samples, that from Day 143 has the lowest average protein length of the community reference proteome.
We’re now ready to calculcate chemical metrics from the amino acid compositions.
metrics.RDP <- ps_metrics(mouse.RDP, refdb = "RefSeq_206", quiet = TRUE)
head(metrics.RDP)
## Zc nO2 nH2O
## F3D0 -0.1567602 -0.7047975 -0.7698409
## F3D141 -0.1549554 -0.7025843 -0.7738190
## F3D142 -0.1536513 -0.7007578 -0.7759440
## F3D143 -0.1549676 -0.7025566 -0.7740865
## F3D144 -0.1543925 -0.7014054 -0.7745432
## F3D145 -0.1532613 -0.7000987 -0.7767563
This data frame has columns for ZC (carbon oxidation state), nO2 (stoichiometric oxidation state), and nH2O (stoichiometric hydration state). ZC is computed from the elemental formula, while nO2 and nH2O are coefficients on O2 and H2O in theoretical formation reactions of the proteins, per residue, from a particular set of thermodynamic components, also known as basis species. The rationale for the choice of basis species used here (glutamine, glutamic acid, cysteine, O2, and H2O, abbreviated as QEC) was given by Dick et al. (2020). Because they are both measures of oxidation state, in general there is a strong positive correlation between ZC and nO2. For this dataset, there is in addition a strong negative correlation between ZC and nH2O, but other datasets do not show such a correlation.
cor(metrics.RDP$Zc, metrics.RDP$nO2)
## [1] 0.9937252
cor(metrics.RDP$Zc, metrics.RDP$nH2O)
## [1] -0.9266887
Let’s plot each of these chemical metrics against the sampling day.
plot_ps_metrics(mouse.RDP, x = "Day", color = "When", shape = "When", refdb = "RefSeq_206", quiet = TRUE) +
geom_point(size = 3)
Let’s plot two chemical metrics against each other.
plot_ps_metrics2(mouse.RDP, color = "When", shape = "When", refdb = "RefSeq_206", quiet = TRUE) +
geom_point(size = 3)
The community reference proteomes for Late samples tend to be more oxidized and less hydrated (i.e., they have higher ZC and lower nH2O) than Early samples, but there is some overlap between the groups.
As noted above, mapping taxon names from Silva or RDP to RefSeq is
not perfect. To overcome this limitation, the GTDB can be used for both
taxonomic classification of 16S rRNA gene sequences and for reference
proteomes of taxa. Here we load a phyloseq-class
object
prepared following the DADA2 Pipeline Tutorial modified to classify 16S
rRNA gene sequences using GTDB r220 (Alishum, 2024). Then, we plot chemical
metrics calculated for reference proteomes also derived from the GTDB,
but a later version. (The reason for this is that DADA2-formatted 16S
rRNA sequences from GTDB r220 are not yet available in the Zenodo
repository maintained by Alishum (2024).)
data(mouse.GTDB_220)
plot_ps_metrics2(mouse.GTDB_220, refdb = "GTDB_220", color = "When", shape = "When") + geom_point(size = 3)
## [1] "map_taxa: mapping rate to GTDB_220 taxonomy is 100.0%"
Compared to using the RDP for classification of 16S rRNA sequences and RefSeq for reference proteomes, using the GTDB for both yields a clearer separation of Early and Late samples in chemical space. There is one Early sample with anomalously high ZC and low nH2O; let’s identify it:
metrics.GTDB <- ps_metrics(mouse.GTDB_220)
## [1] "map_taxa: mapping rate to GTDB_220 taxonomy is 100.0%"
is.early <- sample_data(mouse.GTDB_220)$When == "Early"
iout <- which.min(metrics.GTDB[is.early, ]$nH2O)
(day <- sample_data(mouse.GTDB_220)[is.early, ]$Day[iout])
## [1] 7
This is the sample from day 7. Let’s look again at the taxonomic classifications, this time at the phylum level in GTDB:
top20.GTDB <- names(sort(taxa_sums(mouse.GTDB_220), decreasing = TRUE))[1:20]
mouse.GTDB_220.top20 <- transform_sample_counts(mouse.GTDB_220, function(OTU) OTU/sum(OTU))
mouse.GTDB_220.top20 <- prune_taxa(top20.GTDB, mouse.GTDB_220.top20)
plot_bar(mouse.GTDB_220.top20, x = "Day", fill = "Phylum") + facet_wrap(~When, scales = "free_x")
This plot suggests that a relatively high proportion of Bacteroidota makes the Day 7 sample more similar to samples in the Late group. Bacteroidetes (an older name for Bacteroidota) have some of the lowest nH2O among major bacterial phyla (see Dick and Tan, 2023), which could partially explain the low nH2O observed for the Day 7 sample.
quiet = FALSE
argument to
map_taxa()
, ps_metrics()
,
plot_ps_metrics()
, and
plot_ps_metrics2()
.Beyond conventional diversity metrics used for microbiome analysis, visualizing the chemical metrics of community reference proteomes can provide new insight into genomic adaptation. See the next vignette (Plotting two chemical metrics) for more examples.