--- title: "Integration of chem16S with phyloseq" author: "Jeffrey M. Dick" output: html_document: mathjax: null toc: true css: style.css vignette: > %\VignetteIndexEntry{Integration of chem16S with phyloseq} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} bibliography: chem16S.bib csl: elementa.csl link-citations: true --- ```{r HTML, include = FALSE} Zc <- "ZC" nH2O <- "nH2O" nO2 <- "nO2" H2O <- "H2O" O2 <- "O2" ``` ```{r setup, include = FALSE} oldopt <- options(width = 80) # Use pngquant to optimize PNG images library(knitr) knit_hooks$set(pngquant = hook_pngquant) pngquant <- "--speed=1 --quality=0-25" if (!nzchar(Sys.which("pngquant"))) pngquant <- NULL # To make warnings appear in text box 20230619 # https://selbydavid.com/2017/06/18/rmarkdown-alerts/ knitr::knit_hooks$set( error = function(x, options) { paste('\n\n
', gsub('##', '\n', gsub('^##\ Error', '**Error:**', x)), '
', sep = '\n') }, warning = function(x, options) { paste('\n\n
', gsub('##', '\n', gsub('^##\ Warning:', '**Warning:**', x)), '
', sep = '\n') }, message = function(x, options) { paste('\n\n
', gsub('##', '\n', x), '
', sep = '\n') } ) ## Don't evaluate chunks if phyloseq is not available 20230619 #if(!requireNamespace("phyloseq", quietly = TRUE)) { # knitr::opts_chunk$set(eval = FALSE) # day <- imin <- AA.RDP <- map.RDP <- map.silva <- NULL # warning("The **phyloseq** package is not available, so this vignette shows only the code without the results.") #} ``` This vignette demonstrates the integration of **chem16S** with **phyloseq** to calculate chemical metrics of community reference proteomes. - *Chemical metrics* are molecular properties computed from elemental compositions -- inferred from amino acid compositions of proteins -- and include carbon oxidation state (`r Zc`) and stoichiometric hydration state (`r nH2O`), as described by @DYT20. - *Community reference proteomes* are generated by combining lowest-level taxonomic assignments (from genus to phylum) with reference proteomes for taxa. Precomputed reference proteomes for taxa are available for the GTDB and RefSeq reference databases [the latter were described by @DT23]. ## Load required packages ```{r load_packages} library(chem16S) library(phyloseq) library(ggplot2) theme_set(theme_bw()) ``` This vignette was compiled on `r Sys.Date()` with **chem16S** version `r sessionInfo()$otherPkgs$chem16S$Version` and **phyloseq** version `r sessionInfo()$otherPkgs$phyloseq$Version`. ## Mouse microbiome data The [mothur MiSeq SOP](https://mothur.org/wiki/miseq_sop/) has an example dataset with 16S rRNA gene sequences for mouse gut communities, which was extracted from the complete dataset reported by @SSZ+12. This example dataset includes data collected at 0--9 (early) and 141-150 (late) days post-weaning from one mouse. This dataset is used in the [DADA2 Pipeline Tutorial](https://benjjneb.github.io/dada2/tutorial.html) to demonstrate construction of amplicon sequence variants (ASV), taxonomic classification of the ASVs, and analysis using the **phyloseq** package. Here, we load the `phyloseq-class` object that was created on 2023-06-14 using **dada2** (version 1.28.0) by following the steps in the DADA2 pipeline tutorial (version 1.16). Note that taxonomy was assigned to genus level using a newer version of the Silva training set ([`silva_nr99_v138.1_train_set.fa.gz`](https://zenodo.org/record/4587954)) than that indicated in the DADA2 tutorial on this date (`silva_nr_v132_train_set.fa.gz`). ```{r load_mouse.silva} data(mouse.silva) mouse.silva ``` To start to visualize this data, let's recreate the bar plot from the DADA2 tutorial: ```{r ps_dada2_barplot, fig.width = 7, fig.height = 5, fig.align = "center", pngquant = pngquant} top20.silva <- names(sort(taxa_sums(mouse.silva), decreasing = TRUE))[1:20] mouse.silva.top20 <- transform_sample_counts(mouse.silva, function(OTU) OTU/sum(OTU)) mouse.silva.top20 <- prune_taxa(top20.silva, mouse.silva.top20) plot_bar(mouse.silva.top20, x = "Day", fill = "Family") + facet_wrap(~When, scales = "free_x") ``` ## Lowest-level taxa The **chem16S** package contains the amino acid compositions of archaeal, bacterial, and viral taxa at levels from genus to phylum constructed from the RefSeq database; see the vignette [**Chemical metrics of reference proteomes**](metrics.html) for more information. There are denoted as *reference proteomes* for taxa. Several functions are available to identify the lowest-level taxon for each classified read (or in this case, each ASV), combine the reference proteomes for all taxa to create *community reference proteomes*, and calculate chemical metrics from them. First, let's make a table that lists the lowest-level taxon for each ASV and their abundances. Because **chem16S** was first developed to analyze the output from the RDP Classifier, this data frame has a format that is similar to the file created by using the `classify -h` command of the RDP Classifier followed by the `merge-count` command. That is, the data frame has columns named `taxid`, `rank`, `lineage`, and `name`, followed by columns for all samples. Because of the long text in the `taxid` column (see below), let's look at the other columns first. ```{r mouse.silva_taxacounts} tc.silva <- ps_taxacounts(mouse.silva) head(tc.silva[, -1]) ``` ASVs with identical lowest-level taxonomic assignments are merged, so this data frame has fewer rows than the number of ASVs. The short names of the ASVs are stored in the `taxid` column, and the names of multiple ASVs with the same lowest-level taxonomic classification are separated by a semicolon. ```{r mouse.silva_taxid} head(tc.silva$taxid) ``` As explained in the DADA2 tutorial, the ASVs themselves (that is, the DNA sequences, not their short names) are available in the `phyloseq-class` object, but they are not used here. Now let's list the number of unique lowest-level classifications at each taxonomic level: ```{r mouse.silva_levels, collapse = TRUE} table(tc.silva$rank) ``` These taxa will be used below to construct community reference proteomes. ## Taxonomic mapping A challenge of using reference proteomes is that they come from one source (GTDB and RefSeq are available in **chem16S**), but taxonomic assignments may be made using a different training set (such as Silva or RDP). While identical names and ranks can be used to automatically match taxonomic assignments to reference proteomes, inherent differences in available taxonomies pose difficulties. For example, RDP uses the name *Escherichia/Shigella*, for which the closest correspondence in RefSeq is *Escherichia*. This and other manual mappings were used by @DT23 for mapping between RDP and RefSeq. Let's analyze a `phyloseq-class` object that was made using the Silva training set but use the manual mapping that was designed to map from RDP to RefSeq. ::: {.infobox .note} The manual mapping from RDP to RefSeq was not designed for taxonomic assignments made using Silva. This is just an example to show the limitations of the technique. ::: ```{r map.silva, collapse = TRUE} map.silva <- map_taxa(tc.silva, refdb = "RefSeq_206") ``` The messages show that one group was manually mapped, and the total mapping rate (including both automatic name matching and manual mapping) is `r round(100 - sum(attr(map.silva, "unmapped_percent")), 1)`%. Not let's analyze a `phyloseq-class` object that was made using the RDP training set. This results in a considerably higher mapping rate to the NCBI taxonomy. ```{r map.RDP, collapse = TRUE} data(mouse.RDP) tc.RDP <- ps_taxacounts(mouse.RDP) map.RDP <- map_taxa(tc.RDP, refdb = "RefSeq_206") ``` Now the total mapping rate is `r round(100 - sum(attr(map.RDP, "unmapped_percent")), 1)`%. ::: {.infobox .note} Manual mapping does not overcome inconsistencies between different taxonomies but is simply a starting point for getting the most out of the available taxonomic classifications and reference proteomes. Using the Genome Taxonomy Database (GTDB) for both taxonomic classification and reference proteomes obviates the need for manual mapping and is described further below. ::: ## From taxonomy to amino acid composition The `ps_metrics()` function in **chem16S** wraps the previous two functions (`ps_taxacounts()` and `map_taxa()`) as well as another function (`get_metrics()`). In order, these functions get the lowest-level taxon for each OTU or ASV, map taxonomic names to RefSeq, and combine taxonomic abundances with reference proteomes for taxa to calculate the amino acid composition of community reference proteomes, which are then used to calculate chemical metrics. To peek into this process, let's begin by listing the amino acid compositions of some community reference proteomes. ```{r AA.RDP, collapse = TRUE} AA.RDP <- ps_metrics(mouse.RDP, refdb = "RefSeq_206", quiet = TRUE, return_AA = TRUE) head(AA.RDP) ``` In this data frame, columns named `Run` and `chains` have the sample names and abundance-weighted numbers of reference proteomes -- that is, the total abundance of ASVs that have taxonomic classifications mapped to the NCBI taxonomy -- that are used to compute the amino acid composition. We can use `canprot::calc_metrics()` to calculate the average protein length of each community reference proteome, then make a histogram: ```{r AA.RDP_length, fig.width = 7, fig.height = 5, fig.align = "center", pngquant = pngquant} lengths <- canprot::calc_metrics(AA.RDP, "length")[, 1] hist(lengths) imin <- which.min(lengths) text(lengths[imin], 1.5, AA.RDP$Run[imin], adj = 1) ``` Compared to the other samples, that from Day `r if(!is.null(AA.RDP)) strsplit(AA.RDP$Run[imin], "D")[[1]][2]` has the lowest average protein length of the community reference proteome. ## From taxonomy to chemical metrics We're now ready to calculcate chemical metrics from the amino acid compositions. ```{r metrics.RDP, collapse = TRUE} metrics.RDP <- ps_metrics(mouse.RDP, refdb = "RefSeq_206", quiet = TRUE) head(metrics.RDP) ``` This data frame has columns for `r Zc` (carbon oxidation state), `r nO2` (stoichiometric oxidation state), and `r nH2O` (stoichiometric hydration state). `r Zc` is computed from the elemental formula, while `r nO2` and `r nH2O` are coefficients on `r O2` and `r H2O` in theoretical formation reactions of the proteins, per residue, from a particular set of thermodynamic components, also known as basis species. The rationale for the choice of basis species used here (glutamine, glutamic acid, cysteine, `r O2`, and `r H2O`, abbreviated as QEC) was given by @DYT20. Because they are both measures of oxidation state, in general there is a strong positive correlation between `r Zc` and `r nO2`. For this dataset, there is in addition a strong negative correlation between `r Zc` and `r nH2O`, but other datasets do not show such a correlation. ```{r cor.RDP, collapse = TRUE} cor(metrics.RDP$Zc, metrics.RDP$nO2) cor(metrics.RDP$Zc, metrics.RDP$nH2O) ``` Let's plot each of these chemical metrics against the sampling day. ```{r plot_metrics.RDP, fig.width = 7, fig.height = 5, fig.align = "center", pngquant = pngquant} plot_ps_metrics(mouse.RDP, x = "Day", color = "When", shape = "When", refdb = "RefSeq_206", quiet = TRUE) + geom_point(size = 3) ``` Let's plot two chemical metrics against each other. ```{r plot_metrics2.RDP, fig.width = 6, fig.height = 4, fig.align = "center", pngquant = pngquant} plot_ps_metrics2(mouse.RDP, color = "When", shape = "When", refdb = "RefSeq_206", quiet = TRUE) + geom_point(size = 3) ``` The community reference proteomes for Late samples tend to be more oxidized and less hydrated (i.e., they have higher `r Zc` and lower `r nH2O`) than Early samples, but there is some overlap between the groups. ## Using GTDB for reference proteomes As noted above, mapping taxon names from Silva or RDP to RefSeq is not perfect. To overcome this limitation, the GTDB can be used for both taxonomic classification of 16S rRNA gene sequences and for reference proteomes of taxa. Here we load a `phyloseq-class` object prepared following the DADA2 Pipeline Tutorial modified to classify 16S rRNA gene sequences using GTDB r220 [@Ali24]. Then, we plot chemical metrics calculated for reference proteomes also derived from the GTDB, but a later version. (The reason for this is that DADA2-formatted 16S rRNA sequences from GTDB r220 are not yet available in the Zenodo repository maintained by @Ali24.) ```{r data.GTDB, collapse = TRUE, fig.width = 6, fig.height = 4, fig.align = "center", pngquant = pngquant} data(mouse.GTDB_220) plot_ps_metrics2(mouse.GTDB_220, refdb = "GTDB_220", color = "When", shape = "When") + geom_point(size = 3) ``` Compared to using the RDP for classification of 16S rRNA sequences and RefSeq for reference proteomes, using the GTDB for both yields a clearer separation of Early and Late samples in chemical space. There is one Early sample with anomalously high `r Zc` and low `r nH2O`; let's identify it: ```{r early.GTDB, collapse = TRUE, fig.width = 6, fig.height = 4, fig.align = "center", pngquant = pngquant} metrics.GTDB <- ps_metrics(mouse.GTDB_220) is.early <- sample_data(mouse.GTDB_220)$When == "Early" iout <- which.min(metrics.GTDB[is.early, ]$nH2O) (day <- sample_data(mouse.GTDB_220)[is.early, ]$Day[iout]) ``` This is the sample from day `r day`. Let's look again at the taxonomic classifications, this time at the phylum level in GTDB: ```{r barplot.GTDB, fig.width = 7, fig.height = 5, fig.align = "center", pngquant = pngquant} top20.GTDB <- names(sort(taxa_sums(mouse.GTDB_220), decreasing = TRUE))[1:20] mouse.GTDB_220.top20 <- transform_sample_counts(mouse.GTDB_220, function(OTU) OTU/sum(OTU)) mouse.GTDB_220.top20 <- prune_taxa(top20.GTDB, mouse.GTDB_220.top20) plot_bar(mouse.GTDB_220.top20, x = "Day", fill = "Phylum") + facet_wrap(~When, scales = "free_x") ``` This plot suggests that a relatively high proportion of Bacteroidota makes the Day `r day` sample more similar to samples in the Late group. Bacteroidetes (an older name for Bacteroidota) have some of the lowest `r nH2O` among major bacterial phyla [see @DT23], which could partially explain the low `r nH2O` observed for the Day `r day` sample. ## Take-home messages 1. Taxonomic mapping from RDP to RefSeq is incomplete. If you use this, it is advisable to report the percentage of mapped taxonomic classifications as well as the major unmapped groups, which can be listed with the `quiet = FALSE` argument to `map_taxa()`, `ps_metrics()`, `plot_ps_metrics()`, and `plot_ps_metrics2()`. 2. Consistent taxonomy can be achieved by using the GTDB for both reference proteomes and taxonomic classification of 16S rRNA gene sequences. For the dataset analyzed here, this leads to a clearer separation between sample groups in chemical space and helps to identify the taxonomic basis for possible outliers. 2. Gut communities in later stages of mouse growth are characterized by relatively oxidized and dehydrated reference proteomes. Beyond conventional diversity metrics used for microbiome analysis, visualizing the chemical metrics of community reference proteomes can provide new insight into genomic adaptation. See the next vignette ([**Plotting two chemical metrics**](plotting.html)) for more examples. ## References ```{r cleanup, include = FALSE} options(oldopt) ```