--- title: "Introduction to canprot" output: html_vignette: mathjax: null vignette: > %\VignetteIndexEntry{Introduction to canprot} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} bibliography: canprot.bib link-citations: yes csl: elementa.csl --- ```{r setup, include=FALSE} library(canprot) library(CHNOSZ) oldopt <- options(width = 80) ``` ```{r HTML, include=FALSE} Zc <- "Z_C" nH2O <- "nH₂O" ``` The **canprot** package calculates chemical metrics of proteins from amino acid compositions. This vignette was compiled on `r Sys.Date()` with **canprot** version `r packageDescription("canprot")$Version`. Previous vignette: [Demos for canprot](demos.html) | Next vignette: [More about metrics](metrics.html) ## Reading FASTA files `KHAB17.fasta` was obtained from Supplemental Information of @KHAB17 and is provided in the `extdata/fasta` directory of **canprot**. Use `read_fasta()` to read the file and return a data frame of amino acid composition. ```{r read_fasta_KHAB17} fasta_file <- system.file("extdata/fasta/KHAB17.fasta", package = "canprot") aa <- read_fasta(fasta_file) ``` The result is small enough that we can look at it here. The data frame has four columns for identifying information (`protein`, `organism`, `ref`, and `abbrv`); the first two are filled by `read_fasta()`. The `chains` column is the number of polypeptide chains. ```{r aa_KHAB17} aa ``` *None of the first five columns is necessary for the calculation of chemical metrics.* They are provided for compatibility with CHNOSZ, and if you don't plan to use CHNOSZ, you can put any information here that you want, including NA values, or remove the columns completely. The columns that do matter for the calculation of chemical metrics are the last 20 columns, which are named with the 3-letter abbreviations of the amino acids. These particular sequences are ancestral sequences of Rubisco. A geochemical biology hypothesis is that proteins are oxidized when the environment is oxidizing. Is this what happens? Plot the carbon oxidation state (`r Zc`) to find out. Note the use of pre-formatted plot labels for chemical metrics available in `cplab`. ```{r Zc_KHAB17, out.width = "75%", fig.align = "center", fig.width = 6, fig.height = 4} xlab <- "Ancestral sequences (older to younger)" plot(Zc(aa), type = "b", xaxt = "n", xlab = xlab, ylab = cplab$Zc) names <- gsub(".*_", "", aa$protein) axis(1, at = 1:6, names) abline(v = 3.5, lty = 2, col = 2) axis(3, at = 3.5, "GOE (proposed)") ``` The vertical line denotes the proposed timing of the Great Oxidation Event (GOE) between Anc. I/III and Anc. I [@KHAB17]. This analysis shows that reconstructed ancestral Rubiscos become more oxidized around the proposed timing of GOE. ## Human proteins in **canprot** **canprot** has a database of amino acid compositions of human proteins assembled from UniProt. Use `human_aa()` to get the amino acid composition. This example is for alanine aminotransferase, which has a UniProt ID of [P24298](https://www.uniprot.org/uniprotkb/P24298): ```{r human_aa_ALAT1} (aa <- human_aa("P24298")) Zc(aa) ``` Do you have a list of UniProt IDs for a differential expression dataset? Great! We can use those to calculate chemical metrics and make a boxplot. The IDs in this example come from Figure 5 of @DKM+20, where they were identified as differentially regulated proteins in aggregate (3D) cell culture compared to monolayer (2D) culture. ```{r DKM20_UniProt} up <- c("Q92743", "P43490", "P52895", "P98160", "P23142", "P17301", "U3KQK0", "Q15582", "Q9HCJ1", "P36222", "P27701", "Q08380", "P08572", "P00734", "P22413", "O43657", "P35625", "O75348", "P02649", "P13861", "P10620", "Q9H3N1", "A8K878", "P13611", "P07305", "E7ESP4", "Q9Y625", "Q5ZPR3", "P62266", "Q96AQ6", "Q8N357", "Q13217", "Q9Y230", "Q9Y639", "Q86W92", "C9JF17", "Q96PK6", "O95671", "P01033", "Q13501", "P69905", "Q9Y5X1", "P50281", "Q9UBG0", "O60831", "P02751", "O43854", "P61803", "J3KN66", "P42765", "P36543", "P15121", "Q16563", "Q12884", "P27695", "P12110", "P07686", "Q92598", "Q02818", "Q07954", "O60493", "P40939", "Q9Y3I0", "P51149", "P46776", "P46778", "P62805") down <- c("J3KN67", "Q9Y490", "J3KNQ4", "E7EVA0", "Q01082", "J3KQ32", "P54136", "Q9Y696", "Q01995", "Q15404", "P62714", "Q09666", "P07814", "E7EQR4", "P46821", "O75369", "P02452", "P08123", "P54577", "P01023", "Q6ZN40", "P42224", "B4DUT8", "Q13443", "Q9HCE1", "Q6DKJ4", "P50552", "P35222", "P20908", "Q15417", "O75822", "P17812", "P05997", "P04080", "O43294", "P08243", "P02458") ``` With those UniProt IDs for human proteins we can retrieve the amino acid compositions, then calculate a couple of chemical metrics and make some boxplots comparing the groups of differentially regulated proteins. ```{r DKM20_plot, out.width = "75%", fig.align = "center", fig.width = 7, fig.height = 5} aa_down <- human_aa(down) aa_up <- human_aa(up) bp_names <- paste0(c("Down (", "Up ("), c(nrow(aa_down), nrow(aa_up)), c(")", ")")) par(mfrow = c(1, 2)) Zclist <- list(Zc(aa_down), Zc(aa_up)) names(Zclist) <- bp_names boxplot(Zclist, ylab = cplab$Zc, col = c(4, 2)) names(Zclist) <- c("x", "y") p <- do.call(wilcox.test, Zclist)$p.value legend("bottomleft", paste("p =", round(p, 3)), bty = "n") title("Cabon oxidation state", font.main = 1) nH2Olist <- list(nH2O(aa_down), nH2O(aa_up)) names(nH2Olist) <- bp_names boxplot(nH2Olist, ylab = cplab$nH2O, col = c(4, 2)) names(nH2Olist) <- c("x", "y") p <- do.call(wilcox.test, nH2Olist)$p.value legend("bottomleft", paste("p =", round(p, 3)), bty = "n") title("Stoichiometric water content", font.main = 1) ``` We find no significant difference of `r Zc` for the differentially regulated proteins. In contrast, `r nH2O` is significantly lower for up-regulated than for down-regulated proteins. A similar dehydration trend characterizes most datasets for 3D cell culture [@Dic21a]. *The differential expression datasets analyzed in that paper, which were previously in **canprot**, have been moved to [JMDplots](https://github.com/jedick/JMDplots).* ```{r reset, include=FALSE} options(oldopt) ``` ## References