The canprot package calculates chemical metrics of proteins from amino acid compositions. This vignette was compiled on 2025-02-07 with canprot version 2.0.0-2.
Previous vignette: Introduction to canprot
Carbon oxidation state (ZC) is calculated from elemental ratios, but stoichiometric water content (nH2O) depends on a specific choice of thermodynamic components (or basis species). In canprot, nH2O is calculated from a theoretical reaction to form proteins from the following set of basis species, abbreviated as QEC:
The rationale for this choice is described in Dick et al. (2020) and Dick (2022).
To see how it works, consider the formation reaction of alanylglycine, which can be written using functions in the CHNOSZ package:
## C H N O S ispecies logact state
## C5H10N2O3 5 10 2 3 0 1723 -3.2 aq
## C5H9NO4 5 9 1 4 0 1724 -4.5 aq
## C3H7NO2S 3 7 1 2 1 1721 -3.6 aq
## H2O 0 2 0 1 0 1 0.0 liq
## O2 0 0 0 2 0 2679 -80.0 gas
## coeff name formula state ispecies model
## 2665 1 alanylglycine C5H10N2O3 cr 2665 CGL
## 1723 -1 glutamine C5H10N2O3 aq 1723 HKF
As it turns out, alanylglycine has the same elemental formula as glutamine, which is one of the basis species, so there are no other basis species in the reaction. That includes H2O, so we can deduce that nH2O is zero.
Let’s do the calculation with the nH2O()
function to see
this result. We have to specify terminal_H2O = 1
in order
to account for the terminal -H and -OH groups on alanlglycine. As
described below, the default for this setting is 0 because, most of the
time, we want to measure the per-amino-acid contributions for
proteins.
## [1] 0
Now let’s try an actual protein, chicken egg-white lysozome, which has the name LYSC_CHICK in UniProt with accession number P00698. The amino acid compositions of this and selected other proteins are available in the CHNOSZ package. This gets the amino acid composition and prints the protein length and ZC and nH2O:
## 6
## 129
## 6
## 0.01631321
## 6
## -0.8868217
By the way, lysozyme (a secreted protein) is highly oxidized compared to cytoplasmic proteins, most of which have negative ZC.
To see where the value of nH2O comes from, we can
write the formation reaction of LYSC_CHICK from the QEC
basis species. Recall that we set the basis species above with
CHNOSZ::basis("QEC")
; that setting is used here to balance
the reaction.
## coeff name formula state ispecies
## 3487 1.0 LYSC_CHICK C613H959N193O185S10 aq 3487
## 1723 -66.4 glutamine C5H10N2O3 aq 1723
## 1724 -50.2 glutamic acid C5H9NO4 aq 1724
## 1721 -10.0 cysteine C3H7NO2S aq 1721
## 1 113.4 water H2O liq 1
## 2679 60.8 oxygen O2 gas 2679
## model
## 3487 HKF
## 1723 HKF
## 1724 HKF
## 1721 HKF
## 1 water.SUPCRT92
## 2679 CGL
## [1] 113.4
That says 113.4, but the value for nH2O above was -0.8868217. What happened? Let’s step through the logic:
A general observation: This is a theoretical reaction in terms of thermodynamic components, so we are not dealing with biochemical mechanisms here. That’s one reason for calling this approach geochemical biology.
To save time, nH2O()
does not calculate the formation
reaction for each protein but instead uses precomputed values of
nH2O for each amino acid. The two methods give
equivalent results, as described in Dick et al.
(2020).
Similarly, Zc()
uses precomputed values of
ZC and nC (number of carbon atoms)
for each amino acid. NOTE: Calculating
ZC of proteins from amino acid frequencies
(i.e. abundances or counts in a protein sequence) requires weighting the
amino-acid ZC by the number of carbon atoms in each
amino acid, in addition to weighting by amino acid frequency. Using the
unweighted mean of ZC of amino acids leads to
artificially higher values for proteins.
There are also functions for calculating the grand average of hydropathy (GRAVY) and isoelectric point (pI) of proteins. Below, values for representative proteins are compared with results from the ProtParam tool (Gasteiger et al., 2005).
We first look up a few proteins in CHNOSZ’s list of proteins, then get the amino acid compositions.
iprotein <- CHNOSZ::pinfo(c("LYSC_CHICK", "RNAS1_BOVIN", "AMYA_PYRFU", "CSG_HALJP"))
AAcomp <- CHNOSZ::pinfo(iprotein)
Calculate GRAVY and compare it to reference values from https://web.expasy.org/protparam/.
G_calc <- GRAVY(AAcomp)
# https://web.expasy.org/cgi-bin/protparam/protparam1?P00698@19-147@
# https://web.expasy.org/cgi-bin/protparam/protparam1?P61823@27-150@
# https://web.expasy.org/cgi-bin/protparam/protparam1?P49067@2-649@
G_ref <- c(-0.472, -0.663, -0.325)
stopifnot(all.equal(round(G_calc[1:3], 3), G_ref, check.attributes = FALSE))
Calculate pI and compare it to reference values.
pI_calc <- pI(AAcomp)
# Reference values calculated with ProtParam
# LYSC_CHICK: residues 19-147 (sequence v1)
# RNAS1_BOVIN: residues 27-150 (sequence v1)
# AMYA_PYRFU: residues 2-649 (sequence v2)
# CSG_HALJP: residues 35-862 (sequence v1)
pI_ref <- c(9.32, 8.64, 5.46, 3.37)
stopifnot(all.equal(as.numeric(pI_calc), pI_ref))
Calculate molecular weight (MW
) and compare it to
reference values
# Per-residue molecular weight multiplied by number of residues
MWcalc <- MW(AAcomp) * plength(AAcomp)
# Add terminal groups
MWcalc <- MWcalc + 18.01528
# Reference values for molecular weights of proteins
MWref <- c(14313.14, 13690.29, 76178.25)
stopifnot(all.equal(round(MWcalc[1:3], 2), MWref, check.attributes = FALSE))