Calculating dn/dc from a protein sequence
Introduction
Here is a quick note to show how to calculate the theoretical refractive index increment (dn/dc) of a protein based on its sequence and on published dn/dc values of individual amino acids. This number can be used to determine protein concentration from measurements of relative refractive index, and is used in certain biophysical experiments such as SEC-MALLS. Very often, a consensus value for proteins is used, but it is actually not very difficult to calculate a value for each specific protein sequence of interest.
Getting dn/dc of individual amino acids
We will use dn/dc values of individual amino acids reported by Zhao et al (2011), using the version of the article provided by PubMed Central (PMC) because PMC’s normalized formatting will make it easier to automatically extract dn/dc values from its table 1 and will make this approach more easily reusable for other similar data extraction problems. The simple approach described here is an approximation and neglects the effects of buffer composition and temperature.
We will need the following functions and values from external R packages (some of them are quite big, so this approach of importing only the necessary bits is more convenient than loading everything, for the purpose of a blog post):
::here(.from = magrittr, `%>%`)
import::here(.from = xml2, read_html)
import::here(.from = rvest, html_table)
import::here(.from = tibble, as_tibble)
import::here(.from = dplyr,
import
select,
arrange,
mutate)# Biostrings comes from the Bioconductor repository.
::here(.from = Biostrings,
import
AMINO_ACID_CODE,
AA_STANDARD,
readAAStringSet, alphabetFrequency)
Let’s begin by extracting the dn/dc values from the article’s table 1, and building a table that we can easily use from R. Let’s also convert amino acid names from 3-letter codes to 1-letter codes. This will be more convenient because protein sequences are more often found with the 1-letter code notation.
<- read_html("https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3149238/")
article
<- article %>%
dndc_table html_table() %>%
1]] %>%
.[[as_tibble() %>%
select(aa = `Amino acid`,
dndc_value = `dn/dc(ml/g)‡`) %>%
arrange(aa) %>%
mutate(amino_acid = names(AMINO_ACID_CODE[AMINO_ACID_CODE == aa])) %>%
select(amino_acid, dndc_value)
At this point, we have collected the two relevant columns from the article’s table into a data frame easy to manipulate with R:
dndc_table
## # A tibble: 20 x 2
## amino_acid dndc_value
## <chr> <dbl>
## 1 A 0.167
## 2 R 0.206
## 3 N 0.192
## 4 D 0.197
## 5 C 0.206
## 6 Q 0.186
## 7 E 0.183
## 8 G 0.175
## 9 H 0.219
## 10 I 0.179
## 11 L 0.173
## 12 K 0.181
## 13 M 0.204
## 14 F 0.244
## 15 P 0.165
## 16 S 0.17
## 17 T 0.172
## 18 W 0.277
## 19 Y 0.24
## 20 V 0.172
Because R offers vectorized operations, we can simplify further downstream use of these numbers by transforming this table into a named vector:
<- dndc_table$dndc_value
dndc names(dndc) <- dndc_table$amino_acid
dndc
## A R N D C Q E G H I L K M
## 0.167 0.206 0.192 0.197 0.206 0.186 0.183 0.175 0.219 0.179 0.173 0.181 0.204
## F P S T W Y V
## 0.244 0.165 0.170 0.172 0.277 0.240 0.172
Calculating dn/dc of a protein
The dn/dc value of a protein is the average of all its amino acids’ dn/dc values. One way to calculate it is to first calculate the amino acid composition of the given protein sequence as the frequency of each amino acid, then multiplying each amino acid’s dn/dc value by its frequency in this particular sequence, and finally adding up the frequency-weighted dn/dc values.
Here is an example, using the sequence of human TRF2 (one of the two
proteins I studied during my PhD). First, we can retrieve the sequence
from Uniprot. Biostrings::readAAStringSet
from Bioconductor returns a
list of sequences, so in this case a single-element list; to make subsequent
operations more convenient we can extract this single element.
<- "https://www.uniprot.org/uniprot/Q15554.fasta" %>%
trf2 readAAStringSet() %>%
1]]
.[[ trf2
## 542-letter AAString object
## seq: MAAGAGTAGPASGPGVVRDPAASQPRKRPGREGGEG...YGEGNWAAISKNYPFVNRTAVMIKDRWRTMKRLGMN
Then we can calculate dn/dc. R’s vectorized operations and the dndc
vector
that we constructed in the previous paragraph will make this easy to express.
Biostrings::alphabetFrequency
computes a count of occurrences or frequency of
each amino acid in a protein sequence.
# Calculate amino acid composition as frequencies.
<- alphabetFrequency(trf2, as.prob = TRUE)
trf2_aa_comp
# Calculate dn/dc values.
# We need to subset trf2_aa_composition to keep only frequency
# values of standard amino acids, because we only have dn/dc
# values for these ones.
<- sum(trf2_aa_comp[AA_STANDARD] * dndc)
trf2_dndc trf2_dndc
## [1] 0.1836863
This is pretty close to the consensus value of 0.185 for unmodified proteins that is often used by default when a dn/dc value is needed.
Possible improvements
Now that we can automatically calculate a dn/dc value from a protein sequence,
we can determine the distributions of dn/dc across entire proteomes (like in
figures 2, 4 and 6 and table 2 of the article). A starting point for such an
analysis is the reference proteomes provided by Uniprot and the
Biostrings::readAAStringSet
. The rest mostly consists of figuring out how to
operate on a large list of sequences (AAStringSet
) instead of a single
sequence (AAString
). The purrr
package will probably help.
Another improvement would be to read the article more carefully, especially the equations, and implement a more accurate calculation of dn/dc.
Replicate this post
You can replicate this post by running the Rmd source with RStudio and R. The easiest way to replicate this post is to clone the entire git repository.
::session_info() sessioninfo
## ─ Session info ───────────────────────────────────────────────────────────────
## setting value
## version R version 4.0.3 (2020-10-10)
## os macOS Catalina 10.15.7
## system x86_64, darwin17.0
## ui X11
## language (EN)
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz Europe/Stockholm
## date 2021-02-14
##
## ─ Packages ───────────────────────────────────────────────────────────────────
## package * version date lib source
## assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.0.0)
## BiocGenerics 0.33.3 2020-04-07 [1] Bioconductor
## Biostrings 2.55.7 2020-04-07 [1] Bioconductor
## blogdown 1.1 2021-01-19 [1] CRAN (R 4.0.3)
## bookdown 0.21 2020-10-13 [1] CRAN (R 4.0.3)
## cli 2.3.0 2021-01-31 [1] CRAN (R 4.0.2)
## crayon 1.4.1 2021-02-08 [1] CRAN (R 4.0.3)
## DBI 1.1.1 2021-01-15 [1] CRAN (R 4.0.3)
## digest 0.6.27 2020-10-24 [1] CRAN (R 4.0.2)
## dplyr 1.0.4 2021-02-02 [1] CRAN (R 4.0.2)
## ellipsis 0.3.1 2020-05-15 [1] CRAN (R 4.0.0)
## evaluate 0.14 2019-05-28 [1] CRAN (R 4.0.0)
## fansi 0.4.2 2021-01-15 [1] CRAN (R 4.0.2)
## generics 0.1.0 2020-10-31 [1] CRAN (R 4.0.3)
## glue 1.4.2 2020-08-27 [1] CRAN (R 4.0.2)
## htmltools 0.5.1.1 2021-01-22 [1] CRAN (R 4.0.2)
## httr 1.4.2 2020-07-20 [1] CRAN (R 4.0.2)
## import 1.2.0 2020-09-24 [1] CRAN (R 4.0.2)
## IRanges 2.21.8 2020-04-07 [1] Bioconductor
## knitr 1.31 2021-01-27 [1] CRAN (R 4.0.2)
## lifecycle 0.2.0 2020-03-06 [1] CRAN (R 4.0.0)
## magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.0.2)
## pillar 1.4.7 2020-11-20 [1] CRAN (R 4.0.2)
## pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.0.0)
## purrr 0.3.4 2020-04-17 [1] CRAN (R 4.0.0)
## R6 2.5.0 2020-10-28 [1] CRAN (R 4.0.2)
## rlang 0.4.10 2020-12-30 [1] CRAN (R 4.0.2)
## rmarkdown 2.6 2020-12-14 [1] CRAN (R 4.0.3)
## rvest 0.3.6 2020-07-25 [1] CRAN (R 4.0.2)
## S4Vectors 0.25.15 2020-04-07 [1] Bioconductor
## sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.0.0)
## stringi 1.5.3 2020-09-09 [1] CRAN (R 4.0.2)
## stringr 1.4.0 2019-02-10 [1] CRAN (R 4.0.0)
## tibble 3.0.6 2021-01-29 [1] CRAN (R 4.0.2)
## tidyselect 1.1.0 2020-05-11 [1] CRAN (R 4.0.0)
## utf8 1.1.4 2018-05-24 [1] CRAN (R 4.0.0)
## vctrs 0.3.6 2020-12-17 [1] CRAN (R 4.0.2)
## withr 2.4.1 2021-01-26 [1] CRAN (R 4.0.3)
## xfun 0.20 2021-01-06 [1] CRAN (R 4.0.2)
## xml2 1.3.2 2020-04-23 [1] CRAN (R 4.0.0)
## XVector 0.27.2 2020-04-07 [1] Bioconductor
## yaml 2.2.1 2020-02-01 [1] CRAN (R 4.0.0)
## zlibbioc 1.33.1 2020-04-07 [1] Bioconductor
##
## [1] /Users/guillaume/Library/R/4.0/library
## [2] /Library/Frameworks/R.framework/Versions/4.0/Resources/library