Calculating dn/dc from a protein sequence • Guillaume

Introduction

Here is a quick note to show how to calculate the theoretical refractive index increment (dn/dc) of a protein based on its sequence and on published dn/dc values of individual amino acids. This number can be used to determine protein concentration from measurements of relative refractive index, and is used in certain biophysical experiments such as SEC-MALLS. Very often, a consensus value for proteins is used, but it is actually not very difficult to calculate a value for each specific protein sequence of interest.

Getting dn/dc of individual amino acids

We will use dn/dc values of individual amino acids reported by Zhao et al (2011), using the version of the article provided by PubMed Central (PMC) because PMC’s normalized formatting will make it easier to automatically extract dn/dc values from its table 1 and will make this approach more easily reusable for other similar data extraction problems. The simple approach described here is an approximation and neglects the effects of buffer composition and temperature.

We will need the following functions and values from external R packages (some of them are quite big, so this approach of importing only the necessary bits is more convenient than loading everything, for the purpose of a blog post):

import::here(.from = magrittr, `%>%`)
import::here(.from = xml2, read_html)
import::here(.from = rvest, html_table)
import::here(.from = tibble, as_tibble)
import::here(.from = dplyr,
             select,
             arrange,
             mutate)
# Biostrings comes from the Bioconductor repository.
import::here(.from = Biostrings,
             AMINO_ACID_CODE,
             AA_STANDARD,
             readAAStringSet,
             alphabetFrequency)

Let’s begin by extracting the dn/dc values from the article’s table 1, and building a table that we can easily use from R. Let’s also convert amino acid names from 3-letter codes to 1-letter codes. This will be more convenient because protein sequences are more often found with the 1-letter code notation.

article <- read_html("https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3149238/")

dndc_table <- article %>% 
    html_table() %>% 
    .[[1]] %>% 
    as_tibble() %>% 
    select(aa = `Amino acid`,
           dndc_value = `dn/dc(ml/g)‡`) %>% 
    arrange(aa) %>% 
    mutate(amino_acid = names(AMINO_ACID_CODE[AMINO_ACID_CODE == aa])) %>% 
    select(amino_acid, dndc_value)

At this point, we have collected the two relevant columns from the article’s table into a data frame easy to manipulate with R:

dndc_table

## # A tibble: 20 x 2
##    amino_acid dndc_value
##    <chr>           <dbl>
##  1 A               0.167
##  2 R               0.206
##  3 N               0.192
##  4 D               0.197
##  5 C               0.206
##  6 Q               0.186
##  7 E               0.183
##  8 G               0.175
##  9 H               0.219
## 10 I               0.179
## 11 L               0.173
## 12 K               0.181
## 13 M               0.204
## 14 F               0.244
## 15 P               0.165
## 16 S               0.17 
## 17 T               0.172
## 18 W               0.277
## 19 Y               0.24 
## 20 V               0.172

Because R offers vectorized operations, we can simplify further downstream use of these numbers by transforming this table into a named vector:

dndc <- dndc_table$dndc_value
names(dndc) <- dndc_table$amino_acid
dndc

##     A     R     N     D     C     Q     E     G     H     I     L     K     M 
## 0.167 0.206 0.192 0.197 0.206 0.186 0.183 0.175 0.219 0.179 0.173 0.181 0.204 
##     F     P     S     T     W     Y     V 
## 0.244 0.165 0.170 0.172 0.277 0.240 0.172

Calculating dn/dc of a protein

The dn/dc value of a protein is the average of all its amino acids’ dn/dc values. One way to calculate it is to first calculate the amino acid composition of the given protein sequence as the frequency of each amino acid, then multiplying each amino acid’s dn/dc value by its frequency in this particular sequence, and finally adding up the frequency-weighted dn/dc values.

Here is an example, using the sequence of human TRF2 (one of the two proteins I studied during my PhD). First, we can retrieve the sequence from Uniprot. Biostrings::readAAStringSet from Bioconductor returns a list of sequences, so in this case a single-element list; to make subsequent operations more convenient we can extract this single element.

trf2 <- "https://www.uniprot.org/uniprot/Q15554.fasta" %>% 
    readAAStringSet() %>% 
    .[[1]]
trf2

## 542-letter AAString object
## seq: MAAGAGTAGPASGPGVVRDPAASQPRKRPGREGGEG...YGEGNWAAISKNYPFVNRTAVMIKDRWRTMKRLGMN

Then we can calculate dn/dc. R’s vectorized operations and the dndc vector that we constructed in the previous paragraph will make this easy to express. Biostrings::alphabetFrequency computes a count of occurrences or frequency of each amino acid in a protein sequence.

# Calculate amino acid composition as frequencies.
trf2_aa_comp <- alphabetFrequency(trf2, as.prob = TRUE)

# Calculate dn/dc values.
# We need to subset trf2_aa_composition to keep only frequency
# values of standard amino acids, because we only have dn/dc
# values for these ones.
trf2_dndc <- sum(trf2_aa_comp[AA_STANDARD] * dndc)
trf2_dndc

## [1] 0.1836863

This is pretty close to the consensus value of 0.185 for unmodified proteins that is often used by default when a dn/dc value is needed.

Possible improvements

Now that we can automatically calculate a dn/dc value from a protein sequence, we can determine the distributions of dn/dc across entire proteomes (like in figures 2, 4 and 6 and table 2 of the article). A starting point for such an analysis is the reference proteomes provided by Uniprot and the Biostrings::readAAStringSet. The rest mostly consists of figuring out how to operate on a large list of sequences (AAStringSet) instead of a single sequence (AAString). The purrr package will probably help.

Another improvement would be to read the article more carefully, especially the equations, and implement a more accurate calculation of dn/dc.

Replicate this post

You can replicate this post by running the Rmd source with RStudio and R. The easiest way to replicate this post is to clone the entire git repository.

sessioninfo::session_info()

## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 4.0.3 (2020-10-10)
##  os       macOS Catalina 10.15.7      
##  system   x86_64, darwin17.0          
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  ctype    en_US.UTF-8                 
##  tz       Europe/Stockholm            
##  date     2021-02-14                  
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package      * version date       lib source        
##  assertthat     0.2.1   2019-03-21 [1] CRAN (R 4.0.0)
##  BiocGenerics   0.33.3  2020-04-07 [1] Bioconductor  
##  Biostrings     2.55.7  2020-04-07 [1] Bioconductor  
##  blogdown       1.1     2021-01-19 [1] CRAN (R 4.0.3)
##  bookdown       0.21    2020-10-13 [1] CRAN (R 4.0.3)
##  cli            2.3.0   2021-01-31 [1] CRAN (R 4.0.2)
##  crayon         1.4.1   2021-02-08 [1] CRAN (R 4.0.3)
##  DBI            1.1.1   2021-01-15 [1] CRAN (R 4.0.3)
##  digest         0.6.27  2020-10-24 [1] CRAN (R 4.0.2)
##  dplyr          1.0.4   2021-02-02 [1] CRAN (R 4.0.2)
##  ellipsis       0.3.1   2020-05-15 [1] CRAN (R 4.0.0)
##  evaluate       0.14    2019-05-28 [1] CRAN (R 4.0.0)
##  fansi          0.4.2   2021-01-15 [1] CRAN (R 4.0.2)
##  generics       0.1.0   2020-10-31 [1] CRAN (R 4.0.3)
##  glue           1.4.2   2020-08-27 [1] CRAN (R 4.0.2)
##  htmltools      0.5.1.1 2021-01-22 [1] CRAN (R 4.0.2)
##  httr           1.4.2   2020-07-20 [1] CRAN (R 4.0.2)
##  import         1.2.0   2020-09-24 [1] CRAN (R 4.0.2)
##  IRanges        2.21.8  2020-04-07 [1] Bioconductor  
##  knitr          1.31    2021-01-27 [1] CRAN (R 4.0.2)
##  lifecycle      0.2.0   2020-03-06 [1] CRAN (R 4.0.0)
##  magrittr       2.0.1   2020-11-17 [1] CRAN (R 4.0.2)
##  pillar         1.4.7   2020-11-20 [1] CRAN (R 4.0.2)
##  pkgconfig      2.0.3   2019-09-22 [1] CRAN (R 4.0.0)
##  purrr          0.3.4   2020-04-17 [1] CRAN (R 4.0.0)
##  R6             2.5.0   2020-10-28 [1] CRAN (R 4.0.2)
##  rlang          0.4.10  2020-12-30 [1] CRAN (R 4.0.2)
##  rmarkdown      2.6     2020-12-14 [1] CRAN (R 4.0.3)
##  rvest          0.3.6   2020-07-25 [1] CRAN (R 4.0.2)
##  S4Vectors      0.25.15 2020-04-07 [1] Bioconductor  
##  sessioninfo    1.1.1   2018-11-05 [1] CRAN (R 4.0.0)
##  stringi        1.5.3   2020-09-09 [1] CRAN (R 4.0.2)
##  stringr        1.4.0   2019-02-10 [1] CRAN (R 4.0.0)
##  tibble         3.0.6   2021-01-29 [1] CRAN (R 4.0.2)
##  tidyselect     1.1.0   2020-05-11 [1] CRAN (R 4.0.0)
##  utf8           1.1.4   2018-05-24 [1] CRAN (R 4.0.0)
##  vctrs          0.3.6   2020-12-17 [1] CRAN (R 4.0.2)
##  withr          2.4.1   2021-01-26 [1] CRAN (R 4.0.3)
##  xfun           0.20    2021-01-06 [1] CRAN (R 4.0.2)
##  xml2           1.3.2   2020-04-23 [1] CRAN (R 4.0.0)
##  XVector        0.27.2  2020-04-07 [1] Bioconductor  
##  yaml           2.2.1   2020-02-01 [1] CRAN (R 4.0.0)
##  zlibbioc       1.33.1  2020-04-07 [1] Bioconductor  
## 
## [1] /Users/guillaume/Library/R/4.0/library
## [2] /Library/Frameworks/R.framework/Versions/4.0/Resources/library

References

Zhao H, Brown PH & Schuck P (2011) On the Distribution of Protein Refractive Index Increments. Biophysical Journal 100: 2309–2317. https://doi.org/10.1016/j.bpj.2011.03.004