Here is a quick note to show how to calculate the theoretical refractive index
increment (dn/dc) of a protein based on its sequence and on published dn/dc
values of individual amino acids. This number can be used to determine protein
concentration from measurements of relative refractive index, and is used in
certain biophysical experiments such as SEC-MALLS. Very often, a
consensus value for proteins is used, but it is actually not very difficult to
calculate a value for each specific protein sequence of interest.
Getting dn/dc of individual amino acids
We will use dn/dc values of individual amino acids reported by
Zhao et al (2011), using the version of the article
provided by PubMed Central (PMC) because PMC’s normalized formatting
will make it easier to automatically extract dn/dc values from its table 1 and
will make this approach more easily reusable for other similar data extraction
problems. The simple approach described here is an approximation and neglects
the effects of buffer composition and temperature.
We will need the following functions and values from external R packages (some
of them are quite big, so this approach of importing only the necessary bits is
more convenient than loading everything, for the purpose of a blog post):
import::here(.from = magrittr, `%>%`)import::here(.from = xml2, read_html)import::here(.from = rvest, html_table)import::here(.from = tibble, as_tibble)import::here(.from = dplyr, select, arrange, mutate)# Biostrings comes from the Bioconductor repository.import::here(.from = Biostrings, AMINO_ACID_CODE, AA_STANDARD, readAAStringSet, alphabetFrequency)
Let’s begin by extracting the dn/dc values from the article’s table 1, and
building a table that we can easily use from R. Let’s also convert amino acid
names from 3-letter codes to 1-letter codes. This will be more convenient
because protein sequences are more often found with the 1-letter code notation.
At this point, we have collected the two relevant columns from the article’s
table into a data frame easy to manipulate with R:
## # A tibble: 20 x 2
## amino_acid dndc_value
## <chr> <dbl>
## 1 A 0.167
## 2 R 0.206
## 3 N 0.192
## 4 D 0.197
## 5 C 0.206
## 6 Q 0.186
## 7 E 0.183
## 8 G 0.175
## 9 H 0.219
## 10 I 0.179
## 11 L 0.173
## 12 K 0.181
## 13 M 0.204
## 14 F 0.244
## 15 P 0.165
## 16 S 0.17
## 17 T 0.172
## 18 W 0.277
## 19 Y 0.24
## 20 V 0.172
Because R offers vectorized operations, we can simplify further downstream use
of these numbers by transforming this table into a named vector:
## A R N D C Q E G H I L K
## 0.167 0.206 0.192 0.197 0.206 0.186 0.183 0.175 0.219 0.179 0.173 0.181
## M F P S T W Y V
## 0.204 0.244 0.165 0.170 0.172 0.277 0.240 0.172
Calculating dn/dc of a protein
The dn/dc value of a protein is the average of all its amino acids’ dn/dc
values. One way to calculate it is to first calculate the amino acid composition
of the given protein sequence as the frequency of each amino acid, then
multiplying each amino acid’s dn/dc value by its frequency in this particular
sequence, and finally adding up the frequency-weighted dn/dc values.
Here is an example, using the sequence of human TRF2 (one of the two
proteins I studied during my PhD). First, we can retrieve the sequence
from Uniprot. Biostrings::readAAStringSet from Bioconductor returns a
list of sequences, so in this case a single-element list; to make subsequent
operations more convenient we can extract this single element.
Then we can calculate dn/dc. R’s vectorized operations and the dndc vector
that we constructed in the previous paragraph will make this easy to express.
Biostrings::alphabetFrequency computes a count of occurrences or frequency of
each amino acid in a protein sequence.
# Calculate amino acid composition as frequencies.trf2_aa_comp <-alphabetFrequency(trf2, as.prob =TRUE)# Calculate dn/dc values.# We need to subset trf2_aa_composition to keep only frequency# values of standard amino acids, because we only have dn/dc# values for these ones.trf2_dndc <-sum(trf2_aa_comp[AA_STANDARD] *dndc)trf2_dndc
##  0.1836863
This is pretty close to the consensus value of 0.185 for unmodified proteins
that is often used by default when a dn/dc value is needed.
Now that we can automatically calculate a dn/dc value from a protein sequence,
we can determine the distributions of dn/dc across entire proteomes (like in
figures 2, 4 and 6 and table 2 of the article). A starting point for such an
analysis is the reference proteomes provided by Uniprot and the
Biostrings::readAAStringSet. The rest mostly consists of figuring out how to
operate on a large list of sequences (AAStringSet) instead of a single
sequence (AAString). The purrr package will probably help.
Another improvement would be to read the article more carefully, especially the
equations, and implement a more accurate calculation of dn/dc.