A simple DNA scrambling function in R
Many DNA-binding proteins recognize a specific sequence, although they often bind DNA somewhat weakly regardless of its sequence. For the in vitro study of such sequence-specific protein/DNA interactions, an important control to prove that binding is specific is to show that there is no binding (or weaker binding by orders of magnitude) to a random sequence.
The ideal random sequence for this control is a scrambled version of the recognition sequence (i.e. the same base composition, but occurring in a completely different order). When faced with this simple problem of scrambling a DNA sequence, I tried to find an online tool to do that. A simple search leads to a tool offered by GenScript (a gene synthesis company), but upon closer inspection it turns out that the scrambled sequences it generates are not completely random; they are in fact biased in order to make good siRNA controls, to minimize possible matches to naturally occurring sequences.
In the case of an in vitro binding experiment using purified proteins and DNA, it doesn’t matter that the scrambled control might resemble another naturally occurring sequence. So, instead of spending more time searching for an online tool, I figured it shouldn’t be too difficult to write my own. In the rest of this post, I will explain how I did that with R.
Utility functions
From other packages
We will need a few functions from the stringr
package, as well as the pipe
operator from magrittr
(for convenience).
::here(.from = stringr, str_c, str_count, str_length)
import::here(.from = magrittr, `%>%`) import
Base composition
Before we can scramble a DNA sequence, we need a function that calculates its base composition, because we will need to check that the original binding motif and the scrambled sequence have the same base composition.
<- function(dna_seq) {
base_composition <- c(A = "A", T = "T", G = "G", C = "C")
dna_bases <- str_count(dna_seq, dna_bases)
composition names(composition) <- names(dna_bases)
composition }
This function is very easy to write because R has built-in named vectors. This
means that after we count the occurrences of each base in an input sequence, we
can transfer the names of the dna_bases
vector to the composition
vector we
just built, and printing that vector will report a count for each base.
GC content
With this base_composition
function in hand, we can very easily write a
function to calculate the GC content of a sequence.
<- function(dna_seq) {
gc_content <- base_composition(dna_seq)
composition <- 100 * (composition["G"] + composition["C"]) / str_length(dna_seq)
gc_dna <- round(gc_dna, digits = 2)
gc_dna names(gc_dna) <- NULL
gc_dna }
“Exploding” a DNA sequence
We will need one last utility function to turn a DNA sequence into a vector of single bases.
<- function(dna_seq) {
explode substring(text = dna_seq,
first = seq(1, str_length(dna_seq), 1),
last = seq(1, str_length(dna_seq), 1))
}
DNA sequence scrambling function
Finally, we can write the sequence scrambling function. The way it works is:
- turn a DNA sequence into a vector of single bases,
- shuffle the elements of this vector by drawing all of them one by one, randomly and without replacing them,
- concatenate the elements of this new vector into a new string.
This is where the pipe notation makes it much easier to read.
<- function(dna_seq) {
scramble_dna %>%
dna_seq explode() %>%
sample(x = .,
size = str_length(dna_seq),
replace = FALSE) %>%
str_c(collapse = "")
}
Example
We can now use this function on an example sequence, say the EcoRI
recognition motif GAATTC
.
<- "GAATTC"
ecori base_composition(ecori)
## A T G C
## 2 2 1 1
gc_content(ecori)
## [1] 33.33
<- scramble_dna(ecori)
ecori_scrambled
# The scrambled sequence will obviously differ from the original one
all.equal(ecori, ecori_scrambled)
## [1] "1 string mismatch"
# But it will still have the same base composition and GC content
all.equal(
%>% base_composition(),
ecori %>% base_composition()
ecori_scrambled )
## [1] TRUE
all.equal(
%>% gc_content(),
ecori %>% gc_content()
ecori_scrambled )
## [1] TRUE
Possible improvements
This was easy to write, but could be better. One possible improvement would be to vectorize this function. As it is now, it will only work on a single DNA sequence. It could be nice to make it accept a vector of DNA sequences and return a vector of corresponding scrambled sequence.
An important and simple improvement would be to include an initial check of the
alphabet of each string. If a string contains letters that are not A, C, G or T
(for DNA) or A, C, G or U (for RNA), then it’s not a nucleic acid sequence and
trying to compute base composition and GC content is irrelevant. The
Biostrings
package, part of Bioconductor, provides these alphabets in
Biostrings::DNA_BASES
and Biostrings::RNA_BASES
.
Performance seems reasonnable with respect to sequence length: I tried with a
~15 kb long sequence, and base_composition
, gc_content
and scramble_dna
all evaluated instantly (i.e. system.time
reported 0 for all measured
evaluation times). I don’t know how a vectorized version would perform with
respect to the number of sequences to process.
Practical considerations
I could have looked whether Bioconductor offers such functionality (it definitely has advanced sequence manipulation tools), but this homemade version is totally sufficient for my typical need and took me very little time to write.
Some DNA-binding proteins recognize very short motifs, or derive specificity from very few base pairs within a larger consensus motif. As a result, it might be difficult to completely disrupt binding by simply shuffling the recognition sequence. One might need several attempts at shuffling a sequence before finding one that doesn’t match any feature (or as little as possible) of the recognition sequence. This can be checked by aligning these sequences, for example using Clustal Omega. One can also check that the resulting scrambled sequence doesn’t match any entry of databases such as HOCOMOCO or JASPAR.
Replicate this post
You can replicate this post by running the Rmd source with RStudio and R. The easiest way to replicate this post is to clone the entire git repository.
::session_info() sessioninfo
## ─ Session info ───────────────────────────────────────────────────────────────
## setting value
## version R version 4.0.3 (2020-10-10)
## os macOS Catalina 10.15.7
## system x86_64, darwin17.0
## ui X11
## language (EN)
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz Europe/Stockholm
## date 2021-02-14
##
## ─ Packages ───────────────────────────────────────────────────────────────────
## package * version date lib source
## assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.0.0)
## blogdown 1.1 2021-01-19 [1] CRAN (R 4.0.3)
## bookdown 0.21 2020-10-13 [1] CRAN (R 4.0.3)
## cli 2.3.0 2021-01-31 [1] CRAN (R 4.0.2)
## digest 0.6.27 2020-10-24 [1] CRAN (R 4.0.2)
## evaluate 0.14 2019-05-28 [1] CRAN (R 4.0.0)
## glue 1.4.2 2020-08-27 [1] CRAN (R 4.0.2)
## htmltools 0.5.1.1 2021-01-22 [1] CRAN (R 4.0.2)
## import 1.2.0 2020-09-24 [1] CRAN (R 4.0.2)
## knitr 1.31 2021-01-27 [1] CRAN (R 4.0.2)
## magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.0.2)
## rlang 0.4.10 2020-12-30 [1] CRAN (R 4.0.2)
## rmarkdown 2.6 2020-12-14 [1] CRAN (R 4.0.3)
## sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.0.0)
## stringi 1.5.3 2020-09-09 [1] CRAN (R 4.0.2)
## stringr 1.4.0 2019-02-10 [1] CRAN (R 4.0.0)
## withr 2.4.1 2021-01-26 [1] CRAN (R 4.0.3)
## xfun 0.20 2021-01-06 [1] CRAN (R 4.0.2)
## yaml 2.2.1 2020-02-01 [1] CRAN (R 4.0.0)
##
## [1] /Users/guillaume/Library/R/4.0/library
## [2] /Library/Frameworks/R.framework/Versions/4.0/Resources/library