Keeping up with new nucleosome structures
In this post, I will show how I made the bar graph in Box 3 of the review I co-authored recently (Zhou et al, 2019). This involves collecting metadata from the PDB, which I explained in the first post in this series. Having an automated way to regenerate this graph was invaluable during the writing of this review, because new nucleosome structures were published at a very rapid rate over the last few months.
Required packages
We will need the following packages:
library(magrittr)
library(purrr)
library(forcats)
library(jsonlite)
library(readr)
library(tibble)
library(dplyr)
library(ggplot2)
library(plotly)
Getting and cleaning data
This bar graph will present the number of PDB entries containing a nucleosome
by year of publication since 1997. We need to request a dataset containing all
released structures (status:REL
) with the word “nucleosome” in their title,
and with the following columns (corresponding PDBe API fields):
- database accession code (
pdb_id
), - publication year (
citation_year
), - structure title (
title
), - experimental method (
experimental_method
).
We need to take into account the fact that older entries in the PDB have their titles in all capitals, while newer entries have normal case titles. The PDBe API search is case sensitive, and I haven’t found how to make it case-insensitive (or even whether it is possible at all), so we will need two queries.
This translates as follows:
# We need two queries: one for uppercase titles, the other for lowercase ones
<- c(
pdb_queries uppercase = 'https://www.ebi.ac.uk/pdbe/search/pdb/select?q=title:*UCLEOSOM*%20AND%20status:REL&fl=pdb_id,citation_year,title,experimental_method&rows=1000000&wt=json',
lowercase = 'https://www.ebi.ac.uk/pdbe/search/pdb/select?q=title:*ucleosom*%20AND%20status:REL&fl=pdb_id,citation_year,title,experimental_method&rows=1000000&wt=json'
)
Some PDB entries have the word “nucleosome” in their title, but do not actually contain a nucleosome. This is the case for structures of Nap1 (nucleosome assembly protein 1), for instance. I could not find a more accurate way than combing through the dataset one entry at a time and building a list of such irrelevant entries to filter them out.
# The following PDB entries do not contain a nucleosome
<- c(
non_nucleosome_structures "1hst",
"2z2r",
"5x7v",
"3uv2",
"3fs3",
"1wg3",
"1nw3",
"5ikf",
"3gyw",
"3gyv",
"1ofc",
"2ayu",
"2iw5",
"3hfd"
)
We can then retrieve and clean up the dataset as follows. Having to perform two case-sensitive queries complicates the procedure a little bit; the key parts are to keep only one occurrence of each PDB accession code (remember that a query will return all molecules that match, and that a structure often contains several molecules, as explained in the first post in this series) and to filter out the above list of PDB entries that don’t contain a nucleosome.
<- "datasets/nucleosome-structures.csv"
nucleosome_structures_dataset
<- . %>%
dig_up_data $response %>%
.$docs %>%
.as_tibble()
if (!file.exists(nucleosome_structures_dataset)) {
<- pdb_queries %>%
pdb_data map(fromJSON) %>%
map(dig_up_data) %>%
bind_rows() %>%
distinct(pdb_id, .keep_all = TRUE) %>%
filter(!(pdb_id %in% non_nucleosome_structures)) %>%
mutate(experimental_method = as.character(experimental_method))
write_csv(pdb_data, path = nucleosome_structures_dataset)
else {
} <- read_csv(nucleosome_structures_dataset)
pdb_data
}
%<>%
pdb_data mutate(experimental_method = as_factor(experimental_method),
citation_year = as.integer(citation_year))
pdb_data
## # A tibble: 179 x 4
## citation_year experimental_meth… pdb_id title
## <int> <fct> <chr> <chr>
## 1 2003 X-ray diffraction 1m19 LIGAND BINDING ALTERS THE STRUCTURE …
## 2 2001 X-ray diffraction 1id3 CRYSTAL STRUCTURE OF THE YEAST NUCLE…
## 3 2013 X-ray diffraction 4x23 CRYSTAL STRUCTURE OF CENP-C IN COMPL…
## 4 1997 X-ray diffraction 1aoi COMPLEX BETWEEN NUCLEOSOME CORE PART…
## 5 2000 X-ray diffraction 1f66 2.6 A CRYSTAL STRUCTURE OF A NUCLEOS…
## 6 2000 X-ray diffraction 1eqz X-RAY STRUCTURE OF THE NUCLEOSOME CO…
## 7 2003 X-ray diffraction 1m1a LIGAND BINDING ALTERS THE STRUCTURE …
## 8 2003 X-ray diffraction 1m18 LIGAND BINDING ALTERS THE STRUCTURE …
## 9 2004 X-ray diffraction 1p3f Crystallographic Studies of Nucleoso…
## 10 2011 X-ray diffraction 3ayw Crystal Structure of Human Nucleosom…
## # … with 169 more rows
Here is the resulting dataset for download.
Making a bar graph
From this table, we can now easily construct a bar graph. This is the same graph I made for the review article, except this one is also interactive (you can zoom in and out, and hovering over the bars will indicate the years and counts).
<- ggplot(data = pdb_data) +
figure geom_bar(mapping = aes(x = citation_year, fill = experimental_method)) +
guides(fill = guide_legend(title = "Experimental method")) +
ggtitle("Nucleosome structures") +
xlab("Publication year") +
ylab("Number of PDB entries") +
theme_bw()
ggplotly(figure)
Going further
To generate this graph, only the following fields are absolutely necessary:
- database accession code (
pdb_id
), - publication year (
citation_year
), - experimental method (
experimental_method
).
I also included the title
field in this post to show in the table, but it was
not required to build the graph. Now, when writing the review, I also downloaded
two additional fields:
- resolution (
resolution
), - article DOI (
citation_doi
).
Knowing the resolution was somewhat useful, because we tended to focus on
high-resolution structures in the review. Being able to get each entry’s DOI was
very helpful to easily access the corresponding article. The citation_doi
field only contains a DOI, not a full URL, but with stringr::str_c
it’s very
easy to prepend all DOIs with https://doi.org/
or the URL of your favorite DOI
resolver. Then, by saving the table as an Excel spreadsheet with
readr::write_excel_csv
(and converting to XLSX with Excel), the DOI links can
be made clickable.
Replicate this post
You can replicate this post by running the Rmd source with RStudio and R. The easiest way to replicate this post is to clone the entire git repository.
::session_info() sessioninfo
## ─ Session info ───────────────────────────────────────────────────────────────
## setting value
## version R version 4.0.3 (2020-10-10)
## os macOS Catalina 10.15.7
## system x86_64, darwin17.0
## ui X11
## language (EN)
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz Europe/Stockholm
## date 2021-02-14
##
## ─ Packages ───────────────────────────────────────────────────────────────────
## package * version date lib source
## assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.0.0)
## blogdown 1.1 2021-01-19 [1] CRAN (R 4.0.3)
## bookdown 0.21 2020-10-13 [1] CRAN (R 4.0.3)
## cli 2.3.0 2021-01-31 [1] CRAN (R 4.0.2)
## colorspace 2.0-0 2020-11-11 [1] CRAN (R 4.0.2)
## crayon 1.4.1 2021-02-08 [1] CRAN (R 4.0.3)
## crosstalk 1.1.1 2021-01-12 [1] CRAN (R 4.0.2)
## data.table 1.13.6 2020-12-30 [1] CRAN (R 4.0.2)
## DBI 1.1.1 2021-01-15 [1] CRAN (R 4.0.3)
## digest 0.6.27 2020-10-24 [1] CRAN (R 4.0.2)
## dplyr * 1.0.4 2021-02-02 [1] CRAN (R 4.0.2)
## ellipsis 0.3.1 2020-05-15 [1] CRAN (R 4.0.0)
## evaluate 0.14 2019-05-28 [1] CRAN (R 4.0.0)
## farver 2.0.3 2020-01-16 [1] CRAN (R 4.0.0)
## forcats * 0.5.1 2021-01-27 [1] CRAN (R 4.0.2)
## generics 0.1.0 2020-10-31 [1] CRAN (R 4.0.3)
## ggplot2 * 3.3.3 2020-12-30 [1] CRAN (R 4.0.3)
## glue 1.4.2 2020-08-27 [1] CRAN (R 4.0.2)
## gtable 0.3.0 2019-03-25 [1] CRAN (R 4.0.0)
## hms 1.0.0 2021-01-13 [1] CRAN (R 4.0.2)
## htmltools 0.5.1.1 2021-01-22 [1] CRAN (R 4.0.2)
## htmlwidgets 1.5.3 2020-12-10 [1] CRAN (R 4.0.2)
## httr 1.4.2 2020-07-20 [1] CRAN (R 4.0.2)
## jsonlite * 1.7.2 2020-12-09 [1] CRAN (R 4.0.2)
## knitr 1.31 2021-01-27 [1] CRAN (R 4.0.2)
## labeling 0.4.2 2020-10-20 [1] CRAN (R 4.0.3)
## lazyeval 0.2.2 2019-03-15 [1] CRAN (R 4.0.0)
## lifecycle 0.2.0 2020-03-06 [1] CRAN (R 4.0.0)
## magrittr * 2.0.1 2020-11-17 [1] CRAN (R 4.0.2)
## munsell 0.5.0 2018-06-12 [1] CRAN (R 4.0.0)
## pillar 1.4.7 2020-11-20 [1] CRAN (R 4.0.2)
## pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.0.0)
## plotly * 4.9.3 2021-01-10 [1] CRAN (R 4.0.2)
## purrr * 0.3.4 2020-04-17 [1] CRAN (R 4.0.0)
## R6 2.5.0 2020-10-28 [1] CRAN (R 4.0.2)
## readr * 1.4.0 2020-10-05 [1] CRAN (R 4.0.2)
## rlang 0.4.10 2020-12-30 [1] CRAN (R 4.0.2)
## rmarkdown 2.6 2020-12-14 [1] CRAN (R 4.0.3)
## scales 1.1.1 2020-05-11 [1] CRAN (R 4.0.0)
## sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.0.0)
## stringi 1.5.3 2020-09-09 [1] CRAN (R 4.0.2)
## stringr 1.4.0 2019-02-10 [1] CRAN (R 4.0.0)
## tibble * 3.0.6 2021-01-29 [1] CRAN (R 4.0.2)
## tidyr 1.1.2 2020-08-27 [1] CRAN (R 4.0.2)
## tidyselect 1.1.0 2020-05-11 [1] CRAN (R 4.0.0)
## vctrs 0.3.6 2020-12-17 [1] CRAN (R 4.0.2)
## viridisLite 0.3.0 2018-02-01 [1] CRAN (R 4.0.0)
## withr 2.4.1 2021-01-26 [1] CRAN (R 4.0.3)
## xfun 0.20 2021-01-06 [1] CRAN (R 4.0.2)
## yaml 2.2.1 2020-02-01 [1] CRAN (R 4.0.0)
##
## [1] /Users/guillaume/Library/R/4.0/library
## [2] /Library/Frameworks/R.framework/Versions/4.0/Resources/library