Keeping up with new nucleosome structures

In this post, I will show how I made the bar graph in Box 3 of the review I co-authored recently (Zhou et al, 2019). This involves collecting metadata from the PDB, which I explained in the first post in this series. Having an automated way to regenerate this graph was invaluable during the writing of this review, because new nucleosome structures were published at a very rapid rate over the last few months.

Required packages

We will need the following packages:

library(magrittr)
library(purrr)
library(forcats)
library(jsonlite)
library(readr)
library(tibble)
library(dplyr)
library(ggplot2)
library(plotly)

Getting and cleaning data

This bar graph will present the number of PDB entries containing a nucleosome by year of publication since 1997. We need to request a dataset containing all released structures (status:REL) with the word “nucleosome” in their title, and with the following columns (corresponding PDBe API fields):

  • database accession code (pdb_id),
  • publication year (citation_year),
  • structure title (title),
  • experimental method (experimental_method).

We need to take into account the fact that older entries in the PDB have their titles in all capitals, while newer entries have normal case titles. The PDBe API search is case sensitive, and I haven’t found how to make it case-insensitive (or even whether it is possible at all), so we will need two queries.

This translates as follows:

# We need two queries: one for uppercase titles, the other for lowercase ones
pdb_queries <- c(
    uppercase = 'https://www.ebi.ac.uk/pdbe/search/pdb/select?q=title:*UCLEOSOM*%20AND%20status:REL&fl=pdb_id,citation_year,title,experimental_method&rows=1000000&wt=json',
    lowercase = 'https://www.ebi.ac.uk/pdbe/search/pdb/select?q=title:*ucleosom*%20AND%20status:REL&fl=pdb_id,citation_year,title,experimental_method&rows=1000000&wt=json'
)

Some PDB entries have the word “nucleosome” in their title, but do not actually contain a nucleosome. This is the case for structures of Nap1 (nucleosome assembly protein 1), for instance. I could not find a more accurate way than combing through the dataset one entry at a time and building a list of such irrelevant entries to filter them out.

# The following PDB entries do not contain a nucleosome
non_nucleosome_structures <- c(
    "1hst",
    "2z2r",
    "5x7v",
    "3uv2",
    "3fs3",
    "1wg3",
    "1nw3",
    "5ikf",
    "3gyw",
    "3gyv",
    "1ofc",
    "2ayu",
    "2iw5",
    "3hfd"
)

We can then retrieve and clean up the dataset as follows. Having to perform two case-sensitive queries complicates the procedure a little bit; the key parts are to keep only one occurrence of each PDB accession code (remember that a query will return all molecules that match, and that a structure often contains several molecules, as explained in the first post in this series) and to filter out the above list of PDB entries that don’t contain a nucleosome.

nucleosome_structures_dataset <- "datasets/nucleosome-structures.csv"

dig_up_data <- . %>%
    .$response %>%
    .$docs %>%
    as_tibble()

if (!file.exists(nucleosome_structures_dataset)) {
    pdb_data <- pdb_queries %>%
        map(fromJSON) %>%
        map(dig_up_data) %>% 
        bind_rows() %>%
        distinct(pdb_id, .keep_all = TRUE) %>% 
        filter(!(pdb_id %in% non_nucleosome_structures)) %>%
        mutate(experimental_method = as.character(experimental_method))
    write_csv(pdb_data, path = nucleosome_structures_dataset)
} else {
    pdb_data <- read_csv(nucleosome_structures_dataset)
}

pdb_data %<>%
    mutate(experimental_method = as_factor(experimental_method),
           citation_year       = as.integer(citation_year))

pdb_data
## # A tibble: 179 x 4
##    citation_year experimental_meth… pdb_id title                                
##            <int> <fct>              <chr>  <chr>                                
##  1          2003 X-ray diffraction  1m19   LIGAND BINDING ALTERS THE STRUCTURE …
##  2          2001 X-ray diffraction  1id3   CRYSTAL STRUCTURE OF THE YEAST NUCLE…
##  3          2013 X-ray diffraction  4x23   CRYSTAL STRUCTURE OF CENP-C IN COMPL…
##  4          1997 X-ray diffraction  1aoi   COMPLEX BETWEEN NUCLEOSOME CORE PART…
##  5          2000 X-ray diffraction  1f66   2.6 A CRYSTAL STRUCTURE OF A NUCLEOS…
##  6          2000 X-ray diffraction  1eqz   X-RAY STRUCTURE OF THE NUCLEOSOME CO…
##  7          2003 X-ray diffraction  1m1a   LIGAND BINDING ALTERS THE STRUCTURE …
##  8          2003 X-ray diffraction  1m18   LIGAND BINDING ALTERS THE STRUCTURE …
##  9          2004 X-ray diffraction  1p3f   Crystallographic Studies of Nucleoso…
## 10          2011 X-ray diffraction  3ayw   Crystal Structure of Human Nucleosom…
## # … with 169 more rows

Here is the resulting dataset for download.

Making a bar graph

From this table, we can now easily construct a bar graph. This is the same graph I made for the review article, except this one is also interactive (you can zoom in and out, and hovering over the bars will indicate the years and counts).

figure <- ggplot(data = pdb_data) +
    geom_bar(mapping = aes(x = citation_year, fill = experimental_method)) +
    guides(fill = guide_legend(title = "Experimental method")) +
    ggtitle("Nucleosome structures") +
    xlab("Publication year") +
    ylab("Number of PDB entries") +
    theme_bw()
ggplotly(figure)

Going further

To generate this graph, only the following fields are absolutely necessary:

  • database accession code (pdb_id),
  • publication year (citation_year),
  • experimental method (experimental_method).

I also included the title field in this post to show in the table, but it was not required to build the graph. Now, when writing the review, I also downloaded two additional fields:

  • resolution (resolution),
  • article DOI (citation_doi).

Knowing the resolution was somewhat useful, because we tended to focus on high-resolution structures in the review. Being able to get each entry’s DOI was very helpful to easily access the corresponding article. The citation_doi field only contains a DOI, not a full URL, but with stringr::str_c it’s very easy to prepend all DOIs with https://doi.org/ or the URL of your favorite DOI resolver. Then, by saving the table as an Excel spreadsheet with readr::write_excel_csv (and converting to XLSX with Excel), the DOI links can be made clickable.

Replicate this post

You can replicate this post by running the Rmd source with RStudio and R. The easiest way to replicate this post is to clone the entire git repository.

sessioninfo::session_info()
## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 4.0.3 (2020-10-10)
##  os       macOS Catalina 10.15.7      
##  system   x86_64, darwin17.0          
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  ctype    en_US.UTF-8                 
##  tz       Europe/Stockholm            
##  date     2021-02-14                  
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package     * version date       lib source        
##  assertthat    0.2.1   2019-03-21 [1] CRAN (R 4.0.0)
##  blogdown      1.1     2021-01-19 [1] CRAN (R 4.0.3)
##  bookdown      0.21    2020-10-13 [1] CRAN (R 4.0.3)
##  cli           2.3.0   2021-01-31 [1] CRAN (R 4.0.2)
##  colorspace    2.0-0   2020-11-11 [1] CRAN (R 4.0.2)
##  crayon        1.4.1   2021-02-08 [1] CRAN (R 4.0.3)
##  crosstalk     1.1.1   2021-01-12 [1] CRAN (R 4.0.2)
##  data.table    1.13.6  2020-12-30 [1] CRAN (R 4.0.2)
##  DBI           1.1.1   2021-01-15 [1] CRAN (R 4.0.3)
##  digest        0.6.27  2020-10-24 [1] CRAN (R 4.0.2)
##  dplyr       * 1.0.4   2021-02-02 [1] CRAN (R 4.0.2)
##  ellipsis      0.3.1   2020-05-15 [1] CRAN (R 4.0.0)
##  evaluate      0.14    2019-05-28 [1] CRAN (R 4.0.0)
##  farver        2.0.3   2020-01-16 [1] CRAN (R 4.0.0)
##  forcats     * 0.5.1   2021-01-27 [1] CRAN (R 4.0.2)
##  generics      0.1.0   2020-10-31 [1] CRAN (R 4.0.3)
##  ggplot2     * 3.3.3   2020-12-30 [1] CRAN (R 4.0.3)
##  glue          1.4.2   2020-08-27 [1] CRAN (R 4.0.2)
##  gtable        0.3.0   2019-03-25 [1] CRAN (R 4.0.0)
##  hms           1.0.0   2021-01-13 [1] CRAN (R 4.0.2)
##  htmltools     0.5.1.1 2021-01-22 [1] CRAN (R 4.0.2)
##  htmlwidgets   1.5.3   2020-12-10 [1] CRAN (R 4.0.2)
##  httr          1.4.2   2020-07-20 [1] CRAN (R 4.0.2)
##  jsonlite    * 1.7.2   2020-12-09 [1] CRAN (R 4.0.2)
##  knitr         1.31    2021-01-27 [1] CRAN (R 4.0.2)
##  labeling      0.4.2   2020-10-20 [1] CRAN (R 4.0.3)
##  lazyeval      0.2.2   2019-03-15 [1] CRAN (R 4.0.0)
##  lifecycle     0.2.0   2020-03-06 [1] CRAN (R 4.0.0)
##  magrittr    * 2.0.1   2020-11-17 [1] CRAN (R 4.0.2)
##  munsell       0.5.0   2018-06-12 [1] CRAN (R 4.0.0)
##  pillar        1.4.7   2020-11-20 [1] CRAN (R 4.0.2)
##  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.0.0)
##  plotly      * 4.9.3   2021-01-10 [1] CRAN (R 4.0.2)
##  purrr       * 0.3.4   2020-04-17 [1] CRAN (R 4.0.0)
##  R6            2.5.0   2020-10-28 [1] CRAN (R 4.0.2)
##  readr       * 1.4.0   2020-10-05 [1] CRAN (R 4.0.2)
##  rlang         0.4.10  2020-12-30 [1] CRAN (R 4.0.2)
##  rmarkdown     2.6     2020-12-14 [1] CRAN (R 4.0.3)
##  scales        1.1.1   2020-05-11 [1] CRAN (R 4.0.0)
##  sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 4.0.0)
##  stringi       1.5.3   2020-09-09 [1] CRAN (R 4.0.2)
##  stringr       1.4.0   2019-02-10 [1] CRAN (R 4.0.0)
##  tibble      * 3.0.6   2021-01-29 [1] CRAN (R 4.0.2)
##  tidyr         1.1.2   2020-08-27 [1] CRAN (R 4.0.2)
##  tidyselect    1.1.0   2020-05-11 [1] CRAN (R 4.0.0)
##  vctrs         0.3.6   2020-12-17 [1] CRAN (R 4.0.2)
##  viridisLite   0.3.0   2018-02-01 [1] CRAN (R 4.0.0)
##  withr         2.4.1   2021-01-26 [1] CRAN (R 4.0.3)
##  xfun          0.20    2021-01-06 [1] CRAN (R 4.0.2)
##  yaml          2.2.1   2020-02-01 [1] CRAN (R 4.0.0)
## 
## [1] /Users/guillaume/Library/R/4.0/library
## [2] /Library/Frameworks/R.framework/Versions/4.0/Resources/library

References

Zhou K, Gaullier G & Luger K (2019) Nucleosome structure and dynamics are coming of age. Nature Structural & Molecular Biology 26: 3–13. https://doi.org/10.1038/s41594-018-0166-x