In this post, I will show how I made the bar graph in Box 3 of the review I co-authored recently (Zhou et al, 2019). This involves collecting metadata from the PDB, which I explained in the first post in this series. Having an automated way to regenerate this graph was invaluable during the writing of this review, because new nucleosome structures were published at a very rapid rate over the last few months.
Getting and cleaning data
This bar graph will present the number of PDB entries containing a nucleosome
by year of publication since 1997. We need to request a dataset containing all
released structures (
status:REL) with the word “nucleosome” in their title,
and with the following columns (corresponding PDBe API fields):
- database accession code (
- publication year (
- structure title (
- experimental method (
We need to take into account the fact that older entries in the PDB have their titles in all capitals, while newer entries have normal case titles. The PDBe API search is case sensitive, and I haven’t found how to make it case-insensitive (or even whether it is possible at all), so we will need two queries.
This translates as follows:
# We need two queries: one for uppercase titles, the other for lowercase ones pdb_queries <- c( uppercase = 'https://www.ebi.ac.uk/pdbe/search/pdb/select?q=title:*UCLEOSOM*%20AND%20status:REL&fl=pdb_id,citation_year,title,experimental_method&rows=1000000&wt=json', lowercase = 'https://www.ebi.ac.uk/pdbe/search/pdb/select?q=title:*ucleosom*%20AND%20status:REL&fl=pdb_id,citation_year,title,experimental_method&rows=1000000&wt=json' )
Some PDB entries have the word “nucleosome” in their title, but do not actually contain a nucleosome. This is the case for structures of Nap1 (nucleosome assembly protein 1), for instance. I could not find a more accurate way than combing through the dataset one entry at a time and building a list of such irrelevant entries to filter them out.