New tool: countparticles
In my previous post, I wrote that I would not bother writing a
Python command-line program to simply count particles in each class in a
run_data.star
file from RELION, because it is straightforward to do with
AWK (and it probably runs faster on large files). I
changed my mind and made a new tool (also installable with
pip
and citeable with
doi:10.5281/zenodo.4139778
).
I still love the AWK solution for many reasons: it is indeed straightforward, it
runs fast even on large files, there was no boilerplate code to write to handle
file input. And most importantly, it consists of only one file: drop it anywhere
in your $PATH
, make it executable, and it will work on any system that has AWK,
which means everywhere, since AWK is “mandatory” in the sense of IEEE Std
1003.1-2008. The only problem is that star files from RELION change
between versions, in a way that makes the relevant data for this counting not
always stored in the same column. And AWK can only refer to a column by index,
not by name. Changing the column number in the AWK script is trivial, but
when the script produces nonsensical output or no output at all while I am
trying to make sense of data, this kind of limitation gets frustrating quickly.
The obvious solution was to write a Python program, because the starfile
library produces pandas DataFrame
s, and these in turn can
refer to a column by its name as defined in the star file. This works regardless
of the column’s numerical index, so it doesn’t break when a new version of
RELION produces star files in which the relevant column has a new index. The
downside is having to manage a Python installation… Luckily
conda
makes this manageable, but it now seems like a way over-engineered
solution to the simple problem of counting lines by groups in a file… so I
also added an option to display a bar graph representing the counts.