I have recently started using the high performance cluster at the Sanger Institute, and it comes with an interesting quandary. You have to be explicit in the amount of RAM you request when you submit your job (the PHE cluster wasn’t like this). Then, that amount of RAM is assigned to your job and wont be available for other jobs for the duration of the job.
This leads to an interesting question, with a few considerations, ‘how much ram should i request?’. The criteria, as far as my rudimentary understanding of HPCs goes, are
- The less RAM you request, the quicker your jobs will be run because there will be more ‘spaces’ they can fit into. i.e. if you demand loads of RAM, there will be fewer machines available with enough RAM, so you will wait longer.
- If you don’t request enough RAM, your job will crash and will have to be re-run. This has an, admittedly fairly minor, overhead for my time.
- If you request loads of RAM that you don’t need, this is very inefficient and others jobs who need that RAM will be delayed.
- Chances are, your jobs have a range of RAM requirements, even for the same workflow i.e. on a recent batch the average requirement was 4.3 with a stdev of 0.6 Gb for the jobs that finished, but 15% of the jobs ran out of memory and will need to be re-run. I requested 6 Gb of RAM per job.
I think running with 6 Gb of RAM requested seems like a decent compromise, with 85% of jobs finishing successfully, but I wonder whether there is a more principled (either mathematically or ethically) way of doing it?
Having worked in the Hpc, the first pass solution is to see if you can fit your job into the limit of ram/cpu of the machines. . For example, if the majority of machines are 16 core, 64 GB machines, request 4 GB/core. This allows most other jobs to run proportionally…
Thanks Matthew, good tip!
If you write your pipelines with a workflow manager, they can take care of this issue for you. You can set the number of times to retry a specific process, increasing the amount RAM per retry. See here for a nice post on dealing with resource allocation and errors https://www.nextflow.io/blog/2016/error-recovery-and-automatic-resources-management.html
Wow, that’s so cool. The other thing I was looking at today was adopting nextflow, so great timing!
If it’s just one of many jobs, try and submit test jobs you expect to be outliers (eg largest fastqs) in interactive mode and monitor the resource usage (qacc -j I think in SGE) then use those settings (+ a bit) when you submit your main batch of sequences.
Hi Phil. Let me know how you get on with nextflo. I’m thinking that may be our next workflow manager in PHE 🙂 Have already had some good conversations with the nextflo people.
will do!