Basic Parallelisation in Bash
I have recently started writing some quite CPU intensive code and since we have a nice cluster here (without any management software on it) I decided that it would be best for me to take advantage of the number of cores on them. Actually, this works nicely on my desktop which has four cores anyway (and 4GB of RAM). Essentially, I’m lazy and this basically runs my pipeline in parallel for different sources by sending off different jobs to different cores (to the maximum number that you specify)… and then when they finish runs then next few… and so on until they are all done. No longer do I have to worry about waiting for the jobs to finish and wasting time by missing their finishing point. I also no longer have to have lots of [screen] sessions or loads of terminals open… bliss…
Oh here is my basic [Bash] script:
#By Samuel George; http://www.krioma.net
array=(`ls -d */`)
#My scripts run in sub directories, replace with your list of commands to run.
#loop over the maximum number of jobs based on the number of files in array
while [ $jobnumber -lt $len ]; do
while [ $jobsrunning -le $maxjobs ]
#start jobs up till maximum and then wait for them to finish before continuing.
#go into the directory and run – this is an oddity of my processing
#replace with your own functions
#go back a dir
#add to counter so that you know how many jobs are running at once.
jobsrunning=$(( $jobsrunning + 1 ))
#keep a running total, such that the script will loop over all the jobs you want running
jobnumber=$(( $jobnumber + 1 ))
#keep a listing of where you are.
Stick this in a shell script, chmod 700 file.sh and job done.
There is definitely room for improvement here. For example this code will wait till all of the processes in the inner loop, ideally you’d want it to move on after the first one is finished. Watch this space… well that might not happen since my tasks all take about the same time to finish in.