Running jobs on a cluster submitted via qsub from Python. Does it make sense? -
i have situation doing computation in python, , based on outcomes have list of target files candidates passed 2nd program.
for example, have 50,000 files contain ~2000 items each. want filter items , call command line program calculation on of those.
this program #2 can used via shell command line, requires lengthy set of arguments. because of performance reasons have run program #2 on cluster.
right now, running program #2 via 'subprocess.call("...", shell=true)
i'd run via qsub in future. have not experience of how done in reasonably efficient manner.
would make sense write temporary 'qsub' files , run them via subprocess() directly python script? there better, maybe more pythonic solution?
any ideas , suggestions welcome!
it makes perfect sense, although go solution.
as far understand, have programme #1 determines of 50,000 files needs computed programme #2. both programme #1 , #2 written in python. excellent choice.
incidentally, have python module might come in handy: https://gist.github.com/stefanedwards/8841307
if running same qsub-system have (no idea ours called), cannot use command arguments on submitted scripts. instead, options submitted via -v
option, puts them environment variables, e.g.:
[me@local ~] $ python isprime.py 1 1: true [me@local ~] $ head -n 5 isprime.py #!/usr/bin/python ### python script ... import os os.chdir(os.environ.get('pbs_o_workdir','.')) [me@local ~] $ qsub -v isprime='1 2 3' isprime.py 123456.cluster.control.com [me@local ~]
here, isprime.py
handle command line arguments using argparse
. need check whether script running submitted job, , retrieve said arguments environment variables (os.environ
).
when programme #2 modified run on cluster, programme #1 can submit jobs using subprocess.call(['qsub','-v options=...','programme2.py'], shell=false)
another approach queue files in database (say, sqlite database). have programme #1 check non-processed entries in database, determine outcome (run, not run, run special options). have opportunity run programme #2 in parallel on cluster, checks database files analyse.
edit: when programme #2 executable
instead of python script, use bash script takes environment variables , puts them on command line programme:
#!/bin/bash cd . # put options context/flags etc. if [ -n $option1 ]; _opt1="--opt1 $option1"; fi # can define our own defaults _opt2='--no-verbose' if [ -n $opt2 ]; _opt2="-o $opt2"; fi /path/to/exe $_opt1 $opt2
if going database solution, have python script checks database unprocessed files, mark file being processed (do these in single transaction), options, call executable subprocess
, when done, mark file done, check new file, etc.
Comments
Post a Comment