| Path: | sge.rb |
| Last Update: | Fri May 21 23:48:46 +0900 2010 |
| Copyright: | Copyright (C) 2009, 2010 Toshiaki Katayama <ktym at hgc dot jp> |
| License: | Distributes under the same terms as Ruby |
| Site: | kanehisa.hgc.jp/~k/sge/ |
| Download: | kanehisa.hgc.jp/~k/sge/sge.rb |
| Version: | 2.3 |
As of the version 2.0, this library can also be used as a command.
Usage:
% sge.rb \[options...\] -q input_file -t db_file -c 'command --opts #{query} > #{output}'
Options:
-q or --query file
Specify a flatfile including multiple entries.
-t or --target file
Specify a database file to be used.
-c or --command 'string'
Specify a command line to be executed.
The following identifiers can be used in the command line 'string'.
'#{query}' fragmented query file name (== input_file)
'#{target}' target database file name
'#{work_dir}' current working directory
'#{task_id}' SGE_TASK_ID
'#{slice}' -- task_id / @@slice (integer >= 1)
'#{input_file}' -- 'input/#{slice}/#{task_id}'
'#{output_file}' -- 'output/#{slice}/#{task_id}'
'#{error_file}' -- 'error/#{slice}/#{task_id}'
-o or --sge_opts 'string'
Additional options for the qsub command.
'-l s_vmem=16G -l mem_req=16' to reserve 16GB RAM for each job
'-l cpu_arch=xeon' to limit to use xeon CPUs only
Resource reservation and backfill options:
'-R y -l s_rt=12:0:0' to limit max exec time to 12h (SIGUSER1)
'-R y -l h_rt=12:0:0' to limit max exec time to 12h (SIGKILL)
'-R y -pe mpi-fillup 4' to reserve 4 threads for MPI
-m or --task_min integer
Start number of tasks (default is 1, increase to start from halfway).
-M or --taks_max integer
Last value (default is a total number of entries in query).
-s or --task_step integer
Number of processes per one job (default is 1000). Large value is
recommended for short tasks with a large number of queries, and
and small value (including 1) can be used for time consuming tasks
with a small number of queries.
--clear
Remove a SGE script and output/error/log directories
--clean
Remove a count file and the extracted input directory
--distclean
Exec both of --clear and --clean
-h or --help
Print this help message.
Examples:
% sge.rb -q data/query.pep -t data/target.pep -c 'blastall -p blastp -i #{query} -d #{target}' -o '-l cpu_arch=xeon'
% sge.rb -q data/query.nuc -t /usr/local/db/blast/ncbi/nr -c 'blastall -p blastx -s 10 -i #{query} -d #{target}' -o '-l cpu_arch=xeon -l sjob -l s_vmem=4G,mem_req=4'
% sge.rb -q data/dme.nuc -t data/dme.genome -s 1 -c 'exonerate --bestn 1 --model est2genome --showtargetgff 1 --showvulgar yes #{query} #{target}'
% sge.rb -q data/hsa.pep -t data/Pfam-A.hmm -m 1000 -M 2000 -s 10 -c 'hmmscan --tblout output/#{slice}/#{task_id}.tbl #{target} #{query}'
% sge.rb -q data/refseq.gb -c 'bp_genbank2gff3.pl -out stdout #{query}'
% sge.rb --distclean
See also:
http://kanehisa.hgc.jp/~k/sge/
The Bio::SGE class extract entries in a biological flatfile as queries and execute a bulk submission to the Sun Grid Engine as an array job.
This class takes a flatfile (e.g. multi FASTA file) as a ‘query’, a database file as a ‘target’, and a command line to be executed as a ‘command’ (see also SCRIPT VARIABLES section).
The flatfile must be accepted by the Bio::FlatFile.auto class method of the BioRuby (bioruby.org/) package.
Instantiation of the Bio::SGE object can be done by
sge = Bio::SGE.new(query, target, command, sge_opts)
or by assigning these values through accessors prior to a job submission
sge = Bio::SGE.new sge.query = 'flat_file' sge.target = 'target_database_file' sge.command = 'command --to_be_executed --with_opts'
or by assigning these values with a block parameter.
sge = Bio::SGE.new { |opt|
opt.query = 'flat_file'
opt.target = 'target_database_file'
opt.command = 'command --to_be_executed --with_opts'
}
Then, the "prepare" method will
and now you can submit your SGE job by the "submit" method.
sge.prepare sge.submit
The "submit" method will automatically take care of messy tasks such that (1) splitting array jobs according to the number of total jobs, (2) save stdout and stderr from SGE system to a separate log directory etc.
The execution results will be stored in the following files and directories.
count.txt # correspondence table of the file numbers and entry IDs input/ # extracted sequence files (one file, one sequence) output/ # outputs of the command (numberd same as the input files) error/ # errors of the command (numberd same as the input files) log/ # log files of the qsub run (stdout and stderr)
You can confirm whether there were no system errors during the SGE execution by sizes and contents of files in the log/ directory.
Then, check the error/ directory whether there was a problem or not in your jobs (some command may utilize the stderr to another purpose).
Finally, main results can be obtained from files in the output/ directory.
You can individually call following methods instead of the "prepare" method.
sge.setup # to prepare output directories sge.script # to generate a SGE script sge.extract # to extract each entry
Therefore, if you want to reuse the sequence files already extracted to the input directory, just comment out the line calling "prepare" method (and also avoid to use "extract" method, of course).
#sge.prepare # comment out this line in your script sge.script sge.setup #sge.extract # don't use this as well sge.submit # then submit
Reversely, you can also clean up the working directory (e.g. to remove test or previous execution results) by the following methods.
sge.clear # to remove a SGE script and output/error/log directories sge.clean # to remove a count file and the extracted input directory sge.distclean # to remove all of the above
You can specify the "-t start-last:step" range values for a array job by following accessors (these are optional; see EXAMPLES section below).
sge.task_min # start value (default is 1) sge.task_max # last value (default is a total number of entries in query) sge.task_step # number of processes per one job (default is 1000) sge.sge_opts # additional options for the qsub command
For example, if you only need to calculate on sequences starting from 8421st upto 9064th, and want to invoke 100 processes per each qsub execution, you can specify them by the following way.
sge.task_min = 8421 sge.task_max = 9064 sge.task_step = 100 sge.submit
#!/usr/bin/env ruby
require 'sge'
sge = Bio::SGE.new { |opt|
opt.query = 'flat_file'
opt.target = 'target_database_file'
opt.command = 'command --to_be_executed --with_opts'
opt.sge_opts = '-l cpu_arch=xeon'
opt.task_min = 8421
opt.task_max = 9064
opt.task_step = 100
}
sge.clear # included in sge.distclean
sge.clean # included in sge.distclean
sge.script # included in sge.prepare
sge.setup # included in sge.prepare
sge.extract # included in sge.prepare
sge.submit
In the ‘command’ specification, you can use following identifiers as variables.
'#{query}' fragmented query file name (== input_file)
'#{target}' target database file name
'#{work_dir}' current working directory
'#{task_id}' SGE_TASK_ID
'#{slice}' -- task_id / @@slice (integer >= 1)
'#{input_file}' -- 'input/#{slice}/#{task_id}'
'#{output_file}' -- 'output/#{slice}/#{task_id}'
'#{error_file}' -- 'error/#{slice}/#{task_id}'
Note that these identifires must be kept in ‘single quotes’ to avoid variable expansion before the script generation (see EXAMPLES section in below).
#!/usr/bin/env ruby
require 'sge'
sge = Bio::SGE.new { |opt|
opt.query = 'd.melanogaster.pep'
opt.target = 'genomic_scaffolds'
opt.command = 'exonerate --bestn 1 --model protein2genome --showtargetgff 1 --showvulgar yes #{query} #{target}'
opt.sge_opts = '-l cpu_arch=xeon'
}
sge.prepare
sge.submit
#!/usr/bin/env ruby
require 'sge'
sge = Bio::SGE.new { |opt|
opt.query = 'query.pep'
opt.target = 'target.pep'
opt.command = 'blastall -p blastp -i #{query} -d #{target}'
opt.sge_opts = '-l cpu_arch=xeon'
}
sge.prepare
sge.submit
#!/usr/bin/env ruby
require 'sge'
sge = Bio::SGE.new { |opt|
opt.query = 'data/h.sapiens.pep'
opt.target = 'db/Pfam_ls'
opt.command = 'hmmscan --tblout output/#{slice}/#{task_id}.tbl #{target} #{query}'
opt.sge_opts = '-l cpu_arch=xeon'
}
sge.prepare
sge.submit
#!/usr/bin/env ruby
require 'sge'
sge = Bio::SGE.new { |opt|
opt.query = 'invertebrate6.genomic.gbff'
opt.command = 'bp_genbank2gff3.pl -out stdout #{query}'
}
sge.prepare
sge.submit