GATK - bwHPC Wiki GATK - bwHPC Wiki


From bwHPC Wiki
Jump to: navigation, search
Description Content
module load bio/gatk (Genome Analysis Tool Kit)
Availability bwUniCluster
License Mixed licensing mode: Free for academics.
Citing n./a.
Links GATK Homepage
Graphical Interface no
Requirements Java Version >= 1.7
Some tools additionally require R to generate PDF plots

1 Description/What is GATK?

GATK is a Toolkit for Genome Analysis More specifically, it's a toolkit for variant discovery.
The GATK is the industry standard for identifying SNPs and indels in germline DNA and RNAseq data, and its scope is currently being extended to include somatic variant calling tools. In addition to the variant callers themselves, the GATK also includes many utilities to perform related tasks such as processing and quality control of high-throughput sequencing data.
The GATK tools are primarily designed to process exomes and whole genomes generated with Illumina sequencing technology, but they can be adapted to handle a variety of other technologies and experimental designs. And although it was originally developed for human genetics, the GATK has since evolved to handle genome data from any organism, with any level of ploidy.
For more information on features please visit the GATK Homepage

2 Versions and Availability

A list of versions currently available on all bwHPC-C5-Clusters can be obtained from the

Cluster Information System CIS

On the command line interface you'll get a list of available versions by using the command 'module avail bio/gatk'.

$ module avail bio/gatk
------------------------ /opt/bwhpc/common/modulefiles -------------------------

3 Usage

3.1 Loading the module

3.1.1 Default

You can load the default version of GATK with the command 'module load bio/gatk'.

$ module avail bio/gatk
------------------------ /opt/bwhpc/common/modulefiles -------------------------
$ module load bio/gatk
$ module list
Currently Loaded Modulefiles:
  1) bio/gatk/3.5

The module will try to load modules it needs to function. If loading the module fails, check if you have already loaded one of those modules, but not in the version needed for GATK.

3.1.2 Special Version

If you wish to load a version of GATK, you can do so using module load bio/gatk/'version' to load the version you desires.

$ module avail bio/gatk
------------------------ /opt/bwhpc/common/modulefiles -------------------------
$ module load bio/gatk/3.5
$ module list
Currently Loaded Modulefiles:
  1) bio/gatk/3.5

3.2 Program Binaries

You can find the binaries in the bin-folder of the GATK home folder. After loading the GATK module ('module load bio/gatk') its path is also set to the local $PATH- and $GATK_HOME environments.

  • GATK is a command-line program and is usually used in a pipeline.
  • Multithreading is supported by the GATK Queue.
$ ls -RxF $GATK_HOME
/opt/bwhpc/common/bio/gatk/3.5: # $GATK_HOME
bwhpc-examples/  gatk@  gatk_queue@  GenomeAnalysisTK.jar  modulefiles/  Queue.jar  resources/

/opt/bwhpc/common/bio/gatk/3.5/bwhpc-examples:  # $GATK_EXA_DIR
bwhpc-gatk-example.moab   exampleBAM.bam@             exampleBAM.bam.bai@  ExampleCountLoci.scala@
ExampleCountReads.scala@  ExampleCustomWalker.scala@  exampleFASTA.dict@   exampleFASTA.fasta@
exampleFASTA.fasta.fai@   ExampleReadFilter.scala@    gatk*                gatk_queue*

/opt/bwhpc/common/bio/gatk/3.5/modulefiles: # Modulefile

'*' indicates the file is executable. '/' indicates its a folder. '@' indicates its a symbolic link.

3.3 Wrapper

To start the GATK java applet, you have to type the command:

java -jar $GATK_HOME/GenomeAnalysisTK.jar <options>

Unfortunately Java does not use the $PATH-environment to the main program.
If you start 'java -jar ...' without absolute pathname you'll get a 'not-found'-error.
So this is not running, if your are not in the $GATK_HOME folder:

java -jar GenomeAnalysisTK.jar <options>

To avoid these problems, we wrote a wrapper called "gatk" and "gatk_queue".
These are linked into the $GATK_HOME and you may use this one without any limitations.
You'll find these scripts in the $GATK_EXA_DIR folder, too.

$ cat gatk
# GATK-Wrapper
#, 08.03.2016
[ -z "$GATK_HOME" ] && { module load bio/gatk/3.5; sleep 5; }
[ -z "$options" ] && echo "enter at least one option (e.g. --help). " \
|| { java -jar $GATK_HOME/GenomeAnalysisTK.jar ${options}; }

4 bwHPC Examples for GATK

In the folder $GATK_EXA_DIR you'll find an example how to use GATK.

$ ls -l $GATK_EXA_DIR
[...] bwhpc-gatk-example.moab # example Moab submit script for use with 'msub'

# example-files (symbolic links)
[...] exampleBAM.bam
[...] exampleBAM.bam.bai
[...] ExampleCountLoci.scala
[...] ExampleCountReads.scala
[...] ExampleCustomWalker.scala
[...] exampleFASTA.dict
[...] exampleFASTA.fasta
[...] exampleFASTA.fasta.fai
[...] ExampleReadFilter.scala

[...] gatk # wrapper for 'java -jar GenomeAnalysisTK.jar <options>'
[...] gatk_queue # wrapper for 'java -jar Queue.jar <options>'

[...] gatk_options # usage, gatk --help output
[...] README.bwhpc-examples # some more explanations
[...] # run examples outside 'msub'/Moab

4.1 GATK Queue

GATK-Queue is command-line scripting framework for defining multi-stage genomic analysis pipelines combined with an execution manager that runs those pipelines from end-to-end. Often processing genome data includes several steps to produces outputs, for example our BAM to VCF calling pipeline include among other things:

  • Local realignment around indels
  • Emitting raw SNP calls
  • Emitting indels
  • Masking the SNPs at indels
  • Annotating SNPs using chip data
  • Labeling suspicious calls based on filters
  • Creating a summary report with statistics

See more infos about the GATK-Queue/Pipeline Pipelining with Queue here and here.

4.2 bwhpc-example file

  • bwhpc-gatk-example.moab

Use this Moab start-script to start your own GATKsession in interactive mode. Look for this section inside the file and do your modifications.

4.2.1 How to use the GATK Test-Script

  • Create your own work-space
# WS-Name        Days alive (max. 60)
ws_allocate gatk_repo 30
  • Change dir to your workspace
cd $(ws_find gatk_repo)
  • Copy the moab-example file you'll find in this folder and make your modifications
cp $GATK_EXA_DIR/bwhpc-gatk-example.moab .
  • Submit your job
msub bwhpc-gatk-example.moab
  • Wait for awhile...

... until you see some more files created (e.g. a tarball). The *.tgz-file contains your data.

tar xvzf *.tgz to extract the file-contents

4.2.2 Exerpt from bwhpc-gatk-example.moab

These parameters are allying for the use of GATK on the bwUniCluster.

#MSUB -N gatk_job
#MSUB -j oe
#MSUB -m ae
# -M 'your e-mail-address@DN'
#MSUB -q singlenode
#MSUB -l walltime=00:10:00
echo " "
echo "### Loading GATK module:"
echo " "
module load bio/gatk/3.5
[ -z "$GATK_HOME" ] && { echo 'ERROR: Failed to load module bio/gatk/3.5.'; exit 1; }
module list

echo " "
echo "### Copying input test files for job (if required):"
echo " "
cp -v $GATK_EXA_DIR/[Ee]xample* .
echo " "
echo "### Runing GATK example in single-node-mode..."
echo " "
# -T = AnalysisType
gatk -T CountLoci \
   -R exampleFASTA.fasta \
   -I exampleBAM.bam \
   -o CountLoci_analysis.out > CountLoci_run.out 2>&1
[ "$?" -ne 0 ] && { echo "gatk returned with an error: $?"; exit 1; }
echo "done"

In the (compressed) tar ball you'll find two *.out files.

  1. CountLoci_run.out : Complete output as written by GATK
  2. CountLoci_analysis.out : Summary of the above analysis-run

5 GATK-Specific Environments

To see a list of all GATK environments set by the 'module load'-command use env | grep GATK. Or use the command module display bio/gatk.

$ module display bio/gatk
module-whatis	 GATK 3.5 is a software package for analysis of high-throughput
     sequencing data. 
setenv		 GATK_VERSION 3.5 
setenv		 GATK_HOME /opt/bwhpc/common/bio/gatk/3.5 
setenv		 GATK_EXA_DIR /opt/bwhpc/common/bio/gatk/3.5/bwhpc-examples 
setenv		 GATK_BIN_DIR /opt/bwhpc/common/bio/gatk/3.5 
setenv		 GATK_BPR_URL 
prepend-path	 PATH /opt/bwhpc/common/bio/gatk/3.5 
conflict	 bio/gatk 

The module display command will not load the module!

6 Version-Specific Information

For a more detailed information specific to a specific GATK version, see the information available via the module system with the command module help bio/gatk/.
For a small abstract what GATK is about use the command module whatis bio/gatk.

$ module whatis bio/gatk
bio/gatk  : GATK 3.5 is a software package for analysis of high-throughput
   sequencing data.

$ module help bio/gatk
----------- Module Specific Help for 'bio/gatk/3.5' ---------------
  GATK is a Toolkit for Genome Analysis
  More specifically, it's a toolkit for variant discovery.

* Get started/GATK Homepage 

* GATK documentation/BPR   

* A primer on parallelism with the GATK(-Queue)

* GATK repository (binaries/sources)   

* bwHPC examples can be found here:
  Please read the 'README.bwhpc-examples' file.


Please use the 'gatk' wrapper to start the programm. 
  Beware: The 'GenomeAnalysisTK.jar' file can't be started
  without an absolute path. Appending $HOME to $GATK_HOME will _not_ 
  start the *.jar.
  But the gatk-wrapper works fine.