Page 1

Eli Lilly / September 14-15, 2011

20-Line Lifesavers:" Coding simple solutions in the GATK Kiran V Garimella (kiran.garimella@gmail.com), Mark A DePristo G E N O M E S E Q U E N C I N G A N D A N A LY S I S , B R O A D I N S T I T U T E

Research Informatics Group E L I L I L LY A N D C O M PA N Y


Genome Analysis Toolkit (GATK)! \ˈjē-ˌnōm(,)ə-ˈna-lə-səs(,)ˈtül(,)kit\&

Noun& 1.  A suite of tools for working with medical resequencing projects (e.g. 1,000 Genomes, The Cancer Genome Atlas)& 2.  A structured software library that makes writing efficient analysis tools using next-generation sequencing data easy!


Genome Analysis Toolkit (GATK)! \ˈjē-ˌnōm(,)ə-ˈna-lə-səs(.)ˈtül(,)kit\&

Noun&

Most users think of the toolkit merely as a set of tools that implement our ideas…!

1.  A suite of tools for working with medical resequencing projects (e.g. 1,000 Genomes, The Cancer Genome Atlas)& 2.  A structured software library that makes writing efficient analysis tools using next-generation sequencing data easy!


Genome Analysis Toolkit (GATK)! \ˈjē-ˌnōm(,)ə-ˈna-lə-səs(.)ˈtül(,)kit\&

Noun& 1. … A suite of tools for working medical but the GATKʼs real with power is inresequencing how easy projects it (e.g. 1,000 Genomes, The Cancer Genome Atlas)&

makes it to instantiate your ideas.!

2.  A structured software library that makes writing efficient analysis tools using next-generation sequencing data easy!

This is what we will discuss today.!


Some tasks are made difficult by the wrong tools

These BAMS have numeric, nonunique read ids that collide when you merge them! How long will It take to fix?

Convert to sam format, read the header, parse the read group info into a hash table keyed on the ID, loop over the reads, look up the read group id in the hash, find the platform unit tag, prepend it to the read name, convert back to BAM, reindex BAM. Lines of Code: 500.

All day!

With all apologies to Randall Munroe and XKCD&


That same task, written in the GATK (20 lines of code) package org.broadinstitute.sting.gatk.walkers.examples; import import import import import import

net.sf.samtools.SAMFileWriter; net.sf.samtools.SAMRecord; org.broadinstitute.sting.commandline.Output; org.broadinstitute.sting.gatk.contexts.ReferenceContext; org.broadinstitute.sting.gatk.refdata.ReadMetaDataTracker; org.broadinstitute.sting.gatk.walkers.ReadWalker;

public class FixReadNames extends ReadWalker<Integer, Integer> { @Output SAMFileWriter out; @Override public Integer map(ReferenceContext ref, SAMRecord read, ReadMetaDataTracker metaDataTracker) { read.setReadName(read.getReadGroup().getPlatformUnit() + "." + read.getReadName()); out.addAlignment(read); return null; } @Override public Integer reduceInit() { return null; } @Override public Integer reduce(Integer value, Integer sum) { return null; } }


That same task, written in the GATK" (code thatâ&#x20AC;&#x2122;s not filled in for you by the IDE â&#x20AC;&#x201C; 5 lines) package org.broadinstitute.sting.gatk.walkers.examples; import import import import import import

net.sf.samtools.SAMFileWriter; net.sf.samtools.SAMRecord; org.broadinstitute.sting.commandline.Output; org.broadinstitute.sting.gatk.contexts.ReferenceContext; org.broadinstitute.sting.gatk.refdata.ReadMetaDataTracker; org.broadinstitute.sting.gatk.walkers.ReadWalker;

public class FixReadNames extends ReadWalker<Integer, Integer> { @Output SAMFileWriter out; @Override public Integer map(ReferenceContext ref, SAMRecord read, ReadMetaDataTracker metaDataTracker) { read.setReadName(read.getReadGroup().getPlatformUnit() + "." + read.getReadName()); out.addAlignment(read); return null; } @Override public Integer reduceInit() { return null; } @Override public Integer reduce(Integer value, Integer sum) { return null; } }

Most of the code is boilerplate, and the IDE can fill it in for you. The amount of code you have to manually write is actually very small.!


Those tasks are simple when using the right toolsâ&#x20AC;Ś

Write a GATK READwalker that modifies the read name and writes it out again. Spend rest of time looking at lolCATs.

These BAMS have numeric, nonunique read ids that collide when you merge them! How long will It take to fix?

Lines of Code: 5.

Um, All day...

With all apologies to Randall Munroe and XKCD&


…though whether you’ll tell people that is up to you.

Hehe, I can haz cheezburger INDEED.

With all apologies to Randall Munroe and XKCD&


We â&#x20AC;&#x2122; re g o i n g t o w r i t e g e n u i n e l y u s e f u l , d e a d l i n e d e f e a t i n g , l i f e s a v i n g t o o l s i n < 20 lines of code


Now we’ll go through a bunch of programs and learn to write new GATK tools by example •  Weʼll setup the environment and look at five tutorial programs:& –  HelloRead: A simple walker that prints read information from a BAM& –  FixReadNames: Modify read names and emit results to a new BAM file& –  HelloVariant: A simple walker that prints variant information from a VCF& –  ComputeCoverageFromVCF: Computes a coverage histogram from a VCF& –  FindExclusiveVariants: Create a new VCF of variants exclusive to a sample&

•  Finished and commented versions are in the codebase at:& –  java/src/org/broadinstitute/sting/gatk/walkers/tutorial/&

•  How these tutorials work:& –  The 3! icon enumerates the various steps in each tutorial.& –  The code that you should write at each step is in the IntelliJ window.& –  Text in boxes like this& give additional information on each step, emphasize some information, and may clarify the command or code that you should write. &


Setting up for GATK development


See our wiki resources

•  http://www.broadinstitute.org/gsa/wiki/index.php/ Configuring_IntelliJ& •  http://www.broadinstitute.org/gsa/wiki/index.php/ Queue_with_IntelliJ_IDEA&


Mechanics of a GATK “walker”" (a program that “walks” along a dataset in a prescribed way)


ReadWalker: “walks” over reads and allows a computation to be performed on each one ReadWalker: process one read at a time! reference!

TTTAAATCTGTGGGGTTAATCGGCGGGCTAA! (1)! (2)& computation! (3)& order! (4)&

reads!

(5)&

Example use cases:& 1.  Setting an extra metadata tag in a read& 2.  Searching for mouse contaminant reads and excluding them& 3.  Find or realign indels& Some example GATK programs: CycleQualityWalker, TableRecalibrationWalker, CountReadsWalker, IndelRealigner, etc.!


ReadWalker: “walks” over reads and allows a computation to be performed on each one ReadWalker: process one read at a time! reference!

TTTAAATCTGTGGGGTTAATCGGCGGGCTAA! (1)& (2)! computation! (3)& order! (4)&

reads!

(5)&

Example use cases:& 1.  Setting an extra metadata tag in a read& 2.  Searching for mouse contaminant reads and excluding them& 3.  Find or realign indels& Some example GATK programs: CycleQualityWalker, TableRecalibrationWalker, CountReadsWalker, IndelRealigner, etc.!


ReadWalker: “walks” over reads and allows a computation to be performed on each one ReadWalker: process one read at a time! reference!

TTTAAATCTGTGGGGTTAATCGGCGGGCTAA! (1)& (2)& computation! (3)! order! (4)&

reads!

(5)&

Example use cases:& 1.  Setting an extra metadata tag in a read& 2.  Searching for mouse contaminant reads and excluding them& 3.  Find or realign indels& Some example GATK programs: CycleQualityWalker, TableRecalibrationWalker, CountReadsWalker, IndelRealigner, etc.!


LocusWalker: “walks” over genomic positions and allows a computation to be performed at each one LocusWalker: process a single-base genomic position at a time! computation order!

(1)(2)(3)(4)(5) …&

reference!

TTTAAATCTGTGGGGTTAATCGGCGGGCTAA!

reads!

Example use cases:& 1.  Variant calling& 2.  Depth of coverage calculations& 3.  Compute properties of regions (GC content, read error rates)& Note: reads are required for locus walkers. RefWalkers are a similar type of walker that examine each genomic locus, but do not require reads.!


LocusWalker: “walks” over genomic positions and allows a computation to be performed at each one LocusWalker: process a single-base genomic position at a time! computation order!

(1)(2)(3)(4)(5) …&

reference!

TTTAAATCTGTGGGGTTAATCGGCGGGCTAA!

reads!

Example use cases:& 1.  Variant calling& 2.  Depth of coverage calculations& 3.  Compute properties of regions (GC content, read error rates)& Note: reads are required for locus walkers. RefWalkers are a similar type of walker that examine each genomic locus, but do not require reads.!


LocusWalker: “walks” over genomic positions and allows a computation to be performed at each one LocusWalker: process a single-base genomic position at a time! computation order!

(1)(2)(3)(4)(5) …&

reference!

TTTAAATCTGTGGGGTTAATCGGCGGGCTAA!

reads!

Example use cases:& 1.  Variant calling& 2.  Depth of coverage calculations& 3.  Compute properties of regions (GC content, read error rates)& Note: reads are required for locus walkers. RefWalkers are a similar type of walker that examine each genomic locus, but do not require reads.!


RodWalker: “walks” over positions in a file and allows a computation to be performed at each one RodWalker: process a genomic position from a file (e.g. VCF) at a time! computation order!

SampleA! SampleB! SampleC!

(1)!

(2)&

(3)&(4)&

reference!

TTTAAATCTGTGGGGTTAATCGGCGGGCTAA!

*&*&

*! *& *&

*&

variants!

Example use cases:& 1.  Variant calling& 2.  Depth of coverage calculations& 3.  Compute properties of regions (GC content, read error rates)& Some example GATK programs: VariantEval, PhaseByTransmission, VariantAnnotator, VariantRecalibrator, SelectVariants, etc.!


RodWalker: “walks” over positions in a file and allows a computation to be performed at each one RodWalker: process a genomic position from a file (e.g. VCF) at a time! computation order!

SampleA! SampleB! SampleC!

(1)&

(2)!

(3)&(4)&

reference!

TTTAAATCTGTGGGGTTAATCGGCGGGCTAA!

*&*&

*& *! *!

*&

variants!

Example use cases:& 1.  Variant filtering& 2.  Computing metrics on variants& 3.  Refining variant calls by enforcing additional constraints& Some example GATK programs: VariantEval, PhaseByTransmission, VariantAnnotator, VariantRecalibrator, SelectVariants, etc.!


RodWalker: “walks” over positions in a file and allows a computation to be performed at each one RodWalker: process a genomic position from a file (e.g. VCF) at a time! computation order!

SampleA! SampleB! SampleC!

(1)&

(2)!

(3)!(4)&

reference!

TTTAAATCTGTGGGGTTAATCGGCGGGCTAA!

*!*&

*& *& *&

*&

variants!

Example use cases:& 1.  Variant filtering& 2.  Computing metrics on variants& 3.  Refining variant calls by enforcing additional constraints& Some example GATK programs: VariantEval, PhaseByTransmission, VariantAnnotator, VariantRecalibrator, SelectVariants, etc.!


Writing your first GATK walkers


Example 1: Hello, Read!

1! Right-click on “walkers”, select New->Package&


Example 1: Hello, Read!

2! Type “examples” as the package name.&

3! Click “OK”.&


Example 1: Hello, Read!

Right-click on “examples” and select New->Java class.

4! Enter the name “HelloRead”.&

A file declaring the class and proper package name is created for you.&


Example 1: Hello, Read!

5! Add the following text to the class declaration:& extends ReadWalker<Integer, Integer> {

This will tell the GATK that you are creating a program that iterates over all of the reads in a BAM file, one at a time.& The â&#x20AC;&#x153;importâ&#x20AC;? statement at the top will be added by the IDE.&


Example 1: Hello, Read!

6! IntelliJ can detect what methods you need to implement in order to get your program working.& Make sure your cursor is on the class declaration and type “Alt-Enter” to get the contextual action menu.& Select “Implement Methods”.&


Example 1: Hello, Read!

7! Select all of the methods

(usually, theyʼll already be selected, so you wonʼt need to do anything).&

8! Click “OK”.&


Example 1: Hello, Read!

The three methods, map(), reduceInit(), and reduce() are now implemented with placeholder code.&


Example 1: Hello, Read!

9! Declare a PrintStream and mark it with the @Output annotation. This tells the GATK that weʼre going to channel our output through this object.& Donʼt worry about instantiating it – the GATK will do that automatically.&


Example 1: Hello, Read! 11! When youʼre done, hit the disk icon (or type Ctrl-S) to save your work.&

In your map() method, add a line of code that prints “Hello” and the name of the read:& out.println(“Hello, ” + read.getReadName());

Or, just type read. and then hit Ctrl-Space. IntelliJ will show you a window of all the methods you can call, and you can just select it from the list.&

10!


Example 1: Hello, Read!

12! Back in the terminal window, change to your gatk-lilly directory and type:& ant dist

This will compile the GATK-Lilly codebase, including your new walker!&


Example 1: Hello, Read!

It始ll take about a minute to compile.&


Example 1: Hello, Read!

13! Run your code by entering the following command:& java -jar dist/GenomeAnalysisTK.jar \ -T HelloRead \ -R /lrlhps/apps/gatk-lilly/resources/ucsc.hg19.fasta \ -I /lrlhps/apps/gatk-lilly/testdata/114T.chr21.analysis_ready.bam \ | less

Every walker must be provided with a reference fasta file.


Example 1: Hello, Read!

Your code is now running and saying â&#x20AC;&#x153;Helloâ&#x20AC;? to every read in the file!


Example 1: Hello, Read! Letʼs add some information to the output. Add the line:& out.println(“Hello, ” + read.getReadName() + “at ” + read.getReferenceName() + “:” + read.getAlignmentStart() );

This will print out the read name, the contig name, and the starting position for the readʼs alignment.

14!


Example 1: Hello, Read!

Compile and run with a single command:& Compile and run with a single command:&

15! 1!

ant dist && java -jar dist/GenomeAnalysisTK.jar \ ant dist && java -T HelloRead \ -jar dist/GenomeAnalysisTK.jar \ HelloRead \ -R-T/lrlhps/apps/gatk-lilly/resources/ucsc.hg19.fasta \ -R /lrlhps/apps/gatk-lilly/resources/ucsc.hg19.fasta \ -I /lrlhps/apps/gatk-lilly/testdata/114T.chr21.analysis_ready.bam \ -I /lrlhps/apps/gatk-lilly/testdata/114T.chr21.analysis_ready.bam \ | less â&#x20AC;&#x201C;S | less â&#x20AC;&#x201C;S

(The && instructs the shell to proceed only if the previous command (The && instructs the compilation shell to proceed if the previous command was successful. If the fails, only HelloRead will not be run.) was successful. If the compilation fails, HelloRead will not be run.)


Example 1: Hello, Read!

The updated command is running and showing us the alignment position in addition to the read name!


Example 1: Hello, Read!

16! You can run on just a specific region by supplying the -L argument, and redirect the output to a separate file with the -o argument:& java -T -R -I -L -o

-jar dist/GenomeAnalysisTK.jar \ HelloRead \ /lrlhps/apps/gatk-lilly/resources/ucsc.hg19.fasta \ /lrlhps/apps/gatk-lilly/testdata/114T.chr21.analysis_ready.bam \ chr21:9411000-9411200 \ test.txt

No additional code is required on your part to enable this.


Example 1: Hello, Read!

The resultant file, with reads from chr21:9,411,000-9,411,200 only.


Example 2: Fix read names

Let始s use what we始ve learned to write a program that can change read names like discussed earlier in this tutorial.


Example 2: Fix read names

Now letʼs create a new example program called “FixReadNames”.

1!


Example 2: Fix read names

Make FixReadNames a ReadWalker.

2!


Example 2: Fix read names

3! This time, we始ll emit a BAM file by directing the output to a SAMFileWriter object instead of a PrintStream.


Example 2: Fix read names

4! Change the read name, tacking on the platform unit information.


Example 2: Fix read names

5! Add the alignment to the output stream.


Example 2: Fix read names

6! Compile and run your code:& ant dist && java -jar dist/GenomeAnalysisTK.jar \ -T FixReadNames \ -R /lrlhps/apps/gatk-lilly/resources/ucsc.hg19.fasta \ -I /lrlhps/apps/gatk-lilly/testdata/114T.chr21.analysis_ready.bam \ -L chr21:9411000-9411200 \ -o test.bam


Example 2: Fix read names

Run the following command to see your results:& samtools view test.bam | less -S

7!


Example 2: Fix read names

All of the read names now have the platform unit prepended to them!


Example 3: Hello, Variant!

This will be a larger example, introducing variant processing, map-reduce calculations, and the onTraversalDone() method. All code required is listed here.


Example 3: Hello, Variant!

Weʼve created a new program called “HelloVariant”.

1!


Example 3: Hello, Variant! This program extends

RodWalker<Integer, Integer>

2!


Example 3: Hello, Variant!

3! Declares a PrintStream.


Example 3: Hello, Variant!

4! In the map() function, we始ll loop over lines in a VCF file and print metadata from each record.


Example 3: Hello, Variant!

Return 1.& 5! This will get passed to reduce() later.


Example 3: Hello, Variant!

This gets called before the first reduce() call. By returning 0, 6! we initialize the record counter.


Example 3: Hello, Variant!

All of the return values from map () get passed to reduce(), one at a time. Here, we add value to sum, effectively counting all the calls to map().

7!


Example 3: Hello, Variant!

8! The onTraversalDone() method runs after the computation is complete. Here, we print the total number of map() calls made.


Example 3: Hello, Variant!

9! Compile and run the HelloVariant walker, but this time, rather than specifying a BAM file with the -I argument, weʼll attach a VCF file:& ant dist && java –jar dist/GenomeAnalysisTK.jar \ –T HelloVariant \ -R /lrlhps/apps/gatk-lilly/resources/ucsc.hg19.fasta \ -B:variant,VCF /lrlhps/apps/gatk-lilly/testdata/test.chr21.analysis_ready.vcf


Example 3: Hello, Variant!

The program prints out the reference allele, alternate allele, and locus for each VCF record, and finally prints out the number of records processed!


Example 4: Compute depth of coverage from a VCF file

Let始s continue exploring variant processing by taking a closer look at the VariantContext object, the programmatic representation of a VCF record.& This program will compute a depth of coverage histogram using VCF metadata rather than a BAM file.


Example 4: Compute depth of coverage from a VCF file

Create a new program called&

1!

ComputeCoverageFromVCF

of type& RodWalker<Integer, Integer>

with the usual& @Output PrintStream out

declaration.&


Example 4: Compute depth of coverage from a VCF file

2! Add a command line argument with the following code:& @Argument(fullName=“sample”, shortName=“sn”, doc=“Sample to process”, required=false) public string SAMPLE;

This adds the command-line argument --sample (aka -sn) and stores the inputted value in the String variable SAMPLE.& Weʼll use this to allow the user to specify whether they want to get coverage for a specific sample or all of the samples (by specifying no sample at all).&


Example 4: Compute depth of coverage from a VCF file

3! Declare a hashtable to store the coverage counts.& private TreeMap<Integer, Integer> histogram = new TreeMap<Integer, Integer>();

A TreeMap is a special kind of hashtable that returns its keys in sorted order.&


Example 4: Compute depth of coverage from a VCF file

4! Loop over the variants. For each one, we始ll print the coverage observed. We also make sure that we get the coverage for the sample requested (if the user specified a sample name to the --sample argument), or for all samples (if the user specified no sample name at all).& For every coverage level we observe, we increment the appropriate entry in the histogram object.&


Example 4: Compute depth of coverage from a VCF file

In the onTraversalDone() method, we始ll loop over every coverage level in the histogram and output the depth and the number of times we observed that depth.&

5!


Example 4: Compute depth of coverage from a VCF file

6! Compile and run:& ant dist && java -jar dist/GenomeAnalysisTK.jar \ -T ComputeCoverageFromVCF \ -R /lrlhps/apps/gatk-lilly/resources/ucsc.hg19.fasta \ -B:variant,VCF /lrlhps/apps/gatk-lilly/testdata/test.chr21.analysis_ready.vcf \ -o histogram.txt


Example 4: Compute depth of coverage from a VCF file

Two columns of information are printed. First column is the coverage level, second is the number of times that coverage level was observed!


Example 5: Find variants unique to a single sample

For our last example, we始ll write a simple program that can take an input VCF and write a new VCF containing only variants that are exclusive to one sample.& We始ll also introduce the initialize() method, which can be used to prepare the environment for the computation.&


Example 5: Find variants unique to a single sample

1! Create a new RodWalker called FindExclusiveVariants that has a command-line argument called “sample” (aka “sn”) of type String.& Add an output stream, but rather than be of type PrintStream, make it of type VCFWriter. Weʼll use this to output a new VCF file based on the input VCF.&


Example 5: Find variants unique to a single sample

2! The initialize() method is called first, before any of the map() or reduce() calls are made. It is useful for preparing the environment, writing headers, setting up variables, etc.& Here, we始ll write a VCF header to the output stream. While we始re free to add/remove header lines and samples, we始ll just copy the input file始s header to the output file.&


Example 5: Find variants unique to a single sample

3! Loop over each record in the VCF, and each Genotype object contained within the VariantContext object. Check the genotypes of each sample and, if only our sample of interest is variant, output the record to the new VCF file.&


Example 5: Find variants unique to a single sample

4! Compile and run:& ant dist && java -jar dist/GenomeAnalysisTK.jar \ -T FindExclusiveVariants \ -R /lrlhps/apps/gatk-lilly/resources/ucsc.hg19.fasta \ -B:variant,VCF /lrlhps/apps/gatk-lilly/testdata/test.chr21.analysis_ready.vcf \ -sn 113N \ -o 113.exclusive.vcf


Example 5: Find variants unique to a single sample

5! After the program completes, look at the output.


Example 5: Find variants unique to a single sample

6! You can scroll left and right with the arrow key, but letʼs clean up the output to make it easier to read. Supply this command instead:& grep –v ‘##’ 113.exclusive.vcf | cut –f1-7,10- | head -10 | column –t | less -S


Example 5: Find variants unique to a single sample

Observe how the third sample is variant and the other three samples are not. Our program is selecting only the variants that are exclusive to 113N!


Conclusions •  From the five example programs, we have learned how to:& –  configure IntelliJ for GATK development& –  create a new ReadWalker or RodWalker –  declare output streams (PrintStream, SAMFileWriter, VCFWriter)& –  access and modify metadata in reads& –  access variants, samples, and metadata from a VCF file& –  declare command-line arguments& –  prepare for computations with the initialize() method& –  finish computations with the onTraversalDone() method& –  compile and run new GATK programs&

•  This tutorial is more than enough to get started with writing new and useful GATK programs& –  Our FixReadNames, ComputeCoverageFromVCF, and FindExclusiveVariants walkers are fully realized programs, ready to be used for real work.& –  You now have enough information to write your own somatic variant finder.&


Additional resources •  For more information on developing in the GATK and Java, see& –  http://www.broadinstitute.org/gsa/wiki/index.php/GATK_Development& –  http://download.oracle.com/javase/tutorial/java/index.html&

•  Explore the GATK Git repository at& –  https://github.com/broadgsa& –  https://github.com/signup/free (to add your own code, sign up for free account)&

•  To learn Git, the codebaseʼs version control system, see& –  http://gitref.org/& –  http://git-scm.com/course/svn.html (for those already familiar with SVN)&

•  Read our papers on the GATK framework and tools& –  http://genome.cshlp.org/content/20/9/1297.long& –  http://www.nature.com/ng/journal/v43/n5/abs/ng.806.html&

•  Fore more guidance, feel free to look at other programs in the GATK& –  Every program is a tutorial!&

gatk  

a gatk keynote

Advertisement