https://github.com/bigdatagenomics/    https://twitter.com/bigdatagenomics/

ADAM version 0.25.0 and Cannoli version 0.3.0 have been released!

Since the 0.24.0 release of ADAM, more then 40 issues have been closed, including bug fixes around indexed reads and attributes in VCF. New features include additional filter by methods and multi-sample coverage. The ADAM Python APIs now support Python 3.

Based on feedback from the 2018 GCCBOSC bioinformatics community conference, at 2018 GCCBOSC CollaborationFest the Cannoli API was refactored to greatly improve interactive use in cannoli-shell (a Scala REPL based on Spark Shell, similar to adam-shell) and notebooks such as Jupyter, Zeppelin, and Spark Notebook.

For example, here is an entire variant calling pipeline, based on bwa, ADAM, and Freebayes

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import org.bdgenomics.adam.rdd.ADAMContext._
import org.bdgenomics.cannoli.cli._
import org.bdgenomics.cannoli.cli.Cannoli._

val sample = "sample"
val reference = "ref.fa"

val reads = sc.loadPairedFastqAsFragments(sample + "_1.fq", sample + "_2.fq")

val bwaArgs = new BwaArgs()
bwaArgs.sample = sample
bwaArgs.indexPath = reference

val alignments = reads.alignWithBwa(bwaArgs)
val sorted = alignments.sortReadsByReferencePositionAndIndex()
val markdup = sorted.markDuplicates()

val freebayesArgs = new FreebayesArgs()
freebayesArgs.referencePath = reference

val variantContexts = markdup.callVariantsWithFreebayes(freebayesArgs)

variantContexts.saveAsVcf(sample + ".freebayes.vcf.bgzf")

Changes since Previous Releases

The full list of changes to ADAM since version 0.24.0 and Cannoli since version 0.2.0 are below.

ADAM version 0.25.0

Closed issues:

  • Expand illumina metadata regex to include “N” character #2079
  • Remove support for Hadoop 2.6 #2073
  • NumberFormatException: For input string: “nan” in VCF #2068
  • Support Spark 2.3.2 #2062
  • Arrays should be passed to HTSJDK in the JVM primitive type #2059
  • toCoverage() function for alignments does not distinguish samples #2049
  • Building from adam-core module directory fails to generate Scala code for sql package #2047
  • Data Sets #2043
  • saveAsBed writes missing score values as ‘.’ instead of ‘0’ #2039
  • Fix GFF3 parser to handle trailing FASTA #2037
  • Add StorageLevel as an optional parameter to loadPairedFastq #2032
  • Error: File name too long when building on encrypted file system #2031
  • Fail to transform a VCF file containing multiple genome data (Muliple sample) #2029
  • Dataset and RDD constructors are missing from CoverageRDD #2027
  • How to create a single RDD[Genotype] object out of multiple VCF files? #2025
  • ReadTheDocs github banner is broken #2020
  • -realign_indels throws serialization error with instrumentation enabled #2007
  • Support 0 length FASTQ reads #2006
  • Speed of Reading into ADAM RDDs from S3 #2003
  • Support Python 3 #1999
  • Unordered list of region join types in doc is missing nested levels #1997
  • Add VariantContextRDD.saveAsPartitionedParquet, ADAMContext.loadPartitionedParquetVariantContexts #1996
  • VCF annotation question #1994
  • Fastq reader clips long reads at 10,000 bp #1992
  • adam-submit Error: Number of executors must be a positive number on EMR 5.13.0/Spark 2.3.0 #1991
  • Test against Spark 2.3.1, Parquet 1.8.3 #1989
  • END does not get set when writing a gVCF #1988
  • Support saving single files to filesystems that don’t implement getScheme #1984
  • Add additional filter by convenience methods #1978
  • Limiting FragmentRDD pipe paralellism #1977
  • Consider javadoc.io for API documentation linking #1976
  • FASTQ Reader leaks connections #1974
  • Update bioconda recipe for version 0.24.0 #1971
  • Update homebrew formula at brewsci/homebrew-bio for version 0.24.0 #1970
  • loadPartitionedParquetAlignments fails with Reference.all #1967
  • Caused by: java.lang.VerifyError: class com.fasterxml.jackson.module.scala.ser.ScalaIteratorSerializer overrides final method withResolved #1953
  • FASTQ input format needs to support index sequences #1697
  • Changelog must be edited and committed manually during release process #936

Merged and closed pull requests:

  • added pyspark mock modules for API documentation #2084 (akmorrow13)
  • Added mock python modules for API python documentation #2082 (akmorrow13)
  • [ADAM-2079] Expand illumina metadata regex to include “N” character #2081 (pauldwolfe)
  • ADAM-2079 Added “N” to regexs for illumina metadata #2080 (pauldwolfe)
  • Update docs with new template and documentation #2078 (akmorrow13)
  • [ADAM-1992] Make maximum FASTQ read length configurable. #2077 (heuermh)
  • [ADAM-2059] Properly pass back primitive typed arrays to HTSJDK. #2075 (heuermh)
  • Update dependency versions, including htsjdk to 2.16.1 and guava to 27.0-jre #2072 (heuermh)
  • [ADAM-1999] Support Python 3 #2070 (akmorrow13)
  • [ADAM-2068] Prevent NumberFormatException for nan vs NaN in VCF files. #2069 (heuermh)
  • Update python MAKE file #2067 (Georgehe4)
  • Update python MAKE file #2066 (Georgehe4)
  • Update jenkins script to test python 3.6 #2060 (Georgehe4)
  • [ADAM-2062] Update Spark version to 2.3.2 #2055 (heuermh)
  • Clean up fields and doc in fragment. #2054 (heuermh)
  • [ADAM-2037] Support GFF3 files containing FASTA formatted sequences. #2053 (heuermh)
  • modified CoverageRDD and FeatureRDD to extend MultisampleGenomicDataset #2051 (akmorrow13)
  • Multi-sample coverage #2050 (akmorrow13)
  • [ADAM-2047] Use source directory relative to project.basedir for adam codegen. #2048 (heuermh)
  • [ADAM-2039] Adding support for writing BED format per UCSC definition #2042 (heuermh)
  • Update Jenkins Spark version to 2.2.2 #2035 (akmorrow13)
  • [ADAM-2032] Add StorageLevel as an optional parameter to loadPairedFastq #2033 (heuermh)
  • [ADAM-2027] Add RDD and Dataset constructors to CoverageRDD. #2028 (heuermh)
  • Allow for export of query name sorted SAM files #2026 (karenfeng)
  • [ADAM-2020] Fix ReadTheDocs Github banner. #2021 (fnothaft)
  • [ADAM-1988] Add copyVariantEndToAttribute method to support gVCF END attribute … #2017 (heuermh)
  • [ADAM-936] Use github-changes-maven-plugin to update CHANGES.md. #2014 (heuermh)
  • [ADAM-1992] Make maximum FASTQ read length configurable. #2011 (fnothaft)
  • [ADAM-1697] Expand Illumina metadata regex to cover interleaved index sequences. #2010 (heuermh)
  • [ADAM-2007] Make IndelRealignmentTarget implement Serializable. #2009 (fnothaft)
  • [ADAM-2006] Support loading 0-length reads as FASTQ. #2008 (fnothaft)
  • [ADAM-1697] Expand Illumina metadata regex to cover index sequences #2004 (pauldwolfe)
  • [ADAM-1996] Load and save VariantContexts as partitioned Parquet. #2001 (heuermh)
  • [ADAM-1997] Nest list of region join types in joins doc. #1998 (heuermh)
  • [ADAM-1877] Add filterToReferenceName(s) to SequenceDictionary. #1995 (heuermh)
  • [ADAM-1984] Support file systems that don’t set the scheme. #1985 (fnothaft)
  • [ADAM-1978] Add additional filter by convenience methods. #1983 (heuermh)
  • Adding printAttribute methods for alignment records, features, and samples. #1982 (heuermh)
  • Fix partitioning code to use Long instead of Int #1980 (fnothaft)
  • [ADAM-1976] Adding core API documentation link and badge. #1979 (heuermh)
  • [ADAM-1974] Close unclosed stream in FastqInputFormat. #1975 (fnothaft)
  • Set defaults to schemas #1972 (ffinfo)
  • Add loadPairedFastqAsFragments method. #1866 (heuermh)
  • Adding loadPairedFastqAsFragments method #1828 (ffinfo)

Cannoli Version 0.3.0

Closed issues:

  • Add implicit methods that attach to source RDD #131
  • Flip function and command line class names around #130
  • Add API documentation link and badge #128
  • Add homebrew formula at brewsci/homebrew-bio #124
  • Add bioconda recipe #123
  • Support validation stringency in out formatters #122
  • Add Ensembl Variant Effect Predictor (VEP) for variant annotation #112
  • Add Minimap2 for alignment #111

Merged and closed pull requests:

  • Update release script for changelog. #143 (heuermh)
  • [CANNOLI-141] Update ADAM dependency to 0.25.0. #142 (heuermh)
  • Update default docker image for bowtie2. #140 (heuermh)
  • [CANNOLI-138] Update Cannoli per latest ADAM snapshot changes. #139 (heuermh)
  • [CANNOLI-131] Add implicits on Cannoli function source data sets. #133 (heuermh)
  • [CANNOLI-130] Extract function classes to core package. #132 (heuermh)
  • [CANNOLI-128] Adding API documentation link and badge. #129 (heuermh)
  • [CANNOLI-112] Adding Ensembl Variant Effect Predictor (VEP) for variant annotation #127 (heuermh)
  • [CANNOLI-122] Support validation stringency in out formatters. #126 (heuermh)
  • [CANNOLI-111] Adding Minimap2 for alignment. #119 (heuermh)

Comments