Big Data Genomics

ADAM 0.16.0 is now available.

This release improves the performance of Base Quality Score Recalibration (BQSR) by 3.5x, adds support for multiline FASTQ input, visualization of variants when given VCF input, includes a new RegionJoin implementation that is shuffle-based, and adds new methods for region coverage calculations.

Drop into our Gitter channel to talk with us about this release

In this post, we will detail how to perform simple scalable population stratification analysis, leveraging ADAM and Spark MLlib, as previously presented at scala.io.

The data source is the set of genotypes from the 1000genomes project, resulting from whole genomes sequencing run on samples taken from about 1000 individuals with a known geographic and ethnic origin.

This dataset is rather large and allows us to test scalability of the methods we present here and gives us the possibility to do interesting machine learning. Based on the data we have, we can for example:

build models to classify genomes by population
run unsupervised learning (clustering) to see if populations are reconstructed in the model.
build models to infer missing genotypes

We’ve gone the second way (clustering), the line-up being the following:

Setup the environment
Collection and extraction of the original data
Distribute the original data and convert it to the ADAM model
Collect metadata (samples labels and completeness)
Filter the data to match our cluster capacity (number of nodes, cpus and mem and wall clock time…)
Read and prepare the ADAM formatted and distributed genotypes to have them into a separable high-dimensional space (need a metric)
Apply the KMeans (train/predict)
Assess performance

We’re proud to announce the release of ADAM 0.15.0!

This release includes important memory and performance improvements, better documentation, new features and many bug fixes.

We have upgraded from Parquet 1.4.3 to 1.6.0 in order to dramatically reduce our memory footprint. For string columns with dictionary encoding, the amount of memory used will now be proportional to the number of dictionary entries instead of the number of records materialized. Parquet 1.6.0 also provides improved column statistics and the ability to store custom metadata. We will use these features in subsequent ADAM releases to improve random access performance. Note that ADAM 0.14.0 had a serious memory regression so upgrading to 0.15.0 as soon as possible is recommended.

We are unhappy with the quality of the documentation we have been providing ADAM users and are working to improve it. With this release, all documentation has been centralized into the ./docs directory and we’re using pandoc to convert the Markdown source into both PDF and HTML formats. We are committed to improving the content of the docs over time and welcome your pull requests!

This release includes binary distributions to make it easier for you to get up and running with ADAM. We do not include any Spark or Hadoop artifacts in order to prevent versioning conflicts. For application developers, we have also changed our Spark and Hadoop dependencies to provided. This means that you can more easily running on ADAM using your preferred Spark and Hadoop version and configuration. We want to make deployment as easy as possible.

This release includes numerous features and bug fixes that are detailed below:

Andy Petrella and Xavier Tordoir gave a talk, Scalable Genomics with ADAM, at Scala.IO in Paris, France.

We are at a time where biotech allow us to get personal genomes for $1000. Tremendous progress since the 70s in DNA sequencing have been done, e.g. more samples in an experiment, more genomic coverages at higher speeds. Genomic analysis standards that have been developed over the years weren’t designed with scalability and adaptability in mind. In this talk, we’ll present a game changing technology in this area, ADAM, initiated by the AMPLab at Berkeley. ADAM is framework based on Apache Spark and the Parquet storage. We’ll see how it can speed up a sequence reconstruction to a factor 150.

Andy and Xavier’s talk included a demo: using Spark’s MLlib to do population stratification across 1000 Genomes in just a few minutes in the cloud using Amazon Web Services (AWS). Their talk highlights the advantages of building on open-source technologies, like Apache Spark and Parquet, designed for performance and scale.

Andy also modified the Scala Notebook to create Spark Notebook which enables visualization and reproducible analysis on Apache Spark inside a web browser. A great addition to the Spark ecosystem!

Lightning fast genomics with Spark, Adam and Scala from noootsab

ADAM 0.14.0 is now available. Special thanks to Arun Ahuja, Timothy Danford, Michael L Heuer, Uri Laserson, Frank Nothaft, Andy Petrella and Ryan Williams for their contributions to this release!

This release uses the newly-released Apache Spark 1.1.0 which brings operational and performance improvements in Spark core. Two new scripts, adam-shell and adam-submit, allow you to use ADAM via the Spark shell or the Spark submit script in addition to the ADAM CLI.

The Hadoop-BAM team is now publishing their artifacts to Maven Central (yea!) so we no longer rely on snapshot releases. ADAM 0.14.0 uses the 7.0.0 release of Hadoop-BAM.

This release also adds a new Java plugin interface, improves MD tag processing as well as fixes numerous bugs.

We hope that you enjoy this release. Drop by #adamdev on freenode.net, follow us on Twitter or subscribe to our mailing list to stay in touch.

ADAM 0.13.0 is now available!

This release includes genome visualization to view aligned reads and coverage information over a reference region. You simply run e.g. adam viz myreads.adam chr1 from the ADAM source directory and open your favorite web browser to http://localhost:8080/ to view your data.

This release also includes a number of features and bug fixes including upgrading to Spark 1.0.1.

Frank Austin Nothaft gave a talk on ADAM at the Spark Summit in San Francisco.

ADAM—Spark Summit, 2014 from fnothaft

The Spark Summit organizers will make a video of the talk available soon; we will post a link to the talk as soon as it is available.

ADAM 0.12.0 is now available!

This release includes new Parquet utilities that are part of an effort to read/write Parquet directly on S3, eliminating the need to transfer data from S3 to HDFS for processing. This release also upgrades ADAM to Spark 1.0 and provides new schema definitions, bug fixes and features:

ISSUE 264: Parquet-related Utility Classes
ISSUE 259: ADAMFlatGenotype is a smaller, flat version of a genotype schema
ISSUE 266: Removed extra command ‘BuildInformation’
ISSUE 263: Added AdamContext.referenceLengthFromCigar
ISSUE 260: Modifying conversion code to resolve #112.
ISSUE 258: Adding an ‘args’ parameter to the plugin framework.
ISSUE 262: Adding reference assembly name to ADAMContig.
ISSUE 256: Upgrading to Spark 1.0
ISSUE 257: Adds toString method for sequence dictionary.
ISSUE 255: Add equals, canEqual, and hashCode methods to MdTag class

ADAM 0.11.0 is now available.

This release allows you not just read but also write to SAM/BAM files, adds utilities for trimming reads, implements contig-to-RefSeq translation, refactors SequenceDictionary to include RefSeq information (and without numeric IDs) and prepare ADAMGenotype for incorporating reference model information, and fixes a bug in FASTA fragments.

For details see the following issues…

ISSUE 250: Adding ADAM to SAM conversion.
ISSUE 248: Adding utilities for read trimming.
ISSUE 252: Added a note about rebasing-off-master to CONTRIBUTING.md
ISSUE 249: Cosmetic changes to FastaConverter and FastaConverterSuite.
ISSUE 251: CHANGES.md is updated at release instead of per pull request
ISSUE 247: For #244, Fragments were incorrect order and incomplete
ISSUE 246: Making sample ID field in genotype nullable.
ISSUE 245: Adding ADAMContig back to ADAMVariant.
ISSUE 243: Rebase PR#238 onto master

This short screencast is meant to get someone new to Scala, IntelliJ, and the Big Data Genomics stack up and running with a configured development environment suitable for working with or on projects like ADAM and Avocado.

We’ll walk you through downloading the appropriate JDK, IntelliJ IDE, and plugings. Then we will set up the project (using ADAM as the example), generating sources, packaging the application, and building the project. Finally, we cover running tests, as well as some basic exploration and code navigation using the IDE.

Note, if you have trouble using mvn package in the command line, you may want to add the following to your .bashrc, or at least export these environment variables before running mvn package:

export MAVEN_OPTS="-Xmx512m -XX:MaxPermSize=128M"
export JAVA_HOME=`/usr/libexec/java_home -v 1.7`

Links

00:00:14 Oracle JDK 7
00:00:28 IntelliJ IDE
00:00:43 ADAM Github Repository

About us…

This project is supported in part by NIH BD2K Award 1-U54HG007990-01 and NIH Cancer Cloud Pilot Award HHSN261201400006C with collaborators from the AMPLab at UC Berkeley, Genome Informatics Lab at UC Santa Cruz, Icahn School of Medicine at Mount Sinai, Microsoft Research, Cloudera, and the Broad Institute.

Chat with us…

If you’re interested in contributing, take a look at the open “pick me up!” issues.

ADAM 0.16.0 Released

Scalable Genomes Clustering With ADAM and Spark

ADAM 0.15.0 Released

Lightning Fast Genomics

ADAM 0.14.0 Released

ADAM 0.13.0 Released

Talk on ADAM at the Spark Summit

ADAM 0.12.0 Released

ADAM 0.11.0 Released

Developing Big Data Genomics: A Screencast

Links