https://github.com/bigdatagenomics/    https://twitter.com/bigdatagenomics/

ADAM version 0.22.0 has been released!

Due to major changes between Spark versions 1.6 and 2.0, we build for combinations of Apache Spark and Scala versions: Spark 1.x and Scala 2.10, Spark 1.x and Scala 2.11, Spark 2.x and Scala 2.10, and Spark 2.x and Scala 2.11.

The focus of this release was performance, including major improvements to BQSR and INDEL realignment.

More than 80 other issues were closed in this release, including bug fixes around VCF validation and paired end FASTQ parsing and new features such as pipe API support for features.

The full list of changes since version 0.21.0 is below.

Closed issues:

  • Realign all reads at target site, not just reads with no mismatches #1469
  • Parallel file merger fails if the output file is smaller than the HDFS block size #1467
  • Add new realigner arguments to docs #1465
  • Recalibrate method misspelled as recalibateBaseQualities #1463
  • FASTQ may try to split GZIPed files #1459
  • Update to Hadoop-BAM 7.8.0 #1455
  • Publish Markdown and Scaladoc to the interwebs #1453
  • Make VariantContextConverter public #1451
  • Apply method in FragmentRDD is package private #1445
  • Thread pool will block inside of pipe command for streams too large to buffer #1442
  • FeatureRDD.apply() does not allow addition of other parameters with defaults in the case class #1439
  • Question : Why the number of paired sequence in adam-0.21.0 less than adam-0.19.0? #1424
  • loadCoverage missing from Java API #1420
  • Estimate contig lengths in SequenceDictionary for BED, GFF3, GTF, and NarrowPeak feature formats #1410
  • loadIntervalList FeatureRDD has empty SequenceDictionary #1409
  • problem using transform command #1406
  • Add coveralls #1403
  • INDEL realigner binary search conditional is flipped #1402
  • Delete adam-scripts/R #1398
  • Data missing when transfroming FASTQ to Adam #1393
  • java.io.FileNotFoundException when file exists #1385
  • Off-by-1 error in FASTQ InputFormat start positioning code #1383
  • Set the wrong value for end for symbolic alts #1381
  • RecordGroupDictionary should support isEmpty #1380
  • Add pipe API in and out formatters for Features #1374
  • Increase visibility for SupportedHeaderLines.allHeaderLines #1372
  • Bits of VariantContextConverter don’t get ValidationStringencied #1371
  • Add Markdown docs for Pipe API #1368
  • Array[Consensus] not registered #1367
  • ValidationStringency in MDTagging should apply to reads on unknown references #1365
  • When doing a release, the SNAPSHOT should bump by 0.1.0, not 0.0.1 #1364
  • FromKnowns consensus generator fails if no reads overlap a consensus #1362
  • Performance tune-up in BQSR #1358
  • Increase visibility for ADAMContext.sc and/or getFs… methods #1356
  • Pipe API formatters need to be public #1354
  • Version 0.21.0: VariantContextConverter fails for 1000G VCF data #1353
  • ConsensusModel’s can’t really be instantiated #1352
  • Runtime conflicts in transitive versions of Guava dependency #1350
  • Transcript Effects ignored if more than 1 #1347
  • Remove “fork” tag from releases #1344
  • Refactor isSorted boolean parameters to sorted #1341
  • Loading GZipped VCF returns an empty RDD #1333
  • Follow up on error messages in build scripts #1331
  • Bump Spark 2 build to Spark 2.1.0 #1330
  • FeatureRDD instantiation tries to cache the RDD #1321
  • Load queryname sorted BAMs as Fragments #1303
  • Run Duplicate Marking on Fragments #1302
  • GenomicRDD.pipe may hang on failure error codes #1282
  • IllegalArgumentException Wrong FS for vcf_head files on HDFS #1272
  • java.io.NotSerializableException: org.bdgenomics.formats.avro.AlignmentRecord #1240
  • Investigate sorted join in dataset api #1223
  • Support looser validation stringency for loading some VCF Integer fields #1213
  • Add new feature-overlap command to demonstrate new region joins #1194
  • What should our API at the command line look like? #1178
  • Split apart partition and join in ShuffleRegionJoin #1175
  • Merging files should be multithreaded #1164
  • File _rgdict.avro does not exist #1150
  • how to collect the .adam files from Spark cluster multiple nodes and some questions about avocado #1140
  • JFYI: tiny forked adam-core “0.20.0” release #1139
  • Samtools (htslib) integration testing #1120
  • AlignmentRecordRDD does not extend GenomicRDD per javac #1092
  • Release ADAM version 0.21.0 #1088
  • Difference running markdups with and without projection #1014
  • ADAM to BAM conversion fails using relative path #1012
  • Refactor SequenceDictionary to use Contig instead of SequenceRecord #997
  • Customize adam-main cli from configuration file #918
  • genotypeType for genotypes with multiple OtherAlt alleles? #897
  • How to convert genotype DataFrame to VariantContext DataFrame / RDD #886
  • Ensure Java API is up-to-date with Scala API #855
  • Improve parallelism during FASTA output #842
  • Explicitly validate user args passed to transform enhancement #841
  • BroadcastRegionJoin fails with unmapped reads #821
  • Resolve Fragment vs. SingleReadBucket #789
  • Add profile for skipping test compilation/resolution #713
  • Next on empty iterator in BroadcastRegionJoin #661
  • Cleanup code smell in sort work balancing code #635
  • Remove reliance on MD tags #622
  • Provide low-impact alternative to transform -repartition for reducing partition size #594
  • Clean up Rich records #577
  • Create standardized, interpretable exceptions for error reporting #420
  • Create ADAM Benchmarking suite #120

Merged and closed pull requests:

  • [ADAM-1469] Don’t filter on whether reads have mismatches during realignment #1470 (fnothaft)
  • [ADAM-1467] Skip concat call if there is only one shard. #1468 (fnothaft)
  • [ADAM-1465] Updating realigner CLI docs. #1466 (fnothaft)
  • [ADAM-1463] Rename recalibateBaseQualities method as recalibrateBaseQualities #1464 (heuermh)
  • [ADAM-1453] Add hooks to publish ADAM docs from CI flow. #1461 (fnothaft)
  • [ADAM-1459] Don’t split FASTQ when compressed. #1459 (fnothaft)
  • [ADAM-1451] Make VariantContextConverter class and convert methods public #1452 (fnothaft)
  • Moving API overview from building apps doc to new source file. #1450 (heuermh)
  • [ADAM-1424] Adding test for reads dropped in 0.21.0. #1448 (heuermh)
  • [ADAM-1439] Add inferSequenceDictionary ctr to FeatureRDD. #1447 (heuermh)
  • [ADAM-1445] Make apply method for FragmentRDD public. #1446 (fnothaft)
  • [ADAM-1442] Fix thread pool deadlock in GenomicRDD.pipe #1443 (fnothaft)
  • [ADAM-1164] Add parallel file merger. #1441 (fnothaft)
  • Dependency version bump + BroadcastRegionJoin fix #1440 (fnothaft)
  • added JavaApi for loadCoverage #1437 (akmorrow13)
  • Update versions, etc. in build docs #1435 (heuermh)
  • Add test sample(verify number of reads in loadAlignments function) and ADAM SNAPSHOT document #1433 (xubo245)
  • Add cache argument to loadFeatures, additional Feature timers #1427 (heuermh)
  • feat: speed up 2bit file extract #1426 (Blaok)
  • BQSR refactor for perf improvements #1423 (fnothaft)
  • Add ADAMContext/GenomicRDD/pipe docs #1422 (fnothaft)
  • INDEL realigner cleanup #1412 (fnothaft)
  • Estimate contig lengths in SequenceDictionary for BED, GFF3, GTF, and NarrowPeak feature formats #1411 (heuermh)
  • Add coveralls badge to README.md. #1408 (fnothaft)
  • [ADAM-1403] Push coverage reports to Coveralls. #1404 (fnothaft)
  • Added instrumentation timers around joins. #1401 (fnothaft)
  • Add Apache Spark version to —version text #1400 (heuermh)
  • [ADAM-1398] Delete adam-scripts/R. #1399 (fnothaft)
  • [ADAM-1383] Use gt instead of gteq in FASTQ input format line size checks #1396 (fnothaft)
  • Maint spark2 2.11 0.21.0 #1395 (A-Tsai)
  • [ADAM-1393] fix missing reads when transforming fastq to adam #1394 (A-Tsai)
  • [ADAM-1380] Adds isEmpty method to RecordGroupDictionary. #1392 (fnothaft)
  • [ADAM-1381] Fix Variant end position. #1389 (fnothaft)
  • Make javac see that AlignmentRecordRDD extends GenomicRDD #1386 (fnothaft)
  • Added ShuffleRegionJoin usage docs #1384 (devin-petersohn)
  • Misc. INDEL realigner bugfixes #1382 (fnothaft)
  • Add pipe API in and out formatters for Features #1378 (heuermh)
  • [ADAM-1356] Make ADAMContext.getFsAndFiles and related protected visibility #1376 (heuermh)
  • [ADAM-1372] Increase visibility for DefaultHeaderLines.allHeaderLines #1375 (heuermh)
  • [ADAM-1371] Wrap ADAM->htsjdk VariantContext conversion with validation stringency. #1373 (fnothaft)
  • [ADAM-1367] Register Consensus array for serialization. #1369 (fnothaft)
  • [ADAM-1365] Apply validation stringency to reads on missing contigs when MD tagging #1366 (fnothaft)
  • [ADAM-1362] Fixing issue where FromKnowns consensus model fails if no reads hit a target. #1363 (fnothaft)
  • [ADAM-1352] Clean up consensus model usage. #1357 (fnothaft)
  • Increase visibility for InFormatter case classes from package private to public #1355 (heuermh)
  • Use htsjdk getAttributeAsList for VCF INFO ANN key #1348 (heuermh)
  • Fixes parsing variant annotations for multi-allelic rows #1346 (majkiw)
  • Sort pull requests by id #1345 (heuermh)
  • HBase genotypes backend -revised #1335 (jpdna)
  • [ADAM-1330] Move to Spark 2.1.0. #1332 (fnothaft)
  • Support deduping fragments #1309 (fnothaft)
  • [ADAM-1280] Silence CRAM logging in tests. #1294 (fnothaft)
  • Added test to try and repro #1282. #1292 (fnothaft)

Comments