ADAM 0.20.0 Released - Big Data Genomics

ADAM version 0.20.0 has been released!

Due to major changes between Spark versions 1.6 and 2.0, we now build for combinations of Apache Spark and Scala versions: Spark 1.x and Scala 2.10, Spark 1.x and Scala 2.11, Spark 2.x and Scala 2.10, and Spark 2.x and Scala 2.11.

Since the last release, version 0.19.0, we have closed more than 180 issues and merged more than 120 pull requests.

We added a new pipe API, allowing for streaming alignment and variant records out to external applications and streaming back in the results. Several new region join implementations are now public API, including a broadcast inner join, broadcast right outer join, sort-merge inner join, sort-merge right outer join, sort-merge left outer join, sort-merge full outer join, sort-merge inner join followed by a group by, and a sort-merge right outer join followed by a group by.

Alignment records can now be read from and written to CRAM format. We updated upstream dependencies on Hadoop-BAM and htsjdk to fix various alignment record header bugs and to add support for gzip and BGZF compressed VCF.

Our sequence feature schema now more closely follow the GFF3 specification, while still supporting BED, GFF2/GTF, IntervalList, and NarrowPeak formats. We also added a new sample schema for e.g. SRA sample metadata.

With this version the core ADAM APIs are undergoing a major refactoring. We changed many method names on ADAMContext to make the API more consistent. We also added RDD wrapper classes to increase performance by serializing metadata (such as record groups, samples, and sequence dictionaries) to disk separate from primary data in Parquet. API incompatibilities between ADAM releases will settle down by the 1.0 release, currently targeted for early 2017.

The full list of changes since version 0.19.0 is below.

Closed issues:

Sorting by reference index seems doesn’t work or sorted by DESC order? #1204
master won’t compile #1200
VCF format tag SB field parse error in loading #1199
Publish sources JAR with snapshots #1195
Type SparkFunSuite in package org.bdgenomics.utils.misc is not available #1193
MDTagging fails on GRCh38 #1192
Fix stack overflow in IndelRealigner serialization #1190
Delete ./scripts/commit-pr.sh #1188
Hadoop globStatus returns null if no glob matches #1186
Swapping out IntervalRDD under GenomicRDDs #1184
How to get “SO coordinate” instead of “SO unsorted”? #1182
How to read glob of multiple parquet Genotype #1179
Update command line doc and examples in README.md #1176
FastqRecordConverter needs cleanup and tests #1172
TransformFormats write to .gff3 file path incorrectly writes as parquet #1168
Should be able to merge shards across two different file systems #1165
RG ID gets written as the index, not the record group name #1162
Users should be able to save files as -single without merging them #1161
Users should be able to set size of buffer used for merging files #1160
Bump Hadoop-BAM to 7.7.0 #1158
adam-shell prints command trace to stdout #1154
Map IntervalList format column four to feature name or attributes? #1152
Parquet storage of VariantContext #1151
vcf2adam unparsable vcf record #1149
Reorder kryo.register statements in ADAMKryoRegistrator #1146
Make region joins public again #1143
Support CRAM input/output #1141
Transform should run with spark.kryo.requireRegistration=true #1136
adam-shell not handling bash args correctly #1132
Remove Gene and related models and parsing code #1129
Generate Scoverage reports when running CI #1124
Remove PairingRDD #1122
SAMRecordConverter.convert takes unused arguments #1113
Add Pipe API #1112
Improve coverage in Feature unit tests #1106
K-mer.scala code #1105
add -single file output option to ADAM2Vcf #1102
adam2vcf Fails with Sample not serializable #1100
ReferenceRegion.apply(AlignmentRecord) should not NPE on unmapped reads #1099
Add outer region join implementations #1098
VariantContextConverter never returns DatabaseVariantAnnotation #1097
loadvcf: conflicting require statement #1094
ADAM version 0.19.0 will not run on Spark version 2.0.0 #1093
Be more rigorous with FileSystem.get #1087
Remove network-connected and default test-related Maven profiles #1073
Releases should get pushed to Spark Packages #1067
Invalid POM for cli on 0.19.0 #1066
scala.MatchError RegExp does not catch colons in value part properly #1061
Support writing IntervalList header for features #1059
Add -single support when writing features in native formats #1058
Remove workaround for gzip/BGZF compressed VCF headers #1057
Clean up if clauses in Transform #1053
Adam-0.18.2 can not load Adam-0.14.0 adamSave function data (sam) #1050
filterByOverlappingRegion Incorrect for Genotypes #1042
Move Interval trait to utils, added in #75 #1041
Remove implicit GenomicRDD to RDD conversion #1040
VCF sample metadata – proposal for a GenotypedSampleMetadata object #1039
[build system] ADAM test builds pollute /tmp, leaving lots of cruft… #1038
adamMarkDuplicates function in AlignmentRecordRDDFunctions class can not mark the same read? #1037
test MarkDuplicatesSuite with two similar read in ref and start position and different avgPhredScore, error! #1035
Explore protocol buffers vs Avro #1031
Increase Avro dependency version to 1.8.0 #1029
ADAM specific logging #1024
Reenable Travis CI for pull request builds #1023
Bump Apache Spark version to 1.6.1 in Jenkins #1022
ADAM compatibility with Spark 2.0 #1021
ADAM to BAM conversion failing on 1000G file #1013
Factor out *RDDFunctions classes #1011
Port single file BAM and header code to VCF #1009
Roll Jenkins JDK 8 changes into ./scripts/jenkins-test #1008
Support GFF3 format #1007
Separate fat jar build from adam-cli to new maven module #1006
adam-cli POM invalid: maven.build.timestamp #1004
Sub-partitioning of Parquet file for ADAM #1003
Flattening the Genotype schema #1002
install adam 0.19 error! #1001
How to solve it please? #1000
Has the project realized alignment reads to reference genome algorithm? #996
All file-based input methods should support running on directories, compressed files, and wildcards #993
Contig to ContigName Change not reflected in AlignmentRecordField #991
Add homebrew guidelines to release checklist or automate PR generation #987
fix deprecation warnings #985
rename fragments package #984
Explore if SeqDict data can be factored out more aggressively #983
Make “Adam” all caps in filename Adam2Fastq.scala #981
Adam2Fastq should output reverse complement when 0x10 flag is set for read #980
Allow lowercase letters in jar/version names #974
Add stringency parameter to flagstat #973
Arg-array parsing problem in adam-submit #971
Pass recordGroup parameter to loadPairedFastq #969
Send a number of partitions to sc.textFile calls #968
adamGetReferenceString doesn’t reduce pairs correctly #967
Update ADAM formula in homebrew-science to version 0.19.0 #963
BAM output in ADAM appears to be corrupt #962
Remove code workarounds necessary for Spark 1.2.1/Hadoop 1.0.x support #959
Issue with version 18.0.2 #957
Expose sorting by reference index #952
.rgdict and .seqdict files are not placed in the adam directory #945
Why does count_kmers not return k-mers that are split between two records? #930
Load legacy file formats to Spark SQL Dataframes #912
Clean up RDD method names #910
Load/store sequence dictionaries alongside Genotype RDDs #909
vcf2adam -print_metrics throws IllegalStateException on Spark 1.5.2 or later #902
error: no reads in first split: bad BAM file or tiny split size? #896
FastaConverter.FastaDescriptionLine not kryo-registered #893
Work With ADAM fasta2adam in a distributed mode #881
vcf2adam –> Exception in thread “main” java.lang.NoSuchMethodError: scala.Predef$.$conforms()Lscala/Predef$$less$colon$less; #871
Code coverage profile is broken #849
Building Adam on OS X 10.10.5 with Java 1.8 #835
Normalize AlignmentRecord.recordGroup* fields onto a separate record type #828
Gracefully handle missing Spark- and Hadoop-versions in jenkins-test; document how to set them. #827
Use Adam File with Hive #820
How do we handle reads that don’t have original quality scores when converting to FASTQ with original qualities? #818
SAMFileHeader “sort order” attribute being un-set during file-save job #800
Use same sort order as Samtools #796
RNAME and RNEXT fields jumbled on transform BAM->ADAM->BAM #795
Support loading multiple indexed read files #787
Duplicate OUTPUT command line argument metaVar in adam2fastq #776
Allow Variant to ReferenceRegion conversion #768
Spark Errors References Deprecated SPARK_CLASSPATH #767
Spark Errors References Deprecated SPARK_CLASSPATH #766
adam2vcf fails with -coalesce #735
Writing to a BAM file with adamSAMSave consistently fails #721
BQSR on C835.HCC1143_BL.4 uses excessive amount of driver memory #714
Support writing RDD[Feature] to various file formats #710
adamParquetSave has a menacing false error message about *.adam extension #681
BAMHeader not set when running on a cluster #676
spark 1.3.1 upgarde to hortonworks HDP 2.2.4.2-2? #675
Symbol case class is nucleotide-centric #672
xAssembler cannot be build using mvn #658
adam-submit VerifyError #642
vcf2adam : Unsupported type ENUM #638
Update CDH documentation #615
Remove and generalize plugin code #602
Fix record oriented shuffle #599
Migrate preprocessing stages out of ADAM #598
Publish/socialize a roadmap #591
Eliminate format detection and extension checks for loading data #587
Improve error message when we can’t find a ReferenceRegion for a contig #582
Do reference partitioners restrict a partition to contain keys from a single contig? #573
Connection refused errors when transforming BAM file with BQSR #516
ReferenceRegion shouldn’t extend Ordered #511
Documentation for common usecases #491
Improve handling of “*” sequences during BQSR #484
Original qualities are parsed out, but left in attribute fields #483
Need a FileLocator that mirrors the use of Path in HDFS #477
FileLocator should support finding “child” locators. #476
Add S3 based Parquet directory loader #463
Should FASTQ output use reads’ “original qualities”? #436
VcfStringUtils unused? #428
We should be able to filter genotypes that overlap a region #422
Create a simplified vocabulary for naming projections. #419
Update documentation #406
Bake off different region join implementations #395
Handle no-ops more intelligently when creating MD tags #392
Remove all the commands in the “CONVERSION OPERATIONS” CommandGroup #373
Fail to Write RDD into HDFS with Parquet Format #344
Refactor ReferencePositionWithOrientation #317
Add docs about SPARK_LOCAL_IP #305
PartitionAndJoin should throw an exception if it sees an unmapped read #297
Add insert size calculation #296
Newbie questions – learning resources? Reading a range of records from Adam? #281
Add variant effect ontology #261
Don’t flatten optional SAM tags into a string #240
Characterize impact of partition size on pileup creation #163
Need to support BCF output format #153
Allow list of commands to be injected into adam-cli AdamMain #132
Parse out common annotations stored in VCF format #118
Update normalization code to enable normalization of sequences with more than two indels #64
Add clipping heuristic to indel realigner #63
BQSR should support recalibration across multiple ADAM files #58

Merged and closed pull requests:

fix SB tag parsing #1209 (fnothaft)
Fastq record converter #1208 (fnothaft)
Doc suggested partitionSize in ShuffleRegionJoin #1207 (jpdna)
Test demonstrating region join failure #1206 (jpdna)
fix SB tag parsing #1203 (jpdna)
fix build #1201 (ryan-williams)
[ADAM-1192] Correctly handle other whitespace in FASTA description. #1198 (fnothaft)
[ADAM-1190] Manually (un)pack IndelRealignmentTarget set. #1191 (fnothaft)
[ADAM-1188] Delete scripts/commit-pr.sh #1189 (fnothaft)
[ADAM-1186] Mask null from fs.globStatus. #1187 (fnothaft)
Fastq record converter #1185 (zyxue)
[ADAM-1182] isSorted=true should write SO:coordinate in SAM/BAM/CRAM header. #1183 (fnothaft)
Add scoverage aggregator and fail on low coverage. #1181 (fnothaft)
[ADAM-1179] Improve error message when globbing a parquet file fails. #1180 (fnothaft)
[ADAM-1176] Update command line doc and examples in README.md #1177 (heuermh)
Refactor CLIs for merging sharded files #1167 (fnothaft)
Update Hadoop-BAM to version 7.7.0 #1166 (heuermh)
[ADAM-1162] Write record group string name. #1163 (fnothaft)
Map IntervalList format column four to feature name #1159 (heuermh)
Make AlignmentRecordConverter public so that it can be used from other projects #1157 (tomwhite)
added predicate option to loadCoverage #1156 (akmorrow13)
[ADAM-1154] Change set -x to set -e in ./bin/adam-shell. #1155 (fnothaft)
Remove Gene and related models and parsing code #1153 (heuermh)
Reorder kryo.register statements in ADAMKryoRegistrator #1148 (heuermh)
Updated GenomicPartitioners to accept additional key. #1147 (akmorrow13)
[ADAM-1141] Add support for saving/loading AlignmentRecords to/from CRAM. #1145 (fnothaft)
misc pom/test/resource improvements #1142 (ryan-williams)
[ADAM-1136] Transform runs successfully with kryo registration required #1138 (fnothaft)
[ADAM-1132] Fix improper quoting of bash args in adam-shell. #1133 (fnothaft)
Remove StructuralVariant and StructuralVariantType, add names field to Variant #1131 (heuermh)
Remove StructuralVariant and StructuralVariantType, add names field to Variant #1130 (heuermh)
PR #1108 with issue #1122 #1128 (fnothaft)
[ADAM-1038] Eliminate writing to /tmp during CI builds. #1127 (fnothaft)
Update for bdg-formats code style changes #1126 (heuermh)
[ADAM-1124] Add Scoverage and generate coverage reports in Jenkins. #1125 (fnothaft)
[ADAM-1093] Move to support Spark 2.0.0. #1123 (fnothaft)
remove duplicated dependency #1119 (ryan-williams)
Clean up ADAMContext #1118 (fnothaft)
[ADAM-993] Support loading files using globs and from directory paths. #1117 (fnothaft)
[ADAM-1087] Migrate away from FileSystem.get #1116 (fnothaft)
[ADAM-1099] Make reference region not throw NPE. #1115 (fnothaft)
Add pipes API #1114 (fnothaft)
[ADAM-1105] Use assembly jar in adam-shell. #1111 (fnothaft)
Add outer joins #1109 (fnothaft)
Modified CalculateDepth to calcuate coverage from alignment files #1108 (akmorrow13)
Resolves various single file save/header issues #1104 (fnothaft)
[ADAM-1100] Resolve Sample Not Serializable exception #1101 (fnothaft)
added loadIndexedVcf and loadIndexedBam for multiple ReferenceRegions #1096 (akmorrow13)
Added support for Indexed VCF files #1095 (akmorrow13)
[ADAM-582] Eliminate .get on option in FragmentCoverter. #1091 (fnothaft)
[ADAM-776] Rename duplicate OUTPUT metaVar in ADAM2Fastq. #1090 (fnothaft)
refactored ReferenceFile to require SequenceDictionary #1086 (akmorrow13)
[ADAM-1073] Remove network-connected and default test-related Maven profiles #1082 (heuermh)
[ADAM-1053] Clean up Transform #1081 (fnothaft)
[ADAM-1061] Clean up attributes regex and denormalized fields #1080 (fnothaft)
Extended TwoBitFile and NucleotideContigFragmentRDDFunctions to behave more similar #1079 (akmorrow13)
Refactor variant and genotype annotations #1078 (heuermh)
[ADAM-1039] Add basic support for Sample record. #1077 (fnothaft)
Remove code workarounds necessary for Spark 1.2.1/Hadoop 1.0.x support #1076 (heuermh)
[ADAM-194] Use separate filtersFailed and filtersPassed arrays for variant quality filters #1075 (heuermh)
Whitespace code style fixes #1074 (heuermh)
[ADAM-1006] Split überjar out to adam-assembly submodule. #1072 (fnothaft)
Remove code coverage profile #1071 (heuermh)
[ADAM-768] ReferenceRegion from variant/genotypes #1070 (fnothaft)
[ADAM-1044] Support VCF annotation ANN field #1069 (heuermh)
[ADAM-1067] Add release documentation and scripting for Spark Packages. #1068 (fnothaft)
[ADAM-602] Remove plugin code. #1065 (fnothaft)
Refactoring org.bdgenomics.adam.io package. #1064 (fnothaft)
Cleanup in org.bdgenomics.adam.converters package. #1062 (fnothaft)
[ADAM-1057] Remove workaround for gzip/BGZF compressed VCF headers #1057 (heuermh)
Cleanup on org.bdgenomics.adam.algorithms.smithwaterman package. #1056 (fnothaft)
Documentation cleanup and minor refactor on the consensus package. #1055 (fnothaft)
Add KEYS with public code signing keys #1054 (heuermh)
Adding GA4GH 0.5.1 converter for reads. #1052 (fnothaft)
[ADAM-1011] Refactor to add GenomicRDDs for all Avro types #1051 (fnothaft)
removed interval trait and redirected to interval in utils-intervalrdd #1046 (akmorrow13)
[ADAM-952] Expose sorting by reference index. #1045 (fnothaft)
overlap query reflects new formats #1043 (erictu)
Changed loadIndexedBam to use hadoop-bam InputFormat #1036 (fnothaft)
Increase Avro dependency version to 1.8.0 #1034 (heuermh)
Improved README fix using feedback from other approach review. #1034 (InvisibleTech)
Error in the README.md for kmer.scala example, need to get rdd first. #1032 (InvisibleTech)
Add fragmentEndPosition to NucleotideContigFragment #1030 (heuermh)
Logging to be done by ADAM utils code rather than Spark #1028 (jpdna)
add maxScore #1027 (xubo245)
[ADAM-1008] Modify jenkins-test script to support Java 8 build. #1026 (fnothaft)
whitespace change, do not merge #1025 (shaneknapp)
require kryo registration in tests #1020 (ryan-williams)
print full stack traces on test failures #1019 (ryan-williams)
bump commons-io version #1017 (ryan-williams)
exclude javadoc jar in adam-shell #1016 (ryan-williams)
[ADAM-909] Refactoring variation RDDs. #1015 (fnothaft)
Modified CalculateDepth to get coverage on whole alignment adam files #1010 (akmorrow13)
[ADAM-1004] Remove recursive maven.build.timestamp declaration #1005 (heuermh)
Maint 2.11 0.19.0 #999 (tushu1232)
[ADAM-710] Add saveAs methods for feature formats GTF, BED, IntervalList, and NarrowPeak #998 (heuermh)
Moving Adam2Fastq to ADAM2Fastq #995 (heuermh)
Update release doc for CHANGES.md and homebrew #994 (heuermh)
Update to AlignmentRecordField and its usages as contig changed to co… #992 (jpdna)
[ADAM-974] Short term fix for multiple ADAM cli assembly jars check #990 (heuermh)
Update hadoop-bam dependency version to 7.5.0 #989 (heuermh)
Replaced Contig with ContigName in AlignmentRecord and related changes #988 (jpdna)
fix some deprecation/style things and rename a pkg #986 (ryan-williams)
Fix Adam2fastq in case of read with both reverse and unmapped flags #982 (jpdna)
[ADAM-510] Refactoring RDD function names #979 (heuermh)
Use .adam/_{seq,rg}dict.avro paths for Avro-formatted dictionaries #978 (heuermh)
Remove unused file VcfHeaderUtils.scala #977 (heuermh)
add validation stringency to bam parsing, flagstat #976 (ryan-williams)
more permissible jar regex in adam-submit #975 (ryan-williams)
fix bash arg array processing in adam-submit #972 (ryan-williams)
adamGetReferenceString reduces pairs correctly, fixes #967 #970 (erictu)
A few improvements #966 (ryan-williams)
improve SW performance by replacing functional reductions with imperative ones #965 (noamBarkai)
[ADAM-962] Fix corrupt single-file BAM output. #964 (fnothaft)
[ADAM-960] Updating bdg-utils dependency version to 0.2.4 #961 (heuermh)
[ADAM-946] Fixes to FlagStat for Samtools concordance issue #954 (jpdna)
Use hadoop-bam BAMInputFormat to do loadIndexedBam #953 (andrewmchen)
Add -print_metrics option to Jenkins build #947 (heuermh)
adam2vcf doesn’t have info fields #939 (andrewmchen)
[ADAM-893] Register missing serializers. #933 (fnothaft)

Comments