ADAM 0.20.0 Released
ADAM version 0.20.0 has been released!
Due to major changes between Spark versions 1.6 and 2.0, we now build for combinations of Apache Spark and Scala versions: Spark 1.x and Scala 2.10, Spark 1.x and Scala 2.11, Spark 2.x and Scala 2.10, and Spark 2.x and Scala 2.11.
Since the last release, version 0.19.0, we have closed more than 180 issues and merged more than 120 pull requests.
We added a new pipe API, allowing for streaming alignment and variant records out to external applications and streaming back in the results. Several new region join implementations are now public API, including a broadcast inner join, broadcast right outer join, sort-merge inner join, sort-merge right outer join, sort-merge left outer join, sort-merge full outer join, sort-merge inner join followed by a group by, and a sort-merge right outer join followed by a group by.
Alignment records can now be read from and written to CRAM format. We updated upstream dependencies on Hadoop-BAM and htsjdk to fix various alignment record header bugs and to add support for gzip and BGZF compressed VCF.
Our sequence feature schema now more closely follow the GFF3 specification, while still supporting BED, GFF2/GTF, IntervalList, and NarrowPeak formats. We also added a new sample schema for e.g. SRA sample metadata.
With this version the core ADAM APIs are undergoing a major refactoring. We changed many method names on ADAMContext to make the API more consistent. We also added RDD wrapper classes to increase performance by serializing metadata (such as record groups, samples, and sequence dictionaries) to disk separate from primary data in Parquet. API incompatibilities between ADAM releases will settle down by the 1.0 release, currently targeted for early 2017.
The full list of changes since version 0.19.0 is below.
Closed issues:
- Sorting by reference index seems doesn’t work or sorted by DESC order? #1204
- master won’t compile #1200
- VCF format tag SB field parse error in loading #1199
- Publish sources JAR with snapshots #1195
- Type SparkFunSuite in package org.bdgenomics.utils.misc is not available #1193
- MDTagging fails on GRCh38 #1192
- Fix stack overflow in IndelRealigner serialization #1190
- Delete
./scripts/commit-pr.sh
#1188 - Hadoop globStatus returns null if no glob matches #1186
- Swapping out IntervalRDD under GenomicRDDs #1184
- How to get “SO coordinate” instead of “SO unsorted”? #1182
- How to read glob of multiple parquet Genotype #1179
- Update command line doc and examples in README.md #1176
- FastqRecordConverter needs cleanup and tests #1172
- TransformFormats write to .gff3 file path incorrectly writes as parquet #1168
- Should be able to merge shards across two different file systems #1165
- RG ID gets written as the index, not the record group name #1162
- Users should be able to save files as
-single
without merging them #1161 - Users should be able to set size of buffer used for merging files #1160
- Bump Hadoop-BAM to 7.7.0 #1158
- adam-shell prints command trace to stdout #1154
- Map IntervalList format column four to feature name or attributes? #1152
- Parquet storage of VariantContext #1151
- vcf2adam unparsable vcf record #1149
- Reorder kryo.register statements in ADAMKryoRegistrator #1146
- Make region joins public again #1143
- Support CRAM input/output #1141
- Transform should run with spark.kryo.requireRegistration=true #1136
- adam-shell not handling bash args correctly #1132
- Remove Gene and related models and parsing code #1129
- Generate Scoverage reports when running CI #1124
- Remove PairingRDD #1122
- SAMRecordConverter.convert takes unused arguments #1113
- Add Pipe API #1112
- Improve coverage in Feature unit tests #1106
- K-mer.scala code #1105
- add -single file output option to ADAM2Vcf #1102
- adam2vcf Fails with Sample not serializable #1100
- ReferenceRegion.apply(AlignmentRecord) should not NPE on unmapped reads #1099
- Add outer region join implementations #1098
- VariantContextConverter never returns DatabaseVariantAnnotation #1097
- loadvcf: conflicting require statement #1094
- ADAM version 0.19.0 will not run on Spark version 2.0.0 #1093
- Be more rigorous with FileSystem.get #1087
- Remove network-connected and default test-related Maven profiles #1073
- Releases should get pushed to Spark Packages #1067
- Invalid POM for cli on 0.19.0 #1066
- scala.MatchError RegExp does not catch colons in value part properly #1061
- Support writing IntervalList header for features #1059
- Add -single support when writing features in native formats #1058
- Remove workaround for gzip/BGZF compressed VCF headers #1057
- Clean up if clauses in Transform #1053
- Adam-0.18.2 can not load Adam-0.14.0 adamSave function data (sam) #1050
- filterByOverlappingRegion Incorrect for Genotypes #1042
- Move Interval trait to utils, added in #75 #1041
- Remove implicit GenomicRDD to RDD conversion #1040
- VCF sample metadata – proposal for a GenotypedSampleMetadata object #1039
- [build system] ADAM test builds pollute /tmp, leaving lots of cruft… #1038
- adamMarkDuplicates function in AlignmentRecordRDDFunctions class can not mark the same read? #1037
- test MarkDuplicatesSuite with two similar read in ref and start position and different avgPhredScore, error! #1035
- Explore protocol buffers vs Avro #1031
- Increase Avro dependency version to 1.8.0 #1029
- ADAM specific logging #1024
- Reenable Travis CI for pull request builds #1023
- Bump Apache Spark version to 1.6.1 in Jenkins #1022
- ADAM compatibility with Spark 2.0 #1021
- ADAM to BAM conversion failing on 1000G file #1013
- Factor out *RDDFunctions classes #1011
- Port single file BAM and header code to VCF #1009
- Roll Jenkins JDK 8 changes into ./scripts/jenkins-test #1008
- Support GFF3 format #1007
- Separate fat jar build from adam-cli to new maven module #1006
- adam-cli POM invalid: maven.build.timestamp #1004
- Sub-partitioning of Parquet file for ADAM #1003
- Flattening the Genotype schema #1002
- install adam 0.19 error! #1001
- How to solve it please? #1000
- Has the project realized alignment reads to reference genome algorithm? #996
- All file-based input methods should support running on directories, compressed files, and wildcards #993
- Contig to ContigName Change not reflected in AlignmentRecordField #991
- Add homebrew guidelines to release checklist or automate PR generation #987
- fix deprecation warnings #985
- rename
fragments
package #984 - Explore if SeqDict data can be factored out more aggressively #983
- Make “Adam” all caps in filename Adam2Fastq.scala #981
- Adam2Fastq should output reverse complement when 0x10 flag is set for read #980
- Allow lowercase letters in jar/version names #974
- Add stringency parameter to flagstat #973
- Arg-array parsing problem in adam-submit #971
- Pass recordGroup parameter to loadPairedFastq #969
- Send a number of partitions to sc.textFile calls #968
- adamGetReferenceString doesn’t reduce pairs correctly #967
- Update ADAM formula in homebrew-science to version 0.19.0 #963
- BAM output in ADAM appears to be corrupt #962
- Remove code workarounds necessary for Spark 1.2.1/Hadoop 1.0.x support #959
- Issue with version 18.0.2 #957
- Expose sorting by reference index #952
- .rgdict and .seqdict files are not placed in the adam directory #945
- Why does count_kmers not return k-mers that are split between two records? #930
- Load legacy file formats to Spark SQL Dataframes #912
- Clean up RDD method names #910
- Load/store sequence dictionaries alongside Genotype RDDs #909
- vcf2adam -print_metrics throws IllegalStateException on Spark 1.5.2 or later #902
- error: no reads in first split: bad BAM file or tiny split size? #896
- FastaConverter.FastaDescriptionLine not kryo-registered #893
- Work With ADAM fasta2adam in a distributed mode #881
- vcf2adam –> Exception in thread “main” java.lang.NoSuchMethodError: scala.Predef$.$conforms()Lscala/Predef$$less$colon$less; #871
- Code coverage profile is broken #849
- Building Adam on OS X 10.10.5 with Java 1.8 #835
- Normalize AlignmentRecord.recordGroup* fields onto a separate record type #828
- Gracefully handle missing Spark- and Hadoop-versions in jenkins-test; document how to set them. #827
- Use Adam File with Hive #820
- How do we handle reads that don’t have original quality scores when converting to FASTQ with original qualities? #818
- SAMFileHeader “sort order” attribute being un-set during file-save job #800
- Use same sort order as Samtools #796
- RNAME and RNEXT fields jumbled on transform BAM->ADAM->BAM #795
- Support loading multiple indexed read files #787
- Duplicate OUTPUT command line argument metaVar in adam2fastq #776
- Allow Variant to ReferenceRegion conversion #768
- Spark Errors References Deprecated SPARK_CLASSPATH #767
- Spark Errors References Deprecated SPARK_CLASSPATH #766
- adam2vcf fails with -coalesce #735
- Writing to a BAM file with adamSAMSave consistently fails #721
- BQSR on C835.HCC1143_BL.4 uses excessive amount of driver memory #714
- Support writing RDD[Feature] to various file formats #710
- adamParquetSave has a menacing false error message about *.adam extension #681
- BAMHeader not set when running on a cluster #676
- spark 1.3.1 upgarde to hortonworks HDP 2.2.4.2-2? #675
Symbol
case class is nucleotide-centric #672- xAssembler cannot be build using mvn #658
- adam-submit VerifyError #642
- vcf2adam : Unsupported type ENUM #638
- Update CDH documentation #615
- Remove and generalize plugin code #602
- Fix record oriented shuffle #599
- Migrate preprocessing stages out of ADAM #598
- Publish/socialize a roadmap #591
- Eliminate format detection and extension checks for loading data #587
- Improve error message when we can’t find a ReferenceRegion for a contig #582
- Do reference partitioners restrict a partition to contain keys from a single contig? #573
- Connection refused errors when transforming BAM file with BQSR #516
- ReferenceRegion shouldn’t extend Ordered #511
- Documentation for common usecases #491
- Improve handling of “*” sequences during BQSR #484
- Original qualities are parsed out, but left in attribute fields #483
- Need a FileLocator that mirrors the use of Path in HDFS #477
- FileLocator should support finding “child” locators. #476
- Add S3 based Parquet directory loader #463
- Should FASTQ output use reads’ “original qualities”? #436
- VcfStringUtils unused? #428
- We should be able to filter genotypes that overlap a region #422
- Create a simplified vocabulary for naming projections. #419
- Update documentation #406
- Bake off different region join implementations #395
- Handle no-ops more intelligently when creating MD tags #392
- Remove all the commands in the “CONVERSION OPERATIONS”
CommandGroup
#373 - Fail to Write RDD into HDFS with Parquet Format #344
- Refactor ReferencePositionWithOrientation #317
- Add docs about SPARK_LOCAL_IP #305
- PartitionAndJoin should throw an exception if it sees an unmapped read #297
- Add insert size calculation #296
- Newbie questions – learning resources? Reading a range of records from Adam? #281
- Add variant effect ontology #261
- Don’t flatten optional SAM tags into a string #240
- Characterize impact of partition size on pileup creation #163
- Need to support BCF output format #153
- Allow list of commands to be injected into adam-cli AdamMain #132
- Parse out common annotations stored in VCF format #118
- Update normalization code to enable normalization of sequences with more than two indels #64
- Add clipping heuristic to indel realigner #63
- BQSR should support recalibration across multiple ADAM files #58
Merged and closed pull requests:
- fix SB tag parsing #1209 (fnothaft)
- Fastq record converter #1208 (fnothaft)
- Doc suggested partitionSize in ShuffleRegionJoin #1207 (jpdna)
- Test demonstrating region join failure #1206 (jpdna)
- fix SB tag parsing #1203 (jpdna)
- fix build #1201 (ryan-williams)
- [ADAM-1192] Correctly handle other whitespace in FASTA description. #1198 (fnothaft)
- [ADAM-1190] Manually (un)pack IndelRealignmentTarget set. #1191 (fnothaft)
- [ADAM-1188] Delete scripts/commit-pr.sh #1189 (fnothaft)
- [ADAM-1186] Mask null from fs.globStatus. #1187 (fnothaft)
- Fastq record converter #1185 (zyxue)
- [ADAM-1182] isSorted=true should write SO:coordinate in SAM/BAM/CRAM header. #1183 (fnothaft)
- Add scoverage aggregator and fail on low coverage. #1181 (fnothaft)
- [ADAM-1179] Improve error message when globbing a parquet file fails. #1180 (fnothaft)
- [ADAM-1176] Update command line doc and examples in README.md #1177 (heuermh)
- Refactor CLIs for merging sharded files #1167 (fnothaft)
- Update Hadoop-BAM to version 7.7.0 #1166 (heuermh)
- [ADAM-1162] Write record group string name. #1163 (fnothaft)
- Map IntervalList format column four to feature name #1159 (heuermh)
- Make AlignmentRecordConverter public so that it can be used from other projects #1157 (tomwhite)
- added predicate option to loadCoverage #1156 (akmorrow13)
- [ADAM-1154] Change set -x to set -e in ./bin/adam-shell. #1155 (fnothaft)
- Remove Gene and related models and parsing code #1153 (heuermh)
- Reorder kryo.register statements in ADAMKryoRegistrator #1148 (heuermh)
- Updated GenomicPartitioners to accept additional key. #1147 (akmorrow13)
- [ADAM-1141] Add support for saving/loading AlignmentRecords to/from CRAM. #1145 (fnothaft)
- misc pom/test/resource improvements #1142 (ryan-williams)
- [ADAM-1136] Transform runs successfully with kryo registration required #1138 (fnothaft)
- [ADAM-1132] Fix improper quoting of bash args in adam-shell. #1133 (fnothaft)
- Remove StructuralVariant and StructuralVariantType, add names field to Variant #1131 (heuermh)
- Remove StructuralVariant and StructuralVariantType, add names field to Variant #1130 (heuermh)
- PR #1108 with issue #1122 #1128 (fnothaft)
- [ADAM-1038] Eliminate writing to /tmp during CI builds. #1127 (fnothaft)
- Update for bdg-formats code style changes #1126 (heuermh)
- [ADAM-1124] Add Scoverage and generate coverage reports in Jenkins. #1125 (fnothaft)
- [ADAM-1093] Move to support Spark 2.0.0. #1123 (fnothaft)
- remove duplicated dependency #1119 (ryan-williams)
- Clean up ADAMContext #1118 (fnothaft)
- [ADAM-993] Support loading files using globs and from directory paths. #1117 (fnothaft)
- [ADAM-1087] Migrate away from FileSystem.get #1116 (fnothaft)
- [ADAM-1099] Make reference region not throw NPE. #1115 (fnothaft)
- Add pipes API #1114 (fnothaft)
- [ADAM-1105] Use assembly jar in adam-shell. #1111 (fnothaft)
- Add outer joins #1109 (fnothaft)
- Modified CalculateDepth to calcuate coverage from alignment files #1108 (akmorrow13)
- Resolves various single file save/header issues #1104 (fnothaft)
- [ADAM-1100] Resolve Sample Not Serializable exception #1101 (fnothaft)
- added loadIndexedVcf and loadIndexedBam for multiple ReferenceRegions #1096 (akmorrow13)
- Added support for Indexed VCF files #1095 (akmorrow13)
- [ADAM-582] Eliminate .get on option in FragmentCoverter. #1091 (fnothaft)
- [ADAM-776] Rename duplicate OUTPUT metaVar in ADAM2Fastq. #1090 (fnothaft)
- refactored ReferenceFile to require SequenceDictionary #1086 (akmorrow13)
- [ADAM-1073] Remove network-connected and default test-related Maven profiles #1082 (heuermh)
- [ADAM-1053] Clean up Transform #1081 (fnothaft)
- [ADAM-1061] Clean up attributes regex and denormalized fields #1080 (fnothaft)
- Extended TwoBitFile and NucleotideContigFragmentRDDFunctions to behave more similar #1079 (akmorrow13)
- Refactor variant and genotype annotations #1078 (heuermh)
- [ADAM-1039] Add basic support for Sample record. #1077 (fnothaft)
- Remove code workarounds necessary for Spark 1.2.1/Hadoop 1.0.x support #1076 (heuermh)
- [ADAM-194] Use separate filtersFailed and filtersPassed arrays for variant quality filters #1075 (heuermh)
- Whitespace code style fixes #1074 (heuermh)
- [ADAM-1006] Split überjar out to adam-assembly submodule. #1072 (fnothaft)
- Remove code coverage profile #1071 (heuermh)
- [ADAM-768] ReferenceRegion from variant/genotypes #1070 (fnothaft)
- [ADAM-1044] Support VCF annotation ANN field #1069 (heuermh)
- [ADAM-1067] Add release documentation and scripting for Spark Packages. #1068 (fnothaft)
- [ADAM-602] Remove plugin code. #1065 (fnothaft)
- Refactoring
org.bdgenomics.adam.io
package. #1064 (fnothaft) - Cleanup in org.bdgenomics.adam.converters package. #1062 (fnothaft)
- [ADAM-1057] Remove workaround for gzip/BGZF compressed VCF headers #1057 (heuermh)
- Cleanup on
org.bdgenomics.adam.algorithms.smithwaterman
package. #1056 (fnothaft) - Documentation cleanup and minor refactor on the consensus package. #1055 (fnothaft)
- Add KEYS with public code signing keys #1054 (heuermh)
- Adding GA4GH 0.5.1 converter for reads. #1052 (fnothaft)
- [ADAM-1011] Refactor to add GenomicRDDs for all Avro types #1051 (fnothaft)
- removed interval trait and redirected to interval in utils-intervalrdd #1046 (akmorrow13)
- [ADAM-952] Expose sorting by reference index. #1045 (fnothaft)
- overlap query reflects new formats #1043 (erictu)
- Changed loadIndexedBam to use hadoop-bam InputFormat #1036 (fnothaft)
- Increase Avro dependency version to 1.8.0 #1034 (heuermh)
- Improved README fix using feedback from other approach review. #1034 (InvisibleTech)
- Error in the README.md for kmer.scala example, need to get rdd first. #1032 (InvisibleTech)
- Add fragmentEndPosition to NucleotideContigFragment #1030 (heuermh)
- Logging to be done by ADAM utils code rather than Spark #1028 (jpdna)
- add maxScore #1027 (xubo245)
- [ADAM-1008] Modify jenkins-test script to support Java 8 build. #1026 (fnothaft)
- whitespace change, do not merge #1025 (shaneknapp)
- require kryo registration in tests #1020 (ryan-williams)
- print full stack traces on test failures #1019 (ryan-williams)
- bump commons-io version #1017 (ryan-williams)
- exclude javadoc jar in adam-shell #1016 (ryan-williams)
- [ADAM-909] Refactoring variation RDDs. #1015 (fnothaft)
- Modified CalculateDepth to get coverage on whole alignment adam files #1010 (akmorrow13)
- [ADAM-1004] Remove recursive maven.build.timestamp declaration #1005 (heuermh)
- Maint 2.11 0.19.0 #999 (tushu1232)
- [ADAM-710] Add saveAs methods for feature formats GTF, BED, IntervalList, and NarrowPeak #998 (heuermh)
- Moving Adam2Fastq to ADAM2Fastq #995 (heuermh)
- Update release doc for CHANGES.md and homebrew #994 (heuermh)
- Update to AlignmentRecordField and its usages as contig changed to co… #992 (jpdna)
- [ADAM-974] Short term fix for multiple ADAM cli assembly jars check #990 (heuermh)
- Update hadoop-bam dependency version to 7.5.0 #989 (heuermh)
- Replaced Contig with ContigName in AlignmentRecord and related changes #988 (jpdna)
- fix some deprecation/style things and rename a pkg #986 (ryan-williams)
- Fix Adam2fastq in case of read with both reverse and unmapped flags #982 (jpdna)
- [ADAM-510] Refactoring RDD function names #979 (heuermh)
- Use .adam/_{seq,rg}dict.avro paths for Avro-formatted dictionaries #978 (heuermh)
- Remove unused file VcfHeaderUtils.scala #977 (heuermh)
- add validation stringency to bam parsing, flagstat #976 (ryan-williams)
- more permissible jar regex in adam-submit #975 (ryan-williams)
- fix bash arg array processing in adam-submit #972 (ryan-williams)
- adamGetReferenceString reduces pairs correctly, fixes #967 #970 (erictu)
- A few improvements #966 (ryan-williams)
- improve SW performance by replacing functional reductions with imperative ones #965 (noamBarkai)
- [ADAM-962] Fix corrupt single-file BAM output. #964 (fnothaft)
- [ADAM-960] Updating bdg-utils dependency version to 0.2.4 #961 (heuermh)
- [ADAM-946] Fixes to FlagStat for Samtools concordance issue #954 (jpdna)
- Use hadoop-bam BAMInputFormat to do loadIndexedBam #953 (andrewmchen)
- Add -print_metrics option to Jenkins build #947 (heuermh)
- adam2vcf doesn’t have info fields #939 (andrewmchen)
- [ADAM-893] Register missing serializers. #933 (fnothaft)