Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

INFO/FORMAT fields with number of A, R, or G are not interpretable without defined ordering of alleles #1170

Open
bpow opened this issue Jul 31, 2018 · 3 comments

Comments

@bpow
Copy link
Contributor

bpow commented Jul 31, 2018

Verify

There are not relevant logs. This is an issue related to documentation and/or API design.

Subject of the issue

The API docs for accessing alleles within a VariantContext (VariantContext::getAlleles) states: "There are no constraints imposed on the ordering of alleles in the set." However, the ordering of alleles is important in that it determines how INFO or FORMAT fields with "Number" of types A, R or G are interpreted when read from a VCF file. For instance, if Number=A, then the field contains an array of values associated with each of the alt alleles, in the order they are given in the VCF file. Similarly, for Number=R, the values will be associated with the ref allele, then each of the alt alleles in the order in which they were presented in the VCF file.

It seems like the VariantContext should preserve the order of alleles when read, or else there would have to be accessors on this field by which the API would allow specific values in the array of Number A, R, or G to be accessed using the allele itself. I think the order is actually preserved when reading from VCF files (from some simple tests, but I have not dug into the code), but this is not a guarantee provided in the javadoc (which sort of serves as the spec).

Furthermore (and perhaps a second issue), the AD and PL fields of a Genotype are not, in the general case, interpretable without access to the list of alleles from the containing VariantContext. The Genotype alleles only contain those that are present in the Genotype, but the AD and PL fields can refer to other alleles when sites are multi-alleleic. For instance, at a site that is ref=A, alt=T,C in a sample with genotype 1/2 (heterozygous T/C), a call to Genotype::getAlleles only returns the alleles representing 'T' and 'C', but the DP field will have 3 values, and the PL field 6... to know which DP value is associated with which allele, one needs to back-track to the VariantContext and hope that the result of VariantContext::getAlleles actually provides the alleles in the order in which they are used in the DP field. Unless I am missing something...

Your environment

  • version of htsjdk: 2.16.0
  • version of java: 8 (but effectively any where htsjdk is usable...)
  • which OS: any

Expected behaviour

I think that VariantContext::getAlleles should maintain the order of alleles, such that it matches the order of INFO or FORMAT fields with Number=A, R or G (e.g., keep the order that was present when reading from a VCF file).

Actual behaviour

Empirically, I think this is the actual behavior, but the javadoc states that this order is not guaranteed.

@lindenb
Copy link
Contributor

lindenb commented Jul 31, 2018

the order is set at runtime when the VCF is written at https://github.com/samtools/htsjdk/blob/master/src/main/java/htsjdk/variant/vcf/VCFEncoder.java#L314

@yfarjoun
Copy link
Contributor

I think that the comment indicates that one shouldn't expect that the reference will be first, or that the alleles are sorted alphabetically, etc. however, this doesn't mean that the order is random or arbitrary. The order that the alleles show up in the VC is the order that one should reference when interpreting the PL field in the genotype.

If you can write a better javaDoc comment, I'd be happy to review!

@magicDGS
Copy link
Member

magicDGS commented Aug 2, 2018

I think that this is related with #424

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants