-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
INFO/FORMAT fields with number of A, R, or G are not interpretable without defined ordering of alleles #1170
Comments
the order is set at runtime when the VCF is written at https://github.com/samtools/htsjdk/blob/master/src/main/java/htsjdk/variant/vcf/VCFEncoder.java#L314 |
I think that the comment indicates that one shouldn't expect that the reference will be first, or that the alleles are sorted alphabetically, etc. however, this doesn't mean that the order is random or arbitrary. The order that the alleles show up in the VC is the order that one should reference when interpreting the PL field in the genotype. If you can write a better javaDoc comment, I'd be happy to review! |
I think that this is related with #424 |
Verify
There are not relevant logs. This is an issue related to documentation and/or API design.
Subject of the issue
The API docs for accessing alleles within a VariantContext (
VariantContext::getAlleles
) states: "There are no constraints imposed on the ordering of alleles in the set." However, the ordering of alleles is important in that it determines how INFO or FORMAT fields with "Number" of types A, R or G are interpreted when read from a VCF file. For instance, if Number=A, then the field contains an array of values associated with each of the alt alleles, in the order they are given in the VCF file. Similarly, for Number=R, the values will be associated with the ref allele, then each of the alt alleles in the order in which they were presented in the VCF file.It seems like the VariantContext should preserve the order of alleles when read, or else there would have to be accessors on this field by which the API would allow specific values in the array of Number A, R, or G to be accessed using the allele itself. I think the order is actually preserved when reading from VCF files (from some simple tests, but I have not dug into the code), but this is not a guarantee provided in the javadoc (which sort of serves as the spec).
Furthermore (and perhaps a second issue), the AD and PL fields of a
Genotype
are not, in the general case, interpretable without access to the list of alleles from the containing VariantContext. TheGenotype
alleles only contain those that are present in theGenotype
, but the AD and PL fields can refer to other alleles when sites are multi-alleleic. For instance, at a site that is ref=A, alt=T,C in a sample with genotype 1/2 (heterozygous T/C), a call toGenotype::getAlleles
only returns the alleles representing 'T' and 'C', but the DP field will have 3 values, and the PL field 6... to know which DP value is associated with which allele, one needs to back-track to the VariantContext and hope that the result ofVariantContext::getAlleles
actually provides the alleles in the order in which they are used in the DP field. Unless I am missing something...Your environment
Expected behaviour
I think that
VariantContext::getAlleles
should maintain the order of alleles, such that it matches the order of INFO or FORMAT fields with Number=A, R or G (e.g., keep the order that was present when reading from a VCF file).Actual behaviour
Empirically, I think this is the actual behavior, but the javadoc states that this order is not guaranteed.
The text was updated successfully, but these errors were encountered: