Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can HTSJDK use a VCF index to quickly count total records in a VCF? #1586

Open
bbimber opened this issue Dec 10, 2021 · 9 comments
Open

Can HTSJDK use a VCF index to quickly count total records in a VCF? #1586

bbimber opened this issue Dec 10, 2021 · 9 comments

Comments

@bbimber
Copy link
Contributor

bbimber commented Dec 10, 2021

Hello,

When working with a large VCF, iterating all features to determine the total variant count is slow. Can Can HTSJDK use a VCF index to quickly count total records in a VCF?

Thanks

@cmnbroad
Copy link
Collaborator

Someone else may have a more definitive answer, but I think the linear index part of a Tribble index (.idx) has that information, per-chromosome. I don't think tabix does.

@lindenb
Copy link
Contributor

lindenb commented Dec 14, 2021

@cmnbroad well it should be possible as you can get this information with bcftools index -s in.vcf.gz

@bbimber
Copy link
Contributor Author

bbimber commented Dec 14, 2021

exactly. i also didnt know this was possible, but bcftools apparently can do it. it would be very useful to be able to get variant count like this for big files.

@yfarjoun
Copy link
Contributor

yfarjoun commented Dec 15, 2021 via email

@lindenb
Copy link
Contributor

lindenb commented Dec 15, 2021

@yfarjoun with a recent version of bcftools, I'm able to extract the number of variants/chrom with a tbi index and bcftools index -s.

@yfarjoun
Copy link
Contributor

yfarjoun commented Dec 15, 2021 via email

@lindenb
Copy link
Contributor

lindenb commented Dec 15, 2021

@yfarjoun bcftools. (but I think now both tools now use the same C code for tbi )

@lindenb
Copy link
Contributor

lindenb commented Dec 15, 2021

@yfarjoun the C code collecting metadata is here : https://github.com/samtools/htslib/blob/1d79f449cb3b02eda8fc151556395b7b50ccd730/hts.c#L2857

Indexes (both .tbi and .csi) made by tabix include extra data about the indexed file. The returns a pointer to this data. Note that the data is stored exactly as it is in the index. Callers need to interpret the results themselves, including knowing what sort of data to expect byte swapping etc.

@bbimber
Copy link
Contributor Author

bbimber commented Dec 15, 2021

all of our indexes are made by tabix and have this info, which makes sense if bcftools/tabix share the same code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants