Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evidence string generation for 2025.03 release #450

Open
apriltuesday opened this issue Jan 20, 2025 · 2 comments
Open

Evidence string generation for 2025.03 release #450

apriltuesday opened this issue Jan 20, 2025 · 2 comments

Comments

@apriltuesday
Copy link
Contributor

Deadline for submission: 31 January

Refer to documentation for full description of steps.

@apriltuesday
Copy link
Contributor Author

Counts:

Total number of evidence strings generated      3681863
Total number of complete evidence strings generated     3354238

Total number of ClinVar records 4143238
    Fatal: Cannot produce evidence      1735305
        No traits with valid names      1733629
        No clinical significance        1248
        Excluded submissions    428
    Skipped: Can be rescued by future improvements      342641
        Unsupported variation type      3381
        No functional consequences      30126
        Missing EFO mapping     308586
        Invalid evidence string 520
        Multiple clinical classifications       28
    Done: Generated at least one complete evidence string       2065292
        One complete evidence string    1205960
        Multiple complete evidence strings      859332
Percentage of all potentially supportable ClinVar records which generated at least one complete evidence string 85.8%

Total number of trait-to-ontology mappings in the database      14376
    The number of distinct trait-to-ontology mappings used in the evidence strings      8969
The number of distinct unmapped trait names which prevented complete evidence string generation 9786

Total number of variant to consequence mappings 3164716
    Number of repeat expansion variants 1491
    Number of structural variants       1357

Check that all invalid evidence is due to new clinical significance terms:

$ grep -Pc "Error message: .*? is not one of \['affects'" evidence_string_generation_*.err
evidence_string_generation_0.err:1
evidence_string_generation_1242972.err:0
evidence_string_generation_1657296.err:0
evidence_string_generation_2071620.err:0
evidence_string_generation_2485944.err:0
evidence_string_generation_2900268.err:5
evidence_string_generation_3314592.err:0
evidence_string_generation_3728916.err:514
evidence_string_generation_414324.err:12
evidence_string_generation_828648.err:0

= 532 > 520 as reported, assume this means 12 were excluded for other reasons.

Check no other unexpected errors:

$ grep 'ERROR' evidence_string_generation_*.err | grep -Pv "Error message: .*? is not one of \['affects'|evidence string does not validate against schema|Complete evidence string:"
evidence_string_generation_414324.err:ERROR:root:Found multiple descriptions for one ClinicalClassification in RCV003883131

This one counts under multiple clinical classifications, so this is expected.

@apriltuesday
Copy link
Contributor Author

Based on the counts and a little manually searching on ClinVar, I'm now thinking that gene-related disorder records that were previously filtered out are now either supported by a new submission, or perhaps have had their submission ID updated for some reason.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant