Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid duplicates with alias BACKEND #3685

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

LaVibeX
Copy link
Contributor

@LaVibeX LaVibeX commented May 8, 2024

Description

This PR introduces a new functionality that addresses the issue of duplicate vulnerabilities by comparing the priority of sources and the aliases attached to a component.

The implementation required adding new database rows to support the changes.

To ensure the correctness of the implementation, tests have been added to validate the behavior of the updated functionality.

In addition, a new API endpoint has been added to get the actual Enabled Sources.

It is important to note that this update specifically affects the addVulnerability function and is not able to delete a vulnerability in any case.

Fronted changes: DependencyTrack/frontend#838
I'm open to discussing any changes or improvements👍🏽

Examples:

  1. Alias Deduplication Disabled
    image

  2. Vulnerability source with highest priority: NVD
    image

  3. Vulnerability source with highest priority: GITHUB
    image

Flow Charts for better understanding:

  • Case 1: (Vulnerability source does not have a higher priority than the alias):

image

  • Case 2: (The vulnerability source has a higher priority):

image

  • Case 3: (The vulnerability source has a higher priority, but the alias is in the component):

  • This case can occur only if a higher-priority source vulnerability does not exist at that time. A lower-priority source vulnerability will be added. Later, upon alias mapping and reanalysis, the higher-priority source vulnerability will not be added because the alias is already present.

image

Addressed Issue

This PR fixes #1994 and #2181

Additional Details

Add a new file named ConfigPropertyQueryManager.java to manage functions related to the EnabledSources

Checklist

  • I have read and understand the contributing guidelines
  • This PR fixes a defect, and I have provided tests to verify that the fix is effective
  • This PR implements an enhancement, and I have provided tests to verify that it works as intended
  • This PR introduces changes to the database model, and I have added corresponding update logic
  • This PR introduces new or alters existing behavior, and I have updated the documentation accordingly

This commit introduces a new functionality that addresses the issue of duplicate vulnerabilities
by comparing the priority of sources and the aliases attached to a component.

The implementation required adding new database rows to support the changes.

To ensure the correctness of the implementation, tests have been added to validate the behavior of the updated functionality.

In addition, a new API endpoint has been added to get the actual Enabled Sources.

It is important to note that this update specifically affects the addVulnerability function
and is not able to delete a vulnerability in any case.

Signed-off-by: Andres Tito <[email protected]>
@valentijnscholten
Copy link
Contributor

Not sure if I fully understand the PR. Does it only one vulnerability, the one from the source with the highest priority?

It feels to me that this would be a "bolt on" solution to a database model that should be changed. Wouldn't it be better to have data model that has one vulnerability that can have multiple aliases (from different sources possibly)? Currently with multiple aliases/sources there are (can be / most often are) multiple rows of vulnerabilities. This makes a lot of things harder and more complicated. For example determining the number of affected project for a vulnerability or something "as simple as" sorting the list of vulnerabilities by number of affected projects.

@LaVibeX
Copy link
Contributor Author

LaVibeX commented May 10, 2024

Not sure if I fully understand the PR. Does it only one vulnerability, the one from the source with the highest priority?

It feels to me that this would be a "bolt on" solution to a database model that should be changed. Wouldn't it be better to have data model that has one vulnerability that can have multiple aliases (from different sources possibly)? Currently with multiple aliases/sources there are (can be / most often are) multiple rows of vulnerabilities. This makes a lot of things harder and more complicated. For example determining the number of affected project for a vulnerability or something "as simple as" sorting the list of vulnerabilities by number of affected projects.

PR Description updated: I hope that with the example images the PR will be better explained @valentijnscholten

… I remove it for now, issues with isEmpty, better use != null

Signed-off-by: Andres Tito <[email protected]>
Copy link

codacy-production bot commented May 15, 2024

Coverage summary from Codacy

See diff coverage on Codacy

Coverage variation Diff coverage
+0.10% (target: -1.00%) 93.46% (target: 70.00%)
Coverage variation details
Coverable lines Covered lines Coverage
Common ancestor commit (db58e69) 21630 16377 75.71%
Head commit (846d2dd) 21736 (+106) 16478 (+101) 75.81% (+0.10%)

Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch: <coverage of head commit> - <coverage of common ancestor commit>

Diff coverage details
Coverable lines Covered lines Diff coverage
Pull request (#3685) 107 100 93.46%

Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified: <covered lines added or modified>/<coverable lines added or modified> * 100%

See your quality gate settings    Change summary preferences

Codacy will stop sending the deprecated coverage status from June 5th, 2024. Learn more

@pkunze
Copy link

pkunze commented May 17, 2024

I think this might also solve #2181 would be greatly appreciated :) 🎉

Copy link
Member

@nscuro nscuro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @LaVibeX!

I added a few comments. I wasn't able to test this since it doesn't build for me locally, which is unfortunately caused by a recent refactoring in Alpine and DT (#3730).

Also I have to say that I do agree with @valentijnscholten in that we really should change the underlying data model to make aliases more useful in general. The problem with the priority approach is that you get inconsistent outcomes across projects, depending on which vulnerability came first, and / or the order in which they were processed, which can become quite confusing.

On a related note, during the last community meeting we briefly mentioned that we're considering building a pre-compiled vulnerability database: https://youtu.be/9harG5GcV_E?t=2799. One of the things that it would make easier is the correlation of aliases across vulnerability sources. Perhaps give that a watch and let us know your thoughts?

It would go along quite well with a change in the data model as proposed by @valentijnscholten:

Wouldn't it be better to have data model that has one vulnerability that can have multiple aliases (from different sources possibly)?

I am happy to provide a feature branch so people can test the approach in this PR out, if you're interested to continue working on it.

Comment on lines +230 to +233
setVulnerabilityAliasesIfNull(vulnerability);
boolean vulnerabilityExists = checkVulnerabilityExists(vulnerability, component);

if (!vulnerabilityExists){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will doing this here not cause it all to depend on the order in which vulnerabilities are processed?

For example, if my priority list has:

  1. CVE
  2. GHSA

And the GHSA is processed before the corresponding CVE, de-duplication won't happen until the next time the component is analyzed.

Ideally, given the same set of vulnerabilities being reported across all sources, we should get the same consistent outcome, regardless of the order in which they were processed.

IMO, if we end up doing this sort of de-duplication, we should do it after we have the results from all scanners, so possibly somewhere here:

private void analyzeComponents(final QueryManager qm, final List<Component> components, final Event event) {
/*
When this task is processing events that specify the components to scan,
separate them out into 'candidates' so that we can fire off multiple events
in hopes of perform parallel analysis using different analyzers.
*/
final InternalAnalysisTask internalAnalysisTask = new InternalAnalysisTask();
final OssIndexAnalysisTask ossIndexAnalysisTask = new OssIndexAnalysisTask();
final VulnDbAnalysisTask vulnDbAnalysisTask = new VulnDbAnalysisTask();
final SnykAnalysisTask snykAnalysisTask = new SnykAnalysisTask();
final TrivyAnalysisTask trivyAnalysisTask = new TrivyAnalysisTask();
final List<Component> internalCandidates = new ArrayList<>();
final List<Component> ossIndexCandidates = new ArrayList<>();
final List<Component> vulnDbCandidates = new ArrayList<>();
final List<Component> snykCandidates = new ArrayList<>();
final List<Component> trivyCandidates = new ArrayList<>();
for (final Component component : components) {
inspectComponentReadiness(component, internalAnalysisTask, internalCandidates);
inspectComponentReadiness(component, ossIndexAnalysisTask, ossIndexCandidates);
inspectComponentReadiness(component, vulnDbAnalysisTask, vulnDbCandidates);
inspectComponentReadiness(component, snykAnalysisTask, snykCandidates);
inspectComponentReadiness(component, trivyAnalysisTask, trivyCandidates);
}
qm.detach(components);
// Do not call individual async events when processing a known list of components.
// Call each analyzer task sequentially and catch any exceptions as to prevent one analyzer
// from interrupting the successful execution of all analyzers.
performAnalysis(internalAnalysisTask, new InternalAnalysisEvent(internalCandidates), internalAnalysisTask.getAnalyzerIdentity(), event);
performAnalysis(ossIndexAnalysisTask, new OssIndexAnalysisEvent(ossIndexCandidates), ossIndexAnalysisTask.getAnalyzerIdentity(), event);
performAnalysis(snykAnalysisTask, new SnykAnalysisEvent(snykCandidates), snykAnalysisTask.getAnalyzerIdentity(), event);
performAnalysis(trivyAnalysisTask, new TrivyAnalysisEvent(trivyCandidates), trivyAnalysisTask.getAnalyzerIdentity(), event);
performAnalysis(vulnDbAnalysisTask, new VulnDbAnalysisEvent(vulnDbCandidates), vulnDbAnalysisTask.getAnalyzerIdentity(), event);
}

The problem is that currently, vulnerabilities are "persisted", and notifications are sent, as soon as they are found. De-duplication is supposed to reduce the noise, so we'd need to refactor the scanning such that these things only happen at the very end, when all scanners completed their work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The functionality first checks the priority order, then examines aliases in the context of the vulnerability. Aliases are loaded at the time the vulnerability is created, separate from the vulnerability's processing order.

Suppose VULNDB has an alias [NVD], and NVD has a higher priority. If, for some reason, vulndb is processed before NVD, the system will check if the vulnerability contains any NVD alias. If it does, the system will not add that vulnerability to avoid duplication. Subsequently, when NVD is processed, considering the priority configurations, it will be added.

image

*I will leave more cases on the PR Description.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've read and looked at it three times, and I am not sure I understand it. It seems to rely on the list of aliases being complete and reliable 100% of the time. Even when the vulnerability was published by any source. I am not sure if NVD publishes a vulnerability the aliases field in for example OSV will contain the correct GitHub alias straightaway for example. You would need a lot of test cases to make sure it behaves as expected. You don't want to create false negatives. And you want consistent behaviour. If you have 10 projects all using the exact same component, you want all 10 projects to have the same vulnerabilities from the same source/analyzer attached to that component.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have observed that vulnerabilities often appear in other sources before NVD assigns a CVE ID. It is essential to select sources like GitHub or VulnDB, which provide the most up-to-date vulnerabilities without waiting for a CVE ID. When NVD releases a CVE ID and DT updates any changes in GitHub or VulnDB, these sources will map the CVE ID to their corresponding VulnID, adding a new alias to the table without creating duplicate "new vulnerabilities" in the component audit vulnerabilities view.
The test cases I provided are reliable and cover all possible outcomes. However, I am open to implementing a consistent check and reporting back on its behavior.

Also I would like to know if we agree on this approach or you still believe that an internal id would be better to tackle this problem?
#1994 (comment)

Comment on lines +224 to +229
try {
priorityList = configPropertyQueryManager.parsePriorityList();

} catch (Exception ex) {
LOGGER.warn("An unexpected error occurred while retrieving the preference list for alias duplicates", ex);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code will be executed by multiple threads in parallel, hence it's not a good idea to work with class-level fields like this. Also we will want to avoid loading it for every single vulnerability over and over again, so some sort of caching would be necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will take a look on this

Comment on lines +351 to +355
//Update PriorityList with new Enabled/Disabled Sources to avoid conflicts
ConfigPropertyQueryManager configPropertyQueryManager = new ConfigPropertyQueryManager();
if(configPropertyQueryManager.isDedupEnabled()){
configPropertyQueryManager.updatePropertiesFromEnabledSources();
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems out-of-place here, perhaps used for testing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am seeking a method to update and retrieve the enabled sources before creating a new component. It would be ideal to keep this check in the component creation process to ensure that the most up-to-date data is used and to maintain the correct priority logic.

Should I keep it there or should I move it somewhere else?

@LaVibeX
Copy link
Contributor Author

LaVibeX commented Jul 3, 2024

Hi @valentijnscholten @nscuro

Thank you for your comments on the PR. You're right that a more ideal solution would be to modify the data model to have one vulnerability with multiple aliases from different sources. This change would indeed make it easier to determine the number of affected projects for a vulnerability and sort the list of vulnerabilities by the number of affected projects.

However, I would like to present the current PR as a reasonable approach to solving the issue of duplicate vulnerabilities. The implementation compares the priority of sources and the aliases attached to a component, which helps in reducing duplicates. I understand your concerns about inconsistent outcomes across projects, but this solution is a step towards addressing the issue while we wait for the new database that can accommodate the required changes in the data model.

I understand that the current issue of duplicate vulnerabilities is persistent and makes it difficult for most of our teams to activate other sources due to the duplicates noise. This is indeed unfortunate because multiple sources is one of the best aspects of Dependency-Track.

In the meantime, I hope you find the current solution helpful. I am open to any feedback or suggestions to improve it further. I will address and answer your reviews and questions @nscuro.
Thank you for your understanding and support.

Best,
Andrés.

@ellipse2v
Copy link

hi
thanks @LaVibeX .
Very interesting feature.

With your feature we can try to perform a distinction between publisher reviews that arrive through OSV and CVEs which are unique id of vulnerabilities.

e.g. debian advisory [SECURITY] [DSA 5759-1] python3.11 security update (debian.org) (from OSV DSA-5759-1 - OSV) about 3 vulnerabilities (NVD - CVE-2024-8088 (nist.gov) NVD - CVE-2024-4032 (nist.gov) NVD - CVE-2024-8088 (nist.gov))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Offer alias-based de-duplication of vulnerabilities
5 participants