Avoid duplicates with alias BACKEND #3685

LaVibeX · 2024-05-08T17:03:19Z

Description

This PR introduces a new functionality that addresses the issue of duplicate vulnerabilities by comparing the priority of sources and the aliases attached to a component.

The implementation required adding new database rows to support the changes.

To ensure the correctness of the implementation, tests have been added to validate the behavior of the updated functionality.

In addition, a new API endpoint has been added to get the actual Enabled Sources.

It is important to note that this update specifically affects the addVulnerability function and is not able to delete a vulnerability in any case.

Fronted changes: DependencyTrack/frontend#838
I'm open to discussing any changes or improvements👍🏽

Examples:

Alias Deduplication Disabled
Vulnerability source with highest priority: NVD
Vulnerability source with highest priority: GITHUB

Flow Charts for better understanding:

Case 1: (Vulnerability source does not have a higher priority than the alias):

Case 2: (The vulnerability source has a higher priority):

Case 3: (The vulnerability source has a higher priority, but the alias is in the component):
This case can occur only if a higher-priority source vulnerability does not exist at that time. A lower-priority source vulnerability will be added. Later, upon alias mapping and reanalysis, the higher-priority source vulnerability will not be added because the alias is already present.

Addressed Issue

This PR fixes #1994 and #2181

Additional Details

Add a new file named ConfigPropertyQueryManager.java to manage functions related to the EnabledSources

Checklist

I have read and understand the contributing guidelines
This PR fixes a defect, and I have provided tests to verify that the fix is effective
This PR implements an enhancement, and I have provided tests to verify that it works as intended
This PR introduces changes to the database model, and I have added corresponding update logic
This PR introduces new or alters existing behavior, and I have updated the documentation accordingly

This commit introduces a new functionality that addresses the issue of duplicate vulnerabilities by comparing the priority of sources and the aliases attached to a component. The implementation required adding new database rows to support the changes. To ensure the correctness of the implementation, tests have been added to validate the behavior of the updated functionality. In addition, a new API endpoint has been added to get the actual Enabled Sources. It is important to note that this update specifically affects the addVulnerability function and is not able to delete a vulnerability in any case. Signed-off-by: Andres Tito <[email protected]>

valentijnscholten · 2024-05-09T07:31:58Z

Not sure if I fully understand the PR. Does it only one vulnerability, the one from the source with the highest priority?

It feels to me that this would be a "bolt on" solution to a database model that should be changed. Wouldn't it be better to have data model that has one vulnerability that can have multiple aliases (from different sources possibly)? Currently with multiple aliases/sources there are (can be / most often are) multiple rows of vulnerabilities. This makes a lot of things harder and more complicated. For example determining the number of affected project for a vulnerability or something "as simple as" sorting the list of vulnerabilities by number of affected projects.

LaVibeX · 2024-05-10T10:38:59Z

Not sure if I fully understand the PR. Does it only one vulnerability, the one from the source with the highest priority?

It feels to me that this would be a "bolt on" solution to a database model that should be changed. Wouldn't it be better to have data model that has one vulnerability that can have multiple aliases (from different sources possibly)? Currently with multiple aliases/sources there are (can be / most often are) multiple rows of vulnerabilities. This makes a lot of things harder and more complicated. For example determining the number of affected project for a vulnerability or something "as simple as" sorting the list of vulnerabilities by number of affected projects.

PR Description updated: I hope that with the example images the PR will be better explained @valentijnscholten

… I remove it for now, issues with isEmpty, better use != null Signed-off-by: Andres Tito <[email protected]>

codacy-production · 2024-05-15T17:10:04Z

Coverage summary from Codacy

See diff coverage on Codacy

Coverage variation	Diff coverage
✅ +0.10% (target: -1.00%)	✅ 93.46% (target: 70.00%)

Coverage variation details

	Coverable lines	Covered lines	Coverage
Common ancestor commit (`db58e69`)	21630	16377	75.71%
Head commit (`846d2dd`)	21736 (+106)	16478 (+101)	75.81% (+0.10%)

Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch: <coverage of head commit> - <coverage of common ancestor commit>

Diff coverage details

	Coverable lines	Covered lines	Diff coverage
Pull request (#3685)	107	100	93.46%

Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified: <covered lines added or modified>/<coverable lines added or modified> * 100%

See your quality gate settings Change summary preferences

_{Codacy will stop sending the deprecated coverage status from June 5th, 2024. Learn more}

…y-track into AvoidDuplicatesWithAlias Signed-off-by: Andres Tito <[email protected]>

pkunze · 2024-05-17T05:16:58Z

I think this might also solve #2181 would be greatly appreciated :) 🎉

nscuro

Thanks for the PR @LaVibeX!

I added a few comments. I wasn't able to test this since it doesn't build for me locally, which is unfortunately caused by a recent refactoring in Alpine and DT (#3730).

Also I have to say that I do agree with @valentijnscholten in that we really should change the underlying data model to make aliases more useful in general. The problem with the priority approach is that you get inconsistent outcomes across projects, depending on which vulnerability came first, and / or the order in which they were processed, which can become quite confusing.

On a related note, during the last community meeting we briefly mentioned that we're considering building a pre-compiled vulnerability database: https://youtu.be/9harG5GcV_E?t=2799. One of the things that it would make easier is the correlation of aliases across vulnerability sources. Perhaps give that a watch and let us know your thoughts?

It would go along quite well with a change in the data model as proposed by @valentijnscholten:

Wouldn't it be better to have data model that has one vulnerability that can have multiple aliases (from different sources possibly)?

I am happy to provide a feature branch so people can test the approach in this PR out, if you're interested to continue working on it.

nscuro · 2024-06-16T16:46:00Z

src/main/java/org/dependencytrack/persistence/VulnerabilityQueryManager.java

+        setVulnerabilityAliasesIfNull(vulnerability);
+        boolean vulnerabilityExists = checkVulnerabilityExists(vulnerability, component);
+
+        if (!vulnerabilityExists){


Will doing this here not cause it all to depend on the order in which vulnerabilities are processed?

For example, if my priority list has:

CVE

GHSA

And the GHSA is processed before the corresponding CVE, de-duplication won't happen until the next time the component is analyzed.

Ideally, given the same set of vulnerabilities being reported across all sources, we should get the same consistent outcome, regardless of the order in which they were processed.

IMO, if we end up doing this sort of de-duplication, we should do it after we have the results from all scanners, so possibly somewhere here:

dependency-track/src/main/java/org/dependencytrack/tasks/VulnerabilityAnalysisTask.java

Lines 96 to 130 in ad5e911

private void analyzeComponents(final QueryManager qm, final List<Component> components, final Event event) {

/*

When this task is processing events that specify the components to scan,

separate them out into 'candidates' so that we can fire off multiple events

in hopes of perform parallel analysis using different analyzers.

*/

final InternalAnalysisTask internalAnalysisTask = new InternalAnalysisTask();

final OssIndexAnalysisTask ossIndexAnalysisTask = new OssIndexAnalysisTask();

final VulnDbAnalysisTask vulnDbAnalysisTask = new VulnDbAnalysisTask();

final SnykAnalysisTask snykAnalysisTask = new SnykAnalysisTask();

final TrivyAnalysisTask trivyAnalysisTask = new TrivyAnalysisTask();

final List<Component> internalCandidates = new ArrayList<>();

final List<Component> ossIndexCandidates = new ArrayList<>();

final List<Component> vulnDbCandidates = new ArrayList<>();

final List<Component> snykCandidates = new ArrayList<>();

final List<Component> trivyCandidates = new ArrayList<>();

for (final Component component : components) {

inspectComponentReadiness(component, internalAnalysisTask, internalCandidates);

inspectComponentReadiness(component, ossIndexAnalysisTask, ossIndexCandidates);

inspectComponentReadiness(component, vulnDbAnalysisTask, vulnDbCandidates);

inspectComponentReadiness(component, snykAnalysisTask, snykCandidates);

inspectComponentReadiness(component, trivyAnalysisTask, trivyCandidates);

}

qm.detach(components);

// Do not call individual async events when processing a known list of components.

// Call each analyzer task sequentially and catch any exceptions as to prevent one analyzer

// from interrupting the successful execution of all analyzers.

performAnalysis(internalAnalysisTask, new InternalAnalysisEvent(internalCandidates), internalAnalysisTask.getAnalyzerIdentity(), event);

performAnalysis(ossIndexAnalysisTask, new OssIndexAnalysisEvent(ossIndexCandidates), ossIndexAnalysisTask.getAnalyzerIdentity(), event);

performAnalysis(snykAnalysisTask, new SnykAnalysisEvent(snykCandidates), snykAnalysisTask.getAnalyzerIdentity(), event);

performAnalysis(trivyAnalysisTask, new TrivyAnalysisEvent(trivyCandidates), trivyAnalysisTask.getAnalyzerIdentity(), event);

performAnalysis(vulnDbAnalysisTask, new VulnDbAnalysisEvent(vulnDbCandidates), vulnDbAnalysisTask.getAnalyzerIdentity(), event);

}

The problem is that currently, vulnerabilities are "persisted", and notifications are sent, as soon as they are found. De-duplication is supposed to reduce the noise, so we'd need to refactor the scanning such that these things only happen at the very end, when all scanners completed their work.

The functionality first checks the priority order, then examines aliases in the context of the vulnerability. Aliases are loaded at the time the vulnerability is created, separate from the vulnerability's processing order.

Suppose VULNDB has an alias [NVD], and NVD has a higher priority. If, for some reason, vulndb is processed before NVD, the system will check if the vulnerability contains any NVD alias. If it does, the system will not add that vulnerability to avoid duplication. Subsequently, when NVD is processed, considering the priority configurations, it will be added.

*I will leave more cases on the PR Description.

I've read and looked at it three times, and I am not sure I understand it. It seems to rely on the list of aliases being complete and reliable 100% of the time. Even when the vulnerability was published by any source. I am not sure if NVD publishes a vulnerability the aliases field in for example OSV will contain the correct GitHub alias straightaway for example. You would need a lot of test cases to make sure it behaves as expected. You don't want to create false negatives. And you want consistent behaviour. If you have 10 projects all using the exact same component, you want all 10 projects to have the same vulnerabilities from the same source/analyzer attached to that component.

I have observed that vulnerabilities often appear in other sources before NVD assigns a CVE ID. It is essential to select sources like GitHub or VulnDB, which provide the most up-to-date vulnerabilities without waiting for a CVE ID. When NVD releases a CVE ID and DT updates any changes in GitHub or VulnDB, these sources will map the CVE ID to their corresponding VulnID, adding a new alias to the table without creating duplicate "new vulnerabilities" in the component audit vulnerabilities view.
The test cases I provided are reliable and cover all possible outcomes. However, I am open to implementing a consistent check and reporting back on its behavior.

Also I would like to know if we agree on this approach or you still believe that an internal id would be better to tackle this problem?
#1994 (comment)

nscuro · 2024-06-16T16:48:54Z

src/main/java/org/dependencytrack/persistence/VulnerabilityQueryManager.java

+        try {
+            priorityList = configPropertyQueryManager.parsePriorityList();
+
+        } catch (Exception ex) {
+            LOGGER.warn("An unexpected error occurred while retrieving the preference list for alias duplicates", ex);
+        }


This code will be executed by multiple threads in parallel, hence it's not a good idea to work with class-level fields like this. Also we will want to avoid loading it for every single vulnerability over and over again, so some sort of caching would be necessary.

I will take a look on this

nscuro · 2024-06-16T16:55:37Z

src/main/java/org/dependencytrack/persistence/ComponentQueryManager.java

+        //Update PriorityList with new Enabled/Disabled Sources to avoid conflicts
+        ConfigPropertyQueryManager configPropertyQueryManager = new ConfigPropertyQueryManager();
+        if(configPropertyQueryManager.isDedupEnabled()){
+            configPropertyQueryManager.updatePropertiesFromEnabledSources();
+        }


Seems out-of-place here, perhaps used for testing?

I am seeking a method to update and retrieve the enabled sources before creating a new component. It would be ideal to keep this check in the component creation process to ensure that the most up-to-date data is used and to maintain the correct priority logic.

Should I keep it there or should I move it somewhere else?

LaVibeX · 2024-07-03T09:31:46Z

Hi @valentijnscholten @nscuro

Thank you for your comments on the PR. You're right that a more ideal solution would be to modify the data model to have one vulnerability with multiple aliases from different sources. This change would indeed make it easier to determine the number of affected projects for a vulnerability and sort the list of vulnerabilities by the number of affected projects.

However, I would like to present the current PR as a reasonable approach to solving the issue of duplicate vulnerabilities. The implementation compares the priority of sources and the aliases attached to a component, which helps in reducing duplicates. I understand your concerns about inconsistent outcomes across projects, but this solution is a step towards addressing the issue while we wait for the new database that can accommodate the required changes in the data model.

I understand that the current issue of duplicate vulnerabilities is persistent and makes it difficult for most of our teams to activate other sources due to the duplicates noise. This is indeed unfortunate because multiple sources is one of the best aspects of Dependency-Track.

In the meantime, I hope you find the current solution helpful. I am open to any feedback or suggestions to improve it further. I will address and answer your reviews and questions @nscuro.
Thank you for your understanding and support.

Best,
Andrés.

ellipse2v · 2024-09-04T13:10:09Z

hi
thanks @LaVibeX .
Very interesting feature.

With your feature we can try to perform a distinction between publisher reviews that arrive through OSV and CVEs which are unique id of vulnerabilities.

e.g. debian advisory [SECURITY] [DSA 5759-1] python3.11 security update (debian.org) (from OSV DSA-5759-1 - OSV) about 3 vulnerabilities (NVD - CVE-2024-8088 (nist.gov) NVD - CVE-2024-4032 (nist.gov) NVD - CVE-2024-8088 (nist.gov))

VULNERABILITY_SOURCE_GOOGLE_OSV_ENABLED is not a boolean variable, so…

e22ea05

… I remove it for now, issues with isEmpty, better use != null Signed-off-by: Andres Tito <[email protected]>

Merge branch 'master' of https://github.com/DependencyTrack/dependenc…

846d2dd

…y-track into AvoidDuplicatesWithAlias Signed-off-by: Andres Tito <[email protected]>

LaVibeX mentioned this pull request May 22, 2024

Avoid duplicates with alias DependencyTrack/frontend#838

Open

2 tasks

nscuro reviewed Jun 16, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid duplicates with alias BACKEND #3685

Avoid duplicates with alias BACKEND #3685

LaVibeX commented May 8, 2024 •

edited

Loading

valentijnscholten commented May 9, 2024

LaVibeX commented May 10, 2024

codacy-production bot commented May 15, 2024 •

edited

Loading

pkunze commented May 17, 2024

nscuro left a comment •

edited

Loading

nscuro Jun 16, 2024

LaVibeX Jul 3, 2024

valentijnscholten Jul 3, 2024

LaVibeX Jul 4, 2024

nscuro Jun 16, 2024

LaVibeX Jul 3, 2024

nscuro Jun 16, 2024

LaVibeX Jul 3, 2024

LaVibeX commented Jul 3, 2024

ellipse2v commented Sep 4, 2024

	private void analyzeComponents(final QueryManager qm, final List<Component> components, final Event event) {
	/*
	When this task is processing events that specify the components to scan,
	separate them out into 'candidates' so that we can fire off multiple events
	in hopes of perform parallel analysis using different analyzers.
	*/
	final InternalAnalysisTask internalAnalysisTask = new InternalAnalysisTask();
	final OssIndexAnalysisTask ossIndexAnalysisTask = new OssIndexAnalysisTask();
	final VulnDbAnalysisTask vulnDbAnalysisTask = new VulnDbAnalysisTask();
	final SnykAnalysisTask snykAnalysisTask = new SnykAnalysisTask();
	final TrivyAnalysisTask trivyAnalysisTask = new TrivyAnalysisTask();
	final List<Component> internalCandidates = new ArrayList<>();
	final List<Component> ossIndexCandidates = new ArrayList<>();
	final List<Component> vulnDbCandidates = new ArrayList<>();
	final List<Component> snykCandidates = new ArrayList<>();
	final List<Component> trivyCandidates = new ArrayList<>();
	for (final Component component : components) {
	inspectComponentReadiness(component, internalAnalysisTask, internalCandidates);
	inspectComponentReadiness(component, ossIndexAnalysisTask, ossIndexCandidates);
	inspectComponentReadiness(component, vulnDbAnalysisTask, vulnDbCandidates);
	inspectComponentReadiness(component, snykAnalysisTask, snykCandidates);
	inspectComponentReadiness(component, trivyAnalysisTask, trivyCandidates);
	}

	qm.detach(components);

	// Do not call individual async events when processing a known list of components.
	// Call each analyzer task sequentially and catch any exceptions as to prevent one analyzer
	// from interrupting the successful execution of all analyzers.
	performAnalysis(internalAnalysisTask, new InternalAnalysisEvent(internalCandidates), internalAnalysisTask.getAnalyzerIdentity(), event);
	performAnalysis(ossIndexAnalysisTask, new OssIndexAnalysisEvent(ossIndexCandidates), ossIndexAnalysisTask.getAnalyzerIdentity(), event);
	performAnalysis(snykAnalysisTask, new SnykAnalysisEvent(snykCandidates), snykAnalysisTask.getAnalyzerIdentity(), event);
	performAnalysis(trivyAnalysisTask, new TrivyAnalysisEvent(trivyCandidates), trivyAnalysisTask.getAnalyzerIdentity(), event);
	performAnalysis(vulnDbAnalysisTask, new VulnDbAnalysisEvent(vulnDbCandidates), vulnDbAnalysisTask.getAnalyzerIdentity(), event);
	}

Avoid duplicates with alias BACKEND #3685

Are you sure you want to change the base?

Avoid duplicates with alias BACKEND #3685

Conversation

LaVibeX commented May 8, 2024 • edited Loading

Description

Addressed Issue

Additional Details

Checklist

valentijnscholten commented May 9, 2024

LaVibeX commented May 10, 2024

codacy-production bot commented May 15, 2024 • edited Loading

Coverage summary from Codacy

pkunze commented May 17, 2024

nscuro left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LaVibeX commented Jul 3, 2024

ellipse2v commented Sep 4, 2024

LaVibeX commented May 8, 2024 •

edited

Loading

codacy-production bot commented May 15, 2024 •

edited

Loading

nscuro left a comment •

edited

Loading