Skip to content

feat(security): Add package name typosquatting detection #1059

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

AmineRaouane
Copy link

Implement typosquatting detection for package names during analysis. Compares package names against a list of popular packages using the Jaro-Winkler similarity algorithm. Packages exceeding a defined threshold of similarity to a popular package are flagged.

Summary

Adds typosquatting detection for package names during analysis using Jaro-Winkler similarity.

Description of changes

This PR introduces a new security analysis feature to detect potential typosquatting in package names. The implementation compares the name of a package being analyzed against a list of popular package names. By default, it uses a predefined list stored in a dedicated file, but it also offers an option to use a custom list provided via a configuration path.

The comparison utilizes the Jaro-Winkler similarity algorithm to calculate a similarity score between the package name and each name in the popular packages list. If the calculated similarity score exceeds a configurable threshold, the package is flagged as a potential typosquat.

This feature helps identify malicious packages attempting to mimic legitimate, popular ones through slight variations in spelling, thus enhancing the security posture of the project by warning users about such risks.

The changes include:

  • Integration of the Jaro-Winkler similarity algorithm.
  • Inclusion of a default file containing a list of popular package names for comparison.
  • Addition of a configuration option to provide a custom file path for the popular packages list, overriding the default.
  • Implementation of the comparison logic and threshold-based flagging.

Related issues

Checklist

  • I have reviewed the contribution guide.
  • My PR title and commits follow the Conventional Commits convention.
  • My commits include the "Signed-off-by" line.
  • I have signed my commits following the instructions provided by GitHub. Note that we run GitHub's commit verification tool to check the commit signatures. A green verified label should appear next to all of your commits on GitHub.
  • I have tested my changes and verified they work as expected.

Copy link

Thank you for your pull request and welcome to our community! To contribute, please sign the Oracle Contributor Agreement (OCA).
The following contributors of this PR have not signed the OCA:

To sign the OCA, please create an Oracle account and sign the OCA in Oracle's Contributor Agreement Application.

When signing the OCA, please provide your GitHub username. After signing the OCA and getting an OCA approval from Oracle, this PR will be automatically updated.

If you are an Oracle employee, please make sure that you are a member of the main Oracle GitHub organization, and your membership in this organization is public.

@oracle-contributor-agreement oracle-contributor-agreement bot added the OCA Required At least one contributor does not have an approved Oracle Contributor Agreement. label Apr 21, 2025
Amine and others added 2 commits April 23, 2025 09:28
Implement typosquatting detection for package names during analysis.
Compares package names against a list of popular packages using the Jaro-Winkler similarity algorithm.
Packages exceeding a defined threshold of similarity to a popular package are flagged.

Signed-off-by: Amine <[email protected]>
Adds a new security analysis feature to detect potential typosquatting in package names. Compares the package name against a list of popular packages using the Jaro-Winkler similarity algorithm. Packages exceeding a configurable threshold are flagged. Includes a default popular package list and an option for a custom list via configuration.

Signed-off-by: Amine <[email protected]>
Adds a new security analysis feature to detect potential typosquatting in package names. Compares the package name against a list of popular packages using the Jaro-Winkler similarity algorithm. Packages exceeding a configurable threshold are flagged. Includes a default popular package list and an option for a custom list via configuration.

Signed-off-by: Amine <[email protected]>
@behnazh-w behnazh-w requested review from art1f1c3R and benmss April 24, 2025 01:58
@behnazh-w
Copy link
Member

@AmineRaouane Please add unit tests following the instructions here.

Take a look at the unit tests for other malware heuristics at tests/malware_analyzer/pypi/ and add a similar one for this new heuristic.

For small and standalone functions, you can add test cases to the docstring itself. You can find an example here.

@art1f1c3R
Copy link
Member

Would it be possible to make the path to the custom file list of packages configurable through defaults.ini? Our configurations for heuristic analyzers live under the [heuristic.pypi] section in the file. Check out some of the heuristics that use it to get an idea (anomalous_version.py, high_release_frequency.py), I do something similar with paths in the semgrep PR.

Added unit tests for typosquatting detection. Analyzer variables, including the file path, are now loaded from defaults.ini. Raised heuristic confidence level from medium to high.

BREAKING CHANGE: Analyzer config must now be defined in defaults.ini.

Signed-off-by: Amine <[email protected]>
logger.error(err_msg)
return HeuristicResult.SKIP, {"error": err_msg}

package_name = pypi_package_json.component_name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should also have a check for when the popular_packages list ends up being empty.

Comment on lines +118 to +119
c1 = self.KEYBOARD_LAYOUT.get(char1)
c2 = self.KEYBOARD_LAYOUT.get(char2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please avoid using very short variables names such as c1 and c2 here.

transpositions = 0.0 # Now a float to handle partial costs

# Count matches
for i in range(len1):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here for variable names. Use index. Please also apply this throughout the PR.

if package_name == popular_package_name:
return 1.0

len1, len2 = len(package_name), len(popular_package_name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please put these assignations on separate lines.

if defaults.has_section(section_name):
section = defaults[section_name]
path = section.get("popular_packages_path", default_path)
# Fall back to default if the path in defaults.ini is empty
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use a full stop . at the end of every comment.

self._load_defaults()
)

if global_config.popular_packages_path is not None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we have the path configurable in defualts.ini, could we remove the command-line argument for it? Currently, it looks like the command line argument will override the defaults.ini value.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removing the command-line option would reduce flexibility for users who may want to specify a different list path dynamically, without modifying the config file.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's best to remove the command-line option if we have opted to defaults.ini.
If the user want to provide their own path, they can create a custom defaults.ini with the following content

[heuristic.pypi]
popular_package_path = <custom_value>

This will only override the popular_package_path value, while keeping other values in defaults.ini the same. This approach is to be consistent with other options we have for malware analysis heuristics. In this case, I think it's fine to do so without worrying about flexibility of not having a command line parameter.

Besides, it's usually not a good idea to have 2 ways to achieve the same thing as it could further confuse the users (even though both ways are completely fine on its own).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
OCA Required At least one contributor does not have an approved Oracle Contributor Agreement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants