Name	Name	Last commit message	Last commit date
Latest commit History 299 Commits
.github/workflows	.github/workflows
benchmarks	benchmarks
docs	docs
src/fuzzysearch	src/fuzzysearch
tests	tests
.bumpversion.cfg	.bumpversion.cfg
.coveragerc	.coveragerc
.gitignore	.gitignore
AUTHORS.rst	AUTHORS.rst
CONTRIBUTING.rst	CONTRIBUTING.rst
HISTORY.rst	HISTORY.rst
LICENSE	LICENSE
MANIFEST.in	MANIFEST.in
Makefile	Makefile
README.rst	README.rst
build.cmd	build.cmd
requirements_dev.txt	requirements_dev.txt
setup.py	setup.py
tox.ini	tox.ini

fuzzysearch

Fuzzy search: Find parts of long text or data, allowing for some changes/typos.

Easy, fast, and just works!

>>> find_near_matches('PATTERN', '---PATERN---', max_l_dist=1)
[Match(start=3, end=9, dist=1, matched="PATERN")]

Two simple functions to use: one for in-memory data and one for files
- Fastest search algorithm is chosen automatically
Levenshtein Distance metric with configurable parameters
- Separately configure the max. allowed distance, substitutions, deletions and/or insertions
Advanced algorithms with optional C and Cython optimizations
Properly handles Unicode; special optimizations for binary data
Simple installation:
- pip install fuzzysearch just works
- pure-Python fallbacks for compiled modules
- only one dependency (attrs)
Extensively tested
Free software: MIT license

For more info, see the documentation.

Installation

fuzzysearch supports Python versions 3.8+, as well as PyPy 3.9 and 3.10.

$ pip install fuzzysearch

This will work even if installing the C and Cython extensions fails, using pure-Python fallbacks.

Usage

Just call find_near_matches() with the sub-sequence you're looking for, the sequence to search, and the matching parameters:

>>> from fuzzysearch import find_near_matches
# search for 'PATTERN' with a maximum Levenshtein Distance of 1
>>> find_near_matches('PATTERN', '---PATERN---', max_l_dist=1)
[Match(start=3, end=9, dist=1, matched="PATERN")]

To search in a file, use find_near_matches_in_file() similarly:

>>> from fuzzysearch import find_near_matches_in_file
>>> with open('data_file', 'rb') as f:
...     find_near_matches_in_file(b'PATTERN', f, max_l_dist=1)
[Match(start=3, end=9, dist=1, matched="PATERN")]

Examples

fuzzysearch is great for ad-hoc searches of genetic data, such as DNA or protein sequences, before reaching for "heavier", domain-specific tools like BioPython:

>>> sequence = '''\
GACTAGCACTGTAGGGATAACAATTTCACACAGGTGGACAATTACATTGAAAATCACAGATTGGTCACACACACA
TTGGACATACATAGAAACACACACACATACATTAGATACGAACATAGAAACACACATTAGACGCGTACATAGACA
CAAACACATTGACAGGCAGTTCAGATGATGACGCCCGACTGATACTCGCGTAGTCGTGGGAGGCAAGGCACACAG
GGGATAGG'''
>>> subsequence = 'TGCACTGTAGGGATAACAAT' # distance = 1
>>> find_near_matches(subsequence, sequence, max_l_dist=2)
[Match(start=3, end=24, dist=1, matched="TAGCACTGTAGGGATAACAAT")]

BioPython sequences are also supported:

>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> sequence = Seq('''\
GACTAGCACTGTAGGGATAACAATTTCACACAGGTGGACAATTACATTGAAAATCACAGATTGGTCACACACACA
TTGGACATACATAGAAACACACACACATACATTAGATACGAACATAGAAACACACATTAGACGCGTACATAGACA
CAAACACATTGACAGGCAGTTCAGATGATGACGCCCGACTGATACTCGCGTAGTCGTGGGAGGCAAGGCACACAG
GGGATAGG''', IUPAC.unambiguous_dna)
>>> subsequence = Seq('TGCACTGTAGGGATAACAAT', IUPAC.unambiguous_dna)
>>> find_near_matches(subsequence, sequence, max_l_dist=2)
[Match(start=3, end=24, dist=1, matched="TAGCACTGTAGGGATAACAAT")]

Matching Criteria

The search function supports four possible match criteria, which may be supplied in any combination:

maximum Levenshtein distance (max_l_dist)
maximum # of subsitutions
maximum # of deletions ("delete" = skip a character in the sub-sequence)
maximum # of insertions ("insert" = skip a character in the sequence)

Not supplying a criterion means that there is no limit for it. For this reason, one must always supply max_l_dist and/or all other criteria.

>>> find_near_matches('PATTERN', '---PATERN---', max_l_dist=1)
[Match(start=3, end=9, dist=1, matched="PATERN")]

# this will not match since max-deletions is set to zero
>>> find_near_matches('PATTERN', '---PATERN---', max_l_dist=1, max_deletions=0)
[]

# note that a deletion + insertion may be combined to match a substution
>>> find_near_matches('PATTERN', '---PAT-ERN---', max_deletions=1, max_insertions=1, max_substitutions=0)
[Match(start=3, end=10, dist=1, matched="PAT-ERN")] # the Levenshtein distance is still 1

# ... but deletion + insertion may also match other, non-substitution differences
>>> find_near_matches('PATTERN', '---PATERRN---', max_deletions=1, max_insertions=1, max_substitutions=0)
[Match(start=3, end=10, dist=2, matched="PATERRN")]

When to Use Other Tools

Use case: Search through a list of strings for almost-exactly matching strings. For example, searching through a list of names for possible slight variations of a certain name.

Suggestion: Consider using fuzzywuzzy.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

fuzzysearch

Installation

Usage

Examples

Matching Criteria

When to Use Other Tools

About

Releases 9

Packages

Languages

License

taleinat/fuzzysearch

Folders and files

Latest commit

History

Repository files navigation

fuzzysearch

Installation

Usage

Examples

Matching Criteria

When to Use Other Tools

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 9

Packages 0

Languages

Packages