When it comes to raw search speed FlexSearch outperforms every single searching library out there and also provides flexible search capabilities like multi-word matching, phonetic transformations or partial matching. It also has the most memory-efficient index. Keep in mind that updating and/or removing existing items from the index has a significant cost. When your index needs to be updated continuously then BulkSearch may be a better choice. FlexSearch also provides you a non-blocking asynchronous processing model as well as web workers to perform any updates or queries on the index in parallel through dedicated balanced threads.
Installation Guide • API Reference • Example Options • Custom Builds
Comparison:
Supported Platforms:
- Browser
- Node.js
All Features:
- Web-Worker Sharding (not available in Node.js)
- Contextual Indexes
- Partial Matching
- Multi-Phrase Search
- Phonetic Mathching
- Relevance-based Scoring
- Auto-Balanced Cache by Popularity
- Suggestions (Results)
- Asynchronous Processing & Concurrency Control
- Customizable: Matcher, Encoder, Tokenizer, Stemmer, Filter
These features are not available in the 50% smaller light version:
- WebWorker
- Asynchronous Processing
- Cache
- Built-in encoders except 'balance' and 'icase' (you can still pass in customs)
- Debug logging
- Methods: index.info()
The light version is just available as compiled version (flexsearch.light.js).
It is also pretty simple to make Custom Builds
FlexSearch introduce a new scoring mechanism called Contextual Search which was invented by Thomas Wilkerling, the author of this library. A Contextual Search incredibly boost up queries to a complete new level. The basic idea of this concept is to limit relevance by its context instead of calculating relevance through the whole (unlimited) distance. Imagine you add a text block of some sentences to an index ID. Assuming the query includes a combination of first and last word from this text block, are they really relevant to each other? In this way contextual search also improves the results of relevance-based queries on large amount of text data.
Note: This feature is actually not enabled by default.
"TF-IDF and all kinds of variations (like BM25) is a big mistake in searching algorithms today. They don't provide neither: a meaningful relevance of a term nor the importance of it! Like many pseudo-intelligent algorithms this is also just an example of mathematical stupidity." — Thomas Wilkerling, Contextual-based Scoring, 2018
Workers get its own dedicated memory. Especially for larger indexes, web worker improves speed and available memory a lot. FlexSearch index was tested with a 250 Mb text file including 10 Million words. The indexing was done silently in background by multiple parallel running workers in about 7 minutes. The final index reserves ~ 8.2 Mb memory/space. The search result took ~ 0.25 ms.
Note: It is slightly faster to use no web worker when the index or query isn't too big (index < 500,000 words, query < 25 words).
Description | BulkSearch | FlexSearch |
Access | Read-Write optimized index | Read-Memory optimized index |
Memory | Large: ~ 1 Mb per 100,000 words | Tiny: ~ 100 Kb per 100,000 words |
Index Type | Bulk of encoded string data divided into chunks |
|
Strength |
|
|
Weaks |
|
|
Pagination | Yes | No |
<html>
<head>
<script src="js/flexsearch.min.js"></script>
</head>
...
Note: Use flexsearch.min.js for production and flexsearch.js for development.
Use latest from CDN:
<script src="https://cdn.rawgit.com/nextapps-de/flexsearch/master/flexsearch.min.js"></script>
npm install flexsearch
In your code include as follows:
var FlexSearch = require("flexsearch");
Or pass in options when requiring:
var index = require("flexsearch").create({/* options */});
AMD
var FlexSearch = require("./flexsearch.js");
Global methods:
- FlexSearch.create(<options>)
- FlexSearch.registerMatcher({KEY: VALUE})
- FlexSearch.registerEncoder(name, encoder)
- FlexSearch.registerLanguage(lang, {stemmer:{}, filter:[]})
- FlexSearch.encode(name, string)
Index methods:
- Index.add(id, string)
- Index.search(string, <limit>, <callback>)
- Index.search(<options>)
- Index.update(id, string)
- Index.remove(id)
- Index.reset()
- Index.destroy()
- Index.init(<options>)
- Index.info()
- Index.addMatcher({KEY: VALUE})
- Index.encode(string)
FlexSearch.create(<options>)
var index = new FlexSearch();
alternatively you can also use:
var index = FlexSearch.create();
var index = new FlexSearch("speed");
var index = new FlexSearch({
// default values:
profile: "balance",
encode: "icase",
mode: "ngram",
async: false,
cache: false
});
Read more about custom options
Index.add_(id, string)
index.add(10025, "John Doe");
Index.search(string|options, <limit>, <callback>)
index.search("John");
Limit the result:
index.search("John", 10);
Perform queries asynchronously:
index.search("John", function(result){
// array of results
});
Pass custom options for each query:
index.search({
query: "John",
limit: 1000,
threshold: 5, // >= initial threshold
depth: 3, // <= initial depth
callback: function(results){/* ... */}
});
The same from above could also be written as:
index.search("John", {
limit: 1000,
threshold: 5,
depth: 3
}, function(results){
// ....
});
Get also suggestions for a query:
index.search({
query: "John Doe",
suggest: true
});
When suggestion is enabled all results will be filled up (until limit, default 1000) with similar matches ordered by relevance.
Index.update(id, string)
index.update(10025, "Road Runner");
Index.remove(id)
index.remove(10025);
index.reset();
index.destroy();
Index.init(<options>)
Note: Re-initialization will also destroy the old index!
Initialize (with same options):
index.init();
Initialize with new options:
index.init({
/* options */
});
FlexSearch.registerMatcher({REGEX: REPLACE})
Add global matchers for all instances:
FlexSearch.registerMatcher({
'ä': 'a', // replaces all 'ä' to 'a'
'ó': 'o',
'[ûúù]': 'u' // replaces multiple
});
Add private matchers for a specific instance:
index.addMatcher({
'ä': 'a', // replaces all 'ä' to 'a'
'ó': 'o',
'[ûúù]': 'u' // replaces multiple
});
Define a private custom encoder during creation/initialization:
var index = new FlexSearch({
encode: function(str){
// do something with str ...
return str;
}
});
FlexSearch.registerEncoder(name, encoder)
FlexSearch.registerEncoder('whitespace', function(str){
return str.replace(/ /g, '');
});
Use global encoders:
var index = new FlexSearch({ encode: 'whitespace' });
Private encoder:
var encoded = index.encode("sample text");
var encoded = FlexSearch.encode("whitespace", "sample text");
FlexSearch.registerEncoder('mixed', function(str){
str = this.encode("icase", str); // built-in
str = this.encode("whitespace", str); // custom
return str;
});
FlexSearch.registerEncoder('extended', function(str){
str = this.encode("custom", str);
// do something additional with str ...
return str;
});
Define a private custom tokenizer during creation/initialization:
var index = new FlexSearch({
mode: function(str){
// split string into components, e.g.:
return str.split(/ -\//g);
}
});
Stemmer: several linguistic mutations of the same word (e.g. "run" and "running")
Filter: a blacklist of words to be filtered out from indexing at all (e.g. "and", "to" or "be")
Define a private custom stemmer or filter during creation/initialization:
var index = new FlexSearch({
stemmer: {
// object {key: replacement}
"ational": "ate",
"tional": "tion",
"enci": "ence",
"ing": ""
},
filter: [
// array blacklist
"in",
"into",
"is",
"isn't",
"it",
"it's"
]
});
Or assign stemmer/filters globally to a language:
FlexSearch.registerLanguage('us', {
stemmer: {/* ... */},
filter: [/* ... */]
});
Or use built-in stemmer or filter of your preferred languages:
<html>
<head>
<script src="js/flexsearch.min.js"></script>
<script src="js/lang/en.min.js"></script>
<script src="js/lang/de.min.js"></script>
</head>
...
Now you can assign built-in stemmer during creation/initialization:
var index_en = new FlexSearch({stemmer: 'en', filter: 'en'});
var index_de = new FlexSearch({stemmer: 'de', filter: [/* custom */]});
In Node.js you just need require the language pack files to make them available:
require('lang/en.js');
require('lang/de.js');
It is also possible to compile language packs into the build as follows:
node compile SUPPORT_LANG_EN=true SUPPORT_LANG_DE=true
index.info();
Returns information e.g.:
{
"bytes": 64000,
"id": 0,
"matchers": 0,
"size": 10000,
"status": false
}
Simply chain methods like:
var index = FlexSearch.create()
.addMatcher({'â': 'a'})
.add(0, 'foo')
.add(1, 'bar');
index.remove(0).update(1, 'foo').add(2, 'foobar');
Create index and just set the limit of relevance ("depth"):
var index = new FlexSearch({
encode: "icase",
mode: "strict",
depth: 3
});
Create index and just set a limit of cache entries:
var index = new FlexSearch({
profile: "score",
cache: 10000
});
Create index and just set the count of parallel threads:
var index = new FlexSearch({
encode: "icase",
mode: "full",
async: true,
worker: 4
});
Adding items to worker index as usual (async enabled):
index.add(10025, "John Doe");
Perform search and simply pass in callback like:
index.search("John Doe", function(results){
// do something with array of results
});
FlexSearch ist highly customizable. Make use of the the right options can really improve your results as well as memory economy or query time.
Option | Values | Description |
profile |
"memory" "speed" "match" "score" "balance" "fastest" |
The configuration profile. Choose your preferation. |
mode |
"strict" "foward" "reverse" "ngram" "full" function() |
The indexing mode (tokenizer). Choose one of the built-ins or pass a custom tokenizer function. |
encode |
false "icase" "simple" "advanced" "extra" "balance" function() |
The encoding type. Choose one of the built-ins or pass a custom encoding function. |
cache |
false true {number} |
Enable/Disable and/or set capacity of cached entries. When passing a number as a limit the cache automatically balance stored entries related to their popularity. Note: When just using "true" the cache has no limits and is actually 2-3 times faster (because the balancer do not have to run). |
async |
true false |
Enable/Disable asynchronous processing. Each job will be queued for non-blocking processing. Recommended when using WebWorkers. |
worker |
false {number} |
Enable/Disable and set count of running worker threads. |
depth |
false {number:0-9} |
Enable/Disable contextual indexing and also sets contextual distance of relevance. |
threshold |
false {number:0-9} |
Enable/Disable the threshold of minimum relevance all results should have. Note: It is also possible to set a lower threshold for indexing and pass a higher value when calling index.search(options). |
stemmer |
false {string} {function} |
Disable or pass in language shorthand flag (ISO-3166) or a custom object. |
filter |
false {string} {function} |
Disable or pass in language shorthand flag (ISO-3166) or a custom array. |
Tokenizer effects the required memory also as query time and flexibility of partial matches. Try to choose the most upper of these tokenizer which fits your needs:
Option | Description | Example | Memory Factor (n = length of word) |
"strict" | index whole words | foobar | * 1 |
"ngram" (default) | index words partially through phonetic n-grams | foobar foobar |
* n / 3 |
"foward" | incrementally index words in forward direction | foobar foobar |
* n |
"reverse" | incrementally index words in both directions | foobar foobar |
* 2n - 1 |
"full" | index every possible combination | foobar foobar |
* n * (n - 1) |
Encoding effects the required memory also as query time and phonetic matches. Try to choose the most upper of these encoders which fits your needs, or pass in a custom encoder:
Option | Description | False-Positives | Compression |
false | Turn off encoding | no | no |
"icase" (default) | Case in-sensitive encoding | no | no |
"simple" | Phonetic normalizations | no | ~ 7% |
"advanced" | Phonetic normalizations + Literal transformations | no | ~ 35% |
"extra" | Phonetic normalizations + Soundex transformations | yes | ~ 60% |
function() | Pass custom encoding: function(string):string |
Reference String: "Björn-Phillipp Mayer"
Query | iCase | Simple | Advanced | Extra |
björn | yes | yes | yes | yes |
björ | yes | yes | yes | yes |
bjorn | no | yes | yes | yes |
bjoern | no | no | yes | yes |
philipp | no | no | yes | yes |
filip | no | no | yes | yes |
björnphillip | no | yes | yes | yes |
meier | no | no | yes | yes |
björn meier | no | no | yes | yes |
meier fhilip | no | no | yes | yes |
byorn mair | no | no | no | yes |
(false positives) | no | no | no | yes |
The required memory for the index depends on several options:
Encoding | Memory usage of every ~ 100,000 indexed word |
false | 260 kb |
"icase" (default) | 210 kb |
"simple" | 190 kb |
"advanced" | 150 kb |
"extra" | 90 kb |
Mode | Multiplied with: (n = average length of indexed words) |
"strict" | * 1 |
"ngram" (default) | * n / 3 |
"forward" | * n |
"reverse" | * 2n - 1 |
"full" | * n * (n - 1) |
Contextual Index | Multiply the sum above with: |
* (depth * 2 + 1) |
The book "Gulliver's Travels" (Swift Jonathan 1726) was used for this test.
You can pass a built-in profile during creation/initialization. They have these following settings:
Standard profile: "default"
{
encode: "icase",
mode: "forward"
}
Memory-optimized profile: "memory"
{
encode: "extra",
mode: "strict",
threshold: 7
}
Speed-optimized profile: "speed"
{
encode: "icase",
mode: "strict",
threshold: 7,
depth: 2
}
Matching-tolerant profile: "match"
{
encode: "extra",
mode: "full"
}
Relevance-optimized profile: "score"
{
encode: "extra",
mode: "strict",
threshold: 5,
depth: 5
}
Most-balanced profile: "balanced"
{
encode: "balanced",
mode: "ngram",
threshold: 6,
depth: 3
}
Absolute fastest profile: "fastest"
{
encode: "icase",
threshold: 9,
depth: 1
}
Compare these options above:
Full Build:
npm run build
Light Build:
npm run build-light
Build Language Packs:
npm run build-lang
Custom Build:
npm run build-custom SUPPORT_WORKER=true SUPPORT_ASYNC=true
Supported flags:
- SUPPORT_DEBUG
- SUPPORT_WORKER
- SUPPORT_CACHE
- SUPPORT_ASYNC
- SUPPORT_BUILTINS (built-in encoders)
Supported language flags (includes stemmer and filter):
- SUPPORT_LANG_EN
- SUPPORT_LANG_DE
Alternatively you can also use:
node compile SUPPORT_WORKER=true
The custom build was saved to flexsearch.custom.js
Copyright 2018 Thomas Wilkerling
Released under the Apache 2.0 License