Basic Start • API Reference • Document Indexes • Using Worker • Changelog
The new version is finally available. FlexSearch 0.7.0 was developed as a modern rebuild from the ground up. The result is an improvement in every single aspect and covers tons of enhancements and improvements which was collected over the last 3 years of production use.
This new version has a good compatibility with the old generation, but it might require some migrations steps in your code.
Read the documentation of new features and changes:
https://github.com/nextapps-de/flexsearch/blob/0.7.0/doc/0.7.0.md
Read the documentation of new language encoding features:
https://github.com/nextapps-de/flexsearch/blob/0.7.0/doc/0.7.0-lang.md
When it comes to raw search speed FlexSearch outperforms every single searching library out there and also provides flexible search capabilities like multi-field search, phonetic transformations or partial matching.
Depending on the used options it also provides the most memory-efficient index. FlexSearch introduce a new scoring algorithm called "contextual index" based on a pre-scored lexical dictionary architecture which actually performs queries up to 1,000,000 times faster compared to other libraries. FlexSearch also provides you a non-blocking asynchronous processing model as well as web workers to perform any updates or queries on the index in parallel through dedicated balanced threads.
Supported Platforms:
- Browser
- Node.js
Library Comparison "Gulliver's Travels":
Plugins (extern projects):
- https://github.com/angeloashmore/react-use-flexsearch
- https://www.gatsbyjs.org/packages/gatsby-plugin-flexsearch/
Get Latest Stable Build (Recommended):
Build | File | CDN |
flexsearch.bundle.js | Download | https://rawcdn.githack.com/nextapps-de/flexsearch/0.7.0/dist/flexsearch.bundle.js |
flexsearch.light.js | Download | https://rawcdn.githack.com/nextapps-de/flexsearch/0.7.0/dist/flexsearch.light.js |
flexsearch.compact.js | Download | https://rawcdn.githack.com/nextapps-de/flexsearch/0.7.0/dist/flexsearch.compact.js |
flexsearch.es5.js * | Download | https://rawcdn.githack.com/nextapps-de/flexsearch/0.7.0/dist/flexsearch.es5.js |
ES6 Modules | Download | The /dist/module/ folder of this Github repository |
- The bundle "flexsearch.es5.js" includes polyfills for EcmaScript 5 Support.
Get Latest (NPM):
npm install flexsearch
Get Latest Nightly (Do not use for production!):
Just exchange the version number from the URLs above with "master", e.g.: "/flexsearch/0.7.0/dist/" into "/flexsearch/master/dist".
Compare Web-Bundles:
The Node.js package includes all features from
flexsearch.bundle.js
.
Feature | flexsearch.bundle.js | flexsearch.compact.js | flexsearch.light.js |
Presets | ✓ | ✓ | - |
Async Search | ✓ | ✓ | - |
Web-Workers | ✓ | - | - |
Contextual Indexes | ✓ | ✓ | ✓ |
Index Documents (Field-Search) | ✓ | ✓ | - |
Document Store | ✓ | ✓ | - |
Partial Matching | ✓ | ✓ | ✓ |
Relevance Scoring | ✓ | ✓ | ✓ |
Auto-Balanced Cache by Popularity | ✓ | - | - |
Tags | ✓ | - | - |
Suggestions | ✓ | ✓ | - |
Phonetic Matching | ✓ | ✓ | - |
Customizable Charset/Language (Matcher, Encoder, Tokenizer, Stemmer, Filter, Split, RTL) | ✓ | ✓ | ✓ |
Export / Import | ✓ | - | - |
File Size (gzip) | 6.8 kb | 5.3 kb | 2.9 kb |
Run Comparison: Performance Benchmark "Gulliver's Travels"
Operation per seconds, higher is better, except the test "Memory" on which lower is better.
Rank | Library | Memory | Query (Single Term) | Query (Multi Term) | Query (Long) | Query (Dupes) | Query (Not Found) |
1 | FlexSearch | 23 | 7039844 | 1429457 | 113091 | 1467937 | 2895284 |
2 | JSii | 27 | 6564 | 158149 | 61290 | 95098 | 534109 |
3 | Wade | 424 | 20471 | 78780 | 16693 | 225824 | 213754 |
4 | JS Search | 193 | 8221 | 64034 | 10377 | 95830 | 167605 |
5 | Elasticlunr.js | 646 | 5412 | 7573 | 2865 | 23786 | 13982 |
6 | BulkSearch | 1021 | 3069 | 3141 | 3333 | 3265 | 21825569 |
7 | MiniSearch | 24348 | 4406 | 10945 | 72 | 39989 | 17624 |
8 | bm25 | 15719 | 1429 | 789 | 366 | 884 | 1823 |
9 | Lunr.js | 2219 | 255 | 271 | 272 | 266 | 267 |
10 | FuzzySearch | 157373 | 53 | 38 | 15 | 32 | 43 |
11 | Fuse | 7641904 | 6 | 2 | 1 | 2 | 3 |
Note: This feature is disabled by default because of its extended memory usage. Read here get more information about.
FlexSearch introduce a new scoring mechanism called Contextual Search which was invented by Thomas Wilkerling, the author of this library. A Contextual Search incredibly boost up queries to a complete new level but also requires some additional memory (depending on depth). The basic idea of this concept is to limit relevance by its context instead of calculating relevance through the whole distance of its corresponding document. This way contextual search also improves the results of relevance-based queries on a large amount of text data.
import Index from "./index.js";
import Document from "./document.js";
import WorkerIndex from "./worker/index.js";
const index = new Index(options);
const document = new Document(options);
const worker = new WorkerIndex(options);
<html>
<head>
<script src="js/flexsearch.bundle.js"></script>
</head>
...
Or via CDN:
<script src="https://cdn.jsdelivr.net/gh/nextapps-de/[email protected]/dist/flexsearch.bundle.js"></script>
AMD:
var FlexSearch = require("./flexsearch.js");
Load one of the builds from the folder dist
to your html as a script and use as follows:
var index = new FlexSearch.Index(options);
var document = new FlexSearch.Document(options);
var worker = new FlexSearch.Worker(options);
npm install flexsearch
In your code include as follows:
const { Index, Document, Worker } = require("flexsearch");
index.add(id, text);
index.search(text, limit);
index.search(text, options);
index.search(text, limit, options);
index.search(options);
document.add(doc);
document.add(id, doc);
document.search(text, limit);
document.search(text, options);
document.search(text, limit, options);
document.search(options);
worker.add(id, text);
worker.search(text, limit);
worker.search(text, options);
worker.search(text, limit, options);
worker.search(text, limit, options, callback);
worker.search(options);
The worker
inherits from type Index
and does not inherit from type Document
. Therefore, a WorkerIndex basically works like a standard FlexSearch Index. Worker-Support in documents needs to be enabled by just passing the appropriate option during creation { worker: true }
.
Global methods:
Index + WorkerIndex methods:
- Index.add(id, string) *
- Index.append(id, string) *
- Index.search(string, <limit>, <options>) *
- Index.update(id, string) *
- Index.remove(id) *
- async Index.export(handler)
- async Index.import(key, data)
Document methods:
- Document.add(id, document) *
- Document.append(id, document) *
- Document.search(string, <limit>, <options>) *
- Document.update(id, document) *
- Document.remove(id || document) *
- async Document.export(handler)
- async Document.import(key, data)
- For each of those methods there exist an asynchronous equivalent:
Async Version:
- async .addAsync( ... , <callback>)
- async .appendAsync( ... , <callback>)
- async .searchAsync( ... , <callback>)
- async .updateAsync( ... , <callback>)
- async .removeAsync( ... , <callback>)
Methods export
and also import
are always async as well as every method you call on a Worker-based Index.
FlexSearch is highly customizable. Make use of the right options can really improve your results as well as memory economy and query time.
Option | Values | Description |
preset |
"memory" "performance" "match" "score" "default" |
The configuration profile as a shortcut or as a base for your custom settings. |
tokenize |
"strict" "forward" "reverse" "full" function() |
The indexing mode (tokenizer). Choose one of the built-ins or pass a custom tokenizer function. |
cache |
Boolean Number |
Enable/Disable and/or set capacity of cached entries. When passing a number as a limit the cache automatically balance stored entries related to their popularity. Note: When just using "true" the cache has no limits and growth unbounded. |
resolution | Number | Sets the scoring resolution (default: 9). |
context |
Boolean Context Options |
Enable/Disable contextual indexing. When passing "true" as value it will take the default values for the context. |
Additional Options for Language Encoding: | ||
charset |
Charset Payload String (key) |
Provide a custom charset payload or pass one of the keys of built-in charsets. |
language |
Language Payload String (key) |
Provide a custom language payload or pass one of the keys of built-in languages. |
encode |
false "default" "simple" "balance" "advanced" "extra" function(str):[words] |
The encoding type. Choose one of the built-ins or pass a custom encoding function. |
stemmer |
false String Function |
Disable or pass in language shorthand flag (ISO-3166) or a custom object. |
filter |
false String Function |
Disable or pass in language shorthand flag (ISO-3166) or a custom array. |
Additional Options for Document Indexes: | ||
worker |
Boolean | Enable/Disable and set count of running worker threads. |
document |
Document Descriptor | Includes definitions for the document index and storage. |
Option | Values | Description |
resolution | {number} | Sets the scoring resolution for the context (default: 1). |
depth |
false {number} |
Enable/Disable contextual indexing and also sets contextual distance of relevance. Depth is the maximum number of words/tokens away a term to be considered as relevant. |
bidirectional |
false true |
Sets the scoring resolution (default: 9). |
Option | Values | Description |
id |
String | |
tag |
String | |
index |
String Array |
|
store |
String Array |
Option | Values | Description |
split |
RegExp string |
The rule to split words when using non-custom tokenizer (built-ins e.g. "forward"). Use a string/char or use a regular expression (default: /\W+/ ). |
rtl |
true false |
Enables Right-To-Left encoding. |
encode | function(str) => [words] | The custom encoding function. |
Option | Values | Description |
stemmer |
false String Function |
Disable or pass in language shorthand flag (ISO-3166) or a custom object. |
filter |
false String Function |
Disable or pass in language shorthand flag (ISO-3166) or a custom array. |
matcher |
false String Function |
Disable or pass in language shorthand flag (ISO-3166) or a custom array. |
Option | Values | Description |
limit | number | Sets the limit of results. |
offset | number | Enables paginated results. |
suggest | true, false | Enables suggestions in results. |
- Additionally, to the Index search options above.
Option | Values | Description |
enrich | true, false | Enables paginated results. |
index | string, Array<string> | Sets the document fields which should be searched. When no field is set, all fields will be searched. Custom options per field are also supported. |
bool | "and", "or" | Sets the used logical operator when searching through multiple fields. |
tag | string, Array<string> | Sets the document fields which should be searched. When no field is set, all fields will be searched. Custom options per field are also supported. |
Tokenizer affects the required memory also as query time and flexibility of partial matches. Try to choose the most upper of these tokenizer which fits your needs:
Option | Description | Example | Memory Factor (n = length of word) |
"strict" | index whole words | foobar |
* 1 |
"forward" | incrementally index words in forward direction | fo obarfoob ar |
* n |
"reverse" | incrementally index words in both directions | foobar fo obar |
* 2n - 1 |
"full" | index every possible combination | fooba rf oob ar |
* n * (n - 1) |
Encoding affects the required memory also as query time and phonetic matches. Try to choose the most upper of these encoders which fits your needs, or pass in a custom encoder:
Option | Description | False-Positives | Compression |
false | Turn off encoding | no | no |
"default" (default) | Case in-sensitive encoding | no | no |
"simple" | Phonetic normalizations | no | ~ 7% |
"balance" | Phonetic normalizations + literal transformations | no | ~ 25% |
"advanced" | Phonetic normalizations + advanced literal transformations | no | ~ 35% |
"extra" | Phonetic normalizations + Soundex transformations | yes | ~ 60% |
function() | Pass custom encoding via function(string):[words] |
var index = new Index();
Create a new index and choosing one of the presets:
var index = new Index("speed");
Create a new index with custom options:
var index = new Index({
// default values:
charset: "latin:extra",
tokenize: "reverse",
resolution: 9
});
Create a new index and extend a preset with custom options:
var index = new FlexSearch("memory", {
encode: "balance",
tokenize: "forward",
threshold: 0
});
See all available custom options.
Every content which should be added to the index needs an ID. When your content has no ID, then you need to create one by passing an index or count or something else as an ID (a value from type number
is highly recommended). Those IDs are unique references to a given content. This is important when you update or adding over content through existing IDs. When referencing is not a concern, you can simply use something simple like count++
.
Index.add(id, string)
index.add(10025, "John Doe");
Index.search(string | options, <limit>, <callback>)
index.search("John");
Limit the result:
index.search("John", 10);
You can check if an ID was already indexed by:
if(index.contain(1)){
console.log("ID is already in index");
}
The "async" options was removed, instead you can call each method in its async version, e.g. index.addAsync
or index.searchAsync
.
The advantage is you can now use both variations on the same index, whereas the old version is just performing asynchronous for all methods when the option flag was set.
You can assign callbacks to each async function:
index.addAsync(id, content, function(){
console.log("Task Done");
});
index.searchAsync(query, function(result){
console.log("Results: ", result);
});
Or did not pass a callback function and getting back a Promise
instead:
index.addAsync(id, content).then(function(){
console.log("Task Done");
});
index.searchAsync(query).then(function(result){
console.log("Results: ", result);
});
Or use async
and await
:
async function add(){
await index.addAsync(id, content);
console.log("Task Done");
}
async function search(){
const results = await index.searchAsync(query);
console.log("Results: ", result);
}
You can append contents to an existing index like:
index.append(id, content);
This will not overwrite the old indexed contents as it will do when perform index.update(id, content)
. Keep in mind that index.add(id, content)
will also perform "update" under the hood when the id was already being indexed.
Appended contents will have their own context and also their own full resolution
. Therefore, the relevance isn't being stacked but gets its own context.
Let us take this example:
index.add(0, "some index");
index.append(0, "some appended content");
index.add(1, "some text");
index.append(1, "index appended content");
When you query index.search("index")
then you will get index id 1 as the first entry in the result, because the context starts from zero for the appended data (isn't stacked to the old context) and here "index" is the first term.
If you didn't want this behavior than just ust index.add(id, content)
and provide the full length of content.
Pass custom options for each query:
index.search({
query: "John",
limit: 1000,
threshold: 5, // >= threshold
depth: 3, // <= depth
callback: function(results){
// ...
}
});
The same from above could also be written as:
index.search("John", {
limit: 1000,
threshold: 5,
depth: 3
}, function(results){
// ....
});
See all available custom search options.
FlexSearch is providing a cursor-based pagination which has the ability to inject into the most-inner process. This enables the possibility of many performance improvements.
The cursor implementation may be changed often. Just take the cursor as it is and do not expect any specific value or format.
To enable pagination you have to pass a page field within the custom search object (optionally also a limit as maximum items per page).
Get the first page of results:
var response = index.search("John Doe", {
limit: 5,
page: true
});
Always when passing a page within custom search the response have this format:
{
"page": "xxx:xxx",
"next": "xxx:xxx",
"result": []
}
- page is the pointer to the current page
- next is the pointer to the next page or null when no pages are left
- result yields the searching results
Get the second (next) page of results:
index.search("John Doe", {
limit: 10,
page: response.next
});
The limit can be modified for each query.
Get also suggestions for a query:
index.search({
query: "John Doe",
suggest: true
});
When suggestion is enabled all results will be filled up (until limit, default 1000) with similar matches ordered by relevance.
Actually phonetic suggestions are not supported, for that purpose use the encoder and tokenizer which provides similar functionality. Suggestions comes into game when a query has multiple words/phrases. Assume a query contains 3 words. When the index just match 2 of 3 words then normally you will get no results, but with suggestion enabled you will also get results when 2 of 3 words was matched as well 1 of 3 words was matched (depends on the limit), also sorted by relevance.
Note: Is is planned to improve this feature and providing more flexibility.
Index.update(id, string)
index.update(10025, "Road Runner");
Index.remove(id)
index.remove(10025);
index.clear();
index.destroy();
Index.init(<options>)
Initialize (with same options):
index.init();
Initialize with new options:
index.init({
/* options */
});
Re-initialization will also destroy the old index.
Get the length of an index:
var length = index.length;
Get the index (register) of an instance:
var index = index.index;
The register has the format "@" + id.
Important: Do not modify manually, just use it as read-only.
FlexSearch.registerMatcher({REGEX: REPLACE})
Add global matchers for all instances:
FlexSearch.registerMatcher({
'ä': 'a', // replaces all 'ä' to 'a'
'ó': 'o',
'[ûúù]': 'u' // replaces multiple
});
Add private matchers for a specific instance:
index.addMatcher({
'ä': 'a', // replaces all 'ä' to 'a'
'ó': 'o',
'[ûúù]': 'u' // replaces multiple
});
Assign a custom encoder by passing a function during index creation/initialization:
var index = new FlexSearch({
encode: function(str){
// do something with str ...
return str;
}
});
The encoder function gets a string as a parameter and has to return the modified string.
Call a custom encoder directly:
var encoded = index.encode("sample text");
FlexSearch.registerEncoder(name, encoder)
Global encoders can be shared/used by all instances.
FlexSearch.registerEncoder("whitespace", function(str){
return str.replace(/\s/g, "");
});
Initialize index and assign a global encoder:
var index = new FlexSearch({ encode: "whitespace" });
Call a global encoder directly:
var encoded = FlexSearch.encode("whitespace", "sample text");
FlexSearch.registerEncoder('mixed', function(str){
str = this.encode("icase", str); // built-in
str = this.encode("whitespace", str); // custom
// do something additional with str ...
return str;
});
A tokenizer split words into components or chunks.
Define a private custom tokenizer during creation/initialization:
var index = new FlexSearch({
tokenize: function(str){
return str.split(/\s-\//g);
}
});
The tokenizer function gets a string as a parameter and has to return an array of strings (parts).
Stemmer: several linguistic mutations of the same word (e.g. "run" and "running")
Filter: a blacklist of words to be filtered out from indexing at all (e.g. "and", "to" or "be")
Assign a private custom stemmer or filter during creation/initialization:
var index = new FlexSearch({
stemmer: {
// object {key: replacement}
"ational": "ate",
"tional": "tion",
"enci": "ence",
"ing": ""
},
filter: [
// array blacklist
"in",
"into",
"is",
"isn't",
"it",
"it's"
]
});
Using a custom filter, e.g.:
var index = new FlexSearch({
filter: function(value){
// just add values with length > 1 to the index
return value.length > 1;
}
});
Or assign stemmer/filters globally to a language:
Stemmer are passed as a object (key-value-pair), filter as an array.
FlexSearch.registerLanguage("us", {
stemmer: { /* ... */ },
filter: [ /* ... */ ]
});
Or use some pre-defined stemmer or filter of your preferred languages:
<html>
<head>
<script src="js/flexsearch.min.js"></script>
<script src="js/lang/en.min.js"></script>
<script src="js/lang/de.min.js"></script>
</head>
...
Now you can assign built-in stemmer during creation/initialization:
var index_en = new FlexSearch({
stemmer: "en",
filter: "en"
});
var index_de = new FlexSearch({
stemmer: "de",
filter: [ /* custom */ ]
});
In Node.js you just have to require the language pack files to make them available:
require("flexsearch.js");
require("lang/en.js");
require("lang/de.js");
It is also possible to compile language packs into the build as follows:
node compile SUPPORT_LANG_EN=true SUPPORT_LANG_DE=true
Set the tokenizer at least to "reverse" or "full" when using RTL.
Just set the field "rtl" to true and use a compatible tokenizer:
var index = FlexSearch.create({
encode: "icase",
tokenize: "reverse",
rtl: true
});
Set a custom tokenizer which fits your needs, e.g.:
var index = FlexSearch.create({
encode: false,
tokenize: function(str){
return str.replace(/[\x00-\x7F]/g, "").split("");
}
});
You can also pass a custom encoder function to apply some linguistic transformations.
index.add(0, "一个单词");
var results = index.search("单词");
Assuming our document has a data structure like this:
{
"id": 0,
"content": "some text"
}
Old syntax FlexSearch v0.6.3 (not supported anymore!):
const index = new Document({
doc: {
id: "id",
field: ["content"]
}
});
The document descriptor has slightly changed, there is no
field
branch anymore, instead just apply one level higher, sokey
becomes a main member of options.
For the new syntax the field "doc" was renamed to document
and the field "field" was renamed to index
:
const index = new Document({
document: {
id: "id",
index: ["content"]
}
});
index.add({
id: 0,
content: "some text"
});
The field id
describes where the ID or unique key lives inside your documents. The default key gets the value id
by default when not passed, so you can shorten the example from above to:
const index = new Document({
document: {
index: ["content"]
}
});
The member index
has a list of fields which you want to be indexed from your documents. When just selecting one field, then you can pass a string. When also using default key id
then this shortens to just:
const index = new Document({ document: "content" });
index.add({ id: 0, content: "some text" });
Assuming you have several fields, you can add multiple fields to the index:
var docs = [{
id: 0,
title: "Title A",
content: "Body A"
},{
id: 1,
title: "Title B",
content: "Body B"
}];
const index = new Document({
id: "id",
index: ["title", "content"]
});
You can pass custom options for each field:
const index = new Document({
id: "id",
index: [{
field: "title",
tokenize: "forward",
optimize: true,
resolution: 9
},{
field: "content",
tokenize: "strict",
optimize: true,
resolution: 5,
minlength: 3,
context: {
depth: 1,
resolution: 3
}
}]
});
Field options gets inherited when also global options was passed, e.g.:
const index = new Document({
tokenize: "strict",
optimize: true,
resolution: 9,
document: {
id: "id",
index:[{
field: "title",
tokenize: "forward"
},{
field: "content",
minlength: 3,
context: {
depth: 1,
resolution: 3
}
}]
}
});
Note: The context options from the field "content" also gets inherited by the corresponding field options, whereas this field options was inherited by the global option.
Assume the document array looks more complex (has nested branches etc.), e.g.:
{
"record": {
"id": 0,
"title": "some title",
"content": {
"header": "some text",
"footer": "some text"
}
}
}
Then use the colon separated notation "root:child:child" to define hierarchy within the document descriptor:
const index = new Document({
document: {
id: "record:id",
index: [
"record:title",
"record:content:header",
"record:content:footer"
]
}
});
Just add fields you want to query against. Do not add fields to the index, you just need in the result (but did not query against). For this purpose you can store documents independently of its index (read below).
When you want to query through a field you have to pass the exact key of the field you have defined in the doc
as a field name (with colon syntax):
index.search(query, {
index: [
"record:title",
"record:content:header",
"record:content:footer"
]
});
Same as:
index.search(query, [
"record:title",
"record:content:header",
"record:content:footer"
]);
Using field-specific options:
index.search([{
field: "record:title",
query: "some query",
limit: 100,
suggest: true
},{
field: "record:title",
query: "some other query",
limit: 100,
suggest: true
}]);
You can perform a search through the same field with different queries.
When passing field-specific options you need to provide the full configuration for each field. They get not inherited like the document descriptor.
You need to follow 2 rules for your documents:
- The document cannot start with an Array at the root index. This will introduce sequential data and isn't supported yet. See below for a workaround for such data.
[ // <-- not allowed as document start!
{
"id": 0,
"title": "title"
}
]
- The id can't be nested inside an array (also none of the parent fields can't be an array). This will introduce sequential data and isn't supported yet. See below for a workaround for such data.
{
"records": [ // <-- not allowed when ID or tag lives inside!
{
"id": 0,
"title": "title"
}
]
}
Here an example for a supported complex document:
{
"meta": {
"tag": "cat",
"id": 0
},
"contents": [
{
"body": {
"title": "some title",
"footer": "some text"
},
"keywords": ["some", "key", "words"]
},
{
"body": {
"title": "some title",
"footer": "some text"
},
"keywords": ["some", "key", "words"]
}
]
}
The corresponding document descriptor (when all fields should be indexed) looks like:
const index = new Document({
document: {
id: "meta:id",
tag: "meta:tag",
index: [
"contents[]:body:title",
"contents[]:body:footer",
"contents[]:keywords"
]
}
});
Again, when searching you have to use the same colon-separated-string from your field definition.
index.search(query, {
index: "contents[]:body:title"
});
This example breaks both rules from above:
[ // <-- not allowed as document start!
{
"tag": "cat",
"records": [ // <-- not allowed when ID or tag lives inside!
{
"id": 0,
"body": {
"title": "some title",
"footer": "some text"
},
"keywords": ["some", "key", "words"]
},
{
"id": 1,
"body": {
"title": "some title",
"footer": "some text"
},
"keywords": ["some", "key", "words"]
}
]
}
]
You need to apply some kind of structure normalization.
A workaround to such a data structure looks like this:
const index = new Document({
document: {
id: "record:id",
tag: "tag",
index: [
"record:body:title",
"record:body:footer",
"record:body:keywords"
]
}
});
function add(sequential_data){
for(let x = 0, data; x < sequential_data.length; x++){
data = sequential_data[x];
for(let y = 0, record; y < data.records.length; y++){
record = data.records[y];
index.add({
id: record.id,
tag: data.tag,
record: record
});
}
}
}
// now just use add() helper method as usual:
add([{
// sequential structured data
// take the data example above
}]);
You can skip the first loop when your document data has just one index as the outer array.
Just pass the document array (or a single object) to the index:
index.add(docs);
Update index with a single object or an array of objects:
index.update({
data:{
id: 0,
title: "Foo",
body: {
content: "Bar"
}
}
});
Remove a single object or an array of objects from the index:
index.remove(docs);
When the id is known, you can also simply remove by (faster):
index.remove(id);
On the complex example above, the field keywords
is an array but here the markup did not have brackets like keywords[]
. That will also detect the array but instead of appending each entry to a new context, the array will be joined into on large string and added to the index.
The difference of both kinds of adding array contents is the relevance when searching. When adding each item of an array via append()
to its own context by using the syntax field[]
, then the relevance of the last entry concurrent with the first entry. When you left the brackets in the notation, it will join the array to one whitespace-separated string. Here the first entry has the highest relevance, whereas the last entry has the lowest relevance.
So assuming the keyword from the example above are pre-sorted by relevance to its popularity, then you want to keep this order (information of relevance). For this purpose do not add brackets to the notation. Otherwise, it would take the entries in a new scoring context (the old order is getting lost).
Also you can left bracket notation for better performance and smaller memory footprint. Use it when you did not need the granularity of relevance by the entries.
Search through all fields:
index.search(query);
Search through a specific field:
index.search(query, { index: "title" });
Search through a given set of fields:
index.search(query, { index: ["title", "content"] });
Same as:
index.search(query, ["title", "content"]);
Pass custom modifiers and queries to each field:
index.search([{
field: "content",
query: "some query",
limit: 100,
suggest: true
},{
field: "content",
query: "some other query",
limit: 100,
suggest: true
}]);
You can perform a search through the same field with different queries.
See all available field-search options.
One of the few breaking changes which needs migration of your old implementation is the result set. I was thinking a long time about it and came to the conclusion, that this new structure might look weird on the first time, but also comes with some nice new capabilities.
Schema of the result-set:
fields[] => { field, result[] => { document }}
The first index is an array of fields the query was applied to. Each of this field has a record (object) with 2 properties "field" and "result". The "result" is also an array and includes the result for this specific field. The result could be an array of IDs or as enriched with stored document data.
A non-enriched result set now looks like:
[{
field: "title",
result: [0, 1, 2]
},{
field: "content",
result: [3, 4, 5]
}]
An enriched result set now looks like:
[{
field: "title",
result: [
{ id: 0, doc: { /* document */ }},
{ id: 1, doc: { /* document */ }},
{ id: 2, doc: { /* document */ }}
]
},{
field: "content",
result: [
{ id: 3, doc: { /* document */ }},
{ id: 4, doc: { /* document */ }},
{ id: 5, doc: { /* document */ }}
]
}]
When using pluck
instead of "field" you can explicitly select just one field and get back a flat representation:
index.search(query, { pluck: "title", enrich: true });
[
{ id: 0, doc: { /* document */ }},
{ id: 1, doc: { /* document */ }},
{ id: 2, doc: { /* document */ }}
]
These change is basically based on "boolean search". Instead of applying your bool logic to a nested object (which almost ends in structured hell), you can apply your logic by yourself on top of the result-set dynamically. This opens hugely capabilities on how you process the results. Therefore, the results from the fields aren't squashed into one result anymore. That keeps some important information, like the name of the field as well as the relevance of each field results which didn't get mixed anymore.
A field search will apply a query with the boolean "or" logic by default. Each field has its own result to the given query.
There is one situation where the bool
property is still supported. When you like to switch the default "or" logic from the field search into "and", e.g.:
index.search(query, {
index: ["title", "content"],
bool: "and"
});
You will just get results which contains the query in both fields. That's it.
Like the key
for the ID just define the path to the tag:
const index = new Document({
document: {
id: "id",
tag: "tag",
index: "content"
}
});
index.add({
id: 0,
tag: "cat",
content: "Some content ..."
});
Your data also can have multiple tags as an array:
index.add({
id: 1,
tag: ["animal", "dog"],
content: "Some content ..."
});
You can perform a tag-specific search by:
index.search(query, {
index: "content",
tag: "animal"
});
This just gives you result which was tagged with the given tag.
Use multiple tags when searching:
index.search(query, {
index: "content",
tag: ["cat", "dog"]
});
This gives you result which are tagged with one of the given tag.
Multiple tags will apply as the boolean "or" by default. It just needs one of the tags to be existing.
This is another situation where the bool
property is still supported. When you like to switch the default "or" logic from the tag search into "and", e.g.:
index.search(query, {
index: "content",
tag: ["dog", "animal"],
bool: "and"
});
You will just get results which contains both tags (in this example there is just one records which has the tag "dog" and "animal").
You can also fetch results from one or more tags when no query was passed:
index.search({ tag: ["cat", "dog"] });
In this case the result-set looks like:
[{
tag: "cat",
result: [ /* all cats */ ]
},{
tag: "dog",
result: [ /* all dogs */ ]
}]
By default, every query is limited to 100 entries. Unbounded queries leads into issues. You need to set the limit as an option to adjust the size.
You can set the limit and the offset for each query:
index.search(query, { limit: 20, offset: 100 });
You cannot pre-count the size of the result-set. That's a limit by the design of FlexSearch. When you really need a count of all results you are able to page through, then just assign a high enough limit and get back all results and apply your paging offset manually (this works also on server-side). FlexSearch is fast enough that this isn't an issue.
Only a document index can have a store. You can use a document index instead of a flat index to get this functionality also when only storing ID-content-pairs.
You can define independently which fields should be indexed and which fields should be stored. This way you can index fields which should not be included in the search result.
Do not use a store when: 1. an array of IDs as the result is good enough, or 2. you already have the contents/documents stored elsewhere (outside the index).
When the
store
attribute was set, you have to include all fields which should be stored explicitly (acts like a whitelist).
When the
store
attribute was not set, the original document is stored as a fallback.
This will add the whole original content to the store:
const index = new Document({
document: {
index: "content",
store: true
}
});
index.add({ id: 0, content: "some text" });
You can get indexed documents from the store:
var data = index.get(1);
You can update/change store contents directly without changing the index by:
index.set(1, data);
To update the store and also update the index then just use index.update
, index.add
or index.append
.
When you perform a query, weather it is a document index or a flat index, then you will always get back an array of IDs.
Optionally you can enrich the query results automatically with stored contents by:
index.search(query, { enrich: true });
Your results look now like:
[{
id: 0,
doc: { /* content from store */ }
},{
id: 1,
doc: { /* content from store */ }
}]
This will add just specific fields from a document to the store (the ID isn't necessary to keep in store):
const index = new Document({
document: {
index: "content",
store: ["author", "email"]
}
});
index.add(id, content);
You can configure independently what should being indexed and what should being stored. It is highly recommended to make use of this whenever you can.
Here a useful example of configuring doc and store:
const index = new Document({
document: {
index: "content",
store: ["author", "email"]
}
});
index.add({
id: 0,
author: "Jon Doe",
email: "[email protected]",
content: "Some content for the index ..."
});
You can query through the contents and will get back the stored values instead:
index.search("some content", { enrich: true });
Your results are now looking like:
[{
field: "content",
result: [{
id: 0,
doc: {
author: "Jon Doe",
email: "[email protected]",
}
}]
}]
Both field "author" and "email" are not indexed.
Simply chain methods like:
var index = FlexSearch.create()
.addMatcher({'â': 'a'})
.add(0, 'foo')
.add(1, 'bar');
index.remove(0).update(1, 'foo').add(2, 'foobar');
Create an index and just set the limit of relevance as "depth":
var index = new FlexSearch({
encode: "icase",
tokenize: "strict",
threshold: 7,
depth: 3
});
Only the tokenizer "strict" is actually supported by the contextual index.
The contextual index requires additional amount of memory depending on depth.
Try to use the lowest depth and highest threshold which fits your needs.
It is possible to modify values for threshold and depth during search (see custom search). The restriction is that the threshold can only be raised, on the other hand the depth can only be lowered.
You need to initialize the cache and its limit during the creation of the index:
const index = new Index({ cache: 100 });
const results = index.searchCache(query);
A common scenario for using a cache is an autocomplete or instant search when typing.
When passing a number as a limit the cache automatically balance stored entries related to their popularity.
When just using "true" the cache is unbounded and perform actually 2-3 times faster (because the balancer do not have to run).
The whole worker implementation has changed by also keeping Node.js support in mind. The good news is worker will also get supported by Node.js by the library.
One important change is how workers divided their tasks and how contents are distributed. One big issue was that in the old model workers cycles for each task (Round Robin). Theoretically that provides an optimal balance of workload and storage. But that breaks the internal architecture of this search library and almost every performance optimization is getting lost.
Let us take an example. Assuming you have 4 workers and you will add 4 contents to the index, then each content is delegated to one worker (a perfect balance but index becomes a partial index).
Old syntax FlexSearch v0.6.3 (not supported anymore!):
const index = new FlexSearch({ worker: 4 });
index.add(1, "some")
.add(2, "content")
.add(3, "to")
.add(4, "index");
Worker 1: { 1: "some" }
Worker 2: { 2: "content" }
Worker 3: { 3: "to" }
Worker 4: { 4: "index" }
The issue starts when you query a term. Each of the worker has to resolve the search on its own index and has to delegate back the results to apply the intersection calculation. That's the problem. No one of the workers could solve a search task completely, they have to transmit intermediate results back. Therefore, no optimization path could be applied early, because every worker has to send back the full (non-limited) result first.
The new worker model from v0.7.0 is divided into "fields" from the document (1 worker = 1 field index). This way the worker becomes able to solve tasks (subtasks) completely. The downside of this paradigm is they might not have been perfect balanced in storing contents (fields may have different length of contents). On the other hand there is no indication that balancing the storage gives any advantage (they all require the same amount in total).
const index = new Document({
index: ["tag", "name", "title", "text"],
worker: true
});
index.add({
id: 1, tag: "cat", name: "Tom", title: "some", text: "some"
}).add({
id: 2, tag: "dog", name: "Ben", title: "title", text: "content"
}).add({
id: 3, tag: "cat", name: "Max", title: "to", text: "to"
}).add({
id: 4, tag: "dog", name: "Tim", title: "index", text: "index"
});
Worker 1: { 1: "cat", 2: "dog", 3: "cat", 4: "dog" }
Worker 2: { 1: "Tom", 2: "Ben", 3: "Max", 4: "Tim" }
Worker 3: { 1: "some", 2: "title", 3: "to", 4: "index" }
Worker 4: { 1: "some", 2: "content", 3: "to", 4: "index" }
When you perform a field search through all fields then this task is perfectly balanced through all workers, which can solve their subtasks independently.
Above we have seen that documents will create worker automatically for each field. You can also create a WorkerIndex directly (same like using Index
instead of Document
).
Use as ES6 module:
import WorkerIndex from "./worker/index.js";
const index = new WorkerIndex(options);
index.add(1, "some")
.add(2, "content")
.add(3, "to")
.add(4, "index");
Or when bundled version was used instead:
var index = new FlexSearch.Worker(options);
index.add(1, "some")
.add(2, "content")
.add(3, "to")
.add(4, "index");
Such a WorkerIndex works pretty much the same as a created instance of Index
.
A WorkerIndex only support the
async
variant of all methods. That means when you callindex.search()
on a WorkerIndex this will perform also in async the same way asindex.searchAsync()
will do.
The worker model for Node.js is based on "worker threads" and works exactly the same way:
const { Document } = require("flexsearch");
const index = new Document({
index: ["tag", "name", "title", "text"],
worker: true
});
Or create a single worker instance for a non-document index:
const { Worker } = require("flexsearch");
const index = new Worker({ options });
A worker will always perform as async. On a query method call you always should handle the returned promise (e.g. use await
) or pass a callback function as the last parameter.
const index = new Document({
index: ["tag", "name", "title", "text"],
worker: true
});
All requests and sub-tasks will run in parallel (prioritize "all tasks completed"):
index.searchAsync(query, callback);
index.searchAsync(query, callback);
index.searchAsync(query, callback);
Also (prioritize "all tasks completed"):
index.searchAsync(query).then(callback);
index.searchAsync(query).then(callback);
index.searchAsync(query).then(callback);
Or when you have just one callback when all requests are done, simply use Promise.all()
which also prioritize "all tasks completed":
Promise.all([
index.searchAsync(query).then(callback),
index.searchAsync(query).then(callback),
index.searchAsync(query).then(callback)
]).then(callback);
Inside the callback of Promise.all()
you will also get an array of results as the first parameter respectively for each query you put into.
When using await
you can prioritize the order (prioritize "first task completed") and solve requests one by one and just process the sub-tasks in parallel:
await index.searchAsync(query);
await index.searchAsync(query);
await index.searchAsync(query);
Same for index.add()
, index.append()
, index.remove()
or index.update()
. Here there is a special case which isn't disabled by the library, but you need to keep in mind when using Workers.
When you call the "synced" version on a worker index:
index.add(doc);
index.add(doc);
index.add(doc);
// contents aren't indexed yet,
// they just queued on the message channel
Of course, you can do that but keep in mind that the main thread does not have an additional queue for distributed worker tasks. Running these in a long loop fires content massively to the message channel via worker.postMessage()
internally. Luckily the browser and Node.js will handle such incoming tasks for you automatically (as long enough free RAM is available). When using the "synced" version on a worker index, the content isn't indexed one line below, because all calls are treated as async by default.
When adding/updating/removing large bulks of content to the index (or high frequency), it is recommended to use the async version along with
async/await
to keep a low memory footprint during long processes.
The export has slightly changed. The export now consist of several smaller parts, instead of just one large bulk. You need to pass a callback function which has 2 arguments "key" and "data". This callback function is called by each part, e.g.:
index.export(function(key, data){
// you need to store both the key and the data!
// e.g. use the key for the filename and save your data
localStorage.setItem(key, data);
});
Exporting data to the localStorage isn't really a good practice, but if size is not a concern than use it if you like. The export primarily exists for the usage in Node.js or to store indexes you want to delegate from a server to the client.
The size of the export corresponds to the memory consumption of the library. To reduce export size you have to use a configuration which has less memory footprint (use the table at the bottom to get information about configs and its memory allocation).
When your save routine runs asynchronously you have to return a promise:
index.export(function(key, data){
return new Promise(function(resolve){
// do the saving as async
resolve();
});
});
You cannot export the additional table for the "fastupdate" feature. These table exists of references and when stored they fully get serialized and becomes too large. The lib will handle these automatically for you. When importing data, the index automatically disables "fastupdate".
Before you can import data, you need to create your index first. For document indexes provide the same document descriptor you used when export the data. This configuration isn't stored in the export.
var index = new Index({ ... });
To import the data just pass a key and data:
index.import(key, localStorage.getItem(key));
You need to import every key! Otherwise, your index does not work. You need to store the keys from the export and use this keys for the import (the order of the keys can differ).
This is just for demonstration and is not recommended, because you might have other keys in your localStorage which aren't supported as an import:
var keys = Object.keys(localStorage);
for(let i = 0, key; i < keys.length; i++){
key = keys[i];
index.import(key, localStorage.getItem(key));
}
Reference String: "Björn-Phillipp Mayer"
Query | icase | simple | advanced | extra |
björn | yes | yes | yes | yes |
björ | yes | yes | yes | yes |
bjorn | no | yes | yes | yes |
bjoern | no | no | yes | yes |
philipp | no | no | yes | yes |
filip | no | no | yes | yes |
björnphillip | no | yes | yes | yes |
meier | no | no | yes | yes |
björn meier | no | no | yes | yes |
meier fhilip | no | no | yes | yes |
byorn mair | no | no | no | yes |
(false positives) | no | no | no | yes |
The book "Gulliver's Travels Swift Jonathan 1726" was fully indexed for the examples below.
The most memory-optimized meaningful setting will allocate just 1.2 Mb for the whole book indexed! This is probably the most tiny memory footprint you will get from a search library.
import { encode } from "./lang/latin/extra.js";
index = new Index({
encode: encode,
tokenize: "strict",
optimize: true,
resolution: 1,
minlength: 3,
fastupdate: false,
context: false
});
The book "Gulliver's Travels" (Swift Jonathan 1726) was used for this test.
by default a lexical index is very small:
depth: 0, bidirectional: 0, resolution: 3, minlength: 0
=> 2.1 Mb
a higher resolution will increase the memory allocation:
depth: 0, bidirectional: 0, resolution: 9, minlength: 0
=> 2.9 Mb
using the contextual index will increase the memory allocation:
depth: 1, bidirectional: 0, resolution: 9, minlength: 0
=> 12.5 Mb
a higher contextual depth will increase the memory allocation:
depth: 2, bidirectional: 0, resolution: 9, minlength: 0
=> 21.5 Mb
a higher minlength will decrease memory allocation:
depth: 2, bidirectional: 0, resolution: 9, minlength: 3
=> 19.0 Mb
using bidirectional will decrease memory allocation:
depth: 2, bidirectional: 1, resolution: 9, minlength: 3
=> 17.9 Mb
enable the option "fastupdate" will increase memory allocation:
depth: 2, bidirectional: 1, resolution: 9, minlength: 3
=> 6.3 Mb
Every search library is constantly in competition with these 4 properties:
- Memory Allocation
- Performance
- Matching Capabilities
- Relevance Order (Scoring)
FlexSearch provides you many parameters you can use to adjust the optimal balance for your specific use-case.
Modifier | Memory Impact * | Performance Impact ** | Matching Impact ** | Scoring Impact ** |
resolution | +1 (per level) | +1 (per level) | 0 | +2 (per level) |
depth | +4 (per level) | -1 (per level) | -10 + depth | +10 |
minlength | -2 (per level) | +2 (per level) | -3 (per level) | +2 (per level) |
bidirectional | -2 | 0 | +3 | -1 |
fastupdate | +1 | +10 (update, remove) | 0 | 0 |
optimize: true | -7 | -1 | 0 | -3 |
encoder: "icase" | 0 | 0 | 0 | 0 |
encoder: "simple" | -2 | -1 | +2 | 0 |
encoder: "advanced" | -3 | -2 | +4 | 0 |
encoder: "extra" | -5 | -5 | +6 | 0 |
encoder: "soundex" | -6 | -2 | +8 | 0 |
tokenize: "strict" | 0 | 0 | 0 | 0 |
tokenize: "forward" | +3 | -2 | +5 | 0 |
tokenize: "reverse" | +5 | -4 | +7 | 0 |
tokenize: "full" | +8 | -5 | +10 | 0 |
document index | +3 (per field) | -1 (per field) | 0 | 0 |
document tags | +1 (per tag) | -1 (per tag) | 0 | 0 |
store: true | +5 (per document) | 0 | 0 | 0 |
store: [fields] | +1 (per field) | 0 | 0 | 0 |
cache: true | +10 | +10 | 0 | 0 |
cache: 100 | +1 | +9 | 0 | 0 |
type of ids: number | 0 | 0 | 0 | 0 |
type of ids: string | +3 | -3 | 0 | 0 |
** range from -10 to 10, higher is better
memory
(primary optimize for memory)performance
(primary optimize for performance)match
(primary optimize for matching)score
(primary optimize for scoring)default
(the default balanced profile)
These profiles are covering standard use cases. It is recommended to apply custom configuration instead of using profiles to get the best out for your situation. Every profile could be optimized further to its specific task, e.g. extreme performance optimized configuration or extreme memory and so on.
You can pass a preset during creation/initialization of the index.
It is recommended to use numeric id values as reference when adding content to the index. The byte length of passed ids influences the memory consumption significantly. If this is not possible you should consider to use a index table and map the ids with indexes, this becomes important especially when using contextual indexes on a large amount of content.
Whenever you can, try to divide content by categories and add them to its own index, e.g.:
var action = new FlexSearch();
var adventure = new FlexSearch();
var comedy = new FlexSearch();
This way you can also provide different settings for each category. This is actually the fastest way to perform a fuzzy search.
To make this workaround more extendable you can use a short helper:
var index = {};
function add(id, cat, content){
(index[cat] || (
index[cat] = new FlexSearch
)).add(id, content);
}
function search(cat, query){
return index[cat] ?
index[cat].search(query) : [];
}
Add content to the index:
add(1, "action", "Movie Title");
add(2, "adventure", "Movie Title");
add(3, "comedy", "Movie Title");
Perform queries:
var results = search("action", "movie title"); // --> [1]
Split indexes by categories improves performance significantly.
Copyright 2018-2021 Nextapps GmbH
Released under the Apache 2.0 License