A local, in-memory search tool. Query and compare your text documents from the terminal, with results ranked by textual similarity.
$ docsim --show-scores --limit 3 --best-first "search query" ~/documents/notes
0.472 very-relevant-file.txt
0.123 slightly-similar-file.org
0.000 completely-unrelated-file.md
docsim
is an information retrieval tool, so it's different from other search
tools like grep
, ripgrep
, ag
, and so on. Those tools are all great, but
they search for literal text matches, and sometimes we want to know, "what notes
are most similar to this query, or to this other note?"
If I search for "chunky bacon," I still want to see documents that talk about
"chunks of bacon." And, below those, I probably want to see notes that discuss
regular "bacon," even if it's not chunky. docsim
uses a few different
information retrieval algorithms to provide a ranked list of text documents.
It's also slower and more memory-intensive than e.g. grep
, of course, since it
does more work. But performance is a goal, and on a mid-range machine it'll
process a few thousand documents without notable lag.
This all sounds complicated, but docsim
aspires to be easy to use and to
behave like a good UNIX citizen. It's a single binary that operates on plain
files and streams. No servers, no daemons, no dependencies on Docker containers
or scikit-learn
, not even any persistent indexes or caches to get out of sync.
Searching local documents with information retrieval algorithms shouldn't be any
harder than using grep
!
Check the man
page for the definitive documentation, but these should get
you started.
If no paths are provided docsim
will search the current working directory.
$ docsim "here's a search query"
[...]
Use the --stdin
flag to read the search query from STDIN
instead of a string
argument.
$ echo "Here's another query to search for." | docsim --stdin ~/documents/notes
[...]
Search for similar files in a given directory:
$ docsim --file some-file.txt ~/documents/notes
[...]
Find Go files similar to main.go
in the current directory. Don't use natural
language processing techniques like stemming or stoplists, since these
aren't English documents:
$ docsim --file main.go --no-stemming --no-stoplist **/*.go
[...]
Note that because docsim
uses an English stoplist and an English stemming
algorithm, you'll almost certainly want to use the --no-stoplist
and
--no-stemming
flags if your documents are written in another language
(including source code).
Optionally, you can use the --stoplist
flag to provide a custom stoplist. A
custom stoplist is just a text file of words to ignore, separated by whitespace.
WARNING: docsim
doesn't respect .ignore
or .gitignore
files yet, so
it'll try to search through .git
directories, node_modules
, and so on. That
should be fixed in the near future.
If you're using Homebrew:
$ brew tap hrs/docsim
# brew install docsim
Otherwise, grab an appropriate package from the latest release, which
includes .deb
and .rpm
packages, and precompiled binaries appropriate for
most popular platforms.
If you've got a Go toolchain handy, you can also either:
$ git clone [email protected]:hrs/docsim.git
$ cd docsim
$ sudo make install
Or just:
$ go install github.com/hrs/docsim@latest
Note that using go install
doesn't include the man
page, which you can
optionally install manually by copying it into e.g. /usr/local/share/man/man1
.
Just use the supplied make
task:
$ make test
docsim
uses TF-IDF weighting and cosine similarity to numerically
score the textual similarity between the query and every other document.
"Textual similarity" roughly means "uses the same words." Each document is parsed into a big bag of words, which are passed through a common English stoplist, stemmed (so "spins," "spinner," and "spinning" might all reduce down to just "spin"). Those terms are assigned weights based on how often they appear in the document and how rare they are in the corpus as a whole.
We can think of each of these documents as a vector in term space, where each term is a dimension with its weight as a magnitude. Two documents are "similar," then, inasmuch as they point in the same direction, so we define similarity by the size of the angle between them.
docsim
is still in a nascent state, so I'm happy just writing the code myself
for now, but please feel free to report any issues you encounter!