Skip to content

Commit

Permalink
performances and bugs fix thanks to voodoo testing
Browse files Browse the repository at this point in the history
  • Loading branch information
art-w committed Jan 30, 2024
1 parent bb6d459 commit 106f8e7
Show file tree
Hide file tree
Showing 46 changed files with 1,200 additions and 998 deletions.
191 changes: 50 additions & 141 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,182 +1,91 @@
**Try it online at [doc.sherlocode.com](https://doc.sherlocode.com) !**

A Hoogle-like search engine for OCaml documentation. It can be used in
differents ways, [online](https://doc.sherlocode.com), or offline with
the dev version of odoc.
Sherlodoc is a search engine for OCaml documentation (inspired by [Hoogle](https://hoogle.haskell.org/)), which allows you to search through OCaml libraries by names and approximate type signatures:

It has fuzzy type search supported by a polarity search. As an example, the type
`string -> int -> char` gets simplified to `{ -string, -int, +char }` which
means that it consumes a `string` and an `int` and produces a `char`
(irrespective of the order of the arguments). This polarity search is fast
enough and yields good candidates which are then sorted by similarity with the
query. The sort is slower but the number of candidates is small.
- Search by name: [`list map`](https://doc.sherlocode.com/?q=list%20map)
- Search inside documentation comments: [`raise Not_found`](https://doc.sherlocode.com/?q=raise%20Not_found)
- Fuzzy type search is introduced with a colon, e.g. [`: map -> list`](https://doc.sherlocode.com/?q=%3A%20map%20-%3E%20list)
- Search by name and type with a colon separator [`Bogue : Button.t`](https://doc.sherlocode.com/?q=Bogue%20%3A%20Button.t)
- An underscore `_` can be used as a wildcard in type queries: [`(int -> _) -> list -> _`](https://doc.sherlocode.com/?q=(int%20-%3E%20_)%20-%3E%20list%20-%3E%20_)
- Type search supports products and reordering of function arguments: [`array -> ('a * int -> bool) -> array`](https://doc.sherlocode.com/?q=%3A%20array%20-%3E%20(%27a%20*%20int%20-%3E%20bool)%20-%3E%20array)

You can search for anything that can exists in an MLI files : values, types,
modules, exceptions, constructors etc...
## Local usage

Fuzzy type search is available for values, sum-types constructors, exceptions,
and record fields.

# Usage

First, install sherlodoc :
First, install sherlodoc and odig:

```bash
opam pin add https://github.com/art-w/sherlodoc.git#jsoo
opam install sherlodoc
```
$ opam pin add 'https://github.com/art-w/sherlodoc.git' # optional

## Generating a search-database

The first step to using sherlodoc is generating a search-database. You do this
with the command `sherlodoc index` :

```bash
sherlodoc index --format=marshal -o db.marshal a.odocl b.odocl
$ opam install sherlodoc odig
```

The `--format` option determines in which format the database is outputted. The
available format are `marshal`, `js`. The `js` format, for
javascript, is the one compatible with odoc, and the `marshal` for most other
uses.

There is a third format : `ancient`, that is only available if the package
`ancient` is installed. It is more complicated than the other two, you can read
on it [here](https://github.com/UnixJunkie/ocaml-ancient). It is used for the
[online](https://doc.sherlocode.com) version of sherlodoc, and is an optional
dependency of the `sherlodoc` package.

The `-o` option is the filename of the output.

Then you need to provide a list of .odocl files that contains the signatures
items that are going to be searchable. They are build artifacts of odoc.

There are others options that are documented by `sherlodoc index --help`.

## Queries
[Odig](https://erratique.ch/software/odig) can generate the odoc documentation of your current switch with:

To query sherlodoc, be it on the command-line or in a web interface, you need
to input a string query. A query is a list of words, separated by spaces.
Results will be entries that have every word of the list present in them.

```
"list map"
```

The above query will return entries that have both `list` and `map` in them.

You can also add `: <type>` at the end of your query, and in that case, results
will only be results whose type match <type>. This can only be a value, an
exception, a constructor or a record field.

Matching a type is fuzzy, if you do the following query :

```
"blabla : string"
```bash
$ odig odoc # followed by `odig doc` to browse your switch documentation
```

It could return `val blablabla : int -> string` and `val blabla2 : string`.
Which sherlodoc can then index to create a search database:

You can have just the type-part of the query : `": string -> int"` is a valid
query.
```bash
# name your sherlodoc database
$ export SHERLODOC_DB=/tmp/sherlodoc.marshal

You can use wildcards :
# if you are using OCaml 4, we recommend the `ancient` database format:
$ opam install ancient
$ export SHERLODOC_DB=/tmp/sherlodoc.ancient

# index all odoc files generated by odig for your current switch:
$ sherlodoc index $(find $OPAM_SWITCH_PREFIX/var/cache/odig/odoc -name '*.odocl')
```
": string -> _"
```

will only return functions that take a string a argument, no matter what they
return.

There is limited support for polymorphism : you cannot search for `'a -> 'a` and
get every function `int -> int`, `string -> string` etc. However it will return
a function whose literal type is `'a -> 'a`. Having the first behaviour would
be a lot harder to program, and probably not a good idea, as it would be
impossible to search for polymorphic functions.

## Searching on the command line

If you have a search database in `marshal` format, you can search on the command
line :
Enjoy searching from the command-line or run the webserver:

```bash
sherlodoc --db=db.marshal "blabla : int -> string"
```

`--db` is the filename of the search database. If absent, the environment
variable `SHERLODOC_DB` will be used instead.
$ sherlodoc search "map : list"
$ sherlodoc search # interactice cli

In my example, I gave a query, but if you give none, sherlodoc enter an
interactive mode where you can enter queries until you decide to quit.
$ opam install dream
$ sherlodoc serve # webserver at http://localhost:1234
```

There are more option documented by `sherlodoc --help`, some of them are for
debugging/testing purposes, others might be useful.
The different commands support a `--help` argument for more details/options.

### Search your switch
In particular, sherlodoc supports three different file formats for its database, which can be specified either in the filename extension or through the `--db-format=` flag:
- `ancient` for fast database loading using mmap, but is only compatible with OCaml 4.
- `marshal` for when ancient is unavailable, with slower database opening.
- `js` for integration with odoc static html documentation for client-side search without a server.

A reasonable use of sherlodoc on the cli is to search for signatures items from
your whole switch. Since odig can generate the documentation of the switch, we
can get the .odocl files with it :
## Integration with Odoc

Generate the documentation of your switch :
Odoc 2.4.0 adds a search bar inside the statically generated html documentation. [Integration with dune is in progress](https://github.com/ocaml/dune/pull/9772), you can try it inside a fresh opam switch with: (warning! this will recompile any installed package that depends on dune!)

```bash
odig odoc
```

Generate the search database :
$ opam pin https://github.com/emileTrotignon/dune.git#search-odoc-new

```bash
sherlodoc index --format=marshal -o db.marshal $(find $OPAM_SWITCH_PREFIX/var/cache/odig/odoc -name "*.odocl")
$ dune build @doc # in your favorite project
```

Enjoy searching :
Otherwise, manual integration with odoc requires to add to every call of `odoc html-generate` the flags `--search-uri sherlodoc.js --search-uri db.js` to activate the search bar. You'll also need to generate a search database `db.js` and provide the `sherlodoc.js` dependency (a version of the sherlodoc search engine with odoc support, compiled to javascript):

```bash
sherlodoc search --db=db.marshal
```
$ sherlodoc index --db=_build/default/_doc/_html/YOUR_LIB/db.js \
$(find _build/default/_doc/_odocls/YOUR_LIB -name '*.odocl')

## Searching from an odoc search bar

The latest unreleased version of odoc is compatible with sherlodoc. This allows
you to upload the documentation of a package with a search for this package
embedded.

For this to work, you need to generate a search database with format `js`, and
then add to every call of `odoc html-generate` the flags `--search-uri
sherlodoc.js --search-uri db.js`.

Be sure to copy the two js files in the output directory given to the
html-generate command :

```bash
sherlodoc js html_output/sherlodoc.js ;
cp db.js html_output/db.js ;
$ sherlodoc js > _build/default/_doc/_html/sherlodoc.js
```

Obviously, most people use dune, and do not call `odoc html-generate`. A patch
for dune is being [worked on](https://github.com/emileTrotignon/dune/tree/search-odoc-new).
If you want to, you can test it, it should work. It is still work in progress.
## How it works

## Sherlodoc online
The sherlodoc database uses [Suffix Trees](https://en.wikipedia.org/wiki/Suffix_tree) to search for substrings in value names, documentation and types. During indexation, the suffix trees are compressed to state machine automatas. The children of every node are also sorted, such that a sub-tree can be used as a priority queue during search enumeration.

If you want to use sherlodoc as a server, like on
[doc.sherlocode.com](https://doc.sherlocode.com) it is also possible.
To rank the search results, sherlodoc computes a static evaluation of each candidate during indexation. This static scoring biases the search to favor short names, short types, the presence of documentation, etc. When searching, a dynamic evaluation dependent on the user query is used to adjust the static ordering of the results:

As usual, generate your search database :
- How similar is the result name to the search query? (to e.g. prefer results which respect the case: [`map`](https://doc.sherlocode.com/?q=map) vs [`Map`](https://doc.sherlocode.com/?q=Map))
- How similar are the types? (using a tree diff algorithm, as for example [`('a -> 'b -> 'a) -> 'a -> 'b list -> 'a`](https://doc.sherlocode.com/?q=(%27a%20-%3E%20%27b%20-%3E%20%27a)%20-%3E%20%27a%20-%3E%20%27b%20list%20-%3E%20%27a) and [`('a -> 'b -> 'b) -> 'a list -> 'b -> 'b`](https://doc.sherlocode.com/?q=(%27a%20-%3E%20%27b%20-%3E%20%27b)%20-%3E%20%27a%20list%20-%3E%20%27b%20-%3E%20%27b) are isomorphic yet point to `fold_left` and `fold_right` respectively)

```bash
sherlodoc index --format=ancient -o db.ancient $(find /path/to/doc -name "*.odocl")
```

Then you can run the website :
For fuzzy type search, sherlodoc aims to provide good results without requiring a precise search query, on the basis that the user doesn't know the exact type of the things they are looking for (e.g. [`string -> file_descr`](https://doc.sherlocode.com/?q=string%20-%3E%20file_descr) is incomplete but should still point in the right direction). In particular when exploring a package documentation, the common question "how do I produce a value of type `foo`" can be answered with the query `: foo` (and "which functions consume a value of type `bar`" with `: bar -> _`). This should also work when the type can only be produced indirectly through a callback (for example [`: Eio.Switch.t`](https://doc.sherlocode.com/?q=%3A%20Eio.Switch.t) has no direct constructor). To achieve this, sherlodoc performs a type decomposition based on the polarity of each term: A value produced by a function is said to be positive, while an argument consumed by a function is negative. This simplifies away the tree shape of types, allowing their indexation in the suffix trees. The cardinality of each value type is also indexed, to e.g. differentiate between [`list -> list`](https://doc.sherlocode.com/?q=list%20-%3E%20list) and [`list -> list -> list`](https://doc.sherlocode.com/?q=list%20-%3E%20list%20-%3E%20list).

```bash
sherlodoc serve db.ancient
```
While the polarity search results are satisfying, sherlodoc offers very limited support for polymorphic variables, type aliases and true type isomorphisms. You should check out the extraordinary [Dowsing](https://github.com/Drup/dowsing) project for this!

The real magic for [doc.sherlocode.com](https://doc.sherlocode.com) is all the
.odocl artifacts of the package documentation generated for
[`ocaml.org/packages`](https://ocaml.org/packages), which I got my hands on
thanks to insider trading (but don't have the bandwidth to share back... sorry!)
And if you speak French, a more detailed [presentation of Sherlodoc](https://www.irill.org/videos/OUPS/2023-03/wendling.html) (and [Sherlocode](https://sherlocode.com)) was given at the [OCaml Users in PariS (OUPS)](https://oups.frama.io/) in March 2023.
1 change: 1 addition & 0 deletions cli/dune
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
index
query
db_store
unix
(select
serve.ml
from
Expand Down
47 changes: 37 additions & 10 deletions cli/search.ml
Original file line number Diff line number Diff line change
Expand Up @@ -36,31 +36,46 @@ let print_result ~print_cost ~no_rhs (elt : Db.Entry.t) =
in
Format.printf "%s%s %s%s%a@." cost kind typedecl_params name pp_rhs elt.rhs

let search ~print_cost ~static_sort ~limit ~db ~no_rhs ~pretty_query query =
let search ~print_cost ~static_sort ~limit ~db ~no_rhs ~pretty_query ~time query =
let query = Query.{ query; packages = []; limit } in
if pretty_query then print_endline (Query.pretty query) ;
match Query.search ~shards:db ~dynamic_sort:(not static_sort) query with
let t0 = Unix.gettimeofday () in
let r = Query.Blocking.search ~shards:db ~dynamic_sort:(not static_sort) query in
let t1 = Unix.gettimeofday () in
match r with
| [] -> print_endline "[No results]"
| _ :: _ as results ->
List.iter (print_result ~print_cost ~no_rhs) results ;
flush stdout
flush stdout ;
if time then Format.printf "Search in %f@." (t1 -. t0)

let rec search_loop ~print_cost ~no_rhs ~pretty_query ~static_sort ~limit ~db =
let rec search_loop ~print_cost ~no_rhs ~pretty_query ~static_sort ~limit ~time ~db =
Printf.printf "%ssearch>%s %!" "\027[0;36m" "\027[0;0m" ;
match Stdlib.input_line stdin with
| query ->
search ~print_cost ~static_sort ~limit ~db ~no_rhs ~pretty_query query ;
search_loop ~print_cost ~no_rhs ~pretty_query ~static_sort ~limit ~db
search ~print_cost ~static_sort ~limit ~db ~no_rhs ~pretty_query ~time query ;
search_loop ~print_cost ~no_rhs ~pretty_query ~static_sort ~limit ~time ~db
| exception End_of_file -> Printf.printf "\n%!"

let search query print_cost no_rhs static_sort limit pretty_query db_format db_filename =
let search
query
print_cost
no_rhs
static_sort
limit
pretty_query
time
db_format
db_filename
=
let module Storage = (val Db_store.storage_module db_format) in
let db = Storage.load db_filename in
match query with
| None ->
print_endline header ;
search_loop ~print_cost ~no_rhs ~pretty_query ~static_sort ~limit ~db
| Some query -> search ~print_cost ~no_rhs ~pretty_query ~static_sort ~limit ~db query
search_loop ~print_cost ~no_rhs ~pretty_query ~static_sort ~limit ~time ~db
| Some query ->
search ~print_cost ~no_rhs ~pretty_query ~static_sort ~limit ~time ~db query

open Cmdliner

Expand All @@ -76,6 +91,10 @@ let print_cost =
let doc = "For debugging purposes: prints the cost of each result" in
Arg.(value & flag & info [ "print-cost" ] ~doc)

let print_time =
let doc = "For debugging purposes: prints the search time" in
Arg.(value & flag & info [ "print-time" ] ~doc)

let static_sort =
let doc =
"Sort the results without looking at the query.\n\
Expand All @@ -93,4 +112,12 @@ let pretty_query =
Arg.(value & flag & info [ "pretty-query" ] ~doc)

let term =
Term.(const search $ query $ print_cost $ no_rhs $ static_sort $ limit $ pretty_query)
Term.(
const search
$ query
$ print_cost
$ no_rhs
$ static_sort
$ limit
$ pretty_query
$ print_time)
Loading

0 comments on commit 106f8e7

Please sign in to comment.