Skip to content

Commit

Permalink
publish dryrun works e2e; need better error handling throughout
Browse files Browse the repository at this point in the history
  • Loading branch information
mooreniemi committed Jun 14, 2021
1 parent 82b470e commit 298dd60
Show file tree
Hide file tree
Showing 2 changed files with 15 additions and 6 deletions.
14 changes: 12 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,12 @@ jgs '
```
# suntan

This is just a proof-of-concept tool to dump Elasticsearch Lucene shards into Tantivy. There's also a couple `examples` of calling Lucene through Rust for querying. You provide input, output, and the Tantivy output schema and the tool dumps into it. Your Tantivy schema must be just like or a subset of the Elasticsearch schema. Not all types are supported yet.
This is a proof-of-concept CLI tool to dump Elasticsearch Lucene shards into Tantivy. There's also a couple `examples` of calling Lucene through Rust for querying. You provide input, output, and the Tantivy output schema and the tool dumps into it. Your Tantivy schema must be just like or a subset of the Elasticsearch schema. Not all types are supported yet.

```
# this creates a tantivy index at /tmp/suntan/tantivy-idx given the test resources
suntan -i tests/resources/es-idx/ -s tests/resources/tantivy-schema.json
```

## cli

Expand Down Expand Up @@ -46,6 +51,12 @@ We rely on [`j4rs`](https://github.com/astonbitecode/j4rs). This is a trade-off.

I have made some headway patching [rucene](https://github.com/zhihu/rucene) in order to read Lucene directly from Rust. I got far enough for what I'd need (to pull out `StoredField` and even to do basic text search) but not further to things like `DocValues`. When/if I have time I may try to incorporate that here as an optimization. The danger would be with the next breaking version we'd have to again patch.

## development

I worked with Java 8 and [maven](https://maven.apache.org/what-is-maven.html).

`mvn package` is all you need to generate the jar. Then `build.rs` will copy it into `jassets/suntan.jar`.

## test_data

To generate Elasticsearch data I use [elasticsearch-test-data](https://github.com/oliver006/elasticsearch-test-data). A copy of test data is kept in `tests/resources`. Here's what I used to generate the `test_data` index:
Expand Down Expand Up @@ -79,7 +90,6 @@ rsync -r /var/lib/elasticsearch/nodes/0/indices/TvG2djXSQgqg4PWZSrv2wQ/0/index/

## high level todos

- Properly package the jar with the bin.
- `HierarchicalFacet` and `DateTime` support in the schema mapping.
- Remapping field names on export.
- java_wrapper should probably be made into a git submodule. Right now I `rsync` from another repo.
7 changes: 3 additions & 4 deletions src/main.rs
Original file line number Diff line number Diff line change
Expand Up @@ -87,7 +87,7 @@ fn run(
// there is also a parse_document method we could use specific to tantivy
// but it errors on any keys not in the schema so the below is more flexible right now
// let doc: Document = schema.parse_document(&doc_source)?;
let v: Value = serde_json::from_str(&doc_source).unwrap();
let v: Value = serde_json::from_str(&doc_source).expect("must be valid doc");
// dbg!(v);

let mut doc = Document::new();
Expand All @@ -112,6 +112,7 @@ fn run(
}
tantivy::schema::FieldType::Date(_) => {
// TODO: need to bring in chrono etc
// doc.add_date(content, v["last_updated"].as_str().unwrap_or(""));
todo!()
}
tantivy::schema::FieldType::HierarchicalFacet(_) => {
Expand All @@ -127,9 +128,6 @@ fn run(
}
});

// TODO: chrono timestamp
// doc.add_date(content, v["last_updated"].as_str().unwrap_or(""));

index_writer.add_document(doc);
});
}
Expand All @@ -138,6 +136,7 @@ fn run(
index_writer.commit()?;

// # Searching
// We read the created index and send a test query into it, to confirm that we successfully exported

let reader = index.reader()?;

Expand Down

0 comments on commit 298dd60

Please sign in to comment.