## Updating Elasticsearch
TL;DR When you update the mapping, use Reindex, when you add a mapping, use UpdateMapping
On occasion you will need to update the our Elasticsearch mappings. Unfortunately, you need to change the mapping and then reindex the data to apply said change. Read more about the inspiration
This performs the following:
- Creates a new index (with the new mappings) appending a version number to the new index e.g.
images_5
- Copies over all data from the original index to the new index using scrolling
- Points the write alias to the new index
- Checks if any new data has been wrote since the script started, if so copies this over as well
- Points the read alias to the new index
$ sbt
> scripts/run Reindex <ES_URL>
Optionally takes a DateTime string argument. Will perform reindex for documents updated since the date provieded
> scripts/run Reindex <ES_URL> FROM_TIME=016-01-28T10:55:10.232Z
Optionally takes a new index name string argument. Will reindex into that new name instead of the default version increment
> scripts/run Reindex <ES_URL> NEW_INDEX=images
When you add a mapping e.g. You add a new field to the image mapping
you should add the mapping with this script as we are using strict
mappings (you cannot just add things willy nilly). Updating mappings is done in 2 steps:
-
Set up a SSH tunnel to the AWS elasticsearch instance:
ssh -L 9200:localhost:9200 <ES_URL>
-
Run the script:
$ sbt
> scripts/run UpdateMapping <ES_URL>
Optionally takes an index name. e.g. > scripts/run UpdateMapping <ES_URL> images_5
To test the connection without making any changes to the mappings, you can run: sbt "scripts/run GetMapping <ES_URL>"
.
When you need to close the index to update the settings i.e. when you have to add / reconfigure analysers - this is the command you can use.
```
$ # after pausing thrall
$ sbt
> scripts/run UpdateSettings localhost
```
ES doesn't provide a means of downloading all the IDs, so this script does just that and writes to file - for example a CSV file for upload to AWS Athena.
It relies on the es-ssh-ssm-tunnel.sh
script.
It's most efficient to do this as a 'scan and scroll' (see stackoverflow.com/a/30855670).
```
$ sbt
> scripts/run DownloadAllEsIds http://localhost:9200 /tmp/testing
```
```
$ sbt
> scripts/run BulkDeleteS3Files <bucketName> <inputFile> <auditFile>
```
Input file needs to be a CSV, with a heading row and a single column containing the S3 paths to delete from the specified bucket.
This script groups the input IDs into 1000s so it can use the bulk delete API and reports the success or failure for each S3 path to both the console but also the auditFile
path provided (CSV output).
Note: bulk delete API reports 'deleted' if the path is not found, so this can be run multiple times without issue (although delete markers will be created in S3 for every execution).