This provides classes for integrating BlackLab with Solr. The ultimate goal of this is to enable distributed search via SolrCloud. This is a work in progress.
To enable this plugin for your core, in your solrconfig.xml
, add this to the <config>
section:
<!-- Load the blacklab-solr plugin -->
<lib dir="${solr.install.dir:/opt/solr}/contrib/blacklab-solr/lib/" regex="blacklab-solr.*\.jar" />
Add the blacklab-search
search component, and specify the XSLT file and the Solr field containing the input XML:
<!-- Our Apply XSLT SearchComponent -->
<searchComponent name="blacklab-search" class="org.ivdnt.blacklab.solr.BlackLabSearchComponent" >
<!-- Where to find a core's BlackLab config file (value shown below is the default path).
Each core gets their own config file (although certain settings are engine-wide...)
-->
<str name="configFile">conf/blacklab-webservice.yaml</str>
</searchComponent>
To run the plugin on your /select
handler, add this to the <requestHandler name="/select" ...>
element:
<!-- After all other components (standard Solr per-document search) have run, run the BlackLab (per-hit) search -->
<arr name="last-components">
<str>blacklab-search</str>
</arr>
A Dockerfile
is included which adds this to a Solr image. Build the image with this command:
docker build -t instituutnederlandsetaal/blacklab-solr:1 -f Dockerfile .
You can derive your own Dockerfile
from this. Here's an example that adds a Solr configuration dir to the image and creates a core based on that configuration:
# Based on Solr + XSLT plugin image.
# Creates our core (using the config).
FROM instituutnederlandsetaal/blacklab-solr:1
# Copy the configuration files for our core
COPY . /opt/solr/server/solr/configsets/blacklab/conf
# Pre-create core (using the config copied above)
# as soon as the container is started.
CMD ["solr-precreate", "my-blacklab-corpus", "/opt/solr/server/solr/configsets/blacklab"]
In addition to standard Solr parameters like q
and fq
for document
filtering, you can use all of the same parameters BlackLab Server uses,
but you should prefix them with bl.
. Some examples are shown below.
Note that we always pass rows=0
to Solr, because we don't want Solr's
document results; BlackLab will send a list of hits and include the document info
for these hits automatically.
Find hits: https://server/solr/corename/select?bl.op=hits&bl.patt=%22the%22&q=*%3A*&rows=0
As an alternative to passing separate bl.NAME
parameters, you can also pass a JSON
structure with all the parameters in a parameter called bl.req
, e.g.:
{ "op": "hits", "patt": "\"the\"" }
The full URL in this case would be: https://server/solr/corename/select?bl.req=%7B%22op%22%3A%22hits%22%2C%22patt%22%3A%22%5C%22the%5C%22%22%7D&q=*%3A*&rows=0
The JSON structure for group
and viewgroup
is not a string with separators, but an array of arrays:
{
"op": "hits",
"patt": "\"the\"",
"group": [ [ "field", "title" ] ],
"viewgroup": [ [ "str", "interview about city" ] ]
}
the above group
and viewgroup
parts correspond to bl.group=field:title&bl.viewgroup=str:interview about city
.
The values of bl.op
are:
bl.op | Operation | BLS URL equivalent | Extra parameter |
---|---|---|---|
server-info | Server information | / | |
corpus-info | Corpus information, including fields and values | /CORPUS | |
corpus-status | Corpus (indexing) status | /CORPUS/status | |
field-info | Info about (metadata or annotated) field | /CORPUS/field/FIELDNAME | field |
hits | Search (and optionally group) hits | /CORPUS/hits | |
docs | Search (and optionally group) documents | /CORPUS/docs | |
doc-info | Get document metadata and other information | /CORPUS/docs/PID | docpid |
doc-contents | Get the full contents of a document (if allowed) | /CORPUS/docs/PID/contents | docpid |
doc-snippet | Get snippet of a document (if allowed) | /CORPUS/docs/PID/snippet | docpid |
termfreq | Calculate term frequencies | /CORPUS/termfreq | |
autocomplete | Return terms matching a prefix in a field | /CORPUS/autocomplete | |
list-input-formats | List available input formats | /CORPUS/input-formats | |
input-format-info | Info about an input format | /CORPUS/input-formats/NAME | inputformat |
input-format-xslt | Generate XSLT for an input format | /CORPUS/input-formats/NAME | inputformat |
cache-info | Show cache contents (NOT IMPLEMENTED YET) | /CORPUS/cache-info | |
cache-clear | Clear the cache (debug mode only; NOT IMPLEMENTED YET) | /CORPUS/cache-clear | |
create-corpus | Create corpus (NOT IMPLEMENTED YET) | ||
delete-corpus | Delete corpus (NOT IMPLEMENTED YET) | ||
add-to-corpus | Add to corpus (NOT IMPLEMENTED YET) | ||
write-input-format | Write input format (NOT IMPLEMENTED YET) | ||
delete-input-format | Write input format (NOT IMPLEMENTED YET) |
Some example queries:
- Documents containing "the":
bl.op=docs&bl.patt="the"
- The same documents grouped by title, viewing a single group:
bl.op=docs&bl.patt="the"&bl.group=field:title
- Viewing a single group:
bl.op=docs&bl.patt="the"&bl.group=field:title&bl.viewgroup=str:interview about conference experience and impressions of city
- Information about a document:
bl.op=doc-info&bl.docpid=PRint602
- Document contents:
bl.op=doc-contents&bl.docpid=PRint602
- Document snippet:
bl.op=doc-snippet&bl.docpid=PRint602&bl.wordstart=100&bl.wordend=200
- Term frequencies:
bl.op=termfreq&bl.field=contents&bl.annotation=lemma
- Autocomplete:
bl.op=autocomplete&bl.field=contents&bl.annotation=lemma&bl.term=a