-
Notifications
You must be signed in to change notification settings - Fork 331
InvertedIndex
GE provides built-in inverted index support for substring queries on cell field values. GE only indexes the cell field values that are marked with [index] attribute in TSL.
Only the cell fields with string type can be indexed. There are two
cases. 1) The type of the cell field is string. During substring
query processing, a cell is matched if the value of its indexed field
contains the queried substring. 2) The cell field is a collection of
strings, e.g., List<string>
or List<List<string>>
. During
substring query processing, a cell is matched as long as any string in
the collection contains the queried substring.
To make a field indexed, add an [index]
attribute to it:
cell MyCell
{
[Index]
string Name;
}
The index attribute is only valid for the fields whose type is string or a collection of strings.
It is possible to declare an index on a nested field, for example:
struct leaf
{
[Index]
string data;
}
cell root
{
leaf substructure;
}
Because there is no way to globally identify such a struct in
GE, the index attribute on struct leaf
alone does not
produce a valid index. Only when such a struct is contained in a cell
root
, root.leaf.data
becomes an accessible cell field; a valid
index is formed. It is allowed to have an indexed field included in a
substructure of a substructure, thus
cell.inner_1.inner_2. ... .leaf.data
is still indexable.
If a substructure with one or more indexed fields is included in multiple cell constructs, then one index will be generated for each of these cells, and the indexes are independent with each other.
For an indexed field, we can issue a substring query against it by
calling the method Index.SubstringQuery
with a field identifier and
a query string. A field identifier, such as Index.root.leaf.data,
is an automatically generated nested class defined within the Index
class. It is used to specify which cell field we are going to query
against: Index.root.leaf.data identifies the data
field in the
leaf
struct within the root cell.
Index.SubstringQuery(Index.root.leaf.data, "query string")
returns a
list of cell Ids. Each root.leaf.data field value of the corresponding
cells contains the "query string".
The method Index.SubstringQuery also accepts a sequence of query
strings. Given a sequence of query strings q1, q2, ..., qn
, this
method will perform a wildcard search with the *q1*q2*...*qn*
pattern. It means all the strings q1, q2, ..., qn
are substrings of
a matched string in the order specified in the sequence.
If the cells are continuously updated, updates made to the indexed fields may not be immediately reflected in the index. That is, a substring query may have false negatives -- a cell is currently matched, but not included in the index yet. To rule out false positives (the previously matched cells do not match now), we need to check the cell field values again after getting the matched cell ids.
Indexes are updated periodically by the system. To manually update an
index, we can call Index.UpdateSubstringIndex()
on a specific field
identifier.
The inverted index subsystem is integrated with LINQ. In a LINQ query
over a selector, GE
translates the invocations of String.Contains
on an indexed field
into an inverted index queries. The same rule applies for
IEnumerable<string>.Contains
. Refer to the LINQ section
for more details.