Skip to content

Commit

Permalink
Converted many sidebars to TIPS, and replaced a number of TODOs
Browse files Browse the repository at this point in the history
  • Loading branch information
clintongormley committed Sep 23, 2014
1 parent 0a9ad9c commit 8ac0033
Show file tree
Hide file tree
Showing 42 changed files with 249 additions and 220 deletions.
2 changes: 1 addition & 1 deletion 075_Inside_a_shard/60_Segment_merging.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ search performance if left unchecked. By default, Elasticsearch throttles the
merge process so that search still has enough resources available to perform
well.

TIP: See the <<TODO>> section for advice about tuning merging for your use
TIP: See <<segments-and-merging>> for advice about tuning merging for your use
case.

[[optimize-api]]
Expand Down
5 changes: 3 additions & 2 deletions 120_Proximity_Matching/00_Intro.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -29,9 +29,10 @@ score.

This is the province of _phrase matching_ or _proximity matching_.

****
[TIP]
==================================================
In this chapter we will be using the same example documents that we used for
the <<match-test-data,`match` query>>.
****
==================================================
10 changes: 6 additions & 4 deletions 120_Proximity_Matching/05_Phrase_matching.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,8 @@ only keeps documents which contain *all* of the search terms in the same
would not match any of our documents because no document contains the word
`"quick"` immediately followed by `"fox"`.

****
[TIP]
==================================================
The `match_phrase` query can also be written as a `match` query with type
`phrase`:
Expand All @@ -41,7 +42,7 @@ The `match_phrase` query can also be written as a `match` query with type
--------------------------------------------------
// SENSE: 120_Proximity_Matching/05_Match_phrase_query.json
****
==================================================

==== Term positions

Expand Down Expand Up @@ -103,7 +104,8 @@ For a document to be considered a match for the phrase ``quick brown fox'':

If any of these conditions is not met, the document is not considered a match.

**************************************************
[TIP]
==================================================
Internally, the `match_phrase` query use the low-level `span` query family to
do position-aware matching. Span queries are term-level queries, so they have
Expand All @@ -114,4 +116,4 @@ Thankfully, most people never need to use the `span` queries directly as the
fields, like patent searches, use these low-level queries to perform very
specific carefully constructed positional searches.
**************************************************
==================================================
5 changes: 3 additions & 2 deletions 120_Proximity_Matching/30_Performance.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,8 @@ with `slop`).

And of course, this cost is paid at search time instead of at index time.

**************************************************************************
[TIP]
==================================================
Usually the extra cost of phrase queries is not as scarey as these numbers
suggest. Really, the difference in performance is a testimony to just how fast
Expand All @@ -26,7 +27,7 @@ many many identical terms repeated in many positions. Using higher `slop`
values in this case results in a huge growth in the number of position
calculations.
**************************************************************************
==================================================

So what can we do to limit the performance cost of phrase and proximity
queries? One useful approach is to reduce the total number of documents that
Expand Down
5 changes: 3 additions & 2 deletions 120_Proximity_Matching/35_Shingles.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,8 @@ but also each word *and its neighbour* as single terms:

These word pairs (or _bigrams_) are known as _shingles_.

**************************************************************************
[TIP]
==================================================
Shingles are not restricted to being pairs of words; you could index word
triplets (_trigrams_) as well:
Expand All @@ -40,7 +41,7 @@ Trigrams give you a higher degree of precision, but greatly increases the
number of unique terms in the index. Bigrams are sufficient for most use
cases.
**************************************************************************
==================================================

Of course, shingles are only useful if the user enters their query in the same
order as in the original document; a query for `"sue alligator"` would match
Expand Down
5 changes: 3 additions & 2 deletions 130_Partial_Matching/10_Prefix_query.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -21,14 +21,15 @@ The `prefix` query is a low-level query that works at the term level. It
doesn't analyze the query string before searching -- it assumes that you have
passed it the exact prefix that you want to find.

**************************************************
[TIP]
==================================================
By default, the `prefix` query does no relevance scoring. It just finds
matching documents and gives them all a score of `1`. Really it behaves more
like a filter than a query. The only practical difference between the
`prefix` query and the `prefix` filter is that the filter can be cached.
**************************************************
==================================================


Previously we said that ``you can only find terms that exist in the inverted
Expand Down
5 changes: 3 additions & 2 deletions 130_Partial_Matching/35_Search_as_you_type.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -315,15 +315,16 @@ the `postcode` field would need to be `analyzed` instead of `not_analyzed` but
you could use the `keyword` tokenizer to treat the postcodes as if they were
`not_analyzed`.

*************************************************
[TIP]
==================================================
The `keyword` tokenizer is the NOOP tokenizer, the tokenizer which does
nothing. Whatever string it receives as input, it emits exactly the same
string as a single token. It can therefore be used for values that we would
normally treat as `not_analyzed` but which require some other analysis
transformation such as lowercasing.
*************************************************
==================================================

[source,js]
--------------------------------------------------
Expand Down
15 changes: 9 additions & 6 deletions 170_Relevance/10_Scoring_theory.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -155,14 +155,15 @@ These three factors -- term frequency, inverse document frequency, and field
length norm -- are calculated and stored at index time. Together, they are
used to calculate the _weight_ of a single term in a particular document.

[TIP]
.Document vs Field
***************************
==================================================
When we refer to documents in the above formulae, we are actually talking about
a field within a document. Each field has its own inverted index and thus,
for TF/IDF purposes, the value of the field is the value of the document.
***************************
==================================================

When we run a simple `term` query with `explain` set to `true` (see
<<explain>>), you will see that the only factors involved in calculating the
Expand Down Expand Up @@ -221,15 +222,16 @@ A vector is really just a one-dimensional array containing numbers, like:
In the Vector Space Model, each number in the vector is the _weight_ of a term,
as calculated with <<tfidf,Term Frequency/Inverse Document Frequency>>.

*****************************************
[TIP]
==================================================
While TF/IDF is the default way of calculating term weights for the Vector
Space Model, it is not the only way. Other models like Okapi-BM25 exist and
are available in Elasticsearch. TF/IDF is the default because it is a
simple, efficient algorithm which produces high quality search results, and
has stood the test of time.
*****************************************
==================================================

Imagine that we have a query for ``happy hippopotamus''. A common word like
`happy` will have a low weight, while an uncommon term like `hippopotamus`
Expand Down Expand Up @@ -265,7 +267,8 @@ the query is large, so it is of low relevance. Document 2 is closer to the
query, meaning that it is reasonably relevant, and document 3 is a perfect
match.

**********************************************
[TIP]
==================================================
In practice, only two-dimensional vectors (queries with two terms) can be
plotted easily on a graph. Fortunately, _linear algebra_ -- the branch of
Expand All @@ -276,7 +279,7 @@ same principles explained above to queries which consist of many terms.
You can read more about how to compare two vectors using _Cosine Similarity_
at http://en.wikipedia.org/wiki/Cosine_similarity.
**********************************************
==================================================

Now that we have talked about the theoretical basis of scoring, we can move on
to see how scoring is implemented in Lucene.
5 changes: 3 additions & 2 deletions 170_Relevance/15_Practical_scoring.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -84,15 +84,16 @@ The query normalization factor (`queryNorm`) is an attempt to ``normalize'' a
query so that the results from one query may be compared with the results of
another.

**************************
[TIP]
==================================================
Even though the intent of the query norm is to make results from different
queries comparable, it doesn't work very well. Really, the only purpose of
the relevance `_score` is to sort the results of the current query in the
correct order. You should not try to compare the relevance scores from
different queries.
**************************
==================================================

This factor is calculated at the beginning of the query. The actual
calculation depends on the queries involved but a typical implementation would
Expand Down
5 changes: 3 additions & 2 deletions 170_Relevance/20_Query_time_boosting.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -86,7 +86,8 @@ appear in the query DSL. Instead, any boost values are combined and passsed
down to the individual terms. The `t.getBoost()` method returns any `boost`
value applied to the term itself or to any of the queries higher up the chain.

********************************
[TIP]
==================================================
In fact, reading the <<explain,`explain`>> output is a little more complex
than that. You won't see the `boost` value or `t.getBoost()` mentioned in the
Expand All @@ -95,4 +96,4 @@ than that. You won't see the `boost` value or `t.getBoost()` mentioned in the
the `queryNorm` is the same for every term, you will see that the `queryNorm`
for a boosted term is higher than the `queryNorm` for an unboosted term.
********************************
==================================================
5 changes: 3 additions & 2 deletions 170_Relevance/60_Decay_functions.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -129,14 +129,15 @@ If we were to set the `origin` to £100, then prices below £100 would receive a
lower score. Instead, we set both the `origin` and the `offset` to £50. That
way, the score only decays for any prices above £100 (`origin + offset`).

[TIP]
.Tuning `function_score` clauses
********************************************
==================================================
The `weight` parameter can be used to increase or decrease the contribution of
individual clauses. The `weight`, which defaults to `1.0`, is multiplied with
the score from each clause before the scores are combined with the specified
`score_mode`.
********************************************
==================================================


5 changes: 3 additions & 2 deletions 210_Identifying_words/20_Standard_tokenizer.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -42,8 +42,9 @@ In the above example, the apostrophe in `You're` is treated as part of the
word while the single quotes in `'favourite'` are not, resulting in the
following terms: `You're`, `my`, `favourite`.

[TIP]
.`uax_url_email` tokenizer
***************************************************
==================================================
The `uax_url_email` tokenizer works in exactly the same way as the `standard`
tokenizer, except that it recognises email addresses and URLs as emits them as
Expand All @@ -52,7 +53,7 @@ break them up into individual words. For instance, the email address
`[email protected]` would result in the tokens `joe`, `bloggs`, `foo`,
`bar.com`.
***************************************************
==================================================

The `standard` tokenizer is a reasonable starting point for tokenizing most
languages, especially Western languages. In fact, it forms the basis of most
Expand Down
5 changes: 3 additions & 2 deletions 220_Token_normalization/30_Unicode_world.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -70,8 +70,9 @@ PUT /my_index
--------------------------------------------------
<1> Normalize all tokens into the `nfkc` normalization form.

[TIP]
.When to normalize
**************************************************
==================================================
Besides the `icu_normalizer` token filter mentioned above, there is also an
`icu_normalizer` *character* filter, which does the same job as the token
Expand All @@ -83,7 +84,7 @@ However, if you plan on using a different tokenizer, such as the `ngram`,
`edge_ngram` or `pattern` tokenizers, then it woud make sense to use the
`icu_normalizer` character filter in preference to the token filter.
**************************************************
==================================================

Usually, though, not only will you want to normalize the byte order of tokens,
but also to lowercase them. This can be done with the `icu_normalizer` using
Expand Down
14 changes: 8 additions & 6 deletions 220_Token_normalization/60_Sorting_and_collations.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -275,12 +275,13 @@ PUT /my_index/user/_bulk
GET /my_index/user/_search?sort=name.sort
--------------------------------------------------

.Binary sort keys
**************************************************
The first thing to notice is that the `sort` key returned with each document,
which in earlier examples looked like `brown` and `böhm`, now looks like
gobbledygook: `ᖔ乏昫တ倈⠀\u0001`. The reason is that the `icu_collation` filter
emits keys intended only for efficient sorting, not for any other purposes.
Note that the `sort` key returned with each document, which in earlier
examples looked like `brown` and `böhm`, now looks like gobbledygook:
`ᖔ乏昫တ倈⠀\u0001`. The reason is that the `icu_collation` filter emits keys
intended only for efficient sorting, not for any other purposes.
**************************************************

Expand Down Expand Up @@ -325,13 +326,14 @@ German phonebooks::
{ "language": "en", "variant": "@collation=phonebook" }
-------------------------

[TIP]
.Supported locales
*******************************
==================================================
You can read more about the locales supported by ICU here:
http://userguide.icu-project.org/locale
*******************************
==================================================

This example shows how to setup the German phonebook sort order:

Expand Down
6 changes: 4 additions & 2 deletions 230_Stemming/10_Algorithmic_stemmers.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -18,15 +18,17 @@ http://snowball.tartarus.org/[Snowball language] for creating stemming
algorithms and a number of the stemmers available in Elasticsearch are
written in Snowball.

********************************************
[TIP]
.`kstem` token filter
==================================================
The {ref}analysis-kstem-tokenfilter.html[`kstem` token filter] is a stemmer
for English which combines the algorithmic approach with a built-in
dictionary. The dictionary contains a list of root words and exceptions in
order to avoid conflating words incorrectly. `kstem` tends to stem less
aggressively than the Porter stemmer.
********************************************
==================================================

==== Using an algorithmic stemmer

Expand Down
5 changes: 3 additions & 2 deletions 230_Stemming/30_Hunspell_stemmer.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -169,13 +169,14 @@ An interesting property of the `hunspell` stemmer, as can be seen in the
example above, is that it can remove prefixes as well as as suffixes. Most
algorithmic stemmers remove suffixes only.

***********************************************
[TIP]
==================================================
Hunspell dictionaries can consume a few megabytes of RAM. Fortunately,
Elasticsearch only creates a single instance of a dictionary per node. All
shards which use the same Hunspell analyzer share the same instance.
***********************************************
==================================================

[[hunspell-dictionary-format]]
==== Hunspell dictionary format
Expand Down
5 changes: 3 additions & 2 deletions 240_Stopwords/10_Intro.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -26,13 +26,14 @@ Common words that appear in many documents in the index, like `the`, `and` and
`is`. These words have a low weight and contribute little to the relevance
score.

**********************************************
[TIP]
==================================================
Of course, frequency is really a scale rather than just two points labelled
_low_ and _high_. We just draw a line at some arbitrary point and say that any
terms below that line are low frequency and above the line are high frequency.
**********************************************
==================================================

Which terms are low or high frequency depend on the documents themselves. The
word `and` may be a low frequency term if all of the documents are in Chinese.
Expand Down
5 changes: 3 additions & 2 deletions 240_Stopwords/40_Divide_and_conquer.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -81,8 +81,9 @@ document like ``Quick **AND THE** dead'' higher than ``**THE** quick but
dead''. This approach greatly reduces the number of documents that need to be
examined and scored.

[TIP]
.`and` query
********************************
==================================================
Setting the operator parameter to `and` would make *all* low frequency terms
required, and score documents that contain *all* high frequency terms higher.
Expand All @@ -91,7 +92,7 @@ frequency terms. If you would prefer all low and high frequency terms to be
required, then you should use a `bool` query instead. As we saw in
<<stopwords-and>>, this is already an efficient query.
********************************
==================================================

==== Controlling precision

Expand Down
Loading

0 comments on commit 8ac0033

Please sign in to comment.