Converted many sidebars to TIPS, and replaced a number of TODOs

wangguiyu · Sep 23, 2014 · 8ac0033 · 8ac0033
1 parent 0a9ad9c
commit 8ac0033
Show file tree

Hide file tree

Showing 42 changed files with 249 additions and 220 deletions.
diff --git a/075_Inside_a_shard/60_Segment_merging.asciidoc b/075_Inside_a_shard/60_Segment_merging.asciidoc
@@ -48,7 +48,7 @@ search performance if left unchecked.  By default, Elasticsearch throttles the
 merge process so that search still has enough resources available to perform
 well.
 
-TIP: See the <<TODO>> section for advice about tuning merging for your use
+TIP: See <<segments-and-merging>> for advice about tuning merging for your use
 case.
 
 [[optimize-api]]

diff --git a/120_Proximity_Matching/00_Intro.asciidoc b/120_Proximity_Matching/00_Intro.asciidoc
@@ -29,9 +29,10 @@ score.
 
 This is the province of _phrase matching_ or _proximity matching_.
 
-****
+[TIP]
+==================================================
 
 In this chapter we will be using the same example documents that we used for
 the <<match-test-data,`match` query>>.
 
-****
+==================================================
diff --git a/120_Proximity_Matching/05_Phrase_matching.asciidoc b/120_Proximity_Matching/05_Phrase_matching.asciidoc
@@ -25,7 +25,8 @@ only keeps documents which contain *all* of the search terms in the same
 would not match any of our documents because no document contains the word
 `"quick"` immediately followed by `"fox"`.
 
-****
+[TIP]
+==================================================
 
 The `match_phrase` query can also be written as a `match` query with type
 `phrase`:
@@ -41,7 +42,7 @@ The `match_phrase` query can also be written as a `match` query with type
 --------------------------------------------------
 // SENSE: 120_Proximity_Matching/05_Match_phrase_query.json
 
-****
+==================================================
 
 ==== Term positions
 
@@ -103,7 +104,8 @@ For a document to be considered a match for the phrase ``quick brown fox'':
 
 If any of these conditions is not met, the document is not considered a match.
 
-**************************************************
+[TIP]
+==================================================
 
 Internally, the `match_phrase` query use the low-level `span` query family to
 do position-aware matching. Span queries are term-level queries, so they have
@@ -114,4 +116,4 @@ Thankfully, most people never need to use the `span` queries directly as the
 fields, like patent searches, use these low-level queries to perform very
 specific carefully constructed positional searches.
 
-**************************************************
+==================================================
diff --git a/120_Proximity_Matching/30_Performance.asciidoc b/120_Proximity_Matching/30_Performance.asciidoc
@@ -12,7 +12,8 @@ with `slop`).
 
 And of course, this cost is paid at search time instead of at index time.
 
-**************************************************************************
+[TIP]
+==================================================
 
 Usually the extra cost of phrase queries is not as scarey as these numbers
 suggest. Really, the difference in performance is a testimony to just how fast
@@ -26,7 +27,7 @@ many many identical terms repeated in many positions. Using higher `slop`
 values in this case results in a huge growth in the number of position
 calculations.
 
-**************************************************************************
+==================================================
 
 So what can we do to limit the performance cost of phrase and proximity
 queries? One useful approach is to reduce the total number of documents that

diff --git a/120_Proximity_Matching/35_Shingles.asciidoc b/120_Proximity_Matching/35_Shingles.asciidoc
@@ -29,7 +29,8 @@ but also each word *and its neighbour* as single terms:
 
 These word pairs (or _bigrams_) are known as _shingles_.
 
-**************************************************************************
+[TIP]
+==================================================
 
 Shingles are not restricted to being pairs of words; you could index word
 triplets (_trigrams_) as well:
@@ -40,7 +41,7 @@ Trigrams give you a higher degree of precision, but greatly increases the
 number of unique terms in the index. Bigrams are sufficient for most use
 cases.
 
-**************************************************************************
+==================================================
 
 Of course, shingles are only useful if the user enters their query in the same
 order as in the original document; a query for `"sue alligator"` would match

diff --git a/130_Partial_Matching/10_Prefix_query.asciidoc b/130_Partial_Matching/10_Prefix_query.asciidoc
@@ -21,14 +21,15 @@ The `prefix` query is a low-level query that works at the term level.  It
 doesn't analyze the query string before searching -- it assumes that you have
 passed it the exact prefix that you want to find.
 
-**************************************************
+[TIP]
+==================================================
 
 By default, the `prefix` query does no relevance scoring.  It just finds
 matching documents and gives them all a score of `1`.  Really it behaves more
 like a filter than a query.  The only practical difference between the
 `prefix` query and the `prefix` filter is that the filter can be cached.
 
-**************************************************
+==================================================
 
 
 Previously we said that ``you can only find terms that exist in the inverted

diff --git a/130_Partial_Matching/35_Search_as_you_type.asciidoc b/130_Partial_Matching/35_Search_as_you_type.asciidoc
@@ -315,15 +315,16 @@ the `postcode` field would need to be `analyzed` instead of `not_analyzed` but
 you could use the `keyword` tokenizer to treat the postcodes as if they were
 `not_analyzed`.
 
-*************************************************
+[TIP]
+==================================================
 
 The `keyword` tokenizer is the NOOP tokenizer, the tokenizer which does
 nothing.  Whatever string it receives as input, it emits exactly the same
 string as a single token.  It can therefore be used for values that we would
 normally treat as `not_analyzed` but which require some other analysis
 transformation such as lowercasing.
 
-*************************************************
+==================================================
 
 [source,js]
 --------------------------------------------------

diff --git a/170_Relevance/10_Scoring_theory.asciidoc b/170_Relevance/10_Scoring_theory.asciidoc
@@ -155,14 +155,15 @@ These three factors -- term frequency, inverse document frequency, and field
 length norm -- are calculated and stored at index time.  Together, they are
 used to calculate the _weight_ of a single term in a particular document.
 
+[TIP]
 .Document vs Field
-***************************
+==================================================
 
 When we refer to documents in the above formulae, we are actually talking about
 a field within a document.  Each field has its own inverted index and thus,
 for TF/IDF purposes, the value of the field is the value of the document.
 
-***************************
+==================================================
 
 When we run a simple `term` query with `explain` set to `true` (see
 <<explain>>), you will see that the only factors involved in calculating the
@@ -221,15 +222,16 @@ A vector is really just a one-dimensional array containing numbers, like:
 In the Vector Space Model, each number in the vector is the _weight_ of a term,
 as calculated with <<tfidf,Term Frequency/Inverse Document Frequency>>.
 
-*****************************************
+[TIP]
+==================================================
 
 While TF/IDF is the default way of calculating term weights for the Vector
 Space Model, it is not the only way.  Other models like Okapi-BM25 exist and
 are available in Elasticsearch.  TF/IDF is the default because it is a
 simple, efficient algorithm which produces high quality search results, and
 has stood the test of time.
 
-*****************************************
+==================================================
 
 Imagine that we have a query for ``happy hippopotamus''.  A common word like
 `happy` will have a low weight, while an uncommon term like `hippopotamus`
@@ -265,7 +267,8 @@ the query is large, so it is of low relevance.  Document 2 is closer to the
 query, meaning that it is reasonably relevant, and document 3 is a perfect
 match.
 
-**********************************************
+[TIP]
+==================================================
 
 In practice, only two-dimensional vectors (queries with two terms) can  be
 plotted easily on a graph. Fortunately, _linear algebra_ -- the branch of
@@ -276,7 +279,7 @@ same principles explained above to queries which consist of many terms.
 You can read more about how to compare two vectors using _Cosine Similarity_
 at http://en.wikipedia.org/wiki/Cosine_similarity.
 
-**********************************************
+==================================================
 
 Now that we have talked about the theoretical basis of scoring, we can move on
 to see how scoring is implemented in Lucene.
diff --git a/170_Relevance/15_Practical_scoring.asciidoc b/170_Relevance/15_Practical_scoring.asciidoc
@@ -84,15 +84,16 @@ The query normalization factor (`queryNorm`) is an attempt to ``normalize'' a
 query so that the results from one query may be compared with the results of
 another.
 
-**************************
+[TIP]
+==================================================
 
 Even though the intent of the query norm is to make results from different
 queries comparable, it doesn't work very well.  Really, the only purpose of
 the relevance `_score` is to sort the results of the current query in the
 correct order. You should not try to compare the relevance scores from
 different queries.
 
-**************************
+==================================================
 
 This factor is calculated at the beginning of the query. The actual
 calculation depends on the queries involved but a typical implementation would

diff --git a/170_Relevance/20_Query_time_boosting.asciidoc b/170_Relevance/20_Query_time_boosting.asciidoc
@@ -86,7 +86,8 @@ appear in the query DSL.  Instead, any boost values are combined and passsed
 down to the individual terms.  The `t.getBoost()` method returns any `boost`
 value applied to the term itself or to any of the queries higher up the chain.
 
-********************************
+[TIP]
+==================================================
 
 In fact, reading the <<explain,`explain`>> output is a little more complex
 than that. You won't see the `boost` value or `t.getBoost()` mentioned in the
@@ -95,4 +96,4 @@ than that. You won't see the `boost` value or `t.getBoost()` mentioned in the
 the `queryNorm` is the  same for every term, you will see that the `queryNorm`
 for a boosted term is higher than the `queryNorm` for an unboosted term.
 
-********************************
+==================================================
diff --git a/170_Relevance/60_Decay_functions.asciidoc b/170_Relevance/60_Decay_functions.asciidoc
@@ -129,14 +129,15 @@ If we were to set the `origin` to £100, then prices below £100 would receive a
 lower score. Instead, we set both the `origin` and the `offset` to £50.  That
 way, the score only decays for any prices above £100 (`origin + offset`).
 
+[TIP]
 .Tuning `function_score` clauses
-********************************************
+==================================================
 
 The `weight` parameter can be used to increase or decrease the contribution of
 individual clauses.  The `weight`, which defaults to `1.0`, is multiplied with
 the score from each clause before the scores are combined with the specified
 `score_mode`.
 
-********************************************
+==================================================
 
 
diff --git a/210_Identifying_words/20_Standard_tokenizer.asciidoc b/210_Identifying_words/20_Standard_tokenizer.asciidoc
@@ -42,8 +42,9 @@ In the above example, the apostrophe in `You're` is treated as part of the
 word while the single quotes in `'favourite'` are not, resulting in the
 following terms: `You're`, `my`, `favourite`.
 
+[TIP]
 .`uax_url_email` tokenizer
-***************************************************
+==================================================
 
 The `uax_url_email` tokenizer works in exactly the same way as the `standard`
 tokenizer, except that it recognises email addresses and URLs as emits them as
@@ -52,7 +53,7 @@ break them up into individual words. For instance, the email address
 `[email protected]` would result in the tokens `joe`, `bloggs`, `foo`,
 `bar.com`.
 
-***************************************************
+==================================================
 
 The `standard` tokenizer is a reasonable starting point for tokenizing most
 languages, especially Western languages.  In fact, it forms the basis of most

diff --git a/220_Token_normalization/30_Unicode_world.asciidoc b/220_Token_normalization/30_Unicode_world.asciidoc
@@ -70,8 +70,9 @@ PUT /my_index
 --------------------------------------------------
 <1> Normalize all tokens into the `nfkc` normalization form.
 
+[TIP]
 .When to normalize
-**************************************************
+==================================================
 
 Besides the `icu_normalizer` token filter mentioned above, there is also an
 `icu_normalizer` *character* filter, which does the same job as the token
@@ -83,7 +84,7 @@ However, if you plan on using a different tokenizer, such as the `ngram`,
 `edge_ngram` or `pattern` tokenizers, then it woud make sense to use the
 `icu_normalizer` character filter in preference to the token filter.
 
-**************************************************
+==================================================
 
 Usually, though, not only will you want to normalize the byte order of tokens,
 but also to lowercase them. This can be done with the `icu_normalizer` using

diff --git a/220_Token_normalization/60_Sorting_and_collations.asciidoc b/220_Token_normalization/60_Sorting_and_collations.asciidoc
@@ -275,12 +275,13 @@ PUT /my_index/user/_bulk
 GET /my_index/user/_search?sort=name.sort
 --------------------------------------------------
 
+.Binary sort keys
 **************************************************
 
-The first thing to notice is that the `sort` key returned with each document,
-which in earlier examples looked like `brown` and `böhm`, now looks like
-gobbledygook: `ᖔ乏昫တ倈⠀\u0001`.  The reason is that the `icu_collation` filter
-emits keys intended only for efficient sorting, not for any other purposes.
+Note that the `sort` key returned with each document, which in earlier
+examples looked like `brown` and `böhm`, now looks like gobbledygook:
+`ᖔ乏昫တ倈⠀\u0001`.  The reason is that the `icu_collation` filter emits keys
+intended only for efficient sorting, not for any other purposes.
 
 **************************************************
 
@@ -325,13 +326,14 @@ German phonebooks::
 { "language": "en", "variant": "@collation=phonebook" }
 -------------------------
 
+[TIP]
 .Supported locales
-*******************************
+==================================================
 
 You can read more about the locales supported by ICU here:
 http://userguide.icu-project.org/locale
 
-*******************************
+==================================================
 
 This example shows how to setup the German phonebook sort order:
 

diff --git a/230_Stemming/10_Algorithmic_stemmers.asciidoc b/230_Stemming/10_Algorithmic_stemmers.asciidoc
@@ -18,15 +18,17 @@ http://snowball.tartarus.org/[Snowball language] for creating stemming
 algorithms and a number of the stemmers available in Elasticsearch are
 written in Snowball.
 
-********************************************
+[TIP]
+.`kstem` token filter
+==================================================
 
 The {ref}analysis-kstem-tokenfilter.html[`kstem` token filter] is a stemmer
 for English which combines the algorithmic approach with a built-in
 dictionary. The dictionary contains a list of root words and exceptions in
 order to avoid conflating words incorrectly. `kstem` tends to stem less
 aggressively than the Porter stemmer.
 
-********************************************
+==================================================
 
 ==== Using an algorithmic stemmer
 

diff --git a/230_Stemming/30_Hunspell_stemmer.asciidoc b/230_Stemming/30_Hunspell_stemmer.asciidoc
@@ -169,13 +169,14 @@ An interesting property of the `hunspell` stemmer, as can be seen in the
 example above, is that it can remove prefixes as well as as suffixes. Most
 algorithmic stemmers remove suffixes only.
 
-***********************************************
+[TIP]
+==================================================
 
 Hunspell dictionaries can consume a few megabytes of RAM.  Fortunately,
 Elasticsearch only creates a single instance of a dictionary per node.  All
 shards which use the same Hunspell analyzer share the same instance.
 
-***********************************************
+==================================================
 
 [[hunspell-dictionary-format]]
 ==== Hunspell dictionary format

diff --git a/240_Stopwords/10_Intro.asciidoc b/240_Stopwords/10_Intro.asciidoc
@@ -26,13 +26,14 @@ Common words that appear in many documents in the index, like `the`, `and` and
 `is`. These words  have a low weight and contribute little to the relevance
 score.
 
-**********************************************
+[TIP]
+==================================================
 
 Of course, frequency is really a scale rather than just two points labelled
 _low_ and _high_. We just draw a line at some arbitrary point and say that any
 terms below that line are low frequency and above the line are high frequency.
 
-**********************************************
+==================================================
 
 Which terms are low or high frequency depend on the documents themselves.  The
 word `and` may be a low frequency term if all of the documents are in Chinese.

diff --git a/240_Stopwords/40_Divide_and_conquer.asciidoc b/240_Stopwords/40_Divide_and_conquer.asciidoc
@@ -81,8 +81,9 @@ document like ``Quick **AND THE** dead'' higher than ``**THE** quick but
 dead''.  This approach greatly reduces the number of documents that need to be
 examined and scored.
 
+[TIP]
 .`and` query
-********************************
+==================================================
 
 Setting the operator parameter to `and` would make *all* low frequency terms
 required, and score documents that contain *all* high frequency terms higher.
@@ -91,7 +92,7 @@ frequency terms.  If you would prefer all low and high frequency terms to be
 required, then you should use a `bool` query instead.   As we saw in
 <<stopwords-and>>, this is already an efficient query.
 
-********************************
+==================================================
 
 ==== Controlling precision