forked from mongodb/docs
-
Notifications
You must be signed in to change notification settings - Fork 0
/
map-reduce-sharded-collections.txt
72 lines (52 loc) · 2.4 KB
/
map-reduce-sharded-collections.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
==================================
Map-Reduce and Sharded Collections
==================================
.. default-domain:: mongodb
.. contents:: On this page
:local:
:backlinks: none
:depth: 1
:class: singlecol
Map-reduce supports operations on sharded collections, both as an input
and as an output. This section describes the behaviors of
:dbcommand:`mapReduce` specific to sharded collections.
.. _map-reduce-sharded-cluster:
Sharded Collection as Input
---------------------------
When using sharded collection as the input for a map-reduce operation,
:program:`mongos` will automatically dispatch the map-reduce job to
each shard in parallel. There is no special option
required. :program:`mongos` will wait for jobs on all shards to
finish.
Sharded Collection as Output
----------------------------
If the ``out`` field for :dbcommand:`mapReduce` has the ``sharded``
value, MongoDB shards the output collection using the ``_id`` field as
the shard key.
To output to a sharded collection:
- If the output collection does not exist, MongoDB creates and shards
the collection on the ``_id`` field.
- For a new or an empty sharded collection, MongoDB uses the results of
the first stage of the map-reduce operation to create the initial
:term:`chunks <chunk>` distributed among the shards.
- :program:`mongos` dispatches, in parallel, a map-reduce
post-processing job to every shard that owns a chunk. During the
post-processing, each shard will pull the results
for its own chunks from the other shards, run the final
reduce/finalize, and write locally to the output collection.
.. note::
- During later map-reduce jobs, MongoDB splits chunks as needed.
- Balancing of chunks for the output collection is automatically
prevented during post-processing to avoid concurrency issues.
In MongoDB 2.0:
- :program:`mongos` retrieves the results from each shard,
performs a merge sort to order the results, and proceeds to the reduce/finalize phase as
needed. :program:`mongos` then writes the result to the output
collection in sharded mode.
- This model requires only a small amount of memory, even for large data sets.
- Shard chunks are not automatically split during insertion. This
requires manual intervention until the chunks are granular and
balanced.
.. important::
For best results, only use the sharded output options for
:dbcommand:`mapReduce` in version 2.2 or later.