source/core/map-reduce-sharded-collections.txt

==================================
Map-Reduce and Sharded Collections
==================================

.. default-domain:: mongodb

.. contents:: On this page
   :local:
   :backlinks: none
   :depth: 1
   :class: singlecol

Map-reduce supports operations on sharded collections, both as an input
and as an output. This section describes the behaviors of
:dbcommand:`mapReduce` specific to sharded collections.

.. _map-reduce-sharded-cluster:

Sharded Collection as Input
---------------------------

When using sharded collection as the input for a map-reduce operation,
:program:`mongos` will automatically dispatch the map-reduce job to
each shard in parallel. There is no special option
required. :program:`mongos` will wait for jobs on all shards to
finish.

Sharded Collection as Output
----------------------------

If the ``out`` field for :dbcommand:`mapReduce` has the ``sharded``
value, MongoDB shards the output collection using the ``_id`` field as
the shard key.

To output to a sharded collection:

- If the output collection does not exist, MongoDB creates and shards
  the collection on the ``_id`` field.

- For a new or an empty sharded collection, MongoDB uses the results of
  the first stage of the map-reduce operation to create the initial
  :term:`chunks <chunk>` distributed among the shards.

- :program:`mongos` dispatches, in parallel, a map-reduce
  post-processing job to every shard that owns a chunk. During the
  post-processing, each shard will pull the results
  for its own chunks from the other shards, run the final
  reduce/finalize, and write locally to the output collection.

.. note::

   - During later map-reduce jobs, MongoDB splits chunks as needed.

   - Balancing of chunks for the output collection is automatically
     prevented during post-processing to avoid concurrency issues.

In MongoDB 2.0:

- :program:`mongos` retrieves the results from each shard,
  performs a merge sort to order the results, and proceeds to the reduce/finalize phase as
  needed. :program:`mongos` then writes the result to the output
  collection in sharded mode.

- This model requires only a small amount of memory, even for large data sets.

- Shard chunks are not automatically split during insertion. This
  requires manual intervention until the chunks are granular and
  balanced.

.. important::
   For best results, only use the sharded output options for
   :dbcommand:`mapReduce` in version 2.2 or later.