-
Notifications
You must be signed in to change notification settings - Fork 1.7k
/
sharding.txt
399 lines (280 loc) · 13.7 KB
/
sharding.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
.. _sharding-background:
.. _sharding-introduction:
.. _sharding-sharded-cluster:
========
Sharding
========
.. default-domain:: mongodb
.. facet::
:name: genre
:values: reference
.. meta::
:description: Scale MongoDB deployments horizontally. Use sharding to distribute data across multiple machines, supporting large datasets and high throughput operations.
.. contents:: On this page
:local:
:backlinks: none
:depth: 1
:class: twocols
.. toctree::
:titlesonly:
:hidden:
Sharded Cluster Components </core/sharded-cluster-components>
Shard Keys </core/sharding-shard-key>
Hashed Sharding </core/hashed-sharding>
Ranged Sharding </core/ranged-sharding>
Zones </core/zone-sharding>
Data Partitioning </core/sharding-data-partitioning>
Balancer </core/sharding-balancer-administration>
Administration </administration/sharded-cluster-administration>
Reference </reference/sharding>
:term:`Sharding<sharding>` is a method for distributing data across multiple
machines. MongoDB uses sharding to support deployments with very large data
sets and high throughput operations.
Database systems with large data sets or high throughput applications can
challenge the capacity of a single server. For example, high query rates can
exhaust the CPU capacity of the server. Working set sizes larger than the
system's RAM stress the I/O capacity of disk drives.
There are two methods for addressing system growth: vertical and horizontal
scaling.
*Vertical Scaling* involves increasing the capacity of a single server, such
as using a more powerful CPU, adding more RAM, or increasing the amount of
storage space. Limitations in available technology may restrict a single
machine from being sufficiently powerful for a given workload. Additionally,
Cloud-based providers have hard ceilings based on available hardware
configurations. As a result, there is a practical maximum for vertical scaling.
*Horizontal Scaling* involves dividing the system dataset and load over
multiple servers, adding additional servers to increase capacity as required.
While the overall speed or capacity of a single machine may not be high, each
machine handles a subset of the overall workload, potentially providing better
efficiency than a single high-speed high-capacity server. Expanding the
capacity of the deployment only requires adding additional servers as needed,
which can be a lower overall cost than high-end hardware for a single machine.
The trade off is increased complexity in infrastructure and maintenance for
the deployment.
MongoDB supports *horizontal scaling* through :term:`sharding`.
.. |page-topic| replace:: :atlas:`shard collections in the UI </atlas-ui/collections/#shard-a-collection>`
.. cta-banner::
:url: https://www.mongodb.com/docs/atlas/atlas-ui/collections/#shard-a-collection
:icon: Cloud
.. include:: /includes/fact-atlas-compatible.rst
.. _sharded-cluster:
Sharded Cluster
---------------
.. note::
.. include:: /includes/fact-sharded-cluster-components.rst
The following graphic describes the interaction of components within a
sharded cluster:
.. include:: /images/sharded-cluster-production-architecture.rst
MongoDB shards data at the :term:`collection` level, distributing the
collection data across the shards in the cluster.
Shard Keys
----------
MongoDB uses the :ref:`shard key <sharding-shard-key>` to
distribute the collection's documents across shards. The shard key
consists of a field or multiple fields in the documents.
Documents in sharded collections can be missing the shard key fields. Missing
shard key fields are treated as having null values when distributing the
documents across shards but not when routing queries. For more information, see
:ref:`shard-key-missing`.
You select the shard key when :ref:`sharding a collection
<sharding-shard-key-creation>`.
- Starting in MongoDB 5.0, you can :ref:`reshard a collection
<sharding-resharding>` by changing a collection's shard key.
- You can :ref:`refine a shard key <shard-key-refine>` by adding a suffix field
or fields to the existing shard key.
A document's shard key value determines its distribution across the
shards. You can update a document's shard key value unless your shard key field
is the immutable ``_id`` field. For more information, see
:ref:`update-shard-key`.
Shard Key Index
~~~~~~~~~~~~~~~
To shard a populated collection, the collection must have an
:term:`index` that starts with the shard key. When sharding an empty
collection, MongoDB creates the supporting index if the collection does
not already have an appropriate index for the specified shard key. See
:ref:`sharding-shard-key-indexes`.
Shard Key Strategy
~~~~~~~~~~~~~~~~~~
The choice of shard key affects the performance, efficiency, and scalability
of a sharded cluster. A cluster with the best possible hardware and
infrastructure can be bottlenecked by the choice of shard key. The choice of
shard key and its backing index can also affect the :ref:`sharding strategy
<sharding-strategy>` that your cluster can use.
.. seealso::
:ref:`sharding-shard-key-selection`
Chunks
------
MongoDB partitions sharded data into :term:`chunks<chunk>`. Each
chunk has an inclusive lower and exclusive upper range based on the
:term:`shard key`.
Balancer and Even Data Distribution
-----------------------------------
In an attempt to achieve an even distribution of data across all
shards in the cluster, a :ref:`balancer <sharding-balancing>` runs in the
background to migrate :term:`ranges <range>` across the :term:`shards <shard>`.
.. seealso::
:ref:`sharding-range-migration`
Advantages of Sharding
----------------------
.. _sharding-shard-key-write-scaling:
Reads / Writes
~~~~~~~~~~~~~~
MongoDB distributes the read and write workload across the
:term:`shards <shard>` in the :term:`sharded cluster`, allowing each shard to
process a subset of cluster operations. Both read and write workloads can be
scaled horizontally across the cluster by adding more shards.
For queries that include the shard key or the prefix of a :term:`compound
<compound index>` shard key, :binary:`~bin.mongos` can target the query at a
specific shard or set of shards. These :ref:`targeted
operations<sharding-mongos-targeted>` are generally more efficient than
:ref:`broadcasting <sharding-mongos-broadcast>` to every shard in the cluster.
Storage Capacity
~~~~~~~~~~~~~~~~
:term:`Sharding <sharding>` distributes data across the :term:`shards <shard>` in the
cluster, allowing each shard to contain a subset of the total cluster data. As
the data set grows, additional shards increase the storage capacity of the
cluster.
.. _sharding-availability:
High Availability
~~~~~~~~~~~~~~~~~
The deployment of config servers and shards as replica sets provide
increased availability.
Even if one or more shard replica sets become completely unavailable,
the sharded cluster can continue to perform partial reads and writes.
That is, while data on the unavailable shard(s) cannot be accessed,
reads or writes directed at the available shards can still succeed.
Considerations Before Sharding
------------------------------
Sharded cluster infrastructure requirements and complexity require
careful planning, execution, and maintenance.
While you can :ref:`reshard your collection <sharding-resharding>`
later, it is important to carefully consider your shard key choice to
avoid scalability and performance issues.
.. seealso::
:ref:`sharding-shard-key-selection`
To understand the operational requirements and restrictions for sharding
your collection, see :doc:`/core/sharded-cluster-requirements`.
If queries do *not* include the shard key or the prefix of a
:term:`compound <compound index>` shard key, :binary:`~bin.mongos` performs
a :ref:`broadcast operation <sharding-mongos-broadcast>`, querying
*all* shards in the sharded cluster. These scatter/gather queries can
be long running operations.
.. include:: /includes/fact-5.1-fassert-shard-restart-add-CWWC.rst
.. note::
If you have an active support contract with MongoDB, consider contacting
your account representative for assistance with sharded cluster
planning and deployment.
.. _sharded-vs-non-sharded-collections:
Sharded and Non-Sharded Collections
-----------------------------------
A database can have a mixture of sharded and unsharded collections. Sharded
collections are :term:`partitioned<data partition>` and distributed across the
:term:`shards<shard>` in the cluster. Unsharded collections can be located
on any shard but cannot span across shards.
.. include:: /images/sharded-cluster-primary-shard.rst
Connecting to a Sharded Cluster
-------------------------------
You must connect to a :term:`mongos` router to interact with any collection in
the :term:`sharded cluster`. This includes sharded *and* unsharded
collections. Clients should *never* connect to a single shard in order to
perform read or write operations.
.. include:: /images/sharded-cluster-mixed.rst
You can connect to a :binary:`~bin.mongos` the same way you connect to a
:binary:`~bin.mongod` using the :binary:`~bin.mongosh` or a
MongoDB :driver:`driver </?jump=docs>`.
.. note::
.. include:: /includes/fact-cannot-connect-directly-to-shards.rst
.. _sharding-strategy:
Sharding Strategy
-----------------
MongoDB supports two sharding strategies for distributing data
across :term:`sharded clusters<sharded cluster>`.
Hashed Sharding
~~~~~~~~~~~~~~~
Hashed Sharding involves computing a hash of the shard key field's
value. Each :term:`chunk` is then assigned a range based on the
hashed shard key values.
.. include:: /includes/tip-applications-do-not-need-to-compute-hashes.rst
.. include:: /images/sharding-hash-based.rst
While a range of shard keys may be "close", their hashed values are unlikely
to be on the same :term:`chunk`. Data distribution based on hashed values
facilitates more even data distribution, especially in data sets where the
shard key changes :ref:`monotonically<shard-key-monotonic>`.
However, hashed distribution means that range-based queries on the shard key
are less likely to target a single shard, resulting in more cluster wide
:ref:`broadcast operations<sharding-mongos-broadcast>`
See :doc:`/core/hashed-sharding` for more information.
Ranged Sharding
~~~~~~~~~~~~~~~
Ranged sharding involves dividing data into ranges based on the
shard key values. Each :term:`chunk` is then assigned a range based on the
shard key values.
.. include:: /images/sharding-range-based.rst
A range of shard keys whose values are "close" are more likely to reside on
the same :term:`chunk`. This allows for :ref:`targeted
operations<sharding-mongos-targeted>` as a :binary:`~bin.mongos` can route the
operations to only the shards that contain the required data.
The efficiency of ranged sharding depends on the shard key chosen. Poorly
considered shard keys can result in uneven distribution of data, which can
negate some benefits of sharding or can cause performance bottlenecks. See
:ref:`shard key selection for range-based sharding<sharding-ranged-shard-key>`.
See :doc:`/core/ranged-sharding` for more information.
.. _sharding-zone-sharding:
Zones in Sharded Clusters
-------------------------
Zones can help improve the locality of data for sharded clusters that
span multiple data centers.
.. include:: /includes/intro-zone-sharding.rst
Each zone covers one or more ranges of :term:`shard key` values. Each range a
zone covers is always inclusive of its lower boundary and exclusive of its
upper boundary.
.. include:: /images/sharded-cluster-zones.rst
You must use fields contained in the :term:`shard key` when defining a new
range for a zone to cover. If using a :term:`compound <compound index>` shard
key, the range must include the prefix of the shard key. See :ref:`shard keys
in zones <zone-sharding-shard-key>` for more information.
The possible use of zones in the future should be taken into
consideration when choosing a shard key.
.. tip::
Setting up zones and zone ranges *before* you shard an empty or a
non-existing collection allows for a faster setup of zoned sharding.
See :ref:`zones <zone-sharding>` for more information.
Collations in Sharding
----------------------
Use the :dbcommand:`shardCollection` command with the ``collation :
{ locale : "simple" }`` option to shard a collection which has a
:ref:`default collation <collation>`. Successful
sharding requires that:
- The collection must have an index whose prefix is the shard key
- The index must have the collation ``{ locale: "simple" }``
When creating new collections with a collation, ensure these conditions
are met prior to sharding the collection.
.. include:: /includes/note-sharding-collation.rst
See :dbcommand:`shardCollection` for more information about sharding
and collation.
Change Streams
--------------
:ref:`Change streams <changeStreams>` are
available for replica sets and sharded clusters. Change streams allow
applications to access real-time data changes without the complexity
and risk of tailing the oplog. Applications can use change streams to
subscribe to all data changes on a collection or collections.
Transactions
------------
With the introduction of :ref:`distributed transactions <transactions>`,
multi-document transactions are available on sharded clusters.
.. include:: /includes/extracts/transactions-committed-visibility.rst
Learn More
----------
Practical MongoDB Aggregations E-Book
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
For more information on how sharding works with
:ref:`aggregations <aggregation-pipeline>`, read the sharding
chapter in the `Practical MongoDB Aggregations
<https://www.practical-mongodb-aggregations.com/guides/sharding.html>`__
e-book.
Additional Information
~~~~~~~~~~~~~~~~~~~~~~
- :doc:`/core/transactions`
- :doc:`/core/transactions-production-consideration`
- :doc:`/core/transactions-sharded-clusters`