[FLINK-21410][docs] Document checkpoint interval trade-offs

This closes apache#14964
todd5167 · Feb 19, 2021 · 7f03e13 · 7f03e13
1 parent f5f5d4b
commit 7f03e13
Show file tree

Hide file tree

Showing 3 changed files with 31 additions and 1 deletion.
diff --git a/docs/content.zh/docs/ops/production_ready.md b/docs/content.zh/docs/ops/production_ready.md
@@ -59,6 +59,21 @@ To establish a stable mapping, we need stable operator uids provided by the user
 
 See the [description of state backends]({{< ref "docs/ops/state/state_backends" >}}#choose-the-right-state-backend) for choosing the right one for your use case.
 
+### Choose The Right Checkpoint Interval
+
+[Checkpointing]({{< ref "docs/dev/datastream/fault-tolerance/checkpointing" >}}) is Flink's primary fault-tolerance mechanism, wherein a snapshot of your job's state persisted periodically to some durable location.
+In the case of failure, Flink will restart from the most recent checkpoint and resume processing.
+A jobs [checkpoint interval]({{< ref "docs/deployment/config">}}#execution-checkpointing-interval) configures how often Flink will take these snapshots.
+While there is no single correct answer on the perfect checkpoint interval, the community can guide what factors to consider when configuring this parameter.
+
+* **What is the SLA of your service:** Checkpoint interval is best understood as an expression of the jobs service level agreement (SLA). In the worst-case scenario, where a job fails one second before the next checkpoint, how much data can you tolerate reprocessing? A checkpoint interval of 5 minutes implies that Flink will never reprocess more than 5 minutes worth of data after a failure.   
+
+* **How often must your service deliver results:** Exactly once sinks, such as Kafka or the FileSink, only make results visible on checkpoint completion. Shorter checkpoint intervals make results available more quickly but may also put additional pressure on these systems. It is important to work with stakeholders to find a delivery time that meet product requirements without putting undue load on your sinks.
+
+* **How much load can your Task Managers sustain:** All of Flinks' built-in state backends support asynchronous checkpointing, meaning the snapshot process will not pause data processing. However, it still does require CPU cycles and network bandwidth from your machines. [Incremental checkpointing]({{< ref "docs/deployment/config" >}}#state-backend-incremental) can be a powerful tool to reduce the cost of any given checkpoint.
+
+And most importantly, test and measure your job. Every Flink application is unique, and the best way to find the appropriate checkpoint interval is to see how yours behaves in practice. 
+
 ### Configure JobManager High Availability
 
 The JobManager serves as a central coordinator for each Flink deployment, being responsible for both scheduling and resource management of the cluster.

diff --git a/docs/content/docs/dev/datastream/fault-tolerance/checkpointing.md b/docs/content/docs/dev/datastream/fault-tolerance/checkpointing.md
@@ -46,7 +46,7 @@ Flink's checkpointing mechanism interacts with durable storage for streams and s
 
 ## Enabling and Configuring Checkpointing
 
-By default, checkpointing is disabled. To enable checkpointing, call `enableCheckpointing(n)` on the `StreamExecutionEnvironment`, where *n* is the checkpoint interval in milliseconds.
+By default, checkpointing is disabled. To enable checkpointing, call `enableCheckpointing(n)` on the `StreamExecutionEnvironment`, where *n* is the [checkpoint interval]({{< ref "docs/ops/production_ready#choose-the-right-checkpoint-interval" >}}) in milliseconds.
 
 Other parameters for checkpointing include:
 

diff --git a/docs/content/docs/ops/production_ready.md b/docs/content/docs/ops/production_ready.md
@@ -59,6 +59,21 @@ To establish a stable mapping, we need stable operator uids provided by the user
 
 See the [description of state backends]({{< ref "docs/ops/state/state_backends" >}}#choose-the-right-state-backend) for choosing the right one for your use case.
 
+### Choose The Right Checkpoint Interval
+
+[Checkpointing]({{< ref "docs/dev/datastream/fault-tolerance/checkpointing" >}}) is Flink's primary fault-tolerance mechanism, wherein a snapshot of your job's state persisted periodically to some durable location.
+In the case of failure, Flink will restart from the most recent checkpoint and resume processing.
+A jobs [checkpoint interval]({{< ref "docs/deployment/config">}}#execution-checkpointing-interval) configures how often Flink will take these snapshots.
+While there is no single correct answer on the perfect checkpoint interval, the community can guide what factors to consider when configuring this parameter.
+
+* **What is the SLA of your service:** Checkpoint interval is best understood as an expression of the jobs service level agreement (SLA). In the worst-case scenario, where a job fails one second before the next checkpoint, how much data can you tolerate reprocessing? A checkpoint interval of 5 minutes implies that Flink will never reprocess more than 5 minutes worth of data after a failure.   
+
+* **How often must your service deliver results:** Exactly once sinks, such as Kafka or the FileSink, only make results visible on checkpoint completion. Shorter checkpoint intervals make results available more quickly but may also put additional pressure on these systems. It is important to work with stakeholders to find a delivery time that meet product requirements without putting undue load on your sinks.
+
+* **How much load can your Task Managers sustain:** All of Flinks' built-in state backends support asynchronous checkpointing, meaning the snapshot process will not pause data processing. However, it still does require CPU cycles and network bandwidth from your machines. [Incremental checkpointing]({{< ref "docs/deployment/config" >}}#state-backend-incremental) can be a powerful tool to reduce the cost of any given checkpoint.
+
+And most importantly, test and measure your job. Every Flink application is unique, and the best way to find the appropriate checkpoint interval is to see how yours behaves in practice. 
+
 ### Configure JobManager High Availability
 
 The JobManager serves as a central coordinator for each Flink deployment, being responsible for both scheduling and resource management of the cluster.