Skip to content

Commit

Permalink
[FLINK-21410][docs] Document checkpoint interval trade-offs
Browse files Browse the repository at this point in the history
This closes apache#14964
  • Loading branch information
sjwiesman committed Feb 19, 2021
1 parent f5f5d4b commit 7f03e13
Show file tree
Hide file tree
Showing 3 changed files with 31 additions and 1 deletion.
15 changes: 15 additions & 0 deletions docs/content.zh/docs/ops/production_ready.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,21 @@ To establish a stable mapping, we need stable operator uids provided by the user

See the [description of state backends]({{< ref "docs/ops/state/state_backends" >}}#choose-the-right-state-backend) for choosing the right one for your use case.

### Choose The Right Checkpoint Interval

[Checkpointing]({{< ref "docs/dev/datastream/fault-tolerance/checkpointing" >}}) is Flink's primary fault-tolerance mechanism, wherein a snapshot of your job's state persisted periodically to some durable location.
In the case of failure, Flink will restart from the most recent checkpoint and resume processing.
A jobs [checkpoint interval]({{< ref "docs/deployment/config">}}#execution-checkpointing-interval) configures how often Flink will take these snapshots.
While there is no single correct answer on the perfect checkpoint interval, the community can guide what factors to consider when configuring this parameter.

* **What is the SLA of your service:** Checkpoint interval is best understood as an expression of the jobs service level agreement (SLA). In the worst-case scenario, where a job fails one second before the next checkpoint, how much data can you tolerate reprocessing? A checkpoint interval of 5 minutes implies that Flink will never reprocess more than 5 minutes worth of data after a failure.

* **How often must your service deliver results:** Exactly once sinks, such as Kafka or the FileSink, only make results visible on checkpoint completion. Shorter checkpoint intervals make results available more quickly but may also put additional pressure on these systems. It is important to work with stakeholders to find a delivery time that meet product requirements without putting undue load on your sinks.

* **How much load can your Task Managers sustain:** All of Flinks' built-in state backends support asynchronous checkpointing, meaning the snapshot process will not pause data processing. However, it still does require CPU cycles and network bandwidth from your machines. [Incremental checkpointing]({{< ref "docs/deployment/config" >}}#state-backend-incremental) can be a powerful tool to reduce the cost of any given checkpoint.

And most importantly, test and measure your job. Every Flink application is unique, and the best way to find the appropriate checkpoint interval is to see how yours behaves in practice.

### Configure JobManager High Availability

The JobManager serves as a central coordinator for each Flink deployment, being responsible for both scheduling and resource management of the cluster.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ Flink's checkpointing mechanism interacts with durable storage for streams and s

## Enabling and Configuring Checkpointing

By default, checkpointing is disabled. To enable checkpointing, call `enableCheckpointing(n)` on the `StreamExecutionEnvironment`, where *n* is the checkpoint interval in milliseconds.
By default, checkpointing is disabled. To enable checkpointing, call `enableCheckpointing(n)` on the `StreamExecutionEnvironment`, where *n* is the [checkpoint interval]({{< ref "docs/ops/production_ready#choose-the-right-checkpoint-interval" >}}) in milliseconds.

Other parameters for checkpointing include:

Expand Down
15 changes: 15 additions & 0 deletions docs/content/docs/ops/production_ready.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,21 @@ To establish a stable mapping, we need stable operator uids provided by the user

See the [description of state backends]({{< ref "docs/ops/state/state_backends" >}}#choose-the-right-state-backend) for choosing the right one for your use case.

### Choose The Right Checkpoint Interval

[Checkpointing]({{< ref "docs/dev/datastream/fault-tolerance/checkpointing" >}}) is Flink's primary fault-tolerance mechanism, wherein a snapshot of your job's state persisted periodically to some durable location.
In the case of failure, Flink will restart from the most recent checkpoint and resume processing.
A jobs [checkpoint interval]({{< ref "docs/deployment/config">}}#execution-checkpointing-interval) configures how often Flink will take these snapshots.
While there is no single correct answer on the perfect checkpoint interval, the community can guide what factors to consider when configuring this parameter.

* **What is the SLA of your service:** Checkpoint interval is best understood as an expression of the jobs service level agreement (SLA). In the worst-case scenario, where a job fails one second before the next checkpoint, how much data can you tolerate reprocessing? A checkpoint interval of 5 minutes implies that Flink will never reprocess more than 5 minutes worth of data after a failure.

* **How often must your service deliver results:** Exactly once sinks, such as Kafka or the FileSink, only make results visible on checkpoint completion. Shorter checkpoint intervals make results available more quickly but may also put additional pressure on these systems. It is important to work with stakeholders to find a delivery time that meet product requirements without putting undue load on your sinks.

* **How much load can your Task Managers sustain:** All of Flinks' built-in state backends support asynchronous checkpointing, meaning the snapshot process will not pause data processing. However, it still does require CPU cycles and network bandwidth from your machines. [Incremental checkpointing]({{< ref "docs/deployment/config" >}}#state-backend-incremental) can be a powerful tool to reduce the cost of any given checkpoint.

And most importantly, test and measure your job. Every Flink application is unique, and the best way to find the appropriate checkpoint interval is to see how yours behaves in practice.

### Configure JobManager High Availability

The JobManager serves as a central coordinator for each Flink deployment, being responsible for both scheduling and resource management of the cluster.
Expand Down

0 comments on commit 7f03e13

Please sign in to comment.