1.配置

flink-conf.yaml添加配置：

high-availability: zookeeper high-availability.storageDir: hdfs:///flink/recovery high-availability.zookeeper.quorum: xxx high-availability.zookeeper.path.root: /flink-ha   yarn.application-attempts: 2

注意： 1）yarn不需要在模式下配置high-availability.cluster-id，standalone模式才需要 2）yarn.application-attempts在ha未启用时默认为1ha后默认为2。没有开启ha设置yarn.application-attempts大于1 ，会导致任务从旧checkpoint重启点是一个危险的操作，所以不是ha最好不要设置yarn.application-attempts 3）yarn.application-attempts配置有限yarn的yarn.resourcemanager.am.max-attempts配置

2.故障演练

不开启HA情况下，NodeManager或者JobManager挂断服务会导致任务直接挂断，无故障自动恢复能力以下故障演练是基于HA的

（1）杀JobManager进程

JobManager和TaskManager会自动转移到新的节点重启，且是从最新的checkpoint恢复整个过程非常快，基本感知不到故障转移

（2）杀JobManager所在机器的NodeManager进程

JobManager过程不受影响，任务可以正常执行十分钟后，JobManager和TaskManager它最新的节点自动转移到新节点重启checkpoint恢复

观察ResourceManager日志：

2022-05-18 11:30:39,662 INFO  util.AbstractLivelinessMonitor (AbstractLivelinessMonitor.java:run(148)) - Expired:hdfs08-dev.yingzi.com:45454 Timed out after 600 secs 2022-05-18 11:30:39,666 INFO  rmnode.RMNodeImpl (RMNodeImpl.java:deactivateNode(1077)) - Deactivating Node hdfs08-dev.yingzi.com:45454 as it is now LOST 2022-05-18 11:30:39,667 INFO  rmnode.RMNodeImpl (RMNodeImpl.java:handle(671)) - hdfs08-dev.yingzi.com:45454 Node Transitioned from RUNNING to LOST 2022-05-18 11:30:39,675 INFO  rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(490)) - container_e142_1652757164854_0006_11_000001 Container Transitioned from RUNNING to KILLED 2022-05-18 11:30:39,680 INFO  capacity.CapacityScheduler (CapacityScheduler.java:removeNode(1960)) - Removed node hdfs08-dev.yingzi.com:45454 clusterResource: <memory:143360, vCores:224>

默认10分钟(取决于(取决于)yarn.nm.liveness-monitor.expiry-interval-ms配置)会把NodeManager标记为LOST，然后移除节点和容器

若手动启动NodeManager，则JobManager和TaskManager不会转移

如果不开启HA，10分钟后，任务将直接挂断

（3）先杀JobManager所在机器的NodeManager进程，再杀JobManager进程

刚开始TaskManager过程仍然存活，但任务不能正常执行，5分钟后TaskManager进程挂掉查看TaskManager可以看出，日志主要是和JobManager连接断开导致：

2022-05-18 11:37:15,056 INFO  org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Remove job b5a0c61ab1b1495dd9c2f8e45d349957 from job leader monitoring. 2022-05-18 11:37:15,056 INFO  org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService [] - Stopping DefaultLeaderRetrievalService. 2022-05-18 11:37:15,056 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalDriver [] - Closing ZookeeperLeaderRetrievalDriver{retrievalPath='/leader/b5a0c61ab1b1495dd9c2f8e45d349957/job_manager_lock'}. 2022-05-18 11:37:15,056 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Close JobManager connection for job b5a0c61ab1b1495dd9c2f8e45d349957. 2022-05-18 11:37:15,079 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Close ResourceManager connection 27c66057868b763b46426beea8666578. 2022-05-18 11:37:25,014 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Cannot find task to fail for execution fc957691a7fbc7f3ac92f904b7840190 with exception: java.util.concurrent.TimeoutException: Invocation of public abstract java.util.concurrent.CompletableFuture org.apache.flink.runtime.jobmaster.JobMasterGateway.updateTaskExecutionState(org.apache.flink.runtime.taskmanager.TaskExecutionState) timed out.  ...  Caused by: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka.tcp://flink@hdfs05-dev.yingzi.com:16897/user/rpc/jobmanager_2#-1091134734]] after [10000 ms]. Message of type [org.apache.flink.runtime.rpc.messages.RemoteFencedMessage]. A typical reason for `AskTimeoutException` is that the recipient actor didn't send a reply.  ...  2022-05-18 11:42:15,096 ERROR org.apache.flink.runtime.taskexecutor.TaskManagerRunner      [] - Fatal error occurred while executing the TaskManager. Shutting it down... org.apache.flink.runtime.taskexecutor.exceptions.RegistrationTimeoutException: Could not register at the ResourceManager within the specified maximum registration duration 300000 ms. This indicates a problem with this instance. Terminating now.

十分钟后，JobManager和TaskManager它最新的节点自动转移到新节点重启checkpoint任务恢复正常

观察ResourceManager日志：

2022-05-17 17:06:08,398 INFO  util.AbstractLivelinessMonitor (bstractLivelinessMonitor.java:run(148)) - Expired:hdfs03-dev.yingzi.com:45454 Timed out after 600 secs
2022-05-17 17:06:08,399 INFO  rmnode.RMNodeImpl (RMNodeImpl.java:deactivateNode(1077)) - Deactivating Node hdfs03-dev.yingzi.com:45454 as it is now LOST
2022-05-17 17:06:08,399 INFO  rmnode.RMNodeImpl (RMNodeImpl.java:handle(671)) - hdfs03-dev.yingzi.com:45454 Node Transitioned from RUNNING to LOST

默认10分钟(取决于yarn.nm.liveness-monitor.expiry-interval-ms配置)才会把NodeManager标记为LOST，然后移除节点和容器

如果把yarn.nm.liveness-monitor.expiry-interval-ms或者yarn.am.liveness-monitor.expiry-interval-ms(默认均为10分钟)值其中一个调小，JobManager和TaskManager会提前重启即NodeManager和JobManager同时挂掉，任务恢复时间由nn和am的超时时间共同决定

期间如果手动启动NodeManager，任务也能够自动恢复

（4）杀TaskManager进程

TaskManager转移到新的节点重启，但是任务无法正常执行，并且checkpoint操作也是失败的看日志JobManager还在尝试与旧节点的TaskManager通信，每间隔10秒报一次connection refused的异常直到10分钟(与heartbeat.timeout值相关)后日志显示

2022-05-18 14:39:05,553 WARN  akka.remote.ReliableDeliverySupervisor                       [] - Association with remote system [akka.tcp://flink@hdfs05-dev.yingzi.com:23065] has failed, address is now gated for [50] ms. Reason: [Disassociated]

2022-05-18 14:40:06,013 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - JobManager for job b5a0c61ab1b1495dd9c2f8e45d349957 with leader id 84a8fd94fa276bf476d3af016c1441be lost leadership.
2022-05-18 14:40:06,017 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Close JobManager connection for job b5a0c61ab1b1495dd9c2f8e45d349957.
2022-05-18 14:40:06,018 INFO  org.apache.flink.runtime.taskmanager.Task                    [] - Attempting to fail task externally Source: TableSourceScan(table=[[default_catalog, default_database, anc_herd_property_val]], fields=[id, create_user, modify_user, create_time, modify_time, app_id, tenant_id, deleted, animal_id, property_id, property_val]) -> MiniBatchAssigner(interval=[2000ms], mode=[ProcTime]) -> DropUpdateBefore (1/1)#0 (9c5ac532a5f8459790af009cbc74fd3a).
2022-05-18 14:40:06,020 WARN  org.apache.flink.runtime.taskmanager.Task                    [] - Source: TableSourceScan(table=[[default_catalog, default_database, anc_herd_property_val]], fields=[id, create_user, modify_user, create_time, modify_time, app_id, tenant_id, deleted,animal_id, property_id, property_val]) -> MiniBatchAssigner(interval=[2000ms], mode=[ProcTime]) -> DropUpdateBefore (1/1)#0 (9c5ac532a5f8459790af009cbc74fd3a) switched from RUNNING to FAILED with failure cause: org.apache.flink.util.FlinkException: Disconnect from JobManager responsible for b5a0c61ab1b1495dd9c2f8e45d349957.
	at org.apache.flink.runtime.taskexecutor.TaskExecutor.disconnectJobManagerConnection(TaskExecutor.java:1660)
...
Caused by: java.lang.Exception: Job leader for job id b5a0c61ab1b1495dd9c2f8e45d349957 lost leadership.
...

2022-05-18 14:40:16,042 INFO  org.apache.flink.runtime.taskexecutor.slot.TaskSlotTableImpl [] - Free slot TaskSlot(index:0, state:ALLOCATED, resource profile: ResourceProfile{cpuCores=1.0000000000000000, taskHeapMemory=837.120mb (877783849 bytes), taskOffHeapMemory=0 bytes, managedMemory=2.010gb (2158221148 bytes), networkMemory=343.040mb (359703515 bytes)}, allocationId: 85159a1bbf11b3993c17d8adcbc9694c, jobId: b5a0c61ab1b1495dd9c2f8e45d349957).
2022-05-18 14:40:16,050 INFO  org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Remove job b5a0c61ab1b1495dd9c2f8e45d349957 from job leader monitoring.
2022-05-18 14:40:16,050 INFO  org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService [] - Stopping DefaultLeaderRetrievalService.
2022-05-18 14:40:16,050 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalDriver [] - Closing ZookeeperLeaderRetrievalDriver{retrievalPath='/leader/b5a0c61ab1b1495dd9c2f8e45d349957/job_manager_lock'}.
2022-05-18 14:40:16,109 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Cannot find task to fail for execution 373d61b162d19ffda23ea2906f0cb7a9 with exception:
java.util.concurrent.TimeoutException: Invocation of public abstract java.util.concurrent.CompletableFuture org.apache.flink.runtime.jobmaster.JobMasterGateway.updateTaskExecutionState(org.apache.flink.runtime.taskmanager.TaskExecutionState) timed out.
	at com.sun.proxy.$Proxy36.updateTaskExecutionState(Unknown Source) ~[?:?]
	at org.apache.flink.runtime.taskexecutor.TaskExecutor.updateTaskExecutionState(TaskExecutor.java:1852) ~[flink-dist_2.12-1.13.2.jar:1.13.2]
	at org.apache.flink.runtime.taskexecutor.TaskExecutor.unregisterTaskAndNotifyFinalState(TaskExecutor.java:1885) ~[flink-dist_2.12-1.13.2.jar:1.13.2]

...

Caused by: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka.tcp://flink@hdfs05-dev.yingzi.com:23065/user/rpc/jobmanager_2#-613206152]] after [10000 ms]. Message of type [org.apache.flink.runtime.rpc.messages.RemoteFencedMessage]. A typical reason for `AskTimeoutException` is that the recipient actor didn't send a reply.

...

 2022-05-18 14:45:06,090 ERROR org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Fatal error occurred in TaskExecutor akka.tcp://flink@hdfs04-dev.yingzi.com:32782/user/rpc/taskmanager_0.
1048 org.apache.flink.runtime.taskexecutor.exceptions.RegistrationTimeoutException: Could not register at the ResourceManager within the specified maximum registration duration 300000 ms. This indicates a problem with this instance. Terminating now.
1049     at org.apache.flink.runtime.taskexecutor.TaskExecutor.registrationTimeout(TaskExecutor.java:1440) ~[flink-dist_2.12-1.13.2.jar:1.13.2]
1050     at org.apache.flink.runtime.taskexecutor.TaskExecutor.lambda$startRegistrationTimeout$17(TaskExecutor.java:1425) ~[flink-dist_2.12-1.13.2.jar:1.13.2

...

2022-05-18 14:45:06,106 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Stopping TaskExecutor akka.tcp://flink@hdfs04-dev.yingzi.com:32782/user/rpc/taskmanager_0.
1102 2022-05-18 14:45:06,110 INFO  org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Stop job leader service.
1103 2022-05-18 14:45:06,110 INFO  org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService [] - Stopping DefaultLeaderRetrievalService.
1104 2022-05-18 14:45:06,110 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalDriver [] - Closing ZookeeperLeaderRetri     evalDriver{retrievalPath='/leader/resource_manager_lock'}.
1105 2022-05-18 14:45:06,110 INFO  org.apache.flink.runtime.state.TaskExecutorLocalStateStoresManager [] - Shutting down TaskExecutorLocalStateStoresManager.
1106 2022-05-18 14:45:06,114 INFO  org.apache.flink.runtime.io.disk.FileChannelManagerImpl      [] - FileChannelManager removed spill file directory /hadoop/yarn/local/usercache/hdfs/appcache/application_1652757164854_0006/flink-io-2f303ad9-eef4-4048-846e-85dc29da5038
1107 2022-05-18 14:45:06,114 INFO  org.apache.flink.runtime.io.network.NettyShuffleEnvironment  [] - Shutting down the network environment and its components.
1108 2022-05-18 14:45:06,114 INFO  org.apache.flink.runtime.io.network.netty.NettyClient        [] - Successful shutdown (took 0 ms).
1109 2022-05-18 14:45:06,117 INFO  org.apache.flink.runtime.io.network.netty.NettyServer        [] - Successful shutdown (took 2 ms).
1110 2022-05-18 14:45:06,125 INFO  org.apache.flink.runtime.io.disk.FileChannelManagerImpl      [] - FileChannelManager removed spill file directory /hadoop/yarn/local/usercache/hdfs/appcache/application_1652757164854_0006/flink-netty-shuffle-5b4f703f-8827-4139-8237-6b9bfdfe     fe6e
1111 2022-05-18 14:45:06,125 INFO  org.apache.flink.runtime.taskexecutor.KvStateService         [] - Shutting down the kvState service and its components.
1112 2022-05-18 14:45:06,125 INFO  org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Stop job leader service.
1113 2022-05-18 14:45:06,129 INFO  org.apache.flink.runtime.filecache.FileCache                 [] - removed file cache directory /hadoop/ya     rn/local/usercache/hdfs/appcache/application_1652757164854_0006/flink-dist-cache-f2a2be98-66ed-45c2-815b-a981e7a90d05

然后才恢复正常可以看出是触发了心跳超时，进行触发故障转移，从checkpoint恢复job

（5）杀TaskManager所在机器的NodeManager进程

刚杀完TaskManager进程正常，任务也可以正常工作十几分钟之后，TaskManager自动转移到新的节点重启现象基本和上面（2）类似

（6）先杀TaskManager所在机器的NodeManager进程，再杀TaskManager进程

flink ui上显示TaskManager状态正常且还在原来节点，但是其实任务已经不正常，并且checkpoint操作失败除了tm不会立刻转移，现象基本和上面（4）一致

结论&问题

开启HA的情况下，即使出现某个服务进程，甚至某台机器直接挂掉的情况，任务也能自动恢复。具体恢复的时间取决于挂掉的服务情况和几个超时时间参数的配置;

如果某台机器直接宕机，任务恢复可能要十几分钟，可以通过调整参数来缩短故障恢复时间，但是需要进一步研究超时时间缩短可能带来的副作用

资讯详情

flink jobmanager高可用配置及故障演练