前言 在微服务架构中,为了避免单点故障,一个服务通常会部署多个实例,而对于服务消费者而言,则需要考虑如何选择一个可用的实例,以及调用失败的情况如何处理。在 Dubbo 中,针对这些问题,Dubbo 也提供了多种集群容错策略。
以下内容基于Dubbo 2.7.12版本
集群容错 Cluster 接口是 Dubbo 对一组服务提供者进行抽象的结果,该接口定义了如何从多个服务提供者中选择一个或多个提供者进行调用,并处理调用过程中可能出现的异常。Dubbo 内置了几种常见的集群容错策略,包括:
Failover Cluster:失败自动切换
Failfast Cluster:快速失败
Failsafe Cluster:失败安全
Failback Cluster:失败自动恢复
Forking Cluster:并行调用
Broadcast Cluster:广播调用
源码实现 在每个 Cluster 接口实现中,都会创建对应的 Cluster Invoker 对象,而调用的逻辑实现由 Cluster Invoker 提供,这些 Cluster Invoker 者继承了 AbstractClusterInvoker :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 @Override public Result invoke (final Invocation invocation) throws RpcException { checkWhetherDestroyed(); Map<String, Object> contextAttachments = RpcContext.getContext().getObjectAttachments(); if (contextAttachments != null && contextAttachments.size() != 0 ) { ((RpcInvocation) invocation).addObjectAttachments(contextAttachments); } List<Invoker<T>> invokers = list(invocation); LoadBalance loadbalance = initLoadBalance(invokers, invocation); RpcUtils.attachInvocationIdIfAsync(getUrl(), invocation); return doInvoke(invocation, invokers, loadbalance); }
我们直接看一下其子类 doInvoke 方法的实现。
FailoverClusterInvoker 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 public Result doInvoke (Invocation invocation, final List<Invoker<T>> invokers, LoadBalance loadbalance) throws RpcException { List<Invoker<T>> copyInvokers = invokers; checkInvokers(copyInvokers, invocation); String methodName = RpcUtils.getMethodName(invocation); int len = calculateInvokeTimes(methodName); RpcException le = null ; List<Invoker<T>> invoked = new ArrayList<Invoker<T>>(copyInvokers.size()); Set<String> providers = new HashSet<String>(len); for (int i = 0 ; i < len; i++) { if (i > 0 ) { checkWhetherDestroyed(); copyInvokers = list(invocation); checkInvokers(copyInvokers, invocation); } Invoker<T> invoker = select(loadbalance, invocation, copyInvokers, invoked); invoked.add(invoker); RpcContext.getContext().setInvokers((List) invoked); try { Result result = invoker.invoke(invocation); if (le != null && logger.isWarnEnabled()) { } return result; } catch (RpcException e) { if (e.isBiz()) { throw e; } le = e; } catch (Throwable e) { le = new RpcException(e.getMessage(), e); } finally { providers.add(invoker.getUrl().getAddress()); } } throw new RpcException(le.getCode(), "Failed to invoke the method " + methodName + " in the service " + getInterface().getName() + ". Tried " + len + " times of the providers " + providers + " (" + providers.size() + "/" + copyInvokers.size() + ") from the registry " + directory.getUrl().getAddress() + " on the consumer " + NetUtils.getLocalHost() + " using the dubbo version " + Version.getVersion() + ". Last error is: " + le.getMessage(), le.getCause() != null ? le.getCause() : le); } private int calculateInvokeTimes (String methodName) { int len = getUrl().getMethodParameter(methodName, RETRIES_KEY, DEFAULT_RETRIES) + 1 ; RpcContext rpcContext = RpcContext.getContext(); Object retry = rpcContext.getObjectAttachment(RETRIES_KEY); if (null != retry && retry instanceof Number) { len = ((Number) retry).intValue() + 1 ; rpcContext.removeAttachment(RETRIES_KEY); } if (len <= 0 ) { len = 1 ; } return len; }
适用场景:适用于对请求响应时间要求不高,但对成功率有较高要求的场景。
FailfastClusterInvoker 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 @Override public Result doInvoke (Invocation invocation, List<Invoker<T>> invokers, LoadBalance loadbalance) throws RpcException { checkInvokers(invokers, invocation); Invoker<T> invoker = select(loadbalance, invocation, invokers, null ); try { return invoker.invoke(invocation); } catch (Throwable e) { if (e instanceof RpcException && ((RpcException) e).isBiz()) { throw (RpcException) e; } throw new RpcException(e instanceof RpcException ? ((RpcException) e).getCode() : 0 , "Failfast invoke providers " + invoker.getUrl() + " " + loadbalance.getClass().getSimpleName() + " select from all providers " + invokers + " for service " + getInterface().getName() + " method " + invocation.getMethodName() + " on consumer " + NetUtils.getLocalHost() + " use dubbo version " + Version.getVersion() + ", but no luck to perform the invocation. Last error is: " + e.getMessage(), e.getCause() != null ? e.getCause() : e); } }
FailfastClusterInvoker 只会发起一次调用,如果调用失败则立即抛出异常,不再重试。
适用场景:适用于对响应时间有一定要求的场景。
FailsafeClusterInvoker 1 2 3 4 5 6 7 8 9 10 11 @Override public Result doInvoke (Invocation invocation, List<Invoker<T>> invokers, LoadBalance loadbalance) throws RpcException { try { checkInvokers(invokers, invocation); Invoker<T> invoker = select(loadbalance, invocation, invokers, null ); return invoker.invoke(invocation); } catch (Throwable e) { logger.error("Failsafe ignore exception: " + e.getMessage(), e); return AsyncRpcResult.newDefaultAsyncResult(null , null , invocation); } }
FailsafeClusterInvoker 同样也只会发起一次调用,它跟 FailfastClusterInvoker 的区别在于,如果调用失败了,只会记录日志,而不会抛出异常。
适用场景:适用于对调用结果一致性要求不高场景。
FailbackClusterInvoker 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 @Override protected Result doInvoke (Invocation invocation, List<Invoker<T>> invokers, LoadBalance loadbalance) throws RpcException { Invoker<T> invoker = null ; try { checkInvokers(invokers, invocation); invoker = select(loadbalance, invocation, invokers, null ); return invoker.invoke(invocation); } catch (Throwable e) { logger.error("Failback to invoke method " + invocation.getMethodName() + ", wait for retry in background. Ignored exception: " + e.getMessage() + ", " , e); addFailed(loadbalance, invocation, invokers, invoker); return AsyncRpcResult.newDefaultAsyncResult(null , null , invocation); } } private void addFailed (LoadBalance loadbalance, Invocation invocation, List<Invoker<T>> invokers, Invoker<T> lastInvoker) { if (failTimer == null ) { synchronized (this ) { if (failTimer == null ) { failTimer = new HashedWheelTimer( new NamedThreadFactory("failback-cluster-timer" , true ), 1 , TimeUnit.SECONDS, 32 , failbackTasks); } } } RetryTimerTask retryTimerTask = new RetryTimerTask(loadbalance, invocation, invokers, lastInvoker, retries, RETRY_FAILED_PERIOD); try { failTimer.newTimeout(retryTimerTask, RETRY_FAILED_PERIOD, TimeUnit.SECONDS); } catch (Throwable e) { logger.error("Failback background works error,invocation->" + invocation + ", exception: " + e.getMessage()); } }
FailbackClusterInvoker 在调用失败时,会立即返回空结果,同时将此次调用添加到失败列表,并在 5 秒后进行重试,重试次数可以通过 retries 指定。
适用场景:适用于对调用结果一致性要求比较高的,且对响应时间要求不高的场景。
ForkingClusterInvoker 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 @Override @SuppressWarnings({"unchecked", "rawtypes"}) public Result doInvoke (final Invocation invocation, List<Invoker<T>> invokers, LoadBalance loadbalance) throws RpcException { try { checkInvokers(invokers, invocation); final List<Invoker<T>> selected; final int forks = getUrl().getParameter(FORKS_KEY, DEFAULT_FORKS); final int timeout = getUrl().getParameter(TIMEOUT_KEY, DEFAULT_TIMEOUT); if (forks <= 0 || forks >= invokers.size()) { selected = invokers; } else { selected = new ArrayList<>(forks); while (selected.size() < forks) { Invoker<T> invoker = select(loadbalance, invocation, invokers, selected); if (!selected.contains(invoker)) { selected.add(invoker); } } } RpcContext.getContext().setInvokers((List) selected); final AtomicInteger count = new AtomicInteger(); final BlockingQueue<Object> ref = new LinkedBlockingQueue<>(); for (final Invoker<T> invoker : selected) { executor.execute(() -> { try { Result result = invoker.invoke(invocation); ref.offer(result); } catch (Throwable e) { int value = count.incrementAndGet(); if (value >= selected.size()) { ref.offer(e); } } }); } try { Object ret = ref.poll(timeout, TimeUnit.MILLISECONDS); if (ret instanceof Throwable) { Throwable e = (Throwable) ret; throw new RpcException(e instanceof RpcException ? ((RpcException) e).getCode() : 0 , "Failed to forking invoke provider " + selected + ", but no luck to perform the invocation. Last error is: " + e.getMessage(), e.getCause() != null ? e.getCause() : e); } return (Result) ret; } catch (InterruptedException e) { throw new RpcException("Failed to forking invoke provider " + selected + ", but no luck to perform the invocation. Last error is: " + e.getMessage(), e); } } finally { RpcContext.getContext().clearAttachments(); } }
ForkingClusterInvoker 会通过线程池异步同时调用多个服务提供者,返回第一个调用成功的结果,如果全部失败,则返回失败。
适用场景:适用于对响应时间要求比较高,且服务提供者数量比较多的场景。
BroadcastClusterInvoker 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 @Override @SuppressWarnings({"unchecked", "rawtypes"}) public Result doInvoke (final Invocation invocation, List<Invoker<T>> invokers, LoadBalance loadbalance) throws RpcException { checkInvokers(invokers, invocation); RpcContext.getContext().setInvokers((List) invokers); RpcException exception = null ; Result result = null ; URL url = getUrl(); int broadcastFailPercent = url.getParameter(BROADCAST_FAIL_PERCENT_KEY, MAX_BROADCAST_FAIL_PERCENT); if (broadcastFailPercent < MIN_BROADCAST_FAIL_PERCENT || broadcastFailPercent > MAX_BROADCAST_FAIL_PERCENT) { logger.info(String.format("The value corresponding to the broadcast.fail.percent parameter must be between 0 and 100. " + "The current setting is %s, which is reset to 100." , broadcastFailPercent)); broadcastFailPercent = MAX_BROADCAST_FAIL_PERCENT; } int failThresholdIndex = invokers.size() * broadcastFailPercent / MAX_BROADCAST_FAIL_PERCENT; int failIndex = 0 ; for (Invoker<T> invoker : invokers) { try { result = invoker.invoke(invocation); if (null != result && result.hasException()) { Throwable resultException = result.getException(); if (null != resultException) { exception = getRpcException(result.getException()); logger.warn(exception.getMessage(), exception); if (failIndex == failThresholdIndex) { break ; } else { failIndex++; } } } } catch (Throwable e) { exception = getRpcException(e); logger.warn(exception.getMessage(), exception); if (failIndex == failThresholdIndex) { break ; } else { failIndex++; } } } if (exception != null ) { if (failIndex == failThresholdIndex) { logger.debug( String.format("The number of BroadcastCluster call failures has reached the threshold %s" , failThresholdIndex)); } else { logger.debug(String.format("The number of BroadcastCluster call failures has not reached the threshold %s, fail size is %s" , failIndex)); } throw exception; } return result; }
BroadcastClusterInvoker 会调用所有的服务提供者,如果有其中一个调用失败,便返回失败。
适用场景:适用于发布/订阅的场景。
总结 本文通过源码简单介绍了Dubbo提供的多种集群容错策略及其适用场景,在实际使用中,合理选择和配置这些策略可以有效提高系统的稳定性和可用性。