Queue-Based Load Leveling Pattern 基于队列的负载平衡模式
- Article文章
- 08/26/2015 2015年8月26日
- 5 minutes to read还有五分钟
Use a queue that acts as a buffer between a task and a service that it invokes in order to smooth intermittent heavy loads that may otherwise cause the service to fail or the task to time out. This pattern can help to minimize the impact of peaks in demand on availability and responsiveness for both the task and the service.
使用队列作为任务与其呼叫的服务之间的缓冲区,以平滑和间歇的重负载,否则可能导致服务失败或任务加班。该模型可以帮助降低需求高峰对任务和服务的可用性和响应性的影响。
Context and Problem 背景与问题
Many solutions in the cloud involve running tasks that invoke services. In this environment, if a service is subjected to intermittent heavy loads, it can cause performance or reliability issues
云中的许多解决方案都涉及运行呼叫服务的任务。在这种环境下,如果服务承受间歇性的重负荷,就会导致性能或可靠性问题
A service could be a component that is part of the same solution as the tasks that utilize it, or it could be a third-party service providing access to frequently used resources such as a cache or a storage service. If the same service is utilized by a number of tasks running concurrently, it can be difficult to predict the volume of requests to which the service might be subjected at any given point in time.
服务可以是与使用任务相同的解决方案的组件,也可以是访问经常使用的资源(如缓存或存储服务)的第三方服务。如果同一服务被许多并发任务使用,则很难在任何给定的时间点预测服务可能的请求量。
It is possible that a service might experience peaks in demand that cause it to become overloaded and unable to respond to requests in a timely manner. Flooding a service with a large number of concurrent requests may also result in the service failing if it is unable to handle the contention that these requests could cause.
服务可能会遇到需求高峰,导致过载,无法及时响应要求。如果服务无法处理这些请求可能引起的纠纷,大量并发请求也可能导致服务失败。
Solution 解决方案
Refactor the solution and introduce a queue between the task and the service. The task and the service run asynchronously. The task posts a message containing the data required by the service to a queue. The queue acts as a buffer, storing the message until it is retrieved by the service. The service retrieves the messages from the queue and processes them. Requests from a number of tasks, which can be generated at a highly variable rate, can be passed to the service through the same message queue. Figure 1 shows this structure.
重建解决方案,并在任务和服务之间引入队列。任务和服务异步运行。任务将包含服务所需数据的信息发送到队列中。队列作为缓冲区,在服务检索信息之前总是存储信息。服务从队列中搜索信息并处理信息。它可以通过相同的信息队列将来自多个任务的请求传递给服务,可以以高度可变的速度生成。图1显示了该结构。
[外链图片转存失败,源站可能有防盗链机制,建议保存图片并直接上传(img-HQroYTpl-1655720742810)(https://docs.microsoft.com/en-us/previous-versions/msp-n-p/images/dn589783.23297652914b37b1fd1c46dce590171b(en-us,pandp.10)].png)
Figure 1 - Using a queue to level the load on a service
图1-负载使用队列平衡服务
The queue effectively decouples the tasks from the service, and the service can handle the messages at its own pace irrespective of the volume of requests from concurrent tasks. Additionally, there is no delay to a task if the service is not available at the time it posts a message to the queue.
队列有效地将任务与服务分开,无论并发任务的要求有多大,服务都可以以自己的速度处理信息。此外,如果服务在向队列发送信息时不可用,则任务不会延迟。
This pattern provides the following benefits:
该模型提供了以下好处:
- It can help to maximize availability because delays arising in services will not have an immediate and direct impact on the application, which can continue to post messages to the queue even when the service is not available or is not currently processing messages. 有助于最大限度地提高可用性,因为服务中的延迟不会直接影响应用程序,即使服务不可用或目前没有处理信息,应用程序也可以继续向队列发送信息
- It can help to maximize scalability because both the number of queues and the number of services can be varied to meet demand. 它有助于最大限度地提高可伸缩性,为了满足需求,可以改变队列和服务的数量
- It can help to control costs because the number of service instances deployed needs only to be sufficient to meet average load rather than the peak load. 有助于控制成本,因为部署的服务实例的数量只需要足以满足平均负载,而不是满足峰值负载
Note
注意
Some services may implement throttling if demand reaches a threshold beyond which the system could fail. Throttling may reduce the functionality available. You might be able to implement load leveling with these services to ensure that this threshold is not reached.
如果需求达到,系统可能会失败值,一些服务可能会实现节流。节流可能会降低可用的功能。您可以使用这些服务实现负载均衡,以确保不会达到这个阈值。
Issues and Considerations 问题及考虑
Consider the following points when deciding how to implement this pattern:
在决定如何实现此模式时,请考虑以下几点:
- It is necessary to implement application logic that controls the rate at which services handle messages to avoid overwhelming the target resource. Avoid passing spikes in demand to the next stage of the system. Test the system under load to ensure that it provides the required leveling, and adjust the number of queues and the number of service instances that handle messages to achieve this. 有必要实现控制服务处理消息速率的应用程序逻辑,以避免使目标资源不堪重负。避免将需求高峰转移到系统的下一阶段。测试负载下的系统,以确保它提供所需的均衡,并调整处理消息的队列和服务实例的数量,以实现这一点
- Message queues are a one-way communication mechanism. If a task expects a reply from a service, it may be necessary to implement a mechanism that the service can use to send a response. For more information, see the 消息队列是一种单向通信机制。如果任务期望从服务获得应答,则可能需要实现服务可用于发送响应的机制。有关更多信息,请参见Asynchronous Messaging Primer 异步消息入门.
- You must be careful if you apply autoscaling to services that are listening for requests on the queue because this may result in increased contention for any resources that these services share, and diminish the effectiveness of using the queue to level the load. 如果对正在监听队列上的请求的服务应用自动伸缩,则必须小心,因为这可能导致对这些服务共享的任何资源的争用增加,并降低使用队列平衡负载的有效性
When to Use this Pattern 何时使用此模式
This pattern is ideally suited to any type of application that uses services that may be subject to overloading.
此模式非常适合于使用可能会被重载的服务的任何类型的应用程序。
This pattern might not be suitable if the application expects a response from the service with minimal latency.
如果应用程序期望从服务获得最小延迟的响应,则此模式可能不适合。
Example 例子
A Microsoft Azure web role stores data by using a separate storage service. If a large number of instances of the web role run concurrently, it is possible that the storage service could be overwhelmed and be unable to respond to requests quickly enough to prevent these requests from timing out or failing. Figure 2 highlights this issue.
MicrosoftAzure Web 角色通过使用单独的存储服务来存储数据。如果 web 角色的大部分数量同时运行,存储服务可能会不堪重负,无法快速响应请求,以防止这些请求超时或失败。图2突出显示了这个问题。
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-3g1SSNAP-1655720742811)(https://docs.microsoft.com/en-us/previous-versions/msp-n-p/images/dn589783.7b5bc0860e63b63f6cc46411f1857135(en-us,pandp.10)].png)
Figure 2 - A service being overwhelmed by a large number of concurrent requests from instances of a web role
图2-一个服务被来自 Web 角色实例的大量并发请求所淹没
To resolve this issue, you can use a queue to level the load between the web role instances and the storage service. However, the storage service is designed to accept synchronous requests and cannot be easily modified to read messages and manage throughput. Therefore, you can introduce a worker role to act as a proxy service that receives requests from the queue and forwards them to the storage service. The application logic in the worker role can control the rate at which it passes requests to the storage service to prevent the storage service from being overwhelmed. Figure 3 shows this solution.
要解决这个问题,可以使用队列来平衡 Web 角色实例和存储服务之间的负载。但是,存储服务被设计为接受同步请求,不容易修改以读取消息和管理吞吐量。因此,可以引入辅助角色作为代理服务,接收来自队列的请求并将其转发到存储服务。Worker 角色中的应用程序逻辑可以控制向存储服务传递请求的速率,以防止存储服务不堪重负。图3显示了这个解决方案。
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-esTZNW9f-1655720742811)(https://docs.microsoft.com/en-us/previous-versions/msp-n-p/images/dn589783.ee8b20330d85bb546f581b90ad337680(en-us,pandp.10)].png)
Figure 3 - Using a queue and a worker role to level the load between instances of the web role and the service
图3-使用一个队列和一个工作者角色来平衡 Web 角色实例和服务之间的负载
Related Patterns and Guidance 相关模式及指引
The following patterns and guidance may also be relevant when implementing this pattern:
下列模式和指南在实现此模式时也可能有用:
- Asynchronous Messaging Primer 异步消息入门. Message queues are an inherently asynchronous communications mechanism. It may be necessary to redesign the application logic in a task if it is adapted from communicating directly with a service to using a message queue. Similarly, it may be necessary to refactor a service to accept requests from a message queue (alternatively, it may be possible to implement a proxy service, as described in the example). .消息队列是一种固有的异步通信机制。如果将任务中的应用程序逻辑从直接与服务通信改为使用消息队列,则可能需要重新设计任务中的应用程序逻辑。类似地,可能需要重构服务以接受来自消息队列的请求(或者,也可以实现代理服务,如示例中所述)
- Competing Consumers Pattern 消费者竞争模式. It may be possible to run multiple instances of a service, each of which act as a message consumer from the load-leveling queue. You can use this approach to adjust the rate at which messages are received and passed to a service. .可以运行一个服务的多个实例,其中每个实例都充当来自负载均衡队列的消息使用者。您可以使用此方法来调整消息接收和传递到服务的速率
- Throttling Pattern 节流模式. A simple way to implement throttling with a service is to use queue-based load-leveling and route all requests to a service through a message queue. The service can process requests at a rate that ensures resources required by the service are not exhausted, and to reduce the amount of contention that could occur. .使用服务实现节流的一种简单方法是使用基于队列的负载均衡,并通过消息队列将所有请求路由到服务。服务可以以一定的速度处理请求,以确保服务所需的资源不会耗尽,并减少可能发生的争用
Retry Pattern 重试模式
- Article文章
- 08/26/2015 2015年8月26日
- 10 minutes to read还有10分钟 Enable an application to handle anticipated, temporary failures when it attempts to connect to a service or network resource by transparently retrying an operation that has previously failed in the expectation that the cause of the failure is transient. This pattern can improve the stability of the application.
当应用程序试图连接到服务或网络资源时,通过透明地重新尝试以前失败的操作,使应用程序能够处理预期的临时失败,因为预期失败的原因是临时的。此模式可以提高应用程序的稳定性。
Context and Problem 背景与问题
An application that communicates with elements running in the cloud must be sensitive to the transient faults that can occur in this environment. Such faults include the momentary loss of network connectivity to components and services, the temporary unavailability of a service, or timeouts that arise when a service is busy.
与云中运行的元素进行通信的应用程序必须对可能在此环境中发生的瞬时故障敏感。此类故障包括与组件和服务的网络连接暂时丢失、服务暂时不可用或服务繁忙时出现超时。
These faults are typically self-correcting, and if the action that triggered a fault is repeated after a suitable delay it is likely to be successful. For example, a database service that is processing a large number of concurrent requests may implement a throttling strategy that temporarily rejects any further requests until its workload has eased. An application attempting to access the database may fail to connect, but if it tries again after a suitable delay it may succeed.
这些错误通常是自我纠正的,如果触发错误的动作在适当的延迟之后重复,那么它很可能是成功的。例如,正在处理大量并发请求的数据库服务可能实现一个节流策略,该策略暂时拒绝任何进一步的请求,直到其工作负载得到缓解。试图访问数据库的应用程序可能无法连接,但如果在适当的延迟后再次尝试,则可能会成功。
Solution 解决方案
In the cloud, transient faults are not uncommon and an application should be designed to handle them elegantly and transparently, minimizing the effects that such faults might have on the business tasks that the application is performing.
在云中,暂时性故障并不少见,应用程序应该设计成能够优雅和透明地处理这些故障,最小化这些故障可能对应用程序正在执行的业务任务造成的影响。
If an application detects a failure when it attempts to send a request to a remote service, it can handle the failure by using the following strategies:
如果应用程序在尝试向远程服务发送请求时检测到故障,它可以使用以下策略处理故障:
- If the fault indicates that the failure is not transient or is unlikely to be successful if repeated (for example, an authentication failure caused by providing invalid credentials is unlikely to succeed no matter how many times it is attempted), the application should abort the operation and report a suitable exception. 如果错误表明失败不是短暂的,或者如果重复的话不太可能成功(例如,由于提供无效凭据而导致的身份验证失败不太可能成功,无论尝试多少次) ,应用程序应该中止操作并报告一个合适的异常
- If the specific fault reported is unusual or rare, it may have been caused by freak circumstances such as a network packet becoming corrupted while it was being transmitted. In this case, the application could retry the failing request again immediately because the same failure is unlikely to be repeated and the request will probably be successful. 如果报告的特定故障不寻常或罕见,它可能是由异常情况引起的,如网络数据包在传输过程中损坏。在这种情况下,应用程序可以立即重试失败的请求,因为同样的失败不太可能重复,而且请求可能会成功
- If the fault is caused by one of the more commonplace connectivity or “busy” failures, the network or service may require a short period while the connectivity issues are rectified or the backlog of work is cleared. The application should wait for a suitable time before retrying the request. 如果故障是由较常见的连接或“繁忙”故障之一引起的,则在连接问题得到纠正或清理积压的工作时,网络或服务可能需要一个短暂的时间。应用程序在重试请求之前应该等待一段适当的时间
For the more common transient failures, the period between retries should be chosen so as to spread requests from multiple instances of the application as evenly as possible. This can reduce the chance of a busy service continuing to be overloaded. If many instances of an application are continually bombarding a service with retry requests, it may take the service longer to recover.
对于更常见的暂时故障,应该选择重试之间的时间间隔,以便尽可能均匀地分布来自应用程序多个实例的请求。这可以减少繁忙服务继续超载的可能性。如果应用程序的许多实例不断地用重试请求轰击服务,则服务可能需要更长时间才能恢复。
If the request still fails, the application can wait for a further period and make another attempt. If necessary, this process can be repeated with increasing delays between retry attempts until some maximum number of requests have been attempted and failed. The delay time can be increased incrementally, or a timing strategy such as exponential back-off can be used, depending on the nature of the failure and the likelihood that it will be corrected during this time.
如果请求仍然失败,应用程序可以等待一段时间,然后再次尝试。如果有必要,可以重复这个过程,重试之间的延迟越来越长,直到尝试了最大数量的请求并失败。延迟时间可以逐渐增加,或者可以使用指数后退等定时策略,这取决于故障的性质以及在此期间纠正故障的可能性。
Figure 1 illustrates this pattern. If the request is unsuccessful after a predefined number of attempts, the application should treat the fault as an exception and handle it accordingly.
图1说明了这种模式。如果在预定义的多次尝试之后请求仍然不成功,应用程序应该将该错误视为异常并相应地处理它。
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-84O8WhzC-1655720742815)(https://docs.microsoft.com/en-us/previous-versions/msp-n-p/images/dn589788.f67c15d0bbd1904bcd7493ae870920a2(en-us,pandp.10)].png)
Figure 1 - Invoking an operation in a hosted service using the Retry pattern
图1-使用 Retry 模式在宿主服务中调用操作
The application should wrap all attempts to access a remote service in code that implements a retry policy matching one of the strategies listed above. Requests sent to different services can be subject to different policies, and some vendors provide libraries that encapsulate this approach. These libraries typically implement policies that are parameterized, and the application developer can specify values for items such as the number of retries and the time between retry attempts.
应用程序应该将访问远程服务的所有尝试包装在实现与上面列出的策略之一匹配的重试策略的代码中。发送到不同服务的请求可能受制于不同的策略,一些供应商提供了封装这种方法的库。这些库通常实现参数化的策略,应用程序开发人员可以为项目指定值,例如重试次数和重试间隔时间。
The code in an application that detects faults and retries failing operations should log the details of these failures. This information may be useful to operators. If a service is frequently reported as unavailable or busy, it is often because the service has exhausted its resources. You may be able to reduce the frequency with which these faults occur by scaling out the service. For example, if a database service is continually overloaded, it may be beneficial to partition the database and spread the load across multiple servers.
应用程序中检测错误并重试失败操作的代码应该记录这些失败的详细信息。此信息可能对操作员有用。如果某个服务经常被报告为不可用或忙碌,这通常是因为该服务已耗尽其资源。您可以通过扩展服务来减少这些故障发生的频率。例如,如果一个数据库服务不断地超载,那么对数据库进行分区并将负载分散到多个服务器可能是有益的。
Note
注意
Microsoft Azure provides extensive support for the Retry pattern. The patterns & practices Transient Fault Handling Block enables an application to handle transient faults in many Azure services using a range of retry strategies. The Microsoft Entity Framework version 6 provides facilities for retrying database operations. Additionally, many of the Azure Service Bus and Azure Storage APIs implement retry logic transparently.
MicrosoftAzure 为 Retry 模式提供了广泛的支持。模式和实践瞬态故障处理块使应用程序能够使用一系列重试策略处理许多 Azure 服务中的瞬态故障。MicrosoftEntity Framework 版本6提供了重试数据库操作的工具。此外,许多 Azure 服务总线和 Azure 存储 API 透明地实现了重试逻辑。
Issues and Considerations 问题及考虑
You should consider the following points when deciding how to implement this pattern:
在决定如何实现此模式时,应考虑以下几点:
- The retry policy should be tuned to match the business requirements of the application and the nature of the failure. It may be better for some noncritical operations to fail fast rather than retry several times and impact the throughput of the application. For example, in an interactive web application that attempts to access a remote service, it may be better to fail after a smaller number of retries with only a short delay between retry attempts, and display a suitable message to the user (for example, “please try again later”) to prevent the application from becoming unresponsive. For a batch application, it may be more appropriate to increase the number of retry attempts with an exponentially increasing delay between attempts. 应该调整重试策略,使其与应用程序的业务需求和故障的性质相匹配。对于某些非关键操作来说,快速失败可能比多次重试并影响应用程序的吞吐量更好。例如,在一个尝试访问远程服务的交互式 web 应用程序中,在尝试了较少次数的重试之后失败可能会更好,重试之间只有短暂的延迟,并向用户显示一个合适的消息(例如,“请稍后再试”) ,以防止应用程序变得无响应。对于批处理应用程序,可能更适合增加重试次数,并且尝试之间的延迟呈指数级增长
- A highly aggressive retry policy with minimal delay between attempts, and a large number of retries, could further degrade a busy service that is running close to or at capacity. This retry policy could also affect the responsiveness of the application if it is continually attempting to perform a failing operation rather than doing useful work. 高度主动的重试策略(尝试之间的延迟最小)和大量重试可能会进一步降低运行接近容量或以容量运行的繁忙服务的性能。如果应用程序不断尝试执行失败的操作而不是做有用的工作,那么这种重试策略还可能影响应用程序的响应性
- If a request still fails after a significant number of retries, it may be better for the application to prevent further requests going to the same resource for a period and simply report a failure immediately. When the period expires, the application may tentatively allow one or more requests through to see whether they are successful. For more details of this strategy, see the 如果一个请求在大量重试之后仍然失败,那么应用程序最好防止进一步的请求在一段时间内到达同一资源,并立即报告失败。当期限届满时,申请可能暂时允许一个或多个请求通过,以查看它们是否成功。有关此策略的更多详细信息,请参见Circuit Breaker pattern 断路器模式.
- The operations in a service that are invoked by an application that implements a retry policy may need to be idempotent. For example, a request sent to a service may be received and processed successfully but, due to a transient fault, it may be unable to send a response indicating that the processing has completed. The retry logic in the application might then attempt to repeat the request on the assumption that the first request was not received. 实现重试策略的应用程序调用的服务中的操作可能需要是幂等的。例如,发送到服务的请求可能被成功接收和处理,但是由于暂时性故障,它可能无法发送表明处理已完成的响应。然后,应用程序中的重试逻辑可能会在假设没有接收到第一个请求的情况下尝试重复请求
- A request to a service may fail for a variety of reasons and raise different exceptions, depending on the nature of the failure. Some exceptions may indicate a failure that could be resolved very quickly, while others may indicate that the failure is longer lasting. It may be beneficial for the retry policy to adjust the time between retry attempts based on the type of the exception. 对服务的请求可能由于各种原因而失败,并引发不同的异常,这取决于失败的性质。一些异常可能表明可以很快解决的故障,而另一些异常可能表明故障持续时间更长。根据异常的类型调整重试尝试之间的时间可能对重试策略有益
- Consider how retrying an operation that is part of a transaction will affect the overall transaction consistency. It may be useful to fine tune the retry policy for transactional operations to maximize the chance of success and reduce the need to undo all the transaction steps. 考虑重试作为事务一部分的操作将如何影响整个事务的一致性。对事务操作的重试策略进行微调,以最大限度地提高成功机会,并减少撤消所有事务步骤的需要,这可能是有用的
- Ensure that all retry code is fully tested against a variety of failure conditions. Check that it does not severely impact the performance or reliability of the application, cause excessive load on services and resources, or generate race conditions or bottlenecks. 确保所有重试代码都针对各种失败条件进行了充分测试。检查它是否严重影响应用程序的性能或可靠性,是否导致服务和资源负载过大,是否产生竞态条件或瓶颈
- Implement retry logic only where the full context of a failing operation is understood. For example, if a task that contains a retry policy invokes another task that also contains a retry policy, this extra layer of retries can add long delays to the processing. It may be better to configure the lower-level task to fail fast and report the reason for the failure back to the task that invoked it. This higher-level task can then decide how to handle the failure based on its own policy. 只有在理解了失败操作的完整上下文的情况下才实现重试逻辑。例如,如果一个包含重试策略的任务调用另一个也包含重试策略的任务,这个额外的重试层可能会给处理增加长时间的延迟。最好将低级任务配置为快速失败,并将失败的原因报告给调用它的任务。然后,这个高级任务可以根据自己的策略决定如何处理失败
- It is important to log all connectivity failures that prompt a retry so that underlying problems with the application, services, or resources can be identified. 记录所有提示重试的连接失败非常重要,这样可以识别应用程序、服务或资源的潜在问题
- Investigate the faults that are most likely to occur for a service or a resource to discover if they are likely to be long lasting or terminal. If this is the case, it may be better to handle the fault as an exception. The application can report or log the exception, and then attempt to continue either by invoking an alternative service (if there is one available), or by offering degraded functionality. For more information on how to detect and handle long-lasting faults, see the 调查服务或资源最有可能发生的错误,以发现它们是否可能是长期的或终端的。如果是这种情况,最好将错误作为异常处理。应用程序可以报告或记录异常,然后尝试通过调用替代服务(如果有可用的服务)或提供降级功能来继续。有关如何检测和处理长期故障的更多信息,请参见Circuit Breaker pattern 断路器模式.
When to Use this Pattern 何时使用此模式
Use this pattern:
使用以下模式:
- When an application could experience transient faults as it interacts with a remote service or accesses a remote resource. These faults are expected to be short lived, and repeating a request that has previously failed could succeed on a subsequent attempt. 当应用程序在与远程服务交互或访问远程资源时可能遇到短暂故障。这些错误预计是短暂的,重复以前失败的请求可能会在后续尝试中成功
This pattern might not be suitable:
这种模式可能并不合适:
- When a fault is likely to be long lasting, because this can affect the responsiveness of an application. The application may simply be wasting time and resources attempting to repeat a request that is most likely to fail. 当一个错误可能是长期持续的,因为这可能会影响应用程序的响应性。应用程序可能只是在浪费时间和资源,试图重复最有可能失败的请求
- For handling failures that are not due to transient faults, such as internal exceptions caused by errors in the business logic of an application. 用于处理不是由于暂时性错误造成的故障,例如由应用程序业务逻辑中的错误引起的内部异常
- As an alternative to addressing scalability issues in a system. If an application experiences frequent “busy” faults, it is often an indication that the service or resource being accessed should be scaled up. 作为解决系统中可伸缩性问题的替代方法。如果应用程序经常出现“繁忙”故障,这通常表明正在访问的服务或资源应该被扩展
Example 例子
This example illustrates an implementation of the Retry pattern. The method, shown below, invokes an external service asynchronously through the method (the details of this method will be specific to the service and are omitted from the sample code).
此示例说明了 Retry 模式的实现。OperationWithBasicRetryAsync 方法(如下所示)通过 TranentOperationAsync 方法异步调用外部服务(该方法的详细信息将特定于服务,并从示例代码中省略)。
C# C #Copy 收到
private int retryCount = 3;...public async Task OperationWithBasicRetryAsync(){
int currentRetry = 0; for (; ;) {
try {
// Calling external service. await TransientOperationAsync(); // Return or break. break; } catch (Exception ex) { Trace.TraceError("Operation Exception"); currentRetry++; // Check if the exception thrown was a transient exception // based on the logic in the error detection strategy. // Determine whether to retry the operation, as well as how // long to wait, based on the retry strategy. if (currentRetry > this.retryCount || !IsTransient(ex)) { // If this is not a transient error // or we should not retry re-throw the exception. throw; } } // Wait to retry the operation. // Consider calculating an exponential delay here and // using a strategy best suited for the operation and fault. Await.Task.Delay(); }}// Async method that wraps a call to a remote service (details not shown).private async Task TransientOperationAsync(){ ...}
The statement that invokes this method is encapsulated within a block wrapped in a loop. The loop exits if the call to the method succeeds without throwing an exception. If the method fails, the block examines the reason for the failure, and if it is deemed to be a transient error the code waits for a short delay before retrying the operation.
调用此方法的语句封装在包装在 for 循环中的 try/catch 块中。如果对 TranentOperationAsync 方法的调用成功而没有引发异常,则 for 循环将退出。如果 TranentOperationAsync 方法失败,catch 块检查失败的原因,如果被认为是暂时错误,则代码在重试操作之前等待短暂的延迟。
The loop also tracks the number of times that the operation has been attempted, and if the code fails three times the exception is assumed to be more long lasting. If the exception is not transient or it is longlasting, the catch handler throws an exception. This exception exits the loop and should be caught by the code that invokes the method.
For 循环还跟踪尝试操作的次数,如果代码失败三次,则假定异常持续时间更长。如果异常不是瞬时的或者是长期的,catch 处理程序将引发异常。此异常退出 for 循环,应由调用 OperationWithBasicRetryAsync 方法的代码捕获。
The method, shown below, checks for a specific set of exceptions that are relevant to the environment in which the code is run. The definition of a transient exception may vary according to the resources being accessed and the environment in which the operation is being performed.
如下所示的 IsTranent 方法检查与运行代码的环境相关的一组特定异常。瞬态异常的定义可能会根据所访问的资源和执行操作的环境而有所不同。
C# C #Copy 收到
private bool IsTransient(Exception ex){
// Determine if the exception is transient. // In some cases this may be as simple as checking the exception type, in other // cases it may be necessary to inspect other properties of the exception. if (ex is OperationTransientException) return true; var webException = ex as WebException; if (webException != null) { // If the web exception contains one of the following status values // it may be transient. return new[] {WebExceptionStatus.ConnectionClosed, WebExceptionStatus.Timeout, WebExceptionStatus.RequestCanceled }. Contains(webException.Status); } // Additional exception checking logic goes here. return false;}
Related Patterns and Guidance 相关模式及指引
The following pattern may also be relevant when implementing this pattern:
在实现此模式时,下列模式也可能是相关的:
- Circuit Breaker Pattern 断路器模式. The Retry pattern is ideally suited to handling transient faults. If a failure is expected to be more long lasting, it may be more appropriate to implement the Circuit Breaker Pattern. The Retry pattern can also be used in conjunction with a circuit breaker to provide a comprehensive approach to handling faults. .Retry 模式非常适合处理瞬态故障。如果一个故障预计将是更长的持续时间,它可能更适合实施断路器模式。重试模式也可以与断路器一起使用,以提供一个综合的方法来处理故障
Runtime Reconfiguration Pattern 运行时重新配置模式
- Article文章
- 08/26/2015 2015年8月26日
- 12 minutes to read还有12分钟 Design an application so that it can be reconfigured without requiring redeployment or restarting the application. This helps to maintain availability and minimize downtime.
设计一个应用程序,使其可以重新配置,而不需要重新部署或重新启动应用程序。这有助于维护可用性和最小化停机时间。
Context and Problem 背景与问题
A primary aim for important applications such as commercial and business websites is to minimize downtime and the consequent interruption to customers and users. However, at times it is necessary to reconfigure the application to change specific behavior or settings while it is deployed and in use. Therefore, it is an advantage for the application to be designed in such a way as to allow these configuration changes to be applied while it is running, and for the components of the application to detect the changes and apply them as soon as possible.
商业和商业网站等重要应用程序的主要目标是尽量减少停机时间,从而减少对客户和用户的中断。但是,有时需要重新配置应用程序,以便在部署和使用时更改特定的行为或设置。因此,将应用程序设计成允许在其运行时应用这些配置更改,并使应用程序的组件能够检测这些更改并尽快应用它们,这是一个优势。
Examples of the kinds of configuration changes to be applied might be adjusting the granularity of logging to assist in debugging a problem with the application, swapping connection strings to use a different data store, or turning on or off specific sections or functionality of the application.
要应用的各种配置更改的例子可能是调整日志记录的粒度以帮助调试应用程序的问题,交换连接字符串以使用不同的数据存储,或者打开或关闭应用程序的特定部分或功能。
Solution 解决方案
The solution for implementing this pattern depends on the features available in the application hosting environment. Typically, the application code will respond to one or more events that are raised by the hosting infrastructure when it detects a change to the application configuration. This is usually the result of uploading a new configuration file, or in response to changes in the configuration through the administration portal or by accessing an API.
实现此模式的解决方案取决于应用程序宿主环境中可用的特性。通常,当检测到应用程序配置的更改时,应用程序代码将响应宿主基础结构引发的一个或多个事件。这通常是上传新配置文件的结果,或者是通过管理门户或访问 API 来响应配置中的更改。
Code that handles the configuration change events can examine the changes and apply them to the components of the application. It is necessary for these components to detect and react to the changes, and so the values they use will usually be exposed as writable properties or methods that the code in the event handler can set to new values or execute. From this point, the components should use the new values so that the required changes to the application behavior occur.
处理配置更改事件的代码可以检查更改并将其应用于应用程序的组件。这些组件必须检测到更改并对更改作出反应,因此它们使用的值通常将作为可写属性或方法公开,事件处理程序中的代码可以将这些属性或方法设置为新值或执行。此时,组件应该使用新值,以便对应用程序行为进行所需的更改。
If it is not possible for the components to apply the changes at runtime, it will be necessary to restart the application so that these changes are applied when the application starts up again. In some hosting environments it may be possible to detect these types of changes, and indicate to the environment that the application must be restarted. In other cases it may be necessary to implement code that analyses the setting changes and forces an application restart when necessary.
如果组件不可能在运行时应用更改,则需要重新启动应用程序,以便在应用程序再次启动时应用这些更改。在某些宿主环境中,可以检测到这些类型的更改,并向环境指示必须重新启动应用程序。在其他情况下,可能需要实现分析设置更改的代码,并在必要时强制应用程序重新启动。
Figure 1 shows an overview of this pattern.
图1显示了此模式的概述。
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-7wXOd2wB-1655720742819)(https://docs.microsoft.com/en-us/previous-versions/msp-n-p/images/dn589785.3bec791b019e315e0f28943d7421fd26(en-us,pandp.10)].png)
Figure 1 - A basic overview of this pattern
图1-此模式的基本概述
Most environments expose events raised in response to configuration changes. In those that do not, a polling mechanism that regularly checks for changes to the configuration and applies these changes will be necessary. It may also be necessary to restart the application if the changes cannot be applied at runtime. For example, it may be possible to compare the date and time of a configuration file at preset intervals, and run code to apply the changes when a newer version is found. Another approach would be to incorporate a control in the administration UI of the application, or expose a secured endpoint that can be accessed from outside the application, that executes code that reads and applies the updated configuration.
大多数环境公开响应配置更改而引发的事件。对于那些没有这样做的配置,需要一种轮询机制来定期检查配置的更改并应用这些更改。如果无法在运行时应用更改,则可能还需要重新启动应用程序。例如,可以按预设的间隔比较配置文件的日期和时间,并在找到新版本时运行代码来应用更改。另一种方法是在应用程序的管理 UI 中合并一个控件,或者公开一个可以从应用程序外部访问的安全端点,该端点执行读取和应用更新的配置的代码。
Alternatively, the application could react to some other change in the environment. For example, occurrences of a specific runtime error might change the logging configuration to automatically collect additional information, or the code could use the current date to read and apply a theme that reflects the season or a special event.
或者,应用程序可以对环境中的其他变化作出反应。例如,特定运行时错误的出现可能会更改日志配置以自动收集其他信息,或者代码可以使用当前日期读取和应用反映季节或特殊事件的主题。
Issues and Considerations 问题及考虑
Consider the following points when deciding how to implement this pattern:
在决定如何实现此模式时,请考虑以下几点:
- The configuration settings must be stored outside of the deployed application so that they can be updated without requiring the entire package to be redeployed. Typically the settings are stored in a configuration file, or in an external repository such as a database or online storage. Access to the runtime configuration mechanism should be strictly controlled, as well as strictly audited when used. 配置设置必须存储在已部署的应用程序之外,以便可以在不需要重新部署整个包的情况下更新它们。通常,这些设置存储在配置文件中,或存储在外部存储库(如数据库或联机存储)中。应严格控制对运行时配置机制的访问,并在使用时进行严格审核
- If the hosting infrastructure does not automatically detect configuration change events, and expose these events to the application code, you must implement an alternative mechanism to detect and apply the changes. This may be through a polling mechanism, or by exposing an interactive control or endpoint that initiates the update process. 如果宿主基础结构不能自动检测配置更改事件,并将这些事件公开给应用程序代码,则必须实现一种替代机制来检测和应用更改。这可以通过轮询机制实现,也可以通过公开启动更新过程的交互式控件或端点实现
- If you need to implement a polling mechanism, consider how often checks for updates to the configuration should take place. A long polling interval will mean that changes might not be applied for some time. A short interval might adversely affect operation by absorbing available compute and I/O resources. 如果需要实现轮询机制,请考虑对配置进行更新检查的频率。较长的轮询间隔将意味着更改可能在一段时间内不会应用。较短的间隔可能会吸收可用的计算和 I/O 资源,从而对操作产生不利影响
- If there is more than one instance of the application, additional factors should be considered, depending on how changes are detected. If changes are detected automatically through events raised by the hosting infrastructure, these changes may not be detected by all instances of the application at the same time. This means that some instances will be using the original configuration for a period while others will use the new settings. If the update is detected through a polling mechanism, this must communicate the change to all instances in order to maintain consistency. 如果应用程序有多个实例,则应考虑其他因素,具体取决于检测更改的方式。如果通过宿主基础结构引发的事件自动检测到更改,则应用程序的所有实例可能无法同时检测到这些更改。这意味着一些实例将在一段时间内使用原始配置,而其他实例将使用新设置。如果通过轮询机制检测到更新,则必须将更改传递给所有实例以保持一致性
- Some configuration changes may require the application to be restarted, or even require the hosting server to be rebooted. You must identify these types of configuration settings and perform the appropriate action for each one. For example, a change that requires the application to be restarted might do this automatically, or it might be the responsibility of the administrator to initiate the restart at a suitable time when the application is not under excessive load and other instances of the application can handle the load. 某些配置更改可能需要重新启动应用程序,甚至需要重新启动宿主服务器。您必须标识这些类型的配置设置,并为每个配置设置执行适当的操作。例如,需要重新启动应用程序的更改可能会自动执行,或者管理员有责任在应用程序没有处于过度负载并且应用程序的其他实例可以处理负载的适当时候启动重新启动
- Plan for a staged rollout of updates and confirm they are successful, and that the updated application instances are performing correctly, before applying the update to all instances. This can prevent a total outage of the application should an error occur. Where the update requires a restart or a reboot of the application, particularly where the application has a significant start up or warm up time, use a staged rollout approach to prevent multiple instances being offline at the same time. 在将更新应用到所有实例之前,计划分阶段推出更新,并确认它们是否成功,以及更新后的应用程序实例是否正确执行。这可以防止在发生错误时应用程序完全中断。如果更新需要重新启动或重新启动应用程序,特别是在应用程序具有重要的启动或预热时间的情况下,请使用分阶段展开方法来防止多个实例同时脱机
- Consider how you will roll back configuration changes that cause issues, or that result in failure of the application. For example, it should be possible to roll back a change immediately instead of waiting for a polling interval to detect the change. 考虑如何回滚导致问题或导致应用程序失败的配置更改。例如,应该可以立即回滚更改,而不是等待轮询间隔来检测更改
- Consider how the location of the configuration settings might affect application performance. For example, you should handle the error that will occur if the external store you use is unavailable when the application starts, or when configuration changes are to be applied—perhaps by using a default configuration or by caching the settings locally on the server and reusing these values while retrying access to the remote data store. 考虑配置设置的位置可能如何影响应用程序性能。例如,如果在应用程序启动时,或者要应用配置更改时,您所使用的外部存储区不可用,那么您应该处理这种情况下可能发生的错误ーー也许可以使用默认配置,或者在服务器本地缓存设置,并在重试对远程数据存储区的访问时重用这些值
- Caching can help to reduce delays if a component needs to repeatedly access configuration settings. However, when the configuration changes, the application code will need to invalidate the cached settings, and the component must use the updated settings. 如果组件需要重复访问配置设置,缓存可以帮助减少延迟。但是,当配置发生更改时,应用程序代码将需要使缓存的设置无效,并且组件必须使用更新后的设置
When to Use this Pattern 何时使用此模式
This pattern is ideally suited for:
这种模式非常适合:
- Applications for which you must avoid all unnecessary downtime, while still being able to apply changes to the application configuration. 您必须避免所有不必要的停机时间,同时仍然能够对应用程序配置应用更改的应用程序
- Environments that expose events raised automatically when the main configuration changes. Typically this is when a new configuration file is detected, or when changes are made to an existing configuration file. 公开主配置更改时自动引发的事件的环境。通常是在检测到新配置文件或对现有配置文件进行更改时
- Applications where the configuration changes often and the changes can be applied to components without requiring the application to be restarted, or without requiring the hosting server to be rebooted. 配置经常更改且更改可应用于组件的应用程序,而不需要重新启动应用程序,或者不需要重新启动宿主服务器
This pattern might not be suitable if the runtime components are designed so they can be configured only at initialization time, and the effort of updating those components cannot be justified in comparison to restarting the application and enduring a short downtime.
如果将运行时组件设计成只能在初始化时配置,并且与重新启动应用程序和持续较短的停机时间相比,更新这些组件的工作是不合理的,那么这种模式可能不适合。
Example 例子
Microsoft Azure Cloud Services roles detect and expose two events that are raised when the hosting environment detects a change to the files:
Microsoft Azure Cloud Services 角色检测并公开两个事件,这两个事件是在宿主环境检测到对 ServiceConfiguration.cscfg 文件的更改时引发的:
- . This event is raised after a configuration change is detected, but before it is applied to the application. You can handle the event to query the changes and to cancel the runtime reconfiguration. If you cancel the change, the web or worker role will be restarted automatically so that the new configuration is used by the application. .此事件在检测到配置更改之后但应用于应用程序之前引发。您可以处理事件以查询更改并取消运行库重新配置。如果取消更改,Web 或 worker 角色将自动重新启动,以便应用程序使用新的配置
- . This event is raised after the application configuration has been applied. You can handle the event to query the changes that were applied. .此事件在应用程序配置后引发。您可以处理事件以查询已应用的更改
When you cancel a change in the event you are indicating to Azure that a new setting cannot be applied while the application is running, and that it must be restarted in order to use the new value. Effectively you will cancel a change only if your application or component cannot react to the change at runtime, and requires a restart in order to use the new value.
取消角色环境中的更改时。更改事件,指示 Azure 在应用程序运行时不能应用新设置,必须重新启动该设置才能使用新值。实际上,只有在应用程序或组件无法在运行时对更改作出反应并且需要重新启动才能使用新值时,才能取消更改。
注意
For more information see RoleEnvironment.Changing Event and Use the RoleEnvironment.Changing Event on MSDN.
有关更多信息,请参见 MSDN 上的 RoleEnvironment。更改事件和使用 RoleEnvironment。更改事件。
To handle the and events you will typically add a custom handler to the event. For example, the following code from the class in the solution of the examples you can download for this guide shows how to add a custom function named to the event hander chain. This is from the Global.asax.cs file of the example.
处理角色环境。变化与角色环境。已更改的事件通常将向事件添加自定义处理程序。例如,下面的代码来自可以为本指南下载的示例的运行时重构解决方案中的 Global.asax.cs 类,它显示了如何将名为 RoleEnvironment _ Changed 的自定义函数添加到事件处理程序链中。这来自示例的 Global.asax.cs 文件。
注意
The examples for this pattern are in the RuntimeReconfiguration.Web project of the RuntimeReconfiguration solution.
此模式的示例位于 RuntimeReconfiguration.Web 项目的 RuntimeReconfigurationSolutions 中。
protected void Application_Start(object sender, EventArgs e)
{
ConfigureFromSetting(CustomSettingName);
RoleEnvironment.Changed += this.RoleEnvironment_Changed;
}
In a web or worker role you can use similar code in the event handler of the role to handle the event. This is from the WebRole.cs file of the example.
在 web 或 worker 角色中,可以在角色的 OnStart 事件处理程序中使用类似的代码来处理 RoleEnvironment。改变事件。这来自示例的 WebRole.cs 文件。
public override bool OnStart()
{
// Add the trace listener. The web role process is not configured by web.config.
Trace.Listeners.Add(new DiagnosticMonitorTraceListener());
RoleEnvironment.Changing += this.RoleEnvironment_Changing;
return base.OnStart();
}
Be aware that, in the case of web roles, the event handler runs in a separate process from the web application process itself. This is why you will typically handle the event handler in the file so that you can update the runtime configuration of your web application, and the event in the role itself. In the case of a worker role, you can subscribe to both the and events within the event handler.
请