Keep Your Service Running Forever by Designing an Instant Shutdown

Over a year of designing and moving several services from Azure Cloud Service to Service Fabric taught me a few things that are important to keep in mind when creating or refactoring microservices hosted in a Service Fabric environment. Don’t forget that Service Fabric patterns are tightly coupled to .NET, which has gone through a massive paradigm shift. You must be up-to-date at least with asynchronous programming to be able to code solid services.

I experienced one failure of the whole cluster because I underestimated the importance of one detail. The service ran for half a year without any outages. Then it suddenly started to oscillate (slowdown of one part of the system and subsequent domino effect) and finally shut down (the logging indicated that the code was not executing). The Azure Portal notified me that Your cluster version has expired. Go to ‘Fabric upgrades’ to upgrade to a supported version.

It was a surprise because my cluster was set to an automatic upgrade mode. My cluster version was stuck at version 5.5.216.0 although the latest available version at that time was 5.7.198.9494.

My attempt to upgrade to the latest version by switching to manual mode was not successful.

Later, I found out that every upgrade attempt was rolled back because of this failure:

This warning means that the CancellationToken provided as an argument of the RunAsync method is ignored. (This warning is relevant to stateful or stateless reliable services. The actor service follows the single entry pattern.) The reason why cancellation is so important is that Service Fabric is moving your services away from a node that is being prepared for an upgrade. When the cancellation takes a very long time, the cancellation time multiplied by the upgrade domain count may exceed the time limit for an environment upgrade. This causes the upgrade attempt to fail.

Service Fabric is dynamically balancing your services among cluster nodes according to memory and computing characteristics. This mechanism is also paralyzed when the service freezes on a node. Another consequence is Monitored Upgrade blocking. When the current version of the service cannot be shut down, it cannot be replaced by a higher version.

The programmer’s mission is to code the program in a way that the CancellationToken is propagated to every possible awaitable call. (When you are communicating over the HTTP protocol, you should use the HttpClient because both HttpWebRequest and WebClient do not accept the CancellationToken as a parameter.)

Sometimes you can find the CancellationToken.ThrowIfCancellationRequested method useful, for example, in the body of long-running loops. It does not matter whether the service terminates by throwing an exception or finishing the RunAsync method. Both options are correct.

When the cancellation is requested, the OperationCanceledException is thrown. When you are logging exceptions in the catch clause, you may want to exclude this kind of exception. You can do it in many ways, for example, like this:

try {
    ...
    cancellationToken.ThrowIfCancellationRequested();
    ...
} catch (Exception ex) when (!cancellationToken.IsCancellationRequested) {
    ...
}