Health monitoring of the Service Fabric app upgrade

, 4 minutes to read

De­ploy­ing an up­date of any ap­pli­ca­tion can be risky, be­cause new code may con­tain new bugs. Unit test­ing is an ad­vis­able method of re­duc­ing the risk. How­ever, some mech­a­nisms de­pend on work­load. Some work­loads can be sim­u­lated eas­ier than others. Ser­vice Fab­ric pro­vides health mon­i­tor­ing af­ter the new ap­pli­ca­tion ver­sion is de­ployed to the clus­ter. If the new ver­sion is not healthy the old ver­sion is rolled back au­to­mat­i­cally. Set­t­ing up the pro­tec­tion against fail­ures caused by up­grades is rel­a­tively easy.

Cre­ate a new Ser­vice Fab­ric State­ful ser­vice, open the State­ful1.cs class and re­place its con­tent with the fol­low­ing code:

using Microsoft.ServiceFabric.Services.Communication.Runtime; using Microsoft.ServiceFabric.Services.Runtime; using System; using System.Collections.Generic; using System.Fabric; using System.Fabric.Health; using System.Threading; using System.Threading.Tasks; namespace Stateful1 { internal sealed class Stateful1 : StatefulService { public Stateful1(StatefulServiceContext context) : base(context) { } protected override IEnumerable<ServiceReplicaListener> CreateServiceReplicaListeners() { return new ServiceReplicaListener[0]; } protected override async Task RunAsync(CancellationToken cancellationToken) { var version = Context.CodePackageActivationContext.GetServiceManifestVersion(); ServiceEventSource.Current.ServiceMessage(Context, $"version: {version}", Context.ServiceName); while (!cancellationToken.IsCancellationRequested) { if (version == "1.0.0") { var healthInformation = new HealthInformation(nameof(Stateful1), "Watchdog", HealthState.Ok) { TimeToLive = TimeSpan.FromMinutes(1) }; FabricRuntime.GetActivationContext().ReportDeployedServicePackageHealth(healthInformation); await Task.Delay(TimeSpan.FromSeconds(10), cancellationToken); } else { var healthInformation = new HealthInformation(nameof(Stateful1), "Watchdog", HealthState.Ok) { TimeToLive = TimeSpan.FromSeconds(10) }; FabricRuntime.GetActivationContext().ReportDeployedServicePackageHealth(healthInformation); await Task.Delay(TimeSpan.FromSeconds(30), cancellationToken); } } } } }

As you can see, there is a Health­In­for­ma­tion class. It re­ports a health state of some prop­erty. The health of en­tire ser­vice con­sists of mul­ti­ple prop­er­ties. The health in­for­ma­tion can be valid un­til it is rewrit­ten, or it can be pe­ri­od­i­cally en­sured as valid. It this case, the Time­To­Live in­ter­val must be set. When the in­ter­val will ex­pire and new health in­for­ma­tion is not present, the health state will be au­to­mat­i­cally changed to Er­ror.

The code above sim­u­lates two ver­sions of the same ser­vice. In the first ver­sion, the unit of work is done sooner than health in­for­ma­tion ex­pires. Then the loop starts again, the health in­for­ma­tion is re­freshed and whole cy­cle starts again.

In the newer ver­sion, the unit of work takes longer so that health in­for­ma­tion ex­pires sooner than the work is done. It sim­u­lates un­ex­pected de­crease of the ser­vice per­for­mance. The ser­vice will be un­healthy for the most of the time and Ser­vice Fab­ric can de­tect it and halt the up­grade.

Pub­lish the ap­pli­ca­tion as usual and then up­date its ver­sion.

Pub­lish the ap­pli­ca­tion again, but mod­ify the set­t­ings. Check the Up­grade the Ap­pli­ca­tion op­tion.

Click on the Con­fig­ure Up­grade Set­t­ings link and set the Mon­i­tored Up­grade mode. Ver­ify that the Fail­ure­Ac­tion prop­erty is set to the Roll­back value.

Click to the Pub­lish but­ton and open Ser­vice Fab­ric Ex­plorer. You can see that one up­grade is in progress.

The up­grade is pro­cessed the health of the ser­vice is mon­i­tored.

When the ser­vice is un­healthy af­ter the up­grade, it is down­graded to the orig­i­nal ver­sion.

Health mon­i­tor­ing can re­flect the qual­ity of the ser­vice and block the up­grade if the qual­ity of the ser­vice de­creases.