Dajbych.net


Memory leaks symptoms & causes in Service Fabric reliable services

, 4 minutes to read

Mem­ory leaks are hard to de­tect and can cause se­ri­ous prob­lems. While a sin­gle job task run­n­ing for few sec­onds doesn’t have to care about them, the ser­vice run­n­ing 24 hours a day must be care­fully tuned to suc­cess­fully ful­fill its job. More­over, mem­ory leaks don’t have to be de­tected straight away, they usu­ally ap­pear as an­other ex­cep­tion point­ing you to find­ing prob­lems in dif­fer­ent area. Let’s look at one ex­am­ple of a mem­ory leak and how it shown up.

A good Ser­vice Fab­ric ser­vice prop­a­gates the Can­cel­la­tion To­ken to every possible awaitable call (why? read my pre­vi­ous ar­ti­cle). Keeping this in mind I con­sis­tently wrote a code like this:

using (var conn = new SqlConnection("...")) { await conn.OpenAsync(cancellationToken).ConfigureAwait(false); using (var cmd = new SqlCommand("...", conn)) { ... using (var reader = cmd.ExecuteReaderAsync(cancellationToken)) { while (await reader.ReadAsync(cancellationToken)) { ... } } } }

The code looks good, right? Us­ing state­ments for ev­ery­thing IDis­pos­able, well estab­lished ADO.NET tech­nol­ogy with a long his­tory dat­ing back to Au­gust 1996. I thought that noth­ing here can­not go wrong at all, but I was wrong. The code above con­tains a mem­ory leak. It is not vis­i­ble and it doesn’t throw an Out­OfMem­oryEx­cep­tion.

Service Fabric

The caught ex­cep­tion was dif­fer­ent. It was a Win32Ex­cep­tion with a state­ment:

A con­nec­tion was suc­cess­fully estab­lished with the server, but then an er­ror oc­curred dur­ing the pre-lo­gin hand­shake.

More­over, ris­ing ex­cep­tions weren’t con­sis­tent. Some­times the er­ror mes­sage looked like this:

The client was un­able to estab­lish a con­nec­tion be­cause of an er­ror dur­ing con­nec­tion ini­tial­iza­tion pro­cess be­fore lo­gin. Pos­si­ble causes in­clude the fol­low­ing: the client tried to con­nect to an un­sup­ported ver­sion of SQL Server; the server was too busy to ac­cept new con­nec­tions; or there was a re­source lim­i­ta­tion (in­suf­fi­cient mem­ory or max­i­mum al­lowed con­nec­tions) on the server.

Service Fabric

This mes­sage was very help­ful be­cause the un­sup­ported ver­sion of SQL Server will hap­pen very un­likely with SQL database in Azure and Query Per­for­mance In­sights shows per­for­mance prob­lems more ac­cu­rately than ex­cep­tions in Ap­pli­ca­tion In­sights. Ad­di­tion­ally, the Ser­vice Fab­ric clus­ter re­ported some­thing dif­fer­ent:

Sour­ceId='Sys­tem.Fab­ric­N­ode', Prop­erty='Se­cu­rityApi_Cert­GetCer­tifi­cat­e­Chanin', Health­S­tate='Warn­ing', Con­sider­WarningAsEr­ror=false, …

Service Fabric

I com­pared the ac­tual code with the lat­est sta­ble ver­sion and I found the prob­lem. It is a Sql­DataReader.ReadAsync method that con­tains a mem­ory leak. The un­der­lay­ing bug in this method was al­ready re­ported prior my in­ves­ti­ga­tion so my job was much eas­ier. It is fixed in .NET 4.7 which be­came gen­er­ally avai­l­able on May 2017. I can­not say whether this bus causes also my mem­ory leak at this point of time. I just re­turned to syn­chronous call. It is much eas­ier than forc­ing un­der­lay­ing Vir­tual ma­chine scale set with Win­dows Server 2016 to in­stall the lat­est .NET Frame­work (and again af­ter ev­ery pos­si­ble clus­ter recre­ation).

The Ser­vice Fab­ric ser­vice can be up­dated in a Mon­i­tored mode. It can roll­back when the new ver­sion fails due to bugs ex­hib­ited dur­ing big data pro­cess­ing only – typ­i­cally mem­ory leaks. Un­healthy ser­vice does not nec­es­sary mean that the ser­vice isn’t ris­ing any ex­cep­tions. It also means it is not pro­cess­ing as many data as ex­pected – typ­i­cally be­cause of mem­ory swap­ping.