Slow Clustering

The clustering problem is getting worse. The users who login to mail01 and then mail02 or vice versa are facing great difficulty as the two systems are not in sync. It is taking more time in morning specially to get the same emails on both systems.

I have done the testing by creating a draft on mail01 and it took half an hour to replicate. At this point the availability index was 22.as the issue is creating a lot of problems and operational difficulties.

i quit compact task also, some time clustering happen in few seconds and some time it tooks 30 min.

SN SVR01 SVR02 Delay / Lag

(Drafted)	# of Users	A. Index	(Sync)	

1 9:14 600 35 9:29 0:15

2 10:20 700 30 10:59 0:39

3 11:03 700 29 11:18 0:15

4 11:43 600 22 12:12 0:29

5 12:40 550 13 13:05 0:25

6 13:05 500 9 13:06 0:01

7 13:06 500 20 13:07 0:01

8 14:07 360 38 14:07 0:00

9 14:16 380 35 14:16 0:00

10 14:30 450 35 14:30 0:00

11 14:52 600 38 14:59 0:07

12 15:20 650 37 15:20 0:00

13 15:42 700 18 15:42 0:00

14 16:01 750 24 16:04 0:03

15 16:45 700 26 16:46 0:01

16 17:01 712 39 17:03 0:02

17 17:14 723 43 17:14 0:00

18 17:26 600 30 17:26 0:00

Subject: Slow Clustering

How many cluster replicators do you have?

A cluster should take less than 5 seconds to update a document on a cluster.

You can run this command on your clustered servers:

show stat replica.cluster*

You will have stats like that:

Replica.Cluster.CachedHandles Number of handles cached by cluster.

Replica.Cluster.Docs.Added Number of docs added by cluster replicator.

Replica.Cluster.Docs.Deleted Number of docs deleted by cluster replicator.

Replica.Cluster.Docs.Updated Number of docs updated by cluster replicator.

Replica.Cluster.Failed Number of failed replications since server startup.

Replica.Cluster.Files.Local Number of cluster replicas on this server.

Replica.Cluster.Files.Remote Number of cluster replicas on other servers in cluster.

Replica.Cluster.Retry.Skipped Number of times the Cluster Replicator did not attempt replication because it knew the server was unreachable or the database inaccessible.

Replica.Cluster.Retry.Waiting Number of replicas currently waiting for replication retry attempts.

Replica.Cluster.SecondsOnQueue Number of seconds Cluster Replictor was on queue.

Replica.Cluster.SecondsOnQueue.Avg Average number of seconds Cluster Replictor was on queue.

Replica.Cluster.SecondsOnQueue.Max Maximum number of seconds Cluster Replictor was on queue.

Replica.Cluster.Servers Number of other servers in the cluster that are receiving replications from this server.

Replica.Cluster.SessionBytes.In Number of incoming bytes to the Cluster Replicator.

Replica.Cluster.SessionBytes.Out Number of outgoing bytes to the Cluster Replicator.

Replica.Cluster.Successful Number of successful replications since server startup.

Replica.Cluster.WorkQueueDepth Current number of modified databases in the queue waiting to be replicated by the Cluster Replicator to other servers.

Replica.Cluster.WorkQueueDepth.Avg Average number of modified databases in the queue waiting to be replicated by the Cluster Replicator to other servers.

Replica.Cluster.WorkQueueDepth.Max Maximum number of modified databases in the queue waiting to be replicated by the Cluster Replicator to other servers.

Replica.Docs.Added Number of documents added to the server’s databases via replication.

Replica.Docs.Deleted Number of documents deleted from the server’s databases via replication.

Replica.Docs.Updated Number of documents updated in the server’s databases via replication.

Replica.Failed Number of attempted replications that returned some kind of error.

Replica.Successful Number of replication attempts that returned no error of any kind.

If you run good servers, with dedicated network card, you should run at least 4 instances. One of my client was runnign 6 instances of cluster replicator without any problems.

JYR

Subject: RE: Slow Clustering

Dear JYR,i am posting here result of show stat replica.cluster*, it is taking on peak time when clustering delays till 30 minutes. i have only 2 servers in cluster. most important thing is that server1 takes delay to update to server2 but server 2 updates all things in less then 5 seconds for example if i create a draft mail on server2 it will be updated to server 1 with in couple of seconds, but server1 taking 30 min in peak hours and peak hours are in morning till 1:00 pm. i have 4000 mailboxes on server but only 700 users are online only in day time, others are on different location of world so they are coming online when we have night. server restarted last time on 6th of november.

Replica.Cluster.Docs.Added = 248858

Replica.Cluster.Docs.Deleted = 228298

Replica.Cluster.Docs.Updated = 62735

Replica.Cluster.Failed = 227

Replica.Cluster.Files.Local = 4087

Replica.Cluster.Files.Remote = 4094

Replica.Cluster.Retry.Skipped = 11800

Replica.Cluster.Retry.Waiting = 2

Replica.Cluster.SecondsOnQueue = 86

Replica.Cluster.SecondsOnQueue.Avg = 191

Replica.Cluster.SecondsOnQueue.Max = 6055

Replica.Cluster.Servers = 1

Replica.Cluster.SessionBytes.In = 1024107206

Replica.Cluster.SessionBytes.Out = 1414381844

Replica.Cluster.Successful = 263041

Replica.Cluster.WorkQueueDepth = 62

Replica.Cluster.WorkQueueDepth.Avg = 94

Replica.Cluster.WorkQueueDepth.Max = 2899

Subject: RE: Slow Clustering

Hi Yasir,

JYR has asked some very good questions. Just to expand a little…

The default number of cluster replicators is 1 which is typically too few for a busy mail server. Recently, a client’s mail server was experiencing similar issues with the SecondsOnQueue.avg being 100. By simply loading several more cluster replicators, we were able to reduce this value to 3 seconds on the queue. My personal rule-of-thumb is average of < 10 seconds is acceptable and max should be <30 seconds - YMMV.

However, cluster replicators are not the only aspects that can affect this so before you just assume the above will correct the problem, be sure to rule out other issues… Below are some examples.

  • Network slowness or latency can cause high values for Seconds On Queue

  • Slow or overly busy disks can cause high values for SOQ.

  • Other performance factors can also cause high values for SOQ.

Subject: RE: Slow Clustering

Hi,

Can you post the cluster stats of both of your clusters?

How many cluster replicators tasks do you have on each servers?

Thank you,

JYR