Desperate to find out what the server is actually doing

I sincerely hope that someone may have some suggestions to help us.

We have a couple of clustered R6 servers.

The servers are spec’d exactly the same.

On those servers is a rather large (approx. 10GB) database.

The database is running very well on one of the servers (Server “B”), and like an absolute dog on the other (Server “A”).

I know that the issue is most likely related to the number of users who regularly attach to, and replicate with the server.

We have about 900 users who replicate the database to their remote clients.

Most of those users automatically replicate with Server “A”.

We have tried forcing the clients to “failover” to Server “B” by flagging the database “Out Of Service”, but that has no effect on client replication (as we found out by reading the Admin help - duh!).

What confuses us even more is the discrepency between the SAI displayed on the graphs we use to monitor the servers’ realtime performance, and the client access performance we experience at the same time.

On the graphs, we observe between 90 - 100% availability for BOTH servers.

Opening any database on Server “B” is quite quick.

Attempting the same thing on Server “A”, however, means you might a well make yourself a cup of coffeee while you wait.

We can see all the clients attached to Server “A”, but really can’t tell what they are doing.

The server tasks do not show anything out of the ordinary happening at all - no indexer, no agents, no replication.

Yet, we KNOW the server is being hammered.

I guess I have a couple of questions…

  1. Is there any way to force replication of a clustered database with a particular server (regardless of the replication settings on the Notes cient)?

  2. Is there any way to generate more granular statistics that will show us exactly what the server is doing when the SAI is reporting 100% and server tasks show nothing wrong, but we are experiencing massive connectivity issues?

Any and all suggestions would be most welcome!!

Cheers,

T.

Subject: Desperate to find out what the server is actually doing

In response to your first question (Is there any way to force replication of a clustered database with a particular server), you might be able to do this with a tool from IBM called DSKTOOL.EXE http://www-1.ibm.com/support/docview.wss?&uid=swg24004260

It allows you to manipulate the workspace icons / bookmarks, which might let you force replication to a different server.

Subject: Desperate to find out what the server is actually doing

First of all, thanks very much for the responses guys!

Secondly, we came up with a way to prevent access to a database without modifying the ACL.

Flagging the database as “Out of Service” had absolutely no effect on backend Notes client replication.

In the end, I simply moved the database to a subdirectory outside of the Domino “Data” directory.

I then created a directory link file that points to this new location, and placed it in the server’s “Data” directory. That file included a new group that I created in the Domino Directory.

Using this method, I could easily grant access to admins and servers, while at the same time forcing the clients to “fail over” to the replica on the other server.

Now, through all of this process, we found that the second server (server “B” that was running fine previously) started to run like a dog as well as soon as clients were directed to replicate with it.

One would expect large replication traffic in these sort of cases, but we have observed absolutely nothing out of the ordinary (this database has been in operation for well over 4 years).

I do not believe this issue has anything to do with the internal operations of the database, since it is very rarely opened/used here at head office. It is primarily used once it has been replicated locally by the users. Also…

  • All scheduled agents have been disabled

  • The database open script has been disabled

  • The index properties of the large views have been modified so that they are updated at most every 12 hours

We are currently investigating the use of SPOFU views in the database. The users currently have the ability (through the ACL settings) to save their private folders/views in the database. Yes, I know this is not adviseable, but I have inherited this system and wanted to preserve the way it operated without “rocking the boat” too much.

Besides that, this setting has been associated with the database for years with little adverse effect on the rest of the server (yes, the database runs slowly centrally most times, but it’s not used here very much).

There are a little more than 3,500 views/folders in the system, about 110 of which are shared. I am undertaking a project to remove all private views (not the folders) and de-select the authority in the database ACL.

I will also consider the feedback provided earlier, and post back anything I find.

Thanks again for your help thusfar!

T.

Subject: Desperate to find out what the server is actually doing

Hi Terry,

We can see all the clients attached to Server “A”, but really can’t tell what they are doing.

You can try this, It’s very useful

Keyword Client_Clock Description

Enables certain time keeping logging…see the link

http://www.drcc.com/ref/notesini.nsf/all/7A4872639FB90313C12567D70052CE32

Have you checked the user activity in the log.nsf db? You could be surprise.

Are there any massive reading?

Is there any private views on one of the servers. (I know there suppose to be replicas but :-)…)

Is there any specail agents running on one side?

Have you done a delta of both designs? Maybe you have hidden design elements.

Do you maintain unread documents count?

Have you read those? Seems to be

About readers fields, three very interesting documents:

Performance Considerations for Domino Applications

http://www.redbooks.ibm.com/redbooks/pdfs/sg245602.pdf

Notes from Support: Reader Names fields can impact performance

and

technote : Use of ReaderNames fields slows view performance in Notes

Product: Lotus Notes > Lotus Notes > Version 7.0, 6.5, 6.0, 5.0

Platform(s): Mac OS, Windows

Doc Number: 1097609

Published 2006-05-22

http://www-1.ibm.com/support/docview.wss?uid=swg21247611

Excellent article

Great links here:

http://www-10.lotus.com/ldd/bpmpblog.nsf/dx/search.htm?opendocument&q=performance

Notes/Domino Best Practices: Performance

http://www-1.ibm.com/support/docview.wss?rs=463&uid=swg27008849

JYR

Subject: RE: Desperate to find out what the server is actually doing

Hi Terry

From

http://www.martinscott.com/home.nsf/newsview/C7F5B845A9E6C76385256C180007FBEF?opendocument

RPC

Description

Open_Session

Authenticate with the server and establish a session

Open_Database

Find and open a database

Open_Note

Send the contents of a note (data document or design element)

Open_Collection

Open a view

Read_Entries

Send a list of information from a view or search, usually follows Open_Collection

Find_By_Key

A view lookup via DBLookup

Get_Special_Note_ID

Send info from the ACL

Close_dB

Close DB session

It will help you with the client_clock setting

Subject: RE: Desperate to find out what the server is actually doing

Hi Jean-Yves,

Thanks very much for responding with such verbosity.

I know a couple of consultants here that are busy reading through the links you provided (after pulling their fingers out of their hair!!).

Through further investigation…

http://www-10.lotus.com/ldd/nd6forum.nsf/DateAllThreadedweb/54bf4b3a37ff19c48525737d0021c412?OpenDocument

…we have determined that the replication of the database with our remote clients seems to be the mitigating factor in the performance degradation of our server. It isn’t just the database in question that starts to operate extremely slowly - it’s the entire server.

I did try alot of the techniques to which you linked, and they showed severe latency in a number of areas.

However, it is my contention that is not the actual database operation that is causing the issues, but the replication of the database out to the 900 or so remote Notes clients. This database is very rarely used in-house (opened on the server) - It is primarily used by our remote clients, who each maintain a relatively very small subset of the db data. There are no reports of performance degradation on the client at all.

I am currently undertaking a project to remove all of the private SPOFU view instances from the server-based replicas, and then de-select the ACL setting which permits the users to save their private views/folders in the database (I obviously have to leave the private foldeers intact). Hopefully, that will have some sort of positive impact on the performance moving forward.

I have also raised a PMR with IBM here to see if I can get even more suggestions as to the origin of the issue.

I’ll make sure to pot anything I find back here.

Cheers!

T.

Subject: RE: Desperate to find out what the server is actually doing

Hi Terry,

I know a couple of consultants here that are busy reading through the links you provided…

I think those links are abvious… you need to change your team of consultants!!! :slight_smile:

Anyway, for your private views, 2 good technotes:

How to delete private views stored in the server database in Notes

http://www-1.ibm.com/support/docview.wss?uid=swg21093742

Is it possible to use LotusScript to delete a private or personal ‘on first use’ view?

http://www-1.ibm.com/support/docview.wss?rs=0&uid=swg21086960

Subject: RE: Desperate to find out what the server is actually doing

Hi Jean-Yves,

Thanks again for your valuable feedback.

As for my team of consultants, I believe them to be quite capable.

When you have been spinning your wheels for over a week trying in vain to solve a very elusive issue, the frustration mixed with a lack of results, and almost 1,000 angry users (and management) tends to blur your vision sometimes.

As I mentioned in my earlier post, we had tried many of the suggestions in those links, but it seems the actual operation of the database is not the culprit here. It is only when we experience user load (upwards of 100 to 200 concurrent users at a time) replicating with the database that we start to see the downward performance spiral.

Thanks for your links to methods with which I can remove the private views.

I did use a script agent, but wanted to avoid indexing every view in the database. I believe looping through a database’s views with something like - “Forall vw in db.Views” - touches every view in the system.

Instead, I used a NotesNoteCollection to collect just the view design elements. Using that collection, I could then examine the associated Notes documents and determine whether or not they represented private views. I added those documents to a document collection, and then removed all documents in the collection at the end of the agent.

I had a heck of a time running the agent from my client, as it would never complete in a “reasonable amount of time” (according to the network). So, I simply modified it to run as a scheduled agent on the server. That was more effective, but once it started, the AMgr just never let it go (it never ended). Not even the agent execution limitations defined in the server document would override and terminate it. I finally had to terminate it manually with the Admin client. I could see that it had done the job overnight though, because the private views were missing from the server log document.

Any thoughts on why the agent might act like this (feel free to rip my logic apart!)?

For anyone interested, here is the code (missing alot of the preamble, errorchecking, and auditing/reporting logic)…

Set customerDb = session.CurrentDatabase

Set collection = customerDb.GetProfileDocCollection( "GFIYT" )

Set nc = customerDb.CreateNoteCollection( False )

nc.SelectViews = True

Call nc.BuildCollection



nid$ = nc.GetFirstNoteId

For i% = 1 To nc.Count

	Set viewDoc = customerDb.GetDocumentByID( nid$ )

	If Isarray( viewDoc.Items ) Then

		If viewDoc.HasItem( "$Readers" ) Then

			If viewDoc.HasItem( "$Flags" ) Then

				flagValues = viewDoc.GetItemValue( "$Flags" )(0)

				If Instr( 1, flagValues, "V", 0 ) >0 Then

					Call collection.AddDocument( viewDoc )

				End If

			End If

		End If

	End If

	

	nid$ = nc.GetNextNoteId( nid$ )		

Next



If collection.Count > 0 Then Call collection.RemoveAll( True )

Again, thanks very much for your interest in this discussion thread.

We are very hopeful that a solution will be forthcoming shortly, and will be sure to post any further findings here for future reference.

Cheers!

T.

Subject: RE: Desperate to find out what the server is actually doing

Hi Terry

For your agent, i don’t know why it was not responding, especially if all your private views were destroyed. So maybe the agent was running well.

Did you try to use Tivoli?

Look in the admin help, for the word “Tivoli”

Look on Google for : Tivoli Analyzer for Lotus Domino

http://publib.boulder.ibm.com/tividd/td/IBMTivoliAnalyzerforLotusDomino6.0.html

In R6, you will have to install a little program named : Tivoli Analyzer for Lotus Domino.

In R7 (Client admin side), it is already include in the admin console.

It will give you valuable infos.

ALso, about your db, i know it is 10gb, but can you tell more about tha application?

Do you have a lot of documents?

Do you have any readers fields?

When your users replicate, do they have to replicate a few documents or do they have to replicate allllllllll the documents because they are updated by an agent?

Can you identify which type of elements your users replicate?

You can use this setting (via DRCC http://www.drcc.com/ref/notesini.nsf/all/710FFF922D28E215862569E0004F5328#)

Log_Replication

Defines the logging level of replication events. Acceptable values are:

Setting

Replication events recorded

0

None

1

Only the fact that a database is replicating

2

Summary data about each database

3

Information about each replicated document, including design and data

4

Information about each field replicated

debug_repl_all

http://www.drcc.com/ref/notesini.nsf/all/085870A7479EB49FC1256BA300755D70

To restrict the access, have you tried this

Server_Availability_Threshold

http://www.drcc.com/ref/notesini.nsf/all/F760AE0A5E838E5DC125668A00422339

Server_MaxUsers

http://www.drcc.com/ref/notesini.nsf/all/1691F713D4423D67C125668A00422345

For your SAI, have you read those?

Frequently Asked Questions - Server Availability Index (SAI) in Domino 6.x and 6.5x

JYR

Subject: RE: Desperate to find out what the server is actually doing

Hi Jean-Yves,

Thanks again for hanging in there with me!

I must admit, I had never really considered using Tivoli - I will certainly investigate that avenue as well.

In regards to your questions about the database in question…

  • It is approx 10.5GB in size

  • After running “Compact -D” to remove all view indexes, the database size is reduced to 5.5GB

  • There are approximately 900k documents in the database

  • Nightly agents typically change up to 1k documents (e.g. updating status fields and changing view inclusion). Only a very small subset of these changed documents apply to any individual user

  • We use “Readers” fields extensively to secure the data

  • When users replicate, they only replicate their own data, and that typically results in a database replica between 50 - 200MB in size

  • Before last night, there were 1,172 views in the database, of which 1,079 were private. Those have now been removed, and the ACL has been changed (possibly only temporarily) to prevent the storage of those views in the database.

We are currently creating multiple remote replicas to determine whether or not the upgrade resulted in the creation of rogue documents without any “Reader” security. We have taken several steps to prevent such things happening (e.g. preventing document pasting), but it is always a possibility. That, and the fact that those rogue documents would probably not appear in any of the database’s UI views. I am sceptical this is the issue (but I am ALWAYS open to ANY possibilities!), because there have been no associated traffic spikes reported by the network admins.

Just on a side note, IBM has contacted me and asked me to setup the following…

debug_capture_timeout=1

debug_show_timeout=1

console_log_enabled=1

debug_threadid=1

debug_disable_chronos=1

I did so last night.

I then opened the database up for replication to all users at 8AM this morning.

By 8:20 this morning, we had approx 80 users replicating the database, and the server had been brought to its knees.

I executed the necessary steps to produce the log files (nsd, semdebug, console, etc.) and shot them off in an e-mail to IBM in India.

I then restarted the server and isolated the database again.

The server is now running perfectly (nobody replicating that particular database) with 300 connected users.

The investigation continues.

Thanks again!

T.

Subject: Some more info re: server-2-server replication…

I was just trolling through the server logs, and found another interesting thing.

Every time the cluster replicator attempts to replicate the database between the servers, it results in the following error"

“Unable to replicate from database.nsf to CLUSTER ServerA/Domain database.nsf: Replication cancelled due to timeout”

I did check the “Connection” documents between the servers (as suggested in your R4/5 Forum post), and there is no defined “Replication Time Limit”.

I also took note of the replica document counts last night…

Server A: 848,344

Server B: 848,429

I cleared the replication history on both servers, and replicated the database via the server console.

When I came in this morning, the replica task had completed.

However, the document counts had not changed at all.

I did not expect the document count to increase because of user replication (the database had been isolated at that point), but I DID expect the document counts to equate.

The servers both have identical (Manager) access in the database ACL with all necessary roles.

So, it seems even the clustered servers are experiencing the same replication performance issues as the clients. Only when alot of clients attempt the same operation concurrently, it hammers the servers until they can no longer accept anymore requests.

Cheers,

T.

Subject: RE: Some more info re: server-2-server replication…

Hi Terry,

how many clustor replicator do you have?

You can run this command on your clustered servers:

show stat replica.cluster*

It will give you something similar:

Replica.Cluster.Docs.Added = 5022

Replica.Cluster.Docs.Deleted = 62226

Replica.Cluster.Docs.Updated = 162746

Replica.Cluster.Failed = 4149

Replica.Cluster.Files.Local = 1057

Replica.Cluster.Files.Remote = 2066

Replica.Cluster.Retry.Skipped = 592538 <–check this

Replica.Cluster.Retry.Waiting = 0

Replica.Cluster.SecondsOnQueue = 0 <–check this

Replica.Cluster.SecondsOnQueue.Avg = 18 <–check this

Replica.Cluster.SecondsOnQueue.Max = 1894 <–check this

Replica.Cluster.Servers = 2

Replica.Cluster.SessionBytes.In = 378840196

Replica.Cluster.SessionBytes.Out = 1747925166

Replica.Cluster.Successful = 433515

Replica.Cluster.WorkQueueDepth = 0 <–check this

Replica.Cluster.WorkQueueDepth.Avg = 7 <–check this

Replica.Cluster.WorkQueueDepth.Max = 363 <–check this

If you run good servers, with dedicated network card, you should run at least 4 instances. One of my client was running 6 instances of cluster replicator without any problems.

JYR

Subject: RE: Some more info re: server-2-server replication…

Thanks again for persisting with this Jean-Yves.

Todate, we have only ever needed to run with the single cluster replicator.

I take your point though, and will certainly review the stats once we are back up and running to see if more replicators are warranted.

On that, we are still experiencing dramatic server performance degradation whenever our users start replicating en masse with the database.

Using the “Directory Link” technique described in earlier posts, I have been able to isolate the database from client replicators, and can easily reproduce the problem.

IBM has struggled to find any issues, so our only course of action seems to be to expose the database every evening/weekend to let whatever hidden replication backlog there seems to be clear itself.

This is by far the most frustrating issue I personally have ever had to deal with (having been a Lotus devotee for over 15 years now)! A simple database design upgrade (and it really was quite simple) has caused irrepairable damage to both our reputation and our confidence.

We will, of course, persevere until we have resolved this issue.

There must be a solution for this somewhere!

Cheers!

T.

Subject: RE: Some more info re: server-2-server replication…

Todate, we have only ever needed to run with the single cluster replicator.

I would put a lot more clustor replicators. This is a in-memory process, so it’s not a priority like Nrouter or amgr.

So the more you put, hte most accurate your infos will be. When you only have onbe, if one of your users send a little message (few ko) to a lot of users, nrouter have to distribute this message , , the views of every dbs must be update and it must also replicate to the other cluster mate. So, sometimes you will get poor stats like in this post to the lines pointed by the “Check this”:

http://www-10.lotus.com/ldd/nd6forum.nsf/DateAllFlatweb/bda729c17b5452d68525737e003b1492?OpenDocument

Look in the help files for more details about each stats.

For the replication, i would be curious to see what kind of infos are replicated. It’s like the replication with the server would never end. Too much infos to replicate (??) or like a “hand shake” the client would never receive the acknowledge that the replication is finished. Maybe it is a problem on the client side.

A cache issue, a replication history. The client would replicate the database like it was the first time every morning.

JYR