Scenario:We have two clustered servers running multiple instances of the same application. Each NSF has an agent that is triggered by a console command once per minute. These apps run 24x7 with each instance specified to run on one of the two clustered servers. This has been working great for years.
Problem:
Over the last two weeks every night at 4AM one of the servers throws errors from every instance. The error is either Invalid Universal ID or Document has been deleted. This only occurs on one server and only for a 2-4 minute period around 4AM.
The network admins claim there is no backup or virus software running at this time. There are no Domino maintenance tasks running at this time (that I can find in notes.ini or log.nsf). Nothing shows up in the Windows Event Viewer.
I’m 99.999% sure it’s not related to the application, I can move the instances to the other server with no issues and the problem never occurs outside of this window.
The two servers are almost identical in every way. One other weirdness, on Sunday during the time change the error occurred at 5AM (as if the time had not changed). Monday it was back to 4AM.
Has anyone seen anything like this? Any suggestions on how to track down the problem?