Agents / Server Performance

We are getting ready to implement IBMs Content Collector (ICC) to archive e-mail that is over 1 yr old. This process places all e-mail into a searchable repository and deletes the e-mails from the user’s prod. mail file. Part of the archiving process requires that any e-mail restored by the end user gets deleted from the prod. mail file after 48 hours. Since the restored e-mail already exists in the repository we need to just delete it. The ICC process does not account for this requirement and IBMs solution was to provide us with an agent to add to and run in EVERY mail file on all of our servers. We have about 4500 mail files on 9 servers. Some of the mail files are very large in size (that will change once we begin the archive process). The agent uses db.Search to create a collection of restored e-mails and then uses ndc.RemoveAll to delete the collection of documents found by the search formula ( SELECTION_FORMULA$ = |(IBMAfuMessageState = “RSEARCH” & IBMAfuRestoreDate < @Adjust(@Now;0;0;0;-48;0;0;[InGMT] ))| ).

So my question is, if we add this agent to the mail template so that it runs in every mail file every night can you see any issues with server performance? Mail file count on each server ranges from 300 - 1000. How many mail files are on each server currently depends on the size of the mail file so we will likely more evenly distribute the mail files but will also reduce the number of servers.

I did run a test of the agent in 1 user’s mail file and it worked fine and ran rather fast ( ~10 seconds ) but the mail file was rather small and only had about 4000 docs in it. Even if the agent took only 10 seconds to run in each mail file it would still take about 3 hours to run the agent in all mail files on a server that has 1000 mail files and 1 agent manager running.

I have my doubts about an agent running in every mail file on every server every night but wanted to get some feedback from others. It also seems like a waste of time and resources because most users will not be restoring much e-mail so the agent search, in most cases, will return nothing and the agent will have run for nothing. I’m not looking for suggestions to change anything, just want some feedback on your thought regarding the agent and server performance. Thx!

Subject: Search method can work provided…

When doing timing tests, are you having the server run the agent, or are you running it from your client? It makes a big difference, because if you use the client you have to add network speed to the time it takes to search the documents, whereas the server accesses them as local files.

If the mail files are full-text indexed you could use FTSearch instead (or just build the query into the agent selection).

Or, since this is one of those rare cases where you’re just looking for documents modified quite recently, I would think Search would be OK if you took advantage of the date/time argument to only search among documents modified in the last 3 days (or whatever you decide is a safe margin over 48 hours, in case it doesn’t run one day for whatever reason).

I just tried that in my own local mail file (which is fairly large): searching with Nothing as the date/time argument takes 1.2 seconds, while searching with a date/time argument of now-90 hours, takes 0.04 seconds. The full-text search took 0.6 seconds.

Searching for documents - the wiki doc.

Code I used to test timing:

Sub Click(Source As Button)

Dim db As New NotesDatabase("", "aguirard")

Dim t1 As Single, t2 As Single, t3 As Single, t4 As Single

Dim coll As NotesDocumentCollection

Dim ndt As New NotesDateTime("")

ndt.SetNow

ndt.AdjustHour -90

t1 = Timer

Print "first search"

Set coll = db.Search({@Contains(From; "abc")}, Nothing, 0)

t2 = Timer

Print "second search"

Set coll = db.Search({@Contains(From; "abc")}, ndt, 0)

t3 = Timer

Print "third search"

Set coll = db.FTSearch({ [from] = "*abc*" }, 0)

t4 = Timer

Print "first search: " & (t2-t1) & ", second search: " & (t3-t2) & ", third search: " & (t4-t3)

End Sub

Subject: understood but …

I am not looking to change anything that IBM provided as their solution. To answer your questions, I tested running from the server, there is no time/date argument used in the db.search, we do NOT FTI our mail files.

As mentioned earlier, I am just looking for comments on the solution as provided to us. Running an agent daily in every mail file on every server just seems unrealistic and unnecessary since there will be little restoring of archived mail. Personally I think the arching process of the ICC module should handle the deletion of the restored mail but apparently that would be much more difficult for IBM to provide than the quick agent they provided, which to me is just a quick and dirty solution rather than doing it the correct way.

Subject: on the face of it…

…it does seem to make sense to do all the processing in one place, since you’re accessing the mail file anyway for archiving purposes. However, I don’t like to second-guess a design that I know only through a couple of paragraphs of narrative – and being familiar with the product development process, I can totally understand not wanting to change a working, tested product to satisfy the requirements of a single customer – particularly when the change involves deleting documents from people’s mail files. If you get it wrong, you could be deleting documents you didn’t want to delete, from the mail files of customers who didn’t want this feature to start with.

I grant you, it seems like long odds against any complications, but the cost of a failure in this area is so high, that we’d want really really really long odds before risking putting it into a product. That means thorough code reviews and testing, which is expensive.

Also, the timing requirements for when to archive stuff might be enough different from those for deleting restored documents, making it impractical to merge the code. Or, merging them might result in code that exceeds some time limit, causing failure to meet a performance goal.

I do think it makes more sense to have a single process on each server, rather than a separate agent in each mail file, if only because it makes it harder for the users to mess things up, but partly for performance also. And there seems to be no reason to not limit the search call to recent documents, for greater speed.