Subject: RE: Insufficient memory - index pool is full
Hi Guys,
We have been having this problem for some time now. We are running 6.5.2 on Windows 2000 server SP4. What is happening is that you are blowing the NIF buffer pool. If you check the stats and use NT monitoring to track the NIF buffer pool, you will find that it starts high (our normal servers use kbytes and our problem servers start with mbytes), then it gradually expands. Followed by a sudden jump up to 128 Mbytes when it then maxes out and can’t recover.
To manage this issue in the meantime I would recommend restarting the server in the evening if the NIF pool has reached more than 90 Mbytes. However we have not always been successful with this.
Aggressive monitoring of the problem shows that the NIF pool leaps suddenly when large groups of users log onto the system for the first time in the morning.
We have tried many different things. Initially we thought it was linked to the Indexer queue issues, but as we worked around that problem we found that we still had the NIF problem. We tried upping the NSF_BUFFER_POOL_SIZE_MB but that didn’t help. It seems we are hitting a physical limit in the NIF buffer pool (or a physical number of entries).
One thing that has helped on some of the servers is to take the server offline and do a full compact -d to discard the indexes on all databases. Then an updall -r. I prefer an updall -r -c on the directory but that’s just my preference.
This looks like a memory leak in the indexer. We don’t know what the cause is and we are working with IBM support to send NSD info. However it is proving hard to resolve because we can’t leave the servers out of use all day just to see what happens when the server finally crashes. One of the main problems we see is that new users can’t log in once the error occurs.
We have seen various errors such as corrupt databases that won’t compact unless you do a -c -i, followed by another compact -d however nothing works for long on problem servers.
It is even more frustrating because we have seen it from day1 on one server and only after 10 months on the first user server migrated. There is no consistency between performance of servers, databases located or numbers of users. We are seeing a gradual rise in this condition as our users migrate to 6.5.2. Servers with a 50-50 split of 5/6 users, operating this way over a period of time, are tending to develop this issue. Servers with complete 5 or complete 6 usage don’t seem to have the issue.
We moved one server to win2k3, changed the OS raid and did a full re-install of the code. This improved the server, it changed from failing every morning after a restart the night before to failing once a week. Not much but better.
I would recommend that you do a full compact -d before giving up on 6.5.4 though. You may be carrying over previous corrupt indexes in your databases which 6.5.4 can’t deal with.
Other than that we just need to wait and see if they can find the leak.