How to Better Revive a Failed Search Node (and Why)

| | Comments (7)
I've been working with the same technology stack for an amazingly long nine years. This has given me much opportunity to work with the same types of issues over and over, and in doing so, I've refined my approach quite a bit. Thus, here's a post that is essentially an improvement on a two year old post, How to Revive a Failed Search Node. I hope this post will offer both a better description of the problem and a better solution to it.

The WebCenter Interaction search product has two features that can interfere with each other. First, on the search cluster, you can schedule checkpoints to essentially wrap up and archive the search index to give you the ability to later restore it. Second, on the search nodes, at startup the node's index looks to the directories on the search cluster to synchronize in a copy of the latest index.

Customers running both checkpoints and multiple nodes periodically encounter trouble because the checkpoint process removes old search cluster request directories that the nodes want to access. So if you have one of your search nodes go down, but the other node keeps working and checkpoints continue to run on a daily schedule, then after a few days and by the time you realize one node had failed, then it won't start. It fails when it tries to access the numbered directory that had existed last time it had run properly. The errors in your %WCI_HOME%\ptsearchserver\[version]\[node]\logs  may look like these in such a case:

Cannot find queue segment for last committed index request: \\servername\SearchCluster\requests\1_1555_256\indexQueue.segment

Indeed, if you look at the path that was shown in the error, you'll find that the numbered folder no longer exists. Perhaps the latest folder will be SearchCluster\requests\1_1574_256.

The fix is to reset the search node so that it no longer expects that specific folder upon which it had been fixated. I wrote about a way to do this with several manual steps in my prior post. This time, however, and after encountering the problem perhaps tens of times, I'm sharing a batch file that I place on Windows search servers to automate the reset process (and this works on both ALUI 6.1 and WCI 10gR3):

set searchservice1=myserver0201
set search_home=c:\oracle\wci\ptsearchserver\10.3.0
@rem
@rem configure the above two variables
@rem
net stop %searchservice1%
c:
rmdir /s /q %search_home%\%searchservice1%\index\
mkdir %search_home%\%searchservice1%\index\1
echo 1 > %search_home%\%searchservice1%\index\ready
cd %search_home%\%searchservice1%\index\1
..\..\..\bin\native\emptyarchive lexicon archive
..\..\..\bin\native\emptyarchive spell spell
net start %searchservice1%

search-panel.jpgTo find the name of the search service that goes in the first parameter, open your Windows services panel, find your search node, right-click into its properties page, and find the "service name" value. This is not the same as the display name. The service name by default is [machine][node] as far as I can tell. So on my box (bbenac02) as the first node, my service name is bbenac0201. This is different from the display name, which defaults to something like "BEA ALI Search bbenac0201."

Enjoy!

7 Comments

Thanks for the batch file. You always seem to post something right before I need it.

After starting the search service, do we need to do anything to rebuild the indexes?

Hi Tim:

I'm glad this post was relevant for you.

To answer your question, nothing needs to be done at the end of the process other than starting the service. At startup, the node with its empty index looks at the cluster to see whether it is in sync. It sees that it needs to update its content, so it spins for sometimes several minutes to copy in the content from the cluster.

I hope that helps.

Bill

Hi Bill, I need some advice. I am running ALUI 6.1 Search, and my cluster files are growing to be rather large, about 65GB. How would I purge old files, or is there way to reduce the space taken up?

The file path is: D:\Program Files\plumtree\ptsearchserver\6.1\cluster\requests

This is where the buildup is, the oldest folder 0_1_256 is dated 6/10/2007. The most recent is today of course. Do I need to keep all of these folders?

Thanks in advance Bill!

Ben

Hi Ben:

Are you running regular checkpoints on your search cluster? This can be scheduled from within the search cluster manager utility. Try running these nightly.

You will need a lot of disk space for the first checkpoint if you've never run it before, and it sounds like you haven't.

Are your search nodes similarly large? If not, then you can know that the cruft is just in the cluster and it should clear out after checkpointing (which you should run at least every weekend if not nightly). If the nodes are also huge, then you could look at using multiple search partitions since nodes aren't supposed to be bigger than about 10gb.

I hope that helps.

Bill

Hi Bill, the node itself is about 900MB.

No, I have not run checkpoints before. The problem is, only 1GB space is free on the D drive.

Can I delete the old folders from 2007-2008 from the requests directory and then run a checkpoint?

or

Purge the cluster with cadmin command?

If I can, I will schedule the checkpoints after to run nightly as suggested.

Thanks again Bill!

Hi Ben:

I've not tested deleting old request folders. There's only so much support you can rely on when it comes through blog comments, but:

I /think/ those request folders are only important if you have more than one node in your cluster, and even when you have multiple nodes, I think those request folders are only need to be read once by the secondary nodes. So I /think/ you would be safe deleting those old request folders, but I'm not sure.

Good luck!

Bill

Also:

DO NOT purge the cluster! That would get rid of your entire search index, and your node(s) would remove all their content.

Thanks,

Bill

Leave a comment