September 2009 Archives

In software development, we can sometimes have maddening debates about whether something is a feature or a bug. This reminds me of an old Phish song: Windora Bug.

"Is that a wind? Or a bug? It's a Windora bug." In other words, it's both. While troubleshooting your system, you might want to listen to the mp3.

In WCI 10gR3, we find the collision of two reasonable features. I think together they make a bug. Or at least, a badly designed feature. So let's start with the old feature:

Sometimes agents outside the portal need to authenticate in. Users count as agents, and so do remote portlets. To allow the agent to log in without providing a password each time, the portal can send a login token that the agent can use for future portal connections. Two old examples of this are [1] when a person uses the "Remember my Password" feature of the portal login screen (usually valid for many days) and [2] when a remote portlet web service sends a login token to the remote service (usually valid for five minutes). The login token held on the remote tier by the agent can be decrypted by the server using its key. This works fine in both the old use cases I provided because the remote tier is handed this value by the portal server.

For whatever reason, you may decide every once in a while that there is a security issue related to saved passwords. The portal had a great feature in the old days to let you update the login token's root key and thereby invalidate these old login tokens forcing users to reauthenticate. The tool for the reset is in the administrative portal under the Portal Settings utility, and it looks something like this:

update-login-token-key.jpgWhen you click that "Update" button, it connects to the portal database and generates a new login token root key, stored in PTSERVERCONFIG with settingid 65.

The trouble comes in with the new feature. In 10gR3, the portal introduces new applications that encrypt passwords based on the login token root key, but this is done at configuration time in the remote application's Configuration Manager. The problem is that those applications are built apparently assuming that the login token root key will never change. The configuration manager requires that you provide the login token root key to it directly. Applications that do this include but are not limited to the Common Notification Service and Analytics. For example:

update-login-token-analytics.jpgThe upshot of all this is that if you choose to click that button in the Portal Settings utility, then you get a new login token root key that no longer matches the one relied on by your remote applications.

If this part of the portal were reconceived, then perhaps the database would have one login token root seed used for agents with a transient token such as those given to users and through remote web service calls that are used to let the agent come back. Those keys basically say, "you've been here before, and you can come back." Then the database might have a second root seed for applications that need permanent access to the portal. In that case, the update feature would be fine, and it would only apply to the key for transient agents.

Oh well. We have to live with it. So to avoid administrators accidentally breaking remote applications, I suggest you update the portal UI to explain the full effect of this particular feature (if you don't want to go through the headache of an involved UI modification to entirely remove it). I did this and now have the following:

update-login-token-key-new.jpgI got there by modifying this file on the admin servers:
d:\bea\alui\ptportal\10.3.0\i18n\en\ptmsgs_portaladminmsgs.xml
Within it I changed strings 2134, 2135, 2136, and 2964. My file has no other modifications in it from the vanilla 10.3.0 version. You can download it here.

Enjoy.


I've been working with the same technology stack for an amazingly long nine years. This has given me much opportunity to work with the same types of issues over and over, and in doing so, I've refined my approach quite a bit. Thus, here's a post that is essentially an improvement on a two year old post, How to Revive a Failed Search Node. I hope this post will offer both a better description of the problem and a better solution to it.

The WebCenter Interaction search product has two features that can interfere with each other. First, on the search cluster, you can schedule checkpoints to essentially wrap up and archive the search index to give you the ability to later restore it. Second, on the search nodes, at startup the node's index looks to the directories on the search cluster to synchronize in a copy of the latest index.

Customers running both checkpoints and multiple nodes periodically encounter trouble because the checkpoint process removes old search cluster request directories that the nodes want to access. So if you have one of your search nodes go down, but the other node keeps working and checkpoints continue to run on a daily schedule, then after a few days and by the time you realize one node had failed, then it won't start. It fails when it tries to access the numbered directory that had existed last time it had run properly. The errors in your %WCI_HOME%\ptsearchserver\[version]\[node]\logs  may look like these in such a case:

Cannot find queue segment for last committed index request: \\servername\SearchCluster\requests\1_1555_256\indexQueue.segment

Indeed, if you look at the path that was shown in the error, you'll find that the numbered folder no longer exists. Perhaps the latest folder will be SearchCluster\requests\1_1574_256.

The fix is to reset the search node so that it no longer expects that specific folder upon which it had been fixated. I wrote about a way to do this with several manual steps in my prior post. This time, however, and after encountering the problem perhaps tens of times, I'm sharing a batch file that I place on Windows search servers to automate the reset process (and this works on both ALUI 6.1 and WCI 10gR3):

set searchservice1=myserver0201
set search_home=c:\oracle\wci\ptsearchserver\10.3.0
@rem
@rem configure the above two variables
@rem
net stop %searchservice1%
c:
rmdir /s /q %search_home%\%searchservice1%\index\
mkdir %search_home%\%searchservice1%\index\1
echo 1 > %search_home%\%searchservice1%\index\ready
cd %search_home%\%searchservice1%\index\1
..\..\..\bin\native\emptyarchive lexicon archive
..\..\..\bin\native\emptyarchive spell spell
net start %searchservice1%

search-panel.jpgTo find the name of the search service that goes in the first parameter, open your Windows services panel, find your search node, right-click into its properties page, and find the "service name" value. This is not the same as the display name. The service name by default is [machine][node] as far as I can tell. So on my box (bbenac02) as the first node, my service name is bbenac0201. This is different from the display name, which defaults to something like "BEA ALI Search bbenac0201."

Enjoy!
phantom.jpgI'll stick with the Star Wars theme from my past post because today's issue is quite similar (even though I haven't bothered to watch all the movies in the series). Do you occasionally encounter errors that are tied to phantom users?

My customer tried propagating security into a WCI admin folder using the async job option today, but they got an error similar to this in the job log:

Sep 1, 2009 9:56:39 AM- *** Job Operation #1 of 1 with ClassID 20 and ObjectID 334 cannot be run, probably because the operation has been deleted.

In the error log on the automation server, we found something like this:

Error creating operation 20:302
com.plumtree.server.marshalers.PTException: -2147204095 - InternalSession.Connect(): UserID (205) not found.

Indeed, user 205 didn't exist. Where did the portal get the idea it should look for the user? It turns out that at the time the particular admin folder was created (folder ID 302), it was created by user 205. Later, that user was deleted from the system, but just as in my last post, sometimes when an object is deleted, references to that object are left in certain tables of the database. In this case, the deletion of a user does not trigger a removal of that user's ownership of certain objects like admin folders. I ran this query to look for all instances of the problem:

select folderid from ptadminfolders where ownerid not in (select objectid from ptusers)

The fix here is to set the ownership of that particular folder (and all others) to the administrative user:

update ptadminfolders set ownerid=1 where ownerid not in (select objectid from ptusers)

While we're thinking about this class of problem, we can look for others cases where a phantom user remains, since in some of these cases it will become a menace. The following is a list of queries that found phantom users at my current customer:

select folderid from ptadminfolders where ownerid not in (select objectid from ptusers)
select objectid from ptcards where ownerid not in (select objectid from ptusers)
select objectid from ptcrawlers where ownerid not in (select objectid from ptusers)
select objectid from ptcommunities where ownerid not in (select objectid from ptusers)
select objectid from ptdatasources where ownerid not in (select objectid from ptusers)
select objectid from ptdocumenttypes where ownerid not in (select objectid from ptusers)
select objectid from ptfilters where ownerid not in (select objectid from ptusers)
select objectid from ptgadgetbundles where ownerid not in (select objectid from ptusers)
select objectid from ptgadgets where ownerid not in (select objectid from ptusers)
select objectid from ptgcservers where ownerid not in (select objectid from ptusers)
select objectid from ptjobs where ownerid not in (select objectid from ptusers)
select objectid from ptwebservices where ownerid not in (select objectid from ptusers)

I suggest in each of the above cases that you replace the phantom user with the administrator user. This will cause no harm, and in some cases it allows you to avoid errors:

update ptadminfolders set ownerid=1 where ownerid not in (select objectid from ptusers)
update ptcards set ownerid=1 where ownerid not in (select objectid from ptusers)
update ptcrawlers set ownerid=1 where ownerid not in (select objectid from ptusers)
update ptcommunities set ownerid=1 where ownerid not in (select objectid from ptusers)
update ptdatasources set ownerid=1 where ownerid not in (select objectid from ptusers)
update ptdocumenttypes set ownerid=1 where ownerid not in (select objectid from ptusers)
update ptfilters set ownerid=1 where ownerid not in (select objectid from ptusers)
update ptgadgetbundles set ownerid=1 where ownerid not in (select objectid from ptusers)
update ptgadgets set ownerid=1 where ownerid not in (select objectid from ptusers)
update ptgcservers set ownerid=1 where ownerid not in (select objectid from ptusers)
update ptjobs set ownerid=1 where ownerid not in (select objectid from ptusers)
update ptwebservices set ownerid=1 where ownerid not in (select objectid from ptusers)


Again, like the problem I last wrote about with the phantom footer ID, this one with users is a bug. The fix would be to add to the deleteUser() method a command to clean up each of these tables. Since no fix is provided, you might set up a nightly job on your database to run these cleanup queries.

Enjoy!

PS: you might like this sed example that converts the list of select statements from this post (saved as select.txt) into the list of update statements:

sed -r s/.*("objectid|folderid")" from "(.*)("where.*")/"update "\2"set ownerid\=1 "\3/g select.txt