When PeopleSoft Cache goes Bad

In general Cache is a good thing, but sometimes things go wrong. I usually try to avoid clearing cache unless I have a real reason. In my experience clearing cache without reason usually only adds to the end users perception that PeopleSoft is a slow painful application to work with, and we know PeopleSoft gets plenty of opportunities to prove that daily. In large scale environments if you don’t have a cache building process, clearing cache across multiple domains with a decent amount of processes each could put a significant damper on some users mornings. Say perhaps, your Expense Approval team needs to re-cache 30 – 40 processes, that might ruin their morning. This post isn’t about what makes things go wrong, but how to possibly identify and deal with them in the least impactful manner as possible. Let me show you a recent case I ran into. In this example I’ll go over some basics of tmadmin so if you’ve been doing this a while, you’ll probably already know the stop/start and psr commands, but maybe you’ll learn something new.

Users started reporting intermittent errors, sometimes things worked sometimes they didn’t. That was hint #1. In this case, the errors were around the Query Manager pages, and three different error messages were being reported. Function CheckSec not found in Peoplecode program QRYFUNCTIONS.QRYQUERYFUNCS.FieldFormula. Page load failed for QUERY_MANAGER/GBL. and Data Integrity Error. Usually these error types will show up in the APPSRV log. They would look something like this…


PSAPPSRV.7546 (1610) [01/17/13 15:34:50 user@client.where.com (IE 8.0; WIN7) ICPanel](0) Function CheckSec not found in PeopleCode program QRYFUNCTIONS.QRYQUERYFUNCS.FieldFormula. (2,301)
PSAPPSRV.7546 (1610) [01/17/13 15:34:50 user@client.where.com (IE 8.0; WIN7) ICPanel](0) PRMGet failed for component QUERY_MANAGER market GBL
PSAPPSRV.7546 (1610) [01/17/13 15:34:50 user@client.where.com (IE 8.0; WIN7) ICPanel](0) Data Integrity Error (124,85)
PSAPPSRV.7546 (1610) [01/17/13 15:34:50 user@client.where.com (IE 8.0; WIN7) ICPanel](0) Function CheckSec not found in PeopleCode program QRYFUNCTIONS.QRYQUERYFUNCS.FieldFormula. (2,301)
PSAPPSRV.7546 (1610) [01/17/13 15:34:50 user@client.where.com (IE 8.0; WIN7) ICPanel](0) PRMGet failed for component QUERY_MANAGER market GBL
PSAPPSRV.7546 (1610) [01/17/13 15:34:50 user@client.where.com (IE 8.0; WIN7) ICPanel](0) An error has occurred which prevents this transaction continuing

If you can find these in your log file then the next step is to determine how many processes they are coming from. Use grep or find (if your on a windows servers) to trim the error messages down to an individual one. For instance, I ran
grep 'PRMGet failed for component QUERY_MANAGER' APPSRV_0117.LOG
You could add something like | awk -F" " '{print $1}' if you had did a similar pattern for grep. Anyway, what we are after is the PSAPPSRV pid, which is the first number after PSAPPSRV. In our example above it’s 7546.

If you look at the entries that come back and find that all the errors are coming from only one pid that is hint #2. Remember that the users were saying that some of the requests work, it’s possible we’ve identified a process that has either gone haywire or it’s cache has gone bad (corrupt). Let’s shut this particular process down and see if our users report the problem stops. There are several ways to identify which appserv process this is, you could use ps, or task manager, but since we want to shut it down properly lets just use our psadmin tools.

run psadmin, pick our domain that has problems, and use option 5
TUXEDO command line (tmadmin)

at the tmadmin prompt “>” enter psr -v -g APPSRV (-v is verbose output, -g limits output to the group name specified)

> psr -v -g APPSRV

the output is paginated, so just page through until you find the process with the process id that matches the one your looking for.


Group ID: APPSRV, Server ID: 2
Machine ID: psapp1.localdomain
Process ID: 7546, Request Qaddr: 229380, Reply Qaddr: 129663006

Here it is, PID 7546 is server id 2. Now if we just run psr server id is the ID column

> psr
Prog Name      Queue Name  Grp Name      ID RqDone Load Done Current Service
---------      ----------  --------      -- ------ --------- ---------------
PSAPPSRV       APPQ        APPSRV         2   1022     51100 (  IDLE )

So now let’s shut down our possible trouble maker.

> shutdown -g APPSRV -i 2

There, we shutdown just ONE of multiple PSAPPSRV processes. Now we can have users test, if the problem goes away, we really did find our culprit. Now it’s safe to clear cache. Are we sure it’s safe? Well let’s take a look. Again, I’m going to describe it as performed on a Linux server, but you could use Process Explorer from sysinternals to do the same verification step here.

Lets make sure no one is using our cache files which might cause us head ache if we were to try to wipe them prematurely. As your PeopleSoft application user on the Linux box run lsof and grep for CACHE

$ /usr/sbin/lsof |grep CACHE

...
.
PSAPPSRV 15803 psoft 178u REG 252,2 0 263319 /opt/apps/psoft/domains/appserv/PA91/CACHE/PSAPPSRV_1/SDEFM.DAT
.
...

Look at that, as an example I can see that PSAPPSRV_1/SDEFM.DAT is open by process PSAPPSRV pid 15803 which is Server ID 1. That builds the correlation that the CACHE/PSAPPSRV_1 directory belongs to the PSAPPSRV process with server id 1. Well that makes sense.

We shut down server ID 2, let see if that directory has anything open.

$ /usr/sbin/lsof |grep PSAPPSRV_2

Crickets, just the prompt returned. Perfect, let’s remove all that cache and give it a fresh start. From inside the domain directory let’s run our rm command,

$ rm -rf CACHE/PSAPPSRV_2

Now let’s start up the process again. Go back into psadmin and run tmadmin again.

> boot -g APPSRV -i 2

The process boots, recreates PSAPPSRV_2 in CACHE and is ready to service requests.

So there we have it. In my case, the problems went away and I inconvenienced a much smaller set of users than I would have if I cleared cache across the board. I’ve seen this pop up a few times in the last 2 years and each time this strategy has worked well. One time there was a problem with process scheduler PSAE cache on a piece of SQL (the SQL statement was actually only partially returned). The job would fail when it ran on PSAE server ID 3, but the same principal applied. In that case it took me a little longer to determine that it was really a caching problem, but that’s what it ended up being. Happy trouble shooting.

Add Comment Register



Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>