Bypassing a Load Balancer

Load balancers are fantastic devices but sometimes we need to get around them.  I can’t count the times I have needed or wanted to test/diagnosis the functionality of each PIA instance individually, especially when fighting what might appear at first as an intermittent or random issue.  When we have multiple PeopleSoft PIA instances behind a load balancer or reverse proxy we set the virtual addressing URL in the web profile to be that of the URL to the load balancing device.  If we try to hit SERVER-A or SERVER-B directly the URL is rewritten back to SERVER-LB.   The following are some tips that may help you establish a connection to a specific PIA instance rather than the load balancer. Continue reading

8.54 PreRelease Notes Available

The PeopleSoft Technology Blog has announced the availability of the PreRelease notes for 8.54:

Some observations as I skimmed the document are:

  • Oracle Linux 6, Win 2012, and Win 2012 R2 support added, Win 2008 (R1 not R2) dropped
  • Client OS Windows 8.1 added, Windows 7 (32 bit) dropped
  • WebLogic 12.1.2 added, 10.3.6 dropped
  • Oracle 12, MSSQL 2014 added, Oracle 10.2.0.5, 11.2.0.3, and MSSQL 2008 dropped
  • Current browsers available added (Chrome, Firefox, IE 11, Safari), IE 8 Firefox 17 dropped
  • Tuxedo 12.1.1.0 added, Tuxedo 11gR1 dropped
  • Still uses Java 7
  • SES is 11.2.2.2, SES 11.1.2.2 dropped
  • Excel 32-bit dropped
  • New Fluid User Interface: moves away from ridge page layouts enhancing the use of CSS3, HTML5, and JavaScript.  Supposed to “scale gracefully between devices”.  Fluid page definitions are maintained within App Designer.  Adds Fluid Homepages, Tiles, Notification Framework, PeopleSoft Navigation Bar
  • Mobile Application Platform: Similar to the Fluid User Interface but utilizes RESTful web services
  • 64-bit Development environment!  App Designer, Data Mover, Change Assistant, the whole lot of them all now 64-bit.  Explains the dropped Win 7 (32-bit).
  • App Designer will have improved search functionality (reference, text), code auto-completion for PeopleCode, and new toolbar buttons to improve productivity
  • Enhancements to App Engine tracing: split files, naming convention, program section trace, combined output of PeopleCode and SQL into the AE trace file
  • Portable PS_HOME: Hard-coded paths and sym links within PS_HOME have been removed to further consolidate and allow a single PS_HOME to be shared across multiple environments
  • Two new metaSQL enhancements for Oracle added.  %SqlHint and %SelectDummyTable
  • Also you can now use Oracle Global Temporary Tables, Materialized Views, and the new 12c container/pluggable databases which allow multiple PS databases in the same instance, some may have used GTTs and Materialized Views in the past, but now App Designer will handle them.  Also, App Designer can now be used directly to partition tables and indexes on Oracle.
  • Domain caching changes allowing automatic monitoring and adjusting?  I’ll have to look into that one more.
  • A Push Notification Event Framework, maybe we can finally broadcast a message to all users via the system.
  • Several security enhancements, Oracle Secure Files for the report repository is one that jumped out at me.
  • There’s a lot of other stuff, Just go read it !

Read the foll post here, or find the notes directly on MOS here

Setup AWStats on IIS

AWStats is a great tool for parsing web server access logs of any kind.  If you are not familiar with it, I recommend checking out the Live Demo to see what kind of data it can provide. I’ve been using it for a long time to provide stats on all sorts of different websites and applications (including PeopleSoft). It’s just another great tool for in any Admin’s toolbox.  Normally I run it on Linux but recently I setup AWStats on IIS which was actually pretty painless.  Here is what I did.

What you need:
IIS, Perl, AWStats Continue reading

nVision going to Error

I ran into this problem with nVision a little while ago.  nVision processes were going to error status on the process scheduler.  A quick glance at the logs made it look like either the process was crashing or not starting at all.  I logged into the Windows Process Scheduler server to take a look at Excel.  Logging in as the service account that is used to run our process scheduler I looked at task manager and saw two copies of Excel which had been running, however no jobs had been run for quite a while.  I killed the processes and tried to start Excel by hand and began to wait.  Several minutes later Excel popped up a window stating it was starting in safe mode, probably because I killed the processes I assumed.  Once it started I went ahead and closed it cleanly and tried again.  Again about four minutes later it finally opened.  However, this time it opened with the recover lost files pane open and displayed what I would estimate was 1500+ recovered files.  I closed the pane not saving any and removing them which took some more time.  Eventually I was able to shutdown Excel again.  The next time I opened it, it started right up in a second.  I proceeded to disable the AutoRecover feature in the Save settings in Excel Options (for 2007).

I’m not sure what caused them to show up, or if they were there all along and just finally got to the point where Excel took too long to open. Like many, I inherited this environment, but now I have one more thing to add to my environment assessment checklist for the next time.

How I forward displays from Linux or other *nix systems to Windows

First off, I usually try to avoid it, I’m old school and like my command line options.  So, often when I’m installing PeopleSoft components, I use the console options if available.  Sometimes it’s not that simple though so when that happens, I’ve got a pretty standard method of doing things.  So here’s how I forward a display from a Linux server to my Windows workstation. Continue reading

PeopleSoft TEMP/TMP Directories

In PeopleTools 8.53 Oracle has changed how the TEMP and TMP environment variables are handled.  They are taking the responsibility of setting these out of the Admins hands.  In the past we set these as environment variables perhaps in the shell profile or psconfig.sh.  When we configured the app server / process scheduler these settings would be inherited into the Tuxedo configuration.  With Windows 2008, some were impacted with it’s default of dynamic TEMP/TMP variables based on session and being deleted on logout.  On Windows a previous co-worker of mine got me into the practice of modifying the psappsrv.ubx and prprcs.ubx files and setting these variables there rather than relying on what the user in Windows might have set or trying to adjust them with scripts.  This proved to be helpful in many ways.  Now Oracle has decided to do the same thing by default.

This may impact Admins who are used to having these variables set to something they specifically wanted.  Depending on your setup you may prefer to change these back.  A default psappsrv.ubx file has the following section:

# ————–

*PS_ENVFILE
TEMP={LOGDIR}{FS}tmp
TMP={LOGDIR}{FS}tmp
TM_BOOTTIMEOUT=120
TM_RESTARTSRVTIMEOUT=120

{LOGDIR} is $PS_CFG_HOME/appserv/<DOMAIN>/LOGS and {FS} is the OS specific path delimiter.  So this will use a temp directory in the LOGS directory of each domain such as /opt/apps/psoft/domains/appserv/HCM92/LOGS/tmp.

I’m not sure how much of a fan I am of having TEMP in the logs directory.  I guess just knowing it’s moving is half the battle.  It’s easy to change back if you like.

See Oracle support document [ID 1486978.1] for the announcement.  I don’t recall seeing this in the release notes, but I might have missed it.  They also said it would be back ported to 8.52.16, but you would need to recreate all your domains, not just reconfigure in order to get the change.  Oracle is adding these settings to the domain templates, once you create the new domain from a template TEMP and TMP will be set in the appropriate ubx file.

App Server to Database Reconnection issues in 8.51

I ran into a problem a while ago which brought a more severe problem to my attention.  It appears in at least 8.51.02 (but probably back to 8.51.00) to 8.51.09 there are issues with application server processes properly recovering from a disconnection from the database.  I don’t have first hand experience with that problem, but there is some info on Oracle’s support site about it. If you are running in this PeopleTools range and experiencing odd crashes every once in a while this may be worth investigating.  Using Tracesql=31 will create ORA-3113/3114 errors in your logs.  In 8.51.08 a patch went in to fix it, but it broke something else, causing the problem I encountered.  Bug 11724645 has the details.

My particular problem was experienced on 8.51.09 and was limited to only the Integration Broker PUBSUB processes.  So apparently the PSAPPSRV code had been fixed by then as I never had a problem with those.   In this post I’ll discuss what I saw, some of the troubleshooting steps I used to isolate the problem, and some options I came up with to resolve it.

The Problem:

Integration Broker stops processing messages.   The processes don’t crash and look OK from a quick glance (psr and Process Explorer), but do nothing.  The environment was 8.51.09 all Windows 2008 on SQL Server.  The problem occurred everywhere, even in environments that restarted nightly.

Diagnosing the Problem:

I knew from day one something strange was occurring.  I had never needed to restart the PUBSUB processes this often ever before.  Almost daily some environment would need restarted, sometimes multiple environments, sometimes ones that had already been restarted.  Obviously it was off to the logs first.  There I found something interesting.  Here’s an example, the tables might be different depending on the processes (PSPUBDSP, PSSUBDSP AND PSBRKDSP), but the main message is always the same:  The SELECT permission was denied on the object

PSSUBDSP_dflt.6076 (1) [01/28/12 10:00:25](3) File: E:\pt851-903-R1-retail\peopletools\src\pspubsub\statements.cppSQL error. Stmt #: 67  Error Position: 0  Return: 8601 – [Microsoft][SQL Server Native Client 10.0][SQL Server]The SELECT permission was denied on the object ‘PSAPMSGDSPSTAT’, database ‘HCMDEV’, schema ‘dbo’. (SQLSTATE 42000) 229
Failed SQL stmt:SELECT DSPSTATUS, IB_SLAVEMODE, DSPRESET, CLEANUP_DTTM FROM PSAPMSGDSPSTAT WHERE DISPATCHERNAME=:1 AND MACHINENAME=:2 AND APPSERVER_PATH=:3
PSSUBDSP_dflt.6076 (1) [01/28/12 10:00:25](1) GenMessageBox(200, 0, M): E:\pt851-903-R1-retail\peopletools\src\pspubsub\statements.cpp: A SQL error occurred. Please consult your system log for details.
PSSUBDSP_dflt.6076 (1) [01/28/12 10:00:40](3) File: E:\pt851-903-R1-retail\peopletools\src\pspubsub\statements.cppSQL error. Stmt #: 663  Error Position: 0  Return: 8601 – [Microsoft][SQL Server Native Client 10.0][SQL Server]The SELECT permission was denied on the object ‘PSAPMSGSUBCON’, database ‘HCMDEV’, schema ‘dbo’. (SQLSTATE 42000) 229
Failed SQL stmt:SELECT IBTRANSACTIONID, IB_SEGMENTINDEX, QUEUENAME, IB_OPERATIONNAME, ACTIONNAME, SUBCONSTATUS, PROCESS_INSTANCE FROM PSAPMSGSUBCON WHERE SUBCONSTATUS IN (0,10)AND PROCESS_INSTANCE > 0
PSSUBDSP_dflt.6076 (1) [01/28/12 10:00:40](1) GenMessageBox(200, 0, M): E:\pt851-903-R1-retail\peopletools\src\pspubsub\statements.cpp: A SQL error occurred. Please consult your system log for details.
PSSUBDSP_dflt.6076 (1) [01/28/12 10:00:40](3) File: E:\pt85109b-retail\peopletools\src\psmgr\mgrvers.cppSQL error. Stmt #: 881  Error Position: 0  Return: 8601 – [Microsoft][SQL Server Native Client 10.0][SQL Server]The SELECT permission was denied on the object ‘PSVERSION’, database ‘HCMDEV’, schema ‘dbo’.
[Microsoft][SQL Server Native Client 10.0][SQL Server]The cursor was not declared. (SQLSTATE 37000) 16945
Failed SQL stmt:SELECT VERSION FROM PSVERSION WHERE OBJECTTYPENAME = ‘SYS’
PSSUBDSP_dflt.6076 (1) [01/28/12 10:00:40](1) GenMessageBox(200, 0, M): E:\pt85109b-retail\peopletools\src\psmgr\mgrvers.cpp: A SQL error occurred. Please consult your system log for details.

My first reaction was to check the permissions for the ACCESSID user and of course, nothing was out of the ordinary there.  I searched Oracle support and found a case indicating that I needed to Synchronize the ACCESSID after a Tools upgrade to 8.50+ on SQL Server, but from what I could see the account was setup just fine.  That’s when I took a look at the database connections and saw something bizarre. There was a connection to the database as user people. I waited a minute and looked again, the same connection was still there as people.  Now that shouldn’t happen.  As you should know, the people user is very limited in what it can do.  It’s really only used to validate OPERID’s and retrieve the ACCESSID and password.  I queried sys.dm_exec_sessions, I wanted the host_process_id so I could see what process was connected as people for so long.

login_name session_id login_time              program_name host_process_id status
psaccess   77         2012-01-28 04:16:11.970 PeopleSoft   5624            sleeping
people     81         2012-01-28 04:16:12.030 PeopleSoft   6076            sleeping
psaccess   82         2012-01-28 04:16:12.030 PeopleSoft   5612            sleeping

Once I got the host_process_id, I went back to the app server and confirmed what the log was already telling me.  PID 6076 on the app server was the PSSUBDSP process.  The same one that didn’t have select permission anymore and of course it couldn’t select, it was connected as people.  I also noticed the login_time of the processes seemed odd.  I didn’t restart that system at 04:16 and in fact looking at the processes in Process Explorer indicated the processes had been running quite a while longer.  Now the processes will restart after a certain amount of work load, but that should rarely have them all reconnecting to the database at the same time.  In the logs I didn’t see anything around that time, in fact, the errors didn’t show up until several hours after the login_time.

I restarted the PUBSUB processes and saw that after restarting all processes were once again connected as the ACCESSID.  I decided to see what happened if I killed the connection for the process from the database.  I killed a newly connected PSSUBDSP process by matching the session_id from sys.dm_exec_sessions with the host_process_id again.  The process reconnected as people and never made the switch to the ACCESSID again.

The Test Plan:

I decided to dig in a little deeper.   I was going to file a case with Oracle since I had not found anything on their support site that seemed to address this at all.  I also wanted to have someone else try to replicate it on a different version of PeopleTools.  Plus I didn’t know why the error didn’t really show up until some time later.  To make this section shorter, I did several things to validate my theory (after network/database disconnection the reconnect process was broke) and ensure I had the detail needed for the support case.  I validated outbound port numbers changed with netstat, turned up tracing a bit, and forced messages through.  Something I found during this was that the error would not start showing in the app server logs until a message had tried to be processed.  I ended up coming up with the following test plan to provide to Oracle and others to use:

  1. Set TraceSql=7 in psappsrv.cfg
  2. Review newly created <USER>_PSSUBDSP_dflt.tracesql log to determine pid for PSSUBDSP process
  3. While reviewing the log, verify that after every SQL statement is run transactions are “commited”;  look for a line like:
    PSSUBDSP_dflt.17196 (4)      1-3      16.25.27    0.005000 Cur#20.17196.FINDEV RC=0 Dur=0.003000 Commit
  4. Run netstat -a -o |find “<pid>”  : noting outbound TCP port
  5. Determine SID to kill and the user it’s logged in as
    select session_id, login_name from sys.dm_exec_sessions where host_name=”<App Server HOSTNAME>” and host_process_id=<pid>
  6. kill <sid> :  to kill connection
  7. Wait 15 – 30 seconds
  8. Rerun netstat -a -o |find “<pid>”  : noting outbound TCP port, did it change?  it should have.
  9. Review <USER>_PSSUBDSP_dflt.tracesql log to determine if transactions are now failing; look for a line like
    PSSUBDSP_dflt.17196 (62)      1-3      17.00.24    0.002000 Cur#1.5844.FINDEV RC=0 Dur=0.001000 Rollback
  10. Rerun the SQL above to determine if the login_name has changed
    select session_id, login_name from sys.dm_exec_sessions where host_name=”<App Server HOSTNAME>” and host_process_id=<pid>
  11. If login_name = people or transactions are saying Rollback in the log, you have a problem, review APPSRV log to check for the following errors. None should exist until the next message is processed. PSSUBDSP_dflt.5844 (63) [11/12/12 15:14:56 Dispatch](3) File: E:\pt851-903-R1-retail\peopletools\src\pspubsub\statements.cppSQL error. Stmt #: 67  Error Position: 0  Return: 8601 – [Microsoft][SQL Server Native Client 10.0][SQL Server]The SELECT permission was denied on the object ‘PSAPMSGDSPSTAT’, database ‘FINDEV’, schema ‘dbo’. (SQLSTATE 42000) 229
    Failed SQL stmt:SELECT DSPSTATUS, IB_SLAVEMODE, DSPRESET, CLEANUP_DTTM FROM PSAPMSGDSPSTAT WHERE DISPATCHERNAME=:1 AND MACHINENAME=:2 AND APPSERVER_PATH=:3
    PSSUBDSP_dflt.5844 (63) [11/12/12 15:14:56 Dispatch](1) GenMessageBox(200, 0, M): E:\pt851-903-R1-retail\peopletools\src\pspubsub\statements.cpp: A SQL error occurred. Please consult your system log for details.
  12. Force a message through the system. I was just locking and unlocking my account on the HR side to force the message over to another application.
  13. Check to see if the message was processed and review APPSRV log for error.
  14. Test either Passed: Message proccessed ok, or Failed: Message stuck in New status on subscription side and error in the APPSRV log.
  15. Set TraceSql=0 in psappsrv.cfg
  16. Restart PUBSUB processes to correct any connection problem.

The Fix:

In a test environment I validated that the PUBSUB processes were fixed in 8.51.10, unfortunately, a minor PeopleTools patch was not an option at the time for production, so a work around was in order.  What would be the best way to identify this problem and take corrective action?  If you’ve read some of my other posts you might know blindly restarting every night isn’t my style, and as I saw didn’t guarantee anything.  What ever was causing the disconnect could happen any time, you might need to restart every 30 minutes to ensure a decent availability for integration.  I needed to detect the problem, identify which database on the SQL Server was impacted, and restart only what was impacted.  Time to break out my scripting fingers.  I came up with two scripted solutions.  The first solution I wrote was a SQL script that could be scheduled which would:

  1. Identify connections to the database as people that were older than X minutes
  2. Execute a power shell script on the database server providing  the server name, database, and host PID of the offending process
  3. That database server side PowerShell script would then remotely execute PowerShell commands on the correct app server which
  4. Ensured the host PID provided was for a PUBSUB process and
  5. Executed my normal PUBSUB restart script for the domain

This method had several challenges that I did not try to overcome really.

  1. The database user running the SQL script needs xp_cmdshell.  This is a big security concern in many shops and in general something that should probably be frowned upon.  It would probably be easy enough to have the PowerShell script run the SQL and collect the data as well.  But I didn’t look into it.
  2. My SQL script assumed the app server domain running PUBSUB was the same as the database name.  If you ran multiple or different named domains it would need to be tweaked to take those into account.
  3. Remote PowerShell capabilities had to be turned on, another possible security concern.

I also wrote another Powershell script that was application server side based.  I have a large script infrastructure that is already deployed to any windows app servers that some what mimics psconfig.sh on the nix servers.  This script leverages that infrastructure, and with one additional script, I can monitor all domain APPSRV logs on a server for the error.  This script does the following:

  1. Reads a central file that includes all application domains on the server
  2. Uses pattern matching to find a domain with “select permission was denied” in the log file
  3. If an error is found, the script waits for 20 seconds and rechecks the log file
  4. It compares the number of error lines before and after the 20 seconds (default polling interval is 15 seconds)
  5. If the count increases, it restarts PUBSUB and emails me notification of the restart
  6. If the count is the same, we must have restarted previously, so we do nothing

This method also has some disadvantages, but is the route I chose

  1. It won’t detect the problem until the error is in the log, which could be hours later.  However, there is no user/functional impact until the error arrives as that is the first time a message is actually processed.
  2. I’m not doing anything fancy to track file offsets and start reading where I left off, therefore, the larger the file the longer the script will take to run.  I would not recommend running this against log files that are growing really large or have tracing on.
  3. This script needs to be scheduled on each physical application server instead of a smaller number of database servers.

I never did identify the real culprit of why the processes are being disconnected from the database, it seems to be pretty hit and miss.  I had some ideas, but once I got the automatic restarts scheduled everyone’s interest in the problem died down significantly.

UPPER CASE PeopleSoft scripts for SQL Server

I remember the first time I started working on a PeopleSoft system running on SQL Server.  I was constantly having problems with my SQL and scripts I’d write because I would write everything lower case.  That’s when I learned PeopleSoft recommended the Latin1_General_Bin collation for the database on SQL Server and that collation is case sensitive.  Now I had to train my self to use upper case on SQL Server.

I still have the habit of writing things in lower case (old habits die hard) and a lot of times I’ll take portions of Oracle scripts and put them to use on SQL Server.  Sometimes I go the other way, but since Oracle has the majority of market share it’s usually Oracle to SQL Server.  There’s lots of ways to change the case on text but I thought I’d share this tip for those that didn’t know.  After I copy and paste something that was lower case into Management Studio, I hit CTRL + A to select it all and then hit CTRL + SHIFT + U and it converts everything to upper case.  You can reverse it with CTRL + SHIFT + L, but there is not usually a need for that.

When PeopleSoft Cache goes Bad

In general Cache is a good thing, but sometimes things go wrong. I usually try to avoid clearing cache unless I have a real reason. In my experience clearing cache without reason usually only adds to the end users perception that PeopleSoft is a slow painful application to work with, and we know PeopleSoft gets plenty of opportunities to prove that daily. In large scale environments if you don’t have a cache building process, clearing cache across multiple domains with a decent amount of processes each could put a significant damper on some users mornings. Say perhaps, your Expense Approval team needs to re-cache 30 – 40 processes, that might ruin their morning. This post isn’t about what makes things go wrong, but how to possibly identify and deal with them in the least impactful manner as possible. Let me show you a recent case I ran into. In this example I’ll go over some basics of tmadmin so if you’ve been doing this a while, you’ll probably already know the stop/start and psr commands, but maybe you’ll learn something new.

Users started reporting intermittent errors, sometimes things worked sometimes they didn’t. That was hint #1. In this case, the errors were around the Query Manager pages, and three different error messages were being reported. Function CheckSec not found in Peoplecode program QRYFUNCTIONS.QRYQUERYFUNCS.FieldFormula. Page load failed for QUERY_MANAGER/GBL. and Data Integrity Error. Usually these error types will show up in the APPSRV log. They would look something like this…


PSAPPSRV.7546 (1610) [01/17/13 15:34:50 user@client.where.com (IE 8.0; WIN7) ICPanel](0) Function CheckSec not found in PeopleCode program QRYFUNCTIONS.QRYQUERYFUNCS.FieldFormula. (2,301)
PSAPPSRV.7546 (1610) [01/17/13 15:34:50 user@client.where.com (IE 8.0; WIN7) ICPanel](0) PRMGet failed for component QUERY_MANAGER market GBL
PSAPPSRV.7546 (1610) [01/17/13 15:34:50 user@client.where.com (IE 8.0; WIN7) ICPanel](0) Data Integrity Error (124,85)
PSAPPSRV.7546 (1610) [01/17/13 15:34:50 user@client.where.com (IE 8.0; WIN7) ICPanel](0) Function CheckSec not found in PeopleCode program QRYFUNCTIONS.QRYQUERYFUNCS.FieldFormula. (2,301)
PSAPPSRV.7546 (1610) [01/17/13 15:34:50 user@client.where.com (IE 8.0; WIN7) ICPanel](0) PRMGet failed for component QUERY_MANAGER market GBL
PSAPPSRV.7546 (1610) [01/17/13 15:34:50 user@client.where.com (IE 8.0; WIN7) ICPanel](0) An error has occurred which prevents this transaction continuing

If you can find these in your log file then the next step is to determine how many processes they are coming from. Use grep or find (if your on a windows servers) to trim the error messages down to an individual one. For instance, I ran
grep 'PRMGet failed for component QUERY_MANAGER' APPSRV_0117.LOG
You could add something like | awk -F" " '{print $1}' if you had did a similar pattern for grep. Anyway, what we are after is the PSAPPSRV pid, which is the first number after PSAPPSRV. In our example above it’s 7546.

If you look at the entries that come back and find that all the errors are coming from only one pid that is hint #2. Remember that the users were saying that some of the requests work, it’s possible we’ve identified a process that has either gone haywire or it’s cache has gone bad (corrupt). Let’s shut this particular process down and see if our users report the problem stops. There are several ways to identify which appserv process this is, you could use ps, or task manager, but since we want to shut it down properly lets just use our psadmin tools.

run psadmin, pick our domain that has problems, and use option 5
TUXEDO command line (tmadmin)

at the tmadmin prompt “>” enter psr -v -g APPSRV (-v is verbose output, -g limits output to the group name specified)

> psr -v -g APPSRV

the output is paginated, so just page through until you find the process with the process id that matches the one your looking for.


Group ID: APPSRV, Server ID: 2
Machine ID: psapp1.localdomain
Process ID: 7546, Request Qaddr: 229380, Reply Qaddr: 129663006

Here it is, PID 7546 is server id 2. Now if we just run psr server id is the ID column

So now let’s shut down our possible trouble maker.

> shutdown -g APPSRV -i 2

There, we shutdown just ONE of multiple PSAPPSRV processes. Now we can have users test, if the problem goes away, we really did find our culprit. Now it’s safe to clear cache. Are we sure it’s safe? Well let’s take a look. Again, I’m going to describe it as performed on a Linux server, but you could use Process Explorer from sysinternals to do the same verification step here.

Lets make sure no one is using our cache files which might cause us head ache if we were to try to wipe them prematurely. As your PeopleSoft application user on the Linux box run lsof and grep for CACHE

$ /usr/sbin/lsof |grep CACHE

...
.
PSAPPSRV 15803 psoft 178u REG 252,2 0 263319 /opt/apps/psoft/domains/appserv/PA91/CACHE/PSAPPSRV_1/SDEFM.DAT
.
...

Look at that, as an example I can see that PSAPPSRV_1/SDEFM.DAT is open by process PSAPPSRV pid 15803 which is Server ID 1. That builds the correlation that the CACHE/PSAPPSRV_1 directory belongs to the PSAPPSRV process with server id 1. Well that makes sense.

We shut down server ID 2, let see if that directory has anything open.

$ /usr/sbin/lsof |grep PSAPPSRV_2

Crickets, just the prompt returned. Perfect, let’s remove all that cache and give it a fresh start. From inside the domain directory let’s run our rm command,

$ rm -rf CACHE/PSAPPSRV_2

Now let’s start up the process again. Go back into psadmin and run tmadmin again.

> boot -g APPSRV -i 2

The process boots, recreates PSAPPSRV_2 in CACHE and is ready to service requests.

So there we have it. In my case, the problems went away and I inconvenienced a much smaller set of users than I would have if I cleared cache across the board. I’ve seen this pop up a few times in the last 2 years and each time this strategy has worked well. One time there was a problem with process scheduler PSAE cache on a piece of SQL (the SQL statement was actually only partially returned). The job would fail when it ran on PSAE server ID 3, but the same principal applied. In that case it took me a little longer to determine that it was really a caching problem, but that’s what it ended up being. Happy trouble shooting.