App Server to Database Reconnection issues in 8.51

I ran into a problem a while ago which brought a more severe problem to my attention.  It appears in at least 8.51.02 (but probably back to 8.51.00) to 8.51.09 there are issues with application server processes properly recovering from a disconnection from the database.  I don’t have first hand experience with that problem, but there is some info on Oracle’s support site about it. If you are running in this PeopleTools range and experiencing odd crashes every once in a while this may be worth investigating.  Using Tracesql=31 will create ORA-3113/3114 errors in your logs.  In 8.51.08 a patch went in to fix it, but it broke something else, causing the problem I encountered.  Bug 11724645 has the details.

My particular problem was experienced on 8.51.09 and was limited to only the Integration Broker PUBSUB processes.  So apparently the PSAPPSRV code had been fixed by then as I never had a problem with those.   In this post I’ll discuss what I saw, some of the troubleshooting steps I used to isolate the problem, and some options I came up with to resolve it.

The Problem:

Integration Broker stops processing messages.   The processes don’t crash and look OK from a quick glance (psr and Process Explorer), but do nothing.  The environment was 8.51.09 all Windows 2008 on SQL Server.  The problem occurred everywhere, even in environments that restarted nightly.

Diagnosing the Problem:

I knew from day one something strange was occurring.  I had never needed to restart the PUBSUB processes this often ever before.  Almost daily some environment would need restarted, sometimes multiple environments, sometimes ones that had already been restarted.  Obviously it was off to the logs first.  There I found something interesting.  Here’s an example, the tables might be different depending on the processes (PSPUBDSP, PSSUBDSP AND PSBRKDSP), but the main message is always the same:  The SELECT permission was denied on the object

PSSUBDSP_dflt.6076 (1) [01/28/12 10:00:25](3) File: E:\pt851-903-R1-retail\peopletools\src\pspubsub\statements.cppSQL error. Stmt #: 67  Error Position: 0  Return: 8601 – [Microsoft][SQL Server Native Client 10.0][SQL Server]The SELECT permission was denied on the object ‘PSAPMSGDSPSTAT’, database ‘HCMDEV’, schema ‘dbo’. (SQLSTATE 42000) 229
Failed SQL stmt:SELECT DSPSTATUS, IB_SLAVEMODE, DSPRESET, CLEANUP_DTTM FROM PSAPMSGDSPSTAT WHERE DISPATCHERNAME=:1 AND MACHINENAME=:2 AND APPSERVER_PATH=:3
PSSUBDSP_dflt.6076 (1) [01/28/12 10:00:25](1) GenMessageBox(200, 0, M): E:\pt851-903-R1-retail\peopletools\src\pspubsub\statements.cpp: A SQL error occurred. Please consult your system log for details.
PSSUBDSP_dflt.6076 (1) [01/28/12 10:00:40](3) File: E:\pt851-903-R1-retail\peopletools\src\pspubsub\statements.cppSQL error. Stmt #: 663  Error Position: 0  Return: 8601 – [Microsoft][SQL Server Native Client 10.0][SQL Server]The SELECT permission was denied on the object ‘PSAPMSGSUBCON’, database ‘HCMDEV’, schema ‘dbo’. (SQLSTATE 42000) 229
Failed SQL stmt:SELECT IBTRANSACTIONID, IB_SEGMENTINDEX, QUEUENAME, IB_OPERATIONNAME, ACTIONNAME, SUBCONSTATUS, PROCESS_INSTANCE FROM PSAPMSGSUBCON WHERE SUBCONSTATUS IN (0,10)AND PROCESS_INSTANCE > 0
PSSUBDSP_dflt.6076 (1) [01/28/12 10:00:40](1) GenMessageBox(200, 0, M): E:\pt851-903-R1-retail\peopletools\src\pspubsub\statements.cpp: A SQL error occurred. Please consult your system log for details.
PSSUBDSP_dflt.6076 (1) [01/28/12 10:00:40](3) File: E:\pt85109b-retail\peopletools\src\psmgr\mgrvers.cppSQL error. Stmt #: 881  Error Position: 0  Return: 8601 – [Microsoft][SQL Server Native Client 10.0][SQL Server]The SELECT permission was denied on the object ‘PSVERSION’, database ‘HCMDEV’, schema ‘dbo’.
[Microsoft][SQL Server Native Client 10.0][SQL Server]The cursor was not declared. (SQLSTATE 37000) 16945
Failed SQL stmt:SELECT VERSION FROM PSVERSION WHERE OBJECTTYPENAME = ‘SYS’
PSSUBDSP_dflt.6076 (1) [01/28/12 10:00:40](1) GenMessageBox(200, 0, M): E:\pt85109b-retail\peopletools\src\psmgr\mgrvers.cpp: A SQL error occurred. Please consult your system log for details.

My first reaction was to check the permissions for the ACCESSID user and of course, nothing was out of the ordinary there.  I searched Oracle support and found a case indicating that I needed to Synchronize the ACCESSID after a Tools upgrade to 8.50+ on SQL Server, but from what I could see the account was setup just fine.  That’s when I took a look at the database connections and saw something bizarre. There was a connection to the database as user people. I waited a minute and looked again, the same connection was still there as people.  Now that shouldn’t happen.  As you should know, the people user is very limited in what it can do.  It’s really only used to validate OPERID’s and retrieve the ACCESSID and password.  I queried sys.dm_exec_sessions, I wanted the host_process_id so I could see what process was connected as people for so long.

login_name session_id login_time              program_name host_process_id status
psaccess   77         2012-01-28 04:16:11.970 PeopleSoft   5624            sleeping
people     81         2012-01-28 04:16:12.030 PeopleSoft   6076            sleeping
psaccess   82         2012-01-28 04:16:12.030 PeopleSoft   5612            sleeping

Once I got the host_process_id, I went back to the app server and confirmed what the log was already telling me.  PID 6076 on the app server was the PSSUBDSP process.  The same one that didn’t have select permission anymore and of course it couldn’t select, it was connected as people.  I also noticed the login_time of the processes seemed odd.  I didn’t restart that system at 04:16 and in fact looking at the processes in Process Explorer indicated the processes had been running quite a while longer.  Now the processes will restart after a certain amount of work load, but that should rarely have them all reconnecting to the database at the same time.  In the logs I didn’t see anything around that time, in fact, the errors didn’t show up until several hours after the login_time.

I restarted the PUBSUB processes and saw that after restarting all processes were once again connected as the ACCESSID.  I decided to see what happened if I killed the connection for the process from the database.  I killed a newly connected PSSUBDSP process by matching the session_id from sys.dm_exec_sessions with the host_process_id again.  The process reconnected as people and never made the switch to the ACCESSID again.

The Test Plan:

I decided to dig in a little deeper.   I was going to file a case with Oracle since I had not found anything on their support site that seemed to address this at all.  I also wanted to have someone else try to replicate it on a different version of PeopleTools.  Plus I didn’t know why the error didn’t really show up until some time later.  To make this section shorter, I did several things to validate my theory (after network/database disconnection the reconnect process was broke) and ensure I had the detail needed for the support case.  I validated outbound port numbers changed with netstat, turned up tracing a bit, and forced messages through.  Something I found during this was that the error would not start showing in the app server logs until a message had tried to be processed.  I ended up coming up with the following test plan to provide to Oracle and others to use:

  1. Set TraceSql=7 in psappsrv.cfg
  2. Review newly created <USER>_PSSUBDSP_dflt.tracesql log to determine pid for PSSUBDSP process
  3. While reviewing the log, verify that after every SQL statement is run transactions are “commited”;  look for a line like:
    PSSUBDSP_dflt.17196 (4)      1-3      16.25.27    0.005000 Cur#20.17196.FINDEV RC=0 Dur=0.003000 Commit
  4. Run netstat -a -o |find “<pid>”  : noting outbound TCP port
  5. Determine SID to kill and the user it’s logged in as
    select session_id, login_name from sys.dm_exec_sessions where host_name=”<App Server HOSTNAME>” and host_process_id=<pid>
  6. kill <sid> :  to kill connection
  7. Wait 15 – 30 seconds
  8. Rerun netstat -a -o |find “<pid>”  : noting outbound TCP port, did it change?  it should have.
  9. Review <USER>_PSSUBDSP_dflt.tracesql log to determine if transactions are now failing; look for a line like
    PSSUBDSP_dflt.17196 (62)      1-3      17.00.24    0.002000 Cur#1.5844.FINDEV RC=0 Dur=0.001000 Rollback
  10. Rerun the SQL above to determine if the login_name has changed
    select session_id, login_name from sys.dm_exec_sessions where host_name=”<App Server HOSTNAME>” and host_process_id=<pid>
  11. If login_name = people or transactions are saying Rollback in the log, you have a problem, review APPSRV log to check for the following errors. None should exist until the next message is processed. PSSUBDSP_dflt.5844 (63) [11/12/12 15:14:56 Dispatch](3) File: E:\pt851-903-R1-retail\peopletools\src\pspubsub\statements.cppSQL error. Stmt #: 67  Error Position: 0  Return: 8601 – [Microsoft][SQL Server Native Client 10.0][SQL Server]The SELECT permission was denied on the object ‘PSAPMSGDSPSTAT’, database ‘FINDEV’, schema ‘dbo’. (SQLSTATE 42000) 229
    Failed SQL stmt:SELECT DSPSTATUS, IB_SLAVEMODE, DSPRESET, CLEANUP_DTTM FROM PSAPMSGDSPSTAT WHERE DISPATCHERNAME=:1 AND MACHINENAME=:2 AND APPSERVER_PATH=:3
    PSSUBDSP_dflt.5844 (63) [11/12/12 15:14:56 Dispatch](1) GenMessageBox(200, 0, M): E:\pt851-903-R1-retail\peopletools\src\pspubsub\statements.cpp: A SQL error occurred. Please consult your system log for details.
  12. Force a message through the system. I was just locking and unlocking my account on the HR side to force the message over to another application.
  13. Check to see if the message was processed and review APPSRV log for error.
  14. Test either Passed: Message proccessed ok, or Failed: Message stuck in New status on subscription side and error in the APPSRV log.
  15. Set TraceSql=0 in psappsrv.cfg
  16. Restart PUBSUB processes to correct any connection problem.

The Fix:

In a test environment I validated that the PUBSUB processes were fixed in 8.51.10, unfortunately, a minor PeopleTools patch was not an option at the time for production, so a work around was in order.  What would be the best way to identify this problem and take corrective action?  If you’ve read some of my other posts you might know blindly restarting every night isn’t my style, and as I saw didn’t guarantee anything.  What ever was causing the disconnect could happen any time, you might need to restart every 30 minutes to ensure a decent availability for integration.  I needed to detect the problem, identify which database on the SQL Server was impacted, and restart only what was impacted.  Time to break out my scripting fingers.  I came up with two scripted solutions.  The first solution I wrote was a SQL script that could be scheduled which would:

  1. Identify connections to the database as people that were older than X minutes
  2. Execute a power shell script on the database server providing  the server name, database, and host PID of the offending process
  3. That database server side PowerShell script would then remotely execute PowerShell commands on the correct app server which
  4. Ensured the host PID provided was for a PUBSUB process and
  5. Executed my normal PUBSUB restart script for the domain

This method had several challenges that I did not try to overcome really.

  1. The database user running the SQL script needs xp_cmdshell.  This is a big security concern in many shops and in general something that should probably be frowned upon.  It would probably be easy enough to have the PowerShell script run the SQL and collect the data as well.  But I didn’t look into it.
  2. My SQL script assumed the app server domain running PUBSUB was the same as the database name.  If you ran multiple or different named domains it would need to be tweaked to take those into account.
  3. Remote PowerShell capabilities had to be turned on, another possible security concern.

I also wrote another Powershell script that was application server side based.  I have a large script infrastructure that is already deployed to any windows app servers that some what mimics psconfig.sh on the nix servers.  This script leverages that infrastructure, and with one additional script, I can monitor all domain APPSRV logs on a server for the error.  This script does the following:

  1. Reads a central file that includes all application domains on the server
  2. Uses pattern matching to find a domain with “select permission was denied” in the log file
  3. If an error is found, the script waits for 20 seconds and rechecks the log file
  4. It compares the number of error lines before and after the 20 seconds (default polling interval is 15 seconds)
  5. If the count increases, it restarts PUBSUB and emails me notification of the restart
  6. If the count is the same, we must have restarted previously, so we do nothing

This method also has some disadvantages, but is the route I chose

  1. It won’t detect the problem until the error is in the log, which could be hours later.  However, there is no user/functional impact until the error arrives as that is the first time a message is actually processed.
  2. I’m not doing anything fancy to track file offsets and start reading where I left off, therefore, the larger the file the longer the script will take to run.  I would not recommend running this against log files that are growing really large or have tracing on.
  3. This script needs to be scheduled on each physical application server instead of a smaller number of database servers.

I never did identify the real culprit of why the processes are being disconnected from the database, it seems to be pretty hit and miss.  I had some ideas, but once I got the automatic restarts scheduled everyone’s interest in the problem died down significantly.

One thought on “App Server to Database Reconnection issues in 8.51

Leave a Reply

Your email address will not be published. Required fields are marked *