This is the follow-up to an initial post that covered some details of using the errorstack and ksq traces as and aid to finding the cause of an intermittent ORA-00054: resource busy and acquire with NOWAIT specified or timeout expired
. We were (hypothetically) looking at a scenario where a batch-like process would occasionally fail raising this error and leaving us to deal with an error that we could reproduce on demand.
Recapping the previous article, we saw that we could set a system-wide call to dump an errorstack at level 1 whenever the error occurred, producing a trace file containing the statement that had raised the error along with its SQL ID and a call stack that would allow us to find the object_id of the specific object that had caused the lock conflict. Once we had the SQL ID we then had the option to set a system-wide call to dump the ksq (Kernel Service Enqueues) trace whenever that statement was executed [but see footnote 1]. Whenever the statement succeeded this would give us a complete listing of the locks (enqueues) needed by the statement, and when the statement failed (due to ORA-00054) we would be able to see very clearly where the breakdown had occurred.
alter system set events '54 trace name errorstack level 1';
alter system set events 'trace[ksq][SQL:3afvh3rtqqwyg] disk=highest';
Neither option, however, would tell us anything about the competing session or about the SQL that caused the competing lock to come into existence; all we could hope for was some hint about why the ORA-00054 and little clue about the part of the application that was causing the conflict.
Using the SystemState
The basic systemstate dump tends to be rather large – for starters, it’s going to include a lot of information about currently open cursors for every (user) session – so it’s not something you really want to make frequent use of, and you don’t want it to be triggered frequently. But if you have an occasional (and critical) batch failure due to an intermittent locking problem then you can issue a call like:
alter system set events '00054 trace name systemstate level 2, lifetime 1';
The “lifetime 1” ensures that a session will only dump a systemstate once in its lifetime – which may be necessary to ensure the system isn’t overloaded with by larger numbers of system state dumps begin generated in a very short time interval. You may need to allow for more than just 1 dump per session, though.
In fact, since I want to dump both the errorstack and the systemstate when the ORA-00054 occurrs the critical three lines in my model were as follows
alter system set events '54 trace name errorstack level 1; name systemstate level 2, lifetime 1';
alter table child enable novalidate constraint chi_fk_par;
alter system set events '54 trace name systemstate off; name errorstack off';
So what do you get if you make this call and then try to re-enable a foreign key constraint when the parent table is locked. In my very small system, with just a couple of live sessions, and shortly after instance startup my tracefile was about 1.5MB and 16,000 lines in size, so not something to read through without a little filtering.
From the Call Stack Trace produced by the errorstack dump I could see that the first argument to ktaiam (and the associated function calls) was 00001EA61.This told me that I would find at least one session holding a lock identified as TM-0001EA61 so that’s the text I searched for next. Note the little trap: the value reported in the call stack has an extra leading zero. I found the TM enqueue 11,000 lines further down the file in a “State Object (SO:)” of type “DML Lock”:
SO: 0x9cf0e708, type: DML lock (83), map: 0x9bac7b98
state: LIVE (0x4532), flags: 0x0
owner: 0x9cf648d8, proc: 0xa0eed9f0
link: 0x9cf0e728[0x9cf64948, 0x9cf64948]
conid: 3, conuid: 3792595, SGA version=(1,0), pg: 0
SOC: 0x9bac7b98, type: DML lock (83), map: 0x9cf0e708
state: LIVE (0x99fc), flags: INIT (0x1)
DML LOCK: tab=125537 flg=11 chi=0
his[0]: mod=6 spn=348
2022-07-06 23:34:49.285*:ksq.c@10787:ksqdmc(): Enqueue Dump (enqueue) TM-0001EA61-00000000-0039DED3-00000000 DID: ksqlkdid: 0001-0029-0000001F
lv: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 res_flag: 0x16
mode: X, lock_flag: 0x20, lock: 0x9bac7bc0, res: 0x9e7ec5e8
own: 0xa086dc70, sess: 0xa086dc70, proc: 0xa05617d0, prv: 0x9e7ec5f8
SGA version=(1,0)
In my case there was only one holder for this lock, but in a live system the same object could be locked by many users.
Having found a state object for the lock I had to identify the process holding this lock – which means searching backwards to find the parent of this state object, then its parent, and so on. Here, in the order I found them (which is the reverse order they appear in the file) are the lines I found:
SO: 0x9cf0e708, type: DML lock (83), map: 0x9bac7b98
SO: 0x9cf648d8, type: transaction (85), map: 0x9bbebdc8
SO: 0x8df2ac88, type: LIBRARY OBJECT LOCK (118), map: 0x63693c60
SO: 0x8df4fa78, type: LIBRARY OBJECT LOCK (118), map: 0x81af83e8
SO: 0xa0efc020, type: session (4), map: 0xa086dc70
SO: 0xa0eed9f0, type: process (2), map: 0xa05617d0
So the DML lock is owned by a transaction, which is owned by a session (with a couple of “library object lock” state objects “in the way”) which is owned by a process. You may find other “extraneous” lines on the way but the key detail to note is the hierarchical pattern of state objects – keep going until you’re reached the session and process state objects.
Once we’ve got this information we need to search the systemstate dump for any cursors that this session/process has open to see if we can find something that looks like a statement that could have created the lock, so we need to search for state objects of type “LIBRARY OBJECT LOCK” with the correct owner information.
Technically they would have to appear in the trace file between the process state object we’ve found and the next process state object listed in the file, but it would be a little tedious to do this seacrh with a text editor so I switched from using vi to using grep – and here’s a search condition that will identify and print part of each state object owned by this session and process:
grep -B+2 -A+13 "owner: 0xa0efc020, proc: 0xa0eed9f0" or19_ora_19369.trc >temp.txt
The hexadecimal value following “owner: “ is from the session state object, the value following “proc: “ is from the process state object. or19_ora_19369.trc is my trace file, and each time I’ve found a matching line I’ll write the 2 lines before it, the line itself, and 13 lines after it to the file temp.txt.
In my example I found 29 state objects owned by the process, of which 25 were of type “LIBRARY OBJECT LOCK” – and I’ve reported two of them below:
SO: 0x8df36d38, type: LIBRARY OBJECT LOCK (118), map: 0x7b54afa0
state: LIVE (0x4532), flags: 0x1
owner: 0xa0efc020, proc: 0xa0eed9f0
link: 0x8df36d58[0x8df496c8, 0x8df31608]
child list count: 0, link: 0x8df36da8[0x8df36da8, 0x8df36da8]
conid: 3, conuid: 3792595, SGA version=(1,0), pg: 0
SOC: 0x7b54afa0, type: LIBRARY OBJECT LOCK (118), map: 0x8df36d38
state: LIVE (0x99fc), flags: INIT (0x1)
LibraryObjectLock: Address=0x7b54afa0 Handle=0x82967ac0 Mode=N
CanBeBrokenCount=1 Incarnation=1 ExecutionCount=1
User=0xa086dc70 Session=0xa086dc70 ReferenceCount=1
Flags=CNB/[0001] SavepointNum=155 Time=07/06/2022 23:29:53
LibraryHandle: Address=0x82967ac0 Hash=5ce74124 LockMode=N PinMode=0 LoadLockMode=0 Status=VALD
ObjectName: Name=lock table parent in exclusive mode
SO: 0x8defb228, type: LIBRARY OBJECT LOCK (118), map: 0x944f8880
state: LIVE (0x4532), flags: 0x1
owner: 0xa0efc020, proc: 0xa0eed9f0
link: 0x8defb248[0x96fbe5e8, 0x8df496c8]
child list count: 0, link: 0x8defb298[0x8defb298, 0x8defb298]
conid: 3, conuid: 3792595, SGA version=(1,0), pg: 0
SOC: 0x944f8880, type: LIBRARY OBJECT LOCK (118), map: 0x8defb228
state: LIVE (0x99fc), flags: INIT (0x1)
LibraryObjectLock: Address=0x944f8880 Handle=0x896af550 Mode=N
CanBeBrokenCount=1 Incarnation=1 ExecutionCount=0
Context=0x7f125ecf34b8
User=0xa086dc70 Session=0xa08664b8 ReferenceCount=1
Flags=[0000] SavepointNum=0 Time=07/06/2022 23:29:53
LibraryHandle: Address=0x896af550 Hash=0 LockMode=N PinMode=0 LoadLockMode=0 Status=VALD
Name: Namespace=SQL AREA(00) Type=CURSOR(00) ContainerId=3
Details to note:
- The last line I’ve selected from the first state object looks like a good candidate SQL statement for creating the blocking lock. The line above it, showing “Hash=5ce74124”, gives us the hexadecimal equivalent of the v$sql.hash_value for this statement.
- I believe the last line of the second state object is telling us that the associated statement has been flushed from the library cache but I’m not sure that I’m interpreting that correctly. You’ll notice though that the line does gives us a suitable namespace and type for something to do with a SQL or PL/SQL cursor (and a hash value of zero – so if it is/was a (Pl/)SQL statement that’s the clue that it’s not in memory any more).
Comparing these two state objects, the things I want to find with minimal hassle are (I hope) the lines that start with the text “ObjectName” that appear one line after a line holding the text “Hash=” followed by anything but a zero, from state objects labelled “LIBRARY OBJECT LOCK”. Here’s a one-line (wrapped) grep command to do that, followed by the (slightly re-formatted) results I got from my trace file:
grep -B+2 -A+13 "owner: 0xa0efc020, proc: 0xa0eed9f0" or19_ora_19369.trc |
grep -A+15 "SO:.*LIBRARY OBJECT LOCK" |
grep -A+1 "Hash=[^0]" |
grep "ObjectName"
ObjectName: Name=lock table parent in exclusive mode
ObjectName: Name=select pctfree_stg, pctused_stg, size_stg,initial_stg, next_stg, minext_stg, maxext_stg,
maxsiz_stg, lobret_stg,mintim_stg, pctinc_stg, initra_stg, maxtra_stg, optimal_stg,
maxins_stg,frlins_stg, flags_stg, bfp_stg, enc_stg, cmpflag_stg, cmplvl_stg,imcflag_stg,
ccflag_stg, flags2_stg from deferred_stg$ where obj# =:1
It looks like I’ve got the information I need (or, at least) a good clue about why my batch session raised an ORA-00054; and, in a real system the other open cursors reported for this session might give me enough information to work out where the problem is coming from.
Warnings
The first warning is just a reminder that there may have been multiple sessions/processes holding locks on table, so don’t stop after finding the first occurrence of the TM-xxxxxxxx lock, check to see if there are any more and repeat the search for its owning process and owned Library Object Locks.
The second warning is that all this work may not give you an answer. A session may have locked a table ages ago and still have an active transaction open; if you’re unlucky the statement that produced the lock may have been flushed from the library cache. A comment I made in 2009 about finding the locking SQL is just as relevant here for the systemstate dump immediately after the ORA-00054 as it was when I first wrote about querying v$sql all those years ago. You may get lucky, and this prompt dumping of the systemstate may make you luckier, but there’s no guarantee you’ll find the guilty statement.
Furthermore, the state objects that I’ve been looking at are “LIBRARY OBJECT LOCK” state objects – these are the things that linke to a cursor that’s held open by the session (i.e. things you’d see in v$open_cursor) so if session introduced a table lock then closed the cursor (and hasn’t commited) the table will still be locked but the systemstate won’t have a state object for the statement that locked the table. For example when I created and executed the following procedure to lock the table using an “execute immediate” I found a state object for the procedure call, but I didn’t find a state object for the “lock table” statement:
create or replace procedure lock_p as
begin
execute immediate 'lock table PARENT in exclusive mode';
end;
/
On the other hand when I created the procedure with embedded SQL I found state objects for both the procedure call and the SQL statement.
create or replace procedure lock_p as
begin
LOCK TABLE test_user.PARENT in exclusive mode;
mode';
end;
/
In passing, the text of the “ObjectName:” you find for the procedure call varies depending on whether you “execute lock_p” or “call lock_p()” from SQL*Plus. The former shows up as “BEGIN lock_p; END;” and the latter as “call lock_p()”
The second warning is to remember that there may have been multiple sessions/processes holding locks on table, so don’t stop after finding the first occurrence of the TM-xxxxxxxx lock, check to see if there are any more and repeat the search for its owning process and owned Library Object Locks.
Conclusion
If you need to track down the cause of an intermittent locking problem that results in an Oracle error ORA-00054 then enabling a system-wide dump of the systemstate (level 2 is sufficient) on error 54 may allows you to find out what everyone else was doing around the time of the problem.
If you don’t already know which object is the locked object that’s the direct cause of the ORA-00054 then enabling the errorstack trace at the same time will allow you to find the object_id of the object, so that you can then find the the processes/sessions that are holding a DML (TM-) lock with the correct id.
For each State Object for the relevant DML lock you can track backwards up the trace file, and then use the address of each pair of session and process state objects to find all of their “open cursor”/”library open lock” state objects, and check the “ObjectName” of each to see the SQL or PL/SQL text. This may give you the information you need to identify where/how the application is going wrong.
Cursors close, and cursors that are still open (but not pinned) can be flushed from the library cache, and a lock may have been placed by a cursor whose text is no longer available, or not part of the systemstate dump, so this method is not perfect – however, since the systemstate dump takes place the instance the error occurs it doess improve your chances that the problem statement is still available and reported.
Footnote 1
Trouble-shooting when the problem is not reproducible on demand often puts you in a position where you have to make a trade-off between information gained and overheads required. Dumping an errorstack for every occurrence of an ORA-00054 is probably a small overhead since you (hope that don’t) generated thousands of locking problems per hour – in the unlikely case that the errors occur very frequently to every session that connects you might be able to limit the overhead by adding the “lifetime” clause to the call, e.g:
alter system set events '00054 trace name errorstack level 1 , lifetime 5';
This would result in every session being able to dump the trace only on the first 5 occasions it triggered the error.
On the other hand, I can think of no effective way of limiting the ksq trace (beyond restricting its action to a specific SQL_ID). If the problem statement executes a couple of times in each batch run, and there are only a few batch runs per day, then the overhead will be small when you’re trying to find the details of a problem that happens one a week. But if the problem statement runs thousands of times in each batch run then it would probably be very expensive to enable the ksq trace to catch an intermittent error.