Quantcast
Viewing all articles
Browse latest Browse all 17

ORA-00054 pt.1

If you don’t recognise the number the relevant extract from the oraus.msg file is:

00054, 00000, "resource busy and acquire with NOWAIT specified or timeout expired"
// *Cause:  Interested resource is busy.
// *Action: Retry if necessary or increase timeout.

The error is typically the result of application code trying to do some DDL on an object that is locked in an incompatible mode by some other session, and when it shows up in the log from some batch process it can be difficult to find out what was going on at the same time to cause the problem because the error message tells you nothing about the blocker.

Nenad Noveljic has just published a note discussing what you can do to trouble-shoot this type of problem, but I thought I’d write up a note on one of the ways I’d address the problem, in this case starting from a specific question on the Oracle-L list-server.

Statement of problem

A piece of application code disables the foreign key constraints on a table, inserts (using “insert as select” with the /*+ append */ hint) a very large volume of data (tens to hundreds of millions of rows), then executes a pl/sql loop to re-enable, in novalidate mode, all the foreign key constraints on that table.

****Sample code *****
INSERT /*+Append*/ INTO TAB1 (c1,c2.....)
        SELECT (ROWNUM + sqq_key) ,col1, col2, col3..... from....tab2,
tab3, tab4.....;
COMMIT;

FOR I IN (
        SELECT TABLE_NAME, CONSTRAINT_NAME 
        FROM ALL_CONSTRAINTS 
        WHERE TABLE_NAME = v_table_nm 
        AND CONSTRAINT_TYPE = 'R' 
        AND STATUS = 'DISABLED'
) LOOP
   EXECUTE IMMEDIATE ('ALTER TABLE ' || v_table_nm || ' ENABLE NOVALIDATE CONSTRAINT '|| I.CONSTRAINT_NAME);
 END LOOP I;

From time to time one of the calls to re-enable a constraint fails raising ORA-00054, so the OP had set event 54 to do a systemstate dump to see if that would help identify the cause of the error:

alter system set events '54 trace name systemstate level 266, lifetime 1';

Picking through the resulting trace file, though, the OP got the impression that the session was blocking itself, leading to a worry that somehow the “commit;” wasn’t releasing locks properly so that the lock due to the insert was blocking the lock needed for the “enable constraint”.

Trouble-shooting

Reading the key question “is the commit not working properly?” my first thought was “it’s almost guaranteeable that the commit is doing what it’s supposed to do”; and I had no intention of reading a systemstate dump (or reading the bits that had been extracted by someone who had (almost guaranteeably) misinterpreted it).

Where you start trouble-shooting does depend to a degree on how much you already know about what’s going on. The OP, for example, already knew that the error appeared in response to one of the “alter table” commands and was also able to identify which constraint had caused the error to appear – but if you don’t even have that information how do you begin?

Since Oracle is raising an error (and one that probably doesn’t occur very frequently) you could just set the system to dump an errorstack every time the error occurred. (For a repeatable test you might use “alter session”, for a randomly occurring event you might have to “alter system” unless you were able to modify the batch code itself to issue its own “alter session” at the right point.) To minimise the size of the trace file level 1 should suffice, at least to begin with:

alter system set events '54 trace name errorstack level 1';

Here’s the start of the trace information that I produced by setting this event and trying to enable a constraint when I knew the call would be blocked by a competing lock:

dbkedDefDump(): Starting a non-incident diagnostic dump (flags=0x0, level=1, mask=0x0)
----- Error Stack Dump -----
<error barrier> at 0x7ffcdfb0dd20 placed dbkda.c@296
ORA-00054: resource busy and acquire with NOWAIT specified or timeout expired
----- Current SQL Statement for this session (sql_id=3afvh3rtqqwyg) -----
alter table child enable novalidate constraint chi_fk_par

----- Call Stack Trace -----
calling              call     entry                argument values in hex
location             type     point                (? means dubious value)
-------------------- -------- -------------------- ----------------------------
ksedst1()+95         call     kgdsdst()            7FFCDFB0D180 000000002
                                                   7FFCDFB074B0 ? 7FFCDFB075C8 ?
                                                   000000000 000000082 ?

Key points: the report starts with the SQL that triggered the error and gives us its SQL_ID. Before doing anything else, then, let’s consider what locks might be necessary for the constraint to be enabled (noting, particularly, that in this case the constraint is being enabled with the “novalidate” option. The OP suspects that the problem appears because the session can’t acquire a lock on the child table – but maybe there are other locks involved.

Let’s model the scenario with a parent/child table and referential integrity constraint and see what locks appear as we try to enable the constraint with “novalidate” . But we want to find out what locking goes on when there are no problems with competing sessions and no errors raised. You’ll note that I used the pause command in mid-script so that I could connect through another session if I wanted to introduce some competing locks:

rem
rem     Script:         errorstack.sql
rem     Author:         Jonathan Lewis
rem     Dated:          July 2022
rem

create table parent (
        id      number(4),
        name    varchar2(10),
        constraint par_pk primary key (id)
)
;

create table child(
        id_p    number(4)       constraint chi_fk_par references parent,
        id      number(4),
        name    varchar2(10),
        constraint chi_pk primary key (id_p, id) 
)
;

alter table child disable constraint chi_fk_par;

pause Press return

-- alter system set events '54 trace name errorstack level 1'; 
alter system set events 'trace[ksq][SQL:3afvh3rtqqwyg] disk=highest';

alter table child enable novalidate constraint chi_fk_par;

alter system set events 'trace[ksq][SQL:3afvh3rtqqwyg] off';
-- alter system set events '54 trace name errorstack off'; 

This is using the “new” trace mechanism, tracing the “ksq” (Kernel Service Enqueues) component of the RDBMS library (“oradebug doc component rdbms”), restricted to tracing only when the current SQL statement has a specific SQL_ID.

If you examine the trace file you’ll find lots of lines referencing the source file ksq.c, with call names like ksqgtlctx (get lock context ?) and ksqcli (clear lock information?). I’m just going to grep out the lines that contain the text “mode=”:

2022-07-05 10:44:13.642*:ksq.c@9175:ksqgtlctx(): *** TM-0001E8C8-00000000-0039DED3-00000000 mode=4 flags=0x401 why=173 timeout=0 ***
2022-07-05 10:44:13.642*:ksq.c@9175:ksqgtlctx(): *** TM-0001E8C6-00000000-0039DED3-00000000 mode=4 flags=0x401 why=173 timeout=0 ***
2022-07-05 10:44:13.644*:ksq.c@9175:ksqgtlctx(): *** ZH-0001E8C8-00000005-0039DED3-00000000 mode=6 flags=0x10021 why=225 timeout=0 ***
2022-07-05 10:44:13.644*:ksq.c@9175:ksqgtlctx(): *** ZH-0001E8C8-00000006-0039DED3-00000000 mode=6 flags=0x10021 why=225 timeout=0 ***
2022-07-05 10:44:13.646*:ksq.c@9175:ksqgtlctx(): *** ZH-0001E8C8-00000007-0039DED3-00000000 mode=6 flags=0x10021 why=225 timeout=0 ***
2022-07-05 10:44:13.646*:ksq.c@9175:ksqgtlctx(): *** ZH-0001E8C8-00000008-0039DED3-00000000 mode=6 flags=0x10021 why=225 timeout=0 ***
2022-07-05 10:44:13.646*:ksq.c@9175:ksqgtlctx(): *** ZH-0001E8C8-00000009-0039DED3-00000000 mode=6 flags=0x10021 why=225 timeout=0 ***
2022-07-05 10:44:13.647*:ksq.c@9175:ksqgtlctx(): *** ZH-0001E8C8-0000000A-0039DED3-00000000 mode=6 flags=0x10021 why=225 timeout=0 ***
2022-07-05 10:44:13.647*:ksq.c@9175:ksqgtlctx(): *** ZH-0001E8C8-0000000B-0039DED3-00000000 mode=6 flags=0x10021 why=225 timeout=0 ***
2022-07-05 10:44:13.648*:ksq.c@9175:ksqgtlctx(): *** ZH-0001E8C8-0000000C-0039DED3-00000000 mode=6 flags=0x10021 why=225 timeout=0 ***
2022-07-05 10:44:13.649*:ksq.c@9175:ksqgtlctx(): *** TX-00090014-000022C2-0039DED3-00000000 mode=6 flags=0x401 why=176 timeout=0 ***
2022-07-05 10:44:13.649*:ksq.c@9175:ksqgtlctx(): *** TM-00000061-00000000-0039DED3-00000000 mode=3 flags=0x401 why=173 timeout=21474836 ***
2022-07-05 10:44:13.651*:ksq.c@9175:ksqgtlctx(): *** TM-00000049-00000000-0039DED3-00000000 mode=3 flags=0x401 why=173 timeout=21474836 ***
2022-07-05 10:44:13.651*:ksq.c@9175:ksqgtlctx(): *** TM-00000004-00000000-0039DED3-00000000 mode=3 flags=0x401 why=173 timeout=21474836 ***
2022-07-05 10:44:13.652*:ksq.c@9175:ksqgtlctx(): *** TM-0000001F-00000000-0039DED3-00000000 mode=3 flags=0x401 why=173 timeout=21474836 ***
2022-07-05 10:44:13.653*:ksq.c@9175:ksqgtlctx(): *** TM-00000012-00000000-0039DED3-00000000 mode=3 flags=0x401 why=173 timeout=21474836 ***
2022-07-05 10:44:13.654*:ksq.c@9175:ksqgtlctx(): *** RC-00000002-00000012-0039DED3-00000000 mode=4 flags=0x10401 why=294 timeout=5 ***
2022-07-05 10:44:13.654*:ksq.c@9175:ksqgtlctx(): *** RC-00000002-0000001F-0039DED3-00000000 mode=4 flags=0x10401 why=294 timeout=5 ***
2022-07-05 10:44:13.657*:ksq.c@9175:ksqgtlctx(): *** TM-00004887-00000000-0039DED3-00000000 mode=3 flags=0x401 why=173 timeout=21474836 ***
2022-07-05 10:44:13.657*:ksq.c@9175:ksqgtlctx(): *** TM-0001E643-00000000-0039DED3-00000000 mode=3 flags=0x401 why=173 timeout=21474836 ***
2022-07-05 10:44:13.657*:ksq.c@9175:ksqgtlctx(): *** TX-0003001C-0000223A-0039DED3-00000000 mode=6 flags=0x401 why=176 timeout=0 ***

  • The first two locks are table locks in mode 4 (share) on the child and parent tables respectively (The values 0x0001E8C8 and 0x0001E8C6 are the hexadecimal equivalents of the object_ids).
  • The next 8 locks in mode 6 (exclusive) are something to do with the child table (same object_id appearing) and a check in v$lock_type tells us they’re something to do with compression.
  • Then we see a TX (transaction) lock in mode 6, 5 TM locks on very low (data dictionary) object_ids in mode 3, two RC (result cache) locks in mode 4.
  • Finally there’s two TM locks and a TX lock – and the table locks are for the aud$unified table and one of its partitions.

It’s probably safe to ignore the locking related to the recursive transactions (especially in since the locks show non-zero timeouts); I don’t know what the ZH locks are about but the increasing nature of the second component of the lock id suggests that they’re not likely to be a problem (even though they have to be acquired without a timeout).

The thing that catches my eye is that we have to lock both the child and the parent – and until I did this test I wasn’t certain that for a “novalidate” constraint there would be any need for a data lock on the parent – though a rowcache lock to check for the legality of the constraint definition would make sense.

So maybe the problem isn’t about the child table, possibly it’s about the parent. I’m going to rerun the whole test again, enabling the ksq trace and the errorstack, and in the pause that’s in the script I’m going to lock the parent table from another session before enabling the constraint. From the new trace file I’m going to show you more lines about some of the TM locks (which will be for new object_ids since the code drops and recreates tables) and then a few more lines from the error stack.

First the locking information:

2022-07-05 11:20:54.686*:ksq.c@9100:ksqgtlctx(): ksqtgtlctx: PDB mode 
2022-07-05 11:20:54.688*:ksq.c@9175:ksqgtlctx(): *** TM-0001E8D4-00000000-0039DED3-00000000 mode=4 flags=0x401 why=173 timeout=0 ***
...
2022-07-05 11:20:54.688*:ksq.c@9851:ksqgtlctx(): ksqgtlctx: updated lock mode, mode:4 req:0
2022-07-05 11:20:54.688*:ksq.c@9960:ksqgtlctx(): SUCCESS

2022-07-05 11:20:54.689*:ksq.c@9100:ksqgtlctx(): ksqtgtlctx: PDB mode 
2022-07-05 11:20:54.689*:ksq.c@9175:ksqgtlctx(): *** TM-0001E8D2-00000000-0039DED3-00000000 mode=4 flags=0x401 why=173 timeout=0 ***
...
2022-07-05 11:20:54.689*:ksq.c@9001:ksqcmi(): returns 51
2022-07-05 11:20:54.689*:ksq.c@9948:ksqgtlctx(): FAILURE: returns 51

Note how we can see here that it’s the attempt to lock the parent (lower object id:0x0001E8D2) that makes Oracle raise the error. Notice, by the way, that internally it’s raising error 51 (which is “timeout occurred while waiting for a resource”) not error 54.

And here’s a section of the error stack – quite a long way down:

----- Call Stack Trace -----
calling              call     entry                argument values in hex
location             type     point                (? means dubious value)
-------------------- -------- -------------------- ----------------------------
...
ktagetg0()+929       call     ktaiam()             00001E8D2 000000004 000000000
                                                   7FFF26A8BC38 ? 000000000 ?
                                                   000000000 ?
ktagetp_internal()+  call     ktagetg0()           00001E8D2 ? 000000004 ?
141                                                000000004 ? 7FFF26A8BC38 ?
                                                   000000000 ? 000000000 ?
ktagetg_ddlX()+323   call     ktagetp_internal()   00001E8D2 000000000 000000004
                                                   000000000 ? 000000000
                                                   F7A4F57D00000000
ktagetg_ddl()+30     call     ktagetg_ddlX()       00001E8D2 ? 000000000 ?
                                                   000000004 ? 000000000 ?
                                                   000000000 ? 000000000
kkdllk0()+1551       call     ktagetg_ddl()        00001E8D2 ? 000000000 ?
                                                   000000004 ? 000000000 ?
                                                   000000000 ? 000000000 ?

You’ll notice in the list of arguments for these calls (which relate to getting locks for DDL) that the value 00001E8D2 (the object_id) and 000000004 (requested lock mode) keep appearing. So simply setting errorstack to level 1 will give you the SQL statement that caused the ORA-00054, and you will be able to find the object_id that Oracle was unable to lock and the attempted lock mode.

Conclusion

It can be a little tricky to track down the source of Oracle error ORA-00054 when it appears unpredictably and cannot be reproduced on demand, but there are two options that help you to get started.

The error is about attempting to lock something – so if you set the system to dump an error stack (even at only level 1) when the error occurs you should be able to find the object_id of the object that is the source of the problem and the lock mode being requested. It’s probably the case that this will be a small overhead when run at the system level since you probably don’t generate lock timeouts very often.

If you are lucky enough to know the SQL ID of statement that runs into the problem you can enable ksq (lock) tracing for that specific SQL statement and that will make it a lot easier to see exactly which lock attempt failed. If you have no idea of the SQL_ID, then ksq tracing for the whole system will probably be too much of an overhead to leave in place. The benefit of the ksq trace is that if you don’t know what locking your application code needs you will be able to see all the locks involved, and simply knowing what locks are involved may be enough to point you in the right direction.

Note: (which I haven’t tested) if the guilty SQL is called from inside a package, then using the SQL_ID of the package call may result in ksq tracing for every statement call inside the package call, and that might be a bearable overhead.

Footnote:

Although this allows us to discover where the locking conflict appeared, it doesn’t tell us what the blocking session did to get in our way. In the next installment I’ll describe how we can drill through through the systemstate dump to find out (if the information is still there) what the other session was doing to cause the problem.

Footnote 2:

It’s worth mentioning that in some cases of locking it can be a good idea to use the “wait N” (for a small value of N) option in your code as a wait of a few seconds may allow you to find some clues about blockers in the ASH (v$active_session_history / dba_hist_active_sess_history) information when a timeout occurs. In this specific case, though, I don’t think there’s a variant of the syntax that would allow you to do something like “alter table modify constraint …. wait 5”.

Update (a few minutes after publishing)

If you check comment #1 below you’ll see that Alexander Chervinskiy has supplied a method for getting the recursive locks to wait for a limited period by setting the parameter ddl_lock_timeout to a small value (in seconds). This can be done at the session or at the system level.


Viewing all articles
Browse latest Browse all 17

Trending Articles