Friday, March 27, 2015

oracle wait events explained by experts : asynch descriptor resize

Tanel Poder
The “direct path loader” (KCBL) module is used for performing direct path IO in Oracle, such as direct path segment scans and reading/writing spilled over workareas in temporary tablespace. Direct path IO is used whenever you see “direct path read/write*” wait events reported in your session. This means that IOs aren’t done from/to buffer cache, but from/to PGA directly, bypassing the buffer cache.
This KCBL module tries to dynamically scale up the number of asynch IO descriptors (AIO descriptors are the OS kernel structures, which keep track of asynch IO requests) to match the number of direct path IO slots a process uses. In other words, if the PGA workarea and/or spilled-over hash area in temp tablespace gets larger, Oracle also scales up the number of direct IO slots. Direct IO slots are PGA memory structures helping to do direct IO between files and PGA.
In order to be able to perform this direct IO asynchronously, Oracle also dynamically scales up the number of OS asynch IO descriptors, one for each slot (up to 4096 descriptors per process). When Oracle doesn’t need the direct IO slots anymore (when the direct path table scan has ended or a workarea/tempseg gets cancelled) then it scales down the number of direct IO slots and asynch IO descriptors. Scaling asynch IO descriptors up/down requires issuing syscalls to OS (as the AIO descriptors are OS kernel structures).
I guess this is supposed to be an optimization, to avoid running out of OS AIO descriptors, by releasing them when not they’re not needed, but as that Metalink note mentioned, the resize apparently sucks on Linux. Perhaps that’s why other ports also suffer and have seen the same wait event.
The “asynch descriptor resize” event itself is really an IO wait event (recorded in the wait class Other though), waiting for reaping outstanding IOs. Once this wait is over, then the OS call to change the amount of asynch IO descriptors (allocated to that process) is made. There’s no wait event recorded for the actual “resize” OS call as it shouldn’t block.
So, the more direct IO you do, especially when sorting/hashing to temp with frequent workarea closing/opening, the more of this event you’ll see (and it’s probably the same for regular tablespace direct path IO too).
This problem wouldn’t be noticeable if Oracle kept async io descriptors cached and wouldn’t constantly allocated/free them. Of course then you may end up running out of aio descriptors in the whole server easier. Also I don’t know whether there would be some OS issues with reusing cached aio descriptors, perhaps there is a good reason why such caching isn’t done.
Nevertheless, what’s causing this wait event is too frequent aio descriptor resize due to changes in direct IO slot count (due to changes in PGA workarea/temp segment and perhaps when doing frequent direct path scans through lots of tables/partitions too).
So, the obvious question here is what to do about this wait event? Well, first you should check how big part of your total response time this event takes at all?
  1. If it’s someting like 1% of your response time, then this is not your problem anyway and troubleshooting this further would be not practical – it’s just how Oracle works :)
  2. If it’s something like 20% or more of your response time, then it’s clearly a problem and you’d need to talk to Oracle Support to sort out the bug
  3. If it’s anything in between, make sure you don’t have an IO problem first, before telling that this is a bug. In one recent example I saw direct path reads take over a second on average when this problem popped up. The asynch descriptor resize wait event may well disappear from the radar once you fix the root cause – slow IO (or SQL doing too much IO). Remember, the asynch descriptor resize wait event, at least on Linux, is actually an IO wait event, the process is waiting for outstanding IO completion before the descriptor count increase/decrease can take place.
Source :