Road To OCM: Buffer Busy Waits : Common Causes, Diagnosis, and Actions

A session that reads or modifies a buffer in the SGA must first acquire the cache buffers chains latch and traverse the buffer chain until it finds the necessary buffer header. Then it must acquire a buffer lock or a pin on the buffer header in shared or exclusive mode, depending on the operation it intends to perform. Once the buffer header is pinned, the session releases the cache buffers chains latch and performs the intended operation on the buffer itself. If a pin cannot be obtained, the session waits on the buffer busy waits wait event. This wait event does not apply to read or write operations that are performed in sessions’ private PGAs.

Keep the following key thoughts in mind when dealing with the buffer busy waits event.

The Oracle session is waiting to pin a buffer. A buffer must be pinned before it can be read or modified. Only one process can pin a buffer at any one time.
buffer busy waits indicate read/read, read/write, or write/write contention.
The appropriate action to take depends on the reason encoded in the P3 parameter.

The following pieces of information are essential for diagnosing the buffer busy waits problem when it is the leading bottleneck that slows a process down.
The primary reason code that represents why a process fails to get a buffer pin.
The class of block that the buffer busy waits wait event is for.
The SQL statement that is associated with the buffer busy waits event.
The segment that the buffer belongs to.

A session may fail to get a pin on a buffer header for various reasons. The reasons are represented by a set of codes. Up to Oracle9i Database, a process that waits on the buffer busy waits event publishes the reason code in the P3 parameter of the wait event.

This is typical in applications that spawn multiple concurrent threads or sessions, and each one executes the same query that goes after the same data set. In this case, you can check the sessions’ logon time in the V$SESSION view, and chances are you will find them only a few seconds apart. When multiple sessions request the same data block that is not in the buffer cache, Oracle is smart enough to prevent every session from making the same operating system I/O call. Otherwise, this can severely increase the number of system I/Os. Instead, Oracle allows only one of the sessions to perform the actual I/O, while others wait for the block to be brought into the buffer cache. The other sessions wait for the block on the buffer busy waits event and the session that performs the I/O waits on the db file sequential read (or the db file scattered read) wait event. You will notice that the buffer busy waits and the db file sequential read events share the same P1 (file#) and P2 (block#) values.

Reason code 220 indicates there are multiple sessions trying to concurrently modify different rows within the same block that is in the buffer cache. This symptom is typical in applications with high DML concurrency. Unfortunately, a block can be pinned by only one process at any one time. The other concurrent processes must wait on the buffer busy waits wait event until the first change is complete. This is a good thing; otherwise the block will be corrupted.

If the majority of the buffer busy waits wait events are centered on data blocks (class #1) and the reason code is 130, this shows the application runs multiple sessions that query the same data set at the same time. (You will only know this if you query the V$SESSION_WAIT view repeatedly or trace the session with the 10046 event or use the data sampling methods) This is an application issue. There are three things you can do to minimize this problem:

Reduce the level of concurrency or change the way the work is partitioned between the parallel threads.
Optimize the SQL statement to reduce the number of physical and logical reads.
Increase the number of FREELISTS and FREELIST GROUPS.

From our experience, it is very difficult to get the application to reduce the level of concurrency. It may not be a good idea because it limits scalability. However, there are differences between scalability and a blind attempt by the application to improve performance by spawning multiple sessions. So far, SQL tuning has worked wonderfully to reduce the occurrences of buffer busy waits. Check the SQL execution plan and optimize the SQL statement to use the most effective join method and access paths that reduce the number of physical and logical reads.

If the majority of the buffer busy waits wait events are centered on data blocks and the reason code is 220, this indicates there are multiple sessions performing DML on the same object at the same time. In addition, if the database block size is large (for example, 16K and above), it can only intensify this symptom as larger blocks generally contain more rows per block. There are three things you can do to minimize this problem:

Reduce the level of concurrency or change the portioning method.
Reduce the number of rows in the block.
Rebuild the object in another tablespace with a smaller block size (Oracle9i Database and above).

Again, as mentioned earlier, it may not be practical to limit scalability by reducing the level of concurrency.

If the data blocks belong to tables or indexes, then consider rebuilding the objects to reduce the number of rows per block and spread the data over a larger number of blocks. For example, you can rebuild a table or an index with a higher PCTFREE. In some cases, we have rebuilt indexes with PCTFREE as high as 50 percent. The downside to this is that index range scans and index fast full scans will be slower. You can also alter the table to minimize the number of rows per block with the ALTER TABLE table_name MINIMIZE RECORDS_PER_BLOCK command. Starting in Oracle9i Database, you can move or rebuild the object in another tablespace with a smaller block size. While these actions can minimize the buffer busy waits problem, they will definitely increase full table scans time and disk space utilization. As the saying goes, there is no such thing as a free lunch.

If the majority of the buffer busy waits wait events are centered on data segment headers (that is, the table or index segment header and not the undo segment header), this usually means some tables or indexes in the database have high segment header activities. Processes visit segment headers for two main reasons—to get or modify the process FREELISTS information and to extend the high watermark (HWM). There are three things you can do to minimize this problem:

Increase the number of process FREELISTS and FREELIST GROUPS of the identified object.
Make sure the gap between PCTFREE and PCTUSED is not too small.
Make sure the next extent size is not too small.

The first step is to identify the segment name and type. They can be identified from the P1 and P2 parameters of the buffer busy waits wait event. Then you can alter the object to increase the number of process FREELISTS. If time and opportunity permit, you should also increase the number of FREELIST GROUPS by recreating the object. You should create all data segments with a minimum of two FREELIST GROUPS, by default—even in a single-instance database.
If you do not want to mess with FREELISTS and FREELIST GROUPS, you can rely on the Automatic Segment Space Management (ASSM) feature to scatter the incoming data from the insert statements.

Finally, you should check the next extent size of the identified segment. A high insert rate combined with a small next extent size can cause frequent insertion of new entries into the extent map located in the segment header. Consider altering or rebuilding the object with a larger next extent size. If the object resides in a locally managed tablespace, consider moving the object into a reasonable uniform-size locally managed tablespace.

If the majority of the buffer busy waits wait events are centered on undo segment headers, this indicates there are either too few rollback segments in the database or their extent sizes are too small, causing frequent updates to the segment headers. If you use the system-managed undo introduced in Oracle9i Database, you shouldn’t have to deal with this problem as Oracle will create additional undo segments according to demand.

If the majority of the buffer busy waits wait events are centered on undo blocks, this usually means there are multiple concurrent sessions querying data that is being updated at the same time. Essentially the query sessions are fighting for the read consistent images of the data blocks. This is an application issue and there is nothing amiss in the database. The problem should go away when the application can run the query and DML at different times.

Oracle maintains a number of instance-level statistics on buffer busy waits. These statistics can give you a rough idea of what you are dealing with, but the information may not be specific enough for you to formulate a corrective action. They are mentioned here for the sake of completeness.

The view X$KCBWAIT (kernel cache buffer wait) is the base view for the V$WAITSTAT view, which keeps track of buffer busy waits contentions by block class. The class with the highest count deserves your attention, but unless you are also monitoring the buffer busy waits symptom at a lower level, you don’t have a clear direction to proceed. Let’s say the data block class has the highest count of all. Which segment was affected most, and why were the sessions unable to pin the buffers? Did they fail to get the pins while attempting to read or change the blocks? Unfortunately, Oracle does not keep track of buffer busy waits by block class and reason code. We hope someday Oracle will provide a matrix of buffer busy waits by SQL statement, segment name, and block class.

Road To OCM

Saturday, August 4, 2012

Buffer Busy Waits : Common Causes, Diagnosis, and Actions

No comments:

Post a Comment