I was asked to assist in troubleshooting an interesting problem recently, and I thought it would be good to share my methodology in resolving it. There seemed to be a gradual performance degradation on one of our 18.104.22.168 systems, starting with ORA-01013 errors being reported, seeing lots of "resmgr:cpu quantum"wait events and finally bouncing the database.
All of this happened after hours and the following morning all the previous "issues" were no longer present. Now, however, there was a backlog of data and the database just wasn't loading it quickly enough. And this is where I joined the fray.
Since boilers were being setup, I decided to wipe the proverbial slate clean and consider the database in its (Then) present state. Looking at the OEM Top Activity page as below, it was quick to identify the process where the "backlog" was prominent.
Next step was to try and figure out why an INSERT statement was requiring so much CPU time. I was quite astonished with what I found when I drilled into the statistics of that SQL_ID
2.3 million logical I/O's for an INSERT????? That surely can't be right. I did an ASH report as well, but unfortunately that lead me no closer to finding the problem.
I needed slightly more detail and decided to turn to my trusty old friend (Shout out to @CaryMillsap), the extended SQL trace file! I had heard about a trick to trace only specific SQL_ID's, and was finally able to put that to use as below.
After about 2 minutes, it was turned off and we found a tracefile by grepping for the SQL_ID. TKPROF was called into action, and things started making more sense.
It turned out that most of the logical I/O's were being executed as part of recursive SQL related to the recyclebin!
In order to appease the awaiting masses in the impending boiler, we decided to purge the recyclebin. Within about 30 minutes, we started seeing significant improvements on the response time of the INSERT statement, and the backlog was gradually depleted over the next few hours.
Having said that though, the PURGE DBA_RECYCLEBIN command is still running (As of 14 December 2017), even though we also uncovered MOS note Purging RECYCLEBIN Running Forever (Doc ID 2284986.1) and implemented the suggested workaround. As a matter of interest, the original count in the recyclebin was over 4.5 million (At the time of writing this it's down to 2.3 million)
My best guess is that there was an attempt(s) to reclaim space from the recyclebin, and because the cleanup thereof wasn't completing very quickly, the performance of the INSERT suffered.