Oracle is a software program that is at once relatively simple and devilishly complex. Want to do a backup and don’t mind some downtime? Just copy the data files to another location after stopping the server. It’s like backing up a Word document. Want to implement a RAC? Well, clear your calendar, because that’s like rebuilding an engine. There are database administrators that have devoted their entire careers to understanding and to maintaining Oracle systems. Given the many flavors and the longevity of the products, this is a necessary and often worthwhile endeavor. I have heard that the RAC is one of the most complex pieces of this puzzle.
I have done Microsoft Failover Clustering on both Windows 2003 and 2008 servers. MS DBAs often complain about this level of complexity, but implementing just a simple Oracle RAC on a VMWare server made that look like junior-level busy work. Let me now tell you the trials and tribulations of completing an Oracle RAC.
VMware is the reference for software virtualization. I use it at my work, and I use in my home for experiments and for software that will only run on Linux. It is easy to configure, and there are three solid free offerings to allow you to learn how to use it: ESXi Server, Server 2.0, and Player. If you get player, just google free VMX creator and you’ll find tools that will create the initial virtual machine that will allow you to install Windows or Linux. When you configure Server, which is necessary for the Oracle RAC experiments, be sure to use the Bridged network connection. I tried for a couple of hours to get NAT working, but looking at Wireshark and seeing my packets come out but never come back, I decided this is not worth it. Bridged worked like a charm.
I downloaded and configured VMware server, and was ready to begin.
Downloading Oracle Software
Oracle, unlike some other major DB vendors, does not mind you downloading and installing software that normally would cost more than your house. They encourage it, in fact. Read over the license and make sure you know the limits (absolutely no commercial activity, only development work). They release a flavor of Linux called Oracle Enterprise Linux v5 which I used for my RAC. I also used version 11gR1 for the clusterware and the database. If you don’t know, the clusterware or grid infrastructure, is the software that manages the multi-server architecture for the RAC. You used to have to buy this from another vendor, but now Oracle ships its own.
I followed the aggregated advice of several blogs that took you step by step through the process. You can google Oracle RAC, and you’ll probably find one. Something to note is that you can get hung up on little details that vary from version to version. For instance, 11gR2 clusterware is really different from 11gR1, so watch out.
After configuring a VM, I installed the software and copied it to create a second node. Then trouble occurred. I run a fairly beefy box at home with an SSD. For some reason, this just tanked and would take forever on startup (15 minutes). I could not pinpoint a bottleneck. I had to transfer the VMs to a laptop (which is also rather beefy, but not as much so as my desktop), and then it started up just fine.
After getting the two VMs to actually boot up within a minute or so, I began my installation of the Oracle clustware. I spent a while working with dependences as well as working out the differences in 64-bit vs. 32-bit packages. After working out a few other kinks, it went swimmingly. The installation went just as the various tutorials said they would, and I thought, “Well, what are these DBAs complaining about?” I also liked the command-line centric nature of the process, and I realized that this would be very easy to automate. After installing the software on both nodes, and adding the second node, disaster struck.
One of the tutorials I was reading recommended that you take snapshots of the nodes after each major change, in case you had to revert. This was a life-saver. There are certain mistakes you can make that for whatever reason cannot be undone easily. For some reason, when finished joining the second node, both servers would randomly restart.
I spent hours trying to figure out why. I had never really dug into Linux like this before. I was doing basic troubleshooting like checking logs and looking for serious bottlenecks, but this was greatly impeded by the fact that the nodes kept restarting.
What I found out was that if a node did meet certain performance criteria, it would be evicted from the cluster and automatically restart. There is no manual on this, only bits and pieces scattered about. With Microsoft Clustering Services, there is a central event log and a node is not restarted when it fails to meet certain requirements. After I figured out it was due to the clusterware, I then was able to find logs telling me why it was not working.
I found that my laptop was failing on three different counts: disk latency, CPU latency, and network latency. I guess I should have expected this, but it certainly made me wonder what happens in production when there’s a performance bottleneck on any one of these.
I found various commands that would allow me to keep the cluster going for hours without a restart of node, by setting the bar much lower than you would want in production for latency on the aforementioned bottlenecks. I also learned more than I wanted to know about logging in Oracle, as well as digging into the fundamentals of the clusterware. Finally, I got both nodes stable and ready for the database install.
I finally got the database installed after two tries (the first try suffered a restart while I was configuring it). The database came up and I was able to select and insert! I was done, after two weeks of playing around with it. I quickly shutdown and made a copy of it.
I bought a normal but fairly fast 10K HDD, and put the VMs on that. They ran like a charm and I was able to startup the database. I found the docs and a tutorial on how to change the IPs (I wanted to switch networks). On the first try, I totally messed it up and the cluster would not start. I tried again after realizing that I messed up, but still goofed up the IPs. The third time was the charm, and I finally got it right and the cluster came up beautifully! What I messed up on was accidentally changing the cluster interconnect. Just like all clusters, there is a universal truth: don’t mess with the private networks.
Oracle requires knowledge of both the database, and the grid software. It is not an easy proposition to understand everything all at once, so start out easy. The grid software makes use of things like Oracle’s networking software, SQL*Net, and the files structures are similar to the database software. You can think of the two as separate pieces of software, but the database still needs the grid software.
I definitely gained respect and appreciation for Oracle DBAs, and now I have joined the ranks of people who have installed an Oracle RAC. When will you?