One great feature of Nimble arrays is being able to upgrade resources without down time. In one of our arrays the cache hit ratio was constantly below 80% and Nimble’s Infosight statistics showed we needed a minimum of 10% more cache than we currently had to keep up with all activity. Fortunately we budgeted for this upgrade to make the unit a CS220G-X2 array (upgrading from 320 GB of SSD to 640).
The upgrade process is very simple. Once the upgrade kit has been purchased and arrives on site the drives need to be replaced one at a time. I’m usually full speed ahead on stuff like this but I had my reservations in the process. I know all of the data held in cache is written to the SATA drives as well but there’s always a part of me thinking the system will come crashing down.
While watching the console of the array I removed the SSD drive from slot 7 and replaced it with a new 160 GB drive. After refreshing the Management/Array page a couple times, and not seeing the new drive, I took a look at the Events page. It took just over 2 minutes for the system to see the new drive and then display “The CS-Model has changed. The model is now CS220G-X2.” The Array page now showed the correct drive in slot 7 and the additional 80 GB of cache for the system. I waited 3 minutes to replace the drive in slot 8 and it showed up within a minute on the Array page.
In doing research on Nimble arrays and passing their SE Certification exam I knew data went to cache only on reads/writes and only does a pre-fetch of blocks needed to complete current read requests. This means the new drives didn’t automatically “rebuild” or add the data back to them that was on the old drives. This also means there’s a substantial performance hit if all 4 drives are replaced within a short period of time. I browsed to the performance tab to see how bad it was. The first drive was replaced at 3:00 PM (which is a slow time for this array). You can see how much data had to be pulled from SATA (lower cache hit means more data pulled from SATA). I took the following screen shots to show the next 5 hours and then later in the morning when the cache hit ratio finally went up.
After seeing the very low levels, and watching activity lights on each drive, I decided to wait until the next morning to replace the last two drives. This gave a couple disk intensive operations that run overnight a chance of being in cache and anything not would be added to the new drives (hopefully). I then replaced the last two drives the next morning. I again saw the same cache hit characteristics as above. Also, since these drives are not in any form of RAID with each other they don’t have to be replaced at the same time.
Depending on the load of the system it can take a few days for the cache hit ratio to stay above the wanted 80% threshold.
Leave a Reply