S
skinnyfat
Guest
www.head-fi.org
Head-fi said:2007-11-13 0857 EST
This is obviously our worst outage in the history of Head-Fi.org. What happened was that we had Head-Fi.org's files and backups moved to a multi-terabyte network attached storage (NAS) unit while we continued to work on the proper implementation of a true clustering configuration for Head-Fi.org. From what we can tell, this particular NAS unit--with a reputation for being ultra-reliable--had one of its 12-channel RAID controllers malfunction.
This particular NAS unit is a 24-drive unit, made up of two 12-drive arrays, each array with two parity drives (RAID 6). Maybe we put too much faith in it, but we thought were safe housing everything on it for the time being. From what we're being told, when the controller card malfunctioned, it messed up the NAS unit's logical volume, which is where we're at now. We are working closely with the vendor and the technical support team in Europe to restore the logical volume and get the NAS back up again. We feel confident we will be able to restore Head-Fi to its state just before its outage, but won't know for sure if we'll have to fall back to a back-up, of which there are many on the NAS. Unfortunately, the only off-NAS backups we have of Head-Fi.org's databases are quite old, meaning we'd potentially lose thousands of posts, so I will not put Head-Fi.org back up until we know for sure the status of the logical volume restoration.
The repair was well under way yesterday when the repair process ran out of RAM. (The NAS has four gigabytes of RAM.) Since, for a number of reasons, the repair process was being run almost entirely from RAM, the four gigs was apparently not enough. I have ordered 16 gigabytes of RAM, which will arrive this morning, immediately after which I will head to the datacenter to install it; then the team in Europe can commence with the remotely administered repair process(es). The repair has gone slower than we anticipated, and running out of RAM yesterday was an unfortunate setback. But, once again, 16 gigabytes of RAM (versus the four gigabytes in there now) should be arriving this morning. whereupon we will immediately call our friends in Europe so they can continue with the repair work. We already know some data was lost, but hope and pray that what we do retrieve will be enough to let us get the site back up this evening.
We know we should have been more diligent about keeping more backups off the NAS, but running two 12-drive arrays, each array in RAID 6 (two parity drives, for a total of four)--and the fact that our previous, self-built NAS units ran without problems for over seven years--we felt we were safe in keeping them there until we were finally through with the proper clustering we've intended for months.
All I can do at this point (other than what we're doing above) is to apologize to you all for the outage. Though Head-Fi isn't what I do for a living, it is very important to me as a gathering place for friends, and I know it is for many of you, too. Once again, I'm sorry about this extended outage, and will continue to work on it until we're back up (hopefully tonight).