Dr. Z
19-01-2007, 01:06
I got a call at about 6pm tonight - the shock radio webserver is no longer serving ANY pages (apache running on gentoo). Despite the fact that I am one of two people the chief eng can contact that knows about Linux, I am so far lacking in root access (or any shell access at all) to the box - so I get on MSN to someone who does and start the process of finding out whats wrong.
It accepted an SSH session but the box was apparrently shutting down ... hmm. Wait a few minutes and ssh back in - everything is royally buggered, services up, down, left, right who knows what. dmesg indicated a problem with the fileserver. I get my coat on....
Get to the studio, walk into the server room to find that the webserver is halfway down (refusing logins but still multi-user) with tty1 just showing its progress through shutting down...
The fileserver isnt much healthier - RAID5 in degraded mode. Now, if we were running something GOOD this wouldnt be too much of an issue but we have a crap rocketraid sata controller with freenas on top. Its obviously spat at least one disk but I hear from the station manager that "it stopped working yesterday, then came back so I thought it was fine" - now FreeNAS is totally unresponsive, the web interface is down, the box wont reboot and god only knows what. The thing is, FreeNAS appears to be buggered but it can see individual drives (ad2, 4 and 6) and THEN an array ar0 rather than just seeing one array. Is it hardware raid? The raid card thinks so. Is it software raid? FreeNAS thinks so.
The webserver died because for some reason, something critical is in a folder on the fileserver. Whats the point in that? WHY? It introduces so many ways to bring the web server down its unfunny. The webserver isnt the only thing that runs on that box either, so its not like it crashing only takes out our web presence!
The data on this array has NEVER been backed up and is not only irreplacable but required to be kept by law. The penalty for losing this stuff is a fairly hefty fine and/or imprisonment. Quite why they felt using GOD DAMN MAXTOR DRIVES for this I dont quite know. F****** ridiculous.
Going to rebuild the array tomorrow hopefully (!) and promptly make a backup. Thankfully its "only" 400Gb :/
Just needed a bit of a rant...
It accepted an SSH session but the box was apparrently shutting down ... hmm. Wait a few minutes and ssh back in - everything is royally buggered, services up, down, left, right who knows what. dmesg indicated a problem with the fileserver. I get my coat on....
Get to the studio, walk into the server room to find that the webserver is halfway down (refusing logins but still multi-user) with tty1 just showing its progress through shutting down...
The fileserver isnt much healthier - RAID5 in degraded mode. Now, if we were running something GOOD this wouldnt be too much of an issue but we have a crap rocketraid sata controller with freenas on top. Its obviously spat at least one disk but I hear from the station manager that "it stopped working yesterday, then came back so I thought it was fine" - now FreeNAS is totally unresponsive, the web interface is down, the box wont reboot and god only knows what. The thing is, FreeNAS appears to be buggered but it can see individual drives (ad2, 4 and 6) and THEN an array ar0 rather than just seeing one array. Is it hardware raid? The raid card thinks so. Is it software raid? FreeNAS thinks so.
The webserver died because for some reason, something critical is in a folder on the fileserver. Whats the point in that? WHY? It introduces so many ways to bring the web server down its unfunny. The webserver isnt the only thing that runs on that box either, so its not like it crashing only takes out our web presence!
The data on this array has NEVER been backed up and is not only irreplacable but required to be kept by law. The penalty for losing this stuff is a fairly hefty fine and/or imprisonment. Quite why they felt using GOD DAMN MAXTOR DRIVES for this I dont quite know. F****** ridiculous.
Going to rebuild the array tomorrow hopefully (!) and promptly make a backup. Thankfully its "only" 400Gb :/
Just needed a bit of a rant...