So, what does failure sound like? I am not sure but I think I know what impending failure looks like, like this:
end_request: I/O error, dev sda, sector 844726271 sd 0:0:0:0: [sda] Unhandled error code sd 0:0:0:0: [sda] Result: hostbyte=0x04 driverbyte=0x00 sd 0:0:0:0: [sda] CDB: cdb=0x28: 28 00 32 58 e9 1f 00 00 38 00
This is what I found in the dmesg output on my server tonight after a large copy failed. When the copy failed I started poking around and the entire /home partition went read-only on me and then right after went off line completely. Talk about getting worried… I checked the logs of course and saw the above nastiness.
On reboot the drive came back up just fine. The problem is of course the failure is a sign of bad things to come. I shut down the server to preserve any life the drive may have left until I can secure a replacement. I simply have no place to store 824 GB of data right now and don’t want to run the chance of burning up what little life it has left till I do.
Of course since I have to shut the thing down what better time to work out an upgrade. Nothing like a little scare to remind one that RAID is a good idea. I am pricing the parts for a pair of 2 TB HDs for /home, two 640 GB HDs for / and such and a replacement 1.5 TB HD to replace the failing unit. That would put all the drives into pairs for RAID1 in all cases. At the same time I can go ahead and install the hot swap cages.
Now to cross my fingers and hope the drive can survive an 824 GB data transfer before it dies…