Why do you restart servers when you replace RAID disks? Haven't you heard of hot-swapping?
On our blog, we occasionally mention that we’re restarting a server to replace a failing disk in a RAID array.
In theory, this isn’t necessary. So-called “hot swap” RAID should make it possible to replace a hard disk without restarting the server.
However, to put it bluntly, we don’t trust it. There are all sorts of horror stories on the Internet about hot-swap RAID disasters, even on high-end RAID hardware. If anything goes wrong when all the disks are still turned on, there’s a possibility that whatever goes wrong will cause problems on disks that were previously working fine.
And human error — pulling the wrong disk out of a RAID array — would cause vastly more problems if a server was still running.
We have a policy of being extremely careful where data integrity is involved, so we intentionally don’t swap hard disks in RAID arrays without shutting down the affected server first. The approximately five minutes of downtime is undesirable but worth the trade-off.