The system would not reboot and would not even reach the emergency console...it hung short of there looping on the statement "welcome to emergency mode...". It turned out that partitions on three of the seven hard drives in this system were corrupted (!) and I could not make the system start until I kicked two of those partitions offline, which I did using a USB linux installation, and editing fstab to disable mounting them. The three drives were a 3 TB SATA data drive (not a system drive, but holding several virtual machines - and 3 of those were running when the crash occurred), a 300GB SCSI drive that holds /home, and a partition on the 300GB SCSI drive that holds the system, but the corruption was not in the system partition - and the corrupted partition was the partition that caused the trouble with the Win 2000 VM to begin with. Of these three, the corruption of /home was not sufficient to prevent the system from starting, but I had to remove both of the others to get it to start.
Now, the biggest problem was the 3 TB drive that was now reporting itself as 802 GB. This implied that the partition table was screwed. This drive is organized as one encrypted ext4 partition, and I recovered the drive with no data loss, though it did take me awhile to figure out how to do it. Once I figured it out, it was simple enough - though I did have to work around a bug in the one tool that seemed to be able to do what I wanted to do.
I tried running diskdrake on the drive. Diskdrake reported that it was a 3TB encrypted device with an 802GB partition and the remaining space free. So I tried to enlarge the partition to encompass the drive. I ignored the warning that this would cause me to lose all my data; I knew better. However, the attempt failed,, Diskdrake did not resize the partition and threw an error about cryptsetup failing. Then, Diskdrake refused to start again until after a reboot.
As an aside, I consider it a bug when corrupted non-system drives prevent the system from booting, and a corrupted drive organization prevents diskdrake from starting.
Next I tried parted, and tried to make it create a partition that spanned the drive. It failed. I don't know why.
What succeeded was gdisk. This is a command-line program that is intended to be fdisk for large gpt disks.
To run gdisk on the damaged drive, first I opened the encrypted filesystem. Now, this particular drive is sdg in my system, so I manually opened the encrypted partition like this:
- Code: Select all
cryptsetup luksOpen /dev/sdg1 crypt_sdg1
I entered my passphrase when requested, and the drive was open.
Obviously if your drive is not encrypted you do not have to do this. If your drive IS encrypted, you DO have to do this before running gdisk or the repair won't work. Don't ask me why; I don't know. This is the outcome of trial and error I am reporting here.
I ran gdisk on the damaged drive like this:
- Code: Select all
gdisk /dev/sdg
and had it list the partitions it found. It did find one partition, gave me the size in 512 byte blocks (which was wrong), AND told me what the first useable block number and the last useable block number on the drive was. This latter information was VERY helpful. In fact, it was key.
So, using the option menu (just like in fdisk), I deleted the partition (this was partition 1, the only partition). Then, I added a new partition (as partition 1) and for the starting block of the partition, I gave the first useable block, and for the ending block of the partition, I gave the last useable block - as those were given to me by gdisk. I then wrote this partition table to the drive.
Now I ran fsck on the disk, using "fsck -y /dev/mapper/crypt_sdg1" since I wanted to pick up the opened filesystem. If your drive is not encrypted, you would run "fsck -y /dev/sdg1".
I immediately encountered an error that caused fsck to halt (because of the -y). Fsck informed me that my partition showed its size in 4K blocks (note that gdisk was listing 512 byte blocks...you have to be careful here to handle the translation correctly) to be larger than the physical volume on which it was built by 8 blocks. Therefore, most probably the partition table was corrupt and fsck could not proceed.
So, OK. Back to gdisk. Recreate the partition again, this time 64 - 512 byte blocks smaller. Back to fsck. Same error message, though this time both the volume size and the partition size were listed as 8 blocks smaller.
After fumbling with this for awhile, I concluded that it was a bug in gdisk and I was going to find my partition to be larger than my volume no matter what I did. So I set the partition back to the maximum size available, and then turned once again to diskdrake.
This time, when I started diskdrake, it reported my partition as occupying the entire drive, which was correct. I chose the option to resize the partition, ignored all the dire warnings, moved the slider down a bit, then moved the slider back to a partition size that encompassed the drive, then saved the changes.
Now, I once again ran fsck on the filesystem, and the error in volume size vs partition size was gone. Fsck ran to completion, fixed a number of things, rebuilt the journal, and finished normally. I then mounted the drive and had everything back...no data lost.
So, this post is just to show you how to go about it when your gpt drive gets hosed. THis will work with multiple partitions as well, though of course you have to be careful about your start and end points.
Curiously enough, I am presently building a network attached storage box specifically to protect myself against the kind of disaster this almost was. Had this happened two weeks later, I might not have bothered with the recovery because I would be able to restore from NAS backup.
But anyhow, I hope this helps someone.