How to recover a large, corrupted, gpt disk

Here you'll find a place for solutions and hints.

Please use one of the support subforums below for questions or if you have any issues and need support.

How to recover a large, corrupted, gpt disk

Postby jiml8 » Jul 14th, '14, 23:06

My system went down in flames yesterday, and it took hours to recover. Worst crash I have had in many years. I am not sure what happened, but the problem seems to have started in a Windows 2000 virtual machine. There was some filesystem issue; one physical drive partition could not be accessed by the VM (which was strange, because that partition is assigned to that VM for ancient historical reasons), however I could mount and access it from the Mageia host. I won't bore you with the details, but I decided to reorganize that VM and virtualize all partitions...and something happened and the system locked up.

The system would not reboot and would not even reach the emergency console...it hung short of there looping on the statement "welcome to emergency mode...". It turned out that partitions on three of the seven hard drives in this system were corrupted (!) and I could not make the system start until I kicked two of those partitions offline, which I did using a USB linux installation, and editing fstab to disable mounting them. The three drives were a 3 TB SATA data drive (not a system drive, but holding several virtual machines - and 3 of those were running when the crash occurred), a 300GB SCSI drive that holds /home, and a partition on the 300GB SCSI drive that holds the system, but the corruption was not in the system partition - and the corrupted partition was the partition that caused the trouble with the Win 2000 VM to begin with. Of these three, the corruption of /home was not sufficient to prevent the system from starting, but I had to remove both of the others to get it to start.

Now, the biggest problem was the 3 TB drive that was now reporting itself as 802 GB. This implied that the partition table was screwed. This drive is organized as one encrypted ext4 partition, and I recovered the drive with no data loss, though it did take me awhile to figure out how to do it. Once I figured it out, it was simple enough - though I did have to work around a bug in the one tool that seemed to be able to do what I wanted to do.

I tried running diskdrake on the drive. Diskdrake reported that it was a 3TB encrypted device with an 802GB partition and the remaining space free. So I tried to enlarge the partition to encompass the drive. I ignored the warning that this would cause me to lose all my data; I knew better. However, the attempt failed,, Diskdrake did not resize the partition and threw an error about cryptsetup failing. Then, Diskdrake refused to start again until after a reboot.

As an aside, I consider it a bug when corrupted non-system drives prevent the system from booting, and a corrupted drive organization prevents diskdrake from starting.

Next I tried parted, and tried to make it create a partition that spanned the drive. It failed. I don't know why.

What succeeded was gdisk. This is a command-line program that is intended to be fdisk for large gpt disks.

To run gdisk on the damaged drive, first I opened the encrypted filesystem. Now, this particular drive is sdg in my system, so I manually opened the encrypted partition like this:
Code: Select all
cryptsetup luksOpen /dev/sdg1 crypt_sdg1

I entered my passphrase when requested, and the drive was open.

Obviously if your drive is not encrypted you do not have to do this. If your drive IS encrypted, you DO have to do this before running gdisk or the repair won't work. Don't ask me why; I don't know. This is the outcome of trial and error I am reporting here.

I ran gdisk on the damaged drive like this:
Code: Select all
gdisk /dev/sdg

and had it list the partitions it found. It did find one partition, gave me the size in 512 byte blocks (which was wrong), AND told me what the first useable block number and the last useable block number on the drive was. This latter information was VERY helpful. In fact, it was key.

So, using the option menu (just like in fdisk), I deleted the partition (this was partition 1, the only partition). Then, I added a new partition (as partition 1) and for the starting block of the partition, I gave the first useable block, and for the ending block of the partition, I gave the last useable block - as those were given to me by gdisk. I then wrote this partition table to the drive.

Now I ran fsck on the disk, using "fsck -y /dev/mapper/crypt_sdg1" since I wanted to pick up the opened filesystem. If your drive is not encrypted, you would run "fsck -y /dev/sdg1".

I immediately encountered an error that caused fsck to halt (because of the -y). Fsck informed me that my partition showed its size in 4K blocks (note that gdisk was listing 512 byte blocks...you have to be careful here to handle the translation correctly) to be larger than the physical volume on which it was built by 8 blocks. Therefore, most probably the partition table was corrupt and fsck could not proceed.

So, OK. Back to gdisk. Recreate the partition again, this time 64 - 512 byte blocks smaller. Back to fsck. Same error message, though this time both the volume size and the partition size were listed as 8 blocks smaller.

After fumbling with this for awhile, I concluded that it was a bug in gdisk and I was going to find my partition to be larger than my volume no matter what I did. So I set the partition back to the maximum size available, and then turned once again to diskdrake.

This time, when I started diskdrake, it reported my partition as occupying the entire drive, which was correct. I chose the option to resize the partition, ignored all the dire warnings, moved the slider down a bit, then moved the slider back to a partition size that encompassed the drive, then saved the changes.

Now, I once again ran fsck on the filesystem, and the error in volume size vs partition size was gone. Fsck ran to completion, fixed a number of things, rebuilt the journal, and finished normally. I then mounted the drive and had everything back...no data lost.

So, this post is just to show you how to go about it when your gpt drive gets hosed. THis will work with multiple partitions as well, though of course you have to be careful about your start and end points.

Curiously enough, I am presently building a network attached storage box specifically to protect myself against the kind of disaster this almost was. Had this happened two weeks later, I might not have bothered with the recovery because I would be able to restore from NAS backup.

But anyhow, I hope this helps someone.
jiml8
 
Posts: 1253
Joined: Jul 7th, '13, 18:09

Re: How to recover a large, corrupted, gpt disk

Postby doktor5000 » Jul 15th, '14, 19:39

Dumb question, wouldn't it have been sufficient to run testdisk on the drive after the opening the LUKS volume and do a deep search for partitions, and if found inspect the contents of the partitions?
Cauldron is not for the faint of heart!
Caution: Hot, bubbling magic inside. May explode or cook your kittens!
----
Disclaimer: Beware of allergic reactions in answer to unconstructive complaint-type posts
User avatar
doktor5000
 
Posts: 17659
Joined: Jun 4th, '11, 10:10
Location: Leipzig, Germany

Re: How to recover a large, corrupted, gpt disk

Postby jiml8 » Jul 15th, '14, 20:15

Testdisk failed. It did find the beginning of the correct partition, and many of the backups. It also claimed to find a bunch of other filesystems including FAT and NTFS. It ultimately concluded that a FAT filesystem was the valid one and asked me if I wanted to restore that. I fiddled with its settings and could not make it work. I ultimately concluded it did not handle gpt disks well.

I have since read more documentation about gdisk, and it has a lot of diagnostic and repair capability built in. It also has the capability to take backups of the gpt data, which should make any future restores much easier.
jiml8
 
Posts: 1253
Joined: Jul 7th, '13, 18:09

Re: How to recover a large, corrupted, gpt disk

Postby doktor5000 » Jul 15th, '14, 21:31

Thanks for the clarification, appreciated.

I've also had my first encounter with gdisk pretty recently, when growing recreating a 3TB partition online to 4.5TB.
Liked gdisk overall, even when I forgot to record the old partition boundaries, and had to consult /proc/partitions for the new size in sectors - I was glad gdisk could be easily switched to sectors as units :)
Cauldron is not for the faint of heart!
Caution: Hot, bubbling magic inside. May explode or cook your kittens!
----
Disclaimer: Beware of allergic reactions in answer to unconstructive complaint-type posts
User avatar
doktor5000
 
Posts: 17659
Joined: Jun 4th, '11, 10:10
Location: Leipzig, Germany

Re: How to recover a large, corrupted, gpt disk

Postby jiml8 » Jul 19th, '14, 04:14

As a followup on this, my system has been unstable since that crash. It went down again this morning, but without the damage.

When bringing it back up, I discovered that the error of volume size smaller than partition size in that drive had returned, thus what I did to fix it apparently was not written to disk. I have fooled with it for a bit, and I don't seem to be able to fix it properly with any tool I have. I "fixed" it again, using diskdrake, but this won't survive a reboot (or a crash).

As I write this, my new NAS is initializing its 32 TB RAID-6 array. Based on current progress, I expect the array to be initialized in about 4 days (lol...big array so the times are long) so in a few days I will just scroll the entire contents of this disk onto that array, then I will reformat the disk to solve the problem - unless I figure out how to solve it between now and then.

As for my system instability, I believe that the Nvidia driver version 337.25 is responsible. I have rolled back to 331.49, which I used without incident for several months.
jiml8
 
Posts: 1253
Joined: Jul 7th, '13, 18:09

Re: How to recover a large, corrupted, gpt disk

Postby jiml8 » Aug 2nd, '14, 18:45

And as a continuation of the drama, rolling back the Nvidia driver did stabilize my system. It seems fine again.

I had to reboot a few days ago after installing some Mageia updates, and this time the damaged drive did survive the process with its partition table intact.

Unfortunately, because of the way this whole thing went, I am not exactly certain how I actually repaired it. After the last crash, when I restarted the system, I just immediately used gdisk without first opening the drive's encryption, then used diskdrake the same way. No errors were reported and the drive apparently was fine at the end.

So, it could be that the drive should have been repaired while completely closed to begin with, and it could be that it needed to be repaired both open and closed.

In any event, it is now repaired, so I did it. Thus, this thread really is a how to, even if not a very clear how to. It does tell you that gdisk will do the job where other tools would not, even if gdisk does it not quite right. But gdisk and diskdrake will recover the drive. I guess that is the real takeaway. If someone else here winds up having to do this, you might want to update this thread with exactly what worked.
jiml8
 
Posts: 1253
Joined: Jul 7th, '13, 18:09


Return to The magician suggests...

Who is online

Users browsing this forum: No registered users and 1 guest

cron