I have encountered a problem with my 2 TB Samsung 980 Pro SSD and, strangely enough, a duckduckgo search turns up very little that is relevant. I have looked around on Reddit, and on the samsung site, and I am seeing almost nothing. So either my problem isn't a problem, or I am just very unlucky. I am looking for any information anyone here might have.
I purchased this SSD last June and it is installed in my Asus X570 Plus motherboard, with Ryzen 7 5800X CPU and 128 GB of Crucial RAM. The SSD is in the slot closest to the CPU, which gives me PCIe4 capability.
The SSD contains my system, and also contains all of my active VMs, with "active" being defined as used at least occasionally - in other words, not archived. The total of all these VMs is around 900 GB of storage, so the SSD is half-full with VMs.
The virtual disks of these VMs are divided into multiple files which (depending on the VM) range from 2 GB to 4 GB per file. There is one FreeBSD VM that has its entire 10GB virtual disk in one file, but it is the exception (and, as it happens, that 10GB file is one of the ones that exhibited the problem).
I noticed about a week ago, when backing up some of these VMs (which I do from time to time, manually), that I was getting I/O errors from the rsync command that did the backup. Some investigation confirmed that the I/O errors were being reported out of the SSD, and several virtual drive files on several VMs were affected, as well as the virtual memory files for 2 VMs.
All the VMs had been working without apparent problem (with the exception of one OpenSUSE VM that had been showing some occasionally anomalous behavior).
Dismounting the partition and running e2fsck returned that the filesystem had no errors.
running both smartctl -a and the nvme command returned multiple media errors, and initially indicated that 95% of the device's spare nand remained unused.
running e2fsck -c to do a bad blocks scan turned up multiple bad blocks on the SSD. Given the nature of an SSD, I can't map those out, but finding them (and the files indicated) was useful. While this scan was running (it took hours) the number of media errors reported by both smartctl and nvme increased sharply. When I started, there were some 790 (I don't remember exactly) media errors reported; by the end of the full scan, there were 12,384 media errors reported.
Also, during this scan, my spare nand declined from 95% to 91%, so clearly the SSD controller was mapping out the bad nand as it was discovered during the badblocks scan.
So, evidently, a section of NAND in the SSD has failed, and the controller failed to find it and map it out transparently; it became visible to me - and this is not supposed to happen. Some forensic analysis suggests this problem first occurred sometime in the 4th quarter of last year, and went undetected until now.
After the badblocks scan, and cleaning up multiply linked files, and the other things e2fsck does, I still had I/O errors in the particular VM files that had displayed them before. This prevented me from copying them to backup storage.
I solved this problem by doing dd conv=noerror,sync to copy the files that were showing damage. This worked and resulted in a new copy of the file with the damaged sections removed and replaced by zeros. I then started the affected VM and allowed the file system checker on the VM to run and after that all of them appear to be OK.
So, at this point I seem to have the mess cleaned up, and the SSD says it still has 91% of its spare capacity available. I do not know if this was a one-off infant mortality failure of a section of nand in the device and going forward it will be OK, or if this failure heralds a cascade of failures. I will watch it going forward and see what happens.
But what really puzzles me is that no one is reporting things like this...at least, not where I searched anyway. If this is somewhat normal behavior, I certainly would have expected the SSD controller to map out those bad sections before it came to my attention...why didn't it? Even if this is abnormal behavior and a hard failure is looming, I still would expect the controller to handle it. Maybe the fact that these are all VMs, and those files get modified in place has something to do with it? I don't see how, but...
Has anyone else here seen anything like this, or have a rational explanation for what might have happened? Or whether I should be planning to replace this SSD? It would be a warranty exchange but, as a practical matter, I would have to purchase another one because of the time required to RMA it.