Mageia forum

by **jiml8** » May 13th, '22, 19:52

I have encountered a problem with my 2 TB Samsung 980 Pro SSD and, strangely enough, a duckduckgo search turns up very little that is relevant. I have looked around on Reddit, and on the samsung site, and I am seeing almost nothing. So either my problem isn't a problem, or I am just very unlucky. I am looking for any information anyone here might have.

I purchased this SSD last June and it is installed in my Asus X570 Plus motherboard, with Ryzen 7 5800X CPU and 128 GB of Crucial RAM. The SSD is in the slot closest to the CPU, which gives me PCIe4 capability.

The SSD contains my system, and also contains all of my active VMs, with "active" being defined as used at least occasionally - in other words, not archived. The total of all these VMs is around 900 GB of storage, so the SSD is half-full with VMs.

The virtual disks of these VMs are divided into multiple files which (depending on the VM) range from 2 GB to 4 GB per file. There is one FreeBSD VM that has its entire 10GB virtual disk in one file, but it is the exception (and, as it happens, that 10GB file is one of the ones that exhibited the problem).

I noticed about a week ago, when backing up some of these VMs (which I do from time to time, manually), that I was getting I/O errors from the rsync command that did the backup. Some investigation confirmed that the I/O errors were being reported out of the SSD, and several virtual drive files on several VMs were affected, as well as the virtual memory files for 2 VMs.

All the VMs had been working without apparent problem (with the exception of one OpenSUSE VM that had been showing some occasionally anomalous behavior).

Dismounting the partition and running e2fsck returned that the filesystem had no errors.

running both smartctl -a and the nvme command returned multiple media errors, and initially indicated that 95% of the device's spare nand remained unused.

running e2fsck -c to do a bad blocks scan turned up multiple bad blocks on the SSD. Given the nature of an SSD, I can't map those out, but finding them (and the files indicated) was useful. While this scan was running (it took hours) the number of media errors reported by both smartctl and nvme increased sharply. When I started, there were some 790 (I don't remember exactly) media errors reported; by the end of the full scan, there were 12,384 media errors reported.

Also, during this scan, my spare nand declined from 95% to 91%, so clearly the SSD controller was mapping out the bad nand as it was discovered during the badblocks scan.

So, evidently, a section of NAND in the SSD has failed, and the controller failed to find it and map it out transparently; it became visible to me - and this is not supposed to happen. Some forensic analysis suggests this problem first occurred sometime in the 4th quarter of last year, and went undetected until now.

After the badblocks scan, and cleaning up multiply linked files, and the other things e2fsck does, I still had I/O errors in the particular VM files that had displayed them before. This prevented me from copying them to backup storage.

I solved this problem by doing dd conv=noerror,sync to copy the files that were showing damage. This worked and resulted in a new copy of the file with the damaged sections removed and replaced by zeros. I then started the affected VM and allowed the file system checker on the VM to run and after that all of them appear to be OK.

So, at this point I seem to have the mess cleaned up, and the SSD says it still has 91% of its spare capacity available. I do not know if this was a one-off infant mortality failure of a section of nand in the device and going forward it will be OK, or if this failure heralds a cascade of failures. I will watch it going forward and see what happens.

But what really puzzles me is that no one is reporting things like this...at least, not where I searched anyway. If this is somewhat normal behavior, I certainly would have expected the SSD controller to map out those bad sections before it came to my attention...why didn't it? Even if this is abnormal behavior and a hard failure is looming, I still would expect the controller to handle it. Maybe the fact that these are all VMs, and those files get modified in place has something to do with it? I don't see how, but...

Has anyone else here seen anything like this, or have a rational explanation for what might have happened? Or whether I should be planning to replace this SSD? It would be a warranty exchange but, as a practical matter, I would have to purchase another one because of the time required to RMA it.

by **sturmvogel** » May 13th, '22, 20:09

The 980 pro series had some firmware updates from Samsung which fixed several issues. As your SSD seems now already "damaged" a firmware update wouldn't repair anything. I would go for the guarantee which is 5 years or 1200 TB TBW (for the 980 PRO). Check if you you are below the guaranteed TBW and return the SSD.

There is always a small chance that you have bad luck and get a bad device, even if you have such high quality hardware like the SSDs from Samsung. Also check the values for unsafe shutdowns/power loss. This can also cause hardware failure. If your computer crashes often, don't shutdown clean, bad power supply with unstable voltage can all damage your SSD.

by **jiml8** » May 14th, '22, 01:44

I looked for a firmware upgrade on the samsung site, and didn't see anything. You wouldn't happen to know the date of the last update would you? I purchased this one last June.

My system tends to be up for weeks to months at a time; I hate to reboot. Prior to my upgrade to Mageia 8 (which I did about a month ago, finally) my system had been up for almost 4 months continuously - no reboot, no logout.

Aside from the fact that this is a tribute to the stability of Mageia, it also indicates I don't have any hardware problems; if I did, I wouldn't get nearly that far. Also, power supply problems would manifest in a lot of subtle ways; if there was some problem like that in the system, I would certainly know about it.

The log for the SSD shows 7 unsafe shutdowns. I know I had to hit the reset button one time, subsequent to my Mageia 8 upgrade (which wasn't complete because I had to start the system using the last Mageia 7 kernel - there's a thread about it in the advanced support section of this site) because I allowed one update that got a new glib into conflict with an older kernel. After punching the reset button, I then worked out the kernel problem.

But I am positive this SSD problem predates that button-punch. Over the last year, I tore the system up once or twice via mistakes when doing my development work, and I have had to punch the button a few times. So I guess 7 unsafe shutdowns is supportable. But, while I would not be surprised at data loss from an unsafe shutdown, I would not expect it to damage the SSD.

I have a 5 year warranty and the device is less than a year old. I don't want to RMA it if I can get away without doing it because I would pretty much have to purchase a new one, image this one on to the new one, put the new one in place, and only then RMA this one. That's not cheap. Also, of course, there's the partial teardown of my system to even reach the blasted thing.

So I am just going to keep using it and keep an eye on it to see if it continues to deteriorate. If this was one-off, then I'm not really hurt. If it deteriorates further, then I won't really have a choice.

But I'm really puzzled by this, particularly since I don't see any other similar reports.

Also, I posted in the general discussion forum rather than the basic support forum because this isn't a Mageia issue at all. Someone moved it to this forum. I don't care, but...

by **JoesCat** » May 14th, '22, 06:08

Right now, you're running a commercial grade SSD that can withstand quick bursts of traffic, but isn't really expected to have a lot of consistently heavy traffic running on it. I believe you're having heat problems.

A couple years ago, I had to analyze these sort of gumstick drives, and noted that when you have a huge amount of traffic flowing, you also have a lot of heat generated. If you were to graph IO vs time, some drives go from 100% right down to nearly zero flow. Most of the heat generated then was around that big chip next to the connector, but that was when the older chips had 4, 8, 16 or 32 layers of NAND. Your 2TB drive may have something like 160+ layers of NAND, so I would also expect some of the NAND chips are heating up pretty toasty too.

You may want to run smartmon, and read the temperature when you have a lot of traffic happening.
If you have a heatsink, add it. The paper logo over all the chips is a problem since it acts like an insulator, which isn't ideal. You may also want to have a fan running along the heatsink to help cool the SSD.
If you have other slots available, you may want to split the load between several SSD to help reduce the heat generated per SSD versus throwing all that traffic on just one single SSD. That 95% to 91% drop also suggests you've fried a few bits and the drive needed to replace bits from it's surplus - once you've used that up, your 2TB begins looking like 1.9T...1.8TB...1.7TB....

by **sturmvogel** » May 14th, '22, 09:03

The 980 PRO has an operating temperature up to 70°C. Additionally it has an internal temperature management (which would likely reduce writing speed) if a certain temperature is reached. The maximum reached temperature ever is also shown via the smartctl command.

The last firmware update from Samsung for the 980 Pro is from this january. Download here:
https://semiconductor.samsung.com/consu ... hor_stand4

And installation instruction here:
https://semiconductor.samsung.com/resou ... Manual.pdf

by **jiml8** » May 14th, '22, 20:01

JoesCat wrote:Right now, you're running a commercial grade SSD that can withstand quick bursts of traffic, but isn't really expected to have a lot of consistently heavy traffic running on it. I believe you're having heat problems.

You may want to run smartmon, and read the temperature when you have a lot of traffic happening.
If you have a heatsink, add it.

One of the things I like most about this iteration of my workstation is how cool it runs. I put a Noctua DH-15 heatsink on the CPU, and I have been able to turn all the case fans down to very low settings and still have the CPU idling around 34C and never exceeding 55C even when all 16 threads are at 100%. There is great airflow through the case.

I put a heatsink on the 980 Pro when I installed it, and I monitor pretty much everything. There are 2 discrete temp sensors on the 980 Pro, and one that is labeled a composite sensor. The highest temperature ever recorded on the 980 Pro for all 3 of these sensors combined is 56C, which is well within tolerance for the device.

If you have other slots available, you may want to split the load between several SSD to help reduce the heat generated per SSD versus throwing all that traffic on just one single SSD.

In the previous iteration of this workstation, the VMs were split across 2 separate SSDs (both of which are still in the system) but that was because those SSDs are SATA SSDs and I split the load to keep from being bottlenecked by the SATA channel. I still have my /home on one of those SSDs; I never moved it.

I normally have between 4 and 8 VMs running. Then, of course, there is the Mageia host. One of those VMs that usually runs is a Windows 11 VM. Even so, I always monitor traffic to/from my devices, and I don't see that much I/O traffic on the SSD. So I am not really sure it is being worked all that hard. The nvme command reports I have used 0% of its life expectancy (percentage used) at this time.

That 95% to 91% drop also suggests you've fried a few bits and the drive needed to replace bits from it's surplus - once you've used that up, your 2TB begins looking like 1.9T...1.8TB...1.7TB....

Yes, of course. Hence this thread, and my surprise that as near as I can tell I'm the only one seeing this behavior. That can't be true of course, but I don't see anyone else talking about it.

by **jiml8** » May 14th, '22, 20:09

sturmvogel wrote:The 980 PRO has an operating temperature up to 70°C. Additionally it has an internal temperature management (which would likely reduce writing speed) if a certain temperature is reached. The maximum reached temperature ever is also shown via the smartctl command.

The last firmware update from Samsung for the 980 Pro is from this january. Download here:
https://semiconductor.samsung.com/consu ... hor_stand4

And installation instruction here:
https://semiconductor.samsung.com/resou ... Manual.pdf

Thank you for that. Not sure why I didn't find it before, but I didn't.

by **jiml8** » May 14th, '22, 20:35

I just did a google search on "samsung 980 pro failure" and got some hits. Earlier, I did a duckduckgo search and found very little.

Google is still the best search engine, I guess...

Notably, there was this:
https://www.reddit.com/r/pcmasterrace/c ... n4_failed/

by **JoesCat** » May 15th, '22, 09:54

As a recommendation, due to all the VMs you're running, I think you could benefit by using SLC memory instead of TLC memory. cost vs durability
TLC or QLC is better for read often write seldom.

by **sturmvogel** » May 15th, '22, 10:00

@JoesCat: Did you actually read which temperature jiml8's SSD reached and what he does do monitor it? It's completely inside of the spec and not a problem at all.

Mageia forum

samsung 980 Pro problems

samsung 980 Pro problems

Re: samsung 980 Pro problems

Re: samsung 980 Pro problems

Re: samsung 980 Pro problems

Re: samsung 980 Pro problems

Re: samsung 980 Pro problems

Re: samsung 980 Pro problems

Re: samsung 980 Pro problems

Re: samsung 980 Pro problems

Re: samsung 980 Pro problems

Who is online