modified: Friday 12 July 2024
author: Hales
markup: textile
NVME is so much more complicated than I expected
I’ve had perfect luck with SATA SSDs. I deployed dozens in my previous job and didn’t see any failures over their few years of usage. I have two 1TB Samsung 860 QVOs (quad-bit flash, ie 16 levels) split across my desktop and my laptop that have glid through several years without any issues whatsoever.
Last year I bought my first NVME M.2 drive. Things have gone terribly since then.
Silicon Power P34A80 number 1: Disaster
Pros:
- TLC NAND that can sustain a few hundred MiB/s writes even once cache has expired (!)
- Very reasonable price ($145AUD for 2TB of NVME)
- Has DRAM, so it will handle lots of little files very well (needed for my work — 160GiB mosquito recording collection)
Cons:
- Died after a few months of usage.
I wasn’t doing anything special. One day I noticed that some of my programs were crashing in weird ways. A few days later I tried running my backup utility and it started spitting horrors about inaccessible files.
[55416.212950] critical medium error, dev nvme0n1, sector 1015530568 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
[55416.286907] nvme0n1: Read(0x2) @ LBA 1015530568, 8 blocks, Unrecovered Read Error (sct 0x2 / sc 0x81)
... similar lines spammed infinitely ...
The flash on the disk seems to have failed and become unreadable, but only for certain files at certain addresses. Luckily my backup tool spat out a neat list of the files it couldn’t read. I walked through the rest of my drive and found lots of libraries and binaries that were also unreadable.
Perhaps I had over-used the drive? Let’s check the SMART stats:
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 38 Celsius
Available Spare: 100%
Available Spare Threshold: 32%
Percentage Used: 0%
Data Units Read: 8,851,693 [4.53 TB]
Data Units Written: 4,043,783 [2.07 TB]
Host Read Commands: 92,705,493
Host Write Commands: 23,809,339
Controller Busy Time: 0
Power Cycles: 105
Power On Hours: 924
Unsafe Shutdowns: 9
Media and Data Integrity Errors: 179
Error Information Log Entries: 80
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Error Information (NVMe Log 0x01, 8 of 8 entries)
No Errors Logged
Everything looks fine and healthy on that list other than the “Media and Data Integrity Errors” which kept going up whenever I tried to access one of the cursed files. Interestingly the drive hadn’t tried to use any spare sectors, so the detection of bad flash must have been sudden rather than gradual. Gradual flash degradation can be detected and fixed in most drives as they store ECC bits along with each data sector, the “available spare” space will then start being used as the drive remaps the addresses of the bad sectors to their replacements.
No-one can claim that I farmed Chia on this drive, 1 drive’s worth of writes (2TB) and about 2 drives worth of reads (4TB) are a pittance. 924 power-on-hours is equivalent to 39 days of continuous operation.
At the time I was terrified that I might have also lost some of my other files to silent corruption, but to date I’ve not found any evidence of this happening (and I keep versioned backups). Credit seems due to Silicon Power that their firmware identified the problem and did not hide it.
I squeezed myself back onto my old 1TB Samsung SATA SSD, wiped the broken drive and took it back to the store for a warranty replacement. The failed drive didn’t complain during any of the wipes (it probably started using replacement sectors when I wrote over the bad ones but I didn’t check).
Silicon Power P34A80 number 2: Disaster
The fear of data loss was still in my bones. Instead of using ext4 again I decided to use a fancy new filesystem called bcachefs because it supports file checksums. This would help let me know if files get silently corrupted (I don’t want to lose photos and only realise years later). Sadly my distro’s kernel was too old so I had to compile my own and do a bit of initrd surgery — I’ve left a section at the end of this article with more information about that.
So how did this story end? Exactly the same as the first one, but it only took 2 months this time:
Media and Data Integrity Errors: 464
My system logs confirm the problem appeared suddenly. It wasn’t gradual.
At this point I ordered a new (non-Silicon Power) NVME SSD and also contacted Silicon Power for help:
Hello SP.
(Summary: I bought a P34A80 and it failed within a few months. I got a second P34A80 as a warranty replacement and it has failed in the same way. I do not know why.)
[…]
Question 1: Is this a known problem with P34A80 that perhaps has a known solution?
Question 2: Could I be doing something wrong to cause these issues?
Question 3: Would it help you if I provided some more information? What information would help?
Regards, Hales
I received their reply back only yesterday. Their wording is dangerously close to ignoring my questions and asking me to RMA through them (which I did not ask to do):
Before you process RMA request, we still would like to double confirm information to make sure if there are still any other solutions that we could provide and solve as below
I’m going to give them the benefit of the doubt, but if they completely ignore my questions then I’m going to be super pissed. I have already gone through two rounds of data loss with their products, having support ignore what I ask will seal the deal. If I RMA through them then will I lose my data, my questions and my money? It would be unwise for me trust black holes when I can instead return to the retailer for a refund in my country
Western Digital SN580: Surely not a disaster too?
Hah. Hah hah.
I copied all of my files over, keeping a list of the unreadable ones. I painstakingly fixed my OS by working out what packages provide the affected files and reinstalled those inside a chroot. I tried to boot off the new disk and… my BIOS couldn’t see it?
Huh. Maybe I didn’t install my bootloader correctly. I chrooted back in and tried reinstalling all of its pieces again.
No dice. Weird. OK I’ll just keep using the bootloader on the the old (bad) Siliconpower disk and just tell it to point the kernel at my new disk. No wakkas, I’ll work this out later.
Everything seemed fine until I tried using unison, a program I use to synchronise files between my laptop and desktop. After a few seconds all disk IO locked up (I couldn’t even open a new shell) and about 60 seconds later my kernel logs were spammed with this:
[ 985.666937] nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
[ 985.666945] nvme nvme0: Does your device have a faulty power saving mode enabled?
[ 985.666948] nvme nvme0: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug
[ 985.688944] nvme 0000:01:00.0: enabling device (0000 -> 0002)
[ 985.689097] nvme nvme0: Disabling device after reset failure: -19
[ 985.694936] bcachefs (nvme0n1p2 inum 67335731 offset 0): data read error: I/O
[ 985.694958] bcachefs (nvme0n1p2 inum 134389612 offset 7080): data read error: I/O
[ 985.695000] bcachefs (nvme0n1p2 inum 134389612 offset 7232): data read error: I/O
[ 985.695021] bcachefs (nvme0n1p2): btree write error: I/O
[ 985.695023] bcachefs (nvme0n1p2 inum 134389612 offset 7104): data read error: I/O
[ 985.695026] bcachefs (nvme0n1p2 inum 134389612 offset 7360): data read error: I/O
[ 985.695048] bcachefs (nvme0n1p2): btree write error: I/O
[ 985.695091] bcachefs (nvme0n1p2): error writing journal entry 53425: I/O
[ 985.695111] bcachefs (nvme0n1p2): btree write error: I/O
...
The NVME controller was locking up and a soft reset wasn’t resurrecting it. My filesystem driver (bcachefs) then, unsurprisingly, started running around in circles with paint in its eyes.
The suggested commandline flags didn’t help. I don’t think this was a powersaving issue.
This problem ended up being completely reproducible. Even the following was enough to trigger it:
cd ~/library; bfs . | while read line; do [ -f "$line" ] && cat "$file" > /dev/null; done
It sometimes choked around the same file, other times at other files. Interestingly the following things would NOT trigger the problem:
- Reading just those files (and the ones near them)
- Reading the entire disk end to end using dd if=/dev/nvme1n1 of=/dev/null. Changing the blocksize (bs=) from tiny to massive didn’t make any difference either.
- Heavy write loads
It seemed that only heavy random read loads triggered the problem.
At this point I was tearing my hair out. Was my computer cursed? Did my motherboard OR cpu have PCIE/NVME bugs that caused the failure of 3 different drives? Perhaps my power supply was sagging at just the wrong times? Or maybe this was a new set of bugs caused by me using bcachefs (although the first drive died when I was using ext4).
Days of debugging, bios updating and automation of the testing later: I realised I had a laptop in the house with an NVME slot. I moved the WD SN580 into that, booted from a liveUSB from another distro, and discovered that it misbehaved identically there too. phew
A good night’s sleep later and I worked out…
The problem was 4k sector sizes
When I first received the SN580 I noticed that it supported both 512-byte and 4096-byte (4k) sectors. It reported them like this:
$ nvme id-ns -H /dev/nvme1n1 | grep "Relative Performance"
LBA Format 0 : Metadata Size: 0 bytes - Data Size: 512 bytes - Relative Performance: 0x2 Good (in use)
LBA Format 1 : Metadata Size: 0 bytes - Data Size: 4096 bytes - Relative Performance: 0x1 Better
The drive itself reported 4k as a better option (0x1) than the default of 512 bytes (0x2). This makes sense: internally most SSDs use 4k sectors anyway as it’s more efficient, the only thing we’re changing here is the API exposed to the OS. My Silicon Power drives didn’t report supporting anything other than 512b but this WD disk seems more sophisticated.
It turns out that changing to 4k sectors was a massive mistake. It’s buggy as all shit. Advanced format 4k disks have been around for something like 15 years but counting above 9 bits is still apparently too hard for us.
Changing the drive back into 512b mode (which requires a format) seems to have magically fixed both of my problems:
1. The drive’s controller no longer locks up under heavy random read load
2. My BIOS can now see it and boot from it.
I don’t know who to blame for this.
- Could it be my (AMD) CPU and/or (Gigabyte) motherboard? But then why did the same issue occur when I put it into an (Intel) laptop?
- Could it be Western Digital’s firmware is shoddy? But then why would the device expose the feature at all, especially in a company that straight up says no to firmware updates
Are things working now?
Yes, at least for the last 24 hours :) I can play games again!
Lessons learned:
- You must treat your $200 2TB NVME SSDs like 1.44MB floppies: use 512 byte sectors. Do NOT use any 4k sector features that they say are better, they’re lying, the date is actually 2009 and Advanced Format is not stable yet.
- Specific SSD models and/or batches seem to be buggered, you can’t expect the warranty replacement to work any better.
I want to end this article on a happy note so here are some photos of my bootloader. I changed the backer image for each of the (3 by this point) copies of my main disk so I wouldn’t get confused and boot into the wrong one. Oldest is on the left, newest on the right.
The last image is from Modus Interactive’s Bryce3d render pack which is absolutely gorgeous and worth a look (it’s free). I don’t know the sources of the first two images, they have been sitting around in my folders for years, apologies to the artists behind them.
Sidestory: getting bcachefs working
Back in December 2023 this was quite difficult.
Finding the right incantations to chant on my kernel (boot) command line was difficult:
root=/dev/nvme0n1p2 rootfstype=bcachefs fastboot
root= can’t use a UUID because that’s hidden inside the filesystem metadata, not outside. I later learned I could have used PARTUUID instead which is visible at the partition-table level. To find out what the drive was named/numbered I added rd.break=pre-mount to the boot commandline, which dropped me at a shell just at the point where disks were to be mounted.
rootfstype= was necessary because the “mount” command couldn’t automatically recognise bcachefs partitions at the time.
fastboot tells the initrd not to fsck on the partition. Again it couldn’t work out the partition was. This is relatively safe because the root partition gets mounted read-only (ro) at this stage, so when it hands over to my real full operating system init it gets properly fscked before remounting as read-write (rw).
One day the dracut-generated initrd started hanging, so I had to replace it with a mkinitcpio generated initrd. It was 200MiB instead of 80MiB and was noticeably slower to boot but it worked. I still don’t know why, but my fucks were exhausted.
Voidlinux’s “linux-mainline” kernel package is now version 6.7 so I am now using it instead of my self-compiled kernel. Its dracut initrd seems to be good.