Samsung 850 SSD Checksum Errors on Cache drive

KYThrill · January 16, 2018

For 8 years, I used a regular Samsung HDD for a cache drive. I transfer all the files to my Dune using Teracopy, with checksum verification.

I recently decided to replace the HDD with a Samsung 850 PRO SSD. It is btrfs (I don't remember if the HDD was, I think it may have been xfs).

I've had it in for about two weeks now, and probably have had a dozen times where Teracopy gave me an error because checksum failed verification (copying from my PC to unRAID). I've had to try copying a file as many as 3 times, before eventually passing checksum. I've never had to do that before.

Checking my logs, I am seeing several times a week where Mover is also giving checksum errors and repeatedly needing to copy a file.

Jan 15 03:44:54 TOWER kernel: BTRFS warning (device sdf1): csum failed ino 520 off 4287762432 csum 2562197151 expected csum 1350497962
Jan 15 03:44:54 TOWER kernel: BTRFS warning (device sdf1): csum failed ino 520 off 4287762432 csum 2562197151 expected csum 1350497962
Jan 15 03:44:54 TOWER kernel: BTRFS warning (device sdf1): csum failed ino 520 off 4287762432 csum 2562197151 expected csum 1350497962

I get errors like this (sdf1 is the cache drive) during the mover process, and it will try again. If eventually unable to move without csum error, it just errors out, leaves the file, and moves on to the next. I've never seen this kind of error in 8 years (admittedly, don't check every night, but never saw one).

Lot's of people use SSD cache drives. What could be wrong? Should I witch to XFS, like the old HDD? When I check the SMART data, I only have 1 UDMA CRC error count. I think if I had bad cabling, this number would be higher (386 hours powered on). There are no other errors of any type indicated in the SMART data.

My only other thought is that I have some bad RAM. I have had a stick go bad in the past, which caused a failed CRC check on every file transfer. Just seems odd that this started happening the first day after installing the SSD (SSD is physically far away from RAM, so shouldn't be any type of electrical noise issue).

The other wrench in that though, is that about 10% of my transfers from PC to unRAID are not to cache drive, but directly to array HDD. In the same 2 weeks, none of those transfers have had a checksum failure. Obviously, mover doesn't move those, so no data point there. It seems like if it were RAM failing, at least one of those would end up with an error too.

Any ideas?

KYThrill · January 16, 2018

Also, may be important:

Jan  1 21:55:34 TOWER kernel: ata5.00: supports DRM functions and may not be fully accessible
Jan  1 21:55:34 TOWER kernel: sd 1:0:0:0: Attached scsi generic sg1 type 0
Jan  1 21:55:34 TOWER kernel: ata4.00: configured for UDMA/133
Jan  1 21:55:34 TOWER kernel: ata5.00: disabling queued TRIM support
Jan  1 21:55:34 TOWER kernel: ata5.00: ATA-9: Samsung SSD 850 PRO 256GB, S39KNX0JA32484E, EXM04B6Q, max UDMA/133
Jan  1 21:55:34 TOWER kernel: ata5.00: 500118192 sectors, multi 1: LBA48 NCQ (depth 31/32)
Jan  1 21:55:34 TOWER kernel: ata5.00: supports DRM functions and may not be fully accessible
Jan  1 21:55:34 TOWER kernel: ata5.00: disabling queued TRIM support
Jan  1 21:55:34 TOWER kernel: ata5.00: configured for UDMA/133

ATA5 seems to be the SSD drive. After the first mover csum errror, this happened:

Jan  4 21:26:42 TOWER kernel: ata5.00: exception Emask 0x10 SAct 0x7ffeffff SErr 0x400000 action 0x6 frozen
Jan  4 21:26:42 TOWER kernel: ata5.00: irq_stat 0x08000000, interface fatal error
Jan  4 21:26:42 TOWER kernel: ata5: SError: { Handshk }
Jan  4 21:26:42 TOWER kernel: ata5.00: failed command: WRITE FPDMA QUEUED
Jan  4 21:26:42 TOWER kernel: ata5.00: cmd 61/40:00:88:f7:79/05:00:00:00:00/40 tag 0 ncq 688128 out
Jan  4 21:26:42 TOWER kernel:         res 40/00:90:48:b3:79/00:00:00:00:00/40 Emask 0x10 (ATA bus error)
Jan  4 21:26:42 TOWER kernel: ata5.00: status: { DRDY }
Jan  4 21:26:42 TOWER kernel: ata5.00: failed command: WRITE FPDMA QUEUED
Jan  4 21:26:42 TOWER kernel: ata5.00: cmd 61/40:08:c8:fc:79/05:00:00:00:00/40 tag 1 ncq 688128 out
Jan  4 21:26:42 TOWER kernel:         res 40/00:90:48:b3:79/00:00:00:00:00/40 Emask 0x10 (ATA bus error)
Jan  4 21:26:42 TOWER kernel: ata5.00: status: { DRDY }

the failed command portion (yellow and three lines after) is repeated 30 times or so, and ends with:

Jan  4 21:26:42 TOWER kernel: ata5: hard resetting link
Jan  4 21:26:43 TOWER kernel: ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Jan  4 21:26:43 TOWER kernel: ata5.00: supports DRM functions and may not be fully accessible
Jan  4 21:26:43 TOWER kernel: ata5.00: disabling queued TRIM support
Jan  4 21:26:43 TOWER kernel: ata5.00: supports DRM functions and may not be fully accessible
Jan  4 21:26:43 TOWER kernel: ata5.00: disabling queued TRIM support
Jan  4 21:26:43 TOWER kernel: ata5.00: configured for UDMA/133
Jan  4 21:26:43 TOWER kernel: ata5: EH complete

And after checking the log, the mover csum error hasn't happened as much as I thought. It has happened 3 times in 16 days, and only one of those three times was a file that Teracopy also failed. The other two were files Teracopy moved with no errors. And all the Teracopy files that error, only one had a mover error.

Weird.

KYThrill · January 16, 2018

1 pass of memtest and 0 errors. I know, just one pass, but...

JorgeB · January 16, 2018

3 hours ago, KYThrill said:

If eventually unable to move without csum error, it just errors out, leaves the file

That's expected, btrfs (or any other checksum enable filesystem) won't let you copy a corrupted file.

3 hours ago, KYThrill said:

Should I witch to XFS, like the old HDD?

You can, but much better would be to figure out what's causing the corruption, it will likely still happen with xfs but you just won't be warned.

Start by posting your diagnostics: Tools -> Diagnostics

KYThrill · January 18, 2018

Well, for better or worse, I changed two things... I have not seen a CRC error on Teracopy yet, so fingers crossed (but it hasn't been a week yet, and I was seeing 1-2 errors per week).

Reviewing Linux kernel bug logs, I found many past issues with Samsung SSD's. Apparently they are not a good choice for a Linux device. However, there were many fixes for Samsung SSD's, spread over various kernel versions. One thing not working correctly was TRIM.

So the first thing I did was update to unRAID 6.4 (from 6.1.6), just to get the newest version of the kernel as possible. I also installed the Dynamix TRIM plugin and set a daily schedule for that.

The second thing I did, at the same time (I know, once change at a time is best) is change the SSD formatting to XFS. I had found numerous reports of data corruption using BTRFS or EXT4 on Samsung 850 SSD's. They don't seem to play well together, likely because of Samsung's Linux bugs in firmware. Often, when a user was getting data corruption with either of these, they would switch to EXT3, and everything would be fine.

My theory on this is that EXT3 doesn't support TRIM, but EXT4 has TRIM support. Samsung acknowledged a buggy TRIM support for Linux (but works in Windows), which lead to kernel fixes to work around Samsung's implementation. So TRIM doesn't play well on Samsung SSD's, depending on file system.

XFS looks like it supports TRIM (fstrim/discard), but I read some more info that indicates that may be misleading. The reason is that with XFS, TRIM is disabled by default. The preferred method for TRIM with XFS is to call it, which you can schedule with cron (which is what I think the Dynamix plugin is doing).

So everything appears to be working, but I would not advise anyone to put a Samsung SSD (or M.2) into an unRAID machine. It seems like using this combo together could be risky. But since I already have one, since this is just the cache drive, and since XFS/Dynamix plugin seems to workaround the problem, I don't have much at risk.

JorgeB · January 18, 2018

I've used and continue to use Samsung SSDs for years with btrfs and zero corruption issues.

Samsung 850 SSD Checksum Errors on Cache drive

Recommended Posts

KYThrill

Link to comment

KYThrill

Link to comment

KYThrill

Link to comment

JorgeB

Link to comment

KYThrill

Link to comment

JorgeB

Link to comment

Join the conversation