Sudden unclean reboots, SSD Cache error

lonnestig · December 27, 2019

My unraid server suddenly reboots. Sometimes it restarts 10 times a day and sometimes it's just runs for a week without a restart. When it starts rebooting it can reboot with just minutes between. Here is the disklog from my SSD Cache that might be the problem. I just don't know where to begin my troubleshooting.

Quote

Dec 27 23:14:45 Tower kernel: ata8: SATA max UDMA/133 abar m2048@0xf9bff800 port 0xf9bff980 irq 36
Dec 27 23:14:45 Tower kernel: ata8: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Dec 27 23:14:45 Tower kernel: ata8.00: supports DRM functions and may not be fully accessible
Dec 27 23:14:45 Tower kernel: ata8.00: ATA-11: Samsung SSD 860 EVO 1TB, S3Z9NB0K915954W, RVT01B6Q, max UDMA/133
Dec 27 23:14:45 Tower kernel: ata8.00: 1953525168 sectors, multi 1: LBA48 NCQ (depth 32), AA
Dec 27 23:14:45 Tower kernel: ata8.00: supports DRM functions and may not be fully accessible
Dec 27 23:14:45 Tower kernel: ata8.00: configured for UDMA/133
Dec 27 23:14:45 Tower kernel: ata8.00: Enabling discard_zeroes_data
Dec 27 23:14:45 Tower kernel: sd 8:0:0:0: [sdi] 1953525168 512-byte logical blocks: (1.00 TB/932 GiB)
Dec 27 23:14:45 Tower kernel: sd 8:0:0:0: [sdi] Write Protect is off
Dec 27 23:14:45 Tower kernel: sd 8:0:0:0: [sdi] Mode Sense: 00 3a 00 00
Dec 27 23:14:45 Tower kernel: sd 8:0:0:0: [sdi] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Dec 27 23:14:45 Tower kernel: ata8.00: Enabling discard_zeroes_data
Dec 27 23:14:45 Tower kernel: sdi: sdi1
Dec 27 23:14:45 Tower kernel: ata8.00: Enabling discard_zeroes_data
Dec 27 23:14:45 Tower kernel: sd 8:0:0:0: [sdi] Attached SCSI disk
Dec 27 23:16:07 Tower emhttpd: Samsung_SSD_860_EVO_1TB_S3Z9NB0K915954W (sdi) 512 1953525168
Dec 27 23:16:07 Tower emhttpd: import 30 cache device: (sdi) Samsung_SSD_860_EVO_1TB_S3Z9NB0K915954W
Dec 27 23:16:25 Tower emhttpd: shcmd (50): /usr/sbin/cryptsetup luksOpen /dev/sdi1 sdi1 --allow-discards --key-file=/root/keyfile
Dec 27 23:16:32 Tower emhttpd: shcmd (77): mount -t btrfs -o noatime,nodiratime /dev/mapper/sdi1 /mnt/cache
Dec 27 23:19:25 Tower kernel: ata8.00: exception Emask 0x0 SAct 0x800000 SErr 0x0 action 0x6 frozen
Dec 27 23:19:25 Tower kernel: ata8.00: failed command: READ FPDMA QUEUED
Dec 27 23:19:25 Tower kernel: ata8.00: cmd 60/20:b8:e0:aa:01/00:00:00:00:00/40 tag 23 ncq dma 16384 in
Dec 27 23:19:25 Tower kernel: ata8.00: status: { DRDY }
Dec 27 23:19:25 Tower kernel: ata8: hard resetting link
Dec 27 23:19:25 Tower kernel: ata8: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Dec 27 23:19:25 Tower kernel: ata8.00: supports DRM functions and may not be fully accessible
Dec 27 23:19:25 Tower kernel: ata8.00: supports DRM functions and may not be fully accessible
Dec 27 23:19:25 Tower kernel: ata8.00: configured for UDMA/133
Dec 27 23:19:25 Tower kernel: ata8: EH complete
Dec 27 23:19:25 Tower kernel: ata8.00: Enabling discard_zeroes_data
Dec 27 23:20:20 Tower kernel: ata8.00: exception Emask 0x0 SAct 0x40000001 SErr 0x0 action 0x6 frozen
Dec 27 23:20:20 Tower kernel: ata8.00: failed command: READ FPDMA QUEUED
Dec 27 23:20:20 Tower kernel: ata8.00: cmd 60/08:00:20:43:53/00:00:00:00:00/40 tag 0 ncq dma 4096 in
Dec 27 23:20:20 Tower kernel: ata8.00: status: { DRDY }
Dec 27 23:20:20 Tower kernel: ata8.00: failed command: READ FPDMA QUEUED
Dec 27 23:20:20 Tower kernel: ata8.00: cmd 60/20:f0:c0:f1:02/00:00:00:00:00/40 tag 30 ncq dma 16384 in
Dec 27 23:20:20 Tower kernel: ata8.00: status: { DRDY }
Dec 27 23:20:20 Tower kernel: ata8: hard resetting link
Dec 27 23:20:20 Tower kernel: ata8: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Dec 27 23:20:20 Tower kernel: ata8.00: supports DRM functions and may not be fully accessible
Dec 27 23:20:20 Tower kernel: ata8.00: supports DRM functions and may not be fully accessible
Dec 27 23:20:20 Tower kernel: ata8.00: configured for UDMA/133
Dec 27 23:20:20 Tower kernel: sd 8:0:0:0: [sdi] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
Dec 27 23:20:20 Tower kernel: sd 8:0:0:0: [sdi] tag#0 Sense Key : 0x5 [current]
Dec 27 23:20:20 Tower kernel: sd 8:0:0:0: [sdi] tag#0 ASC=0x21 ASCQ=0x4
Dec 27 23:20:20 Tower kernel: sd 8:0:0:0: [sdi] tag#0 CDB: opcode=0x28 28 00 00 53 43 20 00 00 08 00
Dec 27 23:20:20 Tower kernel: print_req_error: I/O error, dev sdi, sector 5456672
Dec 27 23:20:20 Tower kernel: ata8: EH complete
Dec 27 23:20:20 Tower kernel: ata8.00: Enabling discard_zeroes_data
Dec 27 23:20:51 Tower kernel: ata8.00: exception Emask 0x0 SAct 0x800 SErr 0x0 action 0x6 frozen
Dec 27 23:20:51 Tower kernel: ata8.00: failed command: READ FPDMA QUEUED
Dec 27 23:20:51 Tower kernel: ata8.00: cmd 60/08:58:68:b9:8a/00:00:00:00:00/40 tag 11 ncq dma 4096 in
Dec 27 23:20:51 Tower kernel: ata8.00: status: { DRDY }
Dec 27 23:20:51 Tower kernel: ata8: hard resetting link
Dec 27 23:20:51 Tower kernel: ata8: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Dec 27 23:20:51 Tower kernel: ata8.00: supports DRM functions and may not be fully accessible
Dec 27 23:20:51 Tower kernel: ata8.00: supports DRM functions and may not be fully accessible
Dec 27 23:20:51 Tower kernel: ata8.00: configured for UDMA/133
Dec 27 23:20:51 Tower kernel: sd 8:0:0:0: [sdi] tag#11 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
Dec 27 23:20:51 Tower kernel: sd 8:0:0:0: [sdi] tag#11 Sense Key : 0x5 [current]
Dec 27 23:20:51 Tower kernel: sd 8:0:0:0: [sdi] tag#11 ASC=0x21 ASCQ=0x4
Dec 27 23:20:51 Tower kernel: sd 8:0:0:0: [sdi] tag#11 CDB: opcode=0x28 28 00 00 8a b9 68 00 00 08 00
Dec 27 23:20:51 Tower kernel: print_req_error: I/O error, dev sdi, sector 9091432
Dec 27 23:20:51 Tower kernel: ata8: EH complete
Dec 27 23:20:51 Tower kernel: ata8.00: Enabling discard_zeroes_data

I'm kind of a new to unraid and the whole linux thing so I want all the help I can get.

trurl · December 27, 2019

Go to Tools - Diagnostics and attach the complete diagnostics zip file to your NEXT post.

Have you done memtest?

lonnestig · December 27, 2019

No I haven't done a memtest, can I do that in unraid?

I have attached the diagnostics file.

tower-diagnostics-20191228-0021.zip

trurl · December 27, 2019

1 minute ago, lonnestig said:

No I haven't done a memtest, can I do that in unraid?

Not in Unraid, it has to be done before booting Unraid. Memtest is available on the boot menu.

lonnestig · December 27, 2019

Okey, then I’ll do that and return with the results

trurl · December 28, 2019

Those things in the log in your first post suggest a problem communicating with the cache disk. Likely a bad connection or cable.

I notice Fix Common Problems warnings in the syslog in your diagnostics. One of those warnings says you have syslog mirrored to flash. I think that must mean you have Syslog Server setup.

The sylog in your diagnostics only has information since the last boot, but Syslog Server saves syslogs indefinitely, and they might tell us more about what happened before the reboot. Post those syslogs from your flash, preferably zipped.

I see a lot of other things in your diagnostics that show things are configured wrong, but probably best to take this a little at a time.

lonnestig · December 28, 2019

I have removed the logging and I have 29 more GB on the flashdrive left so that's not the problem. Here is the syslog file.

When the memtest is done, I'm gonna change the cable to the cache ssd disk.

syslog (1)

lonnestig · December 28, 2019

No errors during memtest

trurl · December 28, 2019

2 hours ago, lonnestig said:

I have removed the logging and I have 29 more GB on the flashdrive left so that's not the problem. Here is the syslog file.

The reason for the warning is because frequent writes to flash are not good for its life. It is possible to configure Syslog Server to write to something else such as a user share. But for quick troubleshooting, using flash is OK as long as you remember to turn it off.

I think you may have to backup cache and reformat it. I'll see what @johnnie.black thinks about that.

trurl · December 28, 2019

You have used 31G of a 50G docker image. That almost certainly indicates you have one or more of your applications misconfigured and they are writing into the docker image instead of to mapped storage.

I always recommend allocating only 20G for docker image (less than you have already used). Making it larger won't fix your problem, it will just make it take longer to fill.

Typically the way this happens is an application is set to write to a path that doesn't correspond to a container volume mapping. Common mistakes are not using an absolute path, or not using upper/lower case that matches the mappings.

Another way you can misconfigure a docker is by using a host mapping that doesn't correspond to actual storage. That will result in a docker writing into RAM. Note that any mapping to an Unassigned Device that isn't actually mounted will also be in RAM.

The way forward for this is going to be deleting docker image, recreating it at the more reasonable size of 20G, then using the Previous Apps feature of the Apps page to reinstall your dockers just as before. But just as before isn't good enough, so doing it one docker at a time to consider what you have wrong with the way each one is setup.

But before that let's wait and see if we get another opinion on what to do about your cache drive. Maybe for now just disable the docker service and wait. Possibly that might even make your crashing go away.

I have some things to say about your user shares. I will start a new post for that.

trurl · December 28, 2019

You have an unusually large number of user share configuration files on flash. Some appear to be duplicates, and many are not actually user shares with any contents. Possibly some of these shares were created by accident by some of your dockers.

Go to Shares - User Shares, click the Compute All button at the bottom. wait for the result. It will probably take several minutes. Then post a screenshot.

Also, go to Main - Boot Device, click on the folder icon at far right for flash to drill down into its contents. Then click on config, then click on shares, then post a screenshot.

JorgeB · December 28, 2019

The ATA errors could be a bad cables, could be the controller, since it's on a Marvell controller, avoid those if possible.

More worrying are the checksum errors, which indicate data corruption, I would start by connecting the SSD to one of the onboard SATA ports to see if the ATA errors go away, then run a scrub, if any uncorrectable errors you need to deleted the affected files or reformat the cache.

lonnestig · December 28, 2019

The SSD is on the onboard connections. But now I think I found the problem... The big powersupply on the mobo is glitching. When I stretch the PSU cable to the mobo it seems to run and when I release it, it dies... Here are the printscreen of my shares and from boot device.

JorgeB · December 28, 2019

3 minutes ago, lonnestig said:

The SSD is on the onboard connections

Yes, but 2 of those ports are on Marvell controller doesn't mean they don't work, just they are much more prone to issues, compared to the Intel SATA ports which are usually rock solid.

lonnestig · December 28, 2019

Okey, but I moved it to another port last night so it shouldn’t be on marvell noe

trurl · December 28, 2019

That screenshot of /boot/config/shares is not as useful as I had hoped since it isn't presenting them in alphabetical order. And all of the screenshots are pretty small to read.

Did you really intend to have so many user shares?

lonnestig · December 29, 2019

I have one shares for each category on my plex and some other for like nzbs and torrents, but I might have used it wrong from the start. If you a better example I would like to see it so I know what to do when I decide to reorganize I have moved to another onboard port now that should be Intel.

lonnestig · December 29, 2019

Ever since it tightend the cable on the mobo the server haven't reboot even once.

trurl · December 29, 2019

13 hours ago, trurl said:

That screenshot of /boot/config/shares is not as useful as I had hoped

Let's try another way.

Put your flash in your PC. While there make a backup of it. Then get a directory listing of the files in config/shares on flash, sorted by name, and post it.

Sudden unclean reboots, SSD Cache error

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation