Jump to content

Sudden unclean reboots, SSD Cache error


Recommended Posts

My unraid server suddenly reboots. Sometimes it restarts 10 times a day and sometimes it's just runs for a week without a restart. When it starts rebooting it can reboot with just minutes between. Here is the disklog from my SSD Cache that might be the problem. I just don't know where to begin my troubleshooting.

 

Quote

Dec 27 23:14:45 Tower kernel: ata8: SATA max UDMA/133 abar m2048@0xf9bff800 port 0xf9bff980 irq 36
Dec 27 23:14:45 Tower kernel: ata8: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Dec 27 23:14:45 Tower kernel: ata8.00: supports DRM functions and may not be fully accessible
Dec 27 23:14:45 Tower kernel: ata8.00: ATA-11: Samsung SSD 860 EVO 1TB, S3Z9NB0K915954W, RVT01B6Q, max UDMA/133
Dec 27 23:14:45 Tower kernel: ata8.00: 1953525168 sectors, multi 1: LBA48 NCQ (depth 32), AA
Dec 27 23:14:45 Tower kernel: ata8.00: supports DRM functions and may not be fully accessible
Dec 27 23:14:45 Tower kernel: ata8.00: configured for UDMA/133
Dec 27 23:14:45 Tower kernel: ata8.00: Enabling discard_zeroes_data
Dec 27 23:14:45 Tower kernel: sd 8:0:0:0: [sdi] 1953525168 512-byte logical blocks: (1.00 TB/932 GiB)
Dec 27 23:14:45 Tower kernel: sd 8:0:0:0: [sdi] Write Protect is off
Dec 27 23:14:45 Tower kernel: sd 8:0:0:0: [sdi] Mode Sense: 00 3a 00 00
Dec 27 23:14:45 Tower kernel: sd 8:0:0:0: [sdi] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Dec 27 23:14:45 Tower kernel: ata8.00: Enabling discard_zeroes_data
Dec 27 23:14:45 Tower kernel: sdi: sdi1
Dec 27 23:14:45 Tower kernel: ata8.00: Enabling discard_zeroes_data
Dec 27 23:14:45 Tower kernel: sd 8:0:0:0: [sdi] Attached SCSI disk
Dec 27 23:16:07 Tower emhttpd: Samsung_SSD_860_EVO_1TB_S3Z9NB0K915954W (sdi) 512 1953525168
Dec 27 23:16:07 Tower emhttpd: import 30 cache device: (sdi) Samsung_SSD_860_EVO_1TB_S3Z9NB0K915954W
Dec 27 23:16:25 Tower emhttpd: shcmd (50): /usr/sbin/cryptsetup luksOpen /dev/sdi1 sdi1 --allow-discards --key-file=/root/keyfile
Dec 27 23:16:32 Tower emhttpd: shcmd (77): mount -t btrfs -o noatime,nodiratime /dev/mapper/sdi1 /mnt/cache
Dec 27 23:19:25 Tower kernel: ata8.00: exception Emask 0x0 SAct 0x800000 SErr 0x0 action 0x6 frozen
Dec 27 23:19:25 Tower kernel: ata8.00: failed command: READ FPDMA QUEUED
Dec 27 23:19:25 Tower kernel: ata8.00: cmd 60/20:b8:e0:aa:01/00:00:00:00:00/40 tag 23 ncq dma 16384 in
Dec 27 23:19:25 Tower kernel: ata8.00: status: { DRDY }
Dec 27 23:19:25 Tower kernel: ata8: hard resetting link
Dec 27 23:19:25 Tower kernel: ata8: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Dec 27 23:19:25 Tower kernel: ata8.00: supports DRM functions and may not be fully accessible
Dec 27 23:19:25 Tower kernel: ata8.00: supports DRM functions and may not be fully accessible
Dec 27 23:19:25 Tower kernel: ata8.00: configured for UDMA/133
Dec 27 23:19:25 Tower kernel: ata8: EH complete
Dec 27 23:19:25 Tower kernel: ata8.00: Enabling discard_zeroes_data
Dec 27 23:20:20 Tower kernel: ata8.00: exception Emask 0x0 SAct 0x40000001 SErr 0x0 action 0x6 frozen
Dec 27 23:20:20 Tower kernel: ata8.00: failed command: READ FPDMA QUEUED
Dec 27 23:20:20 Tower kernel: ata8.00: cmd 60/08:00:20:43:53/00:00:00:00:00/40 tag 0 ncq dma 4096 in
Dec 27 23:20:20 Tower kernel: ata8.00: status: { DRDY }
Dec 27 23:20:20 Tower kernel: ata8.00: failed command: READ FPDMA QUEUED
Dec 27 23:20:20 Tower kernel: ata8.00: cmd 60/20:f0:c0:f1:02/00:00:00:00:00/40 tag 30 ncq dma 16384 in
Dec 27 23:20:20 Tower kernel: ata8.00: status: { DRDY }
Dec 27 23:20:20 Tower kernel: ata8: hard resetting link
Dec 27 23:20:20 Tower kernel: ata8: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Dec 27 23:20:20 Tower kernel: ata8.00: supports DRM functions and may not be fully accessible
Dec 27 23:20:20 Tower kernel: ata8.00: supports DRM functions and may not be fully accessible
Dec 27 23:20:20 Tower kernel: ata8.00: configured for UDMA/133
Dec 27 23:20:20 Tower kernel: sd 8:0:0:0: [sdi] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
Dec 27 23:20:20 Tower kernel: sd 8:0:0:0: [sdi] tag#0 Sense Key : 0x5 [current]
Dec 27 23:20:20 Tower kernel: sd 8:0:0:0: [sdi] tag#0 ASC=0x21 ASCQ=0x4
Dec 27 23:20:20 Tower kernel: sd 8:0:0:0: [sdi] tag#0 CDB: opcode=0x28 28 00 00 53 43 20 00 00 08 00
Dec 27 23:20:20 Tower kernel: print_req_error: I/O error, dev sdi, sector 5456672
Dec 27 23:20:20 Tower kernel: ata8: EH complete
Dec 27 23:20:20 Tower kernel: ata8.00: Enabling discard_zeroes_data
Dec 27 23:20:51 Tower kernel: ata8.00: exception Emask 0x0 SAct 0x800 SErr 0x0 action 0x6 frozen
Dec 27 23:20:51 Tower kernel: ata8.00: failed command: READ FPDMA QUEUED
Dec 27 23:20:51 Tower kernel: ata8.00: cmd 60/08:58:68:b9:8a/00:00:00:00:00/40 tag 11 ncq dma 4096 in
Dec 27 23:20:51 Tower kernel: ata8.00: status: { DRDY }
Dec 27 23:20:51 Tower kernel: ata8: hard resetting link
Dec 27 23:20:51 Tower kernel: ata8: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Dec 27 23:20:51 Tower kernel: ata8.00: supports DRM functions and may not be fully accessible
Dec 27 23:20:51 Tower kernel: ata8.00: supports DRM functions and may not be fully accessible
Dec 27 23:20:51 Tower kernel: ata8.00: configured for UDMA/133
Dec 27 23:20:51 Tower kernel: sd 8:0:0:0: [sdi] tag#11 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
Dec 27 23:20:51 Tower kernel: sd 8:0:0:0: [sdi] tag#11 Sense Key : 0x5 [current]
Dec 27 23:20:51 Tower kernel: sd 8:0:0:0: [sdi] tag#11 ASC=0x21 ASCQ=0x4
Dec 27 23:20:51 Tower kernel: sd 8:0:0:0: [sdi] tag#11 CDB: opcode=0x28 28 00 00 8a b9 68 00 00 08 00
Dec 27 23:20:51 Tower kernel: print_req_error: I/O error, dev sdi, sector 9091432
Dec 27 23:20:51 Tower kernel: ata8: EH complete
Dec 27 23:20:51 Tower kernel: ata8.00: Enabling discard_zeroes_data

I'm kind of a new to unraid and the whole linux thing so I want all the help I can get.

Link to comment

Those things in the log in your first post suggest a problem communicating with the cache disk. Likely a bad connection or cable.

 

I notice Fix Common Problems warnings in the syslog in your diagnostics. One of those warnings says you have syslog mirrored to flash. I think that must mean you have Syslog Server setup.

 

The sylog in your diagnostics only has information since the last boot, but Syslog Server saves syslogs indefinitely, and they might tell us more about what happened before the reboot. Post those syslogs from your flash, preferably zipped.

 

I see a lot of other things in your diagnostics that show things are configured wrong, but probably best to take this a little at a time.

 

Link to comment
2 hours ago, lonnestig said:

I have removed the logging and I have 29 more GB on the flashdrive left so that's not the problem. Here is the syslog file.

The reason for the warning is because frequent writes to flash are not good for its life. It is possible to configure Syslog Server to write to something else such as a user share. But for quick troubleshooting, using flash is OK as long as you remember to turn it off.

 

I think you may have to backup cache and reformat it. I'll see what @johnnie.black thinks about that.

Link to comment

You have used 31G of a 50G docker image. That almost certainly indicates you have one or more of your applications misconfigured and they are writing into the docker image instead of to mapped storage.

 

I always recommend allocating only 20G for docker image (less than you have already used). Making it larger won't fix your problem, it will just make it take longer to fill.

 

Typically the way this happens is an application is set to write to a path that doesn't correspond to a container volume mapping. Common mistakes are not using an absolute path, or not using upper/lower case that matches the mappings.

 

Another way you can misconfigure a docker is by using a host mapping that doesn't correspond to actual storage. That will result in a docker writing into RAM. Note that any mapping to an Unassigned Device that isn't actually mounted will also be in RAM.

 

The way forward for this is going to be deleting docker image, recreating it at the more reasonable size of 20G, then using the Previous Apps feature of the Apps page to reinstall your dockers just as before. But just as before isn't good enough, so doing it one docker at a time to consider what you have wrong with the way each one is setup.

 

But before that let's wait and see if we get another opinion on what to do about your cache drive. Maybe for now just disable the docker service and wait. Possibly that might even make your crashing go away.

 

I have some things to say about your user shares. I will start a new post for that.

Link to comment

You have an unusually large number of user share configuration files on flash. Some appear to be duplicates, and many are not actually user shares with any contents. Possibly some of these shares were created by accident by some of your dockers.

 

Go to Shares - User Shares, click the Compute All button at the bottom. wait for the result. It will probably take several minutes. Then post a screenshot.

 

Also, go to Main - Boot Device, click on the folder icon at far right for flash to drill down into its contents. Then click on config, then click on shares, then post a screenshot.

Link to comment

The ATA errors could be a bad cables, could be the controller, since it's on a Marvell controller, avoid those if possible.

 

More worrying are the checksum errors, which indicate data corruption, I would start by connecting the SSD to one of the onboard SATA ports to see if the ATA errors go away, then run a scrub, if any uncorrectable errors you need to deleted the affected files or reformat the cache.

Link to comment

I have one shares for each category on my plex and some other for like nzbs and torrents, but I might have used it wrong from the start. If you a better example I would like to see it so I know what to do when I decide to reorganize :) I have moved to another onboard port now that should be Intel.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...