lonnestig Posted December 27, 2019 Share Posted December 27, 2019 My unraid server suddenly reboots. Sometimes it restarts 10 times a day and sometimes it's just runs for a week without a restart. When it starts rebooting it can reboot with just minutes between. Here is the disklog from my SSD Cache that might be the problem. I just don't know where to begin my troubleshooting. Quote Dec 27 23:14:45 Tower kernel: ata8: SATA max UDMA/133 abar m2048@0xf9bff800 port 0xf9bff980 irq 36 Dec 27 23:14:45 Tower kernel: ata8: SATA link up 6.0 Gbps (SStatus 133 SControl 300) Dec 27 23:14:45 Tower kernel: ata8.00: supports DRM functions and may not be fully accessible Dec 27 23:14:45 Tower kernel: ata8.00: ATA-11: Samsung SSD 860 EVO 1TB, S3Z9NB0K915954W, RVT01B6Q, max UDMA/133 Dec 27 23:14:45 Tower kernel: ata8.00: 1953525168 sectors, multi 1: LBA48 NCQ (depth 32), AA Dec 27 23:14:45 Tower kernel: ata8.00: supports DRM functions and may not be fully accessible Dec 27 23:14:45 Tower kernel: ata8.00: configured for UDMA/133 Dec 27 23:14:45 Tower kernel: ata8.00: Enabling discard_zeroes_data Dec 27 23:14:45 Tower kernel: sd 8:0:0:0: [sdi] 1953525168 512-byte logical blocks: (1.00 TB/932 GiB) Dec 27 23:14:45 Tower kernel: sd 8:0:0:0: [sdi] Write Protect is off Dec 27 23:14:45 Tower kernel: sd 8:0:0:0: [sdi] Mode Sense: 00 3a 00 00 Dec 27 23:14:45 Tower kernel: sd 8:0:0:0: [sdi] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA Dec 27 23:14:45 Tower kernel: ata8.00: Enabling discard_zeroes_data Dec 27 23:14:45 Tower kernel: sdi: sdi1 Dec 27 23:14:45 Tower kernel: ata8.00: Enabling discard_zeroes_data Dec 27 23:14:45 Tower kernel: sd 8:0:0:0: [sdi] Attached SCSI disk Dec 27 23:16:07 Tower emhttpd: Samsung_SSD_860_EVO_1TB_S3Z9NB0K915954W (sdi) 512 1953525168 Dec 27 23:16:07 Tower emhttpd: import 30 cache device: (sdi) Samsung_SSD_860_EVO_1TB_S3Z9NB0K915954W Dec 27 23:16:25 Tower emhttpd: shcmd (50): /usr/sbin/cryptsetup luksOpen /dev/sdi1 sdi1 --allow-discards --key-file=/root/keyfile Dec 27 23:16:32 Tower emhttpd: shcmd (77): mount -t btrfs -o noatime,nodiratime /dev/mapper/sdi1 /mnt/cache Dec 27 23:19:25 Tower kernel: ata8.00: exception Emask 0x0 SAct 0x800000 SErr 0x0 action 0x6 frozen Dec 27 23:19:25 Tower kernel: ata8.00: failed command: READ FPDMA QUEUED Dec 27 23:19:25 Tower kernel: ata8.00: cmd 60/20:b8:e0:aa:01/00:00:00:00:00/40 tag 23 ncq dma 16384 in Dec 27 23:19:25 Tower kernel: ata8.00: status: { DRDY } Dec 27 23:19:25 Tower kernel: ata8: hard resetting link Dec 27 23:19:25 Tower kernel: ata8: SATA link up 6.0 Gbps (SStatus 133 SControl 300) Dec 27 23:19:25 Tower kernel: ata8.00: supports DRM functions and may not be fully accessible Dec 27 23:19:25 Tower kernel: ata8.00: supports DRM functions and may not be fully accessible Dec 27 23:19:25 Tower kernel: ata8.00: configured for UDMA/133 Dec 27 23:19:25 Tower kernel: ata8: EH complete Dec 27 23:19:25 Tower kernel: ata8.00: Enabling discard_zeroes_data Dec 27 23:20:20 Tower kernel: ata8.00: exception Emask 0x0 SAct 0x40000001 SErr 0x0 action 0x6 frozen Dec 27 23:20:20 Tower kernel: ata8.00: failed command: READ FPDMA QUEUED Dec 27 23:20:20 Tower kernel: ata8.00: cmd 60/08:00:20:43:53/00:00:00:00:00/40 tag 0 ncq dma 4096 in Dec 27 23:20:20 Tower kernel: ata8.00: status: { DRDY } Dec 27 23:20:20 Tower kernel: ata8.00: failed command: READ FPDMA QUEUED Dec 27 23:20:20 Tower kernel: ata8.00: cmd 60/20:f0:c0:f1:02/00:00:00:00:00/40 tag 30 ncq dma 16384 in Dec 27 23:20:20 Tower kernel: ata8.00: status: { DRDY } Dec 27 23:20:20 Tower kernel: ata8: hard resetting link Dec 27 23:20:20 Tower kernel: ata8: SATA link up 6.0 Gbps (SStatus 133 SControl 300) Dec 27 23:20:20 Tower kernel: ata8.00: supports DRM functions and may not be fully accessible Dec 27 23:20:20 Tower kernel: ata8.00: supports DRM functions and may not be fully accessible Dec 27 23:20:20 Tower kernel: ata8.00: configured for UDMA/133 Dec 27 23:20:20 Tower kernel: sd 8:0:0:0: [sdi] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 Dec 27 23:20:20 Tower kernel: sd 8:0:0:0: [sdi] tag#0 Sense Key : 0x5 [current] Dec 27 23:20:20 Tower kernel: sd 8:0:0:0: [sdi] tag#0 ASC=0x21 ASCQ=0x4 Dec 27 23:20:20 Tower kernel: sd 8:0:0:0: [sdi] tag#0 CDB: opcode=0x28 28 00 00 53 43 20 00 00 08 00 Dec 27 23:20:20 Tower kernel: print_req_error: I/O error, dev sdi, sector 5456672 Dec 27 23:20:20 Tower kernel: ata8: EH complete Dec 27 23:20:20 Tower kernel: ata8.00: Enabling discard_zeroes_data Dec 27 23:20:51 Tower kernel: ata8.00: exception Emask 0x0 SAct 0x800 SErr 0x0 action 0x6 frozen Dec 27 23:20:51 Tower kernel: ata8.00: failed command: READ FPDMA QUEUED Dec 27 23:20:51 Tower kernel: ata8.00: cmd 60/08:58:68:b9:8a/00:00:00:00:00/40 tag 11 ncq dma 4096 in Dec 27 23:20:51 Tower kernel: ata8.00: status: { DRDY } Dec 27 23:20:51 Tower kernel: ata8: hard resetting link Dec 27 23:20:51 Tower kernel: ata8: SATA link up 6.0 Gbps (SStatus 133 SControl 300) Dec 27 23:20:51 Tower kernel: ata8.00: supports DRM functions and may not be fully accessible Dec 27 23:20:51 Tower kernel: ata8.00: supports DRM functions and may not be fully accessible Dec 27 23:20:51 Tower kernel: ata8.00: configured for UDMA/133 Dec 27 23:20:51 Tower kernel: sd 8:0:0:0: [sdi] tag#11 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 Dec 27 23:20:51 Tower kernel: sd 8:0:0:0: [sdi] tag#11 Sense Key : 0x5 [current] Dec 27 23:20:51 Tower kernel: sd 8:0:0:0: [sdi] tag#11 ASC=0x21 ASCQ=0x4 Dec 27 23:20:51 Tower kernel: sd 8:0:0:0: [sdi] tag#11 CDB: opcode=0x28 28 00 00 8a b9 68 00 00 08 00 Dec 27 23:20:51 Tower kernel: print_req_error: I/O error, dev sdi, sector 9091432 Dec 27 23:20:51 Tower kernel: ata8: EH complete Dec 27 23:20:51 Tower kernel: ata8.00: Enabling discard_zeroes_data I'm kind of a new to unraid and the whole linux thing so I want all the help I can get. Quote Link to comment
trurl Posted December 27, 2019 Share Posted December 27, 2019 Go to Tools - Diagnostics and attach the complete diagnostics zip file to your NEXT post. Have you done memtest? Quote Link to comment
lonnestig Posted December 27, 2019 Author Share Posted December 27, 2019 No I haven't done a memtest, can I do that in unraid? I have attached the diagnostics file. tower-diagnostics-20191228-0021.zip Quote Link to comment
trurl Posted December 27, 2019 Share Posted December 27, 2019 1 minute ago, lonnestig said: No I haven't done a memtest, can I do that in unraid? Not in Unraid, it has to be done before booting Unraid. Memtest is available on the boot menu. Quote Link to comment
lonnestig Posted December 27, 2019 Author Share Posted December 27, 2019 Okey, then I’ll do that and return with the results Quote Link to comment
trurl Posted December 28, 2019 Share Posted December 28, 2019 Those things in the log in your first post suggest a problem communicating with the cache disk. Likely a bad connection or cable. I notice Fix Common Problems warnings in the syslog in your diagnostics. One of those warnings says you have syslog mirrored to flash. I think that must mean you have Syslog Server setup. The sylog in your diagnostics only has information since the last boot, but Syslog Server saves syslogs indefinitely, and they might tell us more about what happened before the reboot. Post those syslogs from your flash, preferably zipped. I see a lot of other things in your diagnostics that show things are configured wrong, but probably best to take this a little at a time. Quote Link to comment
lonnestig Posted December 28, 2019 Author Share Posted December 28, 2019 I have removed the logging and I have 29 more GB on the flashdrive left so that's not the problem. Here is the syslog file. When the memtest is done, I'm gonna change the cable to the cache ssd disk. syslog (1) Quote Link to comment
lonnestig Posted December 28, 2019 Author Share Posted December 28, 2019 No errors during memtest Quote Link to comment
trurl Posted December 28, 2019 Share Posted December 28, 2019 2 hours ago, lonnestig said: I have removed the logging and I have 29 more GB on the flashdrive left so that's not the problem. Here is the syslog file. The reason for the warning is because frequent writes to flash are not good for its life. It is possible to configure Syslog Server to write to something else such as a user share. But for quick troubleshooting, using flash is OK as long as you remember to turn it off. I think you may have to backup cache and reformat it. I'll see what @johnnie.black thinks about that. Quote Link to comment
trurl Posted December 28, 2019 Share Posted December 28, 2019 You have used 31G of a 50G docker image. That almost certainly indicates you have one or more of your applications misconfigured and they are writing into the docker image instead of to mapped storage. I always recommend allocating only 20G for docker image (less than you have already used). Making it larger won't fix your problem, it will just make it take longer to fill. Typically the way this happens is an application is set to write to a path that doesn't correspond to a container volume mapping. Common mistakes are not using an absolute path, or not using upper/lower case that matches the mappings. Another way you can misconfigure a docker is by using a host mapping that doesn't correspond to actual storage. That will result in a docker writing into RAM. Note that any mapping to an Unassigned Device that isn't actually mounted will also be in RAM. The way forward for this is going to be deleting docker image, recreating it at the more reasonable size of 20G, then using the Previous Apps feature of the Apps page to reinstall your dockers just as before. But just as before isn't good enough, so doing it one docker at a time to consider what you have wrong with the way each one is setup. But before that let's wait and see if we get another opinion on what to do about your cache drive. Maybe for now just disable the docker service and wait. Possibly that might even make your crashing go away. I have some things to say about your user shares. I will start a new post for that. Quote Link to comment
trurl Posted December 28, 2019 Share Posted December 28, 2019 You have an unusually large number of user share configuration files on flash. Some appear to be duplicates, and many are not actually user shares with any contents. Possibly some of these shares were created by accident by some of your dockers. Go to Shares - User Shares, click the Compute All button at the bottom. wait for the result. It will probably take several minutes. Then post a screenshot. Also, go to Main - Boot Device, click on the folder icon at far right for flash to drill down into its contents. Then click on config, then click on shares, then post a screenshot. Quote Link to comment
JorgeB Posted December 28, 2019 Share Posted December 28, 2019 The ATA errors could be a bad cables, could be the controller, since it's on a Marvell controller, avoid those if possible. More worrying are the checksum errors, which indicate data corruption, I would start by connecting the SSD to one of the onboard SATA ports to see if the ATA errors go away, then run a scrub, if any uncorrectable errors you need to deleted the affected files or reformat the cache. Quote Link to comment
lonnestig Posted December 28, 2019 Author Share Posted December 28, 2019 The SSD is on the onboard connections. But now I think I found the problem... The big powersupply on the mobo is glitching. When I stretch the PSU cable to the mobo it seems to run and when I release it, it dies... Here are the printscreen of my shares and from boot device. Quote Link to comment
JorgeB Posted December 28, 2019 Share Posted December 28, 2019 3 minutes ago, lonnestig said: The SSD is on the onboard connections Yes, but 2 of those ports are on Marvell controller doesn't mean they don't work, just they are much more prone to issues, compared to the Intel SATA ports which are usually rock solid. Quote Link to comment
lonnestig Posted December 28, 2019 Author Share Posted December 28, 2019 Okey, but I moved it to another port last night so it shouldn’t be on marvell noe Quote Link to comment
trurl Posted December 28, 2019 Share Posted December 28, 2019 That screenshot of /boot/config/shares is not as useful as I had hoped since it isn't presenting them in alphabetical order. And all of the screenshots are pretty small to read. Did you really intend to have so many user shares? Quote Link to comment
lonnestig Posted December 29, 2019 Author Share Posted December 29, 2019 I have one shares for each category on my plex and some other for like nzbs and torrents, but I might have used it wrong from the start. If you a better example I would like to see it so I know what to do when I decide to reorganize I have moved to another onboard port now that should be Intel. Quote Link to comment
lonnestig Posted December 29, 2019 Author Share Posted December 29, 2019 Ever since it tightend the cable on the mobo the server haven't reboot even once. Quote Link to comment
trurl Posted December 29, 2019 Share Posted December 29, 2019 13 hours ago, trurl said: That screenshot of /boot/config/shares is not as useful as I had hoped Let's try another way. Put your flash in your PC. While there make a backup of it. Then get a directory listing of the files in config/shares on flash, sorted by name, and post it. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.