Stubbs Posted January 23, 2022 Share Posted January 23, 2022 (edited) Disk 4 in my array seems to have issues. It automatically dismounted at some point in the last day, and I don't know why or what's going on. Trying to remount it does not work. When I rebooted the server, it started a file system check. It prompted me to open a certain log, which kept going on and on infinitely: Again, I do not understand what any of this means. My Unraid's main log looks like this: Jan 24 02:18:13 Tower kernel: md4: writeback error on inode 98313, offset 125083648, sector 674783544 Jan 24 02:18:13 Tower kernel: XFS (sdg1): Filesystem has duplicate UUID 0fb26a6d-0040-405b-92c9-4fe171b93e9b - can't mount Jan 24 02:18:13 Tower unassigned.devices: Mount of 'sdg1' failed: 'mount: /mnt/disks/WD-WX22DB0KE762: wrong fs type, bad option, bad superblock on /dev/sdg1, missing codepage or helper program, or other error. ' Jan 24 02:18:13 Tower unassigned.devices: Partition 'WD-WX22DB0KE762' cannot be mounted. Jan 24 02:18:33 Tower kernel: ata6.00: exception Emask 0x0 SAct 0x40 SErr 0x0 action 0x0 Jan 24 02:18:33 Tower kernel: ata6.00: irq_stat 0x40000008 Jan 24 02:18:33 Tower kernel: ata6.00: failed command: WRITE FPDMA QUEUED Jan 24 02:18:33 Tower kernel: ata6.00: cmd 61/20:30:20:35:d2/00:00:3b:00:00/40 tag 6 ncq dma 16384 out Jan 24 02:18:33 Tower kernel: res 41/10:00:20:35:d2/00:00:3b:00:00/40 Emask 0x481 (invalid argument) <F> Jan 24 02:18:33 Tower kernel: ata6.00: status: { DRDY ERR } Jan 24 02:18:33 Tower kernel: ata6.00: error: { IDNF } Jan 24 02:18:33 Tower kernel: ata6.00: configured for UDMA/133 Jan 24 02:18:33 Tower kernel: ata6: EH complete Jan 24 02:18:52 Tower emhttpd: cmd: /usr/local/emhttp/plugins/dynamix.plugin.manager/scripts/plugin update community.applications.plg Jan 24 02:18:52 Tower root: plugin: running: anonymous Jan 24 02:18:52 Tower root: plugin: running: anonymous Jan 24 02:18:52 Tower root: plugin: creating: /boot/config/plugins/community.applications/community.applications-2022.01.22-x86_64-1.txz - downloading from URL https://raw.githubusercontent.com/Squidly271/community.applications/master/archive/community.applications-2022.01.22-x86_64-1.txz Jan 24 02:18:53 Tower root: plugin: checking: /boot/config/plugins/community.applications/community.applications-2022.01.22-x86_64-1.txz - MD5 Jan 24 02:18:53 Tower root: plugin: running: /boot/config/plugins/community.applications/community.applications-2022.01.22-x86_64-1.txz Jan 24 02:18:53 Tower root: plugin: running: anonymous Jan 24 02:18:56 Tower emhttpd: cmd: /usr/local/emhttp/plugins/dynamix.plugin.manager/scripts/plugin update unassigned.devices.plg Jan 24 02:18:56 Tower root: plugin: running: anonymous Jan 24 02:18:56 Tower root: plugin: creating: /boot/config/plugins/unassigned.devices/unassigned.devices-2022.01.21.tgz - downloading from URL https://github.com/dlandon/unassigned.devices/raw/master/unassigned.devices-2022.01.21.tgz Jan 24 02:18:56 Tower nginx: 2022/01/24 02:18:56 [error] 5625#5625: *4570 FastCGI sent in stderr: "Primary script unknown" while reading response header from upstream, client: 192.168.1.217, server: , request: "POST /plugins/unassigned.devices/UnassignedDevices.php HTTP/1.1", upstream: "fastcgi://unix:/var/run/php5-fpm.sock:", host: "192.168.1.5", referrer: "http://192.168.1.5/Main" Jan 24 02:18:56 Tower nginx: 2022/01/24 02:18:56 [error] 5625#5625: *4086 FastCGI sent in stderr: "Primary script unknown" while reading response header from upstream, client: 10.10.20.217, server: , request: "POST /plugins/unassigned.devices/UnassignedDevices.php HTTP/1.1", upstream: "fastcgi://unix:/var/run/php5-fpm.sock:", host: "192.168.1.5, referrer: "http://192.168.1.5/Main" Jan 24 02:18:57 Tower root: plugin: checking: /boot/config/plugins/unassigned.devices/unassigned.devices-2022.01.21.tgz - MD5 Jan 24 02:18:57 Tower root: plugin: creating: /tmp/start_unassigned_devices - from INLINE content Jan 24 02:18:57 Tower root: plugin: setting: /tmp/start_unassigned_devices - mode to 0770 Jan 24 02:18:57 Tower root: plugin: skipping: /boot/config/plugins/unassigned.devices/unassigned.devices.cfg already exists Jan 24 02:18:57 Tower root: plugin: skipping: /boot/config/plugins/unassigned.devices/samba_mount.cfg already exists Jan 24 02:18:57 Tower root: plugin: skipping: /boot/config/plugins/unassigned.devices/iso_mount.cfg already exists Jan 24 02:18:57 Tower root: plugin: skipping: /tmp/unassigned.devices/smb-settings.conf already exists Jan 24 02:18:57 Tower root: plugin: skipping: /tmp/unassigned.devices/config/smb-extra.conf already exists Jan 24 02:18:57 Tower root: plugin: skipping: /tmp/unassigned.devices/add-smb-extra already exists Jan 24 02:18:57 Tower root: plugin: setting: /tmp/unassigned.devices/add-smb-extra - mode to 0770 Jan 24 02:18:57 Tower root: plugin: skipping: /tmp/unassigned.devices/remove-smb-extra already exists Jan 24 02:18:57 Tower root: plugin: setting: /tmp/unassigned.devices/remove-smb-extra - mode to 0770 Jan 24 02:18:57 Tower root: plugin: running: anonymous I've attached the SMART reports for both Disk 4 and the Parity Drive (which also has read errors). Can someone please explain to me the problems? Did my disk fail? Is there anything that I should be doing? tower-smart-20220124-0219 (disk 4).zip tower-smart-20220124-0221 (parity).zip tower-diagnostics-20220124-0251.zip Edited January 23, 2022 by Stubbs Quote Link to comment
Squid Posted January 23, 2022 Share Posted January 23, 2022 You should post your entire diagnostics. SMART looks ok, and what can be inferred from the syslog snip (diagnostics helps much better) is a connection issue (poor cabling / slightly loose etc) - Reseat them. Also to be aware of is the non-locking cables tend to work better on WD drives due to their design (or a locking cable that also has the internal "bump" which is actually rather rare) Quote Link to comment
Stubbs Posted January 23, 2022 Author Share Posted January 23, 2022 6 minutes ago, Squid said: You should post your entire diagnostics. SMART looks ok, and what can be inferred from the syslog snip (diagnostics helps much better) is a connection issue (poor cabling / slightly loose etc) - Reseat them. Also to be aware of is the non-locking cables tend to work better on WD drives due to their design (or a locking cable that also has the internal "bump" which is actually rather rare) I forgot about the diagnostics log, sorry. Here it is (attached). This drive is also in one of the hotswap bays. I had no had this issue before and it has only just occurred abruptly. Unraid also just automatically initiated a "read check". I don't know if that will help or not. tower-diagnostics-20220124-0251.zip Quote Link to comment
Squid Posted January 23, 2022 Share Posted January 23, 2022 Over the course of the drive 4's life (4599 hours), it's had 34526 errors logged. Only the last couple appear within the drive's logs but they all appear to be connection related. Advice stays the same - reseat the cabling to it / the bay, and also give the tray an extra little nudge Parity I'd say the same ( but only 14000 over 22000 hours) The other drives haven't suffered the same ills While an odd device reset is ok, continual ones is going to eventually lead to the drive being disabled due to a write failure going to happen at some point. It's also possible the the particular bay you're using just doesn't "like" the drive itself. (I've got one bay that refuses to work properly with one particular drive) Quote Link to comment
Stubbs Posted January 23, 2022 Author Share Posted January 23, 2022 (edited) 8 minutes ago, Squid said: Over the course of the drive 4's life (4599 hours), it's had 34526 errors logged. Only the last couple appear within the drive's logs but they all appear to be connection related. Advice stays the same - reseat the cabling to it / the bay, and also give the tray an extra little nudge Parity I'd say the same ( but only 14000 over 22000 hours) The other drives haven't suffered the same ills While an odd device reset is ok, continual ones is going to eventually lead to the drive being disabled due to a write failure going to happen at some point. It's also possible the the particular bay you're using just doesn't "like" the drive itself. (I've got one bay that refuses to work properly with one particular drive) It's just weird that it was working perfectly fine for the last 6 or so months in this very bay, only to stop working now. Are you sure it's not some kind of xfs filesystem error? Because I actually did manage to re-mount the drive into my array, it's just marked with a red x saying "device disabled, contents emulated"- cue to it starting a "Read check" which will take about 26 hours to complete. Also my parity drive currently has 144 read errors, disk 3 has 168 read errors and disk 4 (the broken one) has 1024 read errors. Edited January 23, 2022 by Stubbs Quote Link to comment
Squid Posted January 23, 2022 Share Posted January 23, 2022 Cancel the read check... No big issue there. Reseat all the cabling, both ends etc. Then try rebuilding the parity drive onto itself. 1 Quote Link to comment
Stubbs Posted January 23, 2022 Author Share Posted January 23, 2022 1 hour ago, Squid said: Cancel the read check... No big issue there. Reseat all the cabling, both ends etc. Then try rebuilding the parity drive onto itself. I don't have the time to open up the server and adjust the cabling right now, but I did try putting the problem drive in a different bay (with a different cable). Same problem, same drive. Just for the sake of it, I'm attaching the diagnostics .zip for when the problem disk is actually in the array, but not functioning. tower-diagnostics-20220123-1728.zip Quote Link to comment
Stubbs Posted January 24, 2022 Author Share Posted January 24, 2022 According to people in the Unraid Discord, it's literally a failed disk that needs to be replaced. Quote Link to comment
JorgeB Posted January 24, 2022 Share Posted January 24, 2022 16 hours ago, Stubbs said: Same problem, same drive. Once a device gets disable it needs to be rebuilt, just changing cables/slot won't fix anything, you can rebuild and see if the problem occurs again, if it does replace the disk. Quote Link to comment
Stubbs Posted January 24, 2022 Author Share Posted January 24, 2022 1 hour ago, JorgeB said: Once a device gets disable it needs to be rebuilt, just changing cables/slot won't fix anything, you can rebuild and see if the problem occurs again, if it does replace the disk. Yeah, I tried rebuilding, and at the 2% mark, it stopped, disabled the drive and started another "read check". I'm just going to have to RMA and replace. Quote Link to comment
JorgeB Posted January 24, 2022 Share Posted January 24, 2022 1 hour ago, Stubbs said: I tried rebuilding, and at the 2% mark, it stopped, disabled the drive Diags you posted didn't show a rebuild, but yeah, in that case you should replace it. Quote Link to comment
Stubbs Posted January 24, 2022 Author Share Posted January 24, 2022 (edited) 5 minutes ago, JorgeB said: Diags you posted didn't show a rebuild, but yeah, in that case you should replace it. I can't remember if the previous diagnostics was before or after I tried the rebuild. This one attached is after the failed rebuild and completed read check: tower-diagnostics-20220124-2349.zip Edited January 24, 2022 by Stubbs Quote Link to comment
Solution JorgeB Posted January 24, 2022 Solution Share Posted January 24, 2022 It's logged more like a connection/power issue, but since it failed in a different slot it's likely a disk problem. Quote Link to comment
Stubbs Posted January 24, 2022 Author Share Posted January 24, 2022 I've just taken the drive out, connected it to my Windows computer with an HDD docking bay, formatted it as NTFS and it seems to be working fine. At this point, I honestly have no clue what's going on. I'll try formatting it again on Unraid in yet another slot. Quote Link to comment
Stubbs Posted January 24, 2022 Author Share Posted January 24, 2022 Well, looks like I fixed my problem. Thanks for the help. I have a hunch for what caused it, but it's too stupid for words. Quote Link to comment
JorgeB Posted January 24, 2022 Share Posted January 24, 2022 7 minutes ago, Stubbs said: but it's too stupid for words. Now you made me curious Quote Link to comment
JonathanM Posted January 24, 2022 Share Posted January 24, 2022 52 minutes ago, Stubbs said: Well, looks like I fixed my problem. Thanks for the help. I have a hunch for what caused it, but it's too stupid for words. Spill! Novel ways of causing intermittent issues are always interesting, especially non-intuitive stuff. Quote Link to comment
Stubbs Posted January 24, 2022 Author Share Posted January 24, 2022 59 minutes ago, JorgeB said: Now you made me curious 12 minutes ago, JonathanM said: Spill! Novel ways of causing intermittent issues are always interesting, especially non-intuitive stuff. Alright, I'm not 100% sure this is what caused it, but when removing the tray, I noticed the front two screws holding the hard drive in had slightly thicker heads than the back two. It was still connected to the SATA and power connecter in the drive bay, but it probably wasn't the most stable connection. I replaced those two thicker screws with the correct thinner-head ones, and all of a sudden it works fine. Quote Link to comment
JonathanM Posted January 24, 2022 Share Posted January 24, 2022 Very plausible. The SATA slip fit connection is precise to fractions of a mm. Plastic and metal drive tray assemblies, not so much. I make it a habit to final tighten the screws in a drive tray allowing the drive to free hang against the partially tightened screws, holding the tray only. That way all the tolerances should stack up to put the drive to the absolute rear of the tray. SATA / SAS connectors are an electrical nightmare, they cause WAY more drive errors than actual disk failure. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.