Disk Failure? Help needed

Stubbs · January 23, 2022

Disk 4 in my array seems to have issues. It automatically dismounted at some point in the last day, and I don't know why or what's going on. Trying to remount it does not work.

When I rebooted the server, it started a file system check. It prompted me to open a certain log, which kept going on and on infinitely:

Again, I do not understand what any of this means.

My Unraid's main log looks like this:

Jan 24 02:18:13 Tower kernel: md4: writeback error on inode 98313, offset 125083648, sector 674783544
Jan 24 02:18:13 Tower kernel: XFS (sdg1): Filesystem has duplicate UUID 0fb26a6d-0040-405b-92c9-4fe171b93e9b - can't mount
Jan 24 02:18:13 Tower unassigned.devices: Mount of 'sdg1' failed: 'mount: /mnt/disks/WD-WX22DB0KE762: wrong fs type, bad option, bad superblock on /dev/sdg1, missing codepage or helper program, or other error. '
Jan 24 02:18:13 Tower unassigned.devices: Partition 'WD-WX22DB0KE762' cannot be mounted.
Jan 24 02:18:33 Tower kernel: ata6.00: exception Emask 0x0 SAct 0x40 SErr 0x0 action 0x0
Jan 24 02:18:33 Tower kernel: ata6.00: irq_stat 0x40000008
Jan 24 02:18:33 Tower kernel: ata6.00: failed command: WRITE FPDMA QUEUED
Jan 24 02:18:33 Tower kernel: ata6.00: cmd 61/20:30:20:35:d2/00:00:3b:00:00/40 tag 6 ncq dma 16384 out
Jan 24 02:18:33 Tower kernel: res 41/10:00:20:35:d2/00:00:3b:00:00/40 Emask 0x481 (invalid argument) <F>
Jan 24 02:18:33 Tower kernel: ata6.00: status: { DRDY ERR }
Jan 24 02:18:33 Tower kernel: ata6.00: error: { IDNF }
Jan 24 02:18:33 Tower kernel: ata6.00: configured for UDMA/133
Jan 24 02:18:33 Tower kernel: ata6: EH complete
Jan 24 02:18:52 Tower emhttpd: cmd: /usr/local/emhttp/plugins/dynamix.plugin.manager/scripts/plugin update community.applications.plg
Jan 24 02:18:52 Tower root: plugin: running: anonymous
Jan 24 02:18:52 Tower root: plugin: running: anonymous
Jan 24 02:18:52 Tower root: plugin: creating: /boot/config/plugins/community.applications/community.applications-2022.01.22-x86_64-1.txz - downloading from URL https://raw.githubusercontent.com/Squidly271/community.applications/master/archive/community.applications-2022.01.22-x86_64-1.txz
Jan 24 02:18:53 Tower root: plugin: checking: /boot/config/plugins/community.applications/community.applications-2022.01.22-x86_64-1.txz - MD5
Jan 24 02:18:53 Tower root: plugin: running: /boot/config/plugins/community.applications/community.applications-2022.01.22-x86_64-1.txz
Jan 24 02:18:53 Tower root: plugin: running: anonymous
Jan 24 02:18:56 Tower emhttpd: cmd: /usr/local/emhttp/plugins/dynamix.plugin.manager/scripts/plugin update unassigned.devices.plg
Jan 24 02:18:56 Tower root: plugin: running: anonymous
Jan 24 02:18:56 Tower root: plugin: creating: /boot/config/plugins/unassigned.devices/unassigned.devices-2022.01.21.tgz - downloading from URL https://github.com/dlandon/unassigned.devices/raw/master/unassigned.devices-2022.01.21.tgz
Jan 24 02:18:56 Tower nginx: 2022/01/24 02:18:56 [error] 5625#5625: *4570 FastCGI sent in stderr: "Primary script unknown" while reading response header from upstream, client: 192.168.1.217, server: , request: "POST /plugins/unassigned.devices/UnassignedDevices.php HTTP/1.1", upstream: "fastcgi://unix:/var/run/php5-fpm.sock:", host: "192.168.1.5", referrer: "http://192.168.1.5/Main"
Jan 24 02:18:56 Tower nginx: 2022/01/24 02:18:56 [error] 5625#5625: *4086 FastCGI sent in stderr: "Primary script unknown" while reading response header from upstream, client: 10.10.20.217, server: , request: "POST /plugins/unassigned.devices/UnassignedDevices.php HTTP/1.1", upstream: "fastcgi://unix:/var/run/php5-fpm.sock:", host: "192.168.1.5, referrer: "http://192.168.1.5/Main"
Jan 24 02:18:57 Tower root: plugin: checking: /boot/config/plugins/unassigned.devices/unassigned.devices-2022.01.21.tgz - MD5
Jan 24 02:18:57 Tower root: plugin: creating: /tmp/start_unassigned_devices - from INLINE content
Jan 24 02:18:57 Tower root: plugin: setting: /tmp/start_unassigned_devices - mode to 0770
Jan 24 02:18:57 Tower root: plugin: skipping: /boot/config/plugins/unassigned.devices/unassigned.devices.cfg already exists
Jan 24 02:18:57 Tower root: plugin: skipping: /boot/config/plugins/unassigned.devices/samba_mount.cfg already exists
Jan 24 02:18:57 Tower root: plugin: skipping: /boot/config/plugins/unassigned.devices/iso_mount.cfg already exists
Jan 24 02:18:57 Tower root: plugin: skipping: /tmp/unassigned.devices/smb-settings.conf already exists
Jan 24 02:18:57 Tower root: plugin: skipping: /tmp/unassigned.devices/config/smb-extra.conf already exists
Jan 24 02:18:57 Tower root: plugin: skipping: /tmp/unassigned.devices/add-smb-extra already exists
Jan 24 02:18:57 Tower root: plugin: setting: /tmp/unassigned.devices/add-smb-extra - mode to 0770
Jan 24 02:18:57 Tower root: plugin: skipping: /tmp/unassigned.devices/remove-smb-extra already exists
Jan 24 02:18:57 Tower root: plugin: setting: /tmp/unassigned.devices/remove-smb-extra - mode to 0770
Jan 24 02:18:57 Tower root: plugin: running: anonymous

I've attached the SMART reports for both Disk 4 and the Parity Drive (which also has read errors).

Can someone please explain to me the problems? Did my disk fail? Is there anything that I should be doing?

tower-smart-20220124-0219 (disk 4).zip tower-smart-20220124-0221 (parity).zip

tower-diagnostics-20220124-0251.zip

Edited January 23, 2022 by Stubbs

Squid · January 23, 2022

You should post your entire diagnostics. SMART looks ok, and what can be inferred from the syslog snip (diagnostics helps much better) is a connection issue (poor cabling / slightly loose etc) - Reseat them.

Also to be aware of is the non-locking cables tend to work better on WD drives due to their design (or a locking cable that also has the internal "bump" which is actually rather rare)

Stubbs · January 23, 2022

6 minutes ago, Squid said:

You should post your entire diagnostics. SMART looks ok, and what can be inferred from the syslog snip (diagnostics helps much better) is a connection issue (poor cabling / slightly loose etc) - Reseat them.

Also to be aware of is the non-locking cables tend to work better on WD drives due to their design (or a locking cable that also has the internal "bump" which is actually rather rare)

I forgot about the diagnostics log, sorry. Here it is (attached).

This drive is also in one of the hotswap bays. I had no had this issue before and it has only just occurred abruptly.

Unraid also just automatically initiated a "read check". I don't know if that will help or not.

tower-diagnostics-20220124-0251.zip

Squid · January 23, 2022

Over the course of the drive 4's life (4599 hours), it's had 34526 errors logged. Only the last couple appear within the drive's logs but they all appear to be connection related. Advice stays the same - reseat the cabling to it / the bay, and also give the tray an extra little nudge Parity I'd say the same ( but only 14000 over 22000 hours) The other drives haven't suffered the same ills

While an odd device reset is ok, continual ones is going to eventually lead to the drive being disabled due to a write failure going to happen at some point.

It's also possible the the particular bay you're using just doesn't "like" the drive itself. (I've got one bay that refuses to work properly with one particular drive)

Stubbs · January 23, 2022

8 minutes ago, Squid said:

Over the course of the drive 4's life (4599 hours), it's had 34526 errors logged. Only the last couple appear within the drive's logs but they all appear to be connection related. Advice stays the same - reseat the cabling to it / the bay, and also give the tray an extra little nudge Parity I'd say the same ( but only 14000 over 22000 hours) The other drives haven't suffered the same ills

While an odd device reset is ok, continual ones is going to eventually lead to the drive being disabled due to a write failure going to happen at some point.

It's also possible the the particular bay you're using just doesn't "like" the drive itself. (I've got one bay that refuses to work properly with one particular drive)

It's just weird that it was working perfectly fine for the last 6 or so months in this very bay, only to stop working now. Are you sure it's not some kind of xfs filesystem error? Because I actually did manage to re-mount the drive into my array, it's just marked with a red x saying "device disabled, contents emulated"- cue to it starting a "Read check" which will take about 26 hours to complete.

Also my parity drive currently has 144 read errors, disk 3 has 168 read errors and disk 4 (the broken one) has 1024 read errors.

Edited January 23, 2022 by Stubbs

Squid · January 23, 2022

Cancel the read check... No big issue there. Reseat all the cabling, both ends etc. Then try rebuilding the parity drive onto itself.

Stubbs · January 23, 2022

1 hour ago, Squid said:

Cancel the read check... No big issue there. Reseat all the cabling, both ends etc. Then try rebuilding the parity drive onto itself.

I don't have the time to open up the server and adjust the cabling right now, but I did try putting the problem drive in a different bay (with a different cable). Same problem, same drive.

Just for the sake of it, I'm attaching the diagnostics .zip for when the problem disk is actually in the array, but not functioning.

tower-diagnostics-20220123-1728.zip

Stubbs · January 24, 2022

According to people in the Unraid Discord, it's literally a failed disk that needs to be replaced.

JorgeB · January 24, 2022

16 hours ago, Stubbs said:

Same problem, same drive.

Once a device gets disable it needs to be rebuilt, just changing cables/slot won't fix anything, you can rebuild and see if the problem occurs again, if it does replace the disk.

Stubbs · January 24, 2022

1 hour ago, JorgeB said:

Once a device gets disable it needs to be rebuilt, just changing cables/slot won't fix anything, you can rebuild and see if the problem occurs again, if it does replace the disk.

Yeah, I tried rebuilding, and at the 2% mark, it stopped, disabled the drive and started another "read check".

I'm just going to have to RMA and replace.

JorgeB · January 24, 2022

1 hour ago, Stubbs said:

I tried rebuilding, and at the 2% mark, it stopped, disabled the drive

Diags you posted didn't show a rebuild, but yeah, in that case you should replace it.

Stubbs · January 24, 2022

5 minutes ago, JorgeB said:

Diags you posted didn't show a rebuild, but yeah, in that case you should replace it.

I can't remember if the previous diagnostics was before or after I tried the rebuild.

This one attached is after the failed rebuild and completed read check:

tower-diagnostics-20220124-2349.zip

Edited January 24, 2022 by Stubbs

JorgeB · January 24, 2022

It's logged more like a connection/power issue, but since it failed in a different slot it's likely a disk problem.

Stubbs · January 24, 2022

I've just taken the drive out, connected it to my Windows computer with an HDD docking bay, formatted it as NTFS and it seems to be working fine.

At this point, I honestly have no clue what's going on. I'll try formatting it again on Unraid in yet another slot.

Stubbs · January 24, 2022

Well, looks like I fixed my problem. Thanks for the help.

I have a hunch for what caused it, but it's too stupid for words.

JorgeB · January 24, 2022

7 minutes ago, Stubbs said:

but it's too stupid for words.

Now you made me curious

JonathanM · January 24, 2022

52 minutes ago, Stubbs said:

Well, looks like I fixed my problem. Thanks for the help.

I have a hunch for what caused it, but it's too stupid for words.

Spill!

Novel ways of causing intermittent issues are always interesting, especially non-intuitive stuff.

Stubbs · January 24, 2022

59 minutes ago, JorgeB said:

Now you made me curious

12 minutes ago, JonathanM said:

Spill!

Novel ways of causing intermittent issues are always interesting, especially non-intuitive stuff.

Alright, I'm not 100% sure this is what caused it, but when removing the tray, I noticed the front two screws holding the hard drive in had slightly thicker heads than the back two. It was still connected to the SATA and power connecter in the drive bay, but it probably wasn't the most stable connection.

I replaced those two thicker screws with the correct thinner-head ones, and all of a sudden it works fine.

JonathanM · January 24, 2022

Very plausible. The SATA slip fit connection is precise to fractions of a mm. Plastic and metal drive tray assemblies, not so much.

I make it a habit to final tighten the screws in a drive tray allowing the drive to free hang against the partially tightened screws, holding the tray only. That way all the tolerances should stack up to put the drive to the absolute rear of the tray.

SATA / SAS connectors are an electrical nightmare, they cause WAY more drive errors than actual disk failure.

Disk Failure? Help needed

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation