Disk Failure? Help needed


Stubbs
Go to solution Solved by JorgeB,

Recommended Posts

Disk 4 in my array seems to have issues. It automatically dismounted at some point in the last day, and I don't know why or what's going on. Trying to remount it does not work.

 

8V4gb0Z.png

 

When I rebooted the server, it started a file system check. It prompted me to open a certain log, which kept going on and on infinitely:

 

K0927eX.png

 

MOoUjz6.png

 

Again, I do not understand what any of this means.

 

My Unraid's main log looks like this:

 

Jan 24 02:18:13 Tower kernel: md4: writeback error on inode 98313, offset 125083648, sector 674783544
Jan 24 02:18:13 Tower kernel: XFS (sdg1): Filesystem has duplicate UUID 0fb26a6d-0040-405b-92c9-4fe171b93e9b - can't mount
Jan 24 02:18:13 Tower unassigned.devices: Mount of 'sdg1' failed: 'mount: /mnt/disks/WD-WX22DB0KE762: wrong fs type, bad option, bad superblock on /dev/sdg1, missing codepage or helper program, or other error. '
Jan 24 02:18:13 Tower unassigned.devices: Partition 'WD-WX22DB0KE762' cannot be mounted.
Jan 24 02:18:33 Tower kernel: ata6.00: exception Emask 0x0 SAct 0x40 SErr 0x0 action 0x0
Jan 24 02:18:33 Tower kernel: ata6.00: irq_stat 0x40000008
Jan 24 02:18:33 Tower kernel: ata6.00: failed command: WRITE FPDMA QUEUED
Jan 24 02:18:33 Tower kernel: ata6.00: cmd 61/20:30:20:35:d2/00:00:3b:00:00/40 tag 6 ncq dma 16384 out
Jan 24 02:18:33 Tower kernel: res 41/10:00:20:35:d2/00:00:3b:00:00/40 Emask 0x481 (invalid argument) <F>
Jan 24 02:18:33 Tower kernel: ata6.00: status: { DRDY ERR }
Jan 24 02:18:33 Tower kernel: ata6.00: error: { IDNF }
Jan 24 02:18:33 Tower kernel: ata6.00: configured for UDMA/133
Jan 24 02:18:33 Tower kernel: ata6: EH complete
Jan 24 02:18:52 Tower emhttpd: cmd: /usr/local/emhttp/plugins/dynamix.plugin.manager/scripts/plugin update community.applications.plg
Jan 24 02:18:52 Tower root: plugin: running: anonymous
Jan 24 02:18:52 Tower root: plugin: running: anonymous
Jan 24 02:18:52 Tower root: plugin: creating: /boot/config/plugins/community.applications/community.applications-2022.01.22-x86_64-1.txz - downloading from URL https://raw.githubusercontent.com/Squidly271/community.applications/master/archive/community.applications-2022.01.22-x86_64-1.txz
Jan 24 02:18:53 Tower root: plugin: checking: /boot/config/plugins/community.applications/community.applications-2022.01.22-x86_64-1.txz - MD5
Jan 24 02:18:53 Tower root: plugin: running: /boot/config/plugins/community.applications/community.applications-2022.01.22-x86_64-1.txz
Jan 24 02:18:53 Tower root: plugin: running: anonymous
Jan 24 02:18:56 Tower emhttpd: cmd: /usr/local/emhttp/plugins/dynamix.plugin.manager/scripts/plugin update unassigned.devices.plg
Jan 24 02:18:56 Tower root: plugin: running: anonymous
Jan 24 02:18:56 Tower root: plugin: creating: /boot/config/plugins/unassigned.devices/unassigned.devices-2022.01.21.tgz - downloading from URL https://github.com/dlandon/unassigned.devices/raw/master/unassigned.devices-2022.01.21.tgz
Jan 24 02:18:56 Tower nginx: 2022/01/24 02:18:56 [error] 5625#5625: *4570 FastCGI sent in stderr: "Primary script unknown" while reading response header from upstream, client: 192.168.1.217, server: , request: "POST /plugins/unassigned.devices/UnassignedDevices.php HTTP/1.1", upstream: "fastcgi://unix:/var/run/php5-fpm.sock:", host: "192.168.1.5", referrer: "http://192.168.1.5/Main"
Jan 24 02:18:56 Tower nginx: 2022/01/24 02:18:56 [error] 5625#5625: *4086 FastCGI sent in stderr: "Primary script unknown" while reading response header from upstream, client: 10.10.20.217, server: , request: "POST /plugins/unassigned.devices/UnassignedDevices.php HTTP/1.1", upstream: "fastcgi://unix:/var/run/php5-fpm.sock:", host: "192.168.1.5, referrer: "http://192.168.1.5/Main"
Jan 24 02:18:57 Tower root: plugin: checking: /boot/config/plugins/unassigned.devices/unassigned.devices-2022.01.21.tgz - MD5
Jan 24 02:18:57 Tower root: plugin: creating: /tmp/start_unassigned_devices - from INLINE content
Jan 24 02:18:57 Tower root: plugin: setting: /tmp/start_unassigned_devices - mode to 0770
Jan 24 02:18:57 Tower root: plugin: skipping: /boot/config/plugins/unassigned.devices/unassigned.devices.cfg already exists
Jan 24 02:18:57 Tower root: plugin: skipping: /boot/config/plugins/unassigned.devices/samba_mount.cfg already exists
Jan 24 02:18:57 Tower root: plugin: skipping: /boot/config/plugins/unassigned.devices/iso_mount.cfg already exists
Jan 24 02:18:57 Tower root: plugin: skipping: /tmp/unassigned.devices/smb-settings.conf already exists
Jan 24 02:18:57 Tower root: plugin: skipping: /tmp/unassigned.devices/config/smb-extra.conf already exists
Jan 24 02:18:57 Tower root: plugin: skipping: /tmp/unassigned.devices/add-smb-extra already exists
Jan 24 02:18:57 Tower root: plugin: setting: /tmp/unassigned.devices/add-smb-extra - mode to 0770
Jan 24 02:18:57 Tower root: plugin: skipping: /tmp/unassigned.devices/remove-smb-extra already exists
Jan 24 02:18:57 Tower root: plugin: setting: /tmp/unassigned.devices/remove-smb-extra - mode to 0770
Jan 24 02:18:57 Tower root: plugin: running: anonymous

 

I've attached the SMART reports for both Disk 4 and the Parity Drive (which also has read errors).

 

Can someone please explain to me the problems? Did my disk fail? Is there anything that I should be doing?

tower-smart-20220124-0219 (disk 4).zip tower-smart-20220124-0221 (parity).zip

tower-diagnostics-20220124-0251.zip

Edited by Stubbs
Link to comment

You should post your entire diagnostics.  SMART looks ok, and what can be inferred from the syslog snip (diagnostics helps much better) is a connection issue (poor cabling / slightly loose etc) - Reseat them.

 

Also to be aware of is the non-locking cables tend to work better on WD drives due to their design (or a locking cable that also has the internal "bump" which is actually rather rare)

Link to comment
6 minutes ago, Squid said:

You should post your entire diagnostics.  SMART looks ok, and what can be inferred from the syslog snip (diagnostics helps much better) is a connection issue (poor cabling / slightly loose etc) - Reseat them.

 

Also to be aware of is the non-locking cables tend to work better on WD drives due to their design (or a locking cable that also has the internal "bump" which is actually rather rare)

I forgot about the diagnostics log, sorry. Here it is (attached).

 

This drive is also in one of the hotswap bays. I had no had this issue before and it has only just occurred abruptly.

Unraid also just automatically initiated a "read check". I don't know if that will help or not.

tower-diagnostics-20220124-0251.zip

Link to comment

Over the course of the drive 4's life (4599 hours), it's had 34526 errors logged.  Only the last couple appear within the drive's logs but they all appear to be connection related.  Advice stays the same - reseat the cabling to it / the bay, and also give the tray an extra little nudge  Parity I'd say the same ( but only 14000 over 22000 hours)  The other drives haven't suffered the same ills

 

While an odd device reset is ok, continual ones is going to eventually lead to the drive being disabled due to a write failure going to happen at some point.

 

It's also possible the the particular bay you're using just doesn't "like" the drive itself.  (I've got one bay that refuses to work properly with one particular drive)

 

 

Link to comment
8 minutes ago, Squid said:

Over the course of the drive 4's life (4599 hours), it's had 34526 errors logged.  Only the last couple appear within the drive's logs but they all appear to be connection related.  Advice stays the same - reseat the cabling to it / the bay, and also give the tray an extra little nudge  Parity I'd say the same ( but only 14000 over 22000 hours)  The other drives haven't suffered the same ills

 

While an odd device reset is ok, continual ones is going to eventually lead to the drive being disabled due to a write failure going to happen at some point.

 

It's also possible the the particular bay you're using just doesn't "like" the drive itself.  (I've got one bay that refuses to work properly with one particular drive)

 

 

It's just weird that it was working perfectly fine for the last 6 or so months in this very bay, only to stop working now. Are you sure it's not some kind of xfs filesystem error? Because I actually did manage to re-mount the drive into my array, it's just marked with a red x saying "device disabled, contents emulated"- cue to it starting a "Read check" which will take about 26 hours to complete.

 

Also my parity drive currently has 144 read errors, disk 3 has 168 read errors and disk 4 (the broken one) has 1024 read errors.

Edited by Stubbs
Link to comment
1 hour ago, Squid said:

Cancel the read check...  No big issue there.  Reseat all the cabling, both ends etc.  Then try rebuilding the parity drive onto itself.

 

I don't have the time to open up the server and adjust the cabling right now, but I did try putting the problem drive in a different bay (with a different cable). Same problem, same drive.

 

ueMeNwH.png

 

Just for the sake of it, I'm attaching the diagnostics .zip for when the problem disk is actually in the array, but not functioning.

tower-diagnostics-20220123-1728.zip

Link to comment
1 hour ago, JorgeB said:

Once a device gets disable it needs to be rebuilt, just changing cables/slot won't fix anything, you can rebuild and see if the problem occurs again, if it does replace the disk.

Yeah, I tried rebuilding, and at the 2% mark, it stopped, disabled the drive and started another "read check".

 

I'm just going to have to RMA and replace.

Link to comment
59 minutes ago, JorgeB said:

Now you made me curious :)

 

 

12 minutes ago, JonathanM said:

Spill!

 

Novel ways of causing intermittent issues are always interesting, especially non-intuitive stuff.

Alright, I'm not 100% sure this is what caused it, but when removing the tray, I noticed the front two screws holding the hard drive in had slightly thicker heads than the back two. It was still connected to the SATA and power connecter in the drive bay, but it probably wasn't the most stable connection.

I replaced those two thicker screws with the correct thinner-head ones, and all of a sudden it works fine.

Link to comment

Very plausible. The SATA slip fit connection is precise to fractions of a mm. Plastic and metal drive tray assemblies, not so much.

 

I make it a habit to final tighten the screws in a drive tray allowing the drive to free hang against the partially tightened screws, holding the tray only. That way all the tolerances should stack up to put the drive to the absolute rear of the tray.

 

SATA / SAS connectors are an electrical nightmare, they cause WAY more drive errors than actual disk failure.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.