Disk 2 down, disk 7 with various error messages and no physical access to the server: what to do?

January 24, 20233 yr

Dear community members,

My server just reported that disk 2 of my array is being emulated. I am at a loss as to what caused it to fail since the smart report seems (mostly) fine. At the same time I keep getting various error messages about disk 7. From what I know CRC errors are typically related to bad cables, but I remember checking that particular cable/sockets in the past and could not find anything obvious. Also, I don't know what "metadata I/O error" refers to.
I would like to ask you for advice: I won't be able to directly access my server for at least two weeks but I have remote access through a vpn server (separate machine). Is there anything I could or indeed should do?

Many thanks in advance!

tower-diagnostics-20230124-1849.zip

Quote

January 24, 20233 yr

Community Expert

Logs are spammed with filesystem issues for disk7, check filesystem, reboot and post new diags after array start.

Quote

January 24, 20233 yr

Author

Dear JorgeB,

Thank you for your quick reply. The xfs filesystem test with the -nv flag gave me a long output with various problems (not sure if posting it here might actually be of any help). The final paragraph reads:

No modify flag set, skipping phase 5
Inode allocation btrees are too corrupted, skipping phases 6 and 7
No modify flag set, skipping filesystem flush and exiting.

        XFS_REPAIR Summary    Tue Jan 24 20:41:37 2023

Phase		Start		End		Duration
Phase 1:	01/24 20:40:39	01/24 20:40:39
Phase 2:	01/24 20:40:39	01/24 20:40:40	1 second
Phase 3:	01/24 20:40:40	01/24 20:41:37	57 seconds
Phase 4:	01/24 20:41:37	01/24 20:41:37
Phase 5:	Skipped
Phase 6:	Skipped
Phase 7:	Skipped

Total run time: 58 seconds

The link you posted mentions that I would be given instructions as to how to proceed, but that doesn't seem to be the case here. What is the next step going to be?

Many thanks

Edited January 24, 20233 yr by bwv1058

Quote

January 25, 20233 yr

Community Expert

Next step is to run it again without -n or nothing will be done, and if it asks for -L use it.

Quote

January 25, 20233 yr

Author

Thank you again!

Sorry if I'm being excessively cautious, but I just want to make sure I'm not causing any unnecessary damage. I'm now repeating the test just with verbose output and we'll see what comes next.
From the documentation I get the following:

Quote

-L

Force Log Zeroing. Forces xfs_repair to zero the log even if it is dirty (contains metadata changes). When using this option the filesystem will likely appear to be corrupt, and can cause the loss of user files and/or data.

Shouldn't it be concerning that the "filesystem will likely appear to be corrupt" etc.?

Quote

January 25, 20233 yr

Author

Alright, seems like I don't need to rerun the test with the -L flag after all... However, seems like many "inodes" habe been moved to the lost+found folder. What exactly does that mean?

        XFS_REPAIR Summary    Wed Jan 25 10:17:39 2023

Phase		Start		End		Duration
Phase 1:	01/25 10:13:32	01/25 10:13:32
Phase 2:	01/25 10:13:32	01/25 10:13:33	1 second
Phase 3:	01/25 10:13:33	01/25 10:16:14	2 minutes, 41 seconds
Phase 4:	01/25 10:16:14	01/25 10:16:15	1 second
Phase 5:	01/25 10:16:15	01/25 10:16:29	14 seconds
Phase 6:	01/25 10:16:29	01/25 10:17:01	32 seconds
Phase 7:	01/25 10:17:01	01/25 10:17:01

Total run time: 3 minutes, 29 seconds
done

Should I now just restart the array normally or does the server need a reboot?

Quote

January 25, 20233 yr

Community Expert

Start in normal mode and check lost+found folder.

Quote

January 25, 20233 yr

Community Expert

Oh, and reboot first to clear the log.

Quote

January 25, 20233 yr

Author

Disk7 is now unmountable, meanwhile disk2 is still disabled... I'm starting to be quite worried.

tower-diagnostics-20230125-1037.zip

Quote

January 25, 20233 yr

Community Expert

Log shows issues with disk7:

Jan 25 10:37:31 Tower kernel: ata6.00: status: { DRDY }
Jan 25 10:37:31 Tower kernel: ata6: hard resetting link
Jan 25 10:37:37 Tower kernel: ata6: link is slow to respond, please be patient (ready=0)
Jan 25 10:37:41 Tower kernel: ata6: COMRESET failed (errno=-16)
Jan 25 10:37:41 Tower kernel: ata6: hard resetting link
Jan 25 10:37:42 Tower kernel: ata6: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Jan 25 10:37:42 Tower kernel: ata6.00: configured for UDMA/33
Jan 25 10:37:42 Tower kernel: ata6: EH complete
Jan 25 10:37:43 Tower kernel: ata6.00: exception Emask 0x50 SAct 0x600 SErr 0x4890800 action 0xe frozen
Jan 25 10:37:43 Tower kernel: ata6.00: irq_stat 0x04400040, connection status changed
Jan 25 10:37:43 Tower kernel: ata6: SError: { HostInt PHYRdyChg 10B8B LinkSeq DevExch }
Jan 25 10:37:43 Tower kernel: ata6.00: failed command: READ FPDMA QUEUED
Jan 25 10:37:43 Tower kernel: ata6.00: cmd 60/08:48:58:f3:28/00:00:5d:00:00/40 tag 9 ncq dma 4096 in
Jan 25 10:37:43 Tower kernel:         res 40/00:48:58:f3:28/00:00:5d:00:00/40 Emask 0x50 (ATA bus error)
Jan 25 10:37:43 Tower kernel: ata6.00: status: { DRDY }
Jan 25 10:37:43 Tower kernel: ata6.00: failed command: READ FPDMA QUEUED
Jan 25 10:37:43 Tower kernel: ata6.00: cmd 60/20:50:10:50:14/00:00:50:00:00/40 tag 10 ncq dma 16384 in
Jan 25 10:37:43 Tower kernel:         res 40/00:48:58:f3:28/00:00:5d:00:00/40 Emask 0x50 (ATA bus error)
Jan 25 10:37:43 Tower kernel: ata6.00: status: { DRDY }
Jan 25 10:37:43 Tower kernel: ata6: hard resetting link
Jan 25 10:37:49 Tower kernel: ata6: link is slow to respond, please be patient (ready=0)
Jan 25 10:37:53 Tower kernel: ata6: COMRESET failed (errno=-16)
Jan 25 10:37:53 Tower kernel: ata6: hard resetting link
Jan 25 10:37:54 Tower kernel: ata6: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Jan 25 10:37:54 Tower kernel: ata6.00: configured for UDMA/33
Jan 25 10:37:54 Tower kernel: ata6: EH complete
Jan 25 10:37:56 Tower kernel: ata6.00: exception Emask 0x50 SAct 0x3000 SErr 0x4090800 action 0xe frozen
Jan 25 10:37:56 Tower kernel: ata6.00: irq_stat 0x00400040, connection status changed
Jan 25 10:37:56 Tower kernel: ata6: SError: { HostInt PHYRdyChg 10B8B DevExch }
Jan 25 10:37:56 Tower kernel: ata6.00: failed command: READ FPDMA QUEUED
Jan 25 10:37:56 Tower kernel: ata6.00: cmd 60/20:60:40:32:e3/00:00:12:00:00/40 tag 12 ncq dma 16384 in
Jan 25 10:37:56 Tower kernel:         res 40/00:68:30:d4:c4/00:00:0f:00:00/40 Emask 0x50 (ATA bus error)
Jan 25 10:37:56 Tower kernel: ata6.00: status: { DRDY }
Jan 25 10:37:56 Tower kernel: ata6.00: failed command: READ FPDMA QUEUED
Jan 25 10:37:56 Tower kernel: ata6.00: cmd 60/08:68:30:d4:c4/00:00:0f:00:00/40 tag 13 ncq dma 4096 in
Jan 25 10:37:56 Tower kernel:         res 40/00:68:30:d4:c4/00:00:0f:00:00/40 Emask 0x50 (ATA bus error)
Jan 25 10:37:56 Tower kernel: ata6.00: status: { DRDY }
Jan 25 10:37:56 Tower kernel: ata6: hard resetting link

These look more like a power/connection problem but if you don't have access to the server it's difficult to sort.

Quote

1

January 25, 20233 yr

Author

I see. I will of course look into that as soon as I can. Should I shut down the server in the meantime?

Edited January 25, 20233 yr by bwv1058

Quote

January 25, 20233 yr

Community Expert

8 minutes ago, bwv1058 said:

shut down the server in the meantime?

Yes. You only have single parity and a disabled disk, so no protection until you rebuild. And you don't want to attempt to rebuild one disk when a different disk can't be read. Parity by itself cannot rebuild or protect anything. All disks are required.

https://wiki.unraid.net/Manual/Overview#Parity-Protected_Array

Quote

January 25, 20233 yr

Author

Dear trurl,

Thank you for the link! I seriously hope that disk7 turns out to have a bad power cable. Should that not be the case and should I still be left with one drive disabled and the other unmountable, what options do I have to retrieve as much data as possible? Also, how can I avoid such a scenario from happening again? Ever since last year, I've had drives randomly being disabled and I never could figure out what was causing the problem... I have replaced cables time and time again and have installed a SAS controller to avoid the onboard Sata connectors, but nothing seems to work reliably.

Quote

January 25, 20233 yr

Community Expert

Unmountable filesystems can usually be repaired. Disabled disks can be rebuilt if all other disks are working well. In any case, data on all other disks should be fine.

Quote

February 16, 20233 yr

Author

I am now again able to access my server and after checking, it seems that one of the sata cables had indeed come loose (although I'm not 100% sure about that, it might just be from my poking into the system). However, upon rebooting the system freezes up:

EDIT:
Following the instructions on this link and renaming vfio-pci.cfg solved the issue but I still have an unmountable drive (disk 7) and a disabled one (disk 2, see diagnostics). What are the next steps going to be?

Thank you all in advance!

tower-diagnostics-20230216-2120.zip

Edited February 16, 20233 yr by bwv1058

Quote

February 16, 20233 yr

Community Expert

You have connection problems on disk7. Fix those first. Then check filesystem on disk7. You have to rebuild disk2.

Unrelated, your appdata and system shares have files on the array. Ideally the default shares would be on fast pool (cache) so Docker/VM performance isn't impacted by slower array, and so array disks can spin down since these files are always open.

https://wiki.unraid.net/Manual/Shares#Default_Shares

Quote

1

February 17, 20233 yr

Author

Dear trurl,

Thank you for your continued support!

I've now attached the drive to my sas controller, which has so far proved to be more reliable than the motherboard's internal sata ports. After starting the array in maintenance mode, drive 7 no longer shows "unmountable", which I guess is good news. Also, the log does not seem to report any connection errors any longer. However, I'm getting spammed with "FastCGI sent in stderr", which is an error I've never seen before and have no idea what it stands for.

Should I still run xfs repair? And what are my next steps after that?

And, by the way, thank you for pointing out the problem with the appdata and system files. I thought I had set both folders to "prefer", but I will check.

tower-diagnostics-20230217-0926.zip

Quote

February 17, 20233 yr

Community Expert

4 hours ago, bwv1058 said:

starting the array in maintenance mode, drive 7 no longer shows "unmountable"

It doesn't attempt to mount any drives in maintenance mode, so you can't see if any drives mount that way. Notice how no disks show any used and free space.

4 hours ago, bwv1058 said:

thought I had set both folders to "prefer", but I will check.

appdata is prefer, system is no. Nothing can move open files, so after setting system to prefer, you will have to disable Docker and VM Manager in Settings then run Mover.

4 hours ago, bwv1058 said:

Should I still run xfs repair?

Probably, but can't say for sure from these latest diagnostics in maintenance mode.

Quote

February 17, 20233 yr

Author

3 minutes ago, trurl said:

It doesn't attempt to mount any drives in maintentance mode

Yes, obvious mistake on my part... or maybe just wishful thinking!
After restarting the array normally, I'm back to square one.

4 minutes ago, trurl said:

appdata is prefer, system is no

Alright, will do as you suggested after I can fix the drive issues.

I'm still getting the aforementioned error message. I don't recall reading it some weeks ago.

tower-diagnostics-20230217-1418.zip

Quote

February 17, 20233 yr

Community Expert

4 hours ago, bwv1058 said:

spammed with "FastCGI sent in stderr"

Not sure about that either. Seems to be related to Unassigned Devices plugin, which you don't have since you are booted in SAFE mode. I suspect something in the browser cache.

5 hours ago, bwv1058 said:

attached the drive to my sas controller

Didn't notice any connection problems in these.

16 hours ago, trurl said:

check filesystem on disk7

Quote

February 17, 20233 yr

Author

Dear trurl,

Thank you again! After running xfs-repair disk 7 seems to have come back.

As expected, a lost+found folder was created with many files in it that I will have to assess individually... Is there any chance that other files beside those in the lost+found folder might have been corrupted in the process?

Since I'm normally rsyncing the most important folders on this server with another nas, what rsync flags should I use to avoid overwriting the other nas with corrupted or missing data (i.e. deleting the backup)? Is there any way I could use rsync to perform a bidirectional sync, so that any "healthy" files on the backup would be restored to my Unraid server? I realise that these questions might potentially be better addressed in a separate topic. Let me know if that would be preferable.

What should I do next about my "missing" disk 2?

Your help means a lot!

tower-diagnostics-20230217-1505.zip

Quote

February 17, 20233 yr

Community Expert

4 hours ago, bwv1058 said:

What should I do next about my "missing" disk 2?

It isn't missing, it is just disabled. Emulated disk2 is mounted and SMART looks OK.

21 hours ago, trurl said:

You have to rebuild disk2.

https://wiki.unraid.net/Manual/Storage_Management#Rebuilding_a_drive_onto_itself

Quote

February 17, 20233 yr

Author

Dear trurl,

I've followed the instructions you posted for rebuilding a drive onto itself. However when I start the array after reassigning disk 2, I still only get a red cross. I've tried again starting in maintenance mode and pressing sync but the data rebuild immediately switches to "paused 0.0%".

Any idea what could be wrong?

tower-diagnostics-20230217-2046.zip

Quote

February 17, 20233 yr

Community Expert

The diagnostics show write errors on disk2, but for some reason do not show what lead up to that. You should carefully check all cabling (power and SATA) to the drive as that is the commonest cause of such issues.

Quote

1

February 17, 20233 yr

Community Expert

No need to do the rebuild in maintenance mode.

Maintenance mode doesn't mount any disks, so nothing can write to your array, but...

Maintenance mode doesn't mount any disks, so can't see if any disks are unmountable.

Disk2 was immediately disabled again since it couldn't be written.

3 minutes ago, itimpi said:

check all cabling (power and SATA)

Quote

Disk 2 down, disk 7 with various error messages and no physical access to the server: what to do?

Featured Replies

Solved by JorgeB

Join the conversation

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)