Jump to content

Disk 2 down, disk 7 with various error messages and no physical access to the server: what to do?


bwv1058
Go to solution Solved by JorgeB,

Recommended Posts

Dear community members,

My server just reported that disk 2 of my array is being emulated. I am at a loss as to what caused it to fail since the smart report seems (mostly) fine. At the same time I keep getting various error messages about disk 7. From what I know CRC errors are typically related to bad cables, but I remember checking that particular cable/sockets in the past and could not find anything obvious. Also, I don't know what "metadata I/O error" refers to.
I would like to ask you for advice: I won't be able to directly access my server for at least two weeks but I have remote access through a vpn server (separate machine). Is there anything I could or indeed should do?

 

Many thanks in advance!

 

tower-diagnostics-20230124-1849.zip

Link to comment

Dear JorgeB,

 

Thank you for your quick reply. The xfs filesystem test with the -nv flag gave me a long output with various problems (not sure if posting it here might actually be of any help). The final paragraph reads:

 

No modify flag set, skipping phase 5
Inode allocation btrees are too corrupted, skipping phases 6 and 7
No modify flag set, skipping filesystem flush and exiting.

        XFS_REPAIR Summary    Tue Jan 24 20:41:37 2023

Phase		Start		End		Duration
Phase 1:	01/24 20:40:39	01/24 20:40:39
Phase 2:	01/24 20:40:39	01/24 20:40:40	1 second
Phase 3:	01/24 20:40:40	01/24 20:41:37	57 seconds
Phase 4:	01/24 20:41:37	01/24 20:41:37
Phase 5:	Skipped
Phase 6:	Skipped
Phase 7:	Skipped

Total run time: 58 seconds

 

The link you posted mentions that I would be given instructions as to how to proceed, but that doesn't seem to be the case here. What is the next step going to be?

Many thanks

Edited by bwv1058
Link to comment

Thank you again!

Sorry if I'm being excessively cautious, but I just want to make sure I'm not causing any unnecessary damage. I'm now repeating the test just with verbose output and we'll see what comes next.
From the documentation I get the following:

 

Quote

-L

Force Log Zeroing. Forces xfs_repair to zero the log even if it is dirty (contains metadata changes). When using this option the filesystem will likely appear to be corrupt, and can cause the loss of user files and/or data.

 

Shouldn't it be concerning that the "filesystem will likely appear to be corrupt" etc.?

Link to comment

Alright, seems like I don't need to rerun the test with the -L flag after all... However, seems like many "inodes" habe been moved to the lost+found folder. What exactly does that mean?

 

        XFS_REPAIR Summary    Wed Jan 25 10:17:39 2023

Phase		Start		End		Duration
Phase 1:	01/25 10:13:32	01/25 10:13:32
Phase 2:	01/25 10:13:32	01/25 10:13:33	1 second
Phase 3:	01/25 10:13:33	01/25 10:16:14	2 minutes, 41 seconds
Phase 4:	01/25 10:16:14	01/25 10:16:15	1 second
Phase 5:	01/25 10:16:15	01/25 10:16:29	14 seconds
Phase 6:	01/25 10:16:29	01/25 10:17:01	32 seconds
Phase 7:	01/25 10:17:01	01/25 10:17:01

Total run time: 3 minutes, 29 seconds
done

 

Should I now just restart the array normally or does the server need a reboot?

Link to comment

Log shows issues with disk7:

 

Jan 25 10:37:31 Tower kernel: ata6.00: status: { DRDY }
Jan 25 10:37:31 Tower kernel: ata6: hard resetting link
Jan 25 10:37:37 Tower kernel: ata6: link is slow to respond, please be patient (ready=0)
Jan 25 10:37:41 Tower kernel: ata6: COMRESET failed (errno=-16)
Jan 25 10:37:41 Tower kernel: ata6: hard resetting link
Jan 25 10:37:42 Tower kernel: ata6: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Jan 25 10:37:42 Tower kernel: ata6.00: configured for UDMA/33
Jan 25 10:37:42 Tower kernel: ata6: EH complete
Jan 25 10:37:43 Tower kernel: ata6.00: exception Emask 0x50 SAct 0x600 SErr 0x4890800 action 0xe frozen
Jan 25 10:37:43 Tower kernel: ata6.00: irq_stat 0x04400040, connection status changed
Jan 25 10:37:43 Tower kernel: ata6: SError: { HostInt PHYRdyChg 10B8B LinkSeq DevExch }
Jan 25 10:37:43 Tower kernel: ata6.00: failed command: READ FPDMA QUEUED
Jan 25 10:37:43 Tower kernel: ata6.00: cmd 60/08:48:58:f3:28/00:00:5d:00:00/40 tag 9 ncq dma 4096 in
Jan 25 10:37:43 Tower kernel:         res 40/00:48:58:f3:28/00:00:5d:00:00/40 Emask 0x50 (ATA bus error)
Jan 25 10:37:43 Tower kernel: ata6.00: status: { DRDY }
Jan 25 10:37:43 Tower kernel: ata6.00: failed command: READ FPDMA QUEUED
Jan 25 10:37:43 Tower kernel: ata6.00: cmd 60/20:50:10:50:14/00:00:50:00:00/40 tag 10 ncq dma 16384 in
Jan 25 10:37:43 Tower kernel:         res 40/00:48:58:f3:28/00:00:5d:00:00/40 Emask 0x50 (ATA bus error)
Jan 25 10:37:43 Tower kernel: ata6.00: status: { DRDY }
Jan 25 10:37:43 Tower kernel: ata6: hard resetting link
Jan 25 10:37:49 Tower kernel: ata6: link is slow to respond, please be patient (ready=0)
Jan 25 10:37:53 Tower kernel: ata6: COMRESET failed (errno=-16)
Jan 25 10:37:53 Tower kernel: ata6: hard resetting link
Jan 25 10:37:54 Tower kernel: ata6: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Jan 25 10:37:54 Tower kernel: ata6.00: configured for UDMA/33
Jan 25 10:37:54 Tower kernel: ata6: EH complete
Jan 25 10:37:56 Tower kernel: ata6.00: exception Emask 0x50 SAct 0x3000 SErr 0x4090800 action 0xe frozen
Jan 25 10:37:56 Tower kernel: ata6.00: irq_stat 0x00400040, connection status changed
Jan 25 10:37:56 Tower kernel: ata6: SError: { HostInt PHYRdyChg 10B8B DevExch }
Jan 25 10:37:56 Tower kernel: ata6.00: failed command: READ FPDMA QUEUED
Jan 25 10:37:56 Tower kernel: ata6.00: cmd 60/20:60:40:32:e3/00:00:12:00:00/40 tag 12 ncq dma 16384 in
Jan 25 10:37:56 Tower kernel:         res 40/00:68:30:d4:c4/00:00:0f:00:00/40 Emask 0x50 (ATA bus error)
Jan 25 10:37:56 Tower kernel: ata6.00: status: { DRDY }
Jan 25 10:37:56 Tower kernel: ata6.00: failed command: READ FPDMA QUEUED
Jan 25 10:37:56 Tower kernel: ata6.00: cmd 60/08:68:30:d4:c4/00:00:0f:00:00/40 tag 13 ncq dma 4096 in
Jan 25 10:37:56 Tower kernel:         res 40/00:68:30:d4:c4/00:00:0f:00:00/40 Emask 0x50 (ATA bus error)
Jan 25 10:37:56 Tower kernel: ata6.00: status: { DRDY }
Jan 25 10:37:56 Tower kernel: ata6: hard resetting link

 

These look more like a power/connection problem but if you don't have access to the server it's difficult to sort.

  • Like 1
Link to comment

Dear trurl,

 

Thank you for the link! I seriously hope that disk7 turns out to have a bad power cable. Should that not be the case and should I still be left with one drive disabled and the other unmountable, what options do I have to retrieve as much data as possible? Also, how can I avoid such a scenario from happening again? Ever since last year, I've had drives randomly being disabled and I never could figure out what was causing the problem... I have replaced cables time and time again and have installed a SAS controller to avoid the onboard Sata connectors, but nothing seems to work reliably.

Link to comment
  • 4 weeks later...

I am now again able to access my server and after checking, it seems that one of the sata cables had indeed come loose (although I'm not 100% sure about that, it might just be from my poking into the system). However, upon rebooting the system freezes up:

 

IMG_20230216_205625855.jpg

 

 

EDIT:
Following the instructions on this link and renaming vfio-pci.cfg solved the issue but I still have an unmountable drive (disk 7) and a disabled one (disk 2, see diagnostics). What are the next steps going to be?

Thank you all in advance!

tower-diagnostics-20230216-2120.zip

Edited by bwv1058
Link to comment

You have connection problems on disk7. Fix those first. Then check filesystem on disk7. You have to rebuild disk2.

 

Unrelated, your appdata and system shares have files on the array. Ideally the default shares would be on fast pool (cache) so Docker/VM performance isn't impacted by slower array, and so array disks can spin down since these files are always open.

https://wiki.unraid.net/Manual/Shares#Default_Shares

  • Thanks 1
Link to comment

Dear trurl,

Thank you for your continued support!

I've now attached the drive to my sas controller, which has so far proved to be more reliable than the motherboard's internal sata ports. After starting the array in maintenance mode, drive 7 no longer shows "unmountable", which I guess is good news. Also, the log does not seem to report any connection errors any longer. However, I'm getting spammed with "FastCGI sent in stderr", which is an error I've never seen before and have no idea what it stands for.

 

Should I still run xfs repair? And what are my next steps after that?

And, by the way, thank you for pointing out the problem with the appdata and system files. I thought I had set both folders to "prefer", but I will check.

tower-diagnostics-20230217-0926.zip

Link to comment
4 hours ago, bwv1058 said:

starting the array in maintenance mode, drive 7 no longer shows "unmountable"

It doesn't attempt to mount any drives in maintenance mode, so you can't see if any drives mount that way. Notice how no disks show any used and free space.

 

4 hours ago, bwv1058 said:

thought I had set both folders to "prefer", but I will check.

appdata is prefer, system is no. Nothing can move open files, so after setting system to prefer, you will have to disable Docker and VM Manager in Settings then run Mover.

 

4 hours ago, bwv1058 said:

Should I still run xfs repair?

Probably, but can't say for sure from these latest diagnostics in maintenance mode.

Link to comment
3 minutes ago, trurl said:

It doesn't attempt to mount any drives in maintentance mode

Yes, obvious mistake on my part... or maybe just wishful thinking!
After restarting the array normally, I'm back to square one.
 

 

4 minutes ago, trurl said:

appdata is prefer, system is no

Alright, will do as you suggested after I can fix the drive issues.

I'm still getting the aforementioned error message. I don't recall reading it some weeks ago.

tower-diagnostics-20230217-1418.zip

Link to comment
4 hours ago, bwv1058 said:

spammed with "FastCGI sent in stderr"

Not sure about that either. Seems to be related to Unassigned Devices plugin, which you don't have since you are booted in SAFE mode. I suspect something in the browser cache.

 

5 hours ago, bwv1058 said:

attached the drive to my sas controller

Didn't notice any connection problems in these.

 

16 hours ago, trurl said:

check filesystem on disk7

 

Link to comment

Dear trurl,

Thank you again! After running xfs-repair disk 7 seems to have come back.

 

As expected, a lost+found folder was created with many files in it that I will have to assess individually... Is there any chance that other files beside those in the lost+found folder might have been corrupted in the process?

Since I'm normally rsyncing the most important folders on this server with another nas, what rsync flags should I use to avoid overwriting the other nas with corrupted or missing data (i.e. deleting the backup)? Is there any way I could use rsync to perform a bidirectional sync, so that any "healthy" files on the backup would be restored to my Unraid server? I realise that these questions might potentially be better addressed in a separate topic. Let me know if that would be preferable.



What should I do next about my "missing" disk 2?

Your help means a lot!
 

tower-diagnostics-20230217-1505.zip

Link to comment

No need to do the rebuild in maintenance mode.

 

Maintenance mode doesn't mount any disks, so nothing can write to your array, but...

 

Maintenance mode doesn't mount any disks, so can't see if any disks are unmountable.

 

Disk2 was immediately disabled again since it couldn't be written.

 

3 minutes ago, itimpi said:

check all cabling (power and SATA)

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...