bwv1058 Posted January 24, 2023 Share Posted January 24, 2023 Dear community members, My server just reported that disk 2 of my array is being emulated. I am at a loss as to what caused it to fail since the smart report seems (mostly) fine. At the same time I keep getting various error messages about disk 7. From what I know CRC errors are typically related to bad cables, but I remember checking that particular cable/sockets in the past and could not find anything obvious. Also, I don't know what "metadata I/O error" refers to. I would like to ask you for advice: I won't be able to directly access my server for at least two weeks but I have remote access through a vpn server (separate machine). Is there anything I could or indeed should do? Many thanks in advance! tower-diagnostics-20230124-1849.zip Quote Link to comment
JorgeB Posted January 24, 2023 Share Posted January 24, 2023 Logs are spammed with filesystem issues for disk7, check filesystem, reboot and post new diags after array start. Quote Link to comment
bwv1058 Posted January 24, 2023 Author Share Posted January 24, 2023 (edited) Dear JorgeB, Thank you for your quick reply. The xfs filesystem test with the -nv flag gave me a long output with various problems (not sure if posting it here might actually be of any help). The final paragraph reads: No modify flag set, skipping phase 5 Inode allocation btrees are too corrupted, skipping phases 6 and 7 No modify flag set, skipping filesystem flush and exiting. XFS_REPAIR Summary Tue Jan 24 20:41:37 2023 Phase Start End Duration Phase 1: 01/24 20:40:39 01/24 20:40:39 Phase 2: 01/24 20:40:39 01/24 20:40:40 1 second Phase 3: 01/24 20:40:40 01/24 20:41:37 57 seconds Phase 4: 01/24 20:41:37 01/24 20:41:37 Phase 5: Skipped Phase 6: Skipped Phase 7: Skipped Total run time: 58 seconds The link you posted mentions that I would be given instructions as to how to proceed, but that doesn't seem to be the case here. What is the next step going to be? Many thanks Edited January 24, 2023 by bwv1058 Quote Link to comment
JorgeB Posted January 25, 2023 Share Posted January 25, 2023 Next step is to run it again without -n or nothing will be done, and if it asks for -L use it. Quote Link to comment
bwv1058 Posted January 25, 2023 Author Share Posted January 25, 2023 Thank you again! Sorry if I'm being excessively cautious, but I just want to make sure I'm not causing any unnecessary damage. I'm now repeating the test just with verbose output and we'll see what comes next. From the documentation I get the following: Quote -L Force Log Zeroing. Forces xfs_repair to zero the log even if it is dirty (contains metadata changes). When using this option the filesystem will likely appear to be corrupt, and can cause the loss of user files and/or data. Shouldn't it be concerning that the "filesystem will likely appear to be corrupt" etc.? Quote Link to comment
bwv1058 Posted January 25, 2023 Author Share Posted January 25, 2023 Alright, seems like I don't need to rerun the test with the -L flag after all... However, seems like many "inodes" habe been moved to the lost+found folder. What exactly does that mean? XFS_REPAIR Summary Wed Jan 25 10:17:39 2023 Phase Start End Duration Phase 1: 01/25 10:13:32 01/25 10:13:32 Phase 2: 01/25 10:13:32 01/25 10:13:33 1 second Phase 3: 01/25 10:13:33 01/25 10:16:14 2 minutes, 41 seconds Phase 4: 01/25 10:16:14 01/25 10:16:15 1 second Phase 5: 01/25 10:16:15 01/25 10:16:29 14 seconds Phase 6: 01/25 10:16:29 01/25 10:17:01 32 seconds Phase 7: 01/25 10:17:01 01/25 10:17:01 Total run time: 3 minutes, 29 seconds done Should I now just restart the array normally or does the server need a reboot? Quote Link to comment
JorgeB Posted January 25, 2023 Share Posted January 25, 2023 Start in normal mode and check lost+found folder. Quote Link to comment
JorgeB Posted January 25, 2023 Share Posted January 25, 2023 Oh, and reboot first to clear the log. Quote Link to comment
bwv1058 Posted January 25, 2023 Author Share Posted January 25, 2023 Disk7 is now unmountable, meanwhile disk2 is still disabled... I'm starting to be quite worried. tower-diagnostics-20230125-1037.zip Quote Link to comment
JorgeB Posted January 25, 2023 Share Posted January 25, 2023 Log shows issues with disk7: Jan 25 10:37:31 Tower kernel: ata6.00: status: { DRDY } Jan 25 10:37:31 Tower kernel: ata6: hard resetting link Jan 25 10:37:37 Tower kernel: ata6: link is slow to respond, please be patient (ready=0) Jan 25 10:37:41 Tower kernel: ata6: COMRESET failed (errno=-16) Jan 25 10:37:41 Tower kernel: ata6: hard resetting link Jan 25 10:37:42 Tower kernel: ata6: SATA link up 1.5 Gbps (SStatus 113 SControl 310) Jan 25 10:37:42 Tower kernel: ata6.00: configured for UDMA/33 Jan 25 10:37:42 Tower kernel: ata6: EH complete Jan 25 10:37:43 Tower kernel: ata6.00: exception Emask 0x50 SAct 0x600 SErr 0x4890800 action 0xe frozen Jan 25 10:37:43 Tower kernel: ata6.00: irq_stat 0x04400040, connection status changed Jan 25 10:37:43 Tower kernel: ata6: SError: { HostInt PHYRdyChg 10B8B LinkSeq DevExch } Jan 25 10:37:43 Tower kernel: ata6.00: failed command: READ FPDMA QUEUED Jan 25 10:37:43 Tower kernel: ata6.00: cmd 60/08:48:58:f3:28/00:00:5d:00:00/40 tag 9 ncq dma 4096 in Jan 25 10:37:43 Tower kernel: res 40/00:48:58:f3:28/00:00:5d:00:00/40 Emask 0x50 (ATA bus error) Jan 25 10:37:43 Tower kernel: ata6.00: status: { DRDY } Jan 25 10:37:43 Tower kernel: ata6.00: failed command: READ FPDMA QUEUED Jan 25 10:37:43 Tower kernel: ata6.00: cmd 60/20:50:10:50:14/00:00:50:00:00/40 tag 10 ncq dma 16384 in Jan 25 10:37:43 Tower kernel: res 40/00:48:58:f3:28/00:00:5d:00:00/40 Emask 0x50 (ATA bus error) Jan 25 10:37:43 Tower kernel: ata6.00: status: { DRDY } Jan 25 10:37:43 Tower kernel: ata6: hard resetting link Jan 25 10:37:49 Tower kernel: ata6: link is slow to respond, please be patient (ready=0) Jan 25 10:37:53 Tower kernel: ata6: COMRESET failed (errno=-16) Jan 25 10:37:53 Tower kernel: ata6: hard resetting link Jan 25 10:37:54 Tower kernel: ata6: SATA link up 1.5 Gbps (SStatus 113 SControl 310) Jan 25 10:37:54 Tower kernel: ata6.00: configured for UDMA/33 Jan 25 10:37:54 Tower kernel: ata6: EH complete Jan 25 10:37:56 Tower kernel: ata6.00: exception Emask 0x50 SAct 0x3000 SErr 0x4090800 action 0xe frozen Jan 25 10:37:56 Tower kernel: ata6.00: irq_stat 0x00400040, connection status changed Jan 25 10:37:56 Tower kernel: ata6: SError: { HostInt PHYRdyChg 10B8B DevExch } Jan 25 10:37:56 Tower kernel: ata6.00: failed command: READ FPDMA QUEUED Jan 25 10:37:56 Tower kernel: ata6.00: cmd 60/20:60:40:32:e3/00:00:12:00:00/40 tag 12 ncq dma 16384 in Jan 25 10:37:56 Tower kernel: res 40/00:68:30:d4:c4/00:00:0f:00:00/40 Emask 0x50 (ATA bus error) Jan 25 10:37:56 Tower kernel: ata6.00: status: { DRDY } Jan 25 10:37:56 Tower kernel: ata6.00: failed command: READ FPDMA QUEUED Jan 25 10:37:56 Tower kernel: ata6.00: cmd 60/08:68:30:d4:c4/00:00:0f:00:00/40 tag 13 ncq dma 4096 in Jan 25 10:37:56 Tower kernel: res 40/00:68:30:d4:c4/00:00:0f:00:00/40 Emask 0x50 (ATA bus error) Jan 25 10:37:56 Tower kernel: ata6.00: status: { DRDY } Jan 25 10:37:56 Tower kernel: ata6: hard resetting link These look more like a power/connection problem but if you don't have access to the server it's difficult to sort. 1 Quote Link to comment
bwv1058 Posted January 25, 2023 Author Share Posted January 25, 2023 (edited) I see. I will of course look into that as soon as I can. Should I shut down the server in the meantime? Edited January 25, 2023 by bwv1058 Quote Link to comment
trurl Posted January 25, 2023 Share Posted January 25, 2023 8 minutes ago, bwv1058 said: shut down the server in the meantime? Yes. You only have single parity and a disabled disk, so no protection until you rebuild. And you don't want to attempt to rebuild one disk when a different disk can't be read. Parity by itself cannot rebuild or protect anything. All disks are required. https://wiki.unraid.net/Manual/Overview#Parity-Protected_Array Quote Link to comment
bwv1058 Posted January 25, 2023 Author Share Posted January 25, 2023 Dear trurl, Thank you for the link! I seriously hope that disk7 turns out to have a bad power cable. Should that not be the case and should I still be left with one drive disabled and the other unmountable, what options do I have to retrieve as much data as possible? Also, how can I avoid such a scenario from happening again? Ever since last year, I've had drives randomly being disabled and I never could figure out what was causing the problem... I have replaced cables time and time again and have installed a SAS controller to avoid the onboard Sata connectors, but nothing seems to work reliably. Quote Link to comment
trurl Posted January 25, 2023 Share Posted January 25, 2023 Unmountable filesystems can usually be repaired. Disabled disks can be rebuilt if all other disks are working well. In any case, data on all other disks should be fine. Quote Link to comment
bwv1058 Posted February 16, 2023 Author Share Posted February 16, 2023 (edited) I am now again able to access my server and after checking, it seems that one of the sata cables had indeed come loose (although I'm not 100% sure about that, it might just be from my poking into the system). However, upon rebooting the system freezes up: EDIT: Following the instructions on this link and renaming vfio-pci.cfg solved the issue but I still have an unmountable drive (disk 7) and a disabled one (disk 2, see diagnostics). What are the next steps going to be? Thank you all in advance! tower-diagnostics-20230216-2120.zip Edited February 16, 2023 by bwv1058 Quote Link to comment
trurl Posted February 16, 2023 Share Posted February 16, 2023 You have connection problems on disk7. Fix those first. Then check filesystem on disk7. You have to rebuild disk2. Unrelated, your appdata and system shares have files on the array. Ideally the default shares would be on fast pool (cache) so Docker/VM performance isn't impacted by slower array, and so array disks can spin down since these files are always open. https://wiki.unraid.net/Manual/Shares#Default_Shares 1 Quote Link to comment
bwv1058 Posted February 17, 2023 Author Share Posted February 17, 2023 Dear trurl, Thank you for your continued support! I've now attached the drive to my sas controller, which has so far proved to be more reliable than the motherboard's internal sata ports. After starting the array in maintenance mode, drive 7 no longer shows "unmountable", which I guess is good news. Also, the log does not seem to report any connection errors any longer. However, I'm getting spammed with "FastCGI sent in stderr", which is an error I've never seen before and have no idea what it stands for. Should I still run xfs repair? And what are my next steps after that? And, by the way, thank you for pointing out the problem with the appdata and system files. I thought I had set both folders to "prefer", but I will check. tower-diagnostics-20230217-0926.zip Quote Link to comment
trurl Posted February 17, 2023 Share Posted February 17, 2023 4 hours ago, bwv1058 said: starting the array in maintenance mode, drive 7 no longer shows "unmountable" It doesn't attempt to mount any drives in maintenance mode, so you can't see if any drives mount that way. Notice how no disks show any used and free space. 4 hours ago, bwv1058 said: thought I had set both folders to "prefer", but I will check. appdata is prefer, system is no. Nothing can move open files, so after setting system to prefer, you will have to disable Docker and VM Manager in Settings then run Mover. 4 hours ago, bwv1058 said: Should I still run xfs repair? Probably, but can't say for sure from these latest diagnostics in maintenance mode. Quote Link to comment
bwv1058 Posted February 17, 2023 Author Share Posted February 17, 2023 3 minutes ago, trurl said: It doesn't attempt to mount any drives in maintentance mode Yes, obvious mistake on my part... or maybe just wishful thinking! After restarting the array normally, I'm back to square one. 4 minutes ago, trurl said: appdata is prefer, system is no Alright, will do as you suggested after I can fix the drive issues. I'm still getting the aforementioned error message. I don't recall reading it some weeks ago. tower-diagnostics-20230217-1418.zip Quote Link to comment
trurl Posted February 17, 2023 Share Posted February 17, 2023 4 hours ago, bwv1058 said: spammed with "FastCGI sent in stderr" Not sure about that either. Seems to be related to Unassigned Devices plugin, which you don't have since you are booted in SAFE mode. I suspect something in the browser cache. 5 hours ago, bwv1058 said: attached the drive to my sas controller Didn't notice any connection problems in these. 16 hours ago, trurl said: check filesystem on disk7 Quote Link to comment
bwv1058 Posted February 17, 2023 Author Share Posted February 17, 2023 Dear trurl, Thank you again! After running xfs-repair disk 7 seems to have come back. As expected, a lost+found folder was created with many files in it that I will have to assess individually... Is there any chance that other files beside those in the lost+found folder might have been corrupted in the process? Since I'm normally rsyncing the most important folders on this server with another nas, what rsync flags should I use to avoid overwriting the other nas with corrupted or missing data (i.e. deleting the backup)? Is there any way I could use rsync to perform a bidirectional sync, so that any "healthy" files on the backup would be restored to my Unraid server? I realise that these questions might potentially be better addressed in a separate topic. Let me know if that would be preferable. What should I do next about my "missing" disk 2? Your help means a lot! tower-diagnostics-20230217-1505.zip Quote Link to comment
trurl Posted February 17, 2023 Share Posted February 17, 2023 4 hours ago, bwv1058 said: What should I do next about my "missing" disk 2? It isn't missing, it is just disabled. Emulated disk2 is mounted and SMART looks OK. 21 hours ago, trurl said: You have to rebuild disk2. https://wiki.unraid.net/Manual/Storage_Management#Rebuilding_a_drive_onto_itself Quote Link to comment
bwv1058 Posted February 17, 2023 Author Share Posted February 17, 2023 Dear trurl, I've followed the instructions you posted for rebuilding a drive onto itself. However when I start the array after reassigning disk 2, I still only get a red cross. I've tried again starting in maintenance mode and pressing sync but the data rebuild immediately switches to "paused 0.0%". Any idea what could be wrong? tower-diagnostics-20230217-2046.zip Quote Link to comment
itimpi Posted February 17, 2023 Share Posted February 17, 2023 The diagnostics show write errors on disk2, but for some reason do not show what lead up to that. You should carefully check all cabling (power and SATA) to the drive as that is the commonest cause of such issues. 1 Quote Link to comment
trurl Posted February 17, 2023 Share Posted February 17, 2023 No need to do the rebuild in maintenance mode. Maintenance mode doesn't mount any disks, so nothing can write to your array, but... Maintenance mode doesn't mount any disks, so can't see if any disks are unmountable. Disk2 was immediately disabled again since it couldn't be written. 3 minutes ago, itimpi said: check all cabling (power and SATA) Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.