(SOLVED) Failed disk followed by I/O errors

Clay Smith · December 9, 2019

Last week I was sick and didn't bother even looking at my server for a few days. When I finally got back to it I noticed that disk 4 was disabled with it's contents emulated. It's my oldest disk that I got already heavily used so I didn't think much of it and ordered a new drive for replacement. This morning I threw the drive into my second server to run pre-clear and figured in the mean time I might move around some data. I installed a new disk 3 weeks ago and it was mostly empty so I downloaded unbalance and remembered the last time I worked with it I needed to run the Docker Safe New Perms script, so I did again here.

I shut down my Docker containers but then decided against the whole thing and uninstalled unbalance and tried to restart my containers, but a lot of them were shutting back down after trying to start up, and the ones that did start weren't working. I tailed my syslog and got a loop of i/o errors.

Dec  8 18:24:47 Yuki emhttpd: error: get_fs_sizes, 6306: Input/output error (5): statfs: /mnt/user/Clay
Dec  8 18:24:47 Yuki emhttpd: error: get_fs_sizes, 6306: Input/output error (5): statfs: /mnt/user/CommunityApplicationsAppdataBackup
Dec  8 18:24:47 Yuki emhttpd: error: get_fs_sizes, 6306: Input/output error (5): statfs: /mnt/user/Handbrake
Dec  8 18:24:47 Yuki emhttpd: error: get_fs_sizes, 6306: Input/output error (5): statfs: /mnt/user/Literature
Dec  8 18:24:47 Yuki emhttpd: error: get_fs_sizes, 6306: Input/output error (5): statfs: /mnt/user/PlexTranscode
Dec  8 18:24:47 Yuki emhttpd: error: get_fs_sizes, 6306: Input/output error (5): statfs: /mnt/user/Temp
Dec  8 18:24:47 Yuki emhttpd: error: get_fs_sizes, 6306: Input/output error (5): statfs: /mnt/user/Terraria
Dec  8 18:24:47 Yuki emhttpd: error: get_fs_sizes, 6306: Input/output error (5): statfs: /mnt/user/VMC
Dec  8 18:24:47 Yuki emhttpd: error: get_fs_sizes, 6306: Input/output error (5): statfs: /mnt/user/appdata
Dec  8 18:24:47 Yuki emhttpd: error: get_fs_sizes, 6306: Input/output error (5): statfs: /mnt/user/domains
Dec  8 18:24:47 Yuki emhttpd: error: get_fs_sizes, 6306: Input/output error (5): statfs: /mnt/user/isos
Dec  8 18:24:47 Yuki emhttpd: error: get_fs_sizes, 6306: Input/output error (5): statfs: /mnt/user/system

This exact list of errors repeats every second in my syslog. I also tried installing unbalance a couple times just because I was confused as to what was happening but could never get it to actually run properly. I had to leave for quite some time but I got back and started looking into it again and found this line in my syslog

Dec  8 08:42:04 Yuki kernel: XFS (md4): Corruption of in-memory data detected.  Shutting down filesystem

At one point I was also wondering if the new permissions script messed something up as my containers where giving errors about accessing their config files and looked into restoring the appdata folder from a backup, but the restore appdata tab is only showing a backup from 2019-11-18 even though it was set to run every week. When browsing to the location where the backups are kept from a windows machine, this is the only backup shown. Looking directly at disk 5 in unraid shows I should have backups from 2019-11-25 and 2019-12-02 but the backup that does show up from 11-18 is stored in Disk 1. Digging around, other files that the Unraid gui shows should be on disk 5 don't appear when browsing shares from windows, including files I have in the past few weeks copied from these shares to my windows machine.

Attempting to write a folder to any share from windows results in this error code.

An unexpected error is keeping you from creating the folder.  If you continue to receive this error, you can use the error code to search for help with with problem.

Error 0x8007045D: The request could not be performed because of an I/O device error.

I've read enough horror stories about people taking steps to fix problems and making it worse so I hope I haven't already gone too far and was wondering what I should do from here? The replacement drive is 15% through step 2 of 5 of the pre-clear. Did the system switch to read-only to protect itself and will it fix itself once I shutdown and replace the bad disk? Is there more I need to do in the meantime to recover? Is it even recoverable? Or do I start trying to copy as much as I can off while the system is still functioning?

yuki-diagnostics-20191209-0032.zip

JorgeB · December 9, 2019

Check filesystem on disk4:

https://wiki.unraid.net/Check_Disk_Filesystems

Clay Smith · December 9, 2019

Quote

If you were told you have drive errors, you are most likely not in the right place.

Does this not apply to me? Disk 4 is the drive that failed.

It also reads further down that for xfs I will have to start the array in Maintenance mode. I have been hesitant to shutdown or stop the array since disk 5 files were not appearing when browsing shares as I didn't know if something had happened to it as well. Unraid hasn't told me that anything is wrong with it but I wasn't sure with the missing files and all.

I don't mean to doubt you, I'll run the test if you say that's what's best, I just want to make sure I lose as little data as possible.

On a side note my VMs are located on an SSD mounted by Unassigned Devices and are still working currently. Tailing my syslog shows 5 more shares on the repeating error list, and my syslog is 91% full according to the dashboard.

JorgeB · December 9, 2019

12 minutes ago, Clay Smith said:

Disk 4 is the drive that failed.

Still need to fix the filesystem before rebuilding, is old disk4 still connected so we can check SMART or did it die completely?

Clay Smith · December 9, 2019

It is still connected. On the 'Main' page it has an X next to it and says 'Device is disabled, contents emulated'. Is it safe to spin up and check SMART while in this state? Under the 'Writes' column it claims it has 18,446,744,073,709,529,088 writes on that drive.

JorgeB · December 9, 2019

SMART looks fine, looks more like a connection problem, recommend replacing/swapping cables to rule them out and then rebuild on top, but only after fixing the filesystem and checking that the emulated disk is mounting correctly and all data appears correct.

Clay Smith · December 9, 2019

I followed the instructions in your link and put the array in maintenance mode to run the test but it's just been sitting on the same step for a few hours.

Phase 1 - find and verify superblock...
couldn't verify primary superblock - not enough secondary superblocks with matching geometry !!!

attempting to find secondary superblock...

This is followed by a line with 3,708,871 periods.

Edited December 9, 2019 by Clay Smith

Harro · December 9, 2019

29 minutes ago, Clay Smith said:
I followed the instructions in your link and put the array in maintenance mode to run the test but it's just been sitting on the same step for a few hours.
Phase 1 - find and verify superblock...
couldn't verify primary superblock - not enough secondary superblocks with matching geometry !!!

attempting to find secondary superblock...
This is followed by a line with 3,708,871 periods.

Which command hook did you use.

xfs_repair -L

Clay Smith · December 9, 2019

I followed the directions in the link for running the test through the GUI and left the default option of just -n

Edited December 9, 2019 by Clay Smith

Harro · December 9, 2019

15 minutes ago, Harro said:

Which command hook did you use.

xfs_repair -L

-n will just show a report

-L "as per the wiki" Force Log Zeroing. Forces xfs_repair to zero the log even if it is dirty (contains metadata changes). When using this option the filesystem will likely appear to be corrupt, and can cause the loss of user files and/or data.

Clay Smith · December 9, 2019

26 minutes ago, Harro said:

-n will just show a report

-L "as per the wiki" Force Log Zeroing. Forces xfs_repair to zero the log even if it is dirty (contains metadata changes). When using this option the filesystem will likely appear to be corrupt, and can cause the loss of user files and/or data.

Did I misunderstand the page? Should I have run it with -n and -L? Would it be right to cancel it? On the settings page for the disk it still says it's running.

Harro · December 9, 2019

-L without the -n

Should look through the file system and find the superblock

Clay Smith · December 9, 2019

The wiki page said to run a -n first as a test but I suppose if I already know there is a problem then there's no reason to test it for problems.

I'm out currently but I suppose when I get back to the server I should tell it to cancel and start it over with -L yes?

Harro · December 9, 2019

Yes run with -L

JorgeB · December 10, 2019

11 hours ago, Clay Smith said:

I followed the instructions in your link and put the array in maintenance mode to run the test but it's just been sitting on the same step for a few hours.

Post the complete command used.

Clay Smith · December 10, 2019

4 hours ago, johnnie.black said:

Post the complete command used.

I used the webgui so I didn't type in a complete command. I clicked 'Disk 4' on the main page to get to the disk settings and then clicked the 'Check' button under the 'Check Filesystem Status' section. In the options box I had left just the -n that was there by default. At the time the syslog was full so I can't grab what it said then but I have since truncated the syslog. When I ran it when I got home last night with the -L this was the line in the syslog

Dec  9 20:28:12 Yuki ool www[13176]: /usr/local/emhttp/plugins/dynamix/scripts/xfs_check 'start' '/dev/md4' 'WDC_WD60EZRX-00MVLB1_WD-WXL1H642CJCJ' '-L'

Edited December 10, 2019 by Clay Smith

Clay Smith · December 10, 2019

Just completed with from when I ran it last night with -L

Phase 1 - find and verify superblock...
couldn't verify primary superblock - not enough secondary superblocks with matching geometry !!!

attempting to find secondary superblock...
...............(5.7 million dots)...........Sorry, could not find valid secondary superblock
Exiting now.

That's all it gives. I'd assume this means that it's not fixed. I'm not really sure where to go from here. Based on syslog line from my last post, did it run properly? Or do I need to use the terminal to run a better command?

JorgeB · December 10, 2019

That's very strange, it's like it's not finding a valid filesystem, please reboot, start the array and get new diags.

Clay Smith · December 10, 2019

My system has a weird quirk where it won't boot if it's not hooked to a monitor and I'm not home to plug one in right now. When I start the array, should I start it normally or in maintenance mode? Is there anything I can do in the mean time before I reboot tonight or should I just hold off and report back?

JorgeB · December 10, 2019

Stop array, start array in normal mode and post new diags.

Clay Smith · December 10, 2019

Here they are. Disk 4 also now reports 'Unmountable: No file system'.

yuki-diagnostics-20191210-1530.zip

JorgeB · December 10, 2019

Stop the array, start in maintenance mode, and on the console (or putty) type:

xfs_repair -v /dev/md4

And see if it repairs or not.

Clay Smith · December 10, 2019

I've started it and it's doing the same thing so far

Phase 1 - find and verify superblock...
couldn't verify primary superblock - not enough secondary superblocks with matching geometry !!!

attempting to find secondary superblock...

followed by it generating a bunch of periods.

While it was started in normal mode I was able to browse the shares in windows and noticed that the appdata backups that weren't showing before were now visible and a video file that IIRC was located on Disk 4 was also visible.

Edited December 10, 2019 by Clay Smith

JorgeB · December 10, 2019

In normal mode there's a mount attempt and it warns of xfs corruption, but then xfs_repair can't find the superblock, very weird, it would suggest there's a serious problem with the file system on the emulated disk, possibly parity wasn't 100% in sync.

Try to mount the old disk with UD, if it mounts correctly and since the disk looks healthy it's best to do a new config and re-sync parity, note that if there's any data written to that disk after it got disable will be lost.

Clay Smith · December 10, 2019

1 hour ago, johnnie.black said:

Try to mount the old disk with UD, if it mounts correctly and since the disk looks healthy it's best to do a new config and re-sync parity, note that if there's any data written to that disk after it got disable will be lost.

Should I cancel the current xfs_repair operation to do this or wait until it completes? If starting the array in normal mode gives me access to the (emulated?) disk, would it be worth trying to copy any of my recent writes to another drive before attempting to mount with UD?

(SOLVED) Failed disk followed by I/O errors

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation