Stable unRAID started having issues (corrupt cache/app)


Recommended Posts

Build is a few years old now, has been rock solid, until recently.  I have had 2 VM's (windows 7 and windows10) and a Plex docker.  The windows 7 VM runs my "house audio" via a pass-through audio PCI card, and also runs uTorrent.

 

First sign of issues, is extracted RAR files either being empty (extraction failed), or corruption (or what I thought was corruption).    I had 5 data drives, 1 parity, cache and an app drive.  Cache drive is a fairly old 1TB WD Black, app drive is a samsung 128GB SSD.  VMs/Dockers live on the app drive, and downloads would download to the VM, then copy to a directory (through cache).

 

I assumed my cache drive was dying even though no red ball, and decided to move into 2018 and dump the separate cache/app drive and put in a 500GB SSD as my cache/app drive.  That all went fairly smooth, however one of my VM's didn't want to come up. I messed with it for a while, but eventually plugged the old app drive into another pc (booted upbuntu usb live) and tried copying over the vm again.  I did this and boom, VM came up.  Fast Forward, uTorrent (or some process) was still having issues.  I decided since I had a nice new big cache/app drive to install ruTorrent docker, Got that up and running and added some existing recent torrents to it pointing to there already downloaded location.   It was here I discovered that uTorrent didnt' seem to be completing the downloads. ruTorrent was finding them at like 96-97% complete and then finishing them.   So, I assumed that this was just some kind of uTorrent issue, and moved on with life using ruTorrent docker vs uTorrent on my VM.  It was this same time I discovered pi-hole and pi-hole docker so I installed and got that working/setup too.    SO I'm feeling all good and things are working etc..

 

Yesterday my VMs/Dockers crashed, and i had all kinds of I/O errors in my logs.   So I SSH in and try to look in /mnt/cache  but no go, i/o errors.  OK, lets shut down the array go into maintenance mode and do some checks.  BUT, I can't get /mnt/cache unmounted.  I let it sit, I tried lsof/fuser etc.. but the kernel was the only process that seemed to have it locked, so I pulled the plug and rebooted.

 

Everything came up but Parity wanted to run.  I stopped it, stopped the array, and put it in maintenance mode.  I wanted to do some checks on the cache drive.  It seemed to find some errors it couldn't fix (and google didn't help me a whole lot honestly)  "Metadata CRC error detected at xfs_bmbt block"

 

Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
        - scan filesystem freespace and inode maps...
        - found root inode chunk
Phase 3 - for each AG...
        - scan and clear agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
Metadata CRC error detected at xfs_bmbt block 0xec53a00/0x1000
Metadata CRC error detected at xfs_bmbt block 0xec53a00/0x1000
btree block 1/451648 is suspect, error -74
        - agno = 1
        - agno = 2
        - agno = 3
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 2
        - agno = 0
        - agno = 3
        - agno = 1
Phase 5 - rebuild AG headers and trees...
        - reset superblock...
Phase 6 - check inode connectivity...
        - resetting contents of realtime bitmap and summary inodes
        - traversing filesystem ...
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify and correct link counts...
done

 

After trying to run xfs_repair w/o -n and ultimately with -L those errors stay, so I restarted the array and let it run a parity check through the night.  It found 2.4 million errors. and when I just took the array offline (stopped, restarted in maintenance mode to get the errors above) it told me I had an unclean shutdown and needed another parity check.

 

I'm attaching some screen shots, and a diagnostics report from just now after the parity.

 

Before I keep digging in this hole I'm in, I'm hoping someone can maybe help me climb out.

 

Thanks, sorry for the rambling.

 

 

 

 

 

cache_user_error.PNG

2million_parity.PNG

unclean_shutdown_need_parity.PNG

nas1-diagnostics-20180401-1056.zip

Edited by Dmtalon
Link to comment

So, I cleaned up the cache drive, formatted it clean and deleted the libvirt image and started fresh.  Started installing windows again fresh.  While trying to install a service pack I see this pop up in the logs.  No indication there  were any issues until just now. (no errors in log)

 

I shut everything down, and reseated the ram, and am running memtest86 on it.  We'll see what happens I guess.  MB/Build is from 9/2014

 

ASUS SABERTOOTH 990FX R2.0

AMD BOX AMD FX-8320 BLACK ED

CRUCIAL 8GB D3 1333 ECC x3
XFX HD5450 1GB D3 DVH PCIE
 

Apr  1 14:48:10 NAS1 kernel: XFS (sdh1): Metadata CRC error detected at xfs_buf_ioend+0x49/0x9c [xfs], xfs_bmbt block 0xdc5be0
Apr  1 14:48:10 NAS1 kernel: XFS (sdh1): Unmount and run xfs_repair
Apr  1 14:48:10 NAS1 kernel: XFS (sdh1): First 64 bytes of corrupted metadata buffer:
Apr  1 14:48:10 NAS1 kernel: ffff8805eaa44000: 42 4d 41 33 00 00 00 fb 00 00 00 00 02 04 96 d8  BMA3............
Apr  1 14:48:10 NAS1 kernel: ffff8805eaa44010: 00 00 00 00 02 04 85 2e 00 00 00 00 00 dc 5b e0  ..............[.
Apr  1 14:48:10 NAS1 kernel: ffff8805eaa44020: 00 00 00 01 00 03 40 4c 87 94 a5 6c 31 ae 4b 35  [email protected]
Apr  1 14:48:10 NAS1 kernel: ffff8805eaa44030: 9c 4c 12 72 8e 1b 67 1c 00 00 00 00 00 00 00 65  .L.r..g........e
Apr  1 14:48:10 NAS1 kernel: XFS (sdh1): metadata I/O error: block 0xdc5be0 ("xfs_trans_read_buf_map") error 74 numblks 8
Apr  1 14:48:10 NAS1 kernel: XFS (sdh1): xfs_do_force_shutdown(0x1) called from line 315 of file fs/xfs/xfs_trans_buf.c.  Return address = 0xffffffffa0254bea
Apr  1 14:48:10 NAS1 kernel: XFS (sdh1): I/O Error Detected. Shutting down filesystem
Apr  1 14:48:10 NAS1 kernel: XFS (sdh1): Please umount the filesystem and rectify the problem(s)
Apr  1 14:48:10 NAS1 kernel: XFS (sdh1): writeback error on sector 246627368

 

Link to comment

So, as an update. I let memtest run 4 tests (about 19 hours) and it found no errors.  I've decided to try going back to my previous build of unRAID (6.3.5) (did a full restore of the zip I took before upgrading).  Put my app drive and cache spinner back in, and rebooted.  

 

I copied off the date from the app drive, cleaned it up by reformatting it, then moved my VM/Dockers back on.  Things seem back up, and no errors in the log(yet).  I'm about 20% through a parity check now.  Once that completes I'll try some tests load tests within the VM to see if the problem returns.

 

I'm starting to wonder if it's a 6.5.0 issues, but time will tell I guess.

 

memtest_pass.jpg

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.