Jump to content
Dmtalon

Stable unRAID started having issues (corrupt cache/app)

4 posts in this topic Last Reply

Recommended Posts

Build is a few years old now, has been rock solid, until recently.  I have had 2 VM's (windows 7 and windows10) and a Plex docker.  The windows 7 VM runs my "house audio" via a pass-through audio PCI card, and also runs uTorrent.

 

First sign of issues, is extracted RAR files either being empty (extraction failed), or corruption (or what I thought was corruption).    I had 5 data drives, 1 parity, cache and an app drive.  Cache drive is a fairly old 1TB WD Black, app drive is a samsung 128GB SSD.  VMs/Dockers live on the app drive, and downloads would download to the VM, then copy to a directory (through cache).

 

I assumed my cache drive was dying even though no red ball, and decided to move into 2018 and dump the separate cache/app drive and put in a 500GB SSD as my cache/app drive.  That all went fairly smooth, however one of my VM's didn't want to come up. I messed with it for a while, but eventually plugged the old app drive into another pc (booted upbuntu usb live) and tried copying over the vm again.  I did this and boom, VM came up.  Fast Forward, uTorrent (or some process) was still having issues.  I decided since I had a nice new big cache/app drive to install ruTorrent docker, Got that up and running and added some existing recent torrents to it pointing to there already downloaded location.   It was here I discovered that uTorrent didnt' seem to be completing the downloads. ruTorrent was finding them at like 96-97% complete and then finishing them.   So, I assumed that this was just some kind of uTorrent issue, and moved on with life using ruTorrent docker vs uTorrent on my VM.  It was this same time I discovered pi-hole and pi-hole docker so I installed and got that working/setup too.    SO I'm feeling all good and things are working etc..

 

Yesterday my VMs/Dockers crashed, and i had all kinds of I/O errors in my logs.   So I SSH in and try to look in /mnt/cache  but no go, i/o errors.  OK, lets shut down the array go into maintenance mode and do some checks.  BUT, I can't get /mnt/cache unmounted.  I let it sit, I tried lsof/fuser etc.. but the kernel was the only process that seemed to have it locked, so I pulled the plug and rebooted.

 

Everything came up but Parity wanted to run.  I stopped it, stopped the array, and put it in maintenance mode.  I wanted to do some checks on the cache drive.  It seemed to find some errors it couldn't fix (and google didn't help me a whole lot honestly)  "Metadata CRC error detected at xfs_bmbt block"

 

Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
        - scan filesystem freespace and inode maps...
        - found root inode chunk
Phase 3 - for each AG...
        - scan and clear agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
Metadata CRC error detected at xfs_bmbt block 0xec53a00/0x1000
Metadata CRC error detected at xfs_bmbt block 0xec53a00/0x1000
btree block 1/451648 is suspect, error -74
        - agno = 1
        - agno = 2
        - agno = 3
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 2
        - agno = 0
        - agno = 3
        - agno = 1
Phase 5 - rebuild AG headers and trees...
        - reset superblock...
Phase 6 - check inode connectivity...
        - resetting contents of realtime bitmap and summary inodes
        - traversing filesystem ...
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify and correct link counts...
done

 

After trying to run xfs_repair w/o -n and ultimately with -L those errors stay, so I restarted the array and let it run a parity check through the night.  It found 2.4 million errors. and when I just took the array offline (stopped, restarted in maintenance mode to get the errors above) it told me I had an unclean shutdown and needed another parity check.

 

I'm attaching some screen shots, and a diagnostics report from just now after the parity.

 

Before I keep digging in this hole I'm in, I'm hoping someone can maybe help me climb out.

 

Thanks, sorry for the rambling.

 

 

 

 

 

cache_user_error.PNG

2million_parity.PNG

unclean_shutdown_need_parity.PNG

nas1-diagnostics-20180401-1056.zip

Edited by Dmtalon

Share this post


Link to post

Corrupt cache filesystem and parity sync errors to me suggest a hardware problem, I would start by running memtest for a few hours, ideally 24

Share this post


Link to post

So, I cleaned up the cache drive, formatted it clean and deleted the libvirt image and started fresh.  Started installing windows again fresh.  While trying to install a service pack I see this pop up in the logs.  No indication there  were any issues until just now. (no errors in log)

 

I shut everything down, and reseated the ram, and am running memtest86 on it.  We'll see what happens I guess.  MB/Build is from 9/2014

 

ASUS SABERTOOTH 990FX R2.0

AMD BOX AMD FX-8320 BLACK ED

CRUCIAL 8GB D3 1333 ECC x3
XFX HD5450 1GB D3 DVH PCIE
 

Apr  1 14:48:10 NAS1 kernel: XFS (sdh1): Metadata CRC error detected at xfs_buf_ioend+0x49/0x9c [xfs], xfs_bmbt block 0xdc5be0
Apr  1 14:48:10 NAS1 kernel: XFS (sdh1): Unmount and run xfs_repair
Apr  1 14:48:10 NAS1 kernel: XFS (sdh1): First 64 bytes of corrupted metadata buffer:
Apr  1 14:48:10 NAS1 kernel: ffff8805eaa44000: 42 4d 41 33 00 00 00 fb 00 00 00 00 02 04 96 d8  BMA3............
Apr  1 14:48:10 NAS1 kernel: ffff8805eaa44010: 00 00 00 00 02 04 85 2e 00 00 00 00 00 dc 5b e0  ..............[.
Apr  1 14:48:10 NAS1 kernel: ffff8805eaa44020: 00 00 00 01 00 03 40 4c 87 94 a5 6c 31 ae 4b 35  ......@L...l1.K5
Apr  1 14:48:10 NAS1 kernel: ffff8805eaa44030: 9c 4c 12 72 8e 1b 67 1c 00 00 00 00 00 00 00 65  .L.r..g........e
Apr  1 14:48:10 NAS1 kernel: XFS (sdh1): metadata I/O error: block 0xdc5be0 ("xfs_trans_read_buf_map") error 74 numblks 8
Apr  1 14:48:10 NAS1 kernel: XFS (sdh1): xfs_do_force_shutdown(0x1) called from line 315 of file fs/xfs/xfs_trans_buf.c.  Return address = 0xffffffffa0254bea
Apr  1 14:48:10 NAS1 kernel: XFS (sdh1): I/O Error Detected. Shutting down filesystem
Apr  1 14:48:10 NAS1 kernel: XFS (sdh1): Please umount the filesystem and rectify the problem(s)
Apr  1 14:48:10 NAS1 kernel: XFS (sdh1): writeback error on sector 246627368

 

Share this post


Link to post

So, as an update. I let memtest run 4 tests (about 19 hours) and it found no errors.  I've decided to try going back to my previous build of unRAID (6.3.5) (did a full restore of the zip I took before upgrading).  Put my app drive and cache spinner back in, and rebooted.  

 

I copied off the date from the app drive, cleaned it up by reformatting it, then moved my VM/Dockers back on.  Things seem back up, and no errors in the log(yet).  I'm about 20% through a parity check now.  Once that completes I'll try some tests load tests within the VM to see if the problem returns.

 

I'm starting to wonder if it's a 6.5.0 issues, but time will tell I guess.

 

memtest_pass.jpg

Share this post


Link to post

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.