Stable unRAID started having issues (corrupt cache/app)

Dmtalon · April 1, 2018

Build is a few years old now, has been rock solid, until recently. I have had 2 VM's (windows 7 and windows10) and a Plex docker. The windows 7 VM runs my "house audio" via a pass-through audio PCI card, and also runs uTorrent.

First sign of issues, is extracted RAR files either being empty (extraction failed), or corruption (or what I thought was corruption). I had 5 data drives, 1 parity, cache and an app drive. Cache drive is a fairly old 1TB WD Black, app drive is a samsung 128GB SSD. VMs/Dockers live on the app drive, and downloads would download to the VM, then copy to a directory (through cache).

I assumed my cache drive was dying even though no red ball, and decided to move into 2018 and dump the separate cache/app drive and put in a 500GB SSD as my cache/app drive. That all went fairly smooth, however one of my VM's didn't want to come up. I messed with it for a while, but eventually plugged the old app drive into another pc (booted upbuntu usb live) and tried copying over the vm again. I did this and boom, VM came up. Fast Forward, uTorrent (or some process) was still having issues. I decided since I had a nice new big cache/app drive to install ruTorrent docker, Got that up and running and added some existing recent torrents to it pointing to there already downloaded location. It was here I discovered that uTorrent didnt' seem to be completing the downloads. ruTorrent was finding them at like 96-97% complete and then finishing them. So, I assumed that this was just some kind of uTorrent issue, and moved on with life using ruTorrent docker vs uTorrent on my VM. It was this same time I discovered pi-hole and pi-hole docker so I installed and got that working/setup too. SO I'm feeling all good and things are working etc..

Yesterday my VMs/Dockers crashed, and i had all kinds of I/O errors in my logs. So I SSH in and try to look in /mnt/cache but no go, i/o errors. OK, lets shut down the array go into maintenance mode and do some checks. BUT, I can't get /mnt/cache unmounted. I let it sit, I tried lsof/fuser etc.. but the kernel was the only process that seemed to have it locked, so I pulled the plug and rebooted.

Everything came up but Parity wanted to run. I stopped it, stopped the array, and put it in maintenance mode. I wanted to do some checks on the cache drive. It seemed to find some errors it couldn't fix (and google didn't help me a whole lot honestly) "Metadata CRC error detected at xfs_bmbt block"

Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
        - scan filesystem freespace and inode maps...
        - found root inode chunk
Phase 3 - for each AG...
        - scan and clear agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
Metadata CRC error detected at xfs_bmbt block 0xec53a00/0x1000
Metadata CRC error detected at xfs_bmbt block 0xec53a00/0x1000
btree block 1/451648 is suspect, error -74
        - agno = 1
        - agno = 2
        - agno = 3
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 2
        - agno = 0
        - agno = 3
        - agno = 1
Phase 5 - rebuild AG headers and trees...
        - reset superblock...
Phase 6 - check inode connectivity...
        - resetting contents of realtime bitmap and summary inodes
        - traversing filesystem ...
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify and correct link counts...
done

After trying to run xfs_repair w/o -n and ultimately with -L those errors stay, so I restarted the array and let it run a parity check through the night. It found 2.4 million errors. and when I just took the array offline (stopped, restarted in maintenance mode to get the errors above) it told me I had an unclean shutdown and needed another parity check.

I'm attaching some screen shots, and a diagnostics report from just now after the parity.

Before I keep digging in this hole I'm in, I'm hoping someone can maybe help me climb out.

Thanks, sorry for the rambling.

nas1-diagnostics-20180401-1056.zip

Edited April 1, 2018 by Dmtalon

JorgeB · April 1, 2018

Corrupt cache filesystem and parity sync errors to me suggest a hardware problem, I would start by running memtest for a few hours, ideally 24

Dmtalon · April 1, 2018

So, I cleaned up the cache drive, formatted it clean and deleted the libvirt image and started fresh. Started installing windows again fresh. While trying to install a service pack I see this pop up in the logs. No indication there were any issues until just now. (no errors in log)

I shut everything down, and reseated the ram, and am running memtest86 on it. We'll see what happens I guess. MB/Build is from 9/2014

ASUS SABERTOOTH 990FX R2.0

AMD BOX AMD FX-8320 BLACK ED

CRUCIAL 8GB D3 1333 ECC x3
XFX HD5450 1GB D3 DVH PCIE

Apr  1 14:48:10 NAS1 kernel: XFS (sdh1): Metadata CRC error detected at xfs_buf_ioend+0x49/0x9c [xfs], xfs_bmbt block 0xdc5be0
Apr  1 14:48:10 NAS1 kernel: XFS (sdh1): Unmount and run xfs_repair
Apr  1 14:48:10 NAS1 kernel: XFS (sdh1): First 64 bytes of corrupted metadata buffer:
Apr  1 14:48:10 NAS1 kernel: ffff8805eaa44000: 42 4d 41 33 00 00 00 fb 00 00 00 00 02 04 96 d8  BMA3............
Apr  1 14:48:10 NAS1 kernel: ffff8805eaa44010: 00 00 00 00 02 04 85 2e 00 00 00 00 00 dc 5b e0  ..............[.
Apr  1 14:48:10 NAS1 kernel: ffff8805eaa44020: 00 00 00 01 00 03 40 4c 87 94 a5 6c 31 ae 4b 35  [email protected]
Apr  1 14:48:10 NAS1 kernel: ffff8805eaa44030: 9c 4c 12 72 8e 1b 67 1c 00 00 00 00 00 00 00 65  .L.r..g........e
Apr  1 14:48:10 NAS1 kernel: XFS (sdh1): metadata I/O error: block 0xdc5be0 ("xfs_trans_read_buf_map") error 74 numblks 8
Apr  1 14:48:10 NAS1 kernel: XFS (sdh1): xfs_do_force_shutdown(0x1) called from line 315 of file fs/xfs/xfs_trans_buf.c.  Return address = 0xffffffffa0254bea
Apr  1 14:48:10 NAS1 kernel: XFS (sdh1): I/O Error Detected. Shutting down filesystem
Apr  1 14:48:10 NAS1 kernel: XFS (sdh1): Please umount the filesystem and rectify the problem(s)
Apr  1 14:48:10 NAS1 kernel: XFS (sdh1): writeback error on sector 246627368

Dmtalon · April 3, 2018

So, as an update. I let memtest run 4 tests (about 19 hours) and it found no errors. I've decided to try going back to my previous build of unRAID (6.3.5) (did a full restore of the zip I took before upgrading). Put my app drive and cache spinner back in, and rebooted.

I copied off the date from the app drive, cleaned it up by reformatting it, then moved my VM/Dockers back on. Things seem back up, and no errors in the log(yet). I'm about 20% through a parity check now. Once that completes I'll try some tests load tests within the VM to see if the problem returns.

I'm starting to wonder if it's a 6.5.0 issues, but time will tell I guess.

Stable unRAID started having issues (corrupt cache/app)

Recommended Posts

Dmtalon

Link to comment

JorgeB

Link to comment

Dmtalon

Link to comment

Dmtalon

Link to comment

Join the conversation