DannyG Posted June 11, 2020 Share Posted June 11, 2020 This is almost a weekly thing now. It appears as though what ever disk Unraid is writing to, that drive is eventually just disable itself. Below is a picture of all my drives, Disk 10 to 20 + ma parity 2 Drive on running off a NetApp Shelf with a controller card. the rest of the Disks are internal disk running off of 2 old IBM SAS cards. I always have the same issue with my netapp shelf. Before Disk 10 was "full" Unraid was writing new data to it and it keep Disabling itself. now that it's writing to Disk 11, it's now the one Disabling itself. Parity 2 seems to follow along the ride each time. What's causing this? What I do to get everything back online, is turn off the array, remove the drive and then rebuild it. then turn off the array a 2nd time, and remove the 2nd parity drive and rebuild it again. that take like 4 days. I've uploaded my logs, hopefully someone can figure why this is happening so i can fix it permanently. tower-diagnostics-20200611-1854.zip Quote Link to comment
BRiT Posted June 11, 2020 Share Posted June 11, 2020 My initial response: You need better hardware, not sure if it's power cables or drive cables or power supply is too small, but you're having a MFT of read failures from a vast handful of drives. My closer look, Are you certain you have everything on a UPS? Right after your UPS did a self test your disks dropped offline and could not be read from. Jun 9 01:26:10 Tower apcupsd[53882]: UPS Self Test switch to battery. Jun 9 01:26:13 Tower kernel: md: disk10 read error, sector=1991694760 Jun 9 01:26:13 Tower kernel: md: disk12 read error, sector=1991694760 Jun 9 01:26:13 Tower kernel: md: disk13 read error, sector=1991694760 Jun 9 01:26:13 Tower kernel: md: disk14 read error, sector=1991694760 Jun 9 01:26:13 Tower kernel: md: disk15 read error, sector=1991694760 Jun 9 01:26:13 Tower kernel: md: disk16 read error, sector=1991694760 Jun 9 01:26:13 Tower kernel: md: disk17 read error, sector=1991694760 Jun 9 01:26:13 Tower kernel: md: disk18 read error, sector=1991694760 Jun 9 01:26:13 Tower kernel: md: disk19 read error, sector=1991694760 Jun 9 01:26:13 Tower kernel: md: disk20 read error, sector=1991694760 Also somewhere inside this mess you have filesystem issues, likely from drives dropping entirely offline. Jun 9 04:37:58 Tower kernel: XFS (md10): metadata I/O error in "xlog_iodone" at daddr 0xe8eb669b len 64 error 5 Jun 9 04:37:58 Tower kernel: XFS (md10): xfs_do_force_shutdown(0x2) called from line 1271 of file fs/xfs/xfs_log.c. Return address = 0000000015e00dc7 Jun 9 04:37:58 Tower kernel: XFS (md10): Log I/O Error Detected. Shutting down filesystem Jun 9 04:37:58 Tower kernel: XFS (md10): Please umount the filesystem and rectify the problem(s) Quote Link to comment
BRiT Posted June 11, 2020 Share Posted June 11, 2020 You also had actual WRITE errors. Jun 9 01:26:13 Tower kernel: XFS (md11): writeback error on sector 1991695272 Jun 9 01:26:18 Tower kernel: XFS (md11): writeback error on sector 4082030256 Jun 9 01:26:33 Tower kernel: XFS (md11): writeback error on sector 1991821992 Jun 9 01:26:33 Tower kernel: XFS (md11): writeback error on sector 1991695488 Quote Link to comment
DannyG Posted June 12, 2020 Author Share Posted June 12, 2020 (edited) interesting... 1, Thank you. 2. Yes, everything is on UPS, my Shelf is on 1, and my Unraid server is on a different one. my NetApp has 4 power supplies, I only have 1 plugged in. (I'm only using 1 half of the shelf) I'll try plugging the 2nd PS to my 3rd UPS. Assuming this fixes the disks going offline, how do I fix the file system ? Edited June 12, 2020 by DannyG Quote Link to comment
JorgeB Posted June 12, 2020 Share Posted June 12, 2020 https://wiki.unraid.net/Check_Disk_Filesystems#Checking_and_fixing_drives_in_the_webGui Quote Link to comment
DannyG Posted June 12, 2020 Author Share Posted June 12, 2020 5 hours ago, johnnie.black said: https://wiki.unraid.net/Check_Disk_Filesystems#Checking_and_fixing_drives_in_the_webGui Which drive do I run this on? the ones that are down? (put them back online and run this test/repair?) Quote Link to comment
JorgeB Posted June 12, 2020 Share Posted June 12, 2020 On any unmountable drive, or any drive showing filesystem issues on the log. Quote Link to comment
DannyG Posted June 12, 2020 Author Share Posted June 12, 2020 Jun 9 01:26:10 Tower apcupsd[53882]: UPS Self Test switch to battery. Jun 9 01:26:13 Tower kernel: md: disk10 read error, sector=1991694760 Jun 9 01:26:13 Tower kernel: md: disk12 read error, sector=1991694760 Jun 9 01:26:13 Tower kernel: md: disk13 read error, sector=1991694760 Jun 9 01:26:13 Tower kernel: md: disk14 read error, sector=1991694760 Jun 9 01:26:13 Tower kernel: md: disk15 read error, sector=1991694760 Jun 9 01:26:13 Tower kernel: md: disk16 read error, sector=1991694760 Jun 9 01:26:13 Tower kernel: md: disk17 read error, sector=1991694760 Jun 9 01:26:13 Tower kernel: md: disk18 read error, sector=1991694760 Jun 9 01:26:13 Tower kernel: md: disk19 read error, sector=1991694760 Jun 9 01:26:13 Tower kernel: md: disk20 read error, sector=1991694760 But notice how i'm getting errors on all my NetApp drives? (Theses are actually brand new drives, never used, sat a powered off shelve for 4 years then moved to storage for 1) Quote Link to comment
JorgeB Posted June 12, 2020 Share Posted June 12, 2020 That is not a disk problem, Unraid is losing connection with the entire NetApp enclosure. Quote Link to comment
JonathanM Posted June 12, 2020 Share Posted June 12, 2020 I think you said the disk enclosure is on a separate UPS from Unraid. That can be an issue. To troubleshoot, move all interconnected disk equipment to the same UPS and see what happens. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.