Why do my NetApp drives keep disabling themselves?

DannyG · June 11, 2020

This is almost a weekly thing now.

It appears as though what ever disk Unraid is writing to, that drive is eventually just disable itself.

Below is a picture of all my drives, Disk 10 to 20 + ma parity 2 Drive on running off a NetApp Shelf with a controller card. the rest of the Disks are internal disk running off of 2 old IBM SAS cards.

I always have the same issue with my netapp shelf.
Before Disk 10 was "full" Unraid was writing new data to it and it keep Disabling itself.
now that it's writing to Disk 11, it's now the one Disabling itself.
Parity 2 seems to follow along the ride each time.

What's causing this?

What I do to get everything back online, is turn off the array, remove the drive and then rebuild it. then turn off the array a 2nd time, and remove the 2nd parity drive and rebuild it again. that take like 4 days.

I've uploaded my logs, hopefully someone can figure why this is happening so i can fix it permanently.

tower-diagnostics-20200611-1854.zip

BRiT · June 11, 2020

My initial response: You need better hardware, not sure if it's power cables or drive cables or power supply is too small, but you're having a MFT of read failures from a vast handful of drives.

My closer look, Are you certain you have everything on a UPS? Right after your UPS did a self test your disks dropped offline and could not be read from.

Jun 9 01:26:10 Tower apcupsd[53882]: UPS Self Test switch to battery.
Jun 9 01:26:13 Tower kernel: md: disk10 read error, sector=1991694760
Jun 9 01:26:13 Tower kernel: md: disk12 read error, sector=1991694760
Jun 9 01:26:13 Tower kernel: md: disk13 read error, sector=1991694760
Jun 9 01:26:13 Tower kernel: md: disk14 read error, sector=1991694760
Jun 9 01:26:13 Tower kernel: md: disk15 read error, sector=1991694760
Jun 9 01:26:13 Tower kernel: md: disk16 read error, sector=1991694760
Jun 9 01:26:13 Tower kernel: md: disk17 read error, sector=1991694760
Jun 9 01:26:13 Tower kernel: md: disk18 read error, sector=1991694760
Jun 9 01:26:13 Tower kernel: md: disk19 read error, sector=1991694760
Jun 9 01:26:13 Tower kernel: md: disk20 read error, sector=1991694760

Also somewhere inside this mess you have filesystem issues, likely from drives dropping entirely offline.

Jun 9 04:37:58 Tower kernel: XFS (md10): metadata I/O error in "xlog_iodone" at daddr 0xe8eb669b len 64 error 5
Jun 9 04:37:58 Tower kernel: XFS (md10): xfs_do_force_shutdown(0x2) called from line 1271 of file fs/xfs/xfs_log.c. Return address = 0000000015e00dc7
Jun 9 04:37:58 Tower kernel: XFS (md10): Log I/O Error Detected. Shutting down filesystem
Jun 9 04:37:58 Tower kernel: XFS (md10): Please umount the filesystem and rectify the problem(s)

BRiT · June 11, 2020

You also had actual WRITE errors.

Jun 9 01:26:13 Tower kernel: XFS (md11): writeback error on sector 1991695272
Jun 9 01:26:18 Tower kernel: XFS (md11): writeback error on sector 4082030256
Jun 9 01:26:33 Tower kernel: XFS (md11): writeback error on sector 1991821992
Jun 9 01:26:33 Tower kernel: XFS (md11): writeback error on sector 1991695488

DannyG · June 12, 2020

interesting...

1, Thank you.
2. Yes, everything is on UPS, my Shelf is on 1, and my Unraid server is on a different one.

my NetApp has 4 power supplies, I only have 1 plugged in. (I'm only using 1 half of the shelf)
I'll try plugging the 2nd PS to my 3rd UPS.

Assuming this fixes the disks going offline, how do I fix the file system ?

Edited June 12, 2020 by DannyG

JorgeB · June 12, 2020

https://wiki.unraid.net/Check_Disk_Filesystems#Checking_and_fixing_drives_in_the_webGui

DannyG · June 12, 2020

5 hours ago, johnnie.black said:

https://wiki.unraid.net/Check_Disk_Filesystems#Checking_and_fixing_drives_in_the_webGui

Which drive do I run this on? the ones that are down? (put them back online and run this test/repair?)

JorgeB · June 12, 2020

On any unmountable drive, or any drive showing filesystem issues on the log.

DannyG · June 12, 2020

Jun 9 01:26:10 Tower apcupsd[53882]: UPS Self Test switch to battery.
Jun 9 01:26:13 Tower kernel: md: disk10 read error, sector=1991694760
Jun 9 01:26:13 Tower kernel: md: disk12 read error, sector=1991694760
Jun 9 01:26:13 Tower kernel: md: disk13 read error, sector=1991694760
Jun 9 01:26:13 Tower kernel: md: disk14 read error, sector=1991694760
Jun 9 01:26:13 Tower kernel: md: disk15 read error, sector=1991694760
Jun 9 01:26:13 Tower kernel: md: disk16 read error, sector=1991694760
Jun 9 01:26:13 Tower kernel: md: disk17 read error, sector=1991694760
Jun 9 01:26:13 Tower kernel: md: disk18 read error, sector=1991694760
Jun 9 01:26:13 Tower kernel: md: disk19 read error, sector=1991694760
Jun 9 01:26:13 Tower kernel: md: disk20 read error, sector=1991694760

But notice how i'm getting errors on all my NetApp drives?
(Theses are actually brand new drives, never used, sat a powered off shelve for 4 years then moved to storage for 1)

JorgeB · June 12, 2020

That is not a disk problem, Unraid is losing connection with the entire NetApp enclosure.

JonathanM · June 12, 2020

I think you said the disk enclosure is on a separate UPS from Unraid. That can be an issue.

To troubleshoot, move all interconnected disk equipment to the same UPS and see what happens.

Why do my NetApp drives keep disabling themselves?

Recommended Posts

DannyG

Link to comment

BRiT

Link to comment

BRiT

Link to comment

DannyG

Link to comment

JorgeB

Link to comment

DannyG

Link to comment

JorgeB

Link to comment

DannyG

Link to comment

JorgeB

Link to comment

JonathanM

Link to comment

Join the conversation