Jump to content

Why do my NetApp drives keep disabling themselves?


Recommended Posts

This is almost a weekly thing now.

It appears as though what ever disk Unraid is writing to, that drive is eventually just disable itself.

 

Below is a picture of all my drives, Disk 10 to 20 + ma parity 2 Drive on running off a NetApp Shelf with a controller card. the rest of the Disks are internal disk running off of 2 old IBM SAS cards.

 

I always have the same issue with my netapp shelf.
Before Disk 10 was "full" Unraid was writing new data to it and it keep Disabling itself.
now that it's writing to Disk 11, it's now the one Disabling itself.
Parity 2 seems to follow along the ride each time.

 

What's causing this?

 

What I do to get everything back online, is turn off the array, remove the drive and then rebuild it. then turn off the array a 2nd time, and remove the 2nd parity drive and rebuild it again. that take like 4 days.

 

I've uploaded my logs, hopefully someone can figure why this is happening so i can fix it permanently.

image.thumb.png.426691f027700c5af22e722265d531db.png 

tower-diagnostics-20200611-1854.zip

Link to comment

My initial response: You need better hardware, not sure if it's power cables or drive cables or power supply is too small, but you're having a MFT of read failures from a vast handful of drives.

 

My closer look, Are you certain you have everything on a UPS? Right after your UPS did a self test your disks dropped offline and could not be read from.

 

Jun  9 01:26:10 Tower apcupsd[53882]: UPS Self Test switch to battery.
Jun  9 01:26:13 Tower kernel: md: disk10 read error, sector=1991694760
Jun  9 01:26:13 Tower kernel: md: disk12 read error, sector=1991694760
Jun  9 01:26:13 Tower kernel: md: disk13 read error, sector=1991694760
Jun  9 01:26:13 Tower kernel: md: disk14 read error, sector=1991694760
Jun  9 01:26:13 Tower kernel: md: disk15 read error, sector=1991694760
Jun  9 01:26:13 Tower kernel: md: disk16 read error, sector=1991694760
Jun  9 01:26:13 Tower kernel: md: disk17 read error, sector=1991694760
Jun  9 01:26:13 Tower kernel: md: disk18 read error, sector=1991694760
Jun  9 01:26:13 Tower kernel: md: disk19 read error, sector=1991694760
Jun  9 01:26:13 Tower kernel: md: disk20 read error, sector=1991694760

 

Also somewhere inside this mess you have filesystem issues, likely from drives dropping entirely offline.

 

Jun  9 04:37:58 Tower kernel: XFS (md10): metadata I/O error in "xlog_iodone" at daddr 0xe8eb669b len 64 error 5
Jun  9 04:37:58 Tower kernel: XFS (md10): xfs_do_force_shutdown(0x2) called from line 1271 of file fs/xfs/xfs_log.c.  Return address = 0000000015e00dc7
Jun  9 04:37:58 Tower kernel: XFS (md10): Log I/O Error Detected.  Shutting down filesystem
Jun  9 04:37:58 Tower kernel: XFS (md10): Please umount the filesystem and rectify the problem(s)

 

 

Link to comment

You also had actual WRITE errors.

 

Jun  9 01:26:13 Tower kernel: XFS (md11): writeback error on sector 1991695272
Jun  9 01:26:18 Tower kernel: XFS (md11): writeback error on sector 4082030256
Jun  9 01:26:33 Tower kernel: XFS (md11): writeback error on sector 1991821992
Jun  9 01:26:33 Tower kernel: XFS (md11): writeback error on sector 1991695488

Link to comment

interesting...

1, Thank you.
2. Yes, everything is on UPS, my Shelf is on 1, and my Unraid server is on a different one.

my NetApp has 4 power supplies, I only have 1 plugged in. (I'm only using 1 half of the shelf)
I'll try plugging the 2nd PS to my 3rd UPS. 

Assuming this fixes the disks going offline, how do I fix the file system ?

IMG_0179.jpeg

Edited by DannyG
Link to comment

Jun  9 01:26:10 Tower apcupsd[53882]: UPS Self Test switch to battery.
Jun  9 01:26:13 Tower kernel: md: disk10 read error, sector=1991694760
Jun  9 01:26:13 Tower kernel: md: disk12 read error, sector=1991694760
Jun  9 01:26:13 Tower kernel: md: disk13 read error, sector=1991694760
Jun  9 01:26:13 Tower kernel: md: disk14 read error, sector=1991694760
Jun  9 01:26:13 Tower kernel: md: disk15 read error, sector=1991694760
Jun  9 01:26:13 Tower kernel: md: disk16 read error, sector=1991694760
Jun  9 01:26:13 Tower kernel: md: disk17 read error, sector=1991694760
Jun  9 01:26:13 Tower kernel: md: disk18 read error, sector=1991694760
Jun  9 01:26:13 Tower kernel: md: disk19 read error, sector=1991694760
Jun  9 01:26:13 Tower kernel: md: disk20 read error, sector=1991694760

 

 

But notice how i'm getting errors on all my NetApp drives?
(Theses are actually brand new drives, never used, sat a powered off shelve for 4 years then moved to storage for 1)

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...