Jump to content

Losing drives after drives in a very short time -- what to do?


alexricher

Recommended Posts

Good day Unraid Community,

 

Today was a weird (and bad) day for my Unraid 6.0.1 server. This morning, I had Disk 10 in "Device is disabled, contents emulated" state which I had no understanding why is this happening suddenly. No one touched the server, nothing has been moved around (so I'm not pointing at a server hardware manipulation.) I've rebooted the server as I was getting spammed with tons of errors:

 

REISERFS error (device md4): zam-7001 reiserfs_find_entry: io error

 

Once rebooted, I've ran a short smarttest check and eeverything looked normal. So, I've unassigned Disk 10, started the array, stopped it then reassigned it to its normal spot. A rebuild started. While I was watching a movie, I get a few email warnings about disk failing one after the one (3 disks failed... 5 disks failed... etc.) I suddenly got scared as I'm not sure what's going on. The webGUI's main page shows only Disk 10 has redballed but there are other disks showing errors:

 

JFKqeba.png

 

I'm also attaching my syslog of the fresh restart so you can help me understanding what's happening.

 

I'd prefer shutting down my server as the syslogs are getting spammed but I kept it in that redballing state to get proper support before doing anything wrong. Did I just lost all of my data?

 

Thanks for your help!

tower-syslog-20150729-2146.zip

Link to comment

Just a quick reassurance before I look at the syslog, when a bunch of drives suddenly drop off, it's always a disk controller has crashed.  That means the drives are fine, and the data is all fine, but without an operational controller the drives are temporarily out of touch.  A reboot usually immediately fixes that, but then you have to deal with the aftermath, plus figure out why it's crashing.  I note that the missing drives have the sequential symbols sdj through sdp, 7 drives almost certainly on an 8 port addon SATA/SAS card.

Link to comment

* The syslog confirms it, at Jul 29 21:39:06, about 2 hours and 10 minutes after the rebuild started, the SAS card failed, and all of its drives were then unresponsive.  Unfortunately, that's about all I can say about it, because whenever this happens, I can't see any more clues as to what actually happened to the card, why it failed/crashed.

 

* Try checking for a firmware update for the SAS card.

 

* You have one 4 port SATA card that isn't being used at all, plus several other empty SATA ports.

 

* Minor issue - CacheDirs is caching things that probably don't need to be cached.  I strongly recommend that users go into the CacheDirs settings, and set the Includes, selecting only those folders that need to be cached.

Link to comment

Hi RobJ!

 

Thanks a lot for your help! :) Reading the syslog isn't that intuitive so I'm glad someone who understands can help me point out the potential culprit. :D

 

Although it's sad my hardware failed (8-port SATA SAS controller), I'm relieved to hear that my data may still be intact. :D I'll try to troubleshoot the hardware issue with this card and hopefully resolve it. Any tips on how to troubleshoot such issue (other than upgrading the firmware?) Luckily, I had rebuilt my server a few months with a new 10 SATA port mobo + a cheap 4 port SATA controller; therefore, I might be able to replug everything onto those ports instead.

 

In the meantime, assuming I remap all of my ports to a working SATA controller, how will I deal with the multiple redball hard drives? Do I need to do anything special to trust the content of it? I had already started rebuilding Disk 10 when the controller failed so I'm not sure whether this drive is ok or not. Also, as far as I understand it, when multiple drive fails you may lose data. Is this truly the case?

 

Thanks again and have a great day!

Link to comment

The fact that the SAS controller appeared to have failed in use does not necessarily mean that it is defective - just that something went wrong when using it.  I had a SAS controller appearing to fail at random intervals unexpectedly for no apparent reason.  After a lot of trial and error I eventually realised that it was not plugged in quite squarely to the motherboard, so I guess under vibration it occasionally had intermittent contact.  Reseating it for better contact resolved the issue and it has since been rock-solid.

Link to comment

Thanks for your reply. :-) I'll give this a go tonight: reset the card's positioning inside the server, check cable connections  and reboot the server. I'll then re-attempt to rebuild disk 10 and hope for the best. It's worrying as I haven't touched the server hardware-wise since a few months but we'll see. Fingers crossed that no data is lost and the controller is fine.

 

One thing to note though : the server seemed under heavy load before the controller failed. I had a load of 17-18 with a disk rebuild and lots of dockers working simultaneously... Perhaps the controller couldn't keep up...

 

I'll report back and don't hesitate if you have other suggestions. :-)

Link to comment

So, the aftermath is 2 drives are redballed. While one was rebuilding (almost near the end), disk 12 failed. :(

 

I'm assuming it's the controller that failed again. How would I go about trusting the content of the drive though? I mean, I now have 2 drives that failed but I'm assuming as stated in the post above that it may be a false error.

 

Thanks for your help!

Link to comment

So, the aftermath is 2 drives are redballed. While one was rebuilding (almost near the end), disk 12 failed. :(

 

I'm assuming it's the controller that failed again. How would I go about trusting the content of the drive though? I mean, I now have 2 drives that failed but I'm assuming as stated in the post above that it may be a false error.

 

Thanks for your help!

Technically, a redballed drive can't be trusted since it failed a write and that might mean file system corruption. But if you don't have any choice, New Config.

 

Best to fix whatever is causing this before doing that or anything else though.

 

Also, I didn't see it mentioned in the thread. Do you have backups?

Link to comment

Thanks for your reply. :)

 

Sadly, I don't have backups (It's a 40TB setup so I can't afford to mirror it for backup.) What kind of backup are we talking about?

 

Disk 12 failed while I was rebuilding Disk 10 (it was near the end of the rebuild on top of that.) I trusted my Supermicro AOC-SAS2LP-MV8 SATA card as I couldn't find anything wrong with it. I'm not really sure if the drive is bad as it seemed the controller kept failing earlier. Does the "New Config" feature is my last resort? Once 2 disks are flagged as bad, are there are other way? I'm assuming I may lose content if I go with "New Config"?

 

In the meantime, I've installed all hard-drives onto a new Marvel SATA controller (4 ports) + the 10 ports on the motherboards. In total, I was missing only 1 port as I have a total of 15 drives. I've decided to hook this drive back to the Supermicro SATA 8 ports card just to get it booting. Booted fine.

 

To push my luck even further before going for the "New Config", I've downloaded the latest firmware of my Supermicro AOC-SAS2LP-MV8 card. I had the original firmware that shipped with the card (4.0.0.1800.) I've upgraded it to the latest in the hope that it may help stabilize the card's recent failing. Reboot once the firmware was applied. Now, I can't boot anymore! :( Where I would usually see the card's BIOS (which a list of all of the connected drive), I now only see a flashing cursor.

 

Have I just bricked my SATA controller?

 

[EDIT]: Alright, I've changed the port for the SATA controller hoping it would boot and it worked. :)

 

Now, back to the initial problem: Do I need to do a "New Config" to trust all content? Or is there a better way around this?

Link to comment

Was there writing to that other disk during the rebuild? If not I don't understand how it would have redballed, resulting in 2 disabled disks. I remember some discussion where unRAID will try to write data that it failed to read by calculating it from the parity array, but with another disk already disabled it wouldn't be possible to calculate the data for another disk.

 

Even if you could get unRAID to trust one of the disks so you could rebuild the other, I don't think the rebuild would be any more trustworthy than if you just did New Config. I would probably do New Config followed by check filesystems on both drives, but you might wait and see if anyone else has an idea.

 

As for backups, I don't backup everything on my array. As you noted, it's too much unless you are going to have another complete server for this (some do). Most of the media could be reacquired if necessary, so I don't backup any of that except for music because it is more important to me than movies and tv. Our most important files, which are irreplaceable, are our personal documents and photography. For these, unRAID is the backup. The originals are all on the desktop PC. In addition to this, I backup some user shares (which contain those irreplaceable files) from unRAID to external disks for storage offsite.

 

So, at least consider what you can't afford to lose, and come up with a plan. The only real backup of a file is one or more extra copies of the file. unRAID's fault-tolerance doesn't change this fact.

Link to comment

I think it may have written to disk 12 unintentionally while it was rebuilding... My assumption is it may have been caused by the dockers running at auto-start (I should have stopped them after startup... :()

 

From my understanding, there isn't much to do other than "New Config"? I'm willing to give this a try. Will this make me lose content of disk 10 & 12? How does "New Config" behave? Does it just try to rebuild the parity from what is on the hard drives?

 

As for backups, I do have backups of my pictures (although I do it manually so it might not be up-to-date) but most of my music, tv and movies aren't backed up.

 

Thanks again and have a great day!

 

 

Link to comment

If you wanted to be really brave. There is a "trust parity" ability when you setup a new array. You could do that and get the array all green again. Then, shut down and pull disk10 and re-start to cause it to fail. At that point, check the contents of the virtual disk10 unRAID creates from the other disks and see if they actually exist and are correct. If it looks good, then you could try the rebuild again.

Link to comment

If you wanted to be really brave. There is a "trust parity" ability when you setup a new array. You could do that and get the array all green again. Then, shut down and pull disk10 and re-start to cause it to fail. At that point, check the contents of the virtual disk10 unRAID creates from the other disks and see if they actually exist and are correct. If it looks good, then you could try the rebuild again.

 

This is along the lines of what I would recommend. When you say "pull the drive", assuming you mean unassign it from the slot, not physically pull it. Stop all Dockers and VMs from starting first.

 

But before going there - you said that the disk rebuild almost completed. Depending on how much data was on the disk I guess it's possible that all your data is on that rebuilt disk. md5s would be very handy.

 

The other disk you say failed, did it really fall or did unRaid just kick out from the array? Often drives drop out of the array for reasons other than a failure (like cabling or controller issue). (Sorry tl;dr entire thread), So you may already have all your data.

Link to comment

Sounds like we are all saying New Config and it wouldn't hurt to Trust Parity since that could be rebuilt later if needed. Then after all green, check disk file systems might tell us whether anything further was needed.

 

I think I have seen it mentioned that even if you trust parity it will start a correcting parity check so if it does you should cancel it.

Link to comment

Thanks everyone for your reply. It's appreciated!

 

So, just to resume, disk 10 failed when we suspected the SATA card controller crashed. Perhaps the data was fine at that point but I proceeded with a rebuild that never completed. Question: Does a rebuild goes over all bits and resets them to 0s and 1s based on the parity? If so, is it safe to assume that even if it never completed, the previous content is still present?

 

Disk 12 failed while rebuilding disk 10 most likely because a docker wrote to it. Question: If I decide to trust it, is there any way to see the content of that drive manually? Also, by doing "New Config", will it take it as-is or based on the parity?

 

Sorry for so many questions but I'm worried to lose content so I prefer taking my time to properly understand what may happened. Personally, I'm willing to lose the latest data copied on disk 10/12 but I'm not willing to lose any previous content (as I'm not sure what was on there before)

 

Thanks again! Fingers crossed I'll be able to take action soon :)

Link to comment

New Config will take all of your data drives just as they are. When you do New Config, you will have to reassign all of your drives. Be VERY sure you do not assign one of your data drives to the parity slot or it will overwrite your data with parity.

 

If the drives are mountable then unRAID will mount them and you should be able to see your files. It is possible, however, that incomplete writes to the drive will have caused some corruption to the file system. We can check for that and try to fix it if needed after you get the "all green".

 

Still unclear whether the actual cause of these problems has been addressed.

Link to comment

New Config will take all of your data drives just as they are. When you do New Config, you will have to reassign all of your drives. Be VERY sure you do not assign one of your data drives to the parity slot or it will overwrite your data with parity.

 

If the drives are mountable then unRAID will mount them and you should be able to see your files. It is possible, however, that incomplete writes to the drive will have caused some corruption to the file system. We can check for that and try to fix it if needed after you get the "all green".

 

Still unclear whether the actual cause of these problems has been addressed.

 

Good day trurl,

 

Thanks for the info! :) So, New Config is safer than I thought if it takes the data as-is. Then, this is what I will do.

 

As for the source of the issue, to be honest, I'm not sure I have found it either. According to my syslog and Robj's reading, the Supermicro AOC-SAS2LP-MV8 SATA controller would have failed:

 

* The syslog confirms it, at Jul 29 21:39:06, about 2 hours and 10 minutes after the rebuild started, the SAS card failed, and all of its drives were then unresponsive.  Unfortunately, that's about all I can say about it, because whenever this happens, I can't see any more clues as to what actually happened to the card, why it failed/crashed.

 

* Try checking for a firmware update for the SAS card.

 

* You have one 4 port SATA card that isn't being used at all, plus several other empty SATA ports.

 

* Minor issue - CacheDirs is caching things that probably don't need to be cached.  I strongly recommend that users go into the CacheDirs settings, and set the Includes, selecting only those folders that need to be cached.

 

The controller is a year and half old with original firmware. The server was under heavy load when it failed. Here's what I did to work around the issue:

  • I've moved all HDDs (except 1 for the lack of other ports) to the motherboard's SATA ports (x10) + x4 to a new Marvell SATA 4 ports controller
  • Upgraded my Supermicro AOC-SAS2LP-MV8 SATA controller to the latest available firmware.
  • Moved it to a different port onto the motherboard
  • Removed the gfx card from the server, in case it would take too much power

 

The HDD plugged onto the Supermicro card seems to work. I'm really not sure what could have made the card to fail in all honesty. :( If anyone has tips, I'm willing to give this a try and debug it. Or should I apply the warranty on the card?

 

In the meantime, I will do the New Config but might need help to get the drives working properly. Is there anything special to be done (other than avoiding the parity check at Start?)

 

Thanks again and have a great day!

Link to comment

Alright :) It's done. I've done:

 

  • New Config
  • Reassigned all drives in proper position
  • Checked "Parity is valid" before starting the array
  • Started the array...
  • As soon as the UI refreshed, the parity sync was in progress so I've stopped it resulting in this: Last check incomplete on Tuesday, 04-08-2015, 19:02 (today), finding 3282 errors.

 

Should I be worried about the 3282 errors found so far?

 

Now, what's the next step? :)

Link to comment

Thanks again trurl!

 

I've done the file system check for disk 10 and disk 12:

 

Disk 10:

root@Tower:/dev# reiserfsck /dev/md10
reiserfsck 3.6.24

Will read-only check consistency of the filesystem on /dev/md10
Will put log info to 'stdout'

Do you want to run this program?[N/Yes] (note need to type Yes if you do):Yes
###########
reiserfsck --check started at Wed Aug  5 08:10:29 2015
###########
Replaying journal: Done.
Reiserfs journal '/dev/md10' in blocks [18..8211]: 0 transactions replayed
Checking internal tree..  finished
Comparing bitmaps..vpf-10640: The on-disk and the correct bitmaps differs.
Checking Semantic tree:
finished
1 found corruptions can be fixed when running with --fix-fixable
###########
reiserfsck finished at Wed Aug  5 09:01:24 2015
###########

 

Disk 12:

root@Tower:~# xfs_repair -v /dev/md12
Phase 1 - find and verify superblock...
        - block cache size set to 1451400 entries
Phase 2 - using internal log
        - zero log...
zero_log: head block 2674817 tail block 2674817
        - scan filesystem freespace and inode maps...
        - found root inode chunk
Phase 3 - for each AG...
        - scan and clear agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 3
        - agno = 1
        - agno = 2
Phase 5 - rebuild AG headers and trees...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - reset superblock...
Phase 6 - check inode connectivity...
        - resetting contents of realtime bitmap and summary inodes
        - traversing filesystem ...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify and correct link counts...

        XFS_REPAIR Summary    Wed Aug  5 08:16:05 2015

Phase           Start           End             Duration
Phase 1:        08/05 08:11:47  08/05 08:11:47
Phase 2:        08/05 08:11:47  08/05 08:13:47  2 minutes
Phase 3:        08/05 08:13:47  08/05 08:15:53  2 minutes, 6 seconds
Phase 4:        08/05 08:15:53  08/05 08:15:53
Phase 5:        08/05 08:15:53  08/05 08:15:53
Phase 6:        08/05 08:15:53  08/05 08:15:54  1 second
Phase 7:        08/05 08:15:54  08/05 08:15:54

Total run time: 4 minutes, 7 seconds
done

 

So, should I fix disk 10? :-) Thanks!

Link to comment

I haven't found any lost+found directory after running "xfs_repair -v /dev/md12". To test this, I've started the array (not in Maintenance mode as I cannot see the drive's content otherwise) and I haven't seen anything special in /mnt/disk12. Perhaps there wasn't anything recovered?

 

As for the "reiserfsck /dev/md10" command, it did mention: "1 found corruptions can be fixed when running with --fix-fixable". Is it safe to go ahead? Should I go ahead and run it?

 

Thanks again! I can't wait to get my server back in normal state. :) This server does so many things. :D

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...