Weird random Drive Dropping.


bradtwo

Recommended Posts

I want to first acknowledge the above and beyond support you guys have helped me with over the years.  Thank you so much. 

 

First, here is a quick overview of my setup. 

Unraid Plus

6-Core AMD6300 / 16GB ECC DDR3 / 

12 Drives (11 - HDD, 1 Cache). 

Configured with Dual Parity (4TB). 

     2 - SATA/PCIE controllers Installed. 

 

 

 

So, here is my issue(s) that I am having with my Unraid.  Hopefully i can fully encompass what is going on. 


I first started seeing drives randomly becoming unavailable in my rig. This would, of course, cause data to become unavailable (very worry some). At first, I could take the server offline, and re-add the drive back. The Parity would begin its process of rebuilding, no issue.  Things would be great for about a month , then another drive would go offline. 

 

I then started seeing an issue of one failed drive (new WD Blue 4TB). Bummer, it was about two years old, but that happens. I then swap it out with a new one, which proceeded to fail or report back failed/unavailable within 3 months (from memory).  The two drives were connected directly to my MB's SATA ports. I moved them to the controller and re-added them. Issue solved. 

 

Most recently, I am now seeing more than 2 drives failed, at the same time and my data is now unavilable. I was able to move one of the failed drives to the Sata/PCI controller (which is now maxed out) and re add it. 

 

Parity Sync / Data Rebuild  . Stated it would take about 1 day to complete (bummer, but I am ok with that). I now check and the rebuild is at 18Kb/s, etimated time is about 344 Days to complete. It is stuck at 50.0%. 

 

I am up to date with all the settings on my UnRaid.  When going to System Log > my browser doesn't pull anything, just spins indefinitely.

 

So... what  the heck is going on? 

 

 

 

 

 

Screen Shot 2018-11-17 at 3.41.40 PM.png

Screen Shot 2018-11-17 at 3.41.29 PM.png

Screen Shot 2018-11-17 at 3.41.09 PM.png

Edited by bradtwo
I wrote Parity rebuild not Data rebuild.
Link to comment

Update 01: I was able to pull this information under "HISTORY" . the drive that is reporting back all the errors is a new drive. 

 

 

DateDurationSpeedStatusErrors

2018-11-16, 12:29:4215 hr, 49 min, 59 sec70.2 MB/sOK39708657

2018-11-01, 14:30:1914 hr, 30 min, 18 sec76.6 MB/sOK0

2018-10-01, 14:12:3114 hr, 12 min, 30 sec78.2 MB/sOK0

2018-09-01, 14:28:4414 hr, 28 min, 43 sec76.8 MB/sOK19

2018-08-17, 10:08:401 day, 13 hr, 36 min29.6 MB/sOK0

2018-08-08, 16:12:0420 hr, 58 min, 33 sec53.0 MB/sOK20

2018-08-05, 12:36:3520 hr, 41 min, 57 sec53.7 MB/sOK1

2018-08-02, 15:17:481 day, 15 hr, 17 min, 47 sec28.3 MB/sOK185

2018-07-16, 13:55:3717 hr, 57 min, 37 sec61.9 MB/sOK0

2018-07-13, 18:04:0422 hr, 14 min, 9 sec50.0 MB/sOK0

2018-07-01, 15:31:5615 hr, 31 min, 55 sec71.6 MB/sOK0

2018-06-01, 19:31:3319 hr, 31 min, 32 sec56.9 MB/sOK0

2018-05-31, 18:20:2820 hr, 56 min, 31 sec53.1 MB/sOK0

2018-05-30, 21:18:471 min, 18 secUnavailableCanceled0

2018-05-27, 22:09:4023 hr, 30 min, 11 sec47.3 MB/sOK0

2018-05-01, 18:30:0918 hr, 30 min, 8 sec60.1 MB/sOK0

2018-04-18, 05:05:537 hr, 46 min, 13 sec143.0 MB/sOK0

2018-04-17, 15:14:5619 hr, 26 min, 32 sec57.2 MB/sOK0

2018-04-08, 10:36:5615 hr, 1 min, 45 sec73.9 MB/sOK0

2018-04-03, 15:03:0719 hr, 43 min, 25 sec56.3 MB/sOK0

2018-04-01, 17:06:1317 hr, 6 min, 12 sec65.0 MB/sOK0

2018-03-21, 08:23:1514 hr, 16 min, 20 sec77.9 MB/sOK0

2018-03-01, 13:42:0013 hr, 41 min, 59 sec81.1 MB/sOK0

2018-02-01, 14:04:0814 hr, 4 min, 7 sec79.0 MB/sOK0

2018-01-01, 14:27:4214 hr, 27 min, 41 sec76.8 MB/sOK0

2017-12-01, 13:20:4213 hr, 20 min, 41 sec83.3 MB/sOK0

2017-11-01, 13:23:1813 hr, 23 min, 17 sec83.0 MB/sOK512

2017-10-15, 12:54:551 day, 10 hr, 32 min, 15 sec32.2 MB/sOK512

2017-10-07, 15:52:4216 hr, 35 min, 57 sec67.0 MB/sOK0

2017-10-01, 13:59:1413 hr, 59 min, 13 sec79.5 MB/sOK0

2017-09-01, 12:36:1512 hr, 36 min, 14 sec88.2 MB/sOK0

2017-08-05, 19:32:028 hr, 3 min, 11 sec138.0 MB/sOK0

2017-08-05, 06:54:1813 hr, 19 min, 13 sec83.4 MB/sOK0

2017-08-01, 13:30:1513 hr, 30 min, 14 sec82.3 MB/sOK0

2017-07-01, 13:46:0213 hr, 46 min, 1 sec80.7 MB/sOK0

2017-06-01, 12:36:3912 hr, 36 min, 38 sec88.1 MB/sOK0

2017-05-01, 12:27:3212 hr, 27 min, 31 sec89.2 MB/sOK0

2017-04-15, 07:37:3612 hr, 44 min, 51 sec87.2 MB/sOK0

2017-04-01, 11:53:5411 hr, 53 min, 53 sec93.4 MB/sOK0

2017-03-18, 05:59:3512 hr, 36 min, 2 sec88.2 MB/sOK0

2017-03-01, 12:37:1012 hr, 37 min, 9 sec88.1 MB/sOK

2017-02-01, 12:30:1012 hr, 30 min, 9 sec88.9 MB/sOK

2017-01-01, 13:03:5713 hr, 3 min, 56 sec85.1 MB/sOK

2016-12-31, 07:06:0713 hr, 29 min, 6 sec82.4 MB/sOK

Dec, 01, 12:50:5412 hr, 50 min, 53 sec86.5 MB/sOK

Nov, 19, 17:54:23UnavailableUnavailableCanceled

Nov, 01, 00:30:2812 hr, 9 min, 32 sec91.4 MB/sOK

Oct, 01, 13:31:5513 hr, 31 min, 53 sec82.1 MB/sOK

Sep, 09, 08:38:2614 hr, 37 min, 6 sec76.0 MB/sOK

Sep, 05, 11:42:1415 hr, 2 min, 24 sec73.9 MB/sOK

Sep, 01, 12:30:4112 hr, 30 min, 39 sec88.8 MB/sOK

Jul, 14, 21:31:4912 hr, 12 min, 26 sec91.0 MB/sOK

Jul, 01, 12:20:2912 hr, 20 min, 27 sec90.1 MB/sOK

Jun, 25, 10:39:4012 hr, 17 min, 41 sec90.4 MB/sOK

Jun, 17, 21:19:1812 hr, 42 min, 28 sec87.5 MB/sOK

Jun, 17, 08:05:5414 hr, 53 min, 48 sec74.6 MB/sOK

Jun, 14, 19:56:3518 hr, 24 min, 18 sec60.4 MB/sOK

Jun, 13, 14:47:56UnavailableUnavailableCanceled

Jun, 01, 00:41:18UnavailableUnavailableCanceled

May, 06, 07:20:5611 hr, 54 min, 14 sec93.4 MB/sOK

May, 01, 11:47:5211 hr, 47 min, 50 sec94.2 MB/sOK

Apr, 01, 12:38:5112 hr, 38 min, 49 sec87.9 MB/sOK

Mar, 01, 11:58:0711 hr, 58 min, 5 sec92.9 MB/sOK

Feb, 10, 06:19:3012 hr, 4 min, 18 sec92.1 MB/sOK

Feb, 06, 08:31:4912 hr, 8 min, 24 sec91.5 MB/sOK

Feb, 01, 13:04:2313 hr, 4 min, 21 sec85.0 MB/sOK

Jan, 29, 09:57:1712 hr, 31 min88.8 MB/sOK

Jan, 25, 18:01:57UnavailableUnavailableCanceled

Link to comment
33 minutes ago, bradtwo said:

So... what  the heck is going on? 

I am not an expert on disk issues, that would be @johnnie.black but I do see an explanation for your disks randomly dropping; you have a Marvell SATA controller on your MB and it appears, PCIe SATA controller as well.  This is not good as they are known to drop disks with unRAID.  Any disk connected to SATA ports (whether those are MB ports or a PCIe HBA) controlled by a Marvell chipset may disappear for no apparent reason.  They could be fine for a long time and then just go bye bye until a reboot.

 

02:00.0 SATA controller [0106]: Marvell Technology Group Ltd. 88SE9172 SATA 6Gb/s Controller [1b4b:9172] (rev 12)
    Subsystem: Gigabyte Technology Co., Ltd 88SE9172 SATA 6Gb/s Controller [1458:b000]

 

07:00.0 SATA controller [0106]: Marvell Technology Group Ltd. Device [1b4b:9215] (rev 11)
    Subsystem: Marvell Technology Group Ltd. Device [1b4b:9215]
 
08:00.0 SATA controller [0106]: Marvell Technology Group Ltd. Device [1b4b:9215] (rev 11)
    Subsystem: Marvell Technology Group Ltd. Device [1b4b:9215]
   

Unfortunately, you are playing Russian Roulette with disks attached to these controllers.

 

For peace of mind, the experts here recommend LSI chipset-based controllers flashed to IT mode.  There are many recommendations in these forums, but, they include models such as LSI 9207-8i, LSI 9211-8i, IBM M105, Dell PERC H200/H310 and others.

 

You may also wish to verify that your SATA cables and connections are all good.

 

As to your disk failure issues, I defer to the experts to diagnose the issues and the solutions.

 

 

 

 

Link to comment

Dude you're awesome. I'm going to take that first step. The Controller I picked was one that was on the original "approved hardware" list back in the day. 

 

So my follow up question is that.   In the current state what would be the risk of me taking the opportunity and migrating this over to all new hardware[1]?   Honestly, my main concern is (of course) dataloss (which is why i went with 2 Parity Drives). 

As it currently sits, I would have to wait 344 days before the parity check completes.  Could I stop this now[2]? I'm not sure i have the patience to wait that long  to use my setup. (422 Days until completion)

 

What is your recommended course of action for me [3]?

 

Another question (sorry).  I read this post here. 

 

 

Would this be a potential temporary fix[4]?

 

 

 

basically I would be ok with limping by and re-building this setup completely, again.

 

 

 

 

Edited by bradtwo
adding in days until completion.
Link to comment
25 minutes ago, bradtwo said:

Dude you're awesome. I'm going to take that first step. The Controller I picked was one that was on the original "approved hardware" list back in the day.

True, With unRAID 5 and prior versions Marvell-based controllers were fine.  I had one. It worked without issues.  However, ever since unRAID 6.x.x, Marvell controllers started having issues with the latest Linux kernels, drivers, etc.

 

25 minutes ago, bradtwo said:

So my follow up question is that.   In the current state what would be the risk of me taking the opportunity and migrating this over to all new hardware[1]?   Honestly, my main concern is (of course) dataloss (which is why i went with 2 Parity Drives). 

As it currently sits, I would have to wait 344 days before the parity check completes.  Could I stop this now[2]? I'm not sure i have the patience to wait that long  to use my setup. 

There is no danger in cancelling the parity check.  Certainly something is amiss.  Your "failed" drives may not have actually been failed drives.  If you have controller and/or SATA cable issues, that could account for the problems.  UnRAID takes a disk offline when writes fail.  That could be a drive issue, but, it could also be a controller or cable issue.

 

As far as how to best proceed with hardware upgrades and restoring the health of your array, I do not want to give any erroneous advice and I am not the expert in these matters.  Hopefully, someone else will chime in on the best procedure.

 

I do think your system is recoverable and you likely have not lost any data.  The first thing I would do is replace the SATA controllers and get one of the ones I mentioned.  There are many of them available on eBay.  With the problems you currently have, there is no telling what can be attributed to those controllers.

 

After your disks are "stable" you can consider other hardware upgrades.

Edited by Hoopster
Link to comment

Thank you for all of your advice. 

I did make this mistake in my comment, which I caught now when I read yours. 


It states:  Parity Sync / Data Rebuild 50.0% .     (this is where it has been stuck at all day). With the estimated completion time going up in days @ 13Kb/s. Originally when this started out it stated it would take 1 day to complete (estimated time) and was doing the procession at 50MB/s (if i recall). This morning when i woke, I see the updated time. 

 

I will edit the comment above. Again, this is my error for mislabeling. 

Link to comment

Modifying this comment to help avoid any confusion. 

 

AGAIN! Thank you. I absolutely love this community. I suppose moving forward I am going to invest in some stand-alone drives outside of unRaid for additional backup in case of future  unplanned incompatibility issues. I did not lose anything important in the issue. My family photos and documents are very well backed up outside of unRaid. 

 

 

Yesterday my count on the repair got up to 600 + days with rebuilding at 3Kb/s. I decided to restart the machine. It came back, began the process and completed it within 1 day. Returned back to no-errors.  I have took the advice to heart and i ordered the  * LSI SAS 9201-8i HBA controllers which are supposedly working out of the box. This will save me a few steps on flashing firmware for IT mode. 

 

 

Now, all drives seem to be back to normal.  One simple error persists outside. 

My /media/movies folder seems to be completely empty. originally this had 1,000+ objects in it.  Manually navigating through the web-interface to share itself, it shows 0 items as well.  However, (big catch), going to the individual drives Disk 1 > media/movies, I see a section of the items there.   

So it appears now that the /media/movies share is somehow broken, but the files still remain on the disks themselves. 

Does anyone know how to recover this?  

 

My next go to would be to somehow browse manually to them via a terminal and copy them to a folder, then re-build that way. 

 

 

 

 

 

 

 

Edited by bradtwo
grammar is hard
Link to comment
17 hours ago, bradtwo said:

Modifying this comment to help avoid any confusion. 

 

AGAIN! Thank you. I absolutely love this community. I suppose moving forward I am going to invest in some stand-alone drives outside of unRaid for additional backup in case of future  unplanned incompatibility issues. I did not lose anything important in the issue. My family photos and documents are very well backed up outside of unRaid. 

 

 

Yesterday my count on the repair got up to 600 + days with rebuilding at 3Kb/s. I decided to restart the machine. It came back, began the process and completed it within 1 day. Returned back to no-errors.  I have took the advice to heart and i ordered the  * LSI SAS 9201-8i HBA controllers which are supposedly working out of the box. This will save me a few steps on flashing firmware for IT mode. 

 

 

Now, all drives seem to be back to normal.  One simple error persists outside. 

My /media/movies folder seems to be completely empty. originally this had 1,000+ objects in it.  Manually navigating through the web-interface to share itself, it shows 0 items as well.  However, (big catch), going to the individual drives Disk 1 > media/movies, I see a section of the items there.   

So it appears now that the /media/movies share is somehow broken, but the files still remain on the disks themselves. 

Does anyone know how to recover this?  

 

My next go to would be to somehow browse manually to them via a terminal and copy them to a folder, then re-build that way. 

 

 

 

 

 

 

 

You might want to check that you do not have folders on the disks with the same spelling but different capitalization (e.g. 'movies' and 'Movies').   Under Linux names are case sensitive.     If you get into the situation of having two folders with different capitalization then you can get the symptoms you describe as Samba will choose one of them and ignore the other.

Link to comment

i agree, linux is case sensitive. 


The shares are with the correct capitalization and worked for a very long time, up until my most recent issue (read above). 

 

now, i can see the folders via the uraid interface by going to the disks individually, but going to the share it is empty. 

Edited by bradtwo
Link to comment

Lost of ATA errors on disk4, it might the Marvell controller, it might be cable related.

 

Both disk1 and disk9 have filesystem corruption and need to be checked.

 

As mentioned earlier parity2 is failing, since you rebuilt two disks at once there were read errors during the rebuild and there will be some data corruption on the rebuilt disks, possibly just on disk9 since the errors appear to be on the last sectors and disk1 is much smaller.

Link to comment

New to this part of the system. 

To confirm.   my disk9 states it must be formatted before it can be mounted. Should I do this? [1]


Some searching brought me to this website: https://wiki.unraid.net/Check_Disk_Filesystems

Is this what you were referencing when you stated to check the file system?[2] 

 

Additionally, reading through this, it appears there are two methods. One via, command line, the second via the web gui-when restarting in maintenance  mode.  Is the maintenance mode preferred ?[3] 

 

thank you (as always). 

 

 

Edited by bradtwo
Link to comment
6 minutes ago, bradtwo said:

Semi noob question here. From my searching I found this article on Checking Disk File System for Unraid: https://wiki.unraid.net/Check_Disk_Filesystems

 

The parameters it stated was  "  reiserfsck --fix-fixable /dev/md5 "

 

is this the correct method? 

 

You should be able to do it from the webUI. Just click on the disk to get to its page. As it explains on that page, you will have to start in Maintenance mode.

Link to comment

VERY QUICK RESPONSE!  (thank you). !!!!

I just edited it when i re-read the bottom bit where I saw that. Thank you for the confirmation on this.  Hopefully I can get this into a state where i can recover the data, then  the plan is to replace those controllers and rebuild. 

 

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.