Weird random Drive Dropping.

bradtwo · November 17, 2018

I want to first acknowledge the above and beyond support you guys have helped me with over the years. Thank you so much.

First, here is a quick overview of my setup.

Unraid Plus

6-Core AMD6300 / 16GB ECC DDR3 /

12 Drives (11 - HDD, 1 Cache).

Configured with Dual Parity (4TB).

2 - SATA/PCIE controllers Installed.

So, here is my issue(s) that I am having with my Unraid. Hopefully i can fully encompass what is going on.

I first started seeing drives randomly becoming unavailable in my rig. This would, of course, cause data to become unavailable (very worry some). At first, I could take the server offline, and re-add the drive back. The Parity would begin its process of rebuilding, no issue. Things would be great for about a month , then another drive would go offline.

I then started seeing an issue of one failed drive (new WD Blue 4TB). Bummer, it was about two years old, but that happens. I then swap it out with a new one, which proceeded to fail or report back failed/unavailable within 3 months (from memory). The two drives were connected directly to my MB's SATA ports. I moved them to the controller and re-added them. Issue solved.

Most recently, I am now seeing more than 2 drives failed, at the same time and my data is now unavilable. I was able to move one of the failed drives to the Sata/PCI controller (which is now maxed out) and re add it.

Parity Sync / Data Rebuild . Stated it would take about 1 day to complete (bummer, but I am ok with that). I now check and the rebuild is at 18Kb/s, etimated time is about 344 Days to complete. It is stuck at 50.0%.

I am up to date with all the settings on my UnRaid. When going to System Log > my browser doesn't pull anything, just spins indefinitely.

So... what the heck is going on?

Edited November 17, 2018 by bradtwo
I wrote Parity rebuild not Data rebuild.

bradtwo · November 17, 2018

Update 01: I was able to pull this information under "HISTORY" . the drive that is reporting back all the errors is a new drive.

DateDurationSpeedStatusErrors

2018-11-16, 12:29:4215 hr, 49 min, 59 sec70.2 MB/sOK39708657

2018-11-01, 14:30:1914 hr, 30 min, 18 sec76.6 MB/sOK0

2018-10-01, 14:12:3114 hr, 12 min, 30 sec78.2 MB/sOK0

2018-09-01, 14:28:4414 hr, 28 min, 43 sec76.8 MB/sOK19

2018-08-17, 10:08:401 day, 13 hr, 36 min29.6 MB/sOK0

2018-08-08, 16:12:0420 hr, 58 min, 33 sec53.0 MB/sOK20

2018-08-05, 12:36:3520 hr, 41 min, 57 sec53.7 MB/sOK1

2018-08-02, 15:17:481 day, 15 hr, 17 min, 47 sec28.3 MB/sOK185

2018-07-16, 13:55:3717 hr, 57 min, 37 sec61.9 MB/sOK0

2018-07-13, 18:04:0422 hr, 14 min, 9 sec50.0 MB/sOK0

2018-07-01, 15:31:5615 hr, 31 min, 55 sec71.6 MB/sOK0

2018-06-01, 19:31:3319 hr, 31 min, 32 sec56.9 MB/sOK0

2018-05-31, 18:20:2820 hr, 56 min, 31 sec53.1 MB/sOK0

2018-05-30, 21:18:471 min, 18 secUnavailableCanceled0

2018-05-27, 22:09:4023 hr, 30 min, 11 sec47.3 MB/sOK0

2018-05-01, 18:30:0918 hr, 30 min, 8 sec60.1 MB/sOK0

2018-04-18, 05:05:537 hr, 46 min, 13 sec143.0 MB/sOK0

2018-04-17, 15:14:5619 hr, 26 min, 32 sec57.2 MB/sOK0

2018-04-08, 10:36:5615 hr, 1 min, 45 sec73.9 MB/sOK0

2018-04-03, 15:03:0719 hr, 43 min, 25 sec56.3 MB/sOK0

2018-04-01, 17:06:1317 hr, 6 min, 12 sec65.0 MB/sOK0

2018-03-21, 08:23:1514 hr, 16 min, 20 sec77.9 MB/sOK0

2018-03-01, 13:42:0013 hr, 41 min, 59 sec81.1 MB/sOK0

2018-02-01, 14:04:0814 hr, 4 min, 7 sec79.0 MB/sOK0

2018-01-01, 14:27:4214 hr, 27 min, 41 sec76.8 MB/sOK0

2017-12-01, 13:20:4213 hr, 20 min, 41 sec83.3 MB/sOK0

2017-11-01, 13:23:1813 hr, 23 min, 17 sec83.0 MB/sOK512

2017-10-15, 12:54:551 day, 10 hr, 32 min, 15 sec32.2 MB/sOK512

2017-10-07, 15:52:4216 hr, 35 min, 57 sec67.0 MB/sOK0

2017-10-01, 13:59:1413 hr, 59 min, 13 sec79.5 MB/sOK0

2017-09-01, 12:36:1512 hr, 36 min, 14 sec88.2 MB/sOK0

2017-08-05, 19:32:028 hr, 3 min, 11 sec138.0 MB/sOK0

2017-08-05, 06:54:1813 hr, 19 min, 13 sec83.4 MB/sOK0

2017-08-01, 13:30:1513 hr, 30 min, 14 sec82.3 MB/sOK0

2017-07-01, 13:46:0213 hr, 46 min, 1 sec80.7 MB/sOK0

2017-06-01, 12:36:3912 hr, 36 min, 38 sec88.1 MB/sOK0

2017-05-01, 12:27:3212 hr, 27 min, 31 sec89.2 MB/sOK0

2017-04-15, 07:37:3612 hr, 44 min, 51 sec87.2 MB/sOK0

2017-04-01, 11:53:5411 hr, 53 min, 53 sec93.4 MB/sOK0

2017-03-18, 05:59:3512 hr, 36 min, 2 sec88.2 MB/sOK0

2017-03-01, 12:37:1012 hr, 37 min, 9 sec88.1 MB/sOK

2017-02-01, 12:30:1012 hr, 30 min, 9 sec88.9 MB/sOK

2017-01-01, 13:03:5713 hr, 3 min, 56 sec85.1 MB/sOK

2016-12-31, 07:06:0713 hr, 29 min, 6 sec82.4 MB/sOK

Dec, 01, 12:50:5412 hr, 50 min, 53 sec86.5 MB/sOK

Nov, 19, 17:54:23UnavailableUnavailableCanceled

Nov, 01, 00:30:2812 hr, 9 min, 32 sec91.4 MB/sOK

Oct, 01, 13:31:5513 hr, 31 min, 53 sec82.1 MB/sOK

Sep, 09, 08:38:2614 hr, 37 min, 6 sec76.0 MB/sOK

Sep, 05, 11:42:1415 hr, 2 min, 24 sec73.9 MB/sOK

Sep, 01, 12:30:4112 hr, 30 min, 39 sec88.8 MB/sOK

Jul, 14, 21:31:4912 hr, 12 min, 26 sec91.0 MB/sOK

Jul, 01, 12:20:2912 hr, 20 min, 27 sec90.1 MB/sOK

Jun, 25, 10:39:4012 hr, 17 min, 41 sec90.4 MB/sOK

Jun, 17, 21:19:1812 hr, 42 min, 28 sec87.5 MB/sOK

Jun, 17, 08:05:5414 hr, 53 min, 48 sec74.6 MB/sOK

Jun, 14, 19:56:3518 hr, 24 min, 18 sec60.4 MB/sOK

Jun, 13, 14:47:56UnavailableUnavailableCanceled

Jun, 01, 00:41:18UnavailableUnavailableCanceled

May, 06, 07:20:5611 hr, 54 min, 14 sec93.4 MB/sOK

May, 01, 11:47:5211 hr, 47 min, 50 sec94.2 MB/sOK

Apr, 01, 12:38:5112 hr, 38 min, 49 sec87.9 MB/sOK

Mar, 01, 11:58:0711 hr, 58 min, 5 sec92.9 MB/sOK

Feb, 10, 06:19:3012 hr, 4 min, 18 sec92.1 MB/sOK

Feb, 06, 08:31:4912 hr, 8 min, 24 sec91.5 MB/sOK

Feb, 01, 13:04:2313 hr, 4 min, 21 sec85.0 MB/sOK

Jan, 29, 09:57:1712 hr, 31 min88.8 MB/sOK

Jan, 25, 18:01:57UnavailableUnavailableCanceled

Hoopster · November 17, 2018

4 minutes ago, bradtwo said:

So... what the heck is going on?

Post diagnostics. With something like this, the experts will want to to see diagnostics.

bradtwo · November 17, 2018

What specific Diagnostic information are you asking for?

Hoopster · November 17, 2018

2 minutes ago, bradtwo said:

What specific Diagnostic information are you asking for?

The complete diagnostics unRAID compiles and downloads when you go to Tools --> Diagnostics in the GUI.

bradtwo · November 17, 2018

Thanks!

I forgot about this option. I was looking only at the Log, which won't pull.

Attaching Diagnostic .Zip.

diagnostics-20181117-1559.zip

Hoopster · November 17, 2018

33 minutes ago, bradtwo said:

So... what the heck is going on?

I am not an expert on disk issues, that would be @johnnie.black but I do see an explanation for your disks randomly dropping; you have a Marvell SATA controller on your MB and it appears, PCIe SATA controller as well. This is not good as they are known to drop disks with unRAID. Any disk connected to SATA ports (whether those are MB ports or a PCIe HBA) controlled by a Marvell chipset may disappear for no apparent reason. They could be fine for a long time and then just go bye bye until a reboot.

02:00.0 SATA controller [0106]: Marvell Technology Group Ltd. 88SE9172 SATA 6Gb/s Controller [1b4b:9172] (rev 12)
Subsystem: Gigabyte Technology Co., Ltd 88SE9172 SATA 6Gb/s Controller [1458:b000]

07:00.0 SATA controller [0106]: Marvell Technology Group Ltd. Device [1b4b:9215] (rev 11)
   Subsystem: Marvell Technology Group Ltd. Device [1b4b:9215]

08:00.0 SATA controller [0106]: Marvell Technology Group Ltd. Device [1b4b:9215] (rev 11)
   Subsystem: Marvell Technology Group Ltd. Device [1b4b:9215]

Unfortunately, you are playing Russian Roulette with disks attached to these controllers.

For peace of mind, the experts here recommend LSI chipset-based controllers flashed to IT mode. There are many recommendations in these forums, but, they include models such as LSI 9207-8i, LSI 9211-8i, IBM M105, Dell PERC H200/H310 and others.

You may also wish to verify that your SATA cables and connections are all good.

As to your disk failure issues, I defer to the experts to diagnose the issues and the solutions.

bradtwo · November 17, 2018

Dude you're awesome. I'm going to take that first step. The Controller I picked was one that was on the original "approved hardware" list back in the day.

So my follow up question is that. In the current state what would be the risk of me taking the opportunity and migrating this over to all new hardware[1]? Honestly, my main concern is (of course) dataloss (which is why i went with 2 Parity Drives).

As it currently sits, I would have to wait 344 days before the parity check completes. Could I stop this now[2]? I'm not sure i have the patience to wait that long to use my setup. (422 Days until completion)

What is your recommended course of action for me [3]?

Another question (sorry). I read this post here.

Would this be a potential temporary fix[4]?

basically I would be ok with limping by and re-building this setup completely, again.

Edited November 17, 2018 by bradtwo
adding in days until completion.

Hoopster · November 17, 2018

25 minutes ago, bradtwo said:

Dude you're awesome. I'm going to take that first step. The Controller I picked was one that was on the original "approved hardware" list back in the day.

True, With unRAID 5 and prior versions Marvell-based controllers were fine. I had one. It worked without issues. However, ever since unRAID 6.x.x, Marvell controllers started having issues with the latest Linux kernels, drivers, etc.

25 minutes ago, bradtwo said:

So my follow up question is that. In the current state what would be the risk of me taking the opportunity and migrating this over to all new hardware[1]? Honestly, my main concern is (of course) dataloss (which is why i went with 2 Parity Drives).

As it currently sits, I would have to wait 344 days before the parity check completes. Could I stop this now[2]? I'm not sure i have the patience to wait that long to use my setup.

There is no danger in cancelling the parity check. Certainly something is amiss. Your "failed" drives may not have actually been failed drives. If you have controller and/or SATA cable issues, that could account for the problems. UnRAID takes a disk offline when writes fail. That could be a drive issue, but, it could also be a controller or cable issue.

As far as how to best proceed with hardware upgrades and restoring the health of your array, I do not want to give any erroneous advice and I am not the expert in these matters. Hopefully, someone else will chime in on the best procedure.

I do think your system is recoverable and you likely have not lost any data. The first thing I would do is replace the SATA controllers and get one of the ones I mentioned. There are many of them available on eBay. With the problems you currently have, there is no telling what can be attributed to those controllers.

After your disks are "stable" you can consider other hardware upgrades.

Edited November 17, 2018 by Hoopster

bradtwo · November 17, 2018

Thank you for all of your advice.

I did make this mistake in my comment, which I caught now when I read yours.

It states: Parity Sync / Data Rebuild 50.0% . (this is where it has been stuck at all day). With the estimated completion time going up in days @ 13Kb/s. Originally when this started out it stated it would take 1 day to complete (estimated time) and was doing the procession at 50MB/s (if i recall). This morning when i woke, I see the updated time.

I will edit the comment above. Again, this is my error for mislabeling.

JorgeB · November 18, 2018

I agree you should get rid of the Marvell controllers, but the array is in somewhat of a mess, you might lose some data, two disabled disks, disk7 dropped offline (hopefully hits healthy) and parity2 is failing.

bradtwo · November 18, 2018

Modifying this comment to help avoid any confusion.

AGAIN! Thank you. I absolutely love this community. I suppose moving forward I am going to invest in some stand-alone drives outside of unRaid for additional backup in case of future unplanned incompatibility issues. I did not lose anything important in the issue. My family photos and documents are very well backed up outside of unRaid.

Yesterday my count on the repair got up to 600 + days with rebuilding at 3Kb/s. I decided to restart the machine. It came back, began the process and completed it within 1 day. Returned back to no-errors. I have took the advice to heart and i ordered the * LSI SAS 9201-8i HBA controllers which are supposedly working out of the box. This will save me a few steps on flashing firmware for IT mode.

Now, all drives seem to be back to normal. One simple error persists outside.

My /media/movies folder seems to be completely empty. originally this had 1,000+ objects in it. Manually navigating through the web-interface to share itself, it shows 0 items as well. However, (big catch), going to the individual drives Disk 1 > media/movies, I see a section of the items there.

So it appears now that the /media/movies share is somehow broken, but the files still remain on the disks themselves.

Does anyone know how to recover this?

My next go to would be to somehow browse manually to them via a terminal and copy them to a folder, then re-build that way.

Edited November 18, 2018 by bradtwo
grammar is hard

JorgeB · November 19, 2018

Post current diags after trying to access that share.

itimpi · November 19, 2018

17 hours ago, bradtwo said:

Modifying this comment to help avoid any confusion.

AGAIN! Thank you. I absolutely love this community. I suppose moving forward I am going to invest in some stand-alone drives outside of unRaid for additional backup in case of future unplanned incompatibility issues. I did not lose anything important in the issue. My family photos and documents are very well backed up outside of unRaid.

Yesterday my count on the repair got up to 600 + days with rebuilding at 3Kb/s. I decided to restart the machine. It came back, began the process and completed it within 1 day. Returned back to no-errors. I have took the advice to heart and i ordered the * LSI SAS 9201-8i HBA controllers which are supposedly working out of the box. This will save me a few steps on flashing firmware for IT mode.

Now, all drives seem to be back to normal. One simple error persists outside.

My /media/movies folder seems to be completely empty. originally this had 1,000+ objects in it. Manually navigating through the web-interface to share itself, it shows 0 items as well. However, (big catch), going to the individual drives Disk 1 > media/movies, I see a section of the items there.

So it appears now that the /media/movies share is somehow broken, but the files still remain on the disks themselves.

Does anyone know how to recover this?

My next go to would be to somehow browse manually to them via a terminal and copy them to a folder, then re-build that way.

You might want to check that you do not have folders on the disks with the same spelling but different capitalization (e.g. 'movies' and 'Movies'). Under Linux names are case sensitive. If you get into the situation of having two folders with different capitalization then you can get the symptoms you describe as Samba will choose one of them and ignore the other.

bradtwo · November 19, 2018

i agree, linux is case sensitive.

The shares are with the correct capitalization and worked for a very long time, up until my most recent issue (read above).

now, i can see the folders via the uraid interface by going to the disks individually, but going to the share it is empty.

Edited November 19, 2018 by bradtwo

trurl · November 19, 2018

12 hours ago, johnnie.black said:

Post current diags after trying to access that share.

bradtwo · November 22, 2018

Here is the updated diagnostic.

I attempted to access the share via a windows computer & via the unraid interface. Both reporting back the same no files avialable on share.

Confirmed that Disk 1 still has media files available within that share.

cosmonaut-diagnostics-20181122-0013.zip

Edited November 22, 2018 by bradtwo
Added more information.

JorgeB · November 22, 2018

Lost of ATA errors on disk4, it might the Marvell controller, it might be cable related.

Both disk1 and disk9 have filesystem corruption and need to be checked.

As mentioned earlier parity2 is failing, since you rebuilt two disks at once there were read errors during the rebuild and there will be some data corruption on the rebuilt disks, possibly just on disk9 since the errors appear to be on the last sectors and disk1 is much smaller.

bradtwo · November 27, 2018

Sorry for the late response ( holidays and everything ).

What should I do for the next step? It seems that some of the Data is there. How would one go about retrieving this data again or rebuilding the folder enough to get it back?

JorgeB · November 27, 2018

Check filesystem on both disks, you'll also need to replace parity2.

bradtwo · December 2, 2018

New to this part of the system.

To confirm. my disk9 states it must be formatted before it can be mounted. Should I do this? [1]

Some searching brought me to this website: https://wiki.unraid.net/Check_Disk_Filesystems

Is this what you were referencing when you stated to check the file system?[2]

Additionally, reading through this, it appears there are two methods. One via, command line, the second via the web gui-when restarting in maintenance mode. Is the maintenance mode preferred ?[3]

thank you (as always).

Edited December 2, 2018 by bradtwo

trurl · December 2, 2018

6 minutes ago, bradtwo said:

Semi noob question here. From my searching I found this article on Checking Disk File System for Unraid: https://wiki.unraid.net/Check_Disk_Filesystems

The parameters it stated was " reiserfsck --fix-fixable /dev/md5 "

is this the correct method?

You should be able to do it from the webUI. Just click on the disk to get to its page. As it explains on that page, you will have to start in Maintenance mode.

bradtwo · December 2, 2018

VERY QUICK RESPONSE! (thank you). !!!!

I just edited it when i re-read the bottom bit where I saw that. Thank you for the confirmation on this. Hopefully I can get this into a state where i can recover the data, then the plan is to replace those controllers and rebuild.

Weird random Drive Dropping.

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation