Defect disks in my Array?

mike_j1 · August 5, 2020

Hey guys,

So today i got some worring notifications. First i got a warning that three of my disks have read error. I have never got that before and now three?

A minute later i got an error that disk three is in error state (disk dsbl).

When i got home i turned the array offline so there is no risk of more disks to die now that the array is vulnerable (i hope that this wasn‘t a bad desition) and while turning off i got another warning, this time about the paritiy disk. „Spin retry count (failing now) is 14“.

I really don‘t know what the best way forward would be right now so i‘m asking for your help!
Attached you‘ll find a dignostics zip file.

Best regards

Mike

nas-diagnostics-20200805-1802.zip

Edited August 5, 2020 by mike_j1

JorgeB · August 5, 2020

Unfortunately log is being spammed with spin down error messages, looks like disk2 isn't spinning down, best to set it to never so at least it won't spam the log, because of that the errors are not visible.

Disable disk looks healthy and multiple disk errors suggest another issue, make sure the emulated disk is mounting correctly and that data looks correct then rebuild on top, if it happens again post new diags, don't forget to disable spin down for disk2.

mike_j1 · August 5, 2020

First of all thanks a lot for your quick answer!

So disk two definitly has a problem with it not being able to spin down?

So right now i unasign disk 3, restart the array and rebuild, correct?

Also should i be worried about the spin retry count thing?

JorgeB · August 5, 2020

10 minutes ago, mike_j1 said:

So disk two definitly has a problem with it not being able to spin down?

Might not be a problem, or at least not affect normal usage, but it is strange.

11 minutes ago, mike_j1 said:

So right now i unasign disk 3, restart the array and rebuild, correct?

First start the array with the disk disabled/unassigned and check that the emulated disk is mounting correctly and data looks OK, if yes rebuild on top.

13 minutes ago, mike_j1 said:

Also should i be worried about the spin retry count thing?

Never a good sign, but I've seen that attribute return to normal on its own, after a few power cycles, but if it doesn't consider replacing it.

mike_j1 · September 10, 2020

Hello again,

now everything was working fine since my last post, but now a different disk entered error state.

Could someone please take a look at the attached log if i need to replace the disk, or if it still looks fine?

Also how do i tell this myself so i don't have to bother you guys every time i have a problem?

Last time after the rebuild i swapped my power supply because i had the suspection that it was faulty and the spin retry count error and the spin down errors went away.

Thanks in advance!

nas-diagnostics-20200910-1927.zip

JorgeB · September 10, 2020

Disk looks fine, there were issues with multiple disks simultaneously, Intel SATA controllers are usually rock solid, so most likely either the board/controller is failing or you have a power problem.

mike_j1 · September 10, 2020

Thanks for taking a look!

So a new motherboard is the way to go since i already replaced the power supply?

Maybe the internal sata power delivery is not strong enaugh, since this board is powerd through a DC connector.

I think that the problems stared as soon as i added the fourth disk, so that would also point to a power problem.

JorgeB · September 10, 2020

3 minutes ago, mike_j1 said:

Maybe the internal sata power delivery is not strong enaugh, since this board is powerd through a DC connector.

It's possible, or the power supply isn't strong enough on the 12v line, could also be the board but I would consider that less likely.

S80_UK · September 11, 2020

The log says Asus Q170T - right? Are you using an external 19 volt supply? If so, then it will be deriving the 12 volts for all of your drives via an on-board regulator, and I suspect that this regulator may be the limiting factor when connected to multiple drives. it would not surprise me that drives may show strange errors from time to time. Another possibility could be fact that all the drives get their power from that board via a single SATA power connector which could also cause possible disk power issues. I have a very similar Intel board, but not in an Unraid application, and I decided to only lightly load the power supply for this very reason. These kinds of boards are great for small systems, but I feel they are probably at their limits with multiple drives connected.

mike_j1 · September 15, 2020

On 9/11/2020 at 2:34 AM, S80_UK said:

The log says Asus Q170T - right? Are you using an external 19 volt supply? If so, then it will be deriving the 12 volts for all of your drives via an on-board regulator, and I suspect that this regulator may be the limiting factor when connected to multiple drives. it would not surprise me that drives may show strange errors from time to time. Another possibility could be fact that all the drives get their power from that board via a single SATA power connector which could also cause possible disk power issues. I have a very similar Intel board, but not in an Unraid application, and I decided to only lightly load the power supply for this very reason. These kinds of boards are great for small systems, but I feel they are probably at their limits with multiple drives connected.

yes correct.

Thats exactly what i am thinking. Its a 12v 120w powerbrick, but there is properbly still some regultion going on and it properbly limits the amperage through that one connector.

I cant find any specs on how much im allowed to draw from this connector through...

i will properbly just look into buying a new board and cpu.
ryzen is fine with unraid right?

mike_j1 · September 15, 2020

I tried to do the rebuild on top like last time, but now the array wont event start...

Something about my cache drives? I'm getting kinda desperate right now...

nas-diagnostics-20200915-2041.zip

Edited September 15, 2020 by mike_j1

trurl · September 15, 2020

Looks like cache corruption has caused docker.img corruption.

If you think you have power issues you should deal with those before trying to do anything else.

mike_j1 · September 15, 2020

Thanks for the reply!

since i already changed the power supply, my next step would be to change the mainboard, so i thought it would be best to get everything back into a working state before i do that.

Can i even change the mainboard in this state?

Edited September 15, 2020 by mike_j1

trurl · September 15, 2020

6 minutes ago, mike_j1 said:

Can i even change the mainboard in this state?

As long as the disks stay the same everything important about your configuration is on flash and should work on different hardware since Unraid figures out the hardware it is on each time it boots.

mike_j1 · September 21, 2020

Okay i think i'm getting somewhere now. I don't know where yet, but hey.

After i added a seperate psu for the harddrives and now the system seems to run stable. Yay 🙂

I was able to rebuild the array and all my data i fine.

Now the next step was to try to reenable docker, aaaand the webinterface got really unresponsive and i was not able to shut the system down.

Then my next thought was to rebuild the docker.img following this guide:

So i deleted the old img and then went ahead and reenabled docker, same behavior: unresponsive webinterface and not able to shut the system down.

I was able to get a diganostics zip using the command line (attached).

I looks like one of the cache drives is done for? How to i go about fixing this? Do i just remove it from the cache pool?

Thanks in advance, all your help is much appreciated!

nas-diagnostics-20200920-2200.zip

JorgeB · September 21, 2020

Problems with one of the cache devices:

Sep 20 21:56:52 NAS kernel: sd 0:0:0:0: timing out command, waited 360s
Sep 20 21:56:52 NAS kernel: sd 0:0:0:0: [sda] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
Sep 20 21:56:52 NAS kernel: sd 0:0:0:0: [sda] tag#0 Sense Key : 0x4 [current]
Sep 20 21:56:52 NAS kernel: sd 0:0:0:0: [sda] tag#0 ASC=0x44 ASCQ=0x0
Sep 20 21:56:52 NAS kernel: sd 0:0:0:0: [sda] tag#0 CDB: opcode=0x35 35 00 00 00 00 00 00 00 00 00
Sep 20 21:56:52 NAS kernel: print_req_error: I/O error, dev sda, sector 0
Sep 20 21:56:52 NAS kernel: BTRFS error (device sda1): bdev /dev/sda1 errs: wr 16, rd 0, flush 17, corrupt 0, gen 0

USB it not recommended for cache or array devices.

mike_j1 · September 21, 2020

Okay good to know, but it did work just fine for over a year.

I should still be able to rescue my data, since it is a cache pool right?

Edited September 21, 2020 by mike_j1

JorgeB · September 21, 2020

Yes, if it's a redundant pool.

Iit can also be the SSD failing, but there's no SMART report to check in the diags.

mike_j1 · September 21, 2020

I think i got the report:

nas-smart-20200921-2005.zip

JorgeB · September 21, 2020

There are a lot of reported uncorrectable, run an extended SMART test, though they are not always reliable on flash devices.

mike_j1 · September 21, 2020

Extended test didn't work i think, it took some time and after reaching 100% it still says 'No self-tests logged on this disk'

JorgeB · September 22, 2020

SMART tests usually count down from 100% to 0.

mike_j1 · September 22, 2020

It starts with 10% and then counts up in 10% steps, i don't know what to tell you...

But that's besides the point, i probably need to somehow get the data of the drives and replace the dead one, right?

What's the best way to do so?

Thanks

JorgeB · September 23, 2020

If the pool is redundant you just need to remove that device, then assign another one later.

Defect disks in my Array?

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation