Defect disks in my Array?


Recommended Posts

Hey guys,

 

So today i got some worring notifications. First i got a warning that three of my disks have read error. I have never got that before and now three?

A minute later i got an error that disk three is in error state (disk dsbl).

 

When i got home i turned the array offline so there is no risk of more disks to die now that the array is vulnerable (i hope that this wasn‘t a bad desition) and while turning off i got another warning, this time about the paritiy disk. „Spin retry count (failing now) is 14“.

 

I really don‘t know what the best way forward would be right now so i‘m asking for your help!
Attached you‘ll find a dignostics zip file.
 

Best regards

Mike

nas-diagnostics-20200805-1802.zip

Edited by mike_j1
Link to comment

Unfortunately log is being spammed with spin down error messages, looks like disk2 isn't spinning down, best to set it to never so at least it won't spam the log, because of that the errors are not visible.

 

Disable disk looks healthy and multiple disk errors suggest another issue, make sure the emulated disk is mounting correctly and that data looks correct then rebuild on top, if it happens again post new diags, don't forget to disable spin down for disk2.

 

Link to comment
10 minutes ago, mike_j1 said:

So disk two definitly has a problem with it not being able to spin down?

Might not be a problem, or at least not affect normal usage, but it is strange.

 

11 minutes ago, mike_j1 said:

So right now i unasign disk 3, restart the array and rebuild, correct?

First start the array with the disk disabled/unassigned and check that the emulated disk is mounting correctly and data looks OK, if yes rebuild on top.

 

13 minutes ago, mike_j1 said:

Also should i be worried about the spin retry count thing?

Never a good sign, but I've seen that attribute return to normal on its own, after a few power cycles, but if it doesn't consider replacing it.

 

 

Link to comment
  • 1 month later...

Hello again,

 

now everything was working fine since my last post, but now a different disk entered error state.

Could someone please take a look at the attached log if i need to replace the disk, or if it still looks fine?

Also how do i tell this myself so i don't have to bother you guys every time i have a problem?

 

Last time after the rebuild i swapped my power supply because i had the suspection that it was faulty and the spin retry count error and the spin down errors went away.

 

Thanks in advance!

 

 

nas-diagnostics-20200910-1927.zip

Link to comment

Thanks for taking a look!

So a new motherboard is the way to go since i already replaced the power supply?

Maybe the internal sata power delivery is not strong enaugh, since this board is powerd through a DC connector.

I think that the problems stared as soon as i added the fourth disk, so that would also point to a power problem.

 

 

Link to comment

The log says Asus Q170T - right?  Are you using an external 19 volt supply?  If so, then it will be deriving the 12 volts for all of your drives via an on-board regulator, and I suspect that this regulator may be the limiting factor when connected to multiple drives.  it would not surprise me that drives may show strange errors from time to time.  Another possibility could be fact that all the drives get their power from that board via a single SATA power connector which could also cause possible disk power issues.   I have a very similar Intel board, but not in an Unraid application, and I decided to only lightly load the power supply for this very reason.  These kinds of boards are great for small systems, but I feel they are probably at their limits with multiple drives connected.

Link to comment
On 9/11/2020 at 2:34 AM, S80_UK said:

The log says Asus Q170T - right?  Are you using an external 19 volt supply?  If so, then it will be deriving the 12 volts for all of your drives via an on-board regulator, and I suspect that this regulator may be the limiting factor when connected to multiple drives.  it would not surprise me that drives may show strange errors from time to time.  Another possibility could be fact that all the drives get their power from that board via a single SATA power connector which could also cause possible disk power issues.   I have a very similar Intel board, but not in an Unraid application, and I decided to only lightly load the power supply for this very reason.  These kinds of boards are great for small systems, but I feel they are probably at their limits with multiple drives connected.


yes correct.

Thats exactly what i am thinking. Its a 12v 120w powerbrick, but there is properbly still some regultion going on and it properbly limits the amperage through that one connector.

I cant find any specs on how much im allowed to draw from this connector through...

 

i will properbly just look into buying a new board and cpu. 
ryzen is fine with unraid right?

Link to comment

Thanks for the reply!

 

since i already changed the power supply, my next step would be to change the mainboard, so i thought it would be best to get everything back into a working state before i do that.

Can i even change the mainboard in this state?

 

Edited by mike_j1
Link to comment

Okay i think i'm getting somewhere now. I don't know where yet, but hey.

After i added a seperate psu for the harddrives and now the system seems to run stable. Yay 🙂

I was able to rebuild the array and all my data i fine.

 

Now the next step was to try to reenable docker, aaaand the webinterface got really unresponsive and i was not able to shut the system down.

Then my next thought was to rebuild the docker.img following this guide: 

 

 

So i deleted the old img and then went ahead and reenabled docker, same behavior: unresponsive webinterface and not able to shut the system down.

I was able to get a diganostics zip using the command line (attached).

I looks like one of the cache drives is done for? How to i go about fixing this? Do i just remove it from the cache pool?

 

Thanks in advance, all your help is much appreciated!

 

nas-diagnostics-20200920-2200.zip

Link to comment

Problems with one of the cache devices:

Sep 20 21:56:52 NAS kernel: sd 0:0:0:0: timing out command, waited 360s
Sep 20 21:56:52 NAS kernel: sd 0:0:0:0: [sda] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
Sep 20 21:56:52 NAS kernel: sd 0:0:0:0: [sda] tag#0 Sense Key : 0x4 [current]
Sep 20 21:56:52 NAS kernel: sd 0:0:0:0: [sda] tag#0 ASC=0x44 ASCQ=0x0
Sep 20 21:56:52 NAS kernel: sd 0:0:0:0: [sda] tag#0 CDB: opcode=0x35 35 00 00 00 00 00 00 00 00 00
Sep 20 21:56:52 NAS kernel: print_req_error: I/O error, dev sda, sector 0
Sep 20 21:56:52 NAS kernel: BTRFS error (device sda1): bdev /dev/sda1 errs: wr 16, rd 0, flush 17, corrupt 0, gen 0

 

USB it not recommended for cache or array devices.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.