mike_j1 Posted August 5, 2020 Share Posted August 5, 2020 (edited) Hey guys, So today i got some worring notifications. First i got a warning that three of my disks have read error. I have never got that before and now three? A minute later i got an error that disk three is in error state (disk dsbl). When i got home i turned the array offline so there is no risk of more disks to die now that the array is vulnerable (i hope that this wasn‘t a bad desition) and while turning off i got another warning, this time about the paritiy disk. „Spin retry count (failing now) is 14“. I really don‘t know what the best way forward would be right now so i‘m asking for your help! Attached you‘ll find a dignostics zip file. Best regards Mike nas-diagnostics-20200805-1802.zip Edited August 5, 2020 by mike_j1 Quote Link to comment
JorgeB Posted August 5, 2020 Share Posted August 5, 2020 Unfortunately log is being spammed with spin down error messages, looks like disk2 isn't spinning down, best to set it to never so at least it won't spam the log, because of that the errors are not visible. Disable disk looks healthy and multiple disk errors suggest another issue, make sure the emulated disk is mounting correctly and that data looks correct then rebuild on top, if it happens again post new diags, don't forget to disable spin down for disk2. Quote Link to comment
mike_j1 Posted August 5, 2020 Author Share Posted August 5, 2020 First of all thanks a lot for your quick answer! So disk two definitly has a problem with it not being able to spin down? So right now i unasign disk 3, restart the array and rebuild, correct? Also should i be worried about the spin retry count thing? Quote Link to comment
JorgeB Posted August 5, 2020 Share Posted August 5, 2020 10 minutes ago, mike_j1 said: So disk two definitly has a problem with it not being able to spin down? Might not be a problem, or at least not affect normal usage, but it is strange. 11 minutes ago, mike_j1 said: So right now i unasign disk 3, restart the array and rebuild, correct? First start the array with the disk disabled/unassigned and check that the emulated disk is mounting correctly and data looks OK, if yes rebuild on top. 13 minutes ago, mike_j1 said: Also should i be worried about the spin retry count thing? Never a good sign, but I've seen that attribute return to normal on its own, after a few power cycles, but if it doesn't consider replacing it. Quote Link to comment
mike_j1 Posted September 10, 2020 Author Share Posted September 10, 2020 Hello again, now everything was working fine since my last post, but now a different disk entered error state. Could someone please take a look at the attached log if i need to replace the disk, or if it still looks fine? Also how do i tell this myself so i don't have to bother you guys every time i have a problem? Last time after the rebuild i swapped my power supply because i had the suspection that it was faulty and the spin retry count error and the spin down errors went away. Thanks in advance! nas-diagnostics-20200910-1927.zip Quote Link to comment
JorgeB Posted September 10, 2020 Share Posted September 10, 2020 Disk looks fine, there were issues with multiple disks simultaneously, Intel SATA controllers are usually rock solid, so most likely either the board/controller is failing or you have a power problem. Quote Link to comment
mike_j1 Posted September 10, 2020 Author Share Posted September 10, 2020 Thanks for taking a look! So a new motherboard is the way to go since i already replaced the power supply? Maybe the internal sata power delivery is not strong enaugh, since this board is powerd through a DC connector. I think that the problems stared as soon as i added the fourth disk, so that would also point to a power problem. Quote Link to comment
JorgeB Posted September 10, 2020 Share Posted September 10, 2020 3 minutes ago, mike_j1 said: Maybe the internal sata power delivery is not strong enaugh, since this board is powerd through a DC connector. It's possible, or the power supply isn't strong enough on the 12v line, could also be the board but I would consider that less likely. Quote Link to comment
S80_UK Posted September 11, 2020 Share Posted September 11, 2020 The log says Asus Q170T - right? Are you using an external 19 volt supply? If so, then it will be deriving the 12 volts for all of your drives via an on-board regulator, and I suspect that this regulator may be the limiting factor when connected to multiple drives. it would not surprise me that drives may show strange errors from time to time. Another possibility could be fact that all the drives get their power from that board via a single SATA power connector which could also cause possible disk power issues. I have a very similar Intel board, but not in an Unraid application, and I decided to only lightly load the power supply for this very reason. These kinds of boards are great for small systems, but I feel they are probably at their limits with multiple drives connected. Quote Link to comment
mike_j1 Posted September 15, 2020 Author Share Posted September 15, 2020 On 9/11/2020 at 2:34 AM, S80_UK said: The log says Asus Q170T - right? Are you using an external 19 volt supply? If so, then it will be deriving the 12 volts for all of your drives via an on-board regulator, and I suspect that this regulator may be the limiting factor when connected to multiple drives. it would not surprise me that drives may show strange errors from time to time. Another possibility could be fact that all the drives get their power from that board via a single SATA power connector which could also cause possible disk power issues. I have a very similar Intel board, but not in an Unraid application, and I decided to only lightly load the power supply for this very reason. These kinds of boards are great for small systems, but I feel they are probably at their limits with multiple drives connected. yes correct. Thats exactly what i am thinking. Its a 12v 120w powerbrick, but there is properbly still some regultion going on and it properbly limits the amperage through that one connector. I cant find any specs on how much im allowed to draw from this connector through... i will properbly just look into buying a new board and cpu. ryzen is fine with unraid right? Quote Link to comment
mike_j1 Posted September 15, 2020 Author Share Posted September 15, 2020 (edited) I tried to do the rebuild on top like last time, but now the array wont event start... Something about my cache drives? I'm getting kinda desperate right now... nas-diagnostics-20200915-2041.zip Edited September 15, 2020 by mike_j1 Quote Link to comment
trurl Posted September 15, 2020 Share Posted September 15, 2020 Looks like cache corruption has caused docker.img corruption. If you think you have power issues you should deal with those before trying to do anything else. Quote Link to comment
mike_j1 Posted September 15, 2020 Author Share Posted September 15, 2020 (edited) Thanks for the reply! since i already changed the power supply, my next step would be to change the mainboard, so i thought it would be best to get everything back into a working state before i do that. Can i even change the mainboard in this state? Edited September 15, 2020 by mike_j1 Quote Link to comment
trurl Posted September 15, 2020 Share Posted September 15, 2020 6 minutes ago, mike_j1 said: Can i even change the mainboard in this state? As long as the disks stay the same everything important about your configuration is on flash and should work on different hardware since Unraid figures out the hardware it is on each time it boots. Quote Link to comment
mike_j1 Posted September 21, 2020 Author Share Posted September 21, 2020 Okay i think i'm getting somewhere now. I don't know where yet, but hey. After i added a seperate psu for the harddrives and now the system seems to run stable. Yay 🙂 I was able to rebuild the array and all my data i fine. Now the next step was to try to reenable docker, aaaand the webinterface got really unresponsive and i was not able to shut the system down. Then my next thought was to rebuild the docker.img following this guide: So i deleted the old img and then went ahead and reenabled docker, same behavior: unresponsive webinterface and not able to shut the system down. I was able to get a diganostics zip using the command line (attached). I looks like one of the cache drives is done for? How to i go about fixing this? Do i just remove it from the cache pool? Thanks in advance, all your help is much appreciated! nas-diagnostics-20200920-2200.zip Quote Link to comment
JorgeB Posted September 21, 2020 Share Posted September 21, 2020 Problems with one of the cache devices: Sep 20 21:56:52 NAS kernel: sd 0:0:0:0: timing out command, waited 360s Sep 20 21:56:52 NAS kernel: sd 0:0:0:0: [sda] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 Sep 20 21:56:52 NAS kernel: sd 0:0:0:0: [sda] tag#0 Sense Key : 0x4 [current] Sep 20 21:56:52 NAS kernel: sd 0:0:0:0: [sda] tag#0 ASC=0x44 ASCQ=0x0 Sep 20 21:56:52 NAS kernel: sd 0:0:0:0: [sda] tag#0 CDB: opcode=0x35 35 00 00 00 00 00 00 00 00 00 Sep 20 21:56:52 NAS kernel: print_req_error: I/O error, dev sda, sector 0 Sep 20 21:56:52 NAS kernel: BTRFS error (device sda1): bdev /dev/sda1 errs: wr 16, rd 0, flush 17, corrupt 0, gen 0 USB it not recommended for cache or array devices. Quote Link to comment
mike_j1 Posted September 21, 2020 Author Share Posted September 21, 2020 (edited) Okay good to know, but it did work just fine for over a year. I should still be able to rescue my data, since it is a cache pool right? Edited September 21, 2020 by mike_j1 Quote Link to comment
JorgeB Posted September 21, 2020 Share Posted September 21, 2020 Yes, if it's a redundant pool. Iit can also be the SSD failing, but there's no SMART report to check in the diags. Quote Link to comment
mike_j1 Posted September 21, 2020 Author Share Posted September 21, 2020 I think i got the report: nas-smart-20200921-2005.zip Quote Link to comment
JorgeB Posted September 21, 2020 Share Posted September 21, 2020 There are a lot of reported uncorrectable, run an extended SMART test, though they are not always reliable on flash devices. Quote Link to comment
mike_j1 Posted September 21, 2020 Author Share Posted September 21, 2020 Extended test didn't work i think, it took some time and after reaching 100% it still says 'No self-tests logged on this disk' Quote Link to comment
JorgeB Posted September 22, 2020 Share Posted September 22, 2020 SMART tests usually count down from 100% to 0. Quote Link to comment
mike_j1 Posted September 22, 2020 Author Share Posted September 22, 2020 It starts with 10% and then counts up in 10% steps, i don't know what to tell you... But that's besides the point, i probably need to somehow get the data of the drives and replace the dead one, right? What's the best way to do so? Thanks Quote Link to comment
JorgeB Posted September 23, 2020 Share Posted September 23, 2020 If the pool is redundant you just need to remove that device, then assign another one later. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.