Joseph Posted December 13, 2017 Share Posted December 13, 2017 TL;DR Replaced power supply... hopefully that fixes it. Hello unRAID community, I'm hoping you can help me out. My unRAID has never really been stable, but that's another matter. At the moment, I have dual parity with 2 array drives knocked offline and I have replacement drives on standby. I tried stopping the array via the GUI and now I can't access the main page on any computer. I went to the console on the unRAID box itself and saw I still had a directory path of "mnt/Disks" with nothing in it. I also got an email notification that has me worried sick now: Quote Event: unRAID array errors Subject: Notice [TOWER] - array turned good Description: Array has 0 disks with read errors Importance: normal I've tried to stop the array via CLI running through a few commands and a script that Alex and Squid provided here. This is an except from the syslog. I'm concerned there might be problems with a parity disk (sdq) and/or cache pool (sdd, sde): Dec 13 09:36:07 Tower kernel: mdcmd (30): import 29 sdq 3907018532 0 ST4000VN0001 Dec 13 09:36:09 Tower kernel: md: import disk29: (sdq) ST4000VN0001 size: 3907018532 Dec 13 09:36:09 Tower emhttp: import 30 cache device: sdd Dec 13 09:36:09 Tower emhttp: import 31 cache device: sde Dec 13 09:36:09 Tower emhttp: check_pool: /sbin/btrfs filesystem show 1e110a1d-e24f-4cdf-a087-31b6323b2a10 2>&1 Dec 13 09:38:50 Tower kernel: sd 5:0:3:0: timing out command, waited 180s Dec 13 09:41:50 Tower kernel: sd 5:0:3:0: timing out command, waited 180s Dec 13 09:41:50 Tower kernel: sd 5:0:3:0: [sdi] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x00 Dec 13 09:41:50 Tower kernel: sd 5:0:3:0: [sdi] tag#0 CDB: opcode=0x88 88 00 00 00 00 01 5d 50 a2 c0 00 00 00 08 00 00 Dec 13 09:41:50 Tower kernel: blk_update_request: I/O error, dev sdi, sector 5860532928 Dec 13 09:44:50 Tower sudo: root : TTY=pts/0 ; PWD=/boot ; USER=root ; COMMAND=/usr/bin/lsof Dec 13 09:44:50 Tower kernel: sd 5:0:3:0: timing out command, waited 180s Dec 13 09:44:50 Tower kernel: sd 5:0:3:0: [sdi] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x00 Dec 13 09:44:50 Tower kernel: sd 5:0:3:0: [sdi] tag#0 CDB: opcode=0x88 88 00 00 00 00 01 5d 50 a2 c0 00 00 00 01 00 00 Dec 13 09:44:50 Tower kernel: blk_update_request: I/O error, dev sdi, sector 5860532928 Dec 13 09:44:50 Tower kernel: Buffer I/O error on dev sdi1, logical block 5860532864, async page read Dec 13 09:44:50 Tower kernel: sd 5:0:3:0: timing out command, waited 180s Dec 13 09:44:50 Tower kernel: sd 5:0:3:0: [sdi] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x00 Dec 13 09:44:50 Tower kernel: sd 5:0:3:0: [sdi] tag#0 CDB: opcode=0x88 88 00 00 00 00 01 5d 50 a2 c1 00 00 00 01 00 00 Dec 13 09:44:50 Tower kernel: blk_update_request: I/O error, dev sdi, sector 5860532929 Dec 13 09:44:50 Tower kernel: Buffer I/O error on dev sdi1, logical block 5860532865, async page read I still can't access the GUI so i need to shut down. So, my questions are: * If there are problems with parity and cache, what are next steps? * What's the safest way to shutdown at this point? * How do I bring up the array with 2 failed disks if I encounter a situation where unRAID thinks parity needs to be rebuilt with 2 disks knocked offline and the contents are being emulated? Please see attached diags ran a few days ago and the syslog generated today after trying to stop the array via GUI & CLI. Any help is greatly appreciated!! Thank you so much for your help!! Link to comment
JorgeB Posted December 13, 2017 Share Posted December 13, 2017 2 hours ago, Joseph said: If there are problems with parity and cache Why do you say that, nothing about that in the log, disk3 and disk5 (sdi) are disable, disk5 dropped offline, you want to check cables, reboot and grab new diags so we can check SMART report. To reboot type reboot on the console. Link to comment
Joseph Posted December 13, 2017 Author Share Posted December 13, 2017 10 minutes ago, johnnie.black said: Why do you say that, nothing about that in the log, disk3 and disk5 (sdi) are disable, disk5 dropped offline, you want to check cables, reboot and grab new diags so we can check SMART report. To reboot type reboot on the console. Bare with me, I'm a linux newbie; I thought sdq was parity disk 2 and was concerned that today's syslog still has entries pertaining to the cache pool after trying to stop the array via CLI. Is it safe to execute the shutdown command from the CLI despite the fact "mdcmd stop" returns this error? /usr/local/sbin/mdcmd stop /usr/local/sbin/mdcmd: line 11: echo: write error: Invalid argument Should I be concerned that there's still an empty mount point "/mnt/disks?" Also, does it matter that sdq shows up as disk29? (fwiw: I don't recall if I tried umount /mnt/disk29) Sorry for all the questions... I just don't want to blow it by doing something wrong and throwing unRAID into a rebuild parity disks because its currently emulating 2 of the array disks. Thank you. Link to comment
JorgeB Posted December 13, 2017 Share Posted December 13, 2017 These entries are normal and nothing error related: Dec 13 09:36:07 Tower kernel: mdcmd (30): import 29 sdq 3907018532 0 ST4000VN0001-1SF178_Z4F0APX6 Dec 13 09:36:09 Tower kernel: md: import disk29: (sdq) ST4000VN0001-1SF178_Z4F0APX6 size: 3907018532 Dec 13 09:36:09 Tower emhttp: import 30 cache device: sdd Dec 13 09:36:09 Tower emhttp: import 31 cache device: sde Only errors are after these referring to sdi, disk5, I did see some timeouts on sdm (disk9) but it recovered, still good idea to also check those cables, especially if it's shares a power cable or adapter with disk5. Link to comment
Joseph Posted December 13, 2017 Author Share Posted December 13, 2017 32 minutes ago, johnnie.black said: These entries are normal and nothing error related: Dec 13 09:36:07 Tower kernel: mdcmd (30): import 29 sdq 3907018532 0 ST4000VN0001-1SF178_Z4F0APX6 Dec 13 09:36:09 Tower kernel: md: import disk29: (sdq) ST4000VN0001-1SF178_Z4F0APX6 size: 3907018532 Dec 13 09:36:09 Tower emhttp: import 30 cache device: sdd Dec 13 09:36:09 Tower emhttp: import 31 cache device: sde Only errors are after these referring to sdi, disk5, I did see some timeouts on sdm (disk9) but it recovered, still good idea to also check those cables, especially if it's shares a power cable or adapter with disk5. ok. I gotta runs some errands but I'll run shutdown from CLI when i return and will report back... thanks! Link to comment
Joseph Posted December 14, 2017 Author Share Posted December 14, 2017 On 12/13/2017 at 4:30 PM, Joseph said: UPDATE: It took some time, but unRAID eventually shut down after issuing powerdown command via CLI. I double checked power and data cables, then replaced Disk 5. The plan is to rebuild Disk 5 first, then rebuild Disk 3 (which only 'failed' because it was accidentally formatted; so I pulled it to attempt data recovery.) FWIW, the data cables were swapped out earlier this year when I had a similar problem and disks 3 - 7 are in a 5 bay drive cage that require 2 power connectors that are also shared by the 2 cache SDDs that are in a separate cage. Upon reboot, I ran diags... they're attached with a couple of photos. I'm probably overthinking this, but I just want to make sure when I start the array, it will rebuild Disk 5 and keep the contents of Disk 3 emulated. Thanks!! Link to comment
JorgeB Posted December 14, 2017 Share Posted December 14, 2017 1 minute ago, Joseph said: when I start the array, it will rebuild Disk 5 and keep the contents of Disk 3 emulated. Correct Link to comment
Joseph Posted December 14, 2017 Author Share Posted December 14, 2017 16 minutes ago, johnnie.black said: Correct ok, here we go... Link to comment
Joseph Posted December 16, 2017 Author Share Posted December 16, 2017 On 12/14/2017 at 11:58 AM, johnnie.black said: Correct UPDATE: So Disk5 rebuild from parity without incident... yay! However.... if you recall, a couple of weeks ago, I lost Disk3 because it was reformatted accidentally. That drive has since been pulled and I'm in process of recovering whatever data I can get before sending it off for warranty replacement. Meanwhile, the contents of Disk3, which is now empty, are being emulated. Today I noticed the mover put data on the emulated disk and I thought as a precaution it would be a good idea to manually move the contents from the emulated disk3 to the now working disk5 via MC. At some point during this process, Disk6 got knocked offline and now its contents are being emulated. SMART is ok...and I'm not sure of next steps. Event: unRAID array errors Subject: Warning [TOWER] - array has errors Description: Array has 1 disk with read errors Importance: warning Disk 6 - HGST_HDN726040ALE614_K4KE04SB (sdg) (errors 128) I'm beginning to think its the PS...but it was recently replaced. Any thoughts are greatly appreciated. Diags attached... and as always, thank you!! Link to CoolerMaster G750M PS specs Link to comment
JorgeB Posted December 16, 2017 Share Posted December 16, 2017 Disk looks fine, so probably connection related, could also be the power supply if it's not working correctly, it's more than adequate if it is. Link to comment
SSD Posted December 17, 2017 Share Posted December 17, 2017 The root cause for problems like you are having normally trace back to cabling issues. I very highly recommend hot-swap style cages like the SuperMicro CSE-M35T-1B. Once installed and burned in, you can swap disks without risking nudging cables and creating the types of instability and risks of data loss you are experiencing. Dual parity may help recover in some situations, but is not a substitute to these cages which actually avoid the problems.from happening in the first place. I've been here a long time, and was skeptical of the need for these types of cages. I remember writing several posts debating others suggesting these cages. But I use them now on all my servers. Before I had to be prepared to spend several hours running tests after a drive addition or exchange, to troubleshoot and ensure my drives were staying online and resolving correction issues. Several frustrating times I found myself unplugging all the drives, and connecting them one at a time, to try to isolate squirrely issues. All this drama is over. I can now swap a drive out in less than 5 minutes and know that there are no side effects. I can even pull every disk from the server, move the server to a new location, and reinsert all the drives - in 15-20 mins, without any issues. Priceless! Check eBay as the cages I referenced above are often available used at less than half of new. Most of my cages are used and they work great. This model is engineered extremely well.and I recommend them. But there are other brands and you can decide for yourself. (#ssdindex) Link to comment
Joseph Posted December 17, 2017 Author Share Posted December 17, 2017 5 hours ago, SSD said: The root cause for problems like you are having normally trace back to cabling issues. I very highly recommend hot-swap style cages like the SuperMicro CSE-M35T-1B. Thanks SSD... all of my drives are in HotSwap Trayless Cages from iStarUSA (BPN-DE series). They should be the same, no? Link to comment
Joseph Posted December 17, 2017 Author Share Posted December 17, 2017 On 12/16/2017 at 2:45 PM, johnnie.black said: Disk looks fine, so probably connection related, could also be the power supply if it's not working correctly, it's more than adequate if it is. UPDATE... UGH!!! NOW WHAT!?! So, I went through pretty much the same procedure to rebuild the contents of Disk6 back on the same Disk6 HDD. I got this in my email this morning: Event: unRAID Disk 6 message Subject: Notice [TOWER] - Disk 6 returned to normal operation Description: HGST_HDN726040ALE614_K4KE04SB (sdq) Importance: normal then Event: unRAID Parity sync / Data rebuild Subject: Notice [TOWER] - Parity sync / Data rebuild finished (826448329 errors) Description: Duration: 8 hours, 29 minutes, 15 seconds. Average speed: 130.9 MB/s Importance: warning Can anyone tell me if the data on Disk6 is ok? Also, what are next steps if it is not ok? Thanks. Link to comment
Joseph Posted December 17, 2017 Author Share Posted December 17, 2017 UPDATE: Disk5 knocked offline again! I've stopped the array for now and will replace the power supply. At this point I don't know if the data exists any more on disk5 or disk6. Any advice would be most appreciated....can I throw up now? Link to comment
NewDisplayName Posted December 17, 2017 Share Posted December 17, 2017 eh, i have no idea, but this 917893417391273189317893131293812312 writes on disk 5 doesnt look normal ?!?! Link to comment
JorgeB Posted December 17, 2017 Share Posted December 17, 2017 4 hours ago, Joseph said: Can anyone tell me if the data on Disk6 is ok? It's not, disk5 failed about one hour after rebuild start, with disk3 also disable unRAID can no longer correctly rebuild the disk. Link to comment
SSD Posted December 17, 2017 Share Posted December 17, 2017 Suggest running a memory test. This is very strange symptom with counts getting so large. Link to comment
Joseph Posted December 18, 2017 Author Share Posted December 18, 2017 3 hours ago, nuhll said: eh, i have no idea, but this 917893417391273189317893131293812312 writes on disk 5 doesnt look normal ?!?! no, not at all. Link to comment
Joseph Posted December 18, 2017 Author Share Posted December 18, 2017 3 hours ago, johnnie.black said: It's not, disk5 failed about one hour after rebuild start, with disk3 also disable unRAID can no longer correctly rebuild the disk. not the news I wanted to hear... if there's any silver lining, I ran a CLI command to capture the directory structure of both 5 and 6 prior to this madness. Link to comment
Joseph Posted December 18, 2017 Author Share Posted December 18, 2017 2 hours ago, SSD said: Suggest running a memory test. This is very strange symptom with counts getting so large. running now.... will see what, if anything, it reveals... thanks. Link to comment
JorgeB Posted December 18, 2017 Share Posted December 18, 2017 5 hours ago, Joseph said: not the news I wanted to hear... if there's any silver lining, I ran a CLI command to capture the directory structure of both 5 and 6 prior to this madness. If the array data was unchanged since the beginning of the rebuild (this includes no running dockers/VMs on the array) there's still a chance to rebuild disk6 again, assuming disk5 is OK, but you need to fix whatever is wrong with the server first. Link to comment
Joseph Posted December 18, 2017 Author Share Posted December 18, 2017 13 hours ago, Joseph said: running now.... will see what, if anything, it reveals... thanks. UPDATE: I let it run overnight... 6 Passes, no errors so far. Link to comment
JorgeB Posted December 18, 2017 Share Posted December 18, 2017 Just now, Joseph said: UPDATE: I let it run overnight... 6 Passes, no errors so far. unlikely that bad RAM would make disks drop offline, prime suspects would cabling, controller and power. Link to comment
Joseph Posted December 18, 2017 Author Share Posted December 18, 2017 7 hours ago, johnnie.black said: If the array data was unchanged since the beginning of the rebuild (this includes no running dockers/VMs on the array) there's still a chance to rebuild disk6 again, assuming disk5 is OK, but you need to fix whatever is wrong with the server first. I'm going to replace the PS and see if that fixes the issue... with the holidays coming up, it might be awhile. Link to comment
Joseph Posted December 18, 2017 Author Share Posted December 18, 2017 1 minute ago, johnnie.black said: unlikely that bad RAM would make disks drop offline, prime suspects would cabling, controller and power. So, you think it might be a good idea to replace the HBA as well as the PS? Link to comment
Recommended Posts
Archived
This topic is now archived and is closed to further replies.