[SOLVED] My unRAID is a MESS... Please HELP!

Joseph · December 13, 2017

TL;DR Replaced power supply... hopefully that fixes it.

Hello unRAID community, I'm hoping you can help me out. My unRAID has never really been stable, but that's another matter. At the moment, I have dual parity with 2 array drives knocked offline and I have replacement drives on standby. I tried stopping the array via the GUI and now I can't access the main page on any computer. I went to the console on the unRAID box itself and saw I still had a directory path of "mnt/Disks" with nothing in it. I also got an email notification that has me worried sick now:

Quote

Event: unRAID array errors
Subject: Notice [TOWER] - array turned good
Description: Array has 0 disks with read errors
Importance: normal

I've tried to stop the array via CLI running through a few commands and a script that Alex and Squid provided here.

This is an except from the syslog. I'm concerned there might be problems with a parity disk (sdq) and/or cache pool (sdd, sde):

Dec 13 09:36:07 Tower kernel: mdcmd (30): import 29 sdq 3907018532 0 ST4000VN0001
Dec 13 09:36:09 Tower kernel: md: import disk29: (sdq) ST4000VN0001 size: 3907018532 
Dec 13 09:36:09 Tower emhttp: import 30 cache device: sdd
Dec 13 09:36:09 Tower emhttp: import 31 cache device: sde
Dec 13 09:36:09 Tower emhttp: check_pool: /sbin/btrfs filesystem show 1e110a1d-e24f-4cdf-a087-31b6323b2a10 2>&1
Dec 13 09:38:50 Tower kernel: sd 5:0:3:0: timing out command, waited 180s
Dec 13 09:41:50 Tower kernel: sd 5:0:3:0: timing out command, waited 180s
Dec 13 09:41:50 Tower kernel: sd 5:0:3:0: [sdi] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x00
Dec 13 09:41:50 Tower kernel: sd 5:0:3:0: [sdi] tag#0 CDB: opcode=0x88 88 00 00 00 00 01 5d 50 a2 c0 00 00 00 08 00 00
Dec 13 09:41:50 Tower kernel: blk_update_request: I/O error, dev sdi, sector 5860532928
Dec 13 09:44:50 Tower sudo:     root : TTY=pts/0 ; PWD=/boot ; USER=root ; COMMAND=/usr/bin/lsof
Dec 13 09:44:50 Tower kernel: sd 5:0:3:0: timing out command, waited 180s
Dec 13 09:44:50 Tower kernel: sd 5:0:3:0: [sdi] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x00
Dec 13 09:44:50 Tower kernel: sd 5:0:3:0: [sdi] tag#0 CDB: opcode=0x88 88 00 00 00 00 01 5d 50 a2 c0 00 00 00 01 00 00
Dec 13 09:44:50 Tower kernel: blk_update_request: I/O error, dev sdi, sector 5860532928
Dec 13 09:44:50 Tower kernel: Buffer I/O error on dev sdi1, logical block 5860532864, async page read
Dec 13 09:44:50 Tower kernel: sd 5:0:3:0: timing out command, waited 180s
Dec 13 09:44:50 Tower kernel: sd 5:0:3:0: [sdi] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x00
Dec 13 09:44:50 Tower kernel: sd 5:0:3:0: [sdi] tag#0 CDB: opcode=0x88 88 00 00 00 00 01 5d 50 a2 c1 00 00 00 01 00 00
Dec 13 09:44:50 Tower kernel: blk_update_request: I/O error, dev sdi, sector 5860532929
Dec 13 09:44:50 Tower kernel: Buffer I/O error on dev sdi1, logical block 5860532865, async page read

I still can't access the GUI so i need to shut down. So, my questions are:

* If there are problems with parity and cache, what are next steps?

* What's the safest way to shutdown at this point?

* How do I bring up the array with 2 failed disks if I encounter a situation where unRAID thinks parity needs to be rebuilt with 2 disks knocked offline and the contents are being emulated?

Please see attached diags ran a few days ago and the syslog generated today after trying to stop the array via GUI & CLI. Any help is greatly appreciated!! Thank you so much for your help!!

JorgeB · December 13, 2017

2 hours ago, Joseph said:

If there are problems with parity and cache

Why do you say that, nothing about that in the log, disk3 and disk5 (sdi) are disable, disk5 dropped offline, you want to check cables, reboot and grab new diags so we can check SMART report.

To reboot type reboot on the console.

Joseph · December 13, 2017

10 minutes ago, johnnie.black said:

Why do you say that, nothing about that in the log, disk3 and disk5 (sdi) are disable, disk5 dropped offline, you want to check cables, reboot and grab new diags so we can check SMART report.

To reboot type reboot on the console.

Bare with me, I'm a linux newbie; I thought sdq was parity disk 2 and was concerned that today's syslog still has entries pertaining to the cache pool after trying to stop the array via CLI.

Is it safe to execute the shutdown command from the CLI despite the fact "mdcmd stop" returns this error?

/usr/local/sbin/mdcmd stop
/usr/local/sbin/mdcmd: line 11: echo: write error: Invalid argument

Should I be concerned that there's still an empty mount point "/mnt/disks?" Also, does it matter that sdq shows up as disk29? (fwiw: I don't recall if I tried umount /mnt/disk29)

Sorry for all the questions... I just don't want to blow it by doing something wrong and throwing unRAID into a rebuild parity disks because its currently emulating 2 of the array disks.

Thank you.

JorgeB · December 13, 2017

These entries are normal and nothing error related:

Dec 13 09:36:07 Tower kernel: mdcmd (30): import 29 sdq 3907018532 0 ST4000VN0001-1SF178_Z4F0APX6
Dec 13 09:36:09 Tower kernel: md: import disk29: (sdq) ST4000VN0001-1SF178_Z4F0APX6 size: 3907018532 
Dec 13 09:36:09 Tower emhttp: import 30 cache device: sdd
Dec 13 09:36:09 Tower emhttp: import 31 cache device: sde

Only errors are after these referring to sdi, disk5, I did see some timeouts on sdm (disk9) but it recovered, still good idea to also check those cables, especially if it's shares a power cable or adapter with disk5.

Joseph · December 13, 2017

32 minutes ago, johnnie.black said:
These entries are normal and nothing error related:
Dec 13 09:36:07 Tower kernel: mdcmd (30): import 29 sdq 3907018532 0 ST4000VN0001-1SF178_Z4F0APX6
Dec 13 09:36:09 Tower kernel: md: import disk29: (sdq) ST4000VN0001-1SF178_Z4F0APX6 size: 3907018532 
Dec 13 09:36:09 Tower emhttp: import 30 cache device: sdd
Dec 13 09:36:09 Tower emhttp: import 31 cache device: sde
Only errors are after these referring to sdi, disk5, I did see some timeouts on sdm (disk9) but it recovered, still good idea to also check those cables, especially if it's shares a power cable or adapter with disk5.

ok. I gotta runs some errands but I'll run shutdown from CLI when i return and will report back... thanks!

Joseph · December 14, 2017

On 12/13/2017 at 4:30 PM, Joseph said:

UPDATE:

It took some time, but unRAID eventually shut down after issuing powerdown command via CLI. I double checked power and data cables, then replaced Disk 5. The plan is to rebuild Disk 5 first, then rebuild Disk 3 (which only 'failed' because it was accidentally formatted; so I pulled it to attempt data recovery.) FWIW, the data cables were swapped out earlier this year when I had a similar problem and disks 3 - 7 are in a 5 bay drive cage that require 2 power connectors that are also shared by the 2 cache SDDs that are in a separate cage.

Upon reboot, I ran diags... they're attached with a couple of photos. I'm probably overthinking this, but I just want to make sure when I start the array, it will rebuild Disk 5 and keep the contents of Disk 3 emulated.

Thanks!!

JorgeB · December 14, 2017

1 minute ago, Joseph said:

when I start the array, it will rebuild Disk 5 and keep the contents of Disk 3 emulated.

Correct

Joseph · December 14, 2017

16 minutes ago, johnnie.black said:

Correct

ok, here we go...

5a32bfaa93599_waitforit.jpg.ed81c4823ffbccbe9739ac2e12b59d19.jpg

Joseph · December 16, 2017

On 12/14/2017 at 11:58 AM, johnnie.black said:

Correct

UPDATE:

So Disk5 rebuild from parity without incident... yay! However.... if you recall, a couple of weeks ago, I lost Disk3 because it was reformatted accidentally. That drive has since been pulled and I'm in process of recovering whatever data I can get before sending it off for warranty replacement. Meanwhile, the contents of Disk3, which is now empty, are being emulated.

Today I noticed the mover put data on the emulated disk and I thought as a precaution it would be a good idea to manually move the contents from the emulated disk3 to the now working disk5 via MC. At some point during this process, Disk6 got knocked offline and now its contents are being emulated. SMART is ok...and I'm not sure of next steps.

Event: unRAID array errors
Subject: Warning [TOWER] - array has errors
Description: Array has 1 disk with read errors
Importance: warning

Disk 6 - HGST_HDN726040ALE614_K4KE04SB (sdg) (errors 128)

I'm beginning to think its the PS...but it was recently replaced. Any thoughts are greatly appreciated. Diags attached... and as always, thank you!!

Link to CoolerMaster G750M PS specs

JorgeB · December 16, 2017

Disk looks fine, so probably connection related, could also be the power supply if it's not working correctly, it's more than adequate if it is.

SSD · December 17, 2017

The root cause for problems like you are having normally trace back to cabling issues. I very highly recommend hot-swap style cages like the SuperMicro CSE-M35T-1B. Once installed and burned in, you can swap disks without risking nudging cables and creating the types of instability and risks of data loss you are experiencing. Dual parity may help recover in some situations, but is not a substitute to these cages which actually avoid the problems.from happening in the first place.

I've been here a long time, and was skeptical of the need for these types of cages. I remember writing several posts debating others suggesting these cages. But I use them now on all my servers. Before I had to be prepared to spend several hours running tests after a drive addition or exchange, to troubleshoot and ensure my drives were staying online and resolving correction issues. Several frustrating times I found myself unplugging all the drives, and connecting them one at a time, to try to isolate squirrely issues. All this drama is over. I can now swap a drive out in less than 5 minutes and know that there are no side effects. I can even pull every disk from the server, move the server to a new location, and reinsert all the drives - in 15-20 mins, without any issues. Priceless!

Check eBay as the cages I referenced above are often available used at less than half of new. Most of my cages are used and they work great. This model is engineered extremely well.and I recommend them. But there are other brands and you can decide for yourself.

(#ssdindex)

Joseph · December 17, 2017

5 hours ago, SSD said:

The root cause for problems like you are having normally trace back to cabling issues. I very highly recommend hot-swap style cages like the SuperMicro CSE-M35T-1B.

Thanks SSD... all of my drives are in HotSwap Trayless Cages from iStarUSA (BPN-DE series). They should be the same, no?

Joseph · December 17, 2017

On 12/16/2017 at 2:45 PM, johnnie.black said:

Disk looks fine, so probably connection related, could also be the power supply if it's not working correctly, it's more than adequate if it is.

UPDATE... UGH!!! NOW WHAT!?!

So, I went through pretty much the same procedure to rebuild the contents of Disk6 back on the same Disk6 HDD. I got this in my email this morning:

Event: unRAID Disk 6 message
Subject: Notice [TOWER] - Disk 6 returned to normal operation
Description: HGST_HDN726040ALE614_K4KE04SB (sdq)
Importance: normal

then

Event: unRAID Parity sync / Data rebuild
Subject: Notice [TOWER] - Parity sync / Data rebuild finished (826448329 errors)
Description: Duration: 8 hours, 29 minutes, 15 seconds. Average speed: 130.9 MB/s
Importance: warning

Can anyone tell me if the data on Disk6 is ok? Also, what are next steps if it is not ok? Thanks.

Joseph · December 17, 2017

UPDATE: Disk5 knocked offline again! I've stopped the array for now and will replace the power supply. At this point I don't know if the data exists any more on disk5 or disk6. Any advice would be most appreciated....can I throw up now?

NewDisplayName · December 17, 2017

eh, i have no idea, but this 917893417391273189317893131293812312 writes on disk 5 doesnt look normal ?!?!

JorgeB · December 17, 2017

4 hours ago, Joseph said:

Can anyone tell me if the data on Disk6 is ok?

It's not, disk5 failed about one hour after rebuild start, with disk3 also disable unRAID can no longer correctly rebuild the disk.

SSD · December 17, 2017

Suggest running a memory test. This is very strange symptom with counts getting so large.

Joseph · December 18, 2017

3 hours ago, nuhll said:

eh, i have no idea, but this 917893417391273189317893131293812312 writes on disk 5 doesnt look normal ?!?!

no, not at all.

Joseph · December 18, 2017

3 hours ago, johnnie.black said:

It's not, disk5 failed about one hour after rebuild start, with disk3 also disable unRAID can no longer correctly rebuild the disk.

not the news I wanted to hear... if there's any silver lining, I ran a CLI command to capture the directory structure of both 5 and 6 prior to this madness.

Joseph · December 18, 2017

2 hours ago, SSD said:

Suggest running a memory test. This is very strange symptom with counts getting so large.

running now.... will see what, if anything, it reveals... thanks.

JorgeB · December 18, 2017

5 hours ago, Joseph said:

not the news I wanted to hear... if there's any silver lining, I ran a CLI command to capture the directory structure of both 5 and 6 prior to this madness.

If the array data was unchanged since the beginning of the rebuild (this includes no running dockers/VMs on the array) there's still a chance to rebuild disk6 again, assuming disk5 is OK, but you need to fix whatever is wrong with the server first.

Joseph · December 18, 2017

13 hours ago, Joseph said:

running now.... will see what, if anything, it reveals... thanks.

UPDATE: I let it run overnight... 6 Passes, no errors so far.

JorgeB · December 18, 2017

Just now, Joseph said:

UPDATE: I let it run overnight... 6 Passes, no errors so far.

unlikely that bad RAM would make disks drop offline, prime suspects would cabling, controller and power.

Joseph · December 18, 2017

7 hours ago, johnnie.black said:

If the array data was unchanged since the beginning of the rebuild (this includes no running dockers/VMs on the array) there's still a chance to rebuild disk6 again, assuming disk5 is OK, but you need to fix whatever is wrong with the server first.

I'm going to replace the PS and see if that fixes the issue... with the holidays coming up, it might be awhile.

Joseph · December 18, 2017

1 minute ago, johnnie.black said:

unlikely that bad RAM would make disks drop offline, prime suspects would cabling, controller and power.

So, you think it might be a good idea to replace the HBA as well as the PS?

[SOLVED] My unRAID is a MESS... Please HELP!

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Archived