Trying to add 2nd Parity drive crashes whole server around 70% done


Recommended Posts

Hi all

 

Having a weird problem adding a 2nd parity drive to my UnRaid server, version 6.9.2.

 

Steps:

- Took 2x identical 4gb drives from my old Synology NAS, both healthy with no issues ever

- Inserted one, pre-cleared (Healthy, 0 errors) now using as a data drive - OK

- Inserted second, set to a 2nd Parity Drive - first try it crashed at some point (it was overnight and my web GUI was closed, syslog saving to cache wasn't set).

 

2nd try, i saw it hit just over 70% and then something crashed the server - i left the WebGUI open and when i refreshed it, it became apparent something had gone wrong as it couldn't see the server. Before it refreshed it was showing 70% complete.

 

When i start the array again, the parity check starts over.

 

Now, the first time, i have no logs. The 2nd time, the last log entry is this (i only read the forum and turned the 'mirror to cache' option for syslog around 12pm, before testing my ext CD rom would work). Parity rebuild was kicked off at about 11am:

 

Jun  3 12:11:10 Tower kernel: usb-storage 1-1.5:1.0: USB Mass Storage device detected
Jun  3 12:11:10 Tower kernel: scsi host9: usb-storage 1-1.5:1.0
Jun  3 12:11:11 Tower kernel: scsi 9:0:0:0: CD-ROM            ASUS     SBW-06D5H-U      E101 PQ: 0 ANSI: 0
Jun  3 12:11:11 Tower kernel: sr 9:0:0:0: [sr0] scsi3-mmc drive: 24x/24x writer dvd-ram cd/rw xa/form2 cdda tray
Jun  3 12:11:11 Tower kernel: sr 9:0:0:0: Attached scsi CD-ROM sr0
Jun  3 12:11:11 Tower kernel: sr 9:0:0:0: Attached scsi generic sg7 type 5
Jun  3 12:15:58 Tower autofan: Highest disk temp is 37C, adjusting fan speed from: 25 (9% @ 499rpm) to: 50 (19% @ 498rpm)
Jun  3 15:01:23 Tower autofan: Highest disk temp is 36C, adjusting fan speed from: 50 (19% @ 498rpm) to: 25 (9% @ 498rpm)
Jun  3 15:06:28 Tower autofan: Highest disk temp is 37C, adjusting fan speed from: 25 (9% @ 499rpm) to: 50 (19% @ 500rpm)
Jun  3 15:11:34 Tower autofan: Highest disk temp is 36C, adjusting fan speed from: 50 (19% @ 499rpm) to: 25 (9% @ 499rpm)
Jun  3 20:31:43 Tower kernel: microcode: microcode updated early to revision 0x21, date = 2019-02-13
Jun  3 20:31:43 Tower kernel: Linux version 5.10.28-Unraid (root@Develop) (gcc (GCC) 9.3.0, GNU ld version 2.33.1-slack15) #1 SMP Wed Apr 7 08:23:18 PDT 2021
Jun  3 20:31:43 Tower kernel: Command line: BOOT_IMAGE=/bzimage initrd=/bzroot

 

I did a visual check at 4pm before going out, all was still ticking away nicely - so something happened between 4pm and 8:30pm when i got home, but had no logs for it?!

 

Couple of questions i guess:

 

- should i pre-clear then add as data drive to check it passes, before trying to set as parity again?

- can anyone think why a parity build for a 2nd drive would crash the server (leaving it powered up but not accessible via WebGUI on a PC or via the console directly from the box) and leave no log trace?

 

Thanks for help and words of wisdom :)

 

 

Edited by Chrisola
Link to comment
2 minutes ago, Chrisola said:

- should i pre-clear then add as data drive to check it passes, before trying to set as parity again?

No, just run an extended smart test, after that completes download the diagnostics zip and attach it to your next post in this thread.

3 minutes ago, Chrisola said:

- can anyone think why a parity build for a 2nd drive would kill the server and leave no log trace?

Marginal power supply? Any time the server crashes to a completely power off condition points to power issues somewhere, board or psu.

Link to comment
51 minutes ago, jonathanm said:

No, just run an extended smart test, after that completes download the diagnostics zip and attach it to your next post in this thread.

Marginal power supply? Any time the server crashes to a completely power off condition points to power issues somewhere, board or psu.

 

Will do....running now.

 

PSU is a 550W Silverstone powering:

 

Xeon CPU

8gb ram

5x HDs

1x Cache SSD

5 case fans

 

Cable wise, the 2nd Parity has taken the power connection that i used to have an unassigned device HD on for internal backup, and that worked flawlessly (server is about a year and half old and was up 24/7 no issues until 5 days ago when i needed more storage).

 

The server wasn't powered off when it crashed - the case fans and disks were whirring away and the lights were on, but i couldn't 'find' the server via the WebGUI on my PC (tried doing all the DNS resets etc) and when i plugged my monitor in direct, there was no console from the server on it.

 

 

Link to comment
2 minutes ago, Chrisola said:

The server wasn't powered off when it crashed

Ahh. That's not what I read.

1 hour ago, Chrisola said:

kill the server

Unclear, I assumed kill meant dead.

1 hour ago, Chrisola said:

turned the 'mirror to cache' option for syslog

So was that enabled during an event? I'm guessing not, or you would have already posted those results.

Link to comment
7 minutes ago, jonathanm said:

Ahh. That's not what I read.

Unclear, I assumed kill meant dead.

So was that enabled during an event? I'm guessing not, or you would have already posted those results.

 

Have updated OP to be clearer :)

 

Smart Test will run into the night, will report back in morning...

 

Logs - first fail, none.

 

2nd fail the final entry was something about fan temp, then the next entry was when it started up again after i powered it off and back on - gap of 5 hrs in which time it crashed, but left no log except:

 

Jun  3 15:06:28 Tower autofan: Highest disk temp is 37C, adjusting fan speed from: 25 (9% @ 499rpm) to: 50 (19% @ 500rpm)
Jun  3 15:11:34 Tower autofan: Highest disk temp is 36C, adjusting fan speed from: 50 (19% @ 499rpm) to: 25 (9% @ 499rpm)
Jun  3 20:31:43 Tower kernel: microcode: microcode updated early to revision 0x21, date = 2019-02-13
Jun  3 20:31:43 Tower kernel: Linux version 5.10.28-Unraid (root@Develop) (gcc (GCC) 9.3.0, GNU ld version 2.33.1-slack15) #1 SMP Wed Apr 7 08:23:18 PDT 2021
Jun  3 20:31:43 Tower kernel: Command line: BOOT_IMAGE=/bzimage initrd=/bz

 

 

Edited by Chrisola
Link to comment

So the troubled drive pre-cleared fine with no errors!

 

Jun 4 13:08:10 Tower kernel: md: sync done. time=24636sec
Jun 4 13:08:10 Tower kernel: md: recovery thread: exit status: 0

 

Really stumped now, i don't want to keep trying to add the 2nd Parity and getting a crash, in case something worse happens.... :(

Link to comment

So....the saga ends!

 

In the end I have removed the Parity Tuning Plugin, pre-cleared, formatted and added the drive into the array as storage, ran in depth SMART test and Diskspeed tests - all passed OK, no issues. Confident disk is fine, mechanically.

 

So I then went through the 'Shrink Array' process to remove it from the array, and then for the New Config I added the original Parity + the new 2nd Parity at the same time, then let it do the Parity build.

 

I had the WEBGui open on my 2nd screen at while at work today and throughout the whole process to see if i could spot any issues - none, and 9hrs 46m (avg 113.8mb/s speed)  later it completed and both drives are now set as Parity 1 and Parity 2, like i wanted.

 

Very, very strange why it took a roundabout way and failed twice to start with, but, touch wood, it's done and working!!! 

 

 

Edited by Chrisola
Link to comment
On 6/5/2021 at 1:11 AM, jonathanm said:

Now you should do a parity check to ensure everything can be read accurately.

 

Scheduled Parity check kicked off at 1am, last log entry is:

 

Jun  7 01:00:01 Tower kernel: mdcmd (61): check 
Jun  7 01:00:01 Tower kernel: md: recovery thread: check P Q ...
Jun  7 01:00:09 Tower emhttpd: read SMART /dev/sdg
Jun  7 01:00:09 Tower emhttpd: read SMART /dev/sdd
Jun  7 01:00:09 Tower emhttpd: read SMART /dev/sde
Jun  7 01:00:09 Tower emhttpd: read SMART /dev/sdb
Jun  7 01:00:09 Tower emhttpd: read SMART /dev/sdc

 

I looked in on it just now (01:55am) and it the server had crashed - still powered on, but no console\ WebGUI or access to the shares.

 

Really, really baffled with this - mechanically everything checks out, and the parity builds fine.

 

Why would the check just cause a crash after starting?

Edited by Chrisola
Link to comment
13 minutes ago, jonathanm said:

Parity check puts a constant load on disk controllers and power supplies.

 

Do you have constant airflow over the motherboard and disk controllers?

 

Yeah - when i watched the stats the CPU never went over 20% usage, and temp around 37 was the max i saw - case itself is currently open while i mess with the drives (also adding more ram and a GPU when they arrive in a couple of days). Have 5x case fans plus CPU cool (all be quiet ones), in a be quiet 600 tower case. Drives all 31 - 33 degrees at the moment.

 

Would it be worth swapping the SATA cable to another port?

 

 

 

 

Link to comment

It's not the CPU that's under stress, it's the I/O chips on the motherboard and any hard drive controller cards. Having the case open is a bad thing, it means the fans aren't forcing air through the case, and stagnant pools of hot air can develop. If you must leave the case open, try using a desk fan or something to force air over all the internals.

 

 

Link to comment
12 hours ago, jonathanm said:

It's not the CPU that's under stress, it's the I/O chips on the motherboard and any hard drive controller cards. Having the case open is a bad thing, it means the fans aren't forcing air through the case, and stagnant pools of hot air can develop. If you must leave the case open, try using a desk fan or something to force air over all the internals.

 

 

 

so...completed Parity check fine.

 

I suspect it was due to the sides being off and poor air circulation, the stress of the Parity check was causing something to overheat and crash things - as you noted.

 

I will monitor but touch wood, it's running fine now (sides back on). All temps look good, no error codes in the log.

 

Next up is adding ram and GPU, will do that at the weekend, then put case straight back together after.

 

Thanks for your advice.

Link to comment
8 minutes ago, Chrisola said:

Next up is adding ram and GPU,

Which will change airflow and add heat. You may need to do a full evaluation on how air is moving inside your case since you seem to be on the ragged edge of stability.

 

Remember, air will flow the easiest path, it doesn't go places unless forced. I know you said you have 5 case fans, but if you don't have them pointed optimally you may end up with one fan just cancelling out the work another is doing. You also may need ducting inside the case or a fan dedicated to blowing across the cards and board I/O chips.

Link to comment
3 hours ago, jonathanm said:

Which will change airflow and add heat. You may need to do a full evaluation on how air is moving inside your case since you seem to be on the ragged edge of stability.

 

Remember, air will flow the easiest path, it doesn't go places unless forced. I know you said you have 5 case fans, but if you don't have them pointed optimally you may end up with one fan just cancelling out the work another is doing. You also may need ducting inside the case or a fan dedicated to blowing across the cards and board I/O chips.

 

Yeah i will, main issue is the 2x front fans are partially blocked by the drive cages due to lay out of the case - always been that way (new drive is at the very top so not impacting air flow by adding).

 

I had 2x top of case fans on exhaust, so i've swapped one around to intake, so i have 3x intake (2x front and top in middle) and 2x exhaust (rear and top at back of case).

 

Got RAM and GPU (Nvidia P400) installed and working, will see how things go.

 

 

 

 

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.