Chrisola Posted June 3, 2021 Share Posted June 3, 2021 (edited) Hi all Having a weird problem adding a 2nd parity drive to my UnRaid server, version 6.9.2. Steps: - Took 2x identical 4gb drives from my old Synology NAS, both healthy with no issues ever - Inserted one, pre-cleared (Healthy, 0 errors) now using as a data drive - OK - Inserted second, set to a 2nd Parity Drive - first try it crashed at some point (it was overnight and my web GUI was closed, syslog saving to cache wasn't set). 2nd try, i saw it hit just over 70% and then something crashed the server - i left the WebGUI open and when i refreshed it, it became apparent something had gone wrong as it couldn't see the server. Before it refreshed it was showing 70% complete. When i start the array again, the parity check starts over. Now, the first time, i have no logs. The 2nd time, the last log entry is this (i only read the forum and turned the 'mirror to cache' option for syslog around 12pm, before testing my ext CD rom would work). Parity rebuild was kicked off at about 11am: Jun 3 12:11:10 Tower kernel: usb-storage 1-1.5:1.0: USB Mass Storage device detected Jun 3 12:11:10 Tower kernel: scsi host9: usb-storage 1-1.5:1.0 Jun 3 12:11:11 Tower kernel: scsi 9:0:0:0: CD-ROM ASUS SBW-06D5H-U E101 PQ: 0 ANSI: 0 Jun 3 12:11:11 Tower kernel: sr 9:0:0:0: [sr0] scsi3-mmc drive: 24x/24x writer dvd-ram cd/rw xa/form2 cdda tray Jun 3 12:11:11 Tower kernel: sr 9:0:0:0: Attached scsi CD-ROM sr0 Jun 3 12:11:11 Tower kernel: sr 9:0:0:0: Attached scsi generic sg7 type 5 Jun 3 12:15:58 Tower autofan: Highest disk temp is 37C, adjusting fan speed from: 25 (9% @ 499rpm) to: 50 (19% @ 498rpm) Jun 3 15:01:23 Tower autofan: Highest disk temp is 36C, adjusting fan speed from: 50 (19% @ 498rpm) to: 25 (9% @ 498rpm) Jun 3 15:06:28 Tower autofan: Highest disk temp is 37C, adjusting fan speed from: 25 (9% @ 499rpm) to: 50 (19% @ 500rpm) Jun 3 15:11:34 Tower autofan: Highest disk temp is 36C, adjusting fan speed from: 50 (19% @ 499rpm) to: 25 (9% @ 499rpm) Jun 3 20:31:43 Tower kernel: microcode: microcode updated early to revision 0x21, date = 2019-02-13 Jun 3 20:31:43 Tower kernel: Linux version 5.10.28-Unraid (root@Develop) (gcc (GCC) 9.3.0, GNU ld version 2.33.1-slack15) #1 SMP Wed Apr 7 08:23:18 PDT 2021 Jun 3 20:31:43 Tower kernel: Command line: BOOT_IMAGE=/bzimage initrd=/bzroot I did a visual check at 4pm before going out, all was still ticking away nicely - so something happened between 4pm and 8:30pm when i got home, but had no logs for it?! Couple of questions i guess: - should i pre-clear then add as data drive to check it passes, before trying to set as parity again? - can anyone think why a parity build for a 2nd drive would crash the server (leaving it powered up but not accessible via WebGUI on a PC or via the console directly from the box) and leave no log trace? Thanks for help and words of wisdom Edited June 3, 2021 by Chrisola Quote Link to comment
JonathanM Posted June 3, 2021 Share Posted June 3, 2021 2 minutes ago, Chrisola said: - should i pre-clear then add as data drive to check it passes, before trying to set as parity again? No, just run an extended smart test, after that completes download the diagnostics zip and attach it to your next post in this thread. 3 minutes ago, Chrisola said: - can anyone think why a parity build for a 2nd drive would kill the server and leave no log trace? Marginal power supply? Any time the server crashes to a completely power off condition points to power issues somewhere, board or psu. Quote Link to comment
Chrisola Posted June 3, 2021 Author Share Posted June 3, 2021 51 minutes ago, jonathanm said: No, just run an extended smart test, after that completes download the diagnostics zip and attach it to your next post in this thread. Marginal power supply? Any time the server crashes to a completely power off condition points to power issues somewhere, board or psu. Will do....running now. PSU is a 550W Silverstone powering: Xeon CPU 8gb ram 5x HDs 1x Cache SSD 5 case fans Cable wise, the 2nd Parity has taken the power connection that i used to have an unassigned device HD on for internal backup, and that worked flawlessly (server is about a year and half old and was up 24/7 no issues until 5 days ago when i needed more storage). The server wasn't powered off when it crashed - the case fans and disks were whirring away and the lights were on, but i couldn't 'find' the server via the WebGUI on my PC (tried doing all the DNS resets etc) and when i plugged my monitor in direct, there was no console from the server on it. Quote Link to comment
JonathanM Posted June 3, 2021 Share Posted June 3, 2021 2 minutes ago, Chrisola said: The server wasn't powered off when it crashed Ahh. That's not what I read. 1 hour ago, Chrisola said: kill the server Unclear, I assumed kill meant dead. 1 hour ago, Chrisola said: turned the 'mirror to cache' option for syslog So was that enabled during an event? I'm guessing not, or you would have already posted those results. Quote Link to comment
Chrisola Posted June 3, 2021 Author Share Posted June 3, 2021 (edited) 7 minutes ago, jonathanm said: Ahh. That's not what I read. Unclear, I assumed kill meant dead. So was that enabled during an event? I'm guessing not, or you would have already posted those results. Have updated OP to be clearer Smart Test will run into the night, will report back in morning... Logs - first fail, none. 2nd fail the final entry was something about fan temp, then the next entry was when it started up again after i powered it off and back on - gap of 5 hrs in which time it crashed, but left no log except: Jun 3 15:06:28 Tower autofan: Highest disk temp is 37C, adjusting fan speed from: 25 (9% @ 499rpm) to: 50 (19% @ 500rpm) Jun 3 15:11:34 Tower autofan: Highest disk temp is 36C, adjusting fan speed from: 50 (19% @ 499rpm) to: 25 (9% @ 499rpm) Jun 3 20:31:43 Tower kernel: microcode: microcode updated early to revision 0x21, date = 2019-02-13 Jun 3 20:31:43 Tower kernel: Linux version 5.10.28-Unraid (root@Develop) (gcc (GCC) 9.3.0, GNU ld version 2.33.1-slack15) #1 SMP Wed Apr 7 08:23:18 PDT 2021 Jun 3 20:31:43 Tower kernel: Command line: BOOT_IMAGE=/bzimage initrd=/bz Edited June 3, 2021 by Chrisola Quote Link to comment
Chrisola Posted June 4, 2021 Author Share Posted June 4, 2021 Extended SMART test results attached...says completed without error? Admittedly i don't really understand what everything means, so any advice appreciated. I decided to pre-clear anyway - by the time i finish work later it will be finished and i can see if that shows any issues. WDC_WD40EFRX-68N32N0_WD-WCC7K1RL1LLC-20210603-2230.txt Quote Link to comment
JorgeB Posted June 4, 2021 Share Posted June 4, 2021 If there's nothing being logged when the server crashes it's most likely a hardware issue. Quote Link to comment
Chrisola Posted June 4, 2021 Author Share Posted June 4, 2021 So the troubled drive pre-cleared fine with no errors! Jun 4 13:08:10 Tower kernel: md: sync done. time=24636sec Jun 4 13:08:10 Tower kernel: md: recovery thread: exit status: 0 Really stumped now, i don't want to keep trying to add the 2nd Parity and getting a crash, in case something worse happens.... Quote Link to comment
JorgeB Posted June 4, 2021 Share Posted June 4, 2021 Unlike a clear a parity sync is CPU intensive, so it could be why it didn't crash now. Quote Link to comment
Chrisola Posted June 4, 2021 Author Share Posted June 4, 2021 (edited) So....the saga ends! In the end I have removed the Parity Tuning Plugin, pre-cleared, formatted and added the drive into the array as storage, ran in depth SMART test and Diskspeed tests - all passed OK, no issues. Confident disk is fine, mechanically. So I then went through the 'Shrink Array' process to remove it from the array, and then for the New Config I added the original Parity + the new 2nd Parity at the same time, then let it do the Parity build. I had the WEBGui open on my 2nd screen at while at work today and throughout the whole process to see if i could spot any issues - none, and 9hrs 46m (avg 113.8mb/s speed) later it completed and both drives are now set as Parity 1 and Parity 2, like i wanted. Very, very strange why it took a roundabout way and failed twice to start with, but, touch wood, it's done and working!!! Edited June 4, 2021 by Chrisola Quote Link to comment
JonathanM Posted June 5, 2021 Share Posted June 5, 2021 1 hour ago, Chrisola said: it's done and working!!! Now you should do a parity check to ensure everything can be read accurately. Quote Link to comment
Chrisola Posted June 7, 2021 Author Share Posted June 7, 2021 (edited) On 6/5/2021 at 1:11 AM, jonathanm said: Now you should do a parity check to ensure everything can be read accurately. Scheduled Parity check kicked off at 1am, last log entry is: Jun 7 01:00:01 Tower kernel: mdcmd (61): check Jun 7 01:00:01 Tower kernel: md: recovery thread: check P Q ... Jun 7 01:00:09 Tower emhttpd: read SMART /dev/sdg Jun 7 01:00:09 Tower emhttpd: read SMART /dev/sdd Jun 7 01:00:09 Tower emhttpd: read SMART /dev/sde Jun 7 01:00:09 Tower emhttpd: read SMART /dev/sdb Jun 7 01:00:09 Tower emhttpd: read SMART /dev/sdc I looked in on it just now (01:55am) and it the server had crashed - still powered on, but no console\ WebGUI or access to the shares. Really, really baffled with this - mechanically everything checks out, and the parity builds fine. Why would the check just cause a crash after starting? Edited June 7, 2021 by Chrisola Quote Link to comment
JonathanM Posted June 7, 2021 Share Posted June 7, 2021 2 minutes ago, Chrisola said: Why would the check just cause a crash after starting? Parity check puts a constant load on disk controllers and power supplies. Do you have constant airflow over the motherboard and disk controllers? Quote Link to comment
Chrisola Posted June 7, 2021 Author Share Posted June 7, 2021 13 minutes ago, jonathanm said: Parity check puts a constant load on disk controllers and power supplies. Do you have constant airflow over the motherboard and disk controllers? Yeah - when i watched the stats the CPU never went over 20% usage, and temp around 37 was the max i saw - case itself is currently open while i mess with the drives (also adding more ram and a GPU when they arrive in a couple of days). Have 5x case fans plus CPU cool (all be quiet ones), in a be quiet 600 tower case. Drives all 31 - 33 degrees at the moment. Would it be worth swapping the SATA cable to another port? Quote Link to comment
Chrisola Posted June 7, 2021 Author Share Posted June 7, 2021 Have done a Fix Common Problems scan and corrected a couple of findings with settings (although some of them i know have been set that way since day one without issue). Parity currently at 14% check at 1hr 1m elapsed, so has gotten further than last time.... Quote Link to comment
JonathanM Posted June 7, 2021 Share Posted June 7, 2021 It's not the CPU that's under stress, it's the I/O chips on the motherboard and any hard drive controller cards. Having the case open is a bad thing, it means the fans aren't forcing air through the case, and stagnant pools of hot air can develop. If you must leave the case open, try using a desk fan or something to force air over all the internals. Quote Link to comment
Chrisola Posted June 7, 2021 Author Share Posted June 7, 2021 Have put sides back on and put it back to normal, see if it helps. Quote Link to comment
Chrisola Posted June 7, 2021 Author Share Posted June 7, 2021 12 hours ago, jonathanm said: It's not the CPU that's under stress, it's the I/O chips on the motherboard and any hard drive controller cards. Having the case open is a bad thing, it means the fans aren't forcing air through the case, and stagnant pools of hot air can develop. If you must leave the case open, try using a desk fan or something to force air over all the internals. so...completed Parity check fine. I suspect it was due to the sides being off and poor air circulation, the stress of the Parity check was causing something to overheat and crash things - as you noted. I will monitor but touch wood, it's running fine now (sides back on). All temps look good, no error codes in the log. Next up is adding ram and GPU, will do that at the weekend, then put case straight back together after. Thanks for your advice. Quote Link to comment
JonathanM Posted June 7, 2021 Share Posted June 7, 2021 8 minutes ago, Chrisola said: Next up is adding ram and GPU, Which will change airflow and add heat. You may need to do a full evaluation on how air is moving inside your case since you seem to be on the ragged edge of stability. Remember, air will flow the easiest path, it doesn't go places unless forced. I know you said you have 5 case fans, but if you don't have them pointed optimally you may end up with one fan just cancelling out the work another is doing. You also may need ducting inside the case or a fan dedicated to blowing across the cards and board I/O chips. Quote Link to comment
Chrisola Posted June 8, 2021 Author Share Posted June 8, 2021 3 hours ago, jonathanm said: Which will change airflow and add heat. You may need to do a full evaluation on how air is moving inside your case since you seem to be on the ragged edge of stability. Remember, air will flow the easiest path, it doesn't go places unless forced. I know you said you have 5 case fans, but if you don't have them pointed optimally you may end up with one fan just cancelling out the work another is doing. You also may need ducting inside the case or a fan dedicated to blowing across the cards and board I/O chips. Yeah i will, main issue is the 2x front fans are partially blocked by the drive cages due to lay out of the case - always been that way (new drive is at the very top so not impacting air flow by adding). I had 2x top of case fans on exhaust, so i've swapped one around to intake, so i have 3x intake (2x front and top in middle) and 2x exhaust (rear and top at back of case). Got RAM and GPU (Nvidia P400) installed and working, will see how things go. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.