UnRaid Server crashing randomly


Recommended Posts

I am at my wits end.  I have had an unRaid server for almost 10 years.  I recently started having crashes with my system that had 10 year old hardware.

 

When it crashes, it is not responsive in anyway, no video, no keyboard response, nothing.  I have to cold shutdown and reboot to get back into it.

 

First thing I did was upgrade unRaid to the lastest, didn't have any issues with that.  But then it still was crashing. I decided it was a good time to upgrade the hardware.  My friend and I purchased two sets of mobo, processor, memory and controllers from the same vendor.  He installed his and it worked fine.  I installed mine and it's still been crashing, especially when doing a parity sync.  I've tried everything, mem tests, new PSU, new controllers, it would still crash when running the parity sync/rebuild.

The only thing that is not different is the case and the actual drives. 

 

I replaced the mobo with a new identical one and that seemed to help.  I was able to start the partiy sync and it ran for over 24 hours and then it crashed.

 

I have a feeling something is going on with a drive but i can't seem to fix it.  Every time I run a parity sync, it crashes.  Currently it shows one drive as unmountable and the parity drive as emulated.  I can't get through a full parity sync/rebuild.

 

I'm attempting an xfs_repair on the problem drive after reading numerous posts to try to figure out the issue.  That's been running for a while now, it's a 10tb drive.

 

I've attached my diagnostics. If anyone has any thoughts, i could use the help, this has become very frustrating.

 

 

tower-diagnostics-20200702-1121.zip

Link to comment

Error prompt on LSI storage controller, try update its BIOS and firmware

 

Jul  2 07:55:02 Tower kernel: mpt2sas_cm0: LSISAS2308: FWVersion(18.00.00.00), ChipRevision(0x05), BiosVersion(07.35.00.00)

 

Jul  2 09:40:39 Tower kernel: sd 9:0:0:0: [sdc] tag#6379 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x00
Jul  2 09:40:39 Tower kernel: sd 9:0:0:0: [sdc] tag#6379 CDB: opcode=0x88 88 00 00 00 00 03 a3 81 29 c0 00 00 00 08 00 00
Jul  2 09:40:39 Tower kernel: print_req_error: I/O error, dev sdc, sector 15628052928
Jul  2 09:40:57 Tower kernel: sd 9:0:1:0: [sdf] tag#6379 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x00
Jul  2 09:40:57 Tower kernel: sd 9:0:1:0: [sdf] tag#6379 CDB: opcode=0x88 88 00 00 00 00 05 74 ff ff 40 00 00 00 08 00 00
Jul  2 09:40:57 Tower kernel: print_req_error: I/O error, dev sdf, sector 23437770560
Jul  2 09:41:07 Tower kernel: sd 9:0:2:0: [sdj] tag#6379 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x00
Jul  2 09:41:07 Tower kernel: sd 9:0:2:0: [sdj] tag#6379 CDB: opcode=0x88 88 00 00 00 00 03 a3 81 29 c0 00 00 00 08 00 00
Jul  2 09:41:07 Tower kernel: print_req_error: I/O error, dev sdj, sector 15628052928
Jul  2 09:41:17 Tower kernel: sd 9:0:4:0: [sdl] tag#6379 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x00
Jul  2 09:41:17 Tower kernel: sd 9:0:4:0: [sdl] tag#6379 CDB: opcode=0x88 88 00 00 00 00 04 8c 3f ff 40 00 00 00 08 00 00
Jul  2 09:41:17 Tower kernel: print_req_error: I/O error, dev sdl, sector 19532873536
Jul  2 09:41:26 Tower kernel: sd 9:0:3:0: [sdk] tag#6379 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x00
Jul  2 09:41:26 Tower kernel: sd 9:0:3:0: [sdk] tag#6379 CDB: opcode=0x88 88 00 00 00 00 05 74 ff ff 40 00 00 00 08 00 00
Jul  2 09:41:26 Tower kernel: print_req_error: I/O error, dev sdk, sector 23437770560
Jul  2 09:41:36 Tower kernel: sd 9:0:5:0: [sdm] tag#6379 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x00
Jul  2 09:41:36 Tower kernel: sd 9:0:5:0: [sdm] tag#6379 CDB: opcode=0x88 88 00 00 00 00 04 8c 3f ff 40 00 00 00 08 00 00
Jul  2 09:41:36 Tower kernel: print_req_error: I/O error, dev sdm, sector 19532873536
Jul  2 09:41:46 Tower kernel: sd 9:0:6:0: [sdn] tag#6379 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x00
Jul  2 09:41:46 Tower kernel: sd 9:0:6:0: [sdn] tag#6379 CDB: opcode=0x88 88 00 00 00 00 03 a3 81 29 c0 00 00 00 08 00 00
Jul  2 09:41:46 Tower kernel: print_req_error: I/O error, dev sdn, sector 15628052928

 

2 hours ago, mdizzle said:

drive still shows as unmountable.

This is another problem as file system corrupt.

Link to comment
  • 1 month later...

First, pls avoid write to disk, if crash again during write then file system may corrupt again.

 

Next, troubleshoot

 

- Stop auto start array

- Stress Memory, i.e. memory test

- Stress on CPU, i.e. mount UD disk, perform "md5sum file" in multiple session or anything which could load CPU

- start array in maintenance mode. "dd" READ test on all disks

- If possible, change another PSU

......

....

...

 

Perform test in step by step to ensure each part normal ...... I notice you change the HBA to Marvell 88SE9485

Link to comment
  • 2 weeks later...

Johnnie,

Random crashes mostly but they seem to happen more frequently when running a parity check.  In the past, i saw some drive issues and i replaced a problem drive hoping it would help and it did but i still get the crashes.

 

I have tried 4 different controllers including two different recommended LSI controllers. In addition, my friend and I bought the same mb, chip, memory and LSI controllers and has the same issues.  the SAS2LP controller you refer to, i had in my old machine that ran for 9 years before it started having issues, hence the upgrade.

 

I've tested the memory before and came up clean.  I haven't tested the HDD specifically but I would've expected some warning from unRaid if the drives were going bad.

 

It's frustrating because i've tried everything short of changing the chip which is expensive and possibly replacing the memory.

 

Link to comment
6 hours ago, mdizzle said:

the SAS2LP controller you refer to, i had in my old machine that ran for 9 years before it started having issues, hence the upgrade.

SAS2LP used to wok great with v5, but since v6 many users start having problems with it (same for SASLP which uses the same driver), many times issues begin after an Unraid upgrade out of the blue, doesn't mean it's the problem, since it doesn't usually makes the server crash, but definitely not recommended.

 

The last diags you posted are just after rebooting, so not much to see, if it keeps crashing, or you can make it crash, use this then post that syslog.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.