Jump to content

Unraid Server Errors


Go to solution Solved by JorgeB,

Recommended Posts

Posted (edited)

This is a followup to this thread:


I'm having issues with my server still.  After rebuilding disk 5, I opted to replace disk 8 because it had 96 relocated sector errors.  I ran a parity check after rebuilding disk 5 without issue.  As soon as I replaced disk 8, I started running into issues.  I would get read errors and UDMA CRC error count on multiple drives simultaneously.  I thought maybe the issue was the SAS card or cables, since the log was being spammed with the following:

mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)

 

I replaced the SAS card and cables and I'm still running into issues.  I just have no idea what to do at this point.  Here are my diagnostics and I've included my syslog right after replacing drive 8 before I tried replacing the cables/SAS card.  The diagnostics provided are after I replaced my cables/SAS card.

 

 

Edited by clowncracker
removed diagnostic
Posted (edited)
On 9/13/2024 at 1:18 AM, JorgeB said:

Looks more like a power/connection issue, did you also check/replace the power cables?

I just tried replacing the power supply entirely and I'm still running into issues.  Still seeing this in my log a lot and more read errors:

 mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)

 

Edited by clowncracker
removed diagnostic
Posted (edited)
6 hours ago, JorgeB said:

Do you have a different controller you could test with? Could be a controller problem, or it doesn't like or or more of the disks.

This is the controller I'm using.  I replaced it with an exact copy and swapped out the cables and I'm still having issues.

 

Of note I was using the old controller for years without issue.

 

https://www.amazon.com/LSI-Logic-9207-8i-Controller-LSI00301/dp/B0085FT2JC

Edited by clowncracker
Posted

Counterfeit electronics many times work fine, it's just the QC that's non-existent, and since it doesn't have the manufacturers backing there is no real warranty. The people making them don't care how many work or not, as long as some work for a while.

 

Do you have forced ventilation over the heatsink? Those cards get rather warm, and marginal chips can fail when they get too hot.

Posted (edited)

I guess I'll bite the bullet and buy one from the art of the server.  All of the drives having issues are stemming from the SAS card (old and new).  After replacing the SAS cables and PSU, at this point it can only be the card or the motherboard is failing (I've tried both PCIe lanes).  The motherboard dying seems unlikely, it's less than a year old at this point.

Edited by clowncracker
Posted (edited)
On 9/14/2024 at 7:36 AM, JorgeB said:

That's fine as long as it's genuine, though it does look too cheap if it's new.

The new card came today, but the last time I started the system it caused a device to be disabled.  At this point I'm afraid to start the array since I already have two drives that are now disabled.  Is there an issue with me using a backup of my USB file to clear out the errors before I start the array?  I'm afraid there is going to be data loss, even though I haven't actually tried to write anything to the array.

I've attached diagnostics just in case.  Is there anything that would indicate that the new SAS card wouldn't solve the problem?  I did have a lot of UDMA CRC error counts when I turned the system on (not started the array), but I don't know if that's just leftover from the last SAS card.

 

 

Edited by clowncracker
removed diagnostic
Posted

If no data was written to the disabled disks once they got disable, you could do a new config, trust parity and then run a correcting parity check instead, it would be safer for the data.

 

I don't see any new disk errors for now, so the UDMA CRC errors are likely old, but monitor to see if they don't keep increasing, if yes, there's still a problem.

Posted (edited)
13 minutes ago, JorgeB said:

If no data was written to the disabled disks once they got disable, you could do a new config, trust parity and then run a correcting parity check instead, it would be safer for the data.

 

I don't see any new disk errors for now, so the UDMA CRC errors are likely old, but monitor to see if they don't keep increasing, if yes, there's still a problem.

Disk 8 is emulated because it's new, but disk 10 had read/write errors from starting the array and trying to rebuild disk 8.  Would a new config still be the best option?  I still have the old disk 8 if that helps.

The disk errors would have been in syslog-previous.txt, I restarted the server after the UDMA CRC errors came up (both boots with the new SAS card) just to see if I would get any more errors.

Edited by clowncracker
Posted
1 hour ago, clowncracker said:

Would a new config still be the best option?  I still have the old disk 8 if that helps.

Disk10 looks OK, so ti should still be a good option, but you'd need to also use the old disk8, assuming that one is also OK.

Posted (edited)

I'm just concerned because the old disk 8 had 96 relocated sector counts.  I've never used a new config before and I'm not going to lie the warning on this page scares me a bit,

Neither of the parity drives are on the SAS controller, so it makes me feel like doing a new config isn't the best option.  What are the benefits of building a new config vs using a USB backup from before all of these issues started vs just trying to rebuild disk 8/10?

 

Ideally I want to go with the option that will protect the data, but also something that will verify the new SAS card will actually solve my problems.

 

Edited by clowncracker
Posted (edited)
On 9/17/2024 at 12:47 AM, JorgeB said:

If the old disk8 is suspect, before attempting a new config, unassign disks 8 and 10 and post new diags after array start.

I installed the old disk 8 and unassigned them from the array.  Here is the new config:

 

Edited by clowncracker
removed diagnostic
  • Solution
Posted

I assume this is old disk8?

 

Device Model:     TOSHIBA HDWE150
Serial Number:    76J7KC8JF57D

 

If yes, there are a few reallocated sectors, so would recommend running an extended SMART test.

 

 

Both emulated disks are mounting, assuming contents look correct, you can rebuild, also no controller errors so far, but any issues are more likely to appear under load, like during a rebuild. 

 

Keep old disk8 intact for now, and rebuild to the new disk, assuming there's no other spare, disk10 would need to be rebuilt on top, so you could try to rebuild only disk8 for now to see how it goes, keeping old disk10 intact as well.

Posted (edited)
On 9/17/2024 at 9:00 AM, JorgeB said:

I assume this is old disk8?

 

Device Model:     TOSHIBA HDWE150
Serial Number:    76J7KC8JF57D

 

If yes, there are a few reallocated sectors, so would recommend running an extended SMART test.

 

 

Both emulated disks are mounting, assuming contents look correct, you can rebuild, also no controller errors so far, but any issues are more likely to appear under load, like during a rebuild. 

 

Keep old disk8 intact for now, and rebuild to the new disk, assuming there's no other spare, disk10 would need to be rebuilt on top, so you could try to rebuild only disk8 for now to see how it goes, keeping old disk10 intact as well.


I ended up rebuilding over disks 8 & 10 without issue and my server is back online and operational.  I appreciate all of the help.  I guess my final question is do you think the airflow for the LSI card is good or should I add another NF-A4x20 over the heatstink?

Edited by clowncracker

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...