SATA controller overheat during parity check

Fransysco · August 3, 2017

I had my SATA controller overheat during a parity check last night due to a failed fan.

The parity check reported 871590738 errors and unfortunately write corrections to parity was enabled. Based on the number of errors seen across the disks it looks like the disk that failed (disk9) dropped out fairly early on in the parity check.

My build has two parity drives, not sure its going to matter for this one though.

I attached a screenshot of the array prior to power down.

Running an extended SMART test shows the disk to be healthy.

Unfortunately, I didn't grab a system diagnostic or logs prior to shutdown due to how hot things were getting inside the server. I did attach the diagnostic file from after reboot, but not sure how helpful that will be.

My question is, should I trust parity and rebuild the failed disk or can I no longer trust parity?

Edited February 7, 2020 by Fransysco

JorgeB · August 3, 2017

11 minutes ago, Fransysco said:

My question is, should I trust parity and rebuild the failed disk or can I no longer trust parity?

No way to know since the diags are after rebooting, SMART for disk9 looks OK, though there are a lot of CRC errors, so there is (or there was in the past) a bad SATA cable.

The safest way would be to rebuild to a spare disk or copy everything from the emulated disk to another disk or computer, then test old disk9 and if OK do a new config with it and re-sync parity, if not OK copy everything you can from it overwriting the rebuilt files.

FreeMan · August 3, 2017

Frankly, I don't think 90°F is all that hot. I'll grant you that cooler is always better, but I have 1 drive that often gets to 37°C (98.6°F) and while I get annoyed that it's so warm, I've never (in 3yr, 10mo of spinning) had an issue with it. I wouldn't have panicked at those temps, but it is your server and you're free to do so. Just supplying a data point for you.

SSD · August 4, 2017

In plain English - your best chance of recovery is from the disk9 that dropped from the array. DO NOT WRITE TO THAT DISK.

DO NOT TRY TO DO A REBUILD OF DISK9 ON TOP OF THAT SAME DISK!!

Johnnie's advice to rebuild onto a spare disk does no harm but takes a long time. Instead I would try to mount the dropped disk9 outside the array and see if the contents look valid. Test a few files to be sure. If so, I'd do a new config including that disk. If it is garbage, then you'd want to do as Johnnie suggested, but I fear that result will be corrupted in about 87 million blocks (356G).

Fransysco · August 5, 2017

On 8/3/2017 at 6:10 PM, FreeMan said:

Frankly, I don't think 90°F is all that hot. I'll grant you that cooler is always better, but I have 1 drive that often gets to 37°C (98.6°F) and while I get annoyed that it's so warm, I've never (in 3yr, 10mo of spinning) had an issue with it. I wouldn't have panicked at those temps, but it is your server and you're free to do so. Just supplying a data point for you.

The HDDs weren't what was getting hot (their fans were still spinning) it was the riser in my server (c2100) with the raid card and ethernet NIC on it and the case surrounding it (seen on the SSDs that were mounted to the case and started getting heat transfer from the raid card). I actually scalded my hand when I touched the RAID card after power down.

Most of my HDDs run at 82 - 84. But thanks for the information!

Fransysco · August 5, 2017

On 8/3/2017 at 9:55 AM, johnnie.black said:

No way to know since the diags are after rebooting, SMART for disk9 looks OK, though there are a lot of CRC errors, so there is (or there was in the past) a bad SATA cable.

The safest way would be to rebuild to a spare disk or copy everything from the emulated disk to another disk or computer, then test old disk9 and if OK do a new config with it and re-sync parity, if not OK copy everything you can from it overwriting the rebuilt files.

On 8/3/2017 at 9:49 PM, bjp999 said:

In plain English - your best chance of recovery is from the disk9 that dropped from the array. DO NOT WRITE TO THAT DISK.

DO NOT TRY TO DO A REBUILD OF DISK9 ON TOP OF THAT SAME DISK!!

Johnnie's advice to rebuild onto a spare disk does no harm but takes a long time. Instead I would try to mount the dropped disk9 outside the array and see if the contents look valid. Test a few files to be sure. If so, I'd do a new config including that disk. If it is garbage, then you'd want to do as Johnnie suggested, but I fear that result will be corrupted in about 87 million blocks (356G).

What do you guys mean by "do a new config"?

JorgeB · August 5, 2017

6 minutes ago, Fransysco said:

What do you guys mean by "do a new config"?

Tools -> New config

SSD · August 5, 2017

A new config will basically cause unRAID to forget its array configuration. For convenience there is an option to leave the existing drives in their existing slots, but unRAID has completely forgotten what the configuration was, and you are free to put them in a different order. Remove disks. Add disks. Whatever. When you start the array, no disk zeroing occurs.

When you do the new config, you have an option to say whether to "trust parity" or not. If you trust it, unRAID assumes parity is already accurate. If not, parity builds.

In your case, it would enable you to put the disk that was kicked out of the array back into the array without rebuilding it. Which is what you probably want. You can trust parity, although it will not be anywhere near 100% accurate. You'd need to run a correcting parity check, and expect a sizable number of errors. Or don't trust parity and let it rebuild. Same net result.

I'd probably just assign all the drives and leave parity out altogether initially. Then look at each drive and confirm that it appears valid, and try to play several media files on each drive, trying to pick a variety of files that were copied at different times. If you have .zip files, you could do integrity checks on them. Or if (and I think it unlikely) you have been computing MD5 or other checksums on your files, this would be a very good time to use them to verify the integrity of each file.

FreeMan · August 5, 2017

7 hours ago, Fransysco said:

The HDDs weren't what was getting hot (their fans were still spinning) it was the riser in my server (c2100) with the raid card and ethernet NIC on it and the case surrounding it (seen on the SSDs that were mounted to the case and started getting heat transfer from the raid card). I actually scalded my hand when I touched the RAID card after power down.

Most of my HDDs run at 82 - 84. But thanks for the information!

Based on your subject line I understood the card was hot, but the screen shot threw me off. Rereading your OP now, I see what I missed the first time through. Pretty impressive that one failed fan could allow the case temps to climb that high! Unfortunately, not the kind of impressing you're after, I'm sure.

Fransysco · August 6, 2017

On 8/5/2017 at 1:00 PM, bjp999 said:

A new config will basically cause unRAID to forget its array configuration. For convenience there is an option to leave the existing drives in their existing slots, but unRAID has completely forgotten what the configuration was, and you are free to put them in a different order. Remove disks. Add disks. Whatever. When you start the array, no disk zeroing occurs.

When you do the new config, you have an option to say whether to "trust parity" or not. If you trust it, unRAID assumes parity is already accurate. If not, parity builds.

In your case, it would enable you to put the disk that was kicked out of the array back into the array without rebuilding it. Which is what you probably want. You can trust parity, although it will not be anywhere near 100% accurate. You'd need to run a correcting parity check, and expect a sizable number of errors. Or don't trust parity and let it rebuild. Same net result.

I'd probably just assign all the drives and leave parity out altogether initially. Then look at each drive and confirm that it appears valid, and try to play several media files on each drive, trying to pick a variety of files that were copied at different times. If you have .zip files, you could do integrity checks on them. Or if (and I think it unlikely) you have been computing MD5 or other checksums on your files, this would be a very good time to use them to verify the integrity of each file.

On 8/5/2017 at 11:15 AM, johnnie.black said:

Tools -> New config

Okay so I did a extended smart test on the failed disk when I got my replacement fan and it passed without errors. I did new config of just the disks and did some file tests and everything seemed fine.

I went to build parity and it failed with a large number of errors again:

After I stopped the array a large majority of the disks are showing missing:

I captured two diagnostics bundles (I wasn't sure if it was worth while to get it after the array was stopped because the disk status changed). Both were captured prior to any power off. -0817 is before the array is stopped and -0828 is after the array is stopped.

Edited February 7, 2020 by Fransysco

JorgeB · August 6, 2017

LSI controller completely dropped out, try a different slot if available, it it fails again it may need replacing.

Fransysco · August 7, 2017

12 hours ago, johnnie.black said:

LSI controller completely dropped out, try a different slot if available, it it fails again it may need replacing.

thats what i was thinking based on most of the disks going away.

Which log or what alert is the indication of the carding dropping? So I can just verify myself if it occurs on a different slot without having to post another diag and having someone else look?

JorgeB · August 7, 2017

Syslog shows all disks connected to the controller stop responding, but the confirmation in on lspci as the controller doesn't even appear anymore, according to the syslog this was its address:

Quote

Aug 5 19:38:39 RIAAHQ kernel: mpt3sas 0000:04:00.0: can't disable ASPM; OS doesn't have ASPM control

So in the current slot it should be on lspci between these:

00:1f.3 SMBus [0c05]: Intel Corporation 82801JI (ICH10 Family) SMBus Controller [8086:3a30]
    Subsystem: QUANTA Computer Inc 82801JI (ICH10 Family) SMBus Controller [152d:8975]
    Kernel driver in use: i801_smbus
    Kernel modules: i2c_i801
05:00.0 PCI bridge [0604]: Integrated Device Technology, Inc. [IDT] PES12N3A PCI Express Switch [111d:8018] (rev 0e)
    Kernel driver in use: pcieport

Fransysco · August 7, 2017

On 8/7/2017 at 3:39 AM, johnnie.black said:
Syslog shows all disks connected to the controller stop responding, but the confirmation in on lspci as the controller doesn't even appear anymore, according to the syslog this was its address:

So in the current slot it should be on lspci between these:
00:1f.3 SMBus [0c05]: Intel Corporation 82801JI (ICH10 Family) SMBus Controller [8086:3a30]
    Subsystem: QUANTA Computer Inc 82801JI (ICH10 Family) SMBus Controller [152d:8975]
    Kernel driver in use: i801_smbus
    Kernel modules: i2c_i801
05:00.0 PCI bridge [0604]: Integrated Device Technology, Inc. [IDT] PES12N3A PCI Express Switch [111d:8018] (rev 0e)
    Kernel driver in use: pcieport

Okay so the different slot seems to have worked... My concern is that its the same riser, so I guess the one slot on the card could be bad but I have my ethernet NIC in the old slot and it seems to be operating fine.

I'm not sure if I can trust things, if I should get a new card, maybe just a whole new server at this point, or if I'm just being paranoid.... lol

Attached diagnostics if interested.

and thanks everyone for the help so far!

Edited February 7, 2020 by Fransysco

SATA controller overheat during parity check

Recommended Posts

Fransysco

Link to comment

JorgeB

Link to comment

FreeMan

Link to comment

SSD

Link to comment

Fransysco

Link to comment

Fransysco

Link to comment

JorgeB

Link to comment

SSD

Link to comment

FreeMan

Link to comment

Fransysco

Link to comment

JorgeB

Link to comment

Fransysco

Link to comment

JorgeB

Link to comment

Fransysco

Link to comment

Join the conversation