Unstable build - Server unresponsive after a few hours of runtime

fehler404 · March 12, 2018

Hey guys,

I just built my first unRAID Server mostly with used parts:

Motherboard: Supermicro - X9DR3-F

CPU: Intel Xeon E5 2670 V1 (partially defective - 2 of 4 memory channels do not work)

RAM: 32 GB Multi-bit ECC (2x 16GB)

RAID Controller: LSI SAS9211-8i (the board only has 2 SATA III slots and I wanted to use 3 SSDs)

HDDs: 2x 10TB Seagate Ironwolf (1 Parity), 1x 6TB HDD, 1x 3TB HDD, 1x 2TB HDD.

SSDs: 2x 240GB SSD as cache pool, 1x 750GB SSD as unassigned device for downloads and VM images.

Flash: 16GB Cruzer Blade as unRAID boot medium, another 16GB Cruzer Fit is plugged into the internal USB slot because I wanted to move unRAID there but have not gotten around to it yet.

My problem is as follows:

After a few hours of uptime (I'm guessing about 12) the server fails to provide the docker services. When I log into the Web GUI I can see that one of the CPU cores is at 100% usage. The log shows BTRFS errors. I attached a full diagnostics file to this post. I also attached a SMART report for the second 240GB SSD, which wasn't included in the diagnostic file for some reason. The first 240GB SSD has some SMART issues with a retired block count of 32 and attribute "Reported uncorrect" of 214 but both these values have been stable for the last few months.

I'm guessing there is something wrong with either the RAID Controller or the SSDs. But I'm very much a newb at this point so I'm hoping some of you guys can give me some pointers in how to tackle this problem.

Best regards

Thanks in advance

fehler404

tower-smart-20180312-1740.zip

tower-diagnostics-20180311-0812.zip

trurl · March 12, 2018

The SSD with SMART errors is in the diagnostics, and the SSD without SMART errors isn't. My guess is the one that isn't has a connection issue, and the one that is might also be a problem. In any case, your cache pool looks like it has problems which would probably explain all the other problems.

@johnnie.black will probably be here shortly to tell you how to fix.

JorgeB · March 12, 2018

When you first started the array, one of the SSDs was considered missing, despite being a available, likely from previous problems:

Mar 10 11:08:57 Tower kernel: BTRFS warning (device sdh1): devid 1 uuid 1192dc0d-7149-40d2-8dc9-3a99a2afc9d0 is missing

So the pool was working with only one device, and then that one dropped offline:

Mar 11 04:02:59 Tower kernel: sd 9:0:0:0: Device offlined - not ready after error recovery
Mar 11 04:02:59 Tower kernel: sd 9:0:0:0: Device offlined - not ready after error recovery
Mar 11 04:02:59 Tower kernel: sd 9:0:0:0: Device offlined - not ready after error recovery
Mar 11 04:02:59 Tower kernel: sd 9:0:0:0: [sdh] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x0b driverbyte=0x00
Mar 11 04:02:59 Tower kernel: sd 9:0:0:0: [sdh] tag#0 CDB: opcode=0x28 28 00 03 e3 75 08 00 00 78 00
Mar 11 04:02:59 Tower kernel: print_req_error: I/O error, dev sdh, sector 65238280

So, you're were left without any online cache devices, and that's obvious not good

Start by checking cables as that is the number one reason for dropped SSDs, cache may or not mount when you bring sdh online, if it does wipe the other SSD and re-add to pool.

fehler404 · March 12, 2018

Wow, thanks for the fast response!

Okay, so I just checked my cables/connections and reconnected everything. Nothing seems out of the ordinary. I also tried switching the SATA connectors. After a reboot the problem seems to still exist (see screenshot) Unfortunately I don't have a spare SAS/SATA cable lying around so I can't replace it right now.

Could it be that the SMART errors are a problem? I would have thought that those should just be monitored but are not that worrisome right now.

I don't get why there is a device warning for sdh_1_ and not just sdh.

Would it be beneficial to dissconnect the drive with the SMART errors and see if that solves the stability issues?

JorgeB · March 12, 2018

10 minutes ago, fehler404 said:

Could it be that the SMART errors are a problem?

That SSD does show some issues, the other was missing so no SMART.

11 minutes ago, fehler404 said:

I don't get why there is a device warning for sdh_1_ and not just sdh

1 is the partition, it's normal.

11 minutes ago, fehler404 said:

Would it be beneficial to dissconnect the drive with the SMART errors and see if that solves the stability issues?

It would be best to replace the cables to rule them out, since the second disconnect has nothing to do with the other SSD, alternatively you could also connect it to the onboard SATA ports instead.

fehler404 · March 12, 2018

Thanks for taking the time to reply!

I will replace the cables asap and post again.

fehler404 · March 13, 2018

21 hours ago, johnnie.black said:

21 hours ago, fehler404 said:

Would it be beneficial to dissconnect the drive with the SMART errors and see if that solves the stability issues?

It would be best to replace the cables to rule them out, since the second disconnect has nothing to do with the other SSD, alternatively you could also connect it to the onboard SATA ports instead.

So I ruled out the cables and the RAID controller: I used another Mini-SAS to SATA cable, used the other Mini-SAS Port on the RAID Controller and connected the SSDs directly onto the onboard SATA ports - the problem remains the same Every time after booting there is the same error

Mar 13 17:58:38 Tower kernel: BTRFS warning (device sde1): devid 1 uuid 1192dc0d-7149-40d2-8dc9-3a99a2afc9d0 is missing

What should be my next move?

JorgeB · March 13, 2018

This error is expected now, dropping the other SSD wasn't, you'll need to wipe that SSD and re-add to the pool, but before post current diags so I can see the current identifier.

fehler404 · March 13, 2018

Current diagnostics:

tower-diagnostics-20180313-1957.zip

JorgeB · March 13, 2018

Before beginning backup any important data on the cache pool, there shouldn't be a problem but btrfs sometimes acts weird, then:

-stop the array

-unassign cache2 (sdd)

-start the array, check cache still mounts

-if yes wipe the now unassigned SSD with

blkdiscard /dev/sdd

-stop the array

-re-assign cache2

-start the array

fehler404 · March 13, 2018

After unassigning sdd I get an error:

unmountable: "No file system"

I attached the now current diagnostic file.

Thank you for your time!

/edit:

Okay, I just retried and left the cache as two slots with one slot unassigned, now it starts as expected. I will now try the wiping process.

Will answer here again asap.

tower-diagnostics-20180313-2027.zip

fehler404 · March 13, 2018

Nice! Wiping the disk seems to have done the trick! The error is gone!

Thank you very much @johnnie.black!

I attached another diagnostic file after a reboot. I swapped the disks back to the RAID controller after wiping, so the names are different again.

I hope this solves my stability issues. I will report back if there are any sudden phases of unresponsiveness again.

tower-diagnostics-20180313-2107.zip

JorgeB · March 13, 2018

Pool looking OK now ?, keep an eye on it for the next few days.

fehler404 · March 17, 2018

Unfortunately my stability issues are back. Now the server even crashes including the webinterface.

I attached my last few diagnostics files.

Do you have any other ideas? What should be my next move?

tower-diagnostics-20180316-0559.zip

tower-diagnostics-20180317-1621.zip

tower-diagnostics-20180316-0614.zip

tower-diagnostics-20180315-0647.zip

JorgeB · March 17, 2018

Only checked the last one, one of the cache SSDs is still dropping offline, try connecting it to the onboard SATA controller instead.

Mar 17 04:03:05 Tower kernel: sd 2:0:0:0: Device offlined - not ready after error recovery
Mar 17 04:03:05 Tower kernel: sd 2:0:0:0: Device offlined - not ready after error recovery
Mar 17 04:03:05 Tower kernel: sd 2:0:0:0: Device offlined - not ready after error recovery
Mar 17 04:03:05 Tower kernel: sd 2:0:0:0: [sdh] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x0b driverbyte=0x00
Mar 17 04:03:05 Tower kernel: sd 2:0:0:0: [sdh] tag#0 CDB: opcode=0x93 93 08 00 00 00 00 03 26 00 00 00 00 00 20 00 00
Mar 17 04:03:05 Tower kernel: print_req_error: I/O error, dev sdh, sector 52822016
Mar 17 04:03:05 Tower kernel: sd 2:0:0:0: rejecting I/O to offline device

Pool was already damaged and was only using this device before it dropped.

fehler404 · March 18, 2018

Okay, so I plugged the problematic device into a onboard SATA port. After successfully booting into the webinterface the disk log showed the same error as always so I tried going the same route with wiping the disk again. On trying the blkdiscard command it put out:

BLKDISCARD ioctl failed: Input/output error

I tried formatting the drive, that did not change anything.

As always I attached the diagnostics file...

tower-diagnostics-20180318-1044.zip

JorgeB · March 18, 2018

Connect both SSDs to the onboard SATA controller, LSI2008 based controllers don't support trim.

fehler404 · March 20, 2018

After connecting both SSDs to the onboard SATA controller the server has now been running for 52 hours without a hitch.

I already ordered a replacement SATA controller with a LSI2308 controller.

Most likely I'll just keep the cache SSDs connected to the motherboard ports for the future.

Is there any way to test if my old controller is faulty? Is there any other component that could've caused the dropped cache disk?

Thank you for your help!

JorgeB · March 20, 2018

8 minutes ago, fehler404 said:

Is there any way to test if my old controller is faulty?

Basically you need to rule other things out, like cables, PCIe slot, etc, if it keeps dropping devices it's likely the controller.

Unstable build - Server unresponsive after a few hours of runtime

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Archived