Jump to content

Unstable build - Server unresponsive after a few hours of runtime


fehler404

Recommended Posts

Hey guys,

 

I just built my first unRAID Server mostly with used parts:

 

Motherboard: Supermicro - X9DR3-F

CPU: Intel Xeon E5 2670 V1 (partially defective - 2 of 4 memory channels do not work)

RAM: 32 GB Multi-bit ECC (2x 16GB)

RAID Controller: LSI SAS9211-8i (the board only has 2 SATA III slots and I wanted to use 3 SSDs)

HDDs: 2x 10TB Seagate Ironwolf (1 Parity), 1x 6TB HDD, 1x 3TB HDD, 1x 2TB HDD.

SSDs: 2x 240GB SSD as cache pool, 1x 750GB SSD as unassigned device for downloads and VM images.

Flash: 16GB Cruzer Blade as unRAID boot medium, another 16GB Cruzer Fit is plugged into the internal USB slot because I wanted to move unRAID there but have not gotten around to it yet.  

 

My problem is as follows:

 

After a few hours of uptime (I'm guessing about 12) the server fails to provide the docker services. When I log into the Web GUI I can see that one of the CPU cores is at 100% usage. The log shows  BTRFS errors. I attached a full diagnostics file to this post. I also attached a SMART report for the second 240GB SSD, which wasn't included in the diagnostic file for some reason. The first 240GB SSD has some SMART issues with a retired block count of 32 and attribute "Reported uncorrect" of 214 but both these values have been stable for the last few months. 

 

I'm guessing there is something wrong with either the RAID Controller or the SSDs. But I'm very much a newb at this point so I'm hoping some of you guys can give me some pointers in how to tackle this problem. 

 

Best regards

Thanks in advance

 

fehler404

 

tower-smart-20180312-1740.zip

tower-diagnostics-20180311-0812.zip

Link to comment

The SSD with SMART errors is in the diagnostics, and the SSD without SMART errors isn't. My guess is the one that isn't has a connection issue, and the one that is might also be a problem. In any case, your cache pool looks like it has problems which would probably explain all the other problems.

 

@johnnie.black will probably be here shortly to tell you how to fix.:)

Link to comment

When you first started the array, one of the SSDs was considered missing, despite being a available, likely from previous problems:

 

Mar 10 11:08:57 Tower kernel: BTRFS warning (device sdh1): devid 1 uuid 1192dc0d-7149-40d2-8dc9-3a99a2afc9d0 is missing

So the pool was working with only one device, and then that one dropped offline:
 

Mar 11 04:02:59 Tower kernel: sd 9:0:0:0: Device offlined - not ready after error recovery
Mar 11 04:02:59 Tower kernel: sd 9:0:0:0: Device offlined - not ready after error recovery
Mar 11 04:02:59 Tower kernel: sd 9:0:0:0: Device offlined - not ready after error recovery
Mar 11 04:02:59 Tower kernel: sd 9:0:0:0: [sdh] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x0b driverbyte=0x00
Mar 11 04:02:59 Tower kernel: sd 9:0:0:0: [sdh] tag#0 CDB: opcode=0x28 28 00 03 e3 75 08 00 00 78 00
Mar 11 04:02:59 Tower kernel: print_req_error: I/O error, dev sdh, sector 65238280

So, you're were left without any online cache devices, and that's obvious not good ;)

 

Start by checking cables as that is the number one reason for dropped SSDs, cache may or not mount when you bring sdh online, if it does wipe the other SSD and re-add to pool.

Link to comment

Wow, thanks for the fast response!

 

Okay, so I just checked my cables/connections and reconnected everything. Nothing seems out of the ordinary. I also tried switching the SATA connectors. After a reboot the problem seems to still exist (see screenshot) Unfortunately I don't have a spare SAS/SATA cable lying around so I can't replace it right now. 

 

Could it be that the SMART errors are a problem? I would have thought that those should just be monitored but are not that worrisome right now. 

 

I don't get why there is a device warning for sdh_1_ and not just sdh. 

 

Would it be beneficial to dissconnect the drive with the SMART errors and see if that solves the stability issues?

Link to comment
10 minutes ago, fehler404 said:

Could it be that the SMART errors are a problem?

That SSD does show some issues, the other was missing so no SMART.

 

11 minutes ago, fehler404 said:

I don't get why there is a device warning for sdh_1_ and not just sdh

1 is the partition, it's normal.

 

11 minutes ago, fehler404 said:

Would it be beneficial to dissconnect the drive with the SMART errors and see if that solves the stability issues?

It would be best to replace the cables to rule them out, since the second disconnect has nothing to do with the other SSD, alternatively you could also connect it to the onboard SATA ports instead.

Link to comment
21 hours ago, johnnie.black said:
21 hours ago, fehler404 said:

Would it be beneficial to dissconnect the drive with the SMART errors and see if that solves the stability issues?

It would be best to replace the cables to rule them out, since the second disconnect has nothing to do with the other SSD, alternatively you could also connect it to the onboard SATA ports instead.

 

So I ruled out the cables and the RAID controller: I used another Mini-SAS to SATA cable, used the other Mini-SAS Port on the RAID Controller and connected the SSDs directly onto the onboard SATA ports - the problem remains the same :/ Every time after booting there is the same error 

 

Mar 13 17:58:38 Tower kernel: BTRFS warning (device sde1): devid 1 uuid 1192dc0d-7149-40d2-8dc9-3a99a2afc9d0 is missing

 

What should be my next move?

Link to comment

Before beginning backup any important data on the cache pool, there shouldn't be a problem but btrfs sometimes acts weird, then:

 

-stop the array

-unassign cache2 (sdd)

-start the array, check cache still mounts

-if yes wipe the now unassigned SSD with

blkdiscard /dev/sdd

-stop the array

-re-assign cache2

-start the array

 

Link to comment

Nice! Wiping the disk seems to have done the trick! The error is gone!

 

Thank you very much @johnnie.black!

 

I attached another diagnostic file after a reboot. I swapped the disks back to the RAID controller after wiping, so the names are different again. 

 

I hope this solves my stability issues. I will report back if there are any sudden phases of unresponsiveness again. 

tower-diagnostics-20180313-2107.zip

Link to comment

Only checked the last one, one of the cache SSDs is still dropping offline, try connecting it to the onboard SATA controller instead.

 

Mar 17 04:03:05 Tower kernel: sd 2:0:0:0: Device offlined - not ready after error recovery
Mar 17 04:03:05 Tower kernel: sd 2:0:0:0: Device offlined - not ready after error recovery
Mar 17 04:03:05 Tower kernel: sd 2:0:0:0: Device offlined - not ready after error recovery
Mar 17 04:03:05 Tower kernel: sd 2:0:0:0: [sdh] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x0b driverbyte=0x00
Mar 17 04:03:05 Tower kernel: sd 2:0:0:0: [sdh] tag#0 CDB: opcode=0x93 93 08 00 00 00 00 03 26 00 00 00 00 00 20 00 00
Mar 17 04:03:05 Tower kernel: print_req_error: I/O error, dev sdh, sector 52822016
Mar 17 04:03:05 Tower kernel: sd 2:0:0:0: rejecting I/O to offline device

Pool was already damaged and was only using this device before it dropped.

Link to comment

Okay, so I plugged the problematic device into a onboard SATA port. After successfully booting into the webinterface the disk log showed the same error as always so I tried going the same route with wiping the disk again. On trying the blkdiscard command it put out:

 

BLKDISCARD ioctl failed: Input/output error

I tried formatting the drive, that did not change anything. 

 

As always I attached the diagnostics file...

 

tower-diagnostics-20180318-1044.zip

Link to comment

After connecting both SSDs to the onboard SATA controller the server has now been running for 52 hours without a hitch. 

 

I already ordered a replacement SATA controller with a LSI2308 controller. 

 

Most likely I'll just keep the cache SSDs connected to the motherboard ports for the future. 

 

Is there any way to test if my old controller is faulty? Is there any other component that could've caused the dropped cache disk?

 

Thank you for your help!

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...