Cache Disk Access LED Solid Freezing Server

mbc0 · April 2, 2019

Hi,

It may be a coincidence it may not but around the time of changing from a single Cache SSD to a 4 Disk RAID10 I am having problems with the server freezing (Grinding to a halt) with the LED Status Light for the SSD Pool solid on. Checking the CPU usage I can see as many as 8-10 cores (out of 32) all red at 100%. When this happens I am unable to use the server at all, VM's lock, Plex, (all dockers in fact) and after a few minutes the cores drop down - barely registering usage the SSD Activity LED drops to a flicker and everything comes back to normal. This can happen every few minutes to a couple of times a day so incredibly difficult to pinpoint.

You can see that the top command is not reflecting the usage I am seeing in the dashboard either?

Any pointers would be really appreciated!

unraidserver-diagnostics-20190402-1139.zip

here is the CPU usage when the issue is occuring

image.png.a8b8b61d263894cc5761dba2890a6473.png

here is the log when the issue is occuring (I have only just noticed the rclone errors, I have no idea what is causing that)

Apr 2 06:47:06 UNRAIDSERVER sSMTP[7046]: SSL connection using TLS_AES_256_GCM_SHA384
Apr 2 06:47:09 UNRAIDSERVER sSMTP[7046]: Sent mail for [email protected] (221 2.0.0 closing connection v17sm3354314wmc.30 - gsmtp) uid=0 username=root outbytes=969
Apr 2 06:57:01 UNRAIDSERVER crond[2604]: failed parsing crontab for user root: ? * * * /usr/local/emhttp/plugins/user.scripts/startCustom.php /boot/config/plugins/user.scripts/scripts/rclone_custom_plugin/script > /dev/null 2>&1
Apr 2 07:13:01 UNRAIDSERVER crond[2604]: failed parsing crontab for user root: ? * * * /usr/local/emhttp/plugins/user.scripts/startCustom.php /boot/config/plugins/user.scripts/scripts/rclone_custom_plugin/script > /dev/null 2>&1
Apr 2 07:29:01 UNRAIDSERVER crond[2604]: failed parsing crontab for user root: ? * * * /usr/local/emhttp/plugins/user.scripts/startCustom.php /boot/config/plugins/user.scripts/scripts/rclone_custom_plugin/script > /dev/null 2>&1
Apr 2 07:45:01 UNRAIDSERVER crond[2604]: failed parsing crontab for user root: ? * * * /usr/local/emhttp/plugins/user.scripts/startCustom.php /boot/config/plugins/user.scripts/scripts/rclone_custom_plugin/script > /dev/null 2>&1
Apr 2 08:01:01 UNRAIDSERVER crond[2604]: failed parsing crontab for user root: ? * * * /usr/local/emhttp/plugins/user.scripts/startCustom.php /boot/config/plugins/user.scripts/scripts/rclone_custom_plugin/script > /dev/null 2>&1
Apr 2 08:17:01 UNRAIDSERVER crond[2604]: failed parsing crontab for user root: ? * * * /usr/local/emhttp/plugins/user.scripts/startCustom.php /boot/config/plugins/user.scripts/scripts/rclone_custom_plugin/script > /dev/null 2>&1
Apr 2 08:32:01 UNRAIDSERVER crond[2604]: failed parsing crontab for user root: ? * * * /usr/local/emhttp/plugins/user.scripts/startCustom.php /boot/config/plugins/user.scripts/scripts/rclone_custom_plugin/script > /dev/null 2>&1
Apr 2 08:48:01 UNRAIDSERVER crond[2604]: failed parsing crontab for user root: ? * * * /usr/local/emhttp/plugins/user.scripts/startCustom.php /boot/config/plugins/user.scripts/scripts/rclone_custom_plugin/script > /dev/null 2>&1
Apr 2 09:03:11 UNRAIDSERVER crond[2604]: failed parsing crontab for user root: ? * * * /usr/local/emhttp/plugins/user.scripts/startCustom.php /boot/config/plugins/user.scripts/scripts/rclone_custom_plugin/script > /dev/null 2>&1
Apr 2 09:19:01 UNRAIDSERVER crond[2604]: failed parsing crontab for user root: ? * * * /usr/local/emhttp/plugins/user.scripts/startCustom.php /boot/config/plugins/user.scripts/scripts/rclone_custom_plugin/script > /dev/null 2>&1
Apr 2 09:35:01 UNRAIDSERVER crond[2604]: failed parsing crontab for user root: ? * * * /usr/local/emhttp/plugins/user.scripts/startCustom.php /boot/config/plugins/user.scripts/scripts/rclone_custom_plugin/script > /dev/null 2>&1
Apr 2 09:50:01 UNRAIDSERVER crond[2604]: failed parsing crontab for user root: ? * * * /usr/local/emhttp/plugins/user.scripts/startCustom.php /boot/config/plugins/user.scripts/scripts/rclone_custom_plugin/script > /dev/null 2>&1
Apr 2 09:53:15 UNRAIDSERVER sSMTP[96530]: Creating SSL connection to host
Apr 2 09:53:15 UNRAIDSERVER sSMTP[96530]: SSL connection using TLS_AES_256_GCM_SHA384
Apr 2 09:53:16 UNRAIDSERVER sSMTP[96932]: Creating SSL connection to host
Apr 2 09:53:16 UNRAIDSERVER sSMTP[96932]: SSL connection using TLS_AES_256_GCM_SHA384
Apr 2 09:53:18 UNRAIDSERVER sSMTP[96530]: Sent mail for [email protected] (221 2.0.0 closing connection e1sm21728952wrw.66 - gsmtp) uid=0 username=root outbytes=950
Apr 2 09:53:18 UNRAIDSERVER sSMTP[96932]: Sent mail for [email protected] (221 2.0.0 closing connection v1sm16644591wrd.47 - gsmtp) uid=0 username=root outbytes=1016
Apr 2 09:53:20 UNRAIDSERVER sSMTP[97574]: Creating SSL connection to host
Apr 2 09:53:20 UNRAIDSERVER sSMTP[97574]: SSL connection using TLS_AES_256_GCM_SHA384
Apr 2 09:53:22 UNRAIDSERVER sSMTP[97574]: Sent mail for [email protected] (221 2.0.0 closing connection o5sm4536408wmc.16 - gsmtp) uid=0 username=root outbytes=1082
Apr 2 10:03:01 UNRAIDSERVER sshd[26374]: Accepted none for root from 192.168.0.12 port 50657 ssh2
Apr 2 10:03:17 UNRAIDSERVER kernel: libfuse.so[28282]: segfault at 0 ip 000014f3669a4e50 sp 00007ffda97e46c0 error 4 in libfuse.so.2.9.8[14f36699c000+2a000]
Apr 2 10:03:17 UNRAIDSERVER kernel: Code: 68 ef 00 00 00 e9 f0 f0 ff ff ff 25 62 29 23 00 68 f0 00 00 00 e9 e0 f0 ff ff ff 25 a2 21 23 00 66 90 00 00 00 00 00 00 00 00 <48> 8b 04 25 00 00 00 00 0f 0b 8b 04 25 0c 00 00 00 0f 0b 66 2e 0f
Apr 2 10:06:01 UNRAIDSERVER crond[2604]: failed parsing crontab for user root: ? * * * /usr/local/emhttp/plugins/user.scripts/startCustom.php /boot/config/plugins/user.scripts/scripts/rclone_custom_plugin/script > /dev/null 2>&1
Apr 2 10:21:01 UNRAIDSERVER crond[2604]: failed parsing crontab for user root: ? * * * /usr/local/emhttp/plugins/user.scripts/startCustom.php /boot/config/plugins/user.scripts/scripts/rclone_custom_plugin/script > /dev/null 2>&1
Apr 2 10:36:11 UNRAIDSERVER crond[2604]: failed parsing crontab for user root: ? * * * /usr/local/emhttp/plugins/user.scripts/startCustom.php /boot/config/plugins/user.scripts/scripts/rclone_custom_plugin/script > /dev/null 2>&1
Apr 2 10:52:01 UNRAIDSERVER crond[2604]: failed parsing crontab for user root: ? * * * /usr/local/emhttp/plugins/user.scripts/startCustom.php /boot/config/plugins/user.scripts/scripts/rclone_custom_plugin/script > /dev/null 2>&1
Apr 2 11:08:01 UNRAIDSERVER crond[2604]: failed parsing crontab for user root: ? * * * /usr/local/emhttp/plugins/user.scripts/startCustom.php /boot/config/plugins/user.scripts/scripts/rclone_custom_plugin/script > /dev/null 2>&1
Apr 2 11:17:00 UNRAIDSERVER login[34366]: ROOT LOGIN on '/dev/pts/4'
Apr 2 11:18:33 UNRAIDSERVER nginx: 2019/04/02 11:18:33 [error] 12206#12206: *2298705 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 192.168.11.32, server: , request: "POST /webGui/include/DashboardApps.php HTTP/1.1", upstream: "fastcgi://unix:/var/run/php5-fpm.sock", host: "192.168.11.33", referrer: "http://192.168.11.33/Dashboard"
Apr 2 11:23:01 UNRAIDSERVER crond[2604]: failed parsing crontab for user root: ? * * * /usr/local/emhttp/plugins/user.scripts/startCustom.php /boot/config/plugins/user.scripts/scripts/rclone_custom_plugin/script > /dev/null 2>&1
Apr 2 11:31:43 UNRAIDSERVER login[119781]: ROOT LOGIN on '/dev/pts/5'

here is the dashboard when the issue is occuring

JorgeB · April 2, 2019

There are read/write errors with cache2:

Mar 28 09:49:39 UNRAIDSERVER kernel: sd 4:0:0:0: [sdr] tag#17 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x06
Mar 28 09:49:39 UNRAIDSERVER kernel: sd 4:0:0:0: [sdr] tag#17 CDB: opcode=0x28 28 00 03 72 94 98 00 00 08 00
Mar 28 09:49:39 UNRAIDSERVER kernel: print_req_error: I/O error, dev sdr, sector 57840792
Mar 28 09:49:39 UNRAIDSERVER kernel: BTRFS error (device sds1): bdev /dev/sdr1 errs: wr 10, rd 9, flush 0, corrupt 0, gen 0
Mar 28 09:49:39 UNRAIDSERVER kernel: BTRFS info (device sds1): read error corrected: ino 5907182 off 691511296 (dev /dev/sdr1 sector 57840744)
Mar 28 09:49:39 UNRAIDSERVER kernel: BTRFS info (device sds1): read error corrected: ino 5907182 off 691503104 (dev /dev/sdr1 sector 57840728)

Redundancy will correct the data, but only for COW shares, NOCOW like docker will likely get corrupted, see here for more info:

https://forums.unraid.net/topic/46802-faq-for-unraid-v6/?do=findComment&comment=700582

mbc0 · April 2, 2019

ah!

Thank you so much! I will pull cache 2 and replace!

Legend!

mbc0 · April 2, 2019

What is the procedure for this? I do not have a spare SSD here at the moment!

Stop Array, change cache 2 to no device and restart array?

Also, drive is only a month old, SMART shows ok, so will change cable first. and check for read errors again after.

JorgeB · April 2, 2019

With SSDs new cables are first thing to try.

mbc0 · April 2, 2019

3 minutes ago, johnnie.black said:

With SSDs new cables are first thing to try.

Ok, Just remembered that I am using a reverse breakout cable from my motherboard SATA3 ports to my SAS backplane so I do not have another cable. I have swapped the positions of all the drives and reseated the connectors so I will see if the read errors remain on the same drive.

mbc0 · April 2, 2019

9 minutes ago, johnnie.black said:

With SSDs new cables are first thing to try.

can I ask please, can I do a scrub with the array online & do I check the box for "repair corrupted blocks" ?

Thank you

JorgeB · April 2, 2019

8 minutes ago, mbc0 said:

can I do a scrub with the array online

yes.

8 minutes ago, mbc0 said:

do I check the box for "repair corrupted blocks"

yes, but like noted NOCOW shares (default for system share) can't be checked or fixed.

mbc0 · April 2, 2019

Not sure about NOCOW will have to do a bit of research, how would I tell if anything is corrupt?

Is there a "Safer" way to run a SSD pool?

I do not like the idea of possible corruption, this is why I decided to run a cache pool in the first place, to avoid errors/failures etc

JorgeB · April 2, 2019

7 minutes ago, mbc0 said:

how would I tell if anything is corrupt?

That's the problem, if the share is set to NOCOW btrfs turns off checksums, so there's no way to know.

8 minutes ago, mbc0 said:

Is there a "Safer" way to run a SSD pool?

Make sure all shares are set to COW.

mbc0 · April 2, 2019

Ah, I am with you now... COW (Copy On Write) so anything on the SSD could be corrupt,

here is the report, so looks ok!

scrub status for 5442ca45-3404-4f71-bc1a-a5b3e66f9188 scrub started at Tue Apr 2 12:47:27 2019 and finished after 00:32:24 total bytes scrubbed: 1.59TiB with 0 errors

Why is 1.59TB scrubbed when I have 4X 480GB SSD's? (960GB Total)

and is there a way to clear the errors reported on

root@UNRAIDSERVER:~# btrfs dev stats /mnt/cache
[/dev/sde1].write_io_errs 0
[/dev/sde1].read_io_errs 0
[/dev/sde1].flush_io_errs 0
[/dev/sde1].corruption_errs 0
[/dev/sde1].generation_errs 0
[/dev/sdf1].write_io_errs 25
[/dev/sdf1].read_io_errs 17
[/dev/sdf1].flush_io_errs 0
[/dev/sdf1].corruption_errs 0
[/dev/sdf1].generation_errs 0
[/dev/sdd1].write_io_errs 0
[/dev/sdd1].read_io_errs 0
[/dev/sdd1].flush_io_errs 0
[/dev/sdd1].corruption_errs 0
[/dev/sdd1].generation_errs 0
[/dev/sdc1].write_io_errs 0
[/dev/sdc1].read_io_errs 0
[/dev/sdc1].flush_io_errs 0
[/dev/sdc1].corruption_errs 0
[/dev/sdc1].generation_errs 0

Thanks for all your help (as usual!) 🙂

JorgeB · April 2, 2019

5 minutes ago, mbc0 said:

Why is 1.59TB scrubbed when I have 4X 480GB SSD's? (960GB Total)

Because data is mirrored.

5 minutes ago, mbc0 said:

and is there a way to clear the errors reported on

There is, it's on the FAQ entry I linked above.

mbc0 · April 2, 2019

I am obviously far too stupid,

I have read both pages you linked to, been through the Main unRAID 6 FAQ's searching for btrfs error count / clear errors, I guess the answer to the question I am looking for is using different terminology that what I know or understand, I guess I will have to leave the error counts as they are and write them down

JorgeB · April 2, 2019

It's here:

mbc0 · April 2, 2019

Thank you @johnnie.black Sorry to be such a pain...

After all this the error count has not increased but still have the same issue! server has become unusable.. any ideas?

JorgeB · April 2, 2019

Try shutting down dockers/VMs, also like it might be a good idea to recreate the docker image, since there could be undetectable corruption, though same could be true for vdisks if they are in NOCOW shares.

mbc0 · April 2, 2019

Ok, I will do that... Thank you,

may take some time to find this problem as it can be every few minutes or a couple of times a day!

mbc0 · April 2, 2019

Ah! I think I have found what is causing it! I don't know why it is happening but I can make it happen on demand!

I have lots of automation and I have just forced it by moving some data from my Desktop to the server...

I can copy at 400mb+ using my 10gb/E cards which as expected is going to be a demand on the drives but never had an issue when using a single cache drive. If I copy large amounts of data everything is fine at first 400mb+ speed and cpu usage low, then the speeds drop literally to 0 and everything locks up CPU usage through the roof then if I pause the file transfer everything catches up and unraid/dockers/vm etc return to normal, then if I resume the file copy the same happens again making the VM's Plex, dockers all lock up!

Is there a way to test cache performance? this is clearly the issue!

M/B: Gigabyte Technology Co., Ltd. - X399 DESIGNARE EX-CF

CPU: AMD Ryzen Threadripper 2950X 16-Core @ 3700

HVM: Enabled

IOMMU: Enabled

Cache: 1536 kB, 8192 kB, 32768 kB

Memory: 32 GB (max. installable capacity 512 GB)

Network: bond0: transmit load balancing, mtu 1500
eth0: 1000 Mb/s, full duplex, mtu 1500
eth1: 1000 Mb/s, full duplex, mtu 1500
eth2: 10000 Mb/s, full duplex, mtu 9000

Kernel: Linux 4.18.20-unRAID x86_64

OpenSSL: 1.1.1a

Uptime:

JorgeB · April 2, 2019

1 hour ago, mbc0 said:

Is there a way to test cache performance? this is clearly the issue!

Copying from your desktop is a good test, there have been similar issues reported before, I was never able to reproduce it, and I have 11 SSDs in my cache pool, if you temporarily shutdown all dockers/VMs and just do a copy is it the same?

mbc0 · April 3, 2019

Hi @johnnie.black

I tried with the vm & docker engine disabled and got the same problem, below is a link to a video of my problem which is easier than me explaining it!

Many Thanks For Your Time!

https://1drv.ms/v/s!AjnJhtmhYMJlh5R_iV33akLpvaRraw

JorgeB · April 3, 2019

I see the problem, but no idea what's causing it, like mentioned I could never reproduce it, but you're not the first complaining of that issue when using a pool.

mbc0 · April 3, 2019

Wow that's me screwed then! If you dont know nobody does!! 😂

Are you able to transfer at constant speeds to your pool?

JorgeB · April 3, 2019

Are you able to transfer at constant speeds to your pool?

Yes, speed varies a little, but it never gets lower than around 700MB/s, and most importantly it never freezes for a few seconds like yours is doing.

1998188002_Screenshot2019-04-0313_26_18.png.a478ee3d872886f80f98bedc78e894fd.png

mbc0 · April 3, 2019

@johnnie.black Can I ask two things please?

1, what connection/disks are you using to get that kind of speed? I can not get above 500 MB/s with 4X Kingston SSD's in RAID 10

2, I understand you don't know what is causing this issue, but if it were you that had this problem what route would you take? I have 8 SATA3 connectors on my Motherboard and have tried them all with no difference, I cannot really use the expander to connect the 4 SSD's as I would lose the trim features.

Many Thanks

JorgeB · April 4, 2019

Currently sing 11 Crucial MX500 in raid5.

In your case first thing I would try would be faster SSDs, you're using TLC models, they can get pretty slow during sustained writes, if you can get some 3D TLC models to test, like 860 EVO, MX500, WD Blue 3D, etc, though even 4 fast SSDs in raid10 won't give more than about 500MB/s, but maybe they can sustain that and without those freezes.

Cache Disk Access LED Solid Freezing Server

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Archived