Jump to content

Cache Disk Access LED Solid Freezing Server


mbc0

Recommended Posts

Hi, 

 

It may be a coincidence it may not but around the time of changing from a single Cache SSD to a 4 Disk RAID10 I am having problems with the server freezing (Grinding to a halt) with the LED Status Light for the SSD Pool solid on.  Checking the CPU usage I can see as many as 8-10 cores (out of 32) all red at 100%.  When this happens I am unable to use the server at all, VM's lock, Plex, (all dockers in fact) and after a few minutes the cores drop down - barely registering usage the SSD Activity LED drops to a flicker and everything comes back to normal.  This can happen every few minutes to a couple of times a day so incredibly difficult to pinpoint.

 

You can see that the top command is not reflecting the usage I am seeing in the dashboard either?

 

Any pointers would be really appreciated!

 

unraidserver-diagnostics-20190402-1139.zip

 

here is the CPU usage when the issue is occuring

 

image.png.a8b8b61d263894cc5761dba2890a6473.png

 

 

here is the log when the issue is occuring (I have only just noticed the rclone errors, I have no idea what is causing that)

 

Apr 2 06:47:06 UNRAIDSERVER sSMTP[7046]: SSL connection using TLS_AES_256_GCM_SHA384
Apr 2 06:47:09 UNRAIDSERVER sSMTP[7046]: Sent mail for [email protected] (221 2.0.0 closing connection v17sm3354314wmc.30 - gsmtp) uid=0 username=root outbytes=969
Apr 2 06:57:01 UNRAIDSERVER crond[2604]: failed parsing crontab for user root: ? * * * /usr/local/emhttp/plugins/user.scripts/startCustom.php /boot/config/plugins/user.scripts/scripts/rclone_custom_plugin/script > /dev/null 2>&1
Apr 2 07:13:01 UNRAIDSERVER crond[2604]: failed parsing crontab for user root: ? * * * /usr/local/emhttp/plugins/user.scripts/startCustom.php /boot/config/plugins/user.scripts/scripts/rclone_custom_plugin/script > /dev/null 2>&1
Apr 2 07:29:01 UNRAIDSERVER crond[2604]: failed parsing crontab for user root: ? * * * /usr/local/emhttp/plugins/user.scripts/startCustom.php /boot/config/plugins/user.scripts/scripts/rclone_custom_plugin/script > /dev/null 2>&1
Apr 2 07:45:01 UNRAIDSERVER crond[2604]: failed parsing crontab for user root: ? * * * /usr/local/emhttp/plugins/user.scripts/startCustom.php /boot/config/plugins/user.scripts/scripts/rclone_custom_plugin/script > /dev/null 2>&1
Apr 2 08:01:01 UNRAIDSERVER crond[2604]: failed parsing crontab for user root: ? * * * /usr/local/emhttp/plugins/user.scripts/startCustom.php /boot/config/plugins/user.scripts/scripts/rclone_custom_plugin/script > /dev/null 2>&1
Apr 2 08:17:01 UNRAIDSERVER crond[2604]: failed parsing crontab for user root: ? * * * /usr/local/emhttp/plugins/user.scripts/startCustom.php /boot/config/plugins/user.scripts/scripts/rclone_custom_plugin/script > /dev/null 2>&1
Apr 2 08:32:01 UNRAIDSERVER crond[2604]: failed parsing crontab for user root: ? * * * /usr/local/emhttp/plugins/user.scripts/startCustom.php /boot/config/plugins/user.scripts/scripts/rclone_custom_plugin/script > /dev/null 2>&1
Apr 2 08:48:01 UNRAIDSERVER crond[2604]: failed parsing crontab for user root: ? * * * /usr/local/emhttp/plugins/user.scripts/startCustom.php /boot/config/plugins/user.scripts/scripts/rclone_custom_plugin/script > /dev/null 2>&1
Apr 2 09:03:11 UNRAIDSERVER crond[2604]: failed parsing crontab for user root: ? * * * /usr/local/emhttp/plugins/user.scripts/startCustom.php /boot/config/plugins/user.scripts/scripts/rclone_custom_plugin/script > /dev/null 2>&1
Apr 2 09:19:01 UNRAIDSERVER crond[2604]: failed parsing crontab for user root: ? * * * /usr/local/emhttp/plugins/user.scripts/startCustom.php /boot/config/plugins/user.scripts/scripts/rclone_custom_plugin/script > /dev/null 2>&1
Apr 2 09:35:01 UNRAIDSERVER crond[2604]: failed parsing crontab for user root: ? * * * /usr/local/emhttp/plugins/user.scripts/startCustom.php /boot/config/plugins/user.scripts/scripts/rclone_custom_plugin/script > /dev/null 2>&1
Apr 2 09:50:01 UNRAIDSERVER crond[2604]: failed parsing crontab for user root: ? * * * /usr/local/emhttp/plugins/user.scripts/startCustom.php /boot/config/plugins/user.scripts/scripts/rclone_custom_plugin/script > /dev/null 2>&1
Apr 2 09:53:15 UNRAIDSERVER sSMTP[96530]: Creating SSL connection to host
Apr 2 09:53:15 UNRAIDSERVER sSMTP[96530]: SSL connection using TLS_AES_256_GCM_SHA384
Apr 2 09:53:16 UNRAIDSERVER sSMTP[96932]: Creating SSL connection to host
Apr 2 09:53:16 UNRAIDSERVER sSMTP[96932]: SSL connection using TLS_AES_256_GCM_SHA384
Apr 2 09:53:18 UNRAIDSERVER sSMTP[96530]: Sent mail for [email protected] (221 2.0.0 closing connection e1sm21728952wrw.66 - gsmtp) uid=0 username=root outbytes=950
Apr 2 09:53:18 UNRAIDSERVER sSMTP[96932]: Sent mail for [email protected] (221 2.0.0 closing connection v1sm16644591wrd.47 - gsmtp) uid=0 username=root outbytes=1016
Apr 2 09:53:20 UNRAIDSERVER sSMTP[97574]: Creating SSL connection to host
Apr 2 09:53:20 UNRAIDSERVER sSMTP[97574]: SSL connection using TLS_AES_256_GCM_SHA384
Apr 2 09:53:22 UNRAIDSERVER sSMTP[97574]: Sent mail for [email protected] (221 2.0.0 closing connection o5sm4536408wmc.16 - gsmtp) uid=0 username=root outbytes=1082
Apr 2 10:03:01 UNRAIDSERVER sshd[26374]: Accepted none for root from 192.168.0.12 port 50657 ssh2
Apr 2 10:03:17 UNRAIDSERVER kernel: libfuse.so[28282]: segfault at 0 ip 000014f3669a4e50 sp 00007ffda97e46c0 error 4 in libfuse.so.2.9.8[14f36699c000+2a000]
Apr 2 10:03:17 UNRAIDSERVER kernel: Code: 68 ef 00 00 00 e9 f0 f0 ff ff ff 25 62 29 23 00 68 f0 00 00 00 e9 e0 f0 ff ff ff 25 a2 21 23 00 66 90 00 00 00 00 00 00 00 00 <48> 8b 04 25 00 00 00 00 0f 0b 8b 04 25 0c 00 00 00 0f 0b 66 2e 0f 
Apr 2 10:06:01 UNRAIDSERVER crond[2604]: failed parsing crontab for user root: ? * * * /usr/local/emhttp/plugins/user.scripts/startCustom.php /boot/config/plugins/user.scripts/scripts/rclone_custom_plugin/script > /dev/null 2>&1
Apr 2 10:21:01 UNRAIDSERVER crond[2604]: failed parsing crontab for user root: ? * * * /usr/local/emhttp/plugins/user.scripts/startCustom.php /boot/config/plugins/user.scripts/scripts/rclone_custom_plugin/script > /dev/null 2>&1
Apr 2 10:36:11 UNRAIDSERVER crond[2604]: failed parsing crontab for user root: ? * * * /usr/local/emhttp/plugins/user.scripts/startCustom.php /boot/config/plugins/user.scripts/scripts/rclone_custom_plugin/script > /dev/null 2>&1
Apr 2 10:52:01 UNRAIDSERVER crond[2604]: failed parsing crontab for user root: ? * * * /usr/local/emhttp/plugins/user.scripts/startCustom.php /boot/config/plugins/user.scripts/scripts/rclone_custom_plugin/script > /dev/null 2>&1
Apr 2 11:08:01 UNRAIDSERVER crond[2604]: failed parsing crontab for user root: ? * * * /usr/local/emhttp/plugins/user.scripts/startCustom.php /boot/config/plugins/user.scripts/scripts/rclone_custom_plugin/script > /dev/null 2>&1
Apr 2 11:17:00 UNRAIDSERVER login[34366]: ROOT LOGIN on '/dev/pts/4'
Apr 2 11:18:33 UNRAIDSERVER nginx: 2019/04/02 11:18:33 [error] 12206#12206: *2298705 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 192.168.11.32, server: , request: "POST /webGui/include/DashboardApps.php HTTP/1.1", upstream: "fastcgi://unix:/var/run/php5-fpm.sock", host: "192.168.11.33", referrer: "http://192.168.11.33/Dashboard"
Apr 2 11:23:01 UNRAIDSERVER crond[2604]: failed parsing crontab for user root: ? * * * /usr/local/emhttp/plugins/user.scripts/startCustom.php /boot/config/plugins/user.scripts/scripts/rclone_custom_plugin/script > /dev/null 2>&1
Apr 2 11:31:43 UNRAIDSERVER login[119781]: ROOT LOGIN on '/dev/pts/5'

 

here is the dashboard when the issue is occuring

 

 

 

image.thumb.png.d1e6703f712d359f83a0ee3da55cb5a7.png

image.png

image.png

image.png

Link to comment

There are read/write errors with cache2:

Mar 28 09:49:39 UNRAIDSERVER kernel: sd 4:0:0:0: [sdr] tag#17 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x06
Mar 28 09:49:39 UNRAIDSERVER kernel: sd 4:0:0:0: [sdr] tag#17 CDB: opcode=0x28 28 00 03 72 94 98 00 00 08 00
Mar 28 09:49:39 UNRAIDSERVER kernel: print_req_error: I/O error, dev sdr, sector 57840792
Mar 28 09:49:39 UNRAIDSERVER kernel: BTRFS error (device sds1): bdev /dev/sdr1 errs: wr 10, rd 9, flush 0, corrupt 0, gen 0
Mar 28 09:49:39 UNRAIDSERVER kernel: BTRFS info (device sds1): read error corrected: ino 5907182 off 691511296 (dev /dev/sdr1 sector 57840744)
Mar 28 09:49:39 UNRAIDSERVER kernel: BTRFS info (device sds1): read error corrected: ino 5907182 off 691503104 (dev /dev/sdr1 sector 57840728)

Redundancy will correct the data, but only for COW shares, NOCOW like docker will likely get corrupted, see here for more info:

https://forums.unraid.net/topic/46802-faq-for-unraid-v6/?do=findComment&amp;comment=700582

 

Link to comment

What is the procedure for this? I do not have a spare SSD here at the moment!

 

Stop Array, change cache 2 to no device and restart array?

 

Also, drive is only a month old, SMART shows ok, so will change cable first. and check for read errors again after.

 

Link to comment
3 minutes ago, johnnie.black said:

With SSDs new cables are first thing to try.

Ok, Just remembered that I am using a reverse breakout cable from my motherboard SATA3 ports to my SAS backplane so I do not have another cable.  I have swapped the positions of all the drives and reseated the connectors so I will see if the read errors remain on the same drive.

 

 

Link to comment

Not sure about NOCOW will have to do a bit of research, how would I tell if anything is corrupt?

 

Is there a "Safer" way to run a SSD pool? 

 

I do not like the idea of possible corruption, this is why I decided to run a cache pool in the first place, to avoid errors/failures etc

Link to comment
7 minutes ago, mbc0 said:

how would I tell if anything is corrupt?

That's the problem, if the share is set to NOCOW btrfs turns off checksums, so there's no way to know.

 

8 minutes ago, mbc0 said:

Is there a "Safer" way to run a SSD pool? 

Make sure all shares are set to COW.

Link to comment

Ah, I am with you now... COW (Copy On Write) so anything on the SSD could be corrupt, 

 

here is the report, so looks ok!

 

scrub status for 5442ca45-3404-4f71-bc1a-a5b3e66f9188 scrub started at Tue Apr 2 12:47:27 2019 and finished after 00:32:24 total bytes scrubbed: 1.59TiB with 0 errors

 

Why is 1.59TB scrubbed when I have 4X 480GB SSD's? (960GB Total)

 

and is there a way to clear the errors reported on 

 

root@UNRAIDSERVER:~# btrfs dev stats /mnt/cache
[/dev/sde1].write_io_errs    0
[/dev/sde1].read_io_errs     0
[/dev/sde1].flush_io_errs    0
[/dev/sde1].corruption_errs  0
[/dev/sde1].generation_errs  0
[/dev/sdf1].write_io_errs    25
[/dev/sdf1].read_io_errs     17
[/dev/sdf1].flush_io_errs    0
[/dev/sdf1].corruption_errs  0
[/dev/sdf1].generation_errs  0
[/dev/sdd1].write_io_errs    0
[/dev/sdd1].read_io_errs     0
[/dev/sdd1].flush_io_errs    0
[/dev/sdd1].corruption_errs  0
[/dev/sdd1].generation_errs  0
[/dev/sdc1].write_io_errs    0
[/dev/sdc1].read_io_errs     0
[/dev/sdc1].flush_io_errs    0
[/dev/sdc1].corruption_errs  0
[/dev/sdc1].generation_errs  0
 

Thanks for all your help (as usual!)  🙂

Link to comment

I am obviously far too stupid, 

 

I have read both pages you linked to, been through the Main unRAID 6 FAQ's searching for btrfs error count / clear errors, I guess the answer to the question I am looking for is using different terminology that what I know or understand, I guess I will have to leave the error counts as they are and write them down

Link to comment

Ah! I think I have found what is causing it!  I don't know why it is happening but I can make it happen on demand!

 

I have lots of automation and I have just forced it by moving some data from my Desktop to the server... 

 

I can copy at 400mb+ using my 10gb/E cards which as expected is going to be a demand on the drives but never had an issue when using a single cache drive.  If I copy large amounts of data everything is fine at first 400mb+ speed and cpu usage low, then the speeds drop literally to 0 and everything locks up CPU usage through the roof then if I pause the file transfer everything catches up and unraid/dockers/vm etc return to normal, then if I resume the file copy the same happens again making the VM's Plex, dockers all lock up!

 

Is there a way to test cache performance? this is clearly the issue!

 

M/B: Gigabyte Technology Co., Ltd. - X399 DESIGNARE EX-CF

CPU: AMD Ryzen Threadripper 2950X 16-Core @ 3700

HVM: Enabled

IOMMU: Enabled

Cache: 1536 kB, 8192 kB, 32768 kB

Memory: 32 GB (max. installable capacity 512 GB)

Network: bond0: transmit load balancing, mtu 1500 
 eth0: 1000 Mb/s, full duplex, mtu 1500 
 eth1: 1000 Mb/s, full duplex, mtu 1500 
 eth2: 10000 Mb/s, full duplex, mtu 9000

Kernel: Linux 4.18.20-unRAID x86_64

OpenSSL: 1.1.1a

Uptime: 

Link to comment
1 hour ago, mbc0 said:

Is there a way to test cache performance? this is clearly the issue!

Copying from your desktop is a good test, there have been similar issues reported before, I was never able to reproduce it, and I have 11 SSDs in my cache pool, if you temporarily shutdown all dockers/VMs and just do a copy is it the same?

Link to comment

@johnnie.black Can I ask two things please?

 

1, what connection/disks are you using to get that kind of speed? I can not get above 500 MB/s with 4X Kingston SSD's in RAID 10

 

2, I understand you don't know what is causing this issue, but if it were you that had this problem what route would you take? I have 8 SATA3 connectors on my Motherboard and have tried them all with no difference, I cannot really use the expander to connect the 4 SSD's as I would lose the trim features.

 

Many Thanks

Link to comment

Currently sing 11 Crucial MX500 in raid5.

 

In your case first thing I would try would be faster SSDs, you're using TLC models, they can get pretty slow during sustained writes, if you can get some 3D TLC models to test, like 860 EVO, MX500, WD Blue 3D, etc, though even 4 fast SSDs in raid10 won't give more than about 500MB/s, but maybe they can sustain that and without those freezes.

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...