mbc0 Posted April 2, 2019 Share Posted April 2, 2019 Hi, It may be a coincidence it may not but around the time of changing from a single Cache SSD to a 4 Disk RAID10 I am having problems with the server freezing (Grinding to a halt) with the LED Status Light for the SSD Pool solid on. Checking the CPU usage I can see as many as 8-10 cores (out of 32) all red at 100%. When this happens I am unable to use the server at all, VM's lock, Plex, (all dockers in fact) and after a few minutes the cores drop down - barely registering usage the SSD Activity LED drops to a flicker and everything comes back to normal. This can happen every few minutes to a couple of times a day so incredibly difficult to pinpoint. You can see that the top command is not reflecting the usage I am seeing in the dashboard either? Any pointers would be really appreciated! unraidserver-diagnostics-20190402-1139.zip here is the CPU usage when the issue is occuring here is the log when the issue is occuring (I have only just noticed the rclone errors, I have no idea what is causing that) Apr 2 06:47:06 UNRAIDSERVER sSMTP[7046]: SSL connection using TLS_AES_256_GCM_SHA384 Apr 2 06:47:09 UNRAIDSERVER sSMTP[7046]: Sent mail for [email protected] (221 2.0.0 closing connection v17sm3354314wmc.30 - gsmtp) uid=0 username=root outbytes=969 Apr 2 06:57:01 UNRAIDSERVER crond[2604]: failed parsing crontab for user root: ? * * * /usr/local/emhttp/plugins/user.scripts/startCustom.php /boot/config/plugins/user.scripts/scripts/rclone_custom_plugin/script > /dev/null 2>&1 Apr 2 07:13:01 UNRAIDSERVER crond[2604]: failed parsing crontab for user root: ? * * * /usr/local/emhttp/plugins/user.scripts/startCustom.php /boot/config/plugins/user.scripts/scripts/rclone_custom_plugin/script > /dev/null 2>&1 Apr 2 07:29:01 UNRAIDSERVER crond[2604]: failed parsing crontab for user root: ? * * * /usr/local/emhttp/plugins/user.scripts/startCustom.php /boot/config/plugins/user.scripts/scripts/rclone_custom_plugin/script > /dev/null 2>&1 Apr 2 07:45:01 UNRAIDSERVER crond[2604]: failed parsing crontab for user root: ? * * * /usr/local/emhttp/plugins/user.scripts/startCustom.php /boot/config/plugins/user.scripts/scripts/rclone_custom_plugin/script > /dev/null 2>&1 Apr 2 08:01:01 UNRAIDSERVER crond[2604]: failed parsing crontab for user root: ? * * * /usr/local/emhttp/plugins/user.scripts/startCustom.php /boot/config/plugins/user.scripts/scripts/rclone_custom_plugin/script > /dev/null 2>&1 Apr 2 08:17:01 UNRAIDSERVER crond[2604]: failed parsing crontab for user root: ? * * * /usr/local/emhttp/plugins/user.scripts/startCustom.php /boot/config/plugins/user.scripts/scripts/rclone_custom_plugin/script > /dev/null 2>&1 Apr 2 08:32:01 UNRAIDSERVER crond[2604]: failed parsing crontab for user root: ? * * * /usr/local/emhttp/plugins/user.scripts/startCustom.php /boot/config/plugins/user.scripts/scripts/rclone_custom_plugin/script > /dev/null 2>&1 Apr 2 08:48:01 UNRAIDSERVER crond[2604]: failed parsing crontab for user root: ? * * * /usr/local/emhttp/plugins/user.scripts/startCustom.php /boot/config/plugins/user.scripts/scripts/rclone_custom_plugin/script > /dev/null 2>&1 Apr 2 09:03:11 UNRAIDSERVER crond[2604]: failed parsing crontab for user root: ? * * * /usr/local/emhttp/plugins/user.scripts/startCustom.php /boot/config/plugins/user.scripts/scripts/rclone_custom_plugin/script > /dev/null 2>&1 Apr 2 09:19:01 UNRAIDSERVER crond[2604]: failed parsing crontab for user root: ? * * * /usr/local/emhttp/plugins/user.scripts/startCustom.php /boot/config/plugins/user.scripts/scripts/rclone_custom_plugin/script > /dev/null 2>&1 Apr 2 09:35:01 UNRAIDSERVER crond[2604]: failed parsing crontab for user root: ? * * * /usr/local/emhttp/plugins/user.scripts/startCustom.php /boot/config/plugins/user.scripts/scripts/rclone_custom_plugin/script > /dev/null 2>&1 Apr 2 09:50:01 UNRAIDSERVER crond[2604]: failed parsing crontab for user root: ? * * * /usr/local/emhttp/plugins/user.scripts/startCustom.php /boot/config/plugins/user.scripts/scripts/rclone_custom_plugin/script > /dev/null 2>&1 Apr 2 09:53:15 UNRAIDSERVER sSMTP[96530]: Creating SSL connection to host Apr 2 09:53:15 UNRAIDSERVER sSMTP[96530]: SSL connection using TLS_AES_256_GCM_SHA384 Apr 2 09:53:16 UNRAIDSERVER sSMTP[96932]: Creating SSL connection to host Apr 2 09:53:16 UNRAIDSERVER sSMTP[96932]: SSL connection using TLS_AES_256_GCM_SHA384 Apr 2 09:53:18 UNRAIDSERVER sSMTP[96530]: Sent mail for [email protected] (221 2.0.0 closing connection e1sm21728952wrw.66 - gsmtp) uid=0 username=root outbytes=950 Apr 2 09:53:18 UNRAIDSERVER sSMTP[96932]: Sent mail for [email protected] (221 2.0.0 closing connection v1sm16644591wrd.47 - gsmtp) uid=0 username=root outbytes=1016 Apr 2 09:53:20 UNRAIDSERVER sSMTP[97574]: Creating SSL connection to host Apr 2 09:53:20 UNRAIDSERVER sSMTP[97574]: SSL connection using TLS_AES_256_GCM_SHA384 Apr 2 09:53:22 UNRAIDSERVER sSMTP[97574]: Sent mail for [email protected] (221 2.0.0 closing connection o5sm4536408wmc.16 - gsmtp) uid=0 username=root outbytes=1082 Apr 2 10:03:01 UNRAIDSERVER sshd[26374]: Accepted none for root from 192.168.0.12 port 50657 ssh2 Apr 2 10:03:17 UNRAIDSERVER kernel: libfuse.so[28282]: segfault at 0 ip 000014f3669a4e50 sp 00007ffda97e46c0 error 4 in libfuse.so.2.9.8[14f36699c000+2a000] Apr 2 10:03:17 UNRAIDSERVER kernel: Code: 68 ef 00 00 00 e9 f0 f0 ff ff ff 25 62 29 23 00 68 f0 00 00 00 e9 e0 f0 ff ff ff 25 a2 21 23 00 66 90 00 00 00 00 00 00 00 00 <48> 8b 04 25 00 00 00 00 0f 0b 8b 04 25 0c 00 00 00 0f 0b 66 2e 0f Apr 2 10:06:01 UNRAIDSERVER crond[2604]: failed parsing crontab for user root: ? * * * /usr/local/emhttp/plugins/user.scripts/startCustom.php /boot/config/plugins/user.scripts/scripts/rclone_custom_plugin/script > /dev/null 2>&1 Apr 2 10:21:01 UNRAIDSERVER crond[2604]: failed parsing crontab for user root: ? * * * /usr/local/emhttp/plugins/user.scripts/startCustom.php /boot/config/plugins/user.scripts/scripts/rclone_custom_plugin/script > /dev/null 2>&1 Apr 2 10:36:11 UNRAIDSERVER crond[2604]: failed parsing crontab for user root: ? * * * /usr/local/emhttp/plugins/user.scripts/startCustom.php /boot/config/plugins/user.scripts/scripts/rclone_custom_plugin/script > /dev/null 2>&1 Apr 2 10:52:01 UNRAIDSERVER crond[2604]: failed parsing crontab for user root: ? * * * /usr/local/emhttp/plugins/user.scripts/startCustom.php /boot/config/plugins/user.scripts/scripts/rclone_custom_plugin/script > /dev/null 2>&1 Apr 2 11:08:01 UNRAIDSERVER crond[2604]: failed parsing crontab for user root: ? * * * /usr/local/emhttp/plugins/user.scripts/startCustom.php /boot/config/plugins/user.scripts/scripts/rclone_custom_plugin/script > /dev/null 2>&1 Apr 2 11:17:00 UNRAIDSERVER login[34366]: ROOT LOGIN on '/dev/pts/4' Apr 2 11:18:33 UNRAIDSERVER nginx: 2019/04/02 11:18:33 [error] 12206#12206: *2298705 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 192.168.11.32, server: , request: "POST /webGui/include/DashboardApps.php HTTP/1.1", upstream: "fastcgi://unix:/var/run/php5-fpm.sock", host: "192.168.11.33", referrer: "http://192.168.11.33/Dashboard" Apr 2 11:23:01 UNRAIDSERVER crond[2604]: failed parsing crontab for user root: ? * * * /usr/local/emhttp/plugins/user.scripts/startCustom.php /boot/config/plugins/user.scripts/scripts/rclone_custom_plugin/script > /dev/null 2>&1 Apr 2 11:31:43 UNRAIDSERVER login[119781]: ROOT LOGIN on '/dev/pts/5' here is the dashboard when the issue is occuring Link to comment
JorgeB Posted April 2, 2019 Share Posted April 2, 2019 There are read/write errors with cache2: Mar 28 09:49:39 UNRAIDSERVER kernel: sd 4:0:0:0: [sdr] tag#17 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x06 Mar 28 09:49:39 UNRAIDSERVER kernel: sd 4:0:0:0: [sdr] tag#17 CDB: opcode=0x28 28 00 03 72 94 98 00 00 08 00 Mar 28 09:49:39 UNRAIDSERVER kernel: print_req_error: I/O error, dev sdr, sector 57840792 Mar 28 09:49:39 UNRAIDSERVER kernel: BTRFS error (device sds1): bdev /dev/sdr1 errs: wr 10, rd 9, flush 0, corrupt 0, gen 0 Mar 28 09:49:39 UNRAIDSERVER kernel: BTRFS info (device sds1): read error corrected: ino 5907182 off 691511296 (dev /dev/sdr1 sector 57840744) Mar 28 09:49:39 UNRAIDSERVER kernel: BTRFS info (device sds1): read error corrected: ino 5907182 off 691503104 (dev /dev/sdr1 sector 57840728) Redundancy will correct the data, but only for COW shares, NOCOW like docker will likely get corrupted, see here for more info: https://forums.unraid.net/topic/46802-faq-for-unraid-v6/?do=findComment&comment=700582 Link to comment
mbc0 Posted April 2, 2019 Author Share Posted April 2, 2019 ah! Thank you so much! I will pull cache 2 and replace! Legend! Link to comment
mbc0 Posted April 2, 2019 Author Share Posted April 2, 2019 What is the procedure for this? I do not have a spare SSD here at the moment! Stop Array, change cache 2 to no device and restart array? Also, drive is only a month old, SMART shows ok, so will change cable first. and check for read errors again after. Link to comment
JorgeB Posted April 2, 2019 Share Posted April 2, 2019 With SSDs new cables are first thing to try. Link to comment
mbc0 Posted April 2, 2019 Author Share Posted April 2, 2019 3 minutes ago, johnnie.black said: With SSDs new cables are first thing to try. Ok, Just remembered that I am using a reverse breakout cable from my motherboard SATA3 ports to my SAS backplane so I do not have another cable. I have swapped the positions of all the drives and reseated the connectors so I will see if the read errors remain on the same drive. Link to comment
mbc0 Posted April 2, 2019 Author Share Posted April 2, 2019 9 minutes ago, johnnie.black said: With SSDs new cables are first thing to try. can I ask please, can I do a scrub with the array online & do I check the box for "repair corrupted blocks" ? Thank you Link to comment
JorgeB Posted April 2, 2019 Share Posted April 2, 2019 8 minutes ago, mbc0 said: can I do a scrub with the array online yes. 8 minutes ago, mbc0 said: do I check the box for "repair corrupted blocks" yes, but like noted NOCOW shares (default for system share) can't be checked or fixed. Link to comment
mbc0 Posted April 2, 2019 Author Share Posted April 2, 2019 Not sure about NOCOW will have to do a bit of research, how would I tell if anything is corrupt? Is there a "Safer" way to run a SSD pool? I do not like the idea of possible corruption, this is why I decided to run a cache pool in the first place, to avoid errors/failures etc Link to comment
JorgeB Posted April 2, 2019 Share Posted April 2, 2019 7 minutes ago, mbc0 said: how would I tell if anything is corrupt? That's the problem, if the share is set to NOCOW btrfs turns off checksums, so there's no way to know. 8 minutes ago, mbc0 said: Is there a "Safer" way to run a SSD pool? Make sure all shares are set to COW. Link to comment
mbc0 Posted April 2, 2019 Author Share Posted April 2, 2019 Ah, I am with you now... COW (Copy On Write) so anything on the SSD could be corrupt, here is the report, so looks ok! scrub status for 5442ca45-3404-4f71-bc1a-a5b3e66f9188 scrub started at Tue Apr 2 12:47:27 2019 and finished after 00:32:24 total bytes scrubbed: 1.59TiB with 0 errors Why is 1.59TB scrubbed when I have 4X 480GB SSD's? (960GB Total) and is there a way to clear the errors reported on root@UNRAIDSERVER:~# btrfs dev stats /mnt/cache [/dev/sde1].write_io_errs 0 [/dev/sde1].read_io_errs 0 [/dev/sde1].flush_io_errs 0 [/dev/sde1].corruption_errs 0 [/dev/sde1].generation_errs 0 [/dev/sdf1].write_io_errs 25 [/dev/sdf1].read_io_errs 17 [/dev/sdf1].flush_io_errs 0 [/dev/sdf1].corruption_errs 0 [/dev/sdf1].generation_errs 0 [/dev/sdd1].write_io_errs 0 [/dev/sdd1].read_io_errs 0 [/dev/sdd1].flush_io_errs 0 [/dev/sdd1].corruption_errs 0 [/dev/sdd1].generation_errs 0 [/dev/sdc1].write_io_errs 0 [/dev/sdc1].read_io_errs 0 [/dev/sdc1].flush_io_errs 0 [/dev/sdc1].corruption_errs 0 [/dev/sdc1].generation_errs 0 Thanks for all your help (as usual!) 🙂 Link to comment
JorgeB Posted April 2, 2019 Share Posted April 2, 2019 5 minutes ago, mbc0 said: Why is 1.59TB scrubbed when I have 4X 480GB SSD's? (960GB Total) Because data is mirrored. 5 minutes ago, mbc0 said: and is there a way to clear the errors reported on There is, it's on the FAQ entry I linked above. Link to comment
mbc0 Posted April 2, 2019 Author Share Posted April 2, 2019 I am obviously far too stupid, I have read both pages you linked to, been through the Main unRAID 6 FAQ's searching for btrfs error count / clear errors, I guess the answer to the question I am looking for is using different terminology that what I know or understand, I guess I will have to leave the error counts as they are and write them down Link to comment
mbc0 Posted April 2, 2019 Author Share Posted April 2, 2019 Thank you @johnnie.black Sorry to be such a pain... After all this the error count has not increased but still have the same issue! server has become unusable.. any ideas? Link to comment
JorgeB Posted April 2, 2019 Share Posted April 2, 2019 Try shutting down dockers/VMs, also like it might be a good idea to recreate the docker image, since there could be undetectable corruption, though same could be true for vdisks if they are in NOCOW shares. Link to comment
mbc0 Posted April 2, 2019 Author Share Posted April 2, 2019 Ok, I will do that... Thank you, may take some time to find this problem as it can be every few minutes or a couple of times a day! Link to comment
mbc0 Posted April 2, 2019 Author Share Posted April 2, 2019 Ah! I think I have found what is causing it! I don't know why it is happening but I can make it happen on demand! I have lots of automation and I have just forced it by moving some data from my Desktop to the server... I can copy at 400mb+ using my 10gb/E cards which as expected is going to be a demand on the drives but never had an issue when using a single cache drive. If I copy large amounts of data everything is fine at first 400mb+ speed and cpu usage low, then the speeds drop literally to 0 and everything locks up CPU usage through the roof then if I pause the file transfer everything catches up and unraid/dockers/vm etc return to normal, then if I resume the file copy the same happens again making the VM's Plex, dockers all lock up! Is there a way to test cache performance? this is clearly the issue! M/B: Gigabyte Technology Co., Ltd. - X399 DESIGNARE EX-CF CPU: AMD Ryzen Threadripper 2950X 16-Core @ 3700 HVM: Enabled IOMMU: Enabled Cache: 1536 kB, 8192 kB, 32768 kB Memory: 32 GB (max. installable capacity 512 GB) Network: bond0: transmit load balancing, mtu 1500 eth0: 1000 Mb/s, full duplex, mtu 1500 eth1: 1000 Mb/s, full duplex, mtu 1500 eth2: 10000 Mb/s, full duplex, mtu 9000 Kernel: Linux 4.18.20-unRAID x86_64 OpenSSL: 1.1.1a Uptime: Link to comment
JorgeB Posted April 2, 2019 Share Posted April 2, 2019 1 hour ago, mbc0 said: Is there a way to test cache performance? this is clearly the issue! Copying from your desktop is a good test, there have been similar issues reported before, I was never able to reproduce it, and I have 11 SSDs in my cache pool, if you temporarily shutdown all dockers/VMs and just do a copy is it the same? Link to comment
mbc0 Posted April 3, 2019 Author Share Posted April 3, 2019 Hi @johnnie.black I tried with the vm & docker engine disabled and got the same problem, below is a link to a video of my problem which is easier than me explaining it! Many Thanks For Your Time! https://1drv.ms/v/s!AjnJhtmhYMJlh5R_iV33akLpvaRraw Link to comment
JorgeB Posted April 3, 2019 Share Posted April 3, 2019 I see the problem, but no idea what's causing it, like mentioned I could never reproduce it, but you're not the first complaining of that issue when using a pool. Link to comment
mbc0 Posted April 3, 2019 Author Share Posted April 3, 2019 Wow that's me screwed then! If you dont know nobody does!! 😂 Are you able to transfer at constant speeds to your pool? Link to comment
JorgeB Posted April 3, 2019 Share Posted April 3, 2019 Are you able to transfer at constant speeds to your pool? Yes, speed varies a little, but it never gets lower than around 700MB/s, and most importantly it never freezes for a few seconds like yours is doing. Link to comment
mbc0 Posted April 3, 2019 Author Share Posted April 3, 2019 @johnnie.black Can I ask two things please? 1, what connection/disks are you using to get that kind of speed? I can not get above 500 MB/s with 4X Kingston SSD's in RAID 10 2, I understand you don't know what is causing this issue, but if it were you that had this problem what route would you take? I have 8 SATA3 connectors on my Motherboard and have tried them all with no difference, I cannot really use the expander to connect the 4 SSD's as I would lose the trim features. Many Thanks Link to comment
JorgeB Posted April 4, 2019 Share Posted April 4, 2019 Currently sing 11 Crucial MX500 in raid5. In your case first thing I would try would be faster SSDs, you're using TLC models, they can get pretty slow during sustained writes, if you can get some 3D TLC models to test, like 860 EVO, MX500, WD Blue 3D, etc, though even 4 fast SSDs in raid10 won't give more than about 500MB/s, but maybe they can sustain that and without those freezes. Link to comment
Recommended Posts
Archived
This topic is now archived and is closed to further replies.