daan_SVK Posted April 7, 2021 Share Posted April 7, 2021 So I popped in a new pre-cleared parity drive, run into a bump with pre-clearing the old parity drive as per this thread and I thought that I'm golden. The system went unresponsive when the parity rebiuld was around 18%. Last thing I noticed was that the CPU usage went to 100% which never happens as I have an 8 core x5600 and just a bunch of light weight dockers. this is where I'm at: - the gui won't load - getting 500 Internal Server Error from the browser - I hooked up a monitor to the server, logged in, ran htop and thats where the console froze and doesn't take any input - the server still seems to be writing to the array or parity, but I can only tell by the solid LED light on the chassis - the old parity drive was hooked up for pre-clearing in an external USB dock and last time I saw it paused - the server still pings fine getting a little spooked here, what are my options? - should I just wait for 2 days in hope the new parity will eventually recalculate and the array comes back alive? - should I disconnect the old parity drive in the caddy that I was pre-clearing thanks for reading, fingers crossed. Quote Link to comment
trurl Posted April 8, 2021 Share Posted April 8, 2021 Since parity isn't valid anyway might as well reboot and get us diagnostics. Quote Link to comment
daan_SVK Posted April 8, 2021 Author Share Posted April 8, 2021 (edited) 14 hours ago, trurl said: Since parity isn't valid anyway might as well reboot and get us diagnostics. thanks Trurl, I waited overnight and the writing to the array stopped. I rebooted it as you suggested and the server came back up claiming the parity rebuild finished but I still get an "Parity Invalid" exclamation mark on the parity drive. I have attached the diagnostics as per your suggestion. Should I just start the array attempting to rebuild the parity again? The disk was propperly precleared originally. tower-diagnostics-20210408-0928.zip Edited April 8, 2021 by daan_SVK Quote Link to comment
trurl Posted April 8, 2021 Share Posted April 8, 2021 2 hours ago, daan_SVK said: disk was propperly precleared Not relevant. Preclear is only a test/burnin utility. When Unraid needs a clear disk (only when adding a disk to a new data slot in an array with valid parity, so parity will remain valid) Unraid will clear the disk itself. Why do you have 100G allocated to docker.img? Have you had problems filling it? 20G is usually more than enough and making it larger won't fix filling, it will only make it take longer to fill. 2 hours ago, daan_SVK said: Should I just start the array attempting to rebuild the parity again? Yes, and post new diagnostics after starting. Lots of things we can't tell about your situation unless the array is started. Quote Link to comment
daan_SVK Posted April 8, 2021 Author Share Posted April 8, 2021 1 hour ago, trurl said: Why do you have 100G allocated to docker.img? no particular reason to be honest, this is how it was originally set up. I started the array again few hour ago, seems to be rebuilding without issues at the moment (at 8.5% now). Server is responsive and all dockers and services are up. I disconnected the USB caddy that was running the pre-clear on the old parity drive in case that might have be the issue. new diagnostic file is attached. thanks again for reviewing it, much appreciated. tower-diagnostics-20210408-1335.zip Quote Link to comment
trurl Posted April 8, 2021 Share Posted April 8, 2021 1 hour ago, daan_SVK said: this is how it was originally set up. You mean somebody else set it up for you? Why are some of your disks still ReiserFS? Parity build looks good so far, but looks like some cache corruption: Apr 8 12:01:02 Tower shfs: copy_file: /mnt/cache/Movies/Cronos (1993)/Cronos 1993 Bluray-1080p.mkv /mnt/disk3/Movies/Cronos (1993)/Cronos 1993 Bluray-1080p.mkv (5) Input/output error Apr 8 12:01:03 Tower crond[2026]: exit status 1 from user root /usr/local/sbin/mover &> /dev/null Apr 8 13:00:54 Tower shfs: copy_file: /mnt/cache/Movies/Cronos (1993)/Cronos 1993 Bluray-1080p.mkv /mnt/disk3/Movies/Cronos (1993)/Cronos 1993 Bluray-1080p.mkv (5) Input/output error Apr 8 13:00:54 Tower kernel: btrfs_print_data_csum_error: 12 callbacks suppressed Apr 8 13:00:54 Tower kernel: BTRFS warning (device sdb1): csum failed root 5 ino 5854828 off 5642928128 csum 0x75f244ca expected csum 0x98829387 mirror 1 Apr 8 13:00:54 Tower kernel: btrfs_dev_stat_print_on_error: 12 callbacks suppressed Apr 8 13:00:54 Tower kernel: BTRFS error (device sdb1): bdev /dev/sdc1 errs: wr 0, rd 0, flush 0, corrupt 35, gen 0 Apr 8 13:00:54 Tower kernel: BTRFS warning (device sdb1): csum failed root 5 ino 5854828 off 5642928128 csum 0x75f244ca expected csum 0x98829387 mirror 2 Apr 8 13:00:54 Tower kernel: BTRFS error (device sdb1): bdev /dev/sdb1 errs: wr 0, rd 0, flush 0, corrupt 34, gen 0 Apr 8 13:00:54 Tower kernel: BTRFS warning (device sdb1): csum failed root 5 ino 5854828 off 5642928128 csum 0x75f244ca expected csum 0x98829387 mirror 1 Apr 8 13:00:54 Tower kernel: BTRFS error (device sdb1): bdev /dev/sdc1 errs: wr 0, rd 0, flush 0, corrupt 36, gen 0 Apr 8 13:00:54 Tower kernel: BTRFS warning (device sdb1): csum failed root 5 ino 5854828 off 5642928128 csum 0x75f244ca expected csum 0x98829387 mirror 2 Apr 8 13:00:54 Tower kernel: BTRFS error (device sdb1): bdev /dev/sdb1 errs: wr 0, rd 0, flush 0, corrupt 35, gen 0 Apr 8 13:00:54 Tower kernel: BTRFS warning (device sdb1): csum failed root 5 ino 5854828 off 5642928128 csum 0x75f244ca expected csum 0x98829387 mirror 1 Apr 8 13:00:54 Tower kernel: BTRFS error (device sdb1): bdev /dev/sdc1 errs: wr 0, rd 0, flush 0, corrupt 37, gen 0 Apr 8 13:00:54 Tower kernel: BTRFS warning (device sdb1): csum failed root 5 ino 5854828 off 5642928128 csum 0x75f244ca expected csum 0x98829387 mirror 2 Apr 8 13:00:54 Tower kernel: BTRFS error (device sdb1): bdev /dev/sdb1 errs: wr 0, rd 0, flush 0, corrupt 36, gen 0 Apr 8 13:00:55 Tower crond[2026]: exit status 1 from user root /usr/local/sbin/mover &> /dev/null Why was mover running? Didn't look at SMART. Do any disks have SMART warnings on the Dashboard page? Quote Link to comment
daan_SVK Posted April 8, 2021 Author Share Posted April 8, 2021 (edited) 39 minutes ago, trurl said: 2 hours ago, daan_SVK said: this is how it was originally set up. You mean somebody else set it up for you? I set it up a long time ago. Outside of being waste of space, this is not an issue, is it? 39 minutes ago, trurl said: Why are some of your disks still ReiserFS? again, this is original configuration. I think some of those disks are over 6 years old. I tend not to fix things if they aren't broken. I was about to replace the oldest drive with the old parity drive. 39 minutes ago, trurl said: Why was mover running? this has to do most likely with the parity corruption you pointed out. It was most likely trying to move the movie listed on the log: Apr 8 12:01:02 Tower shfs: copy_file: /mnt/cache/Movies/Cronos (1993)/Cronos 1993 Bluray-1080p.mkv /mnt/disk3/Movies/Cronos (1993)/Cronos 1993 Bluray-1080p.mkv (5) Input/output error Apr 8 12:01:03 Tower crond[2026]: exit status 1 from user root /usr/local/sbin/mover &> /dev/null Apr 8 13:00:54 Tower shfs: copy_file: /mnt/cache/Movies/Cronos (1993)/Cronos 1993 Bluray-1080p.mkv /mnt/disk3/Movies/Cronos (1993)/Cronos 1993 Bluray-1080p.mkv (5) Input/output error Apr 8 13:00:54 Tower kernel: btrfs_print_data_csum_error: 12 callbacks suppressed 39 minutes ago, trurl said: Didn't look at SMART. Do any disks have SMART warnings on the Dashboard page? no SMART errors reported from any of the drives. So the Mover is the culprit here? Should Mover be disabled in similar cases? I was under the impression the array should be fully operational while the parity is calculated. I certainly never experienced any issues on my monthly parity checks. Also, should I be concerned with the cache corruption? How would I go about addressing that? Should I scrub it in maintainance mode once the parity rebuild has been compleated? Edited April 8, 2021 by daan_SVK Quote Link to comment
JorgeB Posted April 9, 2021 Share Posted April 9, 2021 8 hours ago, trurl said: csum failed Btrfs is detecting data corruption, this suggests a hardware issue, like bad RAM, and not that surprising since it's a known issue with Ryzen and overclocked RAM. Quote Link to comment
daan_SVK Posted April 9, 2021 Author Share Posted April 9, 2021 (edited) 7 hours ago, JorgeB said: known issue with Ryzen and overclocked RAM. I will try to disable XMP and run the RAM on a stock setting. the server was up for over 60 days prior, also it was stress tested with memtest for two days before putting into production. is there something I can do to resolve the cache corruption? this being a pool of two devices, I should be able to run a corrective scrub, correct? Edited April 9, 2021 by daan_SVK Quote Link to comment
JorgeB Posted April 9, 2021 Share Posted April 9, 2021 You should run a scrub to identify the corrupt files, most likely both copies will be corrupt, and in that case it can't be fixed, but you can delete them and restore from backups. Quote Link to comment
daan_SVK Posted April 9, 2021 Author Share Posted April 9, 2021 1 hour ago, JorgeB said: You should run a scrub to identify the corrupt files, most likely both copies will be corrupt, and in that case it can't be fixed, but you can delete them and restore from backups. thanks JorgeB, I deleted the file I thought caused the mover to hang and run the read only scur, however I dont see any corrupted files listed on the output: [1/7] checking root items [2/7] checking extents [3/7] checking free space tree [4/7] checking fs roots [5/7] checking only csums items (without verifying data) [6/7] checking root refs [7/7] checking quota groups skipped (not enabled on this FS) Opening filesystem to check... Checking filesystem on /dev/sdb1 UUID: d1294b70-d13c-4027-b0b5-12417226b0dc cache and super generation don't match, space cache will be invalidated found 128267644928 bytes used, no error found total csum bytes: 20018700 total tree bytes: 368099328 total fs tree bytes: 303235072 total extent tree bytes: 33718272 btree space waste bytes: 64207774 file data blocks allocated: 1508281032704 referenced 126905487360 Quote Link to comment
JorgeB Posted April 9, 2021 Share Posted April 9, 2021 1 hour ago, daan_SVK said: however I dont see any corrupted files listed on the output That's not a scrub, it's a file system check, there's also a GUI option for the scrub, corrupt files, if any, will be listed in the syslog. Quote Link to comment
daan_SVK Posted April 9, 2021 Author Share Posted April 9, 2021 (edited) 14 minutes ago, JorgeB said: That's not a scrub, it's a file system check, there's also a GUI option for the scrub, corrupt files, if any, will be listed in the syslog. ah, yes, thank you for pointing that out. No errors in that one: UUID: d1294b70-d13c-4027-b0b5-12417226b0dc Scrub started: Fri Apr 9 09:41:13 2021 Status: finished Duration: 0:04:15 Total to scrub: 238.90GiB Rate: 959.28MiB/s Error summary: no errors found the cach log still shows 43 corrupted entries Apr 8 09:21:28 Tower kernel: sdc: sdc1 Apr 8 09:21:28 Tower kernel: sd 2:0:0:0: [sdc] Attached SCSI disk Apr 8 09:21:28 Tower kernel: BTRFS: device fsid d1294b70-d13c-4027-b0b5-12417226b0dc devid 2 transid 9310751 /dev/sdc1 scanned by udevd (1642) Apr 8 09:21:59 Tower emhttpd: MTFDDAK512TBN-1AR1ZABHA_UGXVL01J1BF7U0 (sdc) 512 1000215216 Apr 8 09:21:59 Tower emhttpd: import 31 cache device: (sdc) MTFDDAK512TBN-1AR1ZABHA_UGXVL01J1BF7U0 Apr 8 09:21:59 Tower emhttpd: read SMART /dev/sdc Apr 8 11:21:28 Tower kernel: BTRFS info (device sdb1): bdev /dev/sdc1 errs: wr 0, rd 0, flush 0, corrupt 23, gen 0 Apr 8 12:01:02 Tower kernel: BTRFS error (device sdb1): bdev /dev/sdc1 errs: wr 0, rd 0, flush 0, corrupt 24, gen 0 Apr 8 12:01:02 Tower kernel: BTRFS error (device sdb1): bdev /dev/sdc1 errs: wr 0, rd 0, flush 0, corrupt 25, gen 0 Apr 8 12:01:02 Tower kernel: BTRFS error (device sdb1): bdev /dev/sdc1 errs: wr 0, rd 0, flush 0, corrupt 26, gen 0 Apr 8 12:01:02 Tower kernel: BTRFS error (device sdb1): bdev /dev/sdc1 errs: wr 0, rd 0, flush 0, corrupt 27, gen 0 Apr 8 12:01:02 Tower kernel: BTRFS error (device sdb1): bdev /dev/sdc1 errs: wr 0, rd 0, flush 0, corrupt 28, gen 0 Apr 8 12:01:02 Tower kernel: BTRFS error (device sdb1): bdev /dev/sdc1 errs: wr 0, rd 0, flush 0, corrupt 29, gen 0 Apr 8 12:01:02 Tower kernel: BTRFS error (device sdb1): bdev /dev/sdc1 errs: wr 0, rd 0, flush 0, corrupt 30, gen 0 Apr 8 12:01:02 Tower kernel: BTRFS error (device sdb1): bdev /dev/sdc1 errs: wr 0, rd 0, flush 0, corrupt 31, gen 0 Apr 8 13:00:54 Tower kernel: BTRFS error (device sdb1): bdev /dev/sdc1 errs: wr 0, rd 0, flush 0, corrupt 35, gen 0 Apr 8 13:00:54 Tower kernel: BTRFS error (device sdb1): bdev /dev/sdc1 errs: wr 0, rd 0, flush 0, corrupt 36, gen 0 Apr 8 13:00:54 Tower kernel: BTRFS error (device sdb1): bdev /dev/sdc1 errs: wr 0, rd 0, flush 0, corrupt 37, gen 0 Apr 8 14:01:09 Tower kernel: BTRFS error (device sdb1): bdev /dev/sdc1 errs: wr 0, rd 0, flush 0, corrupt 38, gen 0 Apr 8 14:01:09 Tower kernel: BTRFS error (device sdb1): bdev /dev/sdc1 errs: wr 0, rd 0, flush 0, corrupt 39, gen 0 Apr 8 14:01:09 Tower kernel: BTRFS error (device sdb1): bdev /dev/sdc1 errs: wr 0, rd 0, flush 0, corrupt 40, gen 0 Apr 8 15:01:11 Tower kernel: BTRFS error (device sdb1): bdev /dev/sdc1 errs: wr 0, rd 0, flush 0, corrupt 41, gen 0 Apr 8 15:01:11 Tower kernel: BTRFS error (device sdb1): bdev /dev/sdc1 errs: wr 0, rd 0, flush 0, corrupt 42, gen 0 Apr 8 15:01:11 Tower kernel: BTRFS error (device sdb1): bdev /dev/sdc1 errs: wr 0, rd 0, flush 0, corrupt 43, gen 0 Apr 9 09:22:25 Tower emhttpd: read SMART /dev/sdc Apr 9 09:40:16 Tower emhttpd: read SMART /dev/sdc Apr 9 09:40:30 Tower kernel: BTRFS info (device sdb1): bdev /dev/sdc1 errs: wr 0, rd 0, flush 0, corrupt 43, gen 0 what's my option addressing those Edited April 9, 2021 by daan_SVK Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.