Jump to content

sonofdbn

Members
  • Content Count

    328
  • Joined

  • Last visited

Everything posted by sonofdbn

  1. Thanks for the quick replies. OK, I might just change the SSD - or is there a chance it's a SATA cable problem?
  2. Darn. Does this mean I should run memtest? I have ECC RAM, and I've come across a post (by @johnnie.black!) that says memtest doesn't really help. I have a Supermicro X10SDV-TLN4F, which has IPMI.
  3. So in the end I changed the BTRFS cache pool to a single SSD (also BTRFS), re-created everything and was fine for a few months. Unfortunately today I got error messages from the Fix Common Problems plug-in: A) unable to write to cache and B) unable to write to Docker Image. I'm assuming that B is a consequence of A, but anyway I've attached diagnostics. Looking at the GUI, Docker is 34% full and the 1 TB cache drive, a SanDisk SSD, has about 20% free space. But looking at the log for the cache drive, I get a large repeating list of entries like this: Jul 6 18:36:57 Tower kernel: BTRFS error (device sdd1): parent transid verify failed on 432263856128 wanted 2473752 found 2472968 Jul 6 18:36:57 Tower kernel: BTRFS info (device sdd1): no csum found for inode 39100 start 3954470912 Jul 6 18:36:57 Tower kernel: BTRFS warning (device sdd1): csum failed root 5 ino 39100 off 3954470912 csum 0x86885e78 expected csum 0x00000000 mirror 1 Jul 6 18:36:58 Tower kernel: BTRFS error (device sdd1): parent transid verify failed on 432263856128 wanted 2473752 found 2472968 Jul 6 18:36:58 Tower kernel: BTRFS info (device sdd1): no csum found for inode 39100 start 3954470912 Jul 6 18:36:58 Tower kernel: BTRFS warning (device sdd1): csum failed root 5 ino 39100 off 3954470912 csum 0x86885e78 expected csum 0x00000000 mirror 1 Jul 6 18:36:59 Tower kernel: BTRFS error (device sdd1): parent transid verify failed on 432263856128 wanted 2473752 found 2472968 Jul 6 18:36:59 Tower kernel: BTRFS info (device sdd1): no csum found for inode 39100 start 3954470912 Jul 6 18:36:59 Tower kernel: BTRFS warning (device sdd1): csum failed root 5 ino 39100 off 3954470912 csum 0x86885e78 expected csum 0x00000000 mirror 1 Jul 6 18:36:59 Tower kernel: BTRFS error (device sdd1): parent transid verify failed on 432263856128 wanted 2473752 found 2472968 Jul 6 18:36:59 Tower kernel: BTRFS info (device sdd1): no csum found for inode 39100 start 3954470912 Jul 6 18:36:59 Tower kernel: BTRFS warning (device sdd1): csum failed root 5 ino 39100 off 3954470912 csum 0x86885e78 expected csum 0x00000000 mirror 1 Jul 6 18:37:00 Tower kernel: BTRFS error (device sdd1): parent transid verify failed on 432263856128 wanted 2473752 found 2472968 Should I replace the SSD or is there something I can do with BTRFS to try to fix any errors? tower-diagnostics-20200706-1928.zip
  4. My APC UPS died a while ago and I'm only now getting round to replacing it. Probably it's just the battery, but there's not much information available from APC. I used to have my unRAID box, a Synology 4-bay box and my Win 10 PC running off it. In truth, it might have been too much for the APC (650VA, 390W). But since I have to replace something anyway, I was wondering whether it's better to have one UPS per device or one big(ger) UPS. I only need the UPS to enable a clean shutdown; our power doesn't go out often, but a few times a year it goes out when there's lightning. I'm thinking that one UPS per device is probably the way to go because: - it's cheaper, or at least in the same ballpark as having one big device - not risking a single point of failure - that big device will probably weigh a ton - easier to control the shutdowns (no need for software to link to a master device) - takes up a bit more space, but power-cable wise less messy when devices are not close to each other, as in my case. Any thoughts about this?
  5. Came across this issue in reddit, and after doing a bit of reading in this thread and elsewhere on the forum I'm quite concerned. BUT I don't know how to check how badly - if at all - I'm affected. So far what I've done is installed iotop and libffi from the Nerd Tools plug-in and I've run iotop -oa, but I don't know how to interpret the results. loop2 does seem to be writing more than anything else, but how do I know which disk it's writing to? Does it write only to the cache disk? Could someone help out by posting what commands a casual user could try to see what's happening, and how to interpret the results?
  6. The UD is 500GB, and it's likely to be too small, especially taking into account other stuff I want to put on it.
  7. Yes, it's a downloads share for torrents. I did try using cache-prefer, but then of course some files did, correctly, go to the array. But I didn't like keeping the array disk spinning for reads. What I'd like to do is download to my unassigned device (SSD) and then manually move things I want to seed longer back to the cache drive. But I can't find any way of doing this in the docker I use (rtorrentvpn).
  8. I'm trying to look at what's going wrong on my server, and previously enabled the local syslog server, writing to my cache disk as outlined here: https://forums.unraid.net/topic/46802-faq-for-unraid-v6/page/2/?tab=comments#comment-781601. (It's in the FAQ for unRAID v6 topic.) The problem is that my cache drive might be part of the problem, so I'd like to avoid writing to it. Is there a way of writing the local syslog folder to an unassigned device? I have an unassigned SSD which I could use. When I try to select a local syslog folder through the GUI, I only get a choice of shared folders.
  9. Now looking at Fix Problems, I see errors: Unable to write to cache (Drive mounted read-only or completely full.) and Unable to write to Docker Image (Docker Image either full or corrupted.). According to Dashboard, doesn't look like cache is full (90% utilisation - about 100 GB fee). This is what I get (now) when I click Container Size on the Docker page in the GUI. Name Container Writable Log --------------------------------------------------------------------- calibre 1.48 GB 366 MB 64.4 kB binhex-rtorrentvpn 1.09 GB -1 B 151 kB plex 723 MB 301 MB 6.26 kB CrashPlanPRO 454 MB -1 B 44.9 kB nextcloud 354 MB -1 B 4.89 kB mariadb 351 MB -1 B 9.62 kB pihole 289 MB -1 B 10.2 kB letsencrypt 281 MB -1 B 10.1 kB QDirStat 210 MB -1 B 19.4 kB duckdns 20.4 MB 9.09 kB 5.18 kB
  10. Took a quick look at the logs in the above diagnostics and they seem to have omitted docker.log.1. So I've attached it here after editing out many similar lines. (Think file size was too big to upload.) Plex container ID is 8138acb243f3 Pihole container ID is 358107cb7f64 These two seem to come up in the logs. docker.log.1.txt
  11. I've re-created the docker image and all seemed fine. But this morning the log usage on the Dashboard jumped from 1% to 30% when I refreshed the browser. I did reinstall Plex yesterday, and prior to that the log was at 1% of memory (I have 31.4 GiB of usable RAM). Unfortunately it seems that the Dashboard doesn't necessarily update the log until you refresh the browser, so it's possible that the log size was higher than 1% earlier. Is Log size on the Dashboard just the syslog or also docker log? Because docker log is at 36MB, syslog at only around 1MB. Diagnostics are attached. tower-diagnostics-20200427-0949.zip
  12. Can't say I did check the settings. But I'm sure the docker image wasn't anywhere near full when I looked at the dashboard. I remember thinking that there was significantly more free space since I'd left out a number of dockers when I did the reinstall.
  13. OK, I can do that, but I already re-created the docker image when I reinstalled the dockers on the re-formatted cache drive. Any way of reducing the chance of this corruption happening again? Or could it be that there's some problem with the appdata files that I backed up and used to reinstall the dockers that is causing this?
  14. Sorry to be back again, but more problems. So I backed up what I could, then reformatted the cache drives and setup the same cache pool, then reinstalled most of the dockers and VMs. It was a pleasant surprise to find that just about everything that had been recovered was fine, including the VMs. As far as I could tell, nothing major was missing. Anyway, the server trundled along fine for a few days, then today the torrenting seemed a bit slow, so I looked at the dashboard and found that the log was at 100%. So I stopped my sole running VM and then tried to stop some dockers but found I was unable to; they seemed to restart automatically. In the end I used the GUI to stop the Docker service, then tried to stop the array (not shutdown), but the disks couldn't be unmounted. I got the message: Array Stopping - Retry unmounting user share(s) and nothing else happened after that. I grabbed the diagnostics and in the end I used powerdown from the console and shut down the array. From what I can see in the diagnostics, looks like there are a lot of BTRFS errors. So not sure what I should do at this point. Array is still powered down.tower-diagnostics-20200424-1108.zip
  15. Command seems to be OK. Am now happily copying files off /x to the array. I swear that some files that couldn't be copied before are now copying across at a reasonable - even very good - speed. A few folders seem to be missing entirely but everything I've tried so far has copied across with no problem. I'm hopeful that most of the files will be recovered. Thanks again for all the help.
  16. Thanks for all the help so far. Given that my cache pool had two drives, sdd and sdb, is this the correct command to mount them? mount -o usebackuproot /dev/sdd1 /x
  17. I changed a SATA cable on one of the cache drives in case that was a source of the weird online/offline access. Then I started the array with cache drives unassigned, and mounted the cache drives again at /x. Ran btrfs dev stats /x and got 0 errors (two drives in the pool): [/dev/sdd1].write_io_errs 0 [/dev/sdd1].read_io_errs 0 [/dev/sdd1].flush_io_errs 0 [/dev/sdd1].corruption_errs 0 [/dev/sdd1].generation_errs 0 [/dev/sdb1].write_io_errs 0 [/dev/sdb1].read_io_errs 0 [/dev/sdb1].flush_io_errs 0 [/dev/sdb1].corruption_errs 0 [/dev/sdb1].generation_errs 0 So time to scrub. I started the scrub, waited patiently for over an hour, then checked status, and found that it had aborted. Scrub started: Thu Apr 16 19:15:12 2020 Status: aborted Duration: 0:00:00 Total to scrub: 1.79TiB Rate: 0.00B/s Error summary: no errors found Basically it had aborted immediately without any error message. (I know I shouldn't really complain, but Linux is sometimes not too worried about giving feedback.) Thought this might be because I had mounted the drives as read-only, so remounted as normal read-write and scrubbed again. Waited a while, and the status looked good: Scrub started: Thu Apr 16 21:01:53 2020 Status: running Duration: 0:18:06 Time left: 1:07:25 ETA: Thu Apr 16 22:27:27 2020 Total to scrub: 1.79TiB Bytes scrubbed: 388.31GiB Rate: 366.15MiB/s Error summary: no errors found Waited patiently until I couldn't resist checking and found this: Scrub started: Thu Apr 16 21:01:53 2020 Status: aborted Duration: 1:04:36 Total to scrub: 1.79TiB Rate: 262.46MiB/s Error summary: no errors found Again no message that the scrub had aborted, had to run scrub status to see that it had aborted. Ran btrfs dev stats and again got no errors. Maybe this is an edge case, given that the drive is apparently full? So is there anything else worth trying? I'm not expecting to recover everything, but was hoping to avoid having to re-create some of the VMs. What if I deleted some files (if I can) to clear space and then tried scrub again?
  18. In trying to do the btrfs restore, I realised that it might not be surprising that "1) Mount filesystem read only (non-destructive) " above didn't work, because the disk is already mounted. I haven't had a problem actually accessing the disk. And it's then not surprising to see the last error message above. So my problem was how to unmount the cache drives to try 1) again. Not sure if this is the best way, but I simply stopped the array and then tried 1) again. Now I have access to the cache drive at my /x mountpoint, at least in the console. But I was a bit stuck trying to use it in any practical way. I thought about starting up the array again so that I could copy the cache files to an array drive, but wasn't sure if the cache drive could be mounted both "normally" for unRAID and at mountpoint /x. In any case, I had earlier used mc to try to copy files from the cache drive to the array, and that hadn't worked. So I've now turned to WinSCP and am copying files from mountpoint /x to a local drive. The great thing is that it can happily ignore errors and continue and it writes to a log. (No doubt there's some Linux way of doing this, but I didn't spend time looking.) Now I swear that some /appdata folders that generated errors when I tried copying earlier are now copying just fine, with no problems. Or perhaps the problem files are just not there any more ☹️, WinSCP can be very slow, but I think it's a result of the online/offline problem that I had with some files, and at least it keeps chugging away without horrible flashing which Teracopy did. But to my earlier point, can I start the array again, say in safe mode with GUI? I'd like to read some files off it. What would happen to the cache drive at mountpoint /x?
  19. OK, so the "buffer" is actually used. I swear that at one time I knew all of this 🙂
  20. Thanks, that's useful information. Got to re-think my setup when I eventually sort out the cache disk.
  21. I guess what I'm saying is what's the use case for a Cache Minimum Free setting? If there's no temporary use of the minimum free space, isn't it just reducing the size of your cache?
  22. I don't use mover at all (so not really using cache disk as a cache). I move files manually. But the cache-prefer share idea is excellent. I only "discovered" cache preferences a short time ago and didn't have a good idea of how they could be used. I didn't even know there was a Cache Minimum Free setting - but what happens when you hit the minimum free (ignoring for this discussion any cache-prefer shares)? Does this trigger a warning (and continue writing into the "buffer") or does it just act like a hard limit to the cache drive size?
  23. Can you give more details on btrfs restore? If the pool is unmounted, how do I access the drives? Or do I unmount the drives and then mount them individually as - I don't know - unassigned devices?
  24. Cache-only shares are a bit messed up because I set some shares up before I understood how the cache could be used. In reality, I have /appdata, /domains and /isos there, as well as torrents. And, yes, torrents are also seeded from cache (didn't want to keep an array drive spinning). So that's probably the cause? I thought I left myself a reasonable margin (50GB) but perhaps I wasn't paying attention. I also left too many seeding on the cache because the latest versions of unRAID unfortunately slowed down file transfers from cache to array, so I didn't do transfers out as often as I used to. Bottom line, though, is that a full BTRFS cache drive pool can be a pretty bad problem.? Is there any notification that I could have enabled? My experience is Windows, and there I usually get a disk is low on space message.
  25. So it sounds like I've pretty much lost the data on the cache drive, although I might be lucky with some files (and it would take forever to work out which files are OK). That being the case, I think I should just try to recreate the cache drive from scratch. A real pain, but doable. That being the case, is there anything wrong with the drives themselves? And do you think it was the corruption that led to the cache drive being full or the other way round? Because if it was the other way round, I need to monitor what goes on in the cache drive more carefully in future. If the corruption was the issue, any idea what caused it?