sonofdbn

Members
  • Posts

    492
  • Joined

  • Last visited

Everything posted by sonofdbn

  1. Sorry to be back again, but more problems. So I backed up what I could, then reformatted the cache drives and setup the same cache pool, then reinstalled most of the dockers and VMs. It was a pleasant surprise to find that just about everything that had been recovered was fine, including the VMs. As far as I could tell, nothing major was missing. Anyway, the server trundled along fine for a few days, then today the torrenting seemed a bit slow, so I looked at the dashboard and found that the log was at 100%. So I stopped my sole running VM and then tried to stop some dockers but found I was unable to; they seemed to restart automatically. In the end I used the GUI to stop the Docker service, then tried to stop the array (not shutdown), but the disks couldn't be unmounted. I got the message: Array Stopping - Retry unmounting user share(s) and nothing else happened after that. I grabbed the diagnostics and in the end I used powerdown from the console and shut down the array. From what I can see in the diagnostics, looks like there are a lot of BTRFS errors. So not sure what I should do at this point. Array is still powered down.tower-diagnostics-20200424-1108.zip
  2. Command seems to be OK. Am now happily copying files off /x to the array. I swear that some files that couldn't be copied before are now copying across at a reasonable - even very good - speed. A few folders seem to be missing entirely but everything I've tried so far has copied across with no problem. I'm hopeful that most of the files will be recovered. Thanks again for all the help.
  3. Thanks for all the help so far. Given that my cache pool had two drives, sdd and sdb, is this the correct command to mount them? mount -o usebackuproot /dev/sdd1 /x
  4. I changed a SATA cable on one of the cache drives in case that was a source of the weird online/offline access. Then I started the array with cache drives unassigned, and mounted the cache drives again at /x. Ran btrfs dev stats /x and got 0 errors (two drives in the pool): [/dev/sdd1].write_io_errs 0 [/dev/sdd1].read_io_errs 0 [/dev/sdd1].flush_io_errs 0 [/dev/sdd1].corruption_errs 0 [/dev/sdd1].generation_errs 0 [/dev/sdb1].write_io_errs 0 [/dev/sdb1].read_io_errs 0 [/dev/sdb1].flush_io_errs 0 [/dev/sdb1].corruption_errs 0 [/dev/sdb1].generation_errs 0 So time to scrub. I started the scrub, waited patiently for over an hour, then checked status, and found that it had aborted. Scrub started: Thu Apr 16 19:15:12 2020 Status: aborted Duration: 0:00:00 Total to scrub: 1.79TiB Rate: 0.00B/s Error summary: no errors found Basically it had aborted immediately without any error message. (I know I shouldn't really complain, but Linux is sometimes not too worried about giving feedback.) Thought this might be because I had mounted the drives as read-only, so remounted as normal read-write and scrubbed again. Waited a while, and the status looked good: Scrub started: Thu Apr 16 21:01:53 2020 Status: running Duration: 0:18:06 Time left: 1:07:25 ETA: Thu Apr 16 22:27:27 2020 Total to scrub: 1.79TiB Bytes scrubbed: 388.31GiB Rate: 366.15MiB/s Error summary: no errors found Waited patiently until I couldn't resist checking and found this: Scrub started: Thu Apr 16 21:01:53 2020 Status: aborted Duration: 1:04:36 Total to scrub: 1.79TiB Rate: 262.46MiB/s Error summary: no errors found Again no message that the scrub had aborted, had to run scrub status to see that it had aborted. Ran btrfs dev stats and again got no errors. Maybe this is an edge case, given that the drive is apparently full? So is there anything else worth trying? I'm not expecting to recover everything, but was hoping to avoid having to re-create some of the VMs. What if I deleted some files (if I can) to clear space and then tried scrub again?
  5. In trying to do the btrfs restore, I realised that it might not be surprising that "1) Mount filesystem read only (non-destructive) " above didn't work, because the disk is already mounted. I haven't had a problem actually accessing the disk. And it's then not surprising to see the last error message above. So my problem was how to unmount the cache drives to try 1) again. Not sure if this is the best way, but I simply stopped the array and then tried 1) again. Now I have access to the cache drive at my /x mountpoint, at least in the console. But I was a bit stuck trying to use it in any practical way. I thought about starting up the array again so that I could copy the cache files to an array drive, but wasn't sure if the cache drive could be mounted both "normally" for unRAID and at mountpoint /x. In any case, I had earlier used mc to try to copy files from the cache drive to the array, and that hadn't worked. So I've now turned to WinSCP and am copying files from mountpoint /x to a local drive. The great thing is that it can happily ignore errors and continue and it writes to a log. (No doubt there's some Linux way of doing this, but I didn't spend time looking.) Now I swear that some /appdata folders that generated errors when I tried copying earlier are now copying just fine, with no problems. Or perhaps the problem files are just not there any more ☹️, WinSCP can be very slow, but I think it's a result of the online/offline problem that I had with some files, and at least it keeps chugging away without horrible flashing which Teracopy did. But to my earlier point, can I start the array again, say in safe mode with GUI? I'd like to read some files off it. What would happen to the cache drive at mountpoint /x?
  6. OK, so the "buffer" is actually used. I swear that at one time I knew all of this 🙂
  7. Thanks, that's useful information. Got to re-think my setup when I eventually sort out the cache disk.
  8. I guess what I'm saying is what's the use case for a Cache Minimum Free setting? If there's no temporary use of the minimum free space, isn't it just reducing the size of your cache?
  9. I don't use mover at all (so not really using cache disk as a cache). I move files manually. But the cache-prefer share idea is excellent. I only "discovered" cache preferences a short time ago and didn't have a good idea of how they could be used. I didn't even know there was a Cache Minimum Free setting - but what happens when you hit the minimum free (ignoring for this discussion any cache-prefer shares)? Does this trigger a warning (and continue writing into the "buffer") or does it just act like a hard limit to the cache drive size?
  10. Can you give more details on btrfs restore? If the pool is unmounted, how do I access the drives? Or do I unmount the drives and then mount them individually as - I don't know - unassigned devices?
  11. Cache-only shares are a bit messed up because I set some shares up before I understood how the cache could be used. In reality, I have /appdata, /domains and /isos there, as well as torrents. And, yes, torrents are also seeded from cache (didn't want to keep an array drive spinning). So that's probably the cause? I thought I left myself a reasonable margin (50GB) but perhaps I wasn't paying attention. I also left too many seeding on the cache because the latest versions of unRAID unfortunately slowed down file transfers from cache to array, so I didn't do transfers out as often as I used to. Bottom line, though, is that a full BTRFS cache drive pool can be a pretty bad problem.? Is there any notification that I could have enabled? My experience is Windows, and there I usually get a disk is low on space message.
  12. So it sounds like I've pretty much lost the data on the cache drive, although I might be lucky with some files (and it would take forever to work out which files are OK). That being the case, I think I should just try to recreate the cache drive from scratch. A real pain, but doable. That being the case, is there anything wrong with the drives themselves? And do you think it was the corruption that led to the cache drive being full or the other way round? Because if it was the other way round, I need to monitor what goes on in the cache drive more carefully in future. If the corruption was the issue, any idea what caused it?
  13. Here's /var/log/syslog. There's an earlier syslog.1 as well, but it's 15MB. Let me kno w if you need that as well. syslog
  14. Just to add: the cache seems to be mounted as read-only, so I can't delete anything. Is there a way to set it back to "normal" and allow me to clear some space?
  15. I can access the cache, but copying files usually produces errors. For example, this is what I get when I try to copy \appdata\MKVToolnix to a local drive: Trying to copy the same folder to /mnt/disk4 on the server via mc on console produces an error in the same place (?): The other weirdness is that some files seem to go offline and then online again very rapidly, and there's no visible copying progress. Unfortunately I can't seem to capture the Teracopy log as text, but here's a screencap: Some files do copy OK, but backing up what I can would be almost like trying to copy each file individually and seeing which ones work. So I'm a bit stuck at the moment.
  16. Thanks for looking into this. Here are the diagnostics. (It's getting late here; I'll be back in the morning.) tower-diagnostics-20200414-0058.zip
  17. Unfortunately no joy. I went to the link you provided, and tried the first two approaches. My cache drives (btrfs pool) are sdd and sdb. 1) Mount filesystem read only (non-destructive) I created mount point x and then tried mount -o usebackuproot,ro /dev/sdd1 /x This gave me an error mount: /x: can't read superblock on /dev/sdd1. (Same result if I tried sdb1.) Then I tried mount -o ro,notreelog,nologreplay /dev/sdd1 /x This produced the same error. So I moved to 2) BTRFS restore (non-destructive) I created the directory /mnt/disk4/restore. Then entered btrfs restore -v /dev/sdd1 /mnt/disk4/restore After a few seconds I got this error message: /dev/sdd1 is currently mounted. Aborting. This looks odd (in that the disk is mounted and therefore presumably accessible), so I thought I should check whether I've missed anything so far.
  18. I'm using ECC RAM, so I didn't run memtest. My problem is how to backup the files on the cache drive(s). I've tried various things to copy files off the drives, but everything I've tried throws up errors. (I've tried mc via SSH and Teracopy from my Windows PC, WinSCP and a few others.) I've also tried the CA Appdata Backup plugin, but it just flashes briefly that it's working and there's no error message, but the output folders are empty.
  19. My Win10 VM suddenly got disconnected, and when I checked the server dashboard, the log was at 100%. I downloaded a diagnostics file at that point. Then I rebooted and as expected, log was cleared, but now Fix Common Problems reports "Unable to write to cache" and "Unable to write to Docker Image". (I took a quick look at rTorrentVPN and while it starts, the rTorrentVPN GUI shows no activity, not even a list of torrents.) I have attached the post-reboot diagnostics file as well. What should I do next? Second - tower-diagnostics-20200413-1938.zip First - tower-diagnostics-20200413-1845.zip
  20. And now this endpoint has stopped completely for me.
  21. You can always add the variable manually. The option to add variables, ports, etc. is at the bottom of the settings screen.
  22. For me it's working, but speeds fluctuate a lot - sometimes stops altogether, then back up to normal. I'm using Sweden.
  23. Should I let the parity check finish? (The crash came roughly 50% of the way into the previous parity check.) Is there a possibility that the flash drive is corrupted? I did install a docker recently, about the time of the first crash. I haven't started it this time round. Don't want to name it and give it a bad rep when it might have nothing to do with the crashes.
  24. Fortunately I ran the syslog server tool as suggested - the server crashed again. This time I could ping, but no SSH. Also I couldn't see any of the shares, but my Win10 VM was still running. Weird? The GUI timed out with a 500 Internal Server Error. So I shut down the VM and then rebooted. I've attached the syslog, removing entries at the end which came after the reboot. My UPS is down, so those messages are not surprising. 192.168.134 is an Asus router. I have two Asus routers, one is the main one and one is configured as an access point (and also serves as a switch). I believe that internal IP address is the access point. The single Ethernet cable from the server is connected to the access point. unRAID came up again at 00:45 extract-syslog-192.168.1.14.log tower-diagnostics-20200401-0104.zip
  25. Unfortunately the server crashed again. I noticed it when my Win10 VM disconnected (wasn't using the VM, but it was running in an RDP window on my PC). Couldn't SSH in to the server and couldn't ping it. So I've rebooted and parity check started automatically. I've attached the latest diagnostics (after rebooting) and the previous one for easy reference. From what I can tell, on Disk 3 the UDMA CRC Error Count hasn't changed. I'm wondering whether there's anything that can tell what cause the crash. So far, 4.8% into the parity check, there are no errors. tower-diagnostics-20200331-1439.zip tower-diagnostics-20200325-1911.zip