Pauven

Members
  • Posts

    747
  • Joined

  • Last visited

  • Days Won

    7

Pauven last won the day on August 12 2019

Pauven had the most liked content!

Converted

  • Gender
    Male
  • Location
    Atlanta Metro Area

Recent Profile Visitors

The recent visitors block is disabled and is not being shown to other users.

Pauven's Achievements

Enthusiast

Enthusiast (6/14)

124

Reputation

  1. I would highly recommend at least starting with the original slots where the replacement drives are being detected, remove the replacements and install the originals there. For now, don't touch any of the other "good" drives, as that could be compounding the problem, especially if you start to lose track of which drives are which. Keep it simple.
  2. Okay, slow down. Up to this point, most of the advice has been either about doing tests or hypothesizing options. Anytime you think you're ready to take action, please post here your planned steps for review and approval. Anytime you take an action, you're one step closer to losing data if it is the wrong action. I believe your data is still intact, so don't give up hope. But slow down and work with the guys here, don't do anything that's not reviewed and approved. Did you follow this guidance to backup the current flash drive first, before restoring from backup? From a planning perspective, we need to know what options remain.
  3. Hey guys, I'm just chiming in here as jkwaterman is a friend of mine, and we've already been chatting via email - I sent him here for expert advice. I'm super happy to see JorgeB, trurl and Frank1940 are helping out - you guys are sharp so I know he's in good hands. I read through everything, and I do have a few thoughts. Everything you guys are suggesting is pretty much a match for what I've advised via email as well, so we're all already on the same page. Restoring the super.dat from his Apr 2023 backup is a great idea, but I think that only applies if he didn't change any drives between the backup and before the drives failed out. If we send him down this path, I think he first needs to confirm he didn't upgrade/swap/add any drives post backup, and also he should have a new backup of his current (bad) config, in case this goes sideways and he wants to get back to the current state. I wanted to point this out since I didn't see anyone ask this particular question. I also strongly agree with trying to use the original failed drives, and that he should perform SMART tests to validate the drives are okay before re-using them. One thing I'm not sure about is if he uses the old drives, should he use the Trust Parity feature (I assume that's still a feature, been a decade since I last did this). I'm imagining that he's got two paths forward with the old drives. He could recreate the array config using all the original drives, and do a Trust Parity so it won't be rebuilt, and then immediately swap out the two suspect drives and rebuild onto the replacements. Basically, with this approach he's using the GUI to recreate the pre-failed drive config state, and then manually failing/upgrading the drives. Otherwise, he could again recreate the array config using all the original drives, but don't Trust Parity and instead rebuild new Parity via the data on the suspect drives. This second approach sounds slightly riskier, as we're trusting the suspect drives to survive the parity rebuild, and unfortunately we don't know the nature of the errors that started this whole fiasco. I know for a fact that he has started the array numerous times in disk emulation mode, so data could have been written to the array. Additionally we are both users of the My Movies software, which has a habit of updating local movie data from online web contributions that other users continually submit, and this metadata in turn gets written to the array. It's probably safe to assume that My Movies was running at some point during disk emulation mode, so that the current parity data no longer matches the data on the failed drives. I just wanted to point this out, so that we all know to only either trust the parity data, or trust the suspect drive data, but expect the two data sources to be slightly out of sync with each other. Note that the updates from My Movies are trivial and will automatically be reapplied if he reverts to the old drive data, so no risk of data loss there if he reverts to them. One question I had myself is: Is it possible to manually fix the drive config, via text editing, so that the parity drives are re-added to the array in a trusted state, but the 2 failed drives are still shown as missing/wrong/replaced? I was thinking there was a way to accomplish this via text file edits, but I really don't know. I helped with his server build. This power supply has 62A on +12V if I'm not mistaken. Thanks for helping jkwaterman out, guys, I know we both really appreciate it!!!
  4. Thanks Rysz. Actually, it's my signature that's really outdated, hah! But I was still on 6.9.2, and I had to upgrade to 6.10+ even to use the URL method. I'm on the latest 6.12 now, and I was able to install from URL. I assume it's the same MergerFS release as the CA version. I like MergerFS, it's working as I hoped. But it's not perfect. The "Create" option is static for files vs. directories, and I was finding that it would create a directory, write some files to it, drop below the minimum free space, and then create a new directory on a different branch. Considering that I'm backing up uncompressed blu-rays, typically around 45 GB in size, I need the min free space for creating a directory to be at least 45 GB higher than the min free space for creating files. To solve this, I customized the mirror.sh script someone else wrote (which is used to create each directory right before files are written to it, rather than creating all empty directories first and then copying files). I changed it to have it create directories based upon 100 GB min free space, and to evaluate my MergerFS branches in a particular sequence. I then was able to configure MergerFS with a much lower 4 GB min free space, which only applies to files since my script creates the directories. When used with MergerFS's "ep" Existing Path option, I now have MergerFS writing the backup files to where my backup script creates the directories. This allows me to keep my blu-ray disc directories whole on a single drive, and all my MergerFS branches fill up one-by-one. I'm in backup nirvana!!!
  5. A year ago I created an easy, affordable backup solution for my Unraid server. Essentially just a stack of external USB drives that I mounted with Unassigned Devices and joined together in a BTRFS JBOD style pool. With 5x 16TB drives, this gave me a single 80TB storage volume. At the time, this solution seemed perfect. I had a backup script that used RSYNC to copy my files to the single mount point, and I thought that BTRFS filled up each drive one-by-one. Since my Unraid data is basically already a backup of my physical data, having this portable backup volume that could be stored offsite was more than I needed, even without any built-in redundancy. This week, while adding a new 20TB drive to expand this pool up to 100TB, I learned I made several mistakes in my backup solution. First, when adding the new drive I made a few mistakes and ended up corrupting the BTRFS pool. And since my pool had no redundancy, BTRFS prohibits mounting it in RW mode to fix it, so the only option was to start over, recreate the entire pool, and re-backup the original 80TB of data. That was painful enough. But in redoing all this, I discovered that BTRFS is automatically balancing, writing to the drive with the most free space for each file. With the nature of the data I'm storing, losing a single drive would now make the entire backup worthless as I need each directory to remain whole on a single drive, and can't lose any files inside each directory. While my BTRFS backup pool is better than nothing, this is way too fragile for me to continue using it. While researching solutions, I came across MergeFS and eventually this thread. This sounds like the right type of solution. My core requirements are to plug in my USB drives, mount them as a single filesystem, and run a backup script to copy any new/altered data to my backup pool, with data filling up each drive, one-by-one, before moving on to the next drive. That way, if I lose a drive, I only lose the data backed up to that one drive, plus any directories that happened to be spanning the transition between drives. Sorry for the long lead-in. Now to my questions: Is the plugin on CA yet? I searched and can't find it, so I'm assuming I have to install it via URL. Can someone help me with the configuration? I read through the MergerFS github page, and there's tons of options and the examples don't seem to apply to my use case. I'm a bit overwhelmed. I need commands for configuring, mounting, unmounting, and expanding the pool. Thanks! -Paul
  6. I did see that, but then you appended with your edit and I thought you were changing your answer, hence my confusion. I currently have 78.6 TB of data backed up in this pool, as-is. If I follow those steps, is there any risk I could lose that data and have to repopulate the back-up? It was over a week of copying, I don't want to have to do that again. If I'm understanding you correctly, I can remove the 5 disks, delete the history, then 1 at a time insert the disk and rename it to the same pool name, delete my history again just to make sure, and then the next time I bring in all 5 drives at the same time, they will appear as a single pool. Does that sound right?
  7. So does that mean this isn't possible? Sorry, I got confused. Since the mount point has to be the disk label, and it won't let me rename to an existing value, that makes it impossible to do the solution you offered, right?
  8. I just tried changing the Mount Point to all be the same, and it won't let me. It reports "Fail". I think it is because it's changing the disk label and the mount point at the same time. Is there a trick to doing this? Errors in the log: Apr 10 17:24:51 Tower unassigned.devices: Error: Device '/dev/sdx1' mount point 'Frankenstore' - name is reserved, used in the array or by an unassigned device.
  9. Hey dlandon. Thanks so much for this awesome tool. I've been using it for years, and it's been a big help for certain tasks. One of the things I occasionally use it for is for a removable btrfs JBOD drive pool of 5 USB HDD's. It's so easy to plug it in, mount it, run my rsync backup job, then put it back in offline storage when I'm done. I love the fact I don't have to stop/start the array to use it, that I don't get warnings when I unplug it, and that I don't get any fix common problems warnings for duplicated data on a cache disk. I was recently sharing my solution with some fellow users, and I discovered that the tutorial for how to create the btrfs drive pool for UD was removed. I reached out to JorgeB and he restored that post so now I have those instructions again. While working with these other users on how to do this hot-pluggable backup pool, and comparing with how it works using stock Unraid pools, a few things cropped up that I wanted to ask you about. After all, UD is the best tool for creating hot-pluggable drive pools that are normally stored offline, but there are a couple things Unraid pools do a bit better. First, when mounting the 1st pool device, the buttons to mount the other devices remain enabled. One of my fellow users got confused, and clicked mount on all devices, and then saw the pool was mounted multiple times. Would it be possible to both make it more obvious that all the drives in the pool are now mounted, and to disable/hide the mount button on the other drives? Currently the only indication is the partition size on the mounted drive. Perhaps even the other drives that got mounted in the pool can be inset to the right, beneath the parent, to better indicate what is going on. Second, would it be possible to add a feature in the GUI to add a partition to an existing pool? I believe that Unraid pools let you do this, but in UD you have to go out to the command line and do the btrfs dev add... command to add the partition to a mount point. I know it's a pretty easy command line, but some users are very uncomfortable with the cmd line and prefer the GUI approach. I know most people seem to think that Unraid pools are the only game in town now, even your own documentation states to use them. But for hot-pluggable, removable drive pools, UD is so much better, I hope you continue to support and enhance this capability. Thanks!!! Paul
  10. Awesome, thank you Johnnie (should I still call you Johnnie, or Jorge, or something else?), that's exactly what I needed. I was surprisingly close in my recreation of the steps based upon my research, but was full of doubt. You're extremely helpful as always. 😊 Another user did a test and discovered he was able to mount a pool created in Unraid using UD, no big surprise I guess since these are just standard btrfs pools. So for some users it might be easier to create the pool using Unraid, remove it, delete the definition, and then use UD from then on for hot-plugging. I would definitely use the Unraid pools feature if it more gracefully handled hot-pluggable backup pools, and didn't require the stop/start. I'm not complaining, though, since UD does this extremely well.
  11. Hey Johnnie/@JorgeB, I could use some help on this. Side note, your new username and logo had me all confused, I couldn't figure out how you seemed to have been here for years/decades, yet I didn't recognize the name. I finally figured out your provenance, though I'm still baffled by the user name change. Anyway, to my issue. I created a portable backup drive pool, as described above, with Unassigned Devices back when I was running 6.8, using the directions you linked to above. I plug it in 2-4 times a year and do a backup, it's fantastic. Those instructions (which I think you wrote) have since been deleted, since the preferred way is to use the multiple drive pools feature in 6.9. But the functionality in 6.9 is not the same. If you create a drive pool and then unplug it, Unraid is unhappy about the missing drives. You can make the warnings go away if you delete the pool, but then you have to make sure you add recreate the pool with all the drives back in the correct order before doing your next incremental. You also have to stop/start the array to do any changes to the pool. Unassigned Devices did this particular task so much better. No warnings, don't have to delete the config, just plug it in and mount it, don't have to stop the array. While I can understand that the preferred method is to use Unraid for multiple permanent drive pools, I don't understand why the documentation for doing it with UD was deleted, as that still serves a niche. I'm trying to help some other users get up and going with the solution I'm using, and since I can't find the documentation I can't fully help them. I think there were some command lines I used when setting up the btrfs pool as jbod, possibly related to formatting but I don't recall. I also need to expand my UD backup drive pool soon, almost ran out of space on my last backup so I need a 6th drive, and I'm worried I won't be able to do this correctly without the instructions. Even the UD support thread points to the now deleted instructions, and the internet archive doesn't have any successful copies of the FAQ. Is this something you can help with, or point me to someone who can? Thanks! Paul
  12. Thanks JorgeB. I've followed your advice and ripped out the Highpoint 2760A. I installed a couple Dell H310's, combined with 8 SATA ports on my motherboard, to get back to 24 ports. So far it's been smooth sailing, but my Call Trace problems don't usually crop up for a couple weeks, so I'm not in the clear yet. Fingers crossed.
  13. I've been searching the forum, trying to see if any other users have the same issue. I do see plenty of call trace reports, but so far none have matched mine. My log just keeps repeating the same info over and over. What I posted above was just the errors, here's the full detail for a complete error segment: Apr 1 13:50:20 Tower kernel: rcu: INFO: rcu_sched detected stalls on CPUs/tasks: Apr 1 13:50:20 Tower kernel: rcu: 10-....: (2 GPs behind) idle=61e/1/0x4000000000000002 softirq=118394498/118394499 fqs=10418875 Apr 1 13:50:20 Tower kernel: (detected by 8, t=42541182 jiffies, g=291127741, q=55535900) Apr 1 13:50:20 Tower kernel: Sending NMI from CPU 8 to CPUs 10: Apr 1 13:50:20 Tower kernel: NMI backtrace for cpu 10 Apr 1 13:50:20 Tower kernel: CPU: 10 PID: 0 Comm: swapper/10 Tainted: G S 5.10.28-Unraid #1 Apr 1 13:50:20 Tower kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X370 Professional Gaming, BIOS P4.80 07/18/2018 Apr 1 13:50:20 Tower kernel: RIP: 0010:mvs_slot_complete+0x31/0x45f [mvsas] Apr 1 13:50:20 Tower kernel: Code: 00 00 41 56 41 55 41 54 55 53 89 c3 48 6b cb 58 48 83 ec 18 89 44 24 10 83 c8 ff 89 74 24 14 4c 8d 34 0f 4d 8b be 08 fd 00 00 <4d> 85 ff 0f 84 16 04 00 00 49 83 bf e8 00 00 00 00 0f 84 08 04 00 Apr 1 13:50:20 Tower kernel: RSP: 0018:ffffc900003c0e78 EFLAGS: 00000286 Apr 1 13:50:20 Tower kernel: RAX: 00000000ffffffff RBX: 0000000000000000 RCX: 0000000000000000 Apr 1 13:50:20 Tower kernel: RDX: 0000000000000000 RSI: 0000000000010000 RDI: ffff888138a80000 Apr 1 13:50:20 Tower kernel: RBP: ffff888138a80000 R08: 0000000000000001 R09: ffffffffa02eda65 Apr 1 13:50:20 Tower kernel: R10: 00000000d007f000 R11: ffff8881049a9800 R12: 0000000000000000 Apr 1 13:50:20 Tower kernel: R13: 0000000000000000 R14: ffff888138a80000 R15: 0000000000000000 Apr 1 13:50:20 Tower kernel: FS: 0000000000000000(0000) GS:ffff888fdee80000(0000) knlGS:0000000000000000 Apr 1 13:50:20 Tower kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Apr 1 13:50:20 Tower kernel: CR2: 00000000002a925a CR3: 0000000281a36000 CR4: 00000000003506e0 Apr 1 13:50:20 Tower kernel: Call Trace: Apr 1 13:50:20 Tower kernel: <IRQ> Apr 1 13:50:20 Tower kernel: mvs_int_rx+0x85/0xf1 [mvsas] Apr 1 13:50:20 Tower kernel: mvs_int_full+0x1e/0xa4 [mvsas] Apr 1 13:50:20 Tower kernel: mvs_94xx_isr+0x4d/0x60 [mvsas] Apr 1 13:50:20 Tower kernel: mvs_tasklet+0x87/0xa8 [mvsas] Apr 1 13:50:20 Tower kernel: tasklet_action_common.isra.0+0x66/0xa3 Apr 1 13:50:20 Tower kernel: __do_softirq+0xc4/0x1c2 Apr 1 13:50:20 Tower kernel: asm_call_irq_on_stack+0x12/0x20 Apr 1 13:50:20 Tower kernel: </IRQ> Apr 1 13:50:20 Tower kernel: do_softirq_own_stack+0x2c/0x39 Apr 1 13:50:20 Tower kernel: __irq_exit_rcu+0x45/0x80 Apr 1 13:50:20 Tower kernel: common_interrupt+0x119/0x12e Apr 1 13:50:20 Tower kernel: asm_common_interrupt+0x1e/0x40 Apr 1 13:50:20 Tower kernel: RIP: 0010:arch_local_irq_enable+0x7/0x8 Apr 1 13:50:20 Tower kernel: Code: 00 48 83 c4 28 4c 89 e0 5b 5d 41 5c 41 5d 41 5e 41 5f c3 9c 58 0f 1f 44 00 00 c3 fa 66 0f 1f 44 00 00 c3 fb 66 0f 1f 44 00 00 <c3> 55 8b af 28 04 00 00 b8 01 00 00 00 45 31 c9 53 45 31 d2 39 c5 Apr 1 13:50:20 Tower kernel: RSP: 0018:ffffc9000016fea0 EFLAGS: 00000246 Apr 1 13:50:20 Tower kernel: RAX: ffff888fdeea2380 RBX: 0000000000000002 RCX: 000000000000001f Apr 1 13:50:20 Tower kernel: RDX: 0000000000000000 RSI: 00000000238d7f23 RDI: 0000000000000000 Apr 1 13:50:20 Tower kernel: RBP: ffff888105d0d800 R08: 00028e38c0a3fe38 R09: 00028e3ab9ddf5c0 Apr 1 13:50:20 Tower kernel: R10: 0000000000000045 R11: 071c71c71c71c71c R12: 00028e38c0a3fe38 Apr 1 13:50:20 Tower kernel: R13: ffffffff820c8c40 R14: 0000000000000002 R15: 0000000000000000 Apr 1 13:50:20 Tower kernel: cpuidle_enter_state+0x101/0x1c4 Apr 1 13:50:20 Tower kernel: cpuidle_enter+0x25/0x31 Apr 1 13:50:20 Tower kernel: do_idle+0x1a6/0x214 Apr 1 13:50:20 Tower kernel: cpu_startup_entry+0x18/0x1a Apr 1 13:50:20 Tower kernel: secondary_startup_64_no_verify+0xb0/0xbb Comparing with others, and taking a closer look at the output in my log, I'm noticing a few too many [mvsas] related entries. That's for my Marvel based Highpoint 2760A 24-port SAS controller. For years Fix Common Problems has been warning me about my Marvel based controller, but I ignore those warnings since I've never had any issues with it since I bought it in 2013. Almost 9 years of trouble-free operation all the way through 6.8.3. Maybe I'm jumping to conclusions and the issue is something else. Can anyone tell?
  14. This appears to still be an issue for me. Need help to move forward. Quick recap: Last year I upgraded to 6.9.2 and had issues with Seagate IronWolf (actually Exos) drives, plus the issue described here. I thought it was all related. I ended up rolling back to 6.8.3, and the issues went away. A little over a month ago, 6.8.3 stopped working correctly for me, I believe due to an incompatible Unassigned Devices update. About a week ago I decided to apply the Seagate drive fix (disabling EPC) and try upgrading to 6.9.2 again. I thought everything was successful. Multiple spin-ups/spin-downs, a record-fast parity check, and a perfectly working GUI and Dockers and VM's, I thought I was in the clear. Which brings us to today. Being the 1st of the month, the parity check kicked off at 2am. When I woke up this morning, I found that parity check progress was stalled at 0.1% after 6+ hours, and several hours later I can confirm it's not moving. In general, the GUI feels responsive, letting me browse around, but I noticed that the Dashboard presents no data, the drive temps don't appear to be updating, and the CPU/MB temp and fan speeds are wrong and frozen. I connected to the Terminal and ran an mdcmd status to see if the parity check was actually running, but the mdResyncPos is frozen at 9283712. Best I can tell, it seems like Unraid is frozen, even though the GUI isn't hung. First things first, I decided to run diagnostics. An hour later, it still reads "Starting diagnostics collection...". JorgeB is right. I checked Unraid's System Log, and it is full of Call Trace errors: Apr 1 10:53:07 Tower nginx: 2022/04/01 10:53:07 [error] 9804#9804: *2157788 upstream timed out (110: Connection timed out) while reading upstream, client: 192.168.1.218, server: , request: "GET /Dashboard HTTP/1.1", upstream: "fastcgi://unix:/var/run/php5-fpm.sock:", host: "tower", referrer: "http://tower/Main" Apr 1 10:53:19 Tower kernel: CPU: 10 PID: 0 Comm: swapper/10 Tainted: G S 5.10.28-Unraid #1 Apr 1 10:53:19 Tower kernel: Call Trace: Apr 1 10:56:19 Tower kernel: CPU: 10 PID: 0 Comm: swapper/10 Tainted: G S 5.10.28-Unraid #1 Apr 1 10:56:19 Tower kernel: Call Trace: Apr 1 10:59:19 Tower kernel: CPU: 10 PID: 0 Comm: swapper/10 Tainted: G S 5.10.28-Unraid #1 Apr 1 10:59:19 Tower kernel: Call Trace: Apr 1 11:02:19 Tower kernel: CPU: 10 PID: 0 Comm: swapper/10 Tainted: G S 5.10.28-Unraid #1 Apr 1 11:02:19 Tower kernel: Call Trace: Apr 1 11:05:19 Tower kernel: CPU: 10 PID: 0 Comm: swapper/10 Tainted: G S 5.10.28-Unraid #1 Apr 1 11:05:19 Tower kernel: Call Trace: Apr 1 11:08:19 Tower kernel: CPU: 10 PID: 0 Comm: swapper/10 Tainted: G S 5.10.28-Unraid #1 Apr 1 11:08:19 Tower kernel: Call Trace: Since running Diagnostics didn't work, I'm not sure what my next step should be. Do I need to gather more info, or is the issue already confirmed as a Ryzen on Linux Kernel related issue? Are there any solutions?