December 14, 20178 yr I guess I'm not understanding what's going on under the hood in Unraid. I have 2x 500 GB Samsung Evo SSDs in my cache pool right now. They have been operating as expected and total capacity is 500 GB. I bought 2 more 500 GB Samsung Evo SSDs and put them in. I expected the cache pool to expand to at least 1TB, if not 1.5TB... but that's not the case. The cache pool is still only using 2 devices and the size remains the same. All 4 cache drives are listed under Cace Devices, labled Cache, Cache 2, Cache 3, Cache 4. There's no configuration option to change things up... so I'm not really sure what I should do at this point to capture that additional space. Ideally, this would be running as a RAID 5 and I could use 1.5 TB and maintain some redundancy. But if it needs to run RAID 1 without any option to do RAID 5, that's ok, as it should give me 1TB of space at least. Can anyone help? Thanks
December 14, 20178 yr Author Which diagnostics do you mean? Apparently, I have to rebalance the cache pool? Upon reading more about btrfs raid 5, it seems it's unstable and prone to data loss, is that correct? So doing a raid 5 with btrfs is a bad idea?
December 14, 20178 yr 10 minutes ago, joshz said: Which diagnostics do you mean? There is only one: Tools -> Diagnostics 11 minutes ago, joshz said: Apparently, I have to rebalance the cache pool? It should add the new devices automatically in raid1 mode, totaling 1TB usable, that's why I'd like to see the diags 11 minutes ago, joshz said: Upon reading more about btrfs raid 5, it seems it's unstable and prone to data loss, is that correct? So doing a raid 5 with btrfs is a bad idea? Yes, it should be only used for testing, here are the available profiles, with 4 devices I'd use raid10: https://forums.lime-technology.com/topic/46802-faq-for-unraid-v6/?do=findComment&comment=480421
December 14, 20178 yr Author I rebalanced with the parameters --dconvert=raid10 --mconvert=raid10 and it's showing 1TB now. It most definitely did not do it automatically though; it required a forced rebalance. All appears to be well at this point, but the additional drives still don't show up in the diagnostics or listing. newmediaserver-diagnostics-20171214-1634.zip
December 14, 20178 yr 8 minutes ago, joshz said: All appears to be well at this point, No not well, pool is using single profile with only 2 drives. Data, single: total=200.00GiB, used=92.63GiB System, single: total=64.00MiB, used=48.00KiB Metadata, single: total=2.00GiB, used=506.22MiB GlobalReserve, single: total=111.39MiB, used=0.00B Label: none uuid: 94563407-72d4-414d-94b1-174f382195f5 Total devices 2 FS bytes used 93.12GiB devid 1 size 465.76GiB used 101.00GiB path /dev/sde1 devid 2 size 465.76GiB used 101.06GiB path /dev/sdf1 It is also corrupt: Dec 14 14:08:24 NewMediaServer kernel: BTRFS info (device sdf1): read error corrected: ino 1 off 1805320192 (dev /dev/sde1 sector 3526016) Dec 14 14:08:24 NewMediaServer kernel: BTRFS info (device sdf1): read error corrected: ino 1 off 1805324288 (dev /dev/sde1 sector 3526024) Dec 14 14:08:24 NewMediaServer kernel: BTRFS info (device sdf1): read error corrected: ino 1 off 1805328384 (dev /dev/sde1 sector 3526032) Dec 14 14:08:24 NewMediaServer kernel: BTRFS info (device sdf1): read error corrected: ino 1 off 1805332480 (dev /dev/sde1 sector 3526040) Dec 14 14:08:24 NewMediaServer kernel: BTRFS error (device sdf1): parent transid verify failed on 1805352960 wanted 114187 found 113260 Dec 14 14:08:24 NewMediaServer kernel: BTRFS info (device sdf1): read error corrected: ino 1 off 1805352960 (dev /dev/sde1 sector 3526080) Dec 14 14:08:24 NewMediaServer kernel: BTRFS info (device sdf1): read error corrected: ino 1 off 1805357056 (dev /dev/sde1 sector 3526088) Dec 14 14:08:24 NewMediaServer kernel: BTRFS error (device sdf1): parent transid verify failed on 1805369344 wanted 114187 found 113260 Dec 14 14:08:24 NewMediaServer kernel: BTRFS error (device sdf1): parent transid verify failed on 1805467648 wanted 114187 found 113260 Dec 14 14:08:24 NewMediaServer kernel: BTRFS error (device sdf1): parent transid verify failed on 1805484032 wanted 114187 found 113260 Dec 14 14:08:24 NewMediaServer kernel: BTRFS error (device sdf1): parent transid verify failed on 1805500416 wanted 114187 found 113260 Dec 14 14:08:24 NewMediaServer kernel: BTRFS error (device sdf1): parent transid verify failed on 1805516800 wanted 114187 found 113260 Dec 14 14:08:24 NewMediaServer kernel: BTRFS error (device sdf1): parent transid verify failed on 1805598720 wanted 114187 found 113260 Dec 14 14:08:24 NewMediaServer kernel: BTRFS error (device sdf1): parent transid verify failed on 1806237696 wanted 114187 found 113261 You'll want to backup, re-format and restore data.
December 14, 20178 yr Author Hmm... well, BTRFS strikes again. It's a real shame you're forced in to using it. It's not production ready and I really detest it as a file system for production use. It's great for hobbiest and testing, but it's garbage when it comes to a live environment. Any idea why Unraid forces the cache drives to be btrfs?
December 14, 20178 yr 32 minutes ago, joshz said: BTRFS strikes again. It's a real shame you're forced in to using it. It's not production ready and I really detest it as a file system for production use. btrfs works very well (except rai5/raid6) in a stable server, it has issues when there are hardware issues in a multi.device pool, can you post the output of: btrfs dev stats /mnt/cache
December 15, 20178 yr Author That's the problem with BTRFS, it has no graceful recovery from problems like production file systems. All software works great when there's no problems. Good software is differentiated from bad software when it can handle issues and not crap the bed. BTRFS craps the bed at the slightest provocation. As evidenced here. The fact that I have to literally blow out the whole raid, reformat, and recreate it is indicative of not being ready for prime time. Is there any way to switch my cache drive to something more stable? Anyway, here is the requested output: [/dev/sde1].write_io_errs 0 [/dev/sde1].read_io_errs 0 [/dev/sde1].flush_io_errs 0 [/dev/sde1].corruption_errs 0 [/dev/sde1].generation_errs 0 [/dev/sdf1].write_io_errs 0 [/dev/sdf1].read_io_errs 0 [/dev/sdf1].flush_io_errs 0 [/dev/sdf1].corruption_errs 0 [/dev/sdf1].generation_errs 0
December 15, 20178 yr Stats look normal, you can change to xfs but will be limited to a single device.
December 15, 20178 yr 7 hours ago, joshz said: That's the problem with BTRFS, it has no graceful recovery from problems like production file systems. All software works great when there's no problems. Good software is differentiated from bad software when it can handle issues and not crap the bed. BTRFS craps the bed at the slightest provocation. As evidenced here. I have a number of thousand BTRFS file systems installed in vehicles and haven't seen any evidence that they should be extra fragile. When the vehicle power is cut, the devices will die without any way to perform any ordered shutdown. However, the units seems to recover ok. I have also been using BTRFS in quite a number of server installations with no bad outcome. BTRFS isn't perfect, but at least a notch or two better than your post suggests.
December 15, 20178 yr 22 minutes ago, pwm said: BTRFS isn't perfect, but at least a notch or two better than your post suggests. Agree, I've been using it as the only filesystem (array + cache + unassigned devices) on all my servers for some time without any major issues. Without the logs showing the start of the issues I can't guess what happened to your filesystem, stats are OK so no apparent hardware device issues but most likely something serious happened. Edited December 15, 20178 yr by johnnie.black
December 15, 20178 yr All file systems are quite vulnerable to transfer errors, or software/hardware issues that makes the machine send bad data. Journaling, copy-on-write etc are great for recovery after partial writes but can't protect from garbage writes. Garbage writes will not just be able to write bad file data but can also write bad data to the internal file system structures the file system is using for recovering from a crash or power loss. That's also why critical servers are making use of ECC memory, and why the internal cache and busses of server-class processors are making use of ECC.
December 15, 20178 yr 27 minutes ago, pwm said: That's also why critical servers are making use of ECC memory, and why the internal cache and busses of server-class processors are making use of ECC. Agree again, ECC is definitely recommended for a storage server and what I use on all my servers.
Archived
This topic is now archived and is closed to further replies.