WEHA Posted November 2, 2020 Share Posted November 2, 2020 (edited) I shut down my unraid server to replace the ram (64GB to 128GB) Once restarted I enabled docker and virtual machines... Docker page says service failed to start, vm's are barely starting claiming corruption everywhere. Checked the syslog, saw these messages everywhere: BTRFS error (device nvme0n1p1): parent transid verify failed on 1326431879168 wanted 214008171 found 213104704 BTRFS error (device nvme0n1p1): parent transid verify failed on 612809326592 wanted 213820848 found 213103567 dev stats: [/dev/nvme0n1p1].write_io_errs 291117377 [/dev/nvme0n1p1].read_io_errs 382285749 [/dev/nvme0n1p1].flush_io_errs 2039859 [/dev/nvme0n1p1].corruption_errs 0 [/dev/nvme0n1p1].generation_errs 0 [/dev/nvme1n1p1].write_io_errs 0 [/dev/nvme1n1p1].read_io_errs 51 [/dev/nvme1n1p1].flush_io_errs 0 [/dev/nvme1n1p1].corruption_errs 0 [/dev/nvme1n1p1].generation_errs 0 fi usage Overall: Device size: 3.64TiB Device allocated: 3.46TiB Device unallocated: 177.91GiB Device missing: 0.00B Used: 3.00TiB Free (estimated): 326.36GiB (min: 326.36GiB) Data ratio: 2.00 Metadata ratio: 2.00 Global reserve: 512.00MiB (used: 0.00B) Data Metadata System Id Path RAID1 RAID1 RAID1 Unallocated -- -------------- ------- -------- --------- ----------- 1 /dev/nvme0n1p1 1.73TiB 4.00GiB 64.00MiB 88.95GiB 2 /dev/nvme1n1p1 1.73TiB 4.00GiB 64.00MiB 88.95GiB -- -------------- ------- -------- --------- ----------- Total 1.73TiB 4.00GiB 64.00MiB 177.91GiB Used 1.50TiB 1.62GiB 272.00KiB What would be the best course of action to make sure my data is not gone? I can still access shares but I'm concerned about that state How can I identify the problem, is it the nvme disk or motherboard? nvme disk is on a motherboard slot so no addon card Stopping and starting the array does not make the error count go up though.. Edited November 2, 2020 by WEHA Quote Link to comment
JorgeB Posted November 2, 2020 Share Posted November 2, 2020 Start by running a scrub. more info here. Quote Link to comment
WEHA Posted November 2, 2020 Author Share Posted November 2, 2020 (edited) 34 minutes ago, JorgeB said: Start by running a scrub. more info here. In meanwhile I disconnected the nvme that generated the errors to recover from the other nvme disk, I did not start the array yet. Then I saw your message and reconnected it, but now unraid is started and does not recognize the reconnected disk?? Am I f'ed now? EDIT: just to be clear, the device is listed but blue. So when selecting it it says "data will be overwritten" Edited November 2, 2020 by WEHA just to clear Quote Link to comment
JorgeB Posted November 2, 2020 Share Posted November 2, 2020 If you didn't start the array without it it should still accept both devices, but if it doesn't best to just leave the known good one, if all is well after array start, them wipe and re-add the other one. Quote Link to comment
WEHA Posted November 2, 2020 Author Share Posted November 2, 2020 (edited) 8 minutes ago, JorgeB said: If you didn't start the array without it it should still accept both devices, but if it doesn't best to just leave the known good one, if all is well after array start, them wipe and re-add the other one. So I started the array with the "bad" nvme unassigned but it seems that it is still using it? Status has both devices, syslog containers many errors about it again. On the main page it still shows the "bad" one as unassigned This scares me... When I click the cache disk in the gui it says a balance is running? :s Edited November 2, 2020 by WEHA Quote Link to comment
JorgeB Posted November 2, 2020 Share Posted November 2, 2020 Run a scrub and if all errors are corrected re-set the pool, to do that: Stop the array, if Docker/VM services are using the cache pool disable them, unassign all cache devices, start array to make Unraid "forget" current cache config, stop array, reassign all cache devices (there can't be an "All existing data on this device will be OVERWRITTEN when array is Started" warning for any cache device), re-enable Docker/VMs if needed, start array. Quote Link to comment
WEHA Posted November 2, 2020 Author Share Posted November 2, 2020 3 minutes ago, JorgeB said: Run a scrub and if all errors are corrected re-set the pool, to do that: Stop the array, if Docker/VM services are using the cache pool disable them, unassign all cache devices, start array to make Unraid "forget" current cache config, stop array, reassign all cache devices (there can't be an "All existing data on this device will be OVERWRITTEN when array is Started" warning for any cache device), re-enable Docker/VMs if needed, start array. Run the scrub while the balance is running? Quote Link to comment
JorgeB Posted November 2, 2020 Share Posted November 2, 2020 After it finishes, though not much point in running a balance. Quote Link to comment
WEHA Posted November 2, 2020 Author Share Posted November 2, 2020 1 minute ago, JorgeB said: After it finishes, though not much point in running a balance. So should I just cancel it then? Because it will take a while to finnish... Quote Link to comment
WEHA Posted November 2, 2020 Author Share Posted November 2, 2020 3 minutes ago, JorgeB said: Might as well. Ok, balance canceled and now running scrub. It now says that it's 4TB instead of 2TB (it's 2x 2TB nvme) Running status now is counting up corruption & generation errors on the bad one, is this a problem? It was 0 just before the scrub. Quote Link to comment
JorgeB Posted November 2, 2020 Share Posted November 2, 2020 3 minutes ago, WEHA said: is this a problem? Not if all are correct, if there are any uncorrectable errors you need to re-do the pool. Quote Link to comment
WEHA Posted November 2, 2020 Author Share Posted November 2, 2020 Scrub has finished, no uncorrectable errors. I have started to move date of the cache, the vm's that were started before are still having problemset, I suppose it was "permanently" corrupted when they were started before the scrub. Didn't try docker yet, I'm assuming this will be the same. Is there a reason these errors are not taken into account when starting the array? I'm guessing I will have to spend hours getting my vm's fixed... not to mention possible other data that has been corrupted while there was still a good drive from what I can tell :( Quote Link to comment
JorgeB Posted November 3, 2020 Share Posted November 3, 2020 Like mentioned in the link above any share set to NOCOW (default for system shares) can't be fixed since checksums are disable. Quote Link to comment
WEHA Posted November 3, 2020 Author Share Posted November 3, 2020 1 hour ago, JorgeB said: Like mentioned in the link above any share set to NOCOW (default for system shares) can't be fixed since checksums are disable. Sure but like you also mentioned to hopefully use btsfs stats in the near future.. that was 2 years ago I have the notification in the linked thread, only this error count occurred after a reboot (it was 0 before) and unraid just happily started the array... So the notification was useless in this case. Quote Link to comment
JorgeB Posted November 3, 2020 Share Posted November 3, 2020 That part can be correct by the user, just create new shares with COW on, it's what I've always done. Quote Link to comment
WEHA Posted November 3, 2020 Author Share Posted November 3, 2020 55 minutes ago, JorgeB said: That part can be correct by the user, just create new shares with COW on, it's what I've always done. I never knew about COW & NOCOW because the shares are created by default and all the rest are set to auto by default... This should be marked more abundantly on the shares list like the warning sign. That warning sign being set to a green light is to me that the share is save, but in reality it's not... I just spent 16 hours overnight with no sleep to get everything working again because of a silly parameter... Very dissapointing, thank you for your quick responses though. Quote Link to comment
JorgeB Posted November 3, 2020 Share Posted November 3, 2020 COW enable has some disadvantages for VMs, like increased fragmentation and write amplification, that's why LT disabled it by default for those shares, but I still prefer to have it enable for data integrity. Quote Link to comment
WEHA Posted November 3, 2020 Author Share Posted November 3, 2020 Just now, JorgeB said: COW enable has some disadvantages for VMs, like increased fragmentation and write amplification, that's why LT disabled it by default for those shares, but I still prefer to have it enable for data integrity. True but that is negated when we're talking about SSD is it not? I too prefer integrity and was not aware my data was not safe. Quote Link to comment
JorgeB Posted November 3, 2020 Share Posted November 3, 2020 16 minutes ago, WEHA said: True but that is negated when we're talking about SSD is it not? Fragmentation yes, increased write amplification not so good, since it can reduce the SSD life. Quote Link to comment
WEHA Posted November 3, 2020 Author Share Posted November 3, 2020 Just now, JorgeB said: Fragmentation yes, increased write amplification not so good, since it can reduce the SSD life. There are arguments for both but I still think the share list gives a false sense of security. I should have never started the array with the "bad" one still connected, as this corrupted everything on domains & system share. It's disappointing that it seems too difficult to check btrfs stats after n years and show NOCOW / COW status more clearly. For being a NAS at it's core, it seem quite relevant and important. Quote Link to comment
WEHA Posted November 7, 2020 Author Share Posted November 7, 2020 (edited) On 11/3/2020 at 1:33 PM, JorgeB said: Fragmentation yes, increased write amplification not so good, since it can reduce the SSD life. So I read about the "bug" that causes many writes to sdd's especially evo... Mine have a 1200TBW and are around 1500TBW now (in 2 years time) In the new beta there is a solution, but there also issues. My thought is, can I upgrade to the new beta, recreate the cache (on new drives) with the new partition layout and revert back to 6.8.3 if the need arises? Edited November 7, 2020 by WEHA Quote Link to comment
JorgeB Posted November 8, 2020 Share Posted November 8, 2020 14 hours ago, WEHA said: My thought is, can I upgrade to the new beta, recreate the cache (on new drives) with the new partition layout and revert back to 6.8.3 if the need arises? Yes, if you're using a pool. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.