garycase Posted June 6, 2016 Share Posted June 6, 2016 Well, I finally got around to setting up a test server for v6.2, and noted a couple of anomalies. Not sure if these have already been reported (I scanned this thread, but not all of the previous ones) ... but here's what I noted today while setting it up. I used an old system with a C2SEE board with Pentium E6300, 4GB, and 10 old 1.5TB and 2TB drives (reduced to 9 after 30 minutes, as I'll noted below). I configured the system with the two newest 2TB drives as parity, and 6 2TB and 2 1.5TB data drives, and started the array. After about 30 minutes I checked on the status, and noticed that one of the 2TB drives had already had 5 read errors, so I decided to simply reduce the array by one drive, so I stopped the parity sync. Anomaly #1: I then did a New Config, and assigned the same two parity drives and 7 of the data drives (excluding the one I didn't want to use) ... and then Started the array. The system did NOT start a parity sync, but claimed that parity was already good Clearly it was NOT. So I then did another New Config, and assigned only the 7 data drives -- leaving parity unassigned. I then Started the array; formatted the data drives; stopped the array; assigned the two parity drives ... and Started the array, and the system then started a parity sync with no problem. Anomaly #2: During the parity sync, I would check on the status about every 30-60 minutes. More often than not, when I refreshed the page, only the "bottom section" was shown -- the area where the disks are displayed on the Main tab was blank (the section labeled "Array Devices". A refresh of the page would fix this ... but it happens VERY often. The sync just finished before I wrote this, so I haven't had a chance to really do much with it yet. I'm planning to copy several TB of data; "fail" a drive (yank it out mid-operation); and then fail a 2nd drive while rebuilding the "failed" one over the new few days. I presume all will work perfectly, but just though I'd mention the behaviors I noted when setting it up. Link to comment
JorgeB Posted June 6, 2016 Share Posted June 6, 2016 Anomaly #1: I then did a New Config, and assigned the same two parity drives and 7 of the data drives (excluding the one I didn't want to use) ... and then Started the array. The system did NOT start a parity sync, but claimed that parity was already good Clearly it was NOT. Are you sure you didn't check the "parity is already valid" checkbox by mistake? It happened to me sometimes since it's in the same place as the "I'm sure I want to this" checkbox (when displayed). I've done tens of new configs with single and dual parity and a parity sync always begins if not checking the trust parity box. Link to comment
garycase Posted June 6, 2016 Share Posted June 6, 2016 Anything's possible, although I'm reasonably sure I didn't do that. I had the same thought, but I'm ALMOST certain I didn't do that But it certainly could be what happened ... Link to comment
tuxbass Posted June 6, 2016 Share Posted June 6, 2016 Is there any way to still mount the drives without an Internet connection to access my data? Wasn't aware of the requirement and now am unable to get hold of the shares due to not having access to the net. Or is there documentation for configuring internal wifi cards? Link to comment
JorgeB Posted June 6, 2016 Share Posted June 6, 2016 Is there any way to still mount the drives without an Internet connection to access my data? Wasn't aware of the requirement and now am unable to get hold of the shares due to not having access to the net. Or is there documentation for configuring internal wifi cards? Probably easiest to downgrade to v6.1, just copy bzroot and bzimage to the flash and reboot. Link to comment
tuxbass Posted June 6, 2016 Share Posted June 6, 2016 Is there any way to still mount the drives without an Internet connection to access my data? Wasn't aware of the requirement and now am unable to get hold of the shares due to not having access to the net. Or is there documentation for configuring internal wifi cards? Probably easiest to downgrade to v6.1, just copy bzroot and bzimage to the flash and reboot. And later on simply revert back by copying originals back? Any considerable risk involved with this move? Link to comment
JorgeB Posted June 6, 2016 Share Posted June 6, 2016 And later on simply revert back by copying originals back? Any considerable risk involved with this move? No risk for the array, only thing is if you're using dual parity it will be unassigned on v6.1. If you're using VMs not sure if they stay working after downgrading without any config changes. Link to comment
tuxbass Posted June 6, 2016 Share Posted June 6, 2016 And later on simply revert back by copying originals back? Any considerable risk involved with this move? No risk for the array, only thing is if you're using dual parity it will be unassigned on v6.1. If you're using VMs not sure if they stay working after downgrading without any config changes. Will avoid spinning vm-s up. What about dockers, should be fine? Link to comment
JorgeB Posted June 6, 2016 Share Posted June 6, 2016 What about dockers, should be fine? Yes. Link to comment
dAigo Posted June 6, 2016 Share Posted June 6, 2016 Anomaly #1: I then did a New Config, and assigned the same two parity drives and 7 of the data drives (excluding the one I didn't want to use) ... and then Started the array. The system did NOT start a parity sync, but claimed that parity was already good Clearly it was NOT. Are you sure you didn't check the "parity is already valid" checkbox by mistake? It happened to me sometimes since it's in the same place as the "I'm sure I want to this" checkbox (when displayed). I've done tens of new configs with single and dual parity and a parity sync always begins if not checking the trust parity box. Maybe related, but probably not. I just finished my scheduled 2-Month parity check and it came back with 60748078 errors... I was one of the people who had trouble with the freezing server and the resulting hard-resets. I also reproduced theese freezes and tried my best to make sure all of my disks are indeed spun down before cutting power. I was always wondering, because after rebooting, unraid never started a parity check, I was thinking it was due to my best efforts. I have not yet seen any corrupt data, not in or outside of any VM, so I think/hope, that the data was written correctly, but due to the lockup in the unraid driver, changes didn't make it to parity. No disks, parity or data, show any errors or SMART warnings. While the lock-up were definitly annoying, I think not starting a partiy check when it should be is a more severe issue. The other thing that may have something to do with it, I switched one of the disks from xfs to reiserfs and back again. But I always moved every file to another disk before I took the array offline, changed the filesystem, started the array, and hit "format" in the gui. I think that should not invalidate parity or because its an easy task through gui, at least warn the user that it does and start a new parity sync? However, the most concerning thing for me is, the scheduled parity check runs with "nocorrect" in case I need to manually check if parity or data is wrong. With 60748078 errors and nocorrect, shouldn't parity be invalidated? it still states "parity valid"... I am tempted to run a manual parity check with "write corrections", but I would be willing to run another "check only" If you think it helps to find the issue. Last option would be, I am confused and everything is as it should be, but an explanation would be nice in that case Jun 6 01:00:01 unRAID kernel: mdcmd (117): check NOCORRECT Jun 6 01:00:01 unRAID kernel: Jun 6 01:00:01 unRAID kernel: md: recovery thread: check P ... Jun 6 01:00:01 unRAID kernel: md: using 1536k window, over a total of 5860522532 blocks. Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=0 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=8 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=16 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=24 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=32 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=40 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=48 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=56 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=64 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=72 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=80 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=88 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=96 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=104 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=112 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=120 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=128 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=136 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=144 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=152 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=160 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=168 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=176 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=184 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=192 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=200 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=208 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=216 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=224 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=232 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=240 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=248 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=256 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=264 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=272 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=280 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=288 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=296 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=304 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=312 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=320 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=328 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=336 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=344 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=352 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=360 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=368 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=376 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=384 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=392 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=400 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=408 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=416 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=424 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=432 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=440 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=448 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=456 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=464 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=472 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=480 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=488 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=496 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=504 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=512 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=520 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=528 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=536 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=544 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=552 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=560 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=568 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=576 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=584 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=592 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=600 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=608 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=616 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=624 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=632 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=640 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=648 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=656 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=664 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=672 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=680 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=688 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=696 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=704 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=712 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=720 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=728 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=736 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=744 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=752 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=760 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=768 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=776 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=784 Jun 6 01:00:11 unRAID kernel: md: recovery thread: P incorrect, sector=792 Jun 6 01:00:11 unRAID kernel: md: recovery thread: stopped logging Jun 6 17:55:24 unRAID kernel: md: sync done. time=60923sec Jun 6 17:55:24 unRAID kernel: md: recovery thread: completion status: 0 Diag. & screenshots attached... unraid-diagnostics-20160606-1915.zip Link to comment
tuxbass Posted June 8, 2016 Share Posted June 8, 2016 What about dockers, should be fine? Yes. After removing the usb from the box, the filesystem got damaged somehow and wasn't mountable. After resolving the issue, the downgrade was a success. To some degree - no docker container nor vm images can be seen under their respective tabs in the webui. Can someone confirm if this is expected? Link to comment
RobJ Posted June 8, 2016 Share Posted June 8, 2016 I just finished my scheduled 2-Month parity check and it came back with 60748078 errors... I was one of the people who had trouble with the freezing server and the resulting hard-resets. I also reproduced theese freezes and tried my best to make sure all of my disks are indeed spun down before cutting power. I was always wondering, because after rebooting, unraid never started a parity check, I was thinking it was due to my best efforts. I have not yet seen any corrupt data, not in or outside of any VM, so I think/hope, that the data was written correctly, but due to the lockup in the unraid driver, changes didn't make it to parity. No disks, parity or data, show any errors or SMART warnings. While the lock-up were definitly annoying, I think not starting a partiy check when it should be is a more severe issue. The other thing that may have something to do with it, I switched one of the disks from xfs to reiserfs and back again. But I always moved every file to another disk before I took the array offline, changed the filesystem, started the array, and hit "format" in the gui. I think that should not invalidate parity or because its an easy task through gui, at least warn the user that it does and start a new parity sync? However, the most concerning thing for me is, the scheduled parity check runs with "nocorrect" in case I need to manually check if parity or data is wrong. With 60748078 errors and nocorrect, shouldn't parity be invalidated? it still states "parity valid"... I am tempted to run a manual parity check with "write corrections", but I would be willing to run another "check only" If you think it helps to find the issue. Last option would be, I am confused and everything is as it should be, but an explanation would be nice in that case * The syslog unfortunately has been edited, and while it's OK to edit out the files moved, we really need the rest of the syslog. If there's an issue, the 2 most important parts of the syslog are the part where the errors begin AND the initial setup, all of it. If one of those are missing, then we have to speculate, can't do a very good analysis. Your syslog is missing the entire beginning, so perhaps there's a syslog.1 or syslog.2 in /var/log? * You booted some time on the 25th, but the syslog begins on the 29th, and very unfortunately there was some kind of kernel crash on the 28th, involving KVM. It caused a registry dump in the KVM log, but it would be very useful to know what appeared in the syslog. Once a critical event like that happens, I would never consider a system to be stable after that, and that makes everything that occurred after that suspect. If you detect a kernel crash, of any kind, it's best to reboot, even if the system appears to have recovered. * The unusual behaviors you mentioned do seem wrong, but since the system is 'suspect', there's not much we can conclude, as we don't know if it's truly from a bug, or just a corrupted system from the earlier 'crash' event. * *If* all was fine with the system, then I would have to say that parity had never been built, as every single parity block was wrong, until logging was stopped. * Something I noticed, and found in other 6.2-beta21 diagnostics with no second parity drive assigned, the second parity drive is marked as DISK_NP_DSBL, not DISK_NP, that is, it's marked as "disabled", not "not present". And the vars for the system do not count the second parity drive but do count a disabled drive and an invalid drive, even though you don't have any invalid or disabled drives. [mdDisabledDisk] => (null) [mdInvalidDisk] => (null) [mdMissingDisk] => (null) [mdNumDisks] => 6 [mdNumDisabled] => 1 [mdNumInvalid] => 1 [mdNumMissing] => 0 ... [sbNumDisks] => 7 You have 5 data drives and 1 parity drive, making 6 array drives, none of which are invalid or disabled. I'm not sure what the 7 is counting. One conjecture, when looping through the drive count, if the parity function checked if there are disabled disks, it would see a positive count and possibly check the var (mdDisabledDisk) for its index number, which would calculate as zero, meaning skip the parity drive. This conjecture seems improbable though, as more would have hit this situation too. Link to comment
JorgeB Posted June 8, 2016 Share Posted June 8, 2016 What about dockers, should be fine? Yes. After removing the usb from the box, the filesystem got damaged somehow and wasn't mountable. After resolving the issue, the downgrade was a success. To some degree - no docker container nor vm images can be seen under their respective tabs in the webui. Can someone confirm if this is expected? Go to the Docker tab, add container, all your previous dockers should appear on the user defined templates, just add them again, settings will remain the same. Link to comment
hawihoney Posted June 8, 2016 Share Posted June 8, 2016 Just a brief update, we are pretty confident we have discovered the bug causing deadlocks and system hangs and are in the process of testing patched code now before rolling out a new release. Thank you all for your patience with us as we worked to get to the bottom of this very nasty bug. May I vote for beta release just fixing this nasty bug (e.g. 6.2.0-beta21a)? After uncounted hard-reboots because of freezing unRAID machines several of my drives are gone. No cable problem, no adapter problem, they are just dead. If a machine freezes it is no longer usable. Even a graceful IPMI down is not possible. I have to use the power-reset from IPMI. Don't know how many forced power-outages a typical drive survives ... IMHO, it doesn't make sense to test Docker, KVM or whatever if a core functionality like copying one file over SMB can freeze a machine. Just my 0.02. Thanks for listening. Link to comment
bungee91 Posted June 8, 2016 Share Posted June 8, 2016 May I vote for beta release just fixing this nasty bug (e.g. 6.2.0-beta21a)? After uncounted hard-reboots because of freezing unRAID machines several of my drives are gone. No cable problem, no adapter problem, they are just dead. If a machine freezes it is no longer usable. Even a graceful IPMI down is not possible. I have to use the power-reset from IPMI. Don't know how many forced power-outages a typical drive survives ... IMHO, it doesn't make sense to test Docker, KVM or whatever if a core functionality like copying one file over SMB can freeze a machine. Just my 0.02. Thanks for listening. Stop Array. Go to Settings --> Disk Settings and change the num_stripes tunable from 1028 to 8192. Save. Start Array. That should workaround the deadlock / web ui unresponsive issues during heavy IO for now. The setting above has fixed lockups for me, have you performed this temporary work around and still have lockups? Link to comment
hawihoney Posted June 8, 2016 Share Posted June 8, 2016 I didn't_find a setting like that. Do you mean md_num_stripes that defaults to 1280? Is this a LimeTech workaround suggestion? TIA Link to comment
Frank1940 Posted June 8, 2016 Share Posted June 8, 2016 I didn't_find a setting like that. Do you mean md_num_stripes that defaults to 1280? Is this a LimeTech workaround suggestion? TIA Yes and Yes! Link to comment
nexusmaniac Posted June 9, 2016 Share Posted June 9, 2016 "Dynamix GUI 2016.06.08 is available. Download Now" Looks like it was taken back down though - As I can't download it It did however have the unRAID version number listed as beta22 on GitHub Link to comment
bluepr0 Posted June 9, 2016 Share Posted June 9, 2016 "Dynamix GUI 2016.06.08 is available. Download Now" Looks like it was taken back down though - As I can't download it It did however have the unRAID version number listed as beta22 on GitHub yep same here... NEW BETA IS COMING! Link to comment
testdasi Posted June 9, 2016 Share Posted June 9, 2016 yep same here... NEW BETA IS COMING! ;D ;D Link to comment
eschultz Posted June 11, 2016 Share Posted June 11, 2016 FYI - Beta 22 has been released. Here is the announcement and release thread: http://lime-technology.com/forum/index.php?topic=49667.0 Link to comment
Recommended Posts