tillo Posted March 13, 2018 Share Posted March 13, 2018 Hello Unraid forum. I get from time to time Errors after a parity check (not always but sometime, in the range of 0-10 errors/ per check). The check is scheduled to run every 30 days. 1* This is not normal right? I am to expect errors 0 all the time right? 2* 0-10 errors, are we talking about blocks on the disk, files or some other unit. What is defined by 1 error of unit? I also have the checksum plugin added to my Unraid, and that throws me a few errors from time to time to. 3* I assume that this is related to the same errors as the parity check gives me? FYI: The server is running on an intel S1200V3rps with ECC memory. So hopefully no problem there, but if anyone knows if there is some magical switch in BIOS that have to be enabled in order for the ECC memory to do its magic pls tell me. It might not be enabled if not by default. I am great full for all the help I can get. Quote Link to comment
JorgeB Posted March 13, 2018 Share Posted March 13, 2018 Post the diagnostics after a parity check with errors, maybe some clue there. Quote Link to comment
tillo Posted March 15, 2018 Author Share Posted March 15, 2018 @johnnie.black Logfile. Today, I got a whole ton of sync errors! (It is still running the syn, and had made it about 50% when I downloaded the diagnostic file Quote Link to comment
JorgeB Posted March 15, 2018 Share Posted March 15, 2018 Mar 14 17:14:13 Nano emhttpd: unclean shutdown detected Sync errors are expected after an unclean shutdown. PS: syslog is filled with this: Mar 14 17:17:38 Nano shfs: share cache fullMar 14 17:17:39 Nano shfs: share cache fullMar 14 17:17:40 Nano shfs: share cache full Some share has an higher then it should be minimum space setting. Quote Link to comment
tillo Posted March 15, 2018 Author Share Posted March 15, 2018 I have: 316785 errors during the sync. Is that amount expected after an unclean shutdown? Also, minimum spacing setting, how would that affect the share cash? Quote Link to comment
JorgeB Posted March 15, 2018 Share Posted March 15, 2018 6 minutes ago, tillo said: I have: 316785 errors during the sync. Is that amount expected after an unclean shutdown? That's a lot, I stopped reading the syslog after unclean shutdown, but you already remove the diags... 7 minutes ago, tillo said: Also, minimum spacing setting, how would that affect the share cash? There's a share configured to use cache but with a minimum free space setting higher then currently available on cache, so data written to that share is going directly to the array and that warning is constantly logged, flooding the syslog. Quote Link to comment
JorgeB Posted March 15, 2018 Share Posted March 15, 2018 This error repeats several times, it's usually a hardware problem: Mar 14 18:39:27 Nano kernel: Uhhuh. NMI received for unknown reason 21 on CPU 0. Mar 14 18:39:27 Nano kernel: Do you have a strange power saving mode enabled? Mar 14 18:39:27 Nano kernel: Dazed and confused, but trying to continue One of those times it happened just before a lot of sync errors being detect, so possibly related: Mar 15 01:23:35 Nano kernel: Uhhuh. NMI received for unknown reason 31 on CPU 0. Mar 15 01:23:35 Nano kernel: Do you have a strange power saving mode enabled? Mar 15 01:23:35 Nano kernel: Dazed and confused, but trying to continue Mar 15 01:23:35 Nano kernel: md: recovery thread: PQ incorrect, sector=4570300664 Mar 15 01:23:36 Nano kernel: md: recovery thread: PQ incorrect, sector=4570310904 Mar 15 01:23:36 Nano kernel: md: recovery thread: PQ incorrect, sector=4570321144 Mar 15 01:23:36 Nano kernel: md: recovery thread: PQ incorrect, sector=4570331384 Mar 15 01:23:36 Nano kernel: md: recovery thread: PQ incorrect, sector=4570341624 Mar 15 01:23:36 Nano kernel: md: recovery thread: PQ incorrect, sector=4570351864 Mar 15 01:23:36 Nano kernel: md: recovery thread: PQ incorrect, sector=4570362104 Mar 15 01:23:36 Nano kernel: md: recovery thread: PQ incorrect, sector=4570372344 Mar 15 01:23:36 Nano kernel: md: recovery thread: PQ incorrect, sector=4570382584 Mar 15 01:23:36 Nano kernel: md: recovery thread: PQ incorrect, sector=4570392824 Mar 15 01:23:36 Nano kernel: md: recovery thread: PQ incorrect, sector=4570403064 Mar 15 01:23:36 Nano kernel: md: recovery thread: PQ incorrect, sector=4570413304 Mar 15 01:23:36 Nano kernel: md: recovery thread: PQ incorrect, sector=4570423544 Mar 15 01:23:36 Nano kernel: md: recovery thread: PQ incorrect, sector=4570433784 Mar 15 01:23:36 Nano kernel: md: recovery thread: PQ incorrect, sector=4570444024 Mar 15 01:23:36 Nano kernel: md: recovery thread: PQ incorrect, sector=4570454264 Mar 15 01:23:36 Nano kernel: md: recovery thread: PQ incorrect, sector=4570464504 Mar 15 01:23:36 Nano kernel: md: recovery thread: PQ incorrect, sector=4570474744 Mar 15 01:23:36 Nano kernel: md: recovery thread: PQ incorrect, sector=4570484984 Mar 15 01:23:37 Nano kernel: md: recovery thread: PQ incorrect, sector=4570495224 Mar 15 01:23:37 Nano kernel: md: recovery thread: PQ incorrect, sector=4570505464 Mar 15 01:23:37 Nano kernel: md: recovery thread: PQ incorrect, sector=4570515704 Mar 15 01:23:37 Nano kernel: md: recovery thread: PQ incorrect, sector=4570525944 Mar 15 01:23:37 Nano kernel: md: recovery thread: PQ incorrect, sector=4570536184 Mar 15 01:23:37 Nano kernel: md: recovery thread: PQ incorrect, sector=4570546424 Mar 15 01:23:37 Nano kernel: md: recovery thread: PQ incorrect, sector=4570556664 Mar 15 01:23:37 Nano kernel: md: recovery thread: PQ incorrect, sector=4570566904 Mar 15 01:23:37 Nano kernel: md: recovery thread: PQ incorrect, sector=4570577144 Mar 15 01:23:37 Nano kernel: md: recovery thread: PQ incorrect, sector=4570587384 Mar 15 01:23:37 Nano kernel: md: recovery thread: PQ incorrect, sector=4570597624 Mar 15 01:23:37 Nano kernel: md: recovery thread: PQ incorrect, sector=4570607864 Mar 15 01:23:37 Nano kernel: md: recovery thread: PQ incorrect, sector=4570618104 Mar 15 01:23:37 Nano kernel: md: recovery thread: PQ incorrect, sector=4570628344 Mar 15 01:23:37 Nano kernel: md: recovery thread: PQ incorrect, sector=4570638584 Mar 15 01:23:37 Nano kernel: md: recovery thread: PQ incorrect, sector=4570648824 Mar 15 01:23:37 Nano kernel: md: recovery thread: PQ incorrect, sector=4570659064 Mar 15 01:23:37 Nano kernel: md: recovery thread: PQ incorrect, sector=4570669304 Mar 15 01:23:38 Nano kernel: md: recovery thread: PQ incorrect, sector=4570679544 Mar 15 01:23:38 Nano kernel: md: recovery thread: PQ incorrect, sector=4570689784 Mar 15 01:23:38 Nano kernel: md: recovery thread: PQ incorrect, sector=4570700024 Mar 15 01:23:38 Nano kernel: md: recovery thread: PQ incorrect, sector=4570710264 Mar 15 01:23:38 Nano kernel: md: recovery thread: PQ incorrect, sector=4570720504 Mar 15 01:23:38 Nano kernel: md: recovery thread: PQ incorrect, sector=4570730744 Mar 15 01:23:38 Nano kernel: md: recovery thread: PQ incorrect, sector=4570740984 Mar 15 01:23:38 Nano kernel: md: recovery thread: PQ incorrect, sector=4570751224 Mar 15 01:23:38 Nano kernel: md: recovery thread: PQ incorrect, sector=4570761464 Mar 15 01:23:38 Nano kernel: md: recovery thread: PQ incorrect, sector=4570771704 Mar 15 01:23:38 Nano kernel: md: recovery thread: PQ incorrect, sector=4570781944 Mar 15 01:23:38 Nano kernel: md: recovery thread: PQ incorrect, sector=4570792184 Mar 15 01:23:38 Nano kernel: md: recovery thread: PQ incorrect, sector=4570802424 Mar 15 01:23:38 Nano kernel: md: recovery thread: PQ incorrect, sector=4570812664 Mar 15 01:23:38 Nano kernel: md: recovery thread: PQ incorrect, sector=4570822904 Mar 15 01:23:38 Nano kernel: md: recovery thread: PQ incorrect, sector=4570833144 Mar 15 01:23:38 Nano kernel: md: recovery thread: PQ incorrect, sector=4570843384 Mar 15 01:23:39 Nano kernel: md: recovery thread: PQ incorrect, sector=4570853624 Mar 15 01:23:39 Nano kernel: md: recovery thread: PQ incorrect, sector=4570863864 Mar 15 01:23:39 Nano kernel: md: recovery thread: PQ incorrect, sector=4570874104 Mar 15 01:23:39 Nano kernel: md: recovery thread: PQ incorrect, sector=4570884344 Mar 15 01:23:39 Nano kernel: md: recovery thread: PQ incorrect, sector=4570894584 Mar 15 01:23:39 Nano kernel: md: recovery thread: PQ incorrect, sector=4570904824 Mar 15 01:23:39 Nano kernel: md: recovery thread: PQ incorrect, sector=4570915064 Mar 15 01:23:39 Nano kernel: md: recovery thread: PQ incorrect, sector=4570925304 Mar 15 01:23:39 Nano kernel: md: recovery thread: PQ incorrect, sector=4570935544 Mar 15 01:23:39 Nano kernel: md: recovery thread: PQ incorrect, sector=4570945784 Mar 15 01:23:39 Nano kernel: md: recovery thread: PQ incorrect, sector=4570956024 Mar 15 01:23:39 Nano kernel: md: recovery thread: PQ incorrect, sector=4570966264 Mar 15 01:23:39 Nano kernel: md: recovery thread: PQ incorrect, sector=4570976504 Mar 15 01:23:39 Nano kernel: md: recovery thread: PQ incorrect, sector=4570986744 Mar 15 01:23:39 Nano kernel: md: recovery thread: PQ incorrect, sector=4570996984 Mar 15 01:23:39 Nano kernel: md: recovery thread: stopped logging There's also filesystem corruption on disk1: Mar 15 04:40:14 Nano kernel: XFS (md1): Metadata corruption detected at xfs_inode_buf_verify+0xae/0xbe [xfs], xfs_inode block 0x15dd76708 Mar 15 04:40:14 Nano kernel: XFS (md1): Unmount and run xfs_repair Quote Link to comment
tillo Posted March 15, 2018 Author Share Posted March 15, 2018 Okey, could the first one be related to the new RAM modules I installed? Quote Mar 14 18:39:27 Nano kernel: Uhhuh. NMI received for unknown reason 21 on CPU 0. Mar 14 18:39:27 Nano kernel: Do you have a strange power saving mode enabled? Mar 14 18:39:27 Nano kernel: Dazed and confused, but trying to continue The other problem, Corrupt data on Disk 1. what should I do ith that, Unmount and run xfs_repair? Should I replace it? replace cables? Quote Mar 15 04:40:14 Nano kernel: XFS (md1): Metadata corruption detected at xfs_inode_buf_verify+0xae/0xbe [xfs], xfs_inode block 0x15dd76708 Mar 15 04:40:14 Nano kernel: XFS (md1): Unmount and run xfs_repair Quote Link to comment
tillo Posted March 15, 2018 Author Share Posted March 15, 2018 (edited) Okey, I have done a complete teardown of the server. 1* I ran "xfs_repair -nv" Disk1, it did not show any faults or errors. 2* I ran memtest on the RAM memtest86 7.5 version, it showed no errors except on Test 10 (Fade test). Reading about that test in different forums threads, a few error during that test seemed to be normal depending on firmware and motherboard. 3* I checked the cables, they are all good (Original Super Micro cables with a built in lock feature) 4* I am total at a loss currently. What would be the next step? Edited March 16, 2018 by tillo Quote Link to comment
JorgeB Posted March 15, 2018 Share Posted March 15, 2018 2 hours ago, tillo said: 1* I ran "xfs_repair -nv" Disk1, it did not show any faults or errors. It's frequently difficult to identify if there are filesystem issues or not by only looking at the output of xfs_repair, you either need to check the exit code after running with -n (1=corruption detected, 0=no corruption detected) or just run it without -n (no modify flag) so any issues found are repaired. 2 hours ago, tillo said: I am total at a loss currently. What would be the next step? The NMI errors can be caused by PSU or board, but hard to confirm unless you have spares, board especially, maybe you can test with a PSU from another system, also, if you havn't yet check the system event log, there might be some clues there. Quote Link to comment
tillo Posted March 16, 2018 Author Share Posted March 16, 2018 Quote 9 hours ago, tillo said: 1* I ran "xfs_repair -nv" Disk1, it did not show any faults or errors. It's frequently difficult to identify if there are filesystem issues or not by only looking at the output of xfs_repair, you either need to check the exit code after running with -n (1=corruption detected, 0=no corruption detected) or just run it without -n (no modify flag) so any issues found are repaired. So I should run an "xfs_repair -n" or an "xfs_repair". Just to confirm, non of them will with the -n flag, or no flag at all, change the files of my system ("non correcting/altering checks") Is this correct? Could I run the test from the web terminal from unraid? I have currently started the array and I am running a parity sync (without writing any corrections, just to get an ide on how extensive the errors are) I will upload the diagnostic file when it is done. Quote 9 hours ago, tillo said: I am total at a loss currently. What would be the next step? The NMI errors can be caused by PSU or board, but hard to confirm unless you have spares, board especially, maybe you can test with a PSU from another system, also, if you havn't yet check the system event log, there might be some clues there. I got a spare board, and a spare PSU. The PSU is a Seasonic PRIME ULTRA TITANIUM. I rather not pull the board or PSU out at this point. I will let it run for a few days and then check the log file. Quote Link to comment
JorgeB Posted March 16, 2018 Share Posted March 16, 2018 54 minutes ago, tillo said: non of them will with the -n flag, or no flag at all, change the files of my system Without -n it will fix the file system, this has nothing to do with parity check, parity is maintained as long as xfs_repair is run on the mdX device. Quote Link to comment
tillo Posted March 16, 2018 Author Share Posted March 16, 2018 (edited) Quote root@Nano:~# sudo xfs_repair -n /dev/sde1 && echo ${PIPESTATUS[0]} Phase 1 - find and verify superblock... Phase 2 - using internal log - zero log... - scan filesystem freespace and inode maps... - found root inode chunk Phase 3 - for each AG... - scan (but don't clear) agi unlinked lists... - process known inodes and perform inode discovery... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - process newly discovered inodes... Phase 4 - check for duplicate blocks... - setting up duplicate extent list... - check for inodes claiming duplicate blocks... - agno = 1 - agno = 0 - agno = 2 - agno = 3 No modify flag set, skipping phase 5 Phase 6 - check inode connectivity... - traversing filesystem ... - traversal finished ... - moving disconnected inodes to lost+found ... Phase 7 - verify link counts... No modify flag set, skipping filesystem flush and exiting. 0 This is the output that I get. Edited March 16, 2018 by tillo Quote Link to comment
JorgeB Posted March 16, 2018 Share Posted March 16, 2018 7 minutes ago, tillo said: No modify flag set Your still using -n, with that it won't fix anything. Quote Link to comment
tillo Posted March 16, 2018 Author Share Posted March 16, 2018 Quote No modify flag set, skipping filesystem flush and exiting.0 I could run it without -n but in any case it showed a exit code of 0 and that would indicate no errors? 0=no corruption detected ? Would it be a different if I run it without the -n flag? Quote Link to comment
JorgeB Posted March 16, 2018 Share Posted March 16, 2018 Didn't notice the exit code, if it's 0 all is fine unlike what the syslog indicated, keep an eye on it for more errors, could be related to your sync errors. Quote Link to comment
tillo Posted March 16, 2018 Author Share Posted March 16, 2018 Okey, So should I re-run the parity check, and accept all the error (for this time). And then continue as normal, if no more errors occur after that? or should I continue to investigate the motherboard, CPU and PSU? Quote Link to comment
trurl Posted March 16, 2018 Share Posted March 16, 2018 36 minutes ago, tillo said: sudo xfs_repair -n /dev/sde1 That is not the mdX device. You will invalidate parity if you repair the sd device. You need md# of disk, not sdX1 letter of disk. Quote Link to comment
tillo Posted March 16, 2018 Author Share Posted March 16, 2018 @trurl Okey, when running sudo xfs_repair -n /dev/md1 I get an error that the disk is mounted and not accessible? How should I proceed in order to do an xfs_repair correctly? Quote Link to comment
itimpi Posted March 16, 2018 Share Posted March 16, 2018 (edited) 34 minutes ago, tillo said: @trurl Okey, when running sudo xfs_repair -n /dev/md1 I get an error that the disk is mounted and not accessible? How should I proceed in order to do an xfs_repair correctly? Stop the array; restart it in Maintenance mode; click on the drive on the Main tab; select the option to check the filesystem; if check indicates an error rerun the check without the -n flag (and if prompted add -L flag) to actually do the repair. When finished stop the array and restart in normal mode/ doing it this way maintains parity as all writes to ‘md’ type devices update parity appropriately. Edited March 16, 2018 by itimpi 1 Quote Link to comment
JorgeB Posted March 16, 2018 Share Posted March 16, 2018 1 hour ago, trurl said: That is not the mdX device. You will invalidate parity if you repair the sd device. You need md# of disk, not sdX1 letter of disk. Good spot, didn't even notice that, I told the OP to run on mdX. Quote Link to comment
tillo Posted March 16, 2018 Author Share Posted March 16, 2018 @johnnie.black My bad, I tried, but since I recived an error that I could not run on that device, I googled and it sugested that I would do it with "sde1". So I just asume that that was the correct way to do it :-p I will put unraid in maintenance mode and give it a try within an hour, and we will see if that gives an other result. Just to confirm: have i in anyway damaged the file-system by running xfs_repair on sde1 (note, sde1 is Data Disk 1) Quote Link to comment
JorgeB Posted March 16, 2018 Share Posted March 16, 2018 Just now, tillo said: have i in anyway damaged the file-system by running xfs_repair on sde1 (note, sde1 is Data Disk 1) No, but likely added some more sync errors. Quote Link to comment
trurl Posted March 16, 2018 Share Posted March 16, 2018 Since you did it with the -n switch it shouldn't have made any changes. Quote Link to comment
JorgeB Posted March 16, 2018 Share Posted March 16, 2018 25 minutes ago, trurl said: Since you did it with the -n switch it shouldn't have made any changes. Good point. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.