Squazz Posted April 24, 2018 Share Posted April 24, 2018 My array keeps spitting out sync errors that is getting corrected. Problem is, that when I do a new parity check, it finds new errors. I'm fearing that one of my drives might have gone bad, but I'm not sure how to determine this. I have added syslog and diagnostics, hoping this is where we can find such information. I'd greatly appreciate help to read the logs. If you can see whats wrong in the logs, please do guide me to what it is that I should look for so I'm better equipped next time nas-diagnostics-20180424-0748.zip nas-syslog-20180424-0748.zip Quote Link to comment
JorgeB Posted April 24, 2018 Share Posted April 24, 2018 If you keep getting different sync errors first thing to do is to run memtest. 1 Quote Link to comment
Squazz Posted April 24, 2018 Author Share Posted April 24, 2018 12 minutes ago, johnnie.black said: If you keep getting different sync errors first thing to do is to run memtest. Noted, test started Will have it running during my entire work-day, so in about 9 hours I should have a result for ya Quote Link to comment
Squazz Posted April 24, 2018 Author Share Posted April 24, 2018 8 hours ago, johnnie.black said: If you keep getting different sync errors first thing to do is to run memtest. Memtest didn't show any errors in an 8 hours run Quote Link to comment
JorgeB Posted April 24, 2018 Share Posted April 24, 2018 3 minutes ago, Squazz said: Memtest didn't show any errors in an 8 hours run Unfortunately that doesn't prove anything, only a positive result would be proof of a problem, also 24 hours is the recommended time for a test. Either way, run another parity check and post new diags so we can see if the errors are repeating or in completely different places. Quote Link to comment
Squazz Posted April 24, 2018 Author Share Posted April 24, 2018 Just now, johnnie.black said: Unfortunately that doesn't prove anything, only a positive result would be proof of a problem, also 24 hours is the recommended time for a test. Either way, run another parity check and post new diags so we can see if the errors are repeating or in completely different places. Just started a new run After that is done, I'll post it here and then start a 24 hour memtest run Quote Link to comment
Squazz Posted April 24, 2018 Author Share Posted April 24, 2018 1 hour ago, johnnie.black said: Unfortunately that doesn't prove anything, only a positive result would be proof of a problem, also 24 hours is the recommended time for a test. Either way, run another parity check and post new diags so we can see if the errors are repeating or in completely different places. I've already got 2 sync errors, so just wanted to post already in case it already now was possible to see what was up nas-syslog-20180424-1832.zip nas-diagnostics-20180424-1832.zip Quote Link to comment
JorgeB Posted April 24, 2018 Share Posted April 24, 2018 Two completely different and very far away sectors from the previous ones, so bad RAM would still be my main suspect, bad board next. P.S. diags already include the syslog, no need to post it separately. Quote Link to comment
Squazz Posted April 24, 2018 Author Share Posted April 24, 2018 6 minutes ago, johnnie.black said: Two completely different and very far away sectors from the previous ones, so bad RAM would still be my main suspect, bad board next. P.S. diags already include the syslog, no need to post it separately. Thank you How did you see this? What should I look for? Will start a 24h memtest run next Quote Link to comment
JorgeB Posted April 24, 2018 Share Posted April 24, 2018 Apr 24 17:33:33 NAS kernel: md: recovery thread: PQ corrected, sector=105681880 Look for these 1 Quote Link to comment
Squazz Posted April 25, 2018 Author Share Posted April 25, 2018 22 hours ago, johnnie.black said: Apr 24 17:33:33 NAS kernel: md: recovery thread: PQ corrected, sector=105681880 Look for these After 20 hours and 7 passes memtest is still not returning any errors. Shall I give it 10 more hours? Quote Link to comment
Squazz Posted April 26, 2018 Author Share Posted April 26, 2018 12 hours ago, Squazz said: After 20 hours and 7 passes memtest is still not returning any errors. Shall I give it 10 more hours? After 32 hours and 12 passed runs I've now stopped it, and started a new run where I'll enforce multi-threading (SMP). Let's see if that does something, I'm not expecting it to though. Quote Link to comment
JorgeB Posted April 26, 2018 Share Posted April 26, 2018 1 hour ago, Squazz said: After 32 hours and 12 passed runs I've now stopped it, and started a new run where I'll enforce multi-threading (SMP). Let's see if that does something, I'm not expecting it to though. That's the problem with memtest, and also one of the reasons ECC is recommend for a server, e.g.: https://lime-technology.com/forums/topic/70437-reoccurring-issue-with-cache-drives/?do=findComment&comment=653702 Quote Link to comment
Squazz Posted April 26, 2018 Author Share Posted April 26, 2018 (edited) 41 minutes ago, johnnie.black said: That's the problem with memtest, and also one of the reasons ECC is recommend for a server, e.g.: https://lime-technology.com/forums/topic/70437-reoccurring-issue-with-cache-drives/?do=findComment&comment=653702 If the SMP run doesn't return anything either, what should I then look for? All hardware is less than a month old, so I did in no way expect this Edited April 26, 2018 by Squazz Quote Link to comment
JorgeB Posted April 26, 2018 Share Posted April 26, 2018 If you have more the a dimm try with just one, but note that you'll need to run two parity checks, the first one can still find (and correct) errors, the second one can't if the problem was resolved. 1 Quote Link to comment
Squazz Posted April 29, 2018 Author Share Posted April 29, 2018 On 26/4/2018 at 10:03 AM, johnnie.black said: If you have more the a dimm try with just one, but note that you'll need to run two parity checks, the first one can still find (and correct) errors, the second one can't if the problem was resolved. This is weird. Now I'm not getting any errors at all, not with all dimms and not with the dimms separately. I suspect my VMs to be generating the errors in the parity. Could this happen? I have a win10 VM with 2 vDisks. One in domain at 100Gb, where domain is set to prefer the cache drive. One is at 700Gb an is assigned directly to my own user-share. They are assigned like this:/mnt/user/domains/Windows 10/vdisk1.img /mnt/user/Squazz/vdisk1.img Quote Link to comment
John_M Posted April 29, 2018 Share Posted April 29, 2018 5 minutes ago, Squazz said: This is weird. Now I'm not getting any errors at all, not with all dimms and not with the dimms separately. Perhaps one of them wasn't seated properly but now it is. It seems to be a modern trend to use DIMM sockets that only have latches on one side, especially on motherboards that are aimed primarily at gamers, such as yours. I understand it's so they don't foul long video cards but I find they feel less positive than traditional "two-sided" sockets when inserting the DIMM. You have to hook the left hand end in first and then press down with rather more force than I'm comfortable with - another step forward, followed by two backwards! 14 minutes ago, Squazz said: I suspect my VMs to be generating the errors in the parity. Could this happen? I really don't think so. Your problem had the symptoms of a hardware fault. Quote Link to comment
Squazz Posted May 2, 2018 Author Share Posted May 2, 2018 On 4/29/2018 at 10:05 PM, John_M said: Perhaps one of them wasn't seated properly but now it is. It seems to be a modern trend to use DIMM sockets that only have latches on one side, especially on motherboards that are aimed primarily at gamers, such as yours. I understand it's so they don't foul long video cards but I find they feel less positive than traditional "two-sided" sockets when inserting the DIMM. You have to hook the left hand end in first and then press down with rather more force than I'm comfortable with - another step forward, followed by two backwards! I really don't think so. Your problem had the symptoms of a hardware fault. Another week of testing, and it is beginning to point in the direction of a specific RAM socket. I don't know how I verify this theory. I'll give it a couple more tests, but my RAM sticks seems to be fine. I'm just worried it might be the socket. Quote Link to comment
John_M Posted May 2, 2018 Share Posted May 2, 2018 4 minutes ago, Squazz said: it is beginning to point in the direction of a specific RAM socket. There isn't a foreign object in there by any chance? A sliver of paper or plastic, perhaps? Quote Link to comment
Squazz Posted May 2, 2018 Author Share Posted May 2, 2018 1 minute ago, John_M said: There isn't a foreign object in there by any chance? A sliver of paper or plastic, perhaps? Not that I can see no Quote Link to comment
Squazz Posted May 4, 2018 Author Share Posted May 4, 2018 On 4/26/2018 at 10:03 AM, johnnie.black said: If you have more the a dimm try with just one, but note that you'll need to run two parity checks, the first one can still find (and correct) errors, the second one can't if the problem was resolved. After a week of parity checks, I have not closed in on port DIMMA1 (as stated earlier, I know). The previous 2 days I have run tests with my memory sticks on DIMMA2 without sync errors. Today I have now run to checks with the same memory-stick on DIMMA1, running a third now. These two runs have both resulted in errors. 1. Is there anything new in the logs? I can't seem to find it 2. If DIMMA1 is bad, would that result in errors in the first run on that dimm? It scared me a little that I got errors as soon as I used it, even after so many runs without errors on DIMMA2, I didn't expect to find anything until the second run? nas-diagnostics-20180504-0810.zip nas-diagnostics-20180504-1710.zip Quote Link to comment
JorgeB Posted May 4, 2018 Share Posted May 4, 2018 4 minutes ago, Squazz said: 1. Is there anything new in the logs? I can't seem to find it Same as before, several sync errors: May 4 08:03:33 NAS kernel: md: recovery thread: Q corrected, sector=598773104 May 4 08:03:55 NAS s3_sleep: Disk activity on going: sdd May 4 08:03:55 NAS s3_sleep: Disk activity detected. Reset timers. May 4 08:04:07 NAS kernel: md: recovery thread: PQ corrected, sector=605824952 May 4 08:04:12 NAS kernel: md: recovery thread: PQ corrected, sector=606961352 May 4 08:04:55 NAS s3_sleep: Disk activity on going: sdd May 4 08:04:55 NAS s3_sleep: Disk activity detected. Reset timers. May 4 08:05:28 NAS kernel: md: recovery thread: PQ corrected, sector=622892792 May 4 08:05:55 NAS s3_sleep: Disk activity on going: sdd May 4 08:05:55 NAS s3_sleep: Disk activity detected. Reset timers. May 4 08:05:55 NAS kernel: md: recovery thread: P corrected, sector=628580776 1 Quote Link to comment
Squazz Posted May 4, 2018 Author Share Posted May 4, 2018 10 minutes ago, johnnie.black said: Same as before, several sync errors: May 4 08:03:33 NAS kernel: md: recovery thread: Q corrected, sector=598773104 May 4 08:03:55 NAS s3_sleep: Disk activity on going: sdd May 4 08:03:55 NAS s3_sleep: Disk activity detected. Reset timers. May 4 08:04:07 NAS kernel: md: recovery thread: PQ corrected, sector=605824952 May 4 08:04:12 NAS kernel: md: recovery thread: PQ corrected, sector=606961352 May 4 08:04:55 NAS s3_sleep: Disk activity on going: sdd May 4 08:04:55 NAS s3_sleep: Disk activity detected. Reset timers. May 4 08:05:28 NAS kernel: md: recovery thread: PQ corrected, sector=622892792 May 4 08:05:55 NAS s3_sleep: Disk activity on going: sdd May 4 08:05:55 NAS s3_sleep: Disk activity detected. Reset timers. May 4 08:05:55 NAS kernel: md: recovery thread: P corrected, sector=628580776 Sorry, yeas that was exactly what I meant There was still a lot of errors, but the sectors are spread all over the place and without any of them repeating. So it doesn't seem that the errors introduced are the same ones that are being corrected afterwards. As far as I can see. How about the 2. bullet, do you know anything about that? Or do you know anyone that might know something? Quote Link to comment
JorgeB Posted May 4, 2018 Share Posted May 4, 2018 1 minute ago, Squazz said: Sorry, yeas that was exactly what I meant There was still a lot of errors, but the sectors are spread all over the place and without any of them repeating. So it doesn't seem that the errors introduced are the same ones that are being corrected afterwards. As far as I can see. Random RAM errors will result in random sync errors. 1 minute ago, Squazz said: How about the 2. bullet, do you know anything about that? Only you can test and rule out the DIMM or the socket. 1 Quote Link to comment
John_M Posted May 4, 2018 Share Posted May 4, 2018 4 hours ago, Squazz said: After a week of parity checks, I have not closed in on port DIMMA1 (as stated earlier, I know). The previous 2 days I have run tests with my memory sticks on DIMMA2 without sync errors. Today I have now run to checks with the same memory-stick on DIMMA1, running a third now. These two runs have both resulted in errors. I don't understand your logic here. Doesn't this suggest that socket A1 is definitely bad? Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.