Sync errors keeps coming - how do I read the logs?


Recommended Posts

My array keeps spitting out sync errors that is getting corrected. Problem is, that when I do a new parity check, it finds new errors.

 

I'm fearing that one of my drives might have gone bad, but I'm not sure how to determine this.

I have added syslog and diagnostics, hoping this is where we can find such information.

I'd greatly appreciate help to read the logs. If you can see whats wrong in the logs, please do guide me to what it is that I should look for so I'm better equipped next time :)

nas-diagnostics-20180424-0748.zip

nas-syslog-20180424-0748.zip

Link to comment
3 minutes ago, Squazz said:

 

Memtest didn't show any errors in an 8 hours run

Unfortunately that doesn't prove anything, only a positive result would be proof of a problem, also 24 hours is the recommended time for a test.

 

Either way, run another parity check and post new diags so we can see if the errors are repeating or in completely different places.

Link to comment
Just now, johnnie.black said:

Unfortunately that doesn't prove anything, only a positive result would be proof of a problem, also 24 hours is the recommended time for a test.

 

Either way, run another parity check and post new diags so we can see if the errors are repeating or in completely different places.

Just started a new run :)

After that is done, I'll post it here and then start a 24 hour memtest run :)

Link to comment
1 hour ago, johnnie.black said:

Unfortunately that doesn't prove anything, only a positive result would be proof of a problem, also 24 hours is the recommended time for a test.

 

Either way, run another parity check and post new diags so we can see if the errors are repeating or in completely different places.

I've already got 2 sync errors, so just wanted to post already in case it already now was possible to see what was up :)

nas-syslog-20180424-1832.zip

nas-diagnostics-20180424-1832.zip

Link to comment
6 minutes ago, johnnie.black said:

Two completely different and very far away sectors from the previous ones, so bad RAM would still be my main suspect, bad board next.

 

P.S. diags already include the syslog, no need to post it separately.

Thank you :)

How did you see this? What should I look for?

 

Will start a 24h memtest run next

Link to comment
12 hours ago, Squazz said:

 

After 20 hours and 7 passes memtest is still not returning any errors.

Shall I give it 10 more hours?

After 32 hours and 12 passed runs I've now stopped it, and started a new run where I'll enforce multi-threading (SMP). Let's see if that does something, I'm not expecting it to though.

Link to comment
1 hour ago, Squazz said:

After 32 hours and 12 passed runs I've now stopped it, and started a new run where I'll enforce multi-threading (SMP). Let's see if that does something, I'm not expecting it to though.

That's the problem with memtest, and also one of the reasons ECC is recommend for a server, e.g.:

https://lime-technology.com/forums/topic/70437-reoccurring-issue-with-cache-drives/?do=findComment&comment=653702

 

Link to comment
41 minutes ago, johnnie.black said:

That's the problem with memtest, and also one of the reasons ECC is recommend for a server, e.g.:

https://lime-technology.com/forums/topic/70437-reoccurring-issue-with-cache-drives/?do=findComment&comment=653702

 

If the SMP run doesn't return anything either, what should I then look for?

 

All hardware is less than a month old, so I did in no way expect this

Edited by Squazz
Link to comment
On 26/4/2018 at 10:03 AM, johnnie.black said:

If you have more the a dimm try with just one, but note that you'll need to run two parity checks, the first one can still find (and correct) errors, the second one can't if the problem was resolved.

 

This is weird. Now I'm not getting any errors at all, not with all dimms and not with the dimms separately.

 

I suspect my VMs to be generating the errors in the parity. Could this happen?

 

I have a win10 VM with 2 vDisks. One in domain at 100Gb, where domain is set to prefer the cache drive.

One is at 700Gb an is assigned directly to my own user-share.

They are assigned like this:
/mnt/user/domains/Windows 10/vdisk1.img

/mnt/user/Squazz/vdisk1.img

Link to comment
5 minutes ago, Squazz said:

This is weird. Now I'm not getting any errors at all, not with all dimms and not with the dimms separately.

 

Perhaps one of them wasn't seated properly but now it is. It seems to be a modern trend to use DIMM sockets that only have latches on one side, especially on motherboards that are aimed primarily at gamers, such as yours. I understand it's so they don't foul long video cards but I find they feel less positive than traditional "two-sided" sockets when inserting the DIMM. You have to hook the left hand end in first and then press down with rather more force than I'm comfortable with - another step forward, followed by two backwards!

 

14 minutes ago, Squazz said:

I suspect my VMs to be generating the errors in the parity. Could this happen?

 

I really don't think so. Your problem had the symptoms of a hardware fault.

Link to comment
On 4/29/2018 at 10:05 PM, John_M said:

 

Perhaps one of them wasn't seated properly but now it is. It seems to be a modern trend to use DIMM sockets that only have latches on one side, especially on motherboards that are aimed primarily at gamers, such as yours. I understand it's so they don't foul long video cards but I find they feel less positive than traditional "two-sided" sockets when inserting the DIMM. You have to hook the left hand end in first and then press down with rather more force than I'm comfortable with - another step forward, followed by two backwards!

 

 

I really don't think so. Your problem had the symptoms of a hardware fault.

 

Another week of testing, and it is beginning to point in the direction of a specific RAM socket.

 

I don't know how I verify this theory.

 

I'll give it a couple more tests, but my RAM sticks seems to be fine. I'm just worried it might be the socket.

Link to comment
On 4/26/2018 at 10:03 AM, johnnie.black said:

If you have more the a dimm try with just one, but note that you'll need to run two parity checks, the first one can still find (and correct) errors, the second one can't if the problem was resolved.

 

After a week of parity checks, I have not closed in on port DIMMA1 (as stated earlier, I know).

 

The previous 2 days I have run tests with my memory sticks on DIMMA2 without sync errors.
Today I have now run to checks with the same memory-stick on DIMMA1, running a third now.
These two runs have both resulted in errors.

 

1. Is there anything new in the logs? I can't seem to find it
2. If DIMMA1 is bad, would that result in errors in the first run on that dimm? It scared me a little that I got errors as soon as I used it, even after so many runs without errors on DIMMA2, I didn't expect to find anything until the second run?

nas-diagnostics-20180504-0810.zip

nas-diagnostics-20180504-1710.zip

Link to comment
4 minutes ago, Squazz said:

1. Is there anything new in the logs? I can't seem to find it

Same as before, several sync errors:

 

May  4 08:03:33 NAS kernel: md: recovery thread: Q corrected, sector=598773104
May  4 08:03:55 NAS s3_sleep: Disk activity on going: sdd
May  4 08:03:55 NAS s3_sleep: Disk activity detected. Reset timers.
May  4 08:04:07 NAS kernel: md: recovery thread: PQ corrected, sector=605824952
May  4 08:04:12 NAS kernel: md: recovery thread: PQ corrected, sector=606961352
May  4 08:04:55 NAS s3_sleep: Disk activity on going: sdd
May  4 08:04:55 NAS s3_sleep: Disk activity detected. Reset timers.
May  4 08:05:28 NAS kernel: md: recovery thread: PQ corrected, sector=622892792
May  4 08:05:55 NAS s3_sleep: Disk activity on going: sdd
May  4 08:05:55 NAS s3_sleep: Disk activity detected. Reset timers.
May  4 08:05:55 NAS kernel: md: recovery thread: P corrected, sector=628580776

 

  • Like 1
Link to comment
10 minutes ago, johnnie.black said:

Same as before, several sync errors:

 


May  4 08:03:33 NAS kernel: md: recovery thread: Q corrected, sector=598773104
May  4 08:03:55 NAS s3_sleep: Disk activity on going: sdd
May  4 08:03:55 NAS s3_sleep: Disk activity detected. Reset timers.
May  4 08:04:07 NAS kernel: md: recovery thread: PQ corrected, sector=605824952
May  4 08:04:12 NAS kernel: md: recovery thread: PQ corrected, sector=606961352
May  4 08:04:55 NAS s3_sleep: Disk activity on going: sdd
May  4 08:04:55 NAS s3_sleep: Disk activity detected. Reset timers.
May  4 08:05:28 NAS kernel: md: recovery thread: PQ corrected, sector=622892792
May  4 08:05:55 NAS s3_sleep: Disk activity on going: sdd
May  4 08:05:55 NAS s3_sleep: Disk activity detected. Reset timers.
May  4 08:05:55 NAS kernel: md: recovery thread: P corrected, sector=628580776

 

 

Sorry, yeas that was exactly what I meant :) There was still a lot of errors, but the sectors are spread all over the place and without any of them repeating. So it doesn't seem that the errors introduced are the same ones that are being corrected afterwards. As far as I can see.

 

How about the 2. bullet, do you know anything about that? :) Or do you know anyone that might know something?

Link to comment
1 minute ago, Squazz said:

Sorry, yeas that was exactly what I meant :) There was still a lot of errors, but the sectors are spread all over the place and without any of them repeating. So it doesn't seem that the errors introduced are the same ones that are being corrected afterwards. As far as I can see.

Random RAM errors will result in random sync errors.

 

1 minute ago, Squazz said:

How about the 2. bullet, do you know anything about that?

Only you can test and rule out the DIMM or the socket.

  • Upvote 1
Link to comment
4 hours ago, Squazz said:

After a week of parity checks, I have not closed in on port DIMMA1 (as stated earlier, I know).

 

The previous 2 days I have run tests with my memory sticks on DIMMA2 without sync errors.
Today I have now run to checks with the same memory-stick on DIMMA1, running a third now.
These two runs have both resulted in errors.

 

I don't understand your logic here. Doesn't this suggest that socket A1 is definitely bad?

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.