November 4, 200916 yr Guys, I need some advice please. I started my monthly parity check last night, and it completed today. The "Disk status" area of the main web page doesn't show any errors for individual disks. But in the "Command area" of that page I see: Parity is valid. (Last checked on 11/3/2009 10:16:16 PM, finding 385 errors.) Here is the tail of my syslog: Nov 2 23:00:01 Tower kernel: mdcmd (70): check Nov 2 23:00:01 Tower kernel: md: recovery thread woken up ... Nov 2 23:00:01 Tower kernel: md: recovery thread checking parity... Nov 2 23:00:01 Tower kernel: md: using 1152k window, over a total of 1953514552 blocks. Nov 3 22:16:16 Tower kernel: md: sync done. time=83783sec rate=23316K/sec Nov 3 22:16:16 Tower kernel: md: recovery thread sync completion status: 0 Nov 3 22:18:18 Tower in.telnetd[7555]: connect from 10.22.33.5 (10.22.33.5) Nov 3 22:18:18 Tower login[7556]: ROOT LOGIN on `pts/0' from `10.22.33.5' I inspected the whole syslog and there is nothing curious there. So what can these 385 errors be? And on which disk? How worried should I be? Purko --- BTW, this unraid server is on a good UPS, and has always been CLEANLY restarted or shut down through the web interface.
November 4, 200916 yr Firstly, when unRaid hits a sync error it automatically updates parity to "fix" the out-of-sync condition. So if you were to rerun, the parity check should be clean. Second, parity can detect an out of sync condition but cannot identify the culprit. UnRaid used to log sync errors but stopped several versions ago as it did, in some rare circumstances, cause the syslog to grow too large and crash the server. But knowing was helpful for situations like this. I would recommend running a smart report on each drive to see if one of them is showing signs of failing. I would also run a memory test to confirm your RAM is all good. If all is well I would then rerun the parity check. If you continue to get sync errors, we'll need to explore further. Oyherwise this may be an unsolved mystery.
November 4, 200916 yr Author Smart reports on all disks look perfect. Memory test passed without error. So now I started a new parity check. Will see what it says when it finishes tomorrow. I so dislike unsolved misteries.
November 4, 200916 yr UnRaid used to log sync errors but stopped several versions ago as it did, in some rare circumstances, cause the syslog to grow too large and crash the server. But knowing was helpful for situations like this. I agree. I would like to see it added back, with a counter to limit them, perhaps only log the first 1000 parity errors. It sometimes helps to know if the errors are scattered randomly, or clustered in specific regions, and in particular where the regions are, such as at the very beginning of the drive, within the file system structures. Also, when there are also disk errors logged, the timestamps can help to correlate the parity errors with specific disk errors. Since this cluster of parity errors was unexpected, there must have been an undetected but damaging event recently. By the way, when was the last parity check performed, and was it completely clean? This bunch of errors does indicate changes to either data or parity info, but not which one was changed, or which drive or drives were affected. The next tests I would do is to run reiserfsck on each data drive ("Check Disk File systems").
November 4, 200916 yr Author By the way, when was the last parity check performed, and was it completely clean? This bunch of errors does indicate changes to either data or parity info, but not which one was changed, or which drive or drives were affected. The next tests I would do is to run reiserfsck on each data drive ("Check Disk File systems"). This is a four months "new" server, and parity checks always completed clean. Last one was about three weeks ago. I'll check the file systems first thing after this new parity check finishes sometime tomorrow. Thanks for the suggestion. It is indeed a little bit worrisome not being able to find for sure what caused this to happen. Purko
Archived
This topic is now archived and is closed to further replies.