Persistent system crashes with ver. 4.5.6

May 2, 201115 yr

I have been experiencing a worrying number of system shutdowns in recent weeks, including another one just today. The typical behavior has been for the system to run normally for several days and then freeze up, losing all network connectivity including Telnet. The only way to resuscitate the system is to do a power cycle. Invariably the system reboots normally and then launches into a parity check, which it is performing at the moment. This “automatic” parity check rarely finds any errors to correct, and so far that is also the case in instance. About two weeks ago the system did report a number of sync errors, which I put down a failing parity disc (it had several reallocated sectors). After replacing the parity disc with a new, pre-cleared disc the system has run without a hiccup until today.

According to the syslog ‘tail’ (attached) the system failure seems to have occurred almost simultaneously with an attempt to write a batch of files to the cache. I can’t see anything in the syslog which might shed any light on the incident, although I’m anything but an expert in such things. Once the system came back up I checked on the SmartView screen but all the discs appear to be healthy, with nothing to report except their load cycle counts.

The system is running release version 4.5.6. There are 21 discs in the array plus the cache, all 2TB, motherboard is a Supermicro X8-SIL, there are two Supermicro 8-port SATA controllers. 850W Corsair PSU, 4GB memory. Housed in a Norco 4224.

I would very much appreciate it if one of the more experienced members could run a practised eye over the log and advise if anything jumps out as a possible cause. I thought I had had put this issue behind me with the new parity disc but there are obviously still some lurking gremlins. Any advice also on additional troubleshooting steps or information that I could be collecting would also be highly appreciated.

syslog20110501.zip

Quote

May 2, 201115 yr

There's not much usable info there unfortunately, except that you definitely had crashed, with OS corruption ('tainted' modules). In my opinion, a reboot is always required after that. 2 drives are mentioned (sdu and sdt), but that means almost nothing, they were probably the drives being accessed when the crashes occurred. The 2 main suspects here are either bad memory or buggy software.

The first suggestion is always the same in a case like this - test your memory. Run the built-in Memtest from the boot screen, preferably for a number of hours, such as overnight. Your memory may be fine, but you need to eliminate it as a suspect.

What does concern me is the use of the mvsas & scst modules in v4.5.6. I consider them somewhat immature, and I would strongly recommend you upgrade to at least v4.7, and even better to the latest v5.0 beta, where the mvsas support has been reworked significantly, and may be somewhat more stable.

Just a comment, syslogs in textual form (.txt instead of .rtf) are much preferred; and as complete as possible, not just the tail.

Another comment, when you are crashing often, there's not much point in running long parity checks over and over, so I see no reason not to abort them. Once the system is stable again, then run a full parity check.

Quote

May 2, 201115 yr

Author

OK thanks for the feedback, I will take your advice and initiate an extended memtest and upgrade to 4.7 once the system stabilizes.

I did want to query a couple of points though.

Just a comment, syslogs in textual form (.txt instead of .rtf) are much preferred; and as complete as possible, not just the tail.

I've never had any success with capturing a 'pre-crash' syslog except by using tail in a telnet session. Is there a better approach? Every time I've looked at the syslog following a crash it only seems to start with the post-crash restart events.

Another comment, when you are crashing often, there's not much point in running long parity checks over and over, so I see no reason not to abort them. Once the system is stable again, then run a full parity check.

I've wondered about that but have been wary about stopping them.

Thanks again for your help.

Quote

May 2, 201115 yr

I've heard reports of system crashes when the cache drive is failing. Post a SMART report of the cache drive.

Quote

May 2, 201115 yr

Author

Here is the SMART report for the cache drive (/dev/sdu). Nothing looks obviously untoward, at least to my untrained eye.

smart.txt

Quote

May 2, 201115 yr

Here is the SMART report for the cache drive (/dev/sdu). Nothing looks obviously untoward, at least to my untrained eye.

Looks fine.

Has anything in the system been changed recently?

Quote

May 2, 201115 yr

Author

Looks fine.

Has anything in the system been changed recently?

Only the new parity disk, a new Hitachi. Apart from that the system is all less than six months old, although there are a few disks about a year old that were brought over from an older NAS. I would mention though that I had a lot of infant mortality with the Norco hardware and had to replace 3 of the 6 SAS cards in the space of a few weeks.

Have upgraded to 4.7 and now running the memtest, which I'll run for the rest of the day. After two cycles no errors so far.

As a general question about the syslog, I understand that this gets cleared on rebooting, however is there any way to have a periodic dump to the flash say every few seconds? There's plenty of space even on 2GB stick so I'm wondering why such a facility is not part of the standard system.

Quote

May 19, 201115 yr

Author

Apologies for bumping this old thread, but I just wanted to report that since upgrading to 4.7 over two weeks the system has been operating perfectly, not a single unplanned shutdown.

Quote

May 19, 201115 yr

As a general question about the syslog, I understand that this gets cleared on rebooting, however is there any way to have a periodic dump to the flash say every few seconds? There's plenty of space even on 2GB stick so I'm wondering why such a facility is not part of the standard system.

Mostly because the flash drives are only rated for a finite number of "write" operations, typically 100,000 or so, although some may be good for as many as 1,000,000. It really does not matter since writing to the drive once every 5 seconds for a million times = 5,000,000 seconds = 57.87 days. (Odds are you do not want to wear out the drive in 2 months)

You can make a simple temporary change to log ALSO to the flash drive and to a spare virtual system console as described in this post:

http://lime-technology.com/forum/index.php?topic=7758.msg75293#msg75293

Quote

Persistent system crashes with ver. 4.5.6

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)