Red ball and syslog issue


Recommended Posts

Hi...I had a red ball last night and telnet'd in to copy the syslog...when I save it as a txt file, the formatting gets all messed up. Is there a way to format the log so that it is readable in a way I can post on here for advice?

 

Edit: I was able to figure out the formatting issue.

 

Anyways...got my first redball during the parity check overnight and I did read the to-do on it, but am still nervous about what to do next. When I tried to get the smartctl log I get this response:

 

root@Tower:~# smartctl -t short /dev/sdl

smartctl 6.2 2013-07-26 r3841 [i686-linux-3.9.11p-unRAID] (local build)

Copyright © 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

 

Short INQUIRY response, skip product id

A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.

 

Syslog attached

 

I am running unRAID Version: unRAID Server Pro, Version 5.0.5

 

Fugginsyslog.txt

Link to comment

I cannot say for sure, not enough data, but it looks like the mvsas driver crashed or was corrupted, quite a bit more serious than a cable issue!  I don't see direct evidence of any cabling problems, but certainly power could be part of the problem, but there's no direct evidence of it.  The drive is probably fine.

 

The first indication of trouble is at 3:45am, command issues, and then a command timed out, followed immediately by parity errors, probably not true.  At this point, your parity drive cannot be trusted, may have been corrupted.  After some attempts to recover the drive, it gave up and disabled it.  It then had trouble with another drive.

 

The error flags (UNC IDNF) being returned (or that it *says* are being returned, they look corrupted to me) do not make sense.  UNC should be returned with a Media error, but there's no error, just that the error return timed out.  And IDNF is really serious, indicating sectors cannot be found.  You should see a LOT more errors if that were true!

 

At the moment, I don't know what to advise (partly because I'm somewhat distracted, sitting in a jury duty auditorium)!  If you have a better power supply, try installing it.  Flaky power could definitely explain the issues.

Link to comment

Thanks for the input. I don't normally keep spare PSUs around, but the one I bought is an higher quality corsair 850W supply.

 

Just to clarify.. /dev/sdl is not the parity drive, but rather, one of the regular disks in the array. Just wanted to make sure you were aware of that.

 

What more information could I provide to help? Are there other logs I should attach?

Link to comment

Just to clarify.. /dev/sdl is not the parity drive, but rather, one of the regular disks in the array. Just wanted to make sure you were aware of that.

 

There were actually 2 drives affected, primarily sdl (Disk 10, ata11.00, sd 1.0.2.0), but also sdk (ata10.00, sd 1.0.1.0).  There's no info in this syslog excerpt to know which drive sdk is, but it's connected next to Disk 10.  It only had a few issues, nothing like the issues reported for sdl.  I should note that I don't think there are any issues with either drive.

 

Parity errors do not indicate an issue with the parity drive, rather that the parity calculations made after reading from all drives did not result in zero (even parity).  So in this case, UnRAID (not knowing of the problems at the lower level) used the misread info from Disk 10 for its parity calc's, reported a parity error and then rewrote the parity info on the parity drive, corrupting it.

Link to comment

I put in a replacement drive and the system rebuilt itself without any issues.

 

I took out the redballed drive and ran all the seagate tools on it and it passed all the tests.

 

Here's the full syslog. I still don't understand it enough to know what actually caused the issue.

 

As for power issues, I am still hesitant on it because I do have the file server on a power conditioner/UPS system.

 

Should I reinstall the drive and run pre-clear on it again?

Mar_15_23_Syslog.zip

Link to comment

Don't see anything else relevant in the syslog.  Unrelated, but Cache_dirs loads twice, once at boot via Dynamix, and again the following day via the at command.  That seems rather odd!

 

I don't know of any reason Preclearing the drive would be useful, as I don't think it caused the trouble, and your testing found nothing.

 

The possible suspects are still the power supply, and system overheating.  You might check that all fans are spinning, and that the system is not getting too hot under load, as it was when mvsas appeared to have trouble.  And it never hurts to do a memtest, perhaps a very long one, to rule out slightly flaky memory.

Link to comment

Thanks for all the help so far.

 

I know I can uninstall Dynamix Cache_dirs via the UI, but how do I uninstall the command line version so that I can make sure I only use the Dynamix version?

 

I do have 1 drive that will get about 40 degrees C during parity checks, but the rest all run in the 30s....and all the fans are running.

 

I will do a memtest...is there a particular version I can use?

Link to comment

I know I can uninstall Dynamix Cache_dirs via the UI, but how do I uninstall the command line version so that I can make sure I only use the Dynamix version?

I would normally assume the other one is loading from a command in the go file, particularly since it appears to be starting from the at command, but perhaps there is a cron item set up to start Cache_dirs, don't know.

 

I will do a memtest...is there a particular version I can use?

Use the one at the bottom of your UnRAID boot menu, before UnRAID loads.

Link to comment
  • 2 weeks later...

Something went weird last night. The system became unresponsive, so I putty'd into it and performed a clean powerdown, then rebooted. Once it rebooted, the parity started rebuilding itself again, but I had no other warnings.

 

This morning, I go to check on the system and I had the redball return to the same HDD slot (Disk 10 - /dev/sdl) but this time with a different (new) HDD. Would it be possible that the SATA cable or the SATA port on the controller card might be bad? The cable is a Mini SAS 36pin (SFF-8087) Male to SATA 7pin Female Cable from monoprice and the controller card is a SUPERMICRO AOC-SAS2LP-MV8.

 

Had to upload the file to a file host (even zipped was over 600k): http://s000.tinyupload.com/index.php?file_id=04724320411712833973

 

 

 

 

 

Link to comment

I thought it might be a power connector issue as well...I made sure it was seated well. The sata connectors are locking type and are shielded. I run them the opposite side of the power cables to minimize interference issues.

 

PSU is a Corsair HX850

 

However..the power connectors are this kind... http://www.amazon.com/StarTech-com-Power-Splitter-Adapter-PYO4SATA/dp/B0086OGN9E

 

The plugs are like 2 inches apart, so they are kind looped between each HDD. Maybe the stress of that is causing this particular connector to come loose. Are there similar power connectors that have the power connectors that are more inline with the spacing of a standard HDD cage?

Link to comment

Problems seem similar to last time, as it appears that the SAS driver suddenly reported trouble with Disk 10, no response.  No unusual error codes this time, mostly timeouts (lack of response), but 2 device errors.  Problems started at 06:27:34, and at 06:28:56 Disk 10 was disabled, so all further errors on Disk 10 are spurious, false, probably resulted in a red ball.  At 06:28:58, the parity check was aborted.  Then at 06:32:18, something went wrong with Disk 11, and it flooded the syslog with 130MB of the very same error in the Reiser file system.  You will need to run Check Disk File systems on Disk 11 (md11).

Link to comment

Problems seem similar to last time, as it appears that the SAS driver suddenly reported trouble with Disk 10, no response.  No unusual error codes this time, mostly timeouts (lack of response), but 2 device errors.  Problems started at 06:27:34, and at 06:28:56 Disk 10 was disabled, so all further errors on Disk 10 are spurious, false, probably resulted in a red ball.  At 06:28:58, the parity check was aborted.  Then at 06:32:18, something went wrong with Disk 11, and it flooded the syslog with 130MB of the very same error in the Reiser file system.  You will need to run Check Disk File systems on Disk 11 (md11).

 

Finally...a sensible recommendation.

 

I did this but reiserfsck found no errors. I ran it on md10 and md11.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.