unRAID unresponsive intermittently


Recommended Posts

  • Replies 85
  • Created
  • Last Reply

Top Posters In This Topic

Is the parity check progressing at a reasonable speed?

 

If you have any more problems be sure to get a syslog.

 

I think so.  92 Mb/Sec at the moment. 

 

I certainly hope I can grab the syslog, but within the 4 previous times it failed this past week it truly froze and I couldn't get to anything.  Once I rebooted the syslog appeared fresh again.  Shouldn't the log files be saved somewhere in the event something like this happens?

Link to comment

Is the parity check progressing at a reasonable speed?

 

If you have any more problems be sure to get a syslog.

 

I think so.  92 Mb/Sec at the moment. 

 

I certainly hope I can grab the syslog, but within the 4 previous times it failed this past week it truly froze and I couldn't get to anything.  Once I rebooted the syslog appeared fresh again.  Shouldn't the log files be saved somewhere in the event something like this happens?

Go ahead and get a syslog now and post it. It may or may not tell us anything.
Link to comment

Go ahead and get a syslog now and post it. It may or may not tell us anything.

 

It is attached.  42% done and seems to be working fine, but for whatever reason I never had a problem with parity checks.  It would freeze afterwards. 

 

There are 128 sync errors and that seems very atypical when I think historically.  Since the freezes I've had a ton of sync errors whereas before I would rarely have them and if I did there were only one or two.  I don't know if that information means anything, but I just thought I would mention it.

syslog3.txt

Link to comment

Go ahead and get a syslog now and post it. It may or may not tell us anything.

 

It is attached.  42% done and seems to be working fine, but for whatever reason I never had a problem with parity checks.  It would freeze afterwards. 

 

There are 128 sync errors and that seems very atypical when I think historically.  Since the freezes I've had a ton of sync errors whereas before I would rarely have them and if I did there were only one or two.  I don't know if that information means anything, but I just thought I would mention it.

I typically have zero, and I think any at all should be investigated. An unclean shutdown is a likely cause. In fact, this is why an unclean shutdown automatically starts a correcting parity check.

 

Your syslog shows that you have booted in SAFE mode, so that is working now. And it shows it started a correcting parity check due to an unclean shutdown, followed by a lot of messages about correcting parity.

 

So, all seems to be as expected for now.

 

Link to comment

Ugh...Safe Mode is exhibiting the same behavior.  The parity check completed just fine with the same 128 sync errors.  All drives were green with no errors.  I first tried transferring a 2 gig file to the server.  It went fine and the speeds were what I would expect (~28 megs/sec).  I then transferred about 6 gigs of smaller size files.  It also went through just fine.  Then I tried about 20 gigs of different file sizes (500 meg - 2 gig).  About 1/10 of the way through the transfer the server locked up again. 

 

- The transfer could not continue. 

- I had no access to the web interface page. 

- I had no access to the syslog. 

- The monitor connected to the server lost signal so there was no information there.

- Fans are still spinning and the Server is powered on. 

- The HDD activity light on the front of the server is doing nothing. 

- I cannot telnet in to the server.

 

Any thoughts on how to narrow down the culprit from here?

 

Link to comment

A sincere thanks for the help thus far.  I was able to walk my wife through starting the MEMTEST.  I'll just have it run for the next day or two and see what it comes back with.  If the MEMTEST reports a failure does that provide definitive evidence that an issue resides in the RAM or could it say possibly be in the motherboard or CPU? 

 

I'll circle back with what the power supply is once I get home from work.  It has been a few years since this server was built and has held up pretty well until now. 

Link to comment

The power supply appears to be an Antec Neo Eco 400C.  I guess it could be the issue here, but it has powered this server for five years ok.  The past 3 have been with the same drives that are in there now.

 

powersupply.jpg

 

The MEMTEST has gone for 12 hours now with no errors.  I'm about to leave to go out of town from Wednesday to Sunday.  Should I just let the MEMTEST run the whole time I'm gone or should I have someone stop by to stop it after a day or two?  There won't be much troubleshooting I'll be able to perform until Sunday.

Link to comment

~/Downloads/syslog3.txt:587: Jun  8 18:31:56 Tower kernel: ata3.00: ATA-8: SAMSUNG HD203WI, S1UYJ1RZ525479, 1AN10003, max UDMA/133

 

Again, wow, and thank you very much.  I won't be able to do anything until Sunday since I'm on my way out of town, but I'll replace the SATA cable and see if that makes a difference. 

Link to comment

I just got back from being out of town.  The memtest ran for 5+ days while I was gone without any errors.  See below:

 

memtest_1.jpg

 

I did change the SATA cable for the Samsung drive (the only Samsung drive in the array).  When I started the server again I didn't put it in to Safe Mode because I didn't catch it in time as it moved past the screen.  Again, the same issue (freezes) exists with or without Safe Mode so I wouldn't think the plugins are the source of the problem.  I started the array and it began to do a parity check.  It didn't take long and as soon as a file started getting saved to the server it froze up again and I had to manually shut it down from the switch on the power supply.  Again, no luck at getting to syslog when it freezes.  I wish I could.

 

The only other thing I can think of to do right now is change which SATA port the Samsung drive is plugged in to on the motherboard.  I'll try that this evening and see if the freezing remains. 

Link to comment

Drat...still no joy.  Using dgaschk's recommendation I elected to see if I could find any dust or loose sata cables.  There was some dust, but I wouldn't consider it excessive.  Regardless, I hit it all with a considerable amount of compressed air.  I pulled some of the sata cables (specifically, the Samsung drive) and reattached them to ensure they were secure.  I then started up the server and it started to do a parity check.  It wasn't finding any errors so I elected to stop the parity check early and test how transferring data went. 

 

In Safe Mode I probably transferred about 20 gigs.  Everything worked fine so I was optimistic.  I then elected to reboot and try transfers with the plugins.  It hung on attempting to reboot and I had to shut it down manually.  Once it came back up and did not see that there was an unclean shutdown, which was good.  With the plugins enabled everything worked fine for a while, but then the issue resurfaced and the server froze again.  As per usual I could not get to the syslog and just before this the drives reflected no errors and were all green.

 

Since dgaschk noticed the serror with the Samsung drive I thought I should try to change SATA ports and see if that makes any difference.  I'm 99% sure this is the case, but if I don't ask I'll manage to screw something up, that is, all the drives simply need to be in the same location (drive 1, drive 2, etc.) as before...correct?  As you can see from the screenshot below they are, but the drive labels have changed.  SDC for Disk 1 (Samsung) is now SDB (Samsung).  This is normal and expected since I've changed SATA ports...correct?  Those labels can change, but the actual locations of the drives must remain the same as before or the array won't work, that is, Disk 1 for the Samsung drive must still be Disk 1 for the Samsung.  Again, I would just like to be sure before starting the array again.  I haven't had to change SATA ports in 5 years.

 

array.jpg

Link to comment

The sdX assignments can change every boot. They have nothing to do with physical connections. The data drives can be in any order. Only the parity drive must be in the correct slot.

 

There is a Log button in the upper right corner of the unRAID webGUI. First make a copy of the current log (syslog.txt) then click on the Log button. Leave the Log window open until the server crashes. Copy and paste the entire contents of the log window into a second text file (syslogB.txt). Attach both syslog.txt and syslogB.txt to a post.

Link to comment

Thanks.  This was in safe mode.  I started the array and a parity check began.  I cancelled the parity check after about 30 minutes.  There were no sync errors.  A few file transfers went across just fine.  Then I tried a 22 gig file.  At about 4 gigs the server froze again.  No access to the web interface.  The monitor connected to the server went dark.  The server itself had power and fans were spinning, but there didn't appear to be any hdd activity. 

 

The logs were created as you instructed and they are attached. 

sysloga.txt

syslogb.txt

Link to comment

Thanks.  The cache drive is not a SATA drive.  It is a PATA / IDE (I think that is what they called it) drive.  I don't use the "mover" option.  I use the cache drive pretty much for only running plugins since I don't want the rest of the drives spinning all the time. 

 

I only see one option for AHCI in the BIOS and it is selected.  I'll take a screenshot and pass it on once I test disabling the cache drive to see what happens.

Link to comment

I disabled the cache to see if that would provide any new information.  However, when attempting to write a 20 gig file directly to Disk 4 the server froze again.  I did start the log file just in case it might help.  It is below.  Rebooting now to try the same file with Disk 3, 2, and 1 to see if there is any difference.

 

/usr/bin/tail -f /var/log/syslog

Jun 16 18:09:58 Tower avahi-daemon[1521]: Files changed, reloading.

Jun 16 18:09:58 Tower avahi-daemon[1521]: Service group file /services/smb.service changed, reloading.

Jun 16 18:09:58 Tower emhttp: shcmd (108): ps axc | grep -q rpc.mountd

Jun 16 18:09:58 Tower emhttp: Restart NFS...

Jun 16 18:09:58 Tower emhttp: shcmd (109): exportfs -ra |& logger

Jun 16 18:09:58 Tower emhttp: shcmd (110): /usr/local/sbin/emhttp_event svcs_restarted

Jun 16 18:09:58 Tower emhttp_event: svcs_restarted

Jun 16 18:09:58 Tower emhttp: shcmd (111): /usr/local/sbin/emhttp_event started

Jun 16 18:09:58 Tower emhttp_event: started

Jun 16 18:09:58 Tower avahi-daemon[1521]: Service "Tower" (/services/smb.service) successfully established.

 

Link to comment

Well, Disk 1 and Disk 4 failed the 20 gig transfer.  The log file doesn't show any new information.  It is so odd.  None of the drives show any errors in the web interface.  I'm not sure what to narrow this down to.

Link to comment

Thanks again.  So I should run reiserfsck for all data drives and the cache drive...correct? 

 

I'm running unRAID 5.0.5.  This is after unRAID v5.0-beta8d...correct?

 

I'm a Linux idiot so if I don't ask I'll screw it up.  In the wiki it uses /dev/md1 as the example.  If I want to reiserfsck data drive 1 do I enter the same thing or should the syntax be different?  Typically, I associate sdb with drive 1.

 

The directions say not to, but what would be the verbiage to reiserfsck the parity drive?     

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.