OS freezes ups after large data transfer


Recommended Posts

So before I force shutdown my server causing a day-long parity check, I wanted to see if anyone has encountered their whole Unraid OS locking up. I'm talking no webGUI, no ping, no dockers, no SMB, no SSH, and even the console won't input from the keyboard! (yet the cursor is still flashing on the console?).

 

This is the second time this has happened, and I would love to post my syslog, but I can't access it unless I shutdown.

 

I was transferring a bunch of data, multiple TBs. Would the OS lock up close to full capacity?

 

Thoughts and suggestions? Thanks!

Link to comment

So before I force shutdown my server causing a day-long parity check, I wanted to see if anyone has encountered their whole Unraid OS locking up. I'm talking no webGUI, no ping, no dockers, no SMB, no SSH, and even the console won't input from the keyboard! (yet the cursor is still flashing on the console?).

 

This is the second time this has happened, and I would love to post my syslog, but I can't access it unless I shutdown.

 

I was transferring a bunch of data, multiple TBs. Would the OS lock up close to full capacity?

 

Thoughts and suggestions? Thanks!

 

Firstly, sorry you're having these issues. Lets see if we can help you debug.

 

#RANT: It is certainly an issue that you can't grab your diagnostics file. We really need a resolution to this.

 

I don't THINK that getting to FULL capacity would "Lock" your server. I do however assume that you are not TRYING to transfer more than the your server's capacity anyway??

 

I know you asked about recovery BUT if you can't access the server at all (short of cancelling the transfer - if it is still going??) and waiting it out (even perhaps overnight) for a while to see if things come back up) I see no other cause of action but a HARD reset.

 

There is hope POST reset though. In my experience the main reason for a lock is a bad disk. Fortunately SMART reports do contain historical data and are a good indication of a potential lockup.

 

IF you do a Hard Reset can the first thing you do please be to Grab a Diagnostics File:

 

Tools>Diagnostics>Download

 

This will aid us in at least analysing your SMART reports and "potentially" other things.

 

If you're game before me or anyone else has had a look at the diagnostics an initial short SMART test followed by an extended (can take a couple of hours so don't worry) SMART test of each of your disks is not a bad way to go.

Link to comment

So after a hard reset, I attached my syslog and smart reports.

 

It seems to continuously removing and reinstalling the unraid driver. Also, while my Disk 2 has an overall health of "passed", it is reporting these errors:

 

  • 187 Reported uncorrect         0x0032 001 001 000 Old age Always Never 65535
  •   5      Reallocated sector count 0x0033 100 100 036 Pre-fail Always Never 1136

 

Thoughts?

smart.zip

syslog.zip

Link to comment

So after a hard reset, I attached my syslog and smart reports.

 

It seems to continuously removing and reinstalling the unraid driver. Also, while my Disk 2 has an overall health of "passed", it is reporting these errors:

 

  • 187 Reported uncorrect         0x0032 001 001 000 Old age Always Never 65535
  •   5      Reallocated sector count 0x0033 100 100 036 Pre-fail Always Never 1136

 

Thoughts?

 

#RANT: Can I PLEASE ask that when you (and anyone else reading this) post your logs you do so in the expected way. Tools>Diagnostics>Downloads. There is no need to post them separately as everything we would normally need to diagnose a fault is in that zip file.

 

http://lime-technology.com/forum/index.php?topic=39257.0

 

Rant over!  ;):) TBH mate that Syslog tells me nothing. We need to GRAB that log BEFORE the hard reset!

 

As a side note, you were the inspiration for this work I did last night:

 

http://lime-technology.com/forum/index.php?topic=47732.0

 

Anyway, from what you HAVE posted AND as you noted Disk ST3000DM001-9YN166-W1F0Q9G1 is POTENTIALLY in bad shape.

 

  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       1136 

 

I notice from the log that you are running a SMART test. Assuming this is an extended one. You are already doing the right thing.

 

It is feasible (I guess) that your crashes are a result of Errors on that disk HOWEVER Reallocated Sectors are not immediately a bad thing IMHO. There are some people which replace drives once they start showing any reallocated sectors immediately. Saying that, a drive CAN show reallocated sectors for years and work perfectly fine.

 

You can consider Reallocated Sector's as INVISIBLE bad sectors that have been replaced / swapped with reserve sectors. These sectors are NO LONGER VISIBLE to unRAID and and as such can NEVER cause any more problems. So to go further, these are bad sectors in the past. So they might have caused problems in the past but that does not really mean that it did. Disks replace weak sectors as a precaution which may never have caused any problems. Note that smartctl showing that this attribute is not FAILING NOW so it means that the drive has NOT exhausted its pool of spare sectors and as such really I don't see that its time to immediately replace it.  Saying that the pool of available reserve sectors is not endless and AFAIK is usually at about ~1000 SO I beleive some might say as your count is at 1136 it is probably the safest bet to replace it NOW.

 

As an aside, the attribute which you WOULD want to work about is Current Pending Sectors. Current Pending Sector's are bad sectors that CANNOT BE READ but are still visible to unRAID. These are VERY DANGEROUS and cause ALOT of problems. In unRAID specifically they CAN and DO invalidate and corrupt your parity and data. However, you DONT have any of them!  :)

 

Ultimately I can't tell from what you have posted what is causing your lockup. Let's see what the results of your extended SMART test are on that drive and take it from there.

Link to comment

Thanks a lot for your help man. Sorry I didn't post the full diagnostic zip, I'm very privacy oriented and didn't want to share more than I needed too.

 

Yeah, I realized after the hard reset that all the syslogs were erased and that the ones I provided weren't gonna be very helpful. I couldn't access my frozen server in anyway to get my hands on them, at least to my knowledge.

 

Wow, those scripts look brilliant; I hope it gets made into a official plugin!

 

I read up on the SMART errors and found the same results, that it is probably not an issue ATM.

 

Here are my full diagnostics files (post extended smart test), which said it completed without errors.

 

UnRAID seemed like such a powerful, simple solution to so many traditional server needs and problems, but still has been a pain in the ass just like all of those. First Docker issues, then the OS freezing up. I guess only with your cron job script could I catch the syslog as it freezes...

 

Thanks!

tower-diagnostics-20160322-2058.zip

Link to comment

Thanks a lot for your help man. Sorry I didn't post the full diagnostic zip, I'm very privacy oriented and didn't want to share more than I needed too.

 

Yeah, I realized after the hard reset that all the syslogs were erased and that the ones I provided weren't gonna be very helpful. I couldn't access my frozen server in anyway to get my hands on them, at least to my knowledge.

 

Wow, those scripts look brilliant; I hope it gets made into a official plugin!

 

I read up on the SMART errors and found the same results, that it is probably not an issue ATM.

 

Here are my full diagnostics files (post extended smart test), which said it completed without errors.

 

UnRAID seemed like such a powerful, simple solution to so many traditional server needs and problems, but still has been a pain in the ass just like all of those. First Docker issues, then the OS freezing up. I guess only with your cron job script could I catch the syslog as it freezes...

 

Thanks!

 

You don't need to worry about privacy of the diagnostics IMHO. Allot of work has been done on making them as private as possible by blocking out share names, files, emails etc

 

I don't have time to review ALL of those logs right now BUT will do so tonight (10 hours from now) if no-one beats me to it.

 

Good to know the SMART report passed. More convinced now that the issue is probably not the disk, BUT without a set of logs before the lockup it is hard to speculate.

 

You could TRY that script AND then repeat what you did to cause the Freeze. I have run it for the past 12+ hours and it is fine. It works. Remember it is a WIP and is far from the best coding in the world.

Link to comment

It seems to continuously removing and reinstalling the unraid driver.

You're getting uncorrectable read errors on that 3 TB Seagate (SMART parameter 187). Replace it.

 

Can't believe I missed that in my scan of the SMART report. Thats what I get for doing two things at once. John_M is right. Time for a new disk.

 

Question to the OP though. This parameter is usually "Monitored" by unRAID. Did you get ANY notification on the GUI or otherwise for this?

 

I would have expected at very least that on the Dashboard the "Green Thumbs Up" on the SMART Status Cell of the Disk would have been a "Yellow Warning Exclamation"?

Link to comment

Lots of failed SMART tests suggests he had an idea that something was wrong:

 

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     20612         -
# 2  Short offline       Completed without error       00%     20602         -
# 3  Extended offline    Completed without error       00%     20540         -
# 4  Extended offline    Completed: read failure       90%     20443         1525732032
# 5  Short offline       Completed: read failure       90%     20443         1525732032
# 6  Extended offline    Completed: read failure       90%     20442         1525732032
# 7  Short offline       Completed: read failure       90%     20442         1525732032
4 of 4 failed self-tests are outdated by newer successful extended offline self-test # 1

 

How do you explain the recent passes? Well, my guess is that a piece of contamination became dislodged and flew off the platter. It might have been caught by the air filter or it might still be bouncing around.

 

Link to comment

Lots of failed SMART tests suggests he had an idea that something was wrong:

 

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     20612         -
# 2  Short offline       Completed without error       00%     20602         -
# 3  Extended offline    Completed without error       00%     20540         -
# 4  Extended offline    Completed: read failure       90%     20443         1525732032
# 5  Short offline       Completed: read failure       90%     20443         1525732032
# 6  Extended offline    Completed: read failure       90%     20442         1525732032
# 7  Short offline       Completed: read failure       90%     20442         1525732032
4 of 4 failed self-tests are outdated by newer successful extended offline self-test # 1

 

How do you explain the recent passes? Well, my guess is that a piece of contamination became dislodged and flew off the platter. It might have been caught by the air filter or it might still be bouncing around.

 

God, I missed that too! I wasn't having a very good day yesterday at all!

 

Anyway interesting speculation about the contamination of the platter.

Link to comment

...Question to the OP though. This parameter is usually "Monitored" by unRAID. Did you get ANY notification on the GUI or otherwise for this?

 

I would have expected at very least that on the Dashboard the "Green Thumbs Up" on the SMART Status Cell of the Disk would have been a "Yellow Warning Exclamation"?

 

System notifications must be enabled to start background monitoring and receive notifications.

Link to comment

...Question to the OP though. This parameter is usually "Monitored" by unRAID. Did you get ANY notification on the GUI or otherwise for this?

 

I would have expected at very least that on the Dashboard the "Green Thumbs Up" on the SMART Status Cell of the Disk would have been a "Yellow Warning Exclamation"?

 

System notifications must be enabled to start background monitoring and receive notifications.

 

After I posted and re-read I knew someone was going to be literal about by use of the word "notification". What I really "meant" was ANY "indication". This would include Notifications OR Status Icons etc. Am I right in assuming that the SMART Status Icon on the Dashboard is not linked to the "Notification" system!?

 

Sorry, given there is a function called Notifications I should choose my words better.

Link to comment

...Question to the OP though. This parameter is usually "Monitored" by unRAID. Did you get ANY notification on the GUI or otherwise for this?

 

I would have expected at very least that on the Dashboard the "Green Thumbs Up" on the SMART Status Cell of the Disk would have been a "Yellow Warning Exclamation"?

 

System notifications must be enabled to start background monitoring and receive notifications.

 

After I posted and re-read I knew someone was going to be literal about by use of the word "notification". What I really "meant" was ANY "indication". This would include Notifications OR Status Icons etc. Am I right in assuming that the SMART Status Icon on the Dashboard is not linked to the "Notification" system!?

 

Sorry, given there is a function called Notifications I should choose my words better.

 

I wasn't very clear either, my remark was intended to tell people that notifications must be enabled first (default is off).

 

The display of the SMART icons on the dashboard runs independently of the notifications, indeed.

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.