johvabroson Posted March 22, 2016 Share Posted March 22, 2016 So before I force shutdown my server causing a day-long parity check, I wanted to see if anyone has encountered their whole Unraid OS locking up. I'm talking no webGUI, no ping, no dockers, no SMB, no SSH, and even the console won't input from the keyboard! (yet the cursor is still flashing on the console?). This is the second time this has happened, and I would love to post my syslog, but I can't access it unless I shutdown. I was transferring a bunch of data, multiple TBs. Would the OS lock up close to full capacity? Thoughts and suggestions? Thanks! Quote Link to comment
danioj Posted March 22, 2016 Share Posted March 22, 2016 So before I force shutdown my server causing a day-long parity check, I wanted to see if anyone has encountered their whole Unraid OS locking up. I'm talking no webGUI, no ping, no dockers, no SMB, no SSH, and even the console won't input from the keyboard! (yet the cursor is still flashing on the console?). This is the second time this has happened, and I would love to post my syslog, but I can't access it unless I shutdown. I was transferring a bunch of data, multiple TBs. Would the OS lock up close to full capacity? Thoughts and suggestions? Thanks! Firstly, sorry you're having these issues. Lets see if we can help you debug. #RANT: It is certainly an issue that you can't grab your diagnostics file. We really need a resolution to this. I don't THINK that getting to FULL capacity would "Lock" your server. I do however assume that you are not TRYING to transfer more than the your server's capacity anyway?? I know you asked about recovery BUT if you can't access the server at all (short of cancelling the transfer - if it is still going??) and waiting it out (even perhaps overnight) for a while to see if things come back up) I see no other cause of action but a HARD reset. There is hope POST reset though. In my experience the main reason for a lock is a bad disk. Fortunately SMART reports do contain historical data and are a good indication of a potential lockup. IF you do a Hard Reset can the first thing you do please be to Grab a Diagnostics File: Tools>Diagnostics>Download This will aid us in at least analysing your SMART reports and "potentially" other things. If you're game before me or anyone else has had a look at the diagnostics an initial short SMART test followed by an extended (can take a couple of hours so don't worry) SMART test of each of your disks is not a bad way to go. Quote Link to comment
johvabroson Posted March 22, 2016 Author Share Posted March 22, 2016 So after a hard reset, I attached my syslog and smart reports. It seems to continuously removing and reinstalling the unraid driver. Also, while my Disk 2 has an overall health of "passed", it is reporting these errors: 187 Reported uncorrect 0x0032 001 001 000 Old age Always Never 65535 5 Reallocated sector count 0x0033 100 100 036 Pre-fail Always Never 1136 Thoughts? smart.zip syslog.zip Quote Link to comment
danioj Posted March 22, 2016 Share Posted March 22, 2016 So after a hard reset, I attached my syslog and smart reports. It seems to continuously removing and reinstalling the unraid driver. Also, while my Disk 2 has an overall health of "passed", it is reporting these errors: 187 Reported uncorrect 0x0032 001 001 000 Old age Always Never 65535 5 Reallocated sector count 0x0033 100 100 036 Pre-fail Always Never 1136 Thoughts? #RANT: Can I PLEASE ask that when you (and anyone else reading this) post your logs you do so in the expected way. Tools>Diagnostics>Downloads. There is no need to post them separately as everything we would normally need to diagnose a fault is in that zip file. http://lime-technology.com/forum/index.php?topic=39257.0 Rant over! TBH mate that Syslog tells me nothing. We need to GRAB that log BEFORE the hard reset! As a side note, you were the inspiration for this work I did last night: http://lime-technology.com/forum/index.php?topic=47732.0 Anyway, from what you HAVE posted AND as you noted Disk ST3000DM001-9YN166-W1F0Q9G1 is POTENTIALLY in bad shape. 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 1136 I notice from the log that you are running a SMART test. Assuming this is an extended one. You are already doing the right thing. It is feasible (I guess) that your crashes are a result of Errors on that disk HOWEVER Reallocated Sectors are not immediately a bad thing IMHO. There are some people which replace drives once they start showing any reallocated sectors immediately. Saying that, a drive CAN show reallocated sectors for years and work perfectly fine. You can consider Reallocated Sector's as INVISIBLE bad sectors that have been replaced / swapped with reserve sectors. These sectors are NO LONGER VISIBLE to unRAID and and as such can NEVER cause any more problems. So to go further, these are bad sectors in the past. So they might have caused problems in the past but that does not really mean that it did. Disks replace weak sectors as a precaution which may never have caused any problems. Note that smartctl showing that this attribute is not FAILING NOW so it means that the drive has NOT exhausted its pool of spare sectors and as such really I don't see that its time to immediately replace it. Saying that the pool of available reserve sectors is not endless and AFAIK is usually at about ~1000 SO I beleive some might say as your count is at 1136 it is probably the safest bet to replace it NOW. As an aside, the attribute which you WOULD want to work about is Current Pending Sectors. Current Pending Sector's are bad sectors that CANNOT BE READ but are still visible to unRAID. These are VERY DANGEROUS and cause ALOT of problems. In unRAID specifically they CAN and DO invalidate and corrupt your parity and data. However, you DONT have any of them! Ultimately I can't tell from what you have posted what is causing your lockup. Let's see what the results of your extended SMART test are on that drive and take it from there. Quote Link to comment
johvabroson Posted March 23, 2016 Author Share Posted March 23, 2016 Thanks a lot for your help man. Sorry I didn't post the full diagnostic zip, I'm very privacy oriented and didn't want to share more than I needed too. Yeah, I realized after the hard reset that all the syslogs were erased and that the ones I provided weren't gonna be very helpful. I couldn't access my frozen server in anyway to get my hands on them, at least to my knowledge. Wow, those scripts look brilliant; I hope it gets made into a official plugin! I read up on the SMART errors and found the same results, that it is probably not an issue ATM. Here are my full diagnostics files (post extended smart test), which said it completed without errors. UnRAID seemed like such a powerful, simple solution to so many traditional server needs and problems, but still has been a pain in the ass just like all of those. First Docker issues, then the OS freezing up. I guess only with your cron job script could I catch the syslog as it freezes... Thanks! tower-diagnostics-20160322-2058.zip Quote Link to comment
danioj Posted March 23, 2016 Share Posted March 23, 2016 Thanks a lot for your help man. Sorry I didn't post the full diagnostic zip, I'm very privacy oriented and didn't want to share more than I needed too. Yeah, I realized after the hard reset that all the syslogs were erased and that the ones I provided weren't gonna be very helpful. I couldn't access my frozen server in anyway to get my hands on them, at least to my knowledge. Wow, those scripts look brilliant; I hope it gets made into a official plugin! I read up on the SMART errors and found the same results, that it is probably not an issue ATM. Here are my full diagnostics files (post extended smart test), which said it completed without errors. UnRAID seemed like such a powerful, simple solution to so many traditional server needs and problems, but still has been a pain in the ass just like all of those. First Docker issues, then the OS freezing up. I guess only with your cron job script could I catch the syslog as it freezes... Thanks! You don't need to worry about privacy of the diagnostics IMHO. Allot of work has been done on making them as private as possible by blocking out share names, files, emails etc I don't have time to review ALL of those logs right now BUT will do so tonight (10 hours from now) if no-one beats me to it. Good to know the SMART report passed. More convinced now that the issue is probably not the disk, BUT without a set of logs before the lockup it is hard to speculate. You could TRY that script AND then repeat what you did to cause the Freeze. I have run it for the past 12+ hours and it is fine. It works. Remember it is a WIP and is far from the best coding in the world. Quote Link to comment
John_M Posted March 23, 2016 Share Posted March 23, 2016 It seems to continuously removing and reinstalling the unraid driver. That's quite normal until you start the array. You're getting uncorrectable read errors on that 3 TB Seagate (SMART parameter 187). Replace it. Quote Link to comment
danioj Posted March 23, 2016 Share Posted March 23, 2016 It seems to continuously removing and reinstalling the unraid driver. You're getting uncorrectable read errors on that 3 TB Seagate (SMART parameter 187). Replace it. Can't believe I missed that in my scan of the SMART report. Thats what I get for doing two things at once. John_M is right. Time for a new disk. Question to the OP though. This parameter is usually "Monitored" by unRAID. Did you get ANY notification on the GUI or otherwise for this? I would have expected at very least that on the Dashboard the "Green Thumbs Up" on the SMART Status Cell of the Disk would have been a "Yellow Warning Exclamation"? Quote Link to comment
John_M Posted March 23, 2016 Share Posted March 23, 2016 Lots of failed SMART tests suggests he had an idea that something was wrong: SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 20612 - # 2 Short offline Completed without error 00% 20602 - # 3 Extended offline Completed without error 00% 20540 - # 4 Extended offline Completed: read failure 90% 20443 1525732032 # 5 Short offline Completed: read failure 90% 20443 1525732032 # 6 Extended offline Completed: read failure 90% 20442 1525732032 # 7 Short offline Completed: read failure 90% 20442 1525732032 4 of 4 failed self-tests are outdated by newer successful extended offline self-test # 1 How do you explain the recent passes? Well, my guess is that a piece of contamination became dislodged and flew off the platter. It might have been caught by the air filter or it might still be bouncing around. Quote Link to comment
danioj Posted March 23, 2016 Share Posted March 23, 2016 Lots of failed SMART tests suggests he had an idea that something was wrong: SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 20612 - # 2 Short offline Completed without error 00% 20602 - # 3 Extended offline Completed without error 00% 20540 - # 4 Extended offline Completed: read failure 90% 20443 1525732032 # 5 Short offline Completed: read failure 90% 20443 1525732032 # 6 Extended offline Completed: read failure 90% 20442 1525732032 # 7 Short offline Completed: read failure 90% 20442 1525732032 4 of 4 failed self-tests are outdated by newer successful extended offline self-test # 1 How do you explain the recent passes? Well, my guess is that a piece of contamination became dislodged and flew off the platter. It might have been caught by the air filter or it might still be bouncing around. God, I missed that too! I wasn't having a very good day yesterday at all! Anyway interesting speculation about the contamination of the platter. Quote Link to comment
bonienl Posted March 23, 2016 Share Posted March 23, 2016 ...Question to the OP though. This parameter is usually "Monitored" by unRAID. Did you get ANY notification on the GUI or otherwise for this? I would have expected at very least that on the Dashboard the "Green Thumbs Up" on the SMART Status Cell of the Disk would have been a "Yellow Warning Exclamation"? System notifications must be enabled to start background monitoring and receive notifications. Quote Link to comment
danioj Posted March 23, 2016 Share Posted March 23, 2016 ...Question to the OP though. This parameter is usually "Monitored" by unRAID. Did you get ANY notification on the GUI or otherwise for this? I would have expected at very least that on the Dashboard the "Green Thumbs Up" on the SMART Status Cell of the Disk would have been a "Yellow Warning Exclamation"? System notifications must be enabled to start background monitoring and receive notifications. After I posted and re-read I knew someone was going to be literal about by use of the word "notification". What I really "meant" was ANY "indication". This would include Notifications OR Status Icons etc. Am I right in assuming that the SMART Status Icon on the Dashboard is not linked to the "Notification" system!? Sorry, given there is a function called Notifications I should choose my words better. Quote Link to comment
bonienl Posted March 23, 2016 Share Posted March 23, 2016 ...Question to the OP though. This parameter is usually "Monitored" by unRAID. Did you get ANY notification on the GUI or otherwise for this? I would have expected at very least that on the Dashboard the "Green Thumbs Up" on the SMART Status Cell of the Disk would have been a "Yellow Warning Exclamation"? System notifications must be enabled to start background monitoring and receive notifications. After I posted and re-read I knew someone was going to be literal about by use of the word "notification". What I really "meant" was ANY "indication". This would include Notifications OR Status Icons etc. Am I right in assuming that the SMART Status Icon on the Dashboard is not linked to the "Notification" system!? Sorry, given there is a function called Notifications I should choose my words better. I wasn't very clear either, my remark was intended to tell people that notifications must be enabled first (default is off). The display of the SMART icons on the dashboard runs independently of the notifications, indeed. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.