Jump to content

Lost GUI on Reboot - Able to Ping/SSH/Log in Locally - 6.2.0b21


SuperW2

Recommended Posts

The drive associated with ata6 looks like Disk 11, sdg, serial ending in P74.  Most likely possibility is a bad SATA cable, less likely but possibly a poor power connection.

 

Disk 6 looks physically fine, so it's not 'going out'.  The problems are software issues (file system corruption), not hardware issues.

 

I wish we had a good way to test an XFS drive, that would tell us the file system is 100% fine.  All you can do is run the xfs_repair one more time, and if it doesn't indicate any issues, assume it's OK and hope that's true.  If xfs_repair doesn't find anything, it quits without saying anything.

 

Thanks, I'll check the data and power going to sdg/Disk11 once I get to a reboot point (Still have 1.6TB of data to move off of Disk6)... Most likely one 5:1 reverse breakout cables going to one of my SAS Cards, and the power is fed by 3 4Pin's to the back of the 5-in-3 cage...

 

-W

Link to comment

Something is still off... it appears the gui becomes unresponsive if I start more than 1 file copy/move operation (either via Windows Share or from SSH), even when dealing with neither Disk 6 or Disk 11 (still trying to move some data around on other disks to get stuff in in order).  Woke up this morning after stating a couple file move operations via SSH and have lost all my user shares in Windows, can still ping, but a new SSH connection just asks for username and then password (which I don't have one set for root account and it never normally asks me).

 

I'm unable to do a clean powerdown from the server prompt using the powerdown script (it creates the diag log file and then just hangs on "The System is going down for system half NOW!" but never physically powers off.

 

New Diag Log attached.... 22:28-22:33 was me starting SSH sessions for file data moves after a freshish reboot (because locked/lost gui), 6:57 was me trying to log in again.  The USB Losses/at 6:54:38 was me turning off Monitor that has USB Hub where local Server KB/Mouse are plugged in and turned it back on at 7:01:15.

 

Also see another time sync error at 22:31:17.

media-diagnostics-20160511-0703.zip

Link to comment

I'm sorry, I find it difficult to get back to things I worked on, once I put them down.  I'll *try* to do better.

 

Great to see Disk 6 working fine now!

 

So much simultaneous stuff running! (from syslog and ps report of 20160511-0703)

- Almost 400 instances of "/usr/sbin/smbd -D" running (a serious problem)

- 12 rsync moves happening, 2 from Disk 14 to Disk 13, and 10 from Disk 17 to Disk 18 (questionable)

- A Win VM is running

- Java running

- Multiple Plex processes running

- Other plugins/containers running

- Multiple Python processes running (mostly python2 but at least 2 python3) (questionable)

- A plugin check running (questionable, hung?)

 

The most obvious issue is the numerous smbd's running, which is a known issue for quite a few versions, but I don't remember what it's associated with or how to fix.  Hoping someone with experience in that will address it.

 

But it looks like other issues too, and it really looks like you need to simplify temporarily, to isolate the issues.  Try eliminating most but not all, and testing, then proceed accordingly.

 

The rsync's look wrong to me.  Their parent process appears to be the kernel (init), which seems odd.  They each have significant CPU usage in the past, but aren't busy now, and are still hanging around.  Many of them have a defunct process ([rsync] <defunct>) tied to them.

 

Something that seems odd, near the end of the syslog of 20160511-0703 -

- At May 11 06:54:38, a USB set of ports (possibly a hub) went down, attached were your keyboard and mouse

- At May 11 06:57:18 (until 07:01:03), you tried to Telnet in 5 times, from station at 192.168.1.11, and each time it reports an "ILLEGAL ROOT LOGIN"

- At May 11 07:01:15, the USB ports are found again, and your keyboard and mouse are reattached

- At May 11 07:01:19, you login from the console

- At May 11 07:03:08, you initiate Powerdown

Edit: Sorry, I just read back through your posts, and you completely explained all that  :-[

 

You mention the TIME ERROR's, ignore them.  I see them often, in syslogs with correct time and incorrect time.  I experience then myself, with the time already correct.

 

------------------------------

In final diagnostics, you are simplifying, no VM, NTP turned off, no CrashPlan, perhaps others missing too.  You've added Firefox, fluxbox, and Xorg though.  (Thank you!  It's my first chance to compare syslogs with and without bzroot-gui.)

 

This time, it's Disk 4 (sdf, serial ending  in Q32) with a cluster of 4 CRC errors.  Probably a bad SATA cable for it too.  (But this is minor, explains nothing else.)

 

I'm sorry I don't see anything else, no more errors are reported that I can see, so no explanation for what is tormenting your system.

Link to comment

Thanks for the reply... I understand bits of that, an can explain some of it.

 

As far as the RSync's, I use the Indexer App/Excel spreadsheet from http://lime-technology.com/forum/index.php?topic=33689.0 to help move sections of my file shares from one drive to the other (with 18 data drives, I try to keep the files/folders somewhat organized and I have to do that manually.  I can get an example of the RSync command that it uses.  I'm not changing the RSync commands at all, just a direct copy and paste from spreadsheet into SSH connection to server.  I have previously, on occassion had 5 or 6 SSH sessions, each with it's own RSYNC thread moving files from one drive to another and never had an issue.

 

WinVM... I run a Win10 Test VM, occasionally, and would have likely be running during some of these last hang ups.  - Easy to disable

Java... No idea from where or what - ?

Plex is running from the "linuxserver" docker, but I don't think any of my devices would have actively been using it.

 

I only have a few plugin's running now (removed several) : Powerdown Package 2.20 (that doesn't work), Fix Common Problems (that doesn't find any with my system), Communicty Application and then the main Unraid Server OS and Dynamix WebGui ones.

 

For troubleshooting, I can disable my Docker completely (where Crashplan/Plex/SabNZB/Sickbeard, etc are all running) and the "extra" plugins and see if there is any change.  Disabled NTP for now

 

No clue on the sbmd or the python processes.

 

I can almost "replicate" a GUI lockup/crash on command by starting just 2 simultaneous SMB file copy operations on my Win10 Box from any one drive to another anywhere in my array.  Once that that happens the copy/move fails, GUI unreachable, but I never lose Ping/SSH/Server Console access.

 

...and possibly unrelated, I still think it's goofy that a Powerdown script from a SSH command line won't actually shut down my Server (maybe a BIOS Setting?) and even when I change the Syslinux settings on the FlashDrive in the console, they never update, I don't see the GUI setting and can only enter it on server when I halt the autostart, hit tab and add the GUI command line.

Link to comment

What program do you use at the Windows side to copy files?

 

I am using Total Commander and can copy 60+ GB of data over SMB at near 1Gbps speed without any issues or lock ups.

 

Perhaps trying different copy methods will reveal more?

 

Just using native Windows Explorer using either Disk or User shares.  I've never tried anything else.

Link to comment

Just for giggles, I tested Total Commander and it locked after attempting to move 800ish MB of a single 1GB file from Disk 18 to Disk 1...

 

Docker Disabled, VM's Disable, NTP Off, I forgot to disable my extra plugins, but almost nothing else running.

 

Latest Diags attached after last hard reboot.

 

I also went into BIOS after boot, and my BIOS clock was off again, forward several hours... odd...

media-diagnostics-20160513-1532.zip

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...