Unraid GUI dies consistently after 6 days


Recommended Posts

Hi all!

 

So, for the last little while now, Unraid 6.1.x has been crashing on me unexpectedly.  Since upgrading to 6.1.6, at least the GUI crashes (partially) and I can access the terminal prompt, but the server goes off-line.

 

I'm not sure what's causing this, but here's the syslog (attached) and a screen shot of the gui when it's dead, as well as a screen shot of the terminal when I've issued the powerdown script.

 

Any idea which "device" that Powerdown complains about?  Also, in the syslog, it complains that the mover has no space... I'll have to look at that drive to see if there is no space there.

 

Thanks in advance!

syslog.zip

Unraid_Off-line.PNG.c972e266ade9fd0e9eb730d29439ba32.PNG

Unraid_CLI_Powerdown.PNG.5bc52fd0d6fae51b772dcb5a54f794d9.PNG

Link to comment

In the meantime, what does "df -h" and "df -h /" produce?

 

Also going forward, always try using the command "diagnostics" to produce a zip file that contains more information, such as what processes are running, it will also include disk share information and what you may be running from the go script. This will create a zip file under /boot/logs/machinename-diagnostics-yyyymmdd-HHMM.zip .

 

 

I dont know if powerdown is having issues with a full log (/var/log/) or a full USB Flash drive (/boot).

 

Link to comment

You're running out of ram on /var/ filesystem.

 

Also on the last day, mover complained about some filesystem running out of space as well. Line 44 writes to the /var/run/mover.pid file. If your /var/run is running out of space then essentially you're running out of RAM since /var/ is located in RAM.

 

You need to do some of the following:

  • Add more RAM to your system.
  • Figure out what docker or other process is writing to the /var/ filesystem instead of on your cache drive or array drives.
  • Stop using as many plugins as you're using.

 

 

Jan  6 03:40:01 Tower logger: /usr/local/sbin/mover: line 44: echo: write error: No space left on device

 

Also log rotate aborted abnormally, not sure why:

 

Jan  6 04:40:01 Tower logrotate: ALERT - exited abnormally.

 

Your syslog does indicate you had (have?) some errors with your BTRFS on /dev/sdn1.

 

Jan  1 11:06:28 Tower kernel: BTRFS (device sdn1): parent transid verify failed on 1246248960 wanted 309699 found 303642
Jan  1 11:06:28 Tower kernel: BTRFS: read error corrected: ino 1 off 1246248960 (dev /dev/sdo1 sector 2434080)
Jan  1 11:06:28 Tower kernel: BTRFS: read error corrected: ino 1 off 1246253056 (dev /dev/sdo1 sector 2434088)
Jan  1 11:06:28 Tower kernel: BTRFS: read error corrected: ino 1 off 1246257152 (dev /dev/sdo1 sector 2434096)
Jan  1 11:06:28 Tower kernel: BTRFS: read error corrected: ino 1 off 1246261248 (dev /dev/sdo1 sector 2434104)
Jan  1 11:06:28 Tower kernel: BTRFS (device sdn1): parent transid verify failed on 1246232576 wanted 309699 found 303642
Jan  1 11:06:28 Tower kernel: BTRFS: read error corrected: ino 1 off 1246232576 (dev /dev/sdo1 sector 2434048)
Jan  1 11:06:28 Tower kernel: BTRFS: read error corrected: ino 1 off 1246236672 (dev /dev/sdo1 sector 2434056)
Jan  1 11:06:28 Tower kernel: BTRFS: read error corrected: ino 1 off 1246240768 (dev /dev/sdo1 sector 2434064)
Jan  1 11:06:28 Tower kernel: BTRFS: read error corrected: ino 1 off 1246244864 (dev /dev/sdo1 sector 2434072)
Jan  1 11:08:33 Tower kernel: BTRFS (device sdn1): parent transid verify failed on 1289977856 wanted 303764 found 300178
Jan  1 11:08:33 Tower kernel: BTRFS: read error corrected: ino 1 off 1289977856 (dev /dev/sdo1 sector 2519488)
Jan  1 11:08:33 Tower kernel: BTRFS: read error corrected: ino 1 off 1289981952 (dev /dev/sdo1 sector 2519496)
Jan  1 11:08:33 Tower kernel: BTRFS: read error corrected: ino 1 off 1289986048 (dev /dev/sdo1 sector 2519504)
Jan  1 11:08:33 Tower kernel: BTRFS: read error corrected: ino 1 off 1289990144 (dev /dev/sdo1 sector 2519512)

Link to comment
You're running out of ram on /var/ filesystem.

 

Ah, well, maybe that's because in my go file has a line in there:

# resize tmpfs
mount -o remount,size=1024m /var/log

 

That was put in there due to an earlier suggestion to increase it as I have been having this problem for awhile.

 

Here's the output from df -h and df -h /

root@Tower:~# df -h
Filesystem      Size  Used Avail Use% Mounted on
tmpfs           1.0G   94M  931M  10% /var/log
/dev/sda1       7.5G  179M  7.3G   3% /boot
/dev/md1        2.8T  149G  2.6T   6% /mnt/disk1
/dev/md2        1.9T   99G  1.8T   6% /mnt/disk2
/dev/md3        1.9T  593G  1.3T  32% /mnt/disk3
/dev/md4        1.9T  225G  1.6T  13% /mnt/disk4
/dev/md5        1.9T  243G  1.6T  13% /mnt/disk5
/dev/md6        1.9T  1.5T  345G  82% /mnt/disk6
/dev/md7        1.9T  1.3T  593G  69% /mnt/disk7
/dev/md8        1.9T  1.6T  314G  84% /mnt/disk8
/dev/md9        1.9T  1.6T  315G  84% /mnt/disk9
/dev/md10       1.9T  1.5T  341G  82% /mnt/disk10
/dev/md11       1.9T  1.5T  328G  83% /mnt/disk11
/dev/md12       1.9T  1.8T   64G  97% /mnt/disk12
/dev/sdo1       224G   70G  154G  32% /mnt/cache
shfs             23T   12T   11T  52% /mnt/user0
shfs             23T   12T   12T  52% /mnt/user
/dev/loop0       10G  2.1G  6.3G  25% /var/lib/docker
root@Tower:~# df -h /
Filesystem      Size  Used Avail Use% Mounted on
-               7.9G  1.9G  6.0G  24% /

 

The latest diagnostics dump is located here:

https://drive.google.com/file/d/0B_6rmB4u7fHESGY4T2Y4N1RYMVk/view?usp=sharing

 

I notice that my syslog file is growing hugely, and I'm uncertain as to why.  Would certainly contribute to a lack of space!

 

But certainly the /var is something that I need to correct.

 

How do I get Docker to use /var on the cache drive?

 

Thanks!

 

Link to comment

Hm, looks like my log filled up due to Supermicro's IMPI fighting with Sensors module of the kernel:

Jan  7 05:23:09 Tower kernel: w83795 0-002f: Failed to read from register 0x011, err -6
Jan  7 10:00:41 Tower kernel: i801_smbus 0000:00:1f.3: Timeout waiting for interrupt!
Jan  7 10:00:41 Tower kernel: i801_smbus 0000:00:1f.3: Transaction timeout
Jan  7 10:00:41 Tower kernel: i801_smbus 0000:00:1f.3: Failed terminating the transaction
Jan  7 10:00:41 Tower kernel: i801_smbus 0000:00:1f.3: SMBus is busy, can't use it!
Jan  7 10:00:41 Tower kernel: w83795 0-002f: Failed to read from register 0x010, err -16
Jan  7 10:00:41 Tower kernel: i801_smbus 0000:00:1f.3: SMBus is busy, can't use it!

 

I googled around, and the W83795 is the chip my supermicro board uses, and feeds into libsensors.

 

The filling up of the log with the "Failed to read from register xxxxx" error is, according to this a german site:

... because of the built- in attempts Supermicro IPMI simultaneously on the I2C ( SMBus ) access . Either disable the sensor module or IPMI .

 

Haven't seen this before in the logs, so I'll remove these lines from my go script since I'm not using the sensors plugin anymore:

 

# Added for Sensors:
modprobe coretemp
modprobe jc42
modprobe w83795
/usr/bin/sensors -s

 

We'll see what happens after this! :)

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.