Jump to content

Periodic Freeze until reboot, unregistered, key detected, unregestered...


Recommended Posts

Posted (edited)

EDIT: as of 2023-02-28 I have removed the old diagnostics and added a new one after removing redundant SAS cables

 

I have been having an issue for some time where every few weeks my Unraid will freeze and nothing will work.  I can sometimes navigate parts of the WebUI but it will be mostly unresponsive with broken or missing elements.

 

When this happens I will sometimes see errors written in the WebUI ( over random elements on random pages ) stating things like 'fork failed', 'resource temporarily unavailable', 'Unregistered - flash device error', or something to do with docker containers.  ( I don't quite remember as I have no screenshot and cant reproduce at will ).  The only way I have been able to recover from this error is to power off the server manually.

 

I have managed to get a log from when this happens by enabling syslog writing to the USB.  It is attached ( I truncated logs older than 1 day before the issue occurred )

 

Please help me stop the issue from re-occuring.  I am also open to try other diagnostic steps or take any reasonable advice on improving my setup.

 

Unraid 6.11.5 ( this issue happened on prior versions )

 

Hardware:

  • Model: Dell R730 ( I think )

    M/B:0JP31P Version A10

    BIOS:Version 2.7.0. Dated: 05/23/2018

    CPU:Intel® Xeon® CPU E5-2670 0 @ 2.60GHz

    HVM:Enabled

    IOMMU:Enabled

    Cache:256 KiB, 2 MB, 20 MB, 256 KiB, 2 MB, 20 MB

    Memory:64 GiB DDR3 Multi-bit ECC (max. installable capacity 1536 GiB)

  • LSI® SAS 9206-16e

  • (JBOD) EMC KTN-STL3

 

Plugins:

  • Community Applications
  • Compose Manager
  • Dynamix System Tempurature
  • Nerd Tools

 

Things I have tried to fix this issue:

  • Stop some docker images I don't actively use ( I have not tried all of them though )
  • Stop Vms I don't actively use
  • Update Unraid and Plugins
  • Transfer License to a brand new USB device
  • Tried running the new USB device on a different USB port
  • Searched forum on a few occasions for similar issues but haven't really seen anything that looks like my issue

 

Additional notes:

  • My server does a memory test on every boot, I am not sure how reliable this is in terms of confirming this is not an issue with faulty memory.
  • This only started happening recently ( within the past 4-6 months ) and had run without issues for probably close to a year before that.
  • This error occurs every 2 weeks give or take a few days and I usually discover it has already been in this state by finding jellyfin not loading, and doesn't seem to occur from any direct action I make on the server.
  • Since I initially put the server together I have had warnings for some disks that there are "udma crc error count" it has always remained the same number ( 196 ).  These drives came used with my JBOD, I have not seen any reason to believe these errors matter or are related but I think its worth mentioning.  I am under the impression they show up because of some feature or lack of feature on the drive that I read about somewhere when I first installed the JBOD.

 

Attached is my diagnostics after the last crash had happened, and the syslog from the day before up to the error and logs from the reboot.

syslog.txt

fatman-diagnostics-20230228-1607.zip

Edited by BlackMagicCoffee
Removed redundant SAS cables and uploaded new diagnostics zip
Posted
On 3/1/2023 at 2:30 AM, JorgeB said:

Still seeing the same errors, there's only one cable from one HBA going to the enclosure?

Sorry for the delayed response, I got married!

I have 1 cable per controller ( the JBOD has 2 controllers ).  I can try again with only one controller plugged in but I am under the assumption it would not allow all my drives to work.  I will remove one of the remaining cables tonight such that there is only 1 cable from the JBOD to the HBA and post an update.

Posted
28 minutes ago, BlackMagicCoffee said:

I have 1 cable per controller ( the JBOD has 2 controllers ). 

 

On 2/28/2023 at 8:36 AM, JorgeB said:

Unraid doesn't support SAS multipath, connect one cable only from the HBA to the enclosure

 

Posted
23 hours ago, JorgeB said:

 

 

Attached is an image of what happens when I remove one of the remaining two cables.  There is a single cable to each of 2 controllers on my JBOD, disconnecting one makes half the drives inaccessible. Should I plug one controller into the other controller?

 

image.thumb.png.5a1dc614d5c7223104468293f3eae9b5.png

Posted

Lets start over, you have two LSI controllers and one enclosure correct? There should be a single cable from one HBA to one controller in the enclosure, if you do that and reboot you only see half of the disks?

Posted (edited)
26 minutes ago, JorgeB said:

Lets start over, you have two LSI controllers and one enclosure correct? There should be a single cable from one HBA to one controller in the enclosure, if you do that and reboot you only see half of the disks?

Attached is an image of my setup in its current configuration ( half the drives are missing ).  I have 1x LSI HBA, when I am saying "controller" I mean the JBOD has 2x modules that provide a SAS interface.

 

The image is how it is configured right now.  I had previously had the following configurations:
configuration 1 ( originally ) : A cable from the primary and expansion port of each JBOD controller/module connected to the HBA, totalling 4 SAS cables from the JBOD to the HBA.

configuration 2 ( when told to remove cables because of SAS multipath ) : A cable from the primary port of each JBOD controller/module connected to the HBA, totalling 2 SAS cables from JBOD to HBA

configuration 3 ( current ) single SAS cable from HBA to JBOD, only half the drives are detected.  As pictured.

 

Sorry if I have been unclear or I am making this difficult, I do really appreciate your help with this.

jbod.jpg

Edited by BlackMagicCoffee
Typo, additional details
Posted

OK, and are you sure the 2nd diasgs posted were the correct ones? Something is missing here, since latest diags still show duplicate devices, so it doesn't make send that now devices are missing, can you post new diags with configuration 2?

Posted
17 minutes ago, JorgeB said:

OK, and are you sure the 2nd diasgs posted were the correct ones? Something is missing here, since latest diags still show duplicate devices, so it doesn't make send that now devices are missing, can you post new diags with configuration 2?

Here is the new diagnostics, using configuration 2.  Both 'primary' SAS ports from the JBOD are connected to my HBA as per the picture.  it looks like those 'cannot get id' issues are still present.  I restarted the server at 2023-03-07 14:20, the timestamp in the logs confirms that this is the correct diagnostics zip.

20230307_141512.jpg

fatman-diagnostics-20230307-1425.zip

Posted
13 hours ago, BlackMagicCoffee said:

it looks like those 'cannot get id' issues are still present. 

Yep, meaning that the disks are being detected twice, I don't know that enclosure but usually multiple controllers are for redundancy, one controller should give access to all the disks, when you tried the single cable did you try it connected to either controller?

Posted
6 hours ago, JorgeB said:

Yep, meaning that the disks are being detected twice, I don't know that enclosure but usually multiple controllers are for redundancy, one controller should give access to all the disks, when you tried the single cable did you try it connected to either controller?

I had not tried a single cable with the other controller.  I will attempt to do so.

 

I do have some manualls and it does show that there is 2 cables but it's not very clear what information it's trying to convey, and only shows controllers linked to eachother.

  • 2 weeks later...
  • 2 weeks later...
Posted (edited)

well it crashed again, and despite enabling syslog server to write into one of my shares it didn't write anything apparently.  I enabled copy syslog to flash now because i know that did work before.  attached is my diagnostics zip after restart ( which i assume is probably useless because it has no crash info )

I should also add that when it crashed i was able to see the error messages i mentioned before showing up over random spots on the UI, here are the ones i saw:
 

Warning: exec(): Unable to fork [docker network ls --format='{{.Name}}={{.Driver}}' 2>/dev/null] in /usr/local/emhttp/plugins/dynamix.docker.manager/include/DockerClient.php on line 1014 

Warning: exec(): Unable to fork [pgrep 'rc.docker'] in /usr/local/emhttp/plugins/dynamix/include/Helpers.php on line 225 Compose 

Warning: Invalid argument supplied for foreach() in /usr/local/emhttp/plugins/compose.manager/php/compose_manager_main.php on line 33 Warning: Invalid argument supplied for foreach() in /usr/local/emhttp/plugins/compose.manager/php/compose_manager_main.php on line 33 Warning: Invalid argument supplied for foreach() in /usr/local/emhttp/plugins/compose.manager/php/compose_manager_main.php on line 33 Warning: Invalid argument supplied for foreach() in /usr/local/emhttp/plugins/compose.manager/php/compose_manager_main.php on line 33 Warning: Invalid argument supplied for foreach() in /usr/local/emhttp/plugins/compose.manager/php/compose_manager_main.php on line 33 Warning: Invalid argument supplied for foreach() in /usr/local/emhttp/plugins/compose.manager/php/compose_manager_main.php on line 33 Warning: Invalid argument supplied for foreach() in /usr/local/emhttp/plugins/compose.manager/php/compose_manager_main.php on line 33 

fatman-diagnostics-20230325-1524.zip

Edited by BlackMagicCoffee
More details
  • 2 weeks later...
Posted
On 3/26/2023 at 6:35 AM, JorgeB said:

Post the mirrored log after the next crash.

I had the issue happen again and was able to get the diagnostics after a few tries. ( the generate diagnostics modal would sometimes fail and say: /boot/logs Warning: shell_exec(): Unable to execute 'logger error: '/webGui/include/Download.php': missing csrf_token' in /usr/local/emhttp/plugins/dynamix/include/local_prepend.php on line 18 )

 

attached is the diagnostics zip, i also have a copy of the log from my flash device i made just before generating diagnostics

fatman-diagnostics-20230403-1138.zip

Posted

You are having flash drive issues:

 

Apr  3 01:01:13 fatman  emhttpd: Unregistered - flash device error (ENOFLASH7)
Apr  3 01:01:14 fatman  emhttpd: Pro key detected, GUID: 0..3 FILE: /boot/config/Pro.key
Apr  3 01:09:08 fatman  emhttpd: Unregistered - flash device error (ENOFLASH7)
Apr  3 01:09:09 fatman  emhttpd: Pro key detected, GUID: 0..3 FILE: /boot/config/Pro.key
Apr  3 01:09:17 fatman  emhttpd: Unregistered - flash device error (ENOFLASH7)
Apr  3 01:09:18 fatman  emhttpd: Pro key detected, GUID: 0..3 FILE: /boot/config/Pro.key
Apr  3 01:11:19 fatman  emhttpd: Unregistered - flash device error (ENOFLASH7)
Apr  3 01:11:20 fatman  emhttpd: Pro key detected, GUID: 0..3 FILE: /boot/config/Pro.key

 

Make sure it's using a USB 2.0 port and/or try a different port, if issues continue replace it, unrelated but you are also having multiple disk issues.

Posted
15 minutes ago, JorgeB said:

You are having flash drive issues:

 

Apr  3 01:01:13 fatman  emhttpd: Unregistered - flash device error (ENOFLASH7)
Apr  3 01:01:14 fatman  emhttpd: Pro key detected, GUID: 0..3 FILE: /boot/config/Pro.key
Apr  3 01:09:08 fatman  emhttpd: Unregistered - flash device error (ENOFLASH7)
Apr  3 01:09:09 fatman  emhttpd: Pro key detected, GUID: 0..3 FILE: /boot/config/Pro.key
Apr  3 01:09:17 fatman  emhttpd: Unregistered - flash device error (ENOFLASH7)
Apr  3 01:09:18 fatman  emhttpd: Pro key detected, GUID: 0..3 FILE: /boot/config/Pro.key
Apr  3 01:11:19 fatman  emhttpd: Unregistered - flash device error (ENOFLASH7)
Apr  3 01:11:20 fatman  emhttpd: Pro key detected, GUID: 0..3 FILE: /boot/config/Pro.key

 

Make sure it's using a USB 2.0 port and/or try a different port, if issues continue replace it, unrelated but you are also having multiple disk issues.

i realized i could check from terminal, here is the result: im pretty sure EHCI is usb 2.0?

image.thumb.png.bb1725aed2d4826c4f4fe66a35f68e02.png

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...