Skip to content
View in the app

A better way to browse. Learn more.

Unraid

A full-screen app on your home screen with push notifications, badges and more.

To install this app on iOS and iPadOS
  1. Tap the Share icon in Safari
  2. Scroll the menu and tap Add to Home Screen.
  3. Tap Add in the top-right corner.
To install this app on Android
  1. Tap the 3-dot menu (⋮) in the top-right corner of the browser.
  2. Tap Add to Home screen or Install app.
  3. Confirm by tapping Install.

USB Drive Fails After 24h

Featured Replies

Hey all,

 

I recently moved my server into a Dell Percision 5820. The Unraid version is 6.11.5. The server works very well except that it fails after about 24h of being on. Sometimes this looks like an unresponsive system, other times I can access the GUI but there are no shares listed.

If I remove the USB and put it in a new port, the server boots fine and then fails about 24h later. If I simply try a reboot, the system will hang and not boot.

I've gone through ther BIOS and disabled all C-states but I have no idea what is causing this issue or where to start. Any help would be appriciated.

Thanks!

jarvis-diagnostics-20231126-2140.zip

Solved by Heffe

  • Author
31 minutes ago, JorgeB said:

Enable the syslog server and post that after a crash.

 

Will do. Thank you.

If there a preferred location for the logs? Is saving them to the array the best option?

  • Community Expert

Since the flash drive could be a problem save it to an array share.

  • Author

Well, it failed agaiun but I misconfigured my syslog server so I'll have to wait another 24 to 48h until it crashes again. :S

  • Author
On 11/27/2023 at 7:30 AM, JorgeB said:

Enable the syslog server and post that after a crash.

 

Hey @JorgeB, here is the syslog. The server stopped working on Nov 30th @ 03:00 after completing an appdata backup opperation. From reading the log, maybe a bad memory stick?

 

Thanks!

syslog-192.168.0.10.log

  • Community Expert

Could be, but I would that PC to use ECC RAM, still there could be some related issue.

  • Author
14 minutes ago, JorgeB said:

Could be, but I would that PC to use ECC RAM, still there could be some related issue.

Yes, the memory is ECC memory. I just ran the MEMTEST86 utility with no errors. Do you have any other suggestions?

  • Community Expert

There various crashes but not real clear to me what's causing them, one thing you can try is to boot the server in safe mode with all docker/VMs disabled, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one. 

  • Author

Good idea. I'll give that a try. Thank you for your suggestions. 

  • Author

So this weekend I updated to tge latest unRAID version, updated all plugins and stopped all services on the server. I then let the server run until faults started appearing. Here's the log file. 

 

It shows that CPU 0 and CPU 4 are causing some errors? Does that mean I have a bad CPU? Can it be fixed?

jarvis-syslog-20231203-1208.zip

  • Community Expert

Don't know if the CPU is the problem but it does show multiple apps segfaulting, so most likely some hardware issue, I would try with just a stick or RAM, if the same try a different one, that will basically rule out RAM issues.

  • Author

Ill give that a try. Thank you. 

  • Author

I took one RAM stick out and ran the server all night. It still errored. This morning I swapped the RAM stick with another and it errored again. I suppose it could be the RAM but I doubt both sticks would be bad.

I do have a NVME PCIe card that my cache drive is on. I'm going to try and remove that and see if it makes a difference.

 

I've attatched the latest log.

syslog-192.168.0.10.log

  • Community Expert

Unlikely the RAM is the problem, next suspects for me would be the board or CPU.

  • Author

The errors increase in frequency if I have the docker containers running. I have no way of troubleshooting if the CPU is the issue or the motherboard. The server was a cold spare from a working environment so I'm doubtful that its a hardware issue. 

 

I'll continue to update the software and hope that the updates fixe it. 

 

Thanks for your help @JorgeB.

  • Author

Just as an update, I'll keep this thread going in case someone else has similar issues. 

 

In order to troubleshoot the mobo and/CPU issues, I ran the Dell diagnostic tool built into the BIOS on the motherboard. The tool ran all night (~12 hours) and it passed all tests (including RAM tests).

 

After the tests were done, I booted the server in safe made (no GUI) and let it run. Its been on for about 5 hours and it has one "segfault". 

 

Here's the error:

 

Dec  7 09:01:45 Jarvis kernel: smartctl_type[17085]: segfault at 0 ip 0000000000000000 sp 00007ffcc6c52d68 error 14 in php[400000+3b000] likely on CPU 5

 

I'm really frustrated at this issue and I'm lost as to a solution. I'm 99% sure there's no hardware issue but yet here we are.  

jarvis-syslog-20231207-1446.zip

Edited by Heffe
Formatting error.

  • Community Expert

If it's just smarctl segfaulting it could be one of the devices that is making it crash, but you'd need to basically remove one of them at at time to test.

  • Author
2 minutes ago, JorgeB said:

If it's just smarctl segfaulting it could be one of the devices that is making it crash, but you'd need to basically remove one of them at at time to test.

Would running a SMART disk scan provide any useful info?

  • Community Expert

Doubt it.

  • Author

Would it be possible that this issue is being caused by a BIOS update? I've looked across the forum and it seems others are using the same server but on another BIOS version. 

  • Community Expert

I would guess unlikely, but not impossible.

  • Author

Well, here are the cahnges that I've tried since,

 

1. I changed the boot mode to UEFI instead of Legacy.

2. I downgraded the BIOS to v2.0.2 since there was a post on here suggesting that this verison was working for them.
3. I found some corrupted docker files and repaired them.

 

The server ran for about 10 hours and then crashed again. I'm going to run Memtest86 for an extended period of time and see if that finds anything. Lol, I'm so frustrated.
 

syslog-192.168.0.10_111223.log

  • Author

So I gave it a rest and returned to the problem after a few days.

 

I had couple of "loop2" BTRFS errors that, from what I've read on the forum, are related to a corrupt docker image. I've since deleted my docker image and recreated it.

 

Secondly, after recreating my docker image, I pinned each docker to a specific CPU so I could find if there was an offending docker container. I think I found one. If I restart my nextcloud container while my mariadb container is running, I get:
 

kernel: php[8525]: segfault at ffffffffffffffff ip 000055871823b0b2 sp 00007ffd6737dd40 error 7 in php82[558718200000+2b4000] likely on CPU 5 (core 1, socket 0)

I get this error reliably when starting my nextcloud container.

Is there anything I can do to fix this? Is my nextcloud appdata corrupt? Thanks!

syslog-192.168.0.10_171223.log

Edited by Heffe
Uploaded log file.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

Account

Navigation

Search

Search

Configure browser push notifications

Chrome (Android)
  1. Tap the lock icon next to the address bar.
  2. Tap Permissions → Notifications.
  3. Adjust your preference.
Chrome (Desktop)
  1. Click the padlock icon in the address bar.
  2. Select Site settings.
  3. Find Notifications and adjust your preference.