USB Drive Fails After 24h

Heffe · November 27, 2023

Hey all,

I recently moved my server into a Dell Percision 5820. The Unraid version is 6.11.5. The server works very well except that it fails after about 24h of being on. Sometimes this looks like an unresponsive system, other times I can access the GUI but there are no shares listed.

If I remove the USB and put it in a new port, the server boots fine and then fails about 24h later. If I simply try a reboot, the system will hang and not boot.

I've gone through ther BIOS and disabled all C-states but I have no idea what is causing this issue or where to start. Any help would be appriciated.

Thanks!

jarvis-diagnostics-20231126-2140.zip

JorgeB · November 27, 2023

Enable the syslog server and post that after a crash.

Heffe · November 27, 2023

31 minutes ago, JorgeB said:

Enable the syslog server and post that after a crash.

Will do. Thank you.

If there a preferred location for the logs? Is saving them to the array the best option?

JorgeB · November 27, 2023

Since the flash drive could be a problem save it to an array share.

Heffe · November 29, 2023

Well, it failed agaiun but I misconfigured my syslog server so I'll have to wait another 24 to 48h until it crashes again.

Heffe · November 30, 2023

On 11/27/2023 at 7:30 AM, JorgeB said:

Enable the syslog server and post that after a crash.

Hey @JorgeB, here is the syslog. The server stopped working on Nov 30th @ 03:00 after completing an appdata backup opperation. From reading the log, maybe a bad memory stick?

Thanks!

syslog-192.168.0.10.log

JorgeB · November 30, 2023

Could be, but I would that PC to use ECC RAM, still there could be some related issue.

Heffe · November 30, 2023

14 minutes ago, JorgeB said:

Could be, but I would that PC to use ECC RAM, still there could be some related issue.

Yes, the memory is ECC memory. I just ran the MEMTEST86 utility with no errors. Do you have any other suggestions?

JorgeB · November 30, 2023

There various crashes but not real clear to me what's causing them, one thing you can try is to boot the server in safe mode with all docker/VMs disabled, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one.

Heffe · November 30, 2023

Good idea. I'll give that a try. Thank you for your suggestions.

Heffe · December 3, 2023

So this weekend I updated to tge latest unRAID version, updated all plugins and stopped all services on the server. I then let the server run until faults started appearing. Here's the log file.

It shows that CPU 0 and CPU 4 are causing some errors? Does that mean I have a bad CPU? Can it be fixed?

jarvis-syslog-20231203-1208.zip

Heffe · December 3, 2023

Here's the non-zipped log file.

syslog-192.168.0.10 (2).log

JorgeB · December 3, 2023

Don't know if the CPU is the problem but it does show multiple apps segfaulting, so most likely some hardware issue, I would try with just a stick or RAM, if the same try a different one, that will basically rule out RAM issues.

Heffe · December 3, 2023

Ill give that a try. Thank you.

Heffe · December 4, 2023

I took one RAM stick out and ran the server all night. It still errored. This morning I swapped the RAM stick with another and it errored again. I suppose it could be the RAM but I doubt both sticks would be bad.

I do have a NVME PCIe card that my cache drive is on. I'm going to try and remove that and see if it makes a difference.

I've attatched the latest log.

syslog-192.168.0.10.log

JorgeB · December 4, 2023

Unlikely the RAM is the problem, next suspects for me would be the board or CPU.

Heffe · December 6, 2023

The errors increase in frequency if I have the docker containers running. I have no way of troubleshooting if the CPU is the issue or the motherboard. The server was a cold spare from a working environment so I'm doubtful that its a hardware issue.

I'll continue to update the software and hope that the updates fixe it.

Thanks for your help @JorgeB.

Heffe · December 7, 2023

Just as an update, I'll keep this thread going in case someone else has similar issues.

In order to troubleshoot the mobo and/CPU issues, I ran the Dell diagnostic tool built into the BIOS on the motherboard. The tool ran all night (~12 hours) and it passed all tests (including RAM tests).

After the tests were done, I booted the server in safe made (no GUI) and let it run. Its been on for about 5 hours and it has one "segfault".

Here's the error:

Dec  7 09:01:45 Jarvis kernel: smartctl_type[17085]: segfault at 0 ip 0000000000000000 sp 00007ffcc6c52d68 error 14 in php[400000+3b000] likely on CPU 5

I'm really frustrated at this issue and I'm lost as to a solution. I'm 99% sure there's no hardware issue but yet here we are.

jarvis-syslog-20231207-1446.zip

Edited December 7, 2023 by Heffe
Formatting error.

JorgeB · December 7, 2023

If it's just smarctl segfaulting it could be one of the devices that is making it crash, but you'd need to basically remove one of them at at time to test.

Heffe · December 7, 2023

2 minutes ago, JorgeB said:

If it's just smarctl segfaulting it could be one of the devices that is making it crash, but you'd need to basically remove one of them at at time to test.

Would running a SMART disk scan provide any useful info?

JorgeB · December 7, 2023

Doubt it.

Heffe · December 10, 2023

Would it be possible that this issue is being caused by a BIOS update? I've looked across the forum and it seems others are using the same server but on another BIOS version.

JorgeB · December 11, 2023

I would guess unlikely, but not impossible.

Heffe · December 11, 2023

Well, here are the cahnges that I've tried since,

1. I changed the boot mode to UEFI instead of Legacy.

2. I downgraded the BIOS to v2.0.2 since there was a post on here suggesting that this verison was working for them.
3. I found some corrupted docker files and repaired them.

The server ran for about 10 hours and then crashed again. I'm going to run Memtest86 for an extended period of time and see if that finds anything. Lol, I'm so frustrated.

syslog-192.168.0.10_111223.log

Heffe · December 18, 2023

So I gave it a rest and returned to the problem after a few days.

I had couple of "loop2" BTRFS errors that, from what I've read on the forum, are related to a corrupt docker image. I've since deleted my docker image and recreated it.

Secondly, after recreating my docker image, I pinned each docker to a specific CPU so I could find if there was an offending docker container. I think I found one. If I restart my nextcloud container while my mariadb container is running, I get:

kernel: php[8525]: segfault at ffffffffffffffff ip 000055871823b0b2 sp 00007ffd6737dd40 error 7 in php82[558718200000+2b4000] likely on CPU 5 (core 1, socket 0)

I get this error reliably when starting my nextcloud container.

Is there anything I can do to fix this? Is my nextcloud appdata corrupt? Thanks!

syslog-192.168.0.10_171223.log

Edited December 18, 2023 by Heffe
Uploaded log file.

USB Drive Fails After 24h

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation