USB Drive Fails After 24h

November 27, 20232 yr

Hey all,

I recently moved my server into a Dell Percision 5820. The Unraid version is 6.11.5. The server works very well except that it fails after about 24h of being on. Sometimes this looks like an unresponsive system, other times I can access the GUI but there are no shares listed.

If I remove the USB and put it in a new port, the server boots fine and then fails about 24h later. If I simply try a reboot, the system will hang and not boot.

I've gone through ther BIOS and disabled all C-states but I have no idea what is causing this issue or where to start. Any help would be appriciated.

Thanks!

jarvis-diagnostics-20231126-2140.zip

Quote

November 27, 20232 yr

Community Expert

Enable the syslog server and post that after a crash.

Quote

November 27, 20232 yr

Author

31 minutes ago, JorgeB said:

Enable the syslog server and post that after a crash.

Will do. Thank you.

If there a preferred location for the logs? Is saving them to the array the best option?

Quote

November 27, 20232 yr

Community Expert

Since the flash drive could be a problem save it to an array share.

Quote

November 29, 20232 yr

Author

Well, it failed agaiun but I misconfigured my syslog server so I'll have to wait another 24 to 48h until it crashes again.

Quote

November 30, 20232 yr

Author

On 11/27/2023 at 7:30 AM, JorgeB said:

Enable the syslog server and post that after a crash.

Hey @JorgeB, here is the syslog. The server stopped working on Nov 30th @ 03:00 after completing an appdata backup opperation. From reading the log, maybe a bad memory stick?

Thanks!

syslog-192.168.0.10.log

Quote

November 30, 20232 yr

Community Expert

Could be, but I would that PC to use ECC RAM, still there could be some related issue.

Quote

November 30, 20232 yr

Author

14 minutes ago, JorgeB said:

Could be, but I would that PC to use ECC RAM, still there could be some related issue.

Yes, the memory is ECC memory. I just ran the MEMTEST86 utility with no errors. Do you have any other suggestions?

Quote

November 30, 20232 yr

Community Expert

There various crashes but not real clear to me what's causing them, one thing you can try is to boot the server in safe mode with all docker/VMs disabled, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one.

Quote

November 30, 20232 yr

Author

Good idea. I'll give that a try. Thank you for your suggestions.

Quote

December 3, 20232 yr

Author

So this weekend I updated to tge latest unRAID version, updated all plugins and stopped all services on the server. I then let the server run until faults started appearing. Here's the log file.

It shows that CPU 0 and CPU 4 are causing some errors? Does that mean I have a bad CPU? Can it be fixed?

jarvis-syslog-20231203-1208.zip

Quote

December 3, 20232 yr

Author

Here's the non-zipped log file.

syslog-192.168.0.10 (2).log

Quote

December 3, 20232 yr

Community Expert

Don't know if the CPU is the problem but it does show multiple apps segfaulting, so most likely some hardware issue, I would try with just a stick or RAM, if the same try a different one, that will basically rule out RAM issues.

Quote

December 3, 20232 yr

Author

Ill give that a try. Thank you.

Quote

December 4, 20232 yr

Author

I took one RAM stick out and ran the server all night. It still errored. This morning I swapped the RAM stick with another and it errored again. I suppose it could be the RAM but I doubt both sticks would be bad.

I do have a NVME PCIe card that my cache drive is on. I'm going to try and remove that and see if it makes a difference.

I've attatched the latest log.

syslog-192.168.0.10.log

Quote

December 4, 20232 yr

Community Expert

Unlikely the RAM is the problem, next suspects for me would be the board or CPU.

Quote

December 6, 20232 yr

Author

The errors increase in frequency if I have the docker containers running. I have no way of troubleshooting if the CPU is the issue or the motherboard. The server was a cold spare from a working environment so I'm doubtful that its a hardware issue.

I'll continue to update the software and hope that the updates fixe it.

Thanks for your help @JorgeB.

Quote

1

December 7, 20232 yr

Author

Just as an update, I'll keep this thread going in case someone else has similar issues.

In order to troubleshoot the mobo and/CPU issues, I ran the Dell diagnostic tool built into the BIOS on the motherboard. The tool ran all night (~12 hours) and it passed all tests (including RAM tests).

After the tests were done, I booted the server in safe made (no GUI) and let it run. Its been on for about 5 hours and it has one "segfault".

Here's the error:

Dec  7 09:01:45 Jarvis kernel: smartctl_type[17085]: segfault at 0 ip 0000000000000000 sp 00007ffcc6c52d68 error 14 in php[400000+3b000] likely on CPU 5

I'm really frustrated at this issue and I'm lost as to a solution. I'm 99% sure there's no hardware issue but yet here we are.

jarvis-syslog-20231207-1446.zip

Edited December 7, 20232 yr by Heffe
Formatting error.

Quote

December 7, 20232 yr

Community Expert

If it's just smarctl segfaulting it could be one of the devices that is making it crash, but you'd need to basically remove one of them at at time to test.

Quote

December 7, 20232 yr

Author

2 minutes ago, JorgeB said:

If it's just smarctl segfaulting it could be one of the devices that is making it crash, but you'd need to basically remove one of them at at time to test.

Would running a SMART disk scan provide any useful info?

Quote

December 7, 20232 yr

Community Expert

Doubt it.

Quote

December 10, 20232 yr

Author

Would it be possible that this issue is being caused by a BIOS update? I've looked across the forum and it seems others are using the same server but on another BIOS version.

Quote

December 11, 20232 yr

Community Expert

I would guess unlikely, but not impossible.

Quote

December 11, 20232 yr

Author

Well, here are the cahnges that I've tried since,

1. I changed the boot mode to UEFI instead of Legacy.

2. I downgraded the BIOS to v2.0.2 since there was a post on here suggesting that this verison was working for them.
3. I found some corrupted docker files and repaired them.

The server ran for about 10 hours and then crashed again. I'm going to run Memtest86 for an extended period of time and see if that finds anything. Lol, I'm so frustrated.

syslog-192.168.0.10_111223.log

Quote

December 18, 20232 yr

Author

So I gave it a rest and returned to the problem after a few days.

I had couple of "loop2" BTRFS errors that, from what I've read on the forum, are related to a corrupt docker image. I've since deleted my docker image and recreated it.

Secondly, after recreating my docker image, I pinned each docker to a specific CPU so I could find if there was an offending docker container. I think I found one. If I restart my nextcloud container while my mariadb container is running, I get:

kernel: php[8525]: segfault at ffffffffffffffff ip 000055871823b0b2 sp 00007ffd6737dd40 error 7 in php82[558718200000+2b4000] likely on CPU 5 (core 1, socket 0)

I get this error reliably when starting my nextcloud container.

Is there anything I can do to fix this? Is my nextcloud appdata corrupt? Thanks!

syslog-192.168.0.10_171223.log

Edited December 18, 20232 yr by Heffe
Uploaded log file.

Quote

USB Drive Fails After 24h

Featured Replies

Solved by Heffe

Join the conversation

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)