Jump to content
We're Hiring! Full Stack Developer ×

Server suddenly not accessible and some errors I don't know


Falk

Recommended Posts

Yesterday I recognized that my Nextcloud-Client wasn't able to connect to my Unraid-Home-Server.

So, I tried to connect to the unraid admin dashboard over the browser to find out what happened to the nextcloud container and that also didn't work - ERR_CONNECTION_TIMED_OUT.

 

Drama: All my services were not accessible :( and the server itself got warmer than in normal cases.

Measure: I wasn't able to do anything on my unraid-server, thats why I restarted it (hard).

 

In less than a minute, everything booted up and of course I was able to access all my services via browser (websites, nextcloud, admin dashboard, etc.). I decided to check the logs, whether somethings interesting is in there.

 

The only errors I found in my logs - don't know their meanings:

Sep 25 13:37:01 Tower kernel: traps: sh[16809] trap invalid opcode ip:4e9f8a sp:7ffe91c09300 error:0 in bash[426000+c5000]
Sep 25 13:37:01 Tower kernel: traps: sh[16810] trap invalid opcode ip:4e9f8a sp:7ffd30f2c8d0 error:0 in bash[426000+c5000]

 

Side-Problem: Then I recognized, that "Tower" in every log line. I don't know since when, but the server named itself to the default (From Homeserver To Tower).

Side-Measure: So I changed it back.

 

Side-Problem: I also recognized, that the Version also expired (was one minor version behind)

Side-Measure: I updated the OS to the newest Version and restarted everything.

 

After all these things the Parity Check started (as it should). I decided to leave the server overnight alone to complete the parity check, since I wasn't able to recognize some weird behaviours of the running services or critical messages in the logs.

Drama: After the night, same behaviour again like it started (can't access anything) - but temperature was this time completly fine

Measure: Again, I wasn't able to do anything on my unraid-server, thats why I restarted it (hard).

 

No errors just some warnings in the logs - don't know their meanings:

Sep 25 20:38:42 Homeserver kernel: tsc: Fast TSC calibration failed
Sep 25 20:38:42 Homeserver kernel: ACPI: Early table checksum verification disabled
Sep 25 20:38:42 Homeserver kernel: xor: measuring software checksum speed
Sep 25 20:38:42 Homeserver kernel: floppy0: no floppy controllers found
Sep 25 20:38:42 Homeserver kernel: ACPI Warning: SystemIO range 0x0000000000000B00-0x0000000000000B08 conflicts with OpRegion 0x0000000000000B00-0x0000000000000B0F (\GSA1.SMBI) (20220331/utaddress-204)

 

I also decided to export diagnostic, after this happened again (check attachement).

 

Thereafter, I got the same plan: just leave the parity check until it finishs, since everything looks normal now. Just keep an eye on it while the parity check is running.

 

Info: The paritiy check found 3 errors while running until the next Drama happened :D


Drama: After approx. 3.5 hours uptime the same behaviour again - I guess that also the parity runtime stopped.

Measure: Asking for help!! :(
 

When I kept an eye on the dashboard, the server never got overloaded.

RAM Total: ~20%

CPU Load: ~35% (sometimes short spikes on a single core - guess its fine)

Also the array and cache looks quite good.

SMART: all healthy
Temp: ~30°C

Utilization: Array (65%) and Cache (16%)

homeserver-diagnostics-20230926-0548_1695700160309_0.zip

Edited by Falk
Use codeblocks for readability
Link to comment

Thank you for that hint. Didn't know that unraid supports syslog for a while now - well I don't read the patch notes.

 

I configured it like so - I hope its correct:

 

image.thumb.png.becd05b1bc302e3fb92e2153128f0bec.png

 

Afterwards I restarted the server and unfortunately it crashed immediately after few seconds. I got this behaviour for like 4-5 times.

After a short break, the restart finally worked and I was able to access the server again for a while.

 

I was able to analyze the following points:

  • After setting "Mirror syslog to flash" I wanted to check, whether its correctly working. Then I got confused about the logged time for each line. Like I mentioned at the top, apparently not only my server name got changed, also the timezone got corrupted and has been reset to the default value. I updated it to my current timezone.
  • After a new reboot I started a parity check again and I just kept an eye all the time on the dashboard:
    • Weird CPU load: Sometimes there is one core all the time on 100% for like 1-2 Minutes.
    • I get an error which I don't really understand:
       
      Sep 26 16:36:27 Homeserver cache_dirs: ERROR: included directory 'networkshare' does not exist.

      The folder/share doesn't exist for months now. But I don't know where it tries to access.
       
    • And after approx. 1 hour and 6 Minutes Parity Check: Parity Disk failure - finally I got something meaningful that started to make sense.

      image.png.18f97e86e0c05baa6cc2e296719d15bb.png
       
    • The Parity Check paused because Unraid stopped the Parity Disk
    • Then I tried out what happens, when I resume the Parity Check -> crash
  • After reboot the Parity Disk is still disabled by Unraid now
    • image.png.d23ffa26459d8f28986db3c2a87ed5cb.png
    • I started a self test on the disabled disk -> strangely Unraids result says everything is fine

So I guess its clear, that one disk is not working properly. Now I need an advice what to do. My plan is now to shut down the server until I have new disks. I have no parity anymore and I don't want to risk something.

 

When the new disks arrives, is it important to make a parity check afterwards to build up the parity-bits on the new disk? - will it be necessary for Disk 1?

Because my plan in the future is to build up a zfs pool so I can then move my data from Disk 1 to the new pool (new disks) and get rid of the array.

 

Am I right with my result of my analysis or can you find some other critial points? Maybe the broken disk is just a symptom of another serious problem?

 

I leave you diagnostics and syslog in the attachments.

Thank you!!

syslog homeserver-diagnostics-20230926-1824.zip homeserver-diagnostics-20230926-1639.zip

Link to comment
On 9/26/2023 at 7:39 PM, JorgeB said:

Parity dropped offline, most likely a power/connection problem, syslog shows some random call traces and segfaults, start by running memtest

 

Alright!

 

I unplugged my array disks. Don't want to stress my healthy disk because I don't feel safe now, since my parity is broken. (Both disks have the same purchase date tho)

 

I installed the newest version of memtest86+ on a usb stick and its running now for few hours.

 

1883559156_IMG_20230927_2052124632.thumb.jpg.631e281bdf6e8cdeb7276dcb3d6cbd67.jpg

 

Until now it got no errors. As far as I read on the internet, there is at least a recommended memtest time for at least 24h. But I guess that the ram sticks are fine, idk. I think I will leave it running until tomorrow when I get back from work.

 

Maybe there is something similar to test also the CPU? You got some recommendations?

 

On 9/26/2023 at 7:39 PM, JorgeB said:

also check this.

 

Wasn't able to find some settings for power management (c-state) on my bios. But I also had only little time to check. Maybe I need to do more research or you got a hint? I have the Gigabyte Technology Co., Ltd. - B450 I AORUS PRO WIFI-CF with Bios Version F51.

 

I think I just got a corrupted disk - my failure looks pretty similar like PC Tech Hustle shows in his video. But super funny, that his "broken disk" is not really damaged like Unraid assumed. He analyzed the disk afterwards on another windows machine and it seems that it still works perfectly.

 

But I am still curious about the segfaults and op-code errors. Can it happen because of Disk fail? Or is something else broken, sigh?

 

Some measurements I want to do - please scream if its dumb:

  • update bios, because it is really some versions behind now (3 years)
  • buy new disks tomorrow
  • when the disks arrives: replace parity and try parity-rebuild/check again and hope that Unraid runs stable
Link to comment
12 hours ago, Falk said:

Can it happen because of Disk fail?

Very unlikely.

 

12 hours ago, Falk said:

ome measurements I want to do - please scream if its dumb:

  • update bios, because it is really some versions behind now (3 years)
  • buy new disks tomorrow
  • when the disks arrives: replace parity and try parity-rebuild/check again and hope that Unraid runs stable

That looks good, but note that old parity can be fine, disks dropping is most often power/connection issue, you can post a SMART report.

Link to comment

I ran the memtest86+ for 22,5 hours. Memory looks pretty fine I think.

 

IMG_20230928_153124827.thumb.jpg.9f08f32b85c603d302fcaa4713b1c02b.jpg

 

Afterwards I updated the BIOS to the newest Version.

 

Then I found an image which is geared up with maintenance/diagnose tools (Ultimate Boot CD). I booted into this image and ran several CPU stress tests.

Every stress test passed so I decided to start an extended SMART test (via ubcd - ultimate boot cd) over both disks I have.

 

I exported the SMART reports from both ubcd and unraid. (You can find them in attachements tagged by source type ubcd or unraid)

 

I asked ChatGPT to analyze each file. Nothing noticeable, except UDMA_CRC_Error_Count for disk1.

 

Quote

The "UDMA_CRC_Error_Count" value of 31 indicates that there have been some transmission errors during data transfer. A low value is normal, but it's essential to keep an eye on it to ensure that it doesn't continue to increase.

In simpler terms, UDMA_CRC_Error_Count is a SMART attribute that tracks errors that occur during the transfer of data between your hard drive and other parts of your computer, such as the motherboard or cables. A value of 31 is not excessively high, but it suggests that there have been some data transfer errors in the past. It's important to monitor this value because a significant increase could indicate potential issues with data integrity or the health of your hard drive's connection to the rest of your system. If you notice a steady rise in this count, it's worth investigating the cause and potentially replacing cables or checking for other hardware issues.

 

I also ran a xfs_repair with -n. (see attachements)

 

Conclusion:

  • CPU might be fine ---> checked with different stress tests
  • RAM might be fine ---> checked with memtest86+ (passed 30 times)
  • Both disks itself could be fine (I am still suspicious) ---> both passed the extended SMART test (but I will replace the parity disk anyway when the new disks arrives)
  • SATA Port/Cable might be damaged --> need to analyze now the Cable/Port, because the attribute UDMA_CRC_Error_Count from the SMART report may indicate that

Can you find anything else in the reports that looks weird or that should be investigated as well?

smart-report-disk1-unraid.txt smart-report-parity-ubcd.txt smart-report-parity-unraid.txt smart-report-disk1-ubcd.txt xfs-repair-check-report-disk1.txt

Link to comment
14 minutes ago, Falk said:

RAM might be fine ---> checked with memtest86+ (passed 30 times)

This is the one that is not always conclusive.   A failure definitely means you  have a problem, whereas a pass is merely indicative.

 

Having said that it is a good resource - I may make a note of it for suggesting it be used in the future for those experiencing potential hardware problems.

  • Confused 1
Link to comment
13 minutes ago, itimpi said:

This is the one that is not always conclusive.   A failure definitely means you  have a problem, whereas a pass is merely indicative.

 

Ok. That means to my current state I have to pull both the hard drives, SATA connections or RAM under suspicion of damage. But what exactly is affected can not really find out with the diagnostic tools... :D Dang it!

 

Do you have any other ideas? - How to investigate?

Link to comment
15 minutes ago, Falk said:

SATA connections

This is always the most likely.    Do not forget, however, power cabling to drives can also cause difficult to diagnose issues, particularly if splitters are used to get the required number of connections.  The PSU itself is sometimes the culprit as well if failures tend to occur at times of peak load.

  • Thanks 1
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...