Disk array in the disk cabinet randomly drops disks during verification.

January 11, 20242 yr

Disk array in the disk cabinet randomly drops disks during verification. However, the disk checks out fine, and the array starts normally after a reboot.

Disks 9 and 11 are in an external disk cabinet. Even when I use a 'new configuration', these two disks randomly drop out during the synchronization process, their temperature cannot be measured, and read errors continue to accumulate.

After the error occurs, I stop the array and then reboot. The faulty disks can automatically mount and the array starts, everything appears normal. The disks have been checked and no issues were found.

I attempted to start the array and automatically sync three times. Each time, these two disks had issues, whereas the other disks not in the external disk cabinet did not encounter any problems.

My server is a Gen8 ml310e v2, with a P222 serving as the HBA card connecting the internal hard drives and the external disk cabinet.

I'm seeking advice on where the problem might be occurring.

Thank you.

unraid-diagnostics-20240111-1051.zip

Edited January 11, 20242 yr by yuelpl

Quote

January 11, 20242 yr

Community Expert

Attach Diagnostics to your NEXT post in this thread.

Quote

January 11, 20242 yr

Author

thank you

unraid-diagnostics-20240111-1051.zip

Quote

January 11, 20242 yr

Community Expert

I don't see any I/O errors logged during the parity sync

Jan 11 09:44:18 UNRAID kernel: md: recovery thread: recon P ...
Jan 11 09:44:21 UNRAID tips.and.tweaks: Tweaks Applied
Jan 11 09:44:21 UNRAID sudo:     root : PWD=/ ; USER=root ; COMMAND=/bin/bash -c '/usr/local/emhttp/plugins/unbalance/unbalance -port 6237'
Jan 11 09:44:21 UNRAID sudo: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=0)
Jan 11 09:44:26 UNRAID kernel: eth0: renamed from veth32e14d3
Jan 11 09:44:26 UNRAID kernel: IPv6: ADDRCONF(NETDEV_CHANGE): vethd9736a9: link becomes ready
Jan 11 09:44:26 UNRAID kernel: docker0: port 1(vethd9736a9) entered blocking state
Jan 11 09:44:26 UNRAID kernel: docker0: port 1(vethd9736a9) entered forwarding state
Jan 11 09:44:26 UNRAID kernel: mdcmd (37): nocheck cancel
Jan 11 09:44:26 UNRAID kernel: md: recovery thread: exit status: -4

DId you cancel it or did it just stop?

How is the external cabinet powered?

Quote

January 11, 20242 yr

Author

After the error, the read error count keeps accumulating, and many of my services become inaccessible. Therefore, I clicked 'Cancel' to stop the verification and restarted the array after rebooting, which restored normal operation of services like Docker.

During a previous attempt, I also tried to directly stop the array, but the UI froze, and I ultimately had to resort to a hard power reset.

The external disk cabinet has its own power supply, model: Sea Sonic 350W SS-350M1U, which is synchronized with the main server for power supply and power-off through a UPS.This power supply has been in use for less than 2 years.

Quote

January 11, 20242 yr

Author

The cable is connected from the hard drive backplane through an 8087 to 8088 cable to the P222 on the Gen8.

Quote

January 11, 20242 yr

Author

Could it be that because I rebooted the server, the previous error diagnostics were lost?

Quote

January 11, 20242 yr

Community Expert

Diagnostics can only tell about how things are since reboot.

Setup syslog server.

Quote

January 11, 20242 yr

Author

I setup local syslog server to "Enable"，Is it possible to view the logs from the past few days after a reboot? If so, I will reboot the server now.

Edited January 11, 20242 yr by yuelpl

Quote

January 11, 20242 yr

Author

If that's not possible, do I only have the option to start the disk verification again and wait for the issue to occur, then download the diagnostic logs before rebooting?

Quote

January 11, 20242 yr

Community Expert

syslog is in RAM just like the rest of the OS. Unless you have syslog server setup to store it somewhere it is gone.

Better post a screenshot of your syslog server setup, it can be confusing.

Quote

January 11, 20242 yr

Author

Screenshot as follows.

Quote

January 11, 20242 yr

Author

I have about 500GB of space left in my cache. Will doing this cause the cache to fill up and lead to Docker running abnormally?

Quote

January 11, 20242 yr

Author

if the settings are correct? If so, I will start the disk verification again and promptly download the diagnostic logs if there is an error.

Quote

January 11, 20242 yr

Community Expert

5 minutes ago, yuelpl said:

Screenshot as follows.

Can't read that. Are you logging to flash drive? Or what do you have set as the remote server?

And diagnostics will not include the syslog from syslog server you have to get that from where it is stored and post it.

Quote

January 11, 20242 yr

Author

Sorry, I have taken the screenshot again. I have set the logs to output to my cache disk. Please check if this setup is correct.

Quote

January 11, 20242 yr

Author

Should I directly reboot next, or start a new verification and wait for the error to occur?

Quote

January 11, 20242 yr

Community Expert

6 minutes ago, yuelpl said:

the screenshot again

You have to tell it what remote server to log to. Put the IP address of your server to get it to send the log to itself.

Quote

January 11, 20242 yr

Author

Is that so? Then, should I reboot the server and wait for downloadable content to appear in the folder?

Quote

January 11, 20242 yr

Community Expert

wait for the error to occur then get it, zip it, and post it.

Quote

January 11, 20242 yr

Author

OK, I will start a new verification and update this topic when the error occurs. Thank you

Quote

January 11, 20242 yr

Author

It happen again😵

unraid-diagnostics-20240111-1544.zip

syslog-10.0.0.10.zip

Edited January 11, 20242 yr by yuelpl

Quote

January 11, 20242 yr

Author

Now I have paused the validation.Click the arrow in front of the disk to view its contents; it shows 'Invalid Path'.

Strangely, it appears to be readable on the UI. After stopping the array, the disk shows as missing.

Either the 9th or the 11th disk always have read error on one of them. In 5 verification attempts, they have never both experienced read errors at the same time.

Once rebooted, the disk list appears to be normal, but I have to manually stop the verification to prevent the 9th and 11th disks from experiencing read errors again.

Diagnostic logs and syslog have been uploaded above for your review. Thank you.

Edited January 11, 20242 yr by yuelpl

Quote

January 11, 20242 yr

Community Expert

Disk is dropping offline, this is most often a power/connection issue, try replacing cables or connecting that disk to a different controller.

Quote

January 11, 20242 yr

Author

9 minutes ago, JorgeB said:

Disk is dropping offline, this is most often a power/connection issue, try replacing cables or connecting that disk to a different controller.

I have purchased a new cable and power supply, and I will replace them once it arrives

Quote

Disk array in the disk cabinet randomly drops disks during verification.

Featured Replies

Solved by yuelpl

Join the conversation

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)