Parity Check/Docker/User Share errors

Sirkyle · September 1, 2021

For the past 2 months I have had issues on my monthly parity checks (few thousand errors, but no individual drive errors) and my docker seems to go offline, and I have to reinstall all the containers, some shares go offline as well. Attached the diagnostics, any sense out of what is going on?

tower-diagnostics-20210901-1302.zip

JorgeB · September 1, 2021

There are issues with the SAS2LP controller, those controllers are not recommended for v6, if possible replace it with an LSI HBA, if not using ECC RAM also a good idea to run memtest.

Sirkyle · September 1, 2021

Ah, yea they have caused a bit of an issue in the past for me, trying to work on finding an affordable upgrade now. Any suggestions?

JorgeB · September 1, 2021

Sirkyle · September 1, 2021

Thanks, I ordered 2 SAS9211-8I cards today. after getting them on IT mode should it be pretty easy to just plug n play?

JorgeB · September 2, 2021

7 hours ago, Sirkyle said:

after getting them on IT mode should it be pretty easy to just plug n play?

Yep.

Sirkyle · September 8, 2021

Ok, got my new cards, installed them and ran a non-correcting parity check just to make sure everything went ok. I noticed as soon as I started the parity check the user shares all dropped offline, picked up a few errors, and now have a drive that's unassigned. Attached fresh diagnostics.

tower-diagnostics-20210908-0619.zip

Sirkyle · September 8, 2021

Looks like the unassigned device is my cache drive, although its using a different letter code SDE, and SDU

JorgeB · September 8, 2021

Replace or swap cables on that disk.

Sirkyle · September 8, 2021

Ok, I think I have another breakout cable I can try. Ill report back in a bit

itimpi · September 8, 2021

1 hour ago, Sirkyle said:

although its using a different letter code SDE, and SDU

The letter code can always change between boot as it is assigned dynamically during boot process. It can also happen if a drive drops of line temporarily and then reconnects and is given a different letter.

Sirkyle · September 8, 2021

57 minutes ago, itimpi said:

The letter code can always change between boot as it is assigned dynamically during boot process. It can also happen if a drive drops of line temporarily and then reconnects and is given a different letter.

So swap the cable, reboot, and see what happens?

Sirkyle · September 8, 2021

Ok cables swapped, no unassigned drive showing up. user shares are back, need to run any sort of parity check you think?

trurl · September 8, 2021

Parity hasn't got anything to do with cache of course, but sounds like you never got to finish the parity check you wanted to do earlier to test your new controller.

Sirkyle · September 8, 2021

Just now, trurl said:

Parity hasn't got anything to do with cache of course, but sounds like you never got to finish the parity check you wanted to do earlier to test your new controller.

It did complete, 2020 errors. after the test is when I noticed the cache issue

trurl · September 8, 2021

5 hours ago, Sirkyle said:

non-correcting parity check

Doesn't look like it was non-correcting

Sep  7 12:21:36 Tower kernel: mdcmd (36): check 
Sep  7 12:21:36 Tower kernel: md: recovery thread: check P Q ...
Sep  7 12:35:11 Tower kernel: md: recovery thread: PQ corrected, sector=206897064
Sep  7 12:35:11 Tower kernel: md: recovery thread: PQ corrected, sector=206897072
...
Sep  7 12:35:11 Tower kernel: md: recovery thread: PQ corrected, sector=206899896
Sep  7 12:35:11 Tower kernel: md: recovery thread: PQ corrected, sector=206899904
Sep  7 12:35:11 Tower kernel: md: recovery thread: stopped logging

9 minutes ago, Sirkyle said:

2020 errors

You can't allow sync errors.

Post new diagnostics, then run a NON-correcting parity check.

Sirkyle · September 8, 2021

Apologies, I swore I had unchecked the write corrections to parity. New diagnostics attached and starting the non correcting check now

tower-diagnostics-20210908-1151.zip

Sirkyle · September 8, 2021

I notice when I start a parity check a lot of my user shares go offline. Ill post more diagnostics when its done, probably 11 hours

trurl · September 8, 2021

Just now, Sirkyle said:

I notice when I start a parity check a lot of my user shares go offline. Ill post more diagnostics when its done, probably 11 hours

Go ahead and post them now, probably a symptom of disconnected disk.

Sirkyle · September 8, 2021

Attached

tower-diagnostics-20210908-1157.zip

trurl · September 8, 2021

Sep  8 11:23:30 Tower root: error: /plugins/unassigned.devices/UnassignedDevices.php: wrong csrf_token

unrelated, but you should close all browsers to your server after reboot and start with a new browser session.

trurl · September 8, 2021

Sep  8 11:53:46 Tower kernel: XFS (sde1): log I/O error -5
Sep  8 11:53:46 Tower kernel: XFS (sde1): xfs_do_force_shutdown(0x2) called from line 1196 of file fs/xfs/xfs_log.c. Return address = 000000009975f2dd
Sep  8 11:53:46 Tower kernel: XFS (sde1): Log I/O Error Detected. Shutting down filesystem
Sep  8 11:53:46 Tower kernel: XFS (sde1): Please unmount the filesystem and rectify the problem(s)

Why are you using an SMR drive as cache?

Model Family:     Seagate BarraCuda 3.5 (SMR)

Sirkyle · September 8, 2021

19 minutes ago, trurl said:

Sep  8 11:53:46 Tower kernel: XFS (sde1): log I/O error -5
Sep  8 11:53:46 Tower kernel: XFS (sde1): xfs_do_force_shutdown(0x2) called from line 1196 of file fs/xfs/xfs_log.c. Return address = 000000009975f2dd
Sep  8 11:53:46 Tower kernel: XFS (sde1): Log I/O Error Detected. Shutting down filesystem
Sep  8 11:53:46 Tower kernel: XFS (sde1): Please unmount the filesystem and rectify the problem(s)

Why are you using an SMR drive as cache?

Model Family:     Seagate BarraCuda 3.5 (SMR)

To be honest im not really sure what that is?

JorgeB · September 8, 2021

Cache device dropped again, if you swapped both power and SATA cables it could be a device problem, you can also try connecting it to the onboard SATA ports, but I've used the same disk model with an LSI without issues.

Sirkyle · September 8, 2021

so recommendation? I have some brand new drives around but its a 4tb. I don't have any other small drives. I swapped cables from the backplane to the LSI. im happy to format the cache drive and try some specific diagnostics for that? Or keep letting the non-correcting parity check roll?

Parity Check/Docker/User Share errors

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation