Sirkyle Posted September 1, 2021 Share Posted September 1, 2021 For the past 2 months I have had issues on my monthly parity checks (few thousand errors, but no individual drive errors) and my docker seems to go offline, and I have to reinstall all the containers, some shares go offline as well. Attached the diagnostics, any sense out of what is going on? tower-diagnostics-20210901-1302.zip Quote Link to comment
JorgeB Posted September 1, 2021 Share Posted September 1, 2021 There are issues with the SAS2LP controller, those controllers are not recommended for v6, if possible replace it with an LSI HBA, if not using ECC RAM also a good idea to run memtest. Quote Link to comment
Sirkyle Posted September 1, 2021 Author Share Posted September 1, 2021 Ah, yea they have caused a bit of an issue in the past for me, trying to work on finding an affordable upgrade now. Any suggestions? Quote Link to comment
Sirkyle Posted September 1, 2021 Author Share Posted September 1, 2021 Thanks, I ordered 2 SAS9211-8I cards today. after getting them on IT mode should it be pretty easy to just plug n play? Quote Link to comment
JorgeB Posted September 2, 2021 Share Posted September 2, 2021 7 hours ago, Sirkyle said: after getting them on IT mode should it be pretty easy to just plug n play? Yep. Quote Link to comment
Sirkyle Posted September 8, 2021 Author Share Posted September 8, 2021 Ok, got my new cards, installed them and ran a non-correcting parity check just to make sure everything went ok. I noticed as soon as I started the parity check the user shares all dropped offline, picked up a few errors, and now have a drive that's unassigned. Attached fresh diagnostics. tower-diagnostics-20210908-0619.zip Quote Link to comment
Sirkyle Posted September 8, 2021 Author Share Posted September 8, 2021 Looks like the unassigned device is my cache drive, although its using a different letter code SDE, and SDU Quote Link to comment
JorgeB Posted September 8, 2021 Share Posted September 8, 2021 Replace or swap cables on that disk. Quote Link to comment
Sirkyle Posted September 8, 2021 Author Share Posted September 8, 2021 Ok, I think I have another breakout cable I can try. Ill report back in a bit Quote Link to comment
itimpi Posted September 8, 2021 Share Posted September 8, 2021 1 hour ago, Sirkyle said: although its using a different letter code SDE, and SDU The letter code can always change between boot as it is assigned dynamically during boot process. It can also happen if a drive drops of line temporarily and then reconnects and is given a different letter. Quote Link to comment
Sirkyle Posted September 8, 2021 Author Share Posted September 8, 2021 57 minutes ago, itimpi said: The letter code can always change between boot as it is assigned dynamically during boot process. It can also happen if a drive drops of line temporarily and then reconnects and is given a different letter. So swap the cable, reboot, and see what happens? Quote Link to comment
Sirkyle Posted September 8, 2021 Author Share Posted September 8, 2021 Ok cables swapped, no unassigned drive showing up. user shares are back, need to run any sort of parity check you think? Quote Link to comment
trurl Posted September 8, 2021 Share Posted September 8, 2021 Parity hasn't got anything to do with cache of course, but sounds like you never got to finish the parity check you wanted to do earlier to test your new controller. Quote Link to comment
Sirkyle Posted September 8, 2021 Author Share Posted September 8, 2021 Just now, trurl said: Parity hasn't got anything to do with cache of course, but sounds like you never got to finish the parity check you wanted to do earlier to test your new controller. It did complete, 2020 errors. after the test is when I noticed the cache issue Quote Link to comment
trurl Posted September 8, 2021 Share Posted September 8, 2021 5 hours ago, Sirkyle said: non-correcting parity check Doesn't look like it was non-correcting Sep 7 12:21:36 Tower kernel: mdcmd (36): check Sep 7 12:21:36 Tower kernel: md: recovery thread: check P Q ... Sep 7 12:35:11 Tower kernel: md: recovery thread: PQ corrected, sector=206897064 Sep 7 12:35:11 Tower kernel: md: recovery thread: PQ corrected, sector=206897072 ... Sep 7 12:35:11 Tower kernel: md: recovery thread: PQ corrected, sector=206899896 Sep 7 12:35:11 Tower kernel: md: recovery thread: PQ corrected, sector=206899904 Sep 7 12:35:11 Tower kernel: md: recovery thread: stopped logging 9 minutes ago, Sirkyle said: 2020 errors You can't allow sync errors. Post new diagnostics, then run a NON-correcting parity check. Quote Link to comment
Sirkyle Posted September 8, 2021 Author Share Posted September 8, 2021 Apologies, I swore I had unchecked the write corrections to parity. New diagnostics attached and starting the non correcting check now tower-diagnostics-20210908-1151.zip Quote Link to comment
Sirkyle Posted September 8, 2021 Author Share Posted September 8, 2021 I notice when I start a parity check a lot of my user shares go offline. Ill post more diagnostics when its done, probably 11 hours Quote Link to comment
trurl Posted September 8, 2021 Share Posted September 8, 2021 Just now, Sirkyle said: I notice when I start a parity check a lot of my user shares go offline. Ill post more diagnostics when its done, probably 11 hours Go ahead and post them now, probably a symptom of disconnected disk. Quote Link to comment
Sirkyle Posted September 8, 2021 Author Share Posted September 8, 2021 Attached tower-diagnostics-20210908-1157.zip Quote Link to comment
trurl Posted September 8, 2021 Share Posted September 8, 2021 Sep 8 11:23:30 Tower root: error: /plugins/unassigned.devices/UnassignedDevices.php: wrong csrf_token unrelated, but you should close all browsers to your server after reboot and start with a new browser session. Quote Link to comment
trurl Posted September 8, 2021 Share Posted September 8, 2021 Sep 8 11:53:46 Tower kernel: XFS (sde1): log I/O error -5 Sep 8 11:53:46 Tower kernel: XFS (sde1): xfs_do_force_shutdown(0x2) called from line 1196 of file fs/xfs/xfs_log.c. Return address = 000000009975f2dd Sep 8 11:53:46 Tower kernel: XFS (sde1): Log I/O Error Detected. Shutting down filesystem Sep 8 11:53:46 Tower kernel: XFS (sde1): Please unmount the filesystem and rectify the problem(s) Why are you using an SMR drive as cache? Model Family: Seagate BarraCuda 3.5 (SMR) Quote Link to comment
Sirkyle Posted September 8, 2021 Author Share Posted September 8, 2021 19 minutes ago, trurl said: Sep 8 11:53:46 Tower kernel: XFS (sde1): log I/O error -5 Sep 8 11:53:46 Tower kernel: XFS (sde1): xfs_do_force_shutdown(0x2) called from line 1196 of file fs/xfs/xfs_log.c. Return address = 000000009975f2dd Sep 8 11:53:46 Tower kernel: XFS (sde1): Log I/O Error Detected. Shutting down filesystem Sep 8 11:53:46 Tower kernel: XFS (sde1): Please unmount the filesystem and rectify the problem(s) Why are you using an SMR drive as cache? Model Family: Seagate BarraCuda 3.5 (SMR) To be honest im not really sure what that is? Quote Link to comment
JorgeB Posted September 8, 2021 Share Posted September 8, 2021 Cache device dropped again, if you swapped both power and SATA cables it could be a device problem, you can also try connecting it to the onboard SATA ports, but I've used the same disk model with an LSI without issues. Quote Link to comment
Sirkyle Posted September 8, 2021 Author Share Posted September 8, 2021 so recommendation? I have some brand new drives around but its a 4tb. I don't have any other small drives. I swapped cables from the backplane to the LSI. im happy to format the cache drive and try some specific diagnostics for that? Or keep letting the non-correcting parity check roll? Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.