niwmik Posted August 23, 2021 Share Posted August 23, 2021 I'm trying to figure what is causing the following parity check errors on my server. So far I have replaced the motherboard, cpu, memory, and psu hoping this would solve the issue. I wanted to upgrades the cpu anyways. This was before the 8/23/2021 parity check but I'm still getting errors. I'm currently running another parity check but sync errors are still coming up. I have not replaced the hard drives, 2 LSI SAS9201-16e cards, 6 External Mini SAS cables, and a DIY DAS made up of 3 Dual Mini SAS SFF-8088 to SAS36P SFF-8087 Adapter and 6 Mini SAS 26Pin (SFF-8088) Male to 4 SATA 7Pin Female Cable. Any ideas on where I should look next? atlas-diagnostics-20210823-0831.zip Quote Link to comment
JorgeB Posted August 23, 2021 Share Posted August 23, 2021 First you need to fix this: Aug 22 17:25:13 Atlas kernel: md: disk10 read error, sector=228586584 Aug 22 17:25:13 Atlas kernel: md: disk10 write error, sector=228589648 Aug 22 17:25:13 Atlas kernel: md: disk10 write error, sector=228589656 Quote Link to comment
niwmik Posted August 23, 2021 Author Share Posted August 23, 2021 In order to fix disk10, would I replace it with another drive and let unRaid rebuild the new drive? How would the previous parity check sync errors affect this rebuild? Is it possible that the new disk10 rebuild will contain bad data? Quote Link to comment
trurl Posted August 23, 2021 Share Posted August 23, 2021 Disk10 has disconnected and doesn't appear in SMART. Check all connections, power and SATA, both ends, including splitters. Then post new diagnostics. How long has disk10 been disabled? Did it just happen with this latest parity check or was it already disabled before you started the parity check? Since your parity is invalid your best hope is to get disk10 working again. Quote Link to comment
JorgeB Posted August 23, 2021 Share Posted August 23, 2021 Disk dropped offline so there's no SMART, first thing to do is to power cycle the server to see if the disk comes back online, if yes post new diags, if the disk is healthy it might be better doing a new config instead of a rebuild, since parity is suspect. Quote Link to comment
niwmik Posted August 23, 2021 Author Share Posted August 23, 2021 disk10 became disabled during the the latest parity check. After building the new computer, I reattached the 6 External Mini SAS cables to the DAS and on boot-up, 4 of the hard drives were not recognized. I unplugged and re-plugged the 6 External Mini SAS cables and on next boot-up all drives were recognized. I then ran the latest parity check. I suspect there's something going on with the SAS connection. I will check all connections and reboot to see if disk10 comes back. Quote Link to comment
niwmik Posted September 19, 2021 Author Share Posted September 19, 2021 I have replaced the sata and power connection and now disk ST4000DM000-1F2168_Z304M0WL is now showing up in Unassigned Devices. I then assigned disk 10 to ST4000DM000-1F2168_Z304M0WL and get a warning that the disk data will be erased when the array is started. I'm assuming instead of doing this, I need to do a "New Config" and recreate the parity drives. I'm just not sure what the exact options I need to select on "New Config". Also, will my other settings like Shares, dockers, and VM will still be there or will I really need to start from scratch? Quote Link to comment
trurl Posted September 19, 2021 Share Posted September 19, 2021 Post new diagnostics Quote Link to comment
niwmik Posted September 19, 2021 Author Share Posted September 19, 2021 atlas-diagnostics-20210919-1551.zipAttached is the latest diagnostics. Quote Link to comment
JorgeB Posted September 20, 2021 Share Posted September 20, 2021 Disk10 looks OK, you should do a new config (Tools -> New config) and re-sync parity, after that's done run a non correcting parity check, if errors are found run another one and post new diags, all without rebooting. Quote Link to comment
niwmik Posted September 22, 2021 Author Share Posted September 22, 2021 I performed a "New Config" and parity was rebuilt. I then ran a non correcting parity check and it returned errors. I ran a 2nd non correcting parity check and it also returned errors. I've attached the latest diagnostics. Could the lsi 9201-16e cards have gone bad? They are about the only things left that I haven't replaced besides the hard drives and power supply. atlas-diagnostics-20210922-0759.zip Quote Link to comment
JorgeB Posted September 22, 2021 Share Posted September 22, 2021 Since the sync errors from both checks don't match start by running memtest. Quote Link to comment
niwmik Posted September 23, 2021 Author Share Posted September 23, 2021 Ran a memtest overnight and did not return any errors. I went ahea d and order some lsi 9201-16e replacement cards. Quote Link to comment
JorgeB Posted September 23, 2021 Share Posted September 23, 2021 1 hour ago, niwmik said: I went ahea d and order some lsi 9201-16e replacement cards. Could be, but if it's not RAM the next likely candidate is a disk, unfortunately no easy way to find out which one except by testing. Quote Link to comment
niwmik Posted September 23, 2021 Author Share Posted September 23, 2021 How would you go about testing the disks? Thanks for your help by the way. Quote Link to comment
JorgeB Posted September 23, 2021 Share Posted September 23, 2021 Basically you'd need to remove one disk at a time and test without it, but you have a lot of disks, like so: Quote Link to comment
niwmik Posted September 24, 2021 Author Share Posted September 24, 2021 I'm in the process of testing the disks starting with changing the parity disk. I'm in the middle of a parity rebuild and I see the following in the system log. I can't tell which disk has the corruption. Not sure what "dm-2" is. Sep 24 11:08:22 Atlas kernel: XFS (dm-2): Metadata corruption detected at xfs_buf_ioend+0x51/0x284 [xfs], xfs_inode block 0x1796d41c8 xfs_inode_buf_verify Sep 24 11:08:22 Atlas kernel: XFS (dm-2): Unmount and run xfs_repair Sep 24 11:08:22 Atlas kernel: XFS (dm-2): First 128 bytes of corrupted metadata buffer: Sep 24 11:08:22 Atlas kernel: 00000000: 53 f8 8c c5 e2 3f 2c ba bf f3 6c 7f 50 4b 18 fa S....?,...l.PK.. Sep 24 11:08:22 Atlas kernel: 00000010: 4c c8 06 8d 5b b5 0a 13 f6 e4 57 9d 8e e1 b0 86 L...[.....W..... Sep 24 11:08:22 Atlas kernel: 00000020: d9 7e 70 f0 75 a8 8e 17 da b5 51 3a 59 31 38 f9 .~p.u.....Q:Y18. Sep 24 11:08:22 Atlas kernel: 00000030: 2d 20 3f ef 04 d2 89 e5 57 67 5b 9d 6c 92 e7 72 - ?.....Wg[.l..r Sep 24 11:08:22 Atlas kernel: 00000040: 3f 73 f8 9b b4 50 6e ae 74 11 01 27 40 76 3b 38 ?s...Pn.t..'@v;8 Sep 24 11:08:22 Atlas kernel: 00000050: ec 89 37 25 9d 42 11 e3 d3 28 2c 93 a8 e6 5c df ..7%.B...(,...\. Sep 24 11:08:22 Atlas kernel: 00000060: 01 77 8e a9 22 e2 bf 8b 6b 03 f2 c4 ce 23 3f 1e .w.."...k....#?. Sep 24 11:08:22 Atlas kernel: 00000070: ab 06 41 e8 81 d0 07 47 7f 3b ec 97 ba 47 f9 df ..A....G.;...G.. Sep 24 11:08:22 Atlas kernel: XFS (dm-2): metadata I/O error in "xfs_imap_to_bp+0x5c/0xa2 [xfs]" at daddr 0x1796d41c8 len 32 error 117 atlas-diagnostics-20210924-1308.zip Quote Link to comment
JorgeB Posted September 24, 2021 Share Posted September 24, 2021 21 minutes ago, niwmik said: Not sure what "dm-2" is. It's disk3. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.