Jaster Posted January 9, 2019 Share Posted January 9, 2019 (edited) Hello Guys, From prior expirience I learned not to act to quick and ask for advice here. I had a Parity Check stared wich was showing quite a lot of sync erroes (30k+) and after completing about 70% a disk seems to have failed so I grabbed the diagnostics and attached them here. Can/Should I just replace the failing disk or is there something else I should try/do? Best Regards knowlage-diagnostics-20190109-1058.zip Edited January 10, 2019 by Jaster Quote Link to comment
JorgeB Posted January 9, 2019 Share Posted January 9, 2019 Several issues: -are the sync errors expected? They are unrelated to the failing disk, any recent unclean shutdown? -disk3 needs a new SATA cable -disk1 appears to be failing, despite an healthy SMART report, run an extended SMART test. -you're using a Marvell controller with a port multiplier, that should be replaced ASAP by an LSI -disk9 dropped offline, most likely because of the marvell/port multiplier controller but since therer's no SMART report will need new diags after rebooting. Quote Link to comment
Jaster Posted January 9, 2019 Author Share Posted January 9, 2019 1 hour ago, johnnie.black said: -are the sync errors expected? They are unrelated to the failing disk, any recent unclean shutdown? Yes. I do have issues with and AMD GPU. It seems to cause deadlock for the whole server. But I couldn't find any solution/advice so I had several unclean shutdowns. 1 hour ago, johnnie.black said: -disk3 needs a new SATA cable Will do. 1 hour ago, johnnie.black said: -disk1 appears to be failing, despite an healthy SMART report, run an extended SMART test. Running it. I'll post as soon as it's done... seems to take a while 1 hour ago, johnnie.black said: -you're using a Marvell controller with a port multiplier, that should be replaced ASAP by an LSI I wans't aware of any issues and had been running this controller for a bout 3 years (in a different setup). Do I need to stick to LSI or are JMicron or Broadcom acceptable aswell? 1 hour ago, johnnie.black said: -disk9 dropped offline, most likely because of the marvell/port multiplier controller but since therer's no SMART report will need new diags after rebooting. The disk is gone after the reboot, I attached the diags after rebooting knowlage-diagnostics-20190109-1555.zip Quote Link to comment
JorgeB Posted January 9, 2019 Share Posted January 9, 2019 Disk9 looks fine, likely controller related, when there's an error on a disk on one a port multiplier it can timeout and cause issues on the other disks there, it appears to me that's what happened. 4 hours ago, Jaster said: Do I need to stick to LSI or are JMicron or Broadcom acceptable aswell? JMicron is so-so, but not the best choice, LSI was bought by Broadcom, then Avago bough Broadcom, though they still use the Broadcom name, that's the best option for Unraid, Marvell controllers are by themselves not recommended, Marvell with a port multiplier it's just asking for trouble. Quote Link to comment
Jaster Posted January 9, 2019 Author Share Posted January 9, 2019 Disk 1 Extended Smart check runs for hours and then just timesout ('canceled by host'). So Broadcom will do..? Quote Link to comment
JorgeB Posted January 9, 2019 Share Posted January 9, 2019 1 hour ago, Jaster said: So Broadcom will do..? Yes Broadcom is LSI, any HBA with a SAS2008/2308/3008 chipset in IT mode, e.g., 9201-8i, 9211-8i, 9207-8i, 9300-8i, etc and clones, like the Dell H200/H310 and IBM M1015, these latter ones need to be crossflashed. 1 Quote Link to comment
Jaster Posted January 9, 2019 Author Share Posted January 9, 2019 My server keeps crashing on several Events. Usually connected to switching from VM to docker load or working a lot with VM (shutdowns/boots). All VMs run on the Cache, but the dockers access the Raid. Could those crashs be connected to the controller? Quote Link to comment
JorgeB Posted January 9, 2019 Share Posted January 9, 2019 8 minutes ago, Jaster said: Could those crashs be connected to the controller? Possible but not very likely, though it would be visible on the syslog, so post new diags right after that happens. Quote Link to comment
Jaster Posted January 10, 2019 Author Share Posted January 10, 2019 Its hard to catch, I asked for advice in the KVM section, but didn't get any response. Also google didn't bring up to much results - probably cuz I'm not sure what I am looking for. Thats my KVM crash Thread Quote Link to comment
JorgeB Posted January 10, 2019 Share Posted January 10, 2019 There's nothing on those logs about controllers issues, so it's likely not that, it could be related to this: Dec 19 13:23:46 Knowlage kernel: resource sanity check: requesting [mem 0x000c0000-0x000dffff], which spans more than PCI Bus 0000:00 [mem 0x000c4000-0x000c7fff window] Dec 19 13:23:46 Knowlage kernel: caller pci_map_rom+0x68/0xaf mapping multiple BARs Dec 19 13:23:46 Knowlage kernel: resource sanity check: requesting [mem 0x000c0000-0x000dffff], which spans more than PCI Bus 0000:00 [mem 0x000c4000-0x000c7fff window] Dec 19 13:23:46 Knowlage kernel: caller pci_map_rom+0x68/0xaf mapping multiple BARs Dec 19 13:23:46 Knowlage kernel: vfio-pci 0000:65:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=io+mem But VMs are not my forte and I don't really know what the above means, it could be harmless. 1 Quote Link to comment
Jaster Posted January 10, 2019 Author Share Posted January 10, 2019 (edited) I swapped the disk 9 port away from the controller to the main board and it returend. I did the same for disk 3 and also changed the sata cable, but still seem to have some crc errors. Just to be save, I started a parity check. Once its complete (~30h) I'll try to fetch the disk 1 smart data one more time. knowlage-diagnostics-20190110-1432.zip I also order a 9211-8i controller as I can see the difference not using my current one already just with the mainboard ports. Thanks for that one too! Edited January 10, 2019 by Jaster Quote Link to comment
JorgeB Posted January 10, 2019 Share Posted January 10, 2019 16 minutes ago, Jaster said: but still seem to have some crc errors. CRC errors don't reset, and long as they don't keep increasing problem is solved. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.