[Solved] Errors on Disk 8 (AGAIN!) but During Parity Rebuild

AgentXXL · July 15, 2019

So today I added another Ironwolf 10TB to my system and started pre-clearing/stress testing before I allocate it as a 2nd parity drive. I powered down to add the drive to a bay on my enclosure. As I wanted both parity drives on the motherboard SATA controllers, I had to move my recently rebuilt 8TB (Disk 8; rebuilt successfully earlier last week) to another bay.

Upon rebooting, the recently rebuilt 8TB (Disk #8) threw a bunch of read errors. I suspected the bay might be at fault so I stopped the array and set it to not auto-start. I then powered down and moved the 8TB drive to another bay that had been previously occupied by a HGST 4TB which never threw any errors, read/write/UDMA CRC or otherwise. The HGST 4TB was relocated to another shelf and appears to be fine.

As I had just finished rebuilding onto this 8TB drive earlier in the week, I decided it was safe to go ahead and use the 'Re-enable a drive' procedure. Note that on 6.7.2, the last step doesn't seem to apply as there was no confirmation box to select. Just the Start button and a description saying that a replacement disk was found and starting the array will do a parity sync/rebuild. This is what I wanted.

After starting the array, the disc did start the rebuild. It did appear to be working fine, but I just checked and about 6 hrs in it started throwing some write errors. unRAID appears to have paused the parity rebuild but without thinking about it, I just clicked on Resume. ~~The parity rebuild~~ is continuing but I'm concerned about the write errors. EDIT: After I clicked resume, it's no longer doing a parity rebuild on Disk 8 - it's doing a parity read check. Not sure how this affects the partially completed rebuild on this drive before it started throwing write errors.

What should I do? Should I pause the rebuild? Cancel the rebuild? Let the rebuild continue? I'm also seeing a lot of the following errors from during the parity rebuild and at least one since I resumed the rebuild:

Jul 14 22:18:03 AnimNAS kernel: sd 2:0:1:0: Power-on or device reset occurred

Jul 14 22:18:03 AnimNAS rc.diskinfo[5009]: SIGHUP received, forcing refresh of disks info.

Diagnostics attached. I do still have one more 8TB drive that I could use as another replacement, but I need some suggestions on how to safely proceed. I'm fairly certain the write errors are not actual disk errors but controller/cable/SATA backplane related. I do have spare bays on other shelves that I can move the current rebuilding 8TB to, or to add the spare 8TB to.

One other note: the new 10TB Ironwolf is also doing a pre-clear/stress test run before I add it as a 2nd parity drive. I'm willing to stop this procedure and leave it alone until I get the system back to having a valid Disk 8, on either the existing rebuilding disk, or the spare replacement. Note that the bay which has been producing previous errors is currently unused and will remain so until I can get to doing a full disassembly and clean of all of the SATA connections on all 5 backplane shelves. I've also got new 6G certified MiniSAS cables on the way as the extended length cables that came with my LSI 9201-16i are too long, looped back and forth a couple of times inside the case and potentially a cause of cross-talk/errors.

Any help appreciated. Thanks!

animnas-diagnostics-20190715-0426.zip

Edited July 15, 2019 by AgentXXL
Edit regarding resume funtion - now a parity read check instead of a rebuild

AgentXXL · July 15, 2019

I'm concerned about letting the parity read check continue. As I mentioned when I edited the post above, unRAID paused the rebuild of Disk 8 from parity, but when I clicked Resume, it's no longer doing a rebuild of Disk 8, but a read check.

As Disk 8 is still showing as disabled and contents emulated, I've got a feeling I would be better off cancelling the read check. Then stopping the array, removing Disk 8 that had a partial rebuild and replacing it with my spare 8TB. Especially since with the write errors, I now may have corrupted data on Disk 8.

If I cancel the parity read check and remove this disk, I assume the array will let me replace it with my spare and start the rebuild again from scratch. My only concern is that I see the parity drive has had some writes while this issue occurred, and now I'm uncertain if it's actually valid.

My head is spinning on this one.... what should I do?

Vr2Io · July 15, 2019

You should fix unstable problem first, but now with a disk in emulate and need handle ASAP ....

Pls check power cable, connector, or change PSU. I assume no problem in preclear because most disk in sleep, so power requirement less critical, but in parity check/rebuild, then problem occur.

Edited July 15, 2019 by Benson

Vr2Io · July 15, 2019

3 hours ago, AgentXXL said:

5 backplane shelves.

4 to HBA and 1 to mainboard ？20 bay

3 hours ago, AgentXXL said:

cables that came with my LSI 9201-16i are too long

Actually how long ？But I don't think it is the cause.

-----------------------

Got it in your previous post

Norcotek 20 bay

30“ cable, really not a problem

I will suggest change PSU first.

Edited July 15, 2019 by Benson

JorgeB · July 15, 2019

Disk8 CRC error count is climbing a lot, you should be getting a notification about it, it means there's a connections problem.

Previous diags you posted:

ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  9 Power_On_Hours          -O--CK   100   100   000    -    18 (3 203 0)
199 UDMA_CRC_Error_Count    -OSRCK   200   197   000    -    12

Now:

ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  9 Power_On_Hours          -O--CK   100   100   000    -    101 (157 167 0)
199 UDMA_CRC_Error_Count    -OSRCK   200   196   000    -    2181

AgentXXL · July 15, 2019

53 minutes ago, Benson said:

4 to HBA and 1 to mainboard ？20 bay

Actually how long ？But I don't think it is the cause.

Yes, each shelf has connectors for 4 SAS/SATA drives, fed by one SFF-8087 miniSAS connector per shelf. The 1st shelf is connected to the motherboard SATA ports using a SFF-8087 to 4 SATA breakout cable. There are 2 more MB SATA ports, 1 used for a 1TB SSD cache drive and 1 for an unassigned devices SSD for some Dockers and a VM. The other 4 shelves are direct connected to the LSI 9201-16i via the 4 SFF-8087 to SFF-8087 MiniSAS cables. 3 shelves have independent power supplied via different rails of my power supply, with the remaining 2 shelves sharing one power connector/rail from the power supply.

The current cables are 30" long each, the new ones I've ordered are 11.5" each. In my past experience, I've always tried to use cables with the shortest required length as cross-talk is far more likely when excess length is looped to make them fit somewhat cleanly in the case.

As mentioned the parity rebuild of disk 8 had write errors after about 2TB had been re-constructed. When I noticed it was paused on the Main tab of the unRAID webgui, I instinctively clicked Resume. That didn't resume the parity rebuild of disk 8 but instead continued it as a parity read check. Disk 8 is still shown as disabled and contents emulated.

I'm going to assume that because the rebuild failed, the original contents of disk 8 are still what's emulated. I think my best option is to move 1 of my 2 10TB drives that are currently connected to the motherboard SATA to another bay. They are full of data so are essentially read-only devices and are not yet part of the array. They are mounted using the Unassigned Devices plugin until I can migrate their data to the array.

I'm not saying that the current miniSAS cables are entirely the cause, but they may play a role. I'm definitely thinking the bigger problems are the bays that haven't had any drives connected to them for the last 6+ years. Oxidation on the SATA connectors for those unused bays is highly probable. But for now, the real issue is getting the array back to not having that 1 drive emulated.

So at this point, my plan is to:

1. cancel the parity read check

2. stop the array and unassign the failed disk 8 and remove it from the system entirely

3. insert my spare 8TB into the bay vacated by 1 of my UD mounted 10TB drives

4. assign the spare 8TB to the Disk 8 slot and start the array

5. wait for the rebuild of Disk 8 to complete

6. when the preclear/stress test of the new 10TB Ironwolf completes, stop the array and assign it to the 2nd parity slot

7. restart the array and let the 2nd parity drive build/sync

This should allow me to continue using the system until I'm ready to do the major disassembly and cleaning/contact restoration. Does this seem reasonable? Obviously my main goal is to troubleshoot and stabilize the system and reduce the potential for more failures.

Edited July 15, 2019 by AgentXXL

JorgeB · July 15, 2019

13 minutes ago, AgentXXL said:

Does this seem reasonable?

Yes, just make sure system notifications are enable and you're getting them for more CRC errors.

AgentXXL · July 15, 2019

Just now, johnnie.black said:

Yes, just make sure system notifications are enable and you're getting them for more CRC errors.

I am seeing the notification regarding UDMA CRC errors... definitely that's one issue that overall I hope to resolve too. At least they aren't too system critical as the system/drive auto-corrects these, but they're a good indicator of connection issues. When they stop, I'll know I have likely corrected the problem(s). Thanks!

Vr2Io · July 15, 2019

1 hour ago, AgentXXL said:

Does this seem reasonable?

I will suggest only rebuild the emulated disk and power/connect necessary disk.

Then start trouble shoot the unstable issue. In this stage, I will power up all disk but may not connect all to system.

Preclear other disk will be final.

Edited July 15, 2019 by Benson

JorgeB · July 15, 2019

55 minutes ago, AgentXXL said:

At least they aren't too system critical as the system/drive auto-corrects these, but they're a good indicator of connection issues.

Yes, and likely related to the problems you're seeing, i.e., the disk eventually becoming disabled.

AgentXXL · July 15, 2019

Before I follow the procedure completely, I decided to spend a little more time investigating the cause of the UDMA CRC errors. It looks like my LSI controller might not be genuine, but I'm not able to confirm that yet. I did cancel the parity read check but decided before I try the rebuild on my spare drive, I've decided to run one more zero pass. I know it isn't strictly necessary, but I wanted to confirm the drive is OK.

I've been doing the zero pass using the Preclear plugin and just disabled the pre and post read stages. At first I was seeing some more UDMA CRC errors, but no actual read or write errors. And no re-allocated sectors at current. I've got another couple of hours to wait before the zero stage completes.

At that time I'm definitely going to shut down and move one of the 10TB UD mounted drives off the MB SATA. Temporarily I'll re-attach it via USB rather than take my chance on more errors from the possibly fake LSI controller. I will of course attach the spare 8TB to the MB SATA for the data rebuild.

Does anyone have any suggestions on how I can test the LSI to see if it's actually causing some of these issues? I'm planning to order a genuine LSI card from a US vendor as a backup, but have to wait until I get my next disability payment next week. In the meantime, if there are procedures anyone can recommend, I'm willing to give it a go.

Before I try and go back to eBay and/or the Chinese supplier of my LSI controller, I want to try and confirm if it's part of the problem. I'll take some close-up photos so I can try and verify if the card is indeed a fake or if it was just pulled from surplussed servers. It did come in a plain brown box like a lot of generic items, but I've found one supplier in California selling them in the genuine LSI box:

https://www.ebay.com/itm/312653733132

Suggestions?

AgentXXL · July 15, 2019

One other option: replace my 6+ year old Norcotek RPC-4220 with a new storage case. This would help eliminate the issue of oxidation/poor contact that might also be part or entirely the cause of my issues. I'll research the current solutions that might be available but if anyone has any suggestions, please let me know. Thanks!

AgentXXL · July 15, 2019

Just heard back from Broadcom/LSI and they are fairly certain that the controller is a fake. I'm going to wait until the zero pass finishes on the replacement 8TB spare drive and then I'll shut the system down and take some photos of the card to send back to LSI for confirmation. Looks like I should order the 'genuine' card ASAP.

Vr2Io · July 15, 2019

There are many possible, may be single or multiple. If shorter 8087-8087 cable available soon, just waiting it for testing.

If suspect HBA problem, could you use reverse 8087-SATA cable to connect BP to MB for test. ( you already have )

If suspect BP problem, could you use stright 8087-SATA cable to connect HBA to HDD.

So two different kind cable could help troubleshoot most parts, I always have that and ready for test need. All my HBA, Expander, 10GNIC, SAS cable .... come from china, may be fake, old pull whatever .... just well test would be fine. Usually they are new, I won't expect it is orginal as this price.

Test plan also importance, to avoid loopping ( I think you in that ), you may need divide whole system in small part, i.e. test HBA, test 1st BP, test 4 HDD ...... and may not only use Unraid for shooting, the main goal was as fast as found out the failure point.

Edited July 15, 2019 by Benson

AgentXXL · July 15, 2019

7 minutes ago, Benson said:

There are many possible, may be single or multiple. If shorter 8087-8087 cable available soon, just waiting it for testing.

.

.

Test plan also importance, to avoid loopping ( I think you in that ), you may need divide whole system in small part, i.e. test HBA, test 1st BP, test 4 HDD ...... and may not only use Unraid for shooting, the main goal was as fast as possible found out the failure point.

The new cables will be here later this week. At least they're coming from a vendor that clearly states they're 6G certified. I ordered one spare just in case.

As for reverse breakout testing, that's definitely a possibility too. I have 4 more of the SFF-8087 to 4 SATA breakout cables so I could use them if I can figure out a solution to easily power the drives under test. I do have lots of hardware on hand so I'm sure I can make up a decent test rig.

As mentioned above though, LSI thinks the controller is a fake based on the serial number. I'll post the photos I take later both here and of course send them to LSI for more help with identifying it (they asked for this). It's still possible it's a re-badged version from IBM/Dell/HP.

And yes, I'm going to be very cautious in my testing so as to not introduce more problems than I'm attempting to solve. That's one of the reasons I'm also going to wait until I have the 2nd parity drive online before doing ANY major disassembly or contact cleaning of the backplanes. Having dual drive failure protection in place before doing much else is a bit more of a safety net.

Thanks for the suggestions!

Dale

AgentXXL · July 16, 2019

Update: the zero of the 8TB completed with no more UDMA CRC errors and no reallocated sectors or anything else to worry about. I shut the array down and moved the disks so that the 8TB replacement was attached to a motherboard SATA port. I did place the 10TB drive into another bay, but since it's UD mounted and read-only data (movies/TV) for Plex, that shouldn't cause any real issues.

While the array was powered down, I also removed the LSI controller and took some pics that I've now sent to LSI to see if they can identify the card. Also, since I had removed all the cables from it, I then booted off a Win10 SSD as the only device attached to the motherboard. From there I proceeded to use the LSI sas2flash utility for x64 so I could download the firmware, bios and nvram files just in case LSI wants to take a look at them. I then proceeded to update the 9201-16i with the latest IT firmware (P20) and the latest bios that came with the firmware package.

After verifying it all looked fine, I shut down again and disconnected the Win10 SSD. I then re-connected all of the cables to the motherboard SATA and LSI controller. The boot was normal from what I could see. Once booted to the login screen, I opened the webgui from my main system and assigned the replacement 8TB to the Disk 8 slot.

Upon starting the array it proceeded with the start of the parity rebuild for Disk 8. I'll be waiting about 15 - 20 hrs for that to complete. Once it's complete, I'll then stop the array and add the new 10TB Ironwolf (which also finished its preclear/stress test run) and assign it as my 2nd parity drive. I'm sure that will take another 20 hrs or so to build the 2nd parity drive.

But for now, the system looks good. I'll try to not add any new data to the array and instead use the cache drive and/or a temporary disk mounted in UD if I need to. I'm attaching the pics of the suspected fake 9201-16i controller, just in case anyone here wants to take a gander. Thanks again @johnnie.black, @Benson and also @Squid who responded in a related post in the Hardware forum.

Dale

AgentXXL · July 17, 2019

The rebuild of Disk 8 was successful while connected to the motherboard SATA port. I've now stopped the array, added the new 10TB Ironwolf as the 2nd parity drive, and restarted the array. The parity sync/build for Parity 2 is underway.... 20 hrs to go for dual failure protection to be active.

LSI got back to me re: the pictures and although the card looks good physically, the labelling is good except for the serial number. The number is not in their manufactured database and is in a format that's wrong for this series of card. They've given me some instructions and tools that should work from a terminal session under unRAID to do some diagnostics. I will wait for the 2nd parity disk to finish syncing before doing anything else.

I also got shipping confirmation on the replacement miniSAS cables and they could be here by the end of this week. Once they arrive I'll shutdown, swap out the cables, disconnect all unRAID disks and do some testing with the controller and the tools that LSI supplied. I have two spare drives that I can use for testing purposes, one 10TB and one 8TB which I can move between bays of the enclosure to run the tests on each of the 4 SAS/SATA connections per backplane.

The only ones I'll not bother testing are the ones that are connected to motherboard SATA ports as none of them have ever reported errors. I may use them initially, since they have been reliable and would be a good indicator if they failed while attached to the LSI controller vs the MB SATA. When attached to the MB SATA ports, those drive bays have been error-free - that would almost definitely point to the LSI controller as the faulting unit.

When I get my next disability check, I'm going to go ahead and order a new LSI card from the eBay vendor that guarantees a new-in-box unit from LSI. It won't hurt to have a spare even if the existing controller proves to be OK.

More to come when I get some of the testing done.

AgentXXL · July 23, 2019

Well, I received the shorter miniSAS cables. I shut the system down after setting the array to NOT autostart. Good thing as it turns out that a few of the disks didn't re-seat properly in their bays on the Norco RPC-4220. After I swapped in the new cables, I applied some contact restorer to the drives before re-insertion. 3 of them were shown as 'Missing', which would have failed since there's only 2 disk failure with the 2 parity disks.

I shut down and decided that there might be a better connection if I remove the door catch frame from the drive trays. Sure enough, after removal of the door catch assembly from each of the drive trays, the drives definitely seated deeper into the case and were definitely mating with the backplane SATA connectors. Upon reboot, 2 of the missing drives had returned, but Disk 8 failed again.

I did some brief testing and alas it does appear that the bay I've assigned to Disk 8 does have another issue, so for now I'm marking it as bad and just not putting a drive in that bay anymore. Of course that means I also ended up with another 'disabled' Disk 8. Since moving it to one of the other bays, it's recognized and unRAID is rebuilding its contents once again.

I thought the disk re-enable process would just do a parity check of Disk 8 but it appears it just went directly into a rebuild of the data from the parity set. Both of my parity drives seem fine, as do all the others. So for now, I'll continue to watch for more UDMA CRC errors. If they continue, I'm going to look into a new storage enclosure and get the genuine LSI 9201-16i controller.

Thanks again to all that helped with troubleshooting this.

Dale

Vr2Io · July 23, 2019

3 hours ago, AgentXXL said:

the drives definitely seated deeper

Seems you found out the root cause.

3 hours ago, AgentXXL said:

I'm going to look into a new storage enclosure

Must.

Edited July 23, 2019 by Benson

AgentXXL · July 23, 2019

12 hours ago, Benson said:

Seems you found out the root cause.

Alas I'm still getting random UDMA CRC errors from a few drives. One of them (an older 4TB) might be the drive itself, but it's almost full so it's not getting new data written to it any longer. That said, I'll replace it soon. I'm also still suspecting the 3rd party LSI controller, so in addition to a new storage enclosure, I'm also considering the genuine LSI controller from an eBay seller in the US. The Chinese vendor has offered to let me return the one from him and will refund my money. I'll likely go ahead with getting the replacement controller.

Any suggestions for 20 - 24 bay storage enclosures anyone? I posted in the hardware topic about this, but the threads mentioning cases all seem to focus on standard PC cases with a lot of drive bay options, but rarely have I seen one that even takes 16 drives. I'm not adverse to buying a used enclosure from eBay as long as the seller guarantees all drive bays are functional.

JorgeB · July 23, 2019

9 minutes ago, AgentXXL said:

One of them (an older 4TB) might be the drive itself

Very unlikely, in fact I don't remember ever seeing CRC errors caused by a drive.

Supermicro rack cases are very well regarded, but expensive.

AgentXXL · July 23, 2019

3 minutes ago, johnnie.black said:

Very unlikely, in fact I don't remember ever seeing CRC errors caused by a drive.

Supermicro rack cases are very well regarded, but expensive.

The only reason I suspect this 4TB drive is I've moved it to new bays 3 times. And it produces errors no matter which bay it's been in. At least these are correctable errors by the drive/system, but it's still annoying.

I've been browsing the Supermicro cases but that means also getting a different controller than my current one, also more expensive. Looks like there are some older SKUs on eBay and Amazon at reasonable prices, complete with dual Xeon motherboards, some RAM and some with a 8 port internal SAS/SATA LSI controller. A full 24 port controller is more than one of these systems, so it'll take a while to save up enough cash so I can purchase. I'll just keep watching eBay and Amazon and see what pops up. I'm sure eventually I'll find a decent deal for the case and controller.

AgentXXL · August 1, 2019

Well, I'm still struggling with errors from my system... I disassembled it completely to clean the connectors on the backplanes. When re-assembling I tested each port on each backplane while connected to my motherboard SATA controllers. This helped me identify that 5 of the 20 drive bays have bad connections. As mentioned above by @Benson it appears that a number of the errors are due to these bad connections.

So a new storage enclosure is a must. I did order the genuine LSI 9201-16i controller as a backup but now I'm thinking that most of my issues are the bad connections on some of the backplane SATA/power connectors. I'm struggling to find another affordable storage enclosure that I can trust. A new Supermicro 24 bay enclosure is way out of my ballpark so I've been watching eBay to see what comes up. Alas even most of those offerings are iffy as there's no guarantee that all 24 bays will still function.

I've thought about buying 4 of the 5x3 adapters that 'converts' 3 x 5.25" bays to 5 x 3.5" bays. That would give me the 20 ports of my current system, but cost wise they are still quite expensive - about $85 CAD each is the best price I've seen so 4 of them would be around $350 CAD. They would only see use until I can find an alternate storage enclosure to the Norco RPC-4220. Norco still sells the 20 and 24 bay units but reviews are still iffy as some users have had bad bays right from new.

Any suggested enclosures? I've posted in the Hardware forum area in the threads about enclosures, but most of them have dated info for cases that are no longer in production. I've found a few on eBay but again they're used so no guarantee that some ports haven't gone bad.

Vr2Io · August 1, 2019

I use China made rack case, not common worldwide.
Really not recommend those 5x3 4x3 cage.

Does swaptable must ？

Edited August 1, 2019 by Benson

AgentXXL · August 1, 2019

50 minutes ago, Benson said:

I use China made rack case, not common worldwide.
Really not recommend those 5x3 4x3 cage.

Does swaptable must ？

Hot swap is not necessary. Only time I want/need to replace a drive is when it fails or I'm upgrading to a larger capacity. I'm looking at Rosewill cases as an alternative to the Norco RPC-4220.

EDIT: Rosewill RSV-L4500 looks to only support 15 drives max. And it looks like they use 3 of the 5x3 drive cages to offer that. Definitely would be cheaper than a new Norco, but alas not enough bays for my existing 17 x 3.5" drives and 2 x 2.5" SSDs (one for cache, one for UD mounted Docker apps/vm).

I might have to look at buying 2 more 10TB drives and migrate my data off the existing 1 x 6TB and 4 x 4TB drives. That would drop my drive count to 14 drives so a Rosewill would work.

Edited August 1, 2019 by AgentXXL
Update Rosewill options...

[Solved] Errors on Disk 8 (AGAIN!) but During Parity Rebuild

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation