January 25, 20233 yr Hi everyone, I had a drive fail and i'm trying to replace it. However, data rebuild essentially stops and i've had 4 failed attempts at the rebuild. Each time it stops, always at a different point, it says the rebuild is < 1mbs and the array shows no activity at all on any drive. In this state, the rebuild won't pause or cancel, nor will the system shut down or reboot. Pulling the cable is the only option. In my last attempt, I was booted in safe mode with the array in maintenance mode and still had the same result. I read a suggestion on another thread that 'new config' might fix the problem, but considering the failed drive and it being emulated, i would think that would result in data loss. My diag is attached, and i'm open to any advice. Thanks! diagnostics-20230125-1403.zip
January 25, 20233 yr Community Expert 9 minutes ago, PZ303 said: I read a suggestion on another thread that 'new config' might fix the problem, but considering the failed drive and it being emulated, i would think that would result in data loss. New Config would make it impossible to rebuild. You misunderstood that other thread, or it doesn't apply to your situation, or someone is giving bad advice.
January 25, 20233 yr Author @trurl that's what i thought. The other thread had some more nuance to it where it wasn't a replacement and rebuild, so that's what i thought.
January 25, 20233 yr Community Expert Jan 25 11:55:21 SilentCricket kernel: md: recovery thread: multiple disk errors, sector=5291262368 Disable Autostart in Disk Settings. Shutdown, check all disk connections, SATA and power, both ends, including splitters. Unassign disk1, start the array in normal (not maintenance mode) and post new diagnostics.
January 25, 20233 yr Author Thanks for the quick reply, @trurl. This is in a racked array with a backplane and I did a shuffle of which bay the disks live in. Disk 1 is unassigned now and that was the replacement disk. The previous occupant of disk 1 threw tons of errors. New diag attached. diagnostics-20230125-1643.zip
January 25, 20233 yr Community Expert 2 minutes ago, PZ303 said: The previous occupant of disk 1 threw tons of errors Do you still have that disk? Possibly nothing wrong with it. Hang on to it in case we need its contents.
January 25, 20233 yr Community Expert Do any of your disks have SMART warnings on the Dashboard page? (save me the trouble of examining every SMART report).
January 25, 20233 yr Author I have the original disk. I just mounted it, and it's replacement outside the array to SMART test it and it's erroring pretty badly. It's running an extended SMART now. I ran an extend SMART on all drive yesterday to trouble shoot and no issues. Including the new disk.
January 25, 20233 yr Community Expert Emulated disk1 mounted and shows plenty of data. Didn't see much in that previous syslog except that one line. Makes me wonder if the controller is dead or needs reseating. Usually you would see multiple lines showing the multiple disks involved. 06:00.0 Serial Attached SCSI controller [0107]: Broadcom / LSI SAS3008 PCI-Express Fusion-MPT SAS-3 [1000:0097] (rev 02) Subsystem: Super Micro Computer Inc AOC-S3008L-L8e [15d9:0808] Kernel driver in use: mpt3sas Kernel modules: mpt3sas
January 25, 20233 yr Community Expert Looks like all of your array is on that same controller. Can you read the other disks OK?
January 25, 20233 yr Community Expert Other disks must be readable or disk1 couldn't be emulated. Did you check the connections?
January 25, 20233 yr Community Expert NVM, I see what happened. md driver crashed right after that line in syslog. Have you done memtest recently?
January 25, 20233 yr Author I just shut down, checked / reseated the LSI (correct there is only 1) and rebooted. But like you mentioned, all other disks have been fine and Disk 1 has moved to 2 or 3 locations on the backplane to validate that wasn't the issue. I did run a memtest as a part of this a few days ago. I only let it do 1 full pass because there were zero errors. And it wasn't the included memtest, i downloaded memtest onto a new USB and ran it that way.
January 25, 20233 yr Author One more thing worth mentioning is that both of these drives, the failed one and it's replacement, are Renewaed drives from Amazon. https://www.amazon.com/gp/product/B0BLJVKQKJ/ref=ppx_yo_dt_b_asin_title_o00_s00?ie=UTF8&psc=1 The first one failed Smart the second I installed it but i foolishly continued with it and within a month, the system got completely unstable with it just being present. It';s the one with the SMART errors i posted. The replacement is the warranty replacement, still renewed, but it's passing SMART. It is throwing some errors though. I'm starting to wonder if I just got double unlucky with renewed drives.
January 25, 20233 yr Community Expert 1 hour ago, trurl said: Do any of your disks have SMART warnings on the Dashboard page? (save me the trouble of examining every SMART report).
January 26, 20233 yr Author Yeah sorry. I hadn't seen these errors and the dashboard shows it as passed smart. It doesn't show the drive as healthy because it shows it as emulated. Are you also thinking it's also just a bad drive?
January 26, 20233 yr Community Expert Suggest trying with a different Unraid release, like v6.10.3, current kernel might not like your hardware, if the same happens with v6.10.3 then it would suggest a hardware problem.
January 26, 20233 yr Community Expert Are you sure you're talking about the Dashboard page? You might have to spinup drives to see the thumbs-up thumbs down indicator.
January 26, 20233 yr Author @trurl - correct. All green checks. I was confusing the Emulated orange dot with the green check, but it doesn't report as failed and when i ran the smart test it was, and is still saying passed. Despite that, the drive is getting a lot of Raw read error rates and seek error rates. The other drives all show zero while the number on this drive is climbing. @JorgeB i was on 6.11.5 and it gives me a downgrade path to 6.11.4 so i'll start there. This is an upgraded build with a new 13700k and ASRock Z690 Steel Legend with latest bios to support the new 13gen chips. 128gb of RAM memtested with zero errors on just one pass. These drives are both Seagate Exos from 2018 so i doubt those are not supported. I've ordered other drives to replace these so in a few days i'll be able to tell if i just had 2 bad renewed drives. I guess one question is could there be data corruption causing this issue on rebuild?
January 26, 20233 yr Community Expert Just now, PZ303 said: a downgrade path to 6.11.4 You can download the zip of any current release and just replace all the bz* files on flash to get that version. Of course you should backup flash first.
January 26, 20233 yr Author Thanks. I'll try that next. I had seen that option, but the in-gui downgrade was a bit quicker and i'm going to be too busy with work today to mess with the flash. I'll presume the minor version downgrade won't really change things, but i figured why not while i'm busy today.
January 26, 20233 yr Author So i just downgraded to 6.10.3 and the issue persists. The rebuild got unstable and stopped within 5 minutes. I'll wait for the (now third) replacement drive from a different manufacturer and vendor and see if that solves the problem. 2 bad drives, while wildly bad luck, seems to be the most likely scenario here.
January 27, 20233 yr Community Expert Post new diags from v6.10.3 to see if the Unraid driver is still crashing.
January 28, 20233 yr Author Here is the diag when I try the rebuild on 6.10.3 silentcricket-diagnostics-20230127-2037.zip
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.