Data rebuild stopping


PZ303
Go to solution Solved by PZ303,

Recommended Posts

Hi everyone, I had a drive fail and i'm trying to replace it. However, data rebuild essentially stops and i've had 4 failed attempts at the rebuild. Each time it stops, always at a different point, it says the rebuild is < 1mbs and the array shows no activity at all on any drive. In this state, the rebuild won't pause or cancel, nor will the system shut down or reboot. Pulling the cable is the only option. 

 

In my last attempt, I was booted in safe mode with the array in maintenance mode and still had the same result.

 

I read a suggestion on another thread that 'new config' might fix the problem, but considering the failed drive and it being emulated, i would think that would result in data loss. 

 

My diag is attached, and i'm open to any advice. 

 

Thanks!

diagnostics-20230125-1403.zip

Link to comment
9 minutes ago, PZ303 said:

I read a suggestion on another thread that 'new config' might fix the problem, but considering the failed drive and it being emulated, i would think that would result in data loss. 

New Config would make it impossible to rebuild. You misunderstood that other thread, or it doesn't apply to your situation, or someone is giving bad advice.

 

Link to comment

I have the original disk. I just mounted it, and it's replacement outside the array to SMART test it and it's erroring pretty badly. It's running an extended SMART now. 

image.thumb.png.36eed2311a1156c857e73f9d15e03139.png

 

I ran an extend SMART on all drive yesterday to trouble shoot and no issues. Including the new disk. 

Link to comment

Emulated disk1 mounted and shows plenty of data.

 

Didn't see much in that previous syslog except that one line. Makes me wonder if the controller is dead or needs reseating. Usually you would see multiple lines showing the multiple disks involved.

 

06:00.0 Serial Attached SCSI controller [0107]: Broadcom / LSI SAS3008 PCI-Express Fusion-MPT SAS-3 [1000:0097] (rev 02)
	Subsystem: Super Micro Computer Inc AOC-S3008L-L8e [15d9:0808]
	Kernel driver in use: mpt3sas
	Kernel modules: mpt3sas

 

 

Link to comment

I just shut down, checked / reseated the LSI (correct there is only 1) and rebooted. But like you mentioned, all other disks have been fine and Disk 1 has moved to 2 or 3 locations on the backplane to validate that wasn't the issue. 

 

I did run a memtest as a part of this a few days ago. I only let it do 1 full pass because there were zero errors. And it wasn't the included memtest, i downloaded memtest onto a new USB and ran it that way.

Link to comment

One more thing worth mentioning is that both of these drives, the failed one and it's replacement, are Renewaed drives from Amazon. https://www.amazon.com/gp/product/B0BLJVKQKJ/ref=ppx_yo_dt_b_asin_title_o00_s00?ie=UTF8&psc=1 The first one failed Smart the second I installed it but i foolishly continued with it and within a month, the system got completely unstable with it just being present. It';s the one with the SMART errors i posted.

The replacement is the warranty replacement, still renewed, but it's passing SMART. It is throwing some errors though.

 

image.thumb.png.900a951d228ec2dc93cd5b9d304ec190.png 

 

I'm starting to wonder if I just got double unlucky with renewed drives. 

Link to comment

@trurl - correct. All green checks. I was confusing the Emulated orange dot with the green check, but it doesn't report as failed and when i ran the smart test it was, and is still saying passed. Despite that, the drive is getting a lot of Raw read error rates and seek error rates. The other drives all show zero while the number on this drive is climbing.

 

@JorgeB i was on 6.11.5 and it gives me a downgrade path to 6.11.4 so i'll start there. This is an upgraded build with a new 13700k and ASRock Z690 Steel Legend with latest bios to support the new 13gen chips. 128gb of RAM memtested with zero errors on just one pass. These drives are both Seagate Exos from 2018 so i doubt those are not supported. 

 

I've ordered other drives to replace these so in a few days i'll be able to tell if i just had 2 bad renewed drives. I guess one question is could there be data corruption causing this issue on rebuild? 

 

image.png.a522441ba2896cdf45eee90593518ce0.png

image.thumb.png.88468846be1ddc0317ee6a3d1add527b.png

Link to comment

Thanks. I'll try that next. I had seen that option, but the in-gui downgrade was a bit quicker and i'm going to be too busy with work today to mess with the flash. I'll presume the minor version downgrade won't really change things, but i figured why not while i'm busy today. 

Link to comment

So i just downgraded to 6.10.3 and the issue persists. The rebuild got unstable and stopped within 5 minutes.

 

I'll wait for the (now third) replacement drive from a different manufacturer and vendor and see if that solves the problem. 2 bad drives, while wildly bad luck, seems to be the most likely scenario here.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.