Any new data added to my disk corrupts the XFS file system (6.9.0-rc2)


s449

Recommended Posts

I added a new drive to my array 3 weeks ago. It replaced an older smaller drive so because of the high water it’s my emptiest drive and all new data is getting added to that drive. Everything was working just fine on it until I started having issues with rTorrentVPN. i have no clue if it’s related or coincidence or if one caused the other. rTorrentVPN had loads of issues where it was slow and crashing a lot and barely started up. I think it was overloaded with 1000 torrents.

 

Anyway while trying to download torrents it started to crash my shares, where it said I had no shares available. The only fix was restarting the array. I ended up bringing my torrent list down to 400 and the stability of the software improved. However, any new data downloaded to my drive kept corrupting it and leaving me to have to do an XFS Repair. Whether it be from rutorrent, deluge, or just transferring to the array. The data would start to be added just fine but it wouldn’t get more than like 20-40GB before it crashed and I found corruption. Removing files was fine. The hard drive ran just fine. It was only when adding data where eventually it would crash, corrupt itself, and make rTorrentVPN recheck all my torrents. This happened maybe 7 or 8 times in the past few days.

 

XFS Repair always seems to repair it just fine. It usually doesn’t have too many errors and the error logs usually reference the new files I was adding. The most recent one had thousands of lines of the same error oddly. However it managed to fix itself. I checked all my other disks for good measure, no errors.

 

I ran a SMART short test a few times and it never brings up any errors. I’ve tried running the extended test but it seems to time itself out before it finishes oddly enough. Since it’s a 12TB drive I know it’ll take a day.

 

It’s a shucked WD EasyStore 12TB so I don’t think I can just replace it at Best Buy.

 

My question is, is there anything at all I can do here? Or does this just sound like a bad drive that I need to replace?

 

I can actually afford the space to move all the data off the drive with unbalance and then find a way to safely remove it from my array, although it’ll nearly max out my array. But I am worried it’ll just keep corrupting itself if I mess with it. I’m not sure if I can wait for a warranty replacement for rebuilding my array so it’s either that or buying a new drive and eventually getting the old one warranty replaced and now having 24TB of free space lol.

 

Any advice is appreciated.

 

UPDATE (2/27/2021): Thread 2:

 

UPDATE (3/2/2021): For anyone finding these threads in the future, it's very possible this was simply a power issue. Specifically with using a Molex to Sata cable to power my drives. See thread 3:

 

Edited by s449
Link to comment
  • s449 changed the title to Any new data added to my disk corrupts the XFS file system (6.9.0-rc2)

You should never keep getting repeated corruption with no obvious disk problem without some other factor being at play.   My main suspects would be:

  • a failing RAM stick.   Have you run memtest recently?
  • power supply related issues (either the PSU itself or the cabling from it).
  • Disk controller issue
  • cabling issue.

You should include your system’s diagnostics zip file (obtained via Tools -> Diagnostics) attached to your next post that coveris a period where the problem occurs to see if that can provide any clues.

Link to comment
9 hours ago, itimpi said:

You should never keep getting repeated corruption with no obvious disk problem without some other factor being at play.   My main suspects would be:

  • a failing RAM stick.   Have you run memtest recently?
  • power supply related issues (either the PSU itself or the cabling from it).
  • Disk controller issue
  • cabling issue.

You should include your system’s diagnostics zip file (obtained via Tools -> Diagnostics) attached to your next post that coveris a period where the problem occurs to see if that can provide any clues.

  • I haven't run a memtest recently, I'll have to do that! I have two 8GB ECC RAM sticks if that matters.
  • My power supply is a kind of cheap EVGA N1 400W but I did buy it new when I built the server a little under a year ago. Still, is there any way to test if it's the PSU? None of my other drives are having issues. Although like I mentioned, this drive is the one that everything has been writing to due to the high water algorithm. And if it wasn't clear, my media share which has the most new data added is the only non-system share that doesn't go to cache first, then moved by the mover. I haven't seen any mover issues, the corruptions have always been happening when new data is being added to that share and therefore that disk.
  • As far as disk controller, this drive isn't on a PCI card or anything, it's connected straight to the motherboard (SuperMicro X10SLL-F).

I attached my diagnostics. I looked through it a bit and yeah it should have included at least 1 or 2 corruptions since it dates back to February 21st. The problem disk is disk 1/sdf if that's helpful info.

 

Thank you for the help!! I really appreciate it.

apollo-diagnostics-20210223-0817.zip

Link to comment
21 minutes ago, JorgeB said:

If it happens always to the same filesystem I would recommend backing up that disk (or moving the data to other disks) and re-formatting.

 

Yeah I'm waiting for the extended SMART self-test to finish (60%) then if that shows no errors, I think I'll use the Unbalance plug-in to move all the data off the disk to other disks. At least my sensitive data. I know this disk contains a lot of that.

 

If anyone's curious or if it helps with future google searches, the disk in question was a shucked Western Digital EasyStore 12TB who's model number is WD120EDAZ. It's my hottest running disk.

Edited by s449
Link to comment

I started to use Unbalance to move over the important stuff from that drive to another drive. Just in case. However Unbalance is stuck moving a larger 60GB file and for some reason only running at 6-7MB/s which is really slow, I think. That’s the first file so I’m not sure if it’s just that file bottlenecking and it’ll speed up. But it’s ETA is 24hrs for about 700GB of important data to move off it.

 

I also noticed since running Unbalance my server started making an audible chirping mechanical sound. It happens every 30 seconds or so. It’s clearly from a hard drive but I’m honestly not too sure which one. I actually have heard this sound before recently but I’m not sure what caused it and I’m honestly not sure if it was before or after the new drive that’s been having issues. I mean I have two drives active right one, one is reading the other is writing. Google tells me chirping is a sign of failure.

 

I attached an audio clip of the sound that I recorded on my phone.

 

Once again it’s weird the extended SMART test found no errors. I’ll have to try a memtest soon. I’m pretty baffled right now.

 

Does this new info give anyone any ideas what it could be?

chirping server sound.m4a

Link to comment

Unbalance was extremely slow (5MB/s) so I ended up stopping it to check drive speeds with the DiskSpeed docker. Everything looks fine:

 

benchmark-speeds.thumb.png.a202fbc0cfe14ab1a2e24cb616d88ceb.png

 

Partiy, Disk 1, and Disk 2 are 12TB WD EasyStore shucked. Disk 1 is the one that keeps getting corrupted and that was added to my array in the past 3 weeks. Disk 3 and 4 are WD 4TB Reds. Disk 4 is on the LSI controller.

 

I also double checked short S.M.A.R.T. tests, every drive was okay. Also double checked XFS_Repair, every drive was okay except Disk 1 had like 2-3 errors that were fixed just fine. I still am not too sure if Disk 1 is the chirping because I've heard it before and it went away and I am not confident the first time I heard it wasn't before installing that drive. So I'm going to see if I can dial in which one is doing that. If it is Disk 1, I think I'll conclude Disk 1 is just bad and needs to be taken out of the array.

 

I still need to try a memtest.

 

If anyone has any advice or recommendations for things to try, please let me know. Thank you!

 

Edit: Running Memtest86+ from the Unraid boot menu. 1 pass so far with no errors but I'll leave it overnight just in case.

 

Untitled.png.5698c8091ce02bf22ea67c0c18cc1183.png

Edited by s449
More info
Link to comment
23 hours ago, JorgeB said:

Not much point in running memtest with ECC RAM unless there's an option to disable ECC in the BIOS.

 

That makes sense. I'm not sure there's an option to disable ECC, what would the point of that be?

 

Regardless, I ran it overnight and it did 6 passes with zero errors:

unknown.png.62e2f068ac6880e4d43ad374bf14ef2a.png

 

I just tried to run Unbalance again, but did from my new disk to a different disk than the first one that instigated the chirping sound. And I heard the chirping sound. So it sounds like that sound is coming from the new disk that keeps getting corrupted.

 

I think I'm just going to instigate a RMA with the disk. To not have to buy a new disk I am going to attempt to use Unbalance to remove data off it and safely remove it from the array using this guide: https://wiki.unraid.net/Shrink_array Although the speed is once again so slow (5MB/s) that I'm worried it'll break while trying to remove data off of it or something. There's about 5GB on it so at that speed it would take 12.14 days...Probably will end up just replacing it.

 

Regardless, not sure what else I could test or try. I hope my detailed updates are helpful to future debuggers who find this thread on Google.

 

Thanks for all the help!

 

Edit, epilogue: I tried another Unbalance to move data off the drive. However it ended up crashing the shares and corrupting a whole lot of files, including very sensitive ones. Oddly enough the files that were corrupted that the XFS Repair logs logged, were okay. I tested them, they were fine. A good number of files went to lost+found. Still stressed me out a lot. It’s clear writing to the drive isn’t the only thing that’ll corrupt the drive. I clearly can’t reliably move data and shrink my array, it needs to be replaced.

 

So I instigated an RMA with Western Digital. However they don’t seem to do advance RMAs for drives over $200. I’d need to send in the drive first. And from what I read it could take 2-3 weeks to get a replacement.

 

So I purchased a new Western Digital 12TB EasyStore. It’s the same model number, WD120EDAZ. I actually normally don’t test drives before putting them in but for this one I’m running just one pre-clear with no pre or post-read. It should be done in about 20 hours which is stressful and but I want to be sure I’m not just throwing in another unstable drive.

 

I am thinking I can just use the drive in the enclosure running through USB 3.0 as a replacement for the array but Best Buy’s return policy is only 15 days. Not sure if I would get a replacement drive in time. Might as well shuck it.

 

Thanks everyone.

Edited by s449
Epilogue
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.