Jump to content

(SOLVED) Server locks up on parity swap copy


Recommended Posts

Hello,

 

I'm trying to replace my existing 10TB parity drive with a 12 TB while moving the 10TB down to replace a smaller drive. I've done this previously a few times (to get up to the 10 TB) and I'm following the guide. Once I swap the drive around in the GUI and start the copy the server is locking up. I tried tracking the progress this time and the last report I saw in the GUI was 86% done. I have attempted the copy 3 times now. Any ideas on what I should look at to find the cause?

nas4-diagnostics-20220322-0829.zip

Edited by Runaround
Issue resolved
Link to comment
3 hours ago, Runaround said:

I'm trying to replace my existing 10TB parity drive with a 12 TB while moving the 10TB down to replace a smaller drive. I've done this previously a few times (to get up to the 10 TB) and I'm following the guide. Once I swap the drive around in the GUI and start the copy the server is locking up. I tried tracking the progress this time and the last report I saw in the GUI was 86% done. I have attempted the copy 3 times now. Any ideas on what I should look at to find the cause?

You could have better informed suggestion if you attach your diagnostics to your next post.

Link to comment
Mar 21 21:28:25 NAS4 kernel: BTRFS error (device sdi1): block=892079783936 write time tree block corruption detected

 

Write time corruption detected by btrfs is 9 out of 10 times due to bad RAM.

 

Ryzen with overclocked RAM like you have is known to corrupt data, see here, also make sure power supply idle control is correctly set.

 

 

 

Link to comment
1 minute ago, JorgeB said:
Mar 21 21:28:25 NAS4 kernel: BTRFS error (device sdi1): block=892079783936 write time tree block corruption detected

 

Write time corruption detected by btrfs is 9 out of 10 times due to bad RAM.

 

Ryzen with overclocked RAM like you have is known to corrupt data, see here, also make sure power supply idle control is correctly set.

 

 

 

 

Thanks for the info and detailed post you linked. I'll be adjusting the memory timings today. I moved my old desktop CPU/Motherboard over 4 months ago and I didn't think about the Memory timings. I just noticed this round of  btrfs issues happening today. It got corrupted last week and I was able to recover the data and re-format the disk. I thought it was working normally after that. I have had a bit of a history of issues with this particular SSD I'm using for cache. I was getting tons of CRC errors from the firmware on my HBA card when I initially installed it, but it had been working well till it had an issue last week. I'm thinking of replacing it just in case.

 

Do you think that is contributing to the server locking up on the Parity copy as well? I've actually not had this server lock up in several years.

Link to comment
4 hours ago, JorgeB said:

Could be, difficult to say, locking up with Ryzen is usually more related to the other issue I mentioned.

 

I set the CPU and Memory speeds to "auto" in the BIOS. The memory is now running at the speeds you linked to earlier. The BTRFS errors are gone from the logs now. Sadly, my server is acting more unstable now. I see errors showing read issues on SDA so maybe the flash drive hasn't liked having the lock ups and power cycles? I had the syslog viewer up on the screen when it crashed so I saved it and there is a screenshot from a monitor I've connected. 

 

Not sure if it's related, but there was a pre-clear running every time it has crashed. It was in the writing zero's phase.

IMG_0573.JPEG

System Log 2.html

Link to comment

Next update: I powered off the server and took out the flash drive. Did a checkdisk on the flash drive on my PC and booted back up the server. No more errors for the flash drive in the log now and the server isn't crashing within an hour now. There was a curious number of settings that were changed / old in my server after booting though. I've corrected them all now.

 

I also changed the power settings idle control (I think, it's named a little oddly)

 

I picked up a smaller drive so I can try to replace my bad disk without having to copy parity first. It's clearing now, so hopefully I'll have more progress tomorrow. 

 

Thanks again for the help provided so far.

Edited by Runaround
Link to comment
On 3/26/2022 at 7:47 AM, JorgeB said:

You can enable the syslog server and then post that after a crash to see if there's anything logged, but if it's a hardware issue there probably won't be.

 

Here is the syslog from the flash drive and a fresh copy of diagnostics. I didn't really see anything in the syslog though. It's really strange that this is only happening for the cache drive copy. I was able to rebuild a disk just fine.

syslog nas4-diagnostics-20220327-1118.zip

Link to comment
5 hours ago, JorgeB said:

Same, this suggests a hardware issue.

I really appreciate your assistance with this. I have corrected a lot of configuration issues that I didn't know about already.

 

I ran memtest and there was an error, so I'll go through each of the DIMMs to figure out what's wrong. For now, I did a test on one stick. and it passed. I have it in alone and I'm trying to pre-clear a one of these 12 TB disks. It so odd to me that I only see an issue when dealing with these 12 TB disks though.

Link to comment
  • Runaround changed the title to (SOLVED) Server locks up on parity swap copy

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...