Jump to content

Please help with parity check errors


Recommended Posts

Hi!

 

Last night my monthly parity check started, this morning I found the server with errors:

- the parity check is currently paused, since 2 disks have read errors, Disk1 and Disk5

- the array is up (I cannot read SMART values though, "A mandatory smart command failedexiting. To continue, add one or more '-T permissive' options.")

- Disk1 is disabled

 

I am not sure what the best way to proceed is right now. Any help would be greatly appreciated.

Edited by taalas
Link to comment

Hi @JorgeB

sorry, I have attached a diagnostic log. I have restarted the server though, since I couldn't access the SMART logs. I can provide the relevant part of my syslog form last night though if needed. It seems to me, that the server might have had problems with the SATA controller the drives are attached to (since in addition to the 2 drives I mentioned earlier, my SSD cache did not work properly).

After restarting the server, everything seems fine (except for Disk1 still being disabled).

Please let me know whether I can provide any more information.

Thanks!

spire-diagnostics-20220101-1146.zip

Link to comment
7 minutes ago, taalas said:

I can provide the relevant part of my syslog form last night though if needed.

That could help us see the cause of the problem, my first guess would be the Marvell controller, those are not recommended, and not surprisingly it's what disks 1 and 5 are using.

 

As for current status, disk1 looks fine, emulated disk is mounting so next step would be to rebuild on top, though I would recommend replacing that controller with one of the recommended ones.

 

 

  • Like 1
Link to comment

This is the part of the syslog where the problems starting (during the scheduled parity check). When I checked this morning, the parity check was paused, disk 1 showed 1550 errors, disk5 showed 169 errors, parity check showed 169 errors (same as disk5).

 

Since the /dev/sdj (my SSD cache) logged the same errors over and over again. I cut the log after a couple of those.

 

Since restarting the array (and copying some files to another server for safety), no more errors were logged on any drive. This was different before the restart, errors on disk5 kept increasing, albeit slowly.

 

Some question to clarify:

- Since the array is working fine for now, should I stop using the system or can I read from the array and have the docker applications running? I stopped the mover process to reduce the workload though.

- disk1 is almost empty, all data that currently is on the device does not matter. does that help in any way?

- Should I wait for rebuilding/etc. until I have the replacement controller?

- What should I expect in terms of data loss (since there were parity errors logged)?

 

Sorry, just trying to get a clearer picture of the situation.

syslog-127.0.0.1.log.zip

Edited by taalas
Link to comment
8 minutes ago, taalas said:

Since the /dev/sdj (my SSD cache) logged the same errors over and over again. I cut the log after a couple of those.

Yeah, also using the Marvell, and the log does appear to confirm it was a problem with that controller.

 

9 minutes ago, taalas said:

- Since the array is working fine for now, should I stop using the system or can I read from the array and have the docker applications running? I stopped the mover process to reduce the workload though.

You can keep using it, but it's unprotected if there's another disk failure, also avoid using disk1 since it will require using all the others one, and because of that not perform as well.

 

10 minutes ago, taalas said:

- disk1 is almost empty, all data that currently is on the device does not matter. does that help in any way?

Still needs to be rebuilt, but there's less data at risk.

 

11 minutes ago, taalas said:

- Should I wait for rebuilding/etc. until I have the replacement controller?

If you can get a new controller soon I would wait, if it's going to take more than a week or so probably best to try and rebuild as is.

 

12 minutes ago, taalas said:

- What should I expect in terms of data loss (since there were parity errors logged)?

Those errors are not in the log posted, but as long as the parity check was non correct it will be fine, if it was correct there's a (very) small chance of parity corruption, that in turn would result in a corrupt disk1 rebuild.

  • Like 1
Link to comment

Hi @JorgeB

 

thanks for your support, very much appreciated!

 

Just to make sure I am doing the right thing:

 

I ordered an ASM1064 controller, which has arrived by now. To rebuild the disabled disk I would now do the following:

 

- make a screenshot of my disk assignments

- power down the server

- replace the controller

- start the server

- check if all disk assignments are still correct (this should theoretically work although I use a different controller, correct)?

- shutdown the array

- remove disabled disk

- startup the array

- shutdown the array

- reassign the disabled disk to the slot

- startup the array

- rebuild disk from parity

 

Is this the correct order of steps?

 

I am still a bit worried about the parity errors the check warned about (before aborting at about 27%). Do I assume these were just read errors because of the controller malfunction? The parity check was non-correcting.

 

Thanks!

Link to comment

Basically looks correct.   

 

However a step I would add before the one where you reassign the disabled disk is to check that the emulated disk has the expected contents as all the rebuild process does is make the physical disk match the emulated one.  If the emulated disk does NOT contain what you expect (or does not mount) then a different action might be more appropriate.

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...