Jump to content

Mechanical disk failure then separate disk file system corruption with one parity disk


cyberstyx
Go to solution Solved by JorgeB,

Recommended Posts

30 minutes ago, cyberstyx said:

Rebuild finished without errors

Unfortunately not:


 

Sep 16 15:29:13 Tower kernel: md: disk9 read error, sector=219119384
Sep 16 15:29:13 Tower kernel: md: disk9 read error, sector=219119392
Sep 16 15:29:13 Tower kernel: md: disk9 read error, sector=219119400
Sep 16 15:29:13 Tower kernel: md: disk9 read error, sector=219119408
...
Sep 16 15:29:37 Tower kernel: md: disk8 read error, sector=3911496
Sep 16 15:29:37 Tower kernel: md: disk8 read error, sector=3911504
Sep 16 15:29:37 Tower kernel: md: disk8 read error, sector=3911512
Sep 16 15:29:37 Tower kernel: md: disk8 read error, sector=3911520

 

So the rebuilt disk will once again be corrupt, you have multiple disk issues, suggesting some hardware problem, like bad controller or PSU, you can try again with a new PSU if available.

Link to comment
12 minutes ago, JorgeB said:

So the rebuilt disk will once again be corrupt, you have multiple disk issues, suggesting some hardware problem, like bad controller or PSU, you can try again with a new PSU if available.

Since Disk1 was on MB controller and Disk5 was on PCI controller, it is probably a PSU issue.

 

Will come back to this as soon as I get a new PSU and replace all power cabling.

 

Thanks for you help JorgeB, have a nice Sunday.

  • Like 1
Link to comment

Just thought I would comment on this:

On 9/16/2023 at 4:26 AM, cyberstyx said:

Disk 8 is an SSD disk (for VMs and containers) on an expansion PCI card.

SSDs in the parity array cannot be trimmed, and can only be written at parity speed.

 

The usual place for VMs and containers is an SSD in cache or other pool. If you have these on the array, they won't perform as well due to parity, and will also keep array disks spunup since these files are always open.

  • Like 1
Link to comment
14 minutes ago, trurl said:

Just thought I would comment on this:

SSDs in the parity array cannot be trimmed, and can only be written at parity speed.

 

The usual place for VMs and containers is an SSD in cache or other pool. If you have these on the array, they won't perform as well due to parity, and will also keep array disks spunup since these files are always open.

Thank you for that info.

 

I will read more about this and ask you again when I restore the system

Link to comment
  • 2 weeks later...
On 9/17/2023 at 12:28 PM, JorgeB said:

So the rebuilt disk will once again be corrupt, you have multiple disk issues, suggesting some hardware problem, like bad controller or PSU, you can try again with a new PSU if available.

After a week's delay from the shop to send me the bought PSU, I got the replacement a few days ago and had time today to remove the server PC from its installed location and swap the PSU.

 

I followed the suggested steps again to make a new config and start the array and then replace the disk, the array is now being rebuilt.

 

Hopefully it will finish tomorrow morning without any new surprises, and I will share my news then.

 

 

 

Link to comment

I stopped the operation, after 195,362,860 writes, Disk9 was giving Errors. I have attached the diagnostics file.

 

While rebuilt was running, shared folders where not working properly. The configuration for them was there (in Shares tab), I could see the shared folders over the network but they were empty. When I checked a share folder from the console I got "/bin/ls: reading directory '.': Input/output error". Disk contents from /mnt disks were there. When I stopped rebuilding, the share folder contents where visible again.

 

I started the Array in Maintenance Mode so I could do a file system check on Disk 9 (with flag -n). I got this:

 

Phase 1 - find and verify superblock... superblock read failed, offset 0, size 524288, ag 0, rval -1 fatal error -- Input/output error

 

I 've stopped for further instructions now.

 

 

tower-diagnostics-20231001-0005.zip

Link to comment
4 hours ago, JorgeB said:

Disk 9 dropped offline, and this:

Sep 30 21:55:52 Tower kernel: ata3: SError: { BadCRC }

usually means a bad SATA cable, replace it and try again.

 

Also the libvirt.img is corrupt, you'll need to restore from a backup if available.

Hello JorgeB,

 

Replaced SATA cable, did a filesystem check -n on the disk and restarted the rebuild.

Link to comment
On 10/1/2023 at 12:11 PM, JorgeB said:

Disk 9 dropped offline, and this:

Sep 30 21:55:52 Tower kernel: ata3: SError: { BadCRC }

usually means a bad SATA cable, replace it and try again.

 

Also the libvirt.img is corrupt, you'll need to restore from a backup if available.

The rebuild finished successfully with 0 errors.

 

I also did a sample check on file structure, all files are there and all are working as should.

 

I have also attached the diagnostics file if you think you want to have a look.

 

I will start tackling the libvirt.img corruption issue probably tomorrow, I have some unraid OS backups if needed and the config of the VMs has not changed in a long time.

 

I will also check the suggestion about the proper usage of SSD in unraid as suggested.

 

Thank you all again for your help, especially JorgeB. Having 4+ hardware fails one after the other (one PSU and various SATA cables) and many errors due to that was something tackled only by experts.

 

 

Christos.

tower-diagnostics-20231002-1857.zip

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...