Mechanical disk failure then separate disk file system corruption with one parity disk

JorgeB · September 17, 2023

30 minutes ago, cyberstyx said:

Rebuild finished without errors

Unfortunately not:

Sep 16 15:29:13 Tower kernel: md: disk9 read error, sector=219119384
Sep 16 15:29:13 Tower kernel: md: disk9 read error, sector=219119392
Sep 16 15:29:13 Tower kernel: md: disk9 read error, sector=219119400
Sep 16 15:29:13 Tower kernel: md: disk9 read error, sector=219119408
...
Sep 16 15:29:37 Tower kernel: md: disk8 read error, sector=3911496
Sep 16 15:29:37 Tower kernel: md: disk8 read error, sector=3911504
Sep 16 15:29:37 Tower kernel: md: disk8 read error, sector=3911512
Sep 16 15:29:37 Tower kernel: md: disk8 read error, sector=3911520

So the rebuilt disk will once again be corrupt, you have multiple disk issues, suggesting some hardware problem, like bad controller or PSU, you can try again with a new PSU if available.

cyberstyx · September 17, 2023

12 minutes ago, JorgeB said:

So the rebuilt disk will once again be corrupt, you have multiple disk issues, suggesting some hardware problem, like bad controller or PSU, you can try again with a new PSU if available.

Since Disk1 was on MB controller and Disk5 was on PCI controller, it is probably a PSU issue.

Will come back to this as soon as I get a new PSU and replace all power cabling.

Thanks for you help JorgeB, have a nice Sunday.

trurl · September 17, 2023

Just thought I would comment on this:

On 9/16/2023 at 4:26 AM, cyberstyx said:

Disk 8 is an SSD disk (for VMs and containers) on an expansion PCI card.

SSDs in the parity array cannot be trimmed, and can only be written at parity speed.

The usual place for VMs and containers is an SSD in cache or other pool. If you have these on the array, they won't perform as well due to parity, and will also keep array disks spunup since these files are always open.

cyberstyx · September 17, 2023

14 minutes ago, trurl said:

Just thought I would comment on this:

SSDs in the parity array cannot be trimmed, and can only be written at parity speed.

The usual place for VMs and containers is an SSD in cache or other pool. If you have these on the array, they won't perform as well due to parity, and will also keep array disks spunup since these files are always open.

Thank you for that info.

I will read more about this and ask you again when I restore the system

cyberstyx · September 30, 2023

On 9/17/2023 at 12:28 PM, JorgeB said:

So the rebuilt disk will once again be corrupt, you have multiple disk issues, suggesting some hardware problem, like bad controller or PSU, you can try again with a new PSU if available.

After a week's delay from the shop to send me the bought PSU, I got the replacement a few days ago and had time today to remove the server PC from its installed location and swap the PSU.

I followed the suggested steps again to make a new config and start the array and then replace the disk, the array is now being rebuilt.

Hopefully it will finish tomorrow morning without any new surprises, and I will share my news then.

cyberstyx · September 30, 2023

I stopped the operation, after 195,362,860 writes, Disk9 was giving Errors. I have attached the diagnostics file.

While rebuilt was running, shared folders where not working properly. The configuration for them was there (in Shares tab), I could see the shared folders over the network but they were empty. When I checked a share folder from the console I got "/bin/ls: reading directory '.': Input/output error". Disk contents from /mnt disks were there. When I stopped rebuilding, the share folder contents where visible again.

I started the Array in Maintenance Mode so I could do a file system check on Disk 9 (with flag -n). I got this:

Phase 1 - find and verify superblock... superblock read failed, offset 0, size 524288, ag 0, rval -1 fatal error -- Input/output error

I 've stopped for further instructions now.

tower-diagnostics-20231001-0005.zip

JorgeB · October 1, 2023

Disk 9 dropped offline, and this:

Sep 30 21:55:52 Tower kernel: ata3: SError: { BadCRC }

usually means a bad SATA cable, replace it and try again.

Also the libvirt.img is corrupt, you'll need to restore from a backup if available.

cyberstyx · October 1, 2023

4 hours ago, JorgeB said:
Disk 9 dropped offline, and this:
Sep 30 21:55:52 Tower kernel: ata3: SError: { BadCRC }
usually means a bad SATA cable, replace it and try again.

Also the libvirt.img is corrupt, you'll need to restore from a backup if available.

Hello JorgeB,

Replaced SATA cable, did a filesystem check -n on the disk and restarted the rebuild.

cyberstyx · October 2, 2023

On 10/1/2023 at 12:11 PM, JorgeB said:
Disk 9 dropped offline, and this:
Sep 30 21:55:52 Tower kernel: ata3: SError: { BadCRC }
usually means a bad SATA cable, replace it and try again.

Also the libvirt.img is corrupt, you'll need to restore from a backup if available.

The rebuild finished successfully with 0 errors.

I also did a sample check on file structure, all files are there and all are working as should.

I have also attached the diagnostics file if you think you want to have a look.

I will start tackling the libvirt.img corruption issue probably tomorrow, I have some unraid OS backups if needed and the config of the VMs has not changed in a long time.

I will also check the suggestion about the proper usage of SSD in unraid as suggested.

Thank you all again for your help, especially JorgeB. Having 4+ hardware fails one after the other (one PSU and various SATA cables) and many errors due to that was something tackled only by experts.

Christos.

tower-diagnostics-20231002-1857.zip

JorgeB · October 2, 2023

Everything looks good, except the already mentioned libvirt corruption.

Mechanical disk failure then separate disk file system corruption with one parity disk

Recommended Posts

JorgeB

Link to comment

cyberstyx

Link to comment

trurl

Link to comment

cyberstyx

Link to comment

cyberstyx

Link to comment

cyberstyx

Link to comment

JorgeB

Link to comment

cyberstyx

Link to comment

cyberstyx

Link to comment

JorgeB

Link to comment

Join the conversation