[SOLVED] Stuck on stopping array (6.8.3 trial)

trumpets · August 2, 2020

We setup a new unraid system using a repurposed motherboard with 2 new Seagate Ironwolf drives. After the setup was done, we started to copy data from a Windows Server to Unraid. After the first drive was copied, we mounted a old 2nd data drive as Disk 2 and after formating, started to copy again. However during the middle of the copying, the system seems to have crashed and the original Disk 1 was empty with the Unmountable message. Making a rookie mistake, we just formatted Disk 1 again and started copying again. But since the copying over the network was slow, we decided to mount the drive as an Unassigned drive and copy from an ssh terminal and using Midnight Commander.

The copying started out fine, copying them one directory at a time until we started copying a directory which has about 130gb of data. We left the system to copy and when I came back I saw the eror screen below

image.png.414964b40a58c00a851f7c65b7a02506.png

After aborting, I came to the copying screen that indicated a segmentation fault error.

image.png.02461c816bbfc5116973f253f0fe0f35.png

So I tried to stop the array but it doesn't seem to work. I tried to reboot but even though I was kicked out of the ssh session, I logged back in with the system still running. It has not actually rebooted. The load on the system is currently still at 3.00.

image.png.8bfb86b43d770d1a9cbcbc3bf08c9019.png

We cannot do a hard reset since we are accessing it remotely while copying.
Also, we are worrying that if after the reboot, we might again encounter loosing a disk or worse, data.

Any advise on what we might be doing wrong?

Attached also the diagnostic file.

Appreciate any advice. TIA.

tower-diagnostics-20200802-1400.zip

JorgeB · August 2, 2020

You might need to force a reboot, also lots of ATA errors from disk1, start by replacing cables.

Aug  2 13:12:38 Tower kernel: ata2.00: exception Emask 0x50 SAct 0x7e000 SErr 0x4090800 action 0xe frozen
Aug  2 13:12:38 Tower kernel: ata2.00: irq_stat 0x00400040, connection status changed
Aug  2 13:12:38 Tower kernel: ata2: SError: { HostInt PHYRdyChg 10B8B DevExch }
Aug  2 13:12:38 Tower kernel: ata2.00: failed command: READ FPDMA QUEUED
Aug  2 13:12:38 Tower kernel: ata2.00: cmd 60/40:68:40:e2:83/05:00:06:00:00/40 tag 13 ncq dma 688128 in
Aug  2 13:12:38 Tower kernel:         res 40/00:08:a0:f5:3a/00:00:3a:00:00/40 Emask 0x50 (ATA bus error)
Aug  2 13:12:38 Tower kernel: ata2.00: status: { DRDY }
Aug  2 13:12:38 Tower kernel: ata2.00: failed command: READ FPDMA QUEUED
Aug  2 13:12:38 Tower kernel: ata2.00: cmd 60/40:70:80:e7:83/05:00:06:00:00/40 tag 14 ncq dma 688128 in
Aug  2 13:12:38 Tower kernel:         res 40/00:08:a0:f5:3a/00:00:3a:00:00/40 Emask 0x50 (ATA bus error)
Aug  2 13:12:38 Tower kernel: ata2.00: status: { DRDY }
Aug  2 13:12:38 Tower kernel: ata2.00: failed command: READ FPDMA QUEUED
Aug  2 13:12:38 Tower kernel: ata2.00: cmd 60/40:78:c0:ec:83/05:00:06:00:00/40 tag 15 ncq dma 688128 in
Aug  2 13:12:38 Tower kernel:         res 40/00:08:a0:f5:3a/00:00:3a:00:00/40 Emask 0x50 (ATA bus error)
Aug  2 13:12:38 Tower kernel: ata2.00: status: { DRDY }
Aug  2 13:12:38 Tower kernel: ata2.00: failed command: READ FPDMA QUEUED
Aug  2 13:12:38 Tower kernel: ata2.00: cmd 60/40:80:00:f2:83/05:00:06:00:00/40 tag 16 ncq dma 688128 in
Aug  2 13:12:38 Tower kernel:         res 40/00:08:a0:f5:3a/00:00:3a:00:00/40 Emask 0x50 (ATA bus error)
Aug  2 13:12:38 Tower kernel: ata2.00: status: { DRDY }
Aug  2 13:12:38 Tower kernel: ata2.00: failed command: READ FPDMA QUEUED

trumpets · August 2, 2020

Waiting for the cables and give it another go. Thank you.

trumpets · August 19, 2020

We are under lockdown due to the pandemic so was only able to replace all the cables yesterday.
Run smart tests and the drives appear to PASS the test.

Now I start copying the data off the unassigned drive to Drive 1 and the logs started to flood again with the kernel: ata2.00 failed command messges

image.png.f6094e5a56f2913bdf7f19c74d876c32.png

Attached also latest diagnoistics file.

Thanks in advance.

tower-diagnostics-20200802-1400.zip

JorgeB · August 19, 2020

Those are CRC errors, usually a bad SATA cable, but could also be the controller or in extremely rare cases the disk itself.

trumpets · August 19, 2020

I have replaced it with new cables yesterday. I'm beginning to suspecting its the drive. Previous it had 166 UDMA CRC errors. Now, after copying 130g of data, I got a notice:

Tower: Unraid Parity disk SMART health [199]
Warning [TOWER] - udma crc error count is 2096
ST1000VN002-2EY102_Z9CBWT4L (sdc)

JorgeB · August 19, 2020

Still much more likely to be the cable (or port) than the drive, you should try another cable in a different port.

trumpets · August 19, 2020

ah.. didn't think of the port. Will try that tomorrow.
Can I just shut it down and change the ports and turn on again, or are there other steps I need to do as well, like remove it from array, etc?

JorgeB · August 19, 2020

6 minutes ago, trumpets said:

Can I just shut it down and change the ports and turn on again

Just this.

trumpets · August 29, 2020

As recommended, have moved the drive to a different slot, and also a different cable.

The system has so far, been stable and not showing any of the ata errors.

Thank you.

[SOLVED] Stuck on stopping array (6.8.3 trial)

Recommended Posts

trumpets

Link to comment

JorgeB

Link to comment

trumpets

Link to comment

trumpets

Link to comment

JorgeB

Link to comment

trumpets

Link to comment

JorgeB

Link to comment

trumpets

Link to comment

JorgeB

Link to comment

trumpets

Link to comment

Join the conversation