bus related errors during mover activity

Marcel · July 6, 2016

Hi,

while the mover was busy transferring a large amount of data from the cache drive to the array, the below listed error messages appeared in the log from time to time. The moving process did continue however.

Should I be worried about this - in terms of data integrity?

Any idea what is happening there?

All drives are S-ATA ones. Parity and Cache drive are connected to S-ATA ports on the motherboard.

The two data drives are connected to the S-ATA ports of the onboard Promise RAID controller (running in IDE mode).

Thanks & regards,

Marcel

/usr/bin/tail -f /var/log/syslog

Jul 6 14:13:43 MBSRV kernel: ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6

Jul 6 14:13:43 MBSRV kernel: ata5.00: BMDMA stat 0x26

Jul 6 14:13:43 MBSRV kernel: ata5.00: failed command: READ DMA EXT

Jul 6 14:13:43 MBSRV kernel: ata5.00: cmd 25/00:00:10:dc:5e/00:01:15:00:00/e0 tag 0 dma 131072 in

Jul 6 14:13:43 MBSRV kernel: res 51/84:00:10:dc:5e/84:01:15:00:00/e0 Emask 0x30 (host bus error)

Jul 6 14:13:43 MBSRV kernel: ata5.00: status: { DRDY ERR }

Jul 6 14:13:43 MBSRV kernel: ata5.00: error: { ICRC ABRT }

Jul 6 14:13:43 MBSRV kernel: ata5: soft resetting link

Jul 6 14:13:44 MBSRV kernel: ata5.00: configured for UDMA/100

Jul 6 14:13:44 MBSRV kernel: ata5: EH complete

Before the mover continues the log shows these kind of messages:

/usr/bin/tail -f /var/log/syslog

Jul 6 14:19:59 MBSRV kernel: ata5.00: error: { ICRC ABRT }

Jul 6 14:19:59 MBSRV kernel: ata5: soft resetting link

Jul 6 14:20:00 MBSRV kernel: ata5.00: configured for UDMA/33

Jul 6 14:20:00 MBSRV kernel: ata5: EH complete

So apparently the system is switching to UDMA33 mode which would explain the very slow speeds.

The specs for the S-ATA ports on the ASUS P4P800E deluxe motherboard:

ICH5R Southbridge:

2x S-ATA

Promise 20378 onboard RAID Controller

2x S-ATA

The Southbridge also has support for 2x UDMA100.

The Promise Controller supports 1x UDMA133.

But as mentioned all the drives are connected to the S-ATA ports.

Marcel · July 6, 2016

unfortunately about 10 minutes after writing the initial post the server went entirely offline. No web GUI, no SMB shares, no telnet, no ping.

The hard drives didn't make any noise at this point. so I assume the mover had stopped working as well.

Since I didn't have the chance to connect a monitor at this point, I performed an unclean shutdown.

System is performing parity check now. Seems there is still some data left on the cache drive. So the mover did indeed stop.

I found that one ~1GB vdisk file is still on the cache drive but also on the disk within the array. That's probably where things stopped.

Squid · July 6, 2016

Since you're on v5, you're going to have to post your syslog. http://lime-technology.com/forum/index.php?topic=9880.0

Sight unseen, with the errors you mentioned, along with the fact that the shares wound up disappearing, odds on your cache drive dropped off line.

Check your cabling to the drive.

Marcel · July 6, 2016

Please find syslog.txt attached. Obviously this was created after the reboot.

Cables are new and properly attached.

Also, don't all these messages about different UDMA modes point to the onboard Promise Controller not being working properly with unRAID? Otherwise the drives should just show as as S-ATA (?)

Thanks a lot!

syslog.txt

RobJ · July 6, 2016

Both the ICRC and BadCRC error flags indicate communications issues across the cable, corrupted packets. The exception handler then slows the protocol down, because *sometimes* that improves the communication link, although at a slower speed. The indication here is that the SATA cable or one of its connectors is defective, no matter how new it is. However, packets that fail a CRC test are resent, so there should not be any data loss or corruption. Each packet is retried until their CRC check is perfect, each way.

Marcel · July 7, 2016

@Rob: Thank you for the explanation. Cable seems to be ok but the drive seems to have issues. I took it out and - still assuming it had bad sectors - tried to run check for them on a Windows machine. Windows reported I/O interface timeouts several times during the process. So, pretty much the same behaviour as with unRAID.

Question: I didn't know that drives that are linked via a S-ATA port can also be accessed via UDMA modes. Is this a specific unRAID feature or is the S-ATA implementation keeping the old legacy modes as a subset in general?

I have started the array without the parity disk (so no parity check again). Then I copied all the data still on the cache drive over the network.

After then replacing the cache drive with a new disk I had the parity sync run over night.

What is getting annoying however is, that it seems during almost every longer operation the webGUI will fail eventually. So it did this morning.

Trying to restart it I got segmentation faults.

I ended up safely shutting down unRAID from the command line and after reboot everything came up normally.

I have attached my current syslog.

Is the instability of the webGUI a known issue and I have to live with it? Or can I do something about it?

Thanks a lot!

syslog.txt

RobJ · July 7, 2016

@Rob: Thank you for the explanation. Cable seems to be ok but the drive seems to have issues. I took it out and - still assuming it had bad sectors - tried to run check for them on a Windows machine. Windows reported I/O interface timeouts several times during the process. So, pretty much the same behaviour as with unRAID.

Exceptions due to bad sectors are clearly marked as such with characteristic error flags. You may wish to read The Analysis of Drive Issues.

Question: I didn't know that drives that are linked via a S-ATA port can also be accessed via UDMA modes. Is this a specific unRAID feature or is the S-ATA implementation keeping the old legacy modes as a subset in general?

It's Linux. Even the latest kernel (with AHCI etc) uses the UDMA modes and routines. After all, there's nothing faster than the DMA assisted transfers. I looked and only found the mpt*sas modules not mentioning UDMA, although they do mention using "Direct Access ATA", which could be an indirect reference to DMA usage. But even mvsas uses UDMA.

Your BIOS however is not configured to use native SATA modes, it's still using an IDE emulating mode. In the BIOS settings, change the SATA support to a native SATA mode, AHCI if available, anything but an IDE emulating mode.

What is getting annoying however is, that it seems during almost every longer operation the webGUI will fail eventually. So it did this morning.
Trying to restart it I got segmentation faults.

Is the instability of the webGUI a known issue and I have to live with it? Or can I do something about it?

You do realize you are using a really old version of unRAID, where the current versions have hundreds of fixes, both in the Linux kernel and in the unRAID specific modules.

But no, almost everyone had perfect or near perfect usage with 5.0.6.

Segmentation faults have 2 main causes - memory defects and dependency conflicts. The first is easy to test, run the Memtest from the unRAID boot menu, all night at least. It should run perfectly with absolutely no errors at all. The second is harder to find, especially if you are running clean, and I don't see any plugins or packages loaded. You might try 'unraidsafemode', and see if it still happens. Are you running any packages or plugins (check /extra, /packages, /plugins, and /config/plugins)?

That's old hardware, with a BIOS from 2004(!), and a Pentium 4. The motherboard should support 64 bit, but I don't know about your CPU. Have you tested it for 64 bit support? Upgrading to UnRAID v6 would be better, if possible. (But you would need 1GB, 2GB would be better.)

It's probably not related to the faults, but you are trying to run in 512MB, which is extremely tight. I wouldn't, I would prefer 1GB. It's possible to do, but you won't have any significant caching available.

You're having serious network issues, with 3 to 5 periods when the network was down (periods of minutes and hours). It's not a loose cable, as some of the outages would be in seconds, not minutes and longer. Appears to be an issue outside your server.

bus related errors during mover activity

Recommended Posts

Marcel

Link to comment

Marcel

Link to comment

Squid

Link to comment

Marcel

Link to comment

RobJ

Link to comment

Marcel

Link to comment

RobJ

Link to comment

Archived