High load caused by disk I/O wait

November 16, 201411 yr

After a couple of weeks running I am finding that my system (5.0.5) stops responding on the network. Unraid is running on ESXi with a hardware controller passed directly through.

I have 3 disks. Disk3 takes backups nightly from an ESXi box over NFS as well as continual backups from my Mac via TimeMachine (AFP). The indicator that something has gone wrong is when Time Machine stops running. When checking after this the AFP shares are not working, the SMB shares appear to be partially working - giving share listing but no files beneath that. The web interface does not respond.

I then log into the machine on it's local console, finding the load is 70+, with 99.8% wait. I then went to look at /mnt, all the disks are there, 1 and 2 list ok but trying to ls disk 3 hangs the console. I am unable to do a clean shutdown as presumably everything has stalled waiting for this disk.

I do a hard reset, the system comes back up with disk3 missing from /mnt, the web interface does not respond. I log in to the console and reboot, after which the system comes back up and runs as normal.

I have done SMART checks on disk3 that have come back fine, no errors are reported on the web interface. Could this be a hardware issue with disk3?

If anyone could give me some pointers to look at next time it happens that would be much appreciated. I will certainly check the syslog before rebooting next time. It has just happened for the second time so am doing a bit more digging as it's no longer a one-off.

Quote

November 18, 201411 yr

The first order of business would be to post a syslog captured during the issue and a smartctl log of the drive in question.

You can still do the smartctl capture and post so members can assist in reviewing drive health.

I would suggest looking at the ESX via the vpshere client and see if events are posted on the drive. Any significant delays on the drive are posted as events. But that may not be the case with pass through.

While it may not be a hardware issue with disk3, a cabling issue could cause the drive to intermittently disconnect/reconnect. I once had a drive that was not making good contact and any heavy IO to that drive would cause vibrations with the cable. Thus causing it to go offline/online allot. You can tell via syslog review for that one. (Well not the physical cable, but intermittent drive issues)

Quote

December 22, 201411 yr

Author

Thanks for your advice on that.

It turned out I had a drive available so swapped disk 3 for a brand new drive of larger capacity. A few weeks passed and I'm in the same place again. Granted, this does not rule out what you have suggested with intermittent connectivity

Having had a chance to look at the syslog this time, I think it is possible it couple be something to do with AFP. I was in the middle of extracting some AFP errors from syslog when the machine locked up on me and I had to reset. Same procedure as below with two reboots required then all back to normal.

I guess I'll wait until next time and transfer the log off first.

Quote

High load caused by disk I/O wait

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)