6 month old Toshiba disk running at < 50% of its original speed, array operations painfully slow.

-C- · December 12, 2023

In July I added a new 18TB Toshiba MG09ACA18TE.

I was pleased with its speed- as good as, if not better than the 20TB WD drives I have, which were pretty speedy.

Since then, any array operations have felt painfully slow. With 20TB parity drives, the last parity check took over 4 days and the one that's currently running is estimated to take over 5 days. It's fluctuating between 35 & 100 MB/sec.

Diskspeed docker is showing that its speed is now less than half of what it was when first installed:

The most recent test (Nov 15) was made with all other Docker apps disabled and the VM service off.

Speeds of the other disks have remained stable.

Only thing I can think that happened between the Jul & Sep tests is that I cleared the drive and formatted it as ZFS- could that have had this affect? If so, is there anything I can do about it?

JorgeB · December 12, 2023

50 minutes ago, -C- said:

and formatted it as ZFS- could that have had this affect?

No, the diskspeed test it's filesystem agnostic.

-C- · December 12, 2023

6 hours ago, JorgeB said:

it's filesystem agnostic.

As I had hoped/ assumed, which begs the question- what else could be causing this?

JorgeB · December 13, 2023

Could be a disk issue, also check that the disk is still linking at SATA3 speeds.

-C- · December 15, 2023

I'm seeing this for the disk

image.png.2c73e5b08ab730f045014a1e7d03f7b6.png

There are no SMART errors.

Is it worth trying a different SATA cable, or is there no way it's a cabling issue if it's syncing at 6 Gb/s?

JorgeB · December 15, 2023

Unlikely that a cable would be the problem without leaving anything logged in the syslog.

-C- · April 22

Following multiple strange issues with my server and increasing weird things going on with this drive, I became ever more suspicious of it. Some examples:

Unreliable reading or writing and huge usage of resources when attempting to
unable to copy files off it using Unbalanced (it would eventually time out while doing its initial scan of certain directories and return to its start page).
Strange permissions issues or permissions changing and trying to set a chmod command would hang indefinitely.
If I tried to get the size of a directory using File Manager's calculate button on certain directories, it would hang indefinitely.
There are some directories I've successfully used Unbalanced to clear files out of and onto another drive, but now can't delete the empty directories, even via CLI.

So I've been moving data off. Long, slow process as I'm not sure whether a move is going slowly due to large backup image files, or whether it has hung.

I've nearly finished and what's left on there is of low importance, so I'm about to remove the drive.

My suspicion though is that there's something wrong with Unraid's management of the file system/ FUSE regarding this particular disk and it's been causing huge performance issues system-wide. For another clue to something being awry, have a look at the transfer rate of it, taken a few minutes ago (there was no actual activity on that drive that I know of):

This is a regular SATA HDD- there's no way it could ever attain that kind of speed.

Now that I've got any valuable data off and am at a point of being able to clear it, what is the best method of clearing the drive completely so that any FUSE linking to it is destroyed and I can try it again as if it was a fresh drive to Unraid?

JorgeB · April 23

9 hours ago, -C- said:

My suspicion though is that there's something wrong with Unraid's management of the file system/ FUSE regarding this particular disk and it's been causing huge performance issues system-wide.

That seems very unlikely to me, my first guess would be a disk problem.

9 hours ago, -C- said:

This is a regular SATA HDD- there's no way it could ever attain that kind of speed.

Not sure what you mean, it's reporting 800KB/s, not MB/s.

-C- · April 23

2 hours ago, JorgeB said:

That seems very unlikely to me, my first guess would be a disk problem.

Thanks Jorge, I don't know much about the inner workings, but I figure if I can take this disk out of use and try deleting all links to it from Unraid, then try again as if it was a fresh disk to Unraid- if it continues to have issues then I know it's disk issues, if not then it was a software issue. It certainly isn't happy as it is and causing system-wide issues, so I have to do something.

2 hours ago, JorgeB said:

Not sure what you mean, it's reporting 800KB/s, not MB/s.

Haha- oh yes, it was late! Still shouldn't have been anything accessing that drive (the only things that are left on there are old archive files), yet it's had this constant read rate at idle for a while now. Nothing listed for this HDD under disk activity or Open Files.

Can you help with how to remove, clear and replace/ retry?

Thanks

JorgeB · April 23

If you have a spare, and assuming parity is valid, the best way would be to do a direct replacement, if that disk is the problem, it would start to work better after it's removed.

-C- · April 23

OK, thanks. I will remove it and see how that goes. I don't have a spare drive and prices are not favourable at the moment, so I won't be buying another. If this drive's bad, I'd like to get it replaced by Amazon or the manufacturer.

My worry is- how do I prove there's an issue when there are no SMART errors or anything other than poor performance to show there's an issue with it?

JorgeB · April 23

You can run an extended SMART test, if it takes at longer than the expected time, and that time is in the SMART report, if suggests a disk problem.

-C- · May 3

It took over a week, but it's finished without error:

SMART Extended Self-test Log Version: 1 (1 sectors)

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error

# 1  Extended offline    Completed without error       00%      7154

This is the only ZFS formatted drive in the array, formatted as ZFS so I could take snapshots of ZFS cache drives.

During the extended SMART test I could see the disk IO was pretty much constant using htop. Only on this disk, and as said previously, this drive only has backups, archives and snapshots; nothing that should be being accessed, so I don't see why there's constant drive IO. It's the only drive that behaves like this.

I think what I'd like to do is swap things around- I have a slow 10TB drive that's currently in the array being used as a media drive I'd like to reformat in ZFS and use for backups, archiving and snapshots. I'd like to format this 18TB drive as XFS and put media shares on it.

Makes most sense to me if I reformat the 18TB drive as XFS, move the media from the 10TB drive onto it, then reformat the 10TB drive as ZFS. Can you suggest the easiest/ best way to achieve this?

Thanks

JorgeB · May 3

You can use you favorite copy tool, if you don't have one, use the Dynamix File Manager plugin or Midnight Commander.

-C- · May 4

Thanks Jorge- I've reformatted the drive as XFS and run a new speedtest:

image.png.b36cff3b30b1732d134ccc9bf7cb8c1e.png

The 10/7/23 & 5/5/24 lines are pretty much exactly on top of each other.

So my hunch turned out to be correct- it was a software issue, which is good as I don't have to worry about returning/ replacing. It's the fastest spinning drive in the server too, so good to have it back to its former sprightliness .

My next step is to move the data from the 10TB drive onto this one, then reformat it to ZFS and set it up to receive snapshots from the ZFS cache drives. I've just run a DiskSpeed test on it, which gave the same result as all previous tests. Will be interesting to see if I have any issues with it once it's formatted as ZFS and receiving snapshots.

Edit: To say that the server generally is feeling much more responsive again. There was definitely a system-wide issue related to the fact that this drive was constantly being accessed, crushing the system's performance. The Load Average was always 4 or more, even at idle. now it's back down to under 1!

Edited May 4 by -C-

JorgeB · May 5

10 hours ago, -C- said:

So my hunch turned out to be correct- it was a software issue

I find it very strange that a filesystem would affect the diskspeed test, but hopefully it remains good after the change.

-C- · May 5

6 hours ago, JorgeB said:

I find it very strange that a filesystem would affect the diskspeed test, but hopefully it remains good after the change.

Yes, certainly strange.

I followed Space Invader One's guide to use his ZFS Snapshot and Replication Script for snapshotting from cache to the ZFS disk array. Although that all seemed to work OK, It's possible that I messed something up during setup. It's also possible it could be something else to do with the various ZFS systems required for the script to function. Maybe FUSE didn't like something?

I really don't know, but I doubt it was to do with the filesystem, more likely an issue/ glitch in something that was interacting with it.

-C- · May 10

Have now reformatted the slower drive as ZFS to accept the snapshots. Glad to see that this hasn't affected its speed, although I will be keeping an eye on it.

image.png.d4846b6abd8dd87ad9f2bd32600d22ee.png

For the first time in a very long time, when I checked Unraid first thing yesterday (a couple of days after setting up the new snapshotting, after things had settled), all the drives were spun down. Since then though, every time I check the drive access, I'm seeing constant disk reading and writing affecting all drives, so I need to investigate this further. It's possible that it's just related to me accessing the GUI...

6 month old Toshiba disk running at < 50% of its original speed, array operations painfully slow.

Recommended Posts

-C-

Link to comment

JorgeB

Link to comment

-C-

Link to comment

JorgeB

Link to comment

-C-

Link to comment

JorgeB

Link to comment

-C-

Link to comment

JorgeB

Link to comment

-C-

Link to comment

JorgeB

Link to comment

-C-

Link to comment

JorgeB

Link to comment

-C-

Link to comment

JorgeB

Link to comment

-C-

Link to comment

JorgeB

Link to comment

-C-

Link to comment

-C-

Link to comment

Join the conversation