New Unraid user - One drive is super slow and I don't know how to fix it


sdub

Recommended Posts

I've been an Unraid user for about a month now... see attached diagnostics.  My array consists of 8 WD SATA drives ranging from 4 to 12TB, one 8TB Seagate drive, and one 12TB WD Parity drive.  I'm using a single 1TB Samsung NVMe drive for Cache.  It's worth noting that the Seagate drive is a freshly shucked STEB8000100, purchased in May 2018.  I never had issues with it when it was connected via USB to my Windows box.  All SMART diagnostics on all drives show healthy.  Write caching to the drives IS enabled.  Fix common problems plugin shows no warnings.

 

The migration of all 40TB of my data into my Array went slowly but very smoothly... write speeds of about 80MBps with parity enabled during the entire copy.  When the data was copied into the array, "Drive 7" the Seagate drive, didn't end up with any data by chance. The system has been up and running fully for about a week.  

 

This week, as new data was being written to the cache drive and moved to the various disks overnight, data started getting written to "Drive 7".   My first sign of trouble was that the "mover" was never finishing.  Coincidentally, I'd turned on mover logging, so I was able to watch what was going on with "tail -f /var/log/syslog | grep -i move".  Several gigabytes would get moved over at full speed, then it would slow to a crawl....  upwards of 7hr for a 4GB file - about 150kBps.  I also noticed that my CPU iowait was upwards of 10%.  

 

Since there's no way to gracefully kill the mover script, I did a "ps -aux | grep move" and did a "kill -9" on the PID's, in an attempt to do a graceful shutdown and reboot the server.  This was not successful... the top level /usr/local/sbin/move script died in the "D" state, requiring a hard reboot.  Upon reboot Unraid detected the unclean shutdown and initiated a parity check.  The reboot did not magically fix my "Drive 7" IO problems, however...  the parity check was claiming it woulstake 306 days, so I canceled it.  Overnight, the mover script kicked off again, and is running into the same brick wall.  I killed the mover sub-process successfully, but this time didn't try to kill the "move" script.  

 

 I'm trying to move the files off the drive manually, but it's not going any faster.  For what it's worth, I copied a file off another drive, and got the expected 160+MBps on my 10GBE connection.  

 

Any suggestions on what to do?

yellowstone-diagnostics-20201009-0857.zip

Edited by sdub
Link to comment
31 minutes ago, JorgeB said:

Disk7 looks healthy, but don't do a parity check and attempt to move data at the same time, do one or the other first.

Yes, I stopped the parity check to move the data off once I saw it would be 300+ days to complete

Link to comment

To get back to a working system, what's the best way to get that data off "drive 7"?  I was thinking of the following procedure:

  1. Removing Drive 7
  2. Let the parity create it virtually
  3. Move all of the parity-emulated Drive 7 data off to other drives
  4. Reformat the physical Drive 7 in Windows or elsewhere
  5. Bring it back into the array, allowing Unraid to clear it and resynchronize it (no data anymore)
  6. Go into the share settings and prohibit any share from using "drive 7" 
  7. Run the diskspeed diagnostics you linked above
  8. If I'm able to get it fixed, allow drive 7 to be re-included in the shares.  If not, I guess either replace it with a new 8TB HDD or define a new config without it.

Does that make sense?

 

 

Link to comment

There is no point in step 4 (probably some confusion about the meaning of format), and at step 5, it won't clear the disk, it will rebuild it. Even an empty disk is rebuilt since it needs to be made in sync with parity again. If the disk has an empty filesystem, it will be rebuilt with an empty filesystem.

 

Another possibility would be, after step 3, New Config without the disk and rebuild parity. Then you would be protected again, and could proceed however you wanted with testing that disk, and later add it or a new disk (Unraid will clear added disk so parity is maintained).

Link to comment
6 minutes ago, trurl said:

There is no point in step 4 (probably some confusion about the meaning of format), and at step 5, it won't clear the disk, it will rebuild it. Even an empty disk is rebuilt since it needs to be made in sync with parity again. If the disk has an empty filesystem, it will be rebuilt with an empty filesystem.

 

Another possibility would be, after step 3, New Config without the disk and rebuild parity. Then you would be protected again, and could proceed however you wanted with testing that disk, and later add it or a new disk (Unraid will clear added disk so parity is maintained).

That makes more sense...  new plan:

  1. Restart array, marking Disk 7 as "missing", letting the parity emulate the drive
  2. Move all of the parity-emulated Drive 7 data off to other drives
  3. Restart array, marking Disk 7 as "available", allowing the parity to rebuild it (as empty)
  4. Go into the share settings and prohibit any share from using "drive 7" 
  5. Run the diskspeed diagnostics you linked above
  6. If I'm able to get it fixed, allow drive 7 to be re-included in the shares.  If not, I guess either replace it with a new 8TB HDD or define a new config without it.

 

Link to comment

I suppose the source (cache drive) could have been the problem and not the Seagate SATA destination drive... With the drive still offline I’ll try to rerun the mover. 
 

This seemed unlikely as the cache is a brand new Samsung 970 EVO Plus, and it’s been otherwise fine. 

Link to comment

Did as I mentioned previously... removed the drive from the array, used the parity emulation to move the data off.

 

While the drive was out, I did SMART tests and ran DriveSpeed. No problems at all. Threw it back into the array and started the rebuild of that disk. The behavior was very similar... for the first hour, it was full speed... 100+MBps. After that it started slowing, and by hour 4 it was down to 200kBps.

 

I think this drive is just garbage. Never again will I buy a Seagate. Here’s a somewhat related article that gives me deja Vu.

 

https://forums.tomshardware.com/threads/seagate-barracuda-2tb-slow-issue.3443725/

 

I’m running a preclear on it, just begging it to actually give an error, but I doubt it will.

 

I may pull it out and throw it in my Windows machine to do more testing with CrystalDiskMark or something just to be sure it’s not a bad controller channel or something.

 

 

Sent from my iPhone using Tapatalk

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.