Preclear plugin


Recommended Posts

I have never used the "Erase and Clear" option.  Is the erase cycle normally significantly slower?  I am running this on a 8TB 7200rpm NAS drive model ST8000VN002 and I am 38 hours in on step 1.  Its staying right around 50 MB/s 

 

Edit: It finished erasing after 45 hours and moved onto zeroing.  Zeroing at 240 MB/s.  Thanks.

Edited by allischalmersman
Link to comment

I am also having an issue with the plugin stalling. I am in the post-read step of a preclear of an 8TB disk, 22% in, it seems to have stalled. Percentage hasn't changed in hours, time elapsed is also stuck.

 

htop shows the preclear script as running (pegging the cpu core at 100% constantly) and there are two other related active processes: "cmp - /dev/zero" and "dd if=/dev/sdb bs=2097152 skip=1 iflag=direct".

 

Is the preclear still ongoing? Is it only the reporting aspect that is borked? Should I wait another 10 hours for it to finish?

 

The preclear log only shows that the post-read has started, nothing after that.

 

I checked the folder /temp/.preclear/sdb/ and the cmp_out, dd_output and the display_output haven't been modified in hours.

 

Thanks

Link to comment

@gfjardim:

 

Similar issue to aptalca, except mine hung during zeroing. I was actually running a preclear using my script and a preclear using your plugin/script on two different computers. Your script was about 2% ahead (despite running on the slower computer) when I went to bed last night. Seemed to pick up speed during the pre-read and hold steady during zeroing. Today that preclear is stalled at 73% zeroing, while the other computer is at 36% through the post-read/verify. Looks like it stopped a mere 3% after I last checked last night. A few screenshots are attached to try to help you diagnose.

 

I was a little skeptical of the preclear plugin and especially the new script, but after reviewing it I was impressed with the restructuring / updates you made. It is now easier to maintain IMO. I especially liked his post-read method. JoeL and I had a dialog (back in the day) of how to do the post-verify more efficiently without resorting to a custom program (which I ultimately implemented as readvz) and did not find a way, but you did and like it better. JoeL basically summed the byte values using an awk program. It was pretty slow. I never got to the point that yours was used - so not 100% sure it is faster - but assume your comparison to /dev/zero is faster or at least comparable to readvz. Plan to use your method going forward, but this bug needs to be fixed first!

 

BTW, thanks for your credits for my work. I basically made three enhancements to preclear, which is otherwise JoeL's creative work and he deserves huge credit for creating it I can't even say how many years ago. He is missed. :(

 

But my three contributions were:

- Fast preclear (-f)

- "Real time" status updates to the /tmp folder (one that even the stock GUI uses, and neither Joe or I are credited. This urks me)

- "Hidden" option (not so hidden anymore) to bypass all of the prompts and just get to the preclearing directly. Something I added so that preclear could be started from a GUI.

 

FYI - I have been stuck in the relative dark ages of the very stable 6.0.1 and only now looking at this for the first time. Good work. Fix the bug. :)

Stall Preclear TMP Capture.JPG

Stall Preclear PCP Capture.JPG

Stall Preclear TOP Capture.JPG

Link to comment

Guys, I was too busy these last two months that I couldn't even read this topic. I see many are having troubles with my script, but I need sometime to compile all info to see if I can reproduce all the bugs. If someone could help me compiling all relevant info, I'll gladly appreciate.

I need someone to share the development with me too. Those interested please send me a PM.

Link to comment
On 2017-6-9 at 3:58 PM, aptalca said:

I am also having an issue with the plugin stalling. I am in the post-read step of a preclear of an 8TB disk, 22% in, it seems to have stalled. Percentage hasn't changed in hours, time elapsed is also stuck.

 

htop shows the preclear script as running (pegging the cpu core at 100% constantly) and there are two other related active processes: "cmp - /dev/zero" and "dd if=/dev/sdb bs=2097152 skip=1 iflag=direct".

 

Is the preclear still ongoing? Is it only the reporting aspect that is borked? Should I wait another 10 hours for it to finish?

 

The preclear log only shows that the post-read has started, nothing after that.

 

I checked the folder /temp/.preclear/sdb/ and the cmp_out, dd_output and the display_output haven't been modified in hours.

 

Thanks

 

Still having the issue, @aptalca? I could use your help to debug it.

Link to comment
1 hour ago, gfjardim said:

 

Still having the issue, @aptalca? I could use your help to debug it.

 

After seeing reports from others in other threads, I canceled that stalled preclear and started a new one using JoeL's script in a screen session. It is currently in post-read and has another 15 hours or so to go (8TB drives take forever). After that's done, I'd be happy to help debug.

  • Upvote 1
Link to comment

@gfjardim

 

Suggestions ...

 

1 - Add logging to your script so you can figure out where the hang is occurring when it happens again. (Joe's script has some subshells to work around an unexplained bug - might want to look at that - you may be able to wrap some of your code in sub shells and avoid your bug??)

 

2 - Add a "resume" feature, so if people use your script and it does hang, they could simply resume their preclear from the plugin page without having to start from scratch. There may already be enough info in the /tmp files to do so, but if not, a "resume" file could be created. People would be more likely to use it if they know that they aren't going to waste alll that time preclearing. This would actually be a very nice feature in preclear. To be able to stop it and resume later - even after a reboot. I was able to resume my failed preclear with a quick patch to my version of the preclear script, but few would know enough to do that, and instead forced to restart at least the current stage.

 

3 - Add a resumer process that detects the hang, kills the old one, and resumes. Users would never know except for a very brief slowdown. You could use your "phone home" system to share the logs when the resumer has to do its thing.

 

BTW - I pretty much hate shell scripting. Awk is so much easier to use. Had thought about rewriting preclear but in awk, but for a variety of reasons decided to let the preclear scripting stay as it was. But in awk you can use the "system()" command to do the dirty little I/O commands, and still use Awk's simple syntax and structuring features to organize the code. Awk might have a small performance impact in the control logic, but IMO would be well worth it given the added maintainability.

 

Sorry don't have the patience to get into the weeds with you on this. I muddle through shell scripting but have no interest in wasting the brain cells I have left on becoming an expert. :)

 

Good luck. If I have other preclears to do I will help test if you've got something in place to find bugs.

Link to comment
39 minutes ago, bjp999 said:

@gfjardim

 

Suggestions ...

 

1 - Add logging to your script so you can figure out where the hang is occurring when it happens again. (Joe's script has some subshells to work around an unexplained bug - might want to look at that - you may be able to wrap some of your code in sub shells and avoid your bug??)

 

2 - Add a "resume" feature, so if people use your script and it does hang, they could simply resume their preclear from the plugin page without having to start from scratch. There may already be enough info in the /tmp files to do so, but if not, a "resume" file could be created. People would be more likely to use it if they know that they aren't going to waste alll that time preclearing. This would actually be a very nice feature in preclear. To be able to stop it and resume later - even after a reboot. I was able to resume my failed preclear with a quick patch to my version of the preclear script, but few would know enough to do that, and instead forced to restart at least the current stage.

 

3 - Add a resumer process that detects the hang, kills the old one, and resumes. Users would never know except for a very brief slowdown. You could use your "phone home" system to share the logs when the resumer has to do its thing.

 

BTW - I pretty much hate shell scripting. Awk is so much easier to use. Had thought about rewriting preclear but in awk, but for a variety of reasons decided to let the preclear scripting stay as it was. But in awk you can use the "system()" command to do the dirty little I/O commands, and still use Awk's simple syntax and structuring features to organize the code. Awk might have a small performance impact in the control logic, but IMO would be well worth it given the added maintainability.

 

Sorry don't have the patience to get into the weeds with you on this. I muddle through shell scripting but have no interest in wasting the brain cells I have left on becoming an expert. :)

 

Good luck. If I have other preclears to do I will help test if you've got something in place to find bugs.

 

1) Just figured it is a much more complicated problem to deal. Sometimes dd saturates the disk's I/O to such a point where a simple S.M.A.R.T probing is taking as much as 1 minute to complete, and if multiple probings are launched at the same time, this period increases dramatically. I'll have to think in a workaround to this problem, since it envolves emhttp and Unassigned Devices too.

 

2) Already thought about that and it's in my TODO list;

 

3) I would prefer users to be alerted by any issues, but I'll think about it.

 

Thanks a lot for your reply and your kind words!

Link to comment

Actually I have had problems with frequent smart reports and heavy I/O. Worse on add-on controllers vs MB ports. I can't prove it, but a new Seagate 2T drive I had years ago, that I was preclearing on an add on controller while pulling frequent smart reports, got all screwed up and had the freakiest problems I've ever seen. Very long delays responding. I ultimately returned it as defective - but have always thought it was due in some way to preclearing with smart reports. I experimented with something called the "permissive" flag in smartctl on that Seagate and that did not help - maybe made things worse. But do know its not a solution to allowing one to pull constant smart reports. Since then never preclear on anything but motherboard port, and very gentle on pulling smart reports.

 

Kinda forgot all that as I avoided the problems for years. I had disabled updates of the stock GUI (it was causing hangs with my version of unRAID (6.0.1)), so background smart checks were not happening. And with myMain, I never let it do the auto-updates, and I implemented a refresh button that remembered the temperatures from refresh to refresh and avoided pulling new smart report. This causes nearly no smart reports unless I explicitly ask for one. Asking for one every few hours is very low risk.

 

But the machine I used to preclear yesterday was a new build running 6.3.5. Stock GUI using default settings, so doing its background updates. Putting all the fact together with your conclusion, makes a lot of sense that the constant smart reports would screw things up.

 

Would be interesting test to do with GUI turned off or settings adjusted. Probably would not hang.

Link to comment

I'm having the same issue as everyone else. Trying to preclear an 8tb drive. It seems to hang after about the same amount of time every time. I can get through 1 full phase and about 92-95% of the next phase before preclear completely freezes and won't update the time or progress. It also causes the webui to stop updating (I can navigate the webui, but no information in populated. For instance, all of my disks disappear. I can ssh in and see progress using the preclear command once the webui freezes, but eventually that freezes too.

Link to comment
1 hour ago, aptalca said:

 


Hi@gfjardim

I started another preclear through the plugin and it got stuck during post read. One cpu core is pegged at 100%. I'll leave it as is for now. Let me know what you need me to do to debug.

 

 

 

I've made some changes in the code. If you can, please cancel this preclear instance, upgrade the plugin and start a new one.

 

Thanks a lot!

Link to comment

So something weird happened, a full pre-read, zeroing, and post read successfully completed on an 8tb red. Nothing changed other than I made sure to not touch anything else on the unraid box while it processed. I checked the status at several points in the process and the webui was functional each time. Not sure what changed for me.

Link to comment

Updated preclear and I don't know if this is related or not but Im having issues.  Unable to remove plugins (all) the check box does not appear and the "remove" button stays unselectable.  At the same time I am unable to select a notification type or trigger when starting a preclear.  Tried a reboot.  Any advice?  Thanks, Andrew

preclear snip.JPG

Plugin snip.JPG

Link to comment
  • Squid unpinned this topic

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.