Jump to content

Joe L.

Members
  • Posts

    19,009
  • Joined

  • Last visited

  • Days Won

    1

Posts posted by Joe L.

  1. The preclear_disk.sh script is failing on v6.2 betas because the sfdisk -R is no longer supported.  Preclear_disk.sh is reporting the disk as busy and will not preclear it.  It looks like 'blockdev --rereadpt' is the replacement according to the sfdisk man-pages here http://man7.org/linux/man-pages/man8/sfdisk.8.html.

     

    "      Since version 2.26 sfdisk no longer provides the -R or --re-read

          option to force the kernel to reread the partition table.  Use

          blockdev --rereadpt instead."

     

    EDIT: There is also an issue with reads failing.  I changed the following:

     

    read_entire_disk( ) {
      # Get the disk geometry (cylinders, heads, sectors)
      fgeometry=`fdisk -l $1 2>/dev/null`
      units=`echo "$fgeometry" | grep Units | awk '{ print $9 }'`
    

     

    to

     

    read_entire_disk( ) {
      # Get the disk geometry (cylinders, heads, sectors)
      fgeometry=`fdisk -l $1 2>/dev/null`
      units=`echo "$fgeometry" | grep Units | awk '{ print $8 }'`
    

     

    and the reads will work.

     

    Joe L. - Can we get an official fix and an update from you?

    I only (very) recently put 6.2 beta on my server.

     

    I did not have any issue pre-clearing the second parity disk I have just added to my array.

     

    The fix will need to wait until I add/replace one of the existing disks with a larger one.

    (Otherwise, I have no way to test the process. )

     

    Whatever the fix might be, it must be backwards compatible with the older releases of unRAID.

     

    In the interim, you can type this command to "patch" the preclear_disk.sh command

    First change directory to the directory holding the preclear_disk.sh command.  For most, it will be

    cd /boot

    then type (or copy from here and paste) the following:

    sed -i -e "s/print \$9 /print \$8 /" -e "s/sfdisk -R /blockdev --rereadpt /" preclear_disk.sh

     

    Your preclear disk script will be edited and should work with the two changes you mentioned.  (actually, each occurs in two places, so there are a total of 4 lines changed)

     

    Joe L.

    • Like 1
    • Upvote 1
  2.  

    Many of us feel that running two or three preclear cycles will get the drive past the 'infant mortality' portion of the bathtub curve (google for further discussion).  Uncovering an early HD failure before putting that drive into an array is much less stressful than finding a compromised array in the first week after introducing a new drive into the mix. 

     

    PS--- I could tell a story about how the concept of infant mortality came into general knowledge to the military during WWII but that would be completely out of topic...

     

    I agree that there is value in stress testing the drive and checking to make sure nothing is failing after the first few writes.

     

    That said, maybe this signals that a new plugin needs to be made that removes the clearing portion of the plugin and instead focuses entirely on stress testing. Leave the clearing entirely to the OS since that's not an issue anymore.

     

    This should allow more cycles of stress testing without having to have that long post read cycle (that verifies the dive is zeroed) meaning you can do more cycles faster... I think.

     

    I think you are missing a part of the  equation.  It is not only the stress introduced by the testing, the elapsed time is an integral part of the entire process.

    You are both missing an important part of the equation.

     

    Un-readable sectors are ONLY marked as un-readable when they are read.  Therefore, unRAID's writing of zeros to the disk does absolutely nothing to ensure all the sectors on the disk can be read.  (Brand new disks have no sectors marked as un-readable)

     

    Sectors marked as un-readable are ONLY re-allocated when they are subsequently written to.    It is the reason the preclear process I wrote first reads the entire disk and then writes zeros to it. (It allows it to identify un-readable sectors, and fix them where possible)

     

    The entire reason for the post-read phase is because quite a number of disks failed when subsequently read after being written.

     

    If you rely on unRAID to write zeros to the disk and then put it into service, the first time you'll learn of an un-readable sector error is when you go to read the disk after you've put your data on it. (or during a subsequent parity check)

     

    The new feature in this release of unRAID will help some to avoid a lengthy un-anticipated outage of their server if they had not pre-cleared a disk, and for that it is a great improvement.  This improvement in unRAID 6.2 does not however test the disks's reliability in any way, nor identify un-readable sectors (since it only writes them, and does not read them at all)

     

    Additional discussion about the difference between the unRAID 6.2 initial zeroing of drives replacing the preclear process should continue in another thread... and not clutter up this thread in the announcement forum.

     

    Joe L.

     

  3. I personally would run it again, as I understand it, because unraid runs from RAM, logs don't survive a reboot (or loss of power) so there is no way to tell what the previous preclear indicated, if you really don't want to, you could run a SMART test and if everything in there is OK you will probably be fine, what you wont be able to tell is if there were any changes before the start of the preclear and after which can indicate possible problems.

     

    *edit* looking at that smart test there is nothing that stands out to me, but I'm no expert, Current_Pending_Sector  and Reallocated_Sector_Ct are 0 though so will probably be ok.

    Actually, the preclear script logs its reports on the flash drive in

    /boot/preclear_reports

     

    You might look there.  If the report is not there, then it finished the clearing step, but not the post-read phase to see if it was successfully zeroed.

     

    Joe L.

  4. Yes, I think Joe L. has moved on.  What changes have you made to the cache_dirs script and are there any incompatibilities to the original script?

     

    EDIT: So far all I see is that you changed the 'B' to 'b' for disks busy.

     

    Ha! I think you need glasses if that's all the change you see :) But I totally get the need for a better description than the change-log.

     

    I don't have the time right now to go into details and I don't remember everything I did but this is what I wrote earlier:

     

    I have added adaptive depth level, to prevent cache_dirs from thrashing disks when they are otherwise occupied and cache is evicted. I found the cache was often evicted with the number of files I had when system become occupied with other things.

    I added the ability to adjust depth automatically based on whether scans are judged to cause disk access or not. It judges that a disk has been accessed during scan if scan takes a long time or if any recent disk access was made (and no recent disk access was made before scanning). The purpose being to avoid the situations where cache_dirs will continuously search through my files keeping disks busy all the time. Before it was also rather difficult to tell if cache_dirs was 'trashing' my disks, now its quite clear from the log if logging is enabled (though the log is rather large at the moment). If disks are kept spinning for some consecutive scans, the depth is decreased, and future rescan is scheduled at higher depth.

     

    If the file '/var/log/cache_dirs_lost_cache.log' exists then it will write a log that is easily imported into spreadsheet (excel) so its easier to check whether it trashes disks with current settings. I also added the kill I mentioned and some other quite minor bug-fixes.

     

    If you need more let me know, and I might supply more detail over christmas. If you think it looks good and useful I might do a clean up run on the script. I havn't felt like spending more time on the script if nobody but me used it.

     

    Best Alex

    no, not moved on... Just have precious free time to be as heavily involved as I was a few years ago.  (when I was not working.) 

    My servers are both built with out-dated hardware.  I cannot contribute in the same way I did in the past.  (One is an original server sold by Limetech, with IDE based drives, the second newer, but incapable to handle virtualization)

     

    I do follow the threads... and respond occasionally...

     

    Joe L.

  5. Or, the SATA cables are picking up noise from adjacent cables. (adjacent power OR SATA cables)

     

    This often occurs when a user attempts to make their server look neat by bundling all the SATA cables together. 

    When doing so, it is putting into place a situation where induced noise is very likely.

     

    Therefore, cut the tie-wraps bundling cables together.  Yes, it looks less neat, but... you'll see far fewer noise induced CRC errors.

     

    Joe L.

  6.  

    Thanks for the info, I had never seen this.

     

    I am trying it now, although I think it is not the perfect choice for my case, since errors,  appear in different places of the HDDs.

    If the errors are in different places each time, it is more likely to be a memory problem, disk controller problem, or a power supply problem.

     

    Very first thing to check is to run a memory test, preferably overnight (or at least several full passes).  As often as not, a bad memory strip is the issue.

     

    Joe L.

  7. Now the strange problem: 

     

    My initial or 1st Drive seemed to exhibit a similar problem so I was not particularly worried, albeit frustrated at the loss of time but...  So I decided I would just re-boot (when the drive fails preclear it ceases to be seen) and low and behold the Windows machine would not boot off any USB stick regardless of location and/or Boot Selection attempt???  Very strange behavior, I finally re-attached my Windows HD and booted successfully, then totally cleared the 2nd 5TB Drive thinking that something was amiss.  Still no change despite 3 different valid USB bootable sticks.

     

    Finally this evening, I went into the Bios again and "Restored Default Settings" and low & behold UnRaid now Boots and I've re-started the PreClear Plug-In.

     

    So the "burning" question is How Did PreClear Failure somehow write "code" to the BIOS!  Is there anything that this will do to a system in the future?

     

    First, no software (including Preclear) writes to the BIOS.

     

    This is actually a common problem with many motherboards.  Whenever you change the installed drives list for the system, the BIOS may decide to "help" you, and reorder the boot order so that the most likely hard drive will be booted, which is usually NOT the USB drive you had configured!  You did the right thing by going into the BIOS and correcting the boot order, making sure the right drive is booted, not what the BIOS *thinks* is the right drive.

     

    Thanks Rob - I agree in a sense, but I actually selected a "seen" USB Bootable Hard Drive and it/they still failed.  Maybe the Bios still changed it to the Cleared (not PreCleared) hard drive as it showed "no Bootable disc found".

     

    Still an interesting and "freaky" thing to witness.  It worked fine until the PreClear "failed" then would not boot until it was reset.

     

    Dave

    Even though the pre-clear had failed (detected it had not filled the disk as expected), it could have written what looks to the BIOS as a valid master-boot-record to the hard-disk being cleared. 

     

    In other words, as RobJ said, your bios was trying to "help" you by choosing one of your hard-disks to boot from that it thought had a valid master-boot-record, and since none contain actual code to boot from, nothing would boot until you set the bios back to boot from the correct usb-flash-drive.

     

     

  8. Is NUT UPS support still planned for 6.1? Really looking forward to this.

    Since lime-tech is in release-candidate-2 of 6.1, I'd not expect new features, but instead just tiny bug-fixes so they can get to 6.1 final.

    (I can't speak for lime-tech, as I'm a customer, just like you, so it is always possible they would throw in something at the last moment... but I would look to a community plugin rather than something in 6.1 natively)

     

    Joe L.

  9.            

    /boot/preclear_bjp.sh -f -A -D /dev/sdz

     

    Thanks,

     

    Dave

    I think ANY problem with the preclear is an issue. 

     

    The results of the preclear run should be stored on the flash drive in the preclear_reports folder.

     

    Hi Guys,

     

    From ~ 10 days ago, I wrote about my problem PreClearing a 5TB drive with a V6 key using a separate computer.  The first drive got all the way through 2x but indicated that it could not Preclear the MBR, hence itimpi's comment. 

     

    I've checked, no log reports at all.  After rebooting and adding the PreClear Plugin & Script, I started over several times but at some point 20+ hours into Preclear the drive/system disconnect (lose communication with the drive) rendering Preclear useless. 

     

    This has now happened to a new 2nd 5TB Drive (fresh out of the box).  Anyone have a thought?

     

    Dave

    Yes,

     

    your drive is losing contact with the disk controller.

     

    You are lucky you are discovering the issue before you start loading your data to the drive.

     

    In many cases in the past, the issue was poor or intermittent SATA cabling to the drive, or an intermittent power splitter or power connection, or intermittent drive tray, or back-plane,  or a power supply inadequate for the number of drives connected.  Occasionally, it was traced to a flaky SATA controller port.

     

    What exact power supply are you using?  How many disks are being powered from it?

     

    Do not get confused by the preclear report stating it could not clear the MBR.  It must write a protective MBR to the drive, for older utilities that expect it to be there, even though the actual partition is located further up on the disk.  Apparently, at the point where the MBR is being written the drive is already not communicating with the disk controller. (So no writes to the drive would work, regardless of what they were for)

     

     

     

    Joe L.

  10. I don't know for SURE, but if I recall correctly Joel's configuration includes a 1430SA ... and I know he just updated to v6 with no problem.    If he noticed this, perhaps he'll confirm that he's using a 1430SA.  [i'll send him a PM and a link to this comment.]

    That board is not in the server I just upgraded, so therefore, I cannot answer if it works, or not.
  11. There is this wiki: Upgrading to UnRAID v6

    Yes,  I know, and I followed its instructions specifically. (and followed the section for ADVANCED users who did not wish to re-format the flash drive.)

     

    Tom...

    root@Tower:~# ls -l /boot/extra

    total 0

    root@Tower:~# ls -l /boot/plugins

    total 352

    -rwxrwxrwx 1 root root  1510 Aug 10  2013 webGui-latest.plg*

    -rwxrwxrwx 1 root root 333600 Aug 10  2013 webGui-latest.txz*

    root@Tower:~# grep -v "#" /boot/config/go

    /usr/local/sbin/emhttp &

    root@Tower:~#

     

    I'm guessing the files in the plugins directory should not be there.  (I've never used unRAID plugins, so I'm guessing these were for the stock unRAID interface)

    I did not install them specifically, I did copy the "plugins folder from the distribution to the flash drive, but that would have left previous contents.

     

    I'll try removing those two files and let you know what happens when I reboot.

     

    Joe L.

  12. Upgrading from 5.0.6 to 6.0-rc4.

     

    Which method did you use to upgrade?  Did you format your flash drive, or manually move things around?

     

    I did the whole format process moving from 5.0.6 a few weeks ago, and it went very smoothly.

    Did you clear your browser cache?

    I did not.  But clearing it made no difference. (I just tried)

     

    I did not do a complete reformat.  I did rename everything and disable everything, and never used dynamix ever previously.

  13. Upgrading from 5.0.6 to 6.0-rc4.

     

    Many small issues...

     

    First, "failed to load COM32 file menu.c32 when I first attempted to boot my flash drive.

    To get it to boot I had to copy menu.c32 from the syslinux folder on the flash drive to the root of the flash drive.

     

    Then, once that was resolved, it booted and allowed me to see the disk assignment page by invoking //tower

     

    after assigning each of the data drives, I get:

    URL: tower/undefined

    404 File Not Found

     

    hitting the back button and then refreshing the browser, I see the assigned disk, but at the bottom of the main disk assignment page is:

    Fatal error: Call-time pass-by-reference has been removed in /usr/local/emhttp/plugins/dynamix/include/DefaultPageLayout.php(278) : eval()'d code on line 56

     

    After assigning all the disks as they were previously, I'm stuck...

    I cannot start the array now that I've assigned my drives because of this error, as there are no buttons present to start the array.

     

    Help is requested.  This is my first attempt to boot on 6.X and I'm not impressed so far with the experience.  I can click around on the various tabs in the interface and can get to all the pages...  Apparently, it is just the main disk assignment page with the error so far.

     

    It appears to me as if going straight from 5.0.6 to 6-rc4 is going to be an issue for some. 

     

    (I should add I did not completely re-format the flash drive..  I did disable all the packages/add-ons, etc in the config/go script, re-named the packages folder to packages_v5, etc.    There was a older menu.c32 in the root directory of the flash drive... perhaps it was confusing syslinux.  I did run the make_bootable.bat script from within Windows Vista using "run as administrator" on it before my first attempt to boot 6-rc4.)

     

    Joe L.

  14. I'm trying to execute a simple copy command in my go script to copy an sabnzbd skin into the right location. Anyone know why this wouldn't work?

     

    cp /mnt/cache/applications/sabnzbd/skinsholding/Knockstrap /usr/local/sabnzbd/interfaces/ -r

     

    I can literally copy and paste that text into the terminal from the go script and the copy works fine. Is the cache not mounted yet when the script runs?

     

    Thanks!

    it might be because the correct way to supply the "recursive" argument (or any argument for that matter) is

    cp -r source destination

    options to the copy command must come before the source and destination directories/file-names

  15. So I've started a preclear on a 4TB drive but I just read that my building is shutting off the power sometime overnight for some maintenance. Preread will finish but I'll have to shut the server during the zeroing. Is it OK to just do a ctrl-c on the script during the zeroing, shut down the server, start up in the morning and preclear again but skipping the preread? I presume if I get to the zeroing stage tonight, it'll mean the preread was error free?

     

    Thanks in advance.

    yes, you can do exactly as you stated.
  16. Depends on what you call clean... This basically makes sure all writing is stopped.. I think (but could be wrong) that a parity check would still start after tge reboot..

    You are wrong.  If stopped as described in that link to the wiki instructions, the array will not need to perform a parity check upon restart.

     

    The key command is

    /root/mdcmd stop

    which you'll only be able to perform successfully after un-mounting all the disks. (those are the first steps in the wiki link)

     

    Joe L

  17. Disk definitely virginal. Never seen such low numbers!

     

    Self-tests are a good idea as was already mentioned.

     

    If the self-test passes, the behavior might be due to bad cabling. Although this doesn't have the normal symptoms, I'd definitely try replacing the SATA cable.

    I would try running the short smart test before doing anything that would power cycle the disk.

     

    type

    smartctl -t short /dev/sdi

    then wait  for the time it indicates and get a new smart report

    smartctl -A /dev/sdi

     

    followed by the same steps for the long test, waiting several hours or more as indicated when invoked before getting a subsequent smartctl -A report.

    (Don't forget to disable any spin-down timers, as spinning down the disk will terminate the long test.)

    smartctl -t long /dev/sdi

    waiting hours as needed, then

    smartctl -A /dev/sdi

    It might have currently stopped responding to read requests, but might start again if power cycled.  The actual issue could be with the disk controller OR the disk itself.

    (That is not a good behavior, as if you cease being able to read a disk it is a very bad thing in any network-storage-device)

  18. I would not use either drive.

     

    The first has over 900 sectors already re-allocated (it will not get better with use), and the second is already FAILING the smart test.

    184 End-to-End_Error        0x0032  099  099  099    Old_age  Always  FAILING_NOW 1

     

    Joe L.

     

     

  19. Actually, it said "0 bytes copied" so it could not read the disk when it was trying to.

    Might be fine in operation, but I expect you might want to keep an eye on it.

    Can you get a smart report on the drive right now?

    (does it respond at all to read requests?)

     

    What do you see when you run this command that attempts to read the disk's first 195 sectors:

    (it will print, at most, 30 lines of text)

    dd if=/dev/sdi count=195 | od -c -A d |  sed  30q

     

     

  20. based on the smart report, it is highly likely to be bad cabling to the drive, or cabling picking up noise from adjacent cables.

    (if you neatly tie-wrapped all the drive cables, you've caused the problem.  Do NOT run the cables all parallel to each other or to power cables.)

     

    It might also possibly indicate a power supply at its limits, with the power supplied to the drive being noisy causing the checksum errors in communicating with the drive that are showing in the SMART report:

    199 UDMA_CRC_Error_Count    0x0032  200  200  000    Old_age  Always      -      172
  21. I am having trouble preclearing a Seagate 4tb drive.  Every time that I attempt to perform a preclear it hangs on step two.  When it freezes, if I go to the console the monitor keeps refreshing "No such file or directory exists dev/sde."

     

    Please see my original thread here with error logs - http://lime-technology.com/forum/index.php?topic=38014.0

    Then it indicates your disk is dropping off-line in some way and can no longer be accessed.  (either a bad disk, or a power supply that cannot supply proper power to the disk, or a disk controller that is stopping to respond, or a loose cable or connector, or loose drive tray, or back-plane. )

     

    Sorry to say, difficiult to isolate which it might be)

     

    Thanks Joe.  I think that I am just going to RMA the drive even thought it passes all of the Seagate SeaTools tests. 

     

    I am using a Norco 4224 case and I have tried preclearing the drive in multiple slots to eliminate the possibility of a bad cable, or PCI-E card with no luck.  I was able to preclear an old 2 TB drive just fine so I am suspecting the drive.

    Better to RMA the drive now, then to have it stop responding when attempting to load it with your data (or fail when using it to recover another failed disk).

     

    Joe L.

×
×
  • Create New...