Jump to content

pwm

Members
  • Posts

    1,682
  • Joined

  • Last visited

  • Days Won

    10

Posts posted by pwm

  1. Note that a disk that says it has one sector offline uncorrectable doesn't mean the disk need to be toast.

     

    But it means that based on statistics, there is an increased probability that the drive will fail more - or totally - within a limited time span. Some disks just may get a bad sector because of a defect on the surface that wasn't noticed during the original factory scan, but there is a danger that the problem isn't just a tiny spot but a larger surface area that isn't good or that there is some issue with the head or other parts of the drive, in which case the drive is dangerous to continue to use.

     

    It also means there is one sector that can't be read out correctly because the error correction code (ECC) for that sector isn't enough to correct the bit errors. If you already know the contents of that sector and tries to overwrite the sector then the disk can make use of a spare sector to store the correct data, making your RAID have a full set of disks with all correct data again.

     

    As johnnie.black notes, you most definitely do not want to rebuild your parity at this stage, since the current parity is one way to recompute what contents that should have been stored in the offline uncorrectable sector (unless you happen to have a backup of the specific file data for the file that happens to make use of this specific disk sector).

     

    Anyway - after a extended SMART scan, the disk will be able to tell which sector it finds the first error on. And it might potentially also increase the number of bad sectors.

  2. Your problem sounds a lot like a heat issue. If heat is an issue then the first test will run longer while if you then restart the test, the machine will fail earlier because the affected component will reach the critical temperature quicker.

     

    Note that the load on the PSU is high during the parity scan. Unstable power can make CPU, memory, motherboard, ... malfunction giving this kind of issues. But you can do any number of CPU or memory tests later without seeing any issue since your individual component tests are made with lower total system load making the PSU able to supply stable voltages.  And component tests will not increase the total temperature in the box as much. During the parity scan, you both have a high power load on the CPU and also generally warmer input air to the PSU.

     

     

  3. Thanks for the answer. I don't run any docker apps on the unit. I have a separate application server machine. But the beta4 is so old that it still suffers from a timing issue in the RAID code that makes it regularly give a write error when copying new files to the unit. I more-or-less stopped using this specific NAS for other than streaming out data as I got tired of waiting for a fix to this specific problem.

     

    But if this update goes well, then I have a unRAID 5 installation to update too - but there is documentation about the steps for going from version 5 to version 6.

  4.  

    Ok, first and foremost, can you reboot into safe mode to see if the parity check performance issue continues?  I want to see if this is plugin related since I see you're using a bunch.  Just a quick "sanity check" before we move on in troubleshooting...

     

    Thought I uninstalled the old plugins from v5, guess not. I did have apcupsd and unmenu running, disabled unmenu and booted into safe mode, showing 1.1MB/s now, current position 533MB after 5 minutes, going to let it continue to run.

     

    EDIT: Think i might've found my problem. Went thru and read a large file from /mnt/disk* for each drive with dd, all drives came back with 100+ MB/s except one. That drive is dying isn't it?

     

    That fixed it, swapped drives around now rebuilding at ~110MB/s. Guess I had a drive decide to fail at the exact same time I decided to try beta6. Thanks for the help jonp

    Have you verified that you rally had a bad drive? You might also have had an issue with the cabling, resulting in transfer errors and the disk transfer mode switching from a fast DMA mode into an extremely slow and CPU-intensive PIO mode.

     

    On the other hand - lack of automatic supervision of the disks means that a disk can be bad for quite some time without it being noticed.

  5. I definitely hope plugin support will not go away.

     

    I don't want virtual machines for the task. I have no cache disk and if I had room for a cache disk I would use that space/cable for one more data disk.

     

    And I want my disks to sleep which is an advantage we get from booting from a USB thumb drive. With virtual machines, we'll get lots of extra processes that will regularly want to make disk accesses.

  6. Just noted something interesting, while wondering why a new N54L with two 4TB Seagate disks only builds parity at 40MB/s.

     

    Most subsystems thinks that

    /dev/sda = flash thumb drive

    /dev/sdb = 4TB disk

    /dev/sdc = 4TB disk

     

    When using hdparm -t (or -T) to test transfer speeds, then

    /dev/sda = 4TB disk

    /dev/sdb = flash thumb drive

    /dev/sdc = 4TB disk

     

    For other hdparm commands, the devices are in the expected order.

     

    I have never seen hdparm goof like this on any other system - is there any special driver layer code in unRAID that might give this weird behavior?

×
×
  • Create New...