Jump to content

Husker_N7242C

Members
  • Posts

    63
  • Joined

  • Last visited

Posts posted by Husker_N7242C

  1. Hi guys,

    I recently added a new parity disk, rebuilt parity ok then a couple of days later I had an issue with a power cable which caused read errors on multiple disks. I shut down the server prior to any disks being disabled and I replaced the bad cable. After replacing the cable and rebooting disk4 is unmountable.

    Parity should be valid, but contents of disk aren't being emulated. I don't understand why. 

     

    I tried running xfs_repair -n /dev/sdp twice but the disk hasn't come good.

     

    Is there a way to emulate the disk so that i can copy the data to other disks and try re-formatting the unmountable disk?

     

     Attached is a screenshot of the GUI and diagnostics. 

     

    Thanks in advance

    image.png

    nas-diagnostics-20231102-1954.zip

  2. G'day ich777,

    I'm testing out your LuckyBackup container and I think there is a bug...

    When I backup from one UNRAID server to another UNRAID server using "backup source inside destination" with 10 snapshots enabled, then following happens:

    1. The files are copied successfully

    2. Modified files are stored in the snapshots folder within the destination correctly

    The issue is with restorations. No matter which snapshot I select, it only restores the latest backup set and seems to ignore that I've selected an older snapshot. This makes it impossible to restore any version of a file/directory apart from the latest backup.
    I can confirm the the older version of a file does get copied to the snapshots folder then the date/time folder and I can grab an older copy of the file out manually, however restore simply doesn't work as expected.

    Do you know of a way to fix this issue, or an alternate container that can do UNRAID to UNRAID backups with snapshots??

    Many thanks for all of your contributions!

    Dan

  3. On 12/15/2022 at 10:50 PM, Mushin said:

    However, there is no option to exclude shares

    Same. Some thing has changed very recently and the "Docker Safe New Permissions" is now broken because you can't set the include/exclude shares any longer.

     

    I'm stuck because I have all of my shares barring one excluded

     

    image.thumb.png.c3532e1140e13759074c3be49cdff36b.png

     

    image.thumb.png.b9a1d60e59702c9cef63c9e277c023af.png

  4. 28 minutes ago, JorgeB said:

    Any vdisks or other image type files there can grow with time if not trimmed/unmapped, do you have those type of files there?

    G'Day @JorgeB,

    All VM vdisks are on the other cache pool. The docker image, System folder, appdata and a few folders waiting for mover are all that is on this cache pool. All folders had been selected when I grabbed the total file size in Krusader. 

  5. Thanks again @JorgeB, I've disabled spin-down now and will go through the process again for disabling EPC and low current spinup to be sure it is still applied. I was quite careful when I originally did this to be sure that all commands succeeded but you never know.

     

    I'll mark your response as solution and get back in touch if re-applying the workarounds don't prevent this.

     

    I want to avoid having the drives continuously spun-up as we are just coming into summer in Australia and the server's room isn't airconditioned so I try to keep it as dormant as possible during the middle of the day, no downloads, no transcoding, just a couple of VMs on the SSDs (with 12x 140mm Noctua fans) then let it all rip at night.

    • Like 1
  6. Thanks @JorgeB as always for responding, however I've been through the process HERE to disable EPC and Low Current Spinup for all of these drives in the past. Is there anything else in the diagnostics that looks wrong?

    I'll disable spin-down, add the drive back to the array and wait for it to happen again, that we I can be possitive it is/is not this known issue.

     

    Just to double-check - if I stop the array, remove the disk, start the array and then stop the array again, re-add the disk and let the array rebuild, will I loose any data written to disk2 (currently emulated) since it was disabled, or does the drive get rebuilt from parity information instead?

  7. Hi everyone,

    My old server had continuous issues with disks being disabled, yet being found healthy. I never could get to the bottom of it.

    I rebuilt the server:

    -New motherboard (brand new, old stock, SuperMicro X10-DRH-CT)

    -New CPU(s) (used E5-2690V3 x2)

    -New RAM (128GB DDR4 ECC)

    -New HBAs (and SAS-Sata cables) (LSI 9207-8i x2)

    -New USB Flash Drive (Sandisk Cruzer Blade 32GB)

    -New Power Supply (850w Corsair)

     

    I Ran it for two weeks on Windows (no HDDs) and run many hours of stress-testing to ensure the temps stay low nothing failed out of the box.

     

    I transferred over the disks and flash and all seemed well for a week but I've just had disk2 disabled due to read errors. The drive isn't that old and the dashboard reports "Healthy"... but disabled.

     

    I'd really appreciate it if anyone with more knowledge than me had a look at the diagnostics (attached) and could make some recommendations?

     

    If you need any more info please let me know?

     

    Dan

    nas-diagnostics-20221202-1659.zip

  8. Hi everyone,

     

    Can anyone tell me if I'd be able to use the built-in LSI 3108 controller on the Supermicro X10DRH-C motherboard for UNRAID? Is it possible to flash this to IT mode? (I have no experience with built-in raid cards and only a little with PCI-e add-in cards. Also no experience with Supermicro BIOS or flashing them).

     

    Product Page - https://www.supermicro.com/en/products/motherboard/X10DRH-C

    PDF User Manual - https://www.supermicro.com/manuals/motherboard/C600/MNL-1628.pdf

  9. Hi everyone,

    I found that Docker had crashed somehow. Stopping and starting docker didn't resolve it so I rebooted the server. It took about 10 minutes to stop the array and then when it came back online got stuck at starting array for a VERY long time. It seemed to be doing something but super slow.

    There is an I/O error repeated at the end of the syslog attached, not sure if it is related or not (or what it means).

    Any help interpreting the diagnositics would be greatfully received.

    Daniel

    syslog nas-diagnostics-20220725-1904.zip

  10. 13 hours ago, ChatNoir said:

    AFAIK, Memtest included with unraid is an older version (licencing issues) and do not detect errors on ECC modules.

    As suggested, you should try with the version from the official website : https://www.memtest86.com/

    Looks like you are right! Version 5 doesn't support ECC. Higher versions don't support older BIOSs... guess unraid is stuck between a rock and a hard place... or should include both and descibe the choice in the boot menu... or just remove it completely. I think I would have seen this if I had to download and create a boot USB myself but missed it because I know UNRAID recommends ECC memory, so incorrectly assumed that the memory tool packaged with OS would work with ECC memory. Probably needs a re-think.

  11. 5 minutes ago, JorgeB said:

    It will depend on the board, Supermicro server boards for example usually always list the affected DIMM in the SEL.

    It's just an ASRock x79 Extreme 11. Pretty cool board (for it's time) but I don't know of any diagnostic tools like that for it. It threw a boot code last week before I ran all of the memory tests but the code was a general memory error that didn't indicate a particular dimm.

  12. 11 minutes ago, JorgeB said:

    You should replace that DIMM, if there's an uncorrectable error the server will halt, system event log might have more info on the affected DIMM, if not just remove one at a time until the errors stop.

    Thanks for replying again. Of the 8 dimms, it is possible that there could be more more than one with a fault. It is also quad channel so would have to remove more than 1 at a time and it will be a LONG process to eliminate. There MUST be a way to scan the memory properly. If UNRAID can encounter the issues within 24 hours so consistently, surely a memory testing program can too. I'll check the system event log as well for a hint

  13. 14 minutes ago, itimpi said:

    Are you using the memtest that comes with Unraid or did you download a version from memtest86.com that can test ECC ram?

    The one that comes with UNRAID. From what I can see, memtest86 supoprts testing ecc memory. Is there another version that would be better for DDR3 ECC?

  14. On 4/23/2022 at 5:42 PM, JorgeB said:

    Next time please use the forum.

     

    Not seeing anything logged relevant to the crashing, could be hardware related, one thing you can try is to boot the server in safe mode with all docker/VMs disable, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one.

     

    As for the disks getting disabled see below:

     

     

    I wasn't able to pin-down a hardware fault at first. I ran memtest86 with 3 passes on each of the modules 1 at a time then for 2 days straight with all 8 modules and got 0 errors. I replaced the HX850 PSU with a new RM850X single rail and checked that my total power draw is within spec (it is easily). I booted again and had a freeze within a day.
    Syslog now shows memory errors (as below). Unfortunately though doesn't show which module (I understand why it doesn't).

    So I probably have a memory issue, but if memtest86 can't find a fault I'm not sure how to work out which module is bad. I can't really affort to replace 64GB of ECC memory TBH. ANy ideas on how to narrow this down.

    Apr 29 07:29:50 NAS mcelog: Fallback Socket memory error count 32533 exceeded threshold: 2274769153 in 24h
    Apr 29 07:29:50 NAS mcelog: Location SOCKET:0 CHANNEL:? DIMM:? []
    Apr 29 07:29:50 NAS mcelog: Running trigger `socket-memory-error-trigger' (reporter: sockdb_fallback)
    Apr 29 07:29:50 NAS mcelog: Too many trigger children running already
    Apr 29 07:29:50 NAS mcelog: Fallback Socket memory error count 17062 exceeded threshold: 2274704085 in 24h
    Apr 29 07:29:50 NAS mcelog: Fallback Socket memory error count 32533 exceeded threshold: 2274736619 in 24h
    Apr 29 07:29:50 NAS mcelog: Location: SOCKET:0 CHANNEL:? DIMM:? []
    Apr 29 07:29:50 NAS mcelog: Location: SOCKET:0 CHANNEL:? DIMM:? []
    Apr 29 07:29:50 NAS mcelog: Cannot collect child 9515: No child processes
    Apr 29 07:29:51 NAS mcelog: Fallback Socket memory error count 32112 exceeded threshold: 2274801266 in 24h
    Apr 29 07:29:51 NAS mcelog: Location SOCKET:0 CHANNEL:? DIMM:? []
    Apr 29 07:29:51 NAS mcelog: Running trigger `socket-memory-error-trigger' (reporter: sockdb_fallback)
    Apr 29 07:29:51 NAS mcelog: Fallback Socket memory error count 18505 exceeded threshold: 2274819772 in 24h
    Apr 29 07:29:51 NAS mcelog: Location SOCKET:0 CHANNEL:? DIMM:? []
    Apr 29 07:29:51 NAS mcelog: Running trigger `socket-memory-error-trigger' (reporter: sockdb_fallback)
    Apr 29 07:29:51 NAS mcelog: Fallback Socket memory error count 18505 exceeded threshold: 2274838278 in 24h
    Apr 29 07:29:51 NAS mcelog: Location SOCKET:0 CHANNEL:? DIMM:? []
    Apr 29 07:29:51 NAS mcelog: Running trigger `socket-memory-error-trigger' (reporter: sockdb_fallback)
    Apr 29 07:29:51 NAS mcelog: Too many trigger children running already
    Apr 29 07:29:51 NAS mcelog: Fallback Socket memory error count 18505 exceeded threshold: 2274819772 in 24h
    Apr 29 07:29:51 NAS mcelog: Fallback Socket memory error count 32112 exceeded threshold: 2274801266 in 24h
    Apr 29 07:29:51 NAS mcelog: Location: SOCKET:0 CHANNEL:? DIMM:? []
    Apr 29 07:29:51 NAS mcelog: Location: SOCKET:0 CHANNEL:? DIMM:? []
    Apr 29 07:29:52 NAS kernel: mce_notify_irq: 120 callbacks suppressed
    Apr 29 07:29:52 NAS kernel: mce: [Hardware Error]: Machine check events logged
    Apr 29 07:29:52 NAS kernel: mce: [Hardware Error]: Machine check events logged
    Apr 29 07:29:52 NAS mcelog: Fallback Socket memory error count 18025 exceeded threshold: 2274856304 in 24h
    Apr 29 07:29:52 NAS mcelog: Location SOCKET:0 CHANNEL:? DIMM:? []
    Apr 29 07:29:52 NAS mcelog: Running trigger `socket-memory-error-trigger' (reporter: sockdb_fallback)
    Apr 29 07:29:52 NAS mcelog: Fallback Socket memory error count 12193 exceeded threshold: 2274868498 in 24h
    Apr 29 07:29:52 NAS mcelog: Location SOCKET:0 CHANNEL:? DIMM:? []
    Apr 29 07:29:52 NAS mcelog: Running trigger `socket-memory-error-trigger' (reporter: sockdb_fallback)
    Apr 29 07:29:52 NAS mcelog: Fallback Socket memory error count 12193 exceeded threshold: 2274880692 in 24h
    Apr 29 07:29:52 NAS mcelog: Location SOCKET:0 CHANNEL:? DIMM:? []
    Apr 29 07:29:52 NAS mcelog: Running trigger `socket-memory-error-trigger' (reporter: sockdb_fallback)
    Apr 29 07:29:52 NAS mcelog: Too many trigger children running already
    Apr 29 07:29:52 NAS mcelog: Fallback Socket memory error count 18025 exceeded threshold: 2274856304 in 24h
    Apr 29 07:29:52 NAS mcelog: Fallback Socket memory error count 12193 exceeded threshold: 2274868498 in 24h
    Apr 29 07:29:52 NAS mcelog: Location: SOCKET:0 CHANNEL:? DIMM:? []
    Apr 29 07:29:52 NAS mcelog: Location: SOCKET:0 CHANNEL:? DIMM:? []
    Apr 29 07:29:53 NAS mcelog: Fallback Socket memory error count 31707 exceeded threshold: 2274912400 in 24h
    Apr 29 07:29:53 NAS mcelog: Location SOCKET:0 CHANNEL:? DIMM:? []
    Apr 29 07:29:53 NAS mcelog: Running trigger `socket-memory-error-trigger' (reporter: sockdb_fallback)
    Apr 29 07:29:53 NAS mcelog: Fallback Socket memory error count 23544 exceeded threshold: 2274935945 in 24h

  15. On 4/23/2022 at 5:42 PM, JorgeB said:

    Next time please use the forum.

     

    Not seeing anything logged relevant to the crashing, could be hardware related, one thing you can try is to boot the server in safe mode with all docker/VMs disable, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one.

     

    As for the disks getting disabled see below:

     

     

    Thanks again.
    I've disabled EPC and low current spinup on all drives (8 of which are the model mentioned in the guide and another is a 3TB which had both features enabled). This is currently the best value NAS Seagate drive in Australia (and has been for some time now). Is there a way to build the SeaChest executable files into UNRAID and have them automatically check and disable these features on all Seagate drives? when adding a new drive? It seems like this would be a relatively simple script to embed as a bug-fix? (I know nothing about this stuff really, just seems silly to expect the users to have to fix this bug for themselves and repeat everytime they add a drive).

    Regarding the crashing, I've ordered a new PSU (haven't tested but it is due anyway) and will run a memory test and go from there.

    Are there any hardware diagnostic tools that you would suggest to check the LSI controllers and the motherboard?? Anywhere else that helpful logs might be hiding?


     

×
×
  • Create New...