Husker_N7242C

November 2, 2023

Thanks so much for your help @JorgeB. It is up and running again with only 2 files in lost and found, most data looks intact.

Are you able to shed any light on why the parity didn't emulate the disk contents in this instance?

November 2, 2023

Hi guys,

I recently added a new parity disk, rebuilt parity ok then a couple of days later I had an issue with a power cable which caused read errors on multiple disks. I shut down the server prior to any disks being disabled and I replaced the bad cable. After replacing the cable and rebooting disk4 is unmountable.

Parity should be valid, but contents of disk aren't being emulated. I don't understand why.

I tried running xfs_repair -n /dev/sdp twice but the disk hasn't come good.

Is there a way to emulate the disk so that i can copy the data to other disks and try re-formatting the unmountable disk?

Attached is a screenshot of the GUI and diagnostics.

Thanks in advance

nas-diagnostics-20231102-1954.zip

March 17, 2023

G'day ich777,

I'm testing out your LuckyBackup container and I think there is a bug...

When I backup from one UNRAID server to another UNRAID server using "backup source inside destination" with 10 snapshots enabled, then following happens:

1. The files are copied successfully

2. Modified files are stored in the snapshots folder within the destination correctly

The issue is with restorations. No matter which snapshot I select, it only restores the latest backup set and seems to ignore that I've selected an older snapshot. This makes it impossible to restore any version of a file/directory apart from the latest backup.
I can confirm the the older version of a file does get copied to the snapshots folder then the date/time folder and I can grab an older copy of the file out manually, however restore simply doesn't work as expected.

Do you know of a way to fix this issue, or an alternate container that can do UNRAID to UNRAID backups with snapshots??

Many thanks for all of your contributions!

Dan

December 20, 2022

Just now, Husker_N7242C said:

lancache container's cache

(as in its cache of websites and steam games, not its appdata)

December 20, 2022

I finally worked it out. It was the lancache container's cache. It reported as practically empty but was hundreds of gigs. I moved to the array for now and will have a look at how to manage it's size better, assuming that it knows its own cached website data size correctly.
Thanks @JorgeB & @trurl for trying to give me a hand!

December 18, 2022

On 12/15/2022 at 10:50 PM, Mushin said:

However, there is no option to exclude shares

Same. Some thing has changed very recently and the "Docker Safe New Permissions" is now broken because you can't set the include/exclude shares any longer.

I'm stuck because I have all of my shares barring one excluded

December 16, 2022

28 minutes ago, JorgeB said:

Any vdisks or other image type files there can grow with time if not trimmed/unmapped, do you have those type of files there?

G'Day @JorgeB,

All VM vdisks are on the other cache pool. The docker image, System folder, appdata and a few folders waiting for mover are all that is on this cache pool. All folders had been selected when I grabbed the total file size in Krusader.

December 16, 2022

1 minute ago, trurl said:

Something must have happened when you tried to convert it to raid0

Thanks. I thought it was RAID0 from the start.

Just prior to posting I clicked "perform full balance", but then cancelled after reading forums and realising that it wasn't going to help. I wasn't attempting to change the RAID level.

December 16, 2022

14 minutes ago, trurl said:

Default raid1 for 3x250 gives 375

I've used more than 375GB, it holds about 430GB when "full" at the moment.

The GUI shows "Size 750GB"

December 16, 2022

@trurl

nas-diagnostics-20221216-2350.zip

December 16, 2022

Hi guys,

As the title says, my cache drive is nearly full but there is no where near 750GB of files on it. Any ideas why and how to rectify?

Here are some screenshots from Krusader and the GUI

image.png.b171e72bc67f6c8c5dc4c5a1af6e7237.png

image.png.49debcac72864c8424786fc95e956cda.png

December 2, 2022

Thanks again @JorgeB, I've disabled spin-down now and will go through the process again for disabling EPC and low current spinup to be sure it is still applied. I was quite careful when I originally did this to be sure that all commands succeeded but you never know.

I'll mark your response as solution and get back in touch if re-applying the workarounds don't prevent this.

I want to avoid having the drives continuously spun-up as we are just coming into summer in Australia and the server's room isn't airconditioned so I try to keep it as dormant as possible during the middle of the day, no downloads, no transcoding, just a couple of VMs on the SSDs (with 12x 140mm Noctua fans) then let it all rip at night.

December 2, 2022

Thanks @JorgeB as always for responding, however I've been through the process HERE to disable EPC and Low Current Spinup for all of these drives in the past. Is there anything else in the diagnostics that looks wrong?

I'll disable spin-down, add the drive back to the array and wait for it to happen again, that we I can be possitive it is/is not this known issue.

Just to double-check - if I stop the array, remove the disk, start the array and then stop the array again, re-add the disk and let the array rebuild, will I loose any data written to disk2 (currently emulated) since it was disabled, or does the drive get rebuilt from parity information instead?

December 2, 2022

Hi everyone,

My old server had continuous issues with disks being disabled, yet being found healthy. I never could get to the bottom of it.

I rebuilt the server:

-New motherboard (brand new, old stock, SuperMicro X10-DRH-CT)

-New CPU(s) (used E5-2690V3 x2)

-New RAM (128GB DDR4 ECC)

-New HBAs (and SAS-Sata cables) (LSI 9207-8i x2)

-New USB Flash Drive (Sandisk Cruzer Blade 32GB)

-New Power Supply (850w Corsair)

I Ran it for two weeks on Windows (no HDDs) and run many hours of stress-testing to ensure the temps stay low nothing failed out of the box.

I transferred over the disks and flash and all seemed well for a week but I've just had disk2 disabled due to read errors. The drive isn't that old and the dashboard reports "Healthy"... but disabled.

I'd really appreciate it if anyone with more knowledge than me had a look at the diagnostics (attached) and could make some recommendations?

If you need any more info please let me know?

Dan

nas-diagnostics-20221202-1659.zip

September 12, 2022

Hi everyone,

Can anyone tell me if I'd be able to use the built-in LSI 3108 controller on the Supermicro X10DRH-C motherboard for UNRAID? Is it possible to flash this to IT mode? (I have no experience with built-in raid cards and only a little with PCI-e add-in cards. Also no experience with Supermicro BIOS or flashing them).

Product Page - https://www.supermicro.com/en/products/motherboard/X10DRH-C

PDF User Manual - https://www.supermicro.com/manuals/motherboard/C600/MNL-1628.pdf

July 26, 2022

Thanks! I've had 3 of these Crucual MX500 250GB drives fail super early. The 500GB seem to be better but the 500GB drives are only holding v-disks, so maybe the 250GB drives got a bit thrashed.

July 25, 2022

Hi everyone,

I found that Docker had crashed somehow. Stopping and starting docker didn't resolve it so I rebooted the server. It took about 10 minutes to stop the array and then when it came back online got stuck at starting array for a VERY long time. It seemed to be doing something but super slow.

There is an I/O error repeated at the end of the syslog attached, not sure if it is related or not (or what it means).

Any help interpreting the diagnositics would be greatfully received.

Daniel

syslog nas-diagnostics-20220725-1904.zip

May 4, 2022

13 hours ago, ChatNoir said:

AFAIK, Memtest included with unraid is an older version (licencing issues) and do not detect errors on ECC modules.

As suggested, you should try with the version from the official website : https://www.memtest86.com/

Looks like you are right! Version 5 doesn't support ECC. Higher versions don't support older BIOSs... guess unraid is stuck between a rock and a hard place... or should include both and descibe the choice in the boot menu... or just remove it completely. I think I would have seen this if I had to download and create a boot USB myself but missed it because I know UNRAID recommends ECC memory, so incorrectly assumed that the memory tool packaged with OS would work with ECC memory. Probably needs a re-think.

May 3, 2022

5 minutes ago, JorgeB said:

It will depend on the board, Supermicro server boards for example usually always list the affected DIMM in the SEL.

It's just an ASRock x79 Extreme 11. Pretty cool board (for it's time) but I don't know of any diagnostic tools like that for it. It threw a boot code last week before I ran all of the memory tests but the code was a general memory error that didn't indicate a particular dimm.

May 3, 2022

3 minutes ago, JorgeB said:

Unfortunately not, it is nearly 40MB of "Location: SOCKET:0 CHANNEL:? DIMM:? []" in the syslog from the crash. When I look at cat /var/log/syslog now (after hard reset) it doesn't mention memory errors yet.

May 3, 2022

11 minutes ago, JorgeB said:

You should replace that DIMM, if there's an uncorrectable error the server will halt, system event log might have more info on the affected DIMM, if not just remove one at a time until the errors stop.

Thanks for replying again. Of the 8 dimms, it is possible that there could be more more than one with a fault. It is also quad channel so would have to remove more than 1 at a time and it will be a LONG process to eliminate. There MUST be a way to scan the memory properly. If UNRAID can encounter the issues within 24 hours so consistently, surely a memory testing program can too. I'll check the system event log as well for a hint

May 3, 2022

14 minutes ago, itimpi said:

Are you using the memtest that comes with Unraid or did you download a version from memtest86.com that can test ECC ram?

The one that comes with UNRAID. From what I can see, memtest86 supoprts testing ecc memory. Is there another version that would be better for DDR3 ECC?

May 3, 2022

On 4/23/2022 at 5:42 PM, JorgeB said:

Next time please use the forum.

Not seeing anything logged relevant to the crashing, could be hardware related, one thing you can try is to boot the server in safe mode with all docker/VMs disable, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one.

As for the disks getting disabled see below:

I wasn't able to pin-down a hardware fault at first. I ran memtest86 with 3 passes on each of the modules 1 at a time then for 2 days straight with all 8 modules and got 0 errors. I replaced the HX850 PSU with a new RM850X single rail and checked that my total power draw is within spec (it is easily). I booted again and had a freeze within a day.
Syslog now shows memory errors (as below). Unfortunately though doesn't show which module (I understand why it doesn't).

So I probably have a memory issue, but if memtest86 can't find a fault I'm not sure how to work out which module is bad. I can't really affort to replace 64GB of ECC memory TBH. ANy ideas on how to narrow this down.

Apr 29 07:29:50 NAS mcelog: Fallback Socket memory error count 32533 exceeded threshold: 2274769153 in 24h
Apr 29 07:29:50 NAS mcelog: Location SOCKET:0 CHANNEL:? DIMM:? []
Apr 29 07:29:50 NAS mcelog: Running trigger `socket-memory-error-trigger' (reporter: sockdb_fallback)
Apr 29 07:29:50 NAS mcelog: Too many trigger children running already
Apr 29 07:29:50 NAS mcelog: Fallback Socket memory error count 17062 exceeded threshold: 2274704085 in 24h
Apr 29 07:29:50 NAS mcelog: Fallback Socket memory error count 32533 exceeded threshold: 2274736619 in 24h
Apr 29 07:29:50 NAS mcelog: Location: SOCKET:0 CHANNEL:? DIMM:? []
Apr 29 07:29:50 NAS mcelog: Location: SOCKET:0 CHANNEL:? DIMM:? []
Apr 29 07:29:50 NAS mcelog: Cannot collect child 9515: No child processes
Apr 29 07:29:51 NAS mcelog: Fallback Socket memory error count 32112 exceeded threshold: 2274801266 in 24h
Apr 29 07:29:51 NAS mcelog: Location SOCKET:0 CHANNEL:? DIMM:? []
Apr 29 07:29:51 NAS mcelog: Running trigger `socket-memory-error-trigger' (reporter: sockdb_fallback)
Apr 29 07:29:51 NAS mcelog: Fallback Socket memory error count 18505 exceeded threshold: 2274819772 in 24h
Apr 29 07:29:51 NAS mcelog: Location SOCKET:0 CHANNEL:? DIMM:? []
Apr 29 07:29:51 NAS mcelog: Running trigger `socket-memory-error-trigger' (reporter: sockdb_fallback)
Apr 29 07:29:51 NAS mcelog: Fallback Socket memory error count 18505 exceeded threshold: 2274838278 in 24h
Apr 29 07:29:51 NAS mcelog: Location SOCKET:0 CHANNEL:? DIMM:? []
Apr 29 07:29:51 NAS mcelog: Running trigger `socket-memory-error-trigger' (reporter: sockdb_fallback)
Apr 29 07:29:51 NAS mcelog: Too many trigger children running already
Apr 29 07:29:51 NAS mcelog: Fallback Socket memory error count 18505 exceeded threshold: 2274819772 in 24h
Apr 29 07:29:51 NAS mcelog: Fallback Socket memory error count 32112 exceeded threshold: 2274801266 in 24h
Apr 29 07:29:51 NAS mcelog: Location: SOCKET:0 CHANNEL:? DIMM:? []
Apr 29 07:29:51 NAS mcelog: Location: SOCKET:0 CHANNEL:? DIMM:? []
Apr 29 07:29:52 NAS kernel: mce_notify_irq: 120 callbacks suppressed
Apr 29 07:29:52 NAS kernel: mce: [Hardware Error]: Machine check events logged
Apr 29 07:29:52 NAS kernel: mce: [Hardware Error]: Machine check events logged
Apr 29 07:29:52 NAS mcelog: Fallback Socket memory error count 18025 exceeded threshold: 2274856304 in 24h
Apr 29 07:29:52 NAS mcelog: Location SOCKET:0 CHANNEL:? DIMM:? []
Apr 29 07:29:52 NAS mcelog: Running trigger `socket-memory-error-trigger' (reporter: sockdb_fallback)
Apr 29 07:29:52 NAS mcelog: Fallback Socket memory error count 12193 exceeded threshold: 2274868498 in 24h
Apr 29 07:29:52 NAS mcelog: Location SOCKET:0 CHANNEL:? DIMM:? []
Apr 29 07:29:52 NAS mcelog: Running trigger `socket-memory-error-trigger' (reporter: sockdb_fallback)
Apr 29 07:29:52 NAS mcelog: Fallback Socket memory error count 12193 exceeded threshold: 2274880692 in 24h
Apr 29 07:29:52 NAS mcelog: Location SOCKET:0 CHANNEL:? DIMM:? []
Apr 29 07:29:52 NAS mcelog: Running trigger `socket-memory-error-trigger' (reporter: sockdb_fallback)
Apr 29 07:29:52 NAS mcelog: Too many trigger children running already
Apr 29 07:29:52 NAS mcelog: Fallback Socket memory error count 18025 exceeded threshold: 2274856304 in 24h
Apr 29 07:29:52 NAS mcelog: Fallback Socket memory error count 12193 exceeded threshold: 2274868498 in 24h
Apr 29 07:29:52 NAS mcelog: Location: SOCKET:0 CHANNEL:? DIMM:? []
Apr 29 07:29:52 NAS mcelog: Location: SOCKET:0 CHANNEL:? DIMM:? []
Apr 29 07:29:53 NAS mcelog: Fallback Socket memory error count 31707 exceeded threshold: 2274912400 in 24h
Apr 29 07:29:53 NAS mcelog: Location SOCKET:0 CHANNEL:? DIMM:? []
Apr 29 07:29:53 NAS mcelog: Running trigger `socket-memory-error-trigger' (reporter: sockdb_fallback)
Apr 29 07:29:53 NAS mcelog: Fallback Socket memory error count 23544 exceeded threshold: 2274935945 in 24h

April 25, 2022

On 4/23/2022 at 5:42 PM, JorgeB said:

Next time please use the forum.

Not seeing anything logged relevant to the crashing, could be hardware related, one thing you can try is to boot the server in safe mode with all docker/VMs disable, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one.

As for the disks getting disabled see below:

Thanks again.
I've disabled EPC and low current spinup on all drives (8 of which are the model mentioned in the guide and another is a 3TB which had both features enabled). This is currently the best value NAS Seagate drive in Australia (and has been for some time now). Is there a way to build the SeaChest executable files into UNRAID and have them automatically check and disable these features on all Seagate drives? when adding a new drive? It seems like this would be a relatively simple script to embed as a bug-fix? (I know nothing about this stuff really, just seems silly to expect the users to have to fix this bug for themselves and repeat everytime they add a drive).

Regarding the crashing, I've ordered a new PSU (haven't tested but it is due anyway) and will run a memory test and go from there.

Are there any hardware diagnostic tools that you would suggest to check the LSI controllers and the motherboard?? Anywhere else that helpful logs might be hiding?

April 23, 2022

On 4/19/2022 at 5:50 PM, JorgeB said:

Enable the syslog server and post that after a crash.

Thanks JorgeB, I enabled this and I have PM'd you the log from after today's lock-up with a bit more explanation. Thanks again for helping!

Husker_N7242C

Posts

Joined

Last visited

Content Type

Profiles

Forums

Downloads

Store

Gallery

Bug Reports

Documentation

Landing

Posts posted by Husker_N7242C

Disk shows an "Unmountable: Wrong or no file system" - not being emulated

Disk shows an "Unmountable: Wrong or no file system" - not being emulated

[Support] ich777 - Application Dockers

Cache used space is a lot more than the sum of the files on the cache

Cache used space is a lot more than the sum of the files on the cache

[Plugin] CA Fix Common Problems

Cache used space is a lot more than the sum of the files on the cache

Cache used space is a lot more than the sum of the files on the cache

Cache used space is a lot more than the sum of the files on the cache

Cache used space is a lot more than the sum of the files on the cache

Cache used space is a lot more than the sum of the files on the cache

New server - disk disabled

New server - disk disabled

New server - disk disabled

Can I use a Supermicro X10DRH-C with LSISAS 3108 inbuilt RAID controller

Array suddenly very slow to start

Array suddenly very slow to start

Random Lock-ups and Disks Disabled

Random Lock-ups and Disks Disabled

Random Lock-ups and Disks Disabled

Random Lock-ups and Disks Disabled

Random Lock-ups and Disks Disabled

Random Lock-ups and Disks Disabled

Random Lock-ups and Disks Disabled

Random Lock-ups and Disks Disabled