6.12.3 - Intermittent System Issues, Root Cause Trouble Shooting Help Needed - General Support

July 30, 20232 yr

I would really appreciate some help on how to proceed with trouble shooting. I've been having intermittent issues with Unraid over the last 1.5 years. Any time I resolve what seems to be the immediate issue, another one crops up. I am now back where I started. I am concerned that through all my effort I am just solving symptoms and not the assumed underlying root cause.

The system was stable for over 3 years prior to these issues with no hardware changes. I am running this as a bare bones setup, it's an intel NUC with one internal HDD, one M.2 cache drive, and one external USB HDD. No parity drive (I acknowledge the data loss risk in doing this).

Trouble Shooting Attempts: check / repair bad filesystem, replace a bad HDD (new USB cable w/ new HDD), wiped and rebuilt the array / dockers from scratch, move key directories to the cache drive, review system logs, review drive logs, run SMART tests on the drives, run memtest X86, run disk repair on the Unraid thumb-drive, updated the BIOS.

Issues: Unraid UI slow / unresponsive, docker start errors, disk errors, disks missing, docker permission errors, Unriad fails updating (intermittent network error), TPM error, other intermittent weird issues.

If someone could advise on how to solve what I assume is a root problem causing these issues, I'd greatly appreciate it.

Logs as of today are attached.

nuc-diagnostics-20230730-1219.zip

Edited July 30, 20232 yr by wlsn0chrs

Quote

July 30, 20232 yr

Looks like your USB drive is disconnected. One of many reasons USB not recommended in array or pools. Since you have no parity it won't be out-of-sync, maybe reboot will reset the connection.

Quote

July 31, 20232 yr

Author

Thank you, that appears to have resolved the external USB HDD not mounting. After rebooting it has remounted and docker is running.

Checking the system log, I now have some BTRFS warnings again. Does this indicate the cache drive is failing? Any advice on how to resolve the issue?

A few weeks ago I had a BTRFS warning for a different sector on the cache drive which mapped to the docker.img file. I removed the file, performed a balance, scrub, checked the file system status, and ran a SMART test of the cache drive. For good measure, I ran memtest X86 a few times. All OK. I then reinstalled the previous dockers via CA. This took several attempts with the installers failing mid download reporting a "Network failure" until it finally succeeded. Everything seemed OK after that. The cache drive is an internal m.2 NVME SSD.

I tried seeing what file was mapped to 144457728, but nothing is returned when I execute the command through the terminal.

Quote

root@nuc:~# find /mnt/cache -inum 144457728
root@nuc:~#

Quote

Jul 30 18:13:23 nuc kernel: BTRFS warning (device loop2): checksum verify failed on logical 144457728 mirror 1 wanted 0x0d7e520c found 0x6946c52d level 0
Jul 30 18:13:23 nuc kernel: BTRFS info (device loop2): read error corrected: ino 0 off 144457728 (dev /dev/loop2 sector 298528)
Jul 30 18:13:23 nuc kernel: BTRFS info (device loop2): read error corrected: ino 0 off 144461824 (dev /dev/loop2 sector 298536)
Jul 30 18:13:23 nuc kernel: BTRFS info (device loop2): read error corrected: ino 0 off 144465920 (dev /dev/loop2 sector 298544)
Jul 30 18:13:23 nuc kernel: BTRFS info (device loop2): read error corrected: ino 0 off 144470016 (dev /dev/loop2 sector 298552)

nuc-diagnostics-20230730-2156.zip

Edited July 31, 20232 yr by wlsn0chrs
typo / wording

Quote

July 31, 20232 yr

loop2 is the docker image, you should recreate it, also run a correcting scrub on the pool and post the result.

Quote

August 7, 20232 yr

Author

1. Recreated the docker image per your linked instructions. Ran scrub w/ no errors found. Logs attached.

Quote

UUID: 5ef80807-f297-43ce-9616-72ee886e3443

Scrub started: Sun Aug 6 22:02:27 2023

Status: finished

Duration: 0:00:20

Total to scrub: 7.66GiB

Rate: 392.14MiB/s

Error summary: no errors found

2. Reinstall docker apps via CA. During the installs, many of the dockers have to repeatedly retry downloading the files. Two dockers fail with "Error: local error: tls: bad record MAC". These two dockers are continuing to fail to install. I will retry again later in the week (see screenshot).

3. The system log is being spammed "kernel: tpm tpm0: A TPM error (257) occurred attempting get random".

4. In the archived notifications I have two disk warnings: "Array has 1 disk with read errors warning Disk 2 - WD_Game_Drive_57583332443331443556374A-0:0 (sda) (errors 1)". However, the disk log and attributes do not reflect this.

nuc-diagnostics-20230806-2203.zip

Edited August 7, 20232 yr by wlsn0chrs

Quote

August 7, 20232 yr

Diags don't show any read errors, did you reboot?

Quote

August 7, 20232 yr

Author

Yes, unfortunately the machine was inadvertently unplugged last week around the 8/1. I've enabled the local syslog server to avoid losing the logs going forward.

Does the disk read error count get reset on a system reboot? I've gone through my dump of old log files and unfortunately don't have any covering the periods of the listed disk read errors. Screenshot of the archived disk error notifications below.

Quote

August 7, 20232 yr

8 minutes ago, wlsn0chrs said:

Does the disk read error count get reset on a system reboot?

Yes, if it happens again post new diags.

Quote

August 10, 20232 yr

Author

Disk read error has occurred again. Please see attached diags.

nuc-diagnostics-20230809-2341.zip

Quote

August 10, 20232 yr

Log is filled with docker image corruption errors and other spam, delete and re-create the docker image, reboot to clear the log and post new diags after more errors.

Quote

September 1, 20232 yr

Author

Docker image deleted and recreated. System rebooted 3 days ago. Please find logs attached.

Network error when trying to update to 6.12.4:

Quote

plugin: unRAIDServer-6.12.4-x86_64.zip download failure: Network failure

BTRFS error:

Quote

BTRFS error (device loop2): bdev /dev/loop2 errs: wr 0, rd 0, flush 0, corrupt 51800, gen 0
BTRFS warning (device loop2): csum failed root 327 ino 1297 off 143360 csum 0xcfbb2b59 expected csum 0x11bf5696 mirror 1

nuc-diagnostics-20230901-1009.zip

Quote

September 1, 20232 yr

Docker image started detecting new corruptions just after start, cache filesystem also has corruption, suggesting a hardware problem, start by running memtest.

Quote

September 17, 20232 yr

Author

Memtest86 run four times using the default test. No errors. Reports for the last 3 tests attached (s/n removed).

MemTest86-Report-20230913-053317 [redacted].rtfd.zip MemTest86-Report-20230913-151514 [redacted].rtfd.zip MemTest86-Report-20230913-233352 [redacted].rtfd.zip

Edited September 17, 20232 yr by wlsn0chrs
style

Quote

September 18, 20232 yr

Recreate the pool to make sure it's a new filesystem and monitor for new errors, if more corruption errors appear there's likely still some hardware issue.

Quote

October 9, 20232 yr

@wlsn0chrs did you ever solve this? I have very similar issues

Quote

October 26, 20232 yr

Author

@Yivey_unraid I have not resolved what is going on yet. I had a few things come up and haven't gotten around to recreating the pool yet. Please let me know if you figure out is going on with your set up.

Quote

6.12.3 - Intermittent System Issues, Root Cause Trouble Shooting Help Needed

Featured Replies

Join the conversation

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)