enmesh-parisian-latest

Members
  • Posts

    121
  • Joined

  • Last visited

Everything posted by enmesh-parisian-latest

  1. Hi, so I haven't gone through this process before, seems a little scary The device is "disabled, contents emulated", are you suggesting the disk may be fine, no replacement? If I start the array and view the disk contents everything looks ok try rebuilding?
  2. One of my kids accidentally kicked my server in the motherboard, the mobo appears dead so I upgraded the whole case, mobo, ram & cpu, and migrated my raid card connecting my drive array. Trouble is, one of my array disks is now disabled. I pulled the disabled disk out and inserted again, swapped drive bays but it's not coming back online. It seems like too much of a coincidence that the drive would fail at the same time as the main case was damaged, Could the raid card be damaged? Are there any tests I can do? tobor-server-diagnostics-20231205-2347.zip
  3. Hey, the ipvlan switch fixed the main crashes (i was getting 1 every 1-2 days). There's still some nagging problem which is causing a random crash every month or so, but I think that's related somehow to my cpu.
  4. I'm still rebuilding parity, but I noticed some kernel errors in the system log last night: Jul 19 02:38:07 tobor-server kernel: PMS LoudnessCmd[31931]: segfault at 0 ip 000014da6a0d7060 sp 000014da658460d8 error 4 in libswresample.so.4[14da6a0cf000+18000] likely on CPU 47 (core 13, socket 1) Jul 19 02:38:07 tobor-server kernel: Code: 01 cf 4c 39 c7 72 e3 c3 cc cc 8d 04 49 48 98 4d 89 c1 49 29 c1 48 63 c2 48 63 c9 49 39 f9 76 75 f2 0f 10 05 22 05 ff ff 66 90 <0f> bf 16 0f 57 c9 f2 0f 2a ca f2 0f 59 c8 f2 0f 11 0f 0f bf 14 06 Jul 19 02:38:08 tobor-server kernel: PMS LoudnessCmd[32119]: segfault at 0 ip 0000150d92c2f060 sp 0000150d8e5b80d8 error 4 in libswresample.so.4[150d92c27000+18000] likely on CPU 23 (core 13, socket 1) Jul 19 02:38:08 tobor-server kernel: Code: 01 cf 4c 39 c7 72 e3 c3 cc cc 8d 04 49 48 98 4d 89 c1 49 29 c1 48 63 c2 48 63 c9 49 39 f9 76 75 f2 0f 10 05 22 05 ff ff 66 90 <0f> bf 16 0f 57 c9 f2 0f 2a ca f2 0f 59 c8 f2 0f 11 0f 0f bf 14 06 Jul 19 02:38:08 tobor-server kernel: PMS LoudnessCmd[32151]: segfault at 0 ip 00001498864b8900 sp 0000149881cd00d8 error 4 in libswresample.so.4[1498864b0000+18000] likely on CPU 16 (core 4, socket 1) Jul 19 02:38:08 tobor-server kernel: Code: cc cc cc cc cc cc cc cc cc cc 8d 04 49 48 98 4d 89 c1 49 29 c1 48 63 c2 48 63 c9 49 39 f9 76 7c 66 2e 0f 1f 84 00 00 00 00 00 <f3> 0f 10 06 f3 0f 5a c0 f2 0f 11 07 f3 0f 10 04 06 48 01 c6 f3 0f Jul 19 02:38:40 tobor-server kernel: PMS LoudnessCmd[32179]: segfault at 0 ip 000014ae7be78060 sp 000014ae779440d8 error 4 in libswresample.so.4[14ae7be70000+18000] likely on CPU 11 (core 13, socket 0) Jul 19 02:38:40 tobor-server kernel: Code: 01 cf 4c 39 c7 72 e3 c3 cc cc 8d 04 49 48 98 4d 89 c1 49 29 c1 48 63 c2 48 63 c9 49 39 f9 76 75 f2 0f 10 05 22 05 ff ff 66 90 <0f> bf 16 0f 57 c9 f2 0f 2a ca f2 0f 59 c8 f2 0f 11 0f 0f bf 14 06 Jul 19 02:39:22 tobor-server kernel: PMS LoudnessCmd[34204]: segfault at 0 ip 000014b820278060 sp 000014b81bf970d8 error 4 in libswresample.so.4[14b820270000+18000] likely on CPU 47 (core 13, socket 1) Jul 19 02:39:22 tobor-server kernel: Code: 01 cf 4c 39 c7 72 e3 c3 cc cc 8d 04 49 48 98 4d 89 c1 49 29 c1 48 63 c2 48 63 c9 49 39 f9 76 75 f2 0f 10 05 22 05 ff ff 66 90 <0f> bf 16 0f 57 c9 f2 0f 2a ca f2 0f 59 c8 f2 0f 11 0f 0f bf 14 06 Jul 19 02:39:23 tobor-server kernel: PMS LoudnessCmd[36896]: segfault at 0 ip 000014e50e890060 sp 000014e50a00b0d8 error 4 in libswresample.so.4[14e50e888000+18000] likely on CPU 42 (core 8, socket 1) Jul 19 02:39:23 tobor-server kernel: Code: 01 cf 4c 39 c7 72 e3 c3 cc cc 8d 04 49 48 98 4d 89 c1 49 29 c1 48 63 c2 48 63 c9 49 39 f9 76 75 f2 0f 10 05 22 05 ff ff 66 90 <0f> bf 16 0f 57 c9 f2 0f 2a ca f2 0f 59 c8 f2 0f 11 0f 0f bf 14 06 Is this some clue to the original problem? tobor-server-diagnostics-20230719-1146.zip
  5. It's true everything appears to be working now, but for my cache drives and parity to fail within two days of each other, I feel like something bigger is the problem, I'm only addressing the symptoms but haven't found the cause of the problems. I am hoping the diagnostics and logs can help identify this problem. Attached is the system log, however it's missing the period where my parity failed. syslog-10.0.0.200.log
  6. Hey, I've been having issues since 6.12.0 (now on 6.12.3). The system was regularly crashing which I posted about here. While attempting to apply a recommended fix, it became clear that the docker image was corrupted, leading to me realise the problem was bigger and the cache drive partition was corrupted (was only operating in read only). I cleared and reformatted the cache drives and then began transferring my data back, when I noticed the parity drive was now not readable. I couldn't generate a SMART report or perform any checks on the parity drive, so I shut down, checked cables and connections and rebooted, the parity drive was no longer visible in the UI, so as an experiment I switched the parity drive bay with another drive, now the parity drive is back, can generate SMART reports but needs to be formatted and parity rebuilt. I'm now rebuilding the parity, but I get the feeling I might be missing some bigger issue. I've attached diagnostics and a SMART report for the parity drive, is there anything here I should be worried about? As a small side note, I noticed that FCP is reporting the "Write Cache is disabled" on the parity and drives 1-22, however I have 23 disks in my array (plus the parity), it seems odd that one disk would not be reporting the same "Write Cache is disabled"... tobor-server-diagnostics-20230717-1503.zip WD161KFGX-68AFSM_2VGD275B_3500605ba011718e8-20230717-1500.txt
  7. Interesting, I've never considered that. I have plenty of containers with custom IPs. I'll switch it now and report back. Thanks
  8. Hey, so I had a crash yesterday, and one just before these logs were generated. Is there anything here which could identify the problem? syslog-10.0.0.200.log
  9. My unraid box has been stable for years, however with the latest 6.12.x updates something is causing random crashes. The whole system becomes unresponsive, no output to the monitor or terminal access. It's occurred 4 times now on both 6.12.0 and 6.12.1. Attached is my most recent diagnostics generated after my last crash (sometime in the last hour or two. Any advice would be fantastic, thanks tobor-server-diagnostics-20230629-1742.zip
  10. I have an hourly rsync script running in the userscripts plugin, it used to create files/directories with permissions of 655 and 755 owned by nobody/users. Now with the latest unraid os update to 6.10.2 it's creating files/directories 600 and 755 owned by root/root. Any ideas how to fix this and what has changed? My containers running as nobody/users can't access the files.
  11. I'm using an Adaptec RAID 71605 which has served me well for years, although I did a force reboot and the controller was giving me a high pitched alert beep indicating it was overheated. I'll shut down and let it cool off a bit then try again. The missing drives were back but the parity drive is still listed as being in error state. Can you suggest how best to deal with the filesystem corruption? I assume this is somehow related to the raid controller messing up. EDIT: regarding fs corruption I'm following the instructions here: https://wiki.unraid.net/Check_Disk_Filesystems#Checking_and_fixing_drives_in_the_webGui
  12. As mentioned, I first received an email about the parity drive being in an error state followed by 5 other disks. Several docker containers are reporting they have no read access to the appdata directory which is on the cache drive. The array is currently stalled "unmounting disks" as I try to stop the array Any ideas? .tobor-server-diagnostics-20211021-1329.zip ls -la /mnt/ /bin/ls: cannot access '/mnt/disk18': Input/output error /bin/ls: cannot access '/mnt/disk16': Input/output error /bin/ls: cannot access '/mnt/disk10': Input/output error /bin/ls: cannot access '/mnt/disk7': Input/output error /bin/ls: cannot access '/mnt/disk6': Input/output error total 16 drwxr-xr-x 26 root root 520 Sep 15 18:57 ./ drwxr-xr-x 21 root root 440 Oct 21 12:12 ../ drwxrwxrwx 1 nobody users 66 Oct 17 04:30 cache/ drwxrwxrwx 7 nobody users 108 Oct 17 04:30 disk1/ d????????? ? ? ? ? ? disk10/ drwxrwxrwx 6 nobody users 88 Oct 17 04:30 disk11/ drwxrwxrwx 6 nobody users 88 Oct 17 04:30 disk12/ drwxrwxrwx 5 nobody users 69 Oct 10 04:30 disk13/ drwxrwxrwx 5 nobody users 69 Oct 17 04:30 disk14/ drwxrwxrwx 5 nobody users 53 Oct 17 04:30 disk15/ d????????? ? ? ? ? ? disk16/ drwxrwxrwx 4 nobody users 36 Oct 17 04:30 disk17/ d????????? ? ? ? ? ? disk18/ drwxrwxrwx 5 nobody users 67 Oct 17 04:30 disk19/ drwxrwxrwx 6 nobody users 88 Oct 17 04:30 disk2/ drwxrwxrwx 5 nobody users 51 Oct 17 04:30 disk3/ drwxrwxrwx 6 nobody users 88 Oct 17 04:30 disk4/ drwxrwxrwx 5 nobody users 51 Oct 17 04:30 disk5/ d????????? ? ? ? ? ? disk6/ d????????? ? ? ? ? ? disk7/ drwxrwxrwx 5 nobody users 51 Oct 17 04:30 disk8/ drwxrwxrwx 7 nobody users 109 Oct 17 04:30 disk9/ drwxrwxrwt 2 nobody users 40 Sep 15 18:57 disks/ drwxrwxrwt 2 nobody users 40 Sep 15 18:57 remotes/ drwxrwxrwx 1 nobody users 108 Oct 17 04:30 user/ drwxrwxrwx 1 nobody users 108 Oct 17 04:30 user0/
  13. Disk is under warranty, I forked out for a replacement and will RMA it soon. Thanks for input.
  14. === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x80) Offline data collection activity was never started. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 101) seconds. Offline data collection capabilities: (0x5b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: (1250) minutes. SCT capabilities: (0x003d) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000b 100 100 --- Pre-fail Always - 0 2 Throughput_Performance 0x0004 135 135 --- Old_age Offline - 108 3 Spin_Up_Time 0x0007 080 080 --- Pre-fail Always - 397 (Average 397) 4 Start_Stop_Count 0x0012 100 100 --- Old_age Always - 49 5 Reallocated_Sector_Ct 0x0033 100 100 --- Pre-fail Always - 0 7 Seek_Error_Rate 0x000a 100 100 --- Old_age Always - 0 8 Seek_Time_Performance 0x0004 133 133 --- Old_age Offline - 18 9 Power_On_Hours 0x0012 100 100 --- Old_age Always - 6700 10 Spin_Retry_Count 0x0012 100 100 --- Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 --- Old_age Always - 49 22 Helium_Level 0x0023 100 100 --- Pre-fail Always - 100 192 Power-Off_Retract_Count 0x0032 100 100 --- Old_age Always - 2787 193 Load_Cycle_Count 0x0012 100 100 --- Old_age Always - 2787 194 Temperature_Celsius 0x0002 100 100 --- Old_age Always - 22 (Min/Max 18/50) 196 Reallocated_Event_Count 0x0032 100 100 --- Old_age Always - 0 197 Current_Pending_Sector 0x0022 100 100 --- Old_age Always - 0 198 Offline_Uncorrectable 0x0008 100 100 --- Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x000a 100 100 --- Old_age Always - 0 Read SMART Log Directory failed: scsi error medium or hardware error (serious) Read SMART Error Log failed: scsi error medium or hardware error (serious) Read SMART Self-test Log failed: scsi error medium or hardware error (serious) Read SMART Selective Self-test Log failed: scsi error medium or hardware error (serious) I Tried plugging the drive into my Adaptec RAID card but the disk didn't even appear. I ran smartctl -a /dev/sdv and the output is pasted above. There are a lot of errors in the drive log too, for example: Jun 5 11:18:17 t-server kernel: blk_update_request: I/O error, dev sdv, sector 23437770624 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0 Jun 5 11:18:17 t-server kernel: Buffer I/O error on dev sdv, logical block 2929721328, async page read The recurring and most common error is: Jun 5 13:56:36 t-server kernel: ata5.00: failed to get NCQ Send/Recv Log Emask 0x1 Jun 5 13:56:36 t-server kernel: ata5.00: failed to get NCQ Non-Data Log Emask 0x1
  15. Same controller, just a different port on motherboard. I'll try switching power and controllers.
  16. The disk appears in Unassigned devices but I can't seem to perform any operation on the disk. I swapped SATA cables and moved the mobo port but still no dice. The disk is only about 18 months old, any ideas? tobor-server-diagnostics-20210604-2105.zip tobor-server-smart-20210604-2103.zip
  17. I added the :nightly tag to the container and it's working again https://github.com/linuxserver/docker-beets/issues/80
  18. This may sound odd, but could the error be linked to github being down? I noticed that while the error was spamming I couldn't load my plugins page, I looked into that further and found that could occur when github is down. 12h later, github is fine, plugins page can be accessed and errors have stopped. In addition, several people appear to get the error at the same time.
  19. Having this problem too. Get an error every 1-2 seconds all day every day