MrChip

Members
  • Posts

    40
  • Joined

  • Last visited

Everything posted by MrChip

  1. I recently upgraded to 6.12, and I'm now at 6.12.6 on 3 servers. I'm seeing these log messages repeating over and over on 2 of the servers. Jan 4 13:56:08 bailey-un monitor: Stop running nchan processes The TLDR version: I see the nchan log messages repeating over and over on one server after the client PC goes to sleep and the browser there stops accessing the web GUI without logging out. This has happened on a second server but I can't confirm at this time whether it happened again today. I'll have to confirm that later when the server is accessible again. Today, it did not occur on a third server - just one message there. In detail: On the server shown here, the message repeats every 33 seconds. Jan 4 13:56:08 bailey-un monitor: Stop running nchan processes Jan 4 13:56:41 bailey-un monitor: Stop running nchan processes Jan 4 13:57:14 bailey-un monitor: Stop running nchan processes Jan 4 13:57:47 bailey-un monitor: Stop running nchan processes Jan 4 13:58:20 bailey-un monitor: Stop running nchan processes Jan 4 13:58:54 bailey-un monitor: Stop running nchan processes Jan 4 13:59:27 bailey-un monitor: Stop running nchan processes Jan 4 14:00:00 bailey-un monitor: Stop running nchan processes Jan 4 14:00:33 bailey-un monitor: Stop running nchan processes I've seen an earlier thread (https://forums.unraid.net/topic/144710-since-6124-monitor-stop-running-nchan-processes) but it didn't seem to come to a resolution for what I'm experiencing. These messages start after the client PC is put to sleep. I didn't close the browser tab or log out of the session - I just put the PC to sleep. There should be no further activity from the browser session on that PC until it wakes up again. But these messages keep being generated over and over. The messages stopped when a new GUI session to the server was opened from a different client PC on a different local subnet. Jan 4 20:35:41 bailey-un monitor: Stop running nchan processes Jan 4 20:36:14 bailey-un monitor: Stop running nchan processes Jan 4 20:36:47 bailey-un monitor: Stop running nchan processes Jan 4 20:36:55 bailey-un webGUI: Successful login user root from 192.168.x.y Jan 4 20:36:58 bailey-un emhttpd: WDC_WD10EZRX-00A8LB0_WD-WCC1U2905249 (sdg) 512 1953525168 Jan 4 20:36:58 bailey-un emhttpd: read SMART /dev/sdg (This is the end of the log and was copied at 21:06 local time, so there's nothing in the log after 20:36:58.) One of my servers didn't get into this state today. I recorded a single instance of the log message when the client PC (same client PC as above) went to sleep. Jan 4 13:46:29 flint-un emhttpd: spinning down /dev/sdb Jan 4 13:55:43 flint-un monitor: Stop running nchan processes My third server has shown the repeating nchan log messages in recent days, but it is currently in accessible and likely down. (I'm not there to investigate it further right now so I can't say if it started these messages today or not. The same client PC was connected there too.) I think that the repeating messages are a bug - a single message makes sense when the client PC goes to sleep, but repeating messages don't make sense as the client is now silent. Your thoughts?
  2. After replacing disk1 and rebuilding the parity, and waiting for several parity check cycles, I conclude that my parity errors reported in the OP were due to a failing hard drive. Since replacing the drive I have no parity errors reported over multiple parity checks. I consider this issue to be resolved.
  3. Disk1 has now failed (several days ago now). A file system check showed tons of errors, and the root of the file system was unreadable. I took disk1 out of the array so the array had just disk2 and parity. I rebuilt the parity and restored the data from backup. I've run several parity checks since then and all are zero errors. I bought a new disk, zeroed it, and added it to the array today. A parity-sync is in progress. I'll run a parity check daily for a time to keep an eye on it. I think disk1 was the source of the parity error that prompted my OP, but I'll wait a few parity check cycles before I conclude that. I want to thank JorgeB for his input and suggestions - much appreciated.
  4. Hmm, I just found these logs on the disk log for the parity drive. It shows wsdd is failing at times - segfault and general protection fault. I see them only in the disk log for the parity drive, not the other drives. I'm not sure if they are significant to my parity correction problem or not. Oh, and they do also appear in the syslog (but a fewer number - I mean that there are more wsdd segfaults in the parity drive syslog then in the syslog over the same time period. Here's what appears in the parity drive log: Aug 17 11:15:33 chip-un kernel: sd 7:0:0:0: [sdd] 3907029168 512-byte logical blocks: (2.00 TB/1.82 TiB) Aug 17 11:15:33 chip-un kernel: sd 7:0:0:0: [sdd] 4096-byte physical blocks Aug 17 11:15:33 chip-un kernel: sd 7:0:0:0: [sdd] Write Protect is off Aug 17 11:15:33 chip-un kernel: sd 7:0:0:0: [sdd] Mode Sense: 7f 00 10 08 Aug 17 11:15:33 chip-un kernel: sd 7:0:0:0: [sdd] Write cache: enabled, read cache: enabled, supports DPO and FUA Aug 17 11:15:33 chip-un kernel: sdd: sdd1 Aug 17 11:15:33 chip-un kernel: sd 7:0:0:0: [sdd] Attached SCSI disk Aug 17 11:18:35 chip-un emhttpd: ST2000DM008-2FR102_ZK303HMX (sdd) 512 3907029168 Aug 17 11:18:35 chip-un kernel: mdcmd (1): import 0 sdd 64 1953514552 0 ST2000DM008-2FR102_ZK303HMX Aug 17 11:18:35 chip-un kernel: md: import disk0: (sdd) ST2000DM008-2FR102_ZK303HMX size: 1953514552 Aug 17 11:18:35 chip-un emhttpd: read SMART /dev/sdd Aug 17 11:18:38 chip-un root: /usr/sbin/wsdd Aug 17 11:24:21 chip-un root: /usr/sbin/wsdd Aug 17 11:24:23 chip-un kernel: wsdd[16602]: segfault at 1004d ip 0000000000403c92 sp 00007fffcd51c250 error 4 in wsdd[402000+4000] Aug 18 13:19:13 chip-un emhttpd: read SMART /dev/sdd Aug 18 13:19:55 chip-un root: /usr/sbin/wsdd Aug 18 13:43:37 chip-un emhttpd: read SMART /dev/sdd Aug 18 13:43:39 chip-un root: /usr/sbin/wsdd Aug 18 13:44:01 chip-un root: /usr/sbin/wsdd Aug 18 13:44:03 chip-un kernel: wsdd[14070]: segfault at f0007624 ip 0000000000403c92 sp 00007ffe41a216b0 error 4 in wsdd[402000+4000] Aug 18 13:44:08 chip-un root: /usr/sbin/wsdd Aug 18 13:44:08 chip-un wsdd[15554]: set_multicast: Failed to set IPv4 multicast Aug 18 13:44:08 chip-un wsdd[15554]: Failed to add multicast for WSDD: Address already in use Aug 18 13:44:08 chip-un wsdd[15554]: set_multicast: Failed to set IPv4 multicast Aug 18 18:48:28 chip-un emhttpd: read SMART /dev/sdd Aug 18 18:49:09 chip-un root: /usr/sbin/wsdd Aug 18 18:51:06 chip-un emhttpd: read SMART /dev/sdd Aug 18 18:51:06 chip-un root: /usr/sbin/wsdd Aug 18 18:51:27 chip-un root: /usr/sbin/wsdd Aug 18 18:51:29 chip-un kernel: wsdd[28652]: segfault at e08ade5f ip 0000000000403c92 sp 00007ffea454b4a0 error 4 in wsdd[402000+4000] Aug 18 22:29:04 chip-un emhttpd: read SMART /dev/sdd Aug 18 22:29:25 chip-un root: /usr/sbin/wsdd Aug 18 22:31:45 chip-un emhttpd: read SMART /dev/sdd Aug 18 22:31:46 chip-un root: /usr/sbin/wsdd Aug 18 22:32:07 chip-un root: /usr/sbin/wsdd Aug 18 22:32:09 chip-un kernel: traps: wsdd[19417] general protection fault ip:403c92 sp:7fff4f11f7c0 error:0 in wsdd[402000+4000]
  5. I'm glad I'm well backed up ... I've put the array into maintenance mode and performed file system check (through the GUI). There were a lot of issues that were fixed. I restored affected files from backup. All disks pass a file system check fine now. I brought the array online normally and did a parity check (with correction). Over 16k corrections, but I guess that may not be a big surprise given the file system damage that had occurred. I did a second parity check right after the first one. It'll finish in a few minutes, but it's already over 17k corrections. That's way out of whack. I'm not sure how to approach this issue from here. JorgeB suggested a memory issue, but my feeling is that if a memory issue was creating that many parity corrections that there would be other signs and symptoms in the system.
  6. Thanks. I'm now getting file system errors on disk1: root@chip-un:~# cd /mnt/disk1 root@chip-un:/mnt/disk1# ls ./User/kntc/Accounting /bin/ls: cannot access './User/kntc/Accounting': Structure needs cleaning root@chip-un:/mnt/disk1# I'm starting another thread to ask about fixing a file system on an Unraid array. Edit: there's plenty of posts/info about fixing filesystem issues, so I didn't start a new post. It's not likely a coincidence having the parity errors and the file system errors. But the cause isn't clear. I'm leaning to disk or controller, but memory is still possible too.
  7. One thought I've had is to take disk2 out of the array and then check parity on just disk1 for a few days. If the parity errors cease, then disk 2 is suspect. Otherwise disk1 or parity is suspect. Could the disk controller be a source of these issues? I have these disks on an LBA.
  8. I've been watching the server more carefully today, and things have changed quite a bit from the OP. In the OP the Raw_Read_Error_Rate was 79448/79456/183312320 (parity/disk1/disk2). Today I have: root@chip-un:~# smartctl -a -v 1,raw48 /dev/sdd | grep Raw_Read 1 Raw_Read_Error_Rate 0x000f 082 064 006 Pre-fail Always - 165959888 root@chip-un:~# smartctl -a -v 1,raw48 /dev/sde | grep Raw_Read 1 Raw_Read_Error_Rate 0x000f 082 064 006 Pre-fail Always - 168463584 root@chip-un:~# smartctl -a -v 1,raw48 /dev/sdf | grep Raw_Read 1 Raw_Read_Error_Rate 0x000f 082 064 006 Pre-fail Always - 173822392 root@chip-un:~# (sdd=parity, sde=disk1, sdf=disk2) Quite the big change for parity and disk1, but now they are in the same ballpark with disk2. So I'm no longer thinking that disk2 is the most likely culprit. I did another parity check today - 1516 parity errors this time. Today I did not write corrections to parity, but last time I did. I'm not sure which way to go on this question. If one of my data drives is failing, could writing parity mess things up even further?
  9. Okay, that's helping me understand the numbers. Thanks. So using the smartctl command suggested I see an error count of 0 for all 3 disks. But now I'm still wondering why the figure for disk 2 is so much higher than the other two disks. For (parity/disk1/disk2) the Raw_Read_Error_Rate was (in the OP) (79448/79456/183312320). If I understand, since the error count is 0 for all three, then these numbers are the number of reads/seeks. I don't understand the difference across the three disks. If the values for the parity disk and disk1 are sensible, the the value of 183312320 for disk2 doesn't make sense as that disk is essentially empty. And all 3 disks were bought at the same time to populate this server - their Power_On_Hours is (21736/21752/21710), so disk2 isn't any older. Disk2 should have less reads/seeks, not several orders of magnitude more. Or maybe the parity and disk1 figures are too low for Raw_Read_Error_Rate? Only 79k after years of service and plenty of parity checks?
  10. Thanks for the input, but what's the number 2 suspect? I ran Memtest86 (10.5) - 4 passes, 5 hours and 8 minutes elapsed time - and there were zero errors. What's the way to read the Seagate drive figures? All 3 drives are Seagate and bought at the same time, but just one has the error rate in the millions. The other two are in the 74000 range. I'm again thinking that disk 2 has issues.
  11. Hello all. Today (Tuesday) I ran a parity check and the result was 2128 parity errors. Oh oh. After running some self-tests I'm thinking I've identified the culprit disk, however I'm seeking input from more experienced users to ensure I've made reasonable conclusions. I've been running weekly parity checks for about a year (daily checks prior to that). They have all had 0 errors until recently. Aug 6 reported 226, and Aug 13 reported 128. So this concerned me but I wasn't getting other disk errors that gave me reason to suspect a particular disk. So I started keeping a closer eye on the array disks. I ran extended self-tests on all 3 disks - 1 parity and 2 data disks. The self-tests on the parity disk and data disk 1 finished normally and I didn't see anything that jumped out at me. But disk 2 had some trouble finishing the self-test - I had to turn off the spin-down to get the self test to finish, and it finally did. Yesterday I ran another parity check and the result was zero errors. Today I ran the parity check again, and today there was a whopping 2128 errors. At least that looks whopping to me. What is interesting is that I was also running the extended self-test on disk 2 at the same time. I thought the self test wouldn't/shouldn't be affecting unraid processes. Is that wrong? So I think disk 2 is having some issues, and when I look at the self-test report I see SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE 1 Raw_Read_Error_Rate POSR-- 083 064 006 - 183312320 ... 195 Hardware_ECC_Recovered -O-RC- 083 064 000 - 183312320 The other 2 disks are around 79,400 for both these attributes. (I'm assuming the numbers for these 2 attributes are a real count of errors.) Since an ECC recovery appears to have been accomplished for all the read errors, I would expect that the data should be fine, but the parity check errors suggest that there are errors being introduced onto the disk. I'd like to ask for some guidance from the community. Is the conclusion that disk 2 is having problems sound? Are there other diagnostics I should check? Are there other possible causes of the parity errors I should check before replacing the disk 2 disk? There appears to be nothing on disk 2. I have isos configured to go there but I am not running any VMs, so there are no ISOs. I seem to have the unraid configuration set up to be storing my data all on disk 1 - 635 GB (on 2TB drives). This appears to be consistent with the High-water allocation setting. I am attaching diagnostics and the self-test reports for all 3 disks. Thank you for any and all help and feedback. chip-un-diagnostics-20230816-0004.zip Parity-smart-20230814-2304.zip Disk1-smart-20230814-2305.zip Disk2-smart-20230815-2303.zip
  12. Thank you very much. Removing the directory and its files too made the difference. Both my servers are running well now with unassigned.devices.preclear too.
  13. Thank you, that got things back to normal. Much appreciated. Deleting unassigned.devices.preclear.plg appears to have removed the plugin from my servers (which I expected). I tried re-installing the plugin through Community Applications, but it went back into the problem state.
  14. Hi. I have 3 unraid systems and I just updated unassigned.devices.preclear on 2 of them , and those two are now showing serious UI issues. The Dashboard page has headings but no content under the headings. The Main page looks fine, but the button to Stop the array doesn't work. The page refreshes and I see the same device dislplay and the Stop button is still showing and there are no status messages indicated the array is stopping. The Shares page has headings but no content. The User page look normal. The Settings page look fine. I didn't visit all the settings pages, but the ones I did visit showed expected content. The Plugins page has headings but no content. The Docker page has headings but no content. I don't run VMs so I didn't check that page. The Apps page has the side menu, the search box, and a single heading Updating Content, but no content. The Stats page has no content. The Tools page looks okay. I was able to generate the diagnostics (attached). I am able to run the webterminal. I can also login via ssh. I think the dockers are running - I see processes I associate with my dockers in the ps output. I am seeing some messages like this in the logs: Aug 8 15:52:01 flint-un nginx: 2023/08/08 15:52:01 [error] 8140#8140: *1894 open() "/usr/local/emhttp/plugins/unassigned.devices.preclear/assets/javascript.js" failed (2: No such file or directory) while sending to client, client: 192.168.52.67, server: , request: "GET /plugins/unassigned.devices.preclear/assets/javascript.js?v=autov_fileDoesntExist HTTP/2.0", host: "flint-un.pc.kntc.ca", referrer: "https://flint-un.pc.kntc.ca/Main" The contents of /usr/local/emhttp/plugins/unassigned.devices.preclear/assets is: root@flint-un:/usr/local/emhttp/plugins/unassigned.devices.preclear/assets# ls -l total 56 -rwxr-xr-x 1 root root 5099 Jun 21 2017 arrive.min.js* -rwxr-xr-x 1 root root 4492 Jan 2 2022 sweetalert2.css* -rwxr-xr-x 1 root root 40887 Mar 14 2022 sweetalert2.js* root@flint-un:/usr/local/emhttp/plugins/unassigned.devices.preclear/assets# No javascript.js file here, but the one server that I haven't updated the unassigned.devices.preclear plugin on does have this file. So something in the plugin update process appears to have trashed that file. How can I get the needed file? I'll need an out-of-band method to get it because the Plugins page isn't showing the list of plugins, so there's no update ability through the GUI at this time. Thanks for any help! flint-un-diagnostics-20230808-1553.zip
  15. The report from the extended SMART test is in the attachment included in the original post.
  16. During the weekly parity check, Unraid reported read errors on the parity drive. The message in FCP suggested that the data had been rewritten back to the drive successfully. I checked the SMART reports and ran both a short and extended self test. Both came back "Completed without error", but both test reports show a set of UNC errors within the report, but also "SMART overall-health self-assessment test result: PASSED" (extended test report is attached). The Main page of the server UI states parity is valid. From all this I think the disk is not in need of immediate replacement, but I should keep an eye on it. I'm also thinking it might be wise to replace it, but maybe keep the current disk as a spare? I welcome any suggestions and insights you care to share. flint-un-smart-20230402-1811.zip
  17. Okay !!WHEW!! -- Solved! When I did the right search, I discovered super.dat. For those who may have this problem and discover this post in the future, super.dat is where the array configuration is stored. But it's not backed up under that name when I do a flash backup. Rather it's copied to super.dat.CA_BACKUP. When I recreated my flash drive and restored my config directory from my flash backup, I needed to copy super.dat.CA_BACKUP back to super.dat. When I didn't do that, a new, empty super.dat was created which had no array configuration in it. So this is why my disks were not showing in the array , but in the Unassigned Devices list. I shutdown the server, put the flash drive into my PC and copied super.dat.CA_BACKUP back to super.dat, then rebooted the server (with the flash drive, of course). And it all came up with the disks in the right place. I started the array in maintenance mode, and the array came up and everything is fine. It's reporting the parity is valid, without a parity check. (I'm a bit paranoid, though, so I think I'll do a parity check anyway.) So I now have my server back and a much better understanding of what to do to replace a faulty flash drive when one of my other Unraid servers has it's flash drive fail.
  18. My mistake. I actually passed over this card because I thought it required 8 lanes. I can use it if it can run on 4 lanes. Thanks for the correction.
  19. Your HBA card is x8, so it needs an x8 or larger slot. So your E5 slot, which is x4, won't support your HBA card.
  20. Interesting .... When I unassign the disks again, Unassigned Devices appears to figure out that they are encrypted. But Unraid still thinks that those device slots are to be encrypted.
  21. In the GUI I've now assigned the disks according to the DISK_ASSIGNMENTS.txt file in /boot/config - see screen capture below. (This file matches what I have in multiple backup copies, so I believe it's correct.) I believe the parity "should" be fine, but I am seeing a notice that a Parity Sync will be done (not parity check), implying that Unraid doesn't think that there is valid parity. There is a check box to say parity is already valid. And one to enter maintenance mode. I'm inclined check both boxes and run a parity check on the array with the disks unmounted. Is that sane and reasonable? BUT ... BUT ... The array is already encrypted, AND the disk assignment list has icons on both Disk 1 and 2 that indicate the devices will be encrypted. There is also a prompt to enter a new encryption key, not an existing key, to start the array. Encrypting already encrypted disks sounds like a very bad idea to me. So now I'm not inclined to start the array. Any chance Unraid will discover that the disks are already encrypted and not re-encrypt them? Or can I tell it that somehow? I am well backed up, so I can recover from data loss on the array. Here is DISK_ASSIGNMENTS.txt again:
  22. They were up to date when the flash drive failed. I'd seen nothing to make me think there was any parity issues prior to then. Can I validate that outside of reassigning them and running a parity check on the assigned array? The 8TB disk is not part of the array or cache. It is always unassigned.
  23. Here are my diagnostics ... chip-un-diagnostics-20230310-2251.zip