Jump to content

Pjhal

Members
  • Posts

    28
  • Joined

  • Last visited

Posts posted by Pjhal

  1. So i used the bios setting that Squid linked too.

    Quote

    look for "Power Supply Idle Control" (or similar) and set it to "typical current idle"

    But i just got this: ''ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen''

    see short  remote log:

    All_2022-2-14-20 11 1.html

    full Unraid logs:

    silverstone-diagnostics-20220214-2013_anon.zip

    The system hasn't crashed (yet) but i am a bit concerned that might do so soon.

     

    edit: this seems like the same issue to me:

    6.9.0/6.9.1 - KERNEL PANIC DUE TO NETFILTER (NF_NAT_SETUP_INFO) - DOCKER STATIC IP (MACVLAN)

    At least  the same one i am seeing in the logs right now, not sure if it is the same as the original issue i have.

    I have now deactivate my second physical Ethernet.

    How do i need to change my setup to fix the above?

    I am using at least 7 different ip adress on dockers, sometimes more. Using the mcvlan feature. (custom br0)

    And several using the default bridge.

     

     

     

  2. 16 minutes ago, Squid said:

    To get the obvious out of the way,

     

    Thx for your response, I will try the bios options.

    I should note though that it was never an issue until recently and i have had this server running Unraid sinds 2019.

    Also the post suggests updating the BIOS but i cannot update the  the bios on this system.

    I updated to a P version this was a issues fix/beta bios because of issues with the ipmi kvm and apparently you cannot upgrade from that version.

    And the motherboard has been end of life for a while now.

    The memory part of that post also doesn't apply to me, seeing as it passed such a long memtest.

    My hardware btw:

    ASRockRack X470D4U
    American Megatrends Inc., Version P3.30
    BIOS dated: Monday, 2019-11-04

    CPU: AMD Ryzen 7 3700X 8-Core @ 3600 MHz

    my ram is 2 sticks of 16GB ECC udimm and from the supported list of the motherboard.

  3. My system froze/hanged several times requiring a forced shutdown. (completely unresponsive to everything (no web-ui, no ssh, no smb shares) but clearly still powered on). I did ping the system but i think that also got me nothing,(not 100% sure its been a few days).

    Quote

    2022-01-31 18:21:12 Warning Silverstone kern kernel Code: ff 48 8b 15 ef 6a 00 00 89 c0 48 8d 04 c2 48 8b 10 48 85 d2 74 80 48 81 ea 98 00 00 00 48 85 d2 0f 84 70 ff ff ff 8a 44 24 46 <38> 42 46 74 09 48 8b 92 98 00 00 00 eb d9 48 8b 4a 20 48 8b 42 28

    2022-01-31 18:21:12 Warning Silverstone kern kernel RIP: 0010:nf_nat_setup_info+0x129/0x6aa [nf_nat]

    2022-01-31 18:21:12 Warning Silverstone kern kernel Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X470D4U, BIOS P3.30 11/04/2019

    2022-01-31 18:21:12 Warning Silverstone kern kernel CPU: 9 PID: 15782 Comm: python3 Tainted: G D W 5.10.28-Unraid #1

    2022-01-31 18:21:12 Warning Silverstone kern kernel general protection fault, probably for non-canonical address 0xa52fb99018bdb8aa: 0000 [#2] SMP NOPTI

    The full local logs are lost but the quote above is a part from the log as capture by my log server on a synology. (i tried to format it, hope its readable). This to me seemed like it might be the issue i wanted to include the full logs a s captured over the network by a synology nas.

    But i think the external version of the log file is too big at 50 MB. Upload fails

    I did fix the cache file system (XFS) several time because of the dirty shutdowns, i used this guide by Spaceinvader One on Youtube.

    And did a parity check on the array.

     

    Now the external log made it look like it might have been a memory issue to a layman like myself so i ran a memtest.

    I ran the built in memtest on my 2*16GB ECC udims. It ran for well over a 150 hours, partly because i didn't have the time or energy to keep troubleshooting and ''fixing'' the system so i just left it to its own devices.

    I forgot to take a picture of the screen but trust me it was a lot of passes with zero errors!

    On a side note somewhere in the mids of all of this i replaced  2 * 8TB (WD white label chucked) disc 1 data and  1 parity with 2 18Tb discs (WD white label chucked).

    And sinds then the s.m.ar.t part of the webGUI stopt working, the disc do pas the checks but it doesn't display the data  in the Gui properly. I did find a post on the forum about this but none of the fixes worked for me. ( changing the Default SMART controller type, etc i tried all of them)

    All new discs pass have recently passed short and extended smart at least 2 times, plus the preclear script has been used using the binhex-preclear docker image.

     

    Edit: i have added a cut down version (removed older entries) of the externally captured log file: All_2022-2-12-21 28 6 - Copy.csv

    Logs of the system as it is ''now''":

    silverstone-diagnostics-20220212-2021_anon.zip

  4. 9 hours ago, JorgeB said:

    Still looks like a power/connection issue.

    Shutdown server, re plugged HBA and all Disks. Then started it up again.

    After some time new errors

    Quote

    May 22 18:02:58 Silverstone kernel: mdcmd (58): spindown 7
    May 22 18:09:15 Silverstone kernel: mdcmd (59): spindown 6
    May 22 18:15:53 Silverstone kernel: sd 13:0:6:0: [sdh] tag#1409 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
    May 22 18:15:53 Silverstone kernel: sd 13:0:6:0: [sdh] tag#1409 Sense Key : 0x5 [current]
    May 22 18:15:53 Silverstone kernel: sd 13:0:6:0: [sdh] tag#1409 ASC=0x20 ASCQ=0x0
    May 22 18:15:53 Silverstone kernel: sd 13:0:6:0: [sdh] tag#1409 CDB: opcode=0x88 88 00 00 00 00 01 0b b7 0b 50 00 00 00 08 00 00
    May 22 18:15:53 Silverstone kernel: print_req_error: critical target error, dev sdh, sector 4491512656
    May 22 18:15:53 Silverstone kernel: md: disk6 read error, sector=4491512592
    May 22 18:15:53 Silverstone kernel: sd 13:0:5:0: [sdg] tag#1414 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
    May 22 18:15:53 Silverstone kernel: sd 13:0:5:0: [sdg] tag#1414 Sense Key : 0x5 [current]
    May 22 18:15:53 Silverstone kernel: sd 13:0:5:0: [sdg] tag#1414 ASC=0x20 ASCQ=0x0
    May 22 18:15:53 Silverstone kernel: sd 13:0:5:0: [sdg] tag#1414 CDB: opcode=0x88 88 00 00 00 00 01 0b b7 0b 50 00 00 00 08 00 00
    May 22 18:15:53 Silverstone kernel: print_req_error: critical target error, dev sdg, sector 4491512656
    May 22 18:15:53 Silverstone kernel: md: disk7 read error, sector=4491512592

    The weird thing that stands out to me is that the errors occur after the 2 disk happen to spin down. Could that be related?

    Also if it is a hardware defect....I don't have a spare HBA, proper size power supply or SAS cable to do any testing (by swapping them out ) so i am at a loss as to how i should handle this right now.

    Is there anything i can do?

     

    silverstone-diagnostics-20210522-1828.zip

  5. 5 hours ago, JorgeB said:

    Read errors on multiple disks:

     

    
    
    
    May 20 22:48:29 Silverstone kernel: md: disk4 read error, sector=8
    May 20 22:48:29 Silverstone kernel: md: disk4 read error, sector=16
    May 20 22:48:29 Silverstone kernel: md: disk4 read error, sector=24
    May 20 22:48:29 Silverstone kernel: md: disk7 read error, sector=8
    May 20 22:48:29 Silverstone kernel: md: disk7 read error, sector=16
    May 20 22:48:29 Silverstone kernel: md: disk7 read error, sector=24
    May 20 22:48:29 Silverstone kernel: md: disk6 read error, sector=8
    May 20 22:48:29 Silverstone kernel: md: disk6 read error, sector=16
    May 20 22:48:29 Silverstone kernel: md: disk6 read error, sector=24
    May 20 22:48:29 Silverstone kernel: Buffer I/O error on dev md1, logical block 0, async page read
    ### [PREVIOUS LINE REPEATED 1 TIMES] ###
    May 20 22:48:29 Silverstone kernel: md: disk1 read error, sector=32
    May 20 22:48:29 Silverstone kernel: md: disk1 read error, sector=40
    May 20 22:48:29 Silverstone kernel: md: disk1 read error, sector=48

     

    This is a likely a power, connection or controller problem.

    But this issue happened after downgrading from 6.92 back to 6.83 nothing else changed. I also read that some people had compatibility issues with the newer version.

    I use a:

    https://www.broadcom.com/products/storage/host-bus-adapters/sas-9300-8i

    What can i do to fix this?  I understand that it is hypothetically possible that my power supply failed  or that it is a cable failure but it seems incredibly unlikely to me that this happens at the exact time that that i run into OS issues due to updating and downgrading my OS version.

    Edit: oke i disconnected and reconnected the HBA and my array is back so maybe it was a badly plugged in connect?

     

     

    Schermafbeelding 2021-05-21 141120.png

  6. 13 hours ago, JorgeB said:

    Don't see any controller issues logged, so most likely a power/connection problem, power down the server, check all connections and power back up, array should be accessible after that.

    Thank you for your response.

    I have rebooted, Unraid then reported zero errors. Then i started the array in maintenance mode, now doing a Parity check (read only).

    After that ill try starting the array normally.

     

     

  7. As the title says.

    Screen shot shows the discs, diagnostics logs included.

    Also:

    Wen using crusader i cannot access disk 3.

    /mnt/disk 3

    is somehow a file of zero bytes and not a folder

    2021-05-18_disk errors.png

    silverstone-diagnostics-20210518-2122.zip

     

    I put Unraid in maintenance mode and ran a short smart test on Disk 3 it completed with no errors.

    1352393918_2021-05-18_21.35_Unraid-disk3aftershortSMARTtest.thumb.png.df086e7ef3242beca3df955b76939e9e.png


     

     

    Disk 3 XFS check with -n

    **********************************************************************************************************

     

        Phase 1 - find and verify superblock...
        Phase 2 - using internal log
                - zero log...
        ALERT: The filesystem has valuable metadata changes in a log which is being
        ignored because the -n option was used.  Expect spurious inconsistencies
        which may be resolved by first mounting the filesystem to replay the log.
                - scan filesystem freespace and inode maps...
        sb_fdblocks 456165658, counted 458312794
                - found root inode chunk
        Phase 3 - for each AG...
                - scan (but don't clear) agi unlinked lists...
                - process known inodes and perform inode discovery...
                - agno = 0
                - agno = 1
                - agno = 2
                - agno = 3
                - agno = 4
                - agno = 5
                - agno = 6
                - agno = 7
                - process newly discovered inodes...
        Phase 4 - check for duplicate blocks...
                - setting up duplicate extent list...
                - check for inodes claiming duplicate blocks...
                - agno = 0
                - agno = 1
                - agno = 2
                - agno = 3
                - agno = 6
                - agno = 4
                - agno = 7
                - agno = 5
        No modify flag set, skipping phase 5
        Phase 6 - check inode connectivity...
                - traversing filesystem ...
                - traversal finished ...
                - moving disconnected inodes to lost+found ...
        Phase 7 - verify link counts...
        No modify flag set, skipping filesystem flush and exiting.

     

    **********************************************************************************************************

    Edit: i added the SMART diagnostics of all  5 disks with errors after running short SMART on all of them. Disk numbers appended( Disk 1 etc).

    Disk 3 is also running the extensive SMART check atm.

    WDC_WD80EMAZ-00W_7HKJT7EJ_35000cca257f1e771-20210518-2240 - Disk 3.txt

    WDC_WD80EMAZ-00W_7HKJWUXJ_35000cca257f1f4f1-20210518-2244 - Disk 4.txt

    WDC_WD80EZAZ-11T_2SG8U7JJ_35000cca27dc401ba-20210518-2245 Disk 7.txt

    WDC_WD80EZAZ-11T_2SG9465F_35000cca27dc4271a-20210518-2244 Disk 6.txt

    WDC_WD80EZAZ-11T_7HJJ6AVF_35000cca257e38cc8-20210518-2243 Disk 1.txt

    What should i do ?

    How bad is this?

  8. After recently upgrading to the latest version (6.9.2 ) my server crashed(not immediately, but alter a a time i wasn't using it so don't know exactly wen) i lost the logs so i setup my synology nas to recieve logs from my unraid server in the future. Now only days later the unraid server didn't fully crash, my dockers containers are still running! But unraid cannot be accessed at all. Neither through the webui or WinSCP.

    Included are the unraid logs as captured by my synology NAS hence the unusual format.

    All_2021-5-14-23 15 38.html

    Edit: oops i just noticed the typo in the title, it should be a server error.

    Edit2: ultimately i did gain access through WinSCP and used shutdown now.

    Edit3: I couldn't figure out how to fix it and got no responded on this thread so i am simply going to roll back to an older version.

    Hopefully what ever this is will get fixed.

  9. I just noticed that bond0 is my ipmi interface on my Asrockrack X470D4U mother board: https://www.asrockrack.com/general/productdetail.asp?Model=X470D4U#Download

     

    bond0 and eth0 are both listing the exact same numbers . But they do have different mac addresses and the system has 2 cables hooked up.   So this is perhaps just related to the impi/kvm, some weird internal traffic?(sorry didn't think about mentioning this sooner)

    I thought bond0 was some ''unraid thing''.

     For the record i was not using the ipmi/kvm functions and it still showed the constant 7.2 mbps inbound activity, also the kvm function won't work at the moment for some reason ('' powered off no signal'').

    And just now  it seems the speed dropped again to 30/40 kbps inbound on both bond0 and eth0.

    Weird...

     

     

  10. 2 minutes ago, trurl said:

    There is a Help (?) button on the menu bar. It will turn on/off Help for all pages in the webUI. You can also toggle help for a specific setting by simply clicking on its label.

     

    Here is a link to a post in the FAQ that explains the nuances of the Use cache setting in much more detail:

     

    https://forums.unraid.net/topic/46802-faq-for-unraid-v6/page/2/#comment-537383

     

    That FAQ is pinned near the top of this same subforum. Lots of other useful info in that FAQ and also in the Docker FAQ pinned near the top of the Docker Engine subforum.

    I had found the help button, just hadden't gotten around to reading up on the cache yet. Thank you for the link ill save it, and dig into all cache information this weekend.

  11. 5 minutes ago, BRiT said:

    When you see the traffic inbound, have you tried looking at what "netstat -a" says ?

    I had not! New to Unraid/command line Linux. Can i safely post those results? I don't really understand what it is saying.

    2 minutes ago, trurl said:

    Yes is definitely wrong, since Yes means move from cache to array.

     

    Prefer is OK or Only. Prefer is the only setting that can help you get it moved to cache. Prefer means move from array to cache, but you can't move open files so more would need to be done to make that work. Specifically, while docker service and VM service is enabled you can't move docker and libvirt images.

     

    Simplest is probably to just disable and delete those images, set system to cache-only, and recreate them so they will go on cache.

     

     

    Alright, will do that then. 

  12. 8 hours ago, trurl said:

    Not related but I noticed some FCP warnings in your syslog. You shouldn't ignore these unless you know exactly why they don't apply to your specific use (and these do apply):

    
    Dec  5 21:10:01 Tower root: Fix Common Problems Version 2019.11.22
    Dec  5 21:10:01 Tower root: Fix Common Problems: Warning: Share Disk2 test is set for both included (disk2) and excluded (disk1,disk3,disk4) disks
    Dec  5 21:10:01 Tower root: Fix Common Problems: Warning: Share Films is set for both included (disk2) and excluded (disk1) disks
    ....
    Dec  5 21:10:06 Tower root: Fix Common Problems: Warning: Dynamix SSD Trim Plugin Not installed
    Dec  5 21:10:06 Tower root: Fix Common Problems: Warning: Syslog mirrored to flash
    

    You shouldn't set both include and exclude and there is never any good reason to do so. Include means ONLY and Exclude means EXCEPT, so using one or the other not both, covers all possibilities. Remove one or the other. In fact, your setting for Films isn't even consistent.

     

    And I don't know why you would even have a share named Disk2. It doesn't appear in your user shares, and include / exclude settings don't make any sense for a disk share. Maybe you removed that share after it was logged. Hope you can clarify this one for me.

     

    And you don't want to mirror syslog to flash permanently since you will wear out your flash drive. That should only be done temporarily as a troubleshooting measure. Better yet set syslog server to write to one of your user shares.

     

    Also, your system share has files on the array instead of all on cache where they belong, so your dockers won't perform as well due to parity and they will keep array disks spinning.

    Oh that was purely a test, hench the share name ''disk2 test''. I don't use it. But thanks for the tip!

    And i didn't realize it was so easy to set syslog server to write to one of your user shares. I changed it thx !

    In general my shares are a mess, because i was messing with the settings, seeing what happens. The System is still new with no important data on it (yet).

    The system disk was set to prefer but i guess it just needs to be ''yes'' . Fixed 

     

    9 hours ago, Frank1940 said:

    From your screen capture of the Main tab, nothing is being written to any device on the array.  (I am assuming that it was taken when this activity was going on.)

     

    Now, 7.2Mbps is not very fast (for a data transfer).  I would suspect that it is GUI activity.  Did you have more than one GUI screen open?  Were you 'watching' the preclear progress on those three drives? 

     

     

    Yes it was at the same time. The strange thing was that it lasted so long and was so constant. Isn't 7.2 mbps a lot, for incomin traffic if its just the GUI?

    Also it seems its back again! Do not/Did not have more then 1 tab open.  

    2019-12-05.sysstats.Untitled.png

    2019-12-07.Untitled.png

  13. No discs are being written too.

    I haven't set up any copy jobs. 

    Nothing on my network is uploading at anything close to those speeds, not even totaled.

    To the Unraid server or otherwise.

    And i even blocked the server from accessing the internet on my router.

    Turned FTP off, stopped all dockers, removed read permissions to shares. Still the same steady incoming 7 ish mbps.

    I am running the pre-clear plugin, so i would rather not reboot.

    And i assume that the pre-clear plugin should not cause incoming traffic on an Ethernet port.

    Maybe i am missing something super obvious and i'm just and idiot.

    But i don't understand what is going on here.

     

     

    2019-12-05.sysstats.Untitled.png

    2019-12-05.incoming Untitled.png

    2019-12-05.Untitled.png

    tower-diagnostics-20191205-2223.Anonymized.zip

     

    Edit: It stopped After doing this constantly for at least 1 hour and 10 minutes.

  14. On 12/3/2019 at 1:06 PM, johnnie.black said:

    Possibly parity is just invalid and needs syncing, this can also be confirmed by the parity checks, if the errors are exactly the same on both runs.

    So i completed  one parity check with the ''Write corrections to parity'' box unchecked. And no errors!

    Which is good but also scary because i still don't understand where old the errors came from.

    Should i run it again ?

    tower-diagnostics-20191204-1405.Anonymized.zip

  15. 5 minutes ago, johnnie.black said:

    Not yet, there have been Ryzen users before with overclocked memory that would give sync errors despite memtest not detecting anything.

     

    Run a couple of consecutive non correct parity check and compare the errors, or better yet post diags after they finish.

    Thank you, am running the test right now. My memory is running at default settings btw. How will we know, that its ram and not say a SAS cable, HBA or the back plane? 

    Wen i had the 8.900.000 errors, 4 discs where in the right back plane 1 was external via USB. I moved  the discs into the left back plane and left the SAS cables ( meaning now connected to a different SAS cable, back plane and port on the HBA and it recovered to zero errors.)

    I have put the external disc into the right back plane now.

    And  i did get the 500.000 errors on a rebuild after that.

    So i am worried that it might be the right back plane, SAS cable or that part (port?) of the HBA .

  16. 19 hours ago, johnnie.black said:

    I remember reading that kernel on current Unraid, even with v6.8rc, doesn't support ECC with Ryzen3, and sync errors with frequent filesystem corruption makes me suspicious of RAM, start by running a memtest.

     

    Another good test is to run a couple of non correct parity checks, if the errors are not the same it's again likely bad RAM.

    Oke, so i ran memtest86 for almost 16 hours on my single stick of 16 GB ECC unregistered ram. And after almost 16 hours not even a single error. 

    In the same period had the system been doing party, it would have had hundreds of thousands of errors if not millions.

    Can ram now be ruled out ?

     

    2019-12-03.11.18.memtest86.CaptureScreen.jpeg

  17. 4 hours ago, johnnie.black said:

    I remember reading that kernel on current Unraid, even with v6.8rc, doesn't support ECC with Ryzen3, and sync errors with frequent filesystem corruption makes me suspicious of RAM, start by running a memtest.

     

    Another good test is to run a couple of non correct parity checks, if the errors are not the same it's again likely bad RAM.

    Thank, you very much for your advice! I am running Unraid v6.7.2. btw, forgot to mention that.

    Currently letting  memtest86 run.

    So far no errors, but i will probably  just let it run over night.

    2019-12-02.20.52.memtest86.CaptureScreen.jpeg

×
×
  • Create New...