Jump to content

bobobeastie

Members
  • Posts

    132
  • Joined

  • Last visited

Posts posted by bobobeastie

  1. On 3/9/2020 at 4:23 AM, johnnie.black said:

    We can't see what happened but likely a controller error, when errors on multiple drives happen at the same time Unraid will disable as many drives as there are parity devices, which drives get disabled is luck of the draw, unlikely to be upgraded related, especially since if I understood correctly problem was before rebooting, and the new release would only be loaded after the reboot, when the disks were already disabled, just re-sync parity and try to get the diags before rebuilding if it happens again.

    I suppose there was some issue, and perhaps it was controller errors keeping the array from stopping so it could reboot.  There were no errors sent in emails or in the webgui, but it's possible I didn't notice the parity drives being disabled.

     

    This is the third issue I have had with this server recently and it is making me a bit concerned/anxious.  So here are some paranoid ideas I have had as to things that are causing issues:

     

    1. My motherboard temp is being reported at around 96 C, this was initially concerning, but then I learned that ryzen CPU report 27 C over actual temps, but now I am back to concerned.  My setup is Threadripper 1900X on X399 Designare EX, temp drivers are it87 k10temp, and the one reporting high is "k10temp-mb temp". Is this real?  Seems unlikely, am I using the wrong driver?

     

    2. My SAS card (link) (LSI 9201-16e) is a somewhat recent purchase in October last year, the listing says P20, but there are different P20 revisions, and it still has it's bios, which I did not flash on the last card I had when I updated it.  I don't think that makes a difference after booting.  Still I'm going to try to figure out which version I'm on and update if possible, looks like base P20 might be bad, and newest is 20.00.07.00.

     

    3. Power related, the server is plugged in to a CyberPower CPS1215RMS Surge Protector, PSU is EVGA SuperNOVA 1000 G1+, which seems like a good quality PSU.  I have a powerwall2 which I see takes 300-2000 millisecond relay switch-over, so based on https://teslamotorsclub.com/tmc/threads/powerwall-2-ups-connundrum-and-solution.130085/ it looks like I may want a UPS with a tiny battery that accepts 65Hertz power, I don't currently have a UPS.

     

    4. Gremlins/Underpants Gnomes/Static Electricity?

     

    Probably grasping, but I'd like to try to avoid further issues.

  2. Guessing ultimately I need to disable and re-enable both parity drives and then rebuild parity. But guessing with me and unraid have made things way worse in the past. I'm also concerned about what went wrong, and if this is a bug I want to help if possible.

  3. Friday or yesterday I downloaded 6.8.3, was on 6.8.2 at the time, and did not reboot.  Woke up this morning and decided to reboot, web gui was not going to the reboot countdown screen, so I used putty to send reboot command.  It boots back up and both parity drives are disabled, docker and auto disk start were enabled so it looked like writes were happening.  I have 24 drives connected to the same 6 backplanes, 1 SAS card for 16 drives and 8 motherboard sata, so this doesn't seem like a cabling/controller issue to randomly only effect parity drives.  I rebooted another time to see if that did anything.  It's hard for me not to suspect the new version under these circumstances.

    nastheripper-diagnostics-20200308-0821.zip

  4. 54 minutes ago, trurl said:

    If you have appdata backup that plus the templates on flash are all you need to get your dockers going again using the Previous Apps feature on the Apps page. You could do this with a different cache or even without cache.

    Yes, and it looks like mover was able to move almost all of appdata, all that's left is a single file, nzbdrone.db-wal in binhex sonarr.  I'm assuming that's the kind of thing that will fix its self, or through reinstalling the container.  Everything else is torrents, which should fix themselves, and one recently transcoded backed up blu ray.  Not too bad.  I think I will run cache-less while I iron out the things below.  Is the procedure for this as simple as stop array, select no device for cache, then start array?

     

    The future cache drive is a 1TB ADATA_SX6000NP.  I posted the issue I was having with it a while ago and it received no responses.  The issue being that when the trim plugin runs, I get an email saying "fstrim: /mnt/disks/ADATA_SX6000NP_XXXXXXXXXXX: FITRIM ioctl failed: Remote I/O error".  I would feel much better about using it as a cache if that problem was solved.  If I can get trim working, I'm thinking about switching to BTRFS and then RAID1'ing it and the replacement NVMe drives, as I understand I would then be able to lose 1 of them.

  5. Alrighty, apparently HP doesn't handle support for their SSD's, and their support gives a phone number with a full voicemail box, and an outdated email address belonging to a security camera company ([email protected]), When I googled the phone number I found a mention of [email protected], which I found was actually connected to HP.  The only reason I'm mentioning this here is so when someone googles any of these things they might be helped by this info.  This has taught me not to be an HP customer.

     

    In the mean time, it would be great if I could offload from cache, but it just disabled itself again after mover was running very slowly.  The only thing I can think of trying, other than keep trying mover after fixing the fs errors, is to use backup/restore appdata, to make a copy or maybe copying over the share works just as well? Is there was away to ignore fs errors?

    nastheripper-diagnostics-20200302-1534.zip

  6. I examined both nvme drives and swapped them.  Attached is an image of what I was talking about, and I don't think I'm concerned about it as a cause.  I rubbed one of them with my hand and it started to disintegrate, so I stopped.  It only makes contact with the tops of chips or stickers, and residue on those was minimal.

     

    Once I started it back up, everything appears to be okay, at least for now.  I am letting mover run with all shares set to yes for cache so that I can set to no when done.  At that point I am open to suggestions as to what to do.

    IMG_20200302_094737.jpg

  7. Great thanks, I was worried for a minute about plex because my previous config didn't appear to be saved, but it was.  This is my second -L rodeo, so I knew what lost+found was.  What I meant was maybe the lack of files was a bad sign because perhaps some were lost.  But maybe that not how it works.  Anyway, thank you @Squid and @trurl

     

    ...doesn't look like I'm out of the woods yet, guessing the drive is bad:

    Mar 1 20:48:32 NAStheRIPPER kernel: print_req_error: critical medium error, dev nvme0n1, sector 1161546720
    Mar 1 20:48:32 NAStheRIPPER kernel: print_req_error: critical medium error, dev nvme0n1, sector 1161546728
    Mar 1 20:48:32 NAStheRIPPER kernel: print_req_error: critical medium error, dev nvme0n1, sector 1161546728
    Mar 1 20:48:32 NAStheRIPPER kernel: print_req_error: critical medium error, dev nvme0n1, sector 1161546728
    Mar 1 20:51:24 NAStheRIPPER kernel: print_req_error: critical medium error, dev nvme0n1, sector 237741416
    Mar 1 20:51:25 NAStheRIPPER kernel: print_req_error: critical medium error, dev nvme0n1, sector 237741448
    Mar 1 20:51:26 NAStheRIPPER kernel: print_req_error: critical medium error, dev nvme0n1, sector 237741544
    Mar 1 20:51:26 NAStheRIPPER kernel: print_req_error: critical medium error, dev nvme0n1, sector 237741544
    Mar 1 21:00:04 NAStheRIPPER kernel: print_req_error: critical target error, dev nvme1n1, sector 1997853383

     

    The last error is from my former cache drive which I left installed, it is not able to run trim for some reason.  The current cache drive (nvmeOn1) is an HP SSD EX950 1TB.  One thing worth noting, when I last swapped nvme drives around on this board (designare threadripper), I noticed that there is something like a black rubber feeling thermal pad on the included nvme heatsink, and it seemed kind of wet, like from separated glue or something between the metal and "rubber", I just wiped them off and kept using them.  Rubber isn't thermally conductive so I'm sure it's something else.

     

    When I click either SMART button it very quickly changes back to what it started as, not even time to read that it says stop, I had to run it on another drive to check how it normally works, so I don't think they are running.  I'm going to switch from prefer cache to Yes for all applicable shares to start offloading.  Should I run a smart check in the terminal?  If so how?

     

    edit: Not sure it matters, but I just noticed I have the first bios version, and I'm 90% sure I updated it at some point.  maybe it kept a base copy and reverted at some point.  Looking at the change log, here https://www.gigabyte.com/us/Motherboard/X399-DESIGNARE-EX-rev-10/support#support-dl-bios it doesn't mention nvme, except only in relation to RAID, which obviously I'm not using.

     

    nastheripper-diagnostics-20200301-2132.zip

  8. Thanks, that seems to have worked.  The array started, I have a lost+found file with almost nothing in it, which I'm not sure is good or bad. Docker service won't start:

    Warning: stream_socket_client(): unable to connect to unix:///var/run/docker.sock (Connection refused) in /usr/local/emhttp/plugins/dynamix.docker.manager/include/DockerClient.php on line 658
    Couldn't create socket: [111] Connection refused
    Warning: Invalid argument supplied for foreach() in /usr/local/emhttp/plugins/dynamix.docker.manager/include/DockerClient.php on line 830

    Warning: stream_socket_client(): unable to connect to unix:///var/run/docker.sock (Connection refused) in /usr/local/emhttp/plugins/dynamix.docker.manager/include/DockerClient.php on line 658
    Couldn't create socket: [111] Connection refused
    Warning: Invalid argument supplied for foreach() in /usr/local/emhttp/plugins/dynamix.docker.manager/include/DockerClient.php on line 894

     

    I'm pretty sure I need to delete the docker image and then hopefully the templates are still there for adding back?  Easier and/or better solutions are welcome.

  9. I had to reboot as it was stuck on something like unmounting fs when I tried to stop the array.  I ran -n and got 903 lines of output, plety of which are obvious errors, like "would move to lost+found", the output includes filenames (torrent files and plex docker files) so I don't want to attach it, at least not before obfuscating.  After that, running the check with no options gets this:

     

     

    Phase 1 - find and verify superblock... Phase 2 - using internal log - zero log... ERROR: The filesystem has valuable metadata changes in a log which needs to be replayed. Mount the filesystem to replay the log, and unmount it before re-running xfs_repair. If you are unable to mount the filesystem, then use the -L option to destroy the log and attempt a repair. Note that destroying the log may cause corruption -- please attempt a mount of the filesystem before doing this.

     

    Assuming this means trying to start the array normally, stop the array, start in maintenance mode, I have done that, and I get the same message.  So if that is correct, then I should run it with the -L flag next? Please let me know if I should post the original output of -n. I'm going to wait so I don't make this worse by being hasty like I have in the past.

  10. Thank you for the response.  I need to wait for the data rebuild to use maintenance mode, as my cache drive is xfs.  I tried to run the extended smart test, but it doesn't look like it did anything, below it says "No self-tests logged on this disk".

     

    I was able to download a file and maybe this is useful as to the 6 errors, could be useful or probably this was already in my diagnostics:

    Error Information (NVMe Log 0x01, max 256 entries)
    Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
      0          6     7  0x0357  0x0281      -   1011004192     1     -
      1          5     7  0x0357  0x0281      -   1011004192     1     -
      2          4     7  0x0357  0x0281      -   1011004192     1     -
      3          3     7  0x0357  0x0281      -   1011004192     1     -
      4          2     7  0x0357  0x0281      -   1011004192     1     -
      5          1     7  0x0357  0x0281      -   1011004192     1     -

     

    I'll try both steps after the rebuild, in maintenance mode, and if they both don't work I'll reboot and try again.

  11. Woke up and found a fix common errors notification, Unable to write to cache, and Unable to write to Docker Image, then I went to the Shares page and it's empty, which explains the docker part.  Through SMB I am able to see flash and an UD drive share, nothing else.

     

    I briefly, uneducatedly, looked through to diagnostics files, and see the shares are present, and I can see them if I look at files on individual drives also.  Which makes me guess everything will be fine after a reboot.  I would have rebooted already after getting diagnostics, but I am 81% through a data rebuild of a new 12tb disk, so I have 6 hours left.

     

    I did notice some errors in my syslog, not sure if any of them are serious or related.

     

    Running 6.8.2

    nastheripper-diagnostics-20200301-0825.zip

  12. I have a disabled drive, version 6.8.2, found this page: https://wiki.unraid.net/Troubleshooting#What_if_I_get_an_error.3F ran both types of SMART tests, got this:

    Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error

    # 1 Extended offline Completed without error 00% 22171 -

    # 2 Short offline Completed without error 00% 22156 -

     

    So that looks good I think, those attached files are were generated on 2/20, today is 2/22. I noticed that the computer had rebooted while I was away, and now a notification says:

    Notice [] - array turned good
    Array has 0 disks with read errors

     

    But the drive is still disabled.  There are maybe 3 troubling SMART attributes, but they are listed as pre-fail.  Some might be from previous faulty controllers/cables. Other than checking the cables which I will do, is there anything else I can do?  Should I try re-enabling the drive?  I ordered a new 12tb drive for my main server so that one of it's 8tb drives can be used to replace the "failing" drive in my secondary server, but the failing drive (8tb) is the same size as the parity drive while there are a couple of 4tb drives I would prefer to get rid of first, so it's a opportunity cost of 4tb.  But if the drive is really toast then obviously replacing it is the right thing to do.

     

    I'm adding a second post unwanted reboot diagnostic file from today 2/22 just in case it's useful. Thank you.

    flags-smart-20200222-1419.zip flags-diagnostics-20200220-1121.zip flags-diagnostics-20200222-1507.zip

  13. The weird part to me is that my 2 parity drives are 12tb easystores purchased recently when they went on sale, money is tight so I was only able to purchase 2, all of the data drives are still 8 or 10tb, and the parity check is beyond 10tb and the number of errors is growing, not sure how many errors were from under 10tb.  It seems like the only thing being checked on parity drives above the highest data drive would be parity drives themselves(?).  The 12tb drives have no SMART issues and both had 3 rounds of preclears before being added, and I believe the current configuration has passed a sync check, probalby with version 6.7.4 of unraid.  Should I be concerned?  If so anything I should do?  Looks like memtest might be recommended.

     

    Sync errors corrected:46627778

    nastheripper-diagnostics-20200102-0729.zip

  14. Looks like I'm having the same issue:

    Event: Preclear on xxxxxxxx
    Subject: FAIL! Invalid unRAID`s MBR signature on xxxxxxxxx (/dev/sdu)
    Description: FAIL! Invalid unRAID`s MBR signature on xxxxxxxxxx (/dev/sdu).
    Importance: alert

    Invalid unRAID`s MBR signature on xxxxxxxx (/dev/sdu) - Aborted

     

    Drive is a 12 TB WD easystore shucked. WDC_WD120EMFZ-11A6JA0

  15. 2 hours ago, dlandon said:

    Ok.  Go to a terminal and enter these commands:

    
    /sbin/cryptsetup luksOpen /dev/sdX1 HGST_HDN726060ALE614_K1H90MAD -d /root/keyfile
    xfs_admin -U generate /dev/mapper/HGST_HDN726060ALE614_K1H90MAD
    /sbin/cryptsetup luksClose HGST_HDN726060ALE614_K1H90MAD

    sdX1 is the current partition designation.

    How do I tell what X should be in sdX1?  Is this telling it that the encryption key is the same as the array?

  16. 25 minutes ago, johnnie.black said:

    xfs_repair looks normal, disk should mount normally.

    Not sure if this should go in my thread, but could this just be due to a duplicate UUID?  I understand that that issue is usually caused by a bad file system, but now that that is fixed and the UUID issue remains, should it be addressed?  Even if I can get around it now, I'm guessing I might have the same issue when I'm done with the drive and will want to add it as a new drive to my array, so might as well try to deal with it now.  I could try running "xfs_admin -U generate /dev/mapper/mdX" on the array drive that is a duplicate of the UD drive, or probably better, try changing the UUID of the UD drive.

  17. 3 minutes ago, dlandon said:

    Update once more.  The file system check should be working now.

    Thanks, it works now for me.  The drive will not mount while the array is started normally, xfs_repair through your tool was also ran with array mounted normally.  I think I already have the file in lost and found, so I'm going to work on verifying that and if there are any files I need I can transfer them while the array isn't mounted to a secondary location, and then back to the array.  However if it would help you make any changes I'd be more then happy to help test them in this situation, please let me know.  Perhaps it would be useful in your xfs_repair kicking off tool to be able to pass it different flags?

     

     

    Log when I try to mount:

    Nov 5 06:52:40 Tower unassigned.devices: Mount of '/dev/mapper/HGST_HDN726060ALE614_K1H90MAD' failed. Error message: mount: /mnt/disks/HGST_HDN726060ALE614_K1H90MAD: wrong fs type, bad option, bad superblock on /dev/mapper/HGST_HDN726060ALE614_K1H90MAD, missing codepage or helper program, or other error.

     

     

    Your tool when I click on "Run with correct flag:

    FS: crypto_LUKS

    /sbin/xfs_repair /dev/mapper/HGST_HDN726060ALE614_K1H90MAD 2>&1

    Phase 1 - find and verify superblock...
    Phase 2 - using internal log
    - zero log...
    - scan filesystem freespace and inode maps...
    - found root inode chunk
    Phase 3 - for each AG...
    - scan and clear agi unlinked lists...
    - process known inodes and perform inode discovery...
    - agno = 0
    - agno = 1
    - agno = 2
    - agno = 3
    - agno = 4
    - agno = 5
    - agno = 6
    - agno = 7
    - agno = 8
    - agno = 9
    - agno = 10
    - agno = 11
    - process newly discovered inodes...
    Phase 4 - check for duplicate blocks...
    - setting up duplicate extent list...
    - check for inodes claiming duplicate blocks...
    - agno = 4
    - agno = 6
    - agno = 7
    - agno = 2
    - agno = 3
    - agno = 8
    - agno = 5
    - agno = 9
    - agno = 0
    - agno = 1
    - agno = 11
    - agno = 10
    Phase 5 - rebuild AG headers and trees...
    - reset superblock...
    Phase 6 - check inode connectivity...
    - resetting contents of realtime bitmap and summary inodes
    - traversing filesystem ...
    - traversal finished ...
    - moving disconnected inodes to lost+found ...
    Phase 7 - verify and correct link counts...
    done

  18. 55 minutes ago, dlandon said:

    Update to the latest version of UD and then reboot.  You should see better handling of the encrypted disk when the mount fails, and the file system check should work properly now. 

    I was able to get it mounted and have files in lost+found, but only while in maintenance mode which might be normal.  Ideally I would be able to mount with the array mounted so I could transfer to the array.  I suspect the reason I can't mount it at the same time is a duplicate UUID, which I hear is due to something wrong with the file system, but it looks like a previous -L check worked based on the presence of the lost+found folder, although I do see this in the log:

     

    Nov 5 04:31:41 Tower unassigned.devices: Mount of '/dev/mapper/HGST_HDN726060ALE614_K1H90MAD' failed. Error message: mount: /mnt/disks/HGST_HDN726060ALE614_K1H90MAD: wrong fs type, bad option, bad superblock on /dev/mapper/HGST_HDN726060ALE614_K1H90MAD, missing codepage or helper program, or other error.

     

    I did try using UD built in check, but it didn't seem to do anything:

     

    FS: crypto_LUKS

    /sbin/fsck /dev/mapper/HGST_HDN726060ALE614_K1H90MAD 2>&1

    fsck from util-linux 2.33.2
    If you wish to check the consistency of an XFS filesystem or
    repair a damaged filesystem, see xfs_repair(8).

    tower-diagnostics-20191105-1259.zip blkid.txt

×
×
  • Create New...