tjiddy

Members
  • Posts

    23
  • Joined

  • Last visited

Everything posted by tjiddy

  1. check completed (pretty quickly). Output attached. disk_check_output.txt
  2. I have had nothing but trouble for a few weeks now. After finally resolving this mess, 2 days ago, another disk appeared to have died (diagnostics here:jabba-diagnostics-20231111-1557.zip). I replaced disk 1, rebuilt everything. Everything looked great. All bubbles green. no errors in syslog. Then I ran the mover, and started seeing errors. Nov 13 09:26:33 JABBA kernel: XFS (md13p1): Metadata CRC error detected at xfs_allocbt_read_verify+0x12/0x5a [xfs], xfs_bnobt block 0x26b8 Nov 13 09:26:33 JABBA kernel: XFS (md13p1): Unmount and run xfs_repair Nov 13 09:26:33 JABBA kernel: XFS (md13p1): First 128 bytes of corrupted metadata buffer: Here are today's diagnostics: jabba-diagnostics-20231113-0935.zip 1) From my first diagnostics, is there a way to tell what happened to disk1? 2) Could the disk1 which died have written corrupted data to parity, so when I rebuilt it back on a new disk, It is now corrupt? With SO many disk related problems in the past couple week, is it valid to think the disks might not be at fault, but the cables/controller card are dying?
  3. I ran an extended SMART test. Completed: read failure I'll attach the report WDC_WD60EFRX-68L0BN1_WD-WX31D17DJK65-20231030-1343.txt
  4. Thank you. How could this have been avoided? I thought the whole point of parity was to be able to rebuild the contents of failed drives?
  5. Array is started. disk 22 only contains a lost+found directory with what looks like most the old contents of the drive spread throughout numeric folders. I've noted the filenames in case they need to be replaced. Does this mean the data is lost, or will parity be able to rebuild this disk once replaced? Thank you so much for getting me this far! What would be my next step?
  6. Sorry to be overly paranoid, but it's still ok to start with with this message?
  7. Phase 1 - find and verify superblock... - block cache size set to 717200 entries sb root inode value 18446744073709551615 (NULLFSINO) inconsistent with calculated value 128 resetting superblock root inode pointer to 128 sb realtime bitmap inode value 18446744073709551615 (NULLFSINO) inconsistent with calculated value 129 resetting superblock realtime bitmap inode pointer to 129 sb realtime summary inode value 18446744073709551615 (NULLFSINO) inconsistent with calculated value 130 resetting superblock realtime summary inode pointer to 130 Phase 2 - using internal log - zero log... zero_log: head block 114006 tail block 114002 ERROR: The filesystem has valuable metadata changes in a log which needs to be replayed. Mount the filesystem to replay the log, and unmount it before re-running xfs_repair. If you are unable to mount the filesystem, then use the -L option to destroy the log and attempt a repair. Note that destroying the log may cause corruption -- please attempt a mount of the filesystem before doing this.
  8. Thank you for all of your help so far. It's really appreciated. I'm a little confused how to check the filesystem for disk 22. This is what my UI is showing when the array is started in maintenance mode. When I go into the drive settings, I don't even see the check filesystem section, even though I'm in maintenance mode. (I do see it for the other drives). Do you want me to rebuild the drive from parity, then check the filesystem?
  9. I must have missed a step. disk22 is currently emulated. That is the drive which was making a bad sound, had smart errors, and when I tried to attach the drive it showed up as "unmountable: unsupported of no filesystem". Now when I set the device into disk 22 the slot It warns me "All existing data on this device will be OVERWRITTEN when array is Started" Currently checking the filesystem on that device isn't possible because Unraid doesn't recognize the filesystem on the device. (Looks like it hasn't completely died yet, maybe just corrupted to the point where the fs is unrecognized due the controller errors/hard shutdown? ) Should I re-add it, and let Unraid rebuild it from parity before checking the filesystem? If so, since I'm planning on replacing that drive, would it be better to put the replacement in, let it rebuild, then check the fs?
  10. Thanks! I followed your steps and the parity drives are now enabled. The missing disk is being emulated. Here are the new diagnostics. Just curious, why start in maintenance mode the first time?? jabba-diagnostics-20231029-0935.zip
  11. Last night I noticed that my shares were not responding. After looking into it, all the drives attached to one of my LSI controllers were having issues. My log partition was full due to samba and subsequently syslog errors, and I was unable to generate a diagnostics. Any command which accessed the filesystem would just hang. I shut down the system (after being unable to do it gracefully after trying multiple ways, I did the power button hold of shame. I reseated both of my controller cards, and restarted. As the system was spinning back up, I heard the telltale clicking sound of a dead 6TB drive. Once the system rebooted, I was greeted with my 2 parity drives disabled, and another "unmountable: unsupported of no filesystem" drive. The drive which it thought needed to be formatted was indeed the 6TB drive making the dying sounds. I ran an extended smart test on both parity drives and they turned out OK. A SMART test on the questionable 6TB drive resulted in an error. I do have a replacement for the 6TB in hand. Here is my dilemma. The common fix for disabled parity drives is to remove/readd/rebuild parity. I don't want to do rebuild parity because I have a bad drive I need to rebuild. Any thoughts on: Should I just assume that the controller card unseated itself? Are there any telltale signs of a failing card? I've been running Unraid for over 13 years and have never had a card unseat itself like this. It seems like a pretty big coincidence that I had a drive die at the same time as the controller malfunction. One couldn't lead to the other would it? Is there a way to tell why my 2 parity drives were disabled (and only those 2 drives)? I know that might be tough/impossible w/o the logs from before my restart Is it safe to / is there a way to re-enable my 2 parity drives without having to rebuild what is on them, so I can rebuild the dead drive? If my best bet is to abandon the dead drive, rebuild parity without it, then replace it, is there a way to see at least the filenames which were on it so I know what I need to replace? I know about the LSI/Ironwolf drive disabling issue, but my 2 parity drives are Exos, and my card is a SAS3 card, which I believe doesn't suffer from the issue, correct? What would be the best way to make sure this doesn't happen again? Could I have gotten an alert as soon as the problem started? I'll attach my diagnostics. Unfortunately it's from after the system reboot as they wouldn't generate before (they hung trying to query the drive controller), but should include the extended SMART test of the parity drives. Thanks in advance! jabba-diagnostics-20231028-2240.zip
  12. My system was rebuilding a drive today when I noticed my syslog was full. When I looked at it, I saw a bunch of errors saying REISERFS error (device md1): vs-5150 search_by_key: invalid format found in block 1001443464. Fsck? I stopped the rebuild, started in maintenance mode and ran reiserfsck --check /dev/md1. The final output said 16 found corruptions can be fixed only when running with --rebuild-tree I have 2 questions. which disk is this? Does md1 map to sda? If so, thats my flash drive, which is definitely not formatted as REISERFS. If it's sdb, then it's one of my two parity drives. That sounds scary. The wiki says to be very careful with this command and I was wondering if you guys thought it would be safe/recommended to run? I'll attach the full output from the reiserfsck as well as the diagnostics zip. Thanks in advance! jabba-diagnostics-20221118-1441.zip reiserfsck_output.txt
  13. I had a drive go kaput. I followed the normal steps for replacing a drive, but during the data sync when it was rebuilding the new drive, 2 other drives showed errors. I did cancel the rebuild and tried reseting and replacing the cables on the drives in question, but when I initiated the rebuild again, the same 2 drives had errors. The rebuild completed, but I'm curious what the errors mean, does that mean my disk was rebuilt with errors on it? Is my parity drive now hosed? Since changing reseating/replacing the cables didn't fix the problem, am I looking at 2 more failing drives? I've attached my diagnostics zip file. Below is what my main page looks like after the rebuild completed. Thanks in advance for any help! jabba-diagnostics-20220812-2227.zip
  14. I turned off auto array start and restarted the server. That allowed the GUI to start. I started the server in maint mode. ran [pre]xfs_repair -v /dev/md13[/pre]It quickly exited complaining "The filesystem has valuable metadata changes in a log which needs to be replayed. Mount the filesystem to replay the log, and unmount it before re-running xfs_repair". I found an earlier thread you replied to saying "You need to run xfs_repair with -L, usually there's no data loss." I then ran [pre]xfs_repair -v -L /dev/md13[/pre]After about 5 minutes it completed successfully with no apparent errors or recommended action. I started the array and it's currently doing a Parity-Sync / Data Rebuild. Hopefully that was the next step If that completes successfully in the next day or two, i'll mark the thread solved. Thank you so much for your help!!!!!
  15. ok. I upgraded to 6.3.1. Same behavior. I can boot to the GUI mode, but the firefox page never shows anything, just spins waiting, really rendering it no more useful than the normal boot. I'll attach the diagnostics from 6.3.1 jabba-diagnostics-20170208-2104.zip
  16. Hello. Im running 6.3 and woke up the other day to an unresponsive web GUI. Network shares weren't working as well. I sshd into the machine and noticed there was no /mnt/user directory which leads me to think the array isn't started. I noticed in the logs it was complaining about /mnt/disk 13, but a faulty disk shouldn't stop the server from working. right? There were also some other xfs errors in there will a stack trace. I restarted the machine and encountered the same result. ssh works, but no GUI, nor array. I'll attach my diagnostics.zip file. Any help would be appreciated! jabba-diagnostics-20170208-1535.zip
  17. I'm running 6.1.6, but have been encountering this since I upgraded to 6. My unraid server is strictly a NAS. No docker containers running on it. I'm exposing my shares via smb for all my windows machines, and NFS for mounting on my download server. My download machine is running XenServer (v.XS65ESP1016) with a CoreOS (v.835.9.0) VM running on it. Inside of the CoreOS VM, I'm running the ususal suspects of docker containers (nzbget, sickrage, etc...) After a (seemingly random) period of time, the UI of these apps starts acting wonky and looking in the logs I see the infamous Stale File Handle error. It happens multiple times a day on my two Couchpotato containers (I run one for kids movies and one for non-kids), but have seen the Stale File Handle in other containers as well, just not nearly as frequently. A restart of the docker container temporarily fixes the issue. my cloud-config has entries like this to acomplish the mounting: - name: mnt-jabba-movies.mount command: start content: | [Mount] What=192.168.0.16:/mnt/user/Movies Where=/mnt/jabba/movies Type=nfs I've done quite a bit of google searching but haven't seen a good recent solution. any ideas what I can do to eliminate the Stale File Handle errors when using NFS? If any more info would be helpful, ask away. I'd like to stay using NFS so unless it's really the only option, I'm hoping not to hear "switch to smb" Thank in advance
  18. I pulled the drive that seemed to be causing the hang. (It was one of those new 3TB WD red drives). Restarted the array, and started rebuilding the parity drive. Again, the system became unresponsive, with this message in syslog: Jul 25 22:56:30 JABBA kernel: md: recovery thread syncing parity disk ... Jul 25 22:56:30 JABBA kernel: md: using 1536k window, over a total of 2930266532 blocks. Jul 26 00:18:56 JABBA kernel: sas: command 0xf43d4e40, task 0xf74b3540, timed out: BLK_EH_NOT_HANDLED Jul 26 00:18:56 JABBA kernel: sas: Enter sas_scsi_recover_host Jul 26 00:18:56 JABBA kernel: sas: trying to find task 0xf74b3540 Jul 26 00:18:56 JABBA kernel: sas: sas_scsi_find_task: aborting task 0xf74b3540 Jul 26 00:18:56 JABBA kernel: sas: sas_scsi_find_task: querying task 0xf74b3540 Jul 26 00:18:56 JABBA kernel: drivers/scsi/mvsas/mv_sas.c 1747:mvs_query_task:rc= 5 Jul 26 00:18:56 JABBA kernel: sas: sas_scsi_find_task: task 0xf74b3540 failed to abort Jul 26 00:18:56 JABBA kernel: sas: task 0xf74b3540 is not at LU: I_T recover Jul 26 00:18:56 JABBA kernel: sas: I_T nexus reset for dev 0000000000000000 Jul 26 00:18:56 JABBA kernel: sas: sas_form_port: phy0 belongs to port0 already(1)! Jul 26 00:18:58 JABBA kernel: drivers/scsi/mvsas/mv_sas.c 1701:mvs_I_T_nexus_reset for device[0]:rc= 0 Jul 26 00:18:58 JABBA kernel: sas: I_T 0000000000000000 recovered Jul 26 00:18:58 JABBA kernel: sas: sas_ata_task_done: SAS error 8d Jul 26 00:18:58 JABBA kernel: ata5: sas eh calling libata port error handler Jul 26 00:18:58 JABBA kernel: ata5.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0 t0 Jul 26 00:18:58 JABBA kernel: ata5.00: failed command: WRITE FPDMA QUEUED Jul 26 00:18:58 JABBA kernel: ata5.00: cmd 61/00:00:98:e7:c4/02:00:3a:00:00/40 tag 0 ncq 262144 out Jul 26 00:18:58 JABBA kernel: res 41/10:00:98:e7:c4/00:00:3a:00:00/40 Emask 0x481 (invalid argument) <F> Jul 26 00:18:58 JABBA kernel: ata5.00: status: { DRDY ERR } Jul 26 00:18:58 JABBA kernel: ata5.00: error: { IDNF } Jul 26 00:18:58 JABBA kernel: ata5.00: configured for UDMA/133 Jul 26 00:18:58 JABBA kernel: ata5: EH complete Jul 26 00:18:58 JABBA kernel: ata6: sas eh calling libata port error handler Jul 26 00:18:58 JABBA kernel: ata7: sas eh calling libata port error handler Jul 26 00:18:58 JABBA kernel: ata8: sas eh calling libata port error handler Jul 26 00:18:58 JABBA kernel: ata9: sas eh calling libata port error handler Jul 26 00:18:58 JABBA kernel: ata10: sas eh calling libata port error handler Jul 26 00:18:58 JABBA kernel: sas: --- Exit sas_scsi_recover_host Jul 26 00:18:58 JABBA kernel: sas: sas_ata_task_done: SAS error 2 I found a thread of people having very similar issues, I'm going to link to this thread in that one.
  19. Long time listener, first time caller. My unraid server (5.0-rc5) which consists of parity, cache, and 6 drives (two of which were just added on Friday) became mostly unresponsive yesterday. I couldnt access any of the shares and the web interface was not responding yet I didn't notice anyting out of place in the syslog. I could telnet into into the machine, but any attempts to do anything with the disks (even an ls) would result in a hung telnet session. I tried for hours to stop the array cleanly, but nothing worked. I eventually just pushed the reset button on the server. The server started up ok, and began it's "unclean shutdown" parity check. I kept an eye on the logs this time and saw quite a few kernel: md: parity incorrect: 1942702520 messages. I assumed this was because of my hard reset. after some time, the sytem became unresponsive again. This time I noticed this in the logs Jul 24 19:51:40 JABBA kernel: md: parity incorrect: 1942702576 Jul 24 19:52:11 JABBA kernel: sas: command 0xf29fc6c0, task 0xf41a03c0, timed out: BLK_EH_NOT_HANDLED Jul 24 19:52:11 JABBA kernel: sas: Enter sas_scsi_recover_host Jul 24 19:52:11 JABBA kernel: sas: trying to find task 0xf41a03c0 Jul 24 19:52:11 JABBA kernel: sas: sas_scsi_find_task: aborting task 0xf41a03c0 Jul 24 19:52:11 JABBA kernel: sas: sas_scsi_find_task: querying task 0xf41a03c0 Jul 24 19:52:11 JABBA kernel: drivers/scsi/mvsas/mv_sas.c 1747:mvs_query_task:rc= 5 Jul 24 19:52:11 JABBA kernel: sas: sas_scsi_find_task: task 0xf41a03c0 failed to abort Jul 24 19:52:11 JABBA kernel: sas: task 0xf41a03c0 is not at LU: I_T recover Jul 24 19:52:11 JABBA kernel: sas: I_T nexus reset for dev 0000000000000000 Jul 24 19:52:13 JABBA kernel: mvsas 0000:02:00.0: Phy0 : No sig fis Jul 24 19:52:13 JABBA kernel: sas: sas_form_port: phy0 belongs to port0 already(1)! Jul 24 19:52:13 JABBA kernel: drivers/scsi/mvsas/mv_sas.c 1701:mvs_I_T_nexus_reset for device[0]:rc= 0 Jul 24 19:52:13 JABBA kernel: sas: I_T 0000000000000000 recovered Jul 24 19:52:13 JABBA kernel: sas: sas_ata_task_done: SAS error 8d Jul 24 19:52:13 JABBA kernel: ata5: sas eh calling libata port error handler Jul 24 19:52:13 JABBA kernel: ata5.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0 t0 Jul 24 19:52:13 JABBA kernel: ata5.00: failed command: READ FPDMA QUEUED Jul 24 19:52:13 JABBA kernel: ata5.00: cmd 60/d8:00:38:4b:cb/01:00:73:00:00/40 tag 0 ncq 241664 in Jul 24 19:52:13 JABBA kernel: res 41/40:00:48:4c:cb/00:00:73:00:00/40 Emask 0x409 (media error) <F> Jul 24 19:52:13 JABBA kernel: ata5.00: status: { DRDY ERR } Jul 24 19:52:13 JABBA kernel: ata5.00: error: { UNC } Jul 24 19:52:13 JABBA kernel: ata5.00: configured for UDMA/133 Jul 24 19:52:13 JABBA kernel: ata5: EH complete Jul 24 19:52:13 JABBA kernel: ata6: sas eh calling libata port error handler Jul 24 19:52:13 JABBA kernel: ata7: sas eh calling libata port error handler Jul 24 19:52:13 JABBA kernel: ata8: sas eh calling libata port error handler Jul 24 19:52:13 JABBA kernel: ata9: sas eh calling libata port error handler Jul 24 19:52:13 JABBA kernel: ata10: sas eh calling libata port error handler Jul 24 19:52:13 JABBA kernel: ata11: sas eh calling libata port error handler Jul 24 19:52:13 JABBA kernel: sas: --- Exit sas_scsi_recover_host Jul 24 19:52:44 JABBA kernel: sas: command 0xf29fc6c0, task 0xf41a03c0, timed out: BLK_EH_NOT_HANDLED I also noticed that the system is only recognizing one of my two memory sticks. root@JABBA:~# free -m total used free shared buffers cached Mem: 1770 1635 134 0 13 1473 -/+ buffers/cache: 148 1622 Swap: root@JABBA:~# vmstat procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 0 46 0 138436 14120 1508744 0 0 158 0 695 284 0 13 47 39 0 0 0 After that long winded story,I have two questions 1) do the above errors mean I have a dead drive? If so, why would a dead drive bring the whole system to it's knees? I saw some sas references in there, could my backplane have suddenly gone bad? 2) Is there a way to gracefully shut down that I havent' thought of? I've tried powerdown script, /etc/rc.c/rc.unRaid stop, stop script, even a shutdown now -r. I get REALLY nervous pushing that reset button. I have attached my full syslog. Any help would be greatly appreciated syslog.txt