tazman

Members
  • Posts

    68
  • Joined

  • Last visited

Everything posted by tazman

  1. I have experienced this a couple of times now: the internal search function doesn't return anything relevant. Here is a solution: switch to Google, add 'site:forums.unraid.net' to the search term and you will get much better results. Recent example: "remove cache pool drive": Internal search: first three hits: 1. Correct Procedure toremoveCacheDrive? 2. Unraid OS version 6.6.6 available 3. Drive went from 100% full to 'Unmountable: No file system' after a clean shutdown and reboot Google: first three hits: 1. (Solved) Removing a Disk from Cache Pool - General Support - Unraid 2. Removing Cache Drive (SOLVED) - General Support - Unraid 3. Remove drive from Cache pool - General Support - Unraid Merry x-mas! Hope you all find some time to unWIND.
  2. The WebUI refuses to connect in the current version or it gives a Connection failed error: There seem to be two problems: 1. PIA needs to be connected to a port forwarding host as mentioned here: https://www.privateinternetaccess.com/helpdesk/kb/articles/how-do-i-enable-port-forwarding-on-my-vpn. This fixes this error: Note, that the settings do not allow for the selection of all those endpoints. For example, there is an invalid Germany choice. Instead there should be a "DE Frankfurt" and a "DE Berlin" 2. From what I could find out there seems to be also a problem with 2.94. Reverting to 2.93 by using "linuxserver/transmission:121" into the repository and applying the change fixed this. But then, this version doesn't use PIA. Does anybody have an idea how to fix this problem e.g. by reverting to an older version of haugene/transmission-openvpn? Thanks! tazman
  3. I found a solution and have updated the first post accordingly.
  4. Same issue here. Log window does not work. I suspect Bitdefender but even disabling all functions in it doesn't fix it.
  5. Sure, this is a possibility. Just takes longer to do this. But I am still wondering how we can turn the sector number reported in the log to identify the file(s) affected.
  6. (I have updated this first post to reflect the solution I have found) My syslog shows read errors like those: Sep 3 08:17:01 SS kernel: print_req_error: critical medium error, dev sdr, sector 3989616 Sep 3 08:17:01 SS kernel: md: disk17 read error, sector=3989552 A parity check also confirms read errors on the same drive. I wanted to find out which files are affected and have used the following approach: Start maintenance mode Mount the drive partition: e.g. mount /dev/sdr1 /mnt/test. Get the drive number from the unRAID Main GUI. Add 1 to the drive number which indicates the first partition. Check the block size: xfs_info /mnt/test - look for data = bsize=[Block size]. In my case, on a 4TB drive it was 4096 Check the start sector of the partition with fdisk -lu /dev/sdr. In my case 64. Calculate the block number of the sector as: (int)([sector]-[start sector)*512/[block size]. My bad sector 3989616 is in block 498694. Unmount the partition so it xfs_db will run: e.g. umount /mnt/test Run xfs_db -r /dev/sdr1 (-r is for read-only) On the xfs_db command line: Get the information of the block with blockget -n -b [block number]. E.g. blockget -n -b 498694 This will run for a while as it reads the entire disk. At the beginning it will output the inode number for the block. In my case it was 35676. The larger the size of the drive, the more memory it needs. With 4GB on a 4TB disk I got an out of memory error: xfs_db: ___initbuf can't memalign 32768 bytes: Cannot allocate memory. Upgrading to 16GB allowed the command to run. Get the file name for the inode with: ncheck -i [inode] Enter quit or press Ctrl-D to exit xfs_db I have not figured out: How to get blockget to run faster or with less memory usage. Maybe there is an alternative way to determine the inode of a block. How to check additional blocks without exiting xfs_db and running blockget again. I tried convert but couldn't get it to work. Maybe someone from the community has an idea about those. Unfortunately, xfs_db is not very well documented yet on the web beyond man pages e.g. on https://linux.die.net/man/8/xfs_db Kind regards, Tazman
  7. Really, really sorry to hear this. This was my personal nightmare a while ago. Not only was I concerned that an attack could come from my own machine but from any (Windows) machine on my network used by my family or guests. Since then I have done the following to prevent/reduce the likelihood of this happening (I hope): Not export any disks, only the shares I need to. Created a special user that is the only one allowed to write to the shares. Make all shares read-only with the exception for this special user (except a Public share folder writeable by everybody). Have my Backup program (Syncovery) perform the backups of my Windows machines with the special user rights. I trust that an attacker will not find the login information buried deep inside of the backup program. Syncovery has a randsomware detection that detects if data has changed massively in a directory and refuses to copy. When I need to perform small copy jobs I copy the data to the Public share, then telnet into unRaid and copy the data over to where I need it. On my Windows work machine, created a separate Windows user identical to the special user on unRaid. When I need to work on the array data I open a Windows session with the special user and refrain from any surfing. It's a bit of work but this way I am feeling confident that my data is safe from a randsomware attack. Independent of unRaid I am backing up my most important personal data with Crashplan which only writes changes and has versioning. I hope you will find a way to get your data back and that the description above helps you to secure you data for the future. All the best! Tazman
  8. Thanks, bonienl, that makes a lot of sense! I am relieved now.
  9. Hi, I am running a parity sync after a new configuration to rebuild my two parity drives. Parity-sync stands at 62% and I just noticed that drives have spun-down. See attached picture. This is very strange as I thought that all drives will have to be accessed continuously to perform a parity sync. Is this expected behavior or do I need to worry that something is going wrong at the moment? At the very moment, all drives are up again. Could it be that data is read from the cache? Thanks for sharing your opinions about this! Tazman.
  10. To close this one out I would like to report that I have replaced the three Marvel controllers with a LSI SAS9201-16i and everything is working fine again. Replacing the controller was straight-forward: just replace the controller and start unRAID. Before the replacement I got two more drives failing so I was down to no redundancy with my second parity drive. Like the first failed drive, the two additional drives that failed reported write errors in the log despite the fact that nothing was written on them when they failed. I left the failed drives untouched and replaced them with new ones after the controller was replaced. I then compared their contents with the rebuild drives. Mounting the old drives failed as the drive had the same ID as the rebuild new ones. I had to generate new IDs for them. This also failed as both drives had unplayed xfs journal entries. I suspected that the journal entries would lead to data loss and deleted them so I could mount the drives. One drive had a current pending sector. Both old drives (without the journal entries played) did not have any data loss. Looking back I am still unclear why I had a data loss on my first drive. When I precleared the old drives both showed changing current pending sectors during the several preclear cycles but at the end settled with none and an overall 100% clean SMART report. I suspect/hope that the drives are now ok again and that the pending sectors were a consequence of the controller problems.
  11. You were right, the emulation is the same as physical. Which it should be. Thanks so much for your advice!
  12. Thanks, SSD, very good points. I am still unclear about why the data loss has occurred.
  13. The above is from the unstarted array. I have just started it and the loop seemed to have gone away. I am posting new diagnostics below. Apologies for a false alarm if this is normal behaviour. ss-diagnostics-20171105-1754.zip
  14. In the meantime, I was able to find out that the syslog becoming unresponsive is due to a loop around the "unRAID driver removed" as reported in a separate thread here: https://forums.lime-technology.com/topic/61209-unraid-driver-removed-loop/. Thanks for your continued support!
  15. Hi, My server is in a constant loop around "kernel: md: unRAID driver removed" This block occurs in the syslog every few seconds. I am running Safe at the moment. Nov 5 17:25:27 SS emhttp: shcmd (161): rmmod md-mod |& logger Nov 5 17:25:27 SS kernel: md: unRAID driver removed Nov 5 17:25:27 SS emhttp: shcmd (162): modprobe md-mod super=/boot/config/super.dat |& logger Nov 5 17:25:27 SS kernel: md: unRAID driver 2.7.2 installed Nov 5 17:25:27 SS emhttp: Pro key detected, GUID: 0781-5571-0013-540703115442 FILE: /boot/config/Pro.key Nov 5 17:25:27 SS emhttp: Device inventory: Nov 5 17:25:27 SS emhttp: shcmd (163): udevadm settle Nov 5 17:25:27 SS emhttp: WDC_WD30EFRX-68EUZN0_WD-WMC4N0J1EL17 (sdm) 2930266532 Nov 5 17:25:27 SS emhttp: WDC_WD30EFRX-68AX9N0_WD-WMC1T1851777 (sdj) 2930266532 Nov 5 17:25:27 SS emhttp: WDC_WD30EFRX-68EUZN0_WD-WCC4N2KHEZ6V (sdk) 2930266532 Nov 5 17:25:27 SS emhttp: WDC_WD20EARX-00PASB0_WD-WCAZA9597369 (sdg) 1953514552 Nov 5 17:25:27 SS emhttp: WDC_WD20EARX-00PASB0_WD-WCAZA8522113 (sdh) 1953514552 Nov 5 17:25:27 SS emhttp: WDC_WD30EZRS-00J99B0_WD-WCAWZ0366602 (sdd) 2930266532 Nov 5 17:25:27 SS emhttp: HGST_HDN724040ALE640_PK1334PCKAMVUX (sde) 3907018532 Nov 5 17:25:27 SS emhttp: HGST_HDN724040ALE640_PK2334PEK0UZLT (sdb) 3907018532 Nov 5 17:25:27 SS emhttp: HGST_HDN724040ALE640_PK2338P4HY28PC (sdr) 3907018532 Nov 5 17:25:27 SS emhttp: WDC_WD40EFRX-68WT0N0_WD-WCC4E3RZT82X (sdf) 3907018532 Nov 5 17:25:27 SS emhttp: ST2000NM0011_Z1P0AKCM (sdc) 1953514552 Nov 5 17:25:27 SS emhttp: HGST_HDN724040ALE640_PK1334PCK5556X (sds) 3907018532 Nov 5 17:25:27 SS emhttp: HGST_HDN724040ALE640_PK1334PCK8YD3X (sdq) 3907018532 Nov 5 17:25:27 SS emhttp: WDC_WD20EVDS-63T3B0_WD-WCAVY5175470 (sdn) 1953514552 Nov 5 17:25:27 SS emhttp: Hitachi_HDS723020BLA642_MN1220F325TLLD (sdo) 1953514552 Nov 5 17:25:27 SS emhttp: HGST_HDN724040ALE640_PK2334PEJZMJ0T (sdl) 3907018532 Nov 5 17:25:27 SS emhttp: WDC_WD30EZRX-00MMMB0_WD-WCAWZ0687859 (sdi) 2930266532 Nov 5 17:25:27 SS emhttp: SanDisk_Cruzer_Fit_4C530013540703115442-0:0 (sda) 7815136 Nov 5 17:25:27 SS emhttp: Hitachi_HDS723030ALA640_MK0301YHGUAYWA (sdp) 2930266532 Nov 5 17:25:27 SS kernel: mdcmd (1): import 0 sde 3907018532 0 HGST_HDN724040ALE640_PK1334PCKAMVUX Nov 5 17:25:27 SS kernel: md: import disk0: (sde) HGST_HDN724040ALE640_PK1334PCKAMVUX size: 3907018532 Nov 5 17:25:27 SS kernel: mdcmd (2): import 1 sdf 3907018532 0 WDC_WD40EFRX-68WT0N0_WD-WCC4E3RZT82X Nov 5 17:25:27 SS kernel: md: import disk1: (sdf) WDC_WD40EFRX-68WT0N0_WD-WCC4E3RZT82X size: 3907018532 Nov 5 17:25:27 SS kernel: mdcmd (3): import 2 sdl 3907018532 0 HGST_HDN724040ALE640_PK2334PEJZMJ0T Nov 5 17:25:27 SS kernel: md: import disk2: (sdl) HGST_HDN724040ALE640_PK2334PEJZMJ0T size: 3907018532 Nov 5 17:25:27 SS kernel: mdcmd (4): import 3 sdn 1953514552 0 WDC_WD20EVDS-63T3B0_WD-WCAVY5175470 Nov 5 17:25:27 SS kernel: md: import disk3: (sdn) WDC_WD20EVDS-63T3B0_WD-WCAVY5175470 size: 1953514552 Nov 5 17:25:27 SS kernel: mdcmd (5): import 4 sdd 2930266532 0 WDC_WD30EZRS-00J99B0_WD-WCAWZ0366602 Nov 5 17:25:27 SS kernel: md: import disk4: (sdd) WDC_WD30EZRS-00J99B0_WD-WCAWZ0366602 size: 2930266532 Nov 5 17:25:27 SS kernel: mdcmd (6): import 5 sdj 2930266532 0 WDC_WD30EFRX-68AX9N0_WD-WMC1T1851777 Nov 5 17:25:27 SS kernel: md: import disk5: (sdj) WDC_WD30EFRX-68AX9N0_WD-WMC1T1851777 size: 2930266532 Nov 5 17:25:27 SS kernel: mdcmd (7): import 6 sdq 3907018532 0 HGST_HDN724040ALE640_PK1334PCK8YD3X Nov 5 17:25:27 SS kernel: md: import disk6: (sdq) HGST_HDN724040ALE640_PK1334PCK8YD3X size: 3907018532 Nov 5 17:25:27 SS kernel: mdcmd (8): import 7 sdg 1953514552 0 WDC_WD20EARX-00PASB0_WD-WCAZA9597369 Nov 5 17:25:27 SS kernel: md: import disk7: (sdg) WDC_WD20EARX-00PASB0_WD-WCAZA9597369 size: 1953514552 Nov 5 17:25:27 SS kernel: mdcmd (9): import 8 sdo 1953514552 0 Hitachi_HDS723020BLA642_MN1220F325TLLD Nov 5 17:25:27 SS kernel: md: import disk8: (sdo) Hitachi_HDS723020BLA642_MN1220F325TLLD size: 1953514552 Nov 5 17:25:27 SS kernel: mdcmd (10): import 9 sdp 2930266532 0 Hitachi_HDS723030ALA640_MK0301YHGUAYWA Nov 5 17:25:27 SS kernel: md: import disk9: (sdp) Hitachi_HDS723030ALA640_MK0301YHGUAYWA size: 2930266532 Nov 5 17:25:27 SS kernel: mdcmd (11): import 10 sdh 1953514552 0 WDC_WD20EARX-00PASB0_WD-WCAZA8522113 Nov 5 17:25:27 SS kernel: md: import disk10: (sdh) WDC_WD20EARX-00PASB0_WD-WCAZA8522113 size: 1953514552 Nov 5 17:25:27 SS kernel: mdcmd (12): import 11 sdi 2930266532 0 WDC_WD30EZRX-00MMMB0_WD-WCAWZ0687859 Nov 5 17:25:27 SS kernel: md: import disk11: (sdi) WDC_WD30EZRX-00MMMB0_WD-WCAWZ0687859 size: 2930266532 Nov 5 17:25:27 SS kernel: mdcmd (13): import 12 sds 3907018532 0 HGST_HDN724040ALE640_PK1334PCK5556X Nov 5 17:25:27 SS kernel: md: import disk12: (sds) HGST_HDN724040ALE640_PK1334PCK5556X size: 3907018532 Nov 5 17:25:27 SS kernel: mdcmd (14): import 13 sdm 2930266532 0 WDC_WD30EFRX-68EUZN0_WD-WMC4N0J1EL17 Nov 5 17:25:27 SS kernel: md: import disk13: (sdm) WDC_WD30EFRX-68EUZN0_WD-WMC4N0J1EL17 size: 2930266532 Nov 5 17:25:27 SS kernel: mdcmd (15): import 14 sdr 3907018532 0 HGST_HDN724040ALE640_PK2338P4HY28PC Nov 5 17:25:27 SS kernel: md: import disk14: (sdr) HGST_HDN724040ALE640_PK2338P4HY28PC size: 3907018532 Nov 5 17:25:27 SS kernel: mdcmd (16): import 15 sdk 2930266532 0 WDC_WD30EFRX-68EUZN0_WD-WCC4N2KHEZ6V Nov 5 17:25:27 SS kernel: md: import disk15: (sdk) WDC_WD30EFRX-68EUZN0_WD-WCC4N2KHEZ6V size: 2930266532 Nov 5 17:25:27 SS kernel: mdcmd (17): import 16 Nov 5 17:25:27 SS kernel: mdcmd (18): import 17 Nov 5 17:25:27 SS kernel: mdcmd (19): import 18 Nov 5 17:25:27 SS kernel: mdcmd (20): import 19 Nov 5 17:25:27 SS kernel: mdcmd (21): import 20 Nov 5 17:25:27 SS kernel: mdcmd (22): import 21 Nov 5 17:25:27 SS kernel: mdcmd (23): import 22 Nov 5 17:25:27 SS kernel: mdcmd (24): import 23 Nov 5 17:25:27 SS kernel: mdcmd (25): import 24 Nov 5 17:25:27 SS kernel: mdcmd (26): import 25 Nov 5 17:25:27 SS kernel: mdcmd (27): import 26 Nov 5 17:25:27 SS kernel: mdcmd (28): import 27 Nov 5 17:25:27 SS kernel: mdcmd (29): import 28 Nov 5 17:25:27 SS kernel: mdcmd (30): import 29 sdb 3907018532 0 HGST_HDN724040ALE640_PK2334PEK0UZLT Nov 5 17:25:27 SS kernel: md: import disk29: (sdb) HGST_HDN724040ALE640_PK2334PEK0UZLT size: 3907018532 Nov 5 17:25:27 SS emhttp: import 30 cache device: sdc Nov 5 17:25:27 SS emhttp: import flash device: sda This eventually causes the syslog to become very long and unresponsive. This may or may not be related my data loss issues reported here: https://forums.lime-technology.com/topic/61153-data-loss-after-drive-failure-during-parity-check I appreciate your suggestions about how to debug this. I am attaching the full diagnostics. Note that I have three SAS2LPMV8 cards and am aware of the issues with Marvell controllers by now. Thanks for your help! Tazman ss-diagnostics-20171105-1716.zip
  16. I did not format the drive. I don't think unRAID even offered it. I stopped the array, set Drive 12 to 'no device', then precleared the old drive and reassigned it. At this point int time - this is my assumption! - it was accepted as a new drive and rebuilt.
  17. Yes, the status of the drive was still reballed and its content was emulated.
  18. I never touched the failed drive while it was still part of the array. The preclear was only done after the disk was removed from the array. While the disk was being precleared, it was emulated by the array. After the preclear unRAID accepted the drive as a new one and started the rebuild. I don't remember that there was a format. These were the steps: Drive 12 redballed during parity check Let parity check finish Stop array, remove and check drive - everything was ok Reinsert drive to check if anything has changed. unRAID didnt accept the drive as good and still showed it as redballed. Stop the array, set Drive 12 to "no device", start the array At this time the disk 12 showed under unassigned devices and I could preclear it After the preclear finished I stopped the array Assign the drive to disk 12 and start array Start rebuild Is there a mistake in that sequence that could have caused the data loss? I cannot say for sure that the data loss occurred during the rebuild as I did not check the disk 12 contents while it was being emulated. Thanks!
  19. Correct, but the rebuild should reconstruct the data on that failed drive. Why did this not happen in this case?
  20. I have just lost about 3TB of data on single disk failure during a party check/rebuild on a 2 parity drive system. The other day, I ran a parity check. During the party check Drive 12 (4TB) was disabled (red cross). The parity check reported more than 2000 errors. I checked the drive, ran an extended SMART test and then precleared the disabled drive once. No problem at all. So I opened the case and checked for any problem and suspected a 90 degrees bent cable. I then reinstalled the disk in the same slot and unRAID accepted it as a new drive. So I started the rebuild. Rebuild took about 12 hours and was finished without any errors. But then I noticed that the drive was listed having nearly all its space free: I am using water level and know for sure that the drive had at least 3TB on it. I checked the drive and, yes, the files on it were only from a recent backup job. So I fear that this data is lost. Despite dual parity and despite a normal scenario like a single disk failure which has happened several times before. It sounds similar to the issue reported here: https://forums.lime-technology.com/topic/60963-disk-3-error-disk-3-rebuilt-from-parity-disk-3-missing-almost-everything/. The array was running normally during the build. Here is what I found out myself that may indicate possible other problems or even causes: My first place to go was the syslog. I could access the tail of it via Log on the Main page without problems. Then I wanted to check the full syslog on Tools/System Log but I only got a page not responsive message from the browser. I was able to get a larger chunk from the syslog with tail -f /var/log/syslog > /boot/syslog.txt. The parts that I could get to were from after the rebuild and didn't show any problems. I rebooted but the disk 12 was still empty. Syslog worked again. I suspected memory problems. A 48 hour memtest didn’t produce any errors. The BIOS reports several “Smbios 0x01 DIMM_2B (Single Bit ECC Memory Error)”. Many while the memory test was running and also before, but not one during the time when the drive failed. It seems like a single ECC error is normal if it occurs only few time during the day. The BIOS is set to stop the machine when a dual ECC memory occurs. That has never happened. Fix Common Problems reports “It appears that your server has a Marvel based hard drive controller installed within it. Some users with Marvel based controllers exhibit random drives dropping offline, recurring parity errors during checks etc. This tends to be exacerbated if VT-D / IOMMU is enabled in the BIOS. Generally, LSI based controllers would be preferred over Marvel based controllers because of these issues. Note that these issues are out of Limetech's hands. Depending upon the exact combination of hardware present in your server, you may not have any problems whatsoever. If you have no problems, then this warning can be safely ignored, but future versions of unRaid (and later Kernel versions) may (or may not) present you with the previously mentioned issues.” On the Marvel Issue (based on https://forums.lime-technology.com/topic/59091-marvel-issues-starting-point-for-investigation/ and others) I do have three Marvel SATA cards AOC-SAS2LP-MV8 installed (great!!!). The failure occurred on one of them. I don’t’ use any VMs. IOMMU is not supported VT-D is enabled The drive is a HGST HDN724040ALE640 with ATA8-ACS T13/1699-D revision 4 So from this, I conclude/wonder about the following and welcome your recommendations and any suggestions on how to analyze or remediate this situation further: - Is there any way to get the data back? I fear the answer is no. - It seems that the most likely cause is the Marvel controller, not the bent cable: - I have seen no indication on the boards that a fix is on the horizon. So replacing it seems to be warranted. Which card(s) would you go for? - While the Marvel controller is still in place it seems prudent to: - disable VT-D / IOMMU - not to run any parity checks. - Maybe replacing DIMM_2B that gives the ECC error is also in order. - I am not sure at all what to do about the syslog unresponsive problem. I am attaching the syslog part I was able to salvage after the incident and the most recent one. Thanks! syslog incident.txt syslog last.txt _________________ unRAID 6.3.5 - Board: Supermicro X9SCM-F - CPU: Intel i3 2120T - RAM: 4GB, 2x 2GB Kingston KVR1333D3S8E9S/2GEC SDRAM DIMM with up to 32DB DDR3, unbuffered, ECC 1333/1066 - PSU: Corsair Professional Series Gold AX850 - SATA Card: 3x AOC-SAS2LP-MV8 - Backplane: 6x icyDock MB455SPF 5in3 - Case: Lian Li PC-343B
  21. I just had exactly the same experience as Living Legend. My drive 12 (4TB) was disabled during a parity check which has more than 2000 errors. I suspect a too much bent cable. The SMART report on the disk is as clean as it can be. So I reinstalled the disk. Precleared it once. No problems. And then reinserted it as a 'new' disk back into the array. Rebuild took about 12 hours and was finished without any errors. Before, it had at least 3TB of data on it. Not it's only 100GB which were probably coming from a backup job that ran after it was rebuilt. The array was running during the rebuild. I tried to look at the syslog but the syslog page was unresponsive. The parts that I could get to were from after the rebuild and didn't show any problems. I rebooted but the disk 12 is still empty. So I lost 3TB! What is going on here? This severely shatters my trust in unRAID!
  22. Hi, I am also having issues with the preclear script stopping. I am using the most recent version 2017.03.31 with scripts gfjardim 0.8.5-beta and bjp999 - 1.15b. Unassigned devices is installed and up-to-date with 2017.05.21. The progress simply stalls, no values change anymore and the disk access LED indicates that there is not activity on the disk. The GUI/server remains responsive. It's happening on different disks, controllers and with both scripts. Starting the script via the preclear plugin or unassigned devices makes not difference. It happens in the second or third cycle. E.g. my last try stalled at 95% during post-read in the second cycle. I did not observe any changes in memory or CPU usage. So I dont think the process has an issue with resources. Stopping and restarting the preclear does not work. The script says: "Starting..." and doesnt do anything. Instead it displays the following error message: /boot/config/plugins/preclear.disk/preclear_disk.sh -c 2 -J /dev/sdq root@SS:/usr/local/emhttp# /boot/config/plugins/preclear.disk/preclear_disk.sh -c 2 -J /dev/sdq sfdisk: invalid option -- 'R' The script seems to continue to run which is also indicated by this line in the log: May 26 10:54:29 SS preclear.disk: Pausing preclear of disk 'sdm' Which occurred after the pre-clear stalled and I stopped the array. I didn't see anything else bluntly obvious in the log but close to the drive the following warning occurs: May 24 10:46:35 SS kernel: sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 0 tries: 1 To fix this a shutdown is required. A reboot is not sufficient. I initially thought that this points to a hardware issue. But I tried different controllers on my Supermicro X9SCM-F and AOC-SAS2LP-MV8 with the same results and then I saw others also reporting the script stalling so I thought I share this. I trust that the script still performs its tasks properly, so I just start another run to get to my expected cycles. So it's not a real problem for me, more an annoyance. Best regards, Thomas
  23. Ok, good to know. I read that maintenance mode is needed for that as well. Anyhow. I have been wondering for a while: if you suspect that your device is faulty. And you run corrections on it.... doesn't "maintaining parity" make the parity data faulty as well? If something is really wrong with a disk, wouldn't it make more sense not to touch the parity disk at all assuming that it maintains still better/more accurate data that could be used to rebuild the fault disk from scratch if needed?
  24. Thanks johnnie.black. I rebooted 6.2 and performed xfs_repair against all 14 md devices. When I use sd* I still get the same error message as before but I assume that this was wrong in the first place and that these devices should have been sd*1 in the first place which produces the same result as md*. None of the checks returned any errors and I am really wondering what was wrong in the first place. I will continue monitoring.