RallyGallery Posted July 1 Share Posted July 1 Hello all, A very strange sequence of events happened and just wanted to get some advice. I decided to try UnRAID 7 Beta on my backup server and it installed and ran with no problems for about 30 mins. This server has been running with zero issues or problems and has been absolutely solid for the last 12 months of running. The GUI then stopped responding and I was unable to connect to it. On the screen I have on the server rack there were pages of errors I have never seen before. I was able to reboot the server after some time and one of the hard drives had 'failed' without any warning. No errors on the drive were reported and SMART gave no hints or errors. Before the installation of unRAID Beta 7 no errors at all. I decided to roll back to version 6.12 of UnRAID and everything then seemed to function apart from disk 3 'failing again'. I removed the disk and put it in a USB caddy and inspected it in Windows via data software and the disk had no bad sectors and no errors of any kind. I put it back in the array and turned on the server. I had a new config with the drives in their original slots and parity needed to be rebuilt which completed with no errors. All seemed good. Then after an hour all the drives had 160 errors on them and the cache pool has completely failed with no docker containers functioning at all. The data is showing on the drives, however in the shares tab every single share has disappeared. Having used UnRAID for 6 years I have never seen such a situation. Note : I have backup of all the data so I am not losing anything. Plus, the USB key appears to be perfectly fine. My thought on a possible solution is to basically start again from scratch? I have a backup of my key, plus I have all the locations of the disks in the correct slots in the array in unRAID. Install 6.12 on the USB stick using the unRAID creation tool and copy my key file onto it. Boot the server Configure the array and format the disks Install dockers and VM again Copy data Install parity disk in the slot Any other advice? I really have no idea what happened! I have attached the last diagnostics file that I managed to get from the server before I have shut it down in case that helps? Any help or insights anyone can give is greatly appreciated. server-diagnostics-20240701-0828.zip Quote Link to comment
JorgeB Posted July 2 Share Posted July 2 Log is being spammed with btrfs errors, reboot to clear the log and post new diags after array start Quote Link to comment
RallyGallery Posted July 2 Author Share Posted July 2 3 hours ago, JorgeB said: Log is being spammed with btrfs errors, reboot to clear the log and post new diags after array start ok will do. Thanks for your help. Will report here Quote Link to comment
RallyGallery Posted July 2 Author Share Posted July 2 (edited) I have just booted the server back up and downloaded two logs. One as the server booted up and then another as the array came online. These two diagnostic files are attached and named accordingly. However, as I type, this the array seems to be working ok which seems incredibly strange. The shares have come back and are displaying correctly. All the docker containers are working as well as the VM's. Maybe it was 'turn it on and off again'. Maybe the diagnostics will show something? UPDATE : after 30 mins the server is starting to fail with containers not starting to work and the VMs not working now. The log files keep showing this: server booted_array offline-diagnostics-20240702-1508.zip array online-diagnostics-20240702-1509.zip Edited July 2 by RallyGallery Quote Link to comment
RallyGallery Posted July 2 Author Share Posted July 2 The cache drive is in a pool of 2 raid 1 SSD's. I wonder if one of the disks has failed in some way: I took this screen shot from the second of the cache drives: text error warn system array login Jul 2 15:06:52 Moog kernel: sd 10:0:4:0: [sdf] 468862128 512-byte logical blocks: (240 GB/224 GiB) Jul 2 15:06:52 Moog kernel: sd 10:0:4:0: [sdf] Write Protect is off Jul 2 15:06:52 Moog kernel: sd 10:0:4:0: [sdf] Mode Sense: 7f 00 10 08 Jul 2 15:06:52 Moog kernel: sd 10:0:4:0: [sdf] Write cache: enabled, read cache: enabled, supports DPO and FUA Jul 2 15:06:52 Moog kernel: sdf: sdf1 Jul 2 15:06:52 Moog kernel: sd 10:0:4:0: [sdf] Attached SCSI disk Jul 2 15:06:52 Moog kernel: BTRFS: device fsid e2f21b80-f3ed-4181-a646-6a5f79acd47a devid 2 transid 1645519 /dev/sdf1 scanned by udevd (954) Jul 2 15:07:39 Moog emhttpd: ADATA_SU630_2K45291E89ED (sdf) 512 468862128 Jul 2 15:07:39 Moog emhttpd: import 31 cache device: (sdf) ADATA_SU630_2K45291E89ED Jul 2 15:07:39 Moog emhttpd: read SMART /dev/sdf Jul 2 15:08:50 Moog emhttpd: #011devid 2 size 223.57GiB used 84.03GiB path /dev/sdf1 Jul 2 15:08:50 Moog kernel: BTRFS info (device sdg1): bdev /dev/sdf1 errs: wr 1, rd 0, flush 0, corrupt 0, gen 0 Jul 2 15:36:41 Moog kernel: sd 10:0:4:0: [sdf] tag#2340 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=DRIVER_OK cmd_age=5s Jul 2 15:36:41 Moog kernel: sd 10:0:4:0: [sdf] tag#2340 CDB: opcode=0x28 28 00 04 ff 44 90 00 00 68 00 Jul 2 15:36:41 Moog kernel: I/O error, dev sdf, sector 83838096 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 2 Jul 2 15:36:41 Moog kernel: sd 10:0:4:0: [sdf] tag#2342 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=DRIVER_OK cmd_age=5s Jul 2 15:36:41 Moog kernel: sd 10:0:4:0: [sdf] tag#2346 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=DRIVER_OK cmd_age=5s Jul 2 15:36:41 Moog kernel: sd 10:0:4:0: [sdf] tag#2342 CDB: opcode=0x28 28 00 04 fe dd 48 00 00 80 00 Jul 2 15:36:41 Moog kernel: I/O error, dev sdf, sector 83811656 op 0x0:(READ) flags 0x80700 phys_seg 2 prio class 2 Jul 2 15:36:41 Moog kernel: sd 10:0:4:0: [sdf] tag#2346 CDB: opcode=0x28 28 00 04 93 4c b8 00 00 08 00 Jul 2 15:36:41 Moog kernel: I/O error, dev sdf, sector 76762296 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2 Jul 2 15:36:41 Moog kernel: sd 10:0:4:0: [sdf] tag#2343 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=DRIVER_OK cmd_age=5s Jul 2 15:36:41 Moog kernel: sd 10:0:4:0: [sdf] tag#2343 CDB: opcode=0x28 28 00 07 7b 0d 00 00 00 08 00 Jul 2 15:36:41 Moog kernel: I/O error, dev sdf, sector 125504768 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 2 Jul 2 15:36:41 Moog kernel: BTRFS error (device sdg1): bdev /dev/sdf1 errs: wr 1, rd 1, flush 0, corrupt 0, gen 0 Jul 2 15:36:41 Moog kernel: sd 10:0:4:0: [sdf] tag#2344 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=DRIVER_OK cmd_age=5s Jul 2 15:36:41 Moog kernel: sd 10:0:4:0: [sdf] tag#2344 CDB: opcode=0x2a 2a 00 08 78 5e 00 00 00 40 00 Jul 2 15:36:41 Moog kernel: I/O error, dev sdf, sector 142106112 op 0x1:(WRITE) flags 0x800 phys_seg 6 prio class 2 Jul 2 15:36:41 Moog kernel: BTRFS error (device sdg1): bdev /dev/sdf1 errs: wr 2, rd 1, flush 0, corrupt 0, gen 0 Jul 2 15:36:41 Moog kernel: sd 10:0:4:0: [sdf] tag#2345 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=DRIVER_OK cmd_age=5s Jul 2 15:36:41 Moog kernel: sd 10:0:4:0: [sdf] tag#2345 CDB: opcode=0x2a 2a 00 08 79 58 f0 00 00 40 00 Jul 2 15:36:41 Moog kernel: I/O error, dev sdf, sector 142170352 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 2 Jul 2 15:36:41 Moog kernel: BTRFS error (device sdg1): bdev /dev/sdf1 errs: wr 3, rd 1, flush 0, corrupt 0, gen 0 Jul 2 15:36:41 Moog kernel: BTRFS error (device sdg1): bdev /dev/sdf1 errs: wr 4, rd 1, flush 0, corrupt 0, gen 0 Jul 2 15:36:41 Moog kernel: sd 10:0:4:0: [sdf] Synchronizing SCSI cache Jul 2 15:36:41 Moog kernel: sd 10:0:4:0: [sdf] Synchronize Cache(10) failed: Result: hostbyte=0x01 driverbyte=DRIVER_OK Jul 2 15:36:46 Moog kernel: BTRFS error (device sdg1: state EA): bdev /dev/sdf1 errs: wr 29, rd 97, flush 0, corrupt 0, gen 0 Jul 2 15:36:46 Moog kernel: BTRFS error (device sdg1: state EA): bdev /dev/sdf1 errs: wr 29, rd 98, flush 0, corrupt 0, gen 0 Quote Link to comment
JorgeB Posted July 2 Share Posted July 2 Post new diags, since those errors are not in the previous ones. Quote Link to comment
RallyGallery Posted July 4 Author Share Posted July 4 Sorry it's taken a day or so. Latest diagnostics attached. This time when I started the array, Disk 1 is now offline where as it started with the initial problems as Disk 3 and then the second time I booted no errors at all. The shares all showed up correctly this time but then after 2 mins were not visible at all on the shares page. The VM starts but the docker containers are not working. I do not believe the array hard drives have any errors or problems, perhaps one of the SSD's has got an issue but again when I booted the second time the dockers all worked then the cache drive started flooding the log with errors. Any help is gratefully received but I am wondering if reinstalling the USB stick with a backup taken a week ago and going from there. Plus, maybe upgrading both cache SSD drives to zfs and starting my dockers from scratch. moog-diagnostics-20240704-1450.zip Quote Link to comment
Solution JorgeB Posted July 4 Solution Share Posted July 4 Jul 4 14:49:40 Moog kernel: mpt2sas_cm0: SAS host is non-operational !!!! HBA troubles, make sure it's well seated and sufficiently cooled, you can also try a different PCIe slot if available 1 Quote Link to comment
RallyGallery Posted July 4 Author Share Posted July 4 JorgeB, many thanks for looking through the files. I have re-seated the HBA card and rebooted the server, same fault. I swapped the card into another slot, same fault. The HBA card detects all the drives and you can go into the configuration utility at bootup and everything seems in order. It sees all the drives and gives correct information. This started with me going to unRAID beta 7 and even though I ran back to the older stable version it seems odd that everything has coincided together. My last role of the dice is to take the USB and copy the backup files from unRAID 6.12 onto it and see what happens. That is exactly the system that was working before I went to the beta version. I am clutching at straws but worth a go? Else, maybe the card has faulted and I will purchase a new card and see what that does. Thanks again for your help, time and assistance. Quote Link to comment
JorgeB Posted July 5 Share Posted July 5 12 hours ago, RallyGallery said: I am clutching at straws but worth a go? It's worth a try Quote Link to comment
RallyGallery Posted July 5 Author Share Posted July 5 Hi JorgeB. An update and I may have sorted the problem based on your help and advice. I took out the USB stick, deleted all the files and copied a backup of the stick I had made before the upgrade. Rebooted and system was fine but immediately got errors. This has led me to believe that the HBA card has got a fault even though it boots up and sees the drives. I have taken the HBA card out and used normal sata cables to connect all the drives and rebooted the array with no errors so far. The array has been up for 30 mins and error logs are clear. Will leave it going overnight and see what happens. The original HBA card was off eBay and was second hand but had been flashed into IT mode so it wasn't brand new. Will buy another one in time once I am happy with the server. Thanks again. 1 Quote Link to comment
itimpi Posted July 5 Share Posted July 5 Something that is always worth checking that an HBA is well seated. I have had some in the past had some that could end up not quite properly seated and could cause intermittent issues (I think due to vibration causing momentary disconnects on pins). 1 Quote Link to comment
RallyGallery Posted July 6 Author Share Posted July 6 JorgeB - Server has been running with no errors for 24 hours now. The HBA card was the issue. Disk 4 was offline but I have got it back online and rebuilt and zero errors on any of the drives and all SMART info on all drives ok, cache pool is functioning with no issues. A clean install from backup I think also took away any doubts. Thanks for all your help and time. Really appreciated. itimpi - also thanks for the advice. Will try the HBA card in another PC to test. I will purchase another HBA card from ebay as I can't trust this one in the server. I can use it in a non-critical system. Thanks also to you for your help and input. 1 Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.