July 3, 20179 yr Well this is frustrating. Three different times with my BTRFS cache pool I wound up getting some kind of corruption (each presumably due to some kind of hardware issue like a misbehaving SATA controller, IDK) which would require me to move all appdata, etc. off the cache, wipe the file system, and reformat the cache pool before moving everything back, rebuilding dockers from template, etc. So finally after the last time I decided I'd had enough and when I did the cache replace procedure I went back to a single cache drive formatted as XFS instead. And everything had been working fine for a week or so since then until this morning I wake up to see my BTSync transfer to the server was interrupted overnight and the the cache drive unavailable (the drive itself is still green balled and I can run a SMART test so I know it's still physically connected, but I can't access anything on it and my dockers are all unavailable too). And what's all over the system log (attached)? BTRFS errors! How the heck do you get BTRFS errors on a server that has the cache as well as all data drives formatted as XFS?? There are some I/O errors preceding the BTRFS errors as well as the series of power failures and UPS battery kick-ins that happened last night (that I suspect might have contributed to the problem too since I was uploading via BTSync constantly while that was going on). However note the particular point in the log where the BTRFS errors start: Jul 3 07:52:27 JBOX kernel: XFS (sdb1): metadata I/O error: block 0x1d20d883 ("xlog_iodone") error 5 numblks 64 Jul 3 07:52:27 JBOX kernel: XFS (sdb1): xfs_do_force_shutdown(0x2) called from line 1200 of file fs/xfs/xfs_log.c. Return address = 0xffffffff812b4a91 Jul 3 07:52:27 JBOX kernel: XFS (sdb1): Log I/O Error Detected. Shutting down filesystem Jul 3 07:52:27 JBOX kernel: XFS (sdb1): metadata I/O error: block 0x1d20d892 ("xlog_iodone") error 5 numblks 64 Jul 3 07:52:27 JBOX kernel: XFS (sdb1): xfs_do_force_shutdown(0x2) called from line 1200 of file fs/xfs/xfs_log.c. Return address = 0xffffffff812b4a91 Jul 3 07:52:27 JBOX shfs/user: err: shfs_write: write: (5) Input/output error Jul 3 07:52:27 JBOX kernel: XFS (sdb1): Please umount the filesystem and rectify the problem(s) Jul 3 07:52:27 JBOX kernel: BTRFS error (device loop0): bdev /dev/loop0 errs: wr 0, rd 1, flush 0, corrupt 0, gen 0 Jul 3 07:52:27 JBOX kernel: BTRFS error (device loop0): bdev /dev/loop0 errs: wr 0, rd 2, flush 0, corrupt 0, gen 0 Jul 3 07:52:27 JBOX kernel: BTRFS error (device loop0): bdev /dev/loop0 errs: wr 0, rd 3, flush 0, corrupt 0, gen 0 Jul 3 07:52:27 JBOX kernel: BTRFS error (device loop0): bdev /dev/loop0 errs: wr 0, rd 4, flush 0, corrupt 0, gen 0 Jul 3 07:52:27 JBOX kernel: BTRFS error (device loop0): bdev /dev/loop0 errs: wr 0, rd 5, flush 0, corrupt 0, gen 0 Jul 3 07:52:27 JBOX kernel: BTRFS error (device loop0): bdev /dev/loop0 errs: wr 0, rd 6, flush 0, corrupt 0, gen 0 Jul 3 07:52:27 JBOX kernel: BTRFS error (device loop0): bdev /dev/loop0 errs: wr 0, rd 7, flush 0, corrupt 0, gen 0 Jul 3 07:52:27 JBOX kernel: BTRFS error (device loop0): bdev /dev/loop0 errs: wr 0, rd 8, flush 0, corrupt 0, gen 0 Jul 3 07:52:27 JBOX shfs/user: err: shfs_write: write: (5) Input/output error What is this force shutdown and umounting of XFS? Did unRaid just spontaneously decide to change the file system of my cache back to BTRFS? Anyway, please advise on what I should do from here? Reboot, hope the contents of the cache are still available and not hopelessly corrupted, then do the cache replace procedure so I can once again wipe the file system and copy back to a clean and re-formatted XFS cache drive, rebuild dockers again, etc? Looking forward to your input. syslog_7-3-17.zip
July 3, 20179 yr BTRFS is the file system inside the Docker image. I haven't experienced this problem, but I'd try removing the Docker image and reinstalling your Dockers from CA. Not sure what is causing the instability, though.
July 3, 20179 yr The cache drive is having problems. Could be bad cabling / whatnot, or could be the drive itself. Either way that is the root of your problems. (BTW, diagnostics are far better than merely a syslog)
July 3, 20179 yr Community Expert Btrfs errors are a symptom of an underlying hardware issue, probably bad cable.
July 3, 20179 yr Author Okay, thanks everyone. A bad cable was the theory last week when I still had the cache pool but after I unassigned the particular cache SSD that was identified with all those errors on the log, changed cache slots to 1, and formatted and assigned the other cache SSD as the single XFS cache device, I thought that had solved the problem, especially when the logs after that were all clean until now. So now I guess I have to assume one of the following: the cable or connection to the cache device currently in use was the bad one all along even though the previous errors were all identified with the other cache SSD the cable or connection to both SSDs was/is bad there's a problem with the SATA controller (LSI 9211-8i) that both SSDs are attached to even though it hasn't been causing any problems for the data drives attached to the same controller something else?? Anyway I run this server remotely so it'll probably be a couple days until I can get out to pop the hood. When I do I'll free up a SATA port on the motherboard for the cache and see if connecting to that doesn't solve these I/O errors going forward. If you have any other suggestions in the meantime please let me know.
July 3, 20179 yr Author BTW attached is the diagnostics log if this sheds any further light and I'll remember to include this going forward. Thanks again. jbox-diagnostics-20170703-1217.zip
July 3, 20179 yr Community Expert 21 minutes ago, ElJimador said: especially when the logs after that were all clean until now. Not really clean, it's full of these: Jun 27 02:22:33 JBOX kernel: mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000) Jun 27 02:24:56 JBOX kernel: mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000) Jun 27 02:25:15 JBOX kernel: mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000) Jun 27 02:26:32 JBOX kernel: mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000) Jun 27 02:32:22 JBOX kernel: sd 1:0:0:0: attempting task abort! scmd(ffff88011ca65c80) Jun 27 02:32:22 JBOX kernel: sd 1:0:0:0: [sdb] tag#1 CDB: opcode=0x2a 2a 00 00 64 d1 78 00 00 d0 00 Jun 27 02:32:22 JBOX kernel: scsi target1:0:0: handle(0x0009), sas_address(0x4433221102000000), phy(2) Jun 27 02:32:22 JBOX kernel: scsi target1:0:0: enclosure_logical_id(0x500605b002c8a9e3), slot(2) Jun 27 02:32:22 JBOX kernel: sd 1:0:0:0: task abort: SUCCESS scmd(ffff88011ca65c80) Jun 27 02:32:22 JBOX kernel: sd 1:0:0:0: attempting task abort! scmd(ffff8802a6f8f200) Jun 27 02:32:22 JBOX kernel: sd 1:0:0:0: [sdb] tag#3 CDB: opcode=0x2a 2a 00 0e 90 26 b8 00 00 18 00 Jun 27 02:32:22 JBOX kernel: scsi target1:0:0: handle(0x0009), sas_address(0x4433221102000000), phy(2) Jun 27 02:32:22 JBOX kernel: scsi target1:0:0: enclosure_logical_id(0x500605b002c8a9e3), slot(2) Jun 27 02:32:22 JBOX kernel: sd 1:0:0:0: task abort: SUCCESS scmd(ffff8802a6f8f200) sdb is the cache device, try connecting it onboard.
July 3, 20179 yr Community Expert Also are you sure the Samsung SSD was the problem? I recall seeing a user with one Crucial and one Samsung and IIRC the Crucial was the problem, not sure it was you. Edited July 3, 20179 yr by johnnie.black
July 3, 20179 yr Author 11 minutes ago, johnnie.black said: Also are you sure the Samsung SSD was the problem? I recall seeing a user with one Crucial and one Samsung and IIRC the Crucial was the problem, not sure it was you. No in my case the Samsung was cache 1, the Crucial was cache 2, and though I forget the sdX designation of each the log errors were definitely all tracked to the device assigned as cache 1. (Though maybe with a cache pool all errors are going to read as cache 1 regardless?) Anyway I still don't think either SSD itself is actually the problem when neither is reporting any SMART errors. Is there anything else I should be doing though to rule out whether the drive itself might be bad? Seems like once the Crucial was re-formatted as XFS and the contents of the cache pool moved back to it from the array, that if the drive itself was the problem that there would have been something to indicate that right away and not just a week later. Though I admit I'm hardly expert enough on any of this to know how and when the different possible sources of the problem would normally present themselves. While we're brainstorming though, is there any chance there's some kind of corruption that copied back with the move of the cache pool contents back from the array and that it's only masquerading as a HW issue? Before I retired the cache pool I ran BTRFS scrubs until all errors were corrected and when I moved then the cache data to the array everything copied over fine (unlike one time previously when I wound up having to nuke all my appdata and rebuild my Plex server from scratch). So nothing in that to seem like it should be any kind of red flag. Would be nice though if I could at least rule it out as a source of the problem, if nothing else.
July 3, 20179 yr Community Expert 37 minutes ago, ElJimador said: Is there anything else I should be doing though to rule out whether the drive itself might be bad? 1 hour ago, johnnie.black said: try connecting it onboard.
July 3, 20179 yr Community Expert I check your older thread: There are similar errors on both sdb and sdc, so both cache devices, not likely both are bad, IMO cable/enclosure or a controller problem.
July 3, 20179 yr Author 2 minutes ago, johnnie.black said: I check your older thread: There are similar errors on both sdb and sdc, so both cache devices, not likely both are bad, IMO cable/enclosure or a controller problem. OK, thanks again Johnnie. At this point I'm thinking it might just be that the SAS leads from the LSI card were a little too tight to both SSDs due to the drives' location at the front of the case (Fractal Design Node 804) and that maybe the connections just came loose at some point due to vibration? All SATA and SAS cables I use have locking latches to clip to the drive so it doesn't seem like something I'd have to worry about, however it's the best theory I've got at this point. Anyway I'll report back after I get out to take a look at it and get the cache drive connected to the motherboard instead.
July 9, 20178 yr Author Hey. Just wanted to report back that I attached the cache SSD onboard and it's been working fine since. Nothing looked amiss in the connections so I assume it was the SAS cable that was bad. Which looking back, I wonder if that might have been the source of earlier problems that I attributed to the SAS2LP-MV8 controller that I replaced with the LSI. It would make sense since it was the same cable I'd been using for my cache drives all along. Makes me glad I hung on to that card, and also tempted to go back to the cache pool again too now that I know where the instability was coming from. Anyway, thanks again for helping me get it sorted.
Archived
This topic is now archived and is closed to further replies.