ssb201 Posted December 14, 2017 Share Posted December 14, 2017 (edited) A few days ago I had what appeared to be a system lockup with the host unresponsive to ping. The system came back up and seemed to be working fine. The parity check due to unclean shutdown seemed to be proceeding apace, but was not complete when checking on it the next day. The web console has become somewhat unresponsive, the same with all docker containers and shares. The parity check started at 124MB/sec and is now running at 84KB/sec. I reviewed each of the drives using smartctl but see no errors. Nothing apparent on dmesg or in the system log. SSH connection hangs if I run ps after getting through part of the process list. I kicked off diagnostics collection from the command line, but it had not completed after 5 minutes. If it eventually completes I will update this post. root@Tower:/dev# /root/mdcmd status | egrep "mdResync|mdState|sbSync" sbSynced=1513116254 sbSyncErrs=0 sbSynced2=0 sbSyncExit=0 mdState=STARTED mdResyncAction=check P mdResyncSize=7814026532 mdResyncCorr=1 mdResync=7814026532 mdResyncPos=7595181832 mdResyncDt=115 mdResyncDb=4608 root@Tower:/dev# diagnostics Starting diagnostics collection... tower-diagnostics-20171213-1626.zip Edited December 14, 2017 by ssb201 Update with diagnostics Quote Link to comment
pwm Posted December 14, 2017 Share Posted December 14, 2017 It was unexpected that ps hung. Process hangs normally happens if accessing something protected by a lock that hasn't been released, but ps is normally quite nice about what data it goes hunting for. What arguments did you give to the ps command? Quote Link to comment
ssb201 Posted December 14, 2017 Author Share Posted December 14, 2017 One time I did 'ps -ef'. Another time I did 'ps aux | more'. I was not terribly patient, so it is possible that it may have completed if I waited a few minutes. The diagnostics did finally complete, and they do a ps, so it just is really really slow. Quote Link to comment
Squid Posted December 14, 2017 Share Posted December 14, 2017 Don't know why this is (assuming you're running AD), but every minute on the minute, the system is changing the permissions on your downloads share. It appears that you started the check @ 2:04pm on the 12th. It got through the first 4 TB 12:39 am. At that point, CA Auto Turbo Mode kicked in and started to enable / disable turbo mode depending upon your drives spun up. At 8:38 am on the 13th the parity drive decided to drop offline, but it recovered and came back no problems right way. Problem being with this is that the permissions being changed every minute *may* cause a drive to spin up for unRaid to find the /mnt/user/Downloads share. I'm surmising this because auto turbo keeps enabling / disabling turbo mode, which means that a drive is spinning up/down. Its possible the spin ups that are giving you the problems, as very often with controllers all transfers to/from other drives attached will pause until the drive being spun up is up to speed. (which takes a couple of seconds). Turbo mode being enabled / disabled shouldn't account for any significant slowdown itself. Quote Link to comment
Squid Posted December 14, 2017 Share Posted December 14, 2017 Also, while not directly related to your problem, you do have this: 184 End-to-End_Error 0x0032 097 097 099 Old_age Always FAILING_NOW 3 on ST4000VN000-1H4168_Z300T4WM-20171213-1626 - This is a failure of the electronics (cache) on the drive itself If you don't already have it, enable notifications in Settings. Drives with attributes in Failing State should never be ignored. Quote Link to comment
ssb201 Posted December 14, 2017 Author Share Posted December 14, 2017 39 minutes ago, Squid said: Don't know why this is (assuming you're running AD), but every minute on the minute, the system is changing the permissions on your downloads share. It appears that you started the check @ 2:04pm on the 12th. It got through the first 4 TB 12:39 am. At that point, CA Auto Turbo Mode kicked in and started to enable / disable turbo mode depending upon your drives spun up. At 8:38 am on the 13th the parity drive decided to drop offline, but it recovered and came back no problems right way. Problem being with this is that the permissions being changed every minute *may* cause a drive to spin up for unRaid to find the /mnt/user/Downloads share. I'm surmising this because auto turbo keeps enabling / disabling turbo mode, which means that a drive is spinning up/down. Its possible the spin ups that are giving you the problems, as very often with controllers all transfers to/from other drives attached will pause until the drive being spun up is up to speed. (which takes a couple of seconds). Turbo mode being enabled / disabled shouldn't account for any significant slowdown itself. Yes, I am running AD. I am not sure why the permissions are changing. I have not explicitly set anything to do that. Quote Link to comment
ssb201 Posted December 14, 2017 Author Share Posted December 14, 2017 21 hours ago, Squid said: Also, while not directly related to your problem, you do have this: 184 End-to-End_Error 0x0032 097 097 099 Old_age Always FAILING_NOW 3 on ST4000VN000-1H4168_Z300T4WM-20171213-1626 - This is a failure of the electronics (cache) on the drive itself If you don't already have it, enable notifications in Settings. Drives with attributes in Failing State should never be ignored. Yes, I noticed the SMART issue and will have to address it, but it does not explain the slow speed of the parity check. It took me 10 minutes, but I was able to finally disable Turbo Mode. That did not seem to improve anything. The system finally went from 97.2% to 97.3%! I cannot wait 30-50 more days before it finishes. I am probably going to try and reboot tonight unless someone can offer another thing to try. I am scratching my head as to why it is running so slow, without any other indicator in the logs to signify a problem. Quote Link to comment
JorgeB Posted December 14, 2017 Share Posted December 14, 2017 You should cancel and reboot in safe mode, the start another check, you have syncthing and crashplan running and crashplan is using a lot of CPU. Quote Link to comment
ssb201 Posted December 14, 2017 Author Share Posted December 14, 2017 4 minutes ago, johnnie.black said: You should cancel and reboot in safe mode, the start another check, you have syncthing and crashplan running and crashplan is using a lot of CPU. I will give that a shot. Quote Link to comment
ssb201 Posted December 15, 2017 Author Share Posted December 15, 2017 I had difficulty shutting the system down. I finally managed to do so after force killing Docker ('killall docker'). After rebooting and starting in safe mode, bringing up the array and starting a parity sync was still incredibly slow. <diagnostics attached>. I then tried shutting down, removing the drive with the SMART error and starting up again (safe mode). I initiated a parity rebuild with a new 8TB drive that was previously precleared. The console is estimating 8130 days to complete. I still see no relevant hardware errors registering in dmesg or system log. A couple of things I did note: hdparm -I /dev/sdg returns a Sense I/O error The perms on each of the disks under /mnt are 1069023732:1069023745. The same with some, but not all, shares under /mnt/user. I never noticed this before and do not know if this is proper. System operations like going from one web page to another in the web portal take 5-10 minutes each. Most operations on the command line seem to run just as snappy as ever. tower-diagnostics-20171214-1949.zip Quote Link to comment
ssb201 Posted December 15, 2017 Author Share Posted December 15, 2017 Looks like I am well and truly screwed. The parity drive is now showing errors on the array page. It looks like it was the problem, despite the absence of any SMART errors. Not sure how I can dig myself out of this mess now. Do I let the rebuild continue and hope that there are only a limited number of errors from a failing parity drive? It is still saying 2000 days, so that will take forever. Quote Link to comment
JorgeB Posted December 15, 2017 Share Posted December 15, 2017 Grab and post new diags after the errors. Quote Link to comment
ssb201 Posted December 15, 2017 Author Share Posted December 15, 2017 39 minutes ago, johnnie.black said: Grab and post new diags after the errors. Attached new diagnostics, though I gave up on the prior status quo.. The parity rebuild was going nowhere. Just tons of errors on the parity drive. I ended up rebooting and leaving the array in a downed state. In the mean time, I reformatted the new drive (the one that had been precleared and added to the array to replace the SMART error drive), since the parity rebuild had barely begun the file system was entirely corrupt. I kicked off a btrfs restore from the drive I had replaced to the new drive to preserve a copy of as much of my data as possible. Since the parity drive seemed to actually be the problem, I am hoping that despite some btrfs corruption (caused by the less than graceful initial shutdown?), the vast majority of my data will be recovered. Between the three good data drives and the new one I restored to, I should at least have all or almost all of my data, without any dependency on the f--g parity drive. tower-diagnostics-20171215-0012.zip Quote Link to comment
JorgeB Posted December 15, 2017 Share Posted December 15, 2017 Yeah, parity drive looks bad, no pending sectors but these are bad signs: 1 Raw_Read_Error_Rate 0x000b 053 053 --- Pre-fail Always - 922755730 22 Helium_Level 0x0023 076 076 --- Pre-fail Always - 76 Raw Read Error rate should be zero on an healthy drive, and it's leaking helium, never seen a helium drive where this value wasn't 100. Quote Link to comment
JorgeB Posted December 15, 2017 Share Posted December 15, 2017 You can also see the difference form the previous diags: Quote 1 Raw_Read_Error_Rate 0x000b 080 080 016 Pre-fail Always - 715 -> early signs of trouble 22 Helium_Level 0x0023 100 100 025 Pre-fail Always - 100 Quote Link to comment
ssb201 Posted December 16, 2017 Author Share Posted December 16, 2017 14 hours ago, johnnie.black said: You can also see the difference form the previous diags: Yeah. The drive went south quickly, even without SMART notification warning. So I managed to complete the BTRFS restore. Ten or twenty files had warnings about loops and may not have restored properly, the rest hopefully are good. I will need to double check their contents once I have things up and running. Thankfully most are just Blu-Ray rips that I can redo. Now that I have four drives with all my data. Any pointers on recreating the array? Currently the new drive shows up as Disk 3 (drive contents emulated). I am guessing I will need to create a brand new array and slowly copy my data over drive by drive, before importing the next drive. Quote Link to comment
ssb201 Posted December 16, 2017 Author Share Posted December 16, 2017 (edited) I just discovered New Config. It looks like I should be able to just do New Config->Retain Data and Cache slots. Then apply. Will that work when the array thinks that disk 3 is emulated, even if disk 3 does have all the data it should? Do I have to remove that drive and then add it in separately? I assume once the array is good I would then be able to add a new parity drive and build parity. Edited December 16, 2017 by ssb201 Quote Link to comment
JorgeB Posted December 16, 2017 Share Posted December 16, 2017 6 hours ago, ssb201 said: Now that I have four drives with all my data. Any pointers on recreating the array? Currently the new drive shows up as Disk 3 (drive contents emulated) New config will lose any emulated disk, but didn't you un-assign parity? How do you have an emulated disk? You can also post current diags so I can see. Quote Link to comment
ssb201 Posted December 16, 2017 Author Share Posted December 16, 2017 12 hours ago, johnnie.black said: New config will lose any emulated disk, but didn't you un-assign parity? How do you have an emulated disk? You can also post current diags so I can see. Attached. (Please note I have removed the failed parity drive and the other drive that was reporting a SMART end to end error (the one I did a BTRFS restore from). Steps to get here: 1) Parity check slowed to crawl at 97.3%, VM running fine, Dockers unresponsive. 2) Web UI unresponsive, from SSH attempt to reboot hangs. 3) Power cycle in safe mode. System seems fine, web UI responsive and no errors showing in log. Array down. 4) Decide to replace 4TB drive showing End to End SMART error. Pull drive, put new 8TB drive in. Start array and select rebuild from parity. 5) Rebuild slows to a crawl. read errors start to appear on parity drive. 6) Cancel rebuild, reboot, confirm, parity drive is bad. (Freak a bit). 7) Put 4TB drive back in, unable to mount due to FS corruption. 8 ) Clean BTRFS format of 8TB replacement disk (since parity has started on it, a superblock and some fs was there but no data). 9) BTRFS restore almost all files from 4TB to 8TB. 10) Remove 4TB drive. As stands now I have 3 original data drives and a new copy (not clone) of the 4th data drive. No parity drive, but an array configuration expecting a parity rebuild of drive 3 (hence emulation) and a missing parity drive. tower-diagnostics-20171216-1221.zip Quote Link to comment
JorgeB Posted December 16, 2017 Share Posted December 16, 2017 OK, I think I understand now, disk3 is invalid (yellow) not emulated (red), array is in an unusual state but as long as data on disk3 looks correct if you start the array as is, without parity, you can do a new config, retain all and start the array, you can later (or now) add a new parity drive, but keep old disk3 intact in case anything goes wrong. Quote Link to comment
ssb201 Posted December 16, 2017 Author Share Posted December 16, 2017 2 minutes ago, johnnie.black said: OK, I think I understand now, disk3 is invalid (yellow) not emulated (red), array is in an unusual state but as long as data on disk3 looks correct if you start the array as is, without parity, you can do a new config, retain all and start the array, you can later (or now) add a new parity drive, but keep old disk3 intact in case anything goes wrong. Correct. Apologies for any confusion, the tooltip for the yellow says emulated: Quote Link to comment
JorgeB Posted December 16, 2017 Share Posted December 16, 2017 (edited) You're right, I meant the disk status is invalid, not disabled, but without parity the disk can't be emulated, so kind of a bug, but like I said the array is in an unusual state. Edited December 16, 2017 by johnnie.black Quote Link to comment
ssb201 Posted December 16, 2017 Author Share Posted December 16, 2017 1 hour ago, johnnie.black said: You're right, I meant the disk status is invalid, not disabled, but without parity the disk can't be emulated, so kind of a bug, but like I said the array is in an unusual state. The array was definitely in an unusual state. New config worked well. Plex docker was missing and had to rejoin AD, but everything else looked fine. A few movies are corrupt and will need to be re-ripped, but nothing major. I have a new parity drive building now. So glad a pulled the trigger on an extra drive for Black Friday. Unfortunately, the drive the failed, while under warranty is flagged as needing to be returned to system vendor rather than Hitachi, and I have no idea who that is. Thanks everyone for the assistance. Quote Link to comment
loungelizard Posted May 29, 2021 Share Posted May 29, 2021 I just want to add my experience as I have been struggling for a week to get my unraid system up and running. My setup is a mini tower with an 8 bay backplane, everything I purchased was new. My 16TB seagate ironwolf was acting as parity, other disks as data storage. Everything seemed to be working fine, i could not see any particular entries in the log files. However, my system was behaving erratic. As soon as I tried to build parity, the throughput decreased to 30/40 kbps thus lasting 2-3 years to complete. Other trouble also occured like drives not showing up in the bios, not being able to connect anymore, formatting issues, dockers unable to download etc. I was scratching my head and started to take out components of the system one by one. Since the disk was new and causing issues, I was focussing too much on this as well as SATA adressing topics. It took me a week to identify the root cause, which also left me puzzled. After I disconnected my backplane and hooked up the disks manually with sata cables, everything seemed to be working fine. The backplane is connected with 4SATA to SFF-8087 cables, power supply comes from a SATA cable going to a MOLEX-splitter for 2 backplane power MOLEX adapters. Changing the SFF-8087 cable also didn't solve the issue, so I reached for molex power supply and noticed that the two red cables from the Y-splitter had come off. After fixing this, the system went up and running as it should. So the 5V power supply of the backplane was off, but since the disks spun I never expected this to be the trouble. I'm now a happy unraid user and wanted to share this issue, in case somebody else experiences issues like that. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.