thegreatgumbino Posted January 4, 2017 Share Posted January 4, 2017 I'm running unRAID 5.0.5 and started up Plex today and got an error that the server wasn't available. Server was working fine last night. Pulled up the web interface and see the server is stopped and says Disk 1 is missing. Here's my build thread with configuration. Read through the Analysis of Drive Issues Wiki page and searched the Syslog (attached) for the provided keywords. I found two from the list: 1) qc timeout 2) revalidation failed Here's an excerpt from the Syslog with the keywords: Jan 4 18:19:20 Tower kernel: sas: ata7: end_device-2:0: dev error handler Jan 4 18:19:20 Tower kernel: ata7.00: ATA-8: ST31500541AS, 5XW0270P, CC32, max UDMA/133 Jan 4 18:19:20 Tower kernel: ata7.00: 2930277168 sectors, multi 0: LBA48 NCQ (depth 31/32) Jan 4 18:19:20 Tower kernel: ata7.00: qc timeout (cmd 0x27) Jan 4 18:19:20 Tower kernel: ata7.00: failed to read native max address (err_mask=0x4) Jan 4 18:19:20 Tower kernel: ata7.00: HPA support seems broken, skipping HPA handling Jan 4 18:19:20 Tower kernel: ata7.00: revalidation failed (errno=-5) Jan 4 18:19:20 Tower kernel: mvsas 0000:02:00.0: Phy0 : No sig fis Jan 4 18:19:20 Tower kernel: drivers/scsi/mvsas/mv_sas.c 1527:mvs_I_T_nexus_reset for device[0]:rc= 0 Jan 4 18:19:20 Tower kernel: ata7.00: qc timeout (cmd 0xef) Jan 4 18:19:20 Tower kernel: ata7.00: failed to set xfermode (err_mask=0x4) Jan 4 18:19:20 Tower kernel: ata7.00: limiting speed to UDMA/133:PIO3 Jan 4 18:19:20 Tower kernel: mvsas 0000:02:00.0: Phy0 : No sig fis Jan 4 18:19:20 Tower kernel: drivers/scsi/mvsas/mv_sas.c 1527:mvs_I_T_nexus_reset for device[0]:rc= 0 Jan 4 18:19:20 Tower kernel: drivers/scsi/mvsas/mv_sas.c 1957:Release slot [0] tag[0], task [f75c3900]: Jan 4 18:19:20 Tower kernel: sas: sas_ata_task_done: SAS error 8a Jan 4 18:19:20 Tower kernel: ata7.00: failed to set xfermode (err_mask=0x11) Jan 4 18:19:20 Tower kernel: ata7.00: disabled It points to Disk #1 as noted on the web interface. Per the Analysis of Drive Issues: Drive interface issue #4 This is an example of what is probably a loose backplane or cable connection issue: (could be either the SATA connection or the power connection or both) [sNIP] Note: There are no CRC errors here, which normally implicate a bad cable or two. These problems are often related to a backplane, perhaps loose, perhaps vibration-related, perhaps defective. If the SATA link remains up for awhile, but communications are clearly bad, then the emphasis should probably be on the power connection. The easiest way to test whether it is the fault of the backplane is to reinstall the drive outside of the backplane. If there is no backplane involved, then the same considerations apply to the cable connections, each end of both the SATA and power cables, including any power cable splitters that may be involved. It is common after opening a computer case, to jostle the cables, and SATA cables are notorious for coming loose, if they aren't the locking type. It is a good habit to check all SATA connections just before closing a case up. Good quality SATA and power cables and splitters are strongly recommended. Always make certain that they are firmly connected, and not subject to vibration. The same is even more important for backplanes, make sure that drives are firmly and well seated in their trays, and cannot be vibrated loose. I haven't been in the case recently, but I'm assuming it's a bad connection or cable. 1) Can I open the box and check connections while the server is still running or do I need to shut it off? 2) If it is the backplane, wouldn't the other three disks on that row be showing missing? Can I shut down the server and move the "missing" disk to an empty drive slot and see if it works? Any other insight you folks can offer? Thanks syslog-2017-01-04.txt.zip Quote Link to comment
trurl Posted January 4, 2017 Share Posted January 4, 2017 Post the SMART report Quote Link to comment
thegreatgumbino Posted January 5, 2017 Author Share Posted January 5, 2017 Post the SMART report I'm having difficulty figuring out the disk identifier for the "missing disk". I think I screwed up. I started the array in maintenance mode, which I believe disabled the missing disk. I tried using ls -l /dev/disk/by-id to identify the missing disks identifier in telnet, but it doesn't show up. Is there a way to obtain the disk identifier now? Prior to starting in maintenance mode the Main > Array Devices page in the web interface didn't list the disks identifier. Quote Link to comment
trurl Posted January 5, 2017 Share Posted January 5, 2017 It looks like it disappeared from your syslog after that snippet you posted. Shutdown and check connections as you planned to do. Quote Link to comment
thegreatgumbino Posted January 5, 2017 Author Share Posted January 5, 2017 It looks like it disappeared from your syslog after that snippet you posted. Shutdown and check connections as you planned to do. Shutdown, checked connections, powered on. Main > Array Devices shows Disk 1 "not installed / unassigned". I am able to see it as a selection in the pull down menu. It also gives me the disk identifier as sdd. Ran smartctl -a -d ata /dev/sdd. SMART Report attached. I reviewed the important attributes to check outlined in Obtaining a SMART Report and didn't see anything that is a red flag for the disk, nor did I see any error codes that signal bad cabling. Assuming the drive is ok, the next step is to reconstruct the drive, correct? You can re-enable the hard drive and reconstruct it as follows: Stop the array. Go to the Main page (Devices in version 4.7) and unassign the disk. Go to the Main page (Array Operations section) and start the array. Stop the array again. Go to the Main page (Devices in version 4.7) and re-assign the disk. Go to the Main page (Array Operations section) - the system should indicate there is a "new" drive to replace the disabled one. Check the confirmation box and click the Start button to start a reconstruct/rebuild of the disk. Running smartctl -t long /dev/sdd for detailed report and will report back when complete. Thanks for your help! Quote Link to comment
thegreatgumbino Posted January 5, 2017 Author Share Posted January 5, 2017 Tried running the long SMART report when I went to bed and it didn't complete. Trying to re-run it this morning. Quote Link to comment
thegreatgumbino Posted January 5, 2017 Author Share Posted January 5, 2017 I'm not having success getting an extended SMART report. When run, it says it will take 368 minutes but never gives me an update after that. root@Tower:~# smartctl -t long /dev/sdd smartctl 6.2 2013-07-26 r3841 [i686-linux-3.9.11p-unRAID] (local build) Copyright © 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org === START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION === Sending command: "Execute SMART Extended self-test routine immediately in off-line mode". Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful. Testing has begun. Please wait 368 minutes for test to complete. Test will complete after Thu Jan 5 16:01:27 2017 Use smartctl -X to abort test. root@Tower:~# Does it matter if the array is started, stopped, or started in maintenance mode when performing the SMART test? Quote Link to comment
RobJ Posted January 6, 2017 Share Posted January 6, 2017 The response looks correct. It indicates the testing has begun, and to wait at least 368 minutes before requesting a SMART report with the results. Quote Link to comment
thegreatgumbino Posted January 6, 2017 Author Share Posted January 6, 2017 The response looks correct. It indicates the testing has begun, and to wait at least 368 minutes before requesting a SMART report with the results. Duh, I knew that and spaced. Sorry, I'm running on very little sleep this week. Full SMART Report attached. Shows "Passed". Is the next step to re-enable the drive and reconstruct? smart.txt Quote Link to comment
trurl Posted January 6, 2017 Share Posted January 6, 2017 This is not an attribute we often monitor, but it does say failing: 184 End-to-End_Error 0x0032 084 084 099 Old_age Always FAILING_NOW 16 Is the next step to re-enable the drive and reconstruct? You have this exactly backwards. Rebuilding will re-enable the drive. I think I would use a new disk. And preclear it to test of course. Maybe someone else will have a different opinion on whether to re-use that disk or not. Quote Link to comment
RobJ Posted January 6, 2017 Share Posted January 6, 2017 I share trurl's concern about that drive. You've got 5.73 years of 24/7 operation on it, and it's showing its age, with a number of troubling SMART attributes. None are clearly critical right now, but they're worrisome. 184 is seriously lower than the maker ever thought it should go, and lower than I've ever seen. I wouldn't trust that drive without some serious Preclear testing, 2 or 3 passes at least. Quote Link to comment
thegreatgumbino Posted January 6, 2017 Author Share Posted January 6, 2017 Thanks for the insight. I guess this is a good time to shrink my array. I've been planning on doing it anyway. I have a total 13.5TB and only about 7TB used (down from 10TB). I'd like to reduce the number of drives and then upgrade to unRAID OS 6 with dual parity. Guess it's time to do some reading: Shrink Array Clear an Array Drive Script From my limited reading, I *believe* this is possible versus installing a new drive for the failing one. Quote Link to comment
trurl Posted January 6, 2017 Share Posted January 6, 2017 Thanks for the insight. I guess this is a good time to shrink my array. I've been planning on doing it anyway. I have a total 13.5TB and only about 7TB used (down from 10TB). I'd like to reduce the number of drives and then upgrade to unRAID OS 6 with dual parity. Guess it's time to do some reading: Shrink Array Clear an Array Drive Script From my limited reading, I *believe* this is possible versus installing a new drive for the failing one. Your first post in the thread said the array was stopped with the disk missing. Can you start the array and read from the emulated drive? If so then you should be able to copy its contents to other drives then proceed with shrinking. Quote Link to comment
thegreatgumbino Posted January 7, 2017 Author Share Posted January 7, 2017 Your first post in the thread said the array was stopped with the disk missing. Can you start the array and read from the emulated drive? If so then you should be able to copy its contents to other drives then proceed with shrinking. Correct. After shutting down and checking connections, Main > Array Devices shows Disk 1 "not installed / unassigned". The disk does show as a selection in the pull down menu. If I add it to the array, it shows blue / "New Disk, Not In Array". I'm uncertain of how to proceed from here. If the drive is still accessible, it should have the original data on it. If I can get the array to accept it as is (I haven't written anything to the disk in weeks), I can copy the files off the drive. Is there a way to accomplish this or am I stuck having to install a new HD and rebuild the disk from the parity? Quote Link to comment
trurl Posted January 7, 2017 Share Posted January 7, 2017 Your first post in the thread said the array was stopped with the disk missing. Can you start the array and read from the emulated drive? If so then you should be able to copy its contents to other drives then proceed with shrinking. Correct. After shutting down and checking connections, Main > Array Devices shows Disk 1 "not installed / unassigned". The disk does show as a selection in the pull down menu. If I add it to the array, it shows blue / "New Disk, Not In Array". I'm uncertain of how to proceed from here. If the drive is still accessible, it should have the original data on it. If I can get the array to accept it as is (I haven't written anything to the disk in weeks), I can copy the files off the drive. Is there a way to accomplish this or am I stuck having to install a new HD and rebuild the disk from the parity? No, I'm not saying you should do anything with the disk at the moment. But it shouldn't say New Disk either if it is the same disk you had in that slot before. What I was saying is that if you have parity, and you only have one missing disk, then unRAID should be able to emulate the disk's data, even if it is completely removed from the server. And you should be able to copy that data from the emulated disk without even worrying about, or even actually still possessing, the physical disk. Post a screenshot. Quote Link to comment
thegreatgumbino Posted January 7, 2017 Author Share Posted January 7, 2017 No, I'm not saying you should do anything with the disk at the moment. But it shouldn't say New Disk either if it is the same disk you had in that slot before. What I was saying is that if you have parity, and you only have one missing disk, then unRAID should be able to emulate the disk's data, even if it is completely removed from the server. And you should be able to copy that data from the emulated disk without even worrying about, or even actually still possessing, the physical disk. Post a screenshot. Interesting. This is the first time I've run into a disk issue. I didn't realize the parity would emulate a single missing drive. I haven't started the server except for maintenance mode at this point. Screen shots attached. To clarify, if I remove the troubled disk 1 and bring the array on-line, the parity should emulate the missing drive & allow me to copy the missing data to the other drives? Does the parity emulate a folder for "Disk 1"? I ask because my server is setup with high-water allocation. Sorry, for the questions, just trying to wrap my mind around it. Quote Link to comment
trurl Posted January 7, 2017 Share Posted January 7, 2017 When a disk is missing, or if it is just disabled (redball), unRAID can calculate the disk's data from parity plus all the other disks. This is how it is able to rebuild a disk. And if you try to read the disk unRAID calculates its data in just the same way. And if you try to write to the disk, unRAID will actually update parity as if the disk was written, so it can later read that written data by calculating it in just the same way. I'm not entirely clear about the colored indicators next to the disk in your screenshots. I don't think V5 was always consistent in the way these were used regardless of what the descriptions for these colors are. A better idea of what unRAID would do if you started the array can be seen in the Array Operation tab, which we don't have a screenshot for. Quote Link to comment
thegreatgumbino Posted January 7, 2017 Author Share Posted January 7, 2017 When a disk is missing, or if it is just disabled (redball), unRAID can calculate the disk's data from parity plus all the other disks. This is how it is able to rebuild a disk. And if you try to read the disk unRAID calculates its data in just the same way. And if you try to write to the disk, unRAID will actually update parity as if the disk was written, so it can later read that written data by calculating it in just the same way. I'm not entirely clear about the colored indicators next to the disk in your screenshots. I don't think V5 was always consistent in the way these were used regardless of what the descriptions for these colors are. A better idea of what unRAID would do if you started the array can be seen in the Array Operation tab, which we don't have a screenshot for. Here's a screen shot of the Array Operation tab with no Disk #1 selected (the red square screen shot previously posted). Here's a screen shot of the Array Operation tab with Disk #1 selected (the blue circle screen shot previously posted). So if I: 1) Bring the array on-line with the questionable Disk #1 removed from the server (or left as unselected as seen in the red square screen shot previously posted) 2) Copy the files from the emulated Disk #1 to existing hard drives 3) Is there another step after this to shrink the array and remove bad Disk #1 for good (effectively reducing the size of the array by 1.5TB)? Quote Link to comment
trurl Posted January 7, 2017 Share Posted January 7, 2017 1 and 2 correct. For 3, since you are already unprotected there is no point in trying to maintain parity, so the simple thing would be to go to Utils - New Config and reassign the drives you want to keep in the slots you want them to be in, and then when you Start it will rebuild parity. It is very important that you don't accidentally assign a data disk to the parity slot or it will be overwritten. Here it is in the wiki: Shrink Array Quote Link to comment
thegreatgumbino Posted January 10, 2017 Author Share Posted January 10, 2017 1 and 2 correct. For 3, since you are already unprotected there is no point in trying to maintain parity, so the simple thing would be to go to Utils - New Config and reassign the drives you want to keep in the slots you want them to be in, and then when you Start it will rebuild parity. It is very important that you don't accidentally assign a data disk to the parity slot or it will be overwritten. Here it is in the wiki: Shrink Array Thanks for your patience and assistance, trurl. Got everything transferred off the emulated drive, shrank the array, and rebuilt the parity. All is working again. Now to read up on upgrading to unRAID OS 6! Quote Link to comment
trurl Posted January 10, 2017 Share Posted January 10, 2017 There just happens to be an upgrade linked in my sig Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.