JustinAiken Posted August 9, 2010 Share Posted August 9, 2010 A Samsung 1.5 TB drive died when I was copying files to it... or at least, I think it did. The light next to it in the web server is red, it's temperature shows up as 0 degrees, and under errors it says 1,538. It only has 150GB of stuff thankfully, it died early on in me copying files to it... Is this what I do? * Request an RMA from Samsung * Remove the device from the array * Turn off the unRaid * Swap out the drive in the same slot * Turn on the unRaid * Assign the drive in the array * Let parity rebuild it? Or is there a way to fix the drive? I'm surprised to see problems with this one, because the Samsung 1.5TB's have been running cooled than my WD Green's... Link to comment
Rajahal Posted August 9, 2010 Share Posted August 9, 2010 The red ball means that a write to that disk has failed and the server has disabled it. The first thing you should do is power down, reseat the drive, then power back up. It could be something as simple as a loose cable. If that doesn't fix it then try running SMART on the drive. Post the results here. If SMART indicates that the drive has in fact failed, then the procedure you wrote out is correct. Technically you can skip the second step (remove the device from the array), but it won't hurt anything to do that. You will probably want to replace the drive as soon as possible, not wait weeks for the Samsung RMA to be processed. As of now you haven't lost any data, but if another drive were to fail right now you likely would. If you are extra paranoid you may want to keep the server powered off until you have your replacement drive ready to go. If you can afford it, I would recommend buying a new 1.5 or 2 TB drive today and replacing this failed drive as soon as possible. Then you can take your time with the 1.5 TB Samsung running SMART tests, etc, to see if it has actually failed. If it has, RMA it. If it hasn't, add it back to the array as a new disk or keep it as a spare. Link to comment
JustinAiken Posted August 9, 2010 Author Share Posted August 9, 2010 Actually, looking, I only have 100 GB of data on there, 75 of which matters... Would it be better to copy that 75GB off the disc, remove the disc from the array, and rebuild the parity without it? That way I wouldn't be at risk waiting even 2 days for a new drive to come from newegg... Link to comment
JustinAiken Posted August 9, 2010 Author Share Posted August 9, 2010 Okay, turned off the machine, reseated the drive, turned it back on.. the disk5 light was still read, with 0 reads, writes, or errors... Here's the SMART report: http://pastebin.com/3dBySjgY Link to comment
Joe L. Posted August 9, 2010 Share Posted August 9, 2010 Actually, looking, I only have 100 GB of data on there, 75 of which matters... Would it be better to copy that 75GB off the disc, remove the disc from the array, and rebuild the parity without it? That way I wouldn't be at risk waiting even 2 days for a new drive to come from newegg... That would work. after removing the disk from the array you'll need to initialize a new disk configuration. If on an older version of unRAID, when the array is stopped you'll see a button labeled as "restore" You'll need to check the box under it and press it. If on one of the last couple of releases the button was replaced with an exactly equivalent command you'll need to type after logging in as "root" via telnet, or logging in at the system console. Either way, you'll first Stop the array Go to the devices page and un-assign the drive you'll be removing Then either: Log in and type: initconfig Answer Yes if it prompts you to confirm your desire to initialize a new configuration. or Check the box under the button labeled "restore" and then press the "restore" button to initialize a new disk configuration. Then, Go back to the main web-management page (or refresh it), all the remaining assigned drives will appear as "blue" Press "Start" to build a new set of parity data based on the new disk configuration. Link to comment
Joe L. Posted August 9, 2010 Share Posted August 9, 2010 Okay, turned off the machine, reseated the drive, turned it back on.. the disk5 light was still read, with 0 reads, writes, or errors... Here's the SMART report: http://pastebin.com/3dBySjgY It is responding, therefore it might be/have been a loose cable, power splitter, drive cage, etc. It was disabled because a "write" to it failed. You must rebuild it from parity and the remaining drives if you wish the data that was not written. (And also if you want the disk file system to be sane, as the block it could not write might be part of the file-system or part of a file) So to get the disk to be recognized as its own replacement you must have the array forget the model/serial number of the existing drive. To do that Stop the array, Un-assign the drive that is marked as "red" Start the array by pressing "Start" (This causes the array to forget the model/serial number of the un-assigned drive) Stop the array once more Re-assign the drive that was marked as "red" Start the array once more. It will reconstruct the contents of the failed drive, including what it could not write originally. Once the re-construction is complete you'll be protected from a second concurrent drive failure once more. Joe L Link to comment
JustinAiken Posted August 9, 2010 Author Share Posted August 9, 2010 Ok, thanks Joe, did that... it's rebuilding now, got past the point where it errored before, no errors now... Now I just have to wait for it to finish it's 7 hour rebuild (even though the drive was only 10 percent full ) I have the Norco case, so the SATA connection was over a SAS -> SAS cable, and the power was going to the whole backplane... any idea on what I should look out for? Link to comment
Rajahal Posted August 9, 2010 Share Posted August 9, 2010 Now I just have to wait for it to finish it's 7 hour rebuild (even though the drive was only 10 percent full ) Every bit must be rebuilt no matter if it is a 1 or a 0. Therefore empty space takes just as long to rebuild as your actual data. I have the Norco case, so the SATA connection was over a SAS -> SAS cable, and the power was going to the whole backplane... any idea on what I should look out for? And I assume that there are other drives on the same backplane that are working, right? Maybe try a different drive bay? I know there's one bay in my Norco 4220 in which the just doesn't want to seat quite right. For now I'm just not using it, but I'm sure I'll be annoyed once I get to 19 drives... Also, is your PSU powerful enough to run all your drives? You should have a PSU with at least 30A on a single +12V rail. Link to comment
JustinAiken Posted August 9, 2010 Author Share Posted August 9, 2010 Yeah, all the other drives on that backplane are working... I'm out of slots until I get another SAS cable... I guess I could try moving all four drives in the plane down to the next one... I should have plenty of power, I'm using a Corsair 750W with a single rail... Link to comment
JustinAiken Posted August 10, 2010 Author Share Posted August 10, 2010 Okay, so I rebuilt the drive from the parity, it is now green, 0 errors... But if I try to write to it, it gives me permissions errors... I think the whole drive is stuck as read-only now... how do I fix that? Link to comment
Joe L. Posted August 10, 2010 Share Posted August 10, 2010 Okay, so I rebuilt the drive from the parity, it is now green, 0 errors... But if I try to write to it, it gives me permissions errors... I think the whole drive is stuck as read-only now... how do I fix that? First, post a syslog. http://lime-technology.com/wiki/index.php?title=Troubleshooting Odds are you have some file-system corruption and it was mounted as read-only to prevent further damage. Instructions on how to check the file-systems are in the wiki here: http://lime-technology.com/wiki/index.php?title=Check_Disk_Filesystems Link to comment
JustinAiken Posted August 10, 2010 Author Share Posted August 10, 2010 Here's the syslog: http://pastebin.com/y5nRm1HD And here's the output of the reiserfsck check and the fix-fixable: http://pastebin.com/F5dAtb1n Upon remount the drive was still unwriteable... should I do the tree fixing option? Link to comment
Joe L. Posted August 10, 2010 Share Posted August 10, 2010 Here's the syslog: http://pastebin.com/y5nRm1HD And here's the output of the reiserfsck check and the fix-fixable: http://pastebin.com/F5dAtb1n Upon remount the drive was still unwriteable... should I do the tree fixing option? You must follow its advice: 2 found corruptions can be fixed only when running with --rebuild-tree reiserfsck --rebuild-tree /dev/md5 Link to comment
Recommended Posts
Archived
This topic is now archived and is closed to further replies.