Jump to content

(SOLVED) 6.4.0 Parity Rebuild slowed to 3.6MB/s. Was I wrong about this drive?


elecgnosis

Recommended Posts

A few days ago, Disk 5 (serial last four 4555) came up as disabled with writes into the quintillions. The Smart data did not look bad to me and the system log reported I/O errors, which I interpreted as a bad connection to the sata backplane at that drive port, so I shut everything down, moved drives around, and did not open the case to check cable connections. Clearly, that was a mistake.

Yesterday, Disk 12 (serial last four GY9P) came up disabled. I made a diagnostics report but didn't look very closely at it, assuming that I had put Disk 12 into the same drive port that I failed to identify as bad the first time, so I shut down the server and moved Disk 12 to another port. Without any further troubleshooting (another mistake), I started a data rebuild by unassigning Disk 12 from the array, starting the array, stopping the array, assigning Disk 12 again, and restarting the array, which started a data-rebuild. Before I went to bed last night, the data-rebuild had less than eight hours left, so I expected it to be finished this morning.

However, checking it this morning, I saw that the speed of the rebuild had dropped to around 3.6MB/s and the time to complete had risen to five days. I took a closer look at the previous diagnostics report and found a rack of write errors, read errors, and I/O errors. There are also errors for Disk 14 in that report as well as messages from my SAS card about my SAS expander that I don't understand, but could explain why that drive port caused me trouble to begin with.

 

I'm not sure what to do at this point. I've posted two diagnostics reports, the one from yesterday when Disk 12 was first disabled () and another from this morning (). I made mistakes here, but since the parity drive is still good, I'm hoping that you all can show me a way out of this situation without losing my data. If there is no such path to success...well, I could use some advice on the mistakes that I made beyond those that I already recognized, so that I don't repeat them the next time that I have drive trouble. Also, what should I be looking for in Smart reports? Why would the green thumbs up in the web interface be displayed if there are problems in a drive's Smart data?

 

Link to comment

Syslog is spammed with mover logging (turn that off) and wrong CSRF tokes like these:

 

Feb 18 07:36:47 OMNI root: error: /webGui/include/DashUpdate.php: wrong csrf_token
Feb 18 07:36:47 OMNI root: error: /webGui/include/DashUpdate.php: wrong csrf_token
Feb 18 07:36:47 OMNI root: error: /webGui/include/DashUpdate.php: wrong csrf_token
Feb 18 07:36:51 OMNI root: error: /plugins/preclear.disk/Preclear.php: wrong csrf_token
Feb 18 07:36:52 OMNI root: error: /plugins/preclear.disk/Preclear.php: wrong csrf_token
Feb 18 07:36:57 OMNI root: error: /plugins/preclear.disk/Preclear.php: wrong csrf_token
Feb 18 07:37:01 OMNI root: error: /webGui/include/DashUpdate.php: wrong csrf_token
Feb 18 07:37:03 OMNI root: error: /plugins/preclear.disk/Preclear.php: wrong csrf_token
Feb 18 07:37:03 OMNI root: error: /plugins/preclear.disk/Preclear.php: wrong csrf_token
Feb 18 07:37:03 OMNI root: error: /webGui/include/DashUpdate.php: wrong csrf_token
Feb 18 07:37:09 OMNI root: error: /plugins/preclear.disk/Preclear.php: wrong csrf_token
Feb 18 07:37:14 OMNI root: error: /plugins/preclear.disk/Preclear.php: wrong csrf_token
Feb 18 07:37:15 OMNI root: error: /plugins/preclear.disk/Preclear.php: wrong csrf_token
Feb 18 07:37:18 OMNI root: error: /webGui/include/DashUpdate.php: wrong csrf_token
Feb 18 07:37:18 OMNI root: error: /webGui/include/DashUpdate.php: wrong csrf_token
Feb 18 07:37:19 OMNI root: error: /webGui/include/DashUpdate.php: wrong csrf_token
Feb 18 07:37:21 OMNI root: error: /plugins/preclear.disk/Preclear.php: wrong csrf_token
Feb 18 07:37:25 OMNI root: error: /plugins/preclear.disk/Preclear.php: wrong csrf_token

Due to these there is time missing on the syslog but I do see that the emulated disk was already unmountable when you started so you shouldn't have rebuilt on top of the old disk, close any stale browser window open on the webGUI, reboot and post new diagnostics

Link to comment

Nevermind, I went ahead and shut down anyway. I checked cable connections and everything seemed secure from power and SAS connections on the backplane to the controller and expander cards. I have started the server and unassigned the questionable drive, but I have not started the array yet.

 

Here are the current diagnostic reports: 

Link to comment

I can't see what happened on the 2nd diags because of the missing time but it's likely the same thing that happened on the 1st ones, the SAS controller crashed and dropped multiple disks, you're using a Marvell based Highpoint HBA that uses the same driver as the SASLP/SAS2LP, that is a know problem with unRAID v6 and you should replace it with an LSI ASAP.

 

The low speed during the last rebuild was likely due to IRQ16 getting disable, another known issue with those controllers.

 

The main problem is that you rebuilt an unmountable disk on top of old disk12, so if still unmountable you might lose that data.

Link to comment

OK, now there's a more usual error, there's a chance you can recover the data, you'll need an extra disk mounted outside the array, though you can use old disk12 since it's been overwritten, see here for how:

 

https://lime-technology.com/forums/topic/46802-faq-for-unraid-v6/?do=findComment&comment=543490

 

P.S. recommend only doing this after you get that controller replaced or it may cause more damage since all disks need to be read simultaneously to emulated the missing disk.

Link to comment
32 minutes ago, elecgnosis said:

Offhand, can you make a recommendation between an LSI 9211 and an LSI 9240?

By the way, you might find flashed LSI 9211-8i up on E-bay from $70-90US.  You want to specify 'LS 9211-8i IT Mode'  in the search parameters.   Be sure to check out the vendor's reputation, location, and shipping costs.   I have purchased two of LSI cards in the past nine months via E-Bay and I have gotten what I wanted without an issues.  I paid a bit more than the cheapest available price but...

Link to comment

Was taking a quick look at the SMART reports for your disks and they are mostly OK, couple a things you want to monitor, some disks have run way too hot in the past, especially disk17:

 

Lifetime    Min/Max Temperature:     20/68 Celsius

 

Disks should be kept under 40C, 45C tops.

 

There are also various disks with a very high CRC_UDMA error count, these could be old errors, but you should update to v6.4.1 and make sure notifications are enable since that attribute is now monitored, if it keeps increasing for any disk it means there's still a problem, usually a bad SATA cable.

Link to comment

I received my LSI 9211 in IT mode today and started running btrfs restore on the errant device.

The files seem to be coming through okay.

In the meantime, can you tell me what I'll have to do to get that data back into the array? I have another 4TB drive ready to take the place of the original, so I could mount in the place of the missing drive, format it, then copy the data into the array, which should preserve the parity drive. Is there another way I should consider instead?

Link to comment
2 hours ago, elecgnosis said:

so I could mount in the place of the missing drive, format it, then copy the data into the array, which should preserve the parity drive.

That's the only way to do it while maintaining parity, disk will need to rebuilt first though.

 

2 hours ago, elecgnosis said:

Is there another way I should consider instead?

You could also do a new config but then parity would need to be resynced and array unprotected until it's done.

Link to comment

I was restoring the wrong device. I thought sdg1 was still the device that I'm having trouble with, but that letter was assigned to a different device. Now I have no idea which device letter corresponds with the Disk 12 that needs to be restored. Is there some table associating the two pieces of information?

Link to comment
3 minutes ago, elecgnosis said:

I was restoring the wrong device. I thought sdg1 was still the device that I'm having trouble with, but that letter was assigned to a different device. Now I have no idea which device letter corresponds with the Disk 12 that needs to be restored. Is there some table associating the two pieces of information?

I haven't been involved in this thread until now, but how is the "drive letter" relevant to what you have been doing and what you have been told to do?

 

The "drive letter" is assigned at boot time depending on the order the drives respond and is not guaranteed to be the same between boots. unRAID tracks the "disk number" based on its serial number, and it is the disk number that you should usually be working with.

Link to comment
12 minutes ago, elecgnosis said:

I was restoring the wrong device. I thought sdg1 was still the device that I'm having trouble with, but that letter was assigned to a different device. Now I have no idea which device letter corresponds with the Disk 12 that needs to be restored. Is there some table associating the two pieces of information?

The disk identifier is shown on the main page for both assigned and unassigned disks, I don't know if the disk is still part of the array, but you can see by the serial number.

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...