Problem rebuilding drive - Unraid website unresponsive - General Support (V5 and Older)

March 21, 201313 yr

Hi,

i'm replacing my 1.5 TB Drives with 3TB Drives. While rebuilding the unraid server Website became unresponsive. I saved the logfile with the help of telnet. After a hard restart (poweroff command) the datarebuild startet all over again. I 'm kind of worried about data integrity.

I would be thankful, if someone more knowegeble could take a look at my syslog.

Thx

syslog-2013-03-20.txt

Quote

March 21, 201313 yr

Was the server otherwise responsive? Could you view your shares? If so, wait for it to complete the rebuild. Granted, it can be frustrating not knowing the current progress....

Edit: Just realized that the array may not be "live" but if you can telnet in, things should still be running.

Quote

March 22, 201313 yr

Author

Well the LED signals indicate, that there is no reads/writes going on (no flashing). Eventually I restarted the server. Datarebuilding startet all over. But the same behavior occoured again. The webpage is not responding. Any clues?

Quote

March 22, 201313 yr

Several ports are throwing errors, could be a cable not seated well, cable itself, controller, driver. Are these ports on the SASLP? Check in that area. Could be hardware failure.

Quote

March 22, 201313 yr

Author

Yes all the backplanes are connected to the two sas cards 8087->8087. I ll check when i get home. Cables should be ok. They snapin.

Quote

March 22, 201313 yr

Mar 20 07:27:15 Tower kernel: md: recovery thread rebuilding disk5 ...

Mar 20 07:27:15 Tower kernel: md: using 1152k window, over a total of 2930266532 blocks.

Mar 21 08:53:54 Tower kernel: sas: command 0xf42b06c0, task 0xf438d2c0, timed out: BLK_EH_NOT_HANDLED

Mar 21 08:53:54 Tower kernel: sas: Enter sas_scsi_recover_host

Mar 21 08:53:54 Tower kernel: sas: trying to find task 0xf438d2c0

Mar 21 08:53:54 Tower kernel: sas: sas_scsi_find_task: aborting task 0xf438d2c0

Mar 21 08:53:54 Tower kernel: sas: sas_scsi_find_task: querying task 0xf438d2c0

Mar 21 08:53:54 Tower kernel: drivers/scsi/mvsas/mv_sas.c 1747:mvs_query_task:rc= 5

Mar 21 08:53:54 Tower kernel: sas: sas_scsi_find_task: task 0xf438d2c0 failed to abort

Mar 21 08:53:54 Tower kernel: sas: task 0xf438d2c0 is not at LU: I_T recover

Mar 21 08:53:54 Tower kernel: sas: I_T nexus reset for dev 0400000000000000

Mar 21 08:53:54 Tower kernel: sas: sas_form_port: phy4 belongs to port4 already(1)!

Mar 21 08:53:56 Tower kernel: drivers/scsi/mvsas/mv_sas.c 1701:mvs_I_T_nexus_reset for device[4]:rc= 0

Mar 21 08:53:56 Tower kernel: sas: I_T 0400000000000000 recovered

Mar 21 08:53:56 Tower kernel: sas: sas_ata_task_done: SAS error 8d

Mar 21 08:53:56 Tower kernel: ata15.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0 t0

Mar 21 08:53:56 Tower kernel: ata15.00: failed command: READ FPDMA QUEUED

Mar 21 08:53:56 Tower kernel: ata15.00: cmd 60/00:00:88:e4:c1/02:00:e4:00:00/40 tag 0 ncq 262144 in

Mar 21 08:53:56 Tower kernel: res 41/40:00:08:e6:c1/00:00:e4:00:00/40 Emask 0x409 (media error) <F>

Mar 21 08:53:56 Tower kernel: ata15.00: status: { DRDY ERR }

Mar 21 08:53:56 Tower kernel: ata15.00: error: { UNC }

Mar 21 08:53:56 Tower kernel: ata15.00: configured for UDMA/133

Mar 21 08:53:56 Tower kernel: ata15: EH complete

Mar 21 08:53:56 Tower kernel: sas: --- Exit sas_scsi_recover_host

Mar 21 08:54:27 Tower kernel: sas: command 0xf42b06c0, task 0xf438d2c0, timed out: BLK_EH_NOT_HANDLED

Mar 21 22:13:13 Tower in.telnetd[6419]: connect from 192.168.178.27 (192.168.178.27)

Mar 21 22:13:16 Tower login[6420]: ROOT LOGIN on '/dev/pts/0' from '192.168.178.27'

You can see that at 7:27 the morning of the 20th, the rebuild of Disk 5 began. Then the next morning at almost 8:54, a sas-related issue occurs BUT it is not handled! A timeout is reported, then a sas recovery function is called, and it has 'task' problems, can't abort some task that it should be able to, then a reset is called for, and it apparently succeeds. Now the regular ata exception handler is able to report a problem, a 'media error' with UNC flag raised, so probably a bad sector found (on ata15, sdj, Disk 2). It completes its report and closes, then the sas recovery function is complete and exits. But half a minute later, sas tries to report another issue, and it isn't being handled either! Worse yet, there is no recovery attempt at all(!), and no error reported here at all! The sas module appears to be frozen, and I suspect this is when much of the system became frozen too. You checked in much later at 22:13. There is no direct evidence of any issues with the Web management page.

At the moment, I'm not sure what the best advice is, have to think about it. It would have been good to know what percentage the rebuild was at, it had run almost 25 and a half hours. There is an issue on Disk 2 that seems to be stopping the rebuild, and that will have to be fixed, but may be difficult when you already have one drive down. Just curious, how full was Disk 5, the one being rebuilt?

Edit: others may have better ideas, but this may be a good time to run a non-destructive badblocks test on Disk 2, see if we can force it to repair/remap the bad sectors. Do obtain a SMART report for sdj.

Quote

March 23, 201313 yr

Author

I could Switch the New Drive 3tb no5 with the old drive 1.5tb No5 (not yet formatted) and replace Disc no2 (2tb) with a Brand new drive and Start rebuilding. But would the Parity still be good for the rebuilding of the brand new Drive no2 when Inserting Old Drive no5? I m Kind of fightend to Loose 2tb of 2tb of Data.

Quote

March 23, 201313 yr

I don't have time just now, but I want to say that I don't believe that any of your data is in danger YET, so long as we are very careful. The right set of steps should completely and fully restore your system to perfect health. On the other hand, the wrong steps *could* lead to data loss, so make sure there is a consensus in the advice given, and that you yourself are confident in the advice.

Great news that you have the original Disk 5, can't believe I did not think of that. I'll be back later... but there are many other knowledgeable advisers here...

Quote

March 23, 201313 yr

I could Switch the New Drive 3tb no5 with the old drive 1.5tb No5 (not yet formatted) and replace Disc no2 (2tb) with a Brand new drive and Start rebuilding. But would the Parity still be good for the rebuilding of the brand new Drive no2 when Inserting Old Drive no5? I m Kind of fightend to Loose 2tb of 2tb of Data.

This should work. Replace the original drives and and select Utils->New Config. There should then be a checkbox to indicate the parity is good before starting the array. Then you can replace the problem disk.

Quote

March 23, 201313 yr

Author

Well I'm kind of unsure. Utils-newconfig tells me something about wishing to rebuild parity based on a New configuration. So i rather Check back with you guys before i do something wrong. Better ask twice than swear Once is that the right Option? Do i have to rearrange all drives afterwards?

Thx

Quote

March 24, 201313 yr

I have never seen the screen myself, and the docs on v5 are woefully incomplete yet, and I can't remember ever seeing anyone post a screen pic of this screen, and my own UnRAID is down for maintenance (some changes I was making), so I cannot be definitive as to what you *should* be seeing. But as dgaschk mentioned, there should have been another option to indicate that the current parity is valid. Can you try that again, look for that option? Other than reconnecting your old Disk 5, you should not have to rearrange any other drives.

One thing I would recommend, make sure that nothing writes to the array until all restoration is done. That includes any automated processes on other machines, such as automated backups.

But would the Parity still be good for the rebuilding of the brand new Drive no2 when Inserting Old Drive no5?

Is there any chance you have written or modified any files on Disk 5 since pulling the old Disk 5?

Quote

March 24, 201313 yr

Author

No i did not write any files to the array. Hope that there is no misunderstanding. I was referring to the util screen. Not the main screen yet.i ll post the screen as soon as i can. Can' t resize the pic from my iphone right now.

Quote

March 24, 201313 yr

You're running an old release. I don't know if the "parity is correct" option is available on rc5.

Quote

March 24, 201313 yr

Author

Well i restet the System config. Afterwards i had to rearrange all Drives by Hand. I ticked the Box that Parity is valid and the System came on. I can Access the Files whithout Problem. The paritycheck immediatly startet. I have Speeds of 28 mb/sec. The parity Check will Last approximitly 1800 min. Should i stop the parity Check and Replace the faulty Drive no2?

Im Running RC5 because of the Slow parity speed. Some people went back from rc11 to rc5 because of Reported speedissues. I was hoping that would fix the Problem too. Maybe its just the faulty drive that slows the System down. Otherwise i ll Switch my SAS cards with IBM cards. But One step After another. Just for the info i Moped to the New chassis and plugged a Second supermicro SAS Card in to use the 8087 backplanes. After switching the Case the speed Problems occoured. Hopefully the Speed Problems vanish After replacing the drive no2. Other Wise i ll have a New supermicro Board and the two IBM sas Controller to do some serious Testing.

Sorry for the spelling but the ipad autocorrection can Relly be an anoying Thing

Thx so much for your Time.

Quote

March 24, 201313 yr

Author

Well just wanted to check in to see what the Progress on the parity check is and could't open the unraid Website once again. HDD LEDs don't blink anymore.

I'll extracted the syslog once again. What happend during parity? For me it looks like the same Thing that happend during rebuild. Should i try to replace disk 2 and start a rebuild, even though parity check didn't complete?

thx

syslog-2013-03-24.txt

Quote

March 24, 201313 yr

The system is slow and unresponsive because of the faulty drive. The parity check will not complete. Stop the check and do the replacement.

Quote

March 24, 201313 yr

Author

It did Not work. Same Thing. Webgui is Unresponsive After replacing hdd. Hdd LEDs are Off. No activity... Dikiedirk was right it seems like. Funny that there are Not 4 Drives are faulty. 4 Drives per Backplane/SAS Cable so why just One faulty drive?

That leaves SAS-Controller and Backplane. I hate it!

Quote

March 24, 201313 yr

Author

Any Ideas how i could find the faulty link?

Quote

March 25, 201313 yr

Well just wanted to check in to see what the Progress on the parity check is and could't open the unraid Website once again. HDD LEDs don't blink anymore.

I'll extracted the syslog once again. What happend during parity? For me it looks like the same Thing that happend during rebuild. Should i try to replace disk 2 and start a rebuild, even though parity check didn't complete?

The syslog shows a very similar handling of a drive or controller issue, including its inability to properly handle it, and the subsequent freeze, which probably froze much of your system again. However the drive error is different this time (HSM violation), occurs much earlier (only 3 hours into it), and involves a different drive on a different controller (the 4 port sas controller, ata10, sde, Disk 3). The previous issue was a Media error that occurred about 25.5 hours into the run and involved the 8 port sas controller and Disk 2 (ata15, sdj). The Media error is a legitimate problem, and one you need to fix. The HSM violation, on the other hand, is a slippery thing, very hard to pin down. Please see The Analysis of Drive Issues, and look up HSM violation. Two things you can do here, check for newer firmware versions for both of your SAS controllers, and upgrade to the latest UnRAID to take advantage of a more recent version of mvsas. You are using an older version of mvsas, v0.8.2. Recent UnRAID releases have contained v0.8.16, somewhat more mature, and perhaps better at handling the errors you had, without completely crashing and taking down the system. I did some research on mvsas versions, added it to the wiki here.

As to what happened and your questions, you probably did not want to run a parity check any way, because it was going to quit when it got to the bad sector. The parity check was started automatically this time because of the previous crash and bad shutdown, and it is probably going to do the same thing when you restart. Just let it run for a few minutes, then cancel it. I think you will also have to inform it that parity is valid once more. In this syslog, it does not look like you have replaced Disk 2 yet, but your later posts imply that you may have done that?

Try to upgrade to RC12a, and try to upgrade the firmware on both cards, if possible.

Quote

March 25, 201313 yr

Author

Well I tried the new SAS Firmware (.21) but experienced a freeze before unraid even booted. So i installed on both controllers Version .15.

Right now the rebuilding of drive no2 is in progress all over again, because i missed to extract a new syslog. Right now it is at 39% (1.2TB) (Rebuildspeed 10MB/s). At around 40/45% the rebuild froze once again. As soon as i have a new syslog with the failed rebuilding of disk no2 i'll attach it. After that, i'll update unraid to rc-12a. The link concerning the HSM violation isn't very promising.

I'll do more tests, when I get home.

1. update Unraid to rc-12a

2. update firmware of both SAS-Controller to .21

3a. unplug the SAS-Controller with only four drives attached and connect the drives directly via SATA to the Motherboard.

3b. switch SAS-Controller and repeat the test

As an successor indicator I should see a speed raise way above 30 MB/s in parity check, when the fault is isolated.

Quote

March 25, 201313 yr

Author

So here is the syslog from last try of rebuilding faulty drive no2 with a brand new drive. Rebuilding process was terminated at a Progress of 40-45%.

Should i replace the unrebuild drive with the old drive no2? Or can i abort the automatic rebuild process gracefully after each restart? So i don't have to wait for the rebuild process before i restart the System after each test/firmwareupdate. I just don't want to corrupt anything while forcefully restaring the system during the rebuilding process.

syslog-2013-03-25.txt

Quote

March 25, 201313 yr

Author

Well updated unraid to rc-12a and the SAS-cards to from .15 to .21.

Started my first test 3a. Automated rebuild kicked right in. Didn't have to assign any drives. Rebuild started at 45 MB/s. At least its 50% faster. I ll let the System try to rebuild the faulty drive and See if the system freezes again.

Quote

March 26, 201313 yr

Author

Well I'm very happy to inform you that test 3a was successful. The rebuild of drive no2 completed. All balls are green. I'll post a Syslog from the completion of the rebuilding in a second. Maybe someone could look into it, if any unnormal behavior is reported. Parity check is twice as fast with approximatly 58 MB/S. Not very fast but much better.

After someone checked the logs, I'll continue to isolate the problem.

Thx for all the support.

Log posted.

syslog-2013-03-26.txt

Quote

March 26, 201313 yr

Author

Well Test 3b same thing. That is Strange... Maybe the backplane...

Quote

March 26, 201313 yr

Well I'm very happy to inform you that test 3a was successful. The rebuild of drive no2 completed. All balls are green. I'll post a Syslog from the completion of the rebuilding in a second. Maybe someone could look into it, if any unnormal behavior is reported. Parity check is twice as fast with approximatly 58 MB/S. Not very fast but much better.

After someone checked the logs, I'll continue to isolate the problem.

This is good! That speed seems relatively normal, for a 3TB parity and 12 drives. Syslog looks clean, no drive issues! There is a Reiser file system issue on Disk 2 (appeared on the previous syslog too). At some point, you will need to run Check Disk File systems on Disk 2, before you will be able to write to it.

Well Test 3b same thing. That is Strange... Maybe the backplane...

I'm unsure whether you meant same good thing, or same bad thing...

Quote

Problem rebuilding drive - Unraid website unresponsive

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)