Trying to resore array after accidentally unassinging all mu drives!


Acps

Recommended Posts

So ill go ahead and said it, i was clicking buttons i didnt understand lol. This is what I did:

 

Screenshot_3.thumb.jpg.d5ebc9bd553f0963ec0a66813fd9c0cb.jpg

unraid-syslog-20191126-1931.zip

unraid-diagnostics-20191126-1919.zip

 

So 1 of my 5tb is definitely dead dead, but im comfortable holding out till next week to maximize cyber week. But what i was dicking around with was my cache pool. I did have 4 ssds in raid 0, 2x 250gb and 2x 128gb. I yanked one last nite for another pc that i needed access too, and i was overkill with this setup as 4 as is. I was having issues though rebuilding my btrfs cache pool with the 3 ssds i have left. Luckily i had my appdata folders backing up every week, so the only data loss im gonna miss from the cache drives would be whatever was still waiting to the mover schedule to kick in. But now im afraid to touch it because im fairly certain i havent totally bricked all my data yet because i thought if done correctly this is still salvagable, but 1 more fuckup and i could be dead in the water. Loosing roughly 20tb of data. alot of it being irreplaceable. I spent some time trying to figure it out on my own but im worried that i could misinterpret something and still ruin all the data. So i wanted to check in with the pros here that point me in the right direction to salvage my data.

Thanks in advance for the input! 

 

~Acps

Link to comment

I would suspect that the unformatted 5TB drive (in the  unassigned section) is actually your parity drive.   The other six unassigned drives which are formatted are the data drives. 

 

I would also suspect that the TOSHIBA MD04ACA500  (sdg) drive with the serial number ending in FS9A is on its last legs.  Do you think that your parity was valid before this hiccup occurred?  That would effect how you might approach the recovery operation.  If parity was good, then you could just rebuilt this disk onto a replacement.  But be careful with what you do as you could lose the option by not doing correctly.   (In my opinion, the last thing you want to do at this point is to attempt to rebuilt parity with that disk (Serial# ...FS9A) installed as I suspect that the parity operation will fail...)

Link to comment
On 11/26/2019 at 3:59 PM, ashman70 said:

Do you have a backup or second copy of your data?

I wish but my array was 25tb, i thought by running double parity i was protecting my data. I wasnt expecting for me to f up the raid array but getting all the disks unassinged from their slots. I wish i had the internet to back it up to the cloud, but for me to upload 25tb at a speed of .75mb, would take roughly 300+ days, and no one would be able to use the internet during that time lol.

I havent tried to start the array since they got unassigned. I havent even tryied to reboot anything. I thought that it was able to be put by togther, if done correctly, but i dont want to try anyting out of fear of the 1 mistake i could make tjat would cost me all the data. The cache pool data is gone and im fine with that, i backed up my appdata weekly, assuming i can recover the raid. I did put my hdd back in the same slots they were before.

Screenshot_18.thumb.png.9cbed4b59fe8d60712c38ebb7bce3094.png

 

Now Disk2 which is sdg, is definitely bad, its been dead dead. its been like that for several weeks and i tried many times to try and rebuild the array with it to see if it was a fluke, but i 3-4 times it always ended up failing and being emulated data, but i had dual parity so i wasnt worried. Parirty2 sde which is that drive that looks unformated was the other drive i had an issue with. But it was a 1 time fluke because it hasnt had a single issue since it reported a few errors.

 

Disk2 sdg

Screenshot_19.thumb.png.d26b5f07b402e949e3b2f536c74f9dc2.png

 

 

Parity2 sde

Screenshot_20.thumb.png.67f89065633fb00021cc4fc76100ec8d.png

 

Looking the smart health tho its showing they both have errors:

 

Screenshot_17.thumb.png.b4e8836332446a15ca094bb0164a233e.png

 

But i thinks that because they got unassigned and reassigned. Because i acknowledged them after i try and rebuild the array to see if itll occur again.

 

Parity2 sde the one showing unformulated was confusing me too, but once i assinged it back into its slot in the array it says it was formated just fine.

Screenshot_21.thumb.png.2834c72c8a938ac0a02a99c8b4e0e451.png

 

Know my parity wa s always valid as far as my knowledge, i cancelled a few parity syncs by hand. But i back up all those logs as well. My syslog assuming i do a clean shutdown always backs up, got 2 years of them so far lol. Heres the rest of syslogs from Nov 26, which is when this all occured!

syslog-20191126-100643.txtsyslog-20191126-105428.txt

 

So after all the pics and walls of text, i hope that clarifies what i did and hopefully, i just need to start the array and let the parity sync run. But i was worried because in the first pic i posted of the drives assinged but not mounted, the 2 raid arrows are contradicting each other. That red text saying id loose all my data didnt popup till after i assinged both parity and then assinged disk1. So im afraid to try it until i hear back from someone who has a much better understanding of this than me, so ty for the help so far and hopefully someone can confirm i can rebuild it just fine. If thats not the case. I would want to try and recover as much data as possible from the disks without trying to rebuild the array and possible write over the good data thats still there.

 

~Acps

 

 

Link to comment
10 hours ago, ashman70 said:

You may have heard this before but RAID, or unRAID is not a backup, meaning you should always have a second copy of your data. I know this isn't easy for many people as it costs $$ but if your data is worth anything to you, then please seriously consider it.

Or divide your data into two categories-- Irreplaceable Data and everything else.  Make sure you have a well thought-out strategy for the Irreplaceable Data with an off-site location being a part of that strategy.  (I personally use portable hard drives stored in a safety deposit box for off-site.)

Link to comment

I think you can recover from this without data loss.  Do you have the replacement disks yet?  The procedure is relatively simple but it has to be done in exactly the proper sequence.  I am going to ping  @johnnie.black as he has led several other folks through it.

 

Now for a word of caution.  Never run for several weeks depending on dual parity to bail you out.  Either fix it immediately or powerdown the server down until you can address the issue.  The principal advantage of dual parity is that is protects against a second disk failure occurring during the recovery process from the first failure!

Link to comment

The easiest way forward is to check "parity is already valid" and starting the array, assuming all assignments are known to be correct, then if you already have a spare to replace disk2 do it now, if not unassign disk2, and use the array like that until you have a replacement, but try to do it as soon as possible.

Link to comment

Ok so i was able to bring my array back online without disk2, and checking parity is valid, however disk2 data isnt being emulated like it was before. I do have a replacement hdd thatll be here tuesday. Is the data being stored on disk2 still recoverable? Or how do i emulate the data from the parity drives?

Edited by Acps
Link to comment
On 11/29/2019 at 8:34 AM, Frank1940 said:

Or divide your data into two categories-- Irreplaceable Data and everything else.  Make sure you have a well thought-out strategy for the Irreplaceable Data with an off-site location being a part of that strategy.  (I personally use portable hard drives stored in a safety deposit box for off-site.)

 

I knew that i was at risk with 1 disk offline, and thought had crossed my mind to powerdown till i had a replacement. I know that good practice is to have an offsite backup for disaster recovery whether on LTO tapes, or in the cloud. I just never thought id be the cause of the disaster lol. But i didnt think the cost was worth the redundancy to backup 25tb of data. It didnt occur to me that backing up just the sensitive data would be another option. Because most of my data is media/software while important is replaceable, my irreplaceable data is mostly medical/legal/personal data which is documents is most cases. So i think I can easily setup some new shares and move data around to then make it possible for me to backup to external drive quite easily. Or even online with the right encryption protection. I rarely delete anything and would be consider a data hoarder by far. I was an IT in the Coast Guard for 10 years with TS/Sci clearance so i got to work on enterprise level networks, and when i got out this started out as a hobby, but I never was involved with the implementation of our networks just the administer and maintain. A mistake like this would have cost me my job in the military possibly criminal charges for negligence. So hopefully i can recover from this learn and make changes so that it doesnt happen again. I was waiting for blackfriday/cybermonday deals to pickup a few drives, to replace the 2 i had issues with as well as a spare or 2 for future. 

Link to comment
1 hour ago, Acps said:

Ok so i was able to bring my array back online without disk2, and checking parity is valid, however disk2 data isnt being emulated like it was before. I do have a replacement hdd thatll be here tuesday. Is the data being stored on disk2 still recoverable?

Stop at this point.  Don't allow any writes to the array.  What happened to the physical disk 2?  Did you take it out?  I believe you should have included it in the assignment of disks if the server could detected it (even if it is bad!) and thus its contents could be emulated.  I am going to ping @johnnie.black again but I believe he is in Europe so he probably won't see this until tomorrow.  (I have had very few issues with my servers so I don't have much experience in fixing the more complex problems.  He seems to have many servers running and seems to have accumulated a lot of knowledge in "how to deal" with these types of things.)

Link to comment
1 hour ago, Acps said:

Ok so i was able to bring my array back online without disk2, and checking parity is valid, however disk2 data isnt being emulated like it was before. I do have a replacement hdd thatll be here tuesday. Is the data being stored on disk2 still recoverable? Or how do i emulate the data from the parity drives?

If disk2 is not being emulated then it’s data is not recoverable by simply rebuilding onto a new disk.

 

without knowing exactly what steps you have already taken to get your array operable on the other drives it is impossible to determine if there is any path forward that might lead to the contents of disk2 being recoverable or if you have already made this impossible.

Link to comment
On 11/29/2019 at 8:59 AM, johnnie.black said:

The easiest way forward is to check "parity is already valid" and starting the array, assuming all assignments are known to be correct, then if you already have a spare to replace disk2 do it now, if not unassign disk2, and use the array like that until you have a replacement, but try to do it as soon as possible.

 

I literally did this, since i did not have a spare i unassigned disk2. I didnt remove it its still sitting in my server plugged in just as an unassigned disk.

Link to comment

it sounds as if you did not start the array with disk2 assigned and then stop it to unassign the disk and restart the array?    That would have left you with disk2 being emulated.   If you did the unassignment without that initial start then it would not be emulated (and parity would be invalid).

 

Has  anything been written to the array in the meantime?   Have you tried to rebuild parity?

Link to comment

THats exactly what i did, i guess i missunder stood johnny. i did some appdata backups, but as far as i know thats all to the cache drive. so i dont think anything beens wrriten array. Will rebuilding the parity make disk2 data unrecoverable?

If theres no one to salvage the raid, as a last resort id like to try and pull the drive and see if i can recover any data off what was written to it if any.

Link to comment
25 minutes ago, Acps said:

Will rebuilding the parity make disk2 data unrecoverable?

Yes.

 

Quote

i did some appdata backups, but as far as i know thats all to the cache drive.

Appdata backups are normally to the array (in case there is a problem with the cache drive).   Check what you have set for the target?  If it is to the array then it will no longer be possible to rebuild disk2 and you should rebuild parity to make sure it matches your current drive assignments in case another drive has issues (I suspect that the moment it does not)

25 minutes ago, Acps said:

If theres no one to salvage the raid, as a last resort id like to try and pull the drive and see if i can recover any data off what was written to it if any.

Can the drive be mounted when it is shown as an Unassigned device?   

Link to comment

Disk1

Quote

Phase 1 - find and verify superblock... - block cache size set to 1496680 entries Phase 2 - using internal log - zero log... zero_log: head block 3466131 tail block 3425645 ALERT: The filesystem has valuable metadata changes in a log which is being ignored because the -n option was used. Expect spurious inconsistencies which may be resolved by first mounting the filesystem to replay the log. - scan filesystem freespace and inode maps... - found root inode chunk Phase 3 - for each AG... - scan (but don't clear) agi unlinked lists... - process known inodes and perform inode discovery... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - agno = 4 - process newly discovered inodes... Phase 4 - check for duplicate blocks... - setting up duplicate extent list... - check for inodes claiming duplicate blocks... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - agno = 4 No modify flag set, skipping phase 5 Phase 6 - check inode connectivity... - traversing filesystem ... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - agno = 4 - traversal finished ... - moving disconnected inodes to lost+found ... Phase 7 - verify link counts... Maximum metadata LSN (1:3467420) is ahead of log (1:3466131). Would format log to cycle 4. No modify flag set, skipping filesystem flush and exiting. XFS_REPAIR Summary Mon Dec 2 14:22:18 2019 Phase Start End Duration Phase 1: 12/02 14:21:58 12/02 14:21:58 Phase 2: 12/02 14:21:58 12/02 14:21:59 1 second Phase 3: 12/02 14:21:59 12/02 14:22:10 11 seconds Phase 4: 12/02 14:22:10 12/02 14:22:10 Phase 5: Skipped Phase 6: 12/02 14:22:10 12/02 14:22:18 8 seconds Phase 7: 12/02 14:22:18 12/02 14:22:18 Total run time: 20 seconds

 

Disk3

Quote

Phase 1 - find and verify superblock... - block cache size set to 1504136 entries Phase 2 - using internal log - zero log... zero_log: head block 1594967 tail block 1594963 ALERT: The filesystem has valuable metadata changes in a log which is being ignored because the -n option was used. Expect spurious inconsistencies which may be resolved by first mounting the filesystem to replay the log. - scan filesystem freespace and inode maps... - found root inode chunk Phase 3 - for each AG... - scan (but don't clear) agi unlinked lists... - process known inodes and perform inode discovery... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - process newly discovered inodes... Phase 4 - check for duplicate blocks... - setting up duplicate extent list... - check for inodes claiming duplicate blocks... - agno = 0 - agno = 1 - agno = 2 - agno = 3 No modify flag set, skipping phase 5 Phase 6 - check inode connectivity... - traversing filesystem ... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - traversal finished ... - moving disconnected inodes to lost+found ... Phase 7 - verify link counts... Maximum metadata LSN (1:1595140) is ahead of log (1:1594967). Would format log to cycle 4. No modify flag set, skipping filesystem flush and exiting. XFS_REPAIR Summary Mon Dec 2 14:30:26 2019 Phase Start End Duration Phase 1: 12/02 14:30:05 12/02 14:30:05 Phase 2: 12/02 14:30:05 12/02 14:30:05 Phase 3: 12/02 14:30:05 12/02 14:30:16 11 seconds Phase 4: 12/02 14:30:16 12/02 14:30:16 Phase 5: Skipped Phase 6: 12/02 14:30:16 12/02 14:30:26 10 seconds Phase 7: 12/02 14:30:26 12/02 14:30:26 Total run time: 21 seconds

 

So do i go ahead and try the xfs_repair tool with this results?

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.