Page MenuHomePhabricator

rc11 raid-1 array wont come up on boot
Closed, ResolvedPublic

Description

When installing rc11 on a 5 year old Supermicro server with two 1TB drives in sw RAID1 everything looks good during the install phase. The array comes up properly and the installation proceeds normally. However when rebooting fom the drives mdadm does not recognize the array and the installation is booted with 'sda1' mounted as the primary disk.
When looking at 'cat /proc/mdstat' the array is recognized but only with one member (sdb1).

If I reboot back from the ISO again the array is properly mounted and mdstat looks good with both sda1 and sdb1 as members.
This is unfortunately not reproducible on vmware. Please ask and I will provide any output required from the server.

Details

Difficulty level
Unknown (require assessment)
Version
VyOS 1.2.0-rc11
Why the issue appeared?
Will be filled on close

Event Timeline

danhusan created this object in space S1 VyOS Public.

During install (booted from ISO):

vyos@vyos:~$ cat /proc/mdstat 
Personalities : [raid1] 
md0 : active raid1 sdb1[0] sda1[1]
      976758720 blocks [2/2] [UU]
      [>....................]  resync =  3.0% (29548736/976758720) finish=154.6min speed=102046K/sec
      bitmap: 8/8 pages [32KB], 65536KB chunk

unused devices: <none>


vyos@vyos:/tmp$ cat install-2741.log 
turning off swaps...
Creating filesystem on /dev/md0...
Done!
Mounting /dev/md0...
Done!
Setting up grub...
Installing for i386-pc platform.
Installation finished. No error reported.
Installing for i386-pc platform.
Installation finished. No error reported.

Done!

After reboot, running from local disks:

vyos@vyos:~$ cat /proc/mdstat 
Personalities : [raid1] 
md127 : active (auto-read-only) raid1 sdb1[0]
      976758720 blocks [2/1] [U_]
      bitmap: 8/8 pages [32KB], 65536KB chunk

unused devices: <none>


vyos@vyos:~$ df -h
Filesystem      Size  Used Avail Use% Mounted on
tmpfs            13G  9.0M   13G   1% /run
/dev/sda1       917G  424M  870G   1% /lib/live/mount/persistence
/dev/loop0      307M  307M     0 100% /lib/live/mount/rootfs/1.2.0-rc11.squashfs
tmpfs            32G     0   32G   0% /lib/live/mount/overlay
overlay          32G   12M   32G   1% /
devtmpfs         10M     0   10M   0% /dev
tmpfs            32G     0   32G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs            32G     0   32G   0% /sys/fs/cgroup
tmpfs            32G  4.0K   32G   1% /tmp
none             32G   72K   32G   1% /opt/vyatta/config
kroy added a subscriber: kroy.EditedDec 20 2018, 6:28 AM

@danhusan

I’ve been trying to research this a little, and I can’t duplicate. But I suspect it’s because I have fast disk. Your first output says it’s going to take over two hours for a resync.

If you run for a while, does the array stay in that “auto-read-only” state? From some cursory searches, this happens until the array hits its first write, especially since the output does mention the other disk. I’m wondering if this happens with slower disk or something.

I did manage to find a bug here when reusing an existing array that was built under BIOS and using EFI, but that’s probably unrelated.

I can let it sit forever, the status wont change as the array is missing one disk. Notice how after (auto read only) raid1 is only says sdb1[0] (sda1 is missing).
Also we can see that this wont fix itself from looking at the df -h output where the system clearly has booted straight from sda1 instead of from the raid array (/dev/md127)

pasik added a subscriber: pasik.Dec 20 2018, 12:25 PM
c-po added a subscriber: c-po.Dec 20 2018, 2:45 PM

Does the problem occur on a virtual environment e.g. ESXi, too?

I am not able to reproduce it in vmware workstation.

https://forum.vyos.io/t/vyos-1-2-0rc10-raid-1-fresh-install-unable-to-save-config/3059
These users might have the same issue. I noticed one of the symptoms of having the system is in this state is that the config is lost between reboots.

Some data I wasn't able to find in any log files:

kroy added a comment.Dec 20 2018, 5:12 PM

@danhusan

While I'm not sure if it's related, it looks like your system has a buggy ACPI implementation. Sometimes that can cause some weird behaviour.

Can you try hitting "e" on grub when it pops up, and drop acpi=off on the line like pictured?

Then just hit CNTL-x to boot. It might not be the problem here, but it's worth a try as a troubleshooting step.

Spot on, booting with acpi=off makes it come up properly and configurations are actually saved.
Still getting some errors but not sure if they matter or not:

kroy added a comment.Dec 20 2018, 5:38 PM

Yeah, I'm not familiar enough with things to understand why it would be trying to mount the bare RAID partitions, not to mention the actual bare drives

@c-po I'm not really sure of the implications of adding acpi=off as a default grub option. Pinging the guys from the forum thread who were having a similar issue

kroy added a comment.Dec 20 2018, 5:46 PM

@danhusan I'd be curious if an alternative of adding rootdelay=10 instead of the acpi=off works. That may or may not do anything depending on how the kernel is built though.

@kroy you are clearly on a roll today, rootdelay=10 also did the trick.

danhusan added a comment.EditedDec 20 2018, 9:43 PM

And FYI, just to make sure that something wasn't triggered by hanging out in GRUB.
rootdelay=0 fails
rootdelay=2 works, so just a small pause is needed on my system at least.

c-po added a comment.Dec 31 2018, 1:51 PM

instead of differentiating between raid and non raid installations - why not always wait 5 seconds for the discs to settle? As this is only done once on startup this is IMHO better then a special case.

In T1120#29573, @c-po wrote:

instead of differentiating between raid and non raid installations - why not always wait 5 seconds for the discs to settle? As this is only done once on startup this is IMHO better then a special case.

That works too, I was hesitant proposing a global change when we currently only have 1 verified user experiencing the issue.

c-po added a comment.Dec 31 2018, 5:24 PM

I propose to proceed with a global change. Special case handling is always harder to test and the impact is only 5 seconds max in startup time - who cares on a 24/7 active device which is rarely rebootet?

Added new pull requests, built and tested, working fine.

c-po added a comment.Jan 2 2019, 4:20 PM

Also cherry-picked into crux

c-po assigned this task to danhusan.Jan 2 2019, 4:21 PM
c-po closed this task as Resolved.

@c-po https://github.com/vyos/vyos-build/pull/35 is also needed. Without it upgrades (install image ...) will fail.

c-po added a comment.Jan 2 2019, 4:38 PM

THX for the update. done also on current and crux