Page MenuHomePhabricator

Router responding to arp requests for all addresses, breaks Windows networking!
Closed, ResolvedPublicBUG

Description

Been chasing this for a couple days now, at first I thought it was a dhcp issue, but then I couldn't get things to work with a static IP either. Linux boxes seem to handle the issue better than Windows where IPv4 completely breaks.

Looks like the router is responding with it's MAC for every address on the subnet, all 3 addresses were unused.

packet capture:
https://i.imgur.com/irFJ6yU.png

Details

Difficulty level
Unknown (require assessment)
Version
1.2.0-rolling+201809180337
Why the issue appeared?
Will be filled on close
mb300sd created this task.Sep 18 2018, 7:36 PM
hagbard claimed this task.EditedSep 18 2018, 7:54 PM
hagbard added a subscriber: hagbard.

Can you share your config please.

Config's pretty huge, here's the LAN interface. Need to go through and sanitize the rest. No proxy arp or similar anywhere, "arp" doesn't appear in the config at all. Issue occurs on all vlans.

show interfaces ethernet eth1
 duplex auto
 hw-id 00:02:c9:23:b4:90
 speed auto
 vif 2001 {
     address 10.2.1.1/24
     address x:x:x:x::1/64
     description LAN
     ipv6 {
         dup-addr-detect-transmits 1
         router-advert {
             cur-hop-limit 64
             link-mtu 0
             managed-flag false
             max-interval 600
             other-config-flag false
             prefix x:x:x:x::/64 {
                 autonomous-flag true
                 on-link-flag true
                 valid-lifetime 2592000
             }
             reachable-time 0
             retrans-timer 0
             send-advert true
         }
     }
 }
 vif 2002 {
     address 10.2.2.1/24
     address x:x:x:x::/64
     description LAN_Guest
     ipv6 {
         dup-addr-detect-transmits 1
         router-advert {
             cur-hop-limit 64
             link-mtu 0
             managed-flag false
             max-interval 600
             other-config-flag false
             prefix x:x:x:x::/64 {
                 autonomous-flag true
                 on-link-flag true
                 valid-lifetime 2592000
             }
             reachable-time 0
             retrans-timer 0
             send-advert true
         }
     }
 }
 vif 2003 {
     address 10.2.3.1/24
     address x:x:x:x::/64
     description LAN_Unsecure
     ipv6 {
         dup-addr-detect-transmits 1
         router-advert {
             cur-hop-limit 64
             link-mtu 0
             managed-flag false
             max-interval 600
             other-config-flag false
             prefix x:x:x:x::/64 {
                 autonomous-flag true
                 on-link-flag true
                 valid-lifetime 2592000
             }
             reachable-time 0
             retrans-timer 0
             send-advert true
         }
     }
 }
Merijn added a subscriber: Merijn.EditedSep 19 2018, 7:09 AM

I found this to occur sometimes when you configure the same subnet on multiple interfaces. The IPv4 seems to be OK in your config but the IPv6 is hidden and missing the last digit as well.
Also maybe you have created some strange behaving rule in NAT, if you have NAT configured could you post this config, or run through it yourself?

rps added a subscriber: rps.Sep 19 2018, 7:59 AM

I haven't tested this to verify but some initial thoughts:

Sounds like proxy ARP ... can you provide values of: /proc/sys/net/ipv4/conf/*/proxy_arp

Also if that is what's happening this would only be coming into play if the different IP networks were not being isolated from each other. Can you verify that each sub interface is on a separate broadcast domain (VLAN) and that you aren't bridging any VLANs together by accident?

Could also be a driver issue with the NIC not handling 802.1q tagging correctly ...

mb300sd added a comment.EditedSep 19 2018, 4:42 PM

IPv6 is all on different subnets, and I actually have working IPv6 networking while IPv4 is broken. v6 uses NDP instead of ARP so shouldn't be able to cause this.

NAT rules are pretty basic.

nat {
    destination {
        rule 100 {
            description "TCP to 10.2.1.xx"
            destination {
                port "xxx,xxx"
            }
            inbound-interface "eth0"
            protocol "tcp"
            translation {
                address "10.2.1.xxx"
            }
        }
    }
    source {
        rule 1000 {
            description "NAT"
            outbound-interface "eth0"
            source {
                address "10.2.0.0/16"
            }
            translation {
                address "masquerade"
            }
        }
    }
}

No proxy arp anywhere.

cat /proc/sys/net/ipv4/conf/*/proxy_arp
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0

I've checked over the switch config multiple times, and it happens while there's nothing except the vyos box and my laptop connected. Nothing else on the network should be capable of bridging. Problem goes away when the vyos box is disconnected, and every other device is connected.

The tagging is correct, from what I can tell with tcpdump/wireshark, but you might be on to something with the driver issue. The timing seems to line up with when I first upgraded to a version with the new kernel, and didn't notice anything until a Windows box tried to renew it's IP. I was chasing DHCP for a while until I realized static was broken too. Is there a download anywhere of a version with the old kernel to test?

hagbard added a comment.EditedSep 19 2018, 4:59 PM

I tested your sniplet and can't reproduce your issue.
Why is your arp requestor a broadcast address?

Can you please share the version you're running?

It's what Windows does when you first assign an IP address - checks if it's in use and refuses to use it if it is. Linux boxes don't so they have working IPv4.

Currently 1.2.0-rolling+201809180337 but I'm having the issue on every version.

Does anyone know where I can find a build on the old kernel? I deleted mine swapping images at some point. Starting to think that might be when it started.

Hmm, sorry I don't have any windows machine, actually since 1996 I don't have any windows. So I can't test that. I tested with the build from 17th and also with your nat rules, still can't reproduce your issue you seeing. You can check on https://downloads.vyos.io/?dir=rolling/current/amd64 the isos, it goes back till Sept 5th. Or the other option is ti install just a different kernel.
If you need help with installing your own kernel, let me know.

I've tried the oldest build, but it still has the issue. Is there any way to extract an image from another router? The timing does line up for it being a driver issue, I'm going to see if swapping to a different NIC helps next time I drive over, debugging remotely atm, so no rebooting allowed for a few days.

Sure thing, I'll leave the bug ticket open.

rps added a comment.Sep 19 2018, 11:52 PM

Can you provide the output of tcpdump -eni <sub-interface> 'arp' (e.g. eth1.2001) from a root shell on VyOS when this is happening? I would like to see the capture with MAC addresses included for the specific sub-interface involved (text rather than screenshot please).

If you can not see it happening that way try using the parent interface (e.g. eth1) instead.

Also can you verify the values of /proc/sys/net/ipv4/conf/*/proxy_arp_pvlan which might also have similar behavior.

Does your box have a mellanox card? Is there any virtualization involved? Can you check the driver revision in non-/working state in the kernel? Use ethtool to find out the driver servicing your interface and then modinfo the kernel driver name to get its version.

syncer triaged this task as Normal priority.Tue, Sep 25, 1:58 PM
vlesk added a subscriber: vlesk.Tue, Oct 2, 5:21 AM

Finally got back up here to test. Swapped out the Mellanox NIC with a Solarflare card on latest, works. 201807292210 image with Mellanox card, works. Latest image and different Mellanox card, broken. Definitely looks like a driver issue, the new kernel seems to have a far older version. No virtualization involved.

Not working:

filename: /lib/modules/4.18.11-amd64-vyos/kernel/drivers/net/ethernet/mellanox/mlx4/mlx4_en.ko
version: 4.0-0
license: Dual BSD/GPL
description: Mellanox ConnectX HCA Ethernet driver

Working image:

filename: /lib/modules/4.14.26-amd64-vyos/updates/dkms/mlx4_en.ko
version: 4.3-1.0.1
license: Dual BSD/GPL
description: Mellanox ConnectX HCA Ethernet driver

oliko added a subscriber: oliko.Wed, Oct 3, 5:28 PM

Ok, I'm going to build the official driver for the 4.18.11 kernel then.

vlesk added a comment.Thu, Oct 4, 3:45 AM

I have same issue with bridge, wich consist of Intel NICs (drivers igb and e100). Router responding for all arp requests.
My NICs are: Intel Corporation 82575GB Gigabit Network Connection (rev 02) and Intel Corporation 82557/8/9/0/1 Ethernet Pro 100 (rev 08)

Can you please share the output of the command 'show conf comm'. thx.

vlesk added a comment.Sat, Oct 6, 12:30 PM

I prepared config and pcap file. I removed e100 NIC from bridge, however problem present. Windows can't get address nor static neither dynamic. Windows determined it as address conflict.

set firewall all-ping 'enable'
set firewall broadcast-ping 'disable'
set firewall config-trap 'disable'
set firewall ip-src-route 'disable'
set firewall ipv6-receive-redirects 'disable'
set firewall ipv6-src-route 'disable'
set firewall log-martians 'enable'
set firewall name OUTSIDE-IN ...
set firewall name OUTSIDE-TO-LOCAL ...
set firewall receive-redirects 'disable'
set firewall send-redirects 'enable'
set firewall source-validation 'disable'
set firewall syn-cookies 'enable'
set firewall twa-hazards-protection 'disable'
set interfaces bridge br0 address '192.168.101.254/24'
set interfaces bridge br0 aging '300'
set interfaces bridge br0 description 'LAN Bridge'
set interfaces bridge br0 hello-time '2'
set interfaces bridge br0 max-age '20'
set interfaces bridge br0 priority '0'
set interfaces bridge br0 stp 'false'
set interfaces ethernet eth0 address 'wwww'
set interfaces ethernet eth0 description 'WAN Link'
set interfaces ethernet eth0 duplex 'auto'
set interfaces ethernet eth0 firewall in name 'OUTSIDE-IN'
set interfaces ethernet eth0 firewall local name 'OUTSIDE-TO-LOCAL'
set interfaces ethernet eth0 hw-id '48:5b:39:19:fe:72'
set interfaces ethernet eth0 ipv6 address autoconf
set interfaces ethernet eth0 ipv6 dup-addr-detect-transmits '1'
set interfaces ethernet eth0 smp-affinity 'auto'
set interfaces ethernet eth0 speed 'auto'
set interfaces ethernet eth1 bridge-group bridge 'br0'
set interfaces ethernet eth1 duplex 'auto'
set interfaces ethernet eth1 hw-id '00:1b:21:1f:e1:68'
set interfaces ethernet eth1 smp-affinity 'auto'
set interfaces ethernet eth1 speed 'auto'
set interfaces ethernet eth2 bridge-group bridge 'br0'
set interfaces ethernet eth2 duplex 'auto'
set interfaces ethernet eth2 hw-id '00:1b:21:1f:e1:69'
set interfaces ethernet eth2 smp-affinity 'auto'
set interfaces ethernet eth2 speed 'auto'
set interfaces ethernet eth3 bridge-group bridge 'br0'
set interfaces ethernet eth3 duplex 'auto'
set interfaces ethernet eth3 hw-id '00:1b:21:1f:e1:6c'
set interfaces ethernet eth3 smp-affinity 'auto'
set interfaces ethernet eth3 speed 'auto'
set interfaces ethernet eth4 bridge-group bridge 'br0'
set interfaces ethernet eth4 duplex 'auto'
set interfaces ethernet eth4 hw-id '00:1b:21:1f:e1:6d'
set interfaces ethernet eth4 smp-affinity 'auto'
set interfaces ethernet eth4 speed 'auto'
set interfaces ethernet eth5 duplex 'auto'
set interfaces ethernet eth5 hw-id '00:0d:88:ff:dd:8c'
set interfaces ethernet eth5 smp-affinity 'auto'
set interfaces ethernet eth5 speed 'auto'
set interfaces ethernet eth6 duplex 'auto'
set interfaces ethernet eth6 hw-id '00:0d:88:ff:dd:8d'
set interfaces ethernet eth6 smp-affinity 'auto'
set interfaces ethernet eth6 speed 'auto'
set interfaces ethernet eth7 duplex 'auto'
set interfaces ethernet eth7 hw-id '00:0d:88:ff:dd:8e'
set interfaces ethernet eth7 smp-affinity 'auto'
set interfaces ethernet eth7 speed 'auto'
set interfaces ethernet eth8 duplex 'auto'
set interfaces ethernet eth8 hw-id '00:0d:88:ff:dd:8f'
set interfaces ethernet eth8 smp-affinity 'auto'
set interfaces ethernet eth8 speed 'auto'
set interfaces loopback lo address '192.168.100.254/32'
set interfaces openvpn vtun0 ...
set interfaces openvpn vtun1 ...
set interfaces openvpn vtun10 ...
set interfaces openvpn vtun11 ...
set interfaces vti vti0 address 'xxx'
set interfaces vti vti1 address 'xxx'
set interfaces vti vti1 mtu '1438'
set nat destination ...
set nat source ...
set protocols static interface-route ...
set service dhcp-server shared-network-name LAN subnet 192.168.101.0/24 default-router '192.168.101.254'
set service dhcp-server shared-network-name LAN subnet 192.168.101.0/24 dns-server '192.168.101.254'
set service dhcp-server shared-network-name LAN subnet 192.168.101.0/24 lease '86400'
set service dhcp-server shared-network-name LAN subnet 192.168.101.0/24 range 0 start '192.168.101.129'
set service dhcp-server shared-network-name LAN subnet 192.168.101.0/24 range 0 stop '192.168.101.191'
set service dns forwarding ...
set service lldp interface br0 location civic-based ca-type 0 ca-value 'RU'
set service lldp interface br0 location civic-based country-code 'RU'
set service lldp interface eth0 disable
set service lldp legacy-protocols cdp
set service snmp ...
set service ssh ...
...
set vpn ipsec ...
set vpn l2tp  ...
set vpn pptp  ...

Thanks a lot. I tried to reproduce it on various machines without success, which leads me to the assumption that the issue might be the NIC firmware. I just had e1000 to test with, but that's all working fine.
Can you please check the following:

sudo cat /proc/sys/net/ipv4/conf/all/arp_announce
sudo cat cat /proc/sys/net/ipv4/conf/all/arp_ignore

If they are not set to 0, can you please try to set them to 0 and see if the issue is still present?

ThomasB added a subscriber: ThomasB.EditedSat, Oct 6, 11:35 PM

I have the exact same problem since updating from 1.2.0-rolling+201805280337 to 1.2.0-rolling+201810060337. My VyOS VM is running on ESXi 6.5 with two virtual VMXNET3 interfaces. The (shared) physical interface is an Intel i210 Gigabit interface.

sudo cat /proc/sys/net/ipv4/conf/all/arp_announce
2

sudo cat /proc/sys/net/ipv4/conf/all/arp_ignore
1

I tried setting both to 0 in /etc/sysctl.conf and rebooting but the problem persists.

[EDIT]
If it's of any help, here is the output of `modinfo vmxnet3```

1.2.0-rolling+201805280337

filename:       /lib/modules/4.14.26-amd64-vyos/kernel/drivers/net/vmxnet3/vmxnet3.ko
version:        1.4.a.0-k
license:        GPL v2
description:    VMware vmxnet3 virtual NIC driver
author:         VMware, Inc.
srcversion:     871CAF0D12CEB5CB75607D1
alias:          pci:v000015ADd000007B0sv*sd*bc*sc*i*
depends:
intree:         Y
name:           vmxnet3
vermagic:       4.14.26-amd64-vyos SMP mod_unload modversions

1.2.0-rolling+201810060337

filename:       /lib/modules/4.18.11-amd64-vyos/kernel/drivers/net/vmxnet3/vmxnet3.ko
version:        1.4.16.0-k
license:        GPL v2
description:    VMware vmxnet3 virtual NIC driver
author:         VMware, Inc.
srcversion:     83BF94F153CB4DCE8A3B6D4
alias:          pci:v000015ADd000007B0sv*sd*bc*sc*i*
depends:
retpoline:      Y
intree:         Y
name:           vmxnet3
vermagic:       4.18.11-amd64-vyos SMP mod_unload modversions
syncer raised the priority of this task from Normal to High.Sat, Oct 6, 11:42 PM
syncer moved this task from Need Triage to Backlog on the VyOS 1.2.x board.
syncer added a subscriber: syncer.Sun, Oct 7, 4:15 AM

@dmbaturin @c-po i will suspect new kernel
can we build an image with previous to 4.18 kernel to confirm?

c-po added a comment.Sun, Oct 7, 7:30 AM

We can switch back to 4.14. for testing, yes

c-po added a comment.Sun, Oct 7, 11:54 AM

Okay, can you please test if the issue exists with the following image:

http://www.mybll.net/vyos-1.2.0-rolling+201810071039-amd64.iso sha1sum dee5207af12fe01d9a9de4a1b040726590ff7005

its the latest rolling with Kernel 4.14 instead of 4.18.

I’ll see if I can test it tonight (CET) and report back.

vlesk added a comment.Sun, Oct 7, 1:36 PM

I checked image vyos-1.2.0-rolling+201810071039-amd64.iso kernel 4.14.65 work fine. Router respond only for own requests.

I can confirm that the test image works for me as well.

c-po added a comment.Sun, Oct 7, 2:19 PM

THX for testing. I'm currently buiulding a second image for testing and will keep you updated! The second one ships Linux Kernel 4.19-rc6 so we can see if the issue still exists or is gone!

In T852#20403, @ThomasB wrote:

I can confirm that the test image works for me as well.

I spoke too soon. VyOS still responds to ARP requests for any IP address with it's on MAC on my setup.

I rechecked the solarflare card - issue still exists. Didn't catch it last time because my config got a little messed up with all the image swapping.

syncer added a comment.Sun, Oct 7, 7:07 PM

So not specific to kernel version

Just a though... I noticed that vlesk has IPSEC tunnels over vti in his config, and so do I. I remember previous 1.2.0-rolling builds had issues with IPSEC/vti where the router wouldn't respond to ARP because of some routing table mumbo-jumbo I don't quite understand. This has since been worked-around, because it can't really be fixed at the moment.

I wonder if that fix could be the source of the new ARP issue?

I'll see if I can find some time to mess around with it in the coming week, but if anyone else can confirm or deny that this is related to IPSEC/vti, that would at least be a step forward.

syncer added a comment.Sun, Oct 7, 8:13 PM

@ThomasB can you just test config without vti/ipsec in place?

@syncer I would be happy to give it a try, even if it's a little risky for me. My router VM is running on an ESXi server, and when the ARP issue strikes I loose direct access to ESXi and have to go through the tunnel. That basically means that if the router VM breaks, I won't be able to access anything on that server. The server is located in a datacenter 1.500 km away.

So I'll need a (spare time) window large enough to have time to think everything through before doing anything, and also time enough to try and fix it if it goes south. I hope you understand.

syncer added a comment.Sun, Oct 7, 8:26 PM

@ThomasB we have similar issue, my point is may you can spin up an isolated copy with similar config in your env
we don't have issues on 6.5 so i wondering if that is env specific (just pointing with the finger to the sky)

I also have ipsec/vti.

vlesk added a comment.EditedMon, Oct 8, 12:43 PM

I checked issue with and without IPSEC/vti.
Without vti arp worked fine. Then I applied config with vti router began to respond for all arp requests with all own MACs.

I have test router on ESXi 5.5 with 3 NICs. Interface eth0 used for "Internet/IPSec head" connection. On eth1 I checked ARP. And eth2 was unconfigured. With vti my router reply on ARP with MAC eth1 and eth2 (??? strange behaviour).

ARP again began to work correctly after removing vti config and reboot router.

I checked it on 4.18.11-amd64-vyos kernel

I've also managed to test ipsec/vti today by setting up a test environment like this:

                                   *
                                   | 10.2.200.2/30
                                   | eth0
                          +--------+--------+
                          |       R1        |
                          +-+-------------+-+
                       eth1 |             | eth2
              10.2.200.1/24 |             | 10.2.204.1/24
                            |             |    +------------------+
                            |             *----+ PC 10.2.204.5/24 |
                            |             |    +------------------+
              10.2.200.2/24 |             | 10.2.204.2/24
                       eth0 |             | eth0
             vti0 +---------+--+        +-+---------+ vti0
10.99.1.1/30   *--+     R2     |        |     R3    +--* 10.99.1.2/30
                  +------+-----+        +-----+-----+
                    eth1 |                    | eth1
           10.2.201.1/24 |                    | 10.2.205.1/24
                         *                    *
  • R1 is a gateway for the test network, and allows routing to the internet. It's setup using 1.2.0-rolling+201805280337 which I know doesn't have the ARP problem (I use it in my production environment).
  • R2 is running 1.2.0-rolling+201810070337 and uses R1 as a gateway
  • R3 is running 1.2.0-rolling+201810070337 and uses R1 as a gateway
  • PC is just an Ubuntu 16.04 server that lets me manipulate it's ARP table (I can remove the ARP entry for R1 to trigger a new ARP request).

For testing, I ran tcpdump -ni eth2 'arp' on R1 to monitor ARP requests and responses on eth2 (R3 and PC).

When no vti tunnel was configured, I was unable to trigger an ARP response from R3 for 10.2.204.1 when flushing the ARP cache on PC. When the ipsec/vti tunnel between R2 and R3 was configured and up, I could trigger a response from R3 for 10.2.204.1 when clearing the ARP table on PC.

I tried doing set vpn ipsec options disable-route-autoinstall on both R2 and R3, but it didn't change anything.

Oddly, I couldn't reproduce the issue on my latest image. Perhaps there is more to it. We are preparing a release candidate now, please re-test when it's out.

I've tested the above setup with 1.2.0-rc1 and the problem persists.

I also noticed that R2 will send 2 apparently identical responses when someone tries to resolve its MAC:

18:31:59.457989 ARP, Request who-has 10.2.204.2 tell 10.2.204.1, length 46
18:31:59.458033 ARP, Reply 10.2.204.2 is-at 00:0c:29:f5:9c:59, length 28
18:31:59.458108 ARP, Reply 10.2.204.2 is-at 00:0c:29:f5:9c:59, length 28

Sorry for spamming this thread, but I found this bug report that might be relevant: https://bugzilla.redhat.com/show_bug.cgi?id=1488421

This suggests that the farp module in StrongSwan is to blame.

If you disable the StrongSwan farp module, it appears to resolve the problem:

sudo cat /etc/strongswan.d/charon/farp.conf

farp {

    # Whether to load the plugin. Can also be an integer to increase the
    # priority of this plugin.
#    load = yes
    load = no
}

I rebooted after changing load = yes to load = no. After rebooting it looks like the ARP issue is resolved. I've haven't finished testing whether there are any side-effects, but it looks good so far. The ipsec tunnel still works, as does the rest of the basic networking/routing.

hagbard changed the task status from Open to In progress.Fri, Oct 12, 5:06 PM

Should it be disabled globally, or just not loaded vi config?

@hagbard from description i'm not sure if we ever may need it so i propose to do it globally,
we can enable it later if we need to

syncer closed this task as Resolved.