Page MenuHomePhabricator

Mellanox cards, problem with interrupts
In progress, NormalPublic

Description

Today we try put VyOS with 40Ge Mellanox into production, but it died and rebooted under load (~6-8Gbit/s traffic). In the IPMI was error "CPU: 5 PID: 36 Comm: ksoftirqd/5 Not tained 4.19.0-amd64-vyos". After we rollback to the old network scheme and tried to find the cause of the fall VyOS. We assume that the reason in the Mellanox drivers and interrupts, which not present for mlnx cards:

vyos@xx-gw2:~$ cat /proc/interrupts | grep eth
 25:          0          0         46          0          0          0          6          0   PCI-MSI 409600-edge      eth1
 26:          0          1          0          0          0          0          0          0   PCI-MSI 2097152-edge      eth0
 27:      22515          0          3          0          0          0          0          0   PCI-MSI 2097153-edge      eth0-TxRx-0
 28:          0          0        163          3          0          0          0          0   PCI-MSI 2097154-edge      eth0-TxRx-1
 29:          0          0          0          0         51          0          0          0   PCI-MSI 2097155-edge      eth0-TxRx-2
 30:          0          0          0          0          0          3        233          0   PCI-MSI 2097156-edge      eth0-TxRx-3
vyos@xx-gw2:~$
vyos@xx-gw2:~$ show interfaces 
Codes: S - State, L - Link, u - Up, D - Down, A - Admin Down
Interface        IP Address                        S/L  Description
---------        ----------                        ---  -----------
eth0             xx.xx.xx.xx/20                  u/u  MGMT 
eth1             -                                 u/D  
eth2             -                                 u/u  
eth2.700         xx.xx.xx.xx/22                    u/u  
                 xxxx:xxxx::x:xx/48
eth2.703         xx.xx.xx.xx/30                  u/u   
                 xxxx:xxxx::x:xx/126
eth2.704         xx.xx.xx.xx/30                  u/u   
                 xxxx:xxxx::x:xx/126
eth2.712         xx.xx.xx.xx/30                  u/u   
                 xxxx:xxxx:xxxx:1::a/126
eth3             -                                 u/u  
eth3.100         xx.xx.xx.xx/30                     u/u  
eth3.704         -                                 u/u   
lo               127.0.0.1/8                       u/u  
                 ::1/128
vyos@xx-gw2:~$
root@xx-gw2:~# lsmod | grep mlx
mlx5_core             557056  0 
mlxfw                  20480  1 mlx5_core
ipv6                  417792  78 ip6table_mangle,mlx5_core
ptp                    20480  3 igb,e1000e,mlx5_core
root@xx-gw2:~#
01:00.0 Ethernet controller: Mellanox Technologies MT27620 Family
01:00.1 Ethernet controller: Mellanox Technologies MT27620 Family

Details

Difficulty level
Unknown (require assessment)
Version
VyOS 1.2.0-rc7
Why the issue appeared?
Will be filled on close
oliko created this task.Thu, Nov 15, 10:14 AM
oliko created this object in space S1 VyOS Public.
oliko triaged this task as High priority.
oliko updated the task description. (Show Details)
oliko updated the task description. (Show Details)
oliko updated the task description. (Show Details)
syncer added a subscriber: syncer.Wed, Nov 21, 2:31 AM

Can you provide info about firmware level?
Thanks

syncer lowered the priority of this task from High to Normal.
oliko added a comment.Wed, Nov 21, 1:39 PM

There is a problem with the display of the name interface. Not critical.

dmbaturin changed the task status from Open to In progress.Wed, Nov 28, 11:23 PM
dmbaturin added a subscriber: dmbaturin.

@oliko Could you retest it with rc9, which uses a 4.19.4 kernel?

oliko added a comment.Thu, Nov 29, 8:54 AM

@dmbaturin Yes. We'll try tomorrow morning and give you feedback.

oliko added a comment.Wed, Dec 5, 6:10 AM

@dmbaturin Hello, sorry for delay. We tested rc10 today, it not crashed but still writing a lot of errors to logs (in the attach).