ovn metadata (by quqi99)

2023-05-16

作者:张华 发表于:2022-08-25
版权声明:可以任意转载,转载时请务必以超链接形式标明文章原始出处和作者信息及本版权声明

问题

客户描述虚机的metadata功能偶尔有问题,他们怀疑是:https://bugs.launchpad.net/networking-ovn/+bug/1953591
客户还说出现问题之后在出现问题的虚机所在的host上重启neutron-ovn-metadata-agent服务可以恢复服务,重启这个服务会重新运行:

/usr/bin/neutron-rootwrap /etc/neutron/rootwrap.conf ip netns exec ovnmeta-63e2c276-60dd-40e3-baa1-c16342eacce2 haproxy -f /var/lib/neutron/ovn-metadata-proxy/63e2c276-60dd-40e3-baa1-c16342eacce2.conf

ovn metadata原理

我很熟悉neutron metadata, 但ovn metadata弄得还是很多,但它们的原理应该是类似的.

  • 对于sriov port它会有一个对应的external port来提供ovn metadata功能,所以ovn medata是在GW chassis上做的,与compute chassis并不在同一节点.流程见:20220607 - sriov ovn metadata - https://blog.csdn.net/quqi99/article/details/116893909 ,
  • 对于普通虚机的metadata是由它所在的compute chassis上的ovnmeta-xxx来做.

1, 虚机VM(192.168.21.75)内可以由dhcp(192.168.21.2)提供主机路由来直接转发到元数据代理neutron-ovn-metadata-agent(192.168.21.2), 若没有启用dhcp则由GW转发至neutron-ovn-metadata-agent

$ ssh -i ~/testkey.priv ubuntu@10.5.150.219 -- ip route list |grep 169
169.254.169.254 via 192.168.21.2 dev ens2 proto dhcp src 192.168.21.75 metric 100

$ openstack port list
+--------------------------------------+------+-------------------+------------------------------------------------------------------------------+--------+
| ID                                   | Name | MAC Address       | Fixed IP Addresses                                                           | Status |
+--------------------------------------+------+-------------------+------------------------------------------------------------------------------+--------+
| 48244606-524d-4c24-bc54-0143869d4267 |      | fa:16:3e:43:81:23 | ip_address='192.168.21.2', subnet_id='7a3260aa-e2a6-4721-8f0a-5b06aec71b53'  | DOWN   |
| 5043ab8b-184d-4d83-abf8-fa044aec9a6d |      | fa:16:3e:e2:8c:a0 | ip_address='10.5.152.84', subnet_id='a0f40836-715c-4dcc-9df1-fb4093aa0f79'   | DOWN   |
| 544e4423-d3d8-418e-8b6c-ea294ff0d1b3 |      | fa:16:3e:a2:83:76 |                                                                              | DOWN   |
| 70b96954-32a2-488b-b67b-4bf039189b02 |      | fa:16:3e:42:29:d4 | ip_address='192.168.21.1', subnet_id='7a3260aa-e2a6-4721-8f0a-5b06aec71b53'  | ACTIVE |
| c99885ae-1778-46cb-af07-29cf32113ef7 |      | fa:16:3e:4d:03:e6 | ip_address='10.5.150.219', subnet_id='a0f40836-715c-4dcc-9df1-fb4093aa0f79'  | N/A    |
| dd6bd055-3460-459e-b9b2-8da7211dacdc |      | fa:16:3e:53:b7:77 | ip_address='192.168.21.75', subnet_id='7a3260aa-e2a6-4721-8f0a-5b06aec71b53' | ACTIVE |
+--------------------------------------+------+-------------------+------------------------------------------------------------------------------+--------+

2, 计算节点上运行一个neutron-ovn-metadata-agent, 同时neutron-ovn-metadata-agent为每个network创建一个ovnmeta-xxx并启动haproxy实例.
  如果ovn-meta-xxx不存在就会被创建一个,并且使用veth pair将ovn-meta-xxx(eg:tap8ea7d54e-51)与host(eg:tap8ea7d54e-50, 再加到br-int)相连.
haproxy实例监听169.254.169.254:80,并加入X-OVN-Network-ID头,最后转到backend (unix socket: /var/lib/neutron/metadata_proxy)

#nova-compute/0 has 3 processes about metadata
/usr/bin/python3 /usr/bin/neutron-ovn-metadata-agent --config-file=/etc/neutron/neutron.conf --config-file=/etc/neutron/neutron_ovn_metadata_agent.ini --log-file=/var/log/neutron/neutron-ovn-metadata-agent.log
/usr/bin/python3 /usr/bin/nova-api-metadata --config-file=/etc/nova/nova.conf --log-file=/var/log/nova/nova-api-metadata.log
haproxy -f /var/lib/neutron/ovn-metadata-proxy/8ea7d54e-5fe0-408d-9de8-5ccbed8ede84.conf

$ sudo ip netns exec ovnmeta-8ea7d54e-5fe0-408d-9de8-5ccbed8ede84 ip addr show tap8ea7d54e-51 |grep -E 'link|inet'
    link/ether fa:16:3e:43:81:23 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 169.254.169.254/32 brd 169.254.169.254 scope global tap8ea7d54e-51
    inet 192.168.21.2/24 brd 192.168.21.255 scope global tap8ea7d54e-51
$ ip addr show tap8ea7d54e-50 |grep link
    link/ether ca:cb:5c:af:dd:22 brd ff:ff:ff:ff:ff:ff link-netns ovnmeta-8ea7d54e-5fe0-408d-9de8-5ccbed8ede84
$ sudo ovs-vsctl show |grep 8ea7d54e
        Port tap8ea7d54e-50
            Interface tap8ea7d54e-50

$ sudo ip netns exec ovnmeta-8ea7d54e-5fe0-408d-9de8-5ccbed8ede84 route -n
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
192.168.21.0    0.0.0.0         255.255.255.0   U     0      0        0 tap8ea7d54e-51

$ cat /var/lib/neutron/ovn-metadata-proxy/8ea7d54e-5fe0-408d-9de8-5ccbed8ede84.conf
global
    log         /dev/log local0 info
    user        root
    group       root
    maxconn     1024
    pidfile     /var/lib/neutron/external/pids/8ea7d54e-5fe0-408d-9de8-5ccbed8ede84.pid.haproxy
    daemon
defaults
    log global
    mode http
    option httplog
    option dontlognull
    option http-server-close
    option forwardfor
    retries                 3
    timeout http-request    30s
    timeout connect         30s
    timeout client          32s
    timeout server          32s
    timeout http-keep-alive 30s
listen listener
    bind 169.254.169.254:80
    server metadata /var/lib/neutron/metadata_proxy
    http-request add-header X-OVN-Network-ID 8ea7d54e-5fe0-408d-9de8-5ccbed8ede84

3, neutron-ovn-metadata-agent与unix socket(/var/lib/neutron/metadata_proxy)相连,它再将请求发给本机上nova-api-metadata(DVR模式下每个计算节点都会有自己的nova-api-metadata)

已有数据分析

无奈,客户遇到问题了并在07-21 17.15重启了neutron-ovn-metadata-agent, 他抓的sosreport是在运行workaround之后抓的,参考价值有限啊.
他提到了这个namespace - ovnmeta-63e2c276-60dd-40e3-baa1-c16342eacce2, 所以搜索出如下信息,但是因为sosreport是在运行workaround之后抓的所以看不出什么问题.

$ grep -r 'tap63e2c276' sos_commands/networking/namespaces/ovnmeta-63e2c276-60dd-40e3-baa1-c16342eacce2/ip_netns_exec_ovnmeta-63e2c276-60dd-40e3-baa1-c16342eacce2_ip_-d_address_show
2: tap63e2c276-61@if20: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    inet 100.94.98.4/23 brd 100.94.99.255 scope global tap63e2c276-61
    inet 169.254.169.254/32 brd 169.254.169.254 scope global tap63e2c276-61

但是根据这个IP (100.94.98.4)却搜不到东西(find var/log/neutron/* |xargs zgrep -i -E ‘100.94.98.4,’),但是却搜到了一个关于100.94.98.40的日志:

var/log/neutron/neutron-ovn-metadata-agent.log.1:2022-07-20 21:15:56.758 3872 ERROR neutron.agent.ovn.metadata.server [-] No port found in network 63e2c276-60dd-40e3-baa1-c16342eacce2 with IP address 100.94.98.40

这个日志来自代码./neutron/agent/ovn/metadata/server.py#_get_instance_and_project_id,在SB_DB里查询到的Port_Binding为0,

 99     def _get_instance_and_project_id(self, req):                                
100         remote_address = req.headers.get('X-Forwarded-For')                     
101         network_id = req.headers.get('X-OVN-Network-ID')                        
102                                                                                 
103         ports = self.sb_idl.get_network_port_bindings_by_ip(network_id,         
104                                                             remote_address)     
105         num_ports = len(ports)                                                  
106         if num_ports == 1:                                                      
107             external_ids = ports[0].external_ids                                
108             return (external_ids[ovn_const.OVN_DEVID_EXT_ID_KEY],               
109                     external_ids[ovn_const.OVN_PROJID_EXT_ID_KEY])              
110         elif num_ports == 0:                                                    
111             LOG.error("No port found in network %s with IP address %s",                                                                                                            
112                       network_id, remote_address)

899     def get_network_port_bindings_by_ip(self, network, ip_address):                                                                                                                
900         rows = self.db_list_rows('Port_Binding').execute(check_error=True)      
901         # TODO(twilson) It would be useful to have a db_find that takes a       
902         # comparison function                                                   
903                                                                                 
904         def check_net_and_ip(port):                                             
905             # If the port is not bound to any chassis it is not relevant        
906             if not port.chassis:                                                
907                 return False                                                    
908                                                                                 
909             # TODO(dalvarez): Remove the comparison to port.datapath.uuid in Y  
910             # cycle when we are sure that all namespaces will be created with   
911             # the Neutron network UUID and not anymore with the OVN datapath    
912             # UUID.                                                             
913             is_in_network = lambda port: (                                      
914                 str(port.datapath.uuid) == network or                           
915                 utils.get_network_name_from_datapath(port.datapath) == network) 
916                                                                                 
917             return port.mac and is_in_network(port) and (                       
918                     ip_address in port.mac[0].split(' '))                       
919                                                                                 
920         return [r for r in rows if check_net_and_ip(r)] 

但是却搜不着和get_network_port_bindings_by_ip相关的东西(find var/log/neutron/* |xargs zgrep -i -E ‘get_network_port_bindings_by_ip’),所以应该不是连不着SB_DB的问题(若是连不上SB_DB,那运行get_network_port_bindings_by_ip时会出错,打出堆栈会有get_network_port_bindings_by_ip这个字眼吧?), 所以很有可能原因是真的Port_Binding就是0. 在我的测试环境上用’ovn-sbctl list Port_Binding’会看到下列数据(测试环境中的metadata port为192.168.21.2):

_uuid               : a011fdcd-15ba-48c3-aec0-2ef590aee864
chassis             : []
datapath            : 0f509be3-9b1f-4c66-9c59-ac4211cf5ef4
encap               : []
external_ids        : {"neutron:cidrs"="192.168.21.2/24", "neutron:device_id"=ovnmeta-8ea7d54e-5fe0-408d-9de8-5ccbed8ede84, "neutron:device_owner"="network:distributed", "neutron:network_name"=neutron-8ea7d54e-5fe0-408d-9de8-5ccbed8ede84, "neutron:port_name"="", "neutron:project_id"="728f2ae8aa8443b5a255574a5e236746", "neutron:revision_number"="2", "neutron:security_group_ids"=""}
gateway_chassis     : []
ha_chassis_group    : []
logical_port        : "48244606-524d-4c24-bc54-0143869d4267"
mac                 : ["fa:16:3e:43:81:23 192.168.21.2"]
nat_addresses       : []
options             : {requested-chassis=""}
parent_port         : []
tag                 : []
tunnel_key          : 1
type                : localport
up                  : false
virtual_parent      : []

Port_Binding是谁创建的呢?
另外,nova-compute中大量这种日志, 似乎连不上neutron-api (9696):

var/log/nova/nova-api-metadata.log:2022-07-21 13:41:24.086 1790270 ERROR nova.api.metadata.handler [req-42f38810-e2c3-4df0-8de6-754d1f6d5472 - - - - -] Failed to get metadata for instance id: 80f1bb26-f5db-4919-abb2-6b61d13f9267

var/log/nova/nova-api-metadata.log.20.gz:2022-07-01 20:21:46.371 3516 ERROR oslo.messaging._drivers.impl_rabbit [-] [e96b9581-fa77-4f0f-a580-05a13b18bfad] AMQP server on 192.168.0.124:5672 is unreachable: (0, 0): (320) CONNECTION_FORCED - broker forced connection closure with reason 'shutdown'. Trying again in 1 seconds.: amqp.exceptions.ConnectionForced: (0, 0): (320) CONNECTION_FORCED - broker forced connection closure with reason 'shutdown'
...
var/log/nova/nova-api-metadata.log.20.gz:2022-07-01 20:59:04.928 3514 ERROR nova.api.metadata.handler [req-c4d5a1b4-1c36-4864-b729-d68de13ef59d - - - - -] Failed to get metadata for instance id: cc1c7466-b5d4-47d0-8e23-e91c785fafb3: neutronclient.common.exceptions.NeutronClientException: <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">

var/log/nova/nova-api-metadata.log.20.gz:2022-07-01 20:59:04.849 3514 ERROR nova.api.metadata.handler neutronclient.common.exceptions.NeutronClientException: <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
var/log/nova/nova-api-metadata.log.20.gz:2022-07-01 20:59:04.849 3514 ERROR nova.api.metadata.handler <html><head>
var/log/nova/nova-api-metadata.log.20.gz:2022-07-01 20:59:04.849 3514 ERROR nova.api.metadata.handler <title>502 Proxy Error</title>
var/log/nova/nova-api-metadata.log.20.gz:2022-07-01 20:59:04.849 3514 ERROR nova.api.metadata.handler </head><body>
var/log/nova/nova-api-metadata.log.20.gz:2022-07-01 20:59:04.849 3514 ERROR nova.api.metadata.handler <h1>Proxy Error</h1>
var/log/nova/nova-api-metadata.log.20.gz:2022-07-01 20:59:04.849 3514 ERROR nova.api.metadata.handler <p>The proxy server received an invalid^M
var/log/nova/nova-api-metadata.log.20.gz:2022-07-01 20:59:04.849 3514 ERROR nova.api.metadata.handler response from an upstream server.<br />^M
var/log/nova/nova-api-metadata.log.20.gz:2022-07-01 20:59:04.849 3514 ERROR nova.api.metadata.handler The proxy server could not handle the request<p>Reason: <strong>Error reading from remote server</strong></p></p>
var/log/nova/nova-api-metadata.log.20.gz:2022-07-01 20:59:04.849 3514 ERROR nova.api.metadata.handler <hr>
var/log/nova/nova-api-metadata.log.20.gz:2022-07-01 20:59:04.849 3514 ERROR nova.api.metadata.handler <address>Apache/2.4.41 (Ubuntu) Server at neutron.internal.api.xxx.infra.xxx.net Port 9696</address>

还有大量这种日志:

2022-07-20 23:59:18.828 1790159 INFO nova.compute.resource_tracker [req-25cb1333-d268-4870-877d-8465aea2b451 - - - - -] Instance 4d35b4ff-177c-4659-b6a4-d4d1d5ddecab has allocations against this compute host but is not found in the database.

也发现这种日志:

2022-07-11 16:37:25.769 2361 ERROR ovsdbapp.backend.ovs_idl.command Traceback (most recent call last):
2022-07-11 16:37:25.769 2361 ERROR ovsdbapp.backend.ovs_idl.command   File "/usr/lib/python3/dist-packages/ovsdbapp/backend/ovs_idl/command.py", line 42, in execute
2022-07-11 16:37:25.769 2361 ERROR ovsdbapp.backend.ovs_idl.command     t.add(self)
2022-07-11 16:37:25.769 2361 ERROR ovsdbapp.backend.ovs_idl.command   File "/usr/lib/python3.8/contextlib.py", line 120, in __exit__
2022-07-11 16:37:25.769 2361 ERROR ovsdbapp.backend.ovs_idl.command     next(self.gen)
2022-07-11 16:37:25.769 2361 ERROR ovsdbapp.backend.ovs_idl.command   File "/usr/lib/python3/dist-packages/ovsdbapp/api.py", line 110, in transaction
2022-07-11 16:37:25.769 2361 ERROR ovsdbapp.backend.ovs_idl.command     del self._nested_txns_map[cur_thread_id]
2022-07-11 16:37:25.769 2361 ERROR ovsdbapp.backend.ovs_idl.command   File "/usr/lib/python3/dist-packages/ovsdbapp/api.py", line 61, in __exit__
2022-07-11 16:37:25.769 2361 ERROR ovsdbapp.backend.ovs_idl.command     self.result = self.commit()
2022-07-11 16:37:25.769 2361 ERROR ovsdbapp.backend.ovs_idl.command   File "/usr/lib/python3/dist-packages/ovsdbapp/backend/ovs_idl/transaction.py", line 57, in commit
2022-07-11 16:37:25.769 2361 ERROR ovsdbapp.backend.ovs_idl.command     raise exceptions.TimeoutException(
2022-07-11 16:37:25.769 2361 ERROR ovsdbapp.backend.ovs_idl.command ovsdbapp.exceptions.TimeoutException: Commands [DbSetCommand(table=Chassis_Private, record=xxx.maas, col_values=(('external_ids', {'neutron:ovn-metadata-sb-cfg': '31205'}),))] exceeded timeout 180 seconds, cause: Result queue is empty
2022-07-11 16:37:25.769 2361 ERROR ovsdbapp.backend.ovs_idl.command
2022-07-11 16:37:25.780 2361 INFO ovsdbapp.backend.ovs_idl.vlog [-] ssl:100.94.0.158:6642: connected
2022-07-11 16:37:25.817 2361 INFO ovsdbapp.backend.ovs_idl.vlog [-] ssl:100.94.0.158:6642: clustered database server is not cluster leader; trying another server
2022-07-11 16:37:25.818 2361 INFO ovsdbapp.backend.ovs_idl.vlog [-] ssl:100.94.0.158:6642: connection closed by client
2022-07-11 16:37:33.831 2361 INFO ovsdbapp.backend.ovs_idl.vlog [-] ssl:100.94.0.204:6642: connected
2022-07-11 16:37:34.566 2361 INFO neutron.agent.ovn.metadata.agent [-] Connection to OVSDB established, doing a full sync

所以是metadata-agent连接SB_DB时超时导致查询到的port_binding为0了, 找到一个lp bug https://bugs.launchpad.net/neutron/+bug/1951841, 这个bug说system_id不为UUID时会报错,我们也看到了下列日志:

2022-07-11 18:57:45.926 798838 WARNING neutron.agent.ovn.metadata.agent [-] Can't read ovn-bridge external-id from OVSDB. Using br-int instead.
2022-07-11 18:57:45.926 798838 INFO neutron.agent.ovn.metadata.agent [-] Loaded chassis name xxx.maas (UUID: 4fe633c2-0c52-5c86-92b2-0ee1f5e71a90) and ovn bridge br-int.
2022-07-11 18:57:45.927 798838 INFO oslo_service.service [-] Starting 2 workers

308     def _get_ovn_bridge(self):                                                  
309         """Return the external_ids:ovn-bridge value of the Open_vSwitch table.  
310                                                                                 
311         This is the OVS bridge used to plug the metadata ports to.              
312         If the key doesn't exist, this method will return 'br-int' as default.  
313         """                                                                     
314         ext_ids = self.ovs_idl.db_get(                                          
315             'Open_vSwitch', '.', 'external_ids').execute()                      
316         try:                                                                    
317             return ext_ids['ovn-bridge']                                        
318         except KeyError:                                                        
319             LOG.warning("Can't read ovn-bridge external-id from OVSDB. Using "                                                                                                     
320                         "br-int instead.")                                      
321             return 'br-int' 

但是取不到external_ids还是因为Port_Binding不存在啊?

排除lp bug 1951841

https://bugs.launchpad.net/neutron/+bug/1951841
ovn除了为计算节点添加Chassis记录(ovn-sbctl list Chassis), 还得为metadata-agent添加一个Chassis_Private记录(它根据Chassis system-id来生成Chassis_Private的agent-id.下面的代码中假设self.chassis_id是uuid,但实现charm部署的却不一定是uuid,这会导致neutron-ovn-metadata-agent运行可以但this produces that the neutron-ovn-metadata-agent daemon stays running but disconnected and new launched VMs won’t have access to the metadata service.

    def register_metadata_agent(self):
        # NOTE(lucasagomes): db_add() will not overwrite the UUID if
        # it's already set.
        table = ('Chassis_Private' if self.has_chassis_private else 'Chassis')
        # Generate unique, but consistent metadata id for chassis name
        agent_id = uuid.uuid5(self.chassis_id, 'metadata_agent')
        ext_ids = {ovn_const.OVN_AGENT_METADATA_ID_KEY: str(agent_id)}
        self.sb_idl.db_add(table, self.chassis, 'external_ids',
                           ext_ids).execute(check_error=True)

# ovn-sbctl list Chassis_Private
_uuid               : 45524f21-0ca2-4736-a851-e5db69dad805
chassis             : 4e2fabf4-3834-4137-8454-3525e541d652
external_ids        : {"neutron:ovn-metadata-id"="26e2474c-75fc-5977-a311-828b9ad92180", "neutron:ovn-metadata-sb-cfg"="5"}
name                : juju-3df241-ovn-8.cloud.sts
nb_cfg              : 5
nb_cfg_timestamp    : 1661337217145

做了下列实验未重现问题,可能是做一次不会稳定重现吗?

$ juju ssh nova-compute/0 sudo ip netns | grep ovnmeta
ovnmeta-0211506b-233e-4773-a034-3950dfefe23d
ovnmeta-8ea7d54e-5fe0-408d-9de8-5ccbed8ede84 (id: 0)

$ openstack server delete focal-103100
$ juju ssh nova-compute/0 sudo ip netns | grep ovnmeta
<empty>

$ juju status ovn-central | grep leader
ovn-central  21.09.0  active      3  ovn-central  21.09/stable   24  no       Unit is ready (leader: ovnnb_db)
ovn-central/0   active    idle   9        10.5.1.22       6641/tcp,6642/tcp  Unit is ready (leader: ovnnb_db)
ovn-central/2*  active    idle   11       10.5.3.101      6641/tcp,6642/tcp  Unit is ready (leader: ovnsb_db)
$ juju ssh $(juju status ovn-central | grep leader | tail -n 1 | awk '{print $1}' | tr -d '*') sudo reboot
#wait a few minutes
$ juju status ovn-central | grep leader
ovn-central  21.09.0  active      3  ovn-central  21.09/stable   24  no       Unit is ready (leader: ovnnb_db)
ovn-central/0   active    idle   9        10.5.1.22       6641/tcp,6642/tcp  Unit is ready (leader: ovnnb_db)
ovn-central/1   active    idle   10       10.5.0.186      6641/tcp,6642/tcp  Unit is ready (leader: ovnsb_db northd: active)

$ ./tools/instance_launch.sh 1 focal
$ ./tools/float_all.sh
$ juju ssh nova-compute/0 sudo ip netns | grep ovnmeta
ovnmeta-8ea7d54e-5fe0-408d-9de8-5ccbed8ede84 (id: 0)

OVN中的Port_Binding表更新后会触发neutron中的 PortBindingUpdateVirtualPortsEvent 事件, 然后neutron的update_virtual_port_host再去neutron中创建PortBinding记录.

#https://bugs.launchpad.net/neutron/+bug/1951841/comments/28
neutron-ovn-metadata-agent如果每次启动用一个随机的agent-id来创建Private_Chassis的话,这个patch(https://github.com/openstack/neutron/commit/da3ce7319866e8dc874d405e91f9af160e2c3d31)改成不用随机的了,但是它使用了(chassis_id = uuid.UUID(self._get_own_chassis_name()) && agent_id = uuid.uuid5(chassis_id, ‘metadata_agent’)),这个会要求_get_own_chassis_name必须是一个uuid否则会抛异常.
为什么抛出了异常metada-agent就不再重新连接SB_DB去注册Private_Chassis了呢?那是因为ChassisCreateEventBase#run调用"self.agent.register_metadata_agent"第一次运行的时候因为first_time=true不会call self.agent.register_metadata_agent

    def run(self, event, row, old):
        if self.first_time:
            self.first_time = False
        else:
            # NOTE(lucasagomes): Re-register the ovn metadata agent
            # with the local chassis in case its entry was re-created
            # (happens when restarting the ovn-controller)
            self.agent.register_metadata_agent()
            LOG.info("Connection to OVSDB established, doing a full sync")
            self.agent.sync()

解决方法是支持让system-id为non-UUID - https://opendev.org/openstack/neutron/commit/6da4432fed255f3bcf3831f5d0520ab389ce36e5

但是这个代码已经在19.1.0里了,客户用的是19.3.0,显然不是这个bug

排除lp bug 1953510

#https://bugs.launchpad.net/neutron/+bug/1953510/comments/2
ovn-central端总会在做snapshot时做leader切换 - https://github.com/openvswitch/ovs/commit/3c2d6274bceecb65ec8f2f93f2aac26897a7ddfe

SB_DB端会看到日志:

$ find sosreport-juju-2752e1-*/var/log/ovn/* |xargs zgrep -i -E 'leadership'
sosreport-juju-2752e1-6-lxd-24-2022-08-18-entowko/var/log/ovn/ovsdb-server-sb.log:2022-08-18T12:30:35.322Z|80962|raft|INFO|Transferring leadership to write a snapshot.
sosreport-juju-2752e1-6-lxd-24-2022-08-18-entowko/var/log/ovn/ovsdb-server-sb.log:2022-08-18T12:30:35.610Z|80967|raft|INFO|received leadership transfer from 9b15 in term 295
sosreport-juju-2752e1-6-lxd-24-2022-08-18-entowko/var/log/ovn/ovsdb-server-sb.log:2022-08-18T17:52:53.024Z|82382|raft|INFO|Transferring leadership to write a snapshot.
sosreport-juju-2752e1-7-lxd-27-2022-08-18-hhxxqci/var/log/ovn/ovsdb-server-sb.log:2022-08-18T12:30:35.322Z|92695|raft|INFO|received leadership transfer from 0f3c in term 294
sosreport-juju-2752e1-7-lxd-27-2022-08-18-hhxxqci/var/log/ovn/ovsdb-server-sb.log:2022-08-18T12:30:35.330Z|92698|raft|INFO|Transferring leadership to write a snapshot.
sosreport-juju-2752e1-7-lxd-27-2022-08-18-hhxxqci/var/log/ovn/ovsdb-server-sb.log:2022-08-18T17:52:53.025Z|92893|raft|INFO|received leadership transfer from 0f3c in term 296

neutron端也会看到日志,这是在说刚好在写Chassis_Private时发生了leader切换吗?

2022-07-11 16:37:25.769 2361 ERROR ovsdbapp.backend.ovs_idl.command ovsdbapp.exceptions.TimeoutException: Commands [DbSetCommand(table=Chassis_Private, record=xxx.maas, col_values=(('external_ids', {'neutron:ovn-metadata-sb-cfg': '31205'}),))] exceeded timeout 180 seconds, cause: Result queue is empty
2022-07-11 16:37:25.769 2361 ERROR ovsdbapp.backend.ovs_idl.command
2022-07-11 16:37:25.780 2361 INFO ovsdbapp.backend.ovs_idl.vlog [-] ssl:100.94.0.158:6642: connected
2022-07-11 16:37:25.817 2361 INFO ovsdbapp.backend.ovs_idl.vlog [-] ssl:100.94.0.158:6642: clustered database server is not cluster leader; trying another server

进一步的流程分析

1, ovn-controller中的chassis_run上报chassis信息,并在SBDB中创建Chassis记录.

2, ovn-northd中的sbrec_port_binding_insert负责将NB_DB中的port(包含lsp和lrp)插入SB_DB的Port_Binding table, 见: ovn-northd 源码分析 - http://events.jianshu.io/p/e2139ecd0463

ovn-northd -vconsole:emer -vsyslog:err -vfile:info --ovnnb-db=ssl:100.94.0.158:6641,ssl:100.94.0.99:6641,ssl:100.94.0.204:6641 --ovnsb-db=ssl:100.94.0.158:16642,ssl:100.94.0.99:16642,ssl:100.94.0.204:16642 -c /etc/ovn/cert_host -C /etc/ovn/ovn-central.crt -p /etc/ovn/key_host --no-chdir --log-file=/var/log/ovn/ovn-northd.log --pidfile=/var/run/ovn/ovn-northd.pid --detach

2022-08-18T09:15:46.193Z|28332|ovsdb_cs|INFO|ssl:100.94.0.204:16642: clustered database server is not cluster leader; trying another server
2022-08-18T09:15:46.193Z|28333|reconnect|INFO|ssl:100.94.0.204:16642: connection attempt timed out
2022-08-18T09:15:46.194Z|28334|reconnect|INFO|ssl:100.94.0.99:16642: connecting...
2022-08-18T09:15:46.199Z|28335|reconnect|INFO|ssl:100.94.0.99:16642: connected
2022-08-18T09:15:46.201Z|28336|ovn_northd|INFO|ovn-northd lock acquired. This ovn-northd instance is now active.
2022-08-18T09:15:46.202Z|28337|northd|WARN|Dropped 3 log messages in last 86113 seconds (most recently, 86113 seconds ago) due to excessive rate
2022-08-18T09:15:46.202Z|28338|northd|WARN|Chassis does not exist for Chassis_Private record, name: juju-2752e1-5-lxd-24.maas
2022-08-18T09:15:46.202Z|28339|ovsdb_cs|INFO|ssl:100.94.0.99:16642: clustered database server is not cluster leader; trying another server
2022-08-18T09:15:46.203Z|28340|reconnect|INFO|ssl:100.94.0.99:16642: connection attempt timed out
2022-08-18T09:15:46.203Z|28341|ovn_northd|INFO|ovn-northd lock lost. This ovn-northd instance is now on standby.
2022-08-18T09:15:46.203Z|28342|reconnect|INFO|ssl:100.94.0.158:16642: connecting...
2022-08-18T09:15:46.207Z|28343|reconnect|INFO|ssl:100.94.0.158:16642: connected
2022-08-18T09:15:46.209Z|28344|ovn_northd|INFO|ovn-northd lock acquired. This ovn-northd instance is now active.
2022-08-18T09:15:46.210Z|28345|ovsdb_cs|INFO|ssl:100.94.0.158:16642: clustered database server is not cluster leader; trying another server
2022-08-18T09:15:46.210Z|28346|reconnect|INFO|ssl:100.94.0.158:16642: connection attempt timed out
2022-08-18T09:15:46.210Z|28347|ovn_northd|INFO|ovn-northd lock lost. This ovn-northd instance is now on standby.
2022-08-18T09:15:46.210Z|28348|reconnect|INFO|ssl:100.94.0.204:16642: connecting...
2022-08-18T09:15:46.215Z|28349|reconnect|INFO|ssl:100.94.0.204:16642: connected

3, ovn-northd一直有上面的日志显示连不上SB_DB,从而导致无法插入Port_Binding, 那么neutron也不会收到PortBindingUpdateVirtualPortsEvent事件,那么neutron也不会通过update_virtual_port_host在neutron中创建PortBinding记录

4, 这样neutron-medadata-agent的_get_instance_and_project_id查询Port_Binding为0, 也就有了:
var/log/neutron/neutron-ovn-metadata-agent.log.1:2022-07-20 21:15:56.758 3872 ERROR neutron.agent.ovn.metadata.server [-] No port found in network 63e2c276-60dd-40e3-baa1-c16342eacce2 with IP address 100.94.98.40

5, SB_DB leader老在切换, 切换过程中刚好创建虚机的话可能报下列错.
2022-07-11 16:37:25.769 2361 ERROR ovsdbapp.backend.ovs_idl.command ovsdbapp.exceptions.TimeoutException: Commands [DbSetCommand(table=Chassis_Private, record=xxx.maas, col_values=(('external_ids', {'neutron:ovn-metadata-sb-cfg': '31205'}),))] exceeded timeout 180 seconds, cause: Result queue is empty

6, 这种日志还是因为Port_Binding不存在,才取不到Port_Bidning中的external_ids - Can't read ovn-bridge external-id from OVSD

7, Port_Binding不存在,又回到了上面的第2步(Port_Binding是由ovn_northd创建的,它连不上SB_DB), 所以是SB_DB的问题

8, 为啥SB_DB有问题?

$ grep -r 'leader' sosreport-juju-2752e1-*/var/log/ovn/ovsdb-server-sb.log
sosreport-juju-2752e1-5-lxd-23--2022-08-18-bnsdhsj/var/log/ovn/ovsdb-server-sb.log:2022-08-18T09:15:49.863Z|31963|raft|INFO|server 0f3c is leader for term 294
sosreport-juju-2752e1-5-lxd-23--2022-08-18-bnsdhsj/var/log/ovn/ovsdb-server-sb.log:2022-08-18T12:30:35.330Z|32086|raft|INFO|server 9b15 is leader for term 295
sosreport-juju-2752e1-5-lxd-23--2022-08-18-bnsdhsj/var/log/ovn/ovsdb-server-sb.log:2022-08-18T12:30:35.613Z|32088|raft|INFO|server 0f3c is leader for term 296
sosreport-juju-2752e1-5-lxd-23--2022-08-18-bnsdhsj/var/log/ovn/ovsdb-server-sb.log:2022-08-18T17:52:53.379Z|32327|raft|INFO|server 9b15 is leader for term 297
sosreport-juju-2752e1-6-lxd-24--2022-08-18-entowko/var/log/ovn/ovsdb-server-sb.log:2022-08-18T09:15:49.862Z|80388|raft|INFO|term 294: elected leader by 2+ of 3 servers
sosreport-juju-2752e1-6-lxd-24--2022-08-18-entowko/var/log/ovn/ovsdb-server-sb.log:2022-08-18T12:30:35.322Z|80962|raft|INFO|Transferring leadership to write a snapshot.
sosreport-juju-2752e1-6-lxd-24--2022-08-18-entowko/var/log/ovn/ovsdb-server-sb.log:2022-08-18T12:30:35.324Z|80963|raft|INFO|rejected append_reply (not leader)
sosreport-juju-2752e1-6-lxd-24--2022-08-18-entowko/var/log/ovn/ovsdb-server-sb.log:2022-08-18T12:30:35.609Z|80964|raft|INFO|rejected append_reply (not leader)
sosreport-juju-2752e1-6-lxd-24--2022-08-18-entowko/var/log/ovn/ovsdb-server-sb.log:2022-08-18T12:30:35.609Z|80965|raft|INFO|rejected append_reply (not leader)
sosreport-juju-2752e1-6-lxd-24--2022-08-18-entowko/var/log/ovn/ovsdb-server-sb.log:2022-08-18T12:30:35.610Z|80966|raft|INFO|server 9b15 is leader for term 295
sosreport-juju-2752e1-6-lxd-24--2022-08-18-entowko/var/log/ovn/ovsdb-server-sb.log:2022-08-18T12:30:35.610Z|80967|raft|INFO|received leadership transfer from 9b15 in term 295
sosreport-juju-2752e1-6-lxd-24--2022-08-18-entowko/var/log/ovn/ovsdb-server-sb.log:2022-08-18T12:30:35.613Z|80969|raft|INFO|term 296: elected leader by 2+ of 3 servers
sosreport-juju-2752e1-6-lxd-24--2022-08-18-entowko/var/log/ovn/ovsdb-server-sb.log:2022-08-18T17:52:53.024Z|82382|raft|INFO|Transferring leadership to write a snapshot.
sosreport-juju-2752e1-6-lxd-24--2022-08-18-entowko/var/log/ovn/ovsdb-server-sb.log:2022-08-18T17:52:53.367Z|82383|raft|INFO|rejected append_reply (not leader)
sosreport-juju-2752e1-6-lxd-24--2022-08-18-entowko/var/log/ovn/ovsdb-server-sb.log:2022-08-18T17:52:53.378Z|82384|raft|INFO|server 9b15 is leader for term 297
sosreport-juju-2752e1-7-lxd-27--2022-08-18-hhxxqci/var/log/ovn/ovsdb-server-sb.log:2022-08-18T09:15:51.200Z|92562|raft|INFO|server 0f3c is leader for term 294
sosreport-juju-2752e1-7-lxd-27--2022-08-18-hhxxqci/var/log/ovn/ovsdb-server-sb.log:2022-08-18T12:30:35.322Z|92695|raft|INFO|received leadership transfer from 0f3c in term 294
sosreport-juju-2752e1-7-lxd-27--2022-08-18-hhxxqci/var/log/ovn/ovsdb-server-sb.log:2022-08-18T12:30:35.330Z|92697|raft|INFO|term 295: elected leader by 2+ of 3 servers
sosreport-juju-2752e1-7-lxd-27--2022-08-18-hhxxqci/var/log/ovn/ovsdb-server-sb.log:2022-08-18T12:30:35.330Z|92698|raft|INFO|Transferring leadership to write a snapshot.
sosreport-juju-2752e1-7-lxd-27--2022-08-18-hhxxqci/var/log/ovn/ovsdb-server-sb.log:2022-08-18T12:30:35.334Z|92699|raft|INFO|rejected append_reply (not leader)
sosreport-juju-2752e1-7-lxd-27--2022-08-18-hhxxqci/var/log/ovn/ovsdb-server-sb.log:2022-08-18T12:30:35.728Z|92700|raft|INFO|server 0f3c is leader for term 296
sosreport-juju-2752e1-7-lxd-27--2022-08-18-hhxxqci/var/log/ovn/ovsdb-server-sb.log:2022-08-18T17:52:53.025Z|92893|raft|INFO|received leadership transfer from 0f3c in term 296
sosreport-juju-2752e1-7-lxd-27--2022-08-18-hhxxqci/var/log/ovn/ovsdb-server-sb.log:2022-08-18T17:52:53.378Z|92895|raft|INFO|term 297: elected leader by 2+ of 3 servers

然后就搜到了这个:

$ find sosreport-juju-2752e1-*/var/log/ovn/* |xargs zgrep -i -E 'inactivity'
sosreport-juju-2752e1-7-lxd-27-2022-08-18-hhxxqci/var/log/ovn/ovsdb-server-nb.log:2022-08-18T09:15:49.882Z|657323|reconnect|ERR|ssl:100.94.0.99:40308: no response to inactivity probe after 5 seconds, disconnecting
sosreport-juju-2752e1-7-lxd-27-2022-08-18-hhxxqci/var/log/ovn/ovsdb-server-nb.log:2022-08-18T09:15:49.883Z|657324|reconnect|ERR|ssl:100.94.0.158:51592: no response to inactivity probe after 5 seconds, disconnecting
sosreport-juju-2752e1-7-lxd-27-2022-08-18-hhxxqci/var/log/ovn/ovsdb-server-sb.log:2022-08-18T09:15:47.313Z|92550|reconnect|ERR|ssl:100.94.0.158:6644: no response to inactivity probe after 5 seconds, disconnecting
sosreport-juju-2752e1-7-lxd-27-2022-08-18-hhxxqci/var/log/ovn/ovsdb-server-sb.log:2022-08-18T09:15:49.815Z|92555|reconnect|ERR|ssl:100.94.0.99:45744: no response to inactivity probe after 5 seconds, disconnecting
sosreport-juju-2752e1-7-lxd-27-2022-08-18-hhxxqci/var/log/ovn/ovsdb-server-sb.log:2022-08-18T09:15:49.816Z|92556|reconnect|ERR|ssl:100.94.0.158:44928: no response to inactivity probe after 5 seconds, disconnecting

可搜索’20220427 - ovn performance issue’ - https://blog.csdn.net/quqi99/article/details/103194137 , 得试试:

# on compute chassis
ovs-vsctl set open . external_ids:ovn-openflow-probe-interval=0
ovs-vsctl set open . external_ids:ovn-monitor-all=true
ovs-vsctl set open . external_ids:ovn-remote-probe-interval=100000
$ sudo ovs-vsctl get open . external_ids
{hostname=juju-3df241-ovn-8.cloud.sts, ovn-bridge-mappings="physnet1:br-data", ovn-encap-ip="10.5.0.110", ovn-encap-type=geneve, ovn-match-northd-version="true", ovn-remote="ssl:10.5.1.22:6642,ssl:10.5.0.186:6642,ssl:10.5.3.101:6642", rundir="/var/run/openvswitch", system-id=juju-3df241-ovn-8.cloud.sts}

#on SB_DB and NB_DB
ovn-nbctl set connection . inactivity_probe='180000'
ovn-sbctl set connection . inactivity_probe='180000'
ovn-nbctl get connection . inactivity_probe

#juju
juju config ovn-central ovsdb-server-inactivity-probe

确定SB_DB的问题

可以在每个ovn-central unit上运行"ovn-appctl -t ovn-northd status", 它用来告诉unit是active还是standby,但如果stuck就说明ovn-northd状态不对,可能需要重启ovn-northd,这个可排除是否所有ovn-central unit都有问题.
还有如下命令:

#https://github.com/sosreport/sos/commit/499b1ebd1341b524eca294762332939fe0408416
#cat sos_commands/ovn_central/ovs-appctl_-t_.var.run.ovn.ovnnb_db.ctl_cluster.status_OVN_Northbound
$ ls /var/run/ovn/*.ctl
$ sudo ovs-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/status OVN_Northbound
2394
Name: OVN_Northbound
Cluster ID: 74bc (74bcea93-e506-4c88-97bf-e0094b9ca634)
Server ID: 2394 (23940573-b5be-465a-9e0f-b3ab9941335d)
Address: ssl:10.5.3.101:6643
Status: cluster member
Role: follower
Term: 7
Leader: ad8a
Vote: ad8a
Last Election started 5567642 ms ago, reason: timeout
Election timer: 4000
Log: [93, 100]
Entries not yet committed: 0
Entries not yet applied: 0
Connections: ->f38b ->ad8a <-f38b <-ad8a
Disconnections: 0
Servers:
    f38b (f38b at ssl:10.5.1.22:6643) last msg 5564502 ms ago
    ad8a (ad8a at ssl:10.5.0.186:6643) last msg 197 ms ago
    2394 (2394 at ssl:10.5.3.101:6643) (self)
$ sudo ovs-appctl -t /var/run/ovn/ovnsb_db.ctl cluster/status OVN_Southbound
3dd6
Name: OVN_Southbound
Cluster ID: 0993 (09934a9f-1948-48ec-9cba-da7c6efba970)
Server ID: 3dd6 (3dd60330-5607-4817-bc53-1e08ad972c02)
Address: ssl:10.5.3.101:6644
Status: cluster member
Role: follower
Term: 10
Leader: 9c8f
Vote: self
Last Election started 5620622 ms ago, reason: timeout
Election timer: 4000
Log: [114, 125]
Entries not yet committed: 0
Entries not yet applied: 0
Connections: ->8ff3 ->9c8f <-9c8f <-8ff3
Disconnections: 0
Servers:
    3dd6 (3dd6 at ssl:10.5.3.101:6644) (self)
    8ff3 (8ff3 at ssl:10.5.0.186:6644) last msg 5619344 ms ago
    9c8f (9c8f at ssl:10.5.1.22:6644) last msg 222 ms ago

Event Timeline

1, on 2022-07-11, SB_DB should be bad so neutron-ovn-metadata-agent reported a error 'exceeded timeout 180 seconds, cause: Result queue is empty'

2022-07-11 16:37:25.769 2361 ERROR ovsdbapp.backend.ovs_idl.command ovsdbapp.exceptions.TimeoutException: Commands [DbSetCommand(table=Chassis_Private, record=xxx.maas, col_values=(('external_ids', {'neutron:ovn-metadata-sb-cfg': '31205'}),))] exceeded timeout 180 seconds, cause: Result queue is empty
2022-07-11 16:37:25.769 2361 ERROR ovsdbapp.backend.ovs_idl.command
2022-07-11 16:37:25.780 2361 INFO ovsdbapp.backend.ovs_idl.vlog [-] ssl:100.94.0.158:6642: connected
2022-07-11 16:37:25.817 2361 INFO ovsdbapp.backend.ovs_idl.vlog [-] ssl:100.94.0.158:6642: clustered database server is not cluster leader; trying another server
2022-07-11 16:37:25.818 2361 INFO ovsdbapp.backend.ovs_idl.vlog [-] ssl:100.94.0.158:6642: connection closed by client
2022-07-11 16:37:33.831 2361 INFO ovsdbapp.backend.ovs_idl.vlog [-] ssl:100.94.0.204:6642: connected
2022-07-11 16:37:34.566 2361 INFO neutron.agent.ovn.metadata.agent [-] Connection to OVSDB established, doing a full sync

This error should have nothing to do with this issue because it happened on 2022-07-11, not on 2022-08-18

2, on 2022-08-18, SB_DB was good, because I just found the error 'exceeded timeout 180 seconds, cause: Result queue is empty' in sosreport-xxx-2022-07-21-baldxuf, not in sosreport-xxx-2022-08-18-bgnovqu

3, at 2022-08-18T09:15:47, one inactivity probe error happened

var/log/ovn/ovsdb-server-sb.log:2022-08-18T09:15:47.313Z|92550|reconnect|ERR|ssl:100.94.0.158:6644: no response to inactivity probe after 5 seconds, disconnecting

sosreport shows 100.94.0.204 is the leader at that time, so this inactivity probe means the leader 100.94.0.204 didn't receive the reply from 100.94.0.158:6644 within 5 seconds then it disconnected the TCP connection between 100.94.0.204 and 100.94.0.158 to avoid waste of resources. So it should not be the key of the problem.

    5-lxd-23: f801 100.94.0.99:6644 follower term=297
    6-lxd-24: 0f3c 100.94.0.158:6644 follower term=297
    7-lxd-27: 9b15 100.94.0.204:6644 leader term=297  leader

4, at 2022-08-18 09:15:50, neutron-ovn-metadata-agent started to call self.agent.register_metadata_agent() to create Private_Chassis in SB_DB

$ find sosreport-srv1ibm00*/var/log/neutron/* |xargs zgrep -i -E 'Connection to OVSDB established, doing a full'
xxx/var/log/neutron/neutron-ovn-metadata-agent.log:2022-08-18 09:15:50.726 75556 INFO neutron.agent.ovn.metadata.agent [-] Connection to OVSDB established, doing a full sync
    def run(self, event, row, old):
        if self.first_time:
            self.first_time = False
        else:
            # NOTE(lucasagomes): Re-register the ovn metadata agent
            # with the local chassis in case its entry was re-created
            # (happens when restarting the ovn-controller)
            self.agent.register_metadata_agent()
            LOG.info("Connection to OVSDB established, doing a full sync")
            self.agent.sync()

5, at 2022-08-18T12:30:35, reelection took place due to snapshot factor

$ find sosreport-juju-2752e1-*/var/log/ovn/* |xargs zgrep -i -E 'received leadership transfer from'
sosreport-juju-2752e1-6-lxd-24-xxx/var/log/ovn/ovsdb-server-sb.log:2022-08-18T12:30:35.610Z|80967|raft|INFO|received leadership transfer from 9b15 in term 295
sosreport-juju-2752e1-7-lxd-27-xxx/var/log/ovn/ovsdb-server-sb.log:2022-08-18T12:30:35.322Z|92695|raft|INFO|received leadership transfer from 0f3c in term 294
sosreport-juju-2752e1-7-lxd-27-xxx/var/log/ovn/ovsdb-server-sb.log:2022-08-18T17:52:53.025Z|92893|raft|INFO|received leadership transfer from 0f3c in term 296

# reelection was caused by snapshot factor according to timestamp
$ find sosreport-juju-2752e1-*/var/log/ovn/* |xargs zgrep -i -E 'Transferring leadership'
sosreport-juju-2752e1-6-lxd-24-xxx/var/log/ovn/ovsdb-server-sb.log:2022-08-18T12:30:35.322Z|80962|raft|INFO|Transferring leadership to write a snapshot.
sosreport-juju-2752e1-6-lxd-24-xxx/var/log/ovn/ovsdb-server-sb.log:2022-08-18T17:52:53.024Z|82382|raft|INFO|Transferring leadership to write a snapshot.
sosreport-juju-2752e1-7-lxd-27-xxx/var/log/ovn/ovsdb-server-sb.log:2022-08-18T12:30:35.330Z|92698|raft|INFO|Transferring leadership to write a snapshot.

6, at 2022-08-18 12:30:36, neutron-ovn-metadata-agent can't query port from SB_DB

$ find sosreport-srv1ibm00*/var/log/neutron/* |xargs zgrep -i -E 'OVSDB Error'
sosreport-srv1ibm002d-xxx/var/log/neutron/neutron-ovn-metadata-agent.log:2022-08-18 12:30:36.103 75556 ERROR ovsdbapp.backend.ovs_idl.transaction [-] OVSDB Error: no error details available
sosreport-srv1ibm006d-xxx/var/log/neutron/neutron-ovn-metadata-agent.log:2022-08-18 12:30:36.104 2171 ERROR ovsdbapp.backend.ovs_idl.transaction [-] OVSDB Error: no error details available

7, at 2022-08-18 12:57:53, metadata port can't be created

2022-08-18 12:57:53.800 3730 ERROR neutron.agent.ovn.metadata.server [-] No port found in network 63e2c276-60dd-40e3-baa1-c16342eacce2 with IP address 100.94.98.135

8, at 2022-08-18 18:30:01, run the workaround to restart neutron-ovn-metadata-agent.service 

所以这个问题是SB_DB在snapshot前做leadership的迁移的时候neutron又在创建metadata port,我们当然可以让snapshot的频率变低(https://bugs.launchpad.net/ubuntu/+source/openvswitch/+bug/1990978,这个特性在ovs=2.17.3里可用,客户用的是focal-xena: 2.16.0-0ubuntu2.1~cloud0).但是即使SB_DB做leadership时也不能让创建metadata port就失败啊, 这个lp bug ( https://bugs.launchpad.net/neutron/+bug/1961013 )让neutron可以连接一个read-only的non-leader节点来减少失败率.
但neutron是在创建metadata port是在写啊?

本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

ovn metadata (by quqi99) 的相关文章

  • 如何直接从SQL Server备份文件中读取元数据?

    一般来说 要从 SQL Server 备份文件中获取元数据 我们需要使用 TSQL 命令 例如restore headeronly or restore filelistonly 但是 有一些第三方工具可以直接从备份文件中读取此信息 例如这
  • 使用 Python 识别 Mac OS X 中的包目录

    Mac OS X Finder 使用 包 的概念使某些文件夹的内容对用户不透明 我在用着os walk 枚举目录树 并且我想跳过对应用程序包等包的枚举 The mdls命令行实用程序可用于检查是否com apple package在里面kM
  • 什么是元数据?它在android中有什么用

    我是 Android 新手 之前没有见过或听说过元数据 然而我用谷歌搜索并在 YouTube 上搜索它 它基本上是你的对象的信息 如果我错了请纠正我 任何人都可以帮助我以更好的方式理解它 1 什么是元数据 2 为什么在Android中使用它
  • 我想在android中读取图像中的exif信息。我可以从图库中的图像读取 exif,但无法读取从相机拍摄的 exif 照片

    如上图所示 我选择的是拍照 exif信息为空 我从手机选项中选择了图像 并且 exif 信息不为空 现在可以通过 ExifInterface 支持库从流中读取 EXIF 数据 compile com android support exif
  • 有人再为都柏林核心烦恼吗?

    正如问题所述 将 Dublin Core 元标记添加到 HTML 头部有什么意义吗 或者有站点地图 org http www sitemap org删除了大部分的使用 尽管它只替换了一些标签 我问这个问题是因为我访问的大多数网站似乎都没有在
  • 在 BreezeJS 客户端中解析元数据时出现异常

    前传和版本信息 BreezeJS客户端版本 1 5 2 通过设计时 EntityFramework 模型在自定义服务器上生成的元数据 6 1 2 BreezeLabs EdmBuilder 1 0 5 生成元数据时也遇到问题 可以在此处阅读
  • 是否可以将自定义元数据添加到 Lucene 字段?

    我已经到了需要存储一些有关 Lucene Net 索引中特定字段来自何处的附加数据的地步 具体来说 我想在将字段添加到文档时将 guid 附加到文档的某些字段 并在从搜索结果中获取文档时再次检索它 这可能吗 Edit 好吧 让我举个例子来澄
  • Hadoop Namenode 元数据 - fsimage 和编辑日志

    据我所知 出于性能原因 fsimage 在启动时加载到内存中 并且任何进一步的事务都会添加到编辑日志中 而不是添加到 fsimage 中 当namenode重新启动时 内存中的fsimage会被刷新 为了提高效率 辅助名称节点定期执行检查点
  • 我应该在回复元数据拒绝后重新提交二进制文件吗?

    我的申请收到元数据拒绝 要求对我的申请进行一些澄清 Information Needed Your iTunes Connect settings indicate that your app serves third party adve
  • Python:用于从各种格式加载音频元数据的功能最丰富的库是什么?

    我正在寻找一个功能丰富的优秀库 用于从各种音频格式 MP3 FLAC OGG WAV 等 读取元数据 我已经看过 Mutagen 但文档几乎不存在 而且它似乎无法加载基本信息 例如艺术家和音频标题 艺术家和音频标题编码是否正确 它无法满足哪
  • R函数返回包的许可证?

    我想返回 R 包的描述文件中声明的包许可证 理想情况下我宁愿拥有 CRAN 自动添加到它识别的所有通用许可证中的 URL 我意识到我可以获得 R 本身分发的许可证 license 令人惊讶的是 这不适用于包裹 例如许可证 包名 此数据也不是
  • 找到删除分支的人?

    我需要找到在正在合作的项目中删除分支的人 通过检查 repo 文件夹 文件 git refs heads 我找不到包含该信息的任何元数据 找到这个 他有什么办法 您可以在组织仪表板的活动部分下查看此详细信息 只有组织所有者有权访问此页面 h
  • Pandas _metadata的DataFrame持久化错误

    我终于弄清楚如何使用 DataFrame 中的 metadata 一切正常 除了我无法将其持久化 例如 hdf5 或 json 我知道它有效 因为我复制框架 而 metadata 属性复制到 非 metadata 属性则不然 example
  • Typescript 属性装饰器可以为类设置元数据吗?

    在打字稿中 是否可以使用属性装饰器来设置类的元数据 考虑下面的代码 类装饰器的 目标 显然与属性装饰器的 目标 不同 我可以从另一个派生出一个吗 import reflect metadata const MY CLASS DECORATO
  • 创建填充元数据的 Tflite 模型时出现问题(用于对象检测)

    我正在尝试在 Android 上运行 tflite 模型来进行对象检测 对于相同的 我已经用我的图像集成功训练了模型 如下所示 一 培训 python3 object detection model main py pipeline con
  • 有元数据驱动的 UI 示例代码吗?

    我正在设计一个使用元数据驱动 UI 的 net windows 窗体应用程序 除了寻找http msdn microsoft com en us library ms954610 aspx http msdn microsoft com e
  • Django 中的 Meta 到底是什么?

    我想简单地知道 Django 中的 Meta 类是什么以及它们的作用 from django db import models Class Author models Model first name models CharField ma
  • 如何获取 Angular 2 中当前模块的元数据?

    我想获取当前的元数据NgModule为了得到列表declarations and providers为了填充动态模块 我创建了一个在模态中显示组件的模块 那怎么办呢 您可以使用以下方式访问声明reflect metadata https g
  • 从 PostgreSQL 数据库检索评论

    我正在 Postgres 数据库上运行一个项目 需要检索数据库中列的注释以用作表标题等 我已经看到有几个内置函数 pg 描述 http www postgresql org docs 9 1 static catalog pg descri
  • 如何以编程方式获取查找器中的“获取信息”窗格(Objective-c)?

    我们右键单击一个文件夹或文件 然后我们可以选择 获取信息 选项来显示一些信息 我想知道我们是否可以通过 Objective c 来做到这一点 None

随机推荐

  • meta—learning调研及MAML概述

    背景 Meta Learning xff0c 又称为 learning to learn xff0c Meta Learning希望使得模型获取一种 学会学习 的能力 xff0c 使其可以在获取已有 知识 的基础上快速学习新的任务 xff0
  • ubuntu18.04安装pycharm

    安装方法 xff1a 方法1 xff1a 在ubuntu的应用商店下载 方法2 xff1a 使用tar包解压缩后下载 xff0c 可参考网页 xff1a https blog csdn net mao hui fei article det
  • Python的命令行参数解析

    文章作者 xff1a Tyan 博客 xff1a noahsnail com CSDN 简书 命令行参数解析在编程语言中基本都会碰到 xff0c Python中内置了一个用于命令项选项与参数解析的模块argparse 下面主要介绍两种解析P
  • Matlab 2016a/b中调用GPU速度巨慢的解决办法

    利用caffe的MATLAB接口跑深度学习时 xff0c 设置gpu模式 xff1a caffe set mode gpu xff0c 可以加速运算 xff0c 然而在MATLAB 2016a b中调用gpu时会出现了一个BUG xff0c
  • keras 2.3.0 做上采样 UpSampling2D的时候的维度出错问题解决办法

    简单的说 xff0c 你是不是遇到了这样的问题 xff0c 上一层的数据是 None xff0c 200 14 14 你希望上采样到28x28 H 61 UpSampling2D size 61 2 2 H 你以为能得到 None xff0
  • juju based openstack upgrade (by quqi99)

    作者 张华 发表于 2022 02 17 版权声明 可以任意转载 转载时请务必以超链接形式标明文章原始出处和作者信息及本版权声明 http blog csdn net quqi99 问题 客户想将juju管理的openstack从xenia
  • Try Fyde OS on VMWare and Surface (by quqi99)

    作者 xff1a 张华 发表于 xff1a 2022 02 28 版权声明 xff1a 可以任意转载 xff0c 转载时请务必以超链接形式标明文章原始出处和作者信息及本版权声明 http blog csdn net quqi99 Insta
  • Installing third-party firmware on x3-55 letv (by quqi99)

    问题 趁贾老板明天回国之前 xff0c 得连夜将他的乐视x3 55电视刷成第三方精简版的固件 xff0e 官方固件安装的内置服务太多不仅占硬盘空间而且都开着也占用内存影响运行速度 xff0e 要安装的是 xff02 蓝同学 xff02 的固
  • Set up debian based maas ha env on xenial by hand (by quqi99)

    准备三个节点 本文将在xenial ubuntu 16 04 使用debian包手工创建maas ha环境 先快速准备三个节点 juju deploy ubuntu maas1 series xenial config hostname m
  • add a wifi AP for armbian box (by quqi99)

    作者 xff1a 张华 发表于 xff1a 2022 03 26 版权声明 xff1a 可以任意转载 xff0c 转载时请务必以超链接形式标明文章原始出处和作者信息及本版权声明 http blog csdn net quqi99 无线网卡的
  • Kids are forbidden to watch TV after school (by quqi99)

    作者 xff1a 张华 发表于 xff1a 2022 03 30 版权声明 xff1a 可以任意转载 xff0c 转载时请务必以超链接形式标明文章原始出处和作者信息及本版权声明 http blog csdn net quqi99 iptab
  • ubuntu 20.04升级到22.04中遇到的问题(by quqi99)

    作者 xff1a 张华 发表于 xff1a 2022 04 23 版权声明 xff1a 可以任意转载 xff0c 转载时请务必以超链接形式标明文章原始出处和作者信息及本版权声明 http blog csdn net quqi99 昨天通过
  • keil下载代码时出现:“Not a genuine ST Device! Abort connection“的错误

    最近在学习嵌入式 xff0c 难免要玩一些开发板 我选择了相对比较便宜的STM32F10C8T6 所以我就从网上购买了这快板子 刚开始买回来的时候 xff0c 我根本不知道往板子上烧录代码的时候还需要ST LINK 因为我在学F407的时候
  • Testing ovn manually based on LXD (by quqi99)

    作者 xff1a 张华 发表于 xff1a 2022 05 27 版权声明 xff1a 可以任意转载 xff0c 转载时请务必以超链接形式标明文章原始出处和作者信息及本版权声明 准备两个LXD容器 lxc list 43 43 43 43
  • [WIP] Openstack Masakari (by quqi99)

    作者 xff1a 张华 发表于 xff1a 2022 06 07 版权声明 xff1a 可以任意转载 xff0c 转载时请务必以超链接形式标明文章原始出处和作者信息及本版权声明 什么是masakari masakari是OpenStack
  • 远程解决win10上keyboard和chrome不work的两例问题(by quqi99)

    作者 xff1a 张华 发表于 xff1a 2022 06 10 版权声明 xff1a 可以任意转载 xff0c 转载时请务必以超链接形式标明文章原始出处和作者信息及本版权声明 远程解决了两例windows问题 xff0c 记录一下 xff
  • try anbox or waydroid (by quqi99)

    作者 张华 发表于 2022 06 28 版权声明 可以任意转载 转载时请务必以超链接形式标明文章原始出处和作者信息及本版权声明 无论是安装anbox还是waydroid都失败了 记录一下 里面首先是没有 dev binder的问题 那是因
  • set up ovn development env (by quqi99)

    作者 xff1a 张华 发表于 xff1a 2022 07 08 版权声明 xff1a 可以任意转载 xff0c 转载时请务必以超链接形式标明文章原始出处和作者信息及本版权声明 编译ovs并启动ovs vswitchd https docs
  • Using lxd to do vlan test (by quqi99)

    作者 xff1a 张华 发表于 xff1a 2022 08 15 版权声明 xff1a 可以任意转载 xff0c 转载时请务必以超链接形式标明文章原始出处和作者信息及本版权声明 问题 客户说sriov虚机里收不着arp reply 他们的s
  • ovn metadata (by quqi99)

    作者 xff1a 张华 发表于 xff1a 2022 08 25 版权声明 xff1a 可以任意转载 xff0c 转载时请务必以超链接形式标明文章原始出处和作者信息及本版权声明 问题 客户描述虚机的metadata功能偶尔有问题 xff0c