Once again, a potentially catastrophic accident has been averted. Those who don’t want to see the process can skip to the end and see the solution.

A network error

One day, a test application was built on KplCloud. After the build, it was found that the new pod failed to start, and the following error message was thrown:

Failed create pod sandbox: rpc error: code = Unknown desc = NetworkPlugin cni failed to set up pod "xxxxxx-fc4cb949f-gpkm2_xxxxxxx" network: netplugin failed but error parsing its diagnostic message "": unexpected end of JSON input
Copy the code

K8s operation and maintenance students have been away for a long time, what should we do if something goes wrong suddenly?

Try to start solving the problem.

First, there is no possible mirror pull failure, start to find the problem:

  1. Log in to all servers in the cluster to see if space is full (it is not)
  2. Query the network status of all servers in the cluster (no problem)
  3. Start another POD? (Can’t get up)

This is embarrassing…… Could it be Calico?

2. Check the error information reported by the server

Try the following command to see the server error message:

$ journalctl -exf 
Copy the code

There are some error messages:

This error is too extensive, keep trying to find the problem elsewhere.

Already thinking about how to run away…

Can it be solved by trying to restart from?

The risk is too great to take. Rebooting docker and K8S is not the best option in this case, although rebooting will solve most problems in many cases.

Searching through the logs, guessing that there was an IP assignment problem, the target turned to Calico

Find the problem from calico-Node

Query whether the IP address pool is used up.

Run the calicoamd command to check whether Calico is running properly

$calicoctl get ippools -o wide CIDR NAT IPIP 172.20.0.0/16 true false $calicoctl node statusCopy the code

There seems to be no problem.

Start off-site help……

nothing

Since calico-Node is running properly, there should be no problem with calico-etcd.

Try the calico – etcd

In the spirit of doubt to check the attitude of the try, the following began to calico-ETCD a SAO operation.

In order to reduce the amount of code for easy reading, the following etCDctl need to add certificates and endpoints, not one to add, we refer to the good:

ETCDCTL_API=3 etcdctl --cacert=/etc/etcd/ssl/ca.pem \ --cert=/etc/etcd/ssl/etcd.pem \ --key=/etc/etcd/ssl/etcd-key.pem \  --endpoints=http://10.xx.xx.1:2379,http://10.xx.xx.2:2379,http://10.xx.xx.3:2379Copy the code

Calico ETCD = calico ETCD = calico ETCD

$ ETCDCTL_API=3 etcdctl member list
bde98346d77cfa1: name=node-1 peerURLs=http://10.xx.xx.1:2380 clientURLs=http://10.xx.xx.1:2379 isLeader=true
299fcfbf514069ed: name=node-2 peerURLs=http://10.xx.xx.2:2380 clientURLs=http://10.xx.xx.2:2379 isLeader=false
954e5cdb2d25c491: name=node-3 peerURLs=http://10.xx.xx.3:2380 clientURLs=http://10.xx.xx.3:2379 isLeader=false
Copy the code

It seems that the cluster is also working well and the GET data is working well.

Everything looked and felt so normal that there seemed to be nothing wrong with it.

Forget it. Forget it. Let’s just write my resume, refresh my brain.

How about writing a piece of data to ETCD?

$ ETCDCTL_API=3 etcdctl put /hello world

Error:  etcdserver: mvcc: database space exceeded
Copy the code

Error: etcdServer: MVCC: Database space exceeded??

Seems to have found the cause, now that the location of the problem, then easy to do. * (don’t run away (⁎⁍̴̛ᴗ⁍̴̛⁎)) * Put your resume on hold.

Thanks to the great Google, I found some clues and solutions from etCD’s official website. I will post the introduction on etCD’s official website to solve the problem first:

Run the etcdctl endpoint status command to query the status of each etcd node:

$ETCDCTL_API=3 etcdctl endpoint status http://10.xx.xx.1:2379, 299fcfBF514069ed, 3.2.18, 2.1GB, false, 7, 8701663 http://10.xx.xx.2:2379, bDE98346d77cfa1, 3.2.18, 2.1GB, true, 7, 8701683 http://10.xx.xx.3:2379, 954e5Cdb2d25c491, 3.2.18, 2.1GB, false, 7, 8701687Copy the code

As you can see above, the cluster space has been used up to 2.1GB, which is a value to be aware of.

Run the etcdctl alarm list command to check whether alarms exist on the etcd.

$ ETCDCTL_API=3 etcdctl alarm list
memberID:2999344297460918765 alarm:NOSPACE
Copy the code

It shows an alerm:NOSPACE, which means we’re out of space, so we’re out of space? Disk or memory? Let’s look it up.

Seems disk, memory space is enough. The default BACKEND quota of ETCD V3 is 2GB. In other words, the default maximum etCD quota is 2GB. If the quota exceeds the limit, no data can be written and the old data must be deleted or compressed.

Refer to the official solution

Etcd. IO /docs/v3.2.1…

  1. Gets the old version number of etCD

    $ ETCDCTL_API=3 etcdctl endpoint status --write-out="json" | egrep -o '"revision":[0-9]*' | egrep -o '[0-9].*' 5395771 5395771, 5395771,Copy the code
  2. Compress the old version

    $ ETCDCTL_API=3 etcdctl compact 5395771
    compacted revision 5395771
    Copy the code
  3. Defragment,

    $ ETCDCTL_API=3 etcdctl defrag
    Finished defragmenting etcd member[http://10.xx.xx.1:2379]
    Finished defragmenting etcd member[http://10.xx.xx.2:2379]
    Finished defragmenting etcd member[http://10.xx.xx.3:2379]
    Copy the code
  4. Turn off the alarm

    $ ETCDCTL_API=3 etcdctl alarm disarm
    memberID:2999344297460918765 alarm:NOSPACE
    
    $ ETCDCTL_API=3 etcdctl alarm list
    Copy the code
  5. Test whether data can be written

    $ ETCDCTL_API=3 etcdctl put /hello world
    OK
    
    $ ETCDCTL_API=3 etcdctl get /hello
    OK
    Copy the code

Go back to K8S, delete the failed POD, and check to see if IP can be assigned properly.

Everything is right and perfect.

To avoid similar problems in the future, automatic compression needs to be set. To enable automatic compression, add XXXXX =1 to the ETCD startup reference

Skyao. Gitbooks. IO/learning – et…

Etcd does not compact automatically by default. You need to set startup parameters or run commands to compact. You are advised to set this parameter if you change it frequently; otherwise, space and memory waste and errors will occur. Error: ETCDServer: MVCC: EtCDServer: MVCC: EtCDServer: MVCC: Database Space Exceeded “, causing data cannot be written.

The reason why so much garbage data is generated is that due to frequent scheduling, a large number of cronjobs are being executed in our cluster and the execution is very active. Every time new PODS are generated, they will be assigned IP. It is possible that calico-ETCD generated a lot of junk data because the POD time was too short or it was not logged out in time.

The tail

Because the usage quota of the Calico-etcd cluster was full, the IP assigned by Calico could not be written to the ETcd when pod was created. As a result, POD creation failed and CoreDNS could not be registered.

Monitoring was very important in order not to pit, we had monitoring of the ETCD cluster, but ignored monitoring of the ETCD quota, fortunately there was no restart or upgrade applied at that time and no loss was caused.

The last tip is to go up and click, you may have unexpected surprise (shock).

Author: Wang Cong, Creditease Institute of Technology