[TOC]

Docker container platform selection research

Selection of the choreography

Swarm

  1. Swarm can build an image from a Dockerfile, but the image can only run on a single node and cannot be distributed to other nodes in the cluster. Therefore, the application is considered a container in a way that is not fine-grained.

  2. Swarm needs to use the Docker-compose scale to extend one of its containers, which will be scheduled according to the scheduler rules. Docker Swarm does not automatically scale the container if it is overloaded

    • Under normal circumstances, it is necessary to check whether the container has reached the bottleneck and can be expanded in time
  3. Swarm cluster, in docker1.12, supports automatic detection and pulling of failed nodes

  4. Swarm can now use overlay networks to support multiple host container networks.

Mesos & Marathon(Latest version 1.4.1)

  1. Mesos provides resource management and scheduling framework abstractions. Third-party applications need to implement Mesos apis to access resources managed by Mesos.

    • Mesos adds a lightweight sharing layer underneath, providing a unified API that can be accessed by other framework clusters.
    • Mesos is not responsible for scheduling but for delegating authorization, and there are many frameworks that have implemented complex scheduling, such as Marathon
  2. Mesos is more fault tolerant than Swarm because Mesos can use a monitor check on an application in a JSON file

  3. Marathon has a user UI interface that can be thought of as an application, and it can be thought of as a framework for managing containers that can work with Marathon through REST apis for ease of operation and maintenance.

    • The new Version of Marathon has a good support for application update and rollback, removing the container startup dependence on static configuration files, making application container update publishing and rollback easier.
  4. After the elastic expansion and shrinkage of Mesos, a large number of Exit Docker containers will be generated on the host, which consumes resources and affects system stability.

    • The default Mesos only has a time-based cleanup policy, such as a few hours or days later. There is no time-based cleanup policy, such as when the system is idle, and there is no way to customize a cleanup policy for each service.
    • Marathon’s source code program can be modified to add the Docker container garbage cleaning interface, which can clean the Docker container in Exit state according to the specified strategy for different services.
  5. Mesos does not support preemption, so task priorities cannot be set

    • The Apache Aurora plugin already supports priority and resource preemption, but it is on the same level as Marathon.
  6. For stateful storage applications such as mysql/Kafka, mesOS + Marathon is not yet well supported

    • In the case of a failure or a simple service restart, Marathon randomly restarts the service on any resource that meets the service definition constraints, which is not suitable for stateful services because of the high operational cost of migrating the local state to the new service
    • However, Marathon’s local persistent volume allows the stateful service to be deployed
      • After the volume is persisted locally, Marathon and Mesos redeploy the container to the original host instead of another machine the next time the container is started.
  7. You can customize the required framwork. Plug-in processing. However, Marathon itself is written in Scala and the UI is written in React, which is not conducive to secondary development

  8. Other components include mesOS-DNS and Marathon-LB.

    • Mesos-dns is a service discovery tool
    • Marathon-lb is not only a service discovery tool but also a load balancing tool. To use Marathonn-LB, each group of apps must set a HAPROXY_GROUP tag and use HaProxy for load balancing
  9. Access mesos company: mesos.apache.org/documentati…

Kubernetes(latest version 1.6)

  1. Kubernetes uses ReplicationController to ensure that at least one instance of an application is running. When you create a POD in a Kubernetes cluster, you need to create a Kubernetes service called load balancing, which forwards traffic to the containers.

    • If there is only one instance, this service can also be used, because it determines whether traffic can be accurately forwarded to a POD with a dynamic IP.
  2. Kubernetes adds pod and Replica logic. This provides richer functionality for schedulers and orchestration tools, such as load balancing and the ability to expand or shrink applications. It is also possible to update running container instances. Kubernetes has self-healing, automated rollout and rollback and storage orchestration. Main advantages:

    • AutoScale: Determine whether automatic scaling is required based on collected business metrics
    • Rolling Deployents: Rolling deployment of a new version does not interrupt service and exits the old version after the new version is deployed
    • Work Queue: Extends the 1:1 relationship between services to 1: N. This Queue provides a layer of agents for services to be accessed and forwards requests
  3. Kubernetes is good at fixing problems automatically and restarting containers quickly. This causes the user to not notice if the container crashes

    • To solve this problem, you can add a centralized logging system or other means of monitoring
  4. Stateful Service Set: StatefulSets (version 1.4 called PetSets)

    • For the pods in the PetSet, each Pod mounts its own independent storage, and if a Pod fails, a Pod of the same name is started from another node and the storage attached to the original Pod continues to be serviced in its state. (Keep IP /hostname unchanged)
    • Services suitable for PetSet include database services MySQL/PostgreSQL, cluster management services Zookeeper, ETCD and other stateful services.
    • With PetSet, pods can still provide high availability by drifting to different nodes, and storage can also provide high reliability through plug-in-storage. All PetSet does is associate identified pods with identified storage to ensure continuity of state
  5. Golang language, which is helpful for secondary development, high community activity, can join the community to improve the influence of the company

  6. General statistics of companies using Kubernetes: eBay, Yahoo, Microsoft, IBM, Intel, Huawei, VMware, HPE, Mirantis, netease, Puyuan, Asiainco, LeEco, Tencent, JINGdong

Container stack

Technical point Technical solution
Resource scheduling & Choreography Mesos + marathon Swarm(Compose) Kubernetes
Mirror & Warehouse harbor DockerHub Quay.io Artifactory
monitoring cAdvisor Graphite Sysdig Datadog Prometheus ,Zabbix + pythonAgent
The log ELK Graylog flume heka fluentd
Service registration and discovery/load balancing HAProxy Etcd Consul Confd/DNS
storage Devicemapper, Overlayfs, Volume, Ceph, Gluster, NFS, OpenStack Swift, Glance,iSCSI SAN
network HOST, VXLAN IPSEC . OpenFlow .Flannel. Docker Bridge, Calico, Neutron ,pipework ,weave , SocketPlane
Distributed scheduled task chronos
security Notary Vault
Automation tool ansible, salt

Industry Usage Framework

  1. jingdong

    • Openstack Icehouse + docker1.3 + OVS2.1.3/2.3.2+Centos6.6 ==> K8s + Docker + Flannel +Neutron + OVS + DPDK +JFS
    • When a container fails, RC is triggered automatically (keep IP unchanged and “migrate”)
    • OVS-VLAN
  2. zhihu

    • Git+Jenkins(CI/CD) + MESOS + self-developed Framework + Group (isolation) + Consul + HaProxy + DNS + Graphite + cAdvisor
      • Perform fault isolation through the group
      • The image repository is highly available through HDFS and horizontal scaling
      • Horizontal scaling of the Mesos cluster
    • Docker network
      • bridge
      • NAT is not bad
      • The iptables some so
    • Service discovery
      • DNS Client
    • Automatic Scale
      • Burst Response & Efficient Resource Utilization
      • Adjust the number of containers based on CPU specifications
      • Fast and slow
      • Max & Min Hard Limit
      • Supports user-defined indicators
  3. ctrip

    • Openstack + Mesos + Docker + Chronos + ELK
    • Monitoring: Telegraf -> Grafana -> Influxdb -> Grafana
    • Log: elk
      • Mesos stdout and stderr
  4. Where to go

    • OpenStack + nova-docker + VLAN =>Mesos + Marathon + docker (–net=host) + random port =>Mesos + Marathon + docker + Calico
  5. Ali e-commerce cloud

    • EWS developed by ourselves, based on Compose, with reference to Kubernetes’ design. Support multiple region.
      • cAdvisor + InfuxDB + prometheus
      • etcd + consul + zk + docker overlay
        • Use service storage such as RDS,OSS, and OCS
    • Correct posture of docker container
      • Rebuild the image every time the code commits
      • Do not modify a running mirror
      • Use volume to store persistent data
    • Storage management
      • Use the Docker Volume Plugin to support different storage types
        • Block storage, cloud disk
        • Object storage, OSSFS
        • Network file system NFS
  6. With the process

    • Swarm + swarm Agent + etcd + Zabbix + Jenkins + (Nginx+Lua) +
    • Use of the status quo
      • The number of containers is 5000 and the peak capacity is increased to 8000
      • Docker 600 applications, packed into the container also :Mongodb, Redis,Mysql
      • The CPU usage increased from 20% to 80%
    • Resource Isolation Layer
      • Improve the utilization rate of physical machines and arrange applications reasonably
      • Resources are isolated between applications to avoid environment and resource conflicts and enhance security
      • Explosive incoming traffic: Rapid capacity expansion and migration
      • Application migration: Reduce the cost of buying a server
      • Operation and maintenance work: more automation, more refined monitoring and alarm
    • To optimize the
      • Dockfile optimization, reduce the number of layers from 20 to 5 layers, build 1 times faster
      • The storage driver is changed from Devicemapper to Overlayfs, which is twice as fast to build
      • It takes only 40 seconds to launch a large application
      • The automatic test system directly calls the container system deployment environment, and the test can be recycled for use by other tests
      • There is almost no performance loss between the physical machine and the Container
        • Redis-benchmark -h 127.0.01-P6379-q -d 100
    • Image management
      • Basic mirror pool construction
      • Build the application image on top of the base image
      • Application images are rebuilt each time they are published
      • Apply mirroring multi-version storage
      • Build the image once, available everywhere
      • Rollback and capacity expansion of applications are performed based on application mirroring
    • Thinking about the Network
      • The network controllability in the private cloud itself is relatively high
      • Multi-tenant isolation makes little sense in private clouds
      • Stable, controllable and scalable is the urgent need
      • High guarantee of overall bandwidth
      • Network considerations for Docker containers
        • Local network mode and OVS mode
          • Local network mode: For example, web
          • OVS pattern: for example, data analysis
  7. Netease honeycomb

    • Openstack + K8S + ETCD + OpenFlow + iscsi + Ceph + Billing + Multiple computer rooms
  8. tencent

    • Kubernetes + Network (Bridge + VLAN/SR-IOV/overlay) + LXCFS + Ceph + Configmap \secret + Blue Whale Control platform
    • At present, there are about 15,000 resident Docker containers, and dozens of terminal games, page games and mobile games have been run on Docker platform
    • Clusters are compatible with both Docker applications and non-Docker applications
    • Gaia puts the network under unified management as a resource dimension, along with CPU and memory. Services specify their own NETWORK I/O requirements when submitting applications. We use Traffic Control (TC) + Cgroups to Control the network outbound bandwidth, and modify the kernel to increase the Control of the network inbound bandwidth
    • Specific network selection
      • The communication between POD and POD in the cluster does not need Intranet IP (virtual IP can be used), so overlay network is implemented by flannel component.
      • Pod communication between the Intranet of the company and the cluster, such as HAProxy, and some modules of the game adopt SR-IOV network, which is implemented by the customized Sr-IOV-CNI component. This class of pods has a dual network, with eth0 corresponding to the overlay network and eth1 to the SR-IOV network.
      • Communication between POD and Intranet. In the microservice scenario, the game’s data storage and peripheral systems are deployed on physical machines or virtual machines, so pod access to these modules and systems is through the NAT network.
      • (Internet) access, using the company’s TGW solution.
  9. drops

    • Kubernetes
    • As far as we know, Didi has not been using Docker-oriented for a long time, so there is not much reference for architecture design
  10. uber

    • To be added
  11. Mushroom street

    • Kubernetes + VLAN
  12. Seven NiuYun

    • Mesos + Homegrown Container Scheduling Framework (DoraFramework) + Bridge+ NAT + Open vSwitch + Consul + Prometheus + Ansible
    • Seven Cattle has reached the scale of nearly a thousand physical machines, mesOS support for large-scale scheduling is more appropriate
    • Choose self-development over Marathon, the core framework of Mesos
      • There were some aspects of Marathon that didn’t support the posture we expected, such as less-than-great seamless service discovery
      • Marathon is developed in Scala, so it is not easy to troubleshoot any problems, nor is it convenient for us to do secondary development
      • If We choose Marathon, we still need to make another layer of packaging for Marathon to serve as Dora’s scheduling service, so there will be more modules and deployment operation and maintenance will be complicated
  13. Meizu cloud

    • OVS & VLAN + SR-IOV + CEPH (to ensure the reliability of mirroring storage) + its existing monitoring system
    • Hosts communicate with Each other on a large layer 2 network and are isolated by vlans
    • Remote mirror synchronization
    • Container Design Concept
      • For containerized VMS, creating containers requires a long time to run
      • Each Container has an independent and unique IP address
      • Hosts communicate with Each other on a large layer 2 network and are isolated by vlans
      • Container Enables the SSH service. You can log in to the Container using a fortress
      • Container enables other common services, such as Crond
    • network
      • Iperf test: Bridge < OVS veth pair < OVS internal port
      • Iperf test: Native > SR-IOV > OVS > Bridge
      • Docker with DPDK
        • Polling processes packets to avoid interrupt overhead
        • User-driven, avoiding memory copy, system call – CPU affinity, large page technology
      • Idea
        • Virtio serves as the back-end interface
        • The user socket is mounted to the Container
        • Run DPDK applications in Container
    • The Container storage
      • Devicemapper: Mature and stable, raw device, snapshot
      • IOPS: Native is basically Devicemapper
      • Data disk storage -LVM
        • You can change quotas online by Container
    • Mirror storage and synchronization
      • Image storage
        • LVS front-end load balancing ensures high availability
        • Distribution Management Image
        • Back-end CEPH ensures mirror storage reliability
      • Remote mirror synchronization
        • Webhook notification mechanism
        • Strong consistent synchronization mechanism
    • Container cluster scheduling system
      • Scheduling requests fall to the corresponding nodes in the cluster
      • Select hosts based on IDCs, resources, zones, and Containers
      • Dynamic scheduling based on host resource state, requested CPU/ memory/disk size
      • The same service Container is deployed to different cabinets based on cabinet awareness
  14. ucloud

    • kubernetes + Jenkins
      • -v Mount elasticSerach to host, Flume/Logstash/rsyslog + ElasticSerach
      • Vswitch Overlay “Large Layer 2” SDN networking solution + ipvLAN
    • Main problem types and solutions
      • The module configuration

        • Module upstream and downstream relationship, back-end service
        • Operating environment, machine room differential configuration, etc
      • Consistency and dependency

        • Inconsistencies in development, test, and run environments
        • Depends on different base libraries
      • The deployment of

        • The deployment is inefficient, requires many steps, and takes a long time
        • There is no mechanism to check the deployment status
        • Application management
          • The management, capacity expansion, and capacity reduction costs of a large number of container instances are high
          • Unified management of program construction, packaging, operation and maintenance
          • Monitoring and log analysis
      • The solution

        • The module configuration
          • Separate configuration items such as the environment, IDC, and resource class
          • Configure templates and submit them to Cedebase for versioning management
          • Derive different configuration values for different deploys, populate templates, and start scripts
          • The totals run in different Deploys and are simply passed to the Container via environment variables
        • Consistency and dependency
          • The development, testing and online operation environments all use images generated by Docker to ensure consistency
          • Basic system, basic tools, framework, layered construction
          • Base images are predeployed in development, test, and online environments
        • Private mirror warehouse
          • V2 version
          • Support UFile drivers
          • Timing pull Latest mirror
      • Some experience

        • The docker log
          • Log printing costs performance
          • It is best to turn off logDriver and print the logs in the background
        • docker daemon
          • Exit kill Container and upgrade the Docker Daemon. Kill is optional
        • Docker network
          • In NAT mode, nF_Conntrack is enabled, which degrades performance and adjusts kernel parameters
        • Docker mirror
          • Write dockfile specification, reduce the number of mirror layers, the basic part of the first
          • The mirror Registry is deployed geographically

The main problem

  1. Single instance performance tuning + 10MB card performance comes into play. Some improvements to OVS (Open vSwitch) are needed

  2. Multi-room: Multi-room and available domain support

  3. Container network requirements

    • The Iptables some so
    • Network access across host containers
    • Whether the container network needs to be capable of IP address drift
  4. Problems faced by container networks

    • Docker Host mode. Port conflicts exist in blending.
    • In Docker NAT mode,Ip address-sensitive services are too large to support service discovery
    • Overlay network, involving IP address planning,MAC address allocation, and convergence ratio of network devices
    • Overlay Network security, maintainability, capacity planning
  5. Version upgrade (Docker/MesOS/K8S) the upgrade itself

  6. Docker containerizes stateful services

    • kafka / mysql

Network Selection (K8S and MESOS)

Thinking && pain points

  1. Can I access it across machines? Cross-domain access?

    • Flannel can communicate across containers
    • Interconnect containers across hosts
    • The container is connected to the outside
  2. Do you support static OR fixed IP addresses? Domain access?

    • With a fixed IP, you need the IP to remain the same every time you deploy or update or reboot
    • Overlay Network, Docker 1.6 can communicate across hosts
  3. Is DNS supported?

  4. Layer 4 / layer 7 access

  5. Network after container storage capacity

  6. IP port, it is best not to manually plan

  7. Network policy, defense, isolation?

    • Network isolation and traffic limiting between different applications in a container cluster
  8. Docker network

    • Host mode: Containers directly share the host network space and use -p for port mapping. Two containers listening on port 80 cannot be started and are not isolated
    • Container mode: A container uses the network configuration of an existing container. All network-related information, such as IP address information and network ports, is shared
    • Bridge mode: Assigns an IP address from the Docker0 subnet to the container and sets the DOCker0 IP address to the container’s default gateway
      • The IP of the container changes when the container is restarted
      • The communication between containers of different hosts depends on a third-party solution, such as Pipework

plan

  1. Scheme category

    • Tunnel scheme, using tunnels, or Overlay Networking:
      • Weave, UDP broadcast, local set up new BR, interconnect via PCAP.
      • Open vSwitch (OVS) is based on VxLAN and GRE protocols, but suffers serious performance loss.
      • Flannel, UDP broadcast, VxLan
    • Routing scheme
      • Calico, a routing scheme based on BGP, supports detailed ACL control and has a high affinity for hybrid cloud.
      • Macvlan is the solution with the best isolation and performance from the logic and Kernel layers. It is based on layer 2 isolation and requires the support of layer 2 routers. Most cloud service providers do not support it, so it is difficult to implement on hybrid cloud.
      • Good performance, no NAT, high efficiency, but limited by the routing table, in addition, each container has an IP, service IP may be used up.
  2. There are two camps on the web

    • Docker Libnetwork Container Network Model (CNM) camp (Docker Libnetwork advantage is native, and closely combined with the Docker Container lifecycle)

      • Docker Swarm overlay
      • Macvlan & IP network drivers
      • Calico
      • Contiv (from Cisco)
    • Container Network Interface (CNI) camp (CNI has the advantage of being compatible with other Container technologies (e.g. RKT) and upper-layer programming systems (Kuberneres & Mesos)

      • Kubernetes
      • Weave
      • Macvlan
      • Flannel
      • Calico
      • Contiv
      • Mesos CNI
  3. Common solutions are:

    • Flannel VXLAN, Overlay mode

    • calico

      • The network between containers is isolated at layer 3, so there is no need to worry about ARP storms
      • Packet forwarding based on iptable/ Linux Kernel has high efficiency and low loss
      • Calico does not have the concept of multi-tenant. All container nodes are required to be routable and IP addresses cannot be repeated
    • Ipvlan MacVLAN: Physical Layer 2 / Layer 3 isolation. Currently, it needs to be configured on a single node using the Pipework tool. Only VLAN isolation is implemented, but ARP broadcast is not resolved

    • Swarm native VXLAN is similar to flannel VXLAN

    • Neutron SDN: ml2+ Ovsplugin, MIDONET, VLAN or VXLAN

    • Weave

      • A virtual network can be created to connect Docker containers deployed on multiple hosts, external devices can access services provided by the Application container on the Weave network, and existing internal systems can be exposed to the application container
    • contiv

      • Cisco led SDN solution, either with pure soft OVS or OVS + Cisco hardware SDN Controller
      • Based on OpenvSwitch, it supports container access to network in the form of plug-in, support VLAN, Vxlan, multi-tenant, host access control policy, etc
      • The SDN capability allows for finer control over network access to the container
      • Jingdong already supports the operation of 10W + containers based on the same technology stack (OVS + VLAN)
    • Linux Bridge + Layer-3 switch: Set the Linux Bridge on host to the subnetwork segment of layer-3 switches. Containers communicate with each other through Layer-2 switches, and containers communicate with external layer-3 switches through the gateway.

  4. Network selection is commonly used in the industry

    • kubernetes + flannel

      • Kubernetes uses a flat network model that requires each Pod to have a globally unique IP, and pods can communicate directly across hosts. At present, a mature solution is to use Flannel
      • Flannel supports data forwarding modes such as UDP, VxLAN, AWS VPC, and GCE routing.
      • Under Kubernetes, Flannel, OpenVSwitch and Weave can implement Overlay Network
      • Vipshop Contiv NetPlugin solution (fixed external IP) + Flannel
      • Jd Flannel + Neutron + OVS
      • Flannel Performance: Official: The bandwidth does not decrease and the latency increases
    • Mesos + Caclio

      • Mesos supports the CNI standard specification
      • One container one IP, network isolation, DNS service discovery, IP assignment, L3 virtual network
      • Where to Mesos + Caclio
      • Bridge+ NAT + Open vSwitch
    • Meizu cloud OVS & VLAN + SR-IOV

    • Ucloud: VSwitch Overlay “Large Layer 2” SDN networking solution + ipvLAN

Log monitoring selection (including monitoring, statistics)

Due to the hierarchical design mode of Docker, the data in the container cannot be solidified, and the data in the container will be lost when the container is destroyed. Therefore, it is suggested to mount the log to the host or use distributed storage such as CEPh

Stdout /stderr logs can be forwarded to the syslog center by logspout for collection

Docker LogDriver can output logs to specific endpoints, such as Fluentd, Syslog, or Journald. Logspout can route container logs to Syslog or a third-party module such as Redis, Kafka, or Logstash.

  1. Plan to introduce

    • Use out-of-container collection. Mount host disks as log directories and files in a container.
    • Controls applied in the container to the log are also redirected to the log directory.
    • Collect and rotate the application log directory and docker log directory on the host.
  2. Monitoring Optional solution

    • cAdvisor + InfluxDB + Grafana
    • cAdvisor + Prometheus + Grafana
    • Graphite
    • Zabbix
    • Datadog
  3. This option is optional

    • logstash
    • ELK
    • Graylog
    • flume
    • heka
    • fluentd
  4. The industry solution

    • Aliyun: cAdvisor + InfuxDB + Prometheus
    • Coroutines: ELK
    • Graphite + cAdvisor

Image management

  1. Images are always generated from Dockerfile
  2. Do not rely on each other too much. You are advised to use three layers, which are the basic operating system image, middleware image, and application image
  3. All images should have a Git repository for subsequent updates
  4. Registry
    • The solution to a single point problem can be DRBD, distributed storage, and cloud storage
    • “Regi” performance issues, available solutions are to speed up Layer download via HTTP reverse proxy caching, or to provide mirrored mirrors
    • Registry user permissions, Nginx LUA can provide a simple and quick implementation scheme

Personal understanding

  1. Selection should not only depend on the layout, but also on storage/network support

    • Swarm has some bugs in the past, such as not being able to detect failed nodes and restart them
    • K8s is just for scheduling docker
    • Mesos is used to manage clusters of machines. Docker can be managed indirectly through Marathon
  2. Corresponding network support

    • Whether cross-host/cross-domain is possible
    • Can FIXED IP/DNS resolution be enabled?
    • CNI standard support?
  3. Support for storage

    • Can it be cured?
    • Is distributed storage supported?
  4. For orchestration/scheduling/upgrading

    • Is rollback supported? Upgrade online? Rolling upgrade?
    • Whether the CPU/memory allocation can be fine grained
    • Whether containerization and scheduling of stateful services are supported
    • Automatic capacity expansion and shrinkage?
  5. Service registration/discovery mechanism/load balancing

    • Is there a service registration mechanism in place?
    • Can load balancing meet the requirements of various service scenarios?
  6. other

    • Isolation, in addition to cGroup and namespace, there are other isolation, such as network isolation