Imagine if I wanted to deploy Nginx to a Kubernetes cluster, I might type in a terminal command like this:

$ kubectl run –image=nginx –replicas=3

And then enter. After a few seconds, you should see three Nginx Pods distributed across all the working nodes. It’s like magic, but you don’t really know what’s going on behind the scenes.

The magic of Kubernetes is that it handles deployments across infrastructure through a user-friendly API, while the underlying complexity is hidden in simple abstractions. But to fully appreciate the value it provides us, we need to understand its inner workings.

This guide will guide you through the full life cycle of a request from the client to Kubelet, using source code to explain what happens behind the scenes when necessary.

1. kubectl

Validation and generator

When enter is hit, Kubectl first performs some client-side validation to ensure that illegal requests (for example, creating an unsupported resource or using a malformed image name) will fail quickly and not be sent to Kube-Apiserver. Improve system performance by reducing unnecessary load.

Once authenticated, Kubectl begins to encapsulate HTTP requests sent to Kube-Apiserver. Kube-apiserver communicates with ETCD. All attempts to access or change the state of the Kubernetes system are made through Kube-Apiserver, and Kubectl is no exception. Kubectl uses generators to construct HTTP requests. Generators are an abstraction for dealing with serialization.

Not only can deployment be run with Kubectl Run, but many other resource types can be deployed by specifying the parameter generator. If the –generator parameter is not specified, Kubectl will automatically determine the type of resource.

For example, a resource with the parameter –restart-policy=Always will be deployed as Deployment, and a resource with the parameter –restart-policy=Never will be deployed as Pod. Kubectl also checks to see if other operations need to be triggered, such as logging commands (for rollback or auditing).

After Kubectl determines that a Deployment is to be created, it will use the DeploymentV1Beta1 generator to generate a runtime object from the parameters we provide.

API version negotiation and API group

To make it easier to eliminate fields or reorganize resources, Kubernetes supports multiple API versions, each under a different API path, such as/API /v1 or /apis/extensions/v1beta1. Different API versions indicate different levels of stability and support, and refer to the Kubernetes API Overview for a more detailed description.

API groups are designed to categorize similar resources in order to make the Kubernetes API easier to extend. The API group name is specified in the REST path or in the apiVersion field of the serialized object. For example, the API group name for Deployment is apps and the latest VERSION of the API is V1beta2, that is why you should type apiVersion: apps/v1beta2 at the top of the Deployment Manifests manifests.

Once the runtime object is generated, Kubectl begins to find the appropriate API group and API version for it, and then assembes a versioning client that knows the various REST semantics of the resource. In this phase, called version negotiation, Kubectl scans the /apis path on the Remote API to retrieve all possible API groups. Because Kube-Apiserver exposes the OpenAPI format specification documentation in the /apis path, it is easy for clients to find suitable apis.

To improve performance, Kubectl caches OpenAPI specifications in the ~/.kube/cache directory. If you want to understand the process of API discovery, try deleting the directory and setting the -v parameter to the maximum value when you run the kubectl command, and you will see all HTTP requests trying to find these API versions. Refer to the Kubectl cheat sheet.

The last step is to actually send the HTTP request. Once the request is sent and a successful response is received, Kubectl prints the Success message in the desired output format.

Client identity authentication

Client authentication is also required before sending an HTTP request, which was not mentioned earlier but can now be seen.

In order to successfully send the request, Kubectl requires authentication. User credentials are stored in kubeconfig files. Kubectl finds kubeconFig files in the following order:

If the –kubeconfig parameter is provided, Kubectl uses the kubeconfig file provided by the –kubeconfig parameter.

If the — kubeconfig parameter is not provided, but the environment variable $kubeconfig is set, the kubeconfig file provided by that environment variable is used.

If neither the — kubeconfig parameter nor the environment variable $kubeconfig is provided, kubectl uses the default kubeconfig file $HOME/.kube/config.

After parsing the Kubeconfig file, Kubectl determines the current context to be used, the current cluster to which it is pointing, and any authentication information associated with the current user. If the user provides additional parameters (such as -username), these parameters override the values specified in kubeconFig in preference. Once this information is available, Kubectl will populate the HTTP request header to be sent:

X509 certificates are sent using TLS.tlsconfig (including CA certificates).

Bearer tokens were sent in HTTP request header Authorization.

The user name and password are sent through HTTP basic authentication.

The OpenID authentication process is manually handled by the user in advance, resulting in tokens that have been sent like bearer tokens.

2. kube-apiserver


Now that our request has been sent successfully, what happens next? That’s where Kube-Apiserver comes in! Kube-apiserver is the primary interface used by clients and system components to save and retrieve cluster status. In order to perform the function, Kube-Apiserver needs to be able to verify that the requester is legitimate, a process called authentication.

So how does Apiserver authenticate requests? When Kube-Apiserver is first started, it looks at all the CLI parameters provided by the user and combines them into a list of appropriate tokens.

For example: if the –client-ca-file argument is provided, x509 client certificate authentication is added to the token list; If the –token-auth-file argument is provided, the Breaer Token is added to the token list.

Apiserver authenticates through the token chain each time it receives a request until one of the authentications succeeds:

The X509 handler verifies that the HTTP request is encoded with a TLS key signed by the CA root certificate.

Bearer token handlers will verify the existence of token files provided by the –token-auth-file parameter.

The basic authentication handler ensures that the HTTP request’s basic authentication credentials match the local state.

If the authentication fails, the request fails and an error message is returned. If the validation is successful, the Authorization request header in the request is removed and the user information is added to its context. This gives subsequent authorization and access controllers the ability to access previously established user identities.


OK, now that the request has been sent and Kube-Apiserver has successfully verified who we are, relief at last!

However, this is not the end of the matter, although we have proved that we are legal, but do we have the right to carry out this operation? After all, identity and access are not the same thing. Kube-apiserver also authorizes the user for subsequent operations.

Kube-apiserver handles authorization in a similar way to authentication: set by kube-Apiserver’s startup parameter –authorization_mode. It will combine a series of grantees who will authorize each incoming request. If all authorizers reject the request, the request is disallowed and no further response is made. If a grantee approves the request, the request continues.

Kube-apiserver currently supports the following authorization methods:

Webhook: It interacts with HTTP(S) services outside the cluster.

ABAC: It enforces policies defined in static files.

RBAC: it USES RBAC authorization. K8s. IO API Group realize authorization decision, through Kubernetes API allows administrators to dynamic allocation strategy.

Node: This ensures that Kubelet can only access resources on its own Node.

Access control

Will the API Server actually respond to the client’s call request once the authentication and authorization barriers are crossed? The answer is no!

From kube-Apiserver’s point of view, it has verified our identity and given us the appropriate permissions to continue, but from Kubernetes’ point of view, the other components are very vocal about what should or should not be allowed to happen. Therefore, this request also needs to pass through a chain of Admission controls controlled by the Admission Controller. The official standard has nearly ten “levels” and can be customized.

While the focus of authorization is to answer whether a user has permissions, the access controller intercepts the request to ensure that it meets the broader expectations and rules of the cluster. They are the last bastion of resource objects before they are saved to etCD, encapsulating a series of additional checks to ensure that operations do not produce unexpected or negative results. Unlike authorization and authentication, which are concerned only with the requested user and operation, access controls also deal with the content of the request and are only effective for things like create, update, delete, or connect (such as broker), not for read operations. Access controllers work in a similar way to authorizers and verifiers, but with one difference: Unlike authentication and authorization chains, if an access controller fails the check, the entire chain is broken, and the entire request is immediately rejected and an error is returned to the end user.

The design of the admission controller focuses on extensibility. Each controller is stored as a plug-in in the Plugin/PKG /admission directory, matched to an interface, and finally compiled into the Kube-Apiserver binary.

Most of the access controllers are fairly easy to understand, so I’ll focus on SecurityContextDeny, ResourceQuota, and LimitRanger.

SecurityContextDeny This plug-in will prohibit the creation of pods with the SecurityContext set.

ResourceQuota limits not only the number of resources created in a Namespace, but also the total number of resources requested by Pod in a Namespace. The access controller implements ResourceQuota management with resource object ResourceQuota.

LimitRanger similar to the ResourceQuota controller, LimitRanger is a ResourceQuota for each individual (such as Pod and Container) of Namespace resources. This plug-in implements resource quota management with the LimitRange resource object.

3. etcd

By now, Kubernetes has done a thorough review of the client’s call request, and it has been validated to run it to the next step. The next step kube-Apiserver will deserialize the HTTP request, then use the results to build the runtime object (sort of like the inverse of the Kubectl generator) and save it to ETCD. Let’s break this down.

How does Kube-Apiserver know what it should do when it receives a request? In fact, a very complex set of processes takes place before the client sends the call request. Let’s start with the kube-Apiserver binary for the first time:

When the Kube-Apiserver binary is run, it creates a service chain that allows apiserver aggregation. This is a way to extend the Kubernetes API.

A Generic Apiserver is also created as the default Apiserver.

The generated OpenAPI specification is then used to populate the configuration of Apiserver.

Kube-apiserver then iterates through all the API groups specified in the data structure and saves each API group into ETCD as a generic storage abstraction. Kube-apiserver calls these API groups when you access or change the state of a resource.

Each API group iterates through all of its group versions and maps each HTTP route to a REST path.

When the METHOD requested is POST, Kube-Apiserver forwards the request to the resource creation handler.

Kube-apiserver now knows all the routes and their corresponding REST paths so that it knows which processors and key-value stores to call when a request matches. What a clever design! Now assume that the client’s HTTP request has been received by Kube-Apiserver:

If the processing chain can match a request to a registered route, it sends the request to a specialized handler registered to the route for processing; If none of the routes match the request, the request is forwarded to a path-based handler (e.g., when calling /apis); If no path-based handler is registered to the path, the request is forwarded to the Not Found handler and a 404 is returned.

Luckily, we have a registered route named createHandler! What does it do? First it decodes the HTTP request and performs basic validation, such as ensuring that the JSON provided in the request matches the version of the API resource.

Next comes the audit and access control phase.

The resource will then be saved to the ETCD through the storage provider. By default, keys saved to etCD are in the format of /, which you can also customize.

Any errors that occur during the creation of the resource are caught, and finally the Storage Provider performs a get call to verify that the resource was successfully created. If additional cleanup is required, the processors and decorators created later are invoked.

Finally, the HTTP response is constructed and returned to the client.

It turned out that Apiserver had done so much work that they hadn’t discovered it before! So far, the Deployment resource we created has been saved to etCD, but apiserver still doesn’t see it.

4. Initialize

Apiserver cannot fully see or schedule a resource object after it has been persisted to a data store, and there are a series of Initializers to perform before that can happen. Initializers are controllers associated with resource types that perform logic before resources are available. If there are no Initializers for a resource type, this initialization step is skipped and the resource is immediately visible.

Initializers are a powerful feature because they allow us to perform common boot operations. Such as:

Inject the proxy sidecar container into a Pod that exposes port 80, or add a specific annotation.

Inject the volume that holds the test certificate into all pods in a specific namespace.

If the password in Secret is less than 20 characters long, prevent its creation.

InitializerConfiguration Resource objects allow you to declare which Initializers should run for certain resource types. If you want to run custom Initializers every time you create a Pod, you can do this:

kind: InitializerConfiguration
  name: custom-pod-initializer
  - name:
      - apiGroups:
          - ""
          - v1
          - pods

After InitializerConfiguration create resources through the configuration object, you will be in each Pod metadata. The initializers. Pending add custom fields – Pod -, initializer fields. The initialization controller periodically scans for new pods, executes its logic when it detects its name in the Pod’s pending field, and then removes its name from the pending field.

Only the first Initializers in the list under the Pending field can operate on the resource. When all Initializers are complete and the Pending field is empty, the object is considered successfully initialized.

One problem you may have noticed is that if Kube-Apiserver can’t display these resources, what does the user-level controller do with them?

To solve this problem, Kube-Apiserver exposed a? IncludeUninitialized Query parameter, which returns all resource objects (including uninitialized ones).

5. Control loop

Deployments controller

At this stage, our Deployment record is saved in ETCD and all initialization logic has been executed, and the next stage involves the topology on which the resource depends. In Kubernetes, Deployment is really just a collection of Replicaset, which is a collection of PODS. So how does Kubernetes create these resources hierarchically from an HTTP request? All of this work is done by the built-in Controller of Kubernetes.

Kubernetes uses a lot of Controller throughout the system, which is an asynchronous script used to modify the system state from the “current state” to the “expected state.” All controllers run in parallel through the Kube-Controller-Manager component, and each Controller is responsible for a specific control flow. Let’s start with Deployment Controller:

Once the Deployment record has been stored to etCD and initialized, it can be made visible via kube-Apiserver and then detected by the Deployment Controller (whose job is to listen for changes to the Deployment record). In our example, the controller registers a specific callback function that creates the event through an Informer (see more below).

When Deployment is first visible, the Controller adds the resource object to the internal work queue and starts processing it:

Check whether the Deployment has ReplicaSet or Pod records associated with it by querying kube-Apiserver using the label selector.

Interestingly, this synchronization process is state agnostic and checks new records in the same way as checks existing records.

Upon realizing that there is no ReplicaSet or Pod record associated with it, the Deployment Controller starts the elastic scaling process:

Create a ReplicaSet resource, assign it a label selector and set its version number to 1.

ReplicaSet’s PodSpec fields are copied from the Deployment manifest and other associated metadata. Sometimes the Deployment record needs to be updated after this (for example, if process Deadline is set).

When the above steps are complete, the status of the Deployment is updated and the same loop is re-entered, waiting for the Deployment to match the desired state. Since the Deployment Controller only cares about ReplicaSet, coordination needs to continue through the ReplicaSet Controller.

ReplicaSets controller

In the previous step, the Deployment Controller created the first ReplicaSet, but still has no Pod, so it’s time for the ReplicaSet Controller to come on stage! The job of ReplicaSet Controller is to monitor the life cycle of ReplicaSets and its associated resources (PODS). Like most other controllers, it does this through handlers that trigger certain events.

When a ReplicaSet is created (created by the Deployment Controller), the RS Controller checks the status of the new ReplicaSet and checks for any deviation between the current and expected state. Then adjust the number of Pod copies to achieve the desired state.

Creation is the batch of Pod, since SlowStartInitialBatchSize, then to a missile in every successful iterative start operation double. The goal is to mitigate the risk of Kube-Apiserver being swamped by a large number of unnecessary HTTP requests in the event of a large number of Pod startup failures (for example, due to resource quotas). If the creation fails, it is best to do so gracefully and with minimal impact on other system components!

Kubernetes constructs a strict resource object hierarchy with Owner References, which refer to the ID of the parent resource in a field of the child resource. This ensures that once the controller-managed resource is deleted (cascading deletion), the child resource is deleted by the garbage collector, and also provides an efficient way for the parent resource to avoid competing for the same child resource (imagine a scenario where two sets of parents think they have the same child).

Another benefit of Owner References is that it is stateful. If any Controller is restarted, this operation will not affect the stable operation of the system because the topology of the resource object is independent of the Controller. This emphasis on resource isolation is also reflected in the design of controllers themselves: Controllers cannot operate on resources they do not explicitly own, and they should choose the ownership of resources without interference or sharing.

Sometimes orphaned resources occur in a system, usually in one of two ways:

The parent resource is deleted, but the child resource is not deleted

Garbage collection policies prohibit the deletion of child resources

When this happens, the Controller will ensure that the orphan resource has a new Owner. Multiple parent resources can compete with each other for the same orphan resource, but only one will succeed (the other parent resources will receive validation errors).


You may have noticed that some controllers (such as the RBAC authorator or Deployment Controller) need to retrieve the cluster status before they can run properly. In the case of the RBAC authorizer, when a request comes in, the authorizer caches the user’s initial state and then uses it to retrieve all the roles and rolebindings associated with the user in the ETCD. How does Controller access and modify these resource objects? Kubernetes actually solved this problem through the Informer mechanism.

Infomer is a pattern that allows controllers to find data cached in local memory (maintained by the Informer themselves) and list the resources they are interested in.

Although Informer’s design is abstract, it implements a lot of detail processing logic internally (such as caching), which is important because it not only reduces direct calls to the Kubenetes API, but also reduces a lot of repetitive work for the Server and Controller. By using Informer, different controllers can interact with each other in a thread-safe manner without having to worry about collisions if multiple threads access the same resource.

For a more detailed analysis of Informer, see Kubernetes: Controllers, Informers, Reflectors and Stores


When all controllers are up and running, one Deployment, one ReplicaSet and three Pod resource records are stored in etCD and can be viewed via kube-apiserver. However, these Pod resources are still Pending because they have not been scheduled to run on the appropriate nodes in the cluster. This problem is ultimately resolved by the Scheduler.

Scheduler runs as a standalone component on the cluster control plane and works like other controllers: listening to reality and adjusting system state to the desired state. Specifically, the function of Scheduler is to bind the PODS to be scheduled to an appropriate Node in the cluster according to a specific algorithm and scheduling policy. And writes the binding information to etCD (which filters pods with NodeName field null in its PodSpec). The default scheduling algorithm works as follows:

When Scheduler is started, a default chain of pre-selected policies is registered, and these pre-selected policies evaluate alternative nodes to determine whether they meet the requirements of alternative pods. For example, if the PodSpec field limits CPU and memory resources, then when the resource capacity of the candidate node does not meet the requirements of the candidate Pod, Alternative PODS will not be scheduled to the node (resource capacity = total resource of the alternative node – the sum of required resources (CPU and memory) of all containers that already have PODS in the node)

Once the candidate nodes that meet the requirements are screened out, the score of each candidate node is calculated using the optimization strategy, and then these candidate nodes are sorted, and the one with the highest score wins. For example, in order to spread the workload across the system, these preferred policies select the least resource-consuming node from a list of alternative nodes. When each node passes the optimization strategy, a score will be calculated, and each score will be calculated. Finally, the node with a high score will be selected as the result of optimization.

Once a suitable node is found, the Scheduler creates a Binding object whose Name and Uid match the Pod and whose ObjectReference field contains the Name of the selected node, It is then sent to apiserver via a POST request.

When Kube-Apiserver receives this Binding object, the registration bar deserializes the object and updates the following fields in the Pod resource:

Set the value of NodeName to NodeName in ObjectReference.

Add relevant comments.

Set the PodScheduled status value to True. It can be viewed via Kubectl:

$ kubectl get -o go-template='{{range .status.conditions}}{{if eq .type “PodScheduled”}}{{.status}}{{end}}{{end}}’

Once the Scheduler schedules a Pod to a node, that node’s Kubelet takes over the Pod and begins deployment. The -policy-config-file parameter can be used to extend both the pre-selected policy and the preferred policy. If the default scheduler does not meet the requirements, you can deploy a customized scheduler. If the podspec. schedulerName value is set to another scheduler, Kubernetes forwards the scheduling of that Pod to that scheduler.

6. Kubelet

Pod synchronization

Now that all controllers are done, let’s summarize:

HTTP requests go through the authentication, authorization, and access control phases.

One Deployment, ReplicaSet, and three Pod resources are persisted into the ETCD store.

A series of Initializers were then run.

Finally, each Pod is scheduled to the appropriate node.

While all the state changes so far are only for resource records stored in etCD, the next steps involve the distribution of pods running between working nodes, which is a key factor in distributed systems such as Kubernetes. These tasks are all done by the Kubelet component, so let’s get started!

In the Kubernetes cluster, a Kubelet service process is started on each Node, which is used to process the tasks that Scheduler sends to the Node and manage the Pod life cycle. This includes mounting volumes, container logging, garbage collection, and other POD-related events.

To think of Kubelet differently, you can think of Kubelet as a special Controller that gets a list of pods to run on its Node from Kube-Apiserver via NodeName every 20 seconds (customizable). Once it gets the list, it detects the newly added Pod by comparing it to its own internal cache and starts synchronizing the Pod list if there is a discrepancy. Let’s examine the synchronization process in detail:

If a Pod is being created, Kubelet records some metrics used in Prometheus to track the Pod startup delay.

A PodStatus object is then generated, which represents the status of the Pod in its current phase. The Phase of a Pod is the most concise summary of the Pod’s life cycle, including Pending, Running, Succeeded, Failed, and Unkown values. The process of state creation is very gradual, so it’s important to understand the principles behind it:

First, a series of Pod sync handlers (PodSyncHandlers) are executed serially, with each handler checking to see if the Pod should be running on the node. When all processors agree that the Pod should not be running on the node, the Pod Phase value changes to PodFailed and the Pod is expelled from the node. When you create a Job, for example, if the Pod failure retry time than the spec. ActiveDeadlineSeconds set of values, will be out of the node Pod.

Next, the Phase value of the Pod is determined by the state of both the init container and the application container. Because the container is not currently started, the container is considered to be in the waiting Phase, and if at least one container in the Pod is in the waiting Phase, its Phase value is Pending.

Finally, the Pod’s Condition field is determined by the state of all containers inside the Pod. At this point our container has not been created by the container runtime, so the state of PodReady is set to False. It can be viewed via Kubectl: $ kubectl get -o go-template='{{range .status.conditions}}{{if eq .type “Ready”}}{{.status}}{{end}}{{end}}’

Once the PodStatus is generated (the status field in Pod), Kubelet sends it to the Pod’s status manager, whose job is to update records in ETCD asynchronously via Apiserver.

A series of access handlers are then run to ensure that the Pod has the appropriate permissions (including forcing the AppArmor profile and NO_NEW_PRIVS), and the Pod rejected by the access controller remains Pending.

If Kubelet starts with a cgroups-per-qos parameter specified, Kubelet creates a Cgroup for the Pod and puts a corresponding resource limit on it. This is to facilitate quality of service (QoS) management of pods.

Then create directories for the Pod, including the Pod directory (/var/run/kubelet/ Pods /), the Pod volume directory (/volumes), and the Pod plug-in directory (/plugins).

The volume manager will mount the relevant data Volumes defined in spec. Volumes and wait for the mount to be successful. Depending on the type of volume mounted, some PODS may need to wait longer (such as NFS volumes).

Retrieve all the Secret defined in spec.ImagepullSecrets from apiserver and inject it into the container.

Finally, start the Container through the Container Runtime Interface (CRI) (described in more detail below).

CRI and Pause containers

At this stage, a lot of initialization is done and the container is ready to start, which is started by the container runtime (such as Docker and Rkt).

To make it easier to scale, Kubelet has been interacting with the Container Runtime since 1.5.0 through the Container Runtime interface. In short, CRI provides an abstract interface between Kubelet and a particular runtime, via a protocol buffer (which is like a faster JSON) and the gRPC API (which is a great API for performing Kubernetes operations). This is a really cool idea, and by using the contractual relationship defined between Kubelet and the runtime, the implementation details of how the container is orchestrated become irrelevant. Since there is no need to modify the Kubernetes core code, developers can add new runtimes with minimal overhead.

Sorry to digress, but let’s go back to the container startup phase. When Pod is first started, Kubelet calls RunPodSandbox through the Remote Procedure Command(RPC) protocol. Sandbox is used to describe a set of containers, such as Pod in Kubernetes. Sandbox is a very broad concept, so it still makes sense for other runtimes that do not use containers (for example, in a hypervisor-based run, sandbox might refer to virtual machines).

The container used in our example runs as a Docker, and the sandbox is created first as a pause container. The Pause container serves as the base for all other containers in the same Pod, and it provides a large number of POD-level resources for each business container in the Pod, all of which are Linux namespaces (including network namespaces, IPC namespaces, and PID namespaces).

Pause containers provide a way to manage all of these namespaces and allow business containers to share them. The advantage of being in the same network namespace is that containers in the same Pod can communicate with each other using localhost. The second function of the Pause container is related to the way the PID namespace works. In the PID namespace, processes form a tree structure, and if a child becomes an orphan due to a parent’s error, it is adopted by the init process and eventually reclaimed. For details about pause working mode, see The Almighty Pause Container.

Once the pause container is created, it is time to check the disk state and start the service container.

CNI and Pod networks

Our Pod now has the basic skeleton: a Pause container that shares all namespaces to allow business containers to communicate in the same Pod. But there is still a question, how is the network of containers built?

When Kubelet creates a network for the Pod, it gives the CNI plug-in the task of creating the network. CNI stands for Container Network Interface. Similar to how a Container runtime operates, CNI is an abstraction that allows different Network providers to provide different Network implementations for containers. The CNI plugin can configure pause containers by transferring data from the JSON configuration file (default: /etc/cni/net.d) to the relevant CNI binary (default: /opt/ cnI /bin). Then all other containers in the Pod use the pause container’s network. Here is a simple example configuration file:

    "cniVersion": "0.3.1",
    "name": "bridge",
    "type": "bridge",
    "bridge": "cnio0",
    "isGateway": true,
    "ipMasq": true,
    "ipam": {
        "type": "host-local",
        "ranges": [
          [{"subnet": "${POD_CIDR}"}]
        "routes": [{"dst": ""}]

The CNI plug-in also specifies additional metadata for the Pod, including the Pod name and namespace, via the CNI_ARGS environment variable.

The following steps vary from CNI plug-in to CNI plug-in, using the Bridge plug-in as an example:

The plug-in first sets up the local Linux bridge in the root network namespace (that is, the host’s network namespace) to provide network services to all containers on that host.

It then inserts a network interface (one end of the VETH device pair) into the pause container’s network namespace and connects the other end to the bridge. You can think of a VETH device pair this way: It is like a long pipe, with one end connected to the container and the other to the root network namespace, through which packets are propagated.

The IPAM Plugin specified in the JSON file then assigns an IP address to the pause container’s network interface and sets up the corresponding route. The Pod now has its own IP address.

The IPAM Plugin works in a similar way to the CNI Plugin: invoked through binaries and with standardized interfaces, each IPAM Plugin must determine the IP, subnet, gateway, and route of the container network interface and return the information to the CNI Plugin. The most common IPAM Plugin is host-local, which assigns IP addresses to containers from a predefined set of address pools. It stores the address pool information and assignment information in the host’s file system, thus ensuring that the IP address of each container on the same host is unique.

Finally, Kubelet passes the Cluster IP addresses of DNS servers inside the Cluster to the CNI plug-in, which then writes them to the container’s /etc/resolv.conf file.

Once the above steps are completed, the CNI plug-in returns the results of the operation to Kubelet in JSON format.

Network across host containers

So far, we have described how containers communicate with hosts, but what about containers that communicate across hosts?

Overlay networks are typically used to communicate across host containers, which is a way to dynamically synchronize routes between multiple hosts. The most commonly used overlay network plug-in is flannel. For details about how flannel works, please refer to the CoreOS documentation.

Container startup

With all the networks configured, it’s time to actually start the business container!

Once the SanBox is initialized and in the active state, Kubelet can start creating containers for it. Start the Init container defined in PodSpec first, and then start the business container. The specific process is as follows:

First pull the image of the container. If it is a private repository image, the Secret specified in PodSpec is used to pull the image.

The container is then created through the CRI interface. Kubelet populates the PodSpec with a ContainerConfig data structure (which defines commands, mirrors, labels, mount volumes, devices, environment variables, etc.) and sends it to the CRI interface via Protobufs. In Docker’s case, it deserializes this information and populates it with its own configuration information, which is then sent to the Dockerd daemon. In the process, it adds metadata labels (such as container type, log path, DandBox ID, and so on) to the container.

The container is then constrained using the CPU manager, a new alpha feature in Kubelet 1.8 that uses the UpdateContainerResources CRI method to allocate the container to the CPU resource pool on the node.

Finally the container starts to really start.

If a Pod is configured with container lifecycle hooks, these hooks will run when the container is started. There are two types of hooks: Exec (executing a command) and HTTP (sending an HTTP request). If the PostStart Hook starts too long, hangs, or fails, the container will never become running.

7. Conclusion

If all goes well, you should now have three containers running on your cluster, with all the networks, data volumes, and keys added to the container via the CRI interface and configured successfully.

The flow chart of the entire Pod creation process described above is as follows:

Kubelet process for creating a Pod