NP
nodepools !
Getting Node Pools
A node pool, as we can see from get_node_pools.go (in https://github.com/vmware-tanzu/tanzu-framework/blob/main/cmd/cli/plugin/cluster/get_node_pools.go), is just a view over existing CAPI MachineDeployments .
func listNodePoolsInternal(cmd *cobra.Command, server *configapi.Server, clusterName string) error {
mdOptions := tkgclient.GetMachineDeploymentOptions{
ClusterName: clusterName,
Namespace: lnp.namespace,
}
machineDeployments, err := tkgctlClient.GetMachineDeployments(mdOptions)
...
for _, md := range machineDeployments
t.AddRow(md.Name, md.Namespace, md.Status.Phase, md.Status.Replicas, md.Status.ReadyReplicas, md.Status.UpdatedReplicas, md.Status.UnavailableReplicas)
Setting Node Pools
The Tanzu Framework codebase defines 3 inputs to a node pool.
type clusterSetNodePoolCmdOptions struct {
FilePath string
Namespace string
BaseMachineDeployment string
}
You thus create a node pool by: - defining its namespace - defining its parent MachineDeployment - defining a FilePath with its customizations
How Node pool files are interpretted
The node pool input value is a golang struct. There are two layers to it:
The first layer is infrastructure independent:
// NodePool a struct describing a node pool
type NodePool struct {
Name string `yaml:"name"`
Replicas *int32 `yaml:"replicas,omitempty"`
AZ string `yaml:"az,omitempty"`
NodeMachineType string `yaml:"nodeMachineType,omitempty"`
WorkerClass string `yaml:"workerClass,omitempty"`
Labels *map[string]string `yaml:"labels,omitempty"`
VSphere VSphereNodePool `yaml:"vsphere,omitempty"`
Taints *[]corev1.Taint `yaml:"taints,omitempty"`
VMClass string `yaml:"vmClass,omitempty"`
StorageClass string `yaml:"storageClass,omitempty"`
TKRResolver string `yaml:"tkrResolver,omitempty"`
Volumes *[]tkgsv1alpha2.Volume `yaml:"volumes,omitempty"`
TKR tkgsv1alpha2.TKRReference `yaml:"tkr,omitempty"`
NodeDrainTimeout *metav1.Duration `yaml:"nodeDrainTimeout,omitempty"`
BaseMachineDeployment string `yaml:"baseMachineDeployment,omitempty"`
}
The second layer is specific to your cloud: Vsphere, AWS, Azure, etc...
// VSphereNodePool a struct describing properties necessary for a node pool on vSphere
type VSphereNodePool struct {
CloneMode string `yaml:"cloneMode,omitempty"`
Datacenter string `yaml:"datacenter,omitempty"`
Datastore string `yaml:"datastore,omitempty"`
StoragePolicyName string `yaml:"storagePolicyName,omitempty"`
Folder string `yaml:"folder,omitempty"`
Network string `yaml:"network,omitempty"`
Nameservers []string `yaml:"nameservers,omitempty"`
TKGIPFamily string `yaml:"tkgIPFamily,omitempty"`
ResourcePool string `yaml:"resourcePool,omitempty"`
VCIP string `yaml:"vcIP,omitempty"`
Template string `yaml:"template,omitempty"`
MemoryMiB int64 `yaml:"memoryMiB,omitempty"`
DiskGiB int32 `yaml:"diskGiB,omitempty"`
NumCPUs int32 `yaml:"numCPUs,omitempty"`
[]string `yaml:"nameservers,omitempty"`
}
The input for a node pool file
The input for a node pool file is documented here:
https://docs-staging.vmware.com/en/VMware-Tanzu-Kubernetes-Grid/2.1/using-tkg-21/workload-clusters-pool.html#sample-config
You can see in this file that the schema is something like
name: <-- first layer
replica: <-- first layer
labels:
...
vsphere:
... <-- second layer of infra specific parameters
Example: Evolving VsphereNodePool to support GPUs
Every time new data is added to a VsphereMachineTemplate, we need to update the code in tanzu framework to have coverage of the new types of inputs we provide to the underlying machine tempaltes. A good example of this is PCI Passthrough, wherein these parameters. Specifically, lets look at how GPUs work.
What changes in a PCI/GPU enabled VSphereMachineTemplate ?
One typical way that you'd customize GPUs would be to add a configuration like this (see Itay's post on how he does node pools https://cloudnativeapps.blog/tkg-gpu-integration/ for details)...
...
- name: worker
value:
count: 1
machine:
customVMXKeys:
pciPassthru.64bitMMIOSizeGB: "16"
pciPassthru.RelaxACSforP2P: "true"
pciPassthru.allowP2P: "true"
pciPassthru.use64bitMMIO: "true"
diskGiB: 300
memoryMiB: 16384
numCPUs: 4
- name: pci
value:
worker:
devices:
- deviceId: 7864
vendorId: 4318
hardwareVersion: vmx-17
In this case, the following fields are modified when making new VSphere machines that allow GPU workloads:
1) /spec/template/spec/hardwareVersion
vmx-17
2) /spec/template/spec/customVMXKeys:
- pciPassthru.allowP2P:true
pciPassthru.RelaxACSforP2P:true
pciPassthru.use64bitMMIO:true
pciPassthru.64bitMMIOSizeGB:512
3) /spec/template/spec/pciDevices
- deviceId: 0x10DE <-- this defines "nvidia"
vendorId: 0x1EB8 <-- this identifies that it is a "T4 GPU"
- deviceId: ...
vendorId: ...
In otherwords, the three fields: hardwareVersion, customVMXKeys, and pciDevices, all must be modified when making a GPU compatible VSphereVM. And thus, we will add these keys in future versions of TKG to support the creation of node pools, on the fly, which are able to inject vGPUs.
0) CAPV ensures that tanzu cli boots your VM in mode vmx 17, by setting the
spec.template.spec.hardwareVersion=vmx-17.
1) CAPV then launches a VM with VMX Keys which are sent to your device (you send this in as a string) to tanzu CLI.
/spec/template/spec/customVMXKeys
- pciPassthru.allowP2P:true
pciPassthru.RelaxACSforP2P:true
pciPassthru.use64bitMMIO:true
pciPassthru.64bitMMIOSizeGB:512
2) VSphere then boots VM which has a PCI card attached to it, with PCI settings that allow it to work in a GPU context. 3) You then install the nvidia gpu driver pods 4) The PC card is mounted into the nvidia driver pod as a file 5) NVIDIA driver pod then runs a C program that scans PCI devices . This C Driver has the ability to read and parse PCI devices (using nvml.h) 6) It then tells the kubelet via GRPC
service Registration {
rpc Register(RegisterRequest) returns (Empty) {}
}
about this new device 7) The Kubelet then adds metadata to the nodes APIServer object. 8) The scheduler can now allocate GPUs, by querying the node.status "nvidia.com/gpu" field:
apiVersion: v1
kind: Node
metadata:
name: gpu-node-1
labels:
kubernetes.io/hostname: gpu-node-1
spec:
taints:
- effect: NoSchedule
key: nvidia.com/gpu
value: "true"
status:
capacity:
cpu: "8"
memory: 32100Mi
nvidia.com/gpu: "1" <-- this is what is scheduled to the pods
Note that above, we have exactly 1 gpu, because with PCI Passthrough, you only get one schedulable GPU card per device.
An example of how to make a cluster with two node pools
When making a new TKG Cluster, you can define a 2nd machinedeployment, as shown below...
apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
annotations:
osInfo: ubuntu,20.04,amd64
tkg/plan: dev
labels:
tkg.tanzu.vmware.com/cluster-name: l02-tkg-wld-gpu
name: l02-tkg-wld-gpu
namespace: default
spec:
clusterNetwork:
pods:
cidrBlocks:
- 100.96.0.0/11
services:
cidrBlocks:
- 100.64.0.0/13
topology:
class: tkg-vsphere-default-v1.0.0
controlPlane:
metadata:
annotations:
run.tanzu.vmware.com/resolve-os-image: image-type=ova,os-name=ubuntu
replicas: 1
variables:
- name: cni
value: antrea
- name: controlPlaneCertificateRotation
value:
activate: true
daysBefore: 90
- name: auditLogging
value:
enabled: false
- name: trust
value:
additionalTrustedCAs:
- data: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSURkekNDQWwrZ0F3SUJBZ0lRSDR3TmRUaEtDSnBPdjhLMUx6UDdzREFOQmdrcWhraUc5dzBCQVFzRkFEQk8KTVJVd0V3WUtDWkltaVpQeUxHUUJHUllGYkc5allXd3hGekFWQmdvSmtpYUprL0lzWkFFWkZnZDBaWEpoYzJ0NQpNUnd3R2dZRFZRUURFeE4wWlhKaGMydDVMVXhCUWkxQlJEQXhMVU5CTUI0WERURTRNVEl5TURFME5UVTBORm9YCkRUTXpNVEl5TURFMU1EVTBNMW93VGpFVk1CTUdDZ21TSm9tVDhpeGtBUmtXQld4dlkyRnNNUmN3RlFZS0NaSW0KaVpQeUxHUUJHUllIZEdWeVlYTnJlVEVjTUJvR0ExVUVBeE1UZEdWeVlYTnJlUzFNUVVJdFFVUXdNUzFEUVRDQwpBU0l3RFFZSktvWklodmNOQVFFQkJRQURnZ0VQQURDQ0FRb0NnZ0VCQUxWWHAwUlhyT09DZmRVZElUNmF1aDU0CmFTNXN2STNPVml0VGVmUFFiRTQxd0U0Y1FRNll6SDB2cnQ3QjZscnlMSFF0L0VROGxVVTNQTEdEOU4rT25rWWwKa2tKcmZTZ2FTMHlLU3htaXJkaFlRNHZ6Z3psL2hyRXMxZkFQWVo2NkUra3lBc29aQTRsQnZrR0wxNFZ3MVNBMAo5TkV3eTlqOTZsOU9WdFlQcDV4R1c0SWJGZHdLMk96bW9SWFFGRmdBd3JlQkdOS2l0M3BJZkRPby82bWZxZTVXCnFYNUNZbGNVOXJjR3VzWnBoc0U0WTVkS1FRelF2dFMwOUxnSEZlNjA2Wm5QMXUyc1FUdFp2VzdUSzBDYW81ZnMKUkF4T0NmS1ZzZ1Z1SVVNbUhKd0lRK2x4MEdQS3ppVWx6K2F0Qm05Z09UWXBjZmQzcFlQMUJPQnBOTzROUTk4QwpBd0VBQWFOUk1FOHdDd1lEVlIwUEJBUURBZ0dHTUE4R0ExVWRFd0VCL3dRRk1BTUJBZjh3SFFZRFZSME9CQllFCkZISE4rVlp6L1hGR0JJZEZmOEV6UnYyNnd1WFFNQkFHQ1NzR0FRUUJnamNWQVFRREFnRUFNQTBHQ1NxR1NJYjMKRFFFQkN3VUFBNElCQVFCS2pFQlJRSERzY01aZnBhVEJmRDhLcjBMU2dpbnJzU0JNWmtiUHdMMmt1bTlPdVM5RgpzVVZQOEd4OVc3dDBJdzMveGFHN01qL1ZOUDVHSDVVNFAzUW56dXRRMUo5L1Vsb1hrRThlZHlqYlEvUzVLeFU4CjBERFBZSXk2aTdBR09nR3RuYm4zbDZabEtXd0R5cGQzVFcwVFEwVVFObTlqNFhuTm9vM0xwbTFpV3NKVFRCMGwKbFdNZVNxMzZOeHAyTzlZcnh6amhxdnRLTmZKNjAzY0gycVlMK1UvZ0V4OEp1VGVLdEpQOTI3bnNqSlUzbGltegpYSzFjdndxQlpSbEFXcGd0RGxRSWw4U3B1Nnc2TDJBaWdVdHg0ZjhZMFFWV29EWEw3dmxXN1JxSlQ0THpCUlRzCkl5MUdGVjlxdnF3NFdlTWs3TXl2QWRrM3d3Vy9nQ01RY1BzSAotLS0tLUVORCBDRVJUSUZJQ0FURS0tLS0tCg==
name: proxy
- name: podSecurityStandard
value:
audit: baseline
deactivated: false
warn: baseline
- name: apiServerEndpoint
value: ""
- name: aviAPIServerHAProvider
value: true
- name: vcenter
value:
cloneMode: fullClone
datacenter: /Main
datastore: /Main/datastore/LAB-V3-vSANDatastore
folder: /Main/vm/LABS/itay/l02/tkg/workload-clusters
network: /Main/network/itay-k8s-nodes
resourcePool: /Main/host/LAB-V3/Resources/US
server: ts-vc-01.terasky.local
storagePolicyID: k8s-storage-policy-vsan
template: /Main/vm/LABS/itay/l01/tkg/templates/ubuntu-2004-efi-kube-v1.24.10+vmware.1
tlsThumbprint: ""
- name: user
value:
sshAuthorizedKeys:
- ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQC6TPWbdTQQGQzypvuhYCdUK0ZEnjgaCi0ilQbgHlgiicAFG6Nlw5NqBi7UtYm5fFurzQ4sNHl5ysgQM9lIODHt/RsdL0hZFjxpnQGRpHZU856s8DKbGuv7Sm/7M8E7oQxqWqzlhwauddFCI+wy6jVxAZhFpFraM5kcP7bPNUnRwb70hatxgeDvHrDjwO/qPIu6i9E5bGwCMJ8dvOS3Ujv//YuLh2IRecPZoFqLF98nxXOHCX38lPikEtzd7sLf3t+rRYOMlJVgkBz5URbM0VWvHgeXSvBiGogxVrDD8PLDfUPJ01v4WoC+hxkq0F8YwfQcCNi4vYzZKEDgauzs+TWbG0jnZ2SSw3vIgUPsgn+W+8PBrQne5YaNRUzpGNhk9VkhTPsPhSuis6B1sxpJ+m5jZdNw1LxkEKclslN5jbFfBviSuLMNi7jbzfFJeGJxAqyMxlblGyxnt/FlqSkmS5Owjsqz1UNtJFfoqhz73VApLzaG379CjX3Z2TX2zMgWjec=
- name: controlPlane
value:
machine:
diskGiB: 50
memoryMiB: 8192
numCPUs: 2
- name: worker
value:
count: 1
machine:
diskGiB: 300
memoryMiB: 16384
numCPUs: 4
- name: pci #### <--- variables.overrides will override this value below...
value: {}
version: v1.24.10+vmware.1-tkg.2
workers:
machineDeployments:
- class: tkg-worker ### <-- note this tkg-worker is reused in the 2nd node pool
metadata:
annotations:
run.tanzu.vmware.com/resolve-os-image: image-type=ova,os-name=ubuntu
name: md-0
replicas: 1
- class: tkg-worker ### <-- reusing the tkg-worker , we will pretend to customize these nodes to support GPUs
metadata:
annotations:
run.tanzu.vmware.com/resolve-os-image: image-type=ova,os-name=ubuntu
name: md-1-gpu
replicas: 1
### Note that the below in tkg 2.1.1 must be created via kubectl create -f , rather then tanzu create -f...
variables:
overrides:
- name: worker
value:
count: 1
machine:
customVMXKeys:
pciPassthru.64bitMMIOSizeGB: "16"
pciPassthru.RelaxACSforP2P: "true"
pciPassthru.allowP2P: "true"
pciPassthru.use64bitMMIO: "true"
diskGiB: 300
memoryMiB: 16384
numCPUs: 4
- name: pci
value:
worker:
devices:
deviceId: 7864
vendorId: 4318
hardwareVersion: vmx-17
Note that in the below machineDeployments, we created 2 node pools, and in the second we override: - worker - pci Inside of the vspheremachinetemplate for the GPU node pool. This correlated to the pci variable we added above in our cluster variables
- name: pci #### <--- variables.overrides will override this value below...
value: {}
You don't necessarily need to add EVERYTHING you want to override to the cluster variables, but pci parameters (for some reason) are one such variable that DO need to be explicitly given a placeholder.
What about TKGS ?
In TKGS, the TanzuKubernetesCluster has a nodepool object that is explicitly defined with regard to a vmClass.
nodePools:
- name: string
labels: map[string]string
taints:
- key: string
value: string
effect: string
timeAdded: time
replicas: int32
vmClass: string
storageClass: string
volumes:
- name: string
mountPath: string
capacity:
storage: size in GiB
tkr:
reference:
name: string
nodeDrainTimeout: string