Version: v1.3

Linux Operation Guide

Linux server information acquisition

Get operating system version

Input: cat /etc/redhat-release
Output: CentOS Linux release 7.9.2009 

From the above results, we know that the operating system version is CentOS 7 nine

Get local intranet IP

Input: ip r get 1 | awk 'NR==1 {print $NF}'
Output: 172.17.228.252

According to the above results, 172.17.228.252 is the intranet IP address

Get operating system kernel

Input: uname -r
Output: 3.10.0-1160.el7.x86_64

From the above results, we can know that the system kernel version is 3.10.0-1160 and the CPU architecture is x86

Get the number of CPU cores

Input: grep processor /proc/cpuinfo|wc -l
Output: 16

From the above results, we know that the number of CPU cores is 16

Get CPU instruction set

Input: lscpu|grep -E 'avx2|avx|fma'
Output: 
Flags:  fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc eagerfpu pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single fsgsbase tsc_adjust bmi1 hle **avx2** smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 arat avx512_vnni

Get memory size

Input: free -h|awk '{if (NR==2) print $2}'
Output: 30G

From the above results, we know that the memory size is 30g

Check disk boot mount

Input: cat /etc/fstab

Output:

UUID=9f2d3e15-a78a-4f3d-8385-0165b4b67864 /  ext4 defaults 1 1
/dev/sdb1 /home  ext4 defaults 1 1

As you can see from the above, the Device /dev/sdb1 is mounted to the /home directory in the format of ext4, and it is set to load automatically after startup

Check whether the firewall starts automatically

Input: systemctl is-enabled firewalld
Output: disabled

From the above results, we know that the firewall startup is disabled

Check whether the user has sudo permission

Input: sudo -l -U laiye
Output:
    (ALL) ALL

From the above results, we know that user laiye has sudo permission

Check the graphics card driver

Input: nvidia-smi
Output: -bash: nvidia-smi: Not found

From the above results, we know that NVIDIA SMI command is not found. Indicates that the graphics card driver is not installed or the driver is abnormal

Linux Installation and maintenance

Installing the graphics card driver

Get graphics card model

Input: lspci|grep -i nvidia
Output:
    0000:3f:00.0 3D controller: NVIDIA Corporation Device 1eb8 (rev a1)
    0000:40:00.0 3D controller: NVIDIA Corporation Device 1eb8 (rev a1)
    0000:43:00.0 3D controller: NVIDIA Corporation Device 1eb8 (rev a1)

From the above results, we know that the hexadecimal of the graphics card model is 1eb8. click here Search 1eb8 and learn that the model of the graphics card is Tesla T4

Get graphics card driver

Official driver download Search the driver of the specified graphics card model to obtain the binary driver file. CUDA toolkit recommendation 10.2

Install the drive

Transfer the downloaded driver to the GPU server for installation

Check drive

Input: nvidia-smi

If there is a return, the driver installation is completed

Install NVIDIA docker

If the container wants to use GPU resources, it depends on NVIDIA docker. You need to install docker before you can install NVIDIA docker. Remember to back up /etc/docker/daemon.json File, because installing this file will replace all the contents of this file

Input: wget -c https://private-deploy.oss-cn-beijing.aliyuncs.com/pengyongshi/nvidia/nvidia-docker2.tar.gz

Input: tar -zxvf nvidia-docker2.tar.gz

Input: cd nvidia-docker2

Input: rpm -ivUh *.rpm --force --nodeps

After the above steps are installed, / etc / docker / daemon. Will be replaced by default JSON file, the previously backed up daemon The contents of JSON are appended to the file

/etc/docker/daemon.json
{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        }
    },
    "insecure-registries":[
        "10.116.28.239:8888"
    ]
}

Restart docker

systemctl start docker

Install k8s NVIDIA plug-in

To use GPU in kubernetes, NVIDIA Device plug-in is required. NVIDIA device plugin is a daemon, which automatically enumerates the number of GPUs on each node of the cluster and allows pod to run on GPUs. This only needs to be installed on master1

Input: wget -c https://private-deploy.oss-cn-beijing.aliyuncs.com/pengyongshi/nvidia/k8s-device-plugin.tar

Input: docker load -i k8s-device-plugin.tar

Write the following to NVIDIA device plugin Yaml file and import / var / lib / kubelet / plugins/

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: nvidia-device-plugin-ds
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: ""
      labels:
        name: nvidia-device-plugin-ds
    spec:
      tolerations:
      - key: CriticalAddonsOnly
        operator: Exists
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      priorityClassName: "system-node-critical"
      containers:
      - image: nvidia/k8s-device-plugin:1.0.0-beta6
        name: nvidia-device-plugin-ctr
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]
        volumeMounts:
          - name: device-plugin
            mountPath: /var/lib/kubelet/device-plugins
      volumes:
        - name: device-plugin
          hostPath:
            path: /var/lib/kubelet/device-plugins

Input: kubectl apply -f /var/lib/kubelet/plugins/nvidia-device-plugin.yaml

Install cudnn

Input: wget -c https://private-deploy.oss-cn-beijing.aliyuncs.com/pengyongshi/nvidia/cudnn-10.2-linux-x64-v7.6.5.32.tgzInput: tar xvf cudnn-10.2-linux-x64-v7.6.5.32.tgz

Input: cp cuda/include/cudnn.h /usr/local/cuda/include/
    
Input: cp cuda/lib64/libcudnn* /usr/local/cuda/lib64/
    
Input: chmod -R a+r /usr/local/cuda

Input: export PATH=$PATH:/usr/local/cuda/bin

Add sudo account

By default, wheel The users in this group have sudo permission, so we only need to attach ordinary users to wheel This group is OK

Input: usermod -G wheel laiye
Output: 

If the return value is empty, it means that laiye, an ordinary user, has joined wheel This group

Turn off and disable the firewall

Input: setenforce 0
Output: 

Input: sed -i 's/\(SELINUX=\)[a-z].*/\1disabled/' /etc/selinux/config
Output:

Input: systemctl stop firewalld
Output: 

Input: systemctl disable firewalld
Output: 

Linux server information acquisition​

Get operating system version​

Get local intranet IP​

Get operating system kernel​

Get the number of CPU cores​

Get CPU instruction set​

Get memory size​

Check disk boot mount​

Check whether the firewall starts automatically​

Check whether the user has sudo permission​

Check the graphics card driver​

Linux Installation and maintenance​

Installing the graphics card driver​

Add sudo account​

Turn off and disable the firewall​