Linux Operation Guide
Linux server information acquisition
Get operating system version
Input: cat /etc/redhat-release
Output: CentOS Linux release 7.9.2009 
From the above results, we know that the operating system version is CentOS 7 nine
Get local intranet IP
Input: ip r get 1 | awk 'NR==1 {print $NF}'
Output: 172.17.228.252
According to the above results, 172.17.228.252 is the intranet IP address
Get operating system kernel
Input: uname -r
Output: 3.10.0-1160.el7.x86_64
From the above results, we can know that the system kernel version is 3.10.0-1160 and the CPU architecture is x86
Get the number of CPU cores
Input: grep processor /proc/cpuinfo|wc -l
Output: 16
From the above results, we know that the number of CPU cores is 16
Get CPU instruction set
Input: lscpu|grep -E 'avx2|avx|fma'
Output: 
Flags:  fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc eagerfpu pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single fsgsbase tsc_adjust bmi1 hle **avx2** smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 arat avx512_vnni
Get memory size
Input: free -h|awk '{if (NR==2) print $2}'
Output: 30G
From the above results, we know that the memory size is 30g
Check disk boot mount
Input: cat /etc/fstab
Output:
UUID=9f2d3e15-a78a-4f3d-8385-0165b4b67864 /  ext4 defaults 1 1
/dev/sdb1 /home  ext4 defaults 1 1
As you can see from the above, the Device /dev/sdb1 is mounted to the /home directory in the format of ext4, and it is set to load automatically after startup
Check whether the firewall starts automatically
Input: systemctl is-enabled firewalld
Output: disabled
From the above results, we know that the firewall startup is disabled
Check whether the user has sudo permission
Input: sudo -l -U laiye
Output:
    (ALL) ALL
From the above results, we know that user laiye has sudo permission
Check the graphics card driver
Input: nvidia-smi
Output: -bash: nvidia-smi: Not found
From the above results, we know that NVIDIA SMI command is not found. Indicates that the graphics card driver is not installed or the driver is abnormal
Linux Installation and maintenance
Installing the graphics card driver
- Get graphics card model
Input: lspci|grep -i nvidia
Output:
    0000:3f:00.0 3D controller: NVIDIA Corporation Device 1eb8 (rev a1)
    0000:40:00.0 3D controller: NVIDIA Corporation Device 1eb8 (rev a1)
    0000:43:00.0 3D controller: NVIDIA Corporation Device 1eb8 (rev a1)
From the above results, we know that the hexadecimal of the graphics card model is 1eb8. click here Search 1eb8 and learn that the model of the graphics card is Tesla T4
- Get graphics card driver
Official driver download Search the driver of the specified graphics card model to obtain the binary driver file. CUDA toolkit recommendation 10.2
- Install the drive
Transfer the downloaded driver to the GPU server for installation
- Check drive
Input: nvidia-smi
If there is a return, the driver installation is completed
- Install NVIDIA docker
If the container wants to use GPU resources, it depends on NVIDIA docker. You need to install docker before you can install NVIDIA docker. Remember to back up /etc/docker/daemon.json File, because installing this file will replace all the contents of this file
Input: wget -c https://private-deploy.oss-cn-beijing.aliyuncs.com/pengyongshi/nvidia/nvidia-docker2.tar.gz
Input: tar -zxvf nvidia-docker2.tar.gz
Input: cd nvidia-docker2
Input: rpm -ivUh *.rpm --force --nodeps
After the above steps are installed, / etc / docker / daemon. Will be replaced by default JSON file, the previously backed up daemon The contents of JSON are appended to the file
{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        }
    },
    "insecure-registries":[
        "10.116.28.239:8888"
    ]
}
Restart docker
systemctl start docker
- Install k8s NVIDIA plug-in
To use GPU in kubernetes, NVIDIA Device plug-in is required. NVIDIA device plugin is a daemon, which automatically enumerates the number of GPUs on each node of the cluster and allows pod to run on GPUs. This only needs to be installed on master1
Input: wget -c https://private-deploy.oss-cn-beijing.aliyuncs.com/pengyongshi/nvidia/k8s-device-plugin.tar
Input: docker load -i k8s-device-plugin.tar
Write the following to NVIDIA device plugin Yaml file and import / var / lib / kubelet / plugins/
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: nvidia-device-plugin-ds
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: ""
      labels:
        name: nvidia-device-plugin-ds
    spec:
      tolerations:
      - key: CriticalAddonsOnly
        operator: Exists
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      priorityClassName: "system-node-critical"
      containers:
      - image: nvidia/k8s-device-plugin:1.0.0-beta6
        name: nvidia-device-plugin-ctr
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]
        volumeMounts:
          - name: device-plugin
            mountPath: /var/lib/kubelet/device-plugins
      volumes:
        - name: device-plugin
          hostPath:
            path: /var/lib/kubelet/device-plugins
Input: kubectl apply -f /var/lib/kubelet/plugins/nvidia-device-plugin.yaml
- Install cudnn
Input: wget -c https://private-deploy.oss-cn-beijing.aliyuncs.com/pengyongshi/nvidia/cudnn-10.2-linux-x64-v7.6.5.32.tgzInput: tar xvf cudnn-10.2-linux-x64-v7.6.5.32.tgz
Input: cp cuda/include/cudnn.h /usr/local/cuda/include/
    
Input: cp cuda/lib64/libcudnn* /usr/local/cuda/lib64/
    
Input: chmod -R a+r /usr/local/cuda
Input: export PATH=$PATH:/usr/local/cuda/bin
Add sudo account
By default, wheel The users in this group have sudo permission, so we only need to attach ordinary users to wheel This group is OK
Input: usermod -G wheel laiye
Output: 
If the return value is empty, it means that laiye, an ordinary user, has joined wheel This group
Turn off and disable the firewall
Input: setenforce 0
Output: 
Input: sed -i 's/\(SELINUX=\)[a-z].*/\1disabled/' /etc/selinux/config
Output:
Input: systemctl stop firewalld
Output: 
Input: systemctl disable firewalld
Output: