“This is the 22nd day of my participation in the Gwen Challenge in November. Check out the details: The Last Gwen Challenge in 2021.”

What is GPU?

Graphics Processing Unit (English: Graphics Processing Unit, abbreviation: GPU), also known as display core, vision processor, display chip, is a specialized in personal computers, workstations, game consoles and some mobile devices (such as tablet computers, smart phones, etc.) on the image operation of the microprocessor. The purpose is to convert the display information required by the computer system to drive, and provide line scanning signals to the display, control the correct display of the display, is an important component connecting the display and PC motherboard, is also one of the important equipment of “man-machine dialogue”. Graphics card as an important part of the computer host, undertake the task of output display graphics, graphics card is very important for people engaged in professional graphics design, but also widely used in the field of deep learning.

Two, prepare knowledge

The NVIDIA System Management Interface (NVIDIa-SMI) is a command line utility based on the NVIDIA Management Library (NVML) designed to help manage and monitor NVIDIA GPU devices.

This utility allows the administrator to query the GPU device status and has corresponding permissions, allowing the administrator to modify the GPU device status. It is targeted at Tesla TM, GRID TM, Quadro TM and Titan X products, but other NVIDIA Gpus also offer limited support. Nvidia-smi comes with the NVIDIA GPU display driver on Linux and comes with 64-bit Windows Server 2008 R2 and Windows 7. Nvidia-smi can report query information as XML or readable plain text to standard output or file form.

Example NviDIa-SMI output:

How to use NVIDIa-SMI on Windows?

Nvidia-smi comes with the Nvidia graphics driver, so you can find the nvidia-smi.exe file in the default driver installation directory C: Program Files nvidia Corporation NVSMI. Drag the file into the CMD window. You can display information about the GPU, as shown below:

This is the Nvidia GeForce GTX 750. The information in the above table boxes corresponds to the information in the following four boxes:

  • GPU: indicates the GPU number.
  • Name: GPU model.
  • Fan: The Fan speed ranges from 0 to 100%.
  • Temp: indicates the temperature, in degrees Celsius.
  • Perf: Indicates the performance status. The value ranges from P0 to P12. P0 indicates the maximum performance, and P12 indicates the minimum performance.
  • Pwr:Usage/Cap: energy consumption;
  • Memory Usage: Video memory usage;
  • Bus-id: something that involves the GPU Bus,domain:bus:device.function;
  • Disp. A:Display Active, indicating whether the GPU display is initialized.
  • Volatile GPU-Util: Floating GPU utilization (GPU Load);
  • Uncorr. ECC:Error Correcting Code, error checking and correction;
  • Compute M:compute mode, computing mode.
  • Processes indicates the GPU memory usage of each process.

Monitor NVIDIA GPU by Telegraf+ Grafana

Telegraf provides nVIDIa-SMI plug-in to collect GPU performance data.

1. Configure the plug-in

[[inputs.nvidia_smi]]
  ## Optional: path to nvidia-smi binary, defaults to $PATH via exec.LookPath
  bin_path = "C:\\Program Files\\NVIDIA Corporation\\NVSMI\\nvidia-smi.exe"

  ## Optional: timeout for GPU polling
  timeout = "5s"
Copy the code

2. Collection metrics

Measurement: nvidia_smi

  • tags
    • Name (for example, GPU typeGeForce GTX 1070 Ti)
    • Compute_mode (for example, GPU computing mode Default)
    • Index (Index of the port that the GPU connects to the motherboard, for example, 1)
    • Pstate (for example, GPU overclocking state P0)
    • Uuid (for example, gPU-f9BA66FC-a7F5-94C5-DA19-019EF2f9c665)
  • fields
    • Fan_speed (integer, percentage)
    • Memory_free (integers, MiB)
    • Memory_used (integers, MiB)
    • Memory_total (integer, MiB)
    • Power_draw (floating point, W)
    • Temperature_gpu (integer, ℃)
    • Utilization_gpu (integer, percentage)
    • Utilization_memory (integer, percentage)

Sample data collection:

Grafana Dashboard

Related information:

  • Github.com/zuozewei/bl…