Abstract:





1 Background NVIDIA provides the nvidia-smI command tool to query and monitor GPU data. However, it is inconvenient for users to manually view and monitor GPU data in real time.

This paper will introduce how to use the custom monitoring function provided by Aliyun cloud monitoring service to realize the VISUALIZATION of GPU monitoring and alarm of GPU cloud server.

2 Custom monitoring and alarm Aliyun cloud monitoring service provides custom monitoring function, users can use it to achieve custom data monitoring and alarm. By using the API or SDK provided by custom monitoring, the GPU data collected in the GPU cloud host can be reported. By adding corresponding GPU monitoring items on the cloud monitoring console, the corresponding data of the specified GPU in the specified GPU instance can be monitored and corresponding alarm rules can be set for corresponding monitoring items. It can realize automatic alarm of monitoring data.

For example, the GPU utilization, video memory utilization, video memory occupancy, power, temperature and other key information can be monitored and alerted. For details, see Creating custom monitoring items and alarm rules

3 Monitoring data reporting The customized monitoring SDK supports Python and bash. You can write scripts to invoke the SDK interfaces to report monitoring data.

This section describes how to use a scheduled scheduling script to report data according to the reporting period defined when monitoring items are created. Crontab can be used for Linux and Quartz.net can be used for Windows. For details, see Monitoring Data Reporting

4 GPU Data Collection The NVDIA driver provides the NVIDIA Management Library (NVML), which provides an interface for collecting GPU data and provides the nvidia-smI command for collecting GPU-related data. NVML provides official support for Perl and Python languages. Considering that the customized monitoring report SDK supports Python, we can download NVML Python Bindings and write Python scripts to collect GPU data.

NVML Python bindings can be downloaded from the following link: pypi.python.org/pypi/nvidia…

5 sample

5.1 Creating Custom Monitoring Items

Create a custom monitoring item on the cloud monitoring console as shown in the following figure:

5.2 Viewing Monitoring Item Data

View monitoring items on the cloud monitoring console, as shown below:

GPU usage of GPU 0 in an instance (unit: Persent) :

Graphics memory utilization of GPU 0 in an instance (unit: Persent) :

Memory usage of GPU 0 in an instance (unit: Megabytes) :

Power of GPU 1 in an instance (unit: Watt) :

Temperature of an instance GPU 1 (unit: Celsius) :

5.3 Setting alarm Rules

Click alarm Management on temperature monitoring item:

Setting temperature alarm rules:

Set the notification object:

Complete setup:

6 Reference code data collection:

def get_gpu_information():
    nvmlInit()

    deviceCount = nvmlDeviceGetCount()

    util_list = []
   
    for i in range(0, deviceCount):
        handle = nvmlDeviceGetHandleByIndex(i)
        util_list.append(nvmlDeviceGetUtilizationRates(handle))
      
  
    nvmlShutdown()
    return deviceCount, util_listCopy the code

Information reporting:

    for i in range(0, GPU_Count):
        gpuid = i

        cms_post.post(userid,"GPUUtilization",util_list[i].gpu,"Percent",s.format(ecsid=ecsCopy the code

The original link

This article is the original content of the cloud habitat community, shall not be reproduced without permission.