preface

Business monitoring is of great significance to many scenarios. Business monitoring kanban allows us to intuitively see the real-time situation of the current business, and then the operation personnel can adjust the business in time according to these situations to avoid major business problems.

Old Huang once had an embarrassing “accident”.

One of the business lines, one of the businesses serving, switched most of the traffic to another place. However, our operation staff was completely unaware of this on the same day. The next day, after reading yesterday’s statistical report, they found that the volume of this business was much less, so they could follow up and coordinate.

Ps: At that time, there was a lack of real-time reports. The data reports of the previous day were generated in the early morning of the next day, and there was no alarm mechanism.

Then I got a big screen to do a real-time kanban for business monitoring, so I could see what was going on.

Let’s take a look at the final renderings.

This diagram mainly contains the following contents.

  1. Total order quantity
  2. Number of return orders
  3. How often orders are created
  4. Order volume of different channels
  5. Return order quantity of different channels

Here’s how to implement such business monitoring.

Building infrastructure

There are two infrastructures involved here, Prometheus and Grafana.

Start Prometheus, in this case docker.

$base = Split-Path -Parent $MyInvocation.MyCommand.Definition
$prometheusyml = Join-Path $base prometheus.yml
$fileconfig = Join-Path $base "config"

write-host $prometheusyml
write-host $fileconfig

docker run `
    --name prom `
    -p 9090:9090 `
    -v ${prometheusyml}:/etc/prometheus/prometheus.yml `
    -v ${fileconfig}:/etc/prometheus/fileconfig `
    prom/prometheus:v2.20.1
Copy the code

Below is the Prometheus. Yml

global:
  scrape_interval:     15s 
  evaluation_interval: 15s
  
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      # - alertmanager:9093
      
rule_files:

scrape_configs:
  - job_name: 'file_ds'
    file_sd_configs:
    - refresh_interval: 10s
      files:
      - ./fileconfig/*.yml
Copy the code

File-based discovery is used instead of static. More other ways, see Prometheus. IO/docs/promet…

At this point Prometheus is already up and running.

And then grafana, which is even easier to boot.

docker run -d --name grafana -p 3000:3000 grafana/grafana:7.1.3
Copy the code

When done, access localhost:3000 to see the login screen.

Determine business metrics

Determining indicators can be said to be the most, the most important part of the whole business monitoring. Only when we make clear what we want to monitor, can we bury points in the business and get the data we want.

This is actually the same as the demand we usually face. If the demand is clear, the thing we make may be what we want. If the demand is not clear, the thing we make may not be what we want.

To help you understand the relevant content simply, here is an example of monitoring, monitoring orders and refunds from different channels.

When it comes to quantity, it basically increases and does not decrease within a day. At this time, we usually choose the counter type to deal with it.

One is an order and one is a refund, so let’s define two

  • yyyorder_created_total
  • yyyorder_canceled_total

For counter types, it’s usually best to end your name with _total.

What about different channels?

We will use lable to identify the channel.

The final display format is roughly as follows:

yyyorder_created_total{appkey="mt",opreator="cw"} 1
yyyorder_canceled_total{appkey="pdd",opreator="cw"} 2
Copy the code

Here, we should also pay attention to a problem. When determining indicators, we should avoid defining too many indicators. If possible, we should consider using label to distinguish the content of the same nature.

Business is buried point

After the business indicators are defined, it is necessary to carry out burying operations in the corresponding business, which will be somewhat intrusive to the business code. Of course, if the business code is well written and less coupled, AOP may be used to burying points, thus reducing the intrusion.

Write a simple example to simulate the business buried point of this piece.

Create an ASP.NET Core project and install the nuget package prometheus-net.aspnetcore.

<ItemGroup>
    <PackageReference Include="prometheus-net.AspNetCore" Version="3.6.0" />
</ItemGroup>
Copy the code

The second is to enable ASP.NET Core exporter middleware

public void Configure(IApplicationBuilder app, IWebHostEnvironment env)
{
    if (env.IsDevelopment())
    {
        app.UseDeveloperExceptionPage();
    }

    app.UseRouting();
    app.UseAuthorization();

    app.UseEndpoints(endpoints =>
    {
        // This sentence.
        endpoints.MapMetrics();
        endpoints.MapControllers();
    });
}
Copy the code

Finally, there’s the burial point.

[ApiController]
[Route("")]
public class HomeController : ControllerBase
{
    private static readonly Counter OrderCreatedCount = Metrics
        .CreateCounter("yyyorder_created_total"."Number of created orders.".new CounterConfiguration
        {
             LabelNames= new [] { "appkey"."opreator"}});private static readonly Counter OrderCanceledCount = Metrics
        .CreateCounter("yyyorder_canceled_total"."Number of canceled orders.".new CounterConfiguration
        {
            LabelNames = new[] { "appkey"."opreator"}}); [HttpGet]
    public string Get()
    {
        var appKeys = new[] { "ali"."pdd"."mt" };
        var opreators = new[] { "cw"."pz" };

        var rd = new Random((int)DateTimeOffset.Now.ToUnixTimeMilliseconds()).Next(0.2000);
        var appKeyidx = rd % 3;
        var opreatoidx = rd % 2;
        OrderCreatedCount.WithLabels(appKeys[appKeyidx], opreators[opreatoidx]).Inc();

        var cRd = new Random((int)DateTimeOffset.Now.ToUnixTimeMilliseconds()).NextDouble();

        if (cRd < 0.3d)
        {
            OrderCanceledCount.WithLabels(appKeys[appKeyidx], opreators[opreatoidx]).Inc();
        }

        return "ok"; }}Copy the code

In the controller above, two counters are created, as defined in determining business metrics above.

Here is each visit, create an order, at the same time generate a random number, if it is less than 0.3, then as it is back order, so you can simulate the two indicators.

The program starts with some default metrics.

When we access the address of the buried point, we can see that our custom business metrics already have data.

So now that we have the data, how do we present it?

To present the data, we needed Prometheus to save our business metrics data.

Data is written to

Data is written to Prometheus in two ways: pull and push.

Pull is recommended for Prometheus to pull the data we produce by exposing an address.

Pushgateway is used to push buried data, which is then written to Prometheus by PushGateway.

By default, when we use endpoints.mapControllers (); The data is then exposed at the address http://ip:port/metrics.

Now that you know how to pull, what else do you do? Of course, configure Promethues.

Scrape_configs is automatically discovered by file, so simply add a corresponding YML file to the mount path.

Old Huang added a nC-service. yml here, the details are as follows:

- labels:
    service: nc
    project: demo
  targets: 
  - 192.1681.103.: 9874
  - 192.1681.103.: 9875
Copy the code

At this time, you can see the information of our two addresses in the Targets.

From the default interface of Prometheus, it is also possible to see that data has been read normally.

Now comes the real data query and presentation.

The data show

With the above steps to ensure that data can be written and queried properly, we are now ready to create a business monitoring kanban in Grafana.

Configure our data source in Grafana.

Fill in the address of our prometheUse and save it. You can see the green prompt that tells us that the data source is working properly.

Let’s start with a total order number.

Create a new Dashboard and create a Panel.

We fill in our information in the panel and select the graphics we want.

Then write the query criteria, we can see the results we want.

The query is as follows:

sum(ceil(increase(yyyorder_created_total[1d])))
Copy the code

Sum, ceil, increase.

Increase is used to measure increments over time. The range [1d] is followed to indicate that we are looking at one-day increments.

Ceil is used to round the result of increase.

If you look at the graph below, you’ll see that there are a lot of decimal points.

In fact, this is related to the statistical method of Prometheus, which is not expanded here, but used in this way.

The sum is just a sum, and there are a lot of labels in the index, and we’re going to sum each label, and that’s the real result.

So here we have the following result.

The total number of returned orders is the same as the total number of orders, just change the name of the returned order can be.

sum(ceil(increase(yyyorder_canceled_total[1d])))
Copy the code

Take a look at the order statistics of each channel.

Since we are looking at the statistics of each channel, we need to use the label defined above. Appkey represents the channel, so we can group based on it.

You get the following query.

sum by (appkey) (ceil(increase (yyyorder_created_total[1d])))
Copy the code

The results are as follows:

Similarly, each channel is the same writing

sum by (appkey) (ceil(increase (yyyorder_canceled_total[1d])))
Copy the code

Ps: If you want to put the order and return order in the same graph, you can add multiple queries.

The following is an example:

Now that we have the total for all channels, the total for each channel individually, is there any way to know what the trend is over time?

This is for sure, just listen to Old Huang slowly explain.

Have this question above, mostly experienced, a certain period of time amount is very much, but some period of time and almost zero, playing is the heartbeat.

We can call this the increase in orders over time.

So we’re going to use the rate function, and that’s the function that’s going to help us figure out the rate of growth.

It’s the average rate of growth per second, and that’s a little bit too granular, so we’re going to multiply that by 60 to one minute.

And then we look at the sum, and then we round it.

ceil(sum(rate(yyyorder_created_total[5m]) * 60))
Copy the code

The results are as follows:

It can be seen from this result that orders did not increase in most of the time, and only in the middle period, some orders came in.

Now that the main panels are done, all that’s left is to resize and reposition the dashboard.

conclusion

This creates a monitoring kanban or pretty good, but still pay attention to the following issues

  1. Prometheus stores data locally and always hits the ceiling, either periodically deleting data or writing to remote storage.
  2. Prometheus’ own independent query syntax may be awkward at first, and may not yield the results it wants, so it’s not a problem to check data and practice more.
  3. Business burying point this block, or as far as possible to reduce the intrusion of the existing business code.
  4. Business indicators must be determined, otherwise buried point pain, pain query.

There is no alarm related content here, but there is time to write another alarm related content later.

Sample code:

Github.com/catcherwong…

This article first in my personal public number not just old Huang, release some content irregularly, interested can pay attention to yo!