Summary: Fluent in best practices for building a unified observability platform using logging service SLS
Status quo of online education industry
With the introduction of the Internet in the 1990s, online education products also rely on the Internet. With the development of Internet technology, online education products are beginning to emerge a new model. Online education from the original simple text form, began to picture, audio began to develop. Online education also further promotes the development of digitalization. Content, as the core asset of education enterprises, is constantly improving both the digitalization degree and the digitalization scale. At the same time, the improvement of user usage time provides a large amount of source data for educational AI. According to statistics, online education users spent more than 2 million days online in March this year. Such a large amount of data provides a good soil for the intelligent development of the industry, promoting the development of intelligent analysis of teaching content, course marketing, teacher management, quality assessment and so on.
Speak company introduction fluently
Fluent is the world’s leading technology-driven education company. As an advocate of intelligent education, Fluent has an industry-leading ai team. After years of accumulation, Fluent has built a huge “Chinese English pronunciation database”, which has recorded approximately 3.7 billion minutes of conversations and 50.4 billion recorded sentences.
In 2013, Fluent Launched its first product, Fluent English, which integrates core technologies such as speech recognition, scoring and adaptive learning. It has rich contents such as context dialogues and pronunciation guidance courses, and provides artificial intelligence English teachers and game-based learning experience, so that users can have more fun in English learning. This fun and effective product quickly conquered the market and gained high recognition from users.
However, with the rapid development of business and the significant growth of users, the number of users on the platform has increased from millions to more than 100 million. Therefore, the changes of data flow during high and low peaks of business, the complexity of business and the difficulty of analysis have brought great challenges to IT architecture.
Fluent in the challenges associated with a unified monitoring platform
As a company without a separate operation and maintenance department, the unified monitoring platform of the basic platform is mainly completed by the RESEARCH and development of cloud-infra team. The core demands of the team are not only SLA, performance monitoring, alarm and providing data related to problem location, but also the technical value operation of Cloud-infra. Such as utilization, cost saving, business relationship network, etc. Under these core demands, the unified monitoring platform will have high requirements:
1. Collect and monitor various heterogeneous data sources, including K8s, machine indicators on ECS, utilization rate, ISTIO-related call logs, self-built middleware indicators, indicators provided by cloud services, business Trace data, etc. In addition, real-time collection of various cost data is also included.
2. Dynamic discovery and collection of all kinds of resources, including data related to organizational relations and other departments, also need to be updated in real time, so that the most accurate relevant indicators and attribution relationship can be fed back in real time.
3. Large-scale data storage and analysis: Due to the large scale of the business, various cloud resources used and the huge amount of data generated by the business, tens of TB per day, the solution needs to meet the real-time analysis and presentation capabilities on this scale.
4. The monitoring platform is responsible for the stability problem, and its own stability needs to be well done, so it needs to eliminate single point problems in each part and have the ability to recover abnormally fast.
Technology selection
The unified monitoring platform is not just about timing related data. The core business availability data needs to be calculated and analyzed through various types of Logs, so the overall choice of Logs and Metrics is two data schemes. There are different community or business solutions for both types of data, such as ES, Loki, SLS, Prometheus, OpenTSDB, InfluxDB, etc. Ali Cloud SLS was selected as the final log scheme, while Prometheus+SLS was selected as the timing scheme. The main reasons are as follows:
1.SLS has the ability of unified storage and analysis of all kinds of data, and can associate Metrics and Logs data on SLS, which is not available on other platforms
2.SLS platform can adapt to very large data scale, and has better performance than ES. It is also an o&M free service, eliminating the problem of maintaining ES high reliability
3. The timing scheme was dominated by Prometheus, whose ecology was very complete and the use of PromQL was brief. The timing library of SLS can be used as the remote highly reliable storage of Prometheus to solve the reliability problem of Prometheus
4. The SLS scheme has the function of data processing, which can Join analysis and processing with external data sources to better deal with various complex logs and add the log information related to catalog
The overall architecture
The architecture of the current unified fluency monitoring platform is shown in the figure above:
1. To achieve automation, we developed a dynamic IaaS and PaaS resource discovery mechanism applicable to cloud scenarios, which can add newly purchased or created resources to the monitoring and collection in real time, avoiding most manual operations
2. Log related:
• Logs of different services are collected by SLS Logtail to different log libraries
• Not all logs need to be stored and indexed for a long time. Therefore, we classify logs and post them to OSS for long-term storage if they need to be audited. Service troubleshooting logs are saved for two weeks and full-text indexing is enabled. AccessLog enables only partial field indexing, which saves a lot of indexing costs.
• For NGINX access logs whose SLA and PXX indicators need to be calculated, NGINX uses data processing to map urls in NGINX access logs to corresponding departments, applications, and methods based on Catalog information such as mapping rules, departments, and applications stored in RDS.
3. Monitoring
• Prometheus was chosen for the monitoring solution, and for the fluent speaking scenario, we have developed a few exporters for extracting Metrics from various cloud products and home-built components
• In order to make better use of Prometheus and integrate with internal CICD system, we added a Sidecar on Prometheus to monitor changes in Git repository and dynamically Reload Prometheus configuration according to changes
• Various Recording rules are configured on Prometheus to improve the query speed, and all Recording rules are managed using Git. • Alarms of AlertManager are connected to the internal alarm center, providing advanced functions such as layout and upgrade
• To solve the Prometheus singlepoint problem and the subsequent association analysis problem with the Catalog, we used the SLS timing library to Write Prometheus Remote directly into the SLS timing library
4. Index calculation
• Core metrics are calculated in part from the AccessLog of NGINX. From the entry point, you can obtain QPS, Latency (average, PXX, etc.) of each business, which is not intrusive to the business
• Resource utilization, middleware, infrastructure, and other indicators are derived from the timing library written by Prometheus, and the indicators of each department and business can be aggregated based on the Catalog
• The calculated indicator information can be easily stored in MySQL and ES and posted to OSS for backup due to the small amount of data
The related results
At present, this monitoring platform almost carries all the core monitoring of the company. It has been running stably since its launch, and can easily cope with the sudden increase of data volume during various activities. The overall business value is mainly reflected in:
1. Monitoring: The first value of monitoring is to do all kinds of monitoring and alarm, especially related to SLA. Since data has been associated with specific departments and business applications, SLA of each department and application can be easily obtained, and unified promotion and improvement can be carried out across the company
2. Troubleshooting and fault isolation: Based on Istio access logs and Catalog information, the call relationship of each application can be calculated. Therefore, the business relationship grid can be generated in real time and the quality of each relationship (edge) can be known. Knowing the service relationship, you can quickly locate the root cause and isolate faults when problems occur
3.FinOps: In the Cloud Infra department, the most challenged problem is the cost. Therefore, cost optimization is also a core work for us. The main method is to calculate the resource utilization rate of each department and team, including the average utilization rate and the utilization rate of all kinds of PXX (as shown in the table below), so as to judge the resource utilization of each department and promote the cost optimization of each department.
Speak fluently about the technology behind unified monitoring
The unified monitoring is constructed based on Ali Cloud SLS, which is positioned as a cloud native observation and analysis platform, providing large-scale, low-cost and real-time platform-based services for Log/Metric/Trace data. One-stop provides data collection, processing, analysis, alarm visualization and delivery functions, comprehensively improving the digital capabilities of r&d, O&M, operation, and security scenarios. Unified monitoring uses a variety of SLS core functions, including:
Omnidirectional log collection
SLS supports unified Log/Metric/Trace collection, access to data sources such as server /Kubernetes/ application/mobile device/web page /IoT, and access to Log data of Ali Cloud products/open source system/inter-cloud/sub-cloud. Its core features are:
1. Convenience: 40+ mature access scheme, unified collection of multiple clients, supporting multiple transmission modes such as Intranet, public network and global accelerated transmission
2. Reliability: Ali’s self-use infrastructure has been tested by double 11 and Spring Festival Gala activities for many times. Supports resumable data transfer and elastic scaling based on service traffic
3. Open: protocol (HTTP/Syslog/Prometheus/OpenTelemetry) seamless access, complete docking open source ecosystem
Prometheus timing scheme
SLS timing storage is designed from the very beginning to solve the timing storage needs of Ali and many head enterprise customers, and with the help of ali’s years of technical accumulation, so that it can adapt to the majority of enterprise-level timing monitoring/analysis demands. SLS timing sequence storage has the following characteristics:
1. Enriching upstream and downstream: SLS supports various collection methods for data access, including various open source agents and internal monitoring data channels of Ali Cloud; At the same time, the stored sequential data can be connected to various streaming computing engines and offline computing engines, and the data is completely open
2. High performance: THE SLS storage and computing separation architecture gives full play to the clustering capability, especially in the case of a large amount of data, the speed of the end to end is significantly improved
3. O&m free: The sequential storage of SLS is completely servitized, and users do not need to operate and maintain instances by themselves. In addition, all data are stored with three copies of high reliability, so there is no need to worry about the reliability of data
4. Open source friendly: SLS timing storage native support Prometheus writing and query, SQL92 analysis method, native docking Grafana and other visualization schemes
5. Intelligence: SLS provides a variety of AIOps algorithms, such as multi-cycle estimation, prediction, anomaly detection, timing classification and other timing algorithms, which can be used to quickly build an intelligent alarm and diagnosis platform suitable for the company’s business
Real-time data analysis
Query analysis Provides multiple methods, such as keywords, SQL92, and AIOps functions, supports real-time query and analysis for text and structured data, and exception inspection and intelligent analysis. The main features are as follows:
1. High performance: analyze billion data in seconds, and fully support SQL, PromQL and other analysis interfaces, HTTP, Kafka, JDBC, Prometheus and other protocols
2. Stable and reliable: enterprise design, multi-tenant isolation, PB-level capacity design, tens of thousands of enterprise users to choose
3. Intelligence: AIOps ability through ali’s economic practice, supporting intelligent anomaly inspection and root cause analysis data processing
Through flexible syntax, data processing supports the extraction, parsing, enrichment, distribution and other requirements of complex data without writing codes, and supports structured analysis. The main characteristics of data processing are as follows:
1. Flexibility: Provides diversified operators and scenarioalized UDFs (such as Syslog, non-standard JSON, and AccessLog UA/URI/IP resolution) out of the box. Extensible syntax deals with a variety of complex formats
2. O&m free: fully hosted cloud services without additional o&M resources. Supports automatic elastic scaling based on traffic
3. Scalability: support multi-layer nesting, shunting and other logic, support complex data distribution and orchestration requirements
In the cloud native era, digitalization is driving business innovation across industries. Only by improving user experience, accelerating innovation, updating infrastructure and architecture, and leveraging diverse data can we stand out in the overall environment. The intelligent operation and maintenance platform launched by Ali Cloud is not only to help engineers reduce workload, but also to release operation and maintenance engineers from all kinds of mechanized work. We will take care of all the “dirty work”, so that the failure time is greatly reduced, so that the operation and maintenance staff can focus more creativity on digital innovation and enterprise business innovation, so as to provide better competitiveness for the enterprise.
The original link
This article is the original content of Aliyun and shall not be reproduced without permission.