How to build remote live application?

The profile

With the rapid development of the business, for many companies, the technology architecture built in a single region will face a variety of problems, such as: limited infrastructure limits the scalability of the business; Equipment room and city-level faults affect service sustainability.

To address these issues, companies can choose to build a remote live architecture, where multiple units (business centers) are built in the same city/offsite. Each service unit can be distributed in different regions, which effectively solves the limitation of infrastructure expansion and service sustainability caused by single-region deployment.

Remote living is a hot topic in recent years, so when do we need to do this in the actual business? How to do it? What do you think about when you do it?

When to do it?

Personal feelings depend on the following aspects:

  • Business development
  • Condition of infrastructure
  • Technology accumulation

How to do?

Currently on the Internet search to different ground live scheme, basic it is ali, hungry, jingdong, weibo these Internet companies, the practice of these big projects have one thing in common: a large number of the component, to do relevant data synchronization, business segmentation, etc., so, for a lot of traditional enterprise or relatively small enterprises, should how to do it?

  • Leverage appropriate public cloud services based on business characteristics

What do you need to pay attention to when doing this?

  • What do you really need to do different businesses?
  • What about the infrastructure?
  • What is your tolerance for unavailable time?

Business background

  • User-centric is the core business in all systems because it is the prerequisite for entering many other businesses.
  • Our IDC is not very stable. There have been several large-scale failures in the machine room before, for example, the network of the machine room is down and the whole machine room is unavailable.

The above two points are the starting points for user center remote Dr To ensure service availability in the face of machine room level faults.

The business card

As a whole, the user center mainly provides services such as registration, login and user information query. These services have the following features:

  • Login has the highest priority
  • Low transactional requirements

The main common components involved are:

  • MySQL: user data store
  • Redis: stores Authorization Code, SMS verification Code, account lock, and Access tokens
  • Zookeeper: Dubbo dependency

plan

The user center is developed in the form of outsourcing, and is now online and delivered to another outsourcing provider for operation and maintenance. Therefore, when considering the first-phase DISASTER recovery scheme, it is necessary to consider keeping the code as little as possible.

The target

Issue of the target

When the Beijing machine room fails, the flow can be cut to Qingdao machine room within a certain period of time to ensure the basic availability of core services in the user center.

Stage 2 goals

The user center realizes high availability through remote multi-activity (group intelligent DNS support is required).

Architecture design

Issue of architecture

When the Beijing machine room breaks down, the traffic can be quickly switched to Qingdao to ensure that the core service of the user center is available.

The specific plan is as follows:

  • Through otTER, the core business data of Beijing machine room is synchronized to Qingdao machine room in near real time.
  • Middleware such as Redis and ZooKeeper are deployed in the Qingdao equipment room.
  • Core applications of the user center are deployed in the Qingdao equipment room (instances are deployed and run normally, but are not accessed normally).

The specific structure is as follows:

Results can be achieved:

  • If the Beijing equipment room is faulty, the traffic can be transferred to the Qingdao equipment room for a certain period of time to ensure the basic availability of core services in the user center. However, users who have logged in need to log in again.
  • Time: depends on the TIME for the DNS to change the IP address +DNS TTL. The TTL is 10 minutes. Manually changing the IP address takes 10 to 20 minutes.

Existing disadvantages:

  • During the period of non-failure in Beijing machine room, the machine in Qingdao machine room only does database synchronization, which causes a certain waste of resources.
  • When the Beijing machine room is faulty and the traffic is switched to the Qingdao machine room, the login service can only be guaranteed. Services that need to modify the database, such as registration, are not supported. If you access these services during this period, exceptions may occur.

Phase ii architecture

The purpose of the second phase is to correct the shortcomings of the first phase architecture and achieve high availability through remote multi-activity.

Phase II Qingdao machine room will be replaced by Ali Cloud machine room.

The specific plan is as follows:

  • Through ALIyun DTS service, database synchronization of computer rooms in both places is realized to ensure near-real-time consistency of data between Beijing and Aliyun.
  • Beijing and Aliyun both provide online services to improve resource utilization.
  • Sort out service priorities, modify application code, and support service degradation.
  • When a machine room (Aliyun or Beijing) fails, the traffic is switched to another machine room through DNS service.
    • If there are no redundant hardware resources when two sites are deployed, perform service degradation.
    • At present, the group DNS resolution cannot provide the function of automatically detecting whether the service is available, so it cannot automatically switch.
      • Service availability can be monitored through our multi-dial test. When the multi-dial test is unavailable, an alarm notification is sent to relevant personnel for manual intervention.
      • There are two types of multi-dial detection alarms: 1. When a dial point is unavailable. 2.
    • For the group DNS resolution, the TTL takes effect in 10 minutes. You cannot customize the TTL time.

The specific structure is as follows:

Results can be achieved:

  • If the group DNS can provide the website monitoring function similar to Ali Cloud resolution and flexibly set the TTL time, then when the Beijing machine room or Ali cloud machine room fails, the traffic can be switched automatically in a very short time (maximum abnormal time of some services).

Here is an example of aliyun cloud analysis, as long as similar services can be provided.

  • If the group DNS cannot provide the functions of website monitoring and flexible SETTING of TTL time similar to aliyun resolution, the maximum abnormal time of some services still depends on the DNS IP modification time +DNS TTL time.

Noun explanation

What is site monitoring?

HTTP/HTTPS real-time detection of domain name resolution records, support user-defined ports, real-time detection of downtime alarms. The whole network distributed monitoring, in each region of China to simulate the real request of the client, the monitoring results are true and reliable; Supports downtime and Dr Switchover, minimizing the loss caused by service interruption to your services. Dr Switchover supports A record and CNAME domain names, meeting Dr Switchover requirements in various scenarios.

What situation is considered to be down by site monitoring and sends an alarm notification?

If the HTTP/HTTPS return code is greater than 500 in the monitoring result, an alarm is generated. For example: if the set of four detection points Beijing Unicom, Shenzhen Alibaba, Shanghai Telecom, Chongqing Unicom. Scenario 1: Your website is down only when 50% of the four monitoring points cannot receive a response from your server, or 50% of the monitoring points receive a return code greater than or equal to 500. Scenario 2: If more than 50% of the four probe points detect that your website’s return code is less than 500, your website is not down.

Cloud Resolution DNS “Traffic Management”

Cloud resolution Traffic Management returns resolution results based on the weight ratio polling for each resolution line that you set. When an IP address is down on a line, the monitor automatically detects that the IP address is down and removes it from the current line until the monitor IP address becomes normal. At the same time, if all IP addresses of a parsed line are down, switch to another normal line. Maximize your site services high availability, reduce losses.

Maximum abnormal time of some services

For example, if the Beijing computer room is abnormal, the traffic forwarded to Ali Cloud computer room can be accessed normally, but only the traffic forwarded to Beijing computer room is abnormal.

At this time, if you use website monitoring or similar services to monitor, and set the dialing test interval to 1 minute and TTL to take effect to 1 second, some services will be abnormal for 60+1 second at most. After that, DNS will automatically kick off the IP of Beijing machine room and all traffic will be transferred to Ali Cloud.

supplement

  1. The implementation of both phase I and PHASE II schemes strongly depends on the DNS service of the group

  2. In case of machine room level failure, phase I and PHASE II solutions cannot guarantee the availability of services exposed through IP in the user center.

  3. In fact, in addition to the DNS solution, there is another solution is to use a similar device like F5 for cross-room load, but it must be GSLB, and both ends must be the same device.

summary

For companies that are not first-line Internet giants, it is necessary to use public cloud to realize remote disaster recovery. For example:

  • Data can be synchronized across equipment rooms using Data Transmission Service (DTS) of Aliyun. Currently, DTS supports Data Transmission between relational databases, NoSQL, and OLAP Data sources. It is a data transmission service integrating data migration, data subscription and real-time data synchronization.

  • Distributed databases across machine rooms, using OceanBase. Financial environment usually has higher requirements on data reliability. For each transaction submitted by OceanBase, the corresponding log is always synchronized in multiple data centers in real time and persisted. Even in the event of a data center level disaster, every transaction that has been completed can always be recovered in another data center, achieving true financial level reliability requirements.
  • Different companies have different businesses, different infrastructure, and different problems to solve, so choose what works for you.
  • Or just use the cloud database RDS MySQL edition

Personal wechat official Account:

Individual making:

github.com/jiankunking

Personal Blog:

jiankunking.com