The problems that should be considered in the stage of system analysis and design and system integration during reliability analysis and design.

The related concepts mainly include: reliability, availability, dimensionality, average time to failure, average time to repair and average time between failures.

2. System fault model 1. Logical fault model 2. Data structure fault 3

3. System configuration method

(a) single fault tolerance

1, self-inspection system in the occurrence of non-fatal failure can automatically find the fault and determine the nature of the fault, the location, and automatically take measures to replace and isolate the failure of the parts.

2, redundancy

1) Hardware redundancy

2) Software redundancy

3) Time redundancy repeated execution of instructions and programs

4) Information redundancy, etc. Add data bits, etc

The two most common methods of redundancy are the repeated line and the backup line. Repeated line refers to parallel, double safety; Backups are failures that can be remedied.

Self – check is often used with redundancy.

(2) Dual-system hot backup is a highly fault-tolerant application solution combining software and hardware. It consists of two servers, an external shared disk array cabinet, and hot standby software. A disk array card is optional. RAID (Independent redundant disk array) cards can be used on both servers instead.

In dual-system hot backup, the operating system and application software are installed on local disks of the two servers, and data is centrally managed and backed up through disk arrays. When one server fails, the other goes on top, realizing uninterrupted service.

Dual-system hot backup uses the “heartbeat” method to ensure the communication between the active system and the standby system.

Based on the working modes of the two servers, there are three hot backup modes:

1, dual-system hot backup one work, one always ready. Data is written to both machines at the same time. Once the working machine fails, the standby machine can be switched automatically or manually by software. Use the most common way. However, the standby machine may be idle for a long time.

2. Dual-server Mutual backup Two independent applications run on two servers respectively. If one fails, the other can take over the other’s applications. The server has high performance requirements.

3. A form of dual-machine duplex cluster. The two servers are in active state, running the same application (different from the application in dual-server mutual backup), balancing load, and mutual backup. Disk array storage technologies, WEB servers, and FTP servers are commonly used.

(3) Server cluster Cluster technology refers to a group of independent servers in the network combined into a single system for work and management, so as to provide high reliability services.

In most cases, all the computers in the cluster have a common name, and any service in the cluster can be used by all network users.

Each node server in the cluster communicates with each other through an internal LAN. When a node fails, the application running on the node is automatically taken over by another node. If an application service fails, the application is restarted or taken over by another server.

Measures to improve system reliability

(1) Technology for preventing faults

To prevent the failure of the system, two kinds of technology: 1, fault masking to prevent faults caused by errors

2. System reorganization to prevent failure caused by errors

Both technologies are based on redundancy of resources. As mentioned earlier, resource redundancy includes hardware redundancy, software redundancy, time redundancy, and information redundancy.

(2) Hardware redundancy

Hardware redundancy is most commonly used in three-mode redundancy (TMR), where three identical modules receive three identical inputs and the resulting three results are sent to the voter. The voter is a majority voting, one failure, the other two normal, the output of normal results. Obviously, normal is more likely.

(3) Information redundancy Information redundancy refers to the addition of redundant information to data for the purpose of fault detection, fault masking, or fault tolerance. The most widely used is 1, Haiming check code 2, parity check code (CRC).

1. Online backup (hot backup) 2. Offline backup (cold backup)