Problem analysis

All five StarRocks BE nodes suddenly went offline in a matter of minutes. Find the be.out log for BE:

tcmalloc: large alloc 1811947520 bytes == 0x77f9f0000 @ 0x384f94f 0x39ce2dc 0x399646a terminate called after throwing an instance of 'std::bad_alloc' what(): std::bad_alloc *** Aborted at 1641348199 (unix time) try "date -d @1641348199" if you are using GNU date *** PC: @ 0x7fa8c7db4387 __GI_raise *** SIGABRT (@0x2ab9) received by PID 10937 (TID 0x7fa7f0658700) from PID 10937; stack trace: *** @ 0x2da5562 google::(anonymous namespace)::FailureSignalHandler() @ 0x7fa8c99cc630 (unknown) @ 0x7fa8c7db4387 __GI_raise @ 0x7fa8c7db5a78 __GI_abort @ 0x12e91ff _ZN9__gnu_cxx27__verbose_terminate_handlerEv.cold @ 0x391d6f6 __cxxabiv1::__terminate() @ 0x391d761 std::terminate() @ 0x391d8b5 __cxa_throw @ 0x12e80de _ZN12_GLOBAL__N_110handle_oomEPFPvS0_ES0_bb.cold @ 0x39ce27e tcmalloc::allocate_full_cpp_throw_oom() @ 0x399646a std::__cxx11::basic_string<>::_M_mutate() @ 0x3996e90 std::__cxx11::basic_string<>::_M_replace_aux() @ 0x1c5c4fd apache::thrift::protocol::TBinaryProtocolT<>::readStringBody<>() @ 0x1c5c6ac apache::thrift::protocol::TVirtualProtocol<>::readMessageBegin_virt() @ 0x1e3d3c9 apache::thrift::TDispatchProcessor::process() @ 0x2d91062 apache::thrift::server::TConnectedClient::run() @ 0x2d88d13 apache::thrift::server::TThreadedServer::TConnectedClientRunner::run() @ 0x2d8ab10 apache::thrift::concurrency::Thread::threadMain() @ 0x2d7c500 _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJPFvSt10shared_ptrIN6apache6thrift11concurrency6ThreadEEES8_EEEEE6_M_run Ev @ 0x3998d40 execute_native_thread_routine @ 0x7fa8c99c4ea5 start_thread @ 0x7fa8c7e7c9fd __cloneCopy the code
STD ::bad_alloc is an avalanche effect of insufficient memory. If there are many nodes, they may not all fail. BE is c + + development, wrong explanation reference: https://www.zhihu.com/question/24926411Copy the code

Operator new bad_alloc is a serious resource problem, because memory can’t be allocated, objects can’t be constructed, they definitely won’t run as they used to, and you probably won’t even have enough memory to clean up. In this case, letting the application die is the right thing to do…

solution

Increase the memory

The best way is definitely to increase memory. After all, as the amount of data increases, memory usage inevitably increases, and you may not be able to handle a sudden increase in the amount of imported data.

Optimizing import Configuration

In the current version of StarRocke (1.19) there is a configuration item:

# mem_limit=80% # mem_limit=80% # mem_limit=80% # mem_limit=80% # MEM_limit =80% Load_process_max_memory_limit_bytes =107374182400 # 100GB load_process_max_memory_limit_percent=80 # memory upper limit occupied by all import threads on a node, 80%Copy the code

You can set this option to limit the memory footprint.

Other memory optimization parameters can be viewed:

Docs.starrocks.com/zh-cn/main/…

Set memory allocation parameters

You are advised to set cat /proc/sys/vm-overcommit_memory to 1.

echo 1 | sudo tee /proc/sys/vm/overcommit_memory
Copy the code

Table optimization

Memory tables: StarRocks supports caching all table data in memory to speed up queries. Memory tables are suitable for tables with a small number of rows.

However, the optimization of the memory table is not perfect in actual use, so it is recommended not to use the memory table for the time being.

Upgrade StarRocks

The new version of StarRocks(2.0), optimized for memory management, also solves the problem to some extent:

  • Memory management optimization
    • Refactoring memory statistics/control framework, accurate statistics memory usage, completely solve OOM
    • Optimize metadata memory usage
    • Fixed a problem where the thread of execution was stuck for a long time due to a large memory release
    • Process graceful exit mechanism, support for memory leak check #1093

Welcome to wechat public account: Data Architecture Exploration