background

Recently, the online MySQL 5.7.20 cluster occasionally (sometimes three weeks, sometimes one or two days) crashes the primary MySQL database and triggers the primary/secondary switchover. The stack information is as follows:

It is obvious from the stack information that crash was triggered when try_acquire_lock_IMPl was called.

Analysis of the

The similar problems in official Bug library was not found, in turn, from the perspective of the code base, search to the corresponding Bug – 8 bc828b982f678d6b57c1853bbe78080c8f84e84:

BUG#26502135: MYSQLD SEGFAULTS IN

              MDL_CONTEXT::TRY_ACQUIRE_LOCK_IMPL

ANALYSIS:
=========
Server sometimes exited when multiple threads tried to
acquire and release metadata locks simultaneously (for
example, necessary to access a table). The same problem
could have occurred when new objects were registered/
deregistered in Performance Schema.

The problem was caused by a bug in LF_HASH - our lock free
hash implementation which is used by metadata locking
subsystem in 57. branch. In 5. 5 and 56. we only use LF_HASH
in Performance Schema Instrumentation implementation. So
for these versions, the problem was limited to P_S.

The problem was in my_lfind() function, which searches for
the specific hash element by going through the elements
list. During this search it loads information about element
checked such as key pointer and hash value into local
variables. Then it confirms that they are not corrupted by
concurrent delete operation (which will set pointer to 0)
by checking if element is still in the list. The latter
check did not take into account that compiler (and
processor) can reorder reads in such a way that load of key
pointer will happen after it, making result of the check
invalid.

FIX:
====
This patch fixes the problem by ensuring that no such
reordering can take place. This is achieved by using
my_atomic_loadptr() which contains compiler and processor
memory barriers for the check mentioned above and other
similar places.

The default (for non-Windows systems) implementation of
my_atomic*() relies on old __sync intrisics and implements
my_atomic_loadptr() as read-modify operation. To avoid
scalability/performance penalty associated with addition of
my_atomic_loadptr()'s we change the my_atomic*() to use
newer __atomic intrisics when available. This new default
implementation doesn't have such a drawback.
Copy the code

The general meaning is:

This problem may occur when multiple threads obtain and release metadata locks at the same time, or when new objects are registered or destroyed in the Performance Schema, causing mysql Server crashes.

The problem is caused by a BUG with LOCk-free Extensible Hash Tables (LF_HASH). What is LF_HASH used for?

  1. In 5.5 and 5.6, only the Performance Schema Instrumentation module is used.
  2. Also used in 5.7 for metadata locking modules.

The problem is in my_lfind(), which does not consider CAS for cursor->prev. The patch resolves this problem by using my_atomic_loadptr() :

diff --git a/mysys/lf_hash.c b/mysys/lf_hash.c
index dc019b07bd9.. 3a3f665a4f1 100644
--- a/mysys/lf_hash.c
+++ b/mysys/lf_hash.c
@ @ - 1, 4 + 1, 4 @ @
-/* Copyright (c) 2006, 2016, Oracle and/or its affiliates. All rights reserved.
+/* Copyright (c) 2006, 2017, Oracle and/or its affiliates. All rights reserved.
 
    This program is free software; you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
@ @ + 83-83, 7, 8 @ @ retry:
   do { /* PTR() isn't necessary below, head is a dummy node */
     cursor->curr= (LF_SLIST *)(*cursor->prev);
     _lf_pin(pins, 1, cursor->curr);
-  } while (*cursor->prev != (intptr)cursor->curr && LF_BACKOFF);
+  } while (my_atomic_loadptr((void**)cursor->prev) != cursor->curr &&
+ LF_BACKOFF);for (;;) { if (unlikely(! cursor->curr))@ @ + 98-97, 7, 7 @ @ retry:
     cur_hashnr= cursor->curr->hashnr;
     cur_key= cursor->curr->key;
     cur_keylen= cursor->curr->keylen;
- if (*cursor->prev ! = (intptr)cursor->curr)
+ if (my_atomic_loadptr((void**)cursor->prev) ! = cursor->curr)
     {
       (void)LF_BACKOFF;
       goto retry;
Copy the code

To solve

Check the change log, which was fixed in 5.7.22:

A server exit could result from simultaneous attempts by multiple threads to register and deregister metadata Performance Schema objects, or to acquire and release metadata locks. (Bug #26502135)

After the kernel version was upgraded to 5.7.29 and the inspection lasted for one month, the problem did not occur again and was resolved.

PS:

Space is limited, in the subsequent articles will be a separate analysis of MDL, LF_HASH source code, please pay attention.


Welcome to follow my wechat public number [database kernel] : share mainstream open source database and storage engine related technology.

The title The url
GitHub dbkernel.github.io
zhihu www.zhihu.com/people/dbke…
SegmentFault segmentfault.com/u/dbkernel
The Denver nuggets Juejin. Im/user / 5 e9d3e…
OsChina my.oschina.net/dbkernel
CNBlogs www.cnblogs.com/dbkernel