This article is participating in “Java Theme Month – Java Development in Action”, see the activity link for details
This is the fourth day of my participation in Gwen Challenge
Make it a habit to like it first
background
Company a rule engine system, mainly used to do some cost calculation and business logic check function. However, due to some inexplicable historical reasons, the system has no documentation and no source code, and even the configuration is word-of-mouth.
Although this system is relatively old, but after all is a commercial product, the function is relatively perfect and easy to use. After the system is connected to the business system’s calculation fee and verification rules, it goes online soon.
However, after the launch, the service will occasionally take too long. Under normal circumstances, the service response time is less than 20ms. However, when problems occur, the service time can increase to tens of seconds, and the time point for each occurrence of long time is not fixed. And during the long duration, all requests arriving at the service will be long, not just individual requests.
In theory there should be environments like problem testing; But because the test environment restarts very frequently, the occasional long time may be considered as a restart, and no one has been paying attention to it.
We don’t have any data, but the problem has to be solved, and every now and then it takes a few seconds. Docking this rule engine is still the core system of the business, one extra minute less earn insurance premiums…
With a try-and-try mentality, research first.
Screen after
This kind of occasional long time consuming problem, the investigation is more troublesome. It can’t reproduce stably, and there is no rule. It’s not easy to simulate other environments even if you do it. In addition, the system has no documentation and no source code, and no one understands its structure and flow, which makes it even harder to check.
See the monitor
After looking at the monitoring history, CPU & disk I/O utilization went up each time there was a long elapsed time, but not by a lot. When the CPU is high, it is only 50 or 60, while the disk is lower, with only a slight fluctuation, which can bring tens of seconds of time, but it is a little unreasonable.
Look at the log
The CPU block is a little high, but that doesn’t mean anything. We need to find the root cause. So I went through the system’s logs again to see if I could find any clues.I ran out of time and looked at the log for half an hour. Finally, I saw a key clue in the log:
2021-05-08 10:12:35.897 INFO [com.mosaic.rules.res.execution] (default-THREADs-65) execution set / ansNcreckonRuleApp / 1.0/61.0 / ansncreckonrule have finished parsing in 58 seconds.Copy the code
58 seconds… The point in time of this log line matches the point in time of the actual occurrence of the request. Within this 58 second range, there are a lot of request times, and all of them are less than 58 seconds or slightly more than 58 seconds.Aside from this time-consuming printing, the GC log is also a bit suspect.There is a GC activity log before each occurrence of this time-consuming log, but not for long:
The 2021-06-07 T17:36:44. 755 + 0800: [GC2021-06-07 T17:36:44. 756 + 0800: [DefNew: 879616K-> 1234K (3531079K), secs] 3549012K-> 123430k (10467840K), secs] [Times: User sys = = 0.05 0.00, real = 0.2 secs]Copy the code
Stranger still… Not every GC is followed by a time log
The GC problem
A close look at the GC logs shows that each slow response is preceded by a GC, but not FULL GC, and each pause is not long enough to be anywhere near as long as the request takes. In addition, THE CPU utilization during the GC process is not high, which is relatively normal from the data.
Okay, now we have a clue:
- The CPU usage increases during the long duration
- The log is printed for long periods of time, and when you load something, the duration can match
- Before each long run, there is GC activity
Guess the possible cause
After all, there is no source code, no one understands, want to tear the code have no chance, can only rely on guess…
From the above clues, there is an increase in CPU utilization after GC activity, followed by a line of load logs that correspond to the system’s long-running requests.
There is also a GC activity shortly before each long-running log, so the timing of the resource load is related to the GC, which affects the resource load…
Think of this, also probably guess why. Weak references are most likely used to maintain this resource cache. Weak references are reclaimed after GC and therefore need to be reloaded (see Strong/Soft/Weak/virtual references and GC Strategies in Java for detailed explanations of Weak references and test results).
Since this may be the cause, if I find the weak reference to maintain the resource cache, it can be changed to a strong reference to solve the problem! Although there is no source code, but still can decompile ah, decompile to change the cache of that block of code, not a big problem.
Locate the resource loading point
I finally found the probable cause, but there was just one problem… How do I know where this time log was printed? In which class? I don’t even know what Web container this service is using.
No way, go to Arthas, but definitely not in production. So I set up a new set of temporary test environment to troubleshoot this problem.
Arthas traces all classes in the package name as wildcards: com.mosaic.rules.res.execution
# trace all classes and methods in this package name. Only methods that take more than 1000ms are displayed
Since this output may be quite large, > output to a file and run in the backgroundTrace com.mosaic. Rules.res.* *'#cost > 1000' > /app/slow_trace.log &
Copy the code
trace
The command took more than 10 seconds to return, affecting a total of 169 classes and 1617 methods. I would have been taken to heaven if it had been played like this.
Given the situation described above, there is a time problem after GC, so let’s manually trigger a GC,
Arthas’s vmTool can be forceGC directly (jmap or other means are also available) :
vmtool --action forceGc
Copy the code
After forceGC, retest the system interface
As in the above case, the elapsed time again occurs and the request returns and prints the same elapsed time log, only longer than in the production environment (because I used arthas Trace enhancement)
Rules.res. execution] (default-THREADs-65 / ansNcreckonRuleApp / 1.0/61.0 / ansncreckonrule have finished parsing in 70 seconds.Copy the code
At the same time Arthastrace
There is also content in the file where the log is written (link is too long to post code, only screenshot) :thistrace
There are more than 1,000 lines in the log, some of which have been deleted here.
Arthas’s trace command does not do full link tracing, so a trace here is just the current path. But that’s good enough. It Narrows the problem again.
From the picture above you can see IlrRulesetProvider: getRuleset this method are the main time-consuming points, then let’s look at what the game is in this method.
To keep things simple, here is Arthas jad to decompress this class directly and see what the logic is:
jad ilog.rules.res.xu.ruleset.internal.IlrRulesetProvider
Copy the code
It’s a bit too much code, so I’m going to cut it down to the key parts:
public final XURulesetImpl getRuleset(IlrXURulesetArchiveInformation archive, IlrXUContext xuCtx, boolean waitParsing, Listener listener) throws IlrRulesetArchiveInformationNotFoundException, IlrRulesetCreationException, IlrRulesetAlreadyParsingException, IlrRulesetCacheException, XUException {
String canonicalPath = archive.getCanonicalPath().toString();
this.logger.finest("IlrRulesetProvider.getRuleset " + canonicalPath + "" + waitParsing);
if(! archive.getProperties().isShareable()) {return this.factory.createRuleset(archive, xuCtx);
} else {
ClassLoader cl = archive.getXOMClassLoader();
XURulesetImpl ruleset;
while(true) {
synchronized(this.parsingRulesets) {
// Return the rule set from the cache
ruleset = (XURulesetImpl)this.getCache().getRuleset(canonicalPath, cl);
if(ruleset ! =null) {
return ruleset;
}
// The first loaded thread adds the current resource to parsingRulesets and jumps out of the while
if (!this.parsingRulesets.contains(archive)) {
this.parsingRulesets.add(archive);
break;
}
if(! waitParsing) {throw new IlrRulesetAlreadyParsingException("XU.ERROR.10406", (String[])null);
}
// Wait... To prevent concurrent access, wait for the first thread to complete loading while other threads also enter the code block
try {
this.parsingRulesets.wait();
} catch (InterruptedException var20) {
throw new IlrRulesetCreationException("XU.ERROR.10009", (String[])null, var20); }}}if (!this.useWorkManager(archive)) {
this.logger.finest("IlrRulesetProvider.getRuleset doesn't use the workmanager " + this.workManager, (Object[])null, xuCtx);
XURulesetImpl var9;
try {
// Create a resource
ruleset = this.factory.createRuleset(archive, xuCtx);
// Add to cache
this.getCache().addRuleset(ruleset);
var9 = ruleset;
} finally {
this.parsingStopped(archive);
}
returnvar9; }}}Copy the code
Although a lot of cuts, but look at the code is still a little bit unclear, after all, decomcompiled code to read a bit of interference, here is a brief explanation of the above code logic:
- The resource is first fetched from the cache container
- If not, perform the create logic
- Once created, add to cache again
- Wait for the first thread to wake up if there are other threads executing the load
Combine that with our guess above:
It is likely that the system maintains this resource cache with Weak references, which are reclaimed after GC and therefore need to be reloaded
The problem is the cache container! Just look at the cache to see that there must be weak reference code!
Decompile to find the code
Jad decomcompiling is still too cumbersome to look at in a terminal, not as intuitive as in an IDE, so you still have to pull the code down and analyze it, or you’ll struggle to find an associated class.
Arthas has a number of Jar packages, and it is very easy to find out which Jar these related classes are in. Arthas’s sc command is also used:
Sc views information about classes that have been loaded by the JVM
sc -d ilog.rules.res.xu.ruleset.internal.IlrRulesetProvider
Copy the code
Class - the info ilog. Rules. Res., xu, ruleset. Internal. IlrRulesetProvider # here is the information we care about, Jar code-source /content/ jrules-res-xu-jboss61eap.rar /ra.jar name ilog.rules.res.xu.ruleset.internal.IlrRulesetProvider isInterface false isEnum false isAnonymousClass false isArray false isLocalClass false isMemberClass false isPrimitive false isSynthetic false simple-name IlrRulesetProvider modifier final,public annotation interfaces com.ibm.rules.res.xu.ruleset.internal.RulesetParsingWork$Listener super-class +-java.lang.Object class-loader +-ModuleClassLoader for Module "deployment.jrules-res-xu-JBOSS61EAP.rar:main" from Service Module Loader +-sun.misc.Launcher$AppClassLoader@2d74e4b3 +-sun.misc.Launcher$ExtClassLoader@5552bb15 classLoaderHash 2d022d73 Affect(row-cnt:1) cost in 79 ms.Copy the code
Once you find the ra.jar, drag the jar into the IDE to decompile it, but there are other classes associated with the class that are not in the ra.jar. Again, find the associated class, then sc-d finds the jar location and copies it to the local IDE to decompile
After a few twists and turns, I finally got all four jars associated with it back home, and now I can have fun looking at the code in the IDE.
Analyze the mechanism of the cache container
IlrRulesetCacheImpl = IlrRulesetCacheImpl = IlrRulesetCacheImpl = IlrRulesetCacheImpl;
public void addRuleset(IlrXURuleset executableRuleset) {
synchronized(this.syncObject) {
/ /...
this.entries.add(new IlrRulesetCacheEntry(executableRuleset, this.maxIdleTimeOutSupport));
/ /...}}public IlrXURuleset getRuleset(String canonicalRulesetPath, ClassLoader xomClassLoader) {
// ...
List<IlrRulesetCacheEntry> cache = this.entries;
synchronized(this.syncObject) {
Iterator iterator = cache.iterator();
while(iterator.hasNext()) {
IlrRulesetCacheEntry entry = (IlrRulesetCacheEntry)iterator.next();
IlrXURuleset ruleset = (IlrXURuleset)entry.rulesetReference.get();
if (ruleset == null) {
iterator.remove();
} else if(entry.canonicalRulesetPath.equals(canonicalRulesetPath) && (entry.xomClassLoader == xomClassLoader || entry.xomClassLoader ! =null && entry.xomClassLoader.getParent() == xomClassLoader)) {
returnruleset; }}}// ...
return null;
}
Copy the code
After looking at these two methods, it is clear that entries are the key data store set, and let’s see how it works:
protected transient List<IlrRulesetCacheEntry> entries = new ArrayList();
Copy the code
It’s just an ArrayList. Keep looking at IlrRulesetCacheEntry:
public class IlrRulesetCacheEntry {
protected String canonicalRulesetPath = null;
protected ClassLoader xomClassLoader = null;
protected IlrReference<IlrXURuleset> rulesetReference = null;
public IlrRulesetCacheEntry(IlrXURuleset executableRuleset, boolean maxIdleTimeOutSupport) {
this.canonicalRulesetPath = executableRuleset.getCanonicalRulesetPath();
this.xomClassLoader = executableRuleset.getXOMClassLoader();
long maxIdleTime = executableRuleset.getRulesetArchiveProperties().getMaxIdleTime();
// Note that the key here is to select strong and weak references based on the value of maxIdleTime
if(maxIdleTime ! =0L&& (! maxIdleTimeOutSupport || maxIdleTime == -1L)) {
this.rulesetReference = new IlrWeakReference(executableRuleset);
} else {
this.rulesetReference = newIlrStrongReference(executableRuleset); }}}Copy the code
If maxIdleTime is equal to 0, weak reference is used. If maxIdleTime is equal to 0, strong reference is used. However, take a look at the code for the two Reference classes:
// WeakReference inherits WeakReference
public class IlrWeakReference<T> extends WeakReference<T> implements IlrReference<T> {
public IlrWeakReference(T t) {
super(t); }}/ / strong reference
public class IlrStrongReference<T> implements IlrReference<T> {
private T target;
IlrStrongReference(T target) {
this.target = target;
}
public T get(a) {
return this.target; }}Copy the code
There is nothing special about these two classes, which means the same thing as the class name; IlrWeakReference inherits WeakReference, that is WeakReference, when GC occurs, the referenced object will be deleted.
Even though we found the weak reference, we still need to verify that the weak reference is actually used
Verify that weak references are used
Arthas’s vmtool command is used here to look at live objects in the cache:
vmtool --action getInstances --className ilog.rules.res.xu.ruleset.cache.internal.IlrRulesetCacheImpl --express 'instances[0].entries.get(0)' @IlrRulesetCacheEntry[ CanonicalRulesetPath = @ String [/ ansNcreckonRuleApp / 1.0 / ansncreckonrule / 1.0]. XomClassLoader = @ xomClassLoader [com. IBM. Rules. Res. Persistence. Internal. XomClassLoader @ 18794875], # here you can see, An instance of rulesetRef is IlrWeakReference rulesetReference=@IlrWeakReference[ilog.rules.res.xu.ruleset.cache.internal.IlrWeakReference@dbd2972], ]Copy the code
As a result, stone hammer is weak citation.
After all, it is maxIdleTime that causes weak references, so we still need to find the source of maxIdleTime…
Looking for maxIdleTime
In the constructor of IlrRulesetCacheEntry can see maxIdleTime has been obtained from the IlrRulesetArchiveProperties:
long maxIdleTime = executableRuleset.getRulesetArchiveProperties().getMaxIdleTime();
Copy the code
Then continue to see IlrRulesetArchiveProperties this class:
public long getMaxIdleTime(a) {
// Obtain the ruleset. MaxIdleTime value from properties
String result = this.get("ruleset.maxIdleTime");
return result == null ? -1L : Long.valueOf(result);
}
public String get(Object key) {
String result = (String)this.properties.get(key);
return result == null && this.defaults ! =null ? (String)this.defaults.get(key) : result;
}
Copy the code
GetMaxIdleTime returns a default value of -1, which means that if the maxIdleTime value is not configured, the weak reference policy is used by default.
So far, the problem has been pinpointed. Weak reference caching policies cause the resource cache to be cleared during GC, and reloading resources takes a long time.
MaxIdleTime = maxIdleTime = maxIdleTime = maxIdleTime
Not come and all come, all have seen IlrRulesetArchiveProperties this class, you look at what is configured with the values in this class, do you have any maxIdleTime
vmtool --action getInstances --className com.ibm.rules.res.xu.ruleset.internal.CRERulesetImpl --express 'instances[0].getRulesetArchiveProperties()' @IlrRulesetArchivePropertiesImpl[ @String[ruleset.engine]:@String[cre], @String[ruleset.status]:@String[enabled], @String[ruleset.bom.enabled]:@String[true], @ String [ruleset managedxom. Uris,] : @ String [resuri: / / ans - nc - xom. Zip / 54.0, resuri: / / ruleapp - core - model - 1.5.2. Jar / 2.0],]Copy the code
The maxIdleTime attribute does not exist in the properties file, which matches the result above. The maxIdleTime attribute is not configured and defaults to -1, so weak references are used
Find a way to configure maxIdleTime
Decompiled code, although can’t see the comments, but still you can guess from the name of the class, IlrRulesetArchiveProperties this class name should be the meaning of “rule set file properties”.
I don’t understand the business rules of this system, but someone does! So I found the elder brother who is responsible for the configuration of this system and explained to me the various concepts of this system.Brother is also very friendly, directly to me to draw a pictureThis rule engine system has the concept of App, which is often said in the saas platform application meaning; Multiple rule sets can be created under each App, which is the RuleSet in the figure. Each RuleSet is our business rule, such as cost calculation formula or logic verification rule, and each RuleSet also records multiple historical versions of each change.
See this picture, I also understand that a 7788, the positioning of IlrRulesetArchiveProperties is not rule set attribute? This product has a console, there should be a place to configure this property, otherwise why make such a class? Rule sets are created by the console, so rule set properties should also be configurable!
As I expected, there was a concept of attributes on the console’s rule set, and attributes could be added.
. Most important of all, the console ruleset bom, enabled this property, I use the above arthas vmtool see have, that can prove the configuration here, should be configuration IlrRulesetArchiveProperties this class; Is it ok to use strong references just by adding maxIdleTime to the rule set?
Then I gingerly clicked the add attribute button…
Select maxIdleTime and set it to 0.
The verification results
After the configuration, or restart to verify whether it is effective, who knows this system does not support hot refresh, after the change or restart verification is safer.
Verify the properties of the rule set first:
vmtool --action getInstances --className com.ibm.rules.res.xu.ruleset.internal.CRERulesetImpl --express 'instances [0]. GetRulesetArchiveProperties ()' @ IlrRulesetArchivePropertiesImpl [# new configuration @String[ruleset.maxIdleTime]:@String[0], @String[ruleset.engine]:@String[cre], @String[ruleset.status]:@String[enabled], @String[ruleset.bom.enabled]:@String[true], @ String [ruleset managedxom. Uris,] : @ String [resuri: / / ans - nc - xom. Zip / 54.0, resuri: / / ruleapp - core - model - 1.5.2. Jar / 2.0],]Copy the code
As you can see from the above, our new configuration has taken effect and the rule set already has the maxIdleTime.
A reference in the cache has become a strong reference:
vmtool --action getInstances --className ilog.rules.res.xu.ruleset.cache.internal.IlrRulesetCacheImpl --express 'instances[0].entries.get(0).rulesetreference '# target=@CRERulesetImpl[com.ibm.rules.res.xu.ruleset.internal.CRERulesetImpl@28160472], ]Copy the code
After increasing maxIdleTime, the rule set cache becomes strong references, and strong references no longer occur because the GC was collected!
Run the fullGC command through vmTool to see if it can be reloaded:
vmtool --action forceGc
Copy the code
Executed more than ten times, also did not appear a long time consuming problem, the problem should be solved.
I then synchronized this configuration to the test environment and ran for three days. I also went up to manual forceGc from time to time and did not have this long time consuming problem again.
Why doesn’t time occur after every GC
As mentioned at the beginning of this article, each long run is preceded by a GC activity, but not always followed by a long run.
Objects maintained by weak references are not meant to be emptied during GC; It is only during GC that weakly referenced objects are reclaimed if there are no other references left, as in this example:
Map<String,Object> dataMap = new HashMap<>();
WeakReference ref = new WeakReference(dataMap);
System.gc();
System.out.println(ref.get());
dataMap = null;
System.gc();
System.out.println(ref.get());
//output
{}
null
Copy the code
During the first gc, the ref will not be cleared, while the weakly referenced data before the second GC will be cleared if there are no other references.
In combination with the problems of this system, although weak reference caches are used in the rule set, if the caller still holds the rule set object at GC, then the weak reference rule set cache is also not cleared. That’s where this comes in: not every GC causes a reload, but every reload is because of a GC problem
But I don’t think there’s any point in following up, because it doesn’t really matter where you hold the rule set object; After making it a strong reference, the GC will no longer recycle the object, so it doesn’t matter who owns it.
Write in the back
I thought I would have to decomcompile, modify and repackage the code to fix the problem, but in the end I solved the problem without changing a single line of code. It was a bit of a surprise.
However, the design of the rule engine is good, using a configurable cache strategy to save memory, but using weak references to maintain rule sets by default is a bit conservative. For most server scenarios, use as much memory as it frees, which can cause even bigger problems, and the time taken to reload is unacceptable.
Original is not easy, prohibit unauthorized reprint. Like/like/follow my post if it helps you ❤❤❤❤❤❤