Abstract: The function of formula tree module is to generate a formula tree by random sampling from the training set X and function_set, and provide methods of subtree mutation, crossover, hoist mutation and point mutation.
This article is shared by Huawei cloud community “Formula Tree Open Source Library Analysis”, author: Lei Jun.
1. Formula tree module
The function of formula tree module is to generate a formula tree by random sampling from the training set X and function_set, and provide methods of subtree mutation, crossover, hoist mutation and point mutation at the same time.
1.1 Basic Attributes
1.2 methods
1.3 Build_tree algorithm principle
Data structures used:
Terminal_stack: Storage is a stack of several operations
Lisp lists are based on generalized tables, so it’s easy to represent a list as a tree. S-expressions are perhaps best known for their use in the Lisp family of programming languages, invented by John McCarthy in 1958 and first implemented by Steve Russell on the IBM704 computer, it is particularly suitable for use in artificial intelligence schemes because of its efficient processing of symbolic information.
In prefix notation, the operator is written before its operand. For example, expressions
a * ( b + c ) / d
Copy the code
Be written in
(/ (* a (+ b c) ) d)
Copy the code
For example, the formula:
It can also be written as:
So if I write this as an S expression, this becomes this
The corresponding binary tree
That is, the S-expression corresponds to the sequential traversal of the symbol tree
Algorithm input: function_set [‘ add ‘, ‘sub’, ‘the mul], arities {2: [‘ add’, ‘sub’, ‘the mul]}, method: turns, max_depth: 3 (a random number within 2-4)
method: grow n_features: 10 max_depth:2 function_index:0 Function address :<functions._Function object at 0x000001E538356EB0> function_name:add Program: [add] terminal_stack: [2] cycle part # # # # # # # # # # # # LOOP# # # # # # # # # # # # # # # # # # # # # # # # # # 1 times the depth: 1 # program = len(terminal_stack) choice_index = 13 1 #depth < max_depth or choice_index <= len(self.function_set) Add 1_program: [add, add] 1_terminal_stack: [2, 2] depth: 2 choice: 13 choice_index: 11 2_terminal: 3_terminal: 0.8395650516882855 2_program: [add, add, 0.8395650516882855] 2_terminal_stack: [2, 1]# add constant second value subtract 1 depth: 2 choice: 13 choice_index: 3_terminal: 8 2_program: [add, DDD, 0.8395650516882855, 8] 2_terminal_stack: [2, 0] 2_terminal_stack.pop(): 0 # = 0 and pop it off and then stack minus 1 depth: 1 choice: 13 choice_index: 5 2_terminal: 0 3_terminal: 0 2_program: 0 [add, add, 0.8395650516882855, 8, 0] 2_terminal_stack: [0] 2_terminal_stack.pop(): 0 Finally the tree shape becomes this tree1: Add (add(0.840, X8), X0) Grow max_depth:4 function_index:0 Function address :<functions._Function object at 0x000001E538356EB0> function_name:add Program: [add] terminal_stack: [2] # # # # # # # # # # # # # LOOP# # # # # # # # # # # # # # # # # # # # # # # # # 1 times the depth: 1 choice: 13 choice_index: 4 2_terminal: 3_terminal: 3_program: [add, 3] 2_terminal_stack: [1] depth: 1 choice: 13 choice_index: 4 2_terminal: 6 3_terminal: 6 2_program: [add, 3, 6] 2_terminal_stack.pop(): 0 [add, 3, 6]Copy the code
It’s a flow chart
1.4 get_subtree Obtains a random subtree
Give the elements in symbol_tree a weight of 0.9 if it’s an operator or 0.1 if it’s not
Such as the tree is
tree1: mul(X6, sub(X3, X1))
Copy the code
It takes that long to give weight
Probs: [0.9 0.1 0.9 0.1 0.1]Copy the code
And then we normalize by dividing by sum
_probs: [0.42857143 0.47619048 0.9047619 0.95238095 1.]Copy the code
Here, the roulette wheel is used to select the cutting point
steps
1) generate a random value in (0,1) such as generate
S_index: 0.8299421213898753Copy the code
2) Find the position where the random value should be in probs. This position is the position of start
start: 2
Copy the code
End = start=2
stack = 1
Copy the code
If end-start < stack then
node = program[end]
Copy the code
If node were an operator, stack would have to add a tuple
stack += node.arity
Copy the code
End itself adds all the way to the end of program
1.5 Principle of crossover algorithm module
The purpose of crossover is to cross subtrees
The first step is to obtain a random subtree from the symbol tree module
start, end = self.get_subtree(random_state)
Copy the code
The second step is to obtain random subtrees from other symbol tree individuals
donor_start, donor_end = self.get_subtree(random_state, donor)
Copy the code
The third step is to obtain the crossed symbol tree
self.symbol_tree[: start] + donor[donor_start : donor_end] + self.symbol_tree[end : ] tree1: add(mul(X9, X4), X6) start, end: 1, 4 removed:range(1, 4) donor: [MUl, 6, 0.6656811846792283] Donor_start, Donor_end :(0, 3) Donor_removed :[] Add mul x6 0.6656811846792283 x6 self.symbol_tree[: start] + donor[: donor_start: donor_end] + self.symbol_tree[end : ]Copy the code
1.6 subtree_mutation subtree mutation
Controlled by the p_subtree_mutation parameter. This is a more aggressive mutation strategy: one of the winner’s subtrees is replaced by a completely new and random subtree.
Chicken = self.build_tree(random_state) create a new subtree self.crossover(chicken, random_state) #Copy the code
1.7 hoist_mutation hoist mutation
Hoist mutation is an anti-bloating method: randomly select A subtree A from the winner formula tree, then select A subtree B from A, and then promote B to A’s original position and replace A with B. Hoist means to raise.
The first step is to get A random subtree A
start, end = self.get_subtree(random_state)
subtree = self.symbol_tree[start:end]
Copy the code
The second step is to get A subtree B from subtree A
Hoist = subtree[sub_start:sub_end] hoist = subtree[sub_start:sub_end]Copy the code
The third step is to elevate B to A’s original position and replace A with B
self.symbol_tree[:start] + hoist + self.symbol_tree[end:] tree1: add(X6, sub(add(X0, X2), X6)) start, end: 0, 7 subtree : [add, x6, sub, add, x0, x2, x6] mutation: sub_start, sub_end: 3, 6 mutation_hoist : [add, x0, x2] removed: [0, 1, 2, 6] new_tree: [add, x0, x2] tree1: mul(X8, x0) mutation_start, end: 0, 3 mutation_subtree : [mul, x8, x0] mutation: sub_start, sub_end: 1, 2 mutation_hoist : [x8] removed: [0, 2] new_tree: [8]Copy the code
1.8 POINt_mutation point mutation
Controlled by the p_point_replace parameter. A random node will be changed, such as addition can be replaced with division, and the variable X0 can be replaced with a constant -2.5. Point variation can reintroduce some previously obsolete functions and variables, thus promoting the diversity of formulas.
The first step is to copy the symbol tree and get a random point
Program = copy(self.symbol_tree) # copy(self.symbol_tree) Where (random_state.uniform(size = len(program)) < self.p_point_replace)[0] # Get a random pointCopy the code
The second step is to iterate over the mutate node and replace it if it is a Function, or add constants or features if it is not
if isinstance(program[node], _Function): Arity = program[node]. Arity # replacement_len = len(self. arity [arity]) # Find a number of replacement_index = operators that are the same as a arity element Randint (replacement_len) # Select a replacement = self.arities[arity][replacement_index] # find the operator corresponding to index Program [node] = replacement #Copy the code
If it’s not function
Case one
If self.const_range is not None: terminal = random_state.randint(self.n_features + 1) else: Randint (self.n_features) # generate a number in (0, n_features) if terminal == self.n_features: Uniform (*self.const_range) if self.const_range = random_state.uniform(*self.const_range) if self.const_range is None: raise ValueError('A constant was produced with const_range=None.') program[node] = terminalCopy the code
2. Fitness module obtains the adaptability of symbol tree
2.1 get_all_indices interface to get the index of all data
Step 1: verify the parameters
if self._indices_state is None and random_state is None: Show sat sat sat sat sat sat sat sat sat sat sat sat sat sat sat sat sat sat sat sat sat sat sat sat sat sat sat indices not available.') if n_samples is not None and self._n_samples is None: # If n_samples is not None self._n_samples = n_samples if max_samples is not None and self._max_samples is None: # n_samples represents the number of rows of the dataset self._max_samples = max_samples # max_samples Maximum sampling tree if random_state is not None and self._indices_state is None: self._indices_state = random_state.get_state()Copy the code
The second step is to get the random seeds, and then get the index of the data outside the bag and the data inside the bag
Indices_state = check_random_state(None) indices_state.set_state(self._indices_state) # get random_state not_indices = sample_without_replacement( self._n_samples, self._n_samples - self._max_samples, Self._n_samples - self._max_samples Is selected from [0,self._n_samples] and sample_counts = np.bincount(not_indices, Indices = np. Where (sample_counts == 0)[0] # sample_counts == 0Copy the code
Other functions
sample_without_replacement(n_population, n_samples,random_state,method): Sampling function, randomly obtain the data outside the bag, select N_samples data from the set [0, n_population], and put back the sample
Parameter is introduced
2.2 raw_fitness
interface
raw_fitness(self, X, y, sample_weight)
Copy the code
First execute X algorithm to get y_pred, then calculate error according to y,y_pred and weight
Return fitness y_pred = self.execute(x) raw_fitness = self.metric(y, y_pred, sample_weight)Copy the code
2.3 the fitness module
interface
fitness(self, parsimony_coefficient=None)
Copy the code
First execute X algorithm to get y_pred, then calculate error according to y,y_pred and weight
if parsimony_coefficient is None: parsimony_coefficient = self.parsimony_coefficient penalty = parsimony_coefficient * len(self.symbol_tree) * Self.metric. Sign # here is the saving factor times the length of the tree if the bigger the better sign is 1 otherwise -1 return self.raw_fitness_ -penalty # fitness is subtracted here to punish the formula for the tree being too bloatedCopy the code
3. Parallel modules
3.1 the parallel parallel_evolve
interface
_parallel_evolve(n_programs, parents, X, y, sample_weight, seeds, params)
Copy the code
The ginseng
Properties:
3.2 Built-in interface _tournament
Purpose: To find the symbol tree that performs best
Nets = random_state.randint(0, len(parents), tournament_size) # generate, such as this: The len(parents) number is equal to the number of fitness = [parents[p]. Fitness_ for p in the brackets] # metric.greater_is_better: Short [np.argmax(fitness)] the nets are built to search for the largest index value else: The short answer is that parent_index = contenders[np.argmin(fitness)] #Copy the code
3.3 Operation Process:
Step 1: n_programs indicates how many trees there are in a group in the population, and loops n_programs
Initialize the
Method = random_state.uniform() # method select probability from crossover subtree hoist, Parent_index = _tournament() #Copy the code
The second step is to select the type of mutation according to the probability of Method
Method_probs definition
self._method_probs = np.array([self.p_crossover, self.p_subtree_mutation, self.p_hoist_mutation, self.p_point_mutation]) self._method_probs = np.cumsum(self._method_probs) if method < method_probs[0]: Method method donor, donor_index = _tournament() Remains = parent. Crossover (donor. Symbol_tree, random_state) # parent_idx = {'method':' parent_idx': parent_index, 'parent_nodes':removed, 'donor_idx':donor_index, 'Donor_nodes ':remains} elif method < method_probs[1]:# if method is less than crossover _ = parent.subtree_mutation(random_state) genome = {'method':'Subtree Mutation', 'parent_idx':parent_index, 'parent_nodes':removed } elif method < method_probs[2]: Hoist program, removed = paren. hoist_mutation(random_state) genome = {'method':' hoist ', 'parent_idx': parent_index, 'parent_node': removed } elif method < method_probs[3]: # Mutation program, mutated = parent. Point_mutation (random_state) genome = {' Mehtod ':' point Mutation', 'parent_idx':parent_index, 'parent_nodes':mutated} else: # Reproduce program = parent. Reproduce () genome = {'mehtod': 'Reproduction', 'parent_idx':parent_index, 'parent_nodes':[] }Copy the code
The third step generates the formula tree according to the parameters and the program obtained in the second step
program = _SymbolTree(function_set=function_set,
arities=arities,
init_depth=init_depth,
init_method=init_method,
n_features=n_features,
metric=metric,
const_range=const_range,
p_point_replace=p_point_replace,
parsimony_coefficient=parsimony_coefficient,
feature_names=feature_names,
random_state=random_state,
symbol_tree = program)
Copy the code
then
program.parents = genome
Copy the code
Genome stores information that was deleted during the subtree generation process and assigns it to parents
The fourth step assigns weights to the feature columns based on the weight information in sample_weight.
Curr_sample_weight = Np.ones ((n_samples,))) # If sample_weight is None else: curr_sample_weight = Np.ones ((n_samples,)) Copy () oob_sample_weight = curr_sample_weight.copy() # Data outside the bagCopy the code
Calculate the fitness of the data outside the bag
indices, not_indices = program.get_all_indices(n_samples, max_samples, Curr_sample_weight [not_indices] = 0 Oob_sample_weight [indices] = 0 program.raw_fitness_ = program.raw_fitness(X, y, y) Curr_sample_weight) # Calculate the fitness of the data in the bagCopy the code
Calculate the fitness of the data outside the bag
if max_samples < n_samples: Oob_fitness_ = program.raw_fitness(X, y, oob_sample_weight) # Calculate the fitness of the data in the bagCopy the code
Finally, n modified subtree programs are obtained after n cycles, in which two private properties raw_Fitness_ and oob_FITness_ store the applicability of data inside and outside the bag respectively
4. SymbolicTransformer module
4.1 Initializing modules
4.1.1 Basic Attributes
4.1.2 method
_verbose_reporter: controls log output
4.1.3 fit modules
interface
fit(self, X, y, sample_weight = None)
Copy the code
The arguments:
Step 1: Verify the data
Check: check whether X and y are of the same length, whether hall_of_FAME, function_set, _filming are normal, and whether metric is a Fitness type that inherits from the TransformerMixin abstract class
And then we’re going to put the probabilities in the list, incrementally
self._method_probs = np.array([self.p_crossover, self.p_subtree_mutation, self.p_hoist_mutation, self.p_point_mutation])
self._method_probs = np.cumsum(self._method_probs)
Copy the code
Then check _method_probs, init_method, CONST_range, init_depth, and feature_names for type and range checks
Step 2: Take the parameter and assign it a value
params = self.get_params()
print(f'params: {params}')
params['_metric'] = self._metric
params['function_set'] = self._function_set
params['arities'] = self._arities
params['method_probs'] = self._method_probs
Copy the code
Prepare _programs and run_DETAILs_ dictionaries if not warm_START mode
if not self.warm_start or not hasattr(self, '_programs'): Start again if # warm_start is false, Self._programs = [] # _programs list[list1,list2... listn] self.run_details_ = {'generation':[], self._details_ = {'generation':[], 'average_length':[], 'average_fitness':[], 'best_length':[], 'best_fitness': [], 'best_oob_fitness':[], 'generation_time':[]} prior_generations = len(self._programs) # _programs list[list1,list2... listn] The length represents how many generations are iterated. N_more_generations = self.generations - prior_generations # How many more generations need to be iteratedCopy the code
Next is the validation of n_more_GENERATIONS
Population_size represents the number of populations in each generation
If self.warm_start: # Discard random seeds used before for I in range(len(self._programs)): Randint (MAX_INT, size = self.population_size) if self.verbose: self._verbose_reporter()#Copy the code
Step 3: Run the program in parallel
Loop layer (from prior_generations to generations)
1) Record time to find the parent class
Start_time = time() if gen == 0: # parents None parents = None else: Parents = self._programs[gen-1] # _programs the last generation of programsCopy the code
2) Run programs in parallel
n_jobs, n_programs, starts = _partition_estimators(self.population_size, Seeds = random_state. Randint (MAX_INT, Size = self.population_size) # generate population_size population = Parallel(n_jobs=n_jobs, verbose=int(self.verbose > 1))( delayed(_parallel_evolve)(n_programs[i],parents, X, y,sample_weight, seeds[starts[i]:starts[i + 1]], params) for i in range(n_jobs))Copy the code
The data are combined to get the next generation of mutant populations
population[<symboltree._SymbolTree object at 0x00000118ABB89E20>, <symboltree._SymbolTree object at 0x00000118ABB89E80>, <symboltree._SymbolTree object at 0x00000118ABB89F10>, <symboltree._SymbolTree object at 0x00000118ABB89FD0>]
population = list(itertools.chain.from_iterable(population))
Copy the code
To get the fitness and length of all the individuals in the population is a list
fitness = [program.raw_fitness_ for program in population]
length = [program.length_ for program in population]
Copy the code
3) Penalty factor constrains FITNESS
parsimony_coefficient = None if self.parsimony_coefficient == 'auto': Parsimony_coefficient = (np.cov(length, fitness)[1, 0] / np.var(length)) Row 2, column 1 of covariance matrix divided by variance for program in population: Program. Fitness_ = program. Fitness (parsimony_coefficient) # shrink fitness self._programs. Append (population) # Newly generated information for this generation goes into _programsCopy the code
4) Delete the eliminated individuals
if not self.low_memory: for old_gen in np.arange(gen, 0, -1): Indices = [] for programs in self._programs[old_gen]: If program is not None:# If not None for idx in program. Parents: Parents_idx: if 'idx' in idx: if 'idx' in idx: if 'idx' in idx: if 'idx' in idx: If idx in range(self.population_size): if idx in range(self.population_size) Self._programs [old_gen-1][idx] = None elif gen > 0: self._programs[old_gen-1][idx] = None elif gen > 0: Self._programs [gen-1] = NoneCopy the code
Step 4 runs the information
The corresponding code
If self._metric. Greater_is_better: # best_program = population[np.argmax(fitness)] else: best_program = population[np.argmin(fitness)] self.run_details_['generation'].append(gen) self.run_details_['average_length'].append(np.mean(length)) self.run_details_['average_fitness'].append(np.mean(fitness)) self.run_details_['best_length'].append(best_program.length_) Self.run_details_ ['best_fitness'].append(best_program. Raw_fitness_) oob_fitness = Np.nan if self.max_samples < 1.0: Oob_fitness = BEST_program. Oob_fitness_ self.run_details_[' BEST_OOb_FITNESS ']. Append (OOb_FITNESS) # Data accuracy outside the bag Generation_time = time() -start_time self.run_details_['generation_time'].append(generation_time) # Run timeCopy the code
Dealing with early stopping
if self._metric.greater_is_better: best_finess = fitness[np.argmax(fitness)] if best_finess >= self.stopping_criteria: Elsebest_finess = fitness[np.argmix(fitness)] if best_finess <= self.stopping_criteria: # break the stop standardCopy the code
So we end the cycle here, and we get all the generations.
Step 5 if it’s a transformation
A) get hall_of_fame index first
Fitness = np.array(fitness) # find fitness if self._metric. Greater_is_better: Hall_of_fame = fitness. Argsort ()[::-1][:self.hall_of_fame] [:self.hall_of_fame] Hall_of_fame = fitness.argsort()[:self.hall_of_fame] # The smaller the better to select in orderCopy the code
All the individuals in the last generation of the population (including those belonging to Hall_of_FAME) were calculated to get the predicted value
evaluation = np.array([gp.execute(X) for gp in [self._programs[-1][i] for i in hall_of_fame]])
Copy the code
If the index is Spearman’s coefficient, the ranking value of each group of data of evaluation can be calculated
if self.metric == 'spearman': evaluation = np.apply_along_axis(rankdata, 1, Stats import rankdata evaluation = np.array([[1,2,3,4], [6,5,7,8], ,10,11,12 [9]]) print (np) apply_along_axis (rankdata, 1, evaluation)) # output [[1. 2. 3. 4.] [2. 1. 3. 4.] [1. 2. 3. 4.]]Copy the code
Then the correlation coefficient matrix is calculated
with np.errstate(divide = 'ignore', invalid = 'ignore'): Correlation = Np. abs(NP. corrcoef(evaluation)) 1. 3. 4.] [1. 2. 3. 4.]] [[1. 0.8 1.] [0.8 1. 0.8] [1. 0.8 1.]] np. Fill_diagonal (correlations, Hall_of_frame = list(range(self.hall_of_fame)) # hall_of_frame = list(range(self.hall_of_fame)) # hall_of_frame = hall_of_frame list(range(self.hall_of_fame)) # X_shape(354, 13) # evaluation:(50, Len (components) > self.n_components: Correlations_shape (50, 50) MOST_correlation = Np. unravel_index(NP. argmax(correlated)) Correlations can be used to calculate the correlations between the correlations and the correlations. Worst = Max (MOST_correlated) # worst is the column whose index is larger components. Pop (worst) indices.remove(worst) # delete from sequence number correlations = correlations[:, indices][indices, :] indices = list(range(len(components))) self._best_programs = [self._programs[-1][i] for i in hall_of_fame[components]]Copy the code
Click to follow, the first time to learn about Huawei cloud fresh technology ~