Abstract: The function of formula tree module is to generate a formula tree by random sampling from the training set X and function_set, and provide methods of subtree mutation, crossover, hoist mutation and point mutation.

This article is shared by Huawei cloud community “Formula Tree Open Source Library Analysis”, author: Lei Jun.

1. Formula tree module

The function of formula tree module is to generate a formula tree by random sampling from the training set X and function_set, and provide methods of subtree mutation, crossover, hoist mutation and point mutation at the same time.

1.1 Basic Attributes

1.2 methods

1.3 Build_tree algorithm principle

Data structures used:

Terminal_stack: Storage is a stack of several operations

Lisp lists are based on generalized tables, so it’s easy to represent a list as a tree. S-expressions are perhaps best known for their use in the Lisp family of programming languages, invented by John McCarthy in 1958 and first implemented by Steve Russell on the IBM704 computer, it is particularly suitable for use in artificial intelligence schemes because of its efficient processing of symbolic information.

In prefix notation, the operator is written before its operand. For example, expressions

a * ( b + c ) / d
Copy the code

Be written in

(/ (* a (+ b c) ) d)
Copy the code

For example, the formula:

It can also be written as:

So if I write this as an S expression, this becomes this

The corresponding binary tree

That is, the S-expression corresponds to the sequential traversal of the symbol tree

Algorithm input: function_set [‘ add ‘, ‘sub’, ‘the mul], arities {2: [‘ add’, ‘sub’, ‘the mul]}, method: turns, max_depth: 3 (a random number within 2-4)

method: grow n_features: 10 max_depth:2 function_index:0 Function address :<functions._Function object at 0x000001E538356EB0> function_name:add Program: [add] terminal_stack: [2] cycle part # # # # # # # # # # # # LOOP# # # # # # # # # # # # # # # # # # # # # # # # # # 1 times the depth: 1 # program = len(terminal_stack) choice_index = 13 1 #depth < max_depth or choice_index <= len(self.function_set) Add 1_program: [add, add] 1_terminal_stack: [2, 2] depth: 2 choice: 13 choice_index: 11 2_terminal: 3_terminal: 0.8395650516882855 2_program: [add, add, 0.8395650516882855] 2_terminal_stack: [2, 1]# add constant second value subtract 1 depth: 2 choice: 13 choice_index: 3_terminal: 8 2_program: [add, DDD, 0.8395650516882855, 8] 2_terminal_stack: [2, 0] 2_terminal_stack.pop(): 0 # = 0 and pop it off and then stack minus 1 depth: 1 choice: 13 choice_index: 5 2_terminal: 0 3_terminal: 0 2_program: 0 [add, add, 0.8395650516882855, 8, 0] 2_terminal_stack: [0] 2_terminal_stack.pop(): 0 Finally the tree shape becomes this tree1: Add (add(0.840, X8), X0) Grow max_depth:4 function_index:0 Function address :<functions._Function object at 0x000001E538356EB0> function_name:add Program: [add] terminal_stack: [2] # # # # # # # # # # # # # LOOP# # # # # # # # # # # # # # # # # # # # # # # # # 1 times the depth: 1 choice: 13 choice_index: 4 2_terminal: 3_terminal: 3_program: [add, 3] 2_terminal_stack: [1] depth: 1 choice: 13 choice_index: 4 2_terminal: 6 3_terminal: 6 2_program: [add, 3, 6] 2_terminal_stack.pop(): 0 [add, 3, 6]Copy the code

It’s a flow chart

1.4 get_subtree Obtains a random subtree

Give the elements in symbol_tree a weight of 0.9 if it’s an operator or 0.1 if it’s not

Such as the tree is

tree1: mul(X6, sub(X3, X1))
Copy the code

It takes that long to give weight

Probs: [0.9 0.1 0.9 0.1 0.1]Copy the code

And then we normalize by dividing by sum

_probs: [0.42857143 0.47619048 0.9047619 0.95238095 1.]Copy the code

Here, the roulette wheel is used to select the cutting point

steps

1) generate a random value in (0,1) such as generate

S_index: 0.8299421213898753Copy the code

2) Find the position where the random value should be in probs. This position is the position of start

start: 2
Copy the code

End = start=2

stack = 1
Copy the code

If end-start < stack then

node = program[end]
Copy the code

If node were an operator, stack would have to add a tuple

stack += node.arity
Copy the code

End itself adds all the way to the end of program

1.5 Principle of crossover algorithm module

The purpose of crossover is to cross subtrees

The first step is to obtain a random subtree from the symbol tree module

start, end = self.get_subtree(random_state)
Copy the code

The second step is to obtain random subtrees from other symbol tree individuals

donor_start, donor_end = self.get_subtree(random_state, donor)
Copy the code

The third step is to obtain the crossed symbol tree

self.symbol_tree[: start] + donor[donor_start : donor_end] + self.symbol_tree[end : ] tree1: add(mul(X9, X4), X6) start, end: 1, 4 removed:range(1, 4) donor: [MUl, 6, 0.6656811846792283] Donor_start, Donor_end :(0, 3) Donor_removed :[] Add mul x6 0.6656811846792283 x6 self.symbol_tree[: start] + donor[: donor_start: donor_end] + self.symbol_tree[end : ]Copy the code

1.6 subtree_mutation subtree mutation

Controlled by the p_subtree_mutation parameter. This is a more aggressive mutation strategy: one of the winner’s subtrees is replaced by a completely new and random subtree.

Chicken = self.build_tree(random_state) create a new subtree self.crossover(chicken, random_state) #Copy the code

1.7 hoist_mutation hoist mutation

Hoist mutation is an anti-bloating method: randomly select A subtree A from the winner formula tree, then select A subtree B from A, and then promote B to A’s original position and replace A with B. Hoist means to raise.

The first step is to get A random subtree A

start, end = self.get_subtree(random_state)
 subtree = self.symbol_tree[start:end]
Copy the code

The second step is to get A subtree B from subtree A

Hoist = subtree[sub_start:sub_end] hoist = subtree[sub_start:sub_end]Copy the code

The third step is to elevate B to A’s original position and replace A with B

self.symbol_tree[:start] + hoist + self.symbol_tree[end:] tree1: add(X6, sub(add(X0, X2), X6)) start, end: 0, 7 subtree : [add, x6, sub, add, x0, x2, x6] mutation: sub_start, sub_end: 3, 6 mutation_hoist : [add, x0, x2] removed: [0, 1, 2, 6] new_tree: [add, x0, x2] tree1: mul(X8, x0) mutation_start, end: 0, 3 mutation_subtree : [mul, x8, x0] mutation: sub_start, sub_end: 1, 2 mutation_hoist : [x8] removed: [0, 2] new_tree: [8]Copy the code

1.8 POINt_mutation point mutation

Controlled by the p_point_replace parameter. A random node will be changed, such as addition can be replaced with division, and the variable X0 can be replaced with a constant -2.5. Point variation can reintroduce some previously obsolete functions and variables, thus promoting the diversity of formulas.

The first step is to copy the symbol tree and get a random point

Program = copy(self.symbol_tree) # copy(self.symbol_tree) Where (random_state.uniform(size = len(program)) < self.p_point_replace)[0] # Get a random pointCopy the code

The second step is to iterate over the mutate node and replace it if it is a Function, or add constants or features if it is not

if isinstance(program[node], _Function): Arity = program[node]. Arity # replacement_len = len(self. arity [arity]) # Find a number of replacement_index = operators that are the same as a arity element Randint (replacement_len) # Select a replacement = self.arities[arity][replacement_index] # find the operator corresponding to index Program [node] = replacement #Copy the code

If it’s not function

Case one

If self.const_range is not None: terminal = random_state.randint(self.n_features + 1) else: Randint (self.n_features) # generate a number in (0, n_features) if terminal == self.n_features: Uniform (*self.const_range) if self.const_range = random_state.uniform(*self.const_range) if self.const_range is None: raise ValueError('A constant was produced with const_range=None.') program[node] = terminalCopy the code

2. Fitness module obtains the adaptability of symbol tree

2.1 get_all_indices interface to get the index of all data

Step 1: verify the parameters

if self._indices_state is None and random_state is None: Show sat sat sat sat sat sat sat sat sat sat sat sat sat sat sat sat sat sat sat sat sat sat sat sat sat sat sat indices not available.') if n_samples is not None and self._n_samples is None: # If n_samples is not None self._n_samples = n_samples if max_samples is not None and self._max_samples is None: # n_samples represents the number of rows of the dataset self._max_samples = max_samples # max_samples Maximum sampling tree if random_state is not None and self._indices_state is None: self._indices_state = random_state.get_state()Copy the code

The second step is to get the random seeds, and then get the index of the data outside the bag and the data inside the bag

Indices_state = check_random_state(None) indices_state.set_state(self._indices_state) # get random_state not_indices = sample_without_replacement( self._n_samples, self._n_samples - self._max_samples, Self._n_samples - self._max_samples Is selected from [0,self._n_samples] and sample_counts = np.bincount(not_indices, Indices = np. Where (sample_counts == 0)[0] # sample_counts == 0Copy the code

Other functions

sample_without_replacement(n_population, n_samples,random_state,method): Sampling function, randomly obtain the data outside the bag, select N_samples data from the set [0, n_population], and put back the sample

Parameter is introduced

2.2 raw_fitness

interface

raw_fitness(self, X, y, sample_weight)
Copy the code

First execute X algorithm to get y_pred, then calculate error according to y,y_pred and weight

Return fitness y_pred = self.execute(x) raw_fitness = self.metric(y, y_pred, sample_weight)Copy the code

2.3 the fitness module

interface

fitness(self, parsimony_coefficient=None)
Copy the code

First execute X algorithm to get y_pred, then calculate error according to y,y_pred and weight

if parsimony_coefficient is None: parsimony_coefficient = self.parsimony_coefficient penalty = parsimony_coefficient * len(self.symbol_tree) * Self.metric. Sign # here is the saving factor times the length of the tree if the bigger the better sign is 1 otherwise -1 return self.raw_fitness_ -penalty # fitness is subtracted here to punish the formula for the tree being too bloatedCopy the code

3. Parallel modules

3.1 the parallel parallel_evolve

interface

_parallel_evolve(n_programs, parents, X, y, sample_weight, seeds, params)
Copy the code

The ginseng

Properties:

3.2 Built-in interface _tournament

Purpose: To find the symbol tree that performs best

Nets = random_state.randint(0, len(parents), tournament_size) # generate, such as this: The len(parents) number is equal to the number of fitness = [parents[p]. Fitness_ for p in the brackets] # metric.greater_is_better: Short [np.argmax(fitness)] the nets are built to search for the largest index value else: The short answer is that parent_index = contenders[np.argmin(fitness)] #Copy the code

3.3 Operation Process:

Step 1: n_programs indicates how many trees there are in a group in the population, and loops n_programs

Initialize the

Method = random_state.uniform() # method select probability from crossover subtree hoist, Parent_index = _tournament() #Copy the code

The second step is to select the type of mutation according to the probability of Method

Method_probs definition

self._method_probs = np.array([self.p_crossover, self.p_subtree_mutation, self.p_hoist_mutation, self.p_point_mutation]) self._method_probs = np.cumsum(self._method_probs) if method < method_probs[0]: Method method donor, donor_index = _tournament() Remains = parent. Crossover (donor. Symbol_tree, random_state) # parent_idx = {'method':' parent_idx': parent_index, 'parent_nodes':removed, 'donor_idx':donor_index, 'Donor_nodes ':remains} elif method < method_probs[1]:# if method is less than crossover _ = parent.subtree_mutation(random_state) genome = {'method':'Subtree Mutation', 'parent_idx':parent_index, 'parent_nodes':removed } elif method < method_probs[2]: Hoist program, removed = paren. hoist_mutation(random_state) genome = {'method':' hoist ', 'parent_idx': parent_index, 'parent_node': removed } elif method < method_probs[3]: # Mutation program, mutated = parent. Point_mutation (random_state) genome = {' Mehtod ':' point Mutation', 'parent_idx':parent_index, 'parent_nodes':mutated} else: # Reproduce program = parent. Reproduce () genome = {'mehtod': 'Reproduction', 'parent_idx':parent_index, 'parent_nodes':[] }Copy the code

The third step generates the formula tree according to the parameters and the program obtained in the second step

program = _SymbolTree(function_set=function_set,
                               arities=arities,
                               init_depth=init_depth,
                               init_method=init_method,
                               n_features=n_features,
                               metric=metric,
                               const_range=const_range,
                               p_point_replace=p_point_replace,
                               parsimony_coefficient=parsimony_coefficient,
                               feature_names=feature_names,
                               random_state=random_state,
                               symbol_tree = program)
Copy the code

then

program.parents = genome
Copy the code

Genome stores information that was deleted during the subtree generation process and assigns it to parents

The fourth step assigns weights to the feature columns based on the weight information in sample_weight.

Curr_sample_weight = Np.ones ((n_samples,))) # If sample_weight is None else: curr_sample_weight = Np.ones ((n_samples,)) Copy () oob_sample_weight = curr_sample_weight.copy() # Data outside the bagCopy the code

Calculate the fitness of the data outside the bag

indices, not_indices = program.get_all_indices(n_samples, max_samples, Curr_sample_weight [not_indices] = 0 Oob_sample_weight [indices] = 0 program.raw_fitness_ = program.raw_fitness(X, y, y) Curr_sample_weight) # Calculate the fitness of the data in the bagCopy the code

Calculate the fitness of the data outside the bag

if max_samples < n_samples: Oob_fitness_ = program.raw_fitness(X, y, oob_sample_weight) # Calculate the fitness of the data in the bagCopy the code

Finally, n modified subtree programs are obtained after n cycles, in which two private properties raw_Fitness_ and oob_FITness_ store the applicability of data inside and outside the bag respectively

4. SymbolicTransformer module

4.1 Initializing modules

4.1.1 Basic Attributes

4.1.2 method

_verbose_reporter: controls log output

4.1.3 fit modules

interface

fit(self, X, y, sample_weight = None)
Copy the code

The arguments:

Step 1: Verify the data

Check: check whether X and y are of the same length, whether hall_of_FAME, function_set, _filming are normal, and whether metric is a Fitness type that inherits from the TransformerMixin abstract class

And then we’re going to put the probabilities in the list, incrementally

self._method_probs = np.array([self.p_crossover, self.p_subtree_mutation, self.p_hoist_mutation, self.p_point_mutation])
self._method_probs = np.cumsum(self._method_probs)
Copy the code

Then check _method_probs, init_method, CONST_range, init_depth, and feature_names for type and range checks

Step 2: Take the parameter and assign it a value

params = self.get_params()
 print(f'params: {params}')
 params['_metric'] = self._metric
 params['function_set'] = self._function_set
 params['arities'] = self._arities
 params['method_probs'] = self._method_probs
Copy the code

Prepare _programs and run_DETAILs_ dictionaries if not warm_START mode

if not self.warm_start or not hasattr(self, '_programs'): Start again if # warm_start is false, Self._programs = [] # _programs list[list1,list2... listn] self.run_details_ = {'generation':[], self._details_ = {'generation':[], 'average_length':[], 'average_fitness':[], 'best_length':[], 'best_fitness': [], 'best_oob_fitness':[], 'generation_time':[]} prior_generations = len(self._programs) # _programs list[list1,list2... listn] The length represents how many generations are iterated. N_more_generations = self.generations - prior_generations # How many more generations need to be iteratedCopy the code

Next is the validation of n_more_GENERATIONS

Population_size represents the number of populations in each generation

If self.warm_start: # Discard random seeds used before for I in range(len(self._programs)): Randint (MAX_INT, size = self.population_size) if self.verbose: self._verbose_reporter()#Copy the code

Step 3: Run the program in parallel

Loop layer (from prior_generations to generations)

1) Record time to find the parent class

Start_time = time() if gen == 0: # parents None parents = None else: Parents = self._programs[gen-1] # _programs the last generation of programsCopy the code

2) Run programs in parallel

n_jobs, n_programs, starts = _partition_estimators(self.population_size, Seeds = random_state. Randint (MAX_INT, Size = self.population_size) # generate population_size population = Parallel(n_jobs=n_jobs, verbose=int(self.verbose > 1))( delayed(_parallel_evolve)(n_programs[i],parents, X, y,sample_weight, seeds[starts[i]:starts[i + 1]], params) for i in range(n_jobs))Copy the code

The data are combined to get the next generation of mutant populations

population[<symboltree._SymbolTree object at 0x00000118ABB89E20>, <symboltree._SymbolTree object at 0x00000118ABB89E80>, <symboltree._SymbolTree object at 0x00000118ABB89F10>, <symboltree._SymbolTree object at 0x00000118ABB89FD0>]
population = list(itertools.chain.from_iterable(population))
Copy the code

To get the fitness and length of all the individuals in the population is a list

fitness = [program.raw_fitness_ for program in population]
length = [program.length_ for program in population]
Copy the code

3) Penalty factor constrains FITNESS

parsimony_coefficient = None if self.parsimony_coefficient == 'auto': Parsimony_coefficient = (np.cov(length, fitness)[1, 0] / np.var(length)) Row 2, column 1 of covariance matrix divided by variance for program in population: Program. Fitness_ = program. Fitness (parsimony_coefficient) # shrink fitness self._programs. Append (population) # Newly generated information for this generation goes into _programsCopy the code

4) Delete the eliminated individuals

if not self.low_memory: for old_gen in np.arange(gen, 0, -1): Indices = [] for programs in self._programs[old_gen]: If program is not None:# If not None for idx in program. Parents: Parents_idx: if 'idx' in idx: if 'idx' in idx: if 'idx' in idx: if 'idx' in idx: If idx in range(self.population_size): if idx in range(self.population_size) Self._programs [old_gen-1][idx] = None elif gen > 0: self._programs[old_gen-1][idx] = None elif gen > 0: Self._programs [gen-1] = NoneCopy the code

Step 4 runs the information

The corresponding code

If self._metric. Greater_is_better: # best_program = population[np.argmax(fitness)] else: best_program = population[np.argmin(fitness)] self.run_details_['generation'].append(gen) self.run_details_['average_length'].append(np.mean(length)) self.run_details_['average_fitness'].append(np.mean(fitness)) self.run_details_['best_length'].append(best_program.length_) Self.run_details_ ['best_fitness'].append(best_program. Raw_fitness_) oob_fitness = Np.nan if self.max_samples < 1.0: Oob_fitness = BEST_program. Oob_fitness_ self.run_details_[' BEST_OOb_FITNESS ']. Append (OOb_FITNESS) # Data accuracy outside the bag Generation_time = time() -start_time self.run_details_['generation_time'].append(generation_time) # Run timeCopy the code

Dealing with early stopping

if self._metric.greater_is_better: best_finess = fitness[np.argmax(fitness)] if best_finess >= self.stopping_criteria: Elsebest_finess = fitness[np.argmix(fitness)] if best_finess <= self.stopping_criteria: # break the stop standardCopy the code

So we end the cycle here, and we get all the generations.

Step 5 if it’s a transformation

A) get hall_of_fame index first

Fitness = np.array(fitness) # find fitness if self._metric. Greater_is_better: Hall_of_fame = fitness. Argsort ()[::-1][:self.hall_of_fame] [:self.hall_of_fame] Hall_of_fame = fitness.argsort()[:self.hall_of_fame] # The smaller the better to select in orderCopy the code

All the individuals in the last generation of the population (including those belonging to Hall_of_FAME) were calculated to get the predicted value

evaluation = np.array([gp.execute(X) for gp in [self._programs[-1][i] for i in hall_of_fame]])
Copy the code

If the index is Spearman’s coefficient, the ranking value of each group of data of evaluation can be calculated

if self.metric == 'spearman': evaluation = np.apply_along_axis(rankdata, 1, Stats import rankdata evaluation = np.array([[1,2,3,4], [6,5,7,8], ,10,11,12 [9]]) print (np) apply_along_axis (rankdata, 1, evaluation)) # output [[1. 2. 3. 4.] [2. 1. 3. 4.] [1. 2. 3. 4.]]Copy the code

Then the correlation coefficient matrix is calculated

with np.errstate(divide = 'ignore', invalid = 'ignore'): Correlation = Np. abs(NP. corrcoef(evaluation)) 1. 3. 4.] [1. 2. 3. 4.]] [[1. 0.8 1.] [0.8 1. 0.8] [1. 0.8 1.]] np. Fill_diagonal (correlations, Hall_of_frame = list(range(self.hall_of_fame)) # hall_of_frame = list(range(self.hall_of_fame)) # hall_of_frame = hall_of_frame list(range(self.hall_of_fame)) # X_shape(354, 13) # evaluation:(50, Len (components) > self.n_components: Correlations_shape (50, 50) MOST_correlation = Np. unravel_index(NP. argmax(correlated)) Correlations can be used to calculate the correlations between the correlations and the correlations. Worst = Max (MOST_correlated) # worst is the column whose index is larger components. Pop (worst) indices.remove(worst) # delete from sequence number correlations = correlations[:, indices][indices, :] indices = list(range(len(components))) self._best_programs = [self._programs[-1][i] for i in hall_of_fame[components]]Copy the code

Click to follow, the first time to learn about Huawei cloud fresh technology ~