Research road is long, pay attention to xiao Zeng, shares do not lose, xiao Zeng and you encourage progress together. ALipy @[TOC] Is a Python tool library for active learning developed by the Key Laboratory of Pattern Analysis and Machine Intelligence of the Ministry of Industry and Information Technology, School of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics.
ALipy– Active learning in Python
ALiPy provides a module-based implementation of an active learning framework that allows users to easily evaluate, compare, and analyze the performance of active learning methods. It implements more than 20 algorithms and allows users to easily implement their own methods in different Settings.
The characteristics of ALipy
- Model independence: There are no limitations to the classification model. SVM can be used in SkLearn or depth models can be used in TensorFlow as needed.
- Module independence: You can modify one or more modules of the toolkit without affecting other modules.
- Implement your own algorithm without inheriting anything: User-defined functions have few restrictions, such as parameters or names
- Variant Settings supported: noisy predictor, multiple tags, cost effectiveness, feature queries, etc
- Powerful tools: save and load intermediate results; Multithreading; Analysis of experimental results
ALipy module
The active learning implementation is decomposed into components, and ALipy is developed based on multiple modules, each corresponding to a component of the active learning process.
Module components | The basic function |
---|---|
alipy.data manipulate | Provides basic functions of data preprocessing and partitioning |
alipy.query strategy | It consists of 25 common query policies |
alipy.index.IndexCollection | Helps manage indexes for both tagged and untagged examples |
alipy.metric | Multiple criteria are provided to evaluate model performance |
alipy.experiment.state and alipy.experiment.state io | Helps to save intermediate results after each query and to recover the program from breakpoints |
alipy.experiment.stopping criteria | Some common stop conditions are implemented |
alipy.oracle | Different Oracle Settings are supported |
alipy.experiment.experiment analyser | It provides the collection, processing and visualization of experimental results |
alipy.utils.multi thread | A parallel implementation of k – times experiment is provided |
The above modules are designed and realized independently. In this way, the code between different parts can be implemented without restriction. In addition, each individual module can be replaced by the user’s own implementation, and within each module we have provided a high degree of flexibility to enable the toolbox to adapt to different Settings.
Example selected AL implementation framework Noisy Oracles’ AL implementation framework An AL implementation framework for different cost datasets AL implementation framework for instance query
The installation of ALipy
Python >=3.4 Basic library numpy scipy scikit-learn matplotlib prettyTable there are two main installation schemes: PIP installation and source code build
PIP Installation (choose one of three)
- Installing Alipy from PyPI (recommended) :
sudo pip install alipy
Copy the code
- PIP install in the home directory:
pip install --user alipy
Copy the code
- Get the latest source from github repository PIP Install:
pip install git+https://github.com/NUAA-AL/alipy.git
Copy the code
The source code to build
- Clone alipy to a local directory, CD to the alipy folder and run the install command:
cd ALiPy
sudo python setup.py install
Copy the code
- Build and install from source code in your home directory:
python setup.py install --user
Copy the code
- All users on Unix/Linux build and install from source:
python setup.py build
Copy the code
ALipy special Settings
The most striking feature of ALipy is its low coupling, which makes it easy to experiment in other special environments.
Active learning setting | Introduction to the |
---|---|
AL with Noisy Oracles | Sometimes the wrong label may be returned |
AL for Multi-Label Data | An instance associates multiple labels simultaneously |
AL with Different Costs | The cost of querying different tags can vary |
AL by Querying Features | Select missing functionality for the instance to query |
AL with Novel Query Types | Other types of information about the query instance than the label of the query instance |
AL for Large Scale Tasks | Active learning in big data |
The algorithm implemented by ALipy
ALiPy provides more than 20 advanced algorithms for different active learning Settings
Specific code implementation process
The code implementation area is divided into Alipy primer and advanced guide
Introduction to ALipy
I’ll show you a simple example of customizing active learning experiments using tools in Alipy, starting with a unified framework for active learning experiments, followed by the corresponding tools in Alipy.
Unified framework for active learning experiments
1. For example, get a characteristic matrix X [n_samples, n_features] with shape and the corresponding one with shape [n_samples] [If it is not easy to get a specific characteristic matrix, you can only operate on the index of the instance] and split the data into training/test sets for experiments. Data partitions should be repeated several times at random. In active learning, the training set should be further split into initial tag set and untag pool for query. Note that in most active learning setups, the initial set of tags is usually small. 2. You can begin the query process for each experiment fold and record its results. In each query iteration, a subset of untagged data is queried and added to the tag set; After that, the model is retrained and tested against the updated tag set to evaluate the query. 3. After all folds are completed, the learning curve of the query strategy can be obtained by averaging the performance curve of each fold.
Modules in ALipy
-
Call traditional and state-of-the-art methods using alipy.query_strategy.
-
Using alipy. Index. IndexCollection to manage tags index and untagged index.
-
Use alipy.metric to calculate your model performance.
-
Use alipy.experiment.state and alipy.experiment.state_io to save the intermediate results after each query and restore the program from the breakpoint.
-
Use alipy. Experiment. Stopping_criteria to get some sample to stop criteria.
-
Using alipy. Experiment. Experiment_analysisr to collect, process, and visualize your experimental results.
For experienced users, a complete example of an experiment implemented using Alipy is provided. Then, we’ll explain the code separately and introduce common methods in the above tools.
import copy from sklearn.datasets import load_iris from alipy import ToolBox X, y = load_iris(return_X_y=True) alibox = ToolBox(X=X, y=y, query_type='AllLabels', Alibo.split_al (test_ratio=0.3, initial_label_rate=0.1, Split_count =10) # Use the default logistic regression classifier model = alibo.get_default_model () # cost budget is 50 queries stopping_criterion = alibox.get_stopping_criterion('num_of_queries', 55) # use predefined strategies uncertainStrategy = alibo.get_query_strategy (strategy_name='QueryInstanceUncertainty') unc_result = [] for round in range(10): Train_idx, test_IDx, label_IND, Unlab_ind = alibo.get_split (round) # Get the intermediate result of the single folding experiment. Saver saver = alibo.get_stateio (round) # Set the initial performance point model.fit(X=X[label_ind.index, :], y=y[label_ind.index]) pred = model.predict(X[test_idx, :]) accuracy = alibox.calc_performance_metric(y_true=y[test_idx], y_pred=pred, performance_metric='accuracy_score') saver.set_initial_point(accuracy) while not stopping_criterion.is_stop(): Select (label_ind, unlab_ind, model=model, Batch_size =1) # or pass your proba prediction result # prob_pred = model.predict_proba(x[unlab_IND]) # select_Ind = uncertainStrategy.select_by_prediction_mat(unlabel_index=unlab_ind, predict=prob_pred, Batch_size =1) label_ind.update(select_IND) unlab_ind.difference_update(select_IND) # Update the model and compute the performance model according to the model you use model.fit(X=X[label_ind.index, :], y=y[label_ind.index]) pred = model.predict(X[test_idx, :]) accuracy = alibox.calc_performance_metric(y_true=y[test_idx], y_pred=pred, Performance_metric ='accuracy_score') # save intermediate result to file st = alibo. State(select_index=select_ind, Performance =accuracy) saver.add_state(st) saver.save() # Pass the current progress to the stop standard object stopping_criteria.update_information (saver) # Reset the stopping_criterio.reset () unc_result.append(copy.deepCopy (Saver)) analyser = alibox.get_experiment_analyser(x_axis='num_of_queries') analyser.add_method(method_name='uncertainty', method_results=unc_result) print(analyser) analyser.plot_learning_curves(title='Example of AL', std_area=True)Copy the code
For each module, create a ToolBox object and specify a query type for the experiment (query all labels of an instance)
Alibox = ToolBOX (X = X,y = y,query_type = 'AllLabels')Copy the code
Manage marked and unmarked indexes
Alipy. Index. IndexCollection is a similar list of container, used to manage your marked and unmarked index. IndexCollection objects can be easily created by passing a List or numpy.ndarray object.
A = [1,2,3] a_ind = alibox.indexcollection (a) # Or create by importing the module from alipy.index import IndexCollection a_ind = IndexCollection(a)Copy the code
The common methods for IndexCollection are:
-
A_ind.index Specifies the index list type used to obtain the matrix index.
-
A_ind.update () is used to add a batch of indexes to an IndexCollection object.
-
A_ind.difference_update () is used to remove a batch of indexes from an IndexCollection object
Break up the data
There are two ways to split data by toolbox objects.
- You can split the data alibox.split_al () by specifying some options:
Split_AL (test_ratio=0.3, initial_label_rate=0.1, split_count=10) splits the dataset randomly into trained, tested, labeled, and unlabeled sets 10 times 2. You can use your own split function, Set indexes train_IDx, test_IDx, label_IDx and unlabel_IDx when initializing ToolBox objects. unlabel_idx = my_own_split_fun(X, y) alibox = alipy.ToolBox(X=X, y=y, query_type=’AllLabels’, train_idx=train_idx, test_idx=test_idx, label_idx=label_idx, unlabel_idx=unlabel_idx)
Use predefined policies to select samples
One of the core algorithms for active learning may be the query strategy. You can get query policy objects from the Alipy.ToolBox object by simply providing the policy name: uncertainStrategy = alibox.get_query_strategy(strategy_name=’QueryInstanceUncertainty’)
Using alipy.IndexCollection to manage your index, labeled index container is Lind, unlabeled container is Uind An example use of the predefined policy might look like this (just provide list types) :
select_ind = uncertainStrategy.select(label_index=Lind,
unlabel_index=Uind,
batch_size=1)
Copy the code
Update the test model
Available functions ‘accuracy_score’, ‘ROC_auc_SCORE’, ‘get_FPS_TPS_THRESHOLDS’, ‘hamming_loss’, ‘one_ERROR’, ‘coverage_ERROR’, ‘label_ranking_loss’, ‘label_ranking_average_Precision_score’ there are two ways to use them:
- Import the module and call the utility function alipy.metrics
from alipy.metric import accuracy_score
acc = accuracy_score(y_true=y, y_pred=model.predict(X))
Copy the code
- Calc_performance_metric () ToolBox objects
acc = alibox.calc_performance_metric(y_true=y, y_pred=model.predict(X),
performance_metric='accuracy_score')
Copy the code
Senior guide
Advanced encapsulation
ToolBox– Initialize an object to get any tools
ToolBox, mentioned earlier, is a class that provides all the available tool classes. You can get them without passing redundant parameters through the ToolBox objects. 1. Initialize ToolBox objects
['AllLabels', 'PartLabels', 'Features'] From sklearn.datasets import load_iris from Alipy import ToolBox X, y = load_iris(return_X_y=True) alibox = ToolBox(X=X, y=y, query_type='AllLabels', saving_path='.')Copy the code
ALiPy provides a Logistic regression model with default parameters implemented by Sklearn
Lr_model = alipy.get_default_model() # lr_model.fit(X, y) pred = lr_model.predict(X) # get probabilistic output pred = lr_model.predict_proba(X)Copy the code
3. Split the data
Split_AL (test_ratio=0.3, initial_label_rate=0.1, split_count=10)Copy the code
Create an IndexCollection object
# alipy. Index. IndexCollection is used for alipy index management tools. A = [1,2,3] a_ind = alibox.indexcollection (a)Copy the code
The Get Oracle and Repository object Toolbox classes provide initialization of Clean Oracle
# If you need to query by feature vector, This can be done by setting query_by_example=True clean_oracle = alibo.get_clean_oracle (query_by_example=False, Cost_mat =None) # cost_mat=None You can call get_repository(round, instance_flag=False) alibo.get_repository (round=0, instance_flag=False)Copy the code
6. Get the State & StateIO object
Saver = alibo.get_stateio (round=1) # When adding a query to the StateIO object, you need to use a State object, It is a dict-like container that holds some of the necessary information about a query (the state of the current iteration), such as cost, performance, selected indexes, and so on. st = alibox.State(select_index=select_ind, performance=accuracy, cost=cost, queried_label=queried_label)Copy the code
7. Getting predefined QueryStrategy objects has been mentioned before, just to give a brief introduction
QBCStrategy = alibox 。get_query_strategy ( strategy_name = 'QueryInstanceQBC' )
Copy the code
8. Computational performance
# Examples of using the Calc_performance_metric () ToolBox object method: acc = alibox.calc_performance_metric(y_true=y, y_pred=model.predict(X), performance_metric='accuracy_score')Copy the code
Alipy implements some common stop criteria:
- No unlabeled samples available (default)
- Reaches the preset query count
- Meet preset cost limits
- The default percentage of untagged pools is tagged
- Reach the preset run time (CPU time)
# [None, 'num_of_queries', 'cost_limit', 'percent_of_unlabel', Get_stopping_criteria ='num_of_queries', value=50)Copy the code
10. Get the experimental analyzer
# use alipy. Experiment. Analyser tools Analyser. = alibox get_experiment_analyser (x_axis = 'num_of_queries')Copy the code
Get the aceThreading object
# alipy.utils.acethReading is a class to parallel your K-fold experiments and print the state of each thread. acethread = alibox.get_ace_threading ()Copy the code
12. Save and load ToolBox objects
# alibox = toolbox. load('./al_settings.pkl')Copy the code
AIExperiment- a few lines of code running the AL algorithm example
ALipy provides a class that encapsulates the various tools, directly implement active learning the main loop of the ALipy. Experient. Alneatent 】 note: AlExament only supports the most common query – an instance of all tags.
Code implementation # initialization & function model parameters are classified model objects, Scikit-learn API from sklearn.datasets import load_iris from alipy.experiment.al_experiment import AlExperiment X, y = load_iris(return_X_y=True) al = AlExperiment(X, y, stopping_criteria='num_of_queries', Stopping_value =50) # use built-in functions to generate new split al.split_al () # have implemented classic and advanced query strategies, The list of available policy names includes ['QueryInstanceQBC', 'QueryInstanceUncertainty', 'QueryRandom', 'QureyExpectedErrorReduction', 'QueryInstanceGraphDensity', 'QueryInstanceQUIRE', 'QueryInstanceBMDR', 'QueryInstanceSPAL', 'QueryInstanceLAL'] # The GraphDensity and Quire methods require additional parameters al.set_query_strategy(strategy="QueryInstanceUncertainty", Measure ='least_confident') # Set performance metrics. ALiPy has implemented many classic performance metrics, #['accuracy_score', 'ROC_auc_score ',' get_fps_tPS_THRESHOLDS ',' hamming_loss','one_error', 'coverage_error', 'label_ranking_loss', 'label_ranking_average_precision_score', 'zero_one_loss'] al.set_performance_metric('accuracy_score') # By default, k times of active learning run al.start_query(multi_thread=True) # to get experimental results # can be obtained via al.get_example_result (). Obtain the results of k StateIO objects list for K experiments. You can also draw the learning curve of k experiments with a.lot_learning_curve (title=None).Copy the code
Utility classes in Alipy
For those who are not familiar with or have questions about a module, you can visit this address directly:Parnec.nuaa.edu.cn/_upload/tpl…The specific use of each module will be introduced and analyzed in detail
If you think this article is helpful to you, I hope you can click on the following, comments, favorites, thank you
Please also pay attention to Xiaozeng, you can not lose a stake, I will record the bit by bit in the process of my graduate study, we encourage together!
The paper has been uploaded: download.csdn.net/download/qq… GitHub link: github.com/NUAA-AL/ali… ALipy website link: parnec.nuaa.edu.cn/_upload/tpl… I have also read Inji’s article during the preparation period, and I have also gained a lot. If you are interested, you can have a look at blog.csdn.net/weixin_4457…