from Machine Learning Testing: Survey, Landscapes and Horizons


Content

  • PRELIMINARIES OF MACHINE LEARNING

    • elements

      • Dataset
        • Training data
        • Validation data
          • choose your model/fine tune your hiper-parameters
          • Prevent model oversaturation
        • Test data
      • Learning program
      • Framework
    • different types of machine learning

      • Supervised learning

        a type of machine learning that learns from training data with labels as learning targets. It is the most widely used type of machine learning.

      • Unsupervised learning

        a learning methodology that learns from training data without labels and relies on understanding the data itself.

      • Reinforcement learning

        a type of machine learning where the data are in the form of sequences of actions, observations, and rewards, and the learner learns how to take actions to interact in a specific environment so as to maximise the specified rewards.

    • other classifications

      • classic machine learning

        • Decision Tree

        • SVM

          The main idea is to find a hyperplane in space that can divide all data samples and make the distance of all data in the set to this hyperplane the shortest.

          Find a hyperplaneThis translates into an optimization problem

        • linear regression

          Least square solution: partial derivative is equal to zero

        • Naive Bayes

          Find the posterior probability based on the prior probability

      • deep learning

        • DNNs: neural network with many layers
        • CNNs: convolution operation, feature extraction
        • RNNs: Add the concept of time, out of the upper layer, but also affected by their own
      • comparation

        • The same: Machine learning is the use of algorithms to analyze data, learn from it and make inferences or predictions.
        • the difference: deep learning applies Deep Neural Networks (DNNs) that uses multiple layers of nonlinear processing units for feature extraction and transformation.
  • testing workflow

    • two stage

      • Offline testing
      • Online Testing: to help find out which model is better, or whether the new model is superior to the old model under certain application contexts.
        • A/B testing: a splitting testing technique to compare two versions of the systems (e.g., web pages) that involve customers.
        • MRB (Multi-Rrmed Bandit): first conducts A/B testing for a short time and finds out the best model, then put more resources on the chosen model.
    • Test Input Generation Techniques

      • Domain-specific Test Input Synthesis: a domain-specific method

        • DeepXplore: a deep learning system, neuron coverage
        • DeepTest: autonomous driving systems, greedy search with nine different realistic image transformations
        • Generative adversarial networks (GANs): test generation with various weather conditions
          • Generating model G and discriminating model D
      • Fuzz and Search-based Test Input Generation

        • fuzz testing VS random testing: A special form of random testing, aims to breaking the software.
      • Symbolic Execution Based Test Input Generation

        A program analysis technique that analyzes programs to obtain inputs for specific areas of code to execute. As the name implies, when analyzing a program using symbolic execution, the program uses symbolic values as input, rather than the specific values used in the general execution of the program. When the object code is reached, the analyzer can obtain the corresponding path constraint, and then use the constraint solver to obtain the specific value that can trigger the object code.

      • Synthetic Data to Test Learning Program: Synthetic input is synthesized according to sample distribution.

    • Test Oracle

      • Oracle Problem: enable the judgement of bug existence
      • Metamorphic Relations as Test Oracles
      • Cross-Referencing as Test Oracles: detects bugs by observing whether similar applications yield different outputs regarding identical inputs
      • Measurement Metrics for Designing Test Oracles
    • Test Adequacy

      • Test Coverage
        • Neuron coverage
        • MC/DC coverage variants
        • Layer-level coverage: checks the combinatorial activation status of the neurons in each layer
        • Limitations of Coverage Criteria: it is not clear how such criteria directly relate to the system decision logic.
      • Mutation Testing
      • Surprise Adequacy: They argued that a ‘good’ test input should be ‘invent but not overly surprising’ comparing with the training data.
      • Eg. The rule-based Checking of Test in China is inadequate to meet the demand
    • Test Prioritisation and Reduction

      • prioritise test inputs, for test
      • ranks the test instances based on their sensitivity to noises, for generation
    • Bug Report Analysis

  • testing properties

    • basic functional requirements

      • correctness

        • principle: to isolate test data via data sampling to check whether the trained model fits new cases

          • cross-validation
          • Bootstrap: Pull it back
        • correctness measurements

          • accuracy: ; The number of pairs divided by the number of pairs

            For example, the prediction of an earthquake in a certain area on a certain day, suppose we have a bunch of features as attributes of earthquake classification, and there are only two categories: 0: no earthquake and 1: earthquake. A thoughtless classifier that categorizes every test case to zero might be 99 percent accurate, but when a real earthquake strikes, the classifier doesn’t notice, and the loss of classification is huge. Why is it that a classifier with 99% accuracy is not what we want, because the data is not evenly distributed, there is too little data for category 1, and category 1 can still achieve high accuracy by completely misclassification, but it ignores what we care about.

          • precision: ; How many of the samples that were positive were right

          • recall: ; How many positive examples in the sample were predicted correctly

          • F-test: ; Comprehensively consider P index and R index

          • Receiver Operating Characteristic (ROC) and AUC

      • overfitting: lead to high correctness on the existing training data yet low correctness on the unseen data.

        • Cross-validation

        • Perturbed Model Validation (PMV)

          PMV operates by injecting noise to the training data, re-training the model against the perturbed data, then using the training accuracy decrease rate to assess model relevance. A larger decrease rate indicates better concept-hypothesis fit.

    • non-functional requirements

      • robustness: check the correctness of the system with the existence of noise

        • adversarial robustness
          • Perturbation Targeting Test Data: adversarial example generation approaches
          • Perturbation Targeting the Whole System
      • security

        low robustness is just one cause for high security risk.

      • data privacy

        the current research mainly focus on data privacy is how to present privacy-preserving machine learning, instead of detecting privacy violations

      • interpretability

        • Manual Assessment of Interpretability
        • Automatic Assessment of Interpretability
          • The metric measures whether the learned has actually learned the object in object identification scenario via occluding the surroundings of the objects.
          • He identified several models with good interpretability, including linear regression, logistic regression and decision tree models.
  • Testing Components

    • Bug Detection in Data

      • purpose

        • whether the data is sufficient for training or test a model
        • whether the data is representative of future data
        • whether the data contains a lot of noise such as biased labels
        • whether there is skew between training data and test data
        • whether there is data poisoning or adversary information that may affect the model’s performance
      • aspects

        • Bug Detection in Training Data

        • Bug Detection in Test Data

        • Skew Detection in Training and Test Data

          The training instances and the instances that the model predicts should be consistent in aspects such as features and distributions.

        • Frameworks in Detecting Data Bugs

    • Bug Detection in Learning Program

      • purpose
        • the algorithm is designed, chosen, or configured improperly
        • the developers make typos or errors when implementing the designed algorithm
      • aspect
        • Unit Tests for ML Learning Program
        • Algorithm Configuration Examination: compatibility problems
        • Algorithm Selection Examination: compare deep learning and classic learning
        • Mutant Simulations of Learning Program Faults
    • Bug Detection in Frameworks

      • purpose: checks whether the frameworks of machine learning have bugs that may lead to problems in the final system
      • Solutions towards Detecting Implementation Bugs:
        • use multiple implementations or differential testing to detect bugs
        • metamorphic testing
  • Software Testing vs. ML Testing

    • Component to test: traditional software testing detects bugs in the code

    • Behaviours under test: the behaviours of an ML model may frequently change as the update of training data

    • Test input: when testing the data, the test input could be a learning program

    • Test oracle

      • assumes the presence of a test oracle
      • the oracle is usually determined beforehand
      • the answers are usually unknown
    • Test adequacy criteria

      • line coverage, branch coverage, dataflow coverage
      • new test adequacy criteria are required so as to take the characteristics of machine learning software into consideration.
    • False positives in detected bugs

      ML testing tend to yield more false positives

    • Roles of testers

      data scientists or algorithm designers could also play the role of testers

  • application scenarios: domain-specific testing approaches

    • Autonomous Driving

    • Machine Translation

      Machine translation automatically translates text or speech from one language to another.

      • translation consistency
      • the algorithm for detecting machine translation violations
    • Natural Language Inference

      A Nature Language Inference (NLI) task judges the inference relationship of a pair of natural language sentences. For example, the sentence ‘A person is in the room’ could be inferred from the sentence ‘A girl is in the room’.

      • robustness test
  • research distribution

    • General Machine Learning and Deep Learning: Before 2017, papers mostly focus on general machine learning; after 2018, both general machine learning learning and deep learning testing notably arise.

    • Supervised/Unsupervised/Reinforcement Learning Testing: almost all the work we identified in this survey focused on testing supervised machine learning

      reason:

      • First, supervised learning is a widely-known learning scenario associated with classification, regression, and ranking problems. It is natural that researchers would emphasise the testing of widely-applied, known and familiar techniques at the beginning.
      • Second, supervised learning usually has labels in the dataset. It is thereby easier to judge and analyse test effectiveness.
    • Different Learning Tasks: almost all of them focus on classification

    • Different Testing Properties: Around one-third (32.1%) of the papers test, Another one-third of the papers focus on robustness and Security problems. Fairness testing ranks the third among all the properties, with 13.8% papers.

  • CHALLENGES

    • Test Input Generation
      • applying SBST on generating test inputs for testing ML systems (Search-based test generation (SBST) uses a metaheuristic optimising search technique, such as a Genetic Algorithm, to automatically generate test inputs.)
      • how to generate natural test inputs and how to automatically measure the naturalness of the generated inputs. (The existing test input generation techniques focus more on generating adversarial input to test the robustness of an ML system.)
    • Oracle Problem
      • Metamorphic relations: are proposed by human beings
      • A big challenge is thus to automatically identify and construct reliable test oracles for ML testing.
    • Testing Cost Reduction
      • A possible research direction of reducing cost is to represent an ML model into some kind of intermediate state to make it easier for testing.
      • We could also apply traditional cost reduction techniques such as test prioritisation to reduce the size of test cases while remaining the test correctness.
  • OPPORTUNITIES

    • More research works are highly desired for unsupervised learning and reinforcement learning.
    • Testing More Properties
    • there are very few benchmarks like CleverHans that are specially designed for the ML testing research (i.e., adversarial example construction) purpose.
    • no work has explored how to better design mutation operators for machine learning code so that the mutants could better simulate real-world machine learning bugs

Reference

  • Evaluation indexes of machine learning include Precision, Recall, F-measure, ROC curve, etc
  • What does test Oracle stand for in software engineering?
  • Difference between Machine learning and Deep learning
  • Deep Learning vs Classical Machine Learning
  • Svm algorithm principle and implementation
  • what is the difference between validation data and test data and why
  • Recognition of validation set validation_data in neural network training
  • Generative Adversarial Networks (GANs)
  • Bootstrap method details – techniques and examples