By James Wexler, Google AI software engineer
Source | TensorFlow public number
Building effective machine learning (ML) systems requires many questions. It is not enough to train a model and leave it alone. Good developers, like detectives, are constantly exploring, trying to better understand their model: How do changes in data points affect the model’s predictions? Does the model behave differently for different groups, such as historically marginalized populations? How diverse is the data set used to test the model?
Answering these types of questions is not easy. Exploring “what-if” scenarios often means writing one-off custom code to analyze a particular model. This process is not only inefficient, but also difficult for non-programmers to participate in shaping and improving ML models. A key focus of Google’s AI PAIR initiative is to make it easier for users to check, evaluate, and debug ML systems.
IO /what-if-too we released the what-if tool (pin-code.github. A new feature in the TensorBoard web application allows users to analyze ML models without writing code. Given TensorFlow models and data set Pointers, the What-if tool provides an interactive visual interface for exploring model results.
The What-If tool has capabilities that include automatically visualizing the dataset using Facets, manually editing examples of the dataset and viewing the impact of associated changes, and automatically generating local dependency graphs that show how the predicted results of the model change with changes to any single function. Explore two of these features in detail below.
The fact that
At the click of a button, a data point can be compared with the point at which the model predicts a different outcome. We call these points “counterfactual,” and they clearly show the decision boundaries of the model. Alternatively, you can manually edit a data point and then explore the changes predicted by the model. In the screenshot below, we use this tool for the binary classification model. The model uses public census data from the UCI Census dataset to predict whether someone has an income of more than $50,000. This is a benchmark prediction task commonly used by ML researchers, especially for analyzing algorithm fairness, a topic we will cover shortly. In this case, for the selected data points, the model predicted 73% confidence that the person earned more than $50,000. The tool automatically finds the most similar person in the data set (whose income the model predicts is less than $50,000) and compares them side by side. In this case, small changes in age and occupation led to large changes in the model’s predictions.
Performance and algorithm fairness analysis
You can also explore the effects of different classification thresholds and consider constraints such as different numerical fairness standards. The screenshot below shows the results of the smile detection model, which was trained using the open source CelebA dataset consisting of annotated celebrity face images. In the figure below, we divided the facial images in the data set into two groups according to whether the hair was brown or not, and drew a ROC curve and the confusion matrix of the predicted results for each group. At the same time, a slider was provided to set the model to determine whether it was a smiling face image only after reaching a certain confidence level. In this case, the tool automatically sets confidence thresholds for both groups to optimize the model for equal opportunity.
demo
To illustrate the power of the What-if tool, we published a set of demos using the pretraining model:
Detection of misclassification: This is a multi-classification model that predicts plant species by taking four measurements of plant flowers. This tool helps to show the decision boundaries of the model and the causes of the misclassification. The model was trained using the UCI iris data set.
Evaluating the fairness of the binary classification model: This is the smile detection image classification model mentioned above. This tool is useful for evaluating algorithmic fairness of different subgroups. In training the model, we purposely did not provide any examples from a specific subset of people in order to show how the tool can help uncover such biases in the model. Assessing fairness requires careful consideration of the overall context, but it is a useful starting point for quantifying.
Model performance of different subgroups studied: This is a regression model predicting the age of subjects based on census information. This tool helps to show the relative performance of the model in different subgroups and how different characteristics affect the predicted results separately. The model was trained using the UCI census data set.
Practical application of what-IF
We tested the What-If tool with Google’s in-house team and saw its immediate value. One team quickly discovered that their model had mistakenly missed an overall feature of the data set, and went on to fix previously undetected code errors. Another team used the tool to visually rank its examples from best to worst performing to discover patterns in the type of model example that underperformed.
We hope that people both inside and outside Google will use this tool to better understand the ML model and start evaluating its fairness. In addition, since this code is open source, you are welcome to contribute to the development of the tool.
Thank you
What-if is a collaborative effort, and its success is due to the user experience designed by Mahima Pushkarna, the update of Facets by Jimbo Wilson, and the input of many others. We would like to thank the Google team for testing the tool and providing valuable feedback, as well as the TensorBoard team for all their help.