Following on from yesterday’s ten suggestions for AI projects – Concept, today we are going to talk about ten guiding principles for AI projects.
When planning an AI project, it is critical to identify the goals for which the model is built, but this understanding can only provide a successful solution. To truly deliver according to sound recommendations, the AI project team must follow the best implementation path when executing the project. In order to help you follow the correct path, let’s discuss the path of execution for the previous 10 suggestions
1. Know your problem
The most fundamental part of solving any problem is knowing exactly what problem you are solving. Make sure you understand what you are predicting, any limitations, and what the ultimate goal of the project is. Ask questions early and validate your understanding with peers, business experts, and end users. If you find answers that are consistent with your understanding, then you know you’re on the right track.
2. Know your data
By understanding the meaning of your data, you can understand which models work well and which features are used. The questions behind the data will affect which model is most successful, and the calculation time will affect the project cost. By using and creating meaningful functionality, you can emulate or improve human decisions. Knowing what each field means is important to the problem, especially in regulated industries where data may need to be anonymized and therefore less clear. If you are not sure what a feature means, consult a business expert in your field.
3. Split OR clean your data
How will your model handle invisible data? If it can’t generalize to new data, its performance on a given data isn’t all that matters. We don’t allow your model to see part of the data during training, so you can verify how it performs under unknown conditions. This approach is critical to choosing the right model architecture and tuning parameters for optimal performance.
For supervised learning problems, you need to split the data into two or three parts.
The training data — the data from which the model learns — is typically 75-80% of the original randomly selected data.
The test data — the data you evaluate the model on — is the rest.
Depending on the type of model you are building, you may also need a third set of retained data called validation sets to compare multiple supervised learning models that have been tuned to test data. In this case, you need to split the non-training data into two data sets, test and verification. You want to use test data to compare iterations of the same model and validation data to compare final versions of different models.
In Python, the easiest way to split data correctly is to use scikit-learn’s train_test_split function.
4. Don’t give away test data
It is important not to input any information from the test data into your model. This can have a negative impact on the training of the entire data set, or it can be as subtle as performing transformations (such as scaling) before splitting. For example, if you normalize the data before splitting, the model is getting information about the test data set because the global minimum or maximum may be in the retained data.
5. Use the right metrics
Because every problem is different, the appropriate assessment method must be selected based on the context. The most naive — and perhaps most dangerous — is the accuracy of classification indicators. Consider testing for cancer. If we want a fairly accurate model, we always predict “not cancer” because more than 99% of the time we can verify that we’re all right. However, this is not a very useful model, and we actually want to detect cancer. Take care to consider which metrics to use in classification and regression problems.
6. Keep it simple
When dealing with a problem, it’s important to choose the right solution for the job, not the most complex model. Management, customers, and even you may want to use the “latest and greatest.” You need to use the simplest (not the most advanced) model to meet your needs, namely Occam’s Razor. This not only provides more visibility and reduces training time, but can actually improve performance. In short, don’t shoot a fly with a rocket launcher or try to kill Godzilla with a fly swatter.
7. Don’t overfit (or underfit) your model
Overfitting, also known as variance, can cause models to perform poorly on data they haven’t seen before. The model simply memorizes training data. Underfitting, also known as bias, is when too little information is provided to the model to learn the correct representation of the problem. Balancing the two — often referred to as the “bias variance trade-off” — is an important part of the AI process, and different problems require different equilibria.
Let’s take a simple image classifier as an example. Its job is to classify whether there are dogs in the image. If you overfit the model, it will not recognize the image as a dog unless it has seen the exact image before. If you underfit the model, it may not be able to identify the image as a dog even if it has seen that particular image before.
8. Experiment with different model architectures
Most of the time, it is beneficial to consider different model architectures for a problem. What works best for one problem may not work so well for another. Try to mix simple and complex algorithms. For example, if you are implementing a classification model, try something as simple as a random forest and as complex as a neural network. Interestingly, extreme gradient lifting (XGBoost) is generally far superior to neural network classifiers. A simple problem is usually best solved with a simple model.
Adjust your hyperparameters
Hyperparameters are values used in model calculations. For example, one hyperparameter of a decision tree is the depth of the tree, which is how many questions it will ask before deciding on the answer. The default hyperparameters of the model are those that provide the best performance on average. But it’s unlikely that your model will be in that sweet spot. Your model can perform better if you choose different parameters. The most common methods for adjusting hyperparameters are grid search, random search, and Bayesian optimization search, as well as many other more advanced techniques.
10. Compare models correctly
The ultimate goal of machine learning is to develop a well-generalized model. That’s why it’s so important to compare and choose the best model. As mentioned above, you need to use a different hold set than training hyperparameters when evaluating. In addition, you need to use appropriate statistical tests to evaluate the results.
Now that you’ve mastered the guiding principles for executing ai projects, try them out on your next AI project. I’d love to know if any of this advice helped you, and if it did. Please add your own in the comments or private message below!