Knowledge from theoretical calculations is distilled into design materials to minimize the number of laborious experiments.

Photo: Louis-Etienne FoyonUnsplash

Do you want better material? Yes, we do. We want better batteries with more capacity and longer life. Or, we want better solar panels with a higher rate of energy generation. Or, we want better semiconductors, less joule heat loss. All of these performance limitations come from the materials they are made of. So, yes, we do want to provide better materials for any product.

motivation

But how can we find them from the almost infinite number of available materials? One way is to make use of simulated databases such as materials projects. This type of database has been using DFT to calculate assembly material properties, which helps narrow the scope of exploration. However, some properties, such as band gaps, are well known for their poor predictability, so we cannot accept them at face value. Although, according to the Crystallography Open Database, the cumulative number of crystals so far has reached 476,995, it is not possible to check their properties experimentally. Ah, I wish to have a good experimental database and a material discovery search engine! Unfortunately, it doesn’t exist. (At least at this stage.) Yes, we should create a way to use knowledge from theoretical databases to experimental crystal structure databases. So, with the help of machine learning, let’s think about ways to effectively explore the materials we need.

Problem set

Let’s consider exploring materials with ideal band gaps. This eigenvalue is the basic guide when we want a better semiconductor. The ideal band gap varies depending on the design of the product. Therefore, we can simplify the discovery problem of semiconductor materials to the following points.

When E is set as the ideal value of the band gap, how can we find the crystal with the closest band gap from the crystal database with the smallest possible exploration experiment?

Let’s make this more specific. The things we clarified were the following.

1. How close do we need to get to the target? 2. Which indicator is most appropriate? 3. How do we build data sets? 4. How far do we need to explore?

First, we should estimate the target value within 0.01eV. EV, electron volt, is a unit of band gap. The band gap can vary within this range depending on its manufacturing method or condition. So pursuing further may be futile. Second, we can use the MAE (average absolute value) indicator when we emphasize absolute values rather than ratios. As the name implies, this can be obtained by averaging the absolute value of the difference between the target value and the estimate. Next, we can use the CIF (crystal information file) data set. CIF is an international format for describing crystal structures. It contains basic information but is not quantitative. Third, we can use the materials project database as a fake experimental data set. Suppose we can explore materials with band gap E from the hypothetical experimental data set. In this case, we can apply the same strategy to effectively conduct actual experiments. Considering the actual situation, we can survey scientific papers to gather preliminary information on the target value. Let’s say the number is about 100. In addition, we can eliminate many candidates by considering other less-desirable aspects, such as synthesis cost or reactivity. This gives us 100 initial pieces of information and Narrows our exploration to about 6,000. We can then conceptualize the experimental material exploration problem as follows.

When E is set as the ideal value of band gap, how can we use the 100 priori information to find crystals with MAE less than 0.01eV from about 6000 candidates with minimal exploratory experiment?

Machine learning strategies

Bayesian optimization

As we can easily find in Google searches, Bayesian optimization seems to be a good and effective exploration. Bayesian optimization is an algorithm designed to explore better data points with minimal experimentation. For example, the Google Brain team used the algorithm to intelligently optimize chocolate chip cookie recipes. Anyway, it seems practical, and yes, we should apply it to material discovery. But wait, we need descriptors. In other words, a set of variables that identify unique crystals. The Google team used quantified values for each cookie cooking program, such as the weight ratio of cassava starch. How do we quantify crystals?

Crystal graph convolutional neural network

One simple way to do this is to use a deep-learning model of pre-trained crystals. CGCNN, or Crystal graph convolutional Neural Network, is a pioneering deep learning architecture in materials science. In the authors’ GitHub repository, they open up pre-trained models for everyone to use. When we look at the prepared models folder, we can find the bandgap model (band-gap.pth.tar). By using this model as a feature extractor, we can convert CIF files into 128 quantitative descriptors in autopilot.

Principal component analysis

Unfortunately, 128 descriptors is too many for Bayesian optimization. Although there are many cutting-edge algorithms for high-dimensional optimization, basically, low-dimensional algorithms work better and do not require additional effort. Furthermore, these 128 descriptors are only used to quantitatively identify crystals, so high dimensions are essentially unnecessary. Therefore, we can use PCA, or principal component analysis, to reduce the dimensions. By reducing the 128 dimensions to three, we can set up more efficient exploration Spaces.

code

Python library requirements.

  • pymatgen
  • pytorch
  • scikit-learn
  • GPyOpt
  • collector

Data set construction

We will use the Materials project API to build the data set. First, you need to create an account on Materials Project and get the API key. This can be done according to official instructions. We will then compile two datasets for prior information and exploration. You should change MY_API_KEY to your key.

In this code, we search for crystals with band gaps between 2.3 and 2.8, and we find 6,749 materials. They are then divided into two folders: “CIf_Prior” and “CIf_EXPERIMENT”, which contain 100 and 6649 CIF files, respectively. In addition, the band gap value is stored in each folder as “ID_prop.csv”.

Medium.com/media/3d05f…

The CIF is converted into 128 descriptors using a pre-trained CGCNN model

You can clone the CGCNN repository by following the official instructions here. You need to copy **atom_init.json** into the “cif_Prior” and “cif_Experiment” folders. You can then create the feature extraction code predict.py by modifying the following code. I created extract_feature.py based on the validate function in predict.py. The code is too long to write down here, so I’ll just show you the changes.

First of all, the modification part of the main function is the last part like this.

Medium.com/media/5ac1a…

Then, modify the middle part of the validation function like this.

Medium.com/media/ed6b1…

You can then execute the extract_feature.py and add the following parameters.

python3 extract_feature.py ./pre-trained/band-gap.pth.tar ./cif_experiment
Copy the code

Therefore, you can get 128 descriptors such as CGCNn_features.csv. We should create characteristics for both “cif_prior “and” cif_Experiment “. Rename cgCNn_features.csv to CGCNn_Features_prior.csv and CGCNn_Features_experiment.csv.

Reduce the dimension from 128 to 3 through PCA

We will convert 128 features into 3D data. Similarly, the code is executed twice on both datasets and the output is renamed to ** CGCNn_pca_prior.csv ** and CGCNN_pCA_experiment.csv.

Medium.com/media/94c04…

Bayesian optimization

Finally, we will use Bayesian optimization to explore better materials. Experimental setup will be done by defining the following class instance.

Medium.com/media/eb394…

All the preparatory work has been done. The following code will automatically explore better materials. In this setting, the target band gap is set to E=2.534[eV]. The desired material should be within the MAE error of 0.01 eV, so the target range will be between 2.524 and 2.544 eV. The Bayesian optimization loop will be repeated 30 times as _n_experiment_. The discovered material and corresponding values are stored in the self instance below.

  • Explored band gap value; exp.expred_bandgaps
  • Crystal name; exp.crystals
  • Cumulative loss curve; The results of

You can freely visualize or export these results by adding code based on these instances and variables.

Medium.com/media/40a19…

In my setup, bayesian optimization can find the required material in 5 times. That said, this approach seems to be useful in practical materials exploration schemes. Enjoy the fun of exploring materials!


Exploring Better Materials using Deep Graph convolutional Networks and Bayesian Optimization was originally published in The journal Toward Data Science, and people continue the conversation by highlighting and responding to this story.