Databases

Database manipulation

db_file (character(len=80))

The file that design the database chosen for the potential fitting. each line has the syntax: class KLM number_of_files number_of_selected_files

Default "db_model_in"

db_path (character(len=60))

Path to the database where the poscar are located. With the default value you should have `` poscar`` files in DB directory

Default "./DB/".

selection_type (integer)

Inside each class and KLM subdatabase mentioned in db_file

1 selects first “ns” elements of the database;
2 selects last “ns” elements of the database;
3 selects randomly “ns” subsets of “kelem” elements of the database
4 selects first “ns” subsets of “kelem” elements of the database from a starting configuration defined in the db_file

In the case of new type of input, using db_file “ns” is given by number_of_selected_files inside class.KLM

Default 3.

seed (integer)

Seed for random number generator.

Default 11.

iread_energy (integer)

This option fixes which energy from the database is taken as target by MiLaDy. Each .poscar format has in the first line three energies, as it is described in Section Database files. A value between 1, 2 and 3 choose the first, second or third energy, respectively.

Default 2.

ref_energy_per_element (character(len=80))

This option fixes the reference energy of each species. These reference energies are useful when we intend to apply a shift of the total energy read by MiLaDy from .poscar.

This change of reference can be very useful in many situation, from which we exemplify two: (i) when we want to lower the absolute value of the DFT total energy by a rigid shift towards lower numbers (some ab-intio codes provide large numbers for total energy). Those large numbers can induce numerical instabilities in the fitting procedure. (ii) for physical consideration we want to have clean atomization energies for molecules and for the atoms separated at the infinite the energy is zero. There are many others situation when this rigid shif can be useful.

How ot works? Let’s take the case of a target energy \(E_{\textrm{DFT}}\) read by MiLaDy using the option iread_energy. We supose that this system has N atoms distributed over S species and each species has \(n_1\), \(n_2\), \(\ldots\), \(n_S\) atoms. If for each species we have a reference energy \(E_{\textrm{ref},s}\) with \(s=1, \ldots, S\). Then the MiLaDy target energy will be given by equation:

\[E_{\textrm{target}} = E_{\textrm{DFT}} - \sum_{s=1,S} n_s E_{\textrm{ref},s}\]

The number of values provided by ref_energy_per_element should be equal to fix_no_of_elements (this option is described in Atomic systems) otherwise MiLaDy ends into a fatal error. For example ref_energy_per_element="" -3.d0 2.12d0 -1.d0" provide three values -3.d0, 2.12d0 and 1.d0 for the species 1, 2 and 3, respectively.

Default ref_energy_per_element="0.d0".

Database files

Database file format

Database files for MiLaDy are stored in the .poscar format.

Besides a standard information (cell vectors, number of atoms, atomic coordinates and forces) that is usually included in .poscar, our databases files also contain explicit information about chemical compound and energy of the system in the first line as well as the stress tensor (independent six components in the \(\sigma_{xx}\), \(\sigma_{yy}\), \(\sigma_{zz}\), \(\sigma_{yz}\), \(\sigma_{xz}\), \(\sigma_{xy}\) order, as provided by VASP), and ISPIN tag in the end of the file.

The energies, forces and stress are provided in eV, eV/Å and eV/Å\(^{3}\), respectively.

An example of a typical database .poscar file is reported below.

111 1 Fe 26 -15.7255500 0.7884238 0 # EFS-tag n element mass E_1 E_2 E_3
1.00000000 # unit = 1Å
2.63475324 0.00000000 0.00000000 # cell vectors
0.00000000 2.63475324 0.00000000
0.00000000 0.00000000 2.63475324
2 # number of atoms
Cartesian
0.00000000 0.00000000 0.00000000 # atomic positions
1.31737662 1.31737662 1.31737662
# empty line
0.00000000 0.00000000 0.00000000 # forces
0.00000000 0.00000000 0.00000000
# empty line
-0.42315918 -0.42315918 -0.42315918 0 0 0 # stress
# empty line
2 # ISPIN tag: 2 - magnetic, 1 - non magnetic; 0 - not known

The first line of this file (treated as a comment by VASP) indicates that the file contains information about energy (E=1 in EFS), forces (F=1 in EFS), stress (S=1 in EFS); that the system is built by 1 chemical element which is Fe with atomic mass 26, total energy of the system is -15.7255500, target energy value for training (\(E^{tot}-E^{ref}_{1}\)) is 0.7884238 and the alternative target value (\(E^{tot}-E^{ref}_{2}\)) is 0. In this example, the reference energy \(E^{ref}_{1}\) is a total energy of a perfect crystal.

For the systems which contain more than one chemical element (alloys, oxides etc.), the structure of the first and sixths lines will slightly change. For instance, the first line for Fe\(_{3}\)C cementite with known energies and forces (but no stress tensor) is:

110 2 Fe 26 C 12 -15.7255500 0.7884238 0

and the sixth line for the 4-atom Fe\(_{3}\)C system is

3 1

The database files of this format can be directly used as input configurations for calculations in VASP and structure visualization in OVITO. Simple renaming of the the .poscar files to POSCAR allows also their visualization in VESTA.

The case of the calculations for the descritors and no derivatives i.e. desc_forces=.false. In this particular case only the positions of atoms and the box informations are read. Any information about forces , spin etc is ignored and is not complusory.

Other file formats. Some of the files can be stored in binary format .traj, generated and read by Atomic Simulation Environment (ASE). The data can be then extracted to the database .poscar format with a python script extract\_traj.py that is provided together with .traj files. Conversion of the .poscar DB files (compatible with MiLaDy) into extended .xyz format can be performed using DB\_poscar2xyz.py. The inverse conversion from .xyz to .poscar can be done with DB\_xyz2poscar.py.

Database file names

The database files for MiLaDy are generally named as CCKLMXXXXXX.poscar. In this notation, the class CC is defined by a number that can vary from 01 to 99. This part of the file name indicates a physical property that can be derived from this files (e.g., elasticity, point defects, etc.). The class is directly linked to the characteristics of the system that should be fit (E=energy, F=forces, S=stress). Thus, for example for the classes 01 and 02 corresponding to equations of state (EOS) and elasticity, the energies and stress ES represent important parameters to fit (forces are equal zero in this case), while for the class 04 with Generalized Stacking Faults (GSF), only energies are of our interest.

The KLM notation in the file names describes a type of the system, which includes composition, structure, and a source of the database. The first index K indicates a material (composition + structure). For instance, bcc Fe corresponds to K=1, hcp Fe to K=2 and bcc W to K=3, etc.

Input file db\(\_\)model.in

An input file db_model.in provides a detailed summary of the database which will be used to fit a ML potential in MiLaDy. Each line in the file stands for the categories of the database with different class CC and KLM. After providing the relevant CC and KLM attributes, one should indicate the total number of files in the database belonging to this category and how many of them should be used for training of the potential. The rest of the files will be used for the test.

For each category listed in the db_model.in, one can independently define its EFS tag in the form of T or F standing for energies, forces and stress respectively. The final EFS fitting scheme for a given system will be a superposition of the EFS provided in the db_model.in and of the EFS-tag, in the first line of a .poscar file (see Section 4.1). For instance, if the EFS-tag in the .poscar file is 110 and TFF in the db_model.in, the fit will be performed only for the energies.

The last six numbers in each line define the three ranges: \([w^{min}_E, w^{min}_F, w^{min}_S], [w^{max}_E, w^{max}_F, w^{max}_S]\) within which the regression weights will be varied for energies, forces and stress, respectively. The search of optimum regression errors is performed using evolutionary algorithm. In the example of the db_model.in below, the weights are set to vary between 1e2 and 1e6 for energy, 1e1 and 1e3 for forces and 1e2 and 1e4 for stress. Setting everywhere 1.e0 will result in a simple fit without regression weights.

120 614 425 T F T 1.e2 1.e1 1.e2 1.e6 1.e2 1.e4
110 22  15  T F T 1.e2 1.e1 1.e2 1.e6 1.e2 1.e4
120 22  15  T F T 1.e2 1.e1 1.e2 1.e6 1.e2 1.e4
130 22  15  T F T 1.e2 1.e1 1.e2 1.e6 1.e2 1.e4

Warning

For the particular case of the selection_type=4 the above file should provide the first configuration used for the trainning selection. As in the following example, where the first configuration becomes 10 for the class 01 and 1 for all the others, exept the last class for which it is 2.

120 614 425 10 T F T 1.e2 1.e1 1.e2 1.e6 1.e2 1.e4
110 22  15  1  T F T 1.e2 1.e1 1.e2 1.e6 1.e2 1.e4
120 22  15  1  T F T 1.e2 1.e1 1.e2 1.e6 1.e2 1.e4
130 22  15  2  T F T 1.e2 1.e1 1.e2 1.e6 1.e2 1.e4

Database in descriptor space: writing outputs.

write_desc (logical)

Writing or not the local atomic descriptors. The data will be writen in the directory descDB. This option write the descriptors for local energy as well for atomic forces. In order to write only the descritors of the local atomic environement set desc_forces=.false.

Default write_desc=.false.

Note

However, not all the database is writen. It is writen only the configuration that are asked in db_model.in

desc_file_format (integer)

The type of descriptor file, which is written if write_desc=.true.. Depending on the value of desc_file_format, which can be 1, 2 or 3 the descriptor files written in descDB has the extension eml, csv and npz respectively.

1 The name of files is of the form descDB/CC_KLM_XXXXXX.eml and the format is nat x dim_desc + 1 matrix. Where nat is the number of atoms in the corresponding atomic system and dim_desc the dimension of the descriptor. The first column of the matrix indicates the atomic id in the system (the same id as in the corresponding poscar) and the other dim_desc columns are the various components of the descriptor.
2 the files will be written in csv format in descDB/CC_KLM_XXXXXX.csv. The shape of the data is same as for option 1.
3 ensures that the binary format npz is descDB/CC_KLM_XXXXXX.npz.

The shape of the data is the same as for option 1.

Default desc_file_format=1

Note

npz files are smaller in size with a factor of 10 compared to eml or csv. However, pay attention that sometimes there can be problems when you generate the files on one computer then you read on other computer.