Databases
Database manipulation
- db_file (character(len=80))
-
The file that design the database chosen for the potential fitting. each line has the syntax:
class KLM number_of_files number_of_selected_files
Default
"db_model_in"
- db_path (character(len=60))
-
Path to the database where the poscar are located. With the default value you should have `` poscar`` files in
DB
directoryDefault
"./DB/"
.
- selection_type (integer)
-
Inside each
class
andKLM
subdatabase mentioned indb_file
1
selects first “ns” elements of the database;2
selects last “ns” elements of the database;-
3
selects randomly “ns” subsets of “kelem” elements of the database -
4
selects first “ns” subsets of “kelem” elements of the database from a starting configuration defined in thedb_file
In the case of new type of input, usingdb_file
“ns” is given bynumber_of_selected_files
insideclass
.KLM
Default3
.
- seed (integer)
-
Seed for random number generator.
Default
11
.
- iread_energy (integer)
-
This option fixes which energy from the database is taken as target by
MiLaDy
. Each.poscar
format has in the first line three energies, as it is described in Section Database files. A value between 1, 2 and 3 choose the first, second or third energy, respectively.Default
2
.
- ref_energy_per_element (character(len=80))
-
This option fixes the reference energy of each species. These reference energies are useful when we intend to apply a shift of the total energy read by
MiLaDy
from.poscar
.This change of reference can be very useful in many situation, from which we exemplify two: (i) when we want to lower the absolute value of the DFT total energy by a rigid shift towards lower numbers (some ab-intio codes provide large numbers for total energy). Those large numbers can induce numerical instabilities in the fitting procedure. (ii) for physical consideration we want to have clean atomization energies for molecules and for the atoms separated at the infinite the energy is zero. There are many others situation when this rigid shif can be useful.
How ot works? Let’s take the case of a target energy \(E_{\textrm{DFT}}\) read by
MiLaDy
using the optioniread_energy
. We supose that this system has N atoms distributed over S species and each species has \(n_1\), \(n_2\), \(\ldots\), \(n_S\) atoms. If for each species we have a reference energy \(E_{\textrm{ref},s}\) with \(s=1, \ldots, S\). Then theMiLaDy
target energy will be given by equation:\[E_{\textrm{target}} = E_{\textrm{DFT}} - \sum_{s=1,S} n_s E_{\textrm{ref},s}\]The number of values provided by
ref_energy_per_element
should be equal tofix_no_of_elements
(this option is described in Atomic systems) otherwiseMiLaDy
ends into a fatal error. For exampleref_energy_per_element="" -3.d0 2.12d0 -1.d0"
provide three values-3.d0
,2.12d0
and1.d0
for the species 1, 2 and 3, respectively.Default
ref_energy_per_element="0.d0"
.
Database files
Database file format
Database files for MiLaDy are stored in the .poscar
format.
Besides a standard information (cell vectors, number of atoms, atomic
coordinates and forces) that is usually included in .poscar
, our
databases files also contain explicit information about chemical
compound and energy of the system in the first line as well as the
stress tensor (independent six components in the \(\sigma_{xx}\),
\(\sigma_{yy}\), \(\sigma_{zz}\), \(\sigma_{yz}\),
\(\sigma_{xz}\), \(\sigma_{xy}\) order, as provided by VASP),
and ISPIN tag in the end of the file.
The energies, forces and stress are provided in eV, eV/Å and eV/Å\(^{3}\), respectively.
An
example of a typical database .poscar
file is reported below.
1111 1 Fe 26 -15.7255500 0.7884238 0 # EFS-tag n element mass E_1 E_2 E_3
21.00000000 # unit = 1Å
32.63475324 0.00000000 0.00000000 # cell vectors
40.00000000 2.63475324 0.00000000
50.00000000 0.00000000 2.63475324
62 # number of atoms
7Cartesian
80.00000000 0.00000000 0.00000000 # atomic positions
91.31737662 1.31737662 1.31737662
10# empty line
110.00000000 0.00000000 0.00000000 # forces
120.00000000 0.00000000 0.00000000
13# empty line
14-0.42315918 -0.42315918 -0.42315918 0 0 0 # stress
15# empty line
162 # ISPIN tag: 2 - magnetic, 1 - non magnetic; 0 - not known
The first line of this file (treated as a comment by VASP) indicates
that the file contains information about energy (E=1
in EFS
),
forces (F=1
in EFS
), stress (S=1
in EFS
); that the
system is built by 1
chemical element which is Fe
with atomic
mass 26
, total energy of the system is -15.7255500
, target
energy value for training (\(E^{tot}-E^{ref}_{1}\)) is 0.7884238
and the alternative target value (\(E^{tot}-E^{ref}_{2}\)) is 0
.
In this example, the reference energy \(E^{ref}_{1}\) is a total
energy of a perfect crystal.
For the systems which contain more than one chemical element (alloys, oxides etc.), the structure of the first and sixths lines will slightly change. For instance, the first line for Fe\(_{3}\)C cementite with known energies and forces (but no stress tensor) is:
110 2 Fe 26 C 12 -15.7255500 0.7884238 0
and the sixth line for the 4-atom Fe\(_{3}\)C system is
3 1
The database files of this format can be directly used as input configurations for calculations in VASP and structure visualization in OVITO. Simple renaming of the the .poscar files to POSCAR allows also their visualization in VESTA.
The case of the calculations for the descritors and no derivatives
i.e. desc_forces=.false.
In this particular case only the
positions of atoms and the box informations are read. Any information
about forces , spin etc is ignored and is not complusory.
Other file formats. Some of the files can be stored in binary format
.traj
, generated and read by Atomic Simulation
Environment (ASE). The data can be
then extracted to the database .poscar
format with a python script
extract\_traj.py
that is provided together with .traj
files.
Conversion of the .poscar
DB files (compatible with MiLaDy) into
extended .xyz
format can be performed using DB\_poscar2xyz.py
.
The inverse conversion from .xyz
to .poscar
can be done with
DB\_xyz2poscar.py
.
Database file names
The database files for MiLaDy are generally named as
CCKLMXXXXXX.poscar
. In this notation, the class CC
is defined
by a number that can vary from 01 to 99. This part of the file name
indicates a physical property that can be derived from this files
(e.g., elasticity, point defects, etc.). The class is directly
linked to the characteristics of the system that should be fit
(E=energy, F=forces, S=stress). Thus, for example for the classes 01
and 02
corresponding to equations of state (EOS) and elasticity, the
energies and stress ES represent important parameters to fit (forces are
equal zero in this case), while for the class 04
with Generalized
Stacking Faults (GSF), only energies are of our interest.
The KLM
notation in the file names describes a type of the system,
which includes composition, structure, and a source of the database. The
first index K
indicates a material (composition + structure). For
instance, bcc Fe corresponds to K
=1, hcp Fe to K
=2 and bcc W
to K
=3, etc.
Input file db\(\_\)model.in
An input file db_model.in
provides a detailed summary of the
database which will be used to fit a ML potential in MiLaDy. Each line
in the file stands for the categories of the database with different
class CC
and KLM
. After providing the relevant CC
and
KLM
attributes, one should indicate the total number of files in the
database belonging to this category and how many of them should be used
for training of the potential. The rest of the files will be used for
the test.
For each category listed in the db_model.in
, one can independently
define its EFS
tag in the form of T
or F
standing for
energies, forces and stress respectively. The final EFS fitting scheme
for a given system will be a superposition of the EFS provided in the
db_model.in
and of the EFS
-tag, in the first line of a
.poscar
file (see Section 4.1). For instance, if
the EFS-tag
in the .poscar
file is 110
and TFF
in the
db_model.in
, the fit will be performed only for the energies.
The last six numbers in each line define the three ranges:
\([w^{min}_E, w^{min}_F, w^{min}_S], [w^{max}_E, w^{max}_F, w^{max}_S]\)
within which the regression weights will be varied for energies, forces
and stress, respectively. The search of optimum regression errors is
performed using evolutionary algorithm. In the example of the
db_model.in
below, the weights are set to vary between 1e2
and
1e6
for energy, 1e1
and 1e3
for forces and 1e2
and
1e4
for stress. Setting everywhere 1.e0
will result in a simple
fit without regression weights.
01 120 614 425 T F T 1.e2 1.e1 1.e2 1.e6 1.e2 1.e4
02 110 22 15 T F T 1.e2 1.e1 1.e2 1.e6 1.e2 1.e4
02 120 22 15 T F T 1.e2 1.e1 1.e2 1.e6 1.e2 1.e4
02 130 22 15 T F T 1.e2 1.e1 1.e2 1.e6 1.e2 1.e4
Warning
For the particular case of the selection_type=4
the above
file should provide the first configuration used for the trainning
selection. As in the following example, where the first configuration
becomes 10
for the class 01
and 1
for all the others, exept
the last class for which it is 2
.
01 120 614 425 10 T F T 1.e2 1.e1 1.e2 1.e6 1.e2 1.e4
02 110 22 15 1 T F T 1.e2 1.e1 1.e2 1.e6 1.e2 1.e4
02 120 22 15 1 T F T 1.e2 1.e1 1.e2 1.e6 1.e2 1.e4
02 130 22 15 2 T F T 1.e2 1.e1 1.e2 1.e6 1.e2 1.e4
Database in descriptor space: writing outputs.
- write_desc (logical)
-
Writing or not the local atomic descriptors. The data will be writen in the directory
descDB
. This option write the descriptors for local energy as well for atomic forces. In order to write only the descritors of the local atomic environement setdesc_forces=.false.
Default
write_desc=.false.
Note
However, not all the database is writen. It is writen only the configuration that
are asked in db_model.in
- desc_file_format (integer)
-
The type of descriptor file, which is written if
write_desc=.true.
. Depending on the value ofdesc_file_format
, which can be1
,2
or3
the descriptor files written indescDB
has the extensioneml
,csv
andnpz
respectively.1
The name of files is of the formdescDB/CC_KLM_XXXXXX.eml
and the format isnat x dim_desc + 1
matrix. Wherenat
is the number of atoms in the corresponding atomic system anddim_desc
the dimension of the descriptor. The first column of the matrix indicates the atomic id in the system (the same id as in the correspondingposcar
) and the otherdim_desc
columns are the various components of the descriptor.2
the files will be written incsv
format indescDB/CC_KLM_XXXXXX.csv
. The shape of the data is same as for option1
.-
3
ensures that the binary formatnpz
isdescDB/CC_KLM_XXXXXX.npz
.The shape of the data is the same as for option
1
.
Default
desc_file_format=1
Note
npz
files are smaller in size with a factor of 10 compared to eml
or csv
. However, pay attention that sometimes
there can be problems when you generate the files on one computer then you read on other
computer.