Databases
#########

.. _`sec:database`:

Database manipulation
=====================

.. option::  db_file (character(len=80))

   The file that design the database
   chosen for the potential fitting. each line has the syntax: ``class KLM number_of_files number_of_selected_files``

   Default ``"db_model_in"``

.. option::  db_path (character(len=60))

   Path to the database where the
   poscar are located. With the default value you should have `` poscar`` files in 
   ``DB`` directory  

   Default ``"./DB/"``.

.. option::  selection_type (integer)

   Inside each ``class`` and ``KLM`` subdatabase mentioned in ``db_file``

   #. ``1`` selects first "ns" elements of the database;

   #. ``2`` selects last "ns" elements of the database;

   #. | ``3`` selects randomly "ns" subsets of "kelem" elements of the
        database

   #. | ``4`` selects first "ns" subsets of "kelem" elements of the
        database from a starting configuration defined in the
        ``db_file``

   | In the case of new type of input, using ``db_file`` "ns" is given
     by ``number_of_selected_files`` inside ``class``.\ ``KLM``
   | Default ``3``.

.. option::  seed (integer)

   Seed for random number generator.

   Default ``11``.

.. option::  iread_energy (integer)

   This option fixes which energy from the database is taken as target by ``MiLaDy``. 
   Each ``.poscar`` format has in the first line  three energies, as it is described in Section :ref:`Database files<db-format>`. A value between 1, 2 and 3
   choose the first, second or third energy, respectively.

   Default ``2``.

.. option::  ref_energy_per_element (character(len=80))

   This option fixes the reference energy of each species. 
   These reference energies are useful when we intend to apply a shift of the total energy 
   read by ``MiLaDy`` from ``.poscar``. 

   This change of reference can be very useful in many situation, from which we exemplify two: 
   (i) when we want to lower the absolute value of the DFT total energy by a rigid shift towards 
   lower numbers (some ab-intio codes provide large numbers for total energy). Those large numbers 
   can induce numerical instabilities in the fitting procedure. (ii) for physical consideration 
   we want to have clean atomization energies for molecules and for the atoms separated at the infinite 
   the energy is zero. There are many others situation when this rigid shif can be useful. 

   How ot works? Let's take the case of a target energy :math:`E_{\textrm{DFT}}` read by ``MiLaDy`` using the 
   option ``iread_energy``. We supose that this system has N atoms distributed over S species and each 
   species has :math:`n_1`, :math:`n_2`, :math:`\ldots`, :math:`n_S` atoms. 
   If for each species we have a reference energy  :math:`E_{\textrm{ref},s}` with :math:`s=1, \ldots, S`. Then the 
   ``MiLaDy`` target energy will be given by equation: 
   
   .. math::
      E_{\textrm{target}} = E_{\textrm{DFT}} - \sum_{s=1,S} n_s E_{\textrm{ref},s}  
   
   The number of values provided by ``ref_energy_per_element`` should be equal to ``fix_no_of_elements`` (this option is described in 
   :ref:`Atomic systems <sec:atomicsys>`) otherwise ``MiLaDy`` ends into a fatal error.  For example 
   ``ref_energy_per_element="" -3.d0 2.12d0 -1.d0"`` provide three 
   values ``-3.d0``, ``2.12d0`` and ``1.d0`` for the species 1, 2 and 3, respectively.   


   Default ``ref_energy_per_element="0.d0"``.


Database files
==============

.. _db-format:

Database file format
--------------------

Database files for MiLaDy are stored in the ``.poscar`` format.

Besides a standard information (cell vectors, number of atoms, atomic
coordinates and forces) that is usually included in ``.poscar``, our
databases files also contain explicit information about chemical
compound and energy of the system in the first line as well as the
stress tensor (independent six components in the :math:`\sigma_{xx}`,
:math:`\sigma_{yy}`, :math:`\sigma_{zz}`, :math:`\sigma_{yz}`,
:math:`\sigma_{xz}`, :math:`\sigma_{xy}` order, as provided by VASP),
and ISPIN tag in the end of the file.

The energies, forces and stress
are provided in eV, eV/Å and eV/Å\ :math:`^{3}`, respectively.

An
example of a typical database ``.poscar`` file is reported below.

.. code-block:: python
   :linenos:

   111 1 Fe 26 -15.7255500 0.7884238 0 # EFS-tag n element mass E_1 E_2 E_3
   1.00000000 # unit = 1Å
   2.63475324 0.00000000 0.00000000 # cell vectors
   0.00000000 2.63475324 0.00000000
   0.00000000 0.00000000 2.63475324
   2 # number of atoms
   Cartesian
   0.00000000 0.00000000 0.00000000 # atomic positions
   1.31737662 1.31737662 1.31737662
   # empty line
   0.00000000 0.00000000 0.00000000 # forces
   0.00000000 0.00000000 0.00000000
   # empty line
   -0.42315918 -0.42315918 -0.42315918 0 0 0 # stress
   # empty line
   2 # ISPIN tag: 2 - magnetic, 1 - non magnetic; 0 - not known


The first line of this file (treated as a comment by VASP) indicates
that the file contains information about energy (``E=1`` in ``EFS``),
forces (``F=1`` in ``EFS``), stress (``S=1`` in ``EFS``); that the
system is built by ``1`` chemical element which is ``Fe`` with atomic
mass ``26``, total energy of the system is ``-15.7255500``, target
energy value for training (:math:`E^{tot}-E^{ref}_{1}`) is ``0.7884238``
and the alternative target value (:math:`E^{tot}-E^{ref}_{2}`) is ``0``.
In this example, the reference energy :math:`E^{ref}_{1}` is a total
energy of a perfect crystal.

For the systems which contain more than one chemical element (alloys,
oxides *etc.*), the structure of the first and sixths lines will
slightly change. For instance, the first line for Fe\ :math:`_{3}`\ C
cementite with known energies and forces (but no stress tensor) is:


.. code-block:: python

   110 2 Fe 26 C 12 -15.7255500 0.7884238 0

and the sixth line for the 4-atom Fe\ :math:`_{3}`\ C system is

.. code-block:: python

   3 1

The database files of this format can be directly used as input
configurations for calculations in VASP and structure visualization in
OVITO. Simple renaming of the the .poscar files to POSCAR allows also
their visualization in VESTA.

**The case of the calculations for the descritors and no derivatives
i.e.** ``desc_forces=.false.`` In this particular case only the
positions of atoms and the box informations are read. Any information
about forces , spin etc is ignored and is not complusory.

**Other file formats.** Some of the files can be stored in binary format
``.traj``, generated and read by `Atomic Simulation
Environment <https://wiki.fysik.dtu.dk/ase/>`__ (ASE). The data can be
then extracted to the database ``.poscar`` format with a python script
``extract\_traj.py`` that is provided together with ``.traj`` files.
Conversion of the ``.poscar`` DB files (compatible with MiLaDy) into
extended ``.xyz`` format can be performed using ``DB\_poscar2xyz.py``.
The inverse conversion from ``.xyz`` to ``.poscar`` can be done with
``DB\_xyz2poscar.py``.

.. _`sec:dbnames`:

Database file names
-------------------

The database files for MiLaDy are generally named as
``CCKLMXXXXXX.poscar``. In this notation, the *class* ``CC`` is defined
by a number that can vary from 01 to 99. This part of the file name
indicates a physical property that can be derived from this files
(*e.g.*, elasticity, point defects, *etc.*). The class is directly
linked to the characteristics of the system that should be fit
(E=energy, F=forces, S=stress). Thus, for example for the classes ``01``
and ``02`` corresponding to equations of state (EOS) and elasticity, the
energies and stress ES represent important parameters to fit (forces are
equal zero in this case), while for the class ``04`` with Generalized
Stacking Faults (GSF), only energies are of our interest.

The ``KLM`` notation in the file names describes a *type* of the system,
which includes composition, structure, and a source of the database. The
first index ``K`` indicates a material (composition + structure). For
instance, bcc Fe corresponds to ``K``\ =1, hcp Fe to ``K``\ =2 and bcc W
to ``K``\ =3, *etc.*

.. _`sec:db-model`:

Input file db\ :math:`\_`\ model.in
-----------------------------------

An input file ``db_model.in`` provides a detailed summary of the
database which will be used to fit a ML potential in MiLaDy. Each line
in the file stands for the categories of the database with different
class ``CC`` and ``KLM``. After providing the relevant ``CC`` and
``KLM`` attributes, one should indicate the total number of files in the
database belonging to this category and how many of them should be used
for training of the potential. The rest of the files will be used for
the test.

For each category listed in the ``db_model.in``, one can independently
define its ``EFS`` tag in the form of ``T`` or ``F`` standing for
energies, forces and stress respectively. The final EFS fitting scheme
for a given system will be a superposition of the EFS provided in the
``db_model.in`` and of the ``EFS``-tag, in the first line of a
``.poscar`` file (see Section `4.1 <#db-format>`__). For instance, if
the ``EFS-tag`` in the ``.poscar`` file is ``110`` and ``TFF`` in the
``db_model.in``, the fit will be performed only for the energies.

The last six numbers in each line define the three ranges:
:math:`[w^{min}_E, w^{min}_F, w^{min}_S], [w^{max}_E, w^{max}_F, w^{max}_S]`
within which the regression weights will be varied for energies, forces
and stress, respectively. The search of optimum regression errors is
performed using evolutionary algorithm. In the example of the
``db_model.in`` below, the weights are set to vary between ``1e2`` and
``1e6`` for energy, ``1e1`` and ``1e3`` for forces and ``1e2`` and
``1e4`` for stress. Setting everywhere ``1.e0`` will result in a simple
fit without regression weights.

.. code-block::

   01 120 614 425 T F T 1.e2 1.e1 1.e2 1.e6 1.e2 1.e4
   02 110 22  15  T F T 1.e2 1.e1 1.e2 1.e6 1.e2 1.e4
   02 120 22  15  T F T 1.e2 1.e1 1.e2 1.e6 1.e2 1.e4
   02 130 22  15  T F T 1.e2 1.e1 1.e2 1.e6 1.e2 1.e4


.. warning::

   For the particular case of the ``selection_type=4`` the above
   file should provide the first configuration used for the trainning
   selection. As in the following example, where the first configuration
   becomes ``10`` for the class ``01`` and ``1`` for all the others, exept
   the last class for which it is ``2``.

   .. code-block::

      01 120 614 425 10 T F T 1.e2 1.e1 1.e2 1.e6 1.e2 1.e4
      02 110 22  15  1  T F T 1.e2 1.e1 1.e2 1.e6 1.e2 1.e4
      02 120 22  15  1  T F T 1.e2 1.e1 1.e2 1.e6 1.e2 1.e4
      02 130 22  15  2  T F T 1.e2 1.e1 1.e2 1.e6 1.e2 1.e4


Database in descriptor space: writing outputs.
==============================================

.. option::  write_desc (logical)

   Writing or not the local atomic descriptors. The data will be writen in the  
   directory ``descDB``. This option write the descriptors for local energy as well for 
   atomic forces. 
   In order to write only the descritors of the local atomic environement set 
   ``desc_forces=.false.``

   Default ``write_desc=.false.``
.. note::

      However, not all the database is writen. It is writen only the configuration that 
      are asked in ``db_model.in``   

.. option::  desc_file_format (integer)

   The type of descriptor file, which is written if ``write_desc=.true.``. Depending on the 
   value of ``desc_file_format``, which can be ``1``, ``2`` or ``3`` the descriptor files written in 
   ``descDB`` has the extension ``eml``, ``csv`` and ``npz`` respectively.

   - ``1`` The name of files is of the form ``descDB/CC_KLM_XXXXXX.eml`` and the format 
     is ``nat x dim_desc + 1`` matrix. Where  ``nat`` is the number of atoms in the 
     corresponding atomic system and ``dim_desc`` the dimension of the descriptor. 
     The first column of the matrix indicates the atomic id  in the system 
     (the same id as in the corresponding ``poscar``) and the other ``dim_desc`` columns are the various 
     components of the descriptor.

   - ``2`` the files will be written in ``csv`` format in  ``descDB/CC_KLM_XXXXXX.csv``. 
     The shape of the data is same as for option ``1``. 

   - | ``3`` ensures that the binary format ``npz`` is ``descDB/CC_KLM_XXXXXX.npz``. 
     The shape of the data is the same as for option ``1``. 
   
   Default ``desc_file_format=1``
   
.. note::

      ``npz`` files are smaller in size with a factor of 10 compared to ``eml`` or ``csv``. However, pay attention that sometimes 
      there can be problems when you generate the files on one computer then you read on other 
      computer.