Metadata-Version: 2.1
Name: visual-graph-datasets
Version: 0.7.2
Summary: Datasets for the training of graph neural networks (GNNs) and subsequent visualization of attributional explanations of XAI methods
License: MIT
Keywords: graph neural networks,datasets,explainable AI
Author: awa59kst120df
Author-email: awa59kst120df@gmail.com
Maintainer: awa59kst120df
Maintainer-email: awa59kst120df@gmail.com
Requires-Python: >=3.8
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Requires-Dist: cairosvg (>=2.5.0)
Requires-Dist: click (>=7.1.2)
Requires-Dist: imageio (>=2.22.4)
Requires-Dist: jinja2 (>=3.0.3)
Requires-Dist: matplotlib (>=3.5.3)
Requires-Dist: networkx (>=2.8.8)
Requires-Dist: numpy (>=1.23.2)
Requires-Dist: orjson (>=3.8.1)
Requires-Dist: poetry-bumpversion (>=0.3.0)
Requires-Dist: psutil (>=5.7.2)
Requires-Dist: pycomex (>=0.6.1)
Requires-Dist: pytest (>=7.2.0)
Requires-Dist: pyyaml (>=0.6.0)
Requires-Dist: rdkit (>=2022.9.0)
Description-Content-Type: text/x-rst

|made-with-python| |python-version| |os-linux|

.. |os-linux| image:: https://img.shields.io/badge/os-linux-orange.svg
   :target: https://www.python.org/

.. |python-version| image:: https://img.shields.io/badge/Python-3.8.0-green.svg
   :target: https://www.python.org/

.. |made-with-kgcnn| image:: https://img.shields.io/badge/Made%20with-KGCNN-blue.svg
   :target: https://github.com/aimat-lab/gcnn_keras

.. |made-with-python| image:: https://img.shields.io/badge/Made%20with-Python-1f425f.svg
   :target: https://www.python.org/

=====================
Visual Graph Datasets
=====================

This package provides the possibility to manage
a collection of datasets primarily for the training of graph neural networks.
Each dataset is represented by one *folder*. Inside these folders each element of the dataset is
represented by *two* files: (1) A metadata JSON file which contains the full graph representation as
well as additional metadata such as the canonical index, the target value to be predicted etc...
(2) A PNG image file which shows a domain specific illustration of the graph
(molecular graphs for chemical datasets as an example). These additional visualizations of each graph
can be used to easily visualize attributional graph XAI methods which assign importance values to each
node and edge of the original input graph.

Motivation
==========

Usually datasets are packaged as compact as possible.
An example would be that chemical graph datasets are usually packaged as CSV files which only contain the
index a SMILES representation of the molecule and the target value, looking something like this:

.. code-block:: csv

    index, smiles, value
    0, ccc, 0.24
    1, ccc, 0.52
    2, ccc, 1.77


This has the major advantage that even large datasets will have file sizes of only a few MB. These files are
easy to download online and easy to store. The disadvantage however is that these files need to be processed
to be usable to train graph neural networks (GNNs): The encoded SMILES representation first has to be
transformed into a graph representation where node and edge features have to be generated by some kind of
chemical pre-processor. Instead of putting the major storage and bandwidth requirements on the user, this
puts the major processing requirements on the user. Additionally, this method places a greater burden on the
visualization step of generated explanations.

Ultimately we decided to rather put the burden of downloading larger amounts of data on the user a
single time in exchange of simplifying and reducing the burden of pre-processing and
data visualization for *each* training process.

Additionally, by distributing both canonical indexing and canonical visualizations we aim to make
explanation results more comparable in the future.

Installation
============

First clone this repository:

.. code-block:: console

    git clone https://github/username/visual_graph_datasets.git

Then install it like this:

.. code-block:: console

    cd visual_graph_datasets
    pip3 install -e .


Download datasets
-----------------

    **NOTE**: We *strongly* encourage to store datasets on an SSD instead of an HDD, as this can make a
    difference of multiple hours(!) when loading especially large datasets.

Datasets can simply be downloaded by name by using the ``download`` command:

.. code-block:: console

    // Example for the dataset 'rb_dual_motifs'
    python3 -m visual_graph_datasets.cli download "rb_dual_motifs"

By default this dataset will be downloaded into the folder ``$HOME/.visual_graph_datasets/datasets``
where HOME is the current users home directory.

The dataset download destination can be changed in a config file by using the ``config`` command:

.. code-block:: console

    python3 -m visual_graph_datasets.cli config

This command will open the config file at ``$HOME/.visual_graph_datasets/config.yaml`` using the systems
default text editor.

List available datasets
-----------------------

You can display a list of all the currently available datasets of the current remote file share provider
and some metadata information about them by using the command ``list``:

.. code-block:: console

    python3 -m visual_graph_datasets.cli list

Running the unittests
---------------------

After installation you can optionally run the unitests to confirm that all datasets have been correctly
downloaded and that everything works properly:

.. code-block:: console

    cd visual_graph_datasets
    pytest ./tests/*

Usage
=====

The datasets are mainly intended to be used in combination with other packages, but this package provides
some basic utilities to load and explore the datasets themselves within python programs.

.. code-block:: python

    from visual_graph_datasets.config import Config
    from visual_graph_datasets.data import load_visual_graph_dataset

    # The function only needs the absolute path to the dataset folder and will load all the entire datasets
    # from all the files within that folder.
    # The function returns two dictionaries. The first maps the string names of the elements to the content
    # dictionaries and the second dict maps the integer indices of the elements to the very same content
    # dictionaries. Two separate dictionaries are returned to provide different ways of accessing the data
    # of the elements which are needed in different situations.
    dataset_path = os.path.join(Config().get_datasets_path(), 'rb_dual_motifs')
    data_name_map, data_index_map = load_visual_graph_dataset(dataset_path)

One such content dictionary which are the values of the two dicts returned by the function have the
following nested dictionary structure:

- ``image_path``: The absolute path to the image file that visualizes this element
- ``metadata_path``: the absolute path to the metadata file
- ``metadata``: A dict which contains all the metadata for that element
    - ``value``: The target value for the element, which can be a single value (usually with regression) or
      a one-hot vector for classification.
    - ``index``: The canonical index of this element within the dataset
    - (``split``): If defined, either "train" or "test" - assignment for the canonical train test split
    - ``graph``: A dictionary which contains the entire graph representation of this element.
        - ``node_attributes``: tensor of shape (V, N)
        - ``edge_attributes``: tensor of shape (E, M)
        - ``edge_indices``: tensor of shape (E, 2) which are the tuples of integer node indices that
          determine edges
        - ``node_coordinates`` tensor of shape (V, 2) which are the xy positions of each node in pixel
          values within the corresponding image visualization of the element. This is the crucial
          information which is required to use the existing image representations to visualize attributional
          explanations!

With the following variable definitions:

- V - the number of nodes in a graph
- E - the number of edges in a graph
- N - the number of node attributes / features associated with each node
- M - the number of edge attributes / features associated with each edge


Datasets
========

Here is a list of the datasets currently included.

For more information about the individual datasets use the ``list`` command in the CLI (see above).

* rb_dual_motifs
* tadf


