numbox.core.variable
====================

Overview
++++++++

Framework for Directed Acyclic Graph (DAG) in pure Python.
While this module does not contain any JIT-compiled
bits in particular, or anything imported from numba in general,
computationally heavy parts can be put on this graph as JIT-compiled functions
via the `formula` key of the graph variables specifications (see below).

Modules
++++++++

numbox.core.variable.variable
-----------------------------

Overview
********

A graph can be defined as follows::

    from numbox.core.variable.variable import Graph

    def derive_x(y_):
        return 2 * y_

    def derive_a(x_):
        return x_ - 74

    def derive_u(a_):
        return 2 * a_

    x = {"name": "x", "inputs": {"y": "basket"}, "formula": derive_x}
    a = {"name": "a", "inputs": {"x": "variables1"}, "formula": derive_a}
    u = {"name": "u", "inputs": {"a": "variables1"}, "formula": derive_u}

    graph = Graph(
        variables_lists={
            "variables1": [x, a],
            "variables2": [u],
        },
        external_source_names=["basket"]
    )

Here we have the variable `y` sourced externally from the `basket`, and calculated variables
`x` and `a` in the `variables1` namespace, and `u` in the `variables2` namespace.

The dictionaries
`x`, `a`, and `u` are called variable specifications. These specs on their own are agnostic about what
namespace they can be put in. The namespaces however need to be specified via the `variables_lists`
argument given to the `Graph` at the initialization time.

The full and unambiguous way to denote the variables is via their qualified
names, applicable both to externally sourced variables, `basket.y`, as well as
the calculated ones, `variables1.x`,
`variables1.a`, `variables2.u`.

One of the variables specifications, designated with the key `formula`, specifies the
function with the parameters that match the input variables (this graph node's dependencies)
that are in turn
designated with the key `inputs`. While the names of the parameters of the function assigned
to the `formula` key do not have to match the names of the `inputs`, their order is
expected to follow one-to-one correspondence. This way the graph is instructed
which inputs to use to get the values to be assigned to the parameters of the `formula`.

The Python function specified by the `formula`
can be a wrapper around numba JIT-compiled function, i.e.,
a proxy to the numba's `FunctionType` or `CPUDispatcher` objects [#f1]_.

The variable specification for `inputs` (if any) includes both the names of the dependencies variables
required to calculate the given variable via the function given by the `formula`,
as well as the namespaces where these variables are going to be looked for in.

Graph end nodes, located at the edge of the graph (a.k.a., leaf nodes) have neither `inputs`
nor `formula` in their specifications. Specifying `formula` without `inputs`
will result in an exception. It is possible, however, to specify `inputs`
but no formula, which technically defines the placement of the node
on the graph but leaves it up to the developer to defer specifying the node's calculation
logic until later in the runtime.

The variable can be specified as `cacheable` if its value calculated for the given tuple of
arguments can be cached and later retrieved without re-calculation provided
the arguments have not changed. The arguments types of the corresponding `formula` then need to be hashable -
custom type sub-classing with its own `__hash__` might be needed in certain cases, thereby providing the definition
of the identity of the arguments' values.
When `cacheable=True` (by default it is `False`), the graph will avoid recalculation of the
value provided the inputs haven't changed. It is not recommended to abuse the cache, especially
for the continuous or large-cardinality spaces of identities of the parameters of the node's `formula`.

It is worth noting here that the `cacheable` key is a rather brute force way
to avoid identical re-computations.
It is completely unrelated to the graph's dependency structure.
On the other hand, the graph's `recompute`
method, discussed below, only recomputes the values of variables that are dependent on the nodes
that have been updated. That is, the strategy of the `recompute` method
is determined by the graph's topology only
and is independent of the `cacheable` specifications of the nodes'
variables.

Names of the 'external' sources (of data values) need to be given to the `Graph` as well,
via the `external_source_names` argument.
When the :class:`numbox.core.variable.variable.Graph` is compiled
to the :class:`numbox.core.variable.variable.CompiledGraph`, it will automatically figure out which variables need to be sourced
from each of the specified external sources (such as, '`basket`') in order to perform the
required calculation::

    from numbox.core.variable.variable import CompiledGraph

    # What is required from this calculation, the names of qualified variables
    required = ["variables2.u"]

    # Compile the graph for the required variables
    compiled = graph.compile(required)
    assert isinstance(compiled, CompiledGraph)

    # The graph will figure out what external variables it needs to do the calculation
    required_external_variables = compiled.required_external_variables
    assert list(required_external_variables.keys()) == ["basket"]
    basket = required_external_variables["basket"]
    assert list(basket.keys()) == ["y"]
    assert basket["y"].name == "y"

`Graph` uses the variable specifications given to it to create instances of :class:`numbox.core.variable.variable.Variable`.
Namespaces of calculated `Variable` s are :class:`numbox.core.variable.variable.Variables`.
Namespaces of externally sourced `Variable` s are
:class:`numbox.core.variable.variable.External` .

Semantically, each `Variable` is defined by its scoped name, that is, a tuple of its namespace / source
name and its own name.

In DAG terminology, `External` scopes contain variables with no inputs, that is, edge (or end / leaf) nodes.

Instances of `Variable` s and `External` are stored in the `Graph`'s instance's `registry`::

    from numbox.core.variable.variable import Variables, Variable

    registry = graph.registry

    # Get the namespaces...
    variables1 = registry["variables1"]
    variables2 = registry["variables2"]

    # ... and the variables defined in these namespaces
    assert list(variables1.variables.keys()) == ["x", "a"]
    assert list(variables2.variables.keys()) == ["u"]

    assert isinstance(variables1, Variables)
    assert isinstance(variables1.variables["x"], Variable)

    basket_ = registry["basket"]
    ... # same `basket` as above
    assert basket_["y"] is basket["y"]

That is, users are not expected to instantiate neither `Variable` s nor `Variables` s,
although they are certainly allowed to do so if needed (it is recommended to design
one's code so that `Variable` instances when needed are simply retrieved from the `registry` of the
`Graph` instance).
Instead, users provide variable specifications, as the dictionaries `x`, `u`, `a`
in the example above (and the variable name "`y`" that is referred to and implied to be 'external')
that are given to the `Graph`. The `Graph` then creates instances of `Variables` (one per namespace)
and instances of `External` (one per an 'external' source). Finally, `Variables` and `External` in turn
create instances of `Variable` s and store them.

To calculate the required variables, one first needs to instantiate the execution-scope instance
of the storage :class:`numbox.core.variable.variable.Values` of the values of all variables
scoped in `Variables` and `External` namespaces. This storage will get automatically populated
with all calculated nodes
as a mapping from the corresponding `Variable` to instances of :class:`numbox.core.variable.variable.Value`.
The latter wraps the data. All the data of non-external variables is initialized to
the instance `_null` of the :class:`numbox.core.variable.variable._Null`.

Then, one needs to supply `external_values` of the leaf nodes that are needed for the calculation.
As discussed above, these required external variables are identified programmatically. Provided values for these
have been provided, one can calculate the graph as::

    from numbox.core.variable.variable import Values

    # Instantiate the storage
    values = Values()

    # Request the calculation by executing the graph
    compiled.execute(
        external_values={"basket": {"y": 137}},
        values=values,
    )

This populates the `values` with the correct data::

    x_var = variables1["x"]
    a_var = variables1["a"]
    u_var = variables2["u"]

    assert values.get(x_var).value == 274
    assert values.get(a_var).value == 200
    assert values.get(u_var).value == 400

The graph can be recomputed if some of its nodes have been changed.
Only the affected nodes will be re-evaluated::

    compiled.recompute({"basket": {"y": 1}}, values)
    assert values.get(basket["y"]).value == 1
    assert values.get(x_var).value == 2
    assert values.get(a_var).value == -72
    assert values.get(u_var).value == -144


.. rubric:: References

.. [#f1] It is straightforward to adapt the variables specifications given here in pure Python to build a fully-JIT'ed graph of :class:`numbox.core.work.work.Work` nodes, by using the :class:`numbox.core.work.builder.Derived`. See :ref:`builder`.

.. automodule:: numbox.core.variable.variable
   :members:
   :show-inheritance:
   :undoc-members: