numbox.core.variable

Overview

Framework for Directed Acyclic Graph (DAG) in pure Python. While this module does not contain any JIT-compiled bits in particular, or anything imported from numba in general, computationally heavy parts can be put on this graph as JIT-compiled functions via the formula key of the graph variables specifications (see below).

Modules

numbox.core.variable.variable

Overview

A graph can be defined as follows:

from numbox.core.variable.variable import Graph

def derive_x(y_):
    return 2 * y_

def derive_a(x_):
    return x_ - 74

def derive_u(a_):
    return 2 * a_

x = {"name": "x", "inputs": {"y": "basket"}, "formula": derive_x}
a = {"name": "a", "inputs": {"x": "variables1"}, "formula": derive_a}
u = {"name": "u", "inputs": {"a": "variables1"}, "formula": derive_u}

graph = Graph(
    variables_lists={
        "variables1": [x, a],
        "variables2": [u],
    },
    external_source_names=["basket"]
)

Here we have the variable y sourced externally from the basket, and calculated variables x and a in the variables1 namespace, and u in the variables2 namespace.

The dictionaries x, a, and u are called variable specifications. These specs on their own are agnostic about what namespace they can be put in. The namespaces however need to be specified via the variables_lists argument given to the Graph at the initialization time.

The full and unambiguous way to denote the variables is via their qualified names, applicable both to externally sourced variables, basket.y, as well as the calculated ones, variables1.x, variables1.a, variables2.u.

One of the variables specifications, designated with the key formula, specifies the function with the parameters that match the input variables (this graph node’s dependencies) that are in turn designated with the key inputs. While the names of the parameters of the function assigned to the formula key do not have to match the names of the inputs, their order is expected to follow one-to-one correspondence. This way the graph is instructed which inputs to use to get the values to be assigned to the parameters of the formula.

The Python function specified by the formula can be a wrapper around numba JIT-compiled function, i.e., a proxy to the numba’s FunctionType or CPUDispatcher objects [1].

The variable specification for inputs (if any) includes both the names of the dependencies variables required to calculate the given variable via the function given by the formula, as well as the namespaces where these variables are going to be looked for in.

Graph end nodes, located at the edge of the graph (a.k.a., leaf nodes) have neither inputs nor formula in their specifications. Specifying formula without inputs will result in an exception. It is possible, however, to specify inputs but no formula, which technically defines the placement of the node on the graph but leaves it up to the developer to defer specifying the node’s calculation logic until later in the runtime.

The variable can be specified as cacheable if its value calculated for the given tuple of arguments can be cached and later retrieved without re-calculation provided the arguments have not changed. The arguments types of the corresponding formula then need to be hashable - custom type sub-classing with its own __hash__ might be needed in certain cases, thereby providing the definition of the identity of the arguments’ values. When cacheable=True (by default it is False), the graph will avoid recalculation of the value provided the inputs haven’t changed. It is not recommended to abuse the cache, especially for the continuous or large-cardinality spaces of identities of the parameters of the node’s formula.

It is worth noting here that the cacheable key is a rather brute force way to avoid identical re-computations. It is completely unrelated to the graph’s dependency structure. On the other hand, the graph’s recompute method, discussed below, only recomputes the values of variables that are dependent on the nodes that have been updated. That is, the strategy of the recompute method is determined by the graph’s topology only and is independent of the cacheable specifications of the nodes’ variables.

Names of the ‘external’ sources (of data values) need to be given to the Graph as well, via the external_source_names argument. When the numbox.core.variable.variable.Graph is compiled to the numbox.core.variable.variable.CompiledGraph, it will automatically figure out which variables need to be sourced from each of the specified external sources (such as, ‘basket’) in order to perform the required calculation:

from numbox.core.variable.variable import CompiledGraph

# What is required from this calculation, the names of qualified variables
required = ["variables2.u"]

# Compile the graph for the required variables
compiled = graph.compile(required)
assert isinstance(compiled, CompiledGraph)

# The graph will figure out what external variables it needs to do the calculation
required_external_variables = compiled.required_external_variables
assert list(required_external_variables.keys()) == ["basket"]
basket = required_external_variables["basket"]
assert list(basket.keys()) == ["y"]
assert basket["y"].name == "y"

Graph uses the variable specifications given to it to create instances of numbox.core.variable.variable.Variable. Namespaces of calculated Variable s are numbox.core.variable.variable.Variables. Namespaces of externally sourced Variable s are numbox.core.variable.variable.External .

Semantically, each Variable is defined by its scoped name, that is, a tuple of its namespace / source name and its own name.

In DAG terminology, External scopes contain variables with no inputs, that is, edge (or end / leaf) nodes.

Instances of Variable s and External are stored in the Graph’s instance’s registry:

from numbox.core.variable.variable import Variables, Variable

registry = graph.registry

# Get the namespaces...
variables1 = registry["variables1"]
variables2 = registry["variables2"]

# ... and the variables defined in these namespaces
assert list(variables1.variables.keys()) == ["x", "a"]
assert list(variables2.variables.keys()) == ["u"]

assert isinstance(variables1, Variables)
assert isinstance(variables1.variables["x"], Variable)

basket_ = registry["basket"]
... # same `basket` as above
assert basket_["y"] is basket["y"]

That is, users are not expected to instantiate neither Variable s nor Variables s, although they are certainly allowed to do so if needed (it is recommended to design one’s code so that Variable instances when needed are simply retrieved from the registry of the Graph instance). Instead, users provide variable specifications, as the dictionaries x, u, a in the example above (and the variable name “y” that is referred to and implied to be ‘external’) that are given to the Graph. The Graph then creates instances of Variables (one per namespace) and instances of External (one per an ‘external’ source). Finally, Variables and External in turn create instances of Variable s and store them.

To calculate the required variables, one first needs to instantiate the execution-scope instance of the storage numbox.core.variable.variable.Values of the values of all variables scoped in Variables and External namespaces. This storage will get automatically populated with all calculated nodes as a mapping from the corresponding Variable to instances of numbox.core.variable.variable.Value. The latter wraps the data. All the data of non-external variables is initialized to the instance _null of the numbox.core.variable.variable._Null.

Then, one needs to supply external_values of the leaf nodes that are needed for the calculation. As discussed above, these required external variables are identified programmatically. Provided values for these have been provided, one can calculate the graph as:

from numbox.core.variable.variable import Values

# Instantiate the storage
values = Values()

# Request the calculation by executing the graph
compiled.execute(
    external_values={"basket": {"y": 137}},
    values=values,
)

This populates the values with the correct data:

x_var = variables1["x"]
a_var = variables1["a"]
u_var = variables2["u"]

assert values.get(x_var).value == 274
assert values.get(a_var).value == 200
assert values.get(u_var).value == 400

The graph can be recomputed if some of its nodes have been changed. Only the affected nodes will be re-evaluated:

compiled.recompute({"basket": {"y": 1}}, values)
assert values.get(basket["y"]).value == 1
assert values.get(x_var).value == 2
assert values.get(a_var).value == -72
assert values.get(u_var).value == -144

References

class numbox.core.variable.variable.CompiledGraph(ordered_nodes: list[numbox.core.variable.variable.CompiledNode], required_external_variables: dict[str, dict[str, numbox.core.variable.variable.Variable]], dependents: dict[numbox.core.variable.variable.Variable, list[numbox.core.variable.variable.CompiledNode]] = <factory>)[source]

Bases: object

dependents: dict[Variable, list[CompiledNode]]
execute(external_values: dict[str, dict[str, Any]], values: Storage)[source]

Main entry point to calculate values of nodes of the compiled graph. Calculation requires the following inputs:

Parameters:

external_values – actual values of all required external

variables, this can be a superset of what is really needed for the calculation. The map is first from the name of the external namespace and then from the name of the variable within that source to the variable’s actual value. :param values: runtime storage of all values, e.g., an instance of Values.

ordered_nodes: list[CompiledNode]
recompute(changed: dict[str, dict[str, Any]], values: Storage)[source]
Parameters:

changed – dict of sources to names to new values of changed

Variable`s coming from either `External or Variables source. :param values: storage of all the Variable values.

required_external_variables: dict[str, dict[str, Variable]]
class numbox.core.variable.variable.CompiledNode(variable: numbox.core.variable.variable.Variable, inputs: list[numbox.core.variable.variable.Variable])[source]

Bases: object

inputs: list[Variable]
variable: Variable
class numbox.core.variable.variable.External(name: str)[source]

Bases: Namespace

An ‘external’ namespace that facilitates discovery of requested names.

When requesting a Variable with the given name via a typical __getitem__ call, if the Variable is not found, it will be created and added to this dictionary. This way the graph will be able to infer which variables are required from the external source abstracted by this namespace.

class numbox.core.variable.variable.Graph(variables_lists: dict[str, list[VarSpec]], external_source_names: list[str])[source]

Bases: object

compile(required: list[str] | str) CompiledGraph[source]
Required:

list of qualified variables names that need to be calculated.

dependents_of(qual_names: list[str] | set[str] | str) set[str][source]

Return qualified names of Variable`s that directly or indirectly depend on any of `qual_names.

explain(qual_name: str, right_to_left: bool = True) str[source]

Follow the dependencies chain to explain how the given variable is derived.

Uses metadata of the Variable instances.

Parameters:
  • qual_name – qualified name of the Variable.

  • right_to_left – when True (default), begin explanation

with qual_name. That is, move towards the ends of the graph.

class numbox.core.variable.variable.Namespace[source]

Bases: ABC

keys()[source]
name: str
update(key: str, var: Variable) None[source]

Post-initialization update for dynamically generated `Variable`s.

class numbox.core.variable.variable.Storage(*args, **kwargs)[source]

Bases: Protocol

cache: dict[tuple[Variable, tuple[Any, ...]], Any]
get(variable: Variable) Value[source]

Principal access point to the requested variable. Instantiates the corresponding value when first invoked for the given variable.

class numbox.core.variable.variable.Value(variable: ~numbox.core.variable.variable.Variable, value: ~typing.Any | ~numbox.core.variable.variable._Null = <factory>)[source]

Bases: object

Value of the corresponding Variable. Best used when created indirectly by the Values storage.

value: Any | _Null
variable: Variable
class numbox.core.variable.variable.Values[source]

Bases: object

Values of all `Variable`s, computed and external, will be held here.

get(variable: Variable) Value[source]
class numbox.core.variable.variable.VarSpec[source]

Bases: VarSpecBase

cacheable: bool
formula: Callable
inputs: dict[str, str]
metadata: str
name: str
class numbox.core.variable.variable.VarSpecBase[source]

Bases: TypedDict

name: str
class numbox.core.variable.variable.Variable(name: str, source: str = '', inputs: ~typing.Mapping[str, str] = <factory>, formula: ~typing.Callable = None, metadata: str | None = None, cacheable: bool = False)[source]

Bases: object

An instance of Variable is anything that can be calculated from the values of the given inputs dependencies using the provided formula (i.e., a Python function).

Calculated value can be None, that is why a non-calculated value is designated with _null.

An instance of Variable is best created within the given Namespace. For example, when the Variables subtype of the Namespace is instantiated, it gets populated with the freshly created Variable instances per the VarSpec specifications passed to it. Or, when the External subtype of the Namespace is queried for the given variable name, if a Variable with such a name is not already present in that external namespace, it will be created and stored there.

Parameters:
  • name – name of the Variable instance.

  • source – name of the Namespace instance which is

the namespace / source of this Variable. :param inputs: (optional) map from names of the Variable inputs (which are names of other Variable instances) to names of their Namespace`s. :param formula: (optional) function that calculates the value of this `Variable from its sources. :param metadata: any possible metadata associated with this variable. :param cacheable: (default False) when True, the corresponding Value (see below) will be cached during calculation. When attempting to recompute with the same inputs, cached value will be returned instead. Use sparingly!

cacheable: bool = False
formula: Callable = None
inputs: Mapping[str, str]
metadata: str | None = None
name: str
qual_name() str[source]

Qualified name of Variable incorporates both the name of the Variable and the name of its source / namespace.

source: str = ''
class numbox.core.variable.variable.Variables(name: str, variables: list[VarSpec])[source]

Bases: Namespace

numbox.core.variable.variable.make_qual_name(namespace_name: str, var_name: str) str[source]

Each Variable instance is best initialized in and owned by a Namespace object (such as, instances of External and Variables), with the given namespace_name.

This function thereby returns qualified name of the Variable instance.