numbarrow.core.mapinarrow_factory

Overview

Factory for PySpark mapInArrow UDF functions. Bridges PySpark’s Arrow-based batch processing with Numba JIT-compiled functions by converting each pyarrow.RecordBatch column through the adapter layer before passing data to a user-supplied computation function.

Usage:

from numbarrow.core.mapinarrow_factory import make_mapinarrow_func

def my_func(data_dict, bitmap_dict, broadcasts):
    # data_dict:   {col_name: np.ndarray}
    # bitmap_dict: {col_name: np.ndarray (uint8 bitmap)}
    # broadcasts:  {key: value}
    result = ...
    return {"output_col": result}

udf = make_mapinarrow_func(my_func, broadcasts={"scale": 1.5})
df_out = df_in.mapInArrow(udf, output_schema)

See test/demo_map_in_arrow.py for a complete runnable example.

Module

Factory for PySpark mapInArrow UDF functions.

Bridges PySpark’s Arrow-based batch processing with Numba JIT-compiled functions by converting each pyarrow.RecordBatch column through arrow_array_adapter() before passing the data to a user-supplied computation function.

numbarrow.core.mapinarrow_factory.make_mapinarrow_func(main_func: Callable, input_columns: List[str] | None = None, broadcasts: Dict | None = None)[source]

Creates a function that can be given as an argument to mapInArrow

Parameters:

main_func – should have the following signature: - data_dict: Dict[str, np.ndarray], values are arrays of data of various supported types - bitmap_dict: Dict[str, np.ndarray], values are uint8 aligned arrays of bitmap data - broadcasts Optional[Dict[str, Any]] returns: Dict[str, np.ndarray] that will be used to create PyArrow RecordBatch
input_columns – optional list of column names that will be expected to be needed for in

data_dict for the calculation done by main_func. When not given, all columns in the iterated: over PySpark DataFrame will be used.

Parameters:: broadcasts – optional dictionary of broadcast values