numbarrow.core.mapinarrow_factory
Overview
Factory for PySpark mapInArrow UDF functions. Bridges PySpark’s
Arrow-based batch processing with Numba JIT-compiled functions by
converting each pyarrow.RecordBatch column through the adapter
layer before passing data to a user-supplied computation function.
Usage:
from numbarrow.core.mapinarrow_factory import make_mapinarrow_func
def my_func(data_dict, bitmap_dict, broadcasts):
# data_dict: {col_name: np.ndarray}
# bitmap_dict: {col_name: np.ndarray (uint8 bitmap)}
# broadcasts: {key: value}
result = ...
return {"output_col": result}
udf = make_mapinarrow_func(my_func, broadcasts={"scale": 1.5})
df_out = df_in.mapInArrow(udf, output_schema)
See test/demo_map_in_arrow.py for a complete runnable example.
Module
Factory for PySpark mapInArrow UDF functions.
Bridges PySpark’s Arrow-based batch processing with Numba JIT-compiled functions
by converting each pyarrow.RecordBatch column through
arrow_array_adapter() before passing the data
to a user-supplied computation function.
- numbarrow.core.mapinarrow_factory.make_mapinarrow_func(main_func: Callable, input_columns: List[str] | None = None, broadcasts: Dict | None = None)[source]
Creates a function that can be given as an argument to mapInArrow
- Parameters:
main_func – should have the following signature: - data_dict: Dict[str, np.ndarray], values are arrays of data of various supported types - bitmap_dict: Dict[str, np.ndarray], values are uint8 aligned arrays of bitmap data - broadcasts Optional[Dict[str, Any]] returns: Dict[str, np.ndarray] that will be used to create PyArrow RecordBatch
input_columns – optional list of column names that will be expected to be needed for in
- data_dict for the calculation done by main_func. When not given, all columns in the iterated
over PySpark DataFrame will be used.
- Parameters:
broadcasts – optional dictionary of broadcast values