cudf.core.groupby.groupby.GroupBy.apply#

GroupBy.apply(function, *args, engine='cudf')#

Apply a python transformation function over the grouped chunk.

Parameters

functioncallable: The python transformation function that will be applied on the grouped chunk.
argstuple: Optional positional arguments to pass to the function.
engine: {‘cudf’, ‘jit’}, default ‘cudf’: Selects the GroupBy.apply implementation. Use jit to select the numba JIT pipeline. Only certain operations are allowed within the function when using this option: min, max, sum, mean, var, std, idxmax, and idxmin and any arithmetic formula involving them are allowed. Binary operations are not yet supported, so syntax like df[‘x’] * 2 is not yet allowed. For more information, see the cuDF guide to user defined functions.

Examples

from cudf import DataFrame
df = DataFrame()
df['key'] = [0, 0, 1, 1, 2, 2, 2]
df['val'] = [0, 1, 2, 3, 4, 5, 6]
groups = df.groupby(['key'])

# Define a function to apply to each row in a group
def mult(df):
  df['out'] = df['key'] * df['val']
  return df

result = groups.apply(mult)
print(result)

Output:

   key  val  out
  0    0    0
  0    1    0
  1    2    2
  1    3    3
  2    4    8
  2    5   10
  2    6   12

Pandas Compatibility Note

groupby.apply

cuDF’s groupby.apply is limited compared to pandas. In some situations, Pandas returns the grouped keys as part of the index while cudf does not due to redundancy. For example:

>>> df = pd.DataFrame({
...     'a': [1, 1, 2, 2],
...     'b': [1, 2, 1, 2],
...     'c': [1, 2, 3, 4],
... })
>>> gdf = cudf.from_pandas(df)
>>> df.groupby('a').apply(lambda x: x.iloc[[0]])
     a  b  c
a
1 0  1  1  1
2 2  2  1  3
>>> gdf.groupby('a').apply(lambda x: x.iloc[[0]])
   a  b  c
0  1  1  1
2  2  1  3

engine='jit' may be used to accelerate certain functions, initially those that contain reductions and arithmetic operations between results of those reductions: >>> import cudf >>> df = cudf.DataFrame({‘a’:[1,1,2,2,3,3], ‘b’:[1,2,3,4,5,6]}) >>> df.groupby(‘a’).apply( … lambda group: group[‘b’].max() - group[‘b’].min(), … engine=’jit’ … ) a None 0 1 1 1 2 1 2 3 1