libcudf  24.04.00
Classes | Public Member Functions | List of all members
cudf::groupby::groupby Class Reference

Groups values by keys and computes aggregations on those groups. More...

#include <groupby.hpp>

Classes

struct  groups
 The grouped data corresponding to a groupby operation on a set of values. More...
 

Public Member Functions

 groupby (groupby const &)=delete
 
 groupby (groupby &&)=delete
 
groupbyoperator= (groupby const &)=delete
 
groupbyoperator= (groupby &&)=delete
 
 groupby (table_view const &keys, null_policy null_handling=null_policy::EXCLUDE, sorted keys_are_sorted=sorted::NO, std::vector< order > const &column_order={}, std::vector< null_order > const &null_precedence={})
 Construct a groupby object with the specified keys More...
 
std::pair< std::unique_ptr< table >, std::vector< aggregation_result > > aggregate (host_span< aggregation_request const > requests, rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource())
 Performs grouped aggregations on the specified values. More...
 
std::pair< std::unique_ptr< table >, std::vector< aggregation_result > > aggregate (host_span< aggregation_request const > requests, rmm::cuda_stream_view stream, rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource())
 Performs grouped aggregations on the specified values. More...
 
std::pair< std::unique_ptr< table >, std::vector< aggregation_result > > scan (host_span< scan_request const > requests, rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource())
 Performs grouped scans on the specified values. More...
 
std::pair< std::unique_ptr< table >, std::unique_ptr< table > > shift (table_view const &values, host_span< size_type const > offsets, std::vector< std::reference_wrapper< scalar const >> const &fill_values, rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource())
 Performs grouped shifts for specified values. More...
 
groups get_groups (cudf::table_view values={}, rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource())
 Get the grouped keys and values corresponding to a groupby operation on a set of values. More...
 
std::pair< std::unique_ptr< table >, std::unique_ptr< table > > replace_nulls (table_view const &values, host_span< cudf::replace_policy const > replace_policies, rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource())
 Performs grouped replace nulls on value. More...
 

Detailed Description

Groups values by keys and computes aggregations on those groups.

Definition at line 94 of file groupby.hpp.

Constructor & Destructor Documentation

◆ groupby()

cudf::groupby::groupby::groupby ( table_view const &  keys,
null_policy  null_handling = null_policy::EXCLUDE,
sorted  keys_are_sorted = sorted::NO,
std::vector< order > const &  column_order = {},
std::vector< null_order > const &  null_precedence = {} 
)
explicit

Construct a groupby object with the specified keys

If the keys are already sorted, better performance may be achieved by passing keys_are_sorted == true and indicating the ascending/descending order of each column and null order in column_order and null_precedence, respectively.

Note
This object does not maintain the lifetime of keys. It is the user's responsibility to ensure the groupby object does not outlive the data viewed by the keys table_view.
Parameters
keysTable whose rows act as the groupby keys
null_handlingIndicates whether rows in keys that contain NULL values should be included
keys_are_sortedIndicates whether rows in keys are already sorted
column_orderIf keys_are_sorted == YES, indicates whether each column is ascending/descending. If empty, assumes all columns are ascending. Ignored if keys_are_sorted == false.
null_precedenceIf keys_are_sorted == YES, indicates the ordering of null values in each column. Else, ignored. If empty, assumes all columns use null_order::AFTER. Ignored if keys_are_sorted == false.

Member Function Documentation

◆ aggregate() [1/2]

std::pair<std::unique_ptr<table>, std::vector<aggregation_result> > cudf::groupby::groupby::aggregate ( host_span< aggregation_request const >  requests,
rmm::cuda_stream_view  stream,
rmm::mr::device_memory_resource *  mr = rmm::mr::get_current_device_resource() 
)

Performs grouped aggregations on the specified values.

The values to aggregate and the aggregations to perform are specified in an aggregation_request. Each request contains a column_view of values to aggregate and a set of aggregations to perform on those elements.

For each aggregation in a request, values[i] is aggregated with all other values[j] where rows i and j in keys are equivalent.

The size() of the request column must equal keys.num_rows().

For every aggregation_request an aggregation_result will be returned. The aggregation_result holds the resulting column(s) for each requested aggregation on the requests values. The order of the columns in each result is the same order as was specified in the request.

The returned table contains the group labels for each group, i.e., the unique rows from keys. Element i across all aggregation results belongs to the group at row i in the group labels table.

The order of the rows in the group labels is arbitrary. Furthermore, successive groupby::aggregate calls may return results in different orders.

Exceptions
cudf::logic_errorIf requests[i].values.size() != keys.num_rows().

Example:

Input:
keys: {1 2 1 3 1}
{1 2 1 4 1}
request:
values: {3 1 4 9 2}
aggregations: {{SUM}, {MIN}}
result:
keys: {3 1 2}
{4 1 2}
values:
SUM: {9 9 1}
MIN: {9 2 1}
@ MIN
min of first in the group
Parameters
requestsThe set of columns to aggregate and the aggregations to perform
mrDevice memory resource used to allocate the returned table and columns' device memory
Returns
Pair containing the table with each group's unique key and a vector of aggregation_results for each request in the same order as specified in requests.
Parameters
streamCUDA stream used for device memory operations and kernel launches.

◆ aggregate() [2/2]

std::pair<std::unique_ptr<table>, std::vector<aggregation_result> > cudf::groupby::groupby::aggregate ( host_span< aggregation_request const >  requests,
rmm::mr::device_memory_resource *  mr = rmm::mr::get_current_device_resource() 
)

Performs grouped aggregations on the specified values.

The values to aggregate and the aggregations to perform are specified in an aggregation_request. Each request contains a column_view of values to aggregate and a set of aggregations to perform on those elements.

For each aggregation in a request, values[i] is aggregated with all other values[j] where rows i and j in keys are equivalent.

The size() of the request column must equal keys.num_rows().

For every aggregation_request an aggregation_result will be returned. The aggregation_result holds the resulting column(s) for each requested aggregation on the requests values. The order of the columns in each result is the same order as was specified in the request.

The returned table contains the group labels for each group, i.e., the unique rows from keys. Element i across all aggregation results belongs to the group at row i in the group labels table.

The order of the rows in the group labels is arbitrary. Furthermore, successive groupby::aggregate calls may return results in different orders.

Exceptions
cudf::logic_errorIf requests[i].values.size() != keys.num_rows().

Example:

Input:
keys: {1 2 1 3 1}
{1 2 1 4 1}
request:
values: {3 1 4 9 2}
aggregations: {{SUM}, {MIN}}
result:
keys: {3 1 2}
{4 1 2}
values:
SUM: {9 9 1}
MIN: {9 2 1}
Parameters
requestsThe set of columns to aggregate and the aggregations to perform
mrDevice memory resource used to allocate the returned table and columns' device memory
Returns
Pair containing the table with each group's unique key and a vector of aggregation_results for each request in the same order as specified in requests.

◆ get_groups()

groups cudf::groupby::groupby::get_groups ( cudf::table_view  values = {},
rmm::mr::device_memory_resource *  mr = rmm::mr::get_current_device_resource() 
)

Get the grouped keys and values corresponding to a groupby operation on a set of values.

Returns a groups object representing the grouped keys and values. If values is not provided, only a grouping of the keys is performed, and the values of the groups object will be nullptr.

Parameters
valuesTable representing values on which a groupby operation is to be performed
mrDevice memory resource used to allocate the returned tables's device memory in the returned groups
Returns
A groups object representing grouped keys and values

◆ replace_nulls()

std::pair<std::unique_ptr<table>, std::unique_ptr<table> > cudf::groupby::groupby::replace_nulls ( table_view const &  values,
host_span< cudf::replace_policy const >  replace_policies,
rmm::mr::device_memory_resource *  mr = rmm::mr::get_current_device_resource() 
)

Performs grouped replace nulls on value.

For each value[i] == NULL in group j, value[i] is replaced with the first non-null value in group j that precedes or follows value[i]. If a non-null value is not found in the specified direction, value[i] is left NULL.

The returned pair contains a column of the sorted keys and the result column. In result column, values of the same group are in contiguous memory. In each group, the order of values maintain their original order. The order of groups are not guaranteed.

Example:

//Inputs:
keys: {3 3 1 3 1 3 4}
{2 2 1 2 1 2 5}
values: {3 4 7 @ @ @ @}
{@ @ @ "x" "tt" @ @}
replace_policies: {FORWARD, BACKWARD}
//Outputs (group orders may be different):
keys: {3 3 3 3 1 1 4}
{2 2 2 2 1 1 5}
result: {3 4 4 4 7 7 @}
{"x" "x" "x" @ "tt" "tt" @}
Parameters
[in]valuesA table whose column null values will be replaced
[in]replace_policiesSpecify the position of replacement values relative to null values, one for each column
[in]mrDevice memory resource used to allocate device memory of the returned column
Returns
Pair that contains a table with the sorted keys and the result column

◆ scan()

std::pair<std::unique_ptr<table>, std::vector<aggregation_result> > cudf::groupby::groupby::scan ( host_span< scan_request const >  requests,
rmm::mr::device_memory_resource *  mr = rmm::mr::get_current_device_resource() 
)

Performs grouped scans on the specified values.

The values to aggregate and the aggregations to perform are specified in an aggregation_request. Each request contains a column_view of values to aggregate and a set of aggregations to perform on those elements.

For each aggregation in a request, values[i] is scan aggregated with all previous values[j] where rows i and j in keys are equivalent.

The size() of the request column must equal keys.num_rows().

For every aggregation_request an aggregation_result will be returned. The aggregation_result holds the resulting column(s) for each requested aggregation on the requests values. The order of the columns in each result is the same order as was specified in the request.

The returned table contains the group labels for each row, i.e., the keys given to groupby object. Element i across all aggregation results belongs to the group at row i in the group labels table.

The order of the rows in the group labels is arbitrary. Furthermore, successive groupby::scan calls may return results in different orders.

Exceptions
cudf::logic_errorIf requests[i].values.size() != keys.num_rows().

Example:

Input:
keys: {1 2 1 3 1}
{1 2 1 4 1}
request:
values: {3 1 4 9 2}
aggregations: {{SUM}, {MIN}}
result:
keys: {3 1 1 1 2}
{4 1 1 1 2}
values:
SUM: {9 3 7 9 1}
MIN: {9 3 3 2 1}
Parameters
requestsThe set of columns to scan and the scans to perform
mrDevice memory resource used to allocate the returned table and columns' device memory
Returns
Pair containing the table with each group's key and a vector of aggregation_results for each request in the same order as specified in requests.

◆ shift()

std::pair<std::unique_ptr<table>, std::unique_ptr<table> > cudf::groupby::groupby::shift ( table_view const &  values,
host_span< size_type const >  offsets,
std::vector< std::reference_wrapper< scalar const >> const &  fill_values,
rmm::mr::device_memory_resource *  mr = rmm::mr::get_current_device_resource() 
)

Performs grouped shifts for specified values.

In jth column, for each group, ith element is determined by the i - offsets[j]th element of the group. If i - offsets[j] < 0 or >= group_size, the value is determined by fill_values[j].

Note
The first returned table stores the keys passed to the groupby object. Row i of the key table corresponds to the group labels of row i in the shifted columns. The key order in each group matches the input order. The order of each group is arbitrary. The group order in successive calls to groupby::shifts may be different.

Example:

keys: {1 4 1 3 4 4 1}
{1 2 1 3 2 2 1}
values: {3 9 1 4 2 5 7}
{"a" "c" "bb" "ee" "z" "x" "d"}
offset: {2, -1}
fill_value: {@, @}
result (group order maybe different):
keys: {3 1 1 1 4 4 4}
{3 1 1 1 2 2 2}
values: {@ @ @ 3 @ @ 9}
{@ "bb" "d" @ "z" "x" @}
-------------------------------------------------
keys: {1 4 1 3 4 4 1}
{1 2 1 3 2 2 1}
values: {3 9 1 4 2 5 7}
{"a" "c" "bb" "ee" "z" "x" "d"}
offset: {-2, 1}
fill_value: {-1, "42"}
result (group order maybe different):
keys: {3 1 1 1 4 4 4}
{3 1 1 1 2 2 2}
values: {-1 7 -1 -1 5 -1 -1}
{"42" "42" "a" "bb" "42" "c" "z"}
Parameters
valuesTable whose columns to be shifted
offsetsThe offsets by which to shift the input
fill_valuesFill values for indeterminable outputs
mrDevice memory resource used to allocate the returned table and columns' device memory
Returns
Pair containing the tables with each group's key and the columns shifted
Exceptions
cudf::logic_errorif fill_value[i] dtype does not match values[i] dtype for ith column

The documentation for this class was generated from the following file: