Skip to content

ENH: Add smart_groupby() method for automatic grouping by categorical columns and aggregating numerics #61420

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 of 3 tasks
rit4rosa opened this issue May 9, 2025 · 3 comments
Labels
Closing Candidate May be closeable, needs more eyeballs Enhancement Groupby

Comments

@rit4rosa
Copy link

rit4rosa commented May 9, 2025

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

Currently, pandas.DataFrame.groupby() requires users to explicitly specify both the grouping columns and the aggregation functions. This can be repetitive and inefficient, especially during exploratory data analysis on large DataFrames with many columns. A common use case like “group by all categorical columns and compute the mean of numeric columns” requires verbose, manual setup.

Feature Description

Add a new method to DataFrame called smart_groupby(), which intelligently infers grouping and aggregation behavior based on the column types of the DataFrame.

Proposed behavior:

  • If no parameters are passed:
    • Group by all columns of type object, category, or bool
    • Aggregate all remaining numeric columns using the mean
  • Optional keyword parameters:
    • by: specify grouping columns explicitly
    • agg: specify aggregation function(s) (default is "mean")
    • exclude: exclude specific columns from grouping or aggregation

Alternative Solutions

Currently, users must write verbose code to accomplish the same:

group_cols = [col for col in df.columns if df[col].dtype == 'category']
agg_cols = [col for col in df.columns if pd.api.types.is_numeric_dtype(df[col])]
df.groupby(group_cols)[agg_cols].mean()

Additional Context

No response

@rit4rosa rit4rosa added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels May 9, 2025
@rit4rosa
Copy link
Author

rit4rosa commented May 9, 2025

I would like to work on this feature if you agree.

@mroeschke
Copy link
Member

Thanks for the suggestion but I would be -1 including this in pandas. pandas is moving toward explicit and less automatic behaviors, and the snippet you posted is short enough to be wrapped in a custom helper function

@mroeschke mroeschke added Closing Candidate May be closeable, needs more eyeballs and removed Needs Triage Issue that has not been reviewed by a pandas team member labels May 9, 2025
@rhshadrach
Copy link
Member

Agreed @mroeschke. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Closing Candidate May be closeable, needs more eyeballs Enhancement Groupby
Projects
None yet
Development

No branches or pull requests

3 participants