Skip to content

API: inconsistencies between grouped and grouped[['col']] in groupby #21790

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jorisvandenbossche opened this issue Jul 7, 2018 · 2 comments
Closed
Labels
API - Consistency Internal Consistency of API/Behavior Bug Closing Candidate May be closeable, needs more eyeballs Groupby

Comments

@jorisvandenbossche
Copy link
Member

One a DataFrameGroupBy object, you can already select some of the columns you want to apply your functions on, but apart from that, I would not expect it to influence the output structure:

In [56]: df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar'],
    ...:                    'B' : np.random.randn(4),
    ...:                    'C' : np.random.randn(4)})
    ...:                    

In [57]: 

In [57]: gr = df.groupby('A')

In [58]: gr.agg({'B': {'r': np.sum}, 'C': {'r2': np.sum}})
/home/joris/scipy/pandas/pandas/core/groupby/groupby.py:4650: FutureWarning: using a dict with renaming is deprecated and will be removed in a future version
  return super(DataFrameGroupBy, self).aggregate(arg, *args, **kwargs)
Out[58]: 
            C         B
           r2         r
A                      
bar -0.371496  1.230346
foo  0.091779  2.487224

In [59]: gr[['C', 'B']].agg({'B': {'r': np.sum}, 'C': {'r2': np.sum}})
/home/joris/scipy/pandas/pandas/core/groupby/groupby.py:4650: FutureWarning: using a dict with renaming is deprecated and will be removed in a future version
  return super(DataFrameGroupBy, self).aggregate(arg, *args, **kwargs)
Out[59]: 
           r2         r
A                      
bar -0.371496  1.230346
foo  0.091779  2.487224

So in the above, doing gr[['C', 'B']] causes the final result not to have the MultiIndex, which IMO is very surprising.

The above example if for a deprecated case, so not sure how important it is (didn't directly see a similar case for a non-deprecated call), but when we clean-up the deprecations, we should certainly check this.

cc @jreback @WillAyd

@WillAyd
Copy link
Member

WillAyd commented Jul 7, 2018

There's generally some inconsistency in naming here:

In [2]: df = pd.DataFrame([['foo', 1, 1], ['bar', 1, 1]], columns=['A', 'B', 'C'])
In [4]: df.groupby('A').agg({'B': 'sum', 'C': 'min'})
Out[4]: 
     B  C
A        
bar  1  1
foo  1  1

In [5]: df.groupby('A').agg({'B': 'sum', 'C': ['min']})
Out[5]: 
      B   C
    sum min
A          
bar   1   1
foo   1   1

In [5]: df.groupby('A').agg({'B': ['sum'], 'C': ['min']})
Out[5]: 
      B   C
    sum min
A          
bar   1   1
foo   1   1

In [6]: df.groupby('A')[['B', 'C']].agg({'B': 'sum', 'C': 'min'})
Out[6]: 
     B  C
A        
bar  1  1
foo  1  1

In [8]: df.groupby('A')[['B', 'C']].agg({'B': 'sum', 'C': ['min']})
Out[8]: 
      B   C
    sum min
A          
bar   1   1
foo   1   1

In [8]: df.groupby('A')[['B', 'C']].agg({'B': ['sum'], 'C': ['min']})
Out[8]: 
      B   C
    sum min
A          
bar   1   1
foo   1   1

I'm under the belief that all of these should return a MultiIndex so as not to lose sight of the aggregation performed on individual columns while also ensuring a consistent return value and simpler coding. Hope to have a PR soon

@mroeschke mroeschke added API - Consistency Internal Consistency of API/Behavior Bug and removed API Design labels Jun 20, 2021
@rhshadrach
Copy link
Member

when we clean-up the deprecations, we should certainly check this.

I've toyed around with similar examples, I am not seeing inconsistent results.

I'm under the belief that all of these should return a MultiIndex so as not to lose sight of the aggregation performed on individual columns while also ensuring a consistent return value and simpler coding.

I see there is value in keeping track of the aggregation performed, but I don't think df.groupby('A')[['B', 'C']].agg({'B': 'sum', 'C': 'min'}) should have a 2nd level. I think the expectation that the result should be a "concat of df.groupby('A')['B'].sum() and df.groupby('A')['C'].sum()" should take precedence and we should only add an additional level when necessary to differentiate results. MulitIndex columns can be quite difficult to work with.

I'll mark this as a closing candidate for now.

@rhshadrach rhshadrach added the Closing Candidate May be closeable, needs more eyeballs label Jul 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API - Consistency Internal Consistency of API/Behavior Bug Closing Candidate May be closeable, needs more eyeballs Groupby
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants