-
-
Notifications
You must be signed in to change notification settings - Fork 288
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[aggr-] allow ranking rows by key column #2417
Open
midichef
wants to merge
7
commits into
saulpw:develop
Choose a base branch
from
midichef:aggr_rank
base: develop
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
fad6a09
[aggr-] cap runtime when formatting memo status
midichef 9e37ffd
[aggr-] fix chooser lacking aggs starting with 'p'
midichef b41afba
[aggr-] display stdev error note for lists of size 1
midichef ec28446
[aggr-] add rank aggregator, cmds addcol-aggregate/sheetrank
midichef d35738b
Merge branch 'develop' into aggr_rank
saulpw 20ba354
[aggr-] undo accidental reversion of using custom stdev()
midichef 9ed9f4b
[aggr-] improve errors relating to rank with key columns
midichef File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
#!vd -p | ||
{"sheet": "global", "col": null, "row": "disp_date_fmt", "longname": "set-option", "input": "%b %d, %Y", "keystrokes": "", "comment": null} | ||
{"longname": "open-file", "input": "sample_data/test.jsonl", "keystrokes": "o"} | ||
{"sheet": "test", "col": "key2", "row": "", "longname": "key-col", "input": "", "keystrokes": "!", "comment": "toggle current column as a key column"} | ||
{"sheet": "test", "col": "key2", "row": "", "longname": "addcol-aggregate", "input": "count", "comment": "add column(s) with aggregator of rows grouped by key columns"} | ||
{"sheet": "test", "col": "qty", "row": "", "longname": "type-float", "input": "", "keystrokes": "%", "comment": "set type of current column to float"} | ||
{"sheet": "test", "col": "qty", "row": "", "longname": "addcol-aggregate", "input": "rank sum", "comment": "add column(s) with aggregator of rows grouped by key columns"} | ||
{"sheet": "test", "col": "qty_sum", "row": "", "longname": "addcol-sheetrank", "input": "", "comment": "add column with the rank of each row based on its key columns"} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
key2 key2_count key1 qty qty_rank qty_sum test_sheetrank amt | ||
foo 2 2016-01-01 11:00:00 1.00 1 31.00 5 | ||
0 2016-01-01 1:00 2.00 1 66.00 2 3 | ||
baz 3 4.00 1 292.00 4 43.2 | ||
#ERR 0 #ERR #ERR 1 0.00 1 #ERR #ERR | ||
bar 2 2017-12-25 8:44 16.00 2 16.00 3 .3 | ||
baz 3 32.00 2 292.00 4 3.3 | ||
0 2018-07-27 4:44 64.00 2 66.00 2 9.1 | ||
bar 2 2018-07-27 16:44 1 16.00 3 | ||
baz 3 2018-07-27 18:44 256.00 3 292.00 4 .01 | ||
foo 2 2018-10-20 18:44 30.00 2 31.00 5 .01 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,74 @@ | ||
import itertools | ||
|
||
from visidata import Sheet, ListAggregator, SettableColumn | ||
from visidata import vd, anytype, asyncthread | ||
|
||
class RankAggregator(ListAggregator): | ||
''' | ||
Ranks start at 1, and each group's rank is 1 higher than the previous group. | ||
When elements are tied in ranking, each of them gets the same rank. | ||
''' | ||
def aggregate(self, col, rows) -> [int]: | ||
return self.aggregate_list(col, rows) | ||
|
||
def aggregate_list(self, col, rows) -> [int]: | ||
if not col.sheet.keyCols: | ||
vd.error('ranking requires one or more key columns') | ||
return self.rank(col, rows) | ||
|
||
def rank(self, col, rows): | ||
if col.keycol: | ||
vd.warning('rank aggregator is uninformative for key columns') | ||
# compile row data, for each row a list of tuples: (group_key, rank_key, rownum) | ||
rowdata = [(col.sheet.rowkey(r), col.getTypedValue(r), rownum) for rownum, r in enumerate(rows)] | ||
# sort by row key and column value to prepare for grouping | ||
try: | ||
rowdata.sort() | ||
except TypeError as e: | ||
vd.fail(f'elements in a ranking column must be comparable: {e.args[0]}') | ||
rowvals = [] | ||
#group by row key | ||
for _, group in itertools.groupby(rowdata, key=lambda v: v[0]): | ||
# within a group, the rows have already been sorted by col_val | ||
group = list(group) | ||
# rank each group individually | ||
group_ranks = rank_sorted_iterable([col_val for _, col_val, rownum in group]) | ||
rowvals += [(rownum, rank) for (_, _, rownum), rank in zip(group, group_ranks)] | ||
# sort by unique rownum, to make rank results match the original row order | ||
rowvals.sort() | ||
rowvals = [ rank for rownum, rank in rowvals ] | ||
return rowvals | ||
|
||
vd.aggregators['rank'] = RankAggregator('rank', anytype, helpstr='list of ranks, when grouping by key columns', listtype=int) | ||
|
||
def rank_sorted_iterable(vals_sorted) -> [int]: | ||
'''*vals_sorted* is an iterable whose elements form one group. | ||
The iterable must already be sorted.''' | ||
|
||
ranks = [] | ||
val_groups = itertools.groupby(vals_sorted) | ||
for rank, (_, val_group) in enumerate(val_groups, 1): | ||
for _ in val_group: | ||
ranks.append(rank) | ||
return ranks | ||
|
||
@Sheet.api | ||
@asyncthread | ||
def addcol_sheetrank(sheet, rows): | ||
''' | ||
Each row is ranked within its sheet. Rows are ordered by the | ||
value of their key columns. | ||
''' | ||
if not sheet.keyCols: | ||
vd.error('ranking requires one or more key columns') | ||
colname = f'{sheet.name}_sheetrank' | ||
c = SettableColumn(name=colname, type=int) | ||
sheet.addColumnAtCursor(c) | ||
rowkeys = [(sheet.rowkey(r), rownum) for rownum, r in enumerate(rows)] | ||
rowkeys.sort() | ||
ranks = rank_sorted_iterable([rowkey for rowkey, rownum in rowkeys]) | ||
row_ranks = sorted(zip((rownum for _, rownum in rowkeys), ranks)) | ||
row_ranks = [rank for rownum, rank in row_ranks] | ||
c.setValues(sheet.rows, *row_ranks) | ||
|
||
Sheet.addCommand('', 'addcol-sheetrank', 'sheet.addcol_sheetrank(rows)', 'add column with the rank of each row based on its key columns') |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a separate
listtype
? Seems like we could just use the sameaggr.type
in both cases, and remove thisisinstance
(which is usually a code smell for me).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way list aggregators work now, there is a need for two distinct types,
type
andlisttype
.type
is for the result of the aggregator. For example, this is used bymemo-aggregate
. That's whytype
isanytype
forListAggregators
. This type would be used whenever we want to hold the entire result (a list) in a cell.But we also need a separate type for the elements of the list. This is for when the aggregator result goes in a column, like for
addcol-aggregate
, where each cell holds not the result itself, but an element of the list result.If I try to get rid of one or the other types, I run into problems. For
RankAggregator
, if I get rid of thelisttype=int
switch totype=int
, I get an error in the statusbar forz+
rank
:'''
text_rank=int() argument must be a string, a bytes-like object or a real number, not 'list'
'''
But if instead I make
RankAggregator
usetype=anytype
, the column added byaddcol-aggregate
rank
does not get the typeint
.The need for two types is awkward. And I see your point about
isinstance
being a code smell. (That is a helpful heuristic, and I'll use it in the future.) It's accurately pointing out strain in the design: most aggregators produce a single value, list aggregators produce a list.Maybe
rank
should not be an aggregator. It's unlikely people want a list object holding the ranks. Most people want a column holding the ranks. What if we replaceaddcol-aggregate
+rank
with an equivalent commandaddcol-grouprank
(in addition to the existingaddcol-sheetrank
)? And we would reserveaddcol-aggregate
for finding group values likesum
,mean
,median
, as you suggested earlier. What do you think?(I would also consider changing the name
addcol-aggregate
. Maybe toaddcol-group-aggregate
.)