tacl diff is better done as ratios #23

ajenhl · 2014-10-16T03:22:41Z

Rather than a binary (not-)unique-to-sub-corpus, it would be useful to provide a graded result, based on, perhaps, frequency of occurrence. This would at least mostly avoid the issue whereby a single instance of an n-gram in a single witness among potentially thousands of other texts in a sub-corpus will ensure that that n-gram does not occur in the results, despite otherwise appearing solely in another sub-corpus.

ajenhl · 2014-12-31T20:33:18Z

The approach I'm going to take with this is to modify the existing diff functionality in the following ways:

A new column is added to the results, "ratio", which holds the ratio of tokens in all occurrences of the n-gram in the text to the number of tokens in the text.
The "tacl diff" command gains another option, --threshold, which allows the user to specify the proportional difference between the highest and second highest ratios for each n-gram within separate subcorpora that must be reached for that n-gram to be included in the results. This would default to the special value of "infinity", which would lead to the existing behaviour whereby each n-gram must appear only in one subcorpus. (It might also be occasionally useful to specify a value of 0, which would mean that the results contained every n-gram that occurred in any text in any of the subcorpora.

Further, the report function would gain an extra element, --threshold, which filters the results passed to it in the same way as --threshold above.

Note that the new column would be added to intersect results, and the new report function could also filter those results.

So, by default, nothing would change for the end user (a diff would give exactly the same results). But one could get a fuzzy diff by proposing a numeric value for --threshold, guaranteeing that each n-gram exists in one subcorpus preponderantly, rather than entirely.

It might be that there is a need to allow for the threshold to apply not to two individual text's ratios, but to the ratio across all texts within each subcorpora.

ajenhl added the enhancement label Oct 16, 2014

ajenhl self-assigned this Oct 16, 2014

ajenhl changed the title ~~tacl diff is better done as ratios~~ tacl diff is better done as ratios Jun 18, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tacl diff is better done as ratios #23

tacl diff is better done as ratios #23

ajenhl commented Oct 16, 2014

ajenhl commented Dec 31, 2014

tacl diff is better done as ratios #23

tacl diff is better done as ratios #23

Comments

ajenhl commented Oct 16, 2014

ajenhl commented Dec 31, 2014