Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tacl diff is better done as ratios #23

Open
ajenhl opened this issue Oct 16, 2014 · 1 comment
Open

tacl diff is better done as ratios #23

ajenhl opened this issue Oct 16, 2014 · 1 comment
Assignees

Comments

@ajenhl
Copy link
Owner

ajenhl commented Oct 16, 2014

Rather than a binary (not-)unique-to-sub-corpus, it would be useful to provide a graded result, based on, perhaps, frequency of occurrence. This would at least mostly avoid the issue whereby a single instance of an n-gram in a single witness among potentially thousands of other texts in a sub-corpus will ensure that that n-gram does not occur in the results, despite otherwise appearing solely in another sub-corpus.

@ajenhl ajenhl self-assigned this Oct 16, 2014
@ajenhl
Copy link
Owner Author

ajenhl commented Dec 31, 2014

The approach I'm going to take with this is to modify the existing diff functionality in the following ways:

  • A new column is added to the results, "ratio", which holds the ratio of tokens in all occurrences of the n-gram in the text to the number of tokens in the text.
  • The "tacl diff" command gains another option, --threshold, which allows the user to specify the proportional difference between the highest and second highest ratios for each n-gram within separate subcorpora that must be reached for that n-gram to be included in the results. This would default to the special value of "infinity", which would lead to the existing behaviour whereby each n-gram must appear only in one subcorpus. (It might also be occasionally useful to specify a value of 0, which would mean that the results contained every n-gram that occurred in any text in any of the subcorpora.

Further, the report function would gain an extra element, --threshold, which filters the results passed to it in the same way as --threshold above.

Note that the new column would be added to intersect results, and the new report function could also filter those results.

So, by default, nothing would change for the end user (a diff would give exactly the same results). But one could get a fuzzy diff by proposing a numeric value for --threshold, guaranteeing that each n-gram exists in one subcorpus preponderantly, rather than entirely.

It might be that there is a need to allow for the threshold to apply not to two individual text's ratios, but to the ratio across all texts within each subcorpora.

@ajenhl ajenhl changed the title tacl diff is better done as ratios tacl diff is better done as ratios Jun 18, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant