You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Rather than a binary (not-)unique-to-sub-corpus, it would be useful to provide a graded result, based on, perhaps, frequency of occurrence. This would at least mostly avoid the issue whereby a single instance of an n-gram in a single witness among potentially thousands of other texts in a sub-corpus will ensure that that n-gram does not occur in the results, despite otherwise appearing solely in another sub-corpus.
The text was updated successfully, but these errors were encountered:
The approach I'm going to take with this is to modify the existing diff functionality in the following ways:
A new column is added to the results, "ratio", which holds the ratio of tokens in all occurrences of the n-gram in the text to the number of tokens in the text.
The "tacl diff" command gains another option, --threshold, which allows the user to specify the proportional difference between the highest and second highest ratios for each n-gram within separate subcorpora that must be reached for that n-gram to be included in the results. This would default to the special value of "infinity", which would lead to the existing behaviour whereby each n-gram must appear only in one subcorpus. (It might also be occasionally useful to specify a value of 0, which would mean that the results contained every n-gram that occurred in any text in any of the subcorpora.
Further, the report function would gain an extra element, --threshold, which filters the results passed to it in the same way as --threshold above.
Note that the new column would be added to intersect results, and the new report function could also filter those results.
So, by default, nothing would change for the end user (a diff would give exactly the same results). But one could get a fuzzy diff by proposing a numeric value for --threshold, guaranteeing that each n-gram exists in one subcorpus preponderantly, rather than entirely.
It might be that there is a need to allow for the threshold to apply not to two individual text's ratios, but to the ratio across all texts within each subcorpora.
ajenhl
changed the title
tacl diff is better done as ratios
tacl diff is better done as ratios
Jun 18, 2015
Rather than a binary (not-)unique-to-sub-corpus, it would be useful to provide a graded result, based on, perhaps, frequency of occurrence. This would at least mostly avoid the issue whereby a single instance of an n-gram in a single witness among potentially thousands of other texts in a sub-corpus will ensure that that n-gram does not occur in the results, despite otherwise appearing solely in another sub-corpus.
The text was updated successfully, but these errors were encountered: