-
Notifications
You must be signed in to change notification settings - Fork 9
/
CHANGES
406 lines (244 loc) · 13.2 KB
/
CHANGES
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
5.3.0 2023-09-06 Jamie Norrish <[email protected]>
* Fixed Report issue that prevented non-tacl assets from being
copied to an output directory.
* Updated docs build requirements.
5.2.0 2023-06-15 Jamie Norrish <[email protected]>
* Fixed catalogue relabelling to not incorrectly delete relabelled
works when mapped to an existing label that is not included in the
relabel map.
5.1.0 2023-06-14 Jamie Norrish <[email protected]>
* Modified results reduce to properly handle the same witness under
different labels, namely to treat them as entirely distinct.
* Changed from deprecated pkg_resources code to importlib.resources
for handling package files.
* Changed from deprecated Bio.pairwise2 to Bio.Align.
* Added logging statement when about to output CSV results from tacl
results.
5.0.6 2023-05-10 Jamie Norrish <[email protected]>
* Reworked build configuration.
* Updated corpus preparation for Taisho, both markup handling and
further splits.
5.0.5 2022-09-24 Jamie Norrish <[email protected]>
* Improved performance of denormalisation sufficiently to be useful
in real cases.
5.0.4 2022-08-06 Jamie Norrish <[email protected]>
* Updated templating to work with recent versions of Jinja2.
* Changed DataFrame.append to pd.concat when adding rows to a
DataFrame, to suit recent versions of pandas.
* Added the name of the missing label to the error message when
referencing a non-existent label in a catalogue.
* Added split of T1646 into "pin" divs.
* Updated build process.
* Fixed unwanted inclusion of text material within apparatus
criticus markers.
* Added ability to specify a pattern of works to return from a Corpus.
* Allowed catalogues not to be sorted when saved.
5.0.3 2022-03-31 Jamie Norrish <[email protected]>
* Removed characters that are significant within regular expressions
when preparing CBETA TEI texts. These are used within regular
expression contexts and so are not safe.
5.0.2 2022-02-26 Jamie Norrish <[email protected]>
* Rebuild wheel from clean.
* No source changes.
5.0.1 2022-02-24 Jamie Norrish <[email protected]>
* Fixed bug when preparing a CBETA TEI text that includes a tei:g
within anchors demarcating an apparatus criticus entry, such that
the tei:g was included after the tei:app.
* Added --version to tacl command to print out the version number.
* Bumped minimum Python version to 3.8.
* Added split of T0328 into verse and prose sub-works.
* Added raising an error when building a database from a catalogue
when a referenced work is not in the corpus.
* Removed all instances of characters that are illegal in Windows
filenames when preparing CBETA TEI texts.
5.0.0 2021-09-27 Jamie Norrish <[email protected]>
* Updated Corpus API and its use within DataStore to allow for
user-provided Text subclasses when getting witnesses.
* Changed some methods of Text and WitnessText to be properties:
get_content is now content, get_names is now siglum and work.
* Updated the process for generating a corpus from CBETA TEI XML
files. Certain parts of texts are extracted into new texts based
on the markup-defined properties (such as containing div/@type).
* Added the tacl split command to split texts based on user-supplied
configuration files.
* Added the tacl join-works command to join two or more prepared TEI
XML works together into a new work.
* Added the tacl query command to run supplied SQL commands on the
data store with supplied parameters.
* Added mulu title to prepared filename for extracted div where one
exists.
* Improved updating of the database when adding n-grams: all
witnesses that no longer exist in the corpus will be deleted.
* Added handling of some errors in order to provide useful error
messages.
* Reimplemented tacl results' extend operation to be faster and use
less RAM. Its determination of whether the initial results are
intersect results or not (and thus whether the results are
automatically run through reciprocal remove) has become
stricter. Only those results with more than one label and where
every n-gram occurs in every label now count as intersect results.
* Added the tacl normalise command, to normalise a corpus according
to a user-supplied mapping. This is an as yet unoptimised feature.
* Added options to tacl results to denormalise a set of
results. This is an as yet unoptimised feature.
4.2.0 2018-12-04 Jamie Norrish <[email protected]>
* Added option to restrict Results prune requirements to a labelled
subset of results.
* Added a lifetime report, showing details about in which labelled
corpora n-grams occurred in.
* Moved the jitc command to the tacl-extra repository.
* Renamed the tacl.command package to tacl.cli.
4.1.0 2018-10-18 Jamie Norrish <[email protected]>
* Added remove_label method to Catalogue.
* Added method to relable a catalogue.
* Added --relabel option to tacl results, to relabel results
according to a su pplied catalogue.
* Added stripping of cb:mulu contents from prepared TEI.
* Modified Results to also accept a pandas DataFrame as well as a
path or buffer of results.
* Added get_works_by_label method to Catalogue.
* Modified tacl search to allow both for multiple n-gram files and
no n-gram f ile, in which case all n-grams are returned.
* Added get_works method to Corpus to return a list of work names.
4.0.3 2018-05-01 Jamie Norrish <[email protected]>
* Removed obsolete reference to tacl-helper command in setup.py.
4.0.2 2018-04-21 Jamie Norrish <[email protected]>
* Corrected release date for version 4.0.0.
4.0.1 2018-04-21 Jamie Norrish <[email protected]>
* UNRELEASED
* Updated references to documentation, pointing now to
tacl.readthedocs.io.
4.0.0 2018-04-21 Jamie Norrish <[email protected]>
* Refactored search to output results with the same columns as for
diff/intersect, and added grouping options to tacl results.
* Added add-label-work-count option to tacl results to add a column
to results giving the count of works per label.
* Moved label-count from tacl-helper to tacl results as the option
--add-label-count.
* Added tacl excise command (and excise method on Text) to remove
n-grams from a text.
* Added --excise option to tacl results, to remove any results whose
n-gram contains the supplied n-gram.
* Removed tacl-helper command. Its remaining functionality could be
duplicated through simple scripts (mostly concatenation of tacl
commands).
* Added reference to tacl-catalogue-manager project in
documentation.
* Removed support for CBETA 2011 files.
* Changed witness handling so that all witnesses are explicit.
* Added "latin" tokenizer for whitespace delimited sequences of
non-punctuation characters.
* Added check for each Result method that the supplied results have
the required columns.
* Improved error handling when tacl results is supplied either
results without the necessary columns (eg, after
collapse-witnesses), or an empty set of results.
3.0.0 2016-12-12 Jamie Norrish <[email protected]>
* Moved useful script functions to tacl.command.utils, and added
this to the API documentation.
* Added colour to logging output from the command line scripts.
* Added validate-catalogue as a tacl-helper subcommand, to validate
a catalogue file with respect to a corpus.
* Added label-count as a tacl-helper subcommand, to add a column to
results with a count within a label for each n-gram.
* Added -c/--catalogue option to tacl ngrams, to limit the texts
added to a database to those labelled in a catalogue.
* Added experimental command, tacl-jitc, to generate a report on
overlap between pairs of texts in a sub-corpus, against the
background of a second sub-corpus.
* Modified tacl highlight command to allow for either a heatmap
display (based on a results file) or a simple highlight (based on
a file of n-grams).
* Added support for preparing and stripping CBETA TEI files from
their GitHub repository. This also changes the XML expected by the
markup stripping operation.
* Renamed "tacl report" to "tacl results", and tacl.Report to
tacl.Results. (#46)
* Added --ngrams option to tacl results, to exclude results whose
n-gram occurs in the supplied list of n-grams.
* Added --min-count-text and --max-count-text options to tacl
results, to filter results that do not have at least one text
carrying an n-gram with a count within the specified range.
* Added bifurcated-extend options to "tacl results", to
generate results containing n-grams extended from the provided
results, but including only those that occur at sizes that mark a
bifurcation (change in label count between an n-gram and its
containing (n+1)-grams).
* Clarified the language used through code, comments, and docs, to
distinguish clearly between a "work" (abstract 'text', such as
T0220, distinct from any particular expression in a witness), a
"witness" (particular expression of a work), and "text" (the
actual words etc). This has changed the names of some command line
options to tacl results, and the API (Corpus.get_text is now
Corpus.get_witness, for example). The database schema has also
changed, meaning that databases must be recreated from
scratch. Any existing results will need to have the "text name"
header in the first line changed to "work". (#50)
* Added display of aligned sequences in text order of one of the
texts to tacl align. (#32)
2.3.2 2016-02-26 Jamie Norrish <[email protected]>
* Added autogenerated API documentation.
* Added convenience method to Text to get the tokenized text as a
string.
* Expanded test of extend to cover more cases.
2.3.1 2016-01-06 Jamie Norrish <[email protected]>
* Fixed bug that meant extended diff results were immediately
removed after being generated. (#44)
* Noted on the tacl report help the reason for --extend coming
before --reduce when both options are specified.
2.3.0 2015-12-23 Jamie Norrish <[email protected]>
* Added automatic removal of filler n-grams from diff results. Only
n-grams that are entirely composed of (n-1)-grams, or which do not
overlap with any (n-1)-grams, are kept. This then allows for
reduce and extend to work correctly on diff results.
As a result of this change, the API for all difference queries has
changed.
* Add example invocations of tacl subcommands. (#30)
2.2.0 2015-11-20 Jamie Norrish <[email protected]>
* Added basic formatting of highlighted text to match that of the
stripped text.
* Changed calls to pandas' to_csv method to keep up with current API
(use 'columns' instead of 'cols').
* Changed call to pandas' sort_index to sort_values to keep up with
current API.
* Specified minimum version of pandas (0.17.0) in setup.py.
* Changed arguments to sdiff and sintersect subcommands, to work
around Python bug https://bugs.python.org/issue9338
* Changed tacl highlight and align to use Jinja2 templates. This
paves the way for the HTML/CSS/JS-rich output of future scripts.
2.1.0 2015-06-27 Jamie Norrish <[email protected]>
* Added zero-fill option to tacl report, supplying result rows with
a count of zero to witnesses of a text that has at least one
witness with a non-zero count.
* Added collapse-witnesses subcommand to tacl-helper, to collapse
result rows for witnesses to the same text with the same count
into a single row.
* Fixed a bug in tacl report's extend if the same witness is
associated with more than one label in the results.
* Added check for texts being relabelled in a catalogue file.
* Added check for only a single label being supplied to a query.
2.0.0 2015-05-27 Jamie Norrish <[email protected]>
* Added support for multiple witnesses to a text. Witnesses are
automatically extracted from the CBETA texts, and results now
include a 'siglum' field indicating which witness of the text a
result comes from.
* Added total count of n-grams occurrences per text to 'tacl search'
results.
* Split 'tacl strip' into 'tacl prepare' and 'tacl strip'.
* Reimplemented supplied queries to be more useful and not have
hidden gotchas.
* Fixed a bug causing 'tacl stats' to sometimes return incorrect
results.
1.1.0 2014-05-27 Jamie Norrish <[email protected]>
* Added a tacl align command, to output aligned sequences of
intersections between all pairs of differently labelled texts in a
set of results.
* Added LICENCE and CHANGES to MANIFEST.in.
1.0.2 2014-05-20 Jamie Norrish <[email protected]>
* Added install_requires to setup.py, so that installation via pip
installs the dependences.
1.0.1 2014-05-20 Jamie Norrish <[email protected]>
* Added a MANIFEST.in to explicitly include README.rst in the
distribution.
1.0.0 2014-05-20 Jamie Norrish <[email protected]>
* Initial publicised release.