PyGAAP is the Python port of JGAAP, Java Graphical Authorship Attribution Program by Patrick Juola et al.
See https://evllabs.github.io/JGAAP/
Updated: 2022.07.14
- Unlike JGAAP, PyGAAP does not (currently) use dedicated classes for module parameters. See Class variables.
name description (Tkinter module used)
topwindow name of main window (Tk)
├── menubar name of top menu bar (Menu)
├── workspace main Frame under topwindow that contains notebook Tabs (Frame)
│ ├── tabs This sets up the tabs (Notebook)
│ │ ├── Tab_Documents Holds widgets in Documents tab (Frame)
│ │ │ ├── Tab_Documents_UnknownAuthors_Frame Contains Listbox for unknown authors (Frame)
│ │ │ ├── Tab_Documents_doc_buttons Buttons for unknown authors (Frame)
│ │ │ ├── Tab_Documents_KnownAuthors_Frame Contains Listbox for unknown authors (Frame)
│ │ │ ├── Tab_Documents_knownauth_buttons Buttons for known authors (Frame)
│ │ |
│ │ │ The 4 tabs below are generated by create_feature_tab(). The widgets in these tabs are saved in "objects".
│ │ ├── Tab_Canonicizers Holds widgets in Canonicizers tab (Frame) widgets in generated_widgets['Canonicizers']
│ │ ├── Tab_EventDrivers Holds widgets in Event Drivers tab (Frame) widgets in generated_widgets['EventDrivers']
│ │ ├── Tab_EventCulling Same Setup as Event Drivers Tab widgets in generated_widgets['EventCulling']
│ │ ├── Tab_AnalysisMethods Holds widgets in Analysis Methods tab (Frame) widgets in generated_widgets['AnalysisMethods']
│ │ |
│ │ │ within the dictionary entry listed above as % the structure is as follows:
│ │ ├── %
│ │ │ ├── %["top_frame"]
│ │ │ │ ├── %["available_frame"] Contains the listboxes where available features are displayed.
| | | | | └── %["available_listboxes] = [
| | | | | [Frame, Label, Listbox, Scrollbar],
| | | | | [Frame, Label, Listbox, Scrollbar] # for analysis methods (two listboxes to choose from)
| | | | | ]
│ │ │ │ ├── %["buttons_frame"] Contains the add/remove/clear buttons.
│ │ │ │ ├── %["selected_frame"] Contains the listbox where selected features are displayed.
| | | | | └── %["selected_listboxes] = [Frame, Label, Lixtbox/Treeview, Scrollbar]
│ │ │ │ └── %["parameters_frame"] Contains the frame where parameters of a feature are displayed.
│ │ │ |
│ │ │ ├── ["description_frame"] Contains the text box where the description of a feature is displayed.
│ │ |
│ │ ├── Tab_ReviewProcess Holds widgets in Review & Process tab (Frame)
│ │ │ ├── Tab_ReviewProcess_Canonicizers Contains corresponding listbox
│ │ │ ├── Tab_ReviewProcess_EventDrivers
│ │ │ ├── Tab_ReviewProcess_EventCulling
│ │ │ ├── Tab_ReviewProcess_AnalysisMethods
│ │ |
├── bottomframe Hold buttons at the bottom: Notes, Next, and Finish.
└── status_bar Contains the label (text) for status.
In the GUI code, set GUI_debug
to 3
to see function calls printed to the terminal.
Notepad()
├── -# NotepadWindow_SaveButton
├── -NotepadWindow_SaveButton -> Notepad_Save(text)
edit_known_authors(.., mode) #called when a button in [Tab_Documents_knownauth_buttons] is pressed. The mode distinguishes the buttons.
├── -# AuthorAddDocButton
│ ├── -addFile() # opens OS's file browser
|
├── -# AuthorRmvDocButton
│ ├── -select_features(..., "remove")
|
├── -#AuthorOKButton
├── -@ if mode=="add": # when "Add Author" button is pressed
│ authorSave(..., "add") # updates global list (backend) of authors and their documents
│ ├── -authorsListUpdater() # refreshes the listbox used to display authors
|
├── -@ else if mode=="edit" # when "Edit Author" button is pressed
authorSave(..., "edit") # updates global list (backend) of authors and their documents
├── -authorsListUpdater() # refreshes the listbox used to display authors
Add new modules to ./generics/modules
for the API to pick up while loading. Always add a line to import the generic type from ~/generics
. For example, for a set of analysis methods:
from generics.AnalysisMethod import *
Add package dependencies and their version numbers to ~/requirements.txt
, if applicable.
As a readability consideration, it's recommended that the files in ~/generics/modules
be prefixed with the following:
cc
for canonicizers
ed
for event drivers
ec
for event cullers
nc
for number converters
am
for analysis methods
df
for distance functions.
These are the expected input and output types.
Canonicizers (pre-processors)
String -> String
save to Document.text, returning is not required
Event drivers (feature extractors)
String (Document.text) -> list of strings
save to Document.eventSet, return is not requied
Event cullers (feature filtering/culling)
list of strings (Document.eventSet) -> list of strings
save to Document.eventSet, return is not required
Number converters (text embedders)
list of strings (Document.eventSet) -> numpy.array (1D)
save to Document.numbers, returning a 2D numpy.array is recommended, with shape (known categories, unknown categories)
Distance functions
numpy.array (1D or 2D) -> numpy.array (2D), shape (known categories, unknown categories)
must return
analysis methods
numpy.array (Document.numbers) -> list[dict[string:float]]
list of dicts whose keys are authors and values, scores for each unknown category where a lower score is higher ranked.
must return
- The text string is read from file and saved to Document.text. The canonicizers process the text & save it back into (overwrite) Document.text.
- Event drivers read from Document.text and convert it into a list of strings. This is saved into Document.eventSet.
- Event cullers read from Document.eventSet, process the list, and save it back into (overwrite) Document.eventSet
- Number converters read from Document.eventSet and convert the list into a NumPy array. The NumPy arrays are the numerical representations of the documents and are saved into Document.numbers. (1D array) At the same time, two aggregate NumPy arrays (2D) containing data from the known document set (training data) and the unknown document set (testing data) are passed to the next steps. Number converters returning these aggregate arrays is optional but recommended because it may help analysis increase performance by vectorizing the representations.
- The analysis modules receives the entire set of unknown documents, and optionally the aggregate testing data, and performs classification. It's up to the developer to decide whether to process them all at once or one-by-one. The result is a list of dictionaries where each dictionary has the scores for each candidate author.
Class variables are declared within the class definition.
Each user parameter is a class variable exposed to the GUI. These variables must also have corresponding entries in _variable_options
, and their names cannot begin with a "_
".
Conversely, to hide a class variable from the GUI, prefix the name with a "_
".
_global_parameters
API parameters to be passed to all modules, likelanguage
._variable_options
(dictionary) lists the options, GUI type, and the default values of variables. The variables' names are the keys and their attributes are dicts. Each dict for a variable must have"options"
for range of available choices,"type"
for the GUI widget type (currently onlyOptionMenu
is supported), and"default"
for the default value as an index of the"options"
list (for the example below, the default is0
, which picks the item with0
index in the"options"
list as the default value, i.e. the default value for the variable is3
). Optionally, add a display name if different from the variable name.
Example:{"variable_1": {"options": list(range(3, 10)), "type": OptionMenu, "default": 0, "displayed_name": "The First Variable"}}
_NoDistanceFunction_
(boolean) if an anlysis method does not allow a distance function to be set, add this and set it toTrue
. It'sFalse
if omitted.
The __init__()
method for module classes contains initialization for required parameters. These are handled in the abstract (base) class at the top of the generic module files (~/generics/...
). Use an after_init(**options)
function if there are extra steps for a module right after initialization. It takes key-word arguments passed into __init__()
.
- All modules are required to have
displayName()
anddisplayDescription()
.displayName()
(nothing → String) returns the name of the module. Note that the name of a distance function cannot beNA
, which is reserved for a place-holder for analysis methods that don't use distance functions.displayDescription()
(nothing → String) returns a description of the module.
❗ Make sure to return and not (just) print the names and descriptions.
Functions by types of module:
- Canonicizers:
process()
(String → String)
- Event drivers:
CreateEventSet()
(String → List)setParams()
- Event cullers:
process()
(List → List)
- Number Converters:
convert
(List → NumPy.array)
- Analysis methods:
train()
analyze()
setDistanceFunction()
(optional)
To reload all modules while PyGAAP is running, go to the top menu bar: "Developer"
There will be a confirmation in the status bar or an error message window.
❗ Reloading will remove all selected modules.
❗ This does not reload libraries that the modules may import, e.g. SpaCy.
- CLI - Command line interface