Skip to content

Latest commit

 

History

History
203 lines (167 loc) · 12.7 KB

Developer_Manual.md

File metadata and controls

203 lines (167 loc) · 12.7 KB

PyGAAP is the Python port of JGAAP, Java Graphical Authorship Attribution Program by Patrick Juola et al.
See https://evllabs.github.io/JGAAP/

Updated: 2022.07.14

PyGAAP Developer Manual

Table of contents

  1. Differences from JGAAP
  2. Widget structures
    1. Outline of tkinter widgets
    2. Function Calls
  3. Adding a new module
    1. The analysis process
    2. Classs variables
    3. Class initialization
    4. Class functions
    5. Reload modules while PyGAAP is running

Differences from JGAAP

Module parameters

  1. Unlike JGAAP, PyGAAP does not (currently) use dedicated classes for module parameters. See Class variables.

Widget structures

Outline of tkinter widgets

name                                         description (Tkinter module used)

topwindow                                             name of main window (Tk)
├── menubar                                           name of top menu bar (Menu)
├── workspace                                         main Frame under topwindow that contains notebook Tabs (Frame)
│   ├── tabs                                          This sets up the tabs (Notebook)
│   │   ├── Tab_Documents                             Holds widgets in Documents tab (Frame)
│   │   │   ├── Tab_Documents_UnknownAuthors_Frame   Contains Listbox for unknown authors (Frame)
│   │   │   ├── Tab_Documents_doc_buttons            Buttons for unknown authors (Frame)
│   │   │   ├── Tab_Documents_KnownAuthors_Frame     Contains Listbox for unknown authors (Frame)
│   │   │   ├── Tab_Documents_knownauth_buttons      Buttons for known authors (Frame)
│   │   |
│   │   │      The 4 tabs below are generated by create_feature_tab(). The widgets in these tabs are saved in "objects".
│   │   ├── Tab_Canonicizers           Holds widgets in Canonicizers tab (Frame)           widgets in generated_widgets['Canonicizers']
│   │   ├── Tab_EventDrivers           Holds widgets in Event Drivers tab (Frame)          widgets in generated_widgets['EventDrivers']
│   │   ├── Tab_EventCulling           Same Setup as Event Drivers Tab                     widgets in generated_widgets['EventCulling']
│   │   ├── Tab_AnalysisMethods        Holds widgets in Analysis Methods tab (Frame)       widgets in generated_widgets['AnalysisMethods']
│   │   |
│   │   │      within the dictionary entry listed above as % the structure is as follows:
│   │   ├── %
│   │   │   ├── %["top_frame"]
│   │   │   │   ├── %["available_frame"]                Contains the listboxes where available features are displayed.
|   |   |   |   |   └── %["available_listboxes]       = [
|   |   |   |   |                                           [Frame, Label, Listbox, Scrollbar],
|   |   |   |   |                                           [Frame, Label, Listbox, Scrollbar] # for analysis methods (two listboxes to choose from)
|   |   |   |   |                                       ]
│   │   │   │   ├── %["buttons_frame"]                  Contains the add/remove/clear buttons.
│   │   │   │   ├── %["selected_frame"]                 Contains the listbox where selected features are displayed.
|   |   |   |   |   └── %["selected_listboxes]       = [Frame, Label, Lixtbox/Treeview, Scrollbar]
│   │   │   │   └── %["parameters_frame"]               Contains the frame where parameters of a feature are displayed.
│   │   │   |
│   │   │   ├── ["description_frame"]                Contains the text box where the description of a feature is displayed.
│   │   |
│   │   ├── Tab_ReviewProcess                      Holds widgets in Review & Process tab (Frame)
│   │   │   ├── Tab_ReviewProcess_Canonicizers       Contains corresponding listbox
│   │   │   ├── Tab_ReviewProcess_EventDrivers
│   │   │   ├── Tab_ReviewProcess_EventCulling
│   │   │   ├── Tab_ReviewProcess_AnalysisMethods
│   │   |
├── bottomframe                                Hold buttons at the bottom: Notes, Next, and Finish.
└── status_bar                                 Contains the label (text) for status.

Map of some function calls

In the GUI code, set GUI_debug to 3 to see function calls printed to the terminal.

Notepad()
├── -# NotepadWindow_SaveButton
   ├── -NotepadWindow_SaveButton -> Notepad_Save(text)

edit_known_authors(.., mode)                            #called when a button in [Tab_Documents_knownauth_buttons] is pressed. The mode distinguishes the buttons.
├── -# AuthorAddDocButton
│    ├── -addFile()                                     # opens OS's file browser
|
├── -# AuthorRmvDocButton
│    ├── -select_features(..., "remove")
|
├── -#AuthorOKButton
   ├── -@ if mode=="add":                             # when "Add Author" button is pressed
   │    authorSave(..., "add")                        # updates global list (backend) of authors and their documents
   │    ├── -authorsListUpdater()                       # refreshes the listbox used to display authors
   |
   ├── -@ else if mode=="edit"                        # when "Edit Author" button is pressed
      authorSave(..., "edit")                       # updates global list (backend) of authors and their documents
      ├── -authorsListUpdater()                       # refreshes the listbox used to display authors

Adding a new module

Add new modules to ./generics/modules for the API to pick up while loading. Always add a line to import the generic type from ~/generics. For example, for a set of analysis methods:

from generics.AnalysisMethod import *

Add package dependencies and their version numbers to ~/requirements.txt, if applicable.
As a readability consideration, it's recommended that the files in ~/generics/modules be prefixed with the following:
cc for canonicizers
ed for event drivers
ec for event cullers
nc for number converters
am for analysis methods
df for distance functions.

The analysis Process

Data types

These are the expected input and output types.

Canonicizers (pre-processors)
   String -> String
   save to Document.text, returning is not required

Event drivers (feature extractors)
   String (Document.text) -> list of strings
   save to Document.eventSet, return is not requied

Event cullers (feature filtering/culling)
   list of strings (Document.eventSet) -> list of strings
   save to Document.eventSet, return is not required

Number converters (text embedders)
   list of strings (Document.eventSet) -> numpy.array (1D)
   save to Document.numbers, returning a 2D numpy.array is recommended, with shape (known categories, unknown categories)

Distance functions
   numpy.array (1D or 2D) -> numpy.array (2D), shape (known categories, unknown categories)
   must return

analysis methods
   numpy.array (Document.numbers) -> list[dict[string:float]]
   list of dicts whose keys are authors and values, scores for each unknown category where a lower score is higher ranked.
   must return

The process

  1. The text string is read from file and saved to Document.text. The canonicizers process the text & save it back into (overwrite) Document.text.
  2. Event drivers read from Document.text and convert it into a list of strings. This is saved into Document.eventSet.
  3. Event cullers read from Document.eventSet, process the list, and save it back into (overwrite) Document.eventSet
  4. Number converters read from Document.eventSet and convert the list into a NumPy array. The NumPy arrays are the numerical representations of the documents and are saved into Document.numbers. (1D array) At the same time, two aggregate NumPy arrays (2D) containing data from the known document set (training data) and the unknown document set (testing data) are passed to the next steps. Number converters returning these aggregate arrays is optional but recommended because it may help analysis increase performance by vectorizing the representations.
  5. The analysis modules receives the entire set of unknown documents, and optionally the aggregate testing data, and performs classification. It's up to the developer to decide whether to process them all at once or one-by-one. The result is a list of dictionaries where each dictionary has the scores for each candidate author.

Class variables

Class variables are declared within the class definition.

User parameters

Each user parameter is a class variable exposed to the GUI. These variables must also have corresponding entries in _variable_options, and their names cannot begin with a "_". Conversely, to hide a class variable from the GUI, prefix the name with a "_".

  • _global_parameters API parameters to be passed to all modules, like language.
  • _variable_options (dictionary) lists the options, GUI type, and the default values of variables. The variables' names are the keys and their attributes are dicts. Each dict for a variable must have "options" for range of available choices, "type" for the GUI widget type (currently only OptionMenu is supported), and "default" for the default value as an index of the "options" list (for the example below, the default is 0, which picks the item with 0 index in the "options" list as the default value, i.e. the default value for the variable is 3). Optionally, add a display name if different from the variable name.
    Example: {"variable_1": {"options": list(range(3, 10)), "type": OptionMenu, "default": 0, "displayed_name": "The First Variable"}}

Class variables for Analysis Methods

  • _NoDistanceFunction_ (boolean) if an anlysis method does not allow a distance function to be set, add this and set it to True. It's False if omitted.

Class initialization

The __init__() method for module classes contains initialization for required parameters. These are handled in the abstract (base) class at the top of the generic module files (~/generics/...). Use an after_init(**options) function if there are extra steps for a module right after initialization. It takes key-word arguments passed into __init__().

Class functions

  • All modules are required to have displayName() and displayDescription().
    • displayName() (nothing → String) returns the name of the module. Note that the name of a distance function cannot be NA, which is reserved for a place-holder for analysis methods that don't use distance functions.
    • displayDescription() (nothing → String) returns a description of the module.

❗ Make sure to return and not (just) print the names and descriptions.

Functions by types of module:

  • Canonicizers:
    • process() (String → String)
  • Event drivers:
    • CreateEventSet() (String → List)
    • setParams()
  • Event cullers:
    • process() (List → List)
  • Number Converters:
    • convert (List → NumPy.array)
  • Analysis methods:
    • train()
    • analyze()
    • setDistanceFunction() (optional)

Reload modules while PyGAAP is running

To reload all modules while PyGAAP is running, go to the top menu bar: "Developer" $\rightarrow$ "Reload all modules".
There will be a confirmation in the status bar or an error message window.

❗ Reloading will remove all selected modules.
❗ This does not reload libraries that the modules may import, e.g. SpaCy.

Abbreviations, Initialisms, and Acronyms

  • CLI - Command line interface