NOTES

Architecture notes...

1. Internal identification of users and groups
----------------------------------------------

Users and groups are both identified by 2- or 3-element tuples of the
form:

  (name, persistent identifier [, DN])

For example:

  ("dryad", "ark:/13030/foo", "uid=dryad,ou=People,ou=uc3,dc=cdlib,dc=org")

The last tuple element is present only if LDAP is enabled.

In the UI, if the user is not logged in, the user and group are both
set to ("anonymous", "anonymous") for the purposes of identifier
ownership and access control.

2. Session cookies
------------------

Session cookies store the following key/value pairs:

  auth
    A userauth.AuthenticatedUser object which has 'user' and 'group'
    attributes, each of which is a tuple as described above.
    Presence of this key indicates that the user is authenticated.

  redirect_to
    The full URL path to which the user should be redirected following
    a successful login.  UI only.  May or may not be present; not
    cleared.

3. Caching
----------

Caching is employed in several places.  All caches are emptied when
EZID is reloaded.

  ezid.conf settings
    The settings used in modules are cached by those modules.  Loaded
    at module load time.  Reloading EZID causes all settings to be
    reloaded except Django and logging settings.

  userauth._ldapCache authentication cache
    A dictionary that maps local usernames to (hashed password, time,
    AuthenticatedUser) tuples.  (The time in a tuple effectively puts
    a lifetime on the associated cached password.)  Loaded on demand
    as usernames are encountered.  Individual entries are removed when
    users change passwords, change groups, etc.

  policy._groups cache
    A dictionary that maps groups (identified by tuples; see above) to
    group information tuples (containing shoulders, CrossRef
    attributes, etc.).  Loaded on demand as groups are encountered.
    Individual entries are removed when groups are modified through
    the admin interface.

  policy._coOwners cache
    A dictionary that maps users (identified by simple names) to
    co-owner lists (in which users are also identified by simple
    names).  Loaded on demand as users are encountered.  Individual
    entries are removed when co-owner lists are modified through the
    account and admin interfaces.

  session cookies
    See above.  With a couple caveats, a session cookie is deleted if
    and only if a user explicitly logs out
    (SESSION_EXPIRE_AT_BROWSER_CLOSE is set to true, but that only
    directs browsers to drop cookies, it doesn't have any server-side
    effect).  First caveat: expired cookies are deleted by a weekly
    cron job.  Second caveat: all of a user's session cookies are
    deleted if the user's account is disabled or if the user's group
    changes.

  idmap.py: _idMap, _groupMap, _userMap
    These dictionaries cache correspondences between agent identifiers
    and user and group local names.  Loaded on demand.

  LDAP information
    Cached in agent identifiers for the purposes of storage redundancy
    and locality only.  Written only, never read.

  shoulder.py
    Caches shoulder and datacenter objects from the store database;
    the database itself caches the content of the external shoulder
    file.  Loaded when shoulders are first referenced.  Note that
    shoulders and datacenters are never changed within EZID.

  store_group.py
    Caches group objects from the store database.  Loads objects on
    demand, as they're referenced.  Emptied when EZID is reloaded and
    when groups are modified or deleted.

  search database
    The search database is not a cache, strictly speaking, but as a
    quasi-clone of the store database it engenders the same kinds of
    issues that caches do.  In the search database, profiles and
    datacenters are added as they are encountered (and they are never
    deleted, so it is possible for extraneous entries to remain in the
    database).  Users, groups, and realms are kept in sync between the
    two databases.

  search_identifier.py
    Caches user, group, datacenter, and profile objects from the
    search database.  Loads objects on demand, as they're referenced.

4. Identifier metadata
----------------------

See ezid.py.

5. Agent identifiers
--------------------

"Agents" (users and groups) are internally referred to and stored as
ARK identifiers (e.g., "ark:/99166/foo"), but are externally referred
to by local names (e.g., "dryad").  Identifiers that identify agents
are termed "agent identifiers;" see ezid.py and idmap.py for more
information.  Because potentially sensitive LDAP information is cached
in agent identifiers (see above), not only are agent identifiers not
revealed to clients, they are owned by the EZID administrator and can
only be viewed by the EZID administrator.

6. Case sensitivity of LDAP UIDs
--------------------------------

LDAP UIDs are case-insensitive.  Whenever EZID stores a UID (e.g., in
session cookies and in co-owner lists), it always uses the UID as
retrieved from LDAP.  In other words, when LDAP is enabled, EZID's
behavior regarding UIDs is case-insensitive and case-preserving.

7. Use of DataCite's active flag
--------------------------------

DataCite's 'active' flag (a DataCite-specific attribute of a DOI)
works as follows.  It is true by default, and set to false by
performing an HTTP DELETE on the identifier.  Note, though, that a
DELETE may be performed only if the identifier has metadata.
Performing a DELETE on an already deactivated identifier has no
effect.  An identifier is (and can only be) reactivated by posting
metadata to it.

A deactivated identifier continues to exist in DataCite, but it is in
many ways deleted: an attempt to view the identifier returns 410 Gone,
and the identifier is removed from every DataCite service, including
the CrossRef/DataCite content resolver.  It is not entirely deleted,
however, as the identifier continues to exist in the Handle System and
therefore continues to resolve.

Note that the above API behavior has no effect on setting a DOI's
target URL: the target URL may be set whether the identifier is active
or not, and whether it has metadata or not.  Starting 2013-01-01
DataCite will disallow a new registration if the identifier has no
metadata.  Our understanding is that nothing else about the DataCite
API will change, in particular, that the target URL will continue to
be settable if the identifier is not active.  It is unclear at the
time of this writing if the target URL for a legacy identifier lacking
metadata may be set without first uploading metadata.

With this background, EZID's manipulation of the active flag can be
summarized as follows:

  event                           actions
  ------------------------------  -------------------------
  _status: public -> unavailable  url=tombstone; DEACTIVATE
  _status: unavailable -> public  restore url; ACTIVATE
  delete                          url=invalid; DEACTIVATE
  _export: yes -> no              DEACTIVATE
  _export: no -> yes              ACTIVATE

In the above, _status takes precedence over _export.

There are two differences between an unavailable identifier and a
public-but-not-exported identifier.  First, an unavailable
identifier's target URL is overriden with a tombstone URL.  Second, a
public-but-not-exported identifier's metadata is still uploaded to
DataCite.

8. Offline scripts
------------------

Offline scripts (dump-store, stats, dashboard, populate-store-2, etc.)
import EZID modules and directly call EZID functions.  This generally
doesn't cause problems with two exceptions.  The first is logging: to
avoid appending to and possibly corrupting the running server's
transaction log file, offline scripts use the
settings/logging.offline.conf settings to log to standard error
instead.  Second, script update actions may conflict with those of the
running server, even though SQLite locking works across processes,
because offline scripts don't participate in the server's locking
mechanism and won't necessarily interact properly with server
background processing daemons.  This explains why, for example, the
expunge script performs its update actions through the EZID API.

See .../SITE_ROOT/PROJECT_ROOT/tools/offline.py for more information.

9. Log file formats
-------------------

There are two slightly different log file formats.  The transaction
log written by the running server (by module log.py) stores start,
progress, and end records for every transaction, for both read and
write operations, both successful and not, as well as server error and
server status records.  But for space efficiency, historical
transaction logs are converted to a more compact form.  The striplog
tool strips out all records, retaining only records for transactions
that successfully minted, created, or modified non-test identifiers.
Furthermore, the multiple records comprising a transaction are
collapsed into a single record.  For example, the following two
transactions (records have been wrapped here for clarity):

  2014-01-06 20:58:11,383 4ec86f4a775811e3bdd610ddb1cf39e7 BEGIN mintArk
    13030/c7 gjanee ark:/99166/p92z12p14 cdl ark:/99166/p9z60c16v
  2014-01-06 20:58:11,715 4ec86f4a775811e3bdd610ddb1cf39e7 END SUCCESS
    13030/c7b56d41k
  2014-01-06 20:58:11,715 4efb1a59775811e3a95e10ddb1cf39e7 BEGIN createArk
    13030/c7b56d41k gjanee ark:/99166/p92z12p14 cdl ark:/99166/p9z60c16v
    erc.what An%20example
  2014-01-06 20:58:12,338 4efb1a59775811e3a95e10ddb1cf39e7 PROGRESS
    noid_egg.setElements
  2014-01-06 20:58:12,342 4efb1a59775811e3a95e10ddb1cf39e7 PROGRESS
    store.insert
  2014-01-06 20:58:12,345 4efb1a59775811e3a95e10ddb1cf39e7 END SUCCESS

get compacted into:

  2014-01-06 20:58:11,383 mintArk 13030/c7 gjanee ark:/99166/p92z12p14 cdl
    ark:/99166/p9z60c16v -> 13030/c7b56d41k
  2014-01-06 20:58:11,715 createArk 13030/c7b56d41k gjanee
    ark:/99166/p92z12p14 cdl ark:/99166/p9z60c16v erc.what An%20example

Note that record arguments in both types of log files are separated by
single spaces, and thus an empty argument will result in adjacent
spaces.

10. Database dump formats
-------------------------

There are two slightly different database dump formats.  A "raw" dump
(produced by 'dump -r' and 'dump-store -r') lists identifiers as
stored in the bind or store database: in unqualified form, with shadow
ARKs representing non-ARK identifiers, using internal labels, with all
internal labels included.  Here's an example identifier record
(wrapped here for clarity):

  99999/fk4030wkq _is reserved _p erc
    _o ark:/99166/p92z12p14 _g ark:/99166/p9z60c16v
    _c 1389071897 _u 1389071897
    _t1 http://a.target/
    _t http://ezid.cdlib.org/id/ark:/99999/fk4030wkq

A "normal" dump (produced by 'dump -n' and 'dump-store -n', or
converted from a raw dump by 'convert-dump -n') uses a record
representation that is more human-readable and more easily processed.
It lists identifiers in qualified form, with non-ARK identifiers
representing themselves, using external labels, with internal labels
related to identifier status omitted.  The same example in normal
form:

  ark:/99999/fk4030wkq _status reserved _profile erc
    _owner ark:/99166/p92z12p14 _ownergroup ark:/99166/p9z60c16v
    _created 1389071897 _updated 1389071897
    _target http://a.target/

A normal dump may optionally have agent identifiers converted to local
names (by omitting the '-n' option in the above commands), as in:

  ark:/99999/fk4030wkq _status reserved _profile erc
    _owner gjanee _ownergroup cdl
    _created 1389071897 _updated 1389071897
    _target http://a.target/

The select and project tools are intended to work on normal dumps,
though certain operations work on raw dumps as well.

11. CrossRef
------------

CrossRef does not provide an 'active' flag like DataCite does, and
this limits our ability to implement identifier status changes.  Our
next-best-thing approach is as follows:

  - The _crossref element may be set and unset while the identifier is
    reserved.

  - If the _crossref element is set, the identifier must be exported.

  - When the identifier is made public, it is registered with
    CrossRef.  And once the identifier is public, the _crossref
    element may not be unset.

  - If the identifier's status is set to unavailable, the identifier
    remains registered with CrossRef, but its target URL is set to the
    tombstone URL and the resource title is set to "WITHDRAWN".  If
    the identifier is deleted (by the EZID administrator), same thing,
    but the target URL is set to http://datacite.org/invalidDOI.