-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathNOTES
299 lines (238 loc) · 12.4 KB
/
NOTES
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
Architecture notes...
1. Internal identification of users and groups
----------------------------------------------
Users and groups are both identified by 2- or 3-element tuples of the
form:
(name, persistent identifier [, DN])
For example:
("dryad", "ark:/13030/foo", "uid=dryad,ou=People,ou=uc3,dc=cdlib,dc=org")
The last tuple element is present only if LDAP is enabled.
In the UI, if the user is not logged in, the user and group are both
set to ("anonymous", "anonymous") for the purposes of identifier
ownership and access control.
2. Session cookies
------------------
Session cookies store the following key/value pairs:
auth
A userauth.AuthenticatedUser object which has 'user' and 'group'
attributes, each of which is a tuple as described above.
Presence of this key indicates that the user is authenticated.
redirect_to
The full URL path to which the user should be redirected following
a successful login. UI only. May or may not be present; not
cleared.
3. Caching
----------
Caching is employed in several places. All caches are emptied when
EZID is reloaded.
ezid.conf settings
The settings used in modules are cached by those modules. Loaded
at module load time. Reloading EZID causes all settings to be
reloaded except Django and logging settings.
userauth._ldapCache authentication cache
A dictionary that maps local usernames to (hashed password, time,
AuthenticatedUser) tuples. (The time in a tuple effectively puts
a lifetime on the associated cached password.) Loaded on demand
as usernames are encountered. Individual entries are removed when
users change passwords, change groups, etc.
policy._groups cache
A dictionary that maps groups (identified by tuples; see above) to
group information tuples (containing shoulders, CrossRef
attributes, etc.). Loaded on demand as groups are encountered.
Individual entries are removed when groups are modified through
the admin interface.
policy._coOwners cache
A dictionary that maps users (identified by simple names) to
co-owner lists (in which users are also identified by simple
names). Loaded on demand as users are encountered. Individual
entries are removed when co-owner lists are modified through the
account and admin interfaces.
session cookies
See above. With a couple caveats, a session cookie is deleted if
and only if a user explicitly logs out
(SESSION_EXPIRE_AT_BROWSER_CLOSE is set to true, but that only
directs browsers to drop cookies, it doesn't have any server-side
effect). First caveat: expired cookies are deleted by a weekly
cron job. Second caveat: all of a user's session cookies are
deleted if the user's account is disabled or if the user's group
changes.
idmap.py: _idMap, _groupMap, _userMap
These dictionaries cache correspondences between agent identifiers
and user and group local names. Loaded on demand.
LDAP information
Cached in agent identifiers for the purposes of storage redundancy
and locality only. Written only, never read.
shoulder.py
Caches shoulder and datacenter objects from the store database;
the database itself caches the content of the external shoulder
file. Loaded when shoulders are first referenced. Note that
shoulders and datacenters are never changed within EZID.
store_group.py
Caches group objects from the store database. Loads objects on
demand, as they're referenced. Emptied when EZID is reloaded and
when groups are modified or deleted.
search database
The search database is not a cache, strictly speaking, but as a
quasi-clone of the store database it engenders the same kinds of
issues that caches do. In the search database, profiles and
datacenters are added as they are encountered (and they are never
deleted, so it is possible for extraneous entries to remain in the
database). Users, groups, and realms are kept in sync between the
two databases.
search_identifier.py
Caches user, group, datacenter, and profile objects from the
search database. Loads objects on demand, as they're referenced.
4. Identifier metadata
----------------------
See ezid.py.
5. Agent identifiers
--------------------
"Agents" (users and groups) are internally referred to and stored as
ARK identifiers (e.g., "ark:/99166/foo"), but are externally referred
to by local names (e.g., "dryad"). Identifiers that identify agents
are termed "agent identifiers;" see ezid.py and idmap.py for more
information. Because potentially sensitive LDAP information is cached
in agent identifiers (see above), not only are agent identifiers not
revealed to clients, they are owned by the EZID administrator and can
only be viewed by the EZID administrator.
6. Case sensitivity of LDAP UIDs
--------------------------------
LDAP UIDs are case-insensitive. Whenever EZID stores a UID (e.g., in
session cookies and in co-owner lists), it always uses the UID as
retrieved from LDAP. In other words, when LDAP is enabled, EZID's
behavior regarding UIDs is case-insensitive and case-preserving.
7. Use of DataCite's active flag
--------------------------------
DataCite's 'active' flag (a DataCite-specific attribute of a DOI)
works as follows. It is true by default, and set to false by
performing an HTTP DELETE on the identifier. Note, though, that a
DELETE may be performed only if the identifier has metadata.
Performing a DELETE on an already deactivated identifier has no
effect. An identifier is (and can only be) reactivated by posting
metadata to it.
A deactivated identifier continues to exist in DataCite, but it is in
many ways deleted: an attempt to view the identifier returns 410 Gone,
and the identifier is removed from every DataCite service, including
the CrossRef/DataCite content resolver. It is not entirely deleted,
however, as the identifier continues to exist in the Handle System and
therefore continues to resolve.
Note that the above API behavior has no effect on setting a DOI's
target URL: the target URL may be set whether the identifier is active
or not, and whether it has metadata or not. Starting 2013-01-01
DataCite will disallow a new registration if the identifier has no
metadata. Our understanding is that nothing else about the DataCite
API will change, in particular, that the target URL will continue to
be settable if the identifier is not active. It is unclear at the
time of this writing if the target URL for a legacy identifier lacking
metadata may be set without first uploading metadata.
With this background, EZID's manipulation of the active flag can be
summarized as follows:
event actions
------------------------------ -------------------------
_status: public -> unavailable url=tombstone; DEACTIVATE
_status: unavailable -> public restore url; ACTIVATE
delete url=invalid; DEACTIVATE
_export: yes -> no DEACTIVATE
_export: no -> yes ACTIVATE
In the above, _status takes precedence over _export.
There are two differences between an unavailable identifier and a
public-but-not-exported identifier. First, an unavailable
identifier's target URL is overriden with a tombstone URL. Second, a
public-but-not-exported identifier's metadata is still uploaded to
DataCite.
8. Offline scripts
------------------
Offline scripts (dump-store, stats, dashboard, populate-store-2, etc.)
import EZID modules and directly call EZID functions. This generally
doesn't cause problems with two exceptions. The first is logging: to
avoid appending to and possibly corrupting the running server's
transaction log file, offline scripts use the
settings/logging.offline.conf settings to log to standard error
instead. Second, script update actions may conflict with those of the
running server, even though SQLite locking works across processes,
because offline scripts don't participate in the server's locking
mechanism and won't necessarily interact properly with server
background processing daemons. This explains why, for example, the
expunge script performs its update actions through the EZID API.
See .../SITE_ROOT/PROJECT_ROOT/tools/offline.py for more information.
9. Log file formats
-------------------
There are two slightly different log file formats. The transaction
log written by the running server (by module log.py) stores start,
progress, and end records for every transaction, for both read and
write operations, both successful and not, as well as server error and
server status records. But for space efficiency, historical
transaction logs are converted to a more compact form. The striplog
tool strips out all records, retaining only records for transactions
that successfully minted, created, or modified non-test identifiers.
Furthermore, the multiple records comprising a transaction are
collapsed into a single record. For example, the following two
transactions (records have been wrapped here for clarity):
2014-01-06 20:58:11,383 4ec86f4a775811e3bdd610ddb1cf39e7 BEGIN mintArk
13030/c7 gjanee ark:/99166/p92z12p14 cdl ark:/99166/p9z60c16v
2014-01-06 20:58:11,715 4ec86f4a775811e3bdd610ddb1cf39e7 END SUCCESS
13030/c7b56d41k
2014-01-06 20:58:11,715 4efb1a59775811e3a95e10ddb1cf39e7 BEGIN createArk
13030/c7b56d41k gjanee ark:/99166/p92z12p14 cdl ark:/99166/p9z60c16v
erc.what An%20example
2014-01-06 20:58:12,338 4efb1a59775811e3a95e10ddb1cf39e7 PROGRESS
noid_egg.setElements
2014-01-06 20:58:12,342 4efb1a59775811e3a95e10ddb1cf39e7 PROGRESS
store.insert
2014-01-06 20:58:12,345 4efb1a59775811e3a95e10ddb1cf39e7 END SUCCESS
get compacted into:
2014-01-06 20:58:11,383 mintArk 13030/c7 gjanee ark:/99166/p92z12p14 cdl
ark:/99166/p9z60c16v -> 13030/c7b56d41k
2014-01-06 20:58:11,715 createArk 13030/c7b56d41k gjanee
ark:/99166/p92z12p14 cdl ark:/99166/p9z60c16v erc.what An%20example
Note that record arguments in both types of log files are separated by
single spaces, and thus an empty argument will result in adjacent
spaces.
10. Database dump formats
-------------------------
There are two slightly different database dump formats. A "raw" dump
(produced by 'dump -r' and 'dump-store -r') lists identifiers as
stored in the bind or store database: in unqualified form, with shadow
ARKs representing non-ARK identifiers, using internal labels, with all
internal labels included. Here's an example identifier record
(wrapped here for clarity):
99999/fk4030wkq _is reserved _p erc
_o ark:/99166/p92z12p14 _g ark:/99166/p9z60c16v
_c 1389071897 _u 1389071897
_t1 http://a.target/
_t http://ezid.cdlib.org/id/ark:/99999/fk4030wkq
A "normal" dump (produced by 'dump -n' and 'dump-store -n', or
converted from a raw dump by 'convert-dump -n') uses a record
representation that is more human-readable and more easily processed.
It lists identifiers in qualified form, with non-ARK identifiers
representing themselves, using external labels, with internal labels
related to identifier status omitted. The same example in normal
form:
ark:/99999/fk4030wkq _status reserved _profile erc
_owner ark:/99166/p92z12p14 _ownergroup ark:/99166/p9z60c16v
_created 1389071897 _updated 1389071897
_target http://a.target/
A normal dump may optionally have agent identifiers converted to local
names (by omitting the '-n' option in the above commands), as in:
ark:/99999/fk4030wkq _status reserved _profile erc
_owner gjanee _ownergroup cdl
_created 1389071897 _updated 1389071897
_target http://a.target/
The select and project tools are intended to work on normal dumps,
though certain operations work on raw dumps as well.
11. CrossRef
------------
CrossRef does not provide an 'active' flag like DataCite does, and
this limits our ability to implement identifier status changes. Our
next-best-thing approach is as follows:
- The _crossref element may be set and unset while the identifier is
reserved.
- If the _crossref element is set, the identifier must be exported.
- When the identifier is made public, it is registered with
CrossRef. And once the identifier is public, the _crossref
element may not be unset.
- If the identifier's status is set to unavailable, the identifier
remains registered with CrossRef, but its target URL is set to the
tombstone URL and the resource title is set to "WITHDRAWN". If
the identifier is deleted (by the EZID administrator), same thing,
but the target URL is set to http://datacite.org/invalidDOI.