-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathNOTES
283 lines (227 loc) · 11.6 KB
/
NOTES
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
Architecture notes...
1. Internal identification of users and groups
----------------------------------------------
Users and groups are both identified by 2- or 3-element tuples of the
form:
(name, persistent identifier [, DN])
For example:
("dryad", "ark:/13030/foo", "uid=dryad,ou=People,ou=uc3,dc=cdlib,dc=org")
The last tuple element is present only if LDAP is enabled.
In the UI, if the user is not logged in, the user and group are both
set to ("anonymous", "anonymous") for the purposes of identifier
ownership and access control.
2. Session cookies
------------------
Session cookies store the following key/value pairs:
auth
A userauth.AuthenticatedUser object which has 'user' and 'group'
attributes, each of which is a tuple as described above.
Presence of this key indicates that the user is authenticated.
redirect_to
The full URL path to which the user should be redirected following
a successful login. UI only. May or may not be present; not
cleared.
3. Caching
----------
Caching is employed in several places:
ezid.conf settings
Cached in various modules. Loaded at module load time. Invoking
the "Reload EZID" admin function causes all settings to be
reloaded (except logging settings).
shoulder._shoulders dictionary
shoulders_cache.txt file
The former is a dictionary that maps shoulders to
shoulder_parser.Entry objects. Loaded on demand. Emptied when
EZID is reloaded. The latter is a local copy of the (remote)
shoulder database file.
userauth._ldapCache authentication cache
A dictionary that maps local usernames to AuthenticatedUser
objects. Loaded on demand as usernames are encountered. Emptied
when EZID is reloaded. Also, individual entries are removed when
users change groups.
policy._groups shoulder cache
A dictionary that maps groups (identified by tuples; see above) to
lists of shoulders (shoulder_parser.Entry objects). Loaded on
demand as groups are encountered. Emptied when EZID is reloaded.
Also, individual entries are removed when shoulder lists are
modified through the admin interface.
policy._coOwners cache
A dictionary that maps users (identified by simple names) to
co-owner lists (in which users are also identified by simple
names). Loaded on demand as users are encountered. Emptied when
EZID is reloaded. Also, individual entries are removed when
co-owner lists are modified through the account and admin
interfaces.
search._coOwnershipMap cache
A dictionary that maps users to lists of other users that have
named them as account-level co-owners. I.e., if users A and B
both name user C as an account-level co-owner, then this map will
contain an entry C -> [A, B]. Users are identified by ARK
identifiers. Computed once on demand. Emptied when EZID is
reloaded and when any account-level co-ownership list is modified.
session cookies
See above. With a couple caveats, a session cookie is deleted if
and only if a user explicitly logs out
(SESSION_EXPIRE_AT_BROWSER_CLOSE is set to true, but that only
directs browsers to drop cookies, it doesn't have any server-side
effect). First caveat: expired cookies are deleted by a weekly
cron job. Second caveat: all of a user's session cookies are
deleted if the user's account is disabled or if the user's group
changes.
idmap.py: _idMap, _groupMap, _userMap
These dictionaries cache correspondences between agent identifiers
and user and group local names. Loaded on demand. Emptied when
EZID is reloaded.
LDAP information
Cached in agent identifiers for the purposes of storage redundancy
and locality only. Written only, never read.
These can be considered caches as well:
store database
Serves as a local backup for the primary "bind" noid database;
also supports identifier harvesting. Stores all identifiers and
all metadata.
search database
Supports searching and browsing over identifier metadata. Stores
some identifiers (most are included, but, e.g., anonymously-owned
identifiers are excluded) and just that metadata needed for
searching.
4. Identifier metadata
----------------------
See ezid.py.
5. Agent identifiers
--------------------
"Agents" (users and groups) are internally referred to and stored as
ARK identifiers (e.g., "ark:/99166/foo"), but are externally referred
to by local names (e.g., "dryad"). Identifiers that identify agents
are termed "agent identifiers;" see ezid.py and idmap.py for more
information. Because potentially sensitive LDAP information is cached
in agent identifiers (see above), not only are agent identifiers not
revealed to clients, they are owned by the EZID administrator and can
only be viewed by the EZID administrator.
The search database contains agent identifiers; see
.../etc/search-schema.sql for a potential privacy/security hole.
6. Case sensitivity of LDAP UIDs
--------------------------------
LDAP UIDs are case-insensitive. Whenever EZID stores a UID (e.g., in
session cookies and in co-owner lists), it always uses the UID as
retrieved from LDAP. In other words, when LDAP is enabled, EZID's
behavior regarding UIDs is case-insensitive and case-preserving.
7. Use of DataCite's active flag
--------------------------------
DataCite's 'active' flag (a DataCite-specific attribute of a DOI)
works as follows. It is true by default, and set to false by
performing an HTTP DELETE on the identifier. Note, though, that a
DELETE may be performed only if the identifier has metadata.
Performing a DELETE on an already deactivated identifier has no
effect. An identifier is (and can only be) reactivated by posting
metadata to it.
A deactivated identifier continues to exist in DataCite, but it is in
many ways deleted: an attempt to view the identifier returns 410 Gone,
and the identifier is removed from every DataCite service, including
the CrossRef/DataCite content resolver. It is not entirely deleted,
however, as the identifier continues to exist in the Handle System and
therefore continues to resolve.
Note that the above API behavior has no effect on setting a DOI's
target URL: the target URL may be set whether the identifier is active
or not, and whether it has metadata or not. Starting 2013-01-01
DataCite will disallow a new registration if the identifier has no
metadata. Our understanding is that nothing else about the DataCite
API will change, in particular, that the target URL will continue to
be settable if the identifier is not active. It is unclear at the
time of this writing if the target URL for a legacy identifier lacking
metadata may be set without first uploading metadata.
With this background, EZID's manipulation of the active flag can be
summarized as follows:
event actions
------------------------------ -------------------------
_status: public -> unavailable url=tombstone; DEACTIVATE
_status: unavailable -> public restore url; ACTIVATE
delete url=invalid; DEACTIVATE
_export: yes -> no DEACTIVATE
_export: no -> yes ACTIVATE
In the above, _status takes precedence over _export.
There are two differences between an unavailable identifier and a
public-but-not-exported identifier. First, an unavailable
identifier's target URL is overriden with a tombstone URL. Second, a
public-but-not-exported identifier's metadata is still uploaded to
DataCite.
8. Offline scripts
------------------
Offline scripts (expunge, stats, dashboard, etc.) do not perform
identifier actions through the EZID API; rather, they import EZID
modules and directly call EZID functions. This technique generally
shouldn't cause any problems, as SQLite locking mechanisms work across
processes, with two exceptions. The first exception is logging: to
avoid appending to and possibly corrupting the running server's
transaction log file, offline scripts use the
settings/logging.offline.conf settings to log to standard error
instead. An implication of this is that script actions do not appear
in the transaction log file. The second exception is that expunge
actions may conflict with those of the running server. The
possibility is remote, nevertheless, for this reason the expunge
script should probably be rewritten to go through the EZID API.
9. Log file formats
-------------------
There are two slightly different log file formats. The transaction
log written by the running server (by module log.py) stores start,
progress, and end records for every transaction, for both read and
write operations, both successful and not, as well as server error and
server status records. But for space efficiency, historical
transaction logs are converted to a more compact form. The striplog
tool strips out all records, retaining only records for transactions
that successfully minted, created, or modified non-test identifiers.
Furthermore, the multiple records comprising a transaction are
collapsed into a single record. For example, the following two
transactions (records have been wrapped here for clarity):
2014-01-06 20:58:11,383 4ec86f4a775811e3bdd610ddb1cf39e7 BEGIN mintArk
13030/c7 gjanee ark:/99166/p92z12p14 cdl ark:/99166/p9z60c16v
2014-01-06 20:58:11,715 4ec86f4a775811e3bdd610ddb1cf39e7 END SUCCESS
13030/c7b56d41k
2014-01-06 20:58:11,715 4efb1a59775811e3a95e10ddb1cf39e7 BEGIN createArk
13030/c7b56d41k gjanee ark:/99166/p92z12p14 cdl ark:/99166/p9z60c16v
erc.what An%20example
2014-01-06 20:58:12,338 4efb1a59775811e3a95e10ddb1cf39e7 PROGRESS
noid.setElements
2014-01-06 20:58:12,342 4efb1a59775811e3a95e10ddb1cf39e7 PROGRESS
store.insert
2014-01-06 20:58:12,345 4efb1a59775811e3a95e10ddb1cf39e7 END SUCCESS
get compacted into:
2014-01-06 20:58:11,383 mintArk 13030/c7 gjanee ark:/99166/p92z12p14 cdl
ark:/99166/p9z60c16v -> 13030/c7b56d41k
2014-01-06 20:58:11,715 createArk 13030/c7b56d41k gjanee
ark:/99166/p92z12p14 cdl ark:/99166/p9z60c16v erc.what An%20example
Note that record arguments in both types of log files are separated by
single spaces, and thus an empty argument will result in adjacent
spaces.
10. Database dump formats
-------------------------
There are two slightly different database dump formats. A "raw" dump
(produced by 'dump -r' and 'dump-store -r') lists identifiers as
stored in the bind or store database: in unqualified form, with shadow
ARKs representing non-ARK identifiers, using internal labels, with all
internal labels included. Here's an example identifier record
(wrapped here for clarity):
99999/fk4030wkq _is reserved _p erc
_o ark:/99166/p92z12p14 _g ark:/99166/p9z60c16v
_c 1389071897 _u 1389071897
_t1 http://a.target/
_t http://ezid.cdlib.org/id/ark:/99999/fk4030wkq
A "normal" dump (produced by 'dump -n' and 'dump-store -n', or
converted from a raw dump by 'convert-dump -n') uses a record
representation that is more human-readable and more easily processed.
It lists identifiers in qualified form, with non-ARK identifiers
representing themselves, using external labels, with internal labels
related to identifier status omitted. The same example in normal
form:
ark:/99999/fk4030wkq _status reserved _profile erc
_owner ark:/99166/p92z12p14 _ownergroup ark:/99166/p9z60c16v
_created 1389071897 _updated 1389071897
_target http://a.target/
A normal dump may optionally have agent identifiers converted to local
names (by omitting the '-n' option in the above commands), as in:
ark:/99999/fk4030wkq _status reserved _profile erc
_owner gjanee _ownergroup cdl
_created 1389071897 _updated 1389071897
_target http://a.target/
The select and project tools are intended to work on normal dumps,
though certain operations work on raw dumps as well.