-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathnotes
324 lines (284 loc) · 9.69 KB
/
notes
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
Abstractions
--------------------
Client files
two main files:
prefs.xml
user preferences.
includes list of projects; for each:
master URL
authenticator
project-specific prefs
resource share
prefs mod time
client_state.xml
hostid
per-project info
list of sched servers for project
project name
hostid
next_request_time
rpc_seqno (specific to this host)
work info
files, WUs, results etc.
NOTES
- On startup, if there's no prefs.xml, the client prompts
for a master URL and authenticator,
and creates an initial prefs.xml with a zero mod time
(so that any web-created prefs file will override)
- We need to safeguard against a buggy scheduling server
sending back an incomplete or empty prefs file.
Suggestions:
1) verify that at least the responding project is present in the prefs;
(or contain at least 1 project)
2) back up the old prefs file (prefs.xml.date)
- prefs.xml has priority over client_state.xml
If there's a project in prefs with no counterpart in client_state,
a new entry in client_state is created.
Entries in client_state absent from prefs are deleted.
- to "clone" an installation on a new computer,
just need to copy the core client (or run the installer)
then copy the account.xml file.
- a scheduler request can specify that no client_state.xml
was found, so a new host record should be created.
--------------------
When does client contact scheduling server?
Each result has a max notification delay,
so when a client completes it there's a deadline for notification.
Contact a scheduling server if:
- you're below the low-water mark in work for that project,
or you have a result past its deadline
- AND there's no delay in effect for that project.
A delay may be explicitly returned by the scheduling server,
or may be because of exponential backoff after failed attempts.
--------------------
Given that we can estimate the time it will take to get back
a result from a given host, it might be possible to assign
deadlines to results, and only send them to hosts that are fast enough
--------------------
Client logging
write events to log file:
start/stop client
start/finish file xfer
start/finish application execution
start/finish scheduling server call
error messages
logging flag is part of preferences
--------------------
file xfer commands
implemented as WU/result pairs whose app is "file_xfer".
Can have just one input file, one output.
Application servers can leaves these in a "message" directory,
where the scheduling server can find them and give to
client next time they contact.
--------------------
result states in client
don't have files yet
have files, not started
have files, started
completed, sending output files
output files sent
output files sent, some sticky files deleted
--------------------
result attributes in DB, sched server
state:
unsent
sent, in progress
timed out
file state
all output files are openly available
(i.e. have been uploaded)
WU attributes in DB, sched server
input file state (set by app server)
all input files are available
not all input files available
--------------------
Client logic
["network xfer" object encapsulates a set of file xfers in progress]
["processor" object: one for each CPU]
read config file
loop
check user activity - turn off computations if needed
start a computation if possible
all necessary files present,
and workunit not done or in progress.
check processes (fail, done)
start new network xfers if possible
xfer 16KB if possible (use select)
if xfer complete, update state
if estimated work below low-water mark
while estimated work below high-water mark
pick project with work due, OK dont_contact_until
contact a control server; request high-current work
if can't get connection, update dont_contact_until
end
end
end
--------------------
Application logic
--------------------
Control RPC protocol
--------------------
Web site functions
--------------------
Startup scenarios
- How a user initially signs up:
Visit the project's URL.
Create an account:
enter email address
wait for password to arrive in email.
download installer
installer installs agent, initial config file
run agent; type in password.
- How a user adds a project
Same as above, but don't download agent.
Go to "home" web site and add project.
- How a user removes a project
Go to "home" web site and remove project
------------------------------
Versions
Core client:
When and how does a scheduler tell a core agent
that a newer version can/should be downloaded?
How is compatibility between application agents
and core agents represented?
--------------------------------------
Distributed storage
Projects can use clients for storage using "sticky" files
(which are either sent to clients, or generated by the client).
The core client is free to delete sticky files any time.
Scheduler requests include a list of the sticky files held by the host.
This list is stored in a blob in the host record.
Scheduler replies can include <file_info> tags
instructing the client to download files.
These files need not be associated with applications or workunits.
Scheduler replies can include <file_info> tags
instructing the client to upload
--------------------------------
Preferences
CPU usage
don't run or communicate if on batteries
don't run or communicate if user is active
confirm before making network connection
minimum, maximum work buffer
Disk usage
use at most X GB
leave at least X GB free
leave at least X% free
Projects
For each project:
user name
project's master URL
email address
authenticator
resource %
show email address on web site?
accept emails from project?
project-specific prefs
------------------------------
retry policies:
general issues:
when and where to retry?
when to declare overall failure?
what to do if overall failure?
what needs to be saved in state file?
file xfer
download
round-robin through URLs with random exponential backoff
after connection failure or HTTP error.
2X from 1 minute up to 256 minutes
Overall failure after 1 week since last successful xfer
flag result as "file download failed",
abort other file xfers,
delete other files.
write log entry
State file:
record time of last successful xfer
upload
same as for download?
Use HTTP features to find file size on server
scheduler RPC
order projects according to hosts's "debt" to them.
Attempt to contact them in this order.
For each project, try all URLs in sequence with no delay.
If still need more work after a given RPC,
keep going to next project.
If still not enough work after a given "round",
do exponential backoff
2X from 1 minute up to 256 minutes
If reach 256 minutes, show error message to user and write to log
nothing saved in state file
------------------
Core/App connection
two unidirectional message streams.
files "core_to_app.xml" and "app_to_core.xml".
core->app:
initially:
requested frequency of app->core messages
app preferences
name of graphics shared-mem segment
recommended graphics parameters
frame rate
size
recommended checkpoint period
whether to do graphics
thereafter:
recommended graphics params
app->core
percent done
I just checkpointed
CPU time so far
-------------------
File upload security
Each project has a "file upload key pair"
Scheduling server has private key;
data servers have public key.
The key pair may be changed periodically;
data servers have to store old and new during transitions
- in DB, result XML has format
<result>
<file_info>
<max_size>123123</max_size>
</file_info>
...
</result>
- RPC reply: result XML info has format
<result>
<file_info>
...
<expiration>...</expiration> (added by server)
</result>
<result_signature>
<name>foo</name>
... (digital signature of <result> element; added by server)
</result_signature>
- Client stores:
for each result (in state file, in memory)
exact text of <result> tag
signature
- On file upload, client sends
<result> element (exact text)
<result_signature>
...
</result_signature>
<filename>blah</filename>
<offset>123</offset>
<total_size>1234</total_size>
<data_start/>
... data
- file upload handler does:
parse header (up to <data_start/>)
validate signature of <result>
verify that filename is in list of file_infos
verify that total size is within limit
----------------------------
Project main URL scheme
Each project advertises (and is identified by) a "root URL".
This URL returns a browser-visible "root page" describing the project,
linking to the registration, etc.
It also contains (in elements inside HTML comments)
one or more <scheduler_server_url> elements,
each containing the URL of a scheduling server
When the core client initially runs, it fetches and parses
the root page, and records the scheduling server URLs.
Whenever it can't contact any scheduling server, it reloads
the root page; the scheduling server URLs may have changed.