forked from Theano/Theano
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathdebug_faq.txt
572 lines (426 loc) · 22.4 KB
/
debug_faq.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
.. _debug_faq:
=========================================
Debugging Theano: FAQ and Troubleshooting
=========================================
There are many kinds of bugs that might come up in a computer program.
This page is structured as a FAQ. It provides recipes to tackle common
problems, and introduces some of the tools that we use to find problems in our
own Theano code, and even (it happens) in Theano's internals, in
:ref:`using_debugmode`.
Isolating the Problem/Testing Theano Compiler
---------------------------------------------
You can run your Theano function in a :ref:`DebugMode<using_debugmode>`.
This tests the Theano optimizations and helps to find where NaN, inf and other problems come from.
Interpreting Error Messages
---------------------------
Even in its default configuration, Theano tries to display useful error
messages. Consider the following faulty code.
.. testcode::
import numpy as np
import theano
import theano.tensor as T
x = T.vector()
y = T.vector()
z = x + x
z = z + y
f = theano.function([x, y], z)
f(np.ones((2,)), np.ones((3,)))
Running the code above we see:
.. testoutput::
:options: +ELLIPSIS
Traceback (most recent call last):
...
ValueError: Input dimension mis-match. (input[0].shape[0] = 3, input[1].shape[0] = 2)
Apply node that caused the error: Elemwise{add,no_inplace}(<TensorType(float64, vector)>, <TensorType(float64, vector)>, <TensorType(float64, vector)>)
Inputs types: [TensorType(float64, vector), TensorType(float64, vector), TensorType(float64, vector)]
Inputs shapes: [(3,), (2,), (2,)]
Inputs strides: [(8,), (8,), (8,)]
Inputs scalar values: ['not scalar', 'not scalar', 'not scalar']
HINT: Re-running with most Theano optimization disabled could give you a back-traces when this node was created. This can be done with by setting the Theano flags 'optimizer=fast_compile'. If that does not work, Theano optimization can be disabled with 'optimizer=None'.
HINT: Use the Theano flag 'exception_verbosity=high' for a debugprint of this apply node.
Arguably the most useful information is approximately half-way through
the error message, where the kind of error is displayed along with its
cause (`ValueError: Input dimension mis-match. (input[0].shape[0] = 3,
input[1].shape[0] = 2`).
Below it, some other information is given, such as the apply node that
caused the error, as well as the input types, shapes, strides and
scalar values.
The two hints can also be helpful when debugging. Using the theano flag
``optimizer=fast_compile`` or ``optimizer=None`` can often tell you
the faulty line, while ``exception_verbosity=high`` will display a
debugprint of the apply node. Using these hints, the end of the error
message becomes :
.. code-block:: none
Backtrace when the node is created:
File "test0.py", line 8, in <module>
z = z + y
Debugprint of the apply node:
Elemwise{add,no_inplace} [id A] <TensorType(float64, vector)> ''
|Elemwise{add,no_inplace} [id B] <TensorType(float64, vector)> ''
| |<TensorType(float64, vector)> [id C] <TensorType(float64, vector)>
| |<TensorType(float64, vector)> [id C] <TensorType(float64, vector)>
|<TensorType(float64, vector)> [id D] <TensorType(float64, vector)>
We can here see that the error can be traced back to the line ``z = z + y``.
For this example, using ``optimizer=fast_compile`` worked. If it did not,
you could set ``optimizer=None`` or use test values.
Using Test Values
-----------------
As of v.0.4.0, Theano has a new mechanism by which graphs are executed
on-the-fly, before a ``theano.function`` is ever compiled. Since optimizations
haven't been applied at this stage, it is easier for the user to locate the
source of some bug. This functionality is enabled through the config flag
``theano.config.compute_test_value``. Its use is best shown through the
following example. Here, we use ``exception_verbosity=high`` and
``optimizer=fast_compile``, which would not tell you the line at fault.
``optimizer=None`` would and it could therefore be used instead of test values.
.. testcode:: testvalue
import numpy
import theano
import theano.tensor as T
# compute_test_value is 'off' by default, meaning this feature is inactive
theano.config.compute_test_value = 'off' # Use 'warn' to activate this feature
# configure shared variables
W1val = numpy.random.rand(2, 10, 10).astype(theano.config.floatX)
W1 = theano.shared(W1val, 'W1')
W2val = numpy.random.rand(15, 20).astype(theano.config.floatX)
W2 = theano.shared(W2val, 'W2')
# input which will be of shape (5,10)
x = T.matrix('x')
# provide Theano with a default test-value
#x.tag.test_value = numpy.random.rand(5, 10)
# transform the shared variable in some way. Theano does not
# know off hand that the matrix func_of_W1 has shape (20, 10)
func_of_W1 = W1.dimshuffle(2, 0, 1).flatten(2).T
# source of error: dot product of 5x10 with 20x10
h1 = T.dot(x, func_of_W1)
# do more stuff
h2 = T.dot(h1, W2.T)
# compile and call the actual function
f = theano.function([x], h2)
f(numpy.random.rand(5, 10))
Running the above code generates the following error message:
.. testoutput:: testvalue
Traceback (most recent call last):
File "test1.py", line 31, in <module>
f(numpy.random.rand(5, 10))
File "PATH_TO_THEANO/theano/compile/function_module.py", line 605, in __call__
self.fn.thunks[self.fn.position_of_error])
File "PATH_TO_THEANO/theano/compile/function_module.py", line 595, in __call__
outputs = self.fn()
ValueError: Shape mismatch: x has 10 cols (and 5 rows) but y has 20 rows (and 10 cols)
Apply node that caused the error: Dot22(x, DimShuffle{1,0}.0)
Inputs types: [TensorType(float64, matrix), TensorType(float64, matrix)]
Inputs shapes: [(5, 10), (20, 10)]
Inputs strides: [(80, 8), (8, 160)]
Inputs scalar values: ['not scalar', 'not scalar']
Debugprint of the apply node:
Dot22 [id A] <TensorType(float64, matrix)> ''
|x [id B] <TensorType(float64, matrix)>
|DimShuffle{1,0} [id C] <TensorType(float64, matrix)> ''
|Flatten{2} [id D] <TensorType(float64, matrix)> ''
|DimShuffle{2,0,1} [id E] <TensorType(float64, 3D)> ''
|W1 [id F] <TensorType(float64, 3D)>
HINT: Re-running with most Theano optimization disabled could give you a back-traces when this node was created. This can be done with by setting the Theano flags 'optimizer=fast_compile'. If that does not work, Theano optimization can be disabled with 'optimizer=None'.
If the above is not informative enough, by instrumenting the code ever
so slightly, we can get Theano to reveal the exact source of the error.
.. code-block:: python
# enable on-the-fly graph computations
theano.config.compute_test_value = 'warn'
...
# input which will be of shape (5, 10)
x = T.matrix('x')
# provide Theano with a default test-value
x.tag.test_value = numpy.random.rand(5, 10)
In the above, we are tagging the symbolic matrix *x* with a special test
value. This allows Theano to evaluate symbolic expressions on-the-fly (by
calling the ``perform`` method of each op), as they are being defined. Sources
of error can thus be identified with much more precision and much earlier in
the compilation pipeline. For example, running the above code yields the
following error message, which properly identifies *line 24* as the culprit.
.. code-block:: none
Traceback (most recent call last):
File "test2.py", line 24, in <module>
h1 = T.dot(x, func_of_W1)
File "PATH_TO_THEANO/theano/tensor/basic.py", line 4734, in dot
return _dot(a, b)
File "PATH_TO_THEANO/theano/gof/op.py", line 545, in __call__
required = thunk()
File "PATH_TO_THEANO/theano/gof/op.py", line 752, in rval
r = p(n, [x[0] for x in i], o)
File "PATH_TO_THEANO/theano/tensor/basic.py", line 4554, in perform
z[0] = numpy.asarray(numpy.dot(x, y))
ValueError: matrices are not aligned
The ``compute_test_value`` mechanism works as follows:
* Theano ``constants`` and ``shared`` variables are used as is. No need to instrument them.
* A Theano *variable* (i.e. ``dmatrix``, ``vector``, etc.) should be
given a special test value through the attribute ``tag.test_value``.
* Theano automatically instruments intermediate results. As such, any quantity
derived from *x* will be given a ``tag.test_value`` automatically.
``compute_test_value`` can take the following values:
* ``off``: Default behavior. This debugging mechanism is inactive.
* ``raise``: Compute test values on the fly. Any variable for which a test
value is required, but not provided by the user, is treated as an error. An
exception is raised accordingly.
* ``warn``: Idem, but a warning is issued instead of an *Exception*.
* ``ignore``: Silently ignore the computation of intermediate test values, if a
variable is missing a test value.
.. note::
This feature is currently incompatible with ``Scan`` and also with ops
which do not implement a ``perform`` method.
"How do I Print an Intermediate Value in a Function?"
-----------------------------------------------------
Theano provides a 'Print' op to do this.
.. testcode::
import numpy
import theano
x = theano.tensor.dvector('x')
x_printed = theano.printing.Print('this is a very important value')(x)
f = theano.function([x], x * 5)
f_with_print = theano.function([x], x_printed * 5)
#this runs the graph without any printing
assert numpy.all( f([1, 2, 3]) == [5, 10, 15])
#this runs the graph with the message, and value printed
assert numpy.all( f_with_print([1, 2, 3]) == [5, 10, 15])
.. testoutput::
this is a very important value __str__ = [ 1. 2. 3.]
Since Theano runs your program in a topological order, you won't have precise
control over the order in which multiple ``Print()`` ops are evaluted. For a more
precise inspection of what's being computed where, when, and how, see the discussion
:ref:`faq_monitormode`.
.. warning::
Using this ``Print`` Theano Op can prevent some Theano
optimization from being applied. This can also happen with
stability optimization. So if you use this Print and have nan, try
to remove them to know if this is the cause or not.
"How do I Print a Graph?" (before or after compilation)
-------------------------------------------------------
.. TODO: dead links in the next paragraph
Theano provides two functions (:func:`theano.pp` and
:func:`theano.printing.debugprint`) to print a graph to the terminal before or after
compilation. These two functions print expression graphs in different ways:
:func:`pp` is more compact and math-like, :func:`debugprint` is more verbose.
Theano also provides :func:`theano.printing.pydotprint` that creates a png image of the function.
You can read about them in :ref:`libdoc_printing`.
"The Function I Compiled is Too Slow, what's up?"
-------------------------------------------------
First, make sure you're running in ``FAST_RUN`` mode. Even though
``FAST_RUN`` is the default mode, insist by passing ``mode='FAST_RUN'``
to ``theano.function`` (or ``theano.make``) or by setting :attr:`config.mode`
to ``FAST_RUN``.
Second, try the Theano :ref:`using_profilemode`. This will tell you which
``Apply`` nodes, and which ops are eating up your CPU cycles.
Tips:
* Use the flags ``floatX=float32`` to require type *float32* instead of *float64*;
Use the Theano constructors matrix(),vector(),... instead of dmatrix(), dvector(),...
since they respectively involve the default types *float32* and *float64*.
* Check in the ``profile`` mode that there is no ``Dot`` op in the post-compilation
graph while you are multiplying two matrices of the same type. ``Dot`` should be
optimized to ``dot22`` when the inputs are matrices and of the same type. This can
still happen when using ``floatX=float32`` when one of the inputs of the graph is
of type *float64*.
"Why does my GPU function seem to be slow?"
-------------------------------------------
When you compile a theano function, if you do not get the speedup that you expect over the
CPU performance of the same code. It is oftentimes due to the fact that some Ops might be running
on CPU instead GPU. If that is the case, you can use assert_no_cpu_op to check if there
is a CPU Op on your computational graph. assert_no_cpu_op can take the following one of the three
options:
* ``warn``: Raise a warning
* ``pdb``: Stop with a pdb in the computational graph during the compilation
* ``raise``: Raise an error,
if there is a CPU Op in the computational graph.
It is possible to use this mode by providing the flag in THEANO_FLAGS, such as:
``THEANO_FLAGS="float32,device=gpu,assert_no_cpu_op='raise'" python test.py``
But note that this optimization will not catch all the CPU Ops, it might miss some
Ops.
.. _faq_monitormode:
"How do I Step through a Compiled Function?"
--------------------------------------------
You can use ``MonitorMode`` to inspect the inputs and outputs of each
node being executed when the function is called. The code snipped below
shows how to print all inputs and outputs:
.. testcode::
import theano
def inspect_inputs(i, node, fn):
print i, node, "input(s) value(s):", [input[0] for input in fn.inputs],
def inspect_outputs(i, node, fn):
print "output(s) value(s):", [output[0] for output in fn.outputs]
x = theano.tensor.dscalar('x')
f = theano.function([x], [5 * x],
mode=theano.compile.MonitorMode(
pre_func=inspect_inputs,
post_func=inspect_outputs))
f(3)
.. testoutput::
0 Elemwise{mul,no_inplace}(TensorConstant{5.0}, x) input(s) value(s): [array(5.0), array(3.0)] output(s) value(s): [array(15.0)]
When using these ``inspect_inputs`` and ``inspect_outputs`` functions
with ``MonitorMode``, you should see [potentially a lot of] printed output.
Every ``Apply`` node will be printed out,
along with its position in the graph, the arguments to the functions ``perform`` or
``c_code`` and the output it computed.
Admittedly, this may be a huge amount of
output to read through if you are using big tensors... but you can choose to
add logic that would, for instance, print
something out only if a certain kind of op were used, at a certain program
position, or only if a particular value showed up in one of the inputs or outputs.
A typical example is to detect when NaN values are added into computations, which
can be achieved as follows:
.. testcode:: compiled
import numpy
import theano
# This is the current suggested detect_nan implementation to
# show you how it work. That way, you can modify it for your
# need. If you want exactly this method, you can use
# ``theano.compile.monitormode.detect_nan`` that will always
# contain the current suggested version.
def detect_nan(i, node, fn):
for output in fn.outputs:
if (not isinstance(output[0], numpy.random.RandomState) and
numpy.isnan(output[0]).any()):
print '*** NaN detected ***'
theano.printing.debugprint(node)
print 'Inputs : %s' % [input[0] for input in fn.inputs]
print 'Outputs: %s' % [output[0] for output in fn.outputs]
break
x = theano.tensor.dscalar('x')
f = theano.function([x], [theano.tensor.log(x) * x],
mode=theano.compile.MonitorMode(
post_func=detect_nan))
f(0) # log(0) * 0 = -inf * 0 = NaN
.. testoutput:: compiled
:options: +NORMALIZE_WHITESPACE
*** NaN detected ***
Elemwise{Composite{(log(i0) * i0)}} [id A] ''
|x [id B]
Inputs : [array(0.0)]
Outputs: [array(nan)]
To help understand what is happening in your graph, you can
disable the ``local_elemwise_fusion`` and all ``inplace``
optimizations. The first is a speed optimization that merges elemwise
operations together. This makes it harder to know which particular
elemwise causes the problem. The second optimization makes some ops'
outputs overwrite their inputs. So, if an op creates a bad output, you
will not be able to see the input that was overwriten in the ``post_func``
function. To disable those optimizations (with a Theano version after
0.6rc3), define the MonitorMode like this:
.. testcode:: compiled
mode = theano.compile.MonitorMode(post_func=detect_nan).excluding(
'local_elemwise_fusion', 'inplace')
f = theano.function([x], [theano.tensor.log(x) * x],
mode=mode)
.. note::
The Theano flags ``optimizer_including``, ``optimizer_excluding``
and ``optimizer_requiring`` aren't used by the MonitorMode, they
are used only by the ``default`` mode. You can't use the ``default``
mode with MonitorMode, as you need to define what you monitor.
To be sure all inputs of the node are available during the call to
``post_func``, you must also disable the garbage collector. Otherwise,
the execution of the node can garbage collect its inputs that aren't
needed anymore by the Theano function. This can be done with the Theano
flag:
.. code-block:: python
allow_gc=False
.. TODO: documentation for link.WrapLinkerMany
How to Use pdb
--------------
In the majority of cases, you won't be executing from the interactive shell
but from a set of Python scripts. In such cases, the use of the Python
debugger can come in handy, especially as your models become more complex.
Intermediate results don't necessarily have a clear name and you can get
exceptions which are hard to decipher, due to the "compiled" nature of the
functions.
Consider this example script ("ex.py"):
.. testcode::
import theano
import numpy
import theano.tensor as T
a = T.dmatrix('a')
b = T.dmatrix('b')
f = theano.function([a, b], [a * b])
# matrices chosen so dimensions are unsuitable for multiplication
mat1 = numpy.arange(12).reshape((3, 4))
mat2 = numpy.arange(25).reshape((5, 5))
f(mat1, mat2)
.. testoutput::
:hide:
:options: +ELLIPSIS
Traceback (most recent call last):
...
ValueError: Input dimension mis-match. (input[0].shape[0] = 3, input[1].shape[0] = 5)
Apply node that caused the error: Elemwise{mul,no_inplace}(a, b)
Toposort index: 0
Inputs types: [TensorType(float64, matrix), TensorType(float64, matrix)]
Inputs shapes: [(3, 4), (5, 5)]
Inputs strides: [(32, 8), (40, 8)]
Inputs values: ['not shown', 'not shown']
Outputs clients: [['output']]
Backtrace when the node is created:
File "<doctest default[0]>", line 8, in <module>
f = theano.function([a, b], [a * b])
HINT: Use the Theano flag 'exception_verbosity=high' for a debugprint and storage map footprint of this apply node.
This is actually so simple the debugging could be done easily, but it's for
illustrative purposes. As the matrices can't be multiplied element-wise
(unsuitable shapes), we get the following exception:
.. code-block:: none
File "ex.py", line 14, in <module>
f(mat1, mat2)
File "/u/username/Theano/theano/compile/function_module.py", line 451, in __call__
File "/u/username/Theano/theano/gof/link.py", line 271, in streamline_default_f
File "/u/username/Theano/theano/gof/link.py", line 267, in streamline_default_f
File "/u/username/Theano/theano/gof/cc.py", line 1049, in execute ValueError: ('Input dimension mis-match. (input[0].shape[0] = 3, input[1].shape[0] = 5)', Elemwise{mul,no_inplace}(a, b), Elemwise{mul,no_inplace}(a, b))
The call stack contains some useful information to trace back the source
of the error. There's the script where the compiled function was called --
but if you're using (improperly parameterized) prebuilt modules, the error
might originate from ops in these modules, not this script. The last line
tells us about the op that caused the exception. In this case it's a "mul"
involving variables with names "a" and "b". But suppose we instead had an
intermediate result to which we hadn't given a name.
After learning a few things about the graph structure in Theano, we can use
the Python debugger to explore the graph, and then we can get runtime
information about the error. Matrix dimensions, especially, are useful to
pinpoint the source of the error. In the printout, there are also 2 of the 4
dimensions of the matrices involved, but for the sake of example say we'd
need the other dimensions to pinpoint the error. First, we re-launch with
the debugger module and run the program with "c":
.. code-block:: text
python -m pdb ex.py
> /u/username/experiments/doctmp1/ex.py(1)<module>()
-> import theano
(Pdb) c
Then we get back the above error printout, but the interpreter breaks in
that state. Useful commands here are
* "up" and "down" (to move up and down the call stack),
* "l" (to print code around the line in the current stack position),
* "p variable_name" (to print the string representation of 'variable_name'),
* "p dir(object_name)", using the Python dir() function to print the list of an object's members
Here, for example, I do "up", and a simple "l" shows me there's a local
variable "node". This is the "node" from the computation graph, so by
following the "node.inputs", "node.owner" and "node.outputs" links I can
explore around the graph.
That graph is purely symbolic (no data, just symbols to manipulate it
abstractly). To get information about the actual parameters, you explore the
"thunk" objects, which bind the storage for the inputs (and outputs) with
the function itself (a "thunk" is a concept related to closures). Here, to
get the current node's first input's shape, you'd therefore do "p
thunk.inputs[0][0].shape", which prints out "(3, 4)".
.. _faq_dump_fct:
Dumping a Function to help debug
--------------------------------
If you are reading this, there is high chance that you emailed our
mailing list and we asked you to read this section. This section
explain how to dump all the parameter passed to
``theano.function()``. This is useful to help us reproduce a problem
during compilation and it don't request you to make a self contained
example.
For this to work, we need to be able to import the code for all Op in
the graph. So if you create your own Op, we will need this
code. Otherwise, we won't be able to unpickle it. We already have all
the Ops from Theano and Pylearn2.
.. code-block:: python
# Replace this line:
theano.function(...)
# with
theano.function_dump(filename, ...)
# Where filename is a string to a file that we will write to.
Then send us filename.
.. autoclass:: theano.tests.breakpoint.PdbBreakpoint