# From pytorch root
cd benchmarks/instruction_counts
python main.py
Currently main.py
contains a very simple threadpool (so that run time isn't
unbearably onerous) and simply prints the results. These components will be
upgraded in subsequent PRs.
TimerArgs
: Low level definition which maps directly totorch.utils.benchmark.Timer
GroupedStmts
: Benchmark a snippet. (Python, C++, or both) Can automatically generate TorchScript and autograd variants.GroupedModules
: LikeGroupedStmts
, but takesnn.Module
sGroupedVariants
: Benchmark-per-line to define many related benchmarks in a single code block.
One primary goal of this suite is to make it easy to define semantically
related clusters of benchmarks. The crux of this effort is the
GroupedBenchmark
class, which is defined in core/api.py
. It takes a
definition for a set of related benchmarks, and produces one or more concrete
cases. It's helpful to see an example to understand how the machinery works.
Consider the following benchmark:
# `GroupedStmts` is an alias of `GroupedBenchmark.init_from_stmts`
benchmark = GroupedStmts(
py_stmt=r"y = x * w",
cpp_stmt=r"auto y = x * w;",
setup=GroupedSetup(
py_setup="""
x = torch.ones((4, 4))
w = torch.ones((4, 4), requires_grad=True)
""",
cpp_setup="""
auto x = torch::ones((4, 4));
auto w = torch::ones((4, 4));
w.set_requires_grad(true);
""",
),
signature="f(x, w) -> y",
torchscript=True,
autograd=True,
),
It is trivial to generate Timers for the eager forward mode case (ignoring
num_threads
for now):
Timer(
stmt=benchmark.py_fwd_stmt,
setup=benchmark.setup.py_setup,
)
Timer(
stmt=benchmark.cpp_fwd_stmt,
setup=benchmark.setup.cpp_setup,
language="cpp",
)
Moreover, because signature
is provided we know that creation of x
and w
is part of setup, and the overall computation uses x
and w
to produce y
.
As a result, we can derive TorchScript'd and AutoGrad variants as well. We can
deduce that a TorchScript model will take the form:
@torch.jit.script
def f(x, w):
# Paste `benchmark.py_fwd_stmt` into the function body.
y = x * w
return y # Set by `-> y` in signature.
And because we will want to use this model in both Python and C++, we save it to disk and load it as needed. At this point Timers for TorchScript become:
Timer(
stmt="""
y = jit_model(x, w)
""",
setup=""",
# benchmark.setup.py_setup
# jit_model = torch.jit.load(...)
# Warm up jit_model
""",
)
Timer(
stmt="""
std::vector<torch::jit::IValue> ivalue_inputs(
torch::jit::IValue({x}),
torch::jit::IValue({w})
);
auto y = jit_model.forward(ivalue_inputs);
""",
setup="""
# benchmark.setup.cpp_setup
# jit_model = torch::jit::load(...)
# Warm up jit_model
""",
)
While nothing above is particularly complex, there is non-trivial bookkeeping (managing the model artifact, setting up IValues) which if done manually would be rather bug-prone and hard to read.
The story is similar for autograd: because we know the output variable (y
)
and we make sure to assign it when calling TorchScript models, testing AutoGrad
is as simple as appending y.backward()
(or y.backward();
in C++) to the
stmt of the forward only variant. Of course this requires that signature
be
provided, as there is nothing special about the name y
.
The logic for the manipulations above is split between core/api.py
(for
generating stmt
based on language, Eager/TorchScript, with or without AutoGrad)
and core/expand.py
(for larger, more expansive generation). The benchmarks
themselves are defined in definitions/standard.py
. The current set is chosen
to demonstrate the various model definition APIs, and will be expanded when the
benchmark runner infrastructure is better equipped to deal with a larger run.
Once expand.materialize
has flattened the abstract benchmark definitions into
TimerArgs
, they can be sent to a worker (worker/main.py
) subprocess to
execution. This worker has no concept of the larger benchmark suite; TimerArgs
is a one-to-one and direct mapping to the torch.utils.benchmark.Timer
instance
that the worker instantiates.