Skip to content

Commit

Permalink
Fix/update the SimProcedures guide
Browse files Browse the repository at this point in the history
  • Loading branch information
rhelmot committed Nov 14, 2017
1 parent c23d1f3 commit 01fe98f
Showing 1 changed file with 46 additions and 72 deletions.
118 changes: 46 additions & 72 deletions docs/simprocedures.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,17 +20,17 @@ Here's an example that will remove all bugs from any program:
... return 0

# this assumes we have symbols for the binary
>>> project.hook(project.kb.labels.lookup('main'), BugFree)
>>> project.hook_symbol('main', BugFree)

# Run a quick execution!
>>> sm = project.factory.simgr()
>>> sm.run() # step until no more active paths
>>> simgr = project.factory.simulation_manager()
>>> simgr.run() # step until no more active states
Program running with argc=<SAO <BV64 0x0>> and argv=<SAO <BV64 0x7fffffffffeffa0>>
<SimulationManager with 1 deadended>
```

Now, whenever program execution reaches the main function, instead of executing the actual main function, it will execute this procedure!
This procedure just prints out a message, and returns.
It just prints out a message, and returns.

Now, let's talk about what happens on the edge of this function!
When entering the function, where do the values that go into the arguments come from?
Expand All @@ -40,27 +40,18 @@ It should be clear at this point that the SimProcedure we just wrote is meant to
In fact, the original use case for SimProcedures was replacing library functions.
More on that later.

## Clarifying the Hierarchy

We've been using the words Hook and SimProcedure sort of interchangeably. Let's fix that.

- `SimProcedure` is a class that describes a set of actions to take on a state.
Its crux is the `run()` method.
- `Hook` is an angr class that holds a SimProcedure along with information about how to instantiate it.

On a `Project` class, the dict `project._sim_procedures` is a mapping from address to `Hook` instances.
(The name is a historical artifact - SimProcedure is one of the oldest classes in the angr, suite, while Hook is relatively new.)
When the [execution pipeline](pipeline.md) reaches an address that is present in that dict, that is, an address that is hooked, it will execute `procedure = project._sim_procedures[address].instantiate(address, arch)`.
The result of this call is a `SimProcedure` instance!
I recommend you look at the source code for the Hook class, at the bottom of [angr/project.py](https://github.com/angr/angr/blob/master/angr/project.py) in order to see exactly how this works.
## Implementation Context

On a `Project` class, the dict `project._sim_procedures` is a mapping from address to `SimProcedure` instances.
When the [execution pipeline](pipeline.md) reaches an address that is present in that dict, that is, an address that is hooked, it will execute `project._sim_procedures[address].execute(state)`.
This will consult the calling convention to extract the arguments, make a copy of itself in order to preserve thread safety, and run the `run()` method.
It is important to produce a new instance of the SimProcedure for each time it is run, since the process of running a SimProcedure necessarily involves mutating state on the SimProcedure instance, so we need separate ones for each step, lest we run into race conditions in multithreaded environments.

### kwargs

This hierarchy implies that you might want to reuse a single SimProcedure in multiple hooks.
What if you want to use the same SimProcedure in several hooks, but tweaked slightly each time?
angr's support for this is that any additional keyword arguments you pass to the `Hook()` initializer will end up getting passed as keyword args to your SimProcedure's `run()` method.
What if you want to hook the same SimProcedure in several places, but tweaked slightly each time?
angr's support for this is that any additional keyword arguments you pass to the constructor of your SimProcedure will end up getting passed as keyword args to your SimProcedure's `run()` method.
Pretty cool!

## Data Types
Expand All @@ -75,7 +66,7 @@ This will automatically be promoted to a word-sized bitvector!
You can return a native number, a bitvector, or a SimActionObject.

When you want to write a procedure that deals with floating point numbers, you will need to specify the calling convention manually.
It's not too hard, just provide a cc to the hook: [`cc = project.factory.cc_from_arg_kinds((True, True), ret_fp=True)`](http://angr.io/api-doc/angr.html#angr.factory.AngrObjectFactory.cc_from_arg_kinds) and `project.hook(Hook(ProcedureClass, cc=mycc))`
It's not too hard, just provide a cc to the hook: [`cc = project.factory.cc_from_arg_kinds((True, True), ret_fp=True)`](http://angr.io/api-doc/angr.html#angr.factory.AngrObjectFactory.cc_from_arg_kinds) and `project.hook(address, ProcedureClass(cc=mycc))`
This method for passing in a calling convention works for all calling conventions, so if angr's autodetected one isn't right, you can fix that.

## Control Flow
Expand Down Expand Up @@ -103,21 +94,20 @@ In order to do that, you'll need to work directly with the SimSuccessors object

The interface for this is [`self.successors.add_successor(state, addr, guard, jumpkind)`](http://angr.io/api-doc/angr.html#angr.engines.successors.SimSuccessors.add_successor).
All of these parameters should have an obvious meaning if you've followed along so far.
Keep in mind that the state you pass in will NOT be copied, so be sure to make a copy if you want to use it again!
Keep in mind that the state you pass in will NOT be copied and WILL be mutated, so be sure to make a copy beforehand if there will be more work to do!

### SimProcedure Continuations

How can we call a function in the binary and have execution resume within our SimProcedure?
There is a whole bunch of infrastructure called the "SimProcedure Continuation" that will let you do this.
When you use `self.call(addr, args, continue_at)`, `addr` is expected to be the address you'd like to call, `args` are the arguments you'd like to call it with, and `continue_at` is the name of another method in your SimProcedure class that you'd like execution to continue at when it returns.
When you use `self.call(addr, args, continue_at)`, `addr` is expected to be the address you'd like to call, `args` is the tuple of arguments you'd like to call it with, and `continue_at` is the name of another method in your SimProcedure class that you'd like execution to continue at when it returns.
This method must have the same signature as the `run()` method.
Furthermore, you can pass the keyword argument `cc` as the calling convention that ought to be used to communicate with the callee.

When you do this, you finish your current step, and execution will start again at the next step at the function you've specified.
When that function returns, it has to return to some concrete address!
That address is specified by the SimProcedure runtime.
Each SimProcedure which would like to use the continuation subsystem is allocated a "continuation address", the address which is specified as the return address for any calls that are made out of that procedure.
When control flow hits that address again, the SimProcedure is started back up again, and the specified `continue_at` function is called instead of `run()`, with the same args and kwargs as the first time.
That address is specified by the SimProcedure runtime: an address is allocated in angr's externs segment to be used as the return site for returning to the given method call.
It is then hooked with a copy of the procedure instance tweaked to run the specified `continue_at` function instead of `run()`, with the same args and kwargs as the first time.

There are two pieces of metadata you need to attach to your SimProcedure class in order to use the continuation subsystem correctly:

Expand All @@ -127,48 +117,47 @@ There are two pieces of metadata you need to attach to your SimProcedure class i

You may have guessed by now that there exists some sort of auxiliary storage in order to hold on to all this data.
You would be right!
The state plugin `state.procedure_data` exists to hold all the data that SimProcedures need to store in order to go about their business that must persist between runs.
It's stuff that ought to be stored in memory, but the data can't be serialized and/or memory allocation is hard.
In this case, `state.procedure_data.callstack` is a list of "call frames".
Whenever we jump to a continuation address and try to resume a SimProcedure, we pop a frame off this "call stack" and use its data to re-initialize the SimProcedure instance we're now working with.
The state plugin `state.callstack` has an entry called `.procedure_data` which is used by the SimProcedure runtime to store information local to the current call frame.
angr tracks the stack pointer in order to make the current top of the `state.callstack` a meaningful local data store.
It's stuff that ought to be stored in memory in a stack frame, but the data can't be serialized and/or memory allocation is hard.

As an example, let's look at the SimProcedure that angr uses internally to run all the shared library initializers for a `full_init_state`:
As an example, let's look at the SimProcedure that angr uses internally to run all the shared library initializers for a `full_init_state` for a linux program:

```python
class LinuxLoader(SimProcedure):
class LinuxLoader(angr.SimProcedure):
NO_RET = True
IS_FUNCTION = True

# pylint: disable=unused-argument,arguments-differ,attribute-defined-outside-init
local_vars = ('initializers',)
def run(self, project=None):
self.initializers = project.loader.get_initializers()
self.run_initializer(project)

def run_initializer(self, project=None):
def run(self):
self.initializers = self.project.loader.initializers
self.run_initializer()

def run_initializer(self):
if len(self.initializers) == 0:
project._simos.set_entry_register_values(self.state)
self.jump(project.entry)
self.project._simos.set_entry_register_values(self.state)
self.jump(self.project.entry)
else:
addr = self.initializers[0]
self.initializers = self.initializers[1:]
self.call(addr, (self.state.posix.argc, self.state.posix.argv, self.state.posix.environ), 'run_initializer')
```

This is a particularly clever usage of the SimProcedure continuations.
First, notice that the current project is passed in as a keyword arg, so we have access to internal loader logic.
We start by getting a list of initializers.
Then, for as long as the list isn't empty, we pop a single function pointer out of the list (being careful not to mutate the list), and then call it, returning to the `run_initializer` function again.
First, notice that the current project is available for use on the procedure instance.
This is some powerful stuff you can get yourself into; for safety you generally only want to use the project as a read-only or append-only data structure.
Here we're just getting the list of dynamic intializers from the loader.
Then, for as long as the list isn't empty, we pop a single function pointer out of the list, being careful not to mutate the list, since the list object is shared across states, and then call it, returning to the `run_initializer` function again.
When we run out of initializers, we set up the entry state and jump to the program entry point.

Very cool!

## Global Variables

As a brief aside, you can store global variables in `state.procedure_data.global_variables`.
As a brief aside, you can store global variables in `state.globals`.
This is a dictionary that just gets shallow-copied from state to successor state.
Because it's only a shallow copy, its members are the same instances, so the same rules as local variables in SimProcedure continuations apply.
You need to be careful not to mutate any item that is used as a global variable.
You need to be careful not to mutate any item that is used as a global variable unless you know exactly what you're doing.

## Helping out static analysis

Expand All @@ -188,7 +177,7 @@ This is meant to be a quick, best-effort analysis, and you shouldn't try to do a

The process of writing and using a SimProcedure makes a lot of assumptions that you want to hook over a whole function.
What if you don't?
There's an alternate interface for hooking, a user hook, that lets you streamline the process of hooking sections of code.
There's an alternate interface for hooking, a _user hook_, that lets you streamline the process of hooking sections of code.

```python
>>> @project.hook(0x1234, length=5)
Expand All @@ -201,51 +190,36 @@ This is a lot simpler!
The idea is to use a single function instead of an entire SimProcedure subclass.
No extraction of arguments is performed, no complex control flow happens.

Control flow is controlled by the length argument to `Hook.wrap`.
After the function finishes executing, the next step will start at 5 bytes after the hooked address.
If the length argument is omitted or set to zero, execution will resume executing the binary code at exactly the hooked address, without re-triggering the hook. The `Ijk_NoHook` jumpkind allows this to happen.
Control flow is controlled by the length argument.
After the function finishes executing in this example, the next step will start at 5 bytes after the hooked address.
If the length argument is omitted or set to zero, execution will resume executing the binary code at exactly the hooked address, without re-triggering the hook.
The `Ijk_NoHook` jumpkind allows this to happen.

If you want more control over control flow coming out of a user hook, you can return a list of successor states.
Each successor will be expected to have `state.ip`, state.scratch.guard`, and `state.scratch.jumpkind` set.
Each successor will be expected to have `state.regs.ip`, state.scratch.guard`, and `state.scratch.jumpkind` set.
The IP is the target instruction pointer, the guard is a symbolic boolean representing a constraint to add to the state related to it being taken as opposed to the others, and the jumpkind is a VEX enum string, like `Ijk_Boring`, representing the nature of the branch.

The general rule is, if you want your SimProcedure to either be able to extract function arguments or cause a program return, use a SimProcedure.
The general rule is, if you want your SimProcedure to either be able to extract function arguments or cause a program return, write a full SimProcedure class.
Otherwise, use a user hook.

## Hooking Symbols

First, some background.

In binary loading, there is the notion of a symbol - a range of memory in the address space which has a name.
For dynamically linked binaries, there is the notion of an *import symbol*, which is a symbol which has no address, just a name.
These symbols are used to mark dependencies among binaries and shared objects, usually for functions.
During the loading process, each import symbol is supposed to be *provided* by a different binary, one that *exports* a symbol of the same name.
The process of matching import symbols to export symbols is called *dependency resolution*.

When a symbol has been resolved, pointers to the provider's symbol need to be injected into the providee's address space.
This is a process known as *relocation*.
A dynamically linked binary contains a table of *relocations*, individual directives stating that, as soon as you resolve import symbol X, please update the code or data at address Y to refer to the corresponding export symbol.
There are many types of relocations, and so this process is complicated!

When angr loads a program and it gets the `Loader` object from CLE, it wants to do two things:
Make sure every imported function gets resolved to _something_, and replace as many imported functions with SimProcedures as possible.
To do this, CLE exports an interface called `provide_symbol`, which allows angr to advertise its own export symbols and cause the relocation process to point whererver we want for certain symbols.

This happens on the angr level with the method `Project.hook_symbol`.
You can use this function to allocate a fresh address, hook that address with whatever you want, and re-point any import symbol of a given name to your hook!
As you should recall from the [section on loading a binary](loading.md), dynamically linked programs have a list of symbols that they must import from the libraries they have listed as dependencies, and angr will make sure, rain or shine, that every import symbol gets resolved by _some_ address, whether it's a real implementaion of the function or just a dummy address hooked with a do-nothing stub.
As a result, you can just use the `Project.hook_symbol` API to hook the address referred to by a symbol!

This means that you can replace library functions with your own code.
For instance, to replace `rand()` with a function that always returns a consistent sequence of values:

```python
>>> class NotVeryRand(SimProcedure):
... def run(self, return_values=None):
... if 'rand_idx' in self.state.procedure_data.global_variables:
... rand_idx = self.state.procedure_data.global_variables['rand_idx']
... if 'rand_idx' in self.state.globals:
... rand_idx = self.state.globals['rand_idx']
... else:
... rand_idx = 0
...
... out = return_values[rand_idx % len(return_values)]
... self.state.procedure_data.global_variables['rand_idx'] = rand_idx + 1
... self.state.globals['rand_idx'] = rand_idx + 1
... return out

>>> project.hook_symbol('rand', NotVeryRand(return_values=[413, 612, 1025, 1111]))
Expand Down

0 comments on commit 01fe98f

Please sign in to comment.