Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pull] master from BinaryAnalysisPlatform:master #114

Open
wants to merge 257 commits into
base: master
Choose a base branch
from

Conversation

pull[bot]
Copy link

@pull pull bot commented Apr 23, 2021

See Commits and Changes for more details.


Created by pull[bot]

Can you help keep this open source service alive? 💖 Please sponsor : )

@pull pull bot added the ⤵️ pull label Apr 23, 2021
ivg and others added 29 commits May 25, 2021 09:53
The new version might give us much better performance, see the
[discussion][1] for further details.

[1]: https://discuss.ocaml.org/t/ann-set-up-ocaml-2-0-0-alpha/7895
* Use semver-style version matching on CI workflow

Signed-off-by: Sora Morimoto <[email protected]>

* Update docker/build-push-action to v2

Signed-off-by: Sora Morimoto <[email protected]>

* Format GitHub workflow files

Signed-off-by: Sora Morimoto <[email protected]>

* Enable dune cache on the Ubuntu runners

Signed-off-by: Sora Morimoto <[email protected]>
* switches to FrontC.4.x adds the preprocessing option

Since FrontC.4.x supports new types from C99+ we had to update BAP as
well, both its representation of the C types as well as the parsing
procedure.

We also added an option (disabled by default) to apply
preprocessing (with an ability to specify the path to it) since
FrontC.4.1+ is much more robust and is capable of chewing linux system
headers.

* adds the FrontC constraint
The old script was following the old share folder layout and, as a
result, bap is not working from debian as it is missing the neccessary
lisp files.
)

It is not guaranteed that sudo is present and is not really
necessary as those actions are anyway run by a super user.
The docker images are no longer pushed to dockerhubn after #1312.
First of all, because the secret names were accidentally changed,
and secondly, because the `push` parameter defaults to `false` in
the `v2` version of the action.
This file is assuming the presence of the testing snapshot opam
repository, which is not always available. It is also not required to
get ppx_bap from it as it is already for a long time in the upstream
repository.
We reuse the existing arm and thumb abi but make sure that the correct
architecture is passed to the memory-related operations.
* initial scaffolding

* adds support for nameless registers in the disasm backend

We don't want to model unique (temporary) registers as normal physical
registers, so we add an option to set the name to -1 to indicate that
the register is a unique anonymous variable.

* the minimal implementation

* introduces sub-instructions to the disassembler

There was some provision for that in the backend but never fully
implemented. This feature enables seambless integeration of the ghidra
backend but may also used for VLIW architecture, e.g., hexagon llvm
backend is using it.

* fixes the handling of the unique namespace

* implements semantics of sequential instructions

* specifies semantics for most of the pcode instructions

* passes the basic instruction info to the sequential semantics value

* fixes the semantics of LOAD

* improves handling of pcode namespaces

Fixes varnode classification (no longer treating addresses as
registers).

Also to prevent name clashes between virtual variables from different
scopes we use unique prefixes (aka shortcuts in ghidra's parlance) to
distinguish between them. This also makes the generated code more
readable and closer to the originally genereated pcode.

* fixes the subpiece implementation

* puts all pcode opcodes into the pcode namespace

Since they are the same for each architecture and now we have this feature.

* implements support for user-defined opcodes

we translate

`CALLOTHER(<name>,<arg1>,...,<argN>)`

to

`<name>(<arg1>,...,<argN>)`

* translates local (intra-instruction) branches into GOTOs

Branches in p-code are overloaded by the type of destination. If the
destination is a virtual address then it is a normal branch and if
it is a constant (a varnode from the constant namespace) then it is
an intra-instruction branch that represents the inner instruction
logic.

* fixes the satisfies function to account for subinstructions

before that it was only looking into the top-level instruction

* an attempt to pack subinstructions inside an instruction

(breaks lots of stuff)

* publishes subinstructions

I will probably forfeit this approach, still investigating.

* introduces the null object to the knowledge base

It was actually already there, hidden in the [obj] domain. Now it is
properly documented with well defined semantics.

* uses the null object to represent unlabeled blocks in the lifters

Also, more lifters now respect the passed label to the blk operator.

* removes the subinstruction slot from instruction

* adds labels reification to BIL semantics, also reifies gotos

all using special encoding

* implements intra-instruction gotos

* adds the sequence number documentation.

* fixes error hanlding in the goto-subinstruction primitive

* implements a proper disassembler factory that scans for ldefs files

So far not working quite correctly, as the default variables (like
word size, etc) are not properly set. Investigating...

* adds a proper processor initialization

* implements proper command-line interface

Now the plugin is able to list the targets and pass the path to ghidra root.

* adds a tentative --x86-backend option to enable ghidra for x86

* fixes offset and address calculation

* adds `is-symbol` semantic primitive

* fixes overloading of the Primus Lisp semantic definitions

The overloading was prevented by the attributes computation, which
expected no overloads. Also, makes error message more readable.

* tries to overload p-code operations based on their operand types

* passes operands types per each operand, removes extra opcodes

It looks like that not only branches are overloaded in p-code but all
operations, e.g., we can have `INT_ADD (mem:x) (mem:y) (mem:z)` that
represents `mem[x] <- mem[y] + mem[z]`.

We could also resolve this overloading by adding suffixes to
operations, e.g., `INT_ADDmr`, `INT_ADDmm`, and so on, but will
explode the number of opcodes, especially in the presence of
user-defined opcodes.

* catches the bad or unimplemented instructions during decoding

ghidra raises an exception if an instruction is not valid or there is
no semantics for it.

* adds signed ordering Primus Lisp primitives

and implemented corresponding pcode operations

* passes full information about the operand type from the backend

In p-code the semantics of an operation is defined by the types of its
operands. In case of the unique variables the type is not known to us
so we have to pass it. This commit extends the previous approach,
where we were passing only the kind of the operand (mem vs imm) to the
full type qualification, where the type of memory is represented with
Nil and the type of immediates is represented by its size in the
number of bits.

* adds the missing BOOL_NEGATE operation

* fixes the negation operator (pcode represent bool as byte)

* removes aliased registers from the register table

* fixes the selection of the default backend for x86

it should be llvm if not specified otherwise

* switches to caseless ordering of variables

Changes and documents the ordering of variables. Variables are no
compared caseless and the ordering is made loosely compatible with the
caseless lexicographical ordering of the textual representation of
variables' identifiers, e.g.,
```
```

* enables ghidra backend for the arm targets

Right now it is disabled by default, use `--arm-backend=ghidra` to
enable.

* uses pcode-x86 as the CT language for pcode in x86 targets

* adds ghidra backend to mips

* minor pretty-printing tweaks

to make things more readable

* improves primitives performance in Primus Lisp

With this optimization Primus Lisp-based lifters run five times
faster. This especially important for ghidra backend lifters, which
are fully dependent on Primus Lisp.

The idea is to let the primitive implementors provide the body of
their primitive so that every primitive is not computed via the
semantics promise but is invoked directly. Another big idea is to
provide such an interface that will allow to factor out computations
that are common to the target. The same idea could be extrapolated to
all promises.

* optimization: improves name resolution in Primus Lisp

Uses maps for names, not lists. Surprisingly not much of improvement,
something about 5%.

* optimizes unit computatation

by hoisting it out of the loop

* optimization: do not request lisp arguments if not necessary

To compute list arguments we need to invoke the theory and reflect
them into it. The resulting work is discarded if the name is not
bound. The optimization checks if the name is bound by the program and
only after that asks for the list of arguments.

* adds ghidra to powerpc (works out of the box)

* adds the `--x86-64-backend` option

it is just an alias for `--x86-backend`, for consistency

* enables ghidra for riscv (doesn't run out of the box)

* enables the ghidra backend in CI/CD

We will build Ghidra only on supported targets, right now it is Ubuntu
Bionic. We will soon add more targets and more packages. The bap
packages will be now split into `bap-core`, `bap`, and `bap-extra`.

The `bap-core` will contain the minimal part of platform without
analyses. The `bap` package will include most of the analyses, finally,
`bap-extra` will include heavy-weight analysis (in terms of extra
dependencies and build times), e.g., the symbolic executor, and the
ghidra backend.

* moves the ghidra install section lower

* downgrades the ubuntu version on CI/CD

* tries to fix the macOS build

* disables ghidra in the opam/opam used in CI
The `popn` was always poping from the first register and `align_even`
was also doing `popn` instead of aligning.
It should be using the address size everywhere (instead it was using
the number of bits in the architecture, which is 8 for AVR8).
The `getBytesInAddress` virutal member-function for ELF files is
implemented incorrectly in LLVM,
```
template <class ELFT>
uint8_t ELFObjectFile<ELFT>::getBytesInAddress() const {
  return ELFT::Is64Bits ? 8 : 4;
}
```

It is actually a static file that returns 32 for ELF32 and 64 for
ELF64, which has nothing to do with the actual bitness of the target
architecture address.

The if is using the information obtained from the target (triple) and
falls back to `getBytesInAddress` only if the target is not 16, 32, or
64 bit (the target interface is also strange as it doesn't allow any
other values).
* Added flatten/unroll pass.

* Update plugins/flatten/flatten_main.ml

Co-authored-by: Ivan Gotovchits <[email protected]>

* Update plugins/flatten/flatten_main.ml

Co-authored-by: Ivan Gotovchits <[email protected]>

* Update plugins/flatten/flatten_main.ml

Co-authored-by: Ivan Gotovchits <[email protected]>

* Update plugins/flatten/flatten_main.ml

Co-authored-by: Ivan Gotovchits <[email protected]>

* applies `make indent`

* adds the missing core_kernel dependency

Co-authored-by: Ivan Gotovchits <[email protected]>
For the context, since the Ghidra PR the semantics of the [blk]
operation in Core Theory was refined and corresponding theories were
updated. Before #1326 all theories were just ignoring the labels
passed to the [blk] operator. The reason for that was that the [blk]
operation serves two purposes. It enables concatenation of data and
control flow effects. And it attaches a label to that sequence so that
it could be referenced elsewhere. Before #1326 it was only used for
the first purpose, i.e., to merge data and control effects into the
single effect. But for Ghidra we needed an ability to create
labels (as ghidra is relying on Branch/Cbranch) instructions
everywhere, even to express the intra-instruction logic, not real
control flow.

Now, the theories have to take into account the label passed to the
blk operation, when they produce their denotations, unless the label is
`Label.null`. If the label is `Label.null` then the operation is
denotes just a sequence of data and control flow effects. Moreover,
denotations are allowed to coalesce several blocks together. But if
the label is non-null then the denotation has to preserve it.

Before this PR the BIR theory wasn't fully respecting the passed
labels and was sometimes optimizing them away, for example, when the
label was attached to an empty denotation. This PR takes care of
keeping the passed labels and at the same time preserving the minimal
form of the generated IR. Of course assuming that lifters are using
the `blk` operation correctly, i.e., that they are not passing
non-null labels to blocks that they do not plan to invoke later.

In other words, if you have a lifter that uses the `blk` operation you
need to update it and pass `Theory.Label.null` there instead of the
fresh label. This will have the same semantics as it had
before. Passing non-null label now has a different semantics.
The problem was not in heuristic per se, but cyclic dependency between
the abi pass that discovers __libc_start_main, main, abi passes, and
api passes. Ideally, we should reimplement our ABI/API infrastructure
using the Knowledge Base, which was originally designed to handle such
dependencies, but it will take much more time than we currently have.

Therefore, right now we discover `__libc_start_main` in the ABI pass,
to be sure that we have types applied to it (which we need as we will
use them later to get the storage for the first argument). And we
delay the `main` function discovery after the api pass. After we find
main we also want types to be applied. Unfortunately, we can't can
call the api pass again (which is responsible for that), but we known
the prototype of main, so we don't really need to parse the C headers
anymore and can manually apply the prototype and translate it to the
arg terms.

Now the main function discovery works perfectly for the programs that
use glibc runtime, so yeah we can finally execute `/bin/true` and
`/bin/false` :)

```
$bap /bin/false --run --primus-print-obs=call-return | grep main
(call-return (main 0 0x40000008 1))
$ bap /bin/true --run --primus-print-obs=call-return | grep main
(call-return (main 0 0x40000008 0))
```
This theory is used for debugging and introspection purposes. The bug
was in the denotation of `branch` and `repeat` operations, which were
denoted as empty denotations if any of its effects are empty. It is
now corrected.
This operation doesn't make sense for graphs with unlabeled nodes and
shall return the graph unchanged. A corresponding clarification
comment is added. Also, cleans up the documentation a little bit.
* Introduces context variables to the knowledge library

* makes theory a part of the context

Introduces the [Theory.current] and [Theory.with_current] functions.
* adds the register aliasing

* fixes the msb function in the lisp library

* uses signed extension during operands unification in Primus Lisp

Before that it was unsigned but in general we should put more control
on it, as right now we have 32 aarch64 bit operations performed in 64
bits.
The Primus Lisp interpreter uses some state and was previously
implemented as the knowledge monad transformed into the state
monad. The new implementation just uses the newly introduces context
variables, which is much easier and a little bit more performant. It
also enables us to persist the interpreter state between the calls so
that we can generate more efficient code (propagate consts between
instructions), especially for the ghidra backend.
Despite that ocaml/odoc#539 is fixed long
time ago it never landed into the 1.x branch of odoc, which is the
stable and the released version. All beta versions do not have the
stack overflow bug on core_kernel, but render unreadable documentation
(tried 2.0.0-beta till the latest master). Therefore, we had to
fallback to the `--no-pkg-deps` workaround, so that we can get at
least the BAP documentation. Note, this workaround means that the
links to core kernel and other external libraries will be broken, only
links across bap universe will work.
which is presumabely fixed but is only available in the master and is
not released.

We will start with the master branch and if it will break stick to the
last working sha.
The run plugin was resetting the state of the toplevel knowledge base
to the initial and effectively discarding the knowledge that was
accumulated when the Program.t and Symtab.t were constructed as well
discarding any information that might be added to the knowledge base
in project passes.

The run pass was capturing the state at the time when the Project.t is
not fully ready, and at then end was using `Toplevel.set` to reset any
accumualted knowledge.

Effectively, this bug is a race condition, which we could prevent if
the toplevel interface provided safe functions. This will be addressed
in the next couple of pull requests (they will be more aggresive and
change the semantics of several constructs so I decided to split them
for this hotfix).

Big thanks to @anwarmamat for providing a minimal reproducible
example, without it would be nearly impossible to identify it.
bmourad01 and others added 30 commits December 12, 2022 14:48
…1559)

* Fix not propagating term attrs in `mangle_names` during lifting

* Fix style

* Update testsuite revision

* Log linking of lisp definitions

* Emit relocations for data symbols in LLVM

* KB interface for image loaders

* KB-ify the LLVM loader and prefer existing bias for the unit

* Add deps for `bap_llvm` library

* Add bitvec dep for `bap-std`

* Allocate libraries automatically

We will just round up to the nearest page after the previous
binary. Note that the main executable can go wherever it wants.

* Log the mangling of names in `stub-resolver`

* Load and map library segments in the Primus loader + apply relocations

* Ignore empty filename

* Rename and also log the steps of the Primus loader

* Adds `relative-relocation` to the schema

* Fix style

* Mark `llvm:relative-relocation` only when a symbol is unavailable

* Remove unused headers

* Fix for LLVM >= 12

Co-authored-by: bmourad01 <[email protected]>
* fixes the raw (non directory or zip, but file) recipes

* drops llvm depext (as it already should come from bap-conf-llvm)

* switch from ubuntu-latest to focal

* downgrades macos workers from latest to 11
For convenience, we can convert a BIL type to a Core Theory sort.
* Fixes the stub resolver tests

* Preserves transitivity

* Drop `bap-relation`

Co-authored-by: bmourad01 <[email protected]>
…1567)

When we perform relocations on object files, we do the linker work not
the loader, therefore we do not need to check if the memory is
writable or even loadable. The easiest option is to extend the Primus
memory interface with primitives that perform writing without any
checks.
Switches to dune as the main build system but keeps support for
OASIS with ocamlbuild and omake backends.

The implementation utilizes the dune sites and plugins system, but it
still supports plugins that are built with bapbuild. To ease the
transition, the OASIS-built BAP is also able to load dune
plugins. Later the OASIS build system will be removed from BAP but the
support for bundled with bapbuild plugins will remain.

Since dune sites impose their own restrictions we had to introduce a
new bap-common package that manages all the sites, starting from
plugins and ending with semantic files.
Removes duplicating packages, provided that they have the same
implementation.

When dune executes a binary (either via `dune exec` or `dune test`) it
sets an internal `DUNE_DIR_LOCATIONS` environment variable, which
affects the selection of plugins in bap, even if bap is installed from
an indepenent repository. I am still not sure if this is a bug or a
feature of dune, but to prevent double-loading we have to deduplicate
plugins. We silently remove duplicating names only if all
implementation that match the name are the same. We compare
implementation using the sum of md5 sums of files that comprise the
plugin folder. The choice of md5 sum as the implementation witness
instead of parsing the META files allows us to preserve the
abstraction of dune plugins, as the contents of the folder is the
implementation detail of dune. This increases the probability that
this code will still work, when they will change the implementation.

In addition, this change arounds another subtle bug in the dune plugin
system. When a plugin is removed with `opam remove`, opam leaves an
empty folder in the plugin folder, which we should ignore, otherwise
dune plugin loader will fail on it.
Normalized Keys
---------------

1) Keys passed to `ocaml tools/configure.ml` can now use dashes
or underscores interchangeably
2) The keys are case sensitives
3) During rewritign the keys can use either underscores or dashes,
   and can use any registry.

Basically the comparison function makes the following keys equal:
`foo-bar`, `FOO_BAR`, `FOO-bar`, and `foo bar`.

New Configuration Parameters
----------------------------

Publishes several configuration parameters via Bap_main.Configuration
and makes it available via `bap config`, e.g.,
```
$ bap config
version: 2.6.0-alpha
build_id: f4a6c54
datadir: /home/ivg/.local/share/bap
cachedir: /home/ivg/.cache/bap
sysdatadir: /home/ivg/factory/bap-dune/_opam/share/bap-common
libdir: /home/ivg/factory/bap-dune/_opam/lib
confdir: /home/ivg/factory/bap-dune/_opam/etc/bap-common
bindir: /home/ivg/factory/bap-dune/_opam/bin
plugindir: /home/ivg/factory/bap-dune/_opam/lib/bap-common/plugins
```

Updates bapbundle to Hanle dune Plugins
---------------------------------------

```
bapbundle remove foo
```

Will now correctly delete a plugin `foo` even if it a dune
plugin. (Note, that it will only remove the folder from the plugins
directory and wont remove any installed libraries, for that use opam
or your build system)
When bap installed from the testing repository it fails on frontends
due to a missing dependency

```
Bap_plugins_loader_backend referenced from
/home/runner/work/bap/bap/_opam/lib/bap-plugins/bap_plugins.cmxa(Bap_common)
```

Also improves the opam workflow and adds functional checks.
This will break too much downstream code, as well as our own tooling
like bapbuild
If bundled plugins are loaded before dune packaged plugins, the latter
will fail on modules that were linked by the former. The old bap plugin
system has now way to notify the dune plugin system on what units are
loaded (no worries the dynlink system used by both will prevent
double-loading). Instead of catching the Unit_already_loaded exception
from dune, we better first load all dune packages and only after that
handle the old bundled plugins with our old loader, which is capabled
of handling correctly this exception.
It's unclear how it was passing the tests but there is still a missing
module the bap-plugins OASIS specification that breaks downstream builds.
* Fix use of `set-nzcv-from-registers` in Thumb `cmp`

* Fix docstring

* Fix `t2ADDrs`
* Adds missing POSIX API definition for `strchrnul`

* `__stack_chk_fail`
* Adds compatibility with LLVM 15

* Update `bap-llvm.opam`
This must be manually enabled by setting the variable `llvm-shared` to a
true value in the current opam switch, similar to the existing
`llvm-config` variable used by conf-bap-llvm.

* bap_llvm/config/llvm_configurator.ml: use `llvm-config --shared-mode`
  to get the appropriate linking mode

* bap_llvm/llvm_disasm.cpp: add missing include for Optional.h
Some were deprecated, some removed, and other just failing.
Occasionally, llvm refines their instruction definitions and change
the number of operands, for example this [commit][1] added an extra
destination register operand. For the old lifters we used a pretty
strict rules for operands deconstructing and for the bts, btc, and btr
instructions we were expecting two operands (now there are three of
them).

I decided to relax the deconstructor instead of changing the
definition of instructions, first to preserve backward compatibility
with the older versions of llvm and second because many other
instructions could be affected by the same changes from the llvm side,
and I don't see any harm if we will accept instructions with one more
argument, knowing that destinations are always prepended.

[1]: llvm/llvm-project@fe96ff7
Eases local dune-base developement, so that you can easily start local
development without too much external dependencies hassle. Allows you
to disable bap-extra packages and just do `dune build` and everything
should work.

For the context, Dune doesn't provide nice options for multi-project
monorepos. The dune files represent the static information about the
project and it is presumed that there is an external tool that
generates them. Cf., tezos manifest that generates dune files from the
OCaml EDSL or whatever they use in Janestreet (jenga?). Originally,
the plan was to implement the same mechanism, e.g., generate the
dune-project file during the configuration phase. But right now I
don't have enough time for that, so I am pushing a simple
workaround. All packages in bap-extra are now marked as allow_empty,
indicating that they can produce now artifacts. Their corresponding
libraries are now optional (if a dependency is not met, then the
library is silently not built). The only special case is ghidra that
doesn't have dependencies trackable by Dune (i.e., OCaml libraries,
besides base and bap), so to prevent it from building I added a small
check that ghidra library is installed.
Since v4 they all must have distinct names across all workflows.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.