cpu_rec
is a tool that recognizes cpu instructions
in an arbitrary binary file.
It can be used as a standalone tool, or as a plugin for binwalk
(https://github.com/devttys0/binwalk).
- Copy
cpu_rec.py
andcpu_rec_corpus
in the same directory. - If you don't have the
lzma
module installed for your python (this tool works either with python3 or with python2 >= 2.4) then you shouldunxz
the corpus files incpu_rec_corpus
. - If you want to enhance the corpus, you can add new data in the
corpus directory. If you want to create your own corpus, please
look at the method
build_default_corpus
in the source code.
Same as above, but the installation directory must be the binwalk
module directory: $HOME/.config/binwalk/modules
.
You'll need a recent version of binwalk, that includes the patch provided by ReFirmLabs/binwalk#241 .
Add the flag -%
when using binwalk.
Be patient. Waiting a few minutes for the result is to be expected. On my laptop the tool takes 25 seconds and 1 Gb of RAM to create the signatures for 70 architectures, and then the analysis of a binary takes one minute per Mb. If you want the tool to be faster, you can remove some architectures, if you know that your binary is not one of them (typically Cray or MMIX are not found in a firmware).
Just run the tool, with the binary file(s) to analyze as argument(s) The tool will try to match an architecture for the whole file, and then to detect the largest binary chunk that corresponds to a CPU architecture; usually it is the right answer.
If the result is not satisfying, prepending twice -v
to the arguments
makes the tool very verbose; this is helpful when adding a new
architecture to the corpus.
If https://github.com/LRGH/elfesteem is installed, then the tool also extract the text section from ELF, PE, Mach-O or COFF files, and outputs the architecture corresponding to this section; the possibility of extracting the text section is also used when building a corpus from full binary files.
Option -d
followed by a directory dumps the corpus in that directory;
using this option one can reconstruct the default corpus.
The function which_arch
takes a bytestring as input and outputs
the name of the architecture, or None.
Loading the training data is done during the first call of which_arch,
and calling which_arch with no argument does this precomputation only.
For example
>>> from cpu_rec import which_arch
>>> which_arch()
>>> which_arch(b'toto')
>>> which_arch(open('/bin/sh').read())
'X86-64'
Running the tool as a binwalk module typically results in:
shell_prompt> binwalk -% corpus/PE/PPC/NTDLL.DLL corpus/MSP430/goodfet32.hex
Target File: .../corpus/PE/PPC/NTDLL.DLL
MD5 Checksum: d006a2a87a3596c744c5573aece81d77
DECIMAL HEXADECIMAL DESCRIPTION
--------------------------------------------------------------------------------
0 0x0 None (size=0x5800, entropy=0.620536)
22528 0x5800 PPCel (size=0x4c800, entropy=0.737337)
335872 0x52000 None (size=0x1000, entropy=0.720493)
339968 0x53000 IA-64 (size=0x800, entropy=0.491011)
342016 0x53800 None (size=0x22000, entropy=0.727501)
Target File: .../corpus/MSP430/goodfet32.hex
MD5 Checksum: 4b295284024e2b6a6257b720a7168b92
DECIMAL HEXADECIMAL DESCRIPTION
--------------------------------------------------------------------------------
0 0x0 None (size=0x8000, entropy=0.473132)
32768 0x8000 MSP430 (size=0x5000, entropy=0.473457)
53248 0xD000 None (size=0x3000, entropy=0.489337)
We can notice that during the analysis of PPC/NTDLL.DLL
a small chunk has been identified as IA-64
.
This is an erroneous detection, due to the fact that
the IA-64 architecture has statistical properties similar
to data sections.
If the entropy value is above 0.9, it is probably encrypted or compressed data, and therefore the result of cpu_rec should be meaningless.
The tool has been presented at SSTIC 2017, with a full paper describing why this technique has been used for the recognition of architectures. A video of the presentation and the slides are available.
This presentation was made in French. A translation in English of the slides is available, a translation in English of the paper is in progress.
6502
68HC08
68HC11
8051
Alpha
ARcompact
ARM64
ARMeb
ARMel
ARMhf
AVR
AxisCris
Blackfin
Cell-SPU
CLIPPER
CompactRISC
Cray
Epiphany
FR-V
FR30
FT32
H8-300
H8S
HP-Focus
HP-PA
i860
IA-64
IQ2000
M32C
M32R
M68k
M88k
MCore
Mico32
MicroBlaze
MIPS16
MIPSeb
MIPSel
MMIX
MN10300
Moxie
MSP430
NDS32
NIOS-II
OCaml
PDP-11
PIC10
PIC16
PIC18
PIC24
PPCeb
PPCel
RISC-V
RL78
ROMP
RX
S-390
SPARC
STM8
Stormy16
SuperH
TILEPro
TLCS-90
TMS320C2x
TMS320C6x
TriMedia
V850
VAX
Visium
WE32000
X86-64
X86
Xtensa
Z80
#6502#cc65
Because of licencing issues, the following architectures are not in
the default corpus, but they can be manually added:
78k
The cpu_rec.py
file is licenced under a Apache Licence, Version 2.0.
The files in the default corpus have been built from various sources. The corpus is a collection of various compressed files, each compressed file is dedicated to the recognition of one architecture and is made by the compression of the concatenation of one or many binary chunks, which come from various origins and have various licences. Therefore, the default corpus is a composite document, each sub-document (the chunk) being redistributed under the appropriate licence.
The origin of each chunk is described in cpu_rec.py
, in the function
build_default_corpus
. The licences are:
- files
libgmp.so
,libc.so
,libm.so
come from Debian binary distributions and are distributed under GPLv2 (and LGPLv3 for recent versions oflibgmp
) and the source code is available from http://archive.debian.org/. busybox
binaries come from https://busybox.net/downloads/binaries/ and are distributed under GPLv2.C-Kermit
binaries come from ftp://kermit.columbia.edu/kermit/bin/ and are distributed under GPLv2 (according to ftp://kermit.columbia.edu/kermit/archives/COPYING but the status of each binary is not always clear).- all files identified in
build_default_corpus
as part of theCROSS_COMPILED
subdirectory have been built by myself. The corresponding source code arezlib
(from http://zlib.net/, distributed under the zlib licence) orlibjpeg
(from http://www.ijg.org/, distributed under an unknown licence) or some other code based on public sources (e.g. https://anonscm.debian.org/cgit/pkg-games/bsdgames.git/tree/arithmetic/arithmetic.c modified to work with SDCC compilers). - The
camlp4
binary is built from https://github.com/ocaml/camlp4 and distributed under LGPLv2. - The binary for TMS320C2x comes from https://github.com/slavaprokopiy/Mini-TMS320C28346/blob/master/For_user/C28346_Load_Program_to_Flash/Debug/C28346_Load_Program_to_Flash.out where it is distributed under an unknown licence.
- The binary for RISC-V comes from https://riscv.org/software-tools/ distributed under GPLv2 and can downloaded at https://github.com/radare/radare2-regressions/blob/master/bins/elf/analysis/guess-number-riscv64
- The binaries for PIC10 and PIC16 come from http://www.pic24.ru/doku.php/en/osa/ref/examples/intro where they are distributed under an unknown licence.
- The binary for PIC18 comes from https://github.com/radare/radare2-regressions/blob/master/bins/pic18c/FreeRTOS-pic18c.hex where it seems to be distributed under GPLv3 (or later).
- The binary for PIC24 comes from https://raw.githubusercontent.com/mikebdp2/Bus_Pirate/master/package_latest/BPv4/firmware/bpv4_fw7.0_opt0_18092016.hex distributed under Creative Commons Zero.
- The binary for 6502 comes from https://raw.githubusercontent.com/RolfRolles/Atredis2018/master/MemoryDump/data-4000-efff.bin and was distributed for the Atredis BlackHat 2018 challenge, under an unknown licence.
- The binary for H8S comes from #4 and was distributed by Dell, under an unknown licence.
- The binary for TriMedia comes from https://github.com/crackinglandia/trimedia/blob/master/tm-linux/tmlinux-kernel-obj-latest.tar.bz2 where it is distributed under an unknown licence.
- A binary for Nec/Renesas 78k can be found at https://www.metz-mecatech.de/en/lighting/firmware-download-flash-units/mecablitz-50-af-1-digital.html where it is distributed under a restrictive licence. The file named
MB50AF1_NikonV12.mtz
is a nibble-swapped Intel-HEX firmware (cf. https://debugmo.de/2011/10/whats-inside-metz-50-af-1-n/) with 0x7d5a bytes of 78k code starting at offset 0x2ba.