introduction

kuangkaiyuan · Sep 27, 2016 · bc2fd41 · bc2fd41
1 parent 3802d4b
commit bc2fd41
Show file tree

Hide file tree

Showing 8 changed files with 162 additions and 182 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,3 @@
+*.synctex
+DCVP.pdf
+DCVP.pdf
diff --git a/DCVP.pdf b/DCVP.pdf
diff --git a/DCVP.tex b/DCVP.tex
diff --git a/abstract.tex b/abstract.tex
@@ -0,0 +1,22 @@
+\begin{abstract}
+Code virtualization built upon virtual machine (VM) technologies are emerging
+as a viable method for implementing code obfuscation to protect programs
+against unauthorized analysis. State-of-the-art VM-based protection
+approaches use a fixed set of virtual instructions and bytecode interpreters
+across programs. This, however, opens up a security hole where an experienced
+attacker can use knowledge extracted from other programs to quickly uncover
+the mapping between virtual instructions and native code for applications
+protected under the same scheme. In this paper, we propose a novel VM-code
+obfuscation system to address this problem. The core idea of our approach is
+to obfuscate the mapping between the opcodes of bytecode instructions and
+their semantics. We achieve this by partitioning each protected code region
+into multiple segments where the mapping of opcodes and their semantics is
+randomized in different ways in different segments. In this way, each
+bytecode instruction will be translated into different native code in
+different sections of the obfuscated code. This significantly increases the
+diversity of the program behavior. As a result, the knowledge of bytecode to
+native code mappings obtained from other programs is unlikely to be useful
+for a new program. We evaluate our approach on a set of real-world
+applications and compare it against two state-of-the-art VM-based code
+obfuscation approaches. Experimental results show that our simple approach is effective, which provides stronger protection at the cost of little extra overhead.
+\end{abstract}
diff --git a/background.tex b/background.tex
@@ -0,0 +1,52 @@
+\section{Background}\label{sec:background}
+Virtualization technique has been used in many fields, e.g.
+virtual memory for resource virtualization, VMware and VirtualBox for CPU virtualization,
+and Java bytecode and .Net CIL for application virtualization.
+This paper discusses another application of virtulization technique
+in protecting software programs from unauthorized analyses,
+namely code virtualized obfuscation or VM-based obfuscation,
+like VMProtect \cite{vmp} and Code Virtualizer \cite{cv}.
+As software does not have uniform security requirements throughout its execution \cite{geneiatakis2012adaptive}
+and protecting the whole program is too costly,
+code virtualized obfuscation usually protects only the critical part(s) of the whole software program,
+which could be a critical algorithm or a processing logic.
+VM-based obfuscation protects a target program by transforming its native machine code into bytecode
+for a self-defined virtual instruction set architecture.
+At runtime, the execution instruction semantics of the original program are fulfilled
+by a native interpreter bundled with the bytecode.
+In this section, we will look into the internal working mechanism of VM-based obfuscation.
+
+%\subsection{The Internals of VM-based Obfuscation}
+Figure \ref{fig:vmprotection} shows the architecture of a VM-based obfuscation system.
+The core of a VM-based obfuscation are the virtual IS (Instruction Set) and the native interpreter.
+Virtual instructions are used to emulate native instructions.
+It is required that virtual IS be able to emulate all the semantics of native IS,
+or formally speaking, virtual IS should be Turing-equivalent to the native IS.
+The native interpreter is to fetch and execute bytecode instructions.
+It follows the \textit{decode-dispatch} approach \cite{ghosh2012replacement},
+and consists of a bundle of \texttt{handlers} and a \texttt{VMloop}.
+\texttt{VMloop} is the main \textit{decode-dispatch} loop and for each loop,
+\texttt{VMloop} fetches a bytecode instruction, decodes it, and dispatches a \texttt{Handler} to interpret it.
+Different from native instructions, bytecode instructions are specific for a virtual context,
+namely \texttt{VMcontext}, which contains the virtual registers and flags.
+Virtual registers and flags are related to the native registers and flags.
+At runtime, \texttt{VMinit} first saves native context and uses them to initialize the virtual context.
+In the simplest implementation, the virtual context could be a block of memory
+and stores the exact values of the native context;
+this could be more complex \cite{falliere2009inside},
+but it should be guaranteed that the converting between native context and virtual context is reversible,
+since \texttt{VMexit} will restore the native context from the virtual context upon exiting the virtual interpreter.
+
+
+\begin{figure}[!t]
+\centering
+\includegraphics[width=0.9\textwidth]{fig/vmprotection.pdf}
+\caption{The architecture of code virtualized obfuscation and the execution view of a VM-obfuscated program. The main work of this paper is to improve the core steps of VM-based protection (areas of ``a" and ``b"). In the ``a" region, we adopt the partition bytecode encoding schemes, and obfuscate \texttt{handlers} to generate multiple sets of \texttt{handlers}. In the ``b" region, we use a variety of methods of obfuscation and anti-taint analysis technology to protect the important components of virtual interpreter.}
+\label{fig:vmprotection}
+\end{figure}
+
+
+Figure \ref{fig:vmprotection} also depicts the workflow of the obfuscation process. It starts from extracting the critical code from the target program. This is typically done with the help of the program author who will mark the location and scope of the critical code to be protected during programing; the obfuscation system then will search for the marks to locate the critical code at obfuscation time. The critical code is disassembled into native disassembly instructions to enable later conversion from native instructions to virtual instructions in an instruction-by-instruction fashion. The rules of conversions are set ahead of protection and are stable inside a VM-based obfuscation system. These rules depend on the semantics of the virtual IS and guarantee that the semantics of the resulted virtual instructions are equivalent to the native ones. Subsequently, virtual instructions are encoded into bytecode program. Finally, the bytecode program and other VM components are assembled into the target program through binary rewriting.
+This paper improves the core steps of code virtualization protection. We modify the encoding schemes and adopt the partition bytecode programming, and generate multiple sets of \texttt{handlers}. So the bytecodes will have different semantics in different parts of bytecode program. We also use a variety of methods of obfuscation and anti-taint analysis technology to protect the critical components of virtual interpreter (section~\ref{sec:VI-Bytecode}).
+
+At runtime, upon executing the ``critical code", an instruction, \texttt{jmp VMinit}, transfers the control to \texttt{VMinit} (Step \ding{182}). \texttt{VMinit} saves the native context and initializes the virtual context. Next, \texttt{VMloop} starts to work. It fetches a bytecode instruction, decodes it (Step \ding{183}) and dispatches a \texttt{handler} to interpret it (Step \ding{184}). Step \ding{183} and Step \ding{184} are iterated until all the bytecode instructions are interpreted. Then, \texttt{VMloop} transfers the execution to \texttt{VMexit} (Step \ding{185}), where the native context is restored and the program jumps back to the native instruction following the critical code (Step \ding{186}) and continue to execute the rest of the program code. 
diff --git a/clean.bat b/clean.bat
@@ -7,4 +7,3 @@ del *.blg /s
 del *.thm /s
 del *.toc /s
 del *.out /s
-del *.synctex /s
diff --git a/motivation.tex b/motivation.tex
@@ -0,0 +1,32 @@
+\section{Motivation}
+\begin{figure}[!t]
+\centering
+\includegraphics[width=0.75\columnwidth]{fig/figone.pdf}
+\caption{The process of reusing attacking knowledge for code reverse engineering.
+Here we have four different target programs, A, B, C and D.
+In the right side of the scenario, all programs are obfuscated with a code obfuscation scheme that a virtual instruction will be deterministically translated to a fixed set of native code.
+This allows an attacker to reuse knowledge obtained from one program to efficiently reverse engineer other programs.
+In another scenario, the mapping between virtual instructions and native code is different for different programs.
+In this way, the attacker is  unable to reuse the previously extracted knowledge to perform reverse analysis across programs.}
+\label{fig:one}
+\end{figure}
+
+Figure~\ref{fig:one} depicts an reverse analysis scenario where an analyst can
+reuse the \textit{analysis knowledge} to attack applications protected by
+the same VM-based code obfuscation scheme. In this example, there are four different programs
+to be protected, labelled as A, B, C and D. In the right side of the diagram,
+all the four programs are protected using an identical set of virtual
+instructions and bytecode handlers. Under this setting, an experienced analyst would be able to
+use the knowledge of the mapping of virtual instructions and bytecode handlers obtained
+from one program to reverse-engineer the other three programs. Bear in mind that,
+uncovering the mapping between virtual instructions and native code is often the most
+time-consuming process for attacking VM-based code obfuscation. Having able to
+reuse the attacking knowledge thus can significantly reduce the cost involved in the
+attack.
+In another scenario, the translations between virtual instructions and native code
+vary among programs. Therefore,  the
+knowledge obtained from one program will be in inapplicable to others.
+This forces the analyst to start from the scratch when reverse engineering a new program.
+This  example shows that shuffle the relationship between the virtual instructions and bytecode handlers
+can significantly increase the effort and cost involved in performing the attack.
+In the remainder section, we describe how we can construct such as scheme in details.  
diff --git a/threat_model.tex b/threat_model.tex
@@ -0,0 +1,26 @@
+\section{Threat Model}\label{sec:threat-model}
+In our threat model, an analyst owns a copy of the target VM-obfuscated software program and runs it in a malicious host environment \cite{collberg2002watermarking}. In a malicious host environment, also referenced as the white-box attack context \cite{chow2003white,liem2008compiler}, an analyst has full privileged accesses to the system, and she can execute the software program at will and take advantages of any static and dynamic analysis tools (such as, ``\texttt{IDA}"\footnote{IDA Pro, https://www.hex-rays.com/index.shtml.}, ``\texttt{OllyDbg}"\footnote{OllyDbg, http://www.ollydbg.de/.} and ``\texttt{Sysinternals Suite}"\footnote{Sysinternals Suite, https://technet.microsoft.com/en-us/sysinternals/bb842062/.}) to help to trace and analyze instructions, monitor registers and process memory, and even change instruction bytes and control flows at runtime, etc. The analyst in our threat model is defined as an entity that seeks to reverse engineer and understand a software program's inner implementation and logic details, since code understanding is the basis for later tampering, cracking, or behavior pattern extraction. The ultimate goal of the analyst is to automate the entire analysis process. At present, there are two mainly types of methods to attack VM-based protection system. The first is based on the virtual execution analysis, the other is based on the behavior and semantic analysis.
+
+\subsection{Attack based on the virtual execution analysis}
+This method is proposed by Rolles et al. \cite{rolles2009unpacking}, which is based on the analysis of the interpretation process of the bytecode program, and requires an attacker to have a certain understanding of the principle of code virtualization.
+
+It can be summarised as the following three steps. First, reverse engineering the virtual interpreter. The purpose of this step is to get the location and interaction information of each component of the virtual interpreter and the mapping relationship between the real CPU environment and \texttt{VMcontext}. Then using these informations to work out the semantics of individual bytecode instructions. %By using the dynamic analysis tool to record the decoding process of the bytecode and find its corresponding \texttt{handlers} which reflect the semantic information of the bytecode.
+Finally, recovering original program's logic embedded in the bytecode program, eliminating redundant information and restoring a program that is similar to the original program. Nicolas Falliere \cite{falliere2009inside} presented an example of the above analysis process which is used to analyze the Trojan.Clampi protected by VMProtect \cite{vmp}.
+
+Since the virtual interpreter consists of \texttt{VMloop} and \texttt{Handlers}, an analyst needs to locate them and analyze how \texttt{VMloop} works and what each handler does. By tracing the \textit{decode-dispatch} loop in \texttt{VMloop}, the analyst could figure out the correspondences between bytecode instructions and handlers, and thus the semantics of individual bytecode instructions. Combining the above information, the analyst can figure out what the bytecode program does, and after some simplification (constant folding and dead code elimination for example \cite{fightingoreans}), she could reveal the original program's logic.
+
+In the classical code virtualized obfuscation, the relationships between the opocodes of bytecode instructions and their semantics (the \texttt{handlers}) are stable, which means that a bytecode instruction in different obfuscated programs has identical semantics. Once an analyst gets aware of such relationships and semantics of each handler from previous analysis work or materials published by another analyst, she could reuse them for analyzing another VM-obfuscated program more efficiently, just like the first scenario in figure~\ref{fig:one}.
+
+
+\subsection{Attack based on the behavior and semantic analysis}
+This type of attack method can be used to attack not only code virtualization protection but also other confusion methods.
+Coogan et al.~\cite{coogan2011deobfuscation} puts forward a behavior based analysis method, which aims to analyze the important behavior of code, but it does not pay attention to how to restore the original code. Its implementation steps are as follows: (1) Dynamically trace program execution process by using debugging tools, and collect some instruction execution information, such as the address of the instructions and register values. (2) Analyze and identify the system calls and its related parameters from the above information. (3) Further mark all of instructions that influence on the system calls. (4) Extract these labeled instructions and analyze their behavior.
+
+This type of approach is usually used for malicious code analysis because it is based on analysis of the interaction between the program and system. The malicious code will interact with the system frequently in order to achieve a malicious purpose, but it is not necessary for the benign code. In other words, if the protected code does not interact with the system frequently, this approach will not be very effective on reverse analysis.
+
+Another attack method based on semantic is proposed by the Yadegari et al.~\cite{Yadegari2015A}, which use taint propagation to track the flow of inputs values, and semantics-preserving code transformations to simplify the logic of the instructions. The implementation steps are as follows: (1) Dynamically trace program execution process by using debugging tools, identify the input and output of the program. (2) With the input of the program as a taint source to perform the taint propagation, and extract the affected instruction sequence. (3) Simplify the above instruction sequences by using code simplification techniques, then construct the control flow graph of the program and optimize it, and finally get the final result.
+
+For the results of a run obtained, the function is equivalent to the original program, but it is only for one implementation, and does not cover all the execution branches. So the final control flow graph is only part of the original program. We need to perform analysis through multiple tracking and specify different input values each time, then comprehensive analysis to get a more complete control flow graph.
+
+\paragraph{In conclusion}
+The first type of attack method based on virtual execution is closely related to the principle and structure of the code virtualization, and has the most realistic and comprehensive results. The second method has wider applicability, but it is hard to get a comprehensive analysis results. So this paper mainly aims at the first kind of attack, but also will provide some measures to prevent the second kind of attack. And in our threat model, we assume that the analyst is familiar with the mechanism of code virtualized obfuscation and follows the above steps while reverse engineering a VM-obfuscated program. The ultimate goal of the analyst is to fully reverse engineer the VM-obfuscated application and automate the reverse analysis process.