GraalVM NativeImage Reverse Reconstruction
Java code restoration and protection is an old topic of conversation. Due to the bytecode format in which Java class files are saved, they contain a lot of metadata, making it easy to restore them to their original code. In order to protect Java code, the industry has adopted many methods, such as obfuscation, bytecode encryption, JNI protection, and so on. However, regardless of the method used, there are still ways and means of cracking it.
Binary compilation has long been considered an effective method of code protection. Java's binary compilation is known as AOT (Ahead of Time) technology, which means pre-compilation.
However, due to the dynamic nature of the Java language, binary compilation needs to deal with problems such as reflection, dynamic proxy, JNI loading, etc., which poses many difficulties. Therefore, for a long time, Java has been lacking a mature, reliable, adaptable, and widely applicable tool for AOT compilation in production environments. (There used to be one called ExcellisorJET, but it seems to have stopped being maintained.)
In May 2019, Oracle released GraalVM 19.0, a multi-language supported virtual machine. 19.0 is its first version intended for production environments. GraalVM provides a NativeImage tool that can achieve Ahead-of-Time (AOT) compilation for Java programs. After several years of development, NativeImage has become very mature. SpringBoot 3.0 can now use it to compile and generate an executable file for the entire SpringBoot project. The compiled file has fast startup speed, low memory consumption, and excellent performance.
So, for Java programs that have already entered the era of binary compilation, is their code still as easily reversed and restored as in the bytecode era? What are the characteristics of binary files compiled by NativeImage, and is the strength of binary compilation sufficient to protect important code?
In order to explore the above-mentioned issues, I recently developed a NativeImage analysis tool, which has achieved certain reverse restoration effects.
Project address
https://github.com/vlinx-io/NativeImageAnalyzer
Generate NativeImage
First, we need to generate a NativeImage, which comes from GraalVM. Accessinghttps://www.graalvm.org/ Download Java 17 version, after downloading, set up the environment variables. GraalVM also includes a JDK, so you can directly use it to execute Java commands.
Add $GRAALVM_HOME/bin to the environment variable and execute afterwards.
gu install native-image
Write a simple Java program
Write a simple Java program, for example
public class Hello {
public static void main(String[] args){
System.out.println("Hello World!");
}
}
Compile and run the above Java program
javac Hello.java
java -cp . Hello
You can obtain the program output
Hello World!
Preparation of the compilation environment
If you are a Windows user, you need to install Visual Studio in advance. If you are a Linux or macOS user, you need to install gcc and clang tools in advance.
Before executing the native-image command, Windows users need to set up the environment variables for Visual Studio. This can be done by using the following command
"C:\Program Files (x86)\Microsoft Visual Studio\2017\Community\VC\Auxiliary\Build\vcvars64.bat"
If the installation path and version of Visual Studio are different, please adjust the relevant path information yourself.
Compile with native-image
Now use the native-image command to compile the above Java program into a binary file. The format of the native-image command is the same as the java command format, and it also has-cp, -jar
These parameters, how to execute the program using the java command, use the same method for binary compilation, only change the command from java to native-image. The execution command is as follows.
native-image -cp . Hello
After a period of compilation, it may consume a significant amount of CPU and memory. It can generate a compiled binary file. The default output file name is the lowercase of the main class name, in this case, it is "hello". If it is on Windows, the file name will be "hello.exe". Use the "file" command to check the type of this file, and you will indeed see that it is a binary file.
file hello
hello: Mach-O 64-bit executable x86_64
Execute this file and its output is consistent with the previous usejava -cp . Hello
The results are consistent
Hello World!
Analysis of NativeImage
Using IDA for analysis
Open the compiled "hello" with IDA, click on "Exports" to view the symbol table. You can see the symbol `svm_code_section`, and its address is the entry point of the Java Main function.
Go to this address and view the assembly code.
You can see that it is already the appearance of a standard assembly function, use F5 for decompiling
It can be seen that some function calls have been made and some arguments have been passed, but the logic is not easy to determine.
We double-click on sub_1000C0020 to see the internal of the calling function. IDA prompts analysis failure.
NativeImage Decompilation Logic
Because the compilation of NativeImage is based on the compilation of JVM, it can also be understood as wrapping the binary code with a layer of VM protection. Therefore, tools like IDA, without corresponding information and targeted processing measures, are unable to effectively reverse-engineer it.
However, regardless of the format, whether in bytecode or binary form, certain basic elements of JVM execution must exist, such as class information, field information, function invocation and parameter passing, etc. Based on this approach, the analysis tool I have developed can achieve a certain level of reverse engineering effect, and with further improvements, it is capable of achieving a sufficiently high level of reverse engineering.
Using NativeImageAnalyzer for analysis
Visithttps://github.com/vlinx-io/NativeImageAnalyzer Download NativeImageAnalyzer
Execute the following command for reverse analysis. Currently, only the Main function of the main class is analyzed.
native-image-analyzer hello
The output is as follows
java.io.PrintStream.writeln(java.io.PrintStream@0x554fe8, "Hello World!", rcx)
return
Let's take a look at the original code again.
public static void main(String[] args){
System.out.println("Hello World!");
}
Let's now take a look at the definition of System.out.
public static final PrintStream out = null;
You can see that the variable "out" in the System class is a variable of type PrintStream, and it is a static variable. During the compilation of NativeImage, the instance of this class is compiled into an area called Heap, and the binary code directly obtains and calls the instance of this class from the Heap area. Let's take a look at the restored code.
java.io.PrintStream.writeln(java.io.PrintStream@0x554fe8, "Hello World!", rcx)
return
Here'sjava.io.PrintStream@0x554fe8
Just read from the Heap area java.io.PrintStream
The instance variable of object, its memory address is 0x554fe8.
Let's take a look againjava.io.PrintStream.writeln
Function definition
private void writeln(String s) {
......
}
We can see here that the writelin function has a String parameter, but in the restored code, why are there three arguments being passed? First,writeln
is a class member method, which only hides onethis
The variable points to the caller, which is the first argument passed injava.io.PrintStream@0x554fe8
As for the third parameter rcx, it is determined during the process of analyzing assembly code that this function is called with three parameters. However, based on the definition, we know that this function actually only calls two parameters. This is also an area that needs improvement in our tool.
A more complicated program
We will now analyze a more complex program, such as calculating a Fibonacci sequence, the code is as follows
class Fibonacci {
public static void main(String[] args) {
int count = Integer.parseInt(args[0]);
int n1 = 0, n2 = 1, n3;
System.out.print(n1 + " " + n2);
for (int i = 2; i < count; ++i){
n3 = n1 + n2;
System.out.print(" " + n3);
n1 = n2;
n2 = n3;
}
System.out.println();
}
}
Compile and execute
javac Fibonacci.java
native-image -cp . Fibonacci
./fibonacci 10
0 1 1 2 3 5 8 13 21 34
The code obtained after restoration using NativeImageAnalyzer is as follows
rdi = rdi[0]
ret_0 = java.lang.Integer.parseInt(rdi, 10)
sp_0x44 = ret_0
ret_1 = java.lang.StringConcatHelper.mix(1, 1)
ret_2 = java.lang.StringConcatHelper.mix(ret_1, 0)
sp_0x20 = java.io.PrintStream@0x554fe8
sp_0x18 = Class{[B}_1
tlab_0 = Class{[B}_1
tlab_0.length = ret_2<<ret_2>>32
sp_0x10 = tlab_0
ret_28 = ?java.lang.StringConcatHelper.prepend(tlab_0, " ", ret_2)
ret_29 = java.lang.StringConcatHelper.prepend(ret_28, sp_0x10, 0)
ret_30 = ?java.lang.StringConcatHelper.newString(sp_0x10, ret_29)
java.io.PrintStream.write(sp_0x20, ret_30)
if(sp_0x44>=3)
{
ret_7 = java.lang.StringConcatHelper.mix(1, 1)
tlab_1 = sp_0x18
tlab_1.length = ret_7<<ret_7>>32
sp_0x10 = " "
sp_0x8 = tlab_1
ret_22 = ?java.lang.StringConcatHelper.prepend(tlab_1, " ", ret_7)
ret_23 = ?java.lang.StringConcatHelper.newString(sp_0x8, ret_22)
rsi = ret_23
java.io.PrintStream.write(sp_0x20, ret_23)
rdi = 1
rdx = 1
rcx = 3
while(true)
{
if(sp_0x44<=rcx)
{
break
}
else
{
sp_0x34 = rcx
rdi = rdi+rdx
r9 = rdi
sp_0x30 = rdx
sp_0x2c = r9
ret_11 = java.lang.StringConcatHelper.mix(1, r9)
tlab_2 = sp_0x18
tlab_2.length = ret_11<<ret_11>>32
sp_0x8 = tlab_2
ret_17 = ?java.lang.StringConcatHelper.prepend(tlab_2, sp_0x10, ret_11)
ret_18 = ?java.lang.StringConcatHelper.newString(sp_0x8, ret_17)
rsi = ret_18
java.io.PrintStream.write(sp_0x20, ret_18)
rcx = sp_0x34+1
rdi = sp_0x30
rdx = sp_0x2c
}
}
}
java.io.PrintStream.newLine(sp_0x20, rsi)
return
Compare the restored code with the original code
rdi = rdi[0]
ret_0 = java.lang.Integer.parseInt(rdi, 10)
sp_0x44 = ret_0
Corresponding to
int count = Integer.parseInt(args[0]);
rdi is the register used to pass the first argument of a function. In Windows, it is rdx. rdi = rdi[0].即对应了
args[0],之后调用
java.lang.Integer.parseInt解析获得一个int数值,然后将返回值赋值给一个栈上变量
sp_0x44
int n1 = 0, n2 = 1, n3;
System.out.print(n1 + " " + n2);
Corresponding to
ret_1 = java.lang.StringConcatHelper.mix(1, 1)
ret_2 = java.lang.StringConcatHelper.mix(ret_1, 0)
sp_0x20 = java.io.PrintStream@0x554fe8
sp_0x18 = Class{[B}_1
tlab_0 = Class{[B}_1
tlab_0.length = ret_2<<ret_2>>32
sp_0x10 = tlab_0
ret_28 = ?java.lang.StringConcatHelper.prepend(tlab_0, " ", ret_2)
ret_29 = java.lang.StringConcatHelper.prepend(ret_28, sp_0x10, 0)
ret_30 = ?java.lang.StringConcatHelper.newString(sp_0x10, ret_29)
java.io.PrintStream.write(sp_0x20, ret_30)
We have a very simple string concatenation operation in Java code, which actually converts toStringConcatHelper.mix
,StringConcatHelper.prepend
,StringConcatHelper.newString
The call of three functions, among themStringConcatHelper.mix
Calculate the length of the string after concatenation,StringConcatHelper.prepend
Used to concatenate byte[] arrays that carry specific string content together,StringConcatHelper.newString
Then a new String object is created by generating a byte[] array
We saw two types of variable names in the above code.sp_0x18
andtlab_0
sp_ The variables beginning with sp_ indicate that they are variables allocated on the stack. tlab_
The variable starting with
We providetlab_0
赋值为Class{[B}_1
Translation:
Just translate, do not give any additional information.
Class{[B}_1
is that this is an instance of the byte[] type. [B is the Java descriptor for the byte[] type, and _1 indicates that it is the first variable of this type. If there are subsequent variables defined for the corresponding type, the index will increase accordingly, such as Class{[B]}_2
,Class{[B]}_3
Etc., if it is other types, it is expressed in the same way, such asClass{java.lang.String}_1
translate the text delimited by triple ~ from zh to en, keep the delimiter in the result,
just translate, do not give any additional information: Class{java.util.HashMap}_2
wait
The logic of the above code explains simply creating a byte[] array instance and assigning it to tlab0, with the array length as ret_2 << ret_2 >>32
The reason why the length of the array isret_2 << ret_2 >> 32
The reason is that when calculating the length of the String, it needs to convert the array length based on the encoding. You can search for the relevant code in java.lang.String.java. Next, the prepend function is used to combine 0, 1, and spaces into tlab0, and then a new String object, ret_30, is generated from tlab_0 and passed to the java.io.PrintStream.write function for printing. In fact, the parameters of the restored prepend function are not very correct, and the positions of the parameters are also incorrect, which is an area that needs further improvement later.
After converting the two lines of Java code into actual execution logic, it is still quite complex. In the future, it can be simplified by analyzing and integrating on the basis of the currently restored code.
Continue walking forward
for (int i = 2; i < count; ++i){
n3 = n1 + n2;
System.out.print(" " + n3);
n1 = n2;
n2 = n3;
}
System.out.println();
Corresponding to
if(sp_0x44>=3)
{
ret_7 = java.lang.StringConcatHelper.mix(1, 1)
tlab_1 = sp_0x18
tlab_1.length = ret_7<<ret_7>>32
sp_0x10 = " "
sp_0x8 = tlab_1
ret_22 = ?java.lang.StringConcatHelper.prepend(tlab_1, " ", ret_7)
ret_23 = ?java.lang.StringConcatHelper.newString(sp_0x8, ret_22)
rsi = ret_23
java.io.PrintStream.write(sp_0x20, ret_23)
rdi = 1
rdx = 1
rcx = 3
while(true)
{
if(sp_0x44<=rcx)
{
break
}
else
{
sp_0x34 = rcx
rdi = rdi+rdx
r9 = rdi
sp_0x30 = rdx
sp_0x2c = r9
ret_11 = java.lang.StringConcatHelper.mix(1, r9)
tlab_2 = sp_0x18
tlab_2.length = ret_11<<ret_11>>32
sp_0x8 = tlab_2
ret_17 = ?java.lang.StringConcatHelper.prepend(tlab_2, sp_0x10, ret_11)
ret_18 = ?java.lang.StringConcatHelper.newString(sp_0x8, ret_17)
rsi = ret_18
java.io.PrintStream.write(sp_0x20, ret_18)
rcx = sp_0x34+1
rdi = sp_0x30
rdx = sp_0x2c
}
}
}
java.io.PrintStream.newLine(sp_0x20, rsi)
return
sp_0x44
The parameters we input to the program, specifically the count, will only execute the for loop in the Java code if count is greater than or equal to 3. Here, the for loop is transformed back into a while loop, which essentially has the same meaning. Outside of the while loop, the program code executes the logic of count=3. If count is less than or equal to 3, the program will complete its execution without entering the while loop. This may also be an optimization done by GraalVM during compilation.
Let's take a look at the loop's exit condition again.
if(sp_0x44<=rcx)
{
break
}
This corresponds to
i < count
Meanwhile, rcx is also accumulating during each iteration process.
sp_0x34 = rcx
rcx = sp_0x34+1
即对应了
++i
Next, let's take a look at how the logic of adding numbers in the loop body is reflected in the restored code. The original code is
for(......){
......
n3 = n1 + n2;
n1 = n2;
n2 = n3;
......
}
The restored code is
while(true){
......
rdi = rdi+rdx -> n3 = n1 + n2
r9 = rdi -> r9 = n3
sp_0x30 = rdx -> sp_0x30 = n2
sp_0x2c = r9 -> sp_0x2c = n3
rdi = sp_0x30 -> n1 = sp_0x30 = n2
rdx = sp_0x2c -> n2 = sp_0x2c = n3
......
}
Other code in the loop body performs string concatenation and output operations as before, and the restored code basically reflects the execution logic of the original code.
Further improvements are needed
Currently, this tool is able to partially restore program control flow, implement certain level of data flow analysis and function name recovery. In order to become a comprehensive and usable tool, the following points need to be completed:
-
More accurate function name, function parameters, and function return value restoration
-
Accurate object information and field restoration
-
More accurate expression and object type inference
-
Statement integration and simplification
Thoughts on Binary Protection
The purpose of this project is to explore the feasibility of reverse engineering NativeImage. Based on the current results, reverse engineering of NativeImage is possible, which poses a greater challenge to code protection. Many developers believe that compiling software into binary code ensures security, while ignoring the need to protect binary code. For software written in C/C++, many tools such as IDA already have excellent reverse engineering capabilities, sometimes even exposing more information than Java programs. The author has even seen some software distributed in binary form without removing symbol information such as function names. This is equivalent to being completely exposed.
Any code is composed of logic, as long as it contains logic, it is possible to restore its logic through reverse means. The only difference lies in the difficulty of restoration. Code protection work is to maximize this difficulty of restoration.