Saturday, November 16, 2013

Reverse Engineering InternalCall Methods in .NET

Often times, when attempting to reverse engineer a particular .NET method, I will hit a wall because I’ll dig in far enough into the method’s implementation that I’ll reach a private method marked [MethodImpl(MethodImplOptions.InternalCall)]. For example, I was interested in seeing how the .NET framework loads PE files in memory via a byte array using the System.Reflection.Assembly.Load(Byte[]) method. When viewed in ILSpy (my favorite .NET decompiler), it will show the following implementation:
So the first thing it does is check to see if you’re allowed to load a PE image in the first place via the CheckLoadByteArraySupported method. Basically, if the executing assembly is a tile app, then you will not be allowed to load a PE file as a byte array. It then calls the RuntimeAssembly.nLoadImage method. If you click on this method in ILSpy, you will be disappointed to find that there does not appear to be a managed implementation.
As you can see, all you get is a method signature and an InternalCall property. To begin to understand how we might be able reverse engineer this method, we need to know the definition of InternalCall. According to MSDN documentation, InternalCall refers to a method call that “is internal, that is, it calls a method that is implemented within the common language runtime.” So it would seem likely that this method is implemented as a native function in clr.dll. To validate my assumption, let’s use Windbg with sos.dll – the managed code debugger extension. My goal using Windbg will be to determine the native pointer for the nLoadImage method and see if it jumps to its respective native function in clr.dll. I will attach Windbg to PowerShell since PowerShell will make it easy to get the information needed by the SOS debugger extension. The first thing I need to do is get the metadata token for the nLoadImage method. This will be used in Windbg to resolve the method.
As you can see, the Get-ILDisassembly function in PowerSploit conveniently provides the metadata token for the nLoadImage method. Now on to Windbg for further analysis…
The following commands were executed:
1) .loadby sos clr
Load the SOS debugging extension from the directory that clr.dll is loaded from
2) !Token2EE mscorlib.dll 0x0600278C
Retrieves the MethodDesc of the nLoadImage method. The first argument (mscorlib.dll) is the module that implements the nLoadImage method and the hex number is the metadata token retrieved from PowerShell.
3) !DumpMD 0x634381b0
I then dump information about the MethodDesc. This will give the address of the method table for the object that implements nLoadImage
4) !DumpMT -MD 0x636e42fc
This will dump all of the methods for the System.Reflection.RuntimeAssembly class with their respective native entry point. nLoadImage has the following entry:
635910a0 634381b0   NONE System.Reflection.RuntimeAssembly.nLoadImage(Byte[], Byte[], System.Security.Policy.Evidence, System.Threading.StackCrawlMark ByRef, Boolean, System.Security.SecurityContextSource)
So the native address for nLoadImage is 0x635910a0. Now, set a breakpoint on that address, let the program continue execution and use PowerShell to call the Load method on a bogus PE byte array.
PS C:\> [Reflection.Assembly]::Load(([Byte[]]@(1,2,3)))
You’ll then hit your breakpoint in WIndbg and if you disassemble from where you landed, the function that implements the nLoadImage method will be crystal clear – clr!AssemblyNative::LoadImage
You can now use IDA for further analysis and begin digging into the actual implementation of this InternalCall method!
After digging into some of the InternalCall methods in IDA you’ll quickly see that most functions use the fastcall convention. In x86, this means that a static function will pass its first two arguments via ECX and EDX. If it’s an instance function, the ‘this’ pointer will be passed via ECX (as is standard in thiscall) and its first argument via EDX. Any remaining arguments are pushed onto the stack.
So for the handful of people that have wondered where the implementation for an InternalCall method lies, I hope this post has been helpful.

Tuesday, October 1, 2013

Simple CIL Opcode Execution in PowerShell using the DynamicMethod Class and Delegates

tl:dr version

It is possible to assemble .NET methods with CIL opcodes (i.e. .NET bytecode) in PowerShell in only a few lines of code using dynamic methods and delegates.

I’ll admit, I have a love/hate relationship with PowerShell. I love that it is the most powerful scripting language and shell but at the same time, I often find quirks in the language that consistently bother me. One such quirk is the fact that integers don’t wrap when they overflow. Rather, they saturate – they are cast into the next largest type that can accommodate them. To demonstrate what I mean, observe the following:

You’ll notice that [Int16]::MaxValue (i.e. 0x7FFF) understandably remains an Int16. However, rather than wrapping when adding one, it is upcast to an Int32. Admittedly, this is probably the behavior that most PowerShell users would desire. I, on the other hand wish I had the option to perform math on integers that wrapped. To solve this, I originally thought that I would have to write an addition function using complicated binary logic. I opted not to go that route and decided to assemble a function using raw CIL (common intermediate language) opcodes. What follows is a brief explanation of how to accomplish this task.

Common Intermediate Language Basics

CIL is the bytecode that describes .NET methods. A description of all the opcodes implemented by Microsoft can be found here. Every time you call a method in .NET, the runtime either interprets its opcodes or it executes the assembly language equivalent of those opcodes (as a result of the JIT process - just-in-time compilation). The calling convention for CIL is loosely related to how calls are made in X86 assembly – arguments are pushed onto a stack, a method is called, and a return value is returned to the caller.

Since we’re on the subject of addition, here are the CIL opcodes that would add two numbers of similar type together and would wrap in the case of an overflow:

IL_0000: Ldarg_0 // Loads the argument at index 0 onto the evaluation stack.
IL_0001: Ldarg_1 // Loads the argument at index 1 onto the evaluation stack.
IL_0002: Add // Adds two values and pushes the result onto the evaluation stack.
IL_0003: Ret // Returns from the current method, pushing a return value (if present) from the callee's evaluation stack onto the caller's evaluation stack.

Per Microsoft documentation, “integer addition wraps, rather than saturates” when using the Add instruction. This is the behavior I was after in the first place. Now let’s learn how to build a method in PowerShell that uses these opcodes.

Dynamic Methods

In the System.Reflection.Emit namespace, there is a DynamicMethod class that allows you to create methods without having to first go through the steps of creating an assembly and module. This is nice when you want a quick and dirty way to assemble and execute CIL opcodes. When creating a DynamicMethod object, you will need to provide the following arguments to its constructor:

1) The name of the method you want to create
2) The return type of the method
3) An array of types that will serve as the parameters

The following PowerShell command will satisfy those requirements for an addition function:

$MethodInfo = New-Object Reflection.Emit.DynamicMethod('UInt32Add', [UInt32], @([UInt32], [UInt32]))

Here, I am creating an empty method that will take two UInt32 variables as arguments and return a UInt32.

Next, I will actually implement the logic of the method my emitting the CIL opcodes into the method:

$ILGen = $MethodInfo.GetILGenerator()

Now that the logic of the method is complete, I need to create a delegate from the $MethodInfo object. Before this can happen, I need to create a delegate in PowerShell that matches the method signature for the UInt32Add method. This can be accomplished by creating a generic Func delegate with the following convoluted syntax:

$Delegate = [Func``3[UInt32, UInt32, UInt32]]

The previous command states that I want to create a delegate for a function that accepts two UInt32 arguments and returns a UInt32. Note that the Func delegate wasn't introduced until .NET 3.5 which means that this technique will only work in PowerShell 3+. With that, we can now bind the method to the delegate:

$UInt32Add = $MethodInfo.CreateDelegate($Delegate)

And now, all we have to do is call the Invoke method to perform normal integer math that wraps upon an overflow:

$UInt32Add.Invoke([UInt32]::MaxValue, 2)

Here is the code in its entirety:

For additional information regarding the techniques I described, I encourage you to read the following articles:

Introduction to IL Assembly Language
Reflection Emit Dynamic Method Scenarios
How to: Define and Execute Dynamic Methods

Friday, August 16, 2013

Writing Optimized Windows Shellcode in C

Download: PIC_Bindshell


I’ll be the first to admit: writing shellcode sucks. While you have the advantage of employing some cool tricks to minimize the size of your payload, writing shellcode is still error prone and difficult to maintain. For example, I find it quite challenging having to track register allocations (especially in x86) and ensure proper stack alignment (especially in x86_64). Eventually, I got fed up, stepped back, and asked myself, “Why can’t I just write my shellcode payloads in C and let the compiler and linker take care of the rest?” That way, you only have to write your payload once and you can target it to any architecture – x86, x86_64, and ARM. Also, you would have the following added benefits:
  1. You can subject your payload to static analysis tools.
  2. You can unit test your code.
  3. You can employ heavy compiler and linker optimizations to your payload.
  4. The compiler is much better at optimizing assembly for size and/or speed than you are.
  5. You can write your payload in Visual Studio. Intellisense, FTW!
Now, you could say I’m a bit of a Microsoft fan boy. That said, considering the majority of the shellcode I’ve written has been for Windows, I decided to take on the challenge of using only Microsoft tools to emit position independent shellcode. The fundamental challenge however, is that the Microsoft C compiler – cl.exe does not emit position independent code (with the exception of Itanium). Ultimately, to achieve this goal, we’re going to have to rely upon some C coding tricks and some carefully crafted compiler and linker switches.

Shellcode – Back to the Basics

When writing shellcode, whether you do it in C or assembly, the following rules apply:

1) It must be position independent.

In most cases, you cannot know a priori the address at which your shellcode is going to land. Therefore, all branching instructions and instructions that dereference memory must be executed relative to the base address of where you were loaded. The gcc compiler has the option of emitting position independent code (PIC) but unfortunately, Microsoft’s compiler does not.

2) Your payload is on the hook for resolving external references.

If you want your payload to do anything useful, at some point, you’re going to have to call Win32 API functions. In your typical executable, external symbolic references are satisfied in one of two ways: either they are resolved by the loader at startup by walking the import directory of the executable or they are resolved dynamically at runtime using GetProcAddress. Shellcode neither has the luxury of being loaded by a loader nor can it just call GetProcAddress since it has no idea what the address of kernel32!GetProcAddress is in the first place – a classic chicken and the egg problem.

In order to resolve the addresses of library functions, shellcode must resolve function names on its own. This is typically accomplished in shellcode with a function that takes a 32-bit module and function hash, gets the PEB (Process Environment Block) address, walks a linked list of the loaded modules, scans the export directory of each module, hashes each function name, compares it against the hash provided, and if there is a match, the function address is calculated by adding its RVA to the base address of the loaded module. I’m obviously glossing over the details of the process in the interest of space but fortunately, this process is widely used (e.g. in Metasploit) and well documented.

3) Your payload must save stack and register state upon entry and restore state upon exiting the shellcode.

We will get this for free by writing the payload in C by virtue of having function prologs and epilogs emitted by the compiler for each function.

GetProcAddressWithHash Function in C

In the download provided, the GetProcAddressWithHash function resolves Win32 API exported function addresses. I adapted the logic of the function from the Metasploit block_api assembly function:

Going from top to bottom, you may notice a few things:

• I defined ROTR32 as a macro.

The Metasploit payload uses a rotate-right hashing function. Unfortunately, there is no rotate right operator in C. There are several rotate right compiler instrinsics but they are not consistent across processor architectures. The ROTR32 macro implements the logic of a rotate right operation using the equivalent logical operators available to us in C. What’s cool, is that the compiler will recognize that this macro performs a rotate right operation and it will actually compile down to a single rotate right assembly instruction. That’s pretty bas ass, in my opinion.

• I redefine two structure definitions.

Both of those structure are defined in winternl.h but Microsoft’s public definition is incomplete so I simply redefined the structures with the fields I needed.

• There is a different method of getting the PEB address depending upon the processor architecture you’re targeting.

The PEB address is the first step in resolving exported function addresses. The PEB is a structure that contains several pointers to the loaded modules of a process. In x86 and x86_64, the PEB address is obtained by dereferencing an offset into the fs and gs segment registers, respectively. On ARM, the PEB address obtained by reading a specific register from the system control processor (CP15). Fortunately, there is a respective compiler intrinsic for each processor architecture. For whatever reason though, the compiler was not emitting correct ARM assembly instruction so I had to tweak instructions in a very counterintuitive manner.

Implementing Your Primary Payload in C

I’m going to be using a simple bind shell payload as an example for this post. Here is my implementation in C:

There are a few things I needed to be mindful of while writing the payload in order to satisfy the requirements imposed by position independent shellcode:

• I defined HTONS as a macro.

It was easier to define this as a macro versus incurring the overhead of calling ws2_32.dll!htons. Besides, HTONS is ideally suited for a macro since all it does is convert a USHORT from host to network byte order.

• I had to manually define the function signatures for each Win32 API function.

This was necessary since each call to GetProcAddressWithHash needs to be cast to a function pointer. Also, with Intellisense, calling the function has the look and feel of calling a normal Win32 function in Visual Studio. This part is admittedly a pain in the ass. It certainly beats the guess and check method though when writing assembly by hand!

• "ExecutePayload" is the function that implements the primary logic of the bind shell.

Normally, you would call the function "main". One of the problems I ran into though is that when the linker encounters a function named “main,” it expects to be linked against the C runtime library. Obviously, shellcode shouldn’t and doesn’t require the CRT so renaming the entry point to something besides “main” and explicitly telling the linker your entry point function obviates the need to link against the CRT.

• “cmd” and “ws2_32.dll” are explicitly defined as null-terminated character arrays.

This technique was first described by Nick Harbour as a way to force the compiler to allocate strings on the stack. By default, strings are stored in the .rdata section of a binary and relocations are defined in the executable for any references to those strings. Storing strings on the stack allows for references to be made in a position independent manner.

• SecureZeroMemory is used to initialize stack variables

SecureZeroMemory is basically a memset that cannot be compiled out. It is also an inline function meaning I am spared the overhead of having to resolve the address of memset.

• The rest of the payload, is your typical, run-of-the-mill C… only slightly malicious.

Ensuring Proper Stack Alignment in 64-bit Shellcode

32-bit architectures (i.e. x86 and ARMv7) require that function calls be made with 4-byte stack alignment. It is pretty much guaranteed that your shellcode will land with 4-byte alignment. 64-bit shellcode however, needs to have 16-byte stack alignment. This is due to a requirement imposed by utilizing 128-bit XMM registers. Those who have written 64-bit shellcode have most likely experienced crashes at an instruction using an XMM register upon calling Win32 a function. This is due to stack misalignment.

Executable files, when loaded are afforded the luxury of having guaranteed alignment during CRT initialization. Shellcode is afforded no such luxury, however. So, in order to ensure that my shellcode hits its entry point with proper stack alignment on 64-bit, I had to write a short assembly stub that guaranteed alignment. Then, as a pre-build event in Visual Studio, I assemble the shellcode with ml64 (MASM – the Microsoft Assembler) and specify the resulting object file as a dependency for the linker.

Here is the code that performs the alignment:

Basically, what’s happening here is I am preserving the original stack value, and’ing RSP (the stack pointer) to achieve 16-byte alignment, allocating homing space, and then calling the original entry point – ExecutePayload (i.e. the bind shell code).

I also have a small helper function in C that simply calls AlignRSP:

This little helper function will then serve as the new entry point that will be specified to the linker. I will explain shortly why this wrapper function is necessary.

Compiling the Shellcode

I use the following compiler (cl.exe) command line switches in my Visual Studio 2012 project:

/GS- /TC /GL /W4 /O1 /nologo /Zl /FA /Os

Each switch warrants an explanation as it is relevant to the shellcode that will be generated.

/GS-: Disables stack buffer overrun checks. If enabled, external stack cookie setter and getter functions would be called which would no longer make the shellcode position independent.

/TC: Tells the compiler to treat all files as C source files. One of the quirks of this command-line switch is that all local variables must be defined at the beginning of a function. If they are not, unintuitive errors will occur when attempting to compile.

/GL: Whole program optimization. This option tells the linker (via the /LTGC option) to optimize across function calls. I chose this option because I just really like the idea of fully-optimized shellcode. :D

/W4: Enables the highest warning level. This is just good practice.

/O1: Tells the compiler to favor small code over fast code – an ideal attribute of shellcode.

/FA: Outputs an assembly listing. This is optional. I just prefer to validate the assembly code emitted by the compiler.

/Zl: Omit the default C runtime library name from the resulting object file. This serves to tell the linker that you don’t intend to link against the C runtime.

/Os: Another way to tell the compiler to favor small code.

Linking the Shellcode

The following linker (link.exe) switches are used for x86/ARM and x86_64, respectively:


/LTCG "x64\Release\\AdjustStack.obj" /ENTRY:"Begin" /OPT:REF /SAFESEH:NO /SUBSYSTEM:CONSOLE /MAP /ORDER:@"function_link_order64.txt" /OPT:ICF /NOLOGO /NODEFAULTLIB

Each switch warrants an explanation as it is relevant to the shellcode that will be generated.

/LTCG: Enables global optimizations by the linker. The compiler has little to no control over optimizations across function calls since it compiles on a function-by-function basis. Therefore, the linker is ideally suited to perform optimizations across function calls since it receives all of the object files emitted by the compiler.

/ENTRY: Specifies the entry point of the binary. This is “ExecutePayload” (the bind shell logic) in x86 and ARM. However, in x86_64, it is “Begin” – the call to the stack alignment stub – “AlignRSP”. The reason the “Begin” function is necessary in 64BitHelper.h is because since we’re eventually emitting shellcode, we have to explicitly set the link order (via the /ORDER switch). The Microsoft linker doesn’t allow you to specify link order for extern functions (i.e. AlignRSP). To get around this, I simply wrapped AlignRSP in a function. “Begin” is then specified as the first function to be linked. That way, it will be the first code to be called in the shellcode.

/OPT:REF: Eliminates functions and/or data that are never referenced. We want our shellcode to be as small as possible. This linker optimization will reduce shellcode size by eliminating dead code/data.

/SAFESEH:NO: Do not emit SafeSEH handlers. Shellcode has no need for registered exception handling.

/SUBSYSTEM:CONSOLE: As far as shellcode goes, the subsystem is irrelevant. Specifying “CONSOLE” though will allow you to test the compiled exe from the command line.

/MAP: Generate a map file. This file is used to pull out the size of the shellcode.

/ORDER: Because we are generating shellcode, the order in which functions are linked is extremely important. Originally, it was my assumption that the entry point function would be the first function to be linked. This, however, did not turn out to be the case. The /ORDER switch takes a text file containing the functions in the order in which they should be linked. You’ll notice that the function at the top of each list is the entry point function.

/OPT:ICF: Removes redundant functions. This is optional.

/NODEFAULTLIB: Explicitly tells the linker not to attempt to use default libraries when resolving external references. This switch is handy if you accidentally have an external reference in your code. The linker will throw an error which will bring to your attention the fact that your payload cannot have any external references!

Extracting the Shellcode

After the code is compiled and linked, the final step is to pull the shellcode out of the resulting exe. This requires a tool that can parse a PE file and pull the bytes out of the .text section. Fortunately, Get-PEHeader already does this. The only caveat though is that if you were to pull out the entire .text section, you would be left with a bunch of null padding. That’s why I wrote another script that parses the map file which contains the actual length of the code in the .text section.

For those who enjoy analyzing PE files, it is worth investigating the exe files generated. It will only contain a single section - .text and it will not have any entries in the data directories in the optional header. This is exactly what I sought after – a binary without any relocations, extraneous sections, or imports.

Build Steps: PIC_Bindshell includes a Visual Studio 2012 project. I tested it on both VS2012 Express and Ultimate Edition. Just load the solution file (*.sln) in Visual Studio, select the architecture you want to target, and then build. What is output is an exe and a shellcode (*.bin) payload.

The Express Edition of Visual Studio 2012 does not support compiling for ARM. Also, if this is your first time compiling for ARM, Visual Studio will throw the following error upon attempting to compile:

C:\Program Files (x86)\MSBuild\Microsoft.Cpp\v4.0\V110\Platforms\ARM\PlatformToolsets\v110\Microsoft.Cpp.ARM.v110.targets(36,5): error MSB8022: Compiling Desktop applications for the ARM platform is not supported.

You also need to remove the following line from “C:\Program Files (x86)\Microsoft Visual Studio 11.0\VC\includecrtdefs.h” (338):

#error Compiling Desktop applications for the ARM platform is not supported.

Simply remove those lines, restart Visual Studio, and then you’ll be good to go.


With a clearer understanding of how the Microsoft compiler and linker work in concert, it is possible to write fully optimized Windows shellcode in C that can be targeted to any supported processor architecture. That doesn’t mean that you shouldn’t have a clear understanding of the assembly language you’re targeting, though. It just means you don’t have to waste cycles writing large quantities of assembly language by hand. Also, I’ll trust the compiler over my feeble brain any day.

BTW, my 64-bit shellcode uses XMM registers. Does yours? :P

Sunday, July 28, 2013

Windows RT ARMv7-based Shellcode Development

Recently, I've taken an interest in gaining code execution on my Surface RT tablet. I have found Windows RT to be rather enticing since Microsoft has made a concerted effort to prevent the execution of unsigned code. A couple weeks ago I discovered a way to gain arbitrary shellcode execution via PowerShell. I will have a separate blog post on that topic once I get a thumbs up from Microsoft. And for the record, I am aware of the awesome public Windows RT jailbreak.

Anyway, seeing as I'm already fairly comfortable writing x86 and x86_64 shellcode, I wanted to take on the challenge of writing ARMv7-based shellcode for Windows since no one else seems to be doing it publicly. Knowing that writing shellcode from scratch would have been rather painful and prone to error, I decided to write my payload in C and then modify the resulting assembly listing slightly in order to achieve a position independent payload. Here is the result (noting that this is merely a working proof-of-concept):

You may have noticed that this shellcode is written in MASM format, therefore, it can only be assembled using armasm.exe (the ARM equivalent of ml.exe). Unfortunately, armasm doesn't provide the option of outputting to a raw bin file. It will only emit an object (.obj) file. I wrote Get-ObjDump with the intent of pulling out raw payload bytes in mind but armasm doesn't apply relocations. This means that any calls to functions present in the payload won't be fixed up and it will crash upon executing.

So rather than writing my own linker, the natural choice was to leverage Microsoft's linker, link.exe. In theory, all I would need to do is call `link bindshell.obj` and pull out the raw bytes from the '.foo' section of the resulting binary. However, I ran into a couple issues in practice:

1. link.exe requires that you specify an entry point.

Solution: Easy. Provide the '/ENTRY:"main"' switch

2. Depending on the subsystem you choose, link.exe requires certain functions in the CRT to be present. For example, the following subsystems require the following entry point functions:

/DLL - _DllMainCRTStartup
/SUBSYSTEM:NATIVE - NtProcessStartup

Solution: Obviously, I don't care about the C runtime library in my shellcode. The solution I came up with was to specify EFI_APPLICATION as the subsystem since it doesn't require the CRT. In the end, I don't care about the type of PE file I output. I just need the linker to fix up relocations for me. I can take care of pulling out the bytes from the .foo section of the resulting executable. Fortunately, Get-PEHeader can rip raw bytes from a PE file.
Wrapping things up, here is the process of obtaining fully-functioning ARM-based Windows RT shellcode from end to end:

Saturday, June 22, 2013

Undocumented NtQuerySystemInformation Structures (Updated for Windows 8)

Those familiar with Windows internals are likely to have used the NtQuerySystemInformation function in ntdll. This function is extremely valuable for getting system information that would otherwise not be made available via the Win32 API. The MSDN documentation only documents a minimal subset of the structures returned by this powerful function, however. To date, one of the best references for the undocumented features of this function has been the “Windows NT/2000 Native API Reference.” Despite being published in 2000, many of the structures documented in this book are still relevant today. In recent history though, Microsoft has quietly expanded the number of functions returned by NtQuerySystemInformation. Thankfully, the vast majority of them have been made public via symbols present in uxtheme.dll (64-bit structures) and combase.dll (32-bit) structures in Windows 8. At last check, it appears as though Microsoft pulled these symbols from the latest versions of the respective dlls.
I did my best to document these structures and fill in as many holes as possible in the SystemInformationClass enum. What resulted is the following image – a mapping of SystemInformationClass constants to their respective 32-bit structure and a header file – NtQuerySystemInformation.h. I validated that the header file is properly parsed by IDA (Ctrl+F9). To view the result of what was parsed in IDA, press Shift+F1 (Local Types Subview). The most notable structures are the ones that return pointers. In many cases, these are pointers to kernel memory. >D