(or, "Okay Maybe Size Isn't Quite Everything")
All right. We have taken on the task of manually constructing a Linux ELF executable that dynamically links with libc in order to call its _exit() function. The idea is to produce an executable that is as small as possible while still rigidly adhering to the requirements of the ELF specification and the Linux ABI.
Dynamic linking is a rather involved process, something akin to jet planes hooking up and refueling in midair. A lot of machinery has to be in place to do the business that ld takes care of when linking statically. Back in the original tutorial, we shed a kilobyte from our program size when we threw out libc and started making direct system calls. Now it's time to bite the bullet and see how much of it we have to put back in.
(By the way, note that we're going to continue to keep the focus on 32-bit executables. Most everything discussed here also applies equally to 64-bit executables, though.)
So, what does an executable need, in order to dynamically link with a shared object?
Well, for starters, it needs a
The ELF specification explicitly spells out which entries in the dynamic structure are mandatory, fortunately. Quite a few of the mandatory entries are values that refer to other ELF structures — and that means we're going to need to include those structures, too. Ack. We need to back up for a moment here, before we get in over our heads too quickly. One new structure at a time.
For starters, what exactly needs to happen for dynamic linking to work? Well, a dynamically linked function (or any other symbol) has an address that isn't determined until the calling program is actually in memory. At some point after that the dynamic linker has to get involved, so that when a dynamic address is needed, it can intervene. The linker must be able to retrieve the name of the symbol, figure out which library contains the symbol, find the address of the symbol in the process's memory space (mapping the library into the process's memory space first if it isn't there already), and finally insert the retrieved address where the process expects to find it.
Right off we can see that the executable is going to need to contain some text strings. Somewhere, it needs to have the names of the dynamically-linked symbols, as well as the names of the dynamic libraries. Presumably there also needs to be some kind of mapping between the symbols and the places in the assembly code where the symbols are referred to.
Let's start with the storing of the names. These text strings are kept
in a
That sounds simple enough. Let's put together a string table right now. We'll put two strings in it — the name of the function we want to link to, and the name of the library to find it in:
strtab: db 0 libc_name equ $ - strtab db 'libc.so.6', 0 exit_name equ $ - strtab db '_exit', 0
The libc_name and exit_name values provide the offsets to the strings within the table. Presumably there will be some kind of "symbol table" that will use those offsets, right?
Right. A
; Elf32_Sym dd 0 ; st_name dd 0 ; st_value dd 0 ; st_size db 0 ; st_info db 0 ; st_other dw 0 ; st_shndx
st_name is the string table offset to the symbol's actual name. We've got that covered already. st_value and st_size provide the symbol's "value" and "size". What these things actually mean varies for different kinds of symbols. However, the value and size of an externally-defined symbol is always zero, meaning unknown. So, we're good there. st_other is not currently used and so is always zero. Check. st_shndx provides the index of the section in the section header table to which that symbol belongs. We don't have a section header table, so: Zero. Check. That just leaves st_info.
The st_info byte actually holds two values (thus the vague name), each four bits in size. The low nybble indicates the type of object this symbol refers to. The two important possibilities are STT_OBJECT — i.e., some kind of variable — and STT_FUNC — i.e., what we're interested in. The high nybble of the st_info field indicates the symbol's binding. We have three choices here: STB_LOCAL, STB_GLOBAL, and STB_WEAK. Weak? What's that? A weak symbol mostly works like a global symbol. It differs in that it is not an error for there to be more than one weak symbol definition with the same name. Since we're importing and not exporting, the distinction isn't important to us. We'll stick with global.
The ELF specification tells us that STT_FUNC has a value of 2 and STB_GLOBAL has a value of 1. So we have all the information we need to fill in our symbol table now, don't we? Hold on: One more detail. All symbol tables need to have an empty first entry. As with string tables, a symbol table index of zero always indicates an undefined or invalid symbol.
So now we can create our symbol table:
%define STT_FUNC 2 %define STB_GLOBAL 1 %define ST_INFO(b, t) (((b) << 4) | (t)) symtab: ; Elf32_Sym dd 0 ; st_name dd 0 ; st_value dd 0 ; st_size db 0 ; st_info db 0 ; st_other dw 0 ; st_shndx dd exit_name ; st_name dd 0 ; st_value dd 0 ; st_size db ST_INFO(STB_GLOBAL, STT_FUNC) ; st_info db 0 ; st_other dw 0 ; st_shndx
Well, that was easy.
Too easy, really. Weren't we expecting there to be some kind of mapping between symbol names and their use in the actual code? There doesn't seem to be anything for that in this table.
And in fact there isn't. This table just describes the type of each
symbol. After all, a single symbol can get referenced many times. It
would be a waste to repeat all of this stuff for each and every
reference. So for the actual mapping, we have … yet another table!
The
An entry in the relocation table is very simple, and looks like this:
; Elf32_Rel dd 0 ; r_offset dd 0 ; r_info
The r_offset field, in the case of an executable, is generally the memory address where the symbol's true value needs to be inserted. (The precise meaning of this field can vary.) The r_info field is another field with two values crammed together, this time an 8-bit value combined with a 24-bit value. The low byte is a number indicating the relocation type, and the high three bytes contain the index of the symbol in the symbol table.
There are quite a number of relocation types for the i386 architecture. There's R_386_32, which indicates a direct insertion of the symbol's value, which is useful for obtaining pointers to objects. There is R_386_PC32, which obtains the symbol's value relative to the address of the relocation. This is much more useful for composing jumps and calls, which is what we want to do. There are also relocations that calculate values relative to various ELF structures, relocations that make individual copies of objects, and so on. Note also that relocations are additive. In other words, they are added to whatever value is already at the offset, rather than just overwriting it. (It may be worth briefly noting that ELF files can have an alternate kind of relocation table, called Elf32_Rela, in which the initial offset value is stored in a third field in the relocation record instead. Since those take up more space without adding any new capabilities, we're going to ignore them.) As we're importing a function address, what we want is R_386_PC32. So let's set the other relocation types to one side.
We should be able to build our relocation table now … wait a minute! We need an empty entry in the first position, right? No? No, apparently not. Okay, never mind.
%define R_386_PC32 2 %define R_INFO(s, t) (((s) << 8) | (t)) reltab: ; Elf32_Rel dd exit_call ; r_offset dd R_INFO(1, R_386_PC32) ; r_info
For the r_offset field, we have a label that hasn't been placed yet. We'll need to define it when we write the actual code, at the point of the call to the _exit() function. The 1 in the r_info field is the index of our 1 and only symbol in the symbol table.
Okay! Now it looks like it's starting to come together. We've defined enough data for a linker to move from a relocation record to the symbol table entry, and from there to the actual name of the symbol. Maybe now we're ready to take a closer look at what goes into that dynamic section hodgepodge.
The dynamic section has 24 different types of entries. (Yeesh!) Nine of these are marked as mandatory for executables. Let's look at those.
One entry is labelled DT_STRTAB. It provides the address of the string table. We have one of those, so check. There's also a DT_STRSZ, which specifies the size of the string table. We have that too. Next up is DT_SYMTAB, giving the address of the symbol table, and DT_SYMENT, giving the size of a single entry in the symbol table. Check and check. Then there's DT_REL, DT_RELSZ, and DT_RELENT, which specify the address of the relocation table, the size of said table, and the size of a single entry in the table. Got it, got it, got it. So far, so good. The eighth entry type is DT_HASH, which specifies the address of the hash table. Choke. What hash table?! Looks like we have yet another ELF structure to learn about.
Since your average ELF file contains hundreds if not thousands of symbols, ELF files come with a simple hash table to expedite the process of doing lookups. The hash table is simply an array of 32-bit values. The first value in the array specifies the number of buckets in the hash table. The second value in the array specifies the total number of entries in the hash table's chains; this will be the same as the number of entries in the symbol table (including the null entry at the top). After these two values is the array of buckets, which is then followed by the array of chain links.
The ELF specification lays out exactly what hash function to use:
unsigned long elf_hash(const unsigned char *name) { unsigned long h = 0, g; while (*name) { h = (h << 4) + *name++; if (g = h & 0xf0000000) h ^= g >> 24; h &= ~g; } return h; }
In the interest of speed, it is a very simple function, using nothing more than addition and bit-twiddling. To look up a symbol name in the hash table, a program runs the symbol name through the hash function, and uses the resulting number as an index into the bucket array, modulo the number of buckets. The number stored at that index is the index into the symbol table for a symbol with that hash value. The program then retrieves that symbol's name and compares it with the one it seeks. If it's not the right symbol, the program then uses the same number as an index into the chain array. The value there will be the index for another symbol whose name also hashes to that value. The program tries the new index; if it's still not the right symbol, it uses this latest index on the chain array again. It continues to follow the chain links until it finds the symbol it wants, or until it retrieves a zero value from the chain array, at which point the program knows that no more symbols are present with that hash value. (Since the first chain link always corresponds to the null entry, zero can never be a link to a valid symbol.)
A program that is building a hash table can choose whatever number of buckets it thinks will provide the best coverage with the fewest entries per bucket. Or if — as in our case — size is a concern instead of speed, it can build a table with a single bucket, which every symbol will hash to. In our case, there is only one symbol in our symbol table, so both of these strategies point to the same decision: a hash table with one bucket.
hashtab: dd 1 ; the number of buckets dd 2 ; the number of symbols dd 1 ; the lone bucket: symbol 1 for _exit dd 0 ; the null symbol table entry dd 0 ; the end of the only chain
The one bucket entry points to our one symbol. There are two chain entries, and both of them are zero because neither of them chain to the other.
We have a hash table. I mean, it's not much of a hash table, but it's valid in the eyes of the ELF specification. Hurrah!
Now, back to the dynamic section mishmash. We now have something to fill in for DT_HASH. But there is one more required entry left. What is it? As it happens, the ninth required entry is DT_NULL, which comes last in the table and marks its end. So that's it for the dynamic section!
Or is it? There's something we're still missing.
At this point, there's still nothing that points to the name of the shared-object library in the string table! Certainly that belongs in the dynamic section, or maybe as some kind of specially marked symbol? Actually, yes, it does belong in the dynamic section, under a DT_NEEDED identifier. DT_NEEDED is one of the optional entries — probably because, unlike the other entries, you can have more than one DT_NEEDED entry. Each one provides a string table offset to the name of a dynamic library, and the order they appear in the table determines the order that they will be searched for symbols. We just need the one, thanks.
(And you may have noticed, by this point, that there is truly nothing in the ELF file that connects a symbol to the library that supplies it. This is not an oversight, and in fact reflects one of the more noteworthy facts about dynamic linking in Unix — an exported symbol is truly global, and a library can usurp another library's symbol definition by getting loaded first. Check out the documentation for the LD_PRELOAD environment variable if you don't already know about it.)
So. Here is our dynamic section, all filled out:
%define DT_NULL 0 %define DT_NEEDED 1 %define DT_HASH 4 %define DT_STRTAB 5 %define DT_SYMTAB 6 %define DT_STRSZ 10 %define DT_SYMENT 11 %define DT_REL 17 %define DT_RELSZ 18 %define DT_RELENT 19 dyntab: dd DT_STRTAB, strtab dd DT_STRSZ, strtabsz dd DT_SYMTAB, symtab dd DT_SYMENT, symentsz dd DT_REL, reltab dd DT_RELSZ, reltabsz dd DT_RELENT, relentsz dd DT_HASH, hashtab dd DT_NEEDED, libc_name dd DT_NULL, 0
We need to go back and insert some labels and/or equates for a few of these symbols, but none of these require us to add content to the file that we don't already have.
As we can see now, the dynamic section functions as a kind of information kiosk for the dynamic linker, so it can find everything after it's been loaded into memory. Yes, it's all beginning to make sense.
(You may note that while the relocation table requires three entries, to fully specify its dimensions, the symbol table only requires two. Why isn't there a value that specifies the number of symbols in the symbol table? Because that number is included in the hash table. So that's one reason why the hash table isn't optional.)
So, are we done bloating up our executable with all these new ELF structures? Do we have everything we need to cooperate with the dynamic linker?
Not quite. One question: how does the linker find the dynamic section in the first place?
Answer: The dynamic section requires its own entry in the program segment header table. Remember the program segment header table? We've been getting by with only one program segment for the entire file, but that's not good enough any more. The linker needs to have the dynamic section called out directly with a program segment header table entry. (And if you ask how the dynamic linker finds the program segment header table, I will remind you that the location of the program segment header table is stashed in the ELF header itself. And if you ask how the linker finds the ELF header, go back and review the first tutorial: It's always the very first thing in the ELF file.)
Having a second program header table entry is a bit of a downer, because it means we won't be able to overlap it with the ELF header anymore: The first two bytes of the overlap are, in the ELF header, the number of entries in said table, and that number will now be two instead of one. As a small consolation, though, there will be two p_paddr fields in which we can hide bits of code.
In any case, there's no escaping the need for it, so let's quit dawdling and cobble together a new program header table:
%define PT_LOAD 1 %define PT_DYNAMIC 2 %define PF_R 4 %define PF_W 2 %define PF_X 1 phdr: ; Elf32_Phdr dd PT_LOAD ; p_type dd 0 ; p_offset dd $$ ; p_vaddr dd $$ ; p_paddr dd filesz ; p_filesz dd memsz ; p_memsz dd PF_R | PF_W | PF_X ; p_flags dd 0x1000 ; p_align dd PT_DYNAMIC ; p_type dd dyntab - $$ ; p_offset dd dyntab ; p_vaddr dd dyntab ; p_paddr dd dyntabsz ; p_filesz dd dyntabsz ; p_memsz dd PF_R | PF_W ; p_flags dd 4 ; p_align
I've gone ahead and replaced some of the magic numbers in the old version with named definitions, so we can better see what's going on. The original entry is marked PT_LOAD, meaning that this segment is to be loaded into memory at startup. It will still contain the entire file's contents. (That doesn't let us off the hook of having a separate entry for the dynamic section.) The new entry is marked PT_DYNAMIC, so the linker will have no trouble finding it. (Hey! PT_DYNAMIC has a value of two, so if we put this entry first we could overlap with the ELF header again! But no: The second field in this entry isn't zero, so it won't actually mesh with the end of the ELF header. Dang.) This entry covers just the dynamic section, and is, somewhat surprisingly, typically marked as being writable. (I'm not sure why, but it matters; having a read-only dynamic table will cause your executable to segfault during initialization.) Thus the PT_LOAD entry itself needs to be marked as writable, since it's the one that actually determines the flags that are set for that page of memory.
So now we're done! We're ready to put all these disparate pieces together into a single file.
Well, almost. But not yet.
Here's a question: How does the dynamic linker ever get invoked?
This may come as a bit of a surprise (it certainly did to me), but the dynamic linker does not get handed every executable with a dynamic section by default. Dynamic linking is an opt-in affair, and you must specifically request it.
One of the issues here is that the dynamic linker is just another program, and occasionally changes in the kernel require changes in the linker. In order to maintain binary backwards-compatibility, it can be necessary to have more than one version of the dynamic linker program available. Thus, each and every program must specify which linker they were made to work with. In fact, the full pathname to the dynamic linker has to be embedded within the executable. And not within a string table, mind you. It needs to be involved too early in the process, so as a result it gets its own program header table entry. That's right: we need yet another entry to the program segment header table.
Grumble, grumble. Well, if we must:
%define PT_INTERP 3 dd PT_INTERP ; p_type dd interp - $$ ; p_offset dd interp ; p_vaddr dd interp ; p_paddr dd interpsz ; p_filesz dd interpsz ; p_memsz dd PF_R ; p_flags dd 0 ; p_align interp: db '/lib/ld-linux.so.2', 0 interpsz equ $ - interp
Why is it called PT_INTERP? Apparently the dynamic linker is being treated as a kind of interpreter. The dynamic linker is handed the program while the execve() system call is still running, and it does its work before execve() hands control over to the new executable. (In a manner of speaking, that is. A good chunk of the activity involved in dynamic linking is typically delayed until a given symbol is actually referenced. This saves time during startup, and can save time overall as well, if a number of the imported symbols don't get used. The dynamic linker arranges for this by having linked symbols call back into itself, at which time the real symbol's location is looked up. We aren't using this feature in our program, though, in the interests of minimizing our size.)
Note that this diminuitive segment only needs to be readable. And because it's a simple string, it has no aligment requirements.
Okay. I think that really is the last of the pieces. Now we're ready to put it all together. Or almost ready … hmm. Do you get the feeling that we've overlooked something?
Oh yeah. The program!
Originally the version of our program that used libc looked like this:
_start: push byte 42 call _exit
This time, however, we're importing the _exit symbol ourselves, at runtime. So what needs to go here?
Remember our entry in the relocation table? We put a label there, exit_call, and noted that we would need to place it in the code, at the point where the _exit() function is called. Well, we're at that place now.
So our program should look like this:
_start: push byte 42 exit_call: call $
Or maybe not. A bit of thought should make it clear that this won't work. The exit_call label isn't quite pointing at the call address; instead it's pointing at the call instruction itself. We don't want to overwrite that part. So let's push exit_call forward one byte:
_start: push byte 42 call $ exit_call equ $ - 4
Much better. However, there's still an error here.
The dynamic linker is going to insert the relative address of the _exit() function there. But relative to what? Relative to the address that it's modifying — that is, relative to exit_call. But the x86 call instruction doesn't work relative to that location; it calculates the destination relative to the instruction pointer, which by that time is pointing to the byte after the call instruction. So the value that the dynamic linker will insert there will be off by four bytes.
How do we repair this? Do we have to fixup the fixup? In a way, yes. Remember that I mentioned that relocations are added to whatever value is already embedded in the code, rather than just overwriting it. So we just need to store a value of -4 at exit_call. We could do that this way:
_start: push byte 42 db 0xE8 exit_call: dd -4
But writing it like this will also work, and is more clear as to what's going on:
_start: push byte 42 call exit_call exit_call equ $ - 4
And now … I think we have all the pieces! Finally. Let's put them together and see what we get.
; tiny.asm BITS 32 %define ET_EXEC 2 %define EM_386 3 %define EV_CURRENT 1 %define PT_LOAD 1 %define PT_DYNAMIC 2 %define PT_INTERP 3 %define PF_X 1 %define PF_W 2 %define PF_R 4 %define STT_FUNC 2 %define STB_GLOBAL 1 %define R_386_PC32 2 %define DT_NULL 0 %define DT_NEEDED 1 %define DT_HASH 4 %define DT_STRTAB 5 %define DT_SYMTAB 6 %define DT_STRSZ 10 %define DT_SYMENT 11 %define DT_REL 17 %define DT_RELSZ 18 %define DT_RELENT 19 %define ST_INFO(b, t) (((b) << 4) | (t)) %define R_INFO(s, t) (((s) << 8) | (t)) shentsz equ 0x28 org 0x08048000 ;; The ELF header ehdr: ; Elf32_Ehdr db 0x7F, "ELF", 1, 1, 1 ; e_ident times 9 db 0 dw ET_EXEC ; e_type dw EM_386 ; e_machine dd EV_CURRENT ; e_version dd _start ; e_entry dd phdr - $$ ; e_phoff dd 0 ; e_shoff dd 0 ; e_flags dw ehdrsz ; e_ehsize dw phentsz ; e_phentsize dw 3 ; e_phnum dw shentsz ; e_shentsize dw 0 ; e_shnum dw 0 ; e_shstrndx ehdrsz equ $ - ehdr ;; The program segment header table phdr: ; Elf32_Phdr dd PT_INTERP ; p_type dd interp - $$ ; p_offset dd interp ; p_vaddr dd interp ; p_paddr dd interpsz ; p_filesz dd interpsz ; p_memsz dd PF_R ; p_flags dd 0 ; p_align phentsz equ $ - phdr dd PT_LOAD ; p_type dd 0 ; p_offset dd $$ ; p_vaddr dd $$ ; p_paddr dd filesz ; p_filesz dd memsz ; p_memsz dd PF_R | PF_W | PF_X ; p_flags dd 0x1000 ; p_align dd PT_DYNAMIC ; p_type dd dyntab - $$ ; p_offset dd dyntab ; p_vaddr dd dyntab ; p_paddr dd dyntabsz ; p_filesz dd dyntabsz ; p_memsz dd PF_R | PF_W ; p_flags dd 4 ; p_align ;; The hash table hashtab: dd 1 ; no. of buckets dd 2 ; no. of symbols dd 1 ; the bucket: symbol #1 dd 0, 0 ; two links, both zero ;; The symbol table symtab: ; Elf32_Sym dd 0 ; st_name dd 0 ; st_value dd 0 ; st_size db 0 ; st_info db 0 ; st_other dw 0 ; st_shndx symentsz equ $ - symtab dd exit_name ; st_name dd 0 ; st_value dd 0 ; st_size db ST_INFO(STB_GLOBAL, STT_FUNC) ; st_info db 0 ; st_other dw 0 ; st_shndx ;; The relocation table reltab: ; Elf32_Rel dd exit_call ; r_offset dd R_INFO(1, R_386_PC32) ; r_info relentsz equ $ - reltab reltabsz equ $ - reltab ;; The dynamic section dyntab: dd DT_STRTAB, strtab dd DT_STRSZ, strtabsz dd DT_SYMTAB, symtab dd DT_SYMENT, symentsz dd DT_REL, reltab dd DT_RELSZ, reltabsz dd DT_RELENT, relentsz dd DT_HASH, hashtab dd DT_NEEDED, libc_name dd DT_NULL, 0 dyntabsz equ $ - dyntab ;; The interpreter segment interp: db '/lib/ld-linux.so.2', 0 interpsz equ $ - interp ;; The string table strtab: db 0 libc_name equ $ - strtab db 'libc.so.6', 0 exit_name equ $ - strtab db '_exit', 0 strtabsz equ $ - strtab ;; Our program _start: push byte 42 call exit_call exit_call equ $ - 4 ;; End of the file image. filesz equ $ - $$ memsz equ filesz
Note that I've placed all of the structures at the top of the file, before the strings sections and the program. All ELF structures which contain 32-bit fields (which is basically everything except string tables) need to be dword-aligned in memory, which means they also need to be dword-aligned in the file. The easiest way to ensure that is to move all the alignment-agnostic sections to the end.
When this program is executed, what will happen? (Deep breath.) The kernel will map it into memory, whereupon it will see the entry in the program segment header table marked PT_INTERP. It will hand the image of our process-to-be over to the program named in that segment, which is the dynamic linker. The dynamic linker will refer to the dynamic section (thanks to it also having an entry in the program segment header table) to find the relocation table. The relocation table's one entry tells it to compute a relative address for the first symbol in the symbol table, the location for which is also retrieved from the dynamic section. The first entry in the symbol table provides a position in the string table (and again, the location for the string table comes from the dynamic section) where the symbol's name is stored, namely "_exit". To find that address the dynamic linker starts going down the list of shared-object libraries that our program has requested; this list contains only a single library, namely "libc.so.6". The dynamic linker will now shift its attention to that library's ELF structures instead of ours. It will compute the hash value of "_exit", and consult libc's hash table, symbol table, and string table in order to find an entry for that symbol. (If it failed to find this symbol, it would then try the next library mapped into our process space. Since there are no other such libraries, the dynamic linker would display an error message and quit the process.) The symbol table entry, once found, provides a value for the symbol, which the dynamic linker adjusts to match the address of the shared-object in our process space, and finally stores a relative version of that value at the address specified in our relocation table entry. (Phew.)
Then, when the kernel starts us running at our entry point, our program's call instruction executes, and it actually jumps to the desired _exit routine. Or so we hope, anyway.
Here we go:
$ nasm -f bin -o a.out tiny.asm $ chmod +x a.out $ ./a.out ; echo $? 42
It worked. How about that? We have dynamically linked!
Now let's check on the size of this thing:
$ wc -c a.out 331 a.out
Well, that's considerably larger than 45 bytes. On the other hand, it's quite a lot smaller than the executable created for us by the linker. So, I'd say this still counts as a complete success.
The next question is, of course, can we make this any smaller? However, I'm going to step away from my usual role and tell you up front that there isn't a lot we can do to reduce this. We're playing with the net up, as the saying goes. There aren't any fields in the new ELF structures that permit arbitrary values, so we're not going to be able to overlap things much. And we don't have any optional structures that we can remove and still do dynamic linking. (Which is really to say that I've already omitted any optional structures like the procedure linkage table.)
Still, let's see what we can do.
We're still free to overlap sections, and there is some scope for that. A couple of the sections end in several zero bytes, for starters, which immediately brings the symbol table to mind, as it begins with no less than 16 NULs. The structures ending with the most NULs are the hash table and the dynamic structure — both have 11 NUL bytes at their tails. Let's use the hash table, since it's already placed above the symbol table, and overlap them, like so:
;; The hash table and the symbol table, overlapped hashtab: dd 1 ; no. of buckets dd 2 ; no. of symbols db 1 ; the bucket: symbol #1 ; two links, both zero symtab: ; Elf32_Sym dd 0 ; st_name dd 0 ; st_value dd 0 ; st_size ; etc.
But this won't quite work, because now the symbol table is no longer dword-aligned. In practice we can only overlap 8 of the 11 bytes:
;; The hash table and the symbol table, overlapped hashtab: dd 1 ; no. of buckets dd 2 ; no. of symbols dd 1 ; the bucket: symbol #1 ; two links, both zero symtab: ; Elf32_Sym dd 0 ; st_name dd 0 ; st_value dd 0 ; st_size ; etc.
Another trick we can do with structures ending in zeros is to put them at the end of the file. You may recall that the entries in the program segment header table specify both a file size and a memory size (the fields p_filesz and p_memsz). When the memory size is larger than the file size, the loader is required to explicitly make up the difference with NUL bytes.
We can only do this at the end of the file, even though we have three segments, since everything gets loaded in a single segment. (Only the PT_LOAD entry in the table actually causes anything to be copied out of the file and into memory. The PT_INTERP and PT_DYNAMIC entries in the table are really nothing more than pointers.) Since there are no alignment requirements for the ends of structures, we can chop all eleven bytes off the tail of the dynamic structure:
align 4 ;; The dynamic section dyntab: dd DT_STRTAB, strtab dd DT_STRSZ, strtabsz dd DT_SYMTAB, symtab dd DT_SYMENT, symentsz dd DT_REL, reltab dd DT_RELSZ, reltabsz dd DT_RELENT, relentsz dd DT_HASH, hashtab dd DT_NEEDED db libc_name dyntabsz equ ($ - dyntab) + 11 _end equ $ + 11 ;; End of the file image. filesz equ $ - $$ memsz equ _end - $$
Of course, moving the dynamic section to the end of the file means that we may need to insert a few bytes of padding after the unaligned structures. (Thus the "align 4" directive just before the dynamic section.) But it's still worth doing. Also note that we need to be careful to adjust the values of all affected assembler symbols, so that the missing bytes are still counted as part of the structure.
What about overlapping something with the front of the dynamic section? The nice thing about the dynamic section is that its key-value pairs don't have to be in any particular order; anything can be moved to the front. There are a few possible entries that provide overlapping opportunities, but the one that I want to use is the hardest one to see — the end of the symbol table:
%define STT_FUNC 2 %define STB_GLOBAL 1 %define ST_INFO(b, t) (((b) << 4) | (t)) dd exit_name ; st_name dd 0 ; st_value dd 0 ; st_size db ST_INFO(STB_GLOBAL, STT_FUNC) ; st_info db 0 ; st_other dw 0 ; st_shndx
The last four bytes of our symbol table represent the value 18. That happens to match up with the value of DT_RELSZ. And so:
dd exit_name ; st_name dd 0 ; st_value dd 0 ; st_size ; st_info = 18 ; st_other = 0 ; st_shndx = 0 dyntab: dd DT_RELSZ, reltabsz dd DT_RELENT, relentsz dd DT_REL, reltab ; etc.
Of course, the symbol table is already overlapping on the other side with the hash table, so both of them will get dragged down to the end of the file.
And moving on to the two string sections, we can of course overlap the single NUL at the end of the interpreter string with the one at the start of the string table:
;; The interpreter segment and the string table, overlapped interp: db '/lib/ld-linux.so.2' ; plus a terminating NUL interpsz equ ($ - interp) + 1 strtab: db 0 libc_name equ $ - strtab db 'libc.so.6', 0 exit_name equ $ - strtab db '_exit', 0 strtabsz equ $ - strtab
Looks like we've already run out of things to overlap! Well, not quite. There is one more overlap we can create. The program segment header table contains three entries. The order of these entries is umimportant. We can shuffle them around so that the final field in the table will be the interpreter segment's p_align field. Because this segment has no alignment requirements, the value of this field is zero. However, the ELF specification notes that unaligned segments can be indicated with a value of either zero or one. Since the first field in the hash table is a one, let's use the latter.
align 4 ;; The program segment header table, the hash table, the symbol ;; table, and the dynamic section, overlapped phdr: ; Elf32_Phdr dd PT_LOAD ; p_type dd 0 ; p_offset dd $$ ; p_vaddr dd $$ ; p_paddr dd filesz ; p_filesz dd memsz ; p_memsz dd PF_R | PF_W | PF_X ; p_flags dd 0x1000 ; p_align dd PT_DYNAMIC ; p_type dd dyntab - $$ ; p_offset dd dyntab ; p_vaddr dd dyntab ; p_paddr dd dyntabsz ; p_filesz dd dyntabsz ; p_memsz dd PF_R | PF_W ; p_flags dd 4 ; p_align dd PT_INTERP ; p_type dd interp - $$ ; p_offset dd interp ; p_vaddr dd interp ; p_paddr dd interpsz ; p_filesz dd interpsz ; p_memsz dd PF_R ; p_flags ; p_align = 1 hashtab: dd 1 ; no. of buckets dd 2 ; no. of symbols dd 1 ; the bucket: symbol #1 ; etc.
With these overlaps in place, we can verify that we still have a working executable:
$ nasm -f bin -o a.out tiny.asm $ chmod +x a.out $ ./a.out ; echo $? 42 $ wc -c a.out 305 a.out
We reclaimed 8 bytes from the hash table, 11 bytes off the end of the dynamic structure, 1 byte from the interpreter segment, 4 bytes off the end of the symbol table, and 4 bytes out of the program segment header table … minus two bytes of padding that needed to be introduced. Our program is now 26 bytes smaller.
There is, of course, one last reduction that we can make, and that is (as before) squeezing the program itself into the ELF structures.
Despite all this new machinery, we don't have any addtional fields that we can cram arbitrary values into. Just our old standby, the p_paddr field in the program segment header table. However, we now have three p_paddr fields we can use. And our program is only two instructions long, so no problem.
;; Our program _start: push byte 42 call exit_call exit_call equ $ - 4
Well, there is one problem. The second instruction is five bytes long. We don't have five contiguous bytes we can use.
In the previous essay, we selected a load address with a upper byte that matched an instruction, so we could use part of the p_vaddr field as well. But that won't work this time. The initial byte of the call instruction is 0xE8, and this we can't have a load address above 0x80000000. We could embed 0xE8 as the second-highest byte of the load address, except that what comes after the call instruction is the 32-bit value -4, or 0xFFFFFFFC, which is no better. On top of this, that -4 value is going to be overwritten by the dynamic linker. The ELF spec doesn't explicitly say that you can't alter the p_vaddr field after it's been loaded … but neither does it say that you can. (This feels like one of those things that never occurred to anyone as needing to be explicitly forbidden. Sort of like how the ELF specification doesn't explicitly say that you can't expect your executable to work if you set the hard drive on fire while it's loading.) If we're really trying to play by the rules, then it's probably best to avoid altering ELF structures during runtime.
Is there any other field, anywhere else in the file, where we can squeeze a five-byte call instruction? No, there doesn't seem to be.
Is there another instruction we can use instead of call? Well, we can push a return value onto the stack manually and then jump. But a jump is also five bytes long, so that gets us nothing. (There is also the issue of getting a return value to put onto the stack in the first place. Normally that couldn't be done without a five-byte instruction either. In fact, the usual way of getting a return value is to execute a call to a pop instruction! But this is a special case: Since the _exit() function never returns, we likely could have gotten away with pushing any random value, or even subtracting four from the stack pointer.)
Well, is there another encoding of the call instruction we could use? A "call eax" instruction is only two bytes long, for example. But then we're still stuck with a five-byte instruction to load an address into eax.
Maybe we could use a "call [eax]" instruction? We still need to load eax with an address, of course. But before we were dealing with an address that doesn't exist until the dynamic linker comes along and inserts it. This way, we would be dealing with an address that we choose, one located in our file and known ahead of time. This is a possibility.
The only problem is that we don't have a pointer to anyplace within our file (or anywhere else, for that matter). So we'll need to pick a load address that we can create from scratch using short instructions. After experimenting with the instruction set for a bit, the best valid address I can come up with is 0x01000000, which I can generate like so:
xor eax, eax inc eax bswap eax
This is still five bytes, but since there are three instructions now, we don't need the five bytes to be contiguous. This would set eax to point to the beginning of the file image in memory. We still need to offset it to where the actual address gets stored, of course:
call [byte eax + (exit_call - $$)]
This instruction is three bytes long. We need another two bytes to push the argument, for a grand total of ten bytes. Argh! Once we account for the jumps we need to move from one p_paddr field to the next, we only have eight bytes available to us!
But wait. Our program now has a different encoding:
00000000 6A2A push byte 42 00000002 31C0 xor eax, eax 00000004 40 inc eax 00000005 0FC8 bswap eax 00000007 FF507F call [byte eax + 0x7F]
Most of these byte values are below 0x80, providing opportunities for overlap with the p_vaddr field instead. That might give us enough room … except that this whole scheme depends on having a load address of 0x01000000. So much for that idea.
On the other hand, this does suggest a new avenue of attack. Is there a different version of the call instruction that would be more friendly to being embedded in p_vaddr?
Indeed there is: the "call [mem]" instruction is six bytes long, and consists of 0xFF 0x15 followed by a four-byte address. By setting our binary's load address to 0x15FF0000, we could fit the rest of the instruction in the p_paddr field:
phdr: ; Elf32_Phdr dd PT_LOAD ; p_type dd 0 ; p_offset dw 0 ; p_vaddr part2: call [exit_ptr] ; p_paddr dd filesz ; p_filesz dd memsz ; p_memsz dd PF_R | PF_W | PF_X ; p_flags dd 0x1000 ; p_align dd PT_DYNAMIC ; p_type dd dyntab - $$ ; p_offset dd dyntab ; p_vaddr _start: push byte 42 ; p_paddr jmp short part2 dd dyntabsz ; p_filesz dd dyntabsz ; p_memsz dd PF_R | PF_W ; p_flags dd 4 ; p_align
Notice carefully that this version of the call instruction will need to be given an absolute address, instead of the usual call instruction's relative address. This means that we need to change the entry in the relocation table, so that the linker will actually store an absolute address. In other words, we need to use R_386_32 instead of R_386_PC32:
%define R_386_32 1 reltab: ; Elf32_Rel dd exit_ptr ; r_offset dd R_INFO(1, R_386_32) ; r_info
This also means that the four bytes at exit_ptr need to be initialized to zero, instead of -4. That's actually convenient, because now we can locate those bytes in the zero-initialized bss memory at the end:
dyntabsz equ ($ - dyntab) + 11 exit_ptr equ $ + 11 _end equ $ + 15 ;; End of the file image. filesz equ $ - $$ memsz equ _end - $$
And with that, I believe we are ready.
Here's the program in its entirety:
; tiny.asm BITS 32 %define ET_EXEC 2 %define EM_386 3 %define EV_CURRENT 1 %define PT_LOAD 1 %define PT_DYNAMIC 2 %define PT_INTERP 3 %define PF_X 1 %define PF_W 2 %define PF_R 4 %define STT_FUNC 2 %define STB_GLOBAL 1 %define R_386_32 1 %define DT_NULL 0 %define DT_NEEDED 1 %define DT_HASH 4 %define DT_STRTAB 5 %define DT_SYMTAB 6 %define DT_STRSZ 10 %define DT_SYMENT 11 %define DT_REL 17 %define DT_RELSZ 18 %define DT_RELENT 19 %define R_INFO(s, t) (((s) << 8) | (t)) shentsz equ 0x28 org 0x15FF0000 ehdr: ; Elf32_Ehdr db 0x7F, "ELF", 1, 1, 1 ; e_ident times 9 db 0 dw ET_EXEC ; e_type dw EM_386 ; e_machine dd EV_CURRENT ; e_version dd _start ; e_entry dd phdr - $$ ; e_phoff dd 0 ; e_shoff dd 0 ; e_flags dw ehdrsz ; e_ehsize dw phentsz ; e_phentsize dw 3 ; e_phnum dw shentsz ; e_shentsize dw 0 ; e_shnum dw 0 ; e_shstrndx ehdrsz equ $ - ehdr ;; The interpreter segment interp: db '/lib/ld-linux.so.2' interpsz equ $ - interp + 1 ;; The string table strtab: db 0 libc_name equ $ - strtab db 'libc.so.6', 0 exit_name equ $ - strtab db '_exit', 0 strtabsz equ $ - strtab align 4 ;; The relocation table reltab: ; Elf32_Rel dd exit_ptr ; r_offset dd R_INFO(1, R_386_32) ; r_info relentsz equ $ - reltab reltabsz equ $ - reltab ;; The program segment header table, hash table, symbol table, ;; and dynamic section. phdr: ; Elf32_Phdr dd PT_LOAD ; p_type dd 0 ; p_offset dw 0 ; p_vaddr part2: call [exit_ptr] ; p_paddr dd filesz ; p_filesz dd memsz ; p_memsz dd PF_R | PF_W | PF_X ; p_flags dd 0x1000 ; p_align phentsz equ $ - phdr dd PT_DYNAMIC ; p_type dd dyntab - $$ ; p_offset dd dyntab ; p_vaddr _start: push byte 42 ; p_paddr jmp short part2 dd dyntabsz ; p_filesz dd dyntabsz ; p_memsz dd PF_R | PF_W ; p_flags dd 4 ; p_aligndd PT_INTERP ; p_type dd interp - $$ ; p_offset dd interp ; p_vaddr dd 0 ; p_paddr dd interpsz ; p_filesz dd interpsz ; p_memsz dd PF_R ; p_flags ; p_align = 1 hashtab: dd 1 ; no. of buckets dd 2 ; no. of symbols dd 1 ; the bucket: symbol #1 ; two links, both zero symtab: ; Elf32_Sym dd 0 ; st_name dd 0 ; st_value dd 0 ; st_size db 0 ; st_info db 0 ; st_other dw 0 ; st_shndx symentsz equ $ - symtab dd exit_name ; st_name dd 0 ; st_value dd 0 ; st_size ; st_info = 18 ; st_other = 0 ; st_shndx = 0 ;; The dynamic section dyntab: dd DT_RELSZ, reltabsz dd DT_RELENT, relentsz dd DT_REL, reltab dd DT_STRSZ, strtabsz dd DT_STRTAB, strtab dd DT_SYMENT, symentsz dd DT_SYMTAB, symtab dd DT_HASH, hashtab dd DT_NEEDED db libc_name dyntabsz equ $ - dyntab + 11 exit_ptr equ $ + 11 _end equ $ + 15 ;; End of the file image. filesz equ $ - $$ memsz equ _end - $$
And behold, it works:
$ nasm -f bin -o a.out tiny.asm $ chmod +x a.out $ ./a.out ; echo $? 42 $ wc -c a.out 297 a.out
This reduction netted us seven bytes, plus one less byte of alignment padding, raising to 34 the grand total of bytes saved from our first working version, and allowing our executable to slip below the 300-byte mark.
Is this the smallest possible size for this program? It might be, but to be honest I don't know. It's a much more complicated executable, and things aren't as clear-cut as they were in the original tutorial. All I can say for certain is that I haven't been able to make it any smaller yet.