================================================================================
B A S T A R D                                            disassembly environment


                         Bastard  Disassembly  HOWTO



================================================================================
 Contents

 1. "I don't know a thing about disassembling!"
 2. Down and Dirty with the CLI
 3. Streetwalking with the Bastard API
 4. Modifying an sys_initial Disassembly
 5. Producing Output Files
 6. DB Queries and Reports
 
 A. CLI Tricks
 B. The Little Grey Bastard
 
================================================================================
 1: An introduction to disassembly in general and the bastard in particular

Most people who are interested in disassembly are trying to either recover or
approximate the source code for a compiled binary executable. When a compiler
turns text-based source code into a binary executable, it removes what it
views as frivolous information in the code: function names, variable names,
comments, elegant coding style. What's left is binary code.

Binary code is not, strictly speaking, assembly language. When you look at
a compiled executable, you will see a series of bytes such as
  20 73 74 72 69 6e 
rather than assembly language instructions like
  20 73 74     and    %dh,0x74(%ebx)
  72 69        jb     0x804cc43
  6e           outsb  %ds:(%esi),(%dx)
which causes binary files to look like  a bunch of random garbage when viewed
in a hex editor.

Since assembly language instructions are really mnemonics with a more or less
one-to-one correspondence to specific machine opcodes (and associated encoded
operands), it would seem a simple matter to translate binary opcodes to
assembly language instructions.

It is easy, however, to make a mistake and generate incorrect assembly language.
Imagine starting the above disassembly a byte early or a byte late, for example
at 
  6E 20 73 74 72 69 6e
or
  73 74 72 69 6e
instead of starting at the correct byte (0x20).
  
If such a mistake were made, the disassembly would look become
   6e           outsb  %ds:(%esi),(%dx)
	20 73 74     and    %dh,0x74(%ebx)
   72 69        jb     0x804cc43
   6e           outsb  %ds:(%esi),(%dx)
in the first case, and
   73 74        jae    804cc20 
	72 69        jb     0x804cc43
	6e           outsb  %ds:(%esi),(%dx)
in the second. In both cases the disassembly is fundamentally wrong (in the
first example, the address before "and %dh,0x74(%ebx)" should be "20 69 6e
and %ch,0x6e(%ecx)"), even though some of the instructions are the same. 

It is also possible that the above might not have been code at all, but rather
data; for example, the sequence of bytes
    20 73 74 72 69 6e 67 32 00
is the same as the string
    " string2"
... meaning that what in the past few examples was disassembled as code is,
more than likely, data in the form of an ASCII string. 

Knowing this, the fundamental problem in disassembly becomes distinguishing
code from data, and knowing the correct starting point of the code. A basic
disassembler will accomplish this by parsing the file format header -- the
first byte of the binary executable is the start of the file header, which
contains information on the structure of the code and data contained in the
executable, for use by the OS loader -- to determine the boundaries of 
sections within the program, which sections are code and which are data, and
what the entry point (the byte where execution starts when the OS loader
transfers control to the executable) is.

For very basic, well-behaved program (e.g. anything not compiled with gcc)
this is usually enough; for example, this is the approach that objdump uses.
Such methods can lead to confusing or incorrect code in cases where code
and data are mixed:
    jmp _exit
	 
	 db " string 2",\0
becomes
    jmp    0x8048a2c
    and    %dh,0x74(%ebx)
    jb     0x804cc43
    outsb  %ds:(%esi),(%dx)
when disassembled. For this reason, more advanced disassemblers use a "flow of
execution" approach, only disassembling sequences of bytes that will be
executed (and ignoring those that will not, such as code following a jump).
Even this approach has flaws, since sophisticated programs, interpreters, and
object oriented code will often have runtime indirection such as
     call_table:
	      db open_file   ;address of open_file()
			db close_file  ;address of close_file()
			db exit        ;address of exit()
			
		...
		
		;  *** do_user_request( int req ) : eax holds "req" ***
		do_user_request: 
		   mul    4						;multiply req by sizeof code address
		   mov    ebx, call_table
			add    ebx, eax
			call   ebx
			ret
...which will not be disassembled using strict flow-of-execution analysis,
since the actual value of ebx is unknown at compile (and disassemble) time.

There is no single solution. The lesson to be learned is to never trust the
disassembler; always take its output with a grain of salt. More enlightened
disassemblers know that the experienced user often knows more about what is
going on then they themselves do; such disassemblers will do their best to
provide an accurate disassembly, then will allow the user to review and modify
the disassembly until it is reasonably accurate. Bubble Chamber, IDA, and the
bastard are examples of such disassemblers.


Once code and data are known, the problem arises of providing context to the
assembly language listing. What is the name of the subroutine at 0x8041043?
What instructions modify the data 0x804cbd5? How many arguments were pushed
on the stack? All of this information is lost when a program is compiled. The
job of a good disassembler is to get it back.

The tasks themselves are simple enough to delineate:

   1. Find or guess local and global (imported) function names
	2. Find and label ASCII strings 
	3. Maintain a list of cross references between code and data addresses
	
To provide more meaningful information about the disassembly, additional
high-level concepts can be brought in:
   
	4. Apply data structure definitions to appropriate addresses
	5. Apply symbolic constant names to appropriate immediate values
	6. Find or guess the data types for arguments and addresses
	7. Build function prototypes based on arguments and return values
	8. Build library dependency lists based on global/imported functions
	
And, finally, some accomodation for user changes should be provided:
   
	9.  Allow commenting of the disassembled listing
	10. Allow user definition of program sections, functions, names, data, etc
	11. Allow patching and subsequent redisassembly of program bytes
	
Using these basic building blocks, it is possible for the experienced user to
use the disassembler to generate an assembly language representation of the
binary executable that can with a reasonable amount of effort be translated
into a higher-level language such as C or FORTRAN.

  


================================================================================
 2: Using the CLI

The bastard is designed to operate as an interpreter: you can type in it, you
can script it, you can pipe to it, in the future you will even be able to cat
binary files to it. In general, however, it will be controlled with a front-end
such as a script.

The utils/ directory contains a few sample shell scripts which can be used to
produce the disassembly of an ELF file to stdout:
   
   disasm.color.sh      produces a 'full' disassembly in ANSI color
   disasm.full.sh       produces a 'full' disassembly in mono
   disasm.text.sh       produces a disassembly of .text in mono
   
These scripts can be run from any location. The bastard installs itself to
/usr/local/bastard with a symlink to /usr/local/bin; left to these settings, it
will correctly locate its shared libraries in /usr/local/bastard/lib, however
if its home directory is moved, these libraries must be in LD_LIBRARY_PATH as
usual.

When run as a shell --e.g. by just typing `./bastard`-- the full CLI commands
can be used:

   LOAD [target [format [arch]]]  : Load and disassemble a file
   BDB  [bdbname]                 : Load previously-disassembled file

   DISASM [outfile]               : Print a disassembly of target
   DUMP                           : Print a hexdump of target
   STRINGS                        : Print list of found strings
   SECTION name                   : Print disassembly of section 'name'
   RANGE   start end              : Print disassembly of 'start' to 'end' rva
   HEADER                         : Print program file header
   
   DEBUG                          : Run target in a debugger
   EXEC   [command]               : sys_execute a shell command
   RUN    script                  : Run a .bc script file
   {{     \n script \n  }}        : Code and execute a script

   
   API                            : Display bastard.h in pager
   HOWTO                          : Display this HOWTO in pager
   MAN                            : Display the manual in pager

   QUIT                           : Exit the bastard

Some of these commands have one-character aliases for convenience:
      L LOAD               R RANGE
      B BDB                S SECTION
      D DISASM             A STRINGS
      U DUMP               Q QUIT
      H HEADER             M MAN
      ? HELP
Neither the commands nor their aliases are case-sensitive.

The LOAD command is the most important; its complete invocation is
    LOAD target  file_format  cpu_arch
and the parameters default to
    LOAD "./a.out"  "ELF" "i386"
...which is what the command will be if you just type "L" (though if the 
target name is passed on the command line, as in `./bastard /usr/bin/tr`,
it will replace "./a.out"). For Linux x86 ELF files, the syntax
    LOAD target
will be sufficient. For other files, "format" must be a .bc or a .so in
the formats/ directory, and "arch" must be a .bc or a .so in the arch/
directory. Note that only the base name should be specified, so that "ELF.bc"
and "libELF.so" both are referred to as "ELF".

When a file is loaded, the display will look like this:

   =========================================================================
      B A S T A R D                    disassembly environment
      brought to you by the proud folks at the HCU linux forum
   ;>l
   Disassembling named symbols. Instruction stack:
   --------------/

The "---" line displays a visual stack representing the nesting of the program;
when it unwinds all the way, the disassembly will finish, and another prompt
will appear. Disassembly time is usually 2-3 minutes per MB.

At this point, various information about the file can be displayed with 'H'
(file header), 'A' (strings), 'D' (disassembly), and so on. Further modification
of the file requires running external scripts or using the API directly, and 
will be treated in other sections.

To exit the disassembler, type 'Q' for QUIT. The .bdb file for the target will
be saved automatically, for future use. 


================================================================================
 3: Using the API

    The Bastard API
    ---------------
     bastard.h

    Calling API Routines in the CLI
    -------------------------------

    Using Multiline Scripts
    -----------------------

    Running External Scripts
    ------------------------

    Running Plugins
    ---------------
    
    Direct Extension Access
    -----------------------
     extension.h


 
================================================================================
 4: Working with the Disassembly

    Naming Addresses
    ----------------

    Adding Comments
    ---------------

    Identifying Strings
    -------------------

    Defining Functions
    ------------------

    Using Structures
    ----------------

    Constants and Data Types
    ------------------------

    Inline Functions
    ----------------


================================================================================
 5: Producing Output Files

The easiest way to produce an output file with the CLI is to use the DISASM
command:
   ;>D disasm.out
This will write the disassembly output to the file disasm.out .

A second method is to use the redirection operator (>) built into the CLI; this
will redirect stdout to an arbitrary file:
   ;>D > disasm.out
Note that the file will be appended if it exists, not truncated; thus the >
operator acts more like >> in shell.

The final method for creating output files is to use the API routines directly:

   target_save_asm(char *filename);   /* Save as asm suitable for re-assembly */
   target_save_lst(char *filename);   /* Save as a disassmbled listing */
   target_save_hll(char *filename);   /* Save as code in High Level Lang */
   target_save_hex(char *filename);   /* Save as a Hex Dump */
   target_save_diff(char *filename);  /* Save as a binary diff */
   target_save_binary(char *filename); /* Save as an executable */

Note that target_save_asm and target_save_hll are simply wrapper functions for the file
generation routines in the ASM and LANG extensions; target_save_diff and target_save_binary
are intended for binary patching, and are currently not supported.


================================================================================
 6: Working with the Database
 

 The CLI DB Interface

 The DB API Interface

   
================================================================================
 A: CLI Tricks

 * Use pipes : |
   This works very well with grep, wc, head, and sed.

 * Use redirection : >

 * Use 'set' and 'show' to change runtime settings


================================================================================
 B: LGB

The lgb, aka Saraboos, is a very crude Tk front-end to the bastard residing in 
the utils directory; it is intended to simplify the very basic operations of 
the bastard. The usage is a pretty straightforward, menu-driven approach; those
not wishing to brave the CLI are invited to suffer the limitations of lgb (which
still provides a console for direct CLI command entry).


================================================================================
 C: SOB

The Son Of the Bastard -- an unfinished Gtk front-end to the bastard.
