================================================================================
B A S T A R D                                            disassembly environment


                                 T  F  M




================================================================================
 Contents

    1. Introduction
    2. Installation
    3. Quickstart
    4. Background
    5. Command Line Parameters
    6. Using the CLI
    7. The Disassembled Listing
    8. Interfacing 
    9. Scripting
   10. Extending
   11. The Bastard API
   12. Internal Representation
   13. Other Documentation
   14. Responsible Parties
   





================================================================================
 Introduction
Well, this is going to double as a manual and a spec ... bear with.



================================================================================
 Installation


 Installation from Source
 ------------------------
   1. Unpack the bastard
     tar -zxg bastard.tgz -C/usr/src
   2. Build the bastard
     cd /usr/src/bastard
     make install
     
     
     If problems occur, and you know they will, here are some manual
     building steps that should build the app correctly:

   1. Build typhoon
      cd /usr/src/bastard
      mkdir lib
      cd src/typhoon
      ./configure
      make install
      cd util
      cp dbdview ddlp tyexport tyimport /usr/src/bastard/utils

   2. Build EiC
      cd /usr/src/bastard
      mkdir include/script/eic
      cd src/EiC
      config/makeconfig
      make install
      cp lib/*.a /usr/src/bastard/lib
      cp include/eic* /usr/src/bastard/include/script
      cp -r include/* /usr/src/bastard/include/script/eic
      cd, mkdir...
      
   3. Build the .so
      cd /usr/src/bastard
      make libbastard

   4. Build the extensions
      cd /usr/src/bastard
      make modules

   5. Build the bastard
      cd /usr/src/bastard
      make bastard


 Installation from Binary
 ------------------------
   1. Untar the bastard
     tar -zxf bastard.tgz -C/usr/local
   2. Set up symlinks to access the bastard globally
     ln -sf /usr/local/bastard/bastard /usr/local/bin/bastard
     ln -sf /usr/local/bastard/lib/libbastard.so /usr/local/lib/libbastard.so
     ldconfig
   3. Link any desired utilities
     ln -sf /usr/local/bastard/utils/lgb /usr/local/bin/lgb
     ln -sf /usr/local/bastard/utils/disasm_full.sh /usr/local/bin/disasm_full
   4. Link the docs to your doc tree
     ln -sf /usr/local/bastard/doc /usr/doc/bastard



================================================================================
 QuickStart

The most common use of the bastard is to load an x86 ELF file and perform a
full disassembly on it; this is done by the 'LOAD' command [aliased to 'L'
to save typing]. The file can then be printed to STDOUT using the 'DISASM'
command [or 'D']. The following demonstrates this as performed on the file
/usr/bin/tr:

   [/home/work/bastard]$>./bastard
   ;>l /usr/bin/tr
   ;>d

Notice how the output fills the screen buffer rather quickly. The command
line has a pipe ['|'] metacharacter that will pass the output from a bastard
command to a shell command line; thus, it is possible to send output to 
various unixland commands:

   [/home/work/bastard]$>./bastard
   ;>l /usr/bin/tr
   ;>d | more
   ;>d | wc -l
   ;>d | grep call | head

The file can be saved using the target_save_asm() command:

   [/home/work/bastard]$>./bastard
   ;>l /usr/bin/tr
   ;>target_save_asm("tr.asm");

Disassembled files are saved in .bdb database files when unloaded, when
exitting the disassembler, or by calling target_save_db() or target_save_db_as() :

   [/home/work/bastard]$>./bastard
   ;>l /usr/bin/tr
   ;>target_save_db_as("tmp.bdb");
   ;>q
   [/home/work/bastard]$>./bastard
   ;>target_load_bdb("tmp.bdb");
   ;>d

This last example also introduces the highly influential 'QUIT' command, or
'q' for short. Additional commands can be found by typing 'HELP' or '?'; the
more interesting ones are:

   LOAD     -- Load a target for disassembly
   BDB      -- Load a previously-disassembled target
   MAN      -- Display the bastard manual using 'less'
   DISASM   -- Display the disassembled listing of the target
   DUMP     -- Hex dump the target
   DEBUG    -- Run the target in gdb
   EXEC     -- The good old shell escape
   HEADER   -- Print the program header
   SECTION  -- Print the disassembly of a specific section
   STRINGS  -- Print all strings found in the target

Note that the DB can be examined from the command line with the DB command 
set:

   DB HELP
   DB COUNT [table]       -- Count number of records in 'table'
   DB DESC  [table]       -- Print the description [ala struct] of 'table'
   DB DUMP  [table]       -- Dump the contents of 'table'
   DB MAX   [table] [key] -- Print the maximum value of column 'key' in 'table'
   DB MIN   [table] [key] -- Print the minimum value of column 'key' in 'table'

In case you haven't guessd it by now, the pipe feature was added in order to
make dealing with the DB a lot easier.


The intent of the bastard is by and large to be controlled by a shell script
or a parent process [read "front end"]. The bastard is fully scriptable via
the command line, with all of its internal API commands as well as an
embedded C interpreter available via STDIN. As such, it can be controlled
simply by redirecting its input from a file:
   
   ./bastard < commands.in > asm.out

Some sample shell scripts for controlling the bastard are provided in the 
directory $BASTARD_HOME/utils :

   utils/disasm.color.sh      -- demonstrates the 'colorized' mode  
   utils/disasm.full.sh       -- does a full disassembly
   utils/disasm.text.sh       -- disassembles the .text section

For more details, see disasm howto.



================================================================================
 Background

The action all begins when target_load() is used to open a file to disassemble. 
In the future, support may be provided for binary input via STDIN; this will
be made available through a command line option, and may provide the ability
to disassemble arbitrary sequences of bytes.

A .bdb directory is created in the current directory; this will contain the
database and temporary files needed during disassembly. When the database is
saved, the contents of .bdb will be tar-gz'ed into a file with a '.bdb'
extension. The target is copied to a binary image inside .bdb [1], a Target
database is created, and a record is added to the ADDRESS object table with
ADDRESS->PA = 0 and ADDRESS->size = sizeof(target). A TARGET struct is created
as well:

   struct TARGET{
      char* name,
      char* path,
      long  size,
      long  entry;
       ...
   } target;

The next step is to apply a file format structure to the file -- since an
executable is in essence a predictable structure composed of blocks of code and
data, different file formats will be considered different structures rather
than different disassembly operations. Hopefully this will the support for
future file formats more straightforward.

The structure definition will be a BC script containing the structure
definition, and a brief read_header() routine for creating sections:

 /* File Format Sample ------------------------------------------------ */
   #include <bastard.h>
   #include <stdlib.h>
   #include <stdio.h>

   struct FILEFMT{
      ....
   } FileHeader;
   
   struct FILESEC{
      ....
   } FileSection;

   void read_header(void) {
      int flags;
      char ptr = malloc( sizeof(FILEFMT) );
      read( FD, ptr, sizeof(FILEFMT));
      for (/* each section */) {
         sec_new( Section.Name, Section.start, Section.End, flags);
      }
 /* EOF ----------------------------------------------------------------*/

As you can see, the code for parsing the header is actually part of this
script -- this allows arbitrarily complex programs or files to be ported
without recompiling the disassembler. This could be used, in theory, for
writing unpackers.

When the File Format is selected using SetFileType(), the BC script for the
specified type is executed using a file descriptor to the binary image as a
global variable. A record is added to the SECTION table of the database for
each section created with sec_new; likewise, an ADDRESS record is added 
and sized for the start of each section. [2]

At this point, the target can be considered 'loaded' and ready for disassembly.
There are many disassembly options available, but in general one will begin
with a disassembly of whichever section contains the program entry point, 
starting of course at that entry point:

      

Notes:
   1. All of the bytes in the file must be retained in order to produce a 
      patched version of the program, or to do a manual disassembly. As such,
      a perfect duplicate of the original executable is stored as an image in
      the .bdb; all address objects will consist of an offset into this image
      and a size attribute to be used for reading data.
   2. Since a section is the only distinct block of code available at this
      point, the start of each section is used to break the original single 
      ADDRESS record up into smaller, component ADDRESS objects. Subsequent 
      disassembly will use functions, structures, code, and data objects to
      create further ADDRESS objects.


For more info, see TheoryOfOp.txt



================================================================================
 Command Line Parameters
 
 bastard [-q] [-i FD] [-o FD] [-p str] [filename]
   -i [file]   -- use [file] instead of STDIN
   -o [file]   -- use [file] instead of STDOUT
   -e [file]   -- use [file] instead of STDERR
   -s [file]   -- run BC script [file] on target, then exit [disabled]
   -p [prompt] -- use [prompt] for interpreter prompt, e.g. "" for no prompt.
   -q          -- quiet [suppress unnecessary output]
   -a          -- annoy user ["do you want to save?", etc]
   -c          -- colorize output
   -d          -- use debugging code
   --          -- take code to assemble from STDIN [disabled]
   
   filename    -- file or database to open



================================================================================
 Using the Bastard CLI

 --Readline line-editor commands--

 --'Native' commands and aliases--

  Command   Alias   Parameters                 Description
  ----------------------------------------------------------------------------
   HELP      ?                                  Give command summary
   API                                          Print API.txt
   BDB       B      filename                    Open .bdb file
   DB               command                      N/A 
     HELP                                       Give DB command summary
     COUNT          table                       Print # records in table
     DESC           table                       Print definition for table
     DUMP           table                       Print contents of table
     FIND           table key val               Not operational
     MAX            table key                   Print last row of table
     MIN            table key                   Print first row of table
   DEBUG                                        Run target in debugger
   DISASM    D      [output file]               Print disassembly of target
   DUMP      U                                  Print hex dump
   EXEC      !      [command]                   Run command in shell
   HEADER    H                                  Print file header
   HOWTO                                        Print Disasm-HOWTO
   INTCODE   I                                  Print intermediate code
   LOAD      L      [target] [format] [arch]    Load & Disassemble target
   MAN       M                                  Print bastard.txt
   QUIT      Q                                  Exit the bastard
   RANGE     R      start end                   Print disassembly of range
   RUN              script                      Run BC script
   SECTION   S      name                        Print Disassembly of section
   SET              var                         Set bastard ENV flag
   SHOW             var                         Show bastard ENV flag
   STRINGS   A                                  Print strings in target
   {{                                           Write & execute a script
   
 --Setting Options--
   'set' shows all options
   'set option-name' toggles 'option-name' on and off
   'show' shows the settings of all options
   'show option-name' shows the setting of 'option-name'

 --Config File--
   This is stored as ~/.bastard/bastard.rc or /etc/bastard.rc
   Blank lines and lines beginning with a comment char (#) are ignored
   There are two types of configurable settings: "settings" and "options"

   Settings are any of the members of disasm_env, disasm_prefs, and target.
   These are specified in the config file with
      'struct.member=value'
   e.g., to set the prompt p1 
      'disasm_env.p1=#`

   Options are any of the values for disasm_prefs.options. These are
   specified in the config file with
      'option.option-name=1'
   ...where option-name is the name of the option, e.g.
      'option.paginate=1'
    Note that 'option' and '1' can actually be any value; any line not
    interpreted as a setting tried as an option, and options are toggled on
    and off -- not set. Since all options are off by default this is not as
    bad as it sounds.

    All entries are case-insensitive, and spaces around the '=' are allowed.
    Bad options will result in an error message by the CLI.
    Here is a list of all possible config file values:
         # bastard.rc : Items which are commented are best left that way 
         #    Target Options: These will be default values. Currently Disabled
         target.info.endian
         target.info.size
         target.info.entry
         target.info.name
         target.info.path
         target.info.dbname
         target.arch.name
         target.assembler.name
         target.comp.name
         target.format.name
         target.lang.name
         target.os.name
         #    Disassembler Environment: State of the bastard
         disasm_env.home
         disasm_env.p1
         disasm_env.p2
         disasm_env.p3
         disasm_env.p4
         disasm_env.dbpath
         disasm_env.output
         # disasm_env.flags
         # disasm_env.endian
         #    Disassembler Preferences: Various User Preferences 
         disasm_prefs.options
         disasm_prefs.pager
         disasm_prefs.editor
         disasm_prefs.debugger
         # disasm_prefs.num_hist
         #    Disassembler Options: Runtime User Options/Flags (SET & SHOW)
         options.disable_exec
         # options.quit_now
         options.paginate
         options.color_tty
         options.quiet
         options.annoy_user
         options.debug_time
         # options.disasm_redo


 --Dealing with the database--
   `db dump address | grep 804089AC`
   `db dump code | grep 804089AC`
 
 --Dealing with the disassembler--
    Environment Variables:
     PAGINATE
     COLOR_TTY
     QUIET
     ANNOY_USER
     DISABLE_EXEC
   `d | grep 0x804089AC`
   `d | more` 
   

 --The Instruction Stack--
The "Instruction Stack" appears when disasm_forward is called, unless the
QUIET flag is set. It prints a character to STDOUT for each recursion; i.e.,
when following a program jump or subroutine call. The characters change during
the course of disassembly to give an animation that is at once neither useful
nor entertaining. The characters are as follows:

   "+" = enter new stack level
   "|" = Step 1: Disassemble current instruction 
   "/" = Step 2: Analyze current instruction
   "-" = Step 3: Follow branch of execution [if applicable]
   "\" = Step 4: Advance to the next instruction

Each iteration of disasm_forward cleans its character from the screen 
using a backspace character; this represents a successful disassembly.


 --Single-line BC scripts--
Any arbitrary text typed into the CLI which is not recognized as a CLI
command will be sent to the EiC interpreter for handling. This means that
arbitrary C code can be executed:
 
   ;> char string[] = "...Test String\n";
   ;> printf(string);
   ...Test String 
   ;>

This is mostly useful for executing bastard API routines directly:

   ;> target_load("/usr/bin/tr");
   ;> target_set_format("ELF");
   ;> target_set_arch("i386");
   ;> SetTargetLang("C");
   ;> SetTargetAsm("intel");
   ;> ApplyFileFormat();
   ;> disasm_section(".text");

More complex scripting, including loops and conditionals, will tend to take
more than a single line.


 --Multi-line BC scripts--
A script can scan multiple lines if it is begun with the "{{" command;
when "{{" is entered on a line all on its own, the CLI enters multiline
script mode until a line is encountered containing only "}}". All text
between the double braces will be inserted into a dummy script as a main()
taking no arguments; the script will be executed when the closing "}}" braces
are entered, and will then be discarded.

   ;>{{
    -int x;
    -for (x = 0; x < 5; x ++) {
    -   printf(" + ");
    -}
    -printf("\n");
    -}}
     +  +  +  +  + 
   ;>

Note that the multiline script mode is indicated by the " -" prompt.


================================================================================
 The Disassembled Listing

 Notational Conventions:
 <08040000[x]        -- This address references or is referenced by another.
                        Legend:
                         < or >  : Reference is to (<) or from (>) this address
                         08040000: The rva referencing/referenced by this one
                         [x|r|w] : Type of reference: execute, read, or write
 (8040000 was 1233)  -- The relative offset 1233 was converted to the rva
                        8040000
 str_*:              -- This address is the start of a string.
 StrRef:             -- Operand is a reference to a string



================================================================================
 Interfacing With The Bastard

Detailed Information can be found in FrontEndProg.txt



================================================================================
 Scripting The Bastard

 [basically a discussion of EiC and such]
 Detailed information about scripting can be found in BCScript.txt
 C Syntax
 EiC Gotchas



================================================================================
 Extending the Bastard

 
 Adding API Routines
 -------------------
  Add to bastard.h or another header file
  Add wrappers to src/eic_wrappers.c

 Adding Database Tables
 ----------------------
  Add the table to bdb.ddl
  Use ddlp to geneate a new .bdb and .h
  Run utils/fix_bdb.sh on the include
  Add a Get$(TABLE)Object() routine to src/db.c
  Add an EiC wrapper for the GetObject() routine
  Add a Print Routine to src/cli/cli_db.c
  Add a sequence to include/db.h and init it in src/db.c
 
 New Databases
 -------------
  Create a .ddl
  Add to Makefile
 
 Writing Extensions
 ------------------
  Detailed information can be found in ExtensionProg.txt




================================================================================
 The Bastard API


   1.  Basic API Routines
   2.  Names
   3.  Comments
   4.  Strings
   5.  Structures
   6.  Constants
   7.  Sections
   8.  Functions
   9.  Imports and Exports 
   10. Advanced API Routines
   
   
   
   
 Detailed information on the exported API functions in bastard.h
 + The following naming conventions are used:
       rva       Relative Virtual Address -- the address of a location of
                   code or data once the program has been loaded by the OS
       pa        Physical Address -- the offset of a location of code or
                   data within the executable file/image.

 + Most routines return 0 on error or failure; the reason for the error can
   be found by calling sys_get_lasterr();
   
 + Routines that take character arrays as arguments generally do not require
   the length of the array. This because the strings in the DB are of
   fixed length [see bdb.h], and thus it simple enough for the caller to
   know the size of array they should use. Naturally this goes against all
   sane security practice; however anyone incorporating a program such as
   the bastard into their firewalling policy is pretty much beyond saving.
   Be warned: you are expected to know what you are doing.


a lot of the oerations in the api can be performed using the DB commands
the majority of the api is wrappers and combined/oft-used functions

   1.  Basic API Routines
   
   
      Loading the Target
   int target_load( char *target );
   int target_load_bdb( char *dbname );
   
      Loading Extensions Specific to Target
   int target_set_format( char *file_format );
   int target_set_arch( char *file_arch );
   int target_set_asm( char *asm_output );
   int target_set_lang( char *language );
   int plugin_load( char *name );
   
      Saving the Target
   void target_save_db( );
   void target_save_bak( );
   void target_save_db_as( char *filename );
   int target_save_asm( char *filename );
   int target_save_lst( char *filename );
   int target_save_hex( char *filename );
   int target_save_diff( char *filename );
   int target_save_binary( char *filename );

      Closing and Exitting
   void target_close_db( );
   void sys_quit( ) ;

       Disassembly Routines:
   int target_set_header( char *header );
   char * target_header( );
   int target_apply_format( void );
   int disasm_target(char *disasm, void *param);
   int disasm_section( char *name );
   int disasm_range( long rva, int size );
   int disasm_forward( long rva );
   int disasm_pass( int pass );
   int disasm_all_passes();

      Target Architecture Settings
   int disasm_prologue(struct code **c);
   int disasm_epilogue(struct code **c);
   int disasm_byte_order();
   int disasm_addr_size();
   int disasm_byte_size();
   int disasm_word_size();
   int disasm_dword_size();
   int disasm_get_sp();
   int disasm_get_ip();

      Communicating with the User 
   void sys_msg( const char *str ) ;
   void sys_errmsg( int num, const char *str ) ;
   
      ADDRESS Object Manipulation Routines:
   int addr_is_valid( long rva );
   int addr_is_valid_code( long rva );
   int addr_exists( long rva );
   long addr_find_closest( long rva );
   int addr_new( long rva, int size, long pa, int flags );
   int addr_del( long rva );
   int asmsprintf(char *buf, char *format, struct address *addr);
   int addr_print( long rva );
   int addr_make_code( long addr );
   int addr_make_data( long addr );
   long addr_pa( long rva );
   long addr_rva( long pa );
   int addr_type( long rva );
   int addr_size( long rva );
   int addr_flags( long addr );
   int addr_bytes( long rva, char *buf );
   long addr_next( long rva );
   long addr_prev( long rva );
   int addr_set_flags( long rva, int flags );

      CODE Object Manipulation
   int code_new(long rva, struct code *c);
   int code_del(long rva);
   long addr_next_code( long rva );
   long addr_next_data( long rva );
   int code_sprint( char *buf, char *fmt, long rva ) ;
   int code_opcmp(long op1, int flag1, long op2, int flag2);
   int code_effect_new(long rva, int reg, int change);
   int addrexp_new(int scale, int index, int base, int disp, int flags);
   int addrexp_get( int id, struct addr_exp *exp );
   int addrexp_get_str( int id, char *string, int len);

      Cross References
   long xref_to( long rva, int type ) ;       //get first
   long xref_to_next( int type ) ; //get array of structs; return num
   long xref_from( long rva, int type ) ;
   long xref_from_next( int type ) ;
   int xref_new( long from, long to, int type );


   2.  Names
      blahblah name
   int name_new( long rva, char* name, int type );
   int name_get( long rva, char* buffer );
   int name_get_type(long addr);
   long addr_get_by_name( char *name );

   3.  Comments
      blahblahcomment
   int addr_add_comment( long rva, char *buf );
   int addr_comment( long rva );
   int addr_set_comment( long rva, int id );
   int comment_get( int id, char *buf );
   int comment_new( char *buf, int type );
   int comment_change( int id, char *buf );
   int comment_del( int id ) ;

   4.  Strings
   int str_get( long rva, char *buf, int len );
   int str_get_raw(long rva, char **buf);
   int str_new( long rva );
   int str_print( ); /* print all strings/addresses to stdout */
   int str_find_in_range(long rva, int size, int type);

   5.  Structures
   int struct_new( char *name, short size );
   int struct_add_member( int struct_id, long type, int size, int order, 
                           char *name );
   int struct_get( unsigned long id, struct structure *s);
   unsigned long struct_get_id(char *name);
   int struct_get_member( int struct_id, int order, struct struct_member *m) ;
   int struct_apply(long rva, unsigned long id);
   int struct_del( unsigned long id);
   int struct_member_del( unsigned long id);


   6.  Constants
   int const_new( char *name, int value);
   int const_get_by_val(int value, char buf[32]);
   int const_get_by_valNext( char buf[32]);
   int const_get_name( unsigned long id, char buf[32]);
   unsigned long const_get_by_name( char *name);
   int const_del( unsigned long id);

   7.  Sections
   int sec_new( char *name, long start, int size, long pa, int type );
   int sec_del( char *name );
   int sec_get( long rva, struct section *s);
   long sec_get_start( char *name );
   long sec_get_end( char *name );
   int  sec_get_by_rva( long rva, char* buf );
   int  sec_flags( char *name );
   int  sec_set_start( char *name, long start );
   int  sec_get_end( char *name, long end );
   int  sec_rename( char *name, char *new_name );
   int  sec_set_flags( char *name, int flags );

   8.  Functions
   int func_new( long rva, char * name, int size );
   int func_get_name( long rva, char *buf );
   int func_get_start( long rva );
   int func_get_end( long rva );
   int func_get_by_name( char *name );
   int func_set_name( long rva, char *name );
   int func_set_size( long rva, int size );

   9. Imports and Exports

   int imp_new( long rva, char *lib, char* name, int type );
   int imp_get_name( long rva, char *buf );
   int imp_get_lib( long rva, char *buf );
   int imp_type( long rva );
   int imp_print( );
   int lib_new( char *name, int v_hi, int v_lo );
   int lib_print( );   /* print all libraries to stdout */
   int exp_new( long rva, char *name );
   int exp_get( char *name );
   int exp_print( ); /* list all entry points */

   10. Advanced API Routines

      Playing with the Target
   int sys_exec( char *args );
   int sys_debug( );

      DATA_TYPE Manipulation
   int dtype_new(char *name, int size, int flags) ;
   int dtype_del(char *name);
   int dtype_get(char *name);

      BC Scripts and Macros
   int script_file_exec( char *name );
   int script_text_exec( char *script );
   int script_compiled_exec(char *name);
   int macro_new( char *name, char *macro );
   int macro_del( char *name );
   int macro_exec( char *name );
   int macro_record( char *name );

      User Preferences
   int env_set_pager(char *name);
   int env_set_editor(char *name);
   int env_set_debugger(char *name); 

      Disassembler Environment
   char * env_get_home();
   int target_fd();
   char * target_image();
   int target.info.size();
   char * env_get_pager();
   char * env_get_editor();
   char * env_get_debugger(); 
   struct DISASM_ENV   * env_get_env();
   struct DISASM_PREFS * env_get_prefs();
   struct DISASM_TGT   * env_get_target();
   struct DISASM_ENG   * env_get_engine();
   struct DISASM_ASM   * env_get_asm();
   struct DISASM_LANG  * env_get_lang();
   int env_clear();
   int env_clear_target();
   int env_defaults();
   int env_target_defaults();
   int env_pref_defaults();
   int env_get_flag( int flag );
   int env_set_flag( int flag );
   int env_clear_flag( int flag );
   int env_get_option( int flag );
   int env_set_option( int flag );
   int env_clear_option( int flag );
   void env_tty_asm( char *fmt );
   void env_tty_data( char *fmt );
   void env_file_asm( char *fmt );
   void env_file_data( char *fmt );
   void env_lpr_asm( char *fmt );
   void env_lpr_data( char *fmt );
   void sys_re_init(char *home);

      Error Handling
   int sys_set_lasterr( int errnum );
   int sys_get_lasterr( );
   char* sys_lasterr_str( );
   void sys_print_errmsg(int errnum);
   void sys_sprint_errmsg(char *buf, int errnum);

      Low Level DB Access
   void* db_save_state();
   int db_restore_state(void *state);
   int db_index_first(int index, void *dest);
   int db_index_next(int index, void *dest);
   int db_index_prev(int index, void *dest);
   int db_index_last(int index, void *dest);
   int db_index_find(int index, void *value, void *dest);
   int db_table_first(int table, void *dest);
   int db_table_next(int table, void *dest);
   int db_table_prev(int table, void *dest);
   int db_table_last(int table, void *dest);
   int db_record_insert(int table, void *src);
   int db_record_update(int index, void *value, void *src);
   int db_record_delete(int index, void *value);
   int db_find_closest_prev(int index, void *value, void *dest);
   int db_find_closest_next(int index, void *value, void *dest);

      Quick Access to Objects by Primary Index
   struct address * GetAddressObject(long rva);
   struct code * GetCodeObject(long rva);
   struct addr_exp * GetAddrExpObject(int id);
   struct code_effect * GetCodeEffectObject(int id);
   struct section * GetSectionObject(char *name);
   struct xref * GetXrefObject(int id);
   struct int_code * GetIntCodeObject(int id);
   struct fin_code * GetFinCodeObject(int id);
   struct export_addr * GetExportObject(long rva);
   struct import_addr * GetImportObject(long rva);
   struct string * GetStringObject(long rva);
   struct name * GetNameObject(long rva);
   struct comment * GetCommentObject(int id);
   struct function * GetFunctionObject(int id);
   struct func_param * GetFuncParamObject(int id);
   struct func_local * GetFuncLocalObject(int id);
   struct func_effect * GetFuncEffectObject(int id);
   struct f_inline * GetFInlineObject(int id);
   struct inline_type * GetInlineTypeObject(int id);
   struct structure * struct_getureTypeObject(int id);
   struct struct_member * GetStructMemberObject(int id);
   struct data_type * GetDataTypeObject(int id);
   struct constant * GetConstantObject(int id);
   struct bc_macro * GetBCMacroObject(int id);
   



 
================================================================================
 Internal Representation


   The .BDB File
   -------------
Once disassembled, the target is represented by a .bdb file, which is simply
a tarball containing information about the target, an image of the target, and
a database containing the disassembly of the target. Note that plugins, scripts
and users can add arbitrary files and directories to a .bdb without impacting
the bastard's use of the file.

A typical BDB contains the following files:

   root@localhost >tar -ztf a.out.bdb
      ./.a.out.bdb/$TABLE.db           /* typhoon data file for $TABLE */
      ./.a.out.bdb/$TABLE.ix?          /* typhoon index file for $TABLE */
      ./.a.out.bdb/sequences.dat       /* typhoon sequences file */
      ./.a.out.bdb/a.out               /* image of target */
      ./.a.out.bdb/.info               /* target loading info */
      ./.a.out.bdb/header.txt          /* ASCII representation of header */

Note that the .bdb file is decompressed into ./.$FILENAME while the target is
loaded. Also, a copy of the target is stored in the .bdb; this prevents the 
need to store a copy of the binary data for each program address in the DB,
which would require a storage area at least double the size of the target.


   The Target Structure
   --------------------
The target of the disassembly is represented by a single, currently global
structure:

   struct DISASM_TGT {   
      int      status;        
      int      endian;         
      int      FD;           
      char *   image;         
      int      size;               
      long     entry;            
      char     arch[32];        
      char     assembler[32]; 
      char     format[32];     
      char     lang[32];       
      char *   header;     
      char     name[64];   
      char     path[PATH_MAX];
      char     dbname[PATH_MAX];   
   } target;

At some point in the future this may be expanded to a linked list of target
structures in order to allow an executable and all of its required libraries
to be examined simultaneously; however this is hardly necessary for standard
disassembly and decompilation.
   
Most of the fields are pretty straightforward: name, path, and dbname refer to
the original location of the target and the name of its .bdb file; header points
to an ASCII representation of the file header (created by the FORMAT extension)
in memory; status contains the current status of the target disassemby; endian
contains the byte order of the target ( 0 = big endian, 1 = little endian);
entry contains the entry point of the target (set by the FORMAT extension); and
size contains the size of the target in bytes.

The arch, assembler, format, and lang fields correspond to the ARCH, ASM,
FORMAT, and LANG extensions; these allow specific extension modules to be
associated with the target, and loaded when the target is loaded. FD and image 
hold a file descriptor to an opened copy of the target, and a pointer to the 
start of a memory mapped image of that copy.

When the target or the .bdb is loaded, the .bdb copy of the target is memory
mapped to a location pointed to by target.image; to read the raw data contained
at an address in the file --as opposed to the interpreted data such as strings
or code-- the disassembler uses the Physical Address (PA) of the ADDRESS object
as an offset into target.image; a number of bytes equal to the size of the 
ADDRESS object is then read from that offset.


   The Target in the DB
   --------------------
In the Bastard Database (BDB), the target is represented as a series of memory
addresses with the following structure:

   struct address {
      ulong rva,                    -- relative virtual address
      ulong pa,                     -- physical address
      ushort size,                  -- number of bytes in ADDRESS
      int flags,                    -- type of ADDRESS [code, data...] 
      ulong dataType,               -- id of DATA_TYPE applied to this addr
      int   dataSize,               -- Size of Data Type (>1 = array)
      ulong dataConst,              -- id of CONSTANT associated with ADDRESS
      ulong structure,              -- id of STRUCTURE associated with ADDRESS
      ulong comment                 -- id of COMMENT associated with ADDRESS
   }
      Address Flags:
         ADDR_CODE          0x01
         ADDR_DATA          0x02
         ADDR_IMPORT        0x10
         ADDR_EXPORT        0x20
         ADDR_SECTION       0x40
         ADDR_STRUCT        0x100
         ADDR_STRING        0x200
         ADDR_FUNCTION      0x400
         ADDR_INLINE        0x800
         ADDR_NAME          0x1000
         ADDR_COMMENT       0x2000
         ADDR_SYMBOL        0x4000

When the target is first loaded, a single ADDRESS object of target.info.size
byts at RVA 0, PA 0 is created. At this point, the read_header routine of the
FORMAT extension associated with the target is invoked; its job is to create
an ADDRESS and a SECTION object for each section in the target:

   struct section {
      char *name,                   -- name of SECTION
      ulong rva,                    -- starting ADDRESS of SECTION
      int flags,                    -- SECTION type and permissions
      ulong size                    -- number of bytes in SECTION
   }

      Section Flags:
         SECTION_EXECUTE    0x01
         SECTION_WRITE      0x02
         SECTION_READ       0x04
         SECTION_HDR        0x1000
         SECTION_CODE       0x1100
         SECTION_DATA       0x1200
         SECTION_RSRC       0x1300
         SECTION_IMPORT     0x1400
         SECTION_EXPORT     0x1500
         SECTION_DEBUG      0x1600
         SECTION_CMT        0x1700
         SECTION_RELOC      0x1800

The ADDRESS object will have the RVA, PA, and size of the SECTION object; it 
will be broken up into smaller addresses representing code and data during the
actual disassembly. The SECTION objects represent the underlying structure of
the file, and will be used during disassembly and .asm file output to break
the file into logical sequences of addresses; a SECTION may be code or data,
and a target may consist of only binary data with no code to disassemble
(which would be the case in disassembling a binary data file, such as a 
database or an application data file).

At this point arbitrary addresses can be created within the target; while 
ADDRESS objects should be contiguous within a section, this is not enforced.

ADDRESS objects can be given symbolic labels or NAME objects:

   struct name {
      ulong rva,                    -- ADDRESS this NAME applies to
      int type,                     -- type of NAME
      char *name                    -- NAME
   }

      Name Types:
         NAME_AUTO          0x01
         NAME_USER          0x02
         NAME_SUB           0x10
         NAME_SYMBOL        0x20

NAME objects create a namespace which is separate from SECTION, CONSTANT,
FUNCTION local variable, FUNCTION parameter, and STRUCTURE member names, as
well as from COMMENT text; they will, however, collide with the names of
IMPORT and FUNCTION objects, which actually reference NAME objects for their
symbolic labels.

Note that the type of a NAME is important; names which are generated by the
user (NAME_USER) or which are derived from symbols in the file header 
(NAME_SYMBOL) are protected in the parts of the bastard which automatically
generate names; such names must be explicitly changed by the user or a script.

ADDRESS objects can also be assigned a comment:

   struct comment {
      ulong id,                     -- unique number in sequence
      int type,                     -- type of COMMENT
      char text[]                   -- COMMENT
   }

      Comment Types:
         CMT_AUTO           0x01
         CMT_USER           0x02

Comments will be truncated at 256 characters; since a comment is referenced by
ID and not by RVA, it is possible for mutiple addresses to share a comment. In
addition, different objects which share the same RVA --such as ADDRESS and 
FUNCTION objects-- can have separate comments.


In an executable, it is expected for addresses to reference each other; it is
quite common for a piece of code to read, write, or execute the contents of 
another address. These cross references are stored in the XREF table, which
contains an entry for each reference from one address to another, including a
the type of reference:

   struct xref {
      ulong id,                     -- unique number in xref sequence
      ulong from_rva,               -- rva the reference is to
      ulong to_rva,                 -- rva the reference is from
      int type                      -- type of reference
   }

      Xref Types:
         XREF_READ          0x01
         XREF_WRITE         0x02
         XREF_EXEC          0x04

Note that implied references, such as an 'execute' reference from an one address
to the code address immediately following it that occurs in standard serial
execution, are not stored in the XREF table, but rather are implied by the
instruction type of a CODE object in the sense that some instruction types 
(unconditional jumps and returns) are exceptions to the general serial execution
rule. Since XREFs are indexed by id and not by rva, it is possible to have 
many XREFs from or to a single address.


Every executable has a single entry point or exported symbol; shared libraries
will have many such exports. These global symbols are stored in the EXPORT_ADDR
table:

   struct export_addr {
      ulong rva                     -- ADDRESS that is exported
   }

The table is very simple: merely the RVA of the exported symbol is recorded, 
with the name of the symbol being associated with the rva in the NAME table.

sys_executables which are dynamically linked will import symbols from other binary
files such as shared libraries; these symbols are stored in the IMPORT_ADDR
table:

   struct import_addr {
      int library,                  -- id of LIBRARY owning this import
      int type,                     -- type of import
      ulong rva                     -- ADDRESS of import [fn pointer]
   }

      Import Types
         IMPORT_UNKNOWN  0x00
         IMPORT_FUNCTION 0x01
         IMPORT_DATA     0x02

The RVA in the IMPORT_ADDR object refers to the address within the target's
address space which will be used to reference the imported symbol; usually this
is a pointer to a relocation or fixup table, such as the ELF PLT or GOT. Since
imported symbols are associated with a specific source, usually a shared 
library, the IMPORT_ADDR object contains the ID of an entry in the LIBRARY 
table which identifies the library or file owning the import:

   struct library {
      ulong id,                     -- unique number in sequence
      char name[],                  -- name of library
      int ver_hi,                   -- high-order version number [#.]
      int ver_lo                    -- low-order version number [.#]
   }

The import and library tables can be combined to list the dependencies of the
executable, in addition to providing symbolic names for external addresses.


   Data Addresses
   --------------
In the bastard, addresses which are not of type ADDR_CODE are assumed to be data
addresses. Note that the actual data for an address is not stored in the DB;
instead, it must be accessed as an offset into the memory-mapped target image
using the PA of the address. In order to manage data in an endian-independent
fashion, the following routines are provided :

   #include <util.h>

   /* endian-independent memcmp */
   int encmp(char *a, char *b, int len);
   int encmp_short(short a, short b);
   int encmp_int(int a, int b);

   /* endian-independent memcpy */
   void * encpy(char *dest, char *src, int len);
   void * encpy_short(short dest, short src);
   void * encpy_int(int dest, int src);

It is recommended that these routines be used for accessing data which may be
affected by byte ordering (i.e., any data type larger than a byte) in order to
ensure portability.

In high-level programming languages, data is referred to by type rather than by
size, with the size of a piece of data being implied by its type. In order to
support this, the bastard maintains a DATA_TYPE table which contains all of the
known data types applicable to the target:

   struct data_type {
      ulong id,                     -- unique number in sequence
      int size,                     -- size of this DATA_TYPE
      int flags,                    -- attributes of the data type
      char name[]                   -- name identifying this type
   }

      Data_Type Types:
         DT_SIGNED          0x01
         DT_UNSIGNED        0x02
         DT_POINTER         0x04
      
It is the responsibility of the Language extension to seed this table with basic
and default data types; subsequent scripts or user actions may define additional
data types, which will usually equate to system-specific data types or to 
typedefs. Note that data types have their own separate namespace, and that flags
and size can be combined to create new data types:

   struct data_type char =    {1, 1, DT_SIGNED,              "char" };
   struct data_type uchar =   {2, 1, DT_UNSIGNED,            "unsigned char" };
   struct data_type charptr = {3, 4, DT_SIGNED | DT_POINTER, "char *" };
   struct data_type int =     {4, 4, DT_SIGNED,              "int" };
   struct data_type uint =    {5, 4, DT_UNSIGNED,            "unsigned int" };
   struct data_type intptr =  {6, 4, DT_SIGNED | DT_POINTER, "int *" };

Structures are not considered data types at this time, although a pointer to a
structure could be declared as an arbitrary data type:

   struct data_type ElfHdrPtr = {7, 4, DT_POINTER, "struct Elf32_Ehdr *" };

While this retains no direct links to the associated STRUCTURE object, it can 
be applied to an ADDRESS object to clarify its contents for the user.


In addition to information about data types, many data values --as well as 
immediate values embedded in CODE objects-- will refer to compile-time constants
that specify encodings or limits which may be apparent from context, or from
knowledge of the target. For this reason, a CONSTANT table is kept in the DB
which contains symbolic names for known constants in the target:

   struct constant {
      ulong id,                     -- unique number in sequence
      char name[],                  -- name identifying this constant value
      long value                    -- value of CONSTANT
   }

Note that the constants have their own namespace, which is contrary to the rules
of most high-level programming languages. When a value is suspected of being a 
symbolic constant, its value is looked up on the CONSTANT table; from this, a
list of possible symbolic names will be produced, and the chosen symbol will
be referenced by the ADDRESS or CODE object by the ID of its CONSTANT object. 
No constants are initially generated by the bastard, nor are they required of 
any extension; however, Format and Langauge extensions may be written which 
provide known constants for a programming language or operating system.


The most familiar type of data to disassembler users is the string; often the
strings of a target will provide more information than the code itself. The
problem with strings is that each programming language treats them differently;
some use a termination character, some include a length prefix, and some provide
no information aside from the ASCII characters themselves (e.g. FORTRAN). For
this reason, it is the responsibility of the Language extension to provide 
string detection routines; these routines will be called by the bastard (or, 
more accurately, by an Engine extension for string recognition) in the course of
analyzing a section in the target.

Once recognized, strings are stored in the DB in their own table:

   struct string {
      ulong rva,                    -- ADDRESS where string starts
      int length,                   -- size of string in bytes
      char text[]                   -- first 255 bytes of string
   }

It should be noted that the starting address of the string is given a name 
consisting of the prefix "str_"and the first 28 bytes of the string; currently,
the name is not checked for illegal characters.


   Structured Data
   ---------------
While data types and constants are useful in reconstructing source code, a great
deal of data is stored in structures (imperative languages) or objects (OOP 
languages). In order to reconstruct the original source code of the target, it 
is useful to be able to declare data structures and to apply them to addresses
within the target, so that offsets from those addresses will be recognized as
structure members rather than as random, unrelated data.

The fact that the bastard uses a database to store its representation of the 
target causes the representation of objects with variable sizes and members
(such as structures and functions) to be non-obvious, if not confusing. In the 
DB, each structure is given an ID, a name, and a total size in the STRUCTURE
table:

   struct structure {
      ulong id,                     -- unique number in sequence
      char name[],                  -- name of STRUCTURE
      ushort size                   -- size of STRUCTURE in bytes
   }
   
Once the ID is known, structure members can be created by adding a record to the
STRUCT_MEMBER table, using that ID to reference the structure owning the member:

   struct struct_member {
      ulong id,                     -- unique number in sequence
      ulong type,                   -- id of DATA_TYPE associated with this
      int size,                     -- # of data items (>1 == array)
      ulong structure,              -- id of STRUCTURE associated with this
      int order,                    -- which member # is this
      char name[]                   -- name of this struct member
   }

Some details should be noted here: first, each STRUCT_MEMBER has an ID which is
independent of the STRUCTURE ID. Second, structures and structure members each
have a namespace which is separate from address names, and from each other; 
these namespaces are never checked for conflicts, so it is possible for two
structures or structure members to have the same name. Thirdly, a size field is
associated with the data type of each structure member; the total size of the
member is STRUCT_MEMBER.SIZE * DATA_TYPE.SIZE, so that an "int" member of size
1 would be a standard 4-byte "int", while an "int" with size 12 would be 48-byte
array of 12 ints, or "int [12]". This enables arrays to be specified much as 
they are in ADDRESS objects.

Structures are applied to data addresses using the struct_apply() API routine;
this will set the STRUCTURE field of the address to the ID of the associated
structure, add ADDR_STRUCT to the address type field, then create an ADDRESS 
object for each structure member and rename those addresses to match the 
structure member offsets. It is the responsibility of the Assembler or the 
Language extension to display the address in a structured format based on the 
information in the STRUCTURE and STRUCT_MEMBER tables.


   Code Addresses
   --------------
Every address in the target that contains executable code must have a 
corresponding CODE object:

   struct code {
      ulong rva,                    -- ADDRESS where this CODE is
      char mnemonic[],              -- text of mnemonic
      long dest[],                  -- destination operand
      long src[],                   -- source operand
      long aux[],                   -- additional operand
      int mnemType,                 -- type of mnemonic
      int destType,                 -- type of destination operand
      int srcType,                  -- type of source operand
      int auxType,                  -- type of additional operand
      ulong destConst,              -- id of CONSTANT for dest, if used
      ulong srcConst,               -- id of CONSTANT for src, if used
      ulong auxConst                -- id of CONSTANT for aux, if used
   }

      Mnemonic Types:
         INS_BRANCH   0x01     /* Unconditional branch */
         INS_COND     0x02     /* Conditional branch */
         INS_SUB      0x04     /* Jump to subroutine */
         INS_RET      0x08     /* Return from subroutine */
         INS_ARITH    0x10     /* Arithmetic inst */
         INS_LOGIC    0x20     /* logical inst */
         INS_FPU      0x40     /* Floating Point inst */
         INS_FLAG     0x80     /* Modify flags */
         INS_MOVE     0x0100   /* Basic load/store ops */
         INS_ARRAY    0x0200   /* String and XLAT ops */
         INS_PTR      0x0400   /* Load EA/pointer */
         INS_STACK    0x1000   /* PUSH, POP, etc */
         INS_FRAME    0x2000   /* ENTER, LEAVE, etc */
         INS_SYSTEM   0x4000   /* CPUID, WBINVD, etc */
         INS_BYTE     0x10000  /* operand is  8 bits/1 byte  */
         INS_WORD     0x20000  /* operand is 16 bits/2 bytes */
         INS_DWORD    0x40000  /* operand is 32 bits/4 bytes */
         INS_QWORD    0x80000  /* operand is 64 bits/8 bytes */

      Operand Types:
         OP_R         0x001    /* operand is READ */
         OP_W         0x002    /* operand is WRITTEN */
         OP_X         0x004    /* operand is EXECUTED */
         OP_UNK       0x000    /* unknown operand */
         OP_REG       0x100    /* register */
         OP_IMM       0x200    /* immediate value */
         OP_REL       0x300    /* relative Address [offset] */
         OP_PTR       0x400    /* Pointer */
         OP_EXPR      0x500    /* Address Expression [e.g. SIB byte] */
         OP_IND       0x1000   /* operand is a memory reference */
         OP_STRING    0x2000   /* operand is a string */
         OP_SIGNED    0x3000   /* operand is signed */
         OP_CONST     0x4000   /* operans is a constant */
         OP_BYTE      0x10000  /* operand is  8 bits/1 byte  */
         OP_WORD      0x20000  /* operand is 16 bits/2 bytes */
         OP_DWORD     0x30000  /* operand is 32 bits/4 bytes */
         OP_QWORD     0x40000  /* operand is 64 bits/8 bytes */

Most of the fields will be self-explanatory: RVA is a unique mapping of the CODE
object to an ADDRESS object (which will be used to obtain size and naming info);
the mnemonic, src, dest, and aux fields are used to represent the instruction 
and up to three operands; the mnemType, srcType, destType, and auxType fields
are used to store additional information about the instruction and operands for
later analysis and comparison; and the destConst, srcConst, and auxConst fields
are used to associate a symbolic constant with an immediate value stored in any
of the three operands.


The most important detail to note concerning the CODE object is that only the 
mnemonic is stored as text; the operands are stored as numeric values, such
that a printf statement for a CODE object would look like

   printf("%08X \t %s \t %X, %X\n", rva, code.mnemonic, code.dest, code.src);

rather than

   printf("%08X \t %s \t %s, %s\n", rva, code.mnemonic, code.dest, code.src);

as it would if the operands were all strings. This of course leads to the
question of how string values such as registers and address expressions will be
rperesented in the CODE object.

The immediate solution is the operand type field -- either destType, srcType, or
auxType depending on the operand. If this type field contains a flag specifying
that the operand is either a register or an address expression (OP_REG or 
OP_EXPR, respectively), then special handling is required to associate the 
numeric value of the operand with a register or an address expression.

All of the known CPU registers are kept in a register table managed by code in
src/vm.c -- as the name implies, the management of register names will be
enhanced to manage register contents in the future. The table of register names
and sizes is the responsibility of the Architecture extension, and is filled
during the init phase of that extension. The following routines are provided
to allow lookups to the register table:

   #include <vm.h>

   /* return a constant character string for the specified register */
   char * GetRegMnemonic(int index);

   /* return the size of the register in bytes */
   int GetRegSize(int index);

The 'index' parameter required by these routines is the numeric value of the 
operand with the OP_REG type; since the Architecture module manages the 
disassembly of binary code as well as the register table, these indexes are
constant and will not change from target to target providing the same 
Architecture extension is used. The process for printing out an operand of type
OP_REG therefore becomes

   if ( opType & OP_REG )
      printf("%s", GetRegMnemonic(operand);

...which can be integrated into a larger instruction formatting routine such as
asmsprintf().


Address expressions must be handled differently; since these are generated from
base and index registers along with a numeric scale and a displacement that can
be either a register or an integer, they are far less predicatble than the CPU
registers. For this reason, address expressions are stored in their own table
in the DB:

   struct addr_exp {
      int id,                       -- Unique number in sequence
      int scale,                    -- Scale value
      int index,                    -- Index register/value
      int base,                     -- Base register/address
      long displacement,            -- Displacement register/address
      int flags,                    -- Flags for each field
   }

      Address Expression Types:
         ADDREXP_BYTE   0x01    /* Field contains a BYTE value */
         ADDREXP_WORD   0x02    /* Field contains a WORD value */
         ADDREXP_DWORD  0x03    /* Field contains a DWORD value */
         ADDREXP_QWORD  0x04    /* Field contains a QWORD value */
         ADDREXP_REG    0x10    /* Field contains a Register encoding */

The proper way to handle address expressions would be to store only unique 
address expressions, so that operands could be compared by numeric value in
the same manner of registers. However, this requires extra time to insert and
thereby slows down the disassembly process --which occurs far more frequently
than the comparison of two operands-- and so for the time being all ADDR_EXP
objects are non-unique.

Note that the flags field contains 4 byte-size fields, one for each field in
the ADDR_EXP object:

      ADDEXP_SCALE_MASK  0x000000FF
      ADDEXP_INDEX_MASK  0x0000FF00
      ADDEXP_BASE_MASK   0x00FF0000
      ADDEXP_DISP_MASK   0xFF000000

The flags for a given field determine the size of the value for the field, or
if the field contains a numerci register id for lookup in the register table.

Needless to say, formatting an address expression can be pretty complex. The 
general rule for Intel syntax is
       [base] + (scale * [index]) + disp
where base and index are registers. Supporting AT&T syntax as well as other
CPUs makes this more complicated; as a result, the Bastard API provides a 
routine for formatting address expressions:

   #include <bastard.h>

   int addrexp_get_str( int id, char *string, int len);

This relies in the Assembler extension to format the address expression based
on string representations of the various address expression fields.


In the database, all instruction operands are stored as a numeric value with a
qualifying flags field. This makes it trivial to compare numeric operands or
register operands (e.g. `if (op1Type = op2Type && op1 == op2)` ); however, 
operands which are address expressions are harder to compare since entries in
the ADDR_EXP table are non-unique. The code_opcmp routine has been provided to 
make comparing operands easier:

   #include <bastard.h>

   /* Returns 1 if the operands are equal, 0 otherwise */
   int code_opcmp(long op1, int flag1, long op2, int flag2);

The code_opcmp routine handles all register table and ADDR_EXP lookups required to
do the comparison.


In order to perform more advanced analysis of code, the DB maintains a table
whihc records all registers changed by a specific instruction:

   struct code_effect {
      ulong id,                     -- Unique number in sequence
      ulong rva,                    -- Rva which causes this effect
      int reg,                      -- Register which is modified
      int change                    -- Amount of change (signed int)

Each CODE_EFFECT reflects a register which is changed upon execution of the code
at RVA; the reg field identifies the register via its index in the register
table, and the change field contains either the change applied to the register 
(in the case of an increment, decrement, or non-destructive add) or 0 for an
unknown change or overwritten register.

Currently the code effect table is unused; however it will eventually play a 
role in CPU emulation, and in determining the effect of a function within the
target (see FUNC_EFFECT, below).


   Structured Code
   ---------------
Just as program addresses are structured into sections and data is stuctured 
into structure types, so is code structured into subroutines and macros. The
DB allows for two different representations of code structures: the standard
function, and the inline function.


Most functions or subroutines within the target will be stored in the FUNCTION
table of the DB:

   struct function {
      ulong id,                     -- unique number in sequence
      ulong rva,                    -- ADDRESS where function starts
      ushort size,                  -- size of function in bytes
      ulong ret_type,               -- id of DATA_TYPE this func returns
      ulong comment                 -- id of COMMENT associated with FUNCTION
   }

This represents only the most basic information about a function: its address, 
it size, its return type, and the ID of any associated comment; this should be
enough for basic manipulation of functions.

Functions suffer from the same problem as structures, as far as the DB is 
concerned: each function can have any number of parameters or local variables.
As with structures, a separate DB table is kept which links a parameter or a
local variable with a function based on the function ID:

   struct func_param {
      ulong id,                     -- unique number in sequence
      ulong func,                   -- id of FUNCTION associated with param
      ulong type,                   -- id of DATA_TYPE associated with param
      int size,                     -- # of data_type items
      int addr_exp,                 -- id of ADDR_EXP with offset from frame ptr
      int flags,                    -- flags field for future use
      char name[]                   -- name associated with this param
   }

   struct func_local {
      ulong id,                     -- unique number in sequence
      ulong func,                   -- id of FUNCTION associated with var
      ulong type,                   -- id of DATA_TYPE associated with var
      int size,                     -- # of data_type items
      int addr_exp,                 -- ADDR_EXP w/ offset from stack frame
      int flags,                    -- flags field for future use
      char name[]                   -- name associated with this param
   }

In light of the STRUCT_MEMBER definition, these should be pretty clear: each
parameter or local variable has a name in its own namespace, a unique ID, a link
to the owning function by ID, a data type and size, and flags. The only odd 
field is the addr_exp field, which contains the ID of an ADDR_EXP object that 
represents the parameter or local variable as an offset from the stack frame;
this is used for tracking access to the parameter or variable throughout the
function, by comparing it to operands in instructions within the function.
   

As with individual CODE objects, each FUNCTION will effect registers which it
does not save -- the most common being the return register, for eaxmple EAX in
C code on the Intel platform. For this reason, a DB table exists to associate
register changes with FUNCTION objects:

   struct func_effect {
      ulong id,                     --
      ulong func,                   --
      int reg,                      --
      int change                    --
   }

Careful construction of the entries in this table will provide quick information
on which registers a function destroys, as well as whether or not the function
properly cleans up the stack on returning. Note that instructions which call
a function will have CODE_EFFECT entries equivalent to those of the function.


A great deal of source code contains macros which wrap a sequence of code to a
simple function-like call; these macros are expanded inline when the code is 
compiled. While such macros are generally easy to spot, they are difficult if
not impossible to replace in standard disassembly representations; for this
reason, the bastard provides support for specifying such inline functions, with
the option of "folding" instances of them in order to make disassembled code 
less tedious to read.

An F_INLINE object consist of a unique ID, a size in bytes, and the rva of an
instance of the object which can be used in comparisons to find further 
occurrences:
      
   struct f_inline {
      ulong id,                     -- unique number in sequence
      char name[],                  -- name of this type
      ulong rva,                    -- ADDRESS of 'template' for this type
      ushort size                   -- size of type, in bytes
   }

If an ADDRESS object address has the F_INLINE attribute, its structure field 
will refer to the ID of an F_INLINE object (which is essentially the code
equivalent of a data structure). When displaying this address, all subsequent
addresses up to F_INLINE.size can be "swallowed" and replaced with the name
of the inline function.


   High-Level Code
   ---------------

   Right now, this is one big TODO.

   INT_CODE      -- Representation of CODE in an intermediate language
   Intermediate code for an entire function is generated by the Assembler
   Extension.

   struct int_code {
      ulong id,                     -- unique number in sequence
      ulong rva,                    -- ADDRESS where this code starts
      int size,                     -- number of bytes this code represents
      char lvalue[],                -- L-Value [destination]
      char rvalue[]                 -- R-Value [expression]
   }

   FIN_CODE    -- Representation of INT_CODE in the final, high-level language
   Final Code is generated for an entire function by the Language Extension.

   struct fin_code {
      ulong id,                     -- unique number in sequence
      ulong rva,                    -- ADDRESS where this code starts
      int size,                     -- number of bytes this code represents
      char line[]                   -- final line of source code
   }



================================================================================
 Other Documentation

   Manuals
      BC Scripting Manual                BCScript.txt
      Database Schematics                DBSchema.txt
      Extension Programming Guide        ExtensionProg.txt
      Front End Programming Guide        FrontEndProg.txt
      Source Code Layout                 SourceLayout.txt
      Theory Of Operations               TheoryOfOp.txt

   Tutorials
      Basic Disassembly Tutorial         Disasm-HOWTO.txt
      Using Functions HOWTO              Function-HOWTO.txt
      Using Structures HOWTO             Structure-HOWTO.txt

   Third-party Documentation
      The EiC Interpreter Manual         EiC.ps
      The Typhoon RDBMS Manual           typhoon.txt


================================================================================
 Responsible Parties

mammon_   -- design, coding, documentation, project mgt
ReZiDeNt  -- catalyst, initial co-design 
grugq     -- makefile, sanity consultant
FBJ       -- tester, bug-finder
