A Decompiler Project

advertisement
A Decompiler Project
Undergraduate Thesis
By
Mohsen Hariri
1. Introduction
2. Boomerang Project
3. Binary Translation
3.1 Front-end
3.2 SLED
3.3 SSL
3.4 RTL
3.5 CFG
3.6 Backends
4. Static Library Detection
4.1 Application
4.2 Library Files
4.3 Matching Methods
4.3.1 Raw Object/Library Matching
4.3.2 Signature Matching
4.4 Symbols Demangling
5. Terminology
5.1 Terms
5.2 Acronyms
6. Project Discussions
1. Introduction
A compiler is a program that takes as input a program written in a high level language
and produces as output an executable program for a target machine; in other words,
the input is language dependent and the output is machine dependent. Decompiler, or
reverse compiler, attempts to perform the inverse process: given an executable
program the aim is to produce a high level language program that performs the same
function as the executable program. The input in this case is machine dependent, and
the output is language dependent.
Decompilation techniques were initially used to aid in the migration of programs from
one platform to another. Since then, decompilation techniques have been used to aid
in many other fields such as the recovery of lost source code, debugging of programs,
comprehending programs, recovery of high level views of programs and worm and
virus analysis.
In general, decompilation problem is an equivalent of halting problem.
A naive approach to decompilation attempts to enumerate all valid
phrases of an arbitrary attribute grammar, and then to perform a
reverse match of these phrases to their original source code. An
algorithm to solve this problem has been proved to be halting problem
equivalent. A more sensible approach is to try to determine which
addresses contain data and which ones contain instructions in the given
binary program. Given that in a Von Neumann machine, data and
instructions are represented in the same way in the computer memory,
an algorithm that solves this data/instruction problem would also solve
the halting problem, and that is impossible. This means that the
decompilation problem belongs to the class of non-computable
problems; it is equivalent to the halting problem, and is therefore only
partially computable. In other words, we can build a decompiler which
produces the right output for some input programs, but not for all input
programs in general.
(from “A Methodology for Decompilation”, Cristina Cifuentes and
K.John Gough)
Today, there are many generic disassemblers available for the public, but
decompilers, as they are much harder to design and implement, are very few and
almost all of them are designed and programmed for a specific programming language
and compiler. A disassembler is a program that reads an executable program and
translates it into an equivalent assembler program; a decompiler goes a step further by
translating the program into an equivalent high level language (such as C or Pascal)
program.
As with most reverse engineering tools, disassemblers and decompilers are semiautomated tools rather than fully automated tools. In effect, decompilation is merely
an extension of disassembly. If you can produce assembly language source for a
program, then you can produce high-level language source for that program with
more effort.
Figure 1 – A Decompilation System
2. Boomerang Project
The Boomerang project is an attempt to develop a real decompiler for
machine code programs through the open source community. A
decompiler takes as input an executable file, and attempts to create a
high level, compilable, possibly even maintainable source file that does
the same thing. It is therefore the opposite of a compiler, which takes a
source file and makes an executable. However, a general decompiler
does not attempt to reverse every action of the decompiler; rather it
transforms the input program repeatedly until the result is high level
source code. It therefore won't recreate the original source file;
probably nothing like it. It does not matter if the executable file has
symbols or not, or was compiled from any particular language.
(However, declarative languages like ML are not considered.)
The intent is to create a retargetable decompiler (i.e. one that can
decompile different types of machine code files with modest effort,
e.g. X86-windows, sparc-solaris, etc). It was also intended to be highly
modular, so that different parts of the decompiler can be replaced with
experimental modules. It was intended to eventually
become interactive, a la IDA Pro, because some things (not just
variable names and comments, though these are obviously very
important) require expert intervention. Whether the interactivity
belongs in the decompiler or in a separate tool remains unclear.
By transforming the semantics of individual instructions, and using
powerful techniques such as Static Single Assignment dataflow
analysis, Boomerang should be (largely) independent of the exact
behavior of the compiler that happened to be used. Optimization
should not affect the results. Hence, the goal is a general decompiler.
(From Boomerang project homepage,
http://boomerang.sourceforge.net/)
The main theories for this project are taken from researches about Binary
Translations, mainly UQBT project. These researches on UQBT project have been
around since 1995 under the support of Sun Microsytems Laboratories.
Boomerang project started in 2002, and as the time being, it can decompile (very)
simple and small programs, generating the equivalent C language code.
One major limitation is that there is as yet no recognition of statically
linked library functions (a la FLIRT of IDA Pro). This is a severe
limitation for many Windows programs, most of which have large
amounts of library code statically linked (as well as plenty that are
dynamically linked). That means that until this is implemented, then
without significant manual guidance, Boomerang can't even decompile
its own test/windows/hello.exe.
(From Boomerang project homepage,
http://boomerang.sourceforge.net/)
In this project, I’m going to implement library detection for boomerang. But first, I
need to know the project parts and have an overview of the system. So I’ll start by
describing the already implemented parts, and then focus on my part in the whole
project.
3. Binary Translation
Binary translation is the process of automatically translating a binary
executable program from one machine to another. This process
normally involves different machines, Mi, different operating systems,
OSi, and different binary-file formats, BFFi, in order to translate
programs from (M1, OS1, BFF1) to (M2, OS2, BFF2).
Like a compiler, a binary translator can be loosely divided into front
end, analyzer and optimizer, and back end. The front end decodes a
source-machine binary file, produces RTLs, and then lifts the level of
abstraction to HRTL (the high-level intermediate representation) by
using knowledge of the source-machine calling conventions and
instruction set. The analyzer and optimizer map from source-machine
locations to target-machine locations, and it may apply other machinespecific optimizations to prepare for the back end. The back end
translates the intermediate HRTL to target-machine instructions, and it
writes a binary file in the required format.
Like a compiler, a binary translator can be loosely divided into front
end, analyzer and optimizer, and back end. The front end decodes a
source-machine binary file, produces RTLs, and then lifts the level of
abstraction to HRTL (the high-level intermediate representation) by
using knowledge of the source-machine calling conventions and
instruction set. The analyzer and optimizer map from source-machine
locations to target-machine locations, and it may apply other machinespecific optimizations to prepare for the back end. The back end
translates the intermediate HRTL to target-machine instructions, and it
writes a binary file in the required format.
Compilers and other tools have traditionally been called retargetable
when they can support multiple target machines at low cost. By
extension, we call a binary-analysis tool resourceable if it can analyze
binaries from multiple source machines at low cost. Retargetability is
supported in the UQBT framework through specifications of properties
of machines and operating system conventions.
(From UQBT project homepage,
http://www.itee.uq.edu.au/~cristina/uqbt.html)
Figure 2 - UQBT Abstract Architecture
Figure 3 - UQBT Detailed Architecture (2001)
3.1 The Front-end
The front end module deals with machine dependent features and produces
a machine independent representation. It takes as input a binary program
for a specific machine, loads it into virtual memory, parses it, and
produces an intermediate representation of the program.
The loader is an operating system program that loads an executable
program into memory, sets up the segment registers and the stack, and
transfers control to the program. Note that executable files do not contain
much information about which segments are used as data and which ones
are used as code, data segments can contain code and/or addresses. The
parser decides the type of machine instruction at a given memory location,
determines its operands and any offsets involved. The parsing of machine
instructions is not as easy as it might appear. First of all, there are
addressing modes that depend on the value of variables or registers at
runtime. Second, indexed and indirect access to memory locations are
difficult to resolve. Third, the complex machine instruction sets in today's
machines utilize almost all combination of bytes, and therefore it is very
hard to determine if a given byte is an instruction or is data. Fourth, there
is no difference as to how data and instructions are stored in memory in a
Von Neumann machine. Finally, idioms are used by compiler writers to
perform a function in the minimal number of machine cycles, and
therefore a group of instructions will make sense only in a logical way, but
not individually.
In order to determine which bytes of information are instructions and
which ones are data, we start at the unique entry point to the program,
given by the loader. This entry point must the first instruction for the
program, in order to begin execution. From there on, instructions are
parsed sequentially, until the flow of control changes due to a branch, a
procedure call, etc. In this case, the target location is like a new entry point
to part of the program, and from there onwards, instructions can be parsed
in the previous way. Once there are no more instructions to parse, due to
an end of procedure or end of program, we return to where the branch of
control occurred and continue parsing at that level. This method traverses
all possible instruction paths. At the same time, data references are placed
in a global or local symbol table, depending on where the data is stored
(i.e. as an offset on the stack, or at a definite memory location).
A major problem is introduced by the access of indexed and indirect
memory instructions and locations. An idiom is a sequence of instructions
which forms a logical entity and has a meaning that cannot be derived by
considering the primary meanings of the individual instructions.
To handle these, heuristic methods need to be implemented to determine
as much information as possible; analytic methods, such as emulation,
cannot provide the whole range of solutions anyway. In general, it is
impossible to solve these types of problems as they are equivalent to
solving the halting problem, as previously mentioned.
Different problems are introduced by self-modifying code and virus tricks.
A way to tackle these cases is to flag the sections of code involved, and
comment them in the final program. Assembler code might be all that can
be produced in these cases. Even more, a suggested optimal algorithm for
parsing consists in finding the maximum number of trees that contain
instructions; this is a combinatorial method that has been proved to be
NP-complete. For dense machine instruction sets, this algorithm does not
solve the problem of data residing in code segments. The intermediate
code generator produces an intermediate representation of the program. It
works close together with the parser, invoking it to get the next
instruction. Each machine instruction gets translated into an intermediate
code instruction, such representation being machine and language
independent.
Defined/used (du) chains of registers are also attached to the intermediate
instruction; these are used later in the data flow analysis phase.
The quality of the intermediate code can be improved by an optimization
stage that eliminates any redundant instructions, finds probable idioms,
and replaces them by an appropriate intermediate instruction. Many
idioms are machine dependent and reveal some of the semantics
associated with the program at hand. Such idioms represent low level
functions that are normally provided by the compiler at a higher level
(e.g. multiplication and division of integers by powers of 2). Other idioms
are machine independent and they reflect a shortcut used by the compiler
writer in order to get faster code (i.e. fewer machine cycles for a given
function), such as the addition and subtraction of long numbers. Some of
these idioms are widely known in the compiler community, and should be
coded into the decompiler.
(from “A Methodology for Decompilation”, Cristina Cifuentes and
K.John Gough)
3.2 SLED
SLED (Specification Language for Encoding and Decoding) defines the mapping
between symbolic, assembly-language, and binary representations of machine
instructions.
The New Jersey Machine Code (NJMC) toolkit allows users to write
machine descriptions of assembly instructions and their associated binary
representations using the Specification Language for Encoding and
Decoding (SLED). SLED provides for compact specifications of RISC
and CISC machines; with 127, 193 and 460 lines of specification for the
MIPS, SPARC and Pentium respectively. The toolkit also provides extra
support for encoding of assembly to binary code, and decoding of binary
to assembly code. For decoding purposes, the toolkit provides a matching
statement which resembles the C switch statement. The toolkit generates C
and Modula-3 code from matching statements, hence automating part of
the disassembly process. The generated code can be integrated as a
module of a binary-decoding application.
(From UQBT project documents)
3.3 SSL and RTL
Register transfer lists (RTL) is an intermediate language that describes
transfers of information between register-based instructions. RTL assumes
an infinite number of registers hence it is not constrained to a particular
machine representation.
More recently, RTL has been used as an intermediate representation in
different system tools such as the link-time optimizer OM, GNU’s
compilers, and the editing library EEL. In all these tools, RTL stands for
register transfer language, and the representations vary widely.
(From UQBT project documents)
SSL (Syntax/Semantic Language, or Semantic Specification Language), has been
developed in order to describe the semantics of machine instructions. In UBQT,
SSL is defined in terms of RTLs. The syntax of SSL is defined in ExtendedBackus-Naur-Form (EBNF).
HRTL is the result of applying some transformations on RTLs, to build a more
abstract representation of RTLs.
3.4 CFG
A control flow graph (CFG) is a directed graph that represents the flow of control
of a program, thus, it only represents the flow of instructions (code) of the
program and excludes data information. The nodes of a CFG represent basic
blocks of the program, and the edges represent the flow of control between nodes.
A basic block is a sequence of consecutive statements in which flow of control
enters at the beginning and leaves at the end without halt or possibility of
branching except at the end.
Figure 4 - Data Structures to Represent a Binary Program
3.5 Backends
The Back-end generates the final output of the program. It can be an executable
file in another machine’s architecture, or a high-level language representation of
the input executable file.
4. Static Library Detection
4.1 Application
Executable files contain both user codes and library codes. Being able to
distinguish them will let us generate simpler and smaller source files which are
also more readable. Furthermore, many library functions are written in assembler
and contain instructions and structures that are difficult or impossible to
decompile. All programs are more readable when calls to library functions use
their symbolic names (e.g. “strcmp” rather than “proc003”)
4.2 Library Files
There are many object and library file formats available, and there are some free
software to read and use these library formats. Current mostly used object file
formats are:
1. COFF
Common Object File Format, is used by Microsoft compilers. The standards
for this format are not completely documented, but there are some free tools to
read them.
2. OMF
Designed by Intel, and used by Microsoft before switching to COFF. Borland
compilers used this format long after Microsoft. This format is also
documented to some extents, but I have not seen any free tools to read this
format.
3. ELF
Used by GNU gcc, both for object files and executable files. The standards for
this format are available, and freely accessible.
There are other formats like relocatable a.out used on BSD UNIX systems and
IBM 360 objects, which I have not mentioned.
BFD library is the GNU library to deal with object, library and executable files. It
supports a wide range of object file formats such as COFF and ELF. But it does
not have support for OMF format. I am using BFD library to read from object file.
BFD provides a uniform API to work with object files. So that using BFD library
will allow boomerang to read any object file supported by BFD. But this method
will prevent us from using format specific extra information which could be
extracted if we weren’t using BFD.
4.3 Matching Methods
Matching a function in an object file is composed of these steps:
1. Reading the function op-codes from the object file
2. flagging the relocatable bytes out
3. Matching the remaining data pattern against the executable sections
When a match occurs, we will execute the following steps:
1. Create a virtual procedure in boomerang using Prog::newProc
2. In C++, parameters and return value are encoded and appended to the
procedure name. If we know the demangling schema for the compiler, we can
even extract the prototype of the function. For more information see section
4.4.
3. Attaching the names to any references by this procedure using
BinaryFile::AddSymbol. For example:
int myproc(){
g_mm = 10;
}
If in an executable we locate myproc, then using the function references we
can also locate g_mm which is a global variable here.
Here we have assumed we are only matching function codes against
executable sections of the executable file. But, it’s also possible to match data
patterns against executable files, such as searching for MD5 or Blowfish
substitution tables, but the usefulness of this method is in very rare conditions,
where we just have the data and the codes are somehow encrypted or
inaccessible, because usually these tables are referenced by functions and can
be found by flagging function references.
Implementation for a general symbol matching module is defined in
SymbolMatcher abstract class. To apply a symbol container to the executable,
we will use SymbolMatcherFactory, which will return a suitable
SymbolMatcher object for the specified symbol container file.
4.3.1 Raw Object/Library Matching
This method is used for raw object/library files whose format is supported by
BFD. It is implemented in BfdObjMatcher and BfdArchMatcher. BfdObjMatcher
is used to read and match object files, and BfdArchMatcher does the same for
library files.
4.3.2 Signature Matching
Considering the number of libraries, and number of versions of a library, and the
functions it contains, matching whole available libraries against an executable is
very time consuming. So for automatic library function detection, raw library
matching is impossible. To overcome this problem, we can consider the following
points:
1. Limiting the number of libraries to be matched against the executable by
checking if this executable can have used specified library or not, for example
by checking platform of the executable against library.
2. Some libraries, when used in an executable, always add a function to the
executable. This primary function can be searched for before any other
function of that library, and if not found, we can assume that this library has
not been used.
Even after applying these considerations, still the problem of large number of
functions to match exists. So we will use another method for matching function
data, by storing function signatures in a tree structure and matching all library
functions at once using that structure. This method eliminates searching the whole
code sections for each function.
Making a tree structure for functions, considering relocation bytes, which we call
wild-bytes, can be done in many different ways and needs a lot of decisions. For
example how many bytes of a function should we use for matching, and when two
functions are the same in those bytes how should we distinguish them, or when
two functions differ only in relocation bytes, how we can find the difference.
Signature file structure is shown bellow:
struct SIG_FILE_HEADER{
// library platform
platform platform_id; // should I use "MACHINE" enum? (from
binaryfile.h)
// format of the executable file
// that signatures can be matched against
LOAD_FMT file_format;
// name of the library this
// signature is extracted from
SIG_STRING_IDX library_name;
// name of header file which
// defines possible structures
// used in this library
SIG_STRING_IDX header_file;
// start of symbols section
int symbols_section_start;
// total number of symbols
// in this signature file
int total_symbols;
// start of nodes section
int tree_section_start;
// total number of nodes
// in this signature file
int total_nodes;
// start of references section
int references_section_start;
// total number of references
// in this signature file
int total_references;
// start of strings section
int strings_section_start;
// total number of strings
// in this signature file
int total_strings;
// start of index arrays section
int index_array_section_start;
// total number of index arrays
// in this signature file
int total_index_arrays;
// primary symbol of this signature
// file.
// primary symbol is the symbol which
// is always linked when this library
// is used.
SIG_SYMBOL_IDX primary_symbol;
};
struct SIG_SYMBOL{
// prototype of the symbol
SIG_STRING_IDX prototype;
// comments
SIG_STRING_IDX comments;
SIG_NODE_IDX matching_node;
};
struct SIG_NODE{
// if this node is just a reference
bool is_ref;
// parent of this node
SIG_NODE_IDX parent;
union{
// if not just a reference
struct{
// size of node contents
int contents_size;
// signature contents
unsigned char *contents;
};
// if just a reference
SIG_SYMBOL_REF_IDX sym_ref;
};
// is it a leaf? if so, we
// have to flag the matched symbol
bool is_leaf;
union{
// if this node is non-leaf,
// specifies the child nodes
struct{
int childs_count;
SIG_ARRAY_IDX child_nodes;
};
// if this node is a leaf,
// specifies which symbol
// is matched
SIG_SYMBOL_IDX symbol;
};
};
struct SIG_SYMBOL_REF
{
// if the offset of this symbol
// is stored relative to current
// location
bool is_relative;
// true if the reference
// is already in this signature file,
// as a symbol, and false if not
bool is_resolved;
union{
// if not resolved,
// we just hold the name
// or prototype
SIG_STRING_IDX name;
// if resolved, we hold the
// symbol index
SIG_SYMBOL_IDX symbol;
};
};
struct SIG_STRING{
// size of the structure, in bytes
int size;
// string contents, variable size
// items count is (size - 1)
char content[1];
};
struct SIG_ARRAY{
// size of the structure, in bytes
int size;
// array contents, variable size
// items count is : ((size / sizeof(int)) - 1)
int data[1];
};
The signature file, like dcc, will have some sections (they could be put in separate
files, but I thought current style is better).
Sections are:
1. Symbols Section: Contains symbols information
2. Signature Tree Section: Contains nodes of the signature tree, which will be
matched against executable codes.
3. References Section: References made by functions are stored here, they can be
variables or other functions.
4. Strings Section: All strings in the signature file are kept here, for performance
purposes, to avoid variable-sized structures
5. Indexes Section: All index arrays are kept here, again for performance
purposes, to avoid variable-sized structures
All indexes are calculated relative to their section start, and contain the byte offset.
The signature tree is composed of some bytes and some references, which are saved
in separate nodes.
For example:
int len2x(char *str)
{
int i = strlen(str);
return i*2;
}
int len3x(char *str)
{
int i = strlen(str);
return i*3;
}
are compiled into these code:
Figure 5 - Functions Code-Bytes
If we have only these two functions in our signature file, the signature tree would look
like this:
Figure 6 - Signature Tree
4.4 Symbols Demangling
In old C language, different functions could be distinguished using only their
names. But in C++, different functions can have the same name and just differ in
arguments. So compilers needed to differentiate them using both the name and the
parameters. Mangling is the process of adding tokens for parameters to the name
of the function. There are no public standards for mangling, and each compiler has
created its own mangling scheme. For example:
“?kk@@YAXXZ” stands for “void __cdecl kk(void)”
For a decompiler, it would be a precious source of information to have the name,
parameters, return value and calling conversion of a function. So we needed to at
least have some sort of demangling schemes for generally used compilers.
Fortunately there were some free implementations for popular compilers’
demangling schemes. We used wine demangling for Microsoft compiler and BFD
demangling for gcc.
5. Terminology
5.1 Terms




"Original source code", as opposed to the decompiled output (which although
an output, is still referred to as a "source" code. This is the original, usually
high level source code that the program was written in.
"Input program", for the program that the decompiler reads. Terms such as
"source program" just create confusion.
"Decompiled output", which as noted above is a form of source code.
"Executable file" refers to a general class of files that could be decompiled.
Here, the term includes both machine code programs, and programs compiled
to a virtual machine form. The term "native" distinguishes machine code
executables from others. Most people understand "machine code" as meaning
an executable that is executed directly by the processor, and so means the
same but is clearer than "native executable".
5.2 Acronyms











UQBT: University of Queensland Binary Translator. Boomerang is based in
part on code from UQBT.
SSL: Semantic Specification Language. This is the language used in SSL files,
which specify the semantics (meaning) of instructions.
IR: Internal Representation; a representation of the input program in a form
that is convenient for the current analysis or transformation.
RTL: Register Transfer List (sometimes Register Transfer Language). The
term Register Transfers actually comes from hardware design, where registers
are arrays of single bit storage elements, but in software engineering has come
to mean a style of program representation at the register and memory level.
Every transfer (assignment) is explicit, including to flags registers.
DFA: Data Flow Analysis.
SSA: Static Single Assignment. A representation variation that makes certain
kinds of DFA easier to perform.
CFG: Control Flow Graph. Nodes in the CFG are Basic Blocks, and edges
represent possible control flow (execution paths that the program could take).
For example, a basic block ending in a conditional branch would have two
out-edges, one each for the case where the branch is taken and not taken.
BB: Basic Block. Usually a list of statements or RTL which are always
executed together. A basic block is terminated by a conditional or
unconditional branch or call, an indirect branch or call (including n-way
branches or switch statements), return instructions, or labels (where other
control flow enters). If the label is not explicit, a "fall through" basic block
could terminate in an ordinary (non control flow altering) instruction.
TA: Type Analysis.
HLL: High Level Language; in a compiler, typically the output language.
AST: Abstract Syntax Tree. This is an IR close to the HLL, typically at the
statement level. For example, a node of the AST might be labeled as a pretested-while node, and children of that node could represent the loop
conditional expression, and a block node representing statements in the loop.
In a compiler, an AST typically results from parsing the input HLL program.
6. Project Discussions
Mohsen Hariri <m.hariri@gmail.com>
About Boomerang
4 messages
Mohsen Hariri <m.hariri@gmail.com>
Reply-To: Mohsen Hariri <m.hariri@gmail.com>
To: emmerik@users.sourceforge.net, quantumg@users.sourceforge.net
Fri, Jul 15, 2005 at 1:30 PM
Hi
As my undergraduate project, I want to participate in
development of Boomerang project. I have read the stuff
in the site and downloaded the current CVS codes, and
will inspect them in a few days.
My skills/experiences are:
1. Programming in C/C++, Python, PHP, Assembly
for about 7 years
2. Familiar with many disassembling/debugging tools from old
turbo debugger to current IDA, w32dasm and OllyDBG
4. Familiar with Intel cpu architecture/instruction set
5. Familiar with Windows programming for about 5 years
6. Familiar with PE/COFF/OMF executable/object file formats
If there are any startup tips, or which part of the project is
better to start reading, please tell me.
I hope that I can do something in this bright-future project.
Regards
Mohsen Hariri
Mike van Emmerik <emmerik@itee.uq.edu.au>
To: Mohsen Hariri <m.hariri@gmail.com>
Cc: quantumg@users.sourceforge.net
> Hi
>
> As my undergraduate project, I want to participate in
> development of Boomerang project.
Great!
> My skills/experiences are:
> 1. Programming in C/C++, Python, PHP, Assembly
> for about 7 years
> 2. Familiar with many disassembling/debugging tools from old
> turbo debugger to current IDA, w32dasm and OllyDBG
> 4. Familiar with Intel cpu architecture/instruction set
Sun, Jul 17, 2005 at 4:39 PM
> 5. Familiar with Windows programming for about 5 years
> 6. Familiar with PE/COFF/OMF executable/object file formats
Your skills seem well suited to working on Boomerang, apart from your
counting skills :-)
> If there are any startup tips, or which part of the project is
> better to start reading, please tell me.
The "still to be done" Boomerang page
(http://boomerang.sourceforge.net/tobedone.html) has all the details on
this. There are also some general ideas on the "information for students"
page: http://boomerang.sourceforge.net/students.html .
But these don't prioritise the things that need to be done, or consider
them from the point of view of an undergraduate project.
For a moderately stand alone project, perhaps consider some solution to the
detection of static library files. This problem is hlding Boomerang back,
although the -sf switch allows the information to be entered manually. You
could perhaps just use the BinaryFile interface (include/BinaryFile.h) to
read a binary file, and perhaps use small parts of the decoder to find call
statements. From this, you could generate a list of addresses and static
library signatures, suitable for including in a -sf file. That way, you
would not have to make any changes to Boomerang at all, and not have to
understand all that much about how it works in great detail. Consider for
example that three students recently worked on a PowerPC front end, and
found it difficult to get it working to the extent even of allowing "hello
world" to decompile. That was a one semester project; perhaps yours is a
full year project. It can be frustrating trying to understand significant
parts of a 60,000 line project, espececially if parts of it (e.g. the front
end) use unfamiliar tools such as the New Jersey Machine Code Toolkit.
(These students took well over a week just to make the toolkit, for
example).
So a front end would make a good whole year project, but I'd avoid it for a
one semester project. A Java front end might be interesting, so that
Booerang could be compared with Java decompilers, some of which are very
good. (However, you may get bogged down in design matters, e.g. how to
convey some kinds of type information from the front end to the
intermediate representation.)
The structuring code (converting to loops and if/then/else, handling break
and continue, etc) needs some work. The advantage of this is that the code
is relatively clean, and is fairly well isolated from the rest of the
decompiler. The main needs here are for loops (or at least while loops
without the if/then around them), similarly changing the if statement
around switch statements to become the default case, and short circuit
conditionals (e.g. if (p1 || p2) { ... }.
If you are interested in parsers, the three or four parsers in Boomerang
need to use something better supported than Coetmeur's Bison++. There is a
Sourceforge project called Bisonc++ that sounds promising. The disadvantage
of this project is that when completed there is no change visible from the
outside; it's all "inner beauty".
A new back end would be interesting, and Boomerang only has the one. There
may be some design issues, since Boomerang is moderately strongly tied to
"C-like" output.
I would not attempt yet another GUI for Boomerang, unless you have really
strong ideas about what is needed, and I'd recommend running it past us
first.
Similarly, implementing new analyses for Boomerang would probably be beyond
an undergrad project. Also, it might clash with changes I may implement as
part of my thesis research.
> I hope that I can do something in this bright-future project.
Well, there are several ideas in the above that would make interesting
projects. Of course, in reading the code, you may come up with something
completely different. For example, some Danish students recently finished a
project on how "phi loops" can happen, and what to do about them. A fellow
from Italy is looking in to how Boomerang might be able to deal with
malware. I've had some ideas recently on how to parallise decompilation,
but these could well be difficult to implement.
So consider all these ideas, and by all means discuss them with me if you
like. Good luck!
- Mike
Mohsen Hariri <m.hariri@gmail.com>
Reply-To: Mohsen Hariri <m.hariri@gmail.com>
To: Mike van Emmerik <emmerik@itee.uq.edu.au>
Sun, Jul 31, 2005 at 10:11 AM
Hi
Sorry for the soo late reply and Thanks for your long and descriptive reply.
I think the static library detection would be a fine project. After a little digging
around in boomerang sources, and taking a look how IDA does library function
detection, I've come up to this:
I will make another "project" in the "solution" (I'm using the visual studio
names, as I will be using that for development), which will contain these
tools:
1. A program to make library signature files( something like IDA FLIRT
signatures ) from given libraries
2. An interface for the boomeran Win32BinaryFile to be able to use those
signatures(We have to access the contents of the file, so I think this
feature is very dependent on executable file format, maybe we can
just use the library signature generation files independent of the executable
format)
and one more thing. I've read about microsoft compiler global optimizations,
I read it in one Matt Pietrick MSDN mag issues(I did a little search but couldn't find
it). I havn't seen what would happen to library function when global optimization is
activated but I remember Matt said "Global optimization makes functions exported from
different files to be optimized by passing arguments by registers, or eliminating
arguments.... " here is the MSDN link, it doesn't say anything about libraries:
http://winfx.msdn.microsoft.com/library/default.asp?url=/library/enus/dv_vccomp/html/d10630cc-b9cf-4e97-bde3-8d7ee79e9435.asp
Anyway I will test that. Please tell me if I'm doing it the right way.
One more thing, when DLL functions are imported by number(Ordinal import),
we can get their names if we have the DLL files. Is it something related
to what I do?
bye
Mohsen
[Quoted text hidden]
Mike van Emmerik <emmerik@itee.uq.edu.au>
To: Mohsen Hariri <m.hariri@gmail.com>
Sun, Jul 31, 2005 at 3:04 PM
> Hi
>
> Sorry for the soo late reply and Thanks for your long and descriptive reply.
No problem.
> I think the static library detection would be a fine project. After a little
> digging around in boomerang sources, and taking a look how IDA does library
> function detection, I've come up to this:
>
> I will make another "project" in the "solution" (I'm using the visual studio
> names, as I will be using that for development), which will contain these
> tools:
>
> 1. A program to make library signature files( something like IDA FLIRT
> signatures ) from given libraries
Good. The problem with this is finding the library files. I wonder
if it might be possible to create signatures from executable files, where
you can specify some start addresses and library names, e.g. from a
symbols.h file (as used by the -sf switch). Of course you need some way of
identifying those library functions, e.g. using IDA Pro or other tools, or
figuring out what it does from a disassembly and guessing. Just a thought.
> 2. An interface for the boomeran Win32BinaryFile to be able to use those
> signatures (We have to access the contents of the file, so I think this
> feature is very dependent on executable file format, maybe we can
> just use the library signature generation files independent of the
> executable
> format)
Well, just make an interface; I don't think it should be accessed from
Win32BinaryFile, as signatures can be generated for any architecture in any
exeutable file format. You may only be able to generate signatures for
Win32 binary files with Tool 1 above, but other tools can be written to
generate signatures for other combinations, and those signatures should be
applicable by tool 2. Functions are in a sense just byte streams; it should
not matter what the architecture is or how they are encoded in the
executable file.
>
> and one more thing. I've read about microsoft compiler global optimizations,
> ...
> Anyway I will test that.
Good idea, though I'd be very surprised if the library functions are
accessed with register calling convention. I've certainly seen user
programs that have such parameters.
> Please tell me if I'm doing it the right way.
You're on the right track, as far as I can tell.
> One more thing, when DLL functions are imported by number(Ordinal import),
> we can get their names if we have the DLL files. Is it something related
> to what I do?
Yes and no. No in the sense that this is not signature matching, so you
don't need to include this as part of your project. Yes in the sense that
this achieves the same purpose (i.e identifying library functions), just
for a different class of library functions (dynamically linked functions
using ordinal numbers as opposed to statically linked functions). They are
similar in that each signature can have some sort of ID associated with it
(perhaps a checksum, index in a special hash table, etc), and associated
with that ID is a function signature (unfortunate clash of terminology
here; I mean a functions's name, names and types of parameters, and type of
return value if any). So it would seem fairly easy to extend the
identification of static library functions to those that are dynamically
linked and accessed with ordinal numbers.
I'll leave it up to you to decide whether to make this extension.
Good luck!
- Mike
Mohsen Hariri <m.hariri@gmail.com>
news
2 messages
Mohsen Hariri <m.hariri@gmail.com>
Reply-To: Mohsen Hariri <m.hariri@gmail.com>
To: Mike van Emmerik <emmerik@itee.uq.edu.au>
Tue, Aug 2, 2005 at 9:17 PM
Hi
Now I've read all I could find about IDA FLIRT, there was
a very good document(http://www.datarescue.com/idabase/flirt.htm)
it tells everything about it. I didn't know the "main" function detection
is also a part of these signatures.
I found the article about cross file function calling optimizations.
(http://msdn.microsoft.com/msdnmag/issues/02/05/Hood/)
it is called LTCG(Link Time Code Generation) not Global Optimization.
it says at the end:"Next, the OBJ files produced when using LTCG
aren't standard COFF format OBJs." and he says they are compiler
version dependent, so we don't have to worry, the standard LIBs
don't seem to be using those features, at least not in near future.
Now I want to discuss how should we do library detection. I want
the method to have these features:
1. main function detection
something like IDA, but it cannot be done automatically, so signature
detections can be implemented in three parts: one to detect if the executable
is using specified library, one to locate the main function, and one to detect
functions locations. the first two parts are optional for each library.
2. parameter and return value type/name identification
for library functions, we can also save parameter names and use them when
a function is detected, I think it is done in IDS files for IDA, am I right?
3. Ability to use a LIB or OBJ file directly, without the need to go
through those signature creation procedures, in case we want to.
So I will implement this feature in two parts:
1. A Detection Method
2. A Detection Input
Detection methods would be:
1. Normal Library Detection, which takes a LIB or OBJ file format.
In this method we can also find variable names, for example we
apply an object file using this method, and some functions from
that object file are found in the executable, and those functions
use some global variables, and we can flag type/name of those
global variables also.
But this method would be slow, as we would be checking whole
functions against each function position in the executable.
The input files can be in OMF/COFF format.
2. Signatured Library Detection
something like what IDA does, but first we have to decide on our
signature files format.
3. Dynamic Library Detection
this method flags the function names along with parameters and
return values. the inputs are DLL files, and they may be provided
automatically if they exist in system folders when boomeran
processes imports
But should library detection be applied after functions locations detection,
I mean after first analysis of boomerang or before that?
Anyway, I think for the first step, I will implemend nomal library detection
for COFF files, and see how it will be.
Is there any other way to contact you like using mirc or IMs or alike?
Bye
Mohsen
Mike van Emmerik <emmerik@itee.uq.edu.au>
To: Mohsen Hariri <m.hariri@gmail.com>
Wed, Aug 3, 2005 at 5:24 AM
> But should library detection be applied after functions locations detection,
> I mean after first analysis of boomerang or before that?
I think it should probably be done first up. Signature detection, assuming
it's pretty robust, should arguably (perhaps optionally) override
directions from the user (e.g. in a symbols.h -sf file), since the user was
possibly guessing, and the signature detection probably is right. In any
case, you should probably do signature detection first if only to point out
any conflicts.
> Is there any other way to contact you like using mirc or IMs or alike?
I'm not a great fan of chat/messaging, but I've been knowm to use Mozilla's
messaging client. Actually, I now use Firefox, and it looks like I might
have to hunt one down.
- Mike
Mohsen Hariri <m.hariri@gmail.com>
progress
2 messages
Mohsen Hariri <m.hariri@gmail.com>
To: Mike van Emmerik <emmerik@itee.uq.edu.au>
Sat, Aug 6, 2005 at 11:48 AM
Hi
Today I wrote a tiny program and tried to decompile it, and see the flow of
boomerang, to understand its parts better.
I used:
J:\moh\decompiler\boomerang\debug>console -E 0x00411020 -sf
J:\moh\decompiler\tiny\sig.h J:\moh\decompiler\tiny\Debug\tiny.exe
contents of the sig.h file:
0x411020 __cdecl int myfunc(void);
when boomerang decompiles this function, it doesn't
care the return type I specify, and tells this function
is a void function. why?
By the way, I've attached a diff file to prevent access violation
when import table of an executable is empty, as my tiny program
was.
now I understand what you meant by saying I have to
generate something like a signature file(-sf). I think
now I'm able to start some coding, I will start by
reading a COFF object file and matching the whole
functions against the executable, and generating a
signature file, just for the start.
I want to name the library detection project "LIBID",
(Library Identification). any opinions?
bye
Mohsen
Win32BinaryFile.diff
4K Download
Mike van Emmerik <emmerik@itee.uq.edu.au>
To: Mohsen Hariri <m.hariri@gmail.com>
> when boomerang decompiles this function, it doesn't
> care the return type I specify, and tells this function
> is a void function. why?
Sun, Aug 7, 2005 at 1:40 AM
Oops, thought I fixed that one. Basically, I made a fairly major change of
how Boomerang detects parameters and returns, but it doesn't fit well with
the -sf symbol file. So I have to make explicit exceptions to the logic for
parameters, for returns (which you found are not working), for the function
name, argument names, etc. I'll see if I can fix that in the next few days.
- Mike
Mohsen Hariri <m.hariri@gmail.com>
Re: diff
3 messages
Mike van Emmerik <emmerik@itee.uq.edu.au>
To: Mohsen Hariri <m.hariri@gmail.com>
Sun, Aug 7, 2005 at 1:41 AM
> By the way, I've attached a diff file to prevent access violation
> when import table of an executable is empty, as my tiny program
> was.
Heh, thanks. I must never have had a tiny enough excecutable :-) Diffs like
this are always welcome.
- Mike
Mohsen Hariri <m.hariri@gmail.com>
To: Mike van Emmerik <emmerik@itee.uq.edu.au>
Sun, Aug 7, 2005 at 10:34 AM
seems the only opensource library I can find is BFD, for reading
object files. but it does not support OMF (borland's format).
tooo bad!!
do you know any other libs? should I use this and hope in future
we can use some other ways for OMF?
[Quoted text hidden]
Mike van Emmerik <emmerik@itee.uq.edu.au>
To: Mohsen Hariri <m.hariri@gmail.com>
Mon, Aug 8, 2005 at 5:55 AM
> seems the only opensource library I can find is BFD, for reading
> object files. but it does not support OMF (borland's format).
> tooo bad!!
Ick! I didn't realise that. What Borland compilers still use OMF? I'm sure
the latest ones use PE, e.g. test/windows/switch_borland.exe . This page:
http://csharpcomputing.com/Tutorials/Lesson20.htm
seems to indicate that you can effectively use the MSVC linker to convert
OMF to PE, but it's not clear if you can just convert and save the result;
perhaps it just allows linking of OMF .obj files.
Ah, OMF files, I remember those. I have some simple source code that
parsers part of them, still available in the file makedsig.c, in this
archive:
http://www.itee.uq.edu.au/~cristina/dcc/distribution/makedsig.zip
It would be a lot of work to make a complete loader from this, though. The
above code just scans a .lib file in OMF format, and pulls out some names,
from memory.
> do you know any other libs?
You should look at cgen (http://sourceware.org/cgen/), though it might use
BFD as the loader, and ISDL
(http://www.princeton.edu/~mescal/spam/pubs/ISDL-TR.html), which is
supposed to be able to generate an assembler from a spec, though I don't
know if that includes the binary file writer (and of course you want a
reader, but perhaps they come together.)
Apart from that, it's a bit dismal, I agree. Also, BFD has a GPL license I
think, and I'm not sure if that was one of the reasons I rejected using it
for Boomerang.
What about the BinaryFile part of Boomerang? It may be possible to add OMF
support to that without too much effort.
> should I use this and hope in future we can use some other ways for OMF?
Is OMF really important? Searching the web, it seems really obsolete.
- Mike
Mohsen Hariri <m.hariri@gmail.com>
Re: last email
1 message
Mike van Emmerik <emmerik@itee.uq.edu.au>
To: Mohsen Hariri <m.hariri@gmail.com>
Mon, Aug 8, 2005 at 6:00 AM
Mohsen,
sorry with the last email, of course OMF is relevant, because people want
to decompile old executables, and therefore they would want signatures for
it. I forgot what you wanted to do with the files...
Anyway, that makes the makedsig.c file even more relevant. This is a
signature generator for dcc that I wrote about 11 years ago (when OMF was
still the dominant format). You might even get some ideas from the perfect
hashing function stuff that I used, though there are probably other storage
formats that are better these days (e.g. something based on trees, as per
IDA).
- Mike
Mohsen Hariri <m.hariri@gmail.com>
compiling libbfd
2 messages
Mohsen Hariri <m.hariri@gmail.com>
To: Mike van Emmerik <emmerik@itee.uq.edu.au>
Fri, Aug 12, 2005 at 3:53 PM
Hi
I'm just stuck compiling libbfd as a dll under cygwin. I'm not familiar with
cygwin, actully it's my first time compiling something under cygwin.
I run these commands under bfd folder of binutils:
./configure --enable-shared=yes
checking build system type... i686-pc-cygwin
checking host system type... i686-pc-cygwin
checking target system type... i686-pc-cygwin
checking for gcc... gcc
checking for C compiler default output file name... a.exe
[...]
checking if libtool supports shared libraries... yes
checking if package supports dlls... no
checking whether to build shared libraries... no [WHY????]
checking whether to build static libraries... yes
creating libtool
[...]
I tried anything I could, searching internet for hints,
changing the configure script, or trying to make
a dll out of statically compiled libbfd.a (actully it works, but the dll has
no exported function, even with gcc's --export-all parameter),
all unsuccessful.
maybe you have more experience with cygwin. anyway, if I'm making
any stupid mistake please inform me.
thanks in advance
bye
Mike van Emmerik <emmerik@itee.uq.edu.au>
To: Mohsen Hariri <m.hariri@gmail.com>
Sat, Aug 13, 2005 at 12:59 AM
> Hi
>
> I'm just stuck compiling libbfd as a dll under cygwin. I'm not familiar with
> cygwin, actully it's my first time compiling something under cygwin.
> I run these commands under bfd folder of binutils:
>
> ./configure --enable-shared=yes
Um, I've never used that method. I believe that it requires the Makefile to
pick up some define (e.g. in include/config.h) and make the appropriate
things happen.
What I'd do is change the link command to include "-shared". This is the
Linux linker's command to generate a shared object, which for Cygwin will
generate a .dll file. (You may have to rename the output with the dll
name). That's what we do in Boomerang. That part of the Makefile was sorted
out by some people more expert at these sorts of things than me. Here is
the command that generates lib/libWin32BinaryFile.dll (from the loader/
directory) under CygWin:
g++ -Wall -g -O0 -o ../lib/libWin32BinaryFile.dll -shared Win32BinaryFile.o
microX86dis.o -lBinaryFile -L../lib
Perhaps this will help.
> [...]
> checking if libtool supports shared libraries... yes
> checking if package supports dlls... no
> checking whether to build shared libraries... no [WHY????]
Sorry, no idea. I'm really lame with configure scripts. You could try to
read the actual script (it's just a shell script) to try to understand the
logic, and maybe insert some echo commands or the like to debug it. I
remember doing that for a while before those experts came along and fixed
that part of Boomerang.
> maybe you have more experience with cygwin. anyway, if I'm making
> any stupid mistake please inform me.
I wouldn't say a stupid mistake, but I think that the approach of fiddlig
with the configure script will surely be less successful than changing the
Makefile or modifying the final link command by hand.
Good luck!
- Mike
Mohsen Hariri <m.hariri@gmail.com>
libid
4 messages
Mohsen Hariri <m.hariri@gmail.com>
To: Mike van Emmerik <emmerik@itee.uq.edu.au>
Sun, Aug 21, 2005 at 1:01 PM
Hi,
In this week I have done the followings:
1. at last bfd was compiled under cygwin, I used mingw target under cygwin to make
windows native DLL(the cygwin target just hung when I tried to load the compiled dll)
I made some test projects and now I'm ready to make library signatures.
2. I thought it's better to mention the already done parts of the project in my
undergraduate thesis, and write 'how it works', so I read UQBT and some
other documents mostly written by 'Cristina Cifuentes', they were excellent.
Now I'm much more familiar with the Boomerang and UQBT, I think I know
all the parts and their functionalities, at least at an abstract level. I didn't
know how BIG the project is.
3. I saw dcc sources, they were simple and I could understand them
easily. I'll take a deeper look in a few days as I start my coding.
(those files were from 1993... in that time I was a 11 years old boy
while you were looking for signatures :p)
Seems your university is very active on this subject.
4. I tried to write how to implement the LibId, and attached the document I wrote.
I have some Q's about (Win32)BinaryFile (they are in attached document too) :
1. dlprocptrs: is it the place to hold dynamic library functions(as it's mentioned
in its comments)? so why the "main" symbol is added and searched for in this map?
if it's general purpose, can I use it for variables also? or I have to use boomerang::symbols?
(the ones I can extract from object and library files). anyway what's the difference?
2. when a signature detector faces a callback function, should it add that
function to entry points to be decompiled?(boomerang::entrypoints)
3. dynamically imported functions, are loaded in IAT itself. so I will use
BinaryFile::AddSymbol on the dynamically linked function
address container variable, not on function address. is it ok?
(In Win32BinaryFile::findJumps, I guess you have done this)
By the way, some mistypings in http://boomerang.sourceforge.net/terminology.html page:
Acronyms: CFG : ... one each for for the case where ...
Acronyms: BB : ... A basic block is terminated by a conditional or undonditional branch ...
Acronyms: AST : ... might be labelled as ...
not important, I know ;)
libid.txt
2K View Download
Mohsen Hariri <m.hariri@gmail.com>
To: Mike van Emmerik <emmerik@itee.uq.edu.au>
Fri, Aug 26, 2005 at 5:36 PM
did you receive my mail? please inform me if you got this.
[Quoted text hidden]
libid.txt
2K View Download
Mike van Emmerik <emmerik@itee.uq.edu.au>
To: Mohsen Hariri <m.hariri@gmail.com>
Sat, Aug 27, 2005 at 1:42 AM
On Fri, 26 Aug 2005, Mohsen Hariri wrote:
> did you receive my mail? please inform me if you got this.
Oh, sorry, I did, but the whole family has had the 'flu and things have
been a bit crazy.
I'll reply next.
- Mike
[Quoted text hidden]
Mike van Emmerik <emmerik@itee.uq.edu.au>
To: Mohsen Hariri <m.hariri@gmail.com>
Sat, Aug 27, 2005 at 2:12 AM
On Sun, 21 Aug 2005, Mohsen Hariri wrote:
> Hi,
>
> In this week I have done the followings:
> 1. at last bfd was compiled under cygwin, I used mingw target under
> cygwin to make windows native DLL
A good trick to remember.
> , so I read UQBT and some
> other documents mostly written by 'Cristina Cifuentes', they were excellent.
Yes, she's a good writer.
> Now I'm much more familiar with the Boomerang and UQBT, I think I know
> all the parts and their functionalities, at least at an abstract level. I
> didn't know how BIG the project is.
Yes, and it seems to keep getting bigger.
> 3. I saw dcc sources, they were simple and I could understand them
> easily.
Wow, I found them quite tricky myself :-)
> Seems your university is very active on this subject.
Well, it all stems from Prof. John Gough (a compilers person, recently
retired but still writing books) wondering about decompilers back about
1990. From there came Cristina's thesis, published in 1994, and from that
many projects, including UQBT and Boomerang, and Boomerang seems to have
started several other decompilers (Anakrino, exetoc, perhaps two in Japan).
REC was influenced by Cristina's work, and even the Flirt part of IDA Pro
was influenced by dcc's signatures. Not a bad chain of events! But UQ
hasn't been directly involved in the last 5 years apart from my own thesis.
Hopefully when I publish in the next year or so, there might be some more
influencing.
> I have some Q's about (Win32)BinaryFile (they are in attached document too)
>:
> 1. dlprocptrs: is it the place to hold dynamic library functions(as it's
> mentioned in its comments)?
Be aware that this code was thrown together by the other main author
(QuantumG). So I'm not as familiar with this loader as with the others.
From memory, this map holds one entry for each entry in the import address
table (IAT). I don't recall if it has entries for the export table. When
gathering symbols, you would be interested in eports, not imports.
> so why the "main" symbol is added and searched for in this
> map?
We have this idea of attempting to find an entry point for everything that
gets decompiled. As you can see, it doesn't always work out. It's OK when
the program being decompiled is an executable. Then we are just adding a
sort of synthetic import.
> if it's general purpose, can I use it for variables also? or I have to use
> boomerang::symbols?
Ah, Boomerang::symbols (also GuantumG code) are the symbols loaded from the
symbols.h file(s) (via the -sf switch; see the web page on that switch for
an overview).
I think that these symbols may be suitable; somehow, when addresses are
used and they match with the entries in Boomerang::symbols, they become
associated with the name from the symbol (e.g. when a new procedure is
made, that name would be used instead of proc99).
As to whether to use an existing data structure or not, that's a design
decision you have to make. I can see points in favour of either approach;
for example, it would be nice to keep the signatures code fairly separate,
perhaps mostly in one class. There could be thousands of signature symbols,
and they may need special data structures because of their number. For
starters, though, I'd consider trying Boomerang::symbols first. Perhaps the
-sf logic and the signatures could be combined, either by design or after
implementation.
> 2. when a signature detector faces a callback function, should it add that
> function to entry points to be decompiled?(boomerang::entrypoints)
That sounds reasonable, especially if you put the signatures into the
Boomerang::symbols map. I think that entrypoints comes from -e and -E
switches.
> 3. dynamically imported functions, are loaded in IAT itself. so I will use
> BinaryFile::AddSymbol on the dynamically linked function
> address container variable, not on function address. is it ok?
> (In Win32BinaryFile::findJumps, I guess you have done this)
Wow, that is my code, but written a long time ago. It looks like a pretty
nasty hack. I guess I was irritated by all the one line procedures that
result from jumps to IAT entries (where the real symbols are). Any way of
making this a little more general would be good. I believe it's possible
for some calls to use the jump tables, with others using the IAT entry
directly. I don't recall why they do this.
> By the way, some mistypings in
> http://boomerang.sourceforge.net/terminology.html page:
>
> Acronyms: CFG : ... one each for for the case where ...
> Acronyms: BB : ... A basic block is terminated by a conditional or
> undonditional
> branch ...
> Acronyms: AST : ... might be labelled as ...
>
> not important, I know ;)
Ah, bit it is. I'll try and get to those this weekend. Thanks!
- Mike
Mohsen Hariri <m.hariri@gmail.com>
another thing
2 messages
Mohsen Hariri <m.hariri@gmail.com>
To: Mike van Emmerik <emmerik@itee.uq.edu.au>
Sat, Sep 3, 2005 at 9:54 AM
binaryfile.h needs this:
#ifndef _WIN32
#include <dlfcn.h>
#else
#include <windows.h>
WinSock.h
#endif
// include before types.h: name collision of NO_ADDRESS and
for the HINSTANCE you are using, like binaryfilefactory.cpp.
Mike van Emmerik <emmerik@itee.uq.edu.au>
To: Mohsen Hariri <m.hariri@gmail.com>
Sat, Sep 3, 2005 at 3:49 PM
[Quoted text hidden]
Except that windows.h has a name collision with something in the objective
C code as well. So now it's a void* and I cast it (ugly, but there are only
two places it's used anyway).
- Mike
Mohsen Hariri <m.hariri@gmail.com>
compiling problem
10 messages
Mohsen Hariri <m.hariri@gmail.com>
To: Mike van Emmerik <emmerik@itee.uq.edu.au>
Wed, Sep 7, 2005 at 10:18 AM
hi
Still my boomerang doesn't compile. There are errors while
compiling prog.cpp in ansi-c-parser.h ( I wonder why this
header works when it comes to compiling ansi-c-parser.cpp
or ansi-c-parser.cpp ). Here are the errors :
Compiling...
prog.cpp
e:\Program Files\Microsoft Visual Studio .NET 2003\Vc7\PlatformSDK\Include\WinSock.h(691) :
warning C4005: 'NO_ADDRESS' : macro redefinition
j:\moh\decompiler\boomerang\include\types.h(22) : see previous definition of
'NO_ADDRESS'
/home/38/binary/u1.luna.tools/lib/bison.h(363) : error C2143: syntax error : missing '}' before '='
/home/38/binary/u1.luna.tools/lib/bison.h(363) : error C2059: syntax error : '='
/home/38/binary/u1.luna.tools/lib/bison.h(186) : error C2143: syntax error : missing ';' before '}'
/home/38/binary/u1.luna.tools/lib/bison.h(186) : error C2238: unexpected token(s) preceding ';'
...
and after commenting out those #line directives:
Compiling...
prog.cpp
e:\Program Files\Microsoft Visual Studio .NET 2003\Vc7\PlatformSDK\Include\WinSock.h(691) :
warning C4005: 'NO_ADDRESS' : macro redefinition
j:\moh\decompiler\boomerang\include\types.h(22) : see previous definition of
'NO_ADDRESS'
j:\moh\decompiler\boomerang\c\ansi-c-parser.h(487) : error C2143: syntax error : missing '}'
before '='
j:\moh\decompiler\boomerang\c\ansi-c-parser.h(487) : error C2059: syntax error : '='
j:\moh\decompiler\boomerang\c\ansi-c-parser.h(553) : error C2143: syntax error : missing ';'
before '}'
j:\moh\decompiler\boomerang\c\ansi-c-parser.h(553) : error C2238: unexpected token(s) preceding
';'
j:\moh\decompiler\boomerang\c\ansi-c-parser.h(555) : error C2059: syntax error : 'public'
...
It complains about CDECL definition in the enum.( again I wonder why this
header works when it comes to compiling ansi-c-parser.cpp
or ansi-c-parser.cpp )
by the way, I've attached a patch for the project to compile
when 'max' function was defined somewhere(As on my system).
dfa.cpp.patch
1K Download
Mike van Emmerik <emmerik@itee.uq.edu.au>
To: Mohsen Hariri <m.hariri@gmail.com>
Wed, Sep 7, 2005 at 10:31 AM
> hi
>
> Still my boomerang doesn't compile. There are errors while
> compiling prog.cpp in ansi-c-parser.h ( I wonder why this
> header works when it comes to compiling ansi-c-parser.cpp
> or ansi-c-parser.cpp ).
It's an ordering issue; sometimes it's just not a good idea to #include
"windows.h" (which includes the whole universe and makes name collisions
much more likely).
I'm feeling a little guilty here; I suspect that I #included windows.h
somewhere, noted problems with Windows compilation, removed it, and then
forgot to check in all the changes. It will have to wait till I get home,
sorry.
> by the way, I've attached a patch for the project to compile
> when 'max' function was defined somewhere(As on my system).
Good; thanks. What compiler are you using exactly?
- Mike
Mohsen Hariri <m.hariri@gmail.com>
To: Mike van Emmerik <emmerik@itee.uq.edu.au>
Wed, Sep 7, 2005 at 10:40 AM
Microsoft Visual C++ from visual studio 2003,
no PSDK installed. I guess you use this version
too, right? donno why my compiler complains too
much... ;)
[Quoted text hidden]
Mohsen Hariri <m.hariri@gmail.com>
To: Mike van Emmerik <emmerik@itee.uq.edu.au>
don't worry about compilation, I can test my codes without
compiling the whole project(the project took too long
to compile, so I made a test platform)
currently I work only 6-7 hours a week on this project,
but from next week I will have more time to put on this project.
hope I can get something done in next week.
one more thing, about c++ mangled names. do you know
any library I can use to demangle them? cause it seems
there is no standard for it, and wvery compiler does what
it likes. am I wrong?
On 9/7/05, Mike van Emmerik <emmerik@itee.uq.edu.au> wrote:
[Quoted text hidden]
Wed, Sep 7, 2005 at 11:16 AM
Mike van Emmerik <emmerik@itee.uq.edu.au>
To: Mohsen Hariri <m.hariri@gmail.com>
Wed, Sep 7, 2005 at 11:23 AM
> one more thing, about c++ mangled names. do you know
> any library I can use to demangle them? cause it seems
> there is no standard for it, and every compiler does what
> it likes. am I wrong?
They do all do whatever they want. However, I think that the latest gcc
scheme is somewhat standardised now. The c++filt program demangles; I have
not had success finding a tool other than the compiler (which is NOT
convenient) for the mangling process. You could do worse then exec'ing this
tool, or finding the source code for it (part of binutils, I would guess).
- Mike
Mike van Emmerik <emmerik@itee.uq.edu.au>
To: Mohsen Hariri <m.hariri@gmail.com>
Wed, Sep 7, 2005 at 11:26 AM
On Wed, 7 Sep 2005, Mohsen Hariri wrote:
> Microsoft Visual C++ from visual studio 2003,
Yep, same as mine. Though on the machine at work, I have the 2002 Microsoft
Development Environment, version 7.0.9466. It seems to be a very early .NET
version, and it won't read Boomerang's solution files. I'm thinking of
trying to redo the solution files on that compiler, so it should be
compatible with all .NET versions.
> no PSDK installed. I guess you use this version
> too, right? dunno why my compiler complains too
> much... ;)
Just lucky, I guess :-)
- Mike
Mike van Emmerik <emmerik@itee.uq.edu.au>
To: Mohsen Hariri <m.hariri@gmail.com>
Thu, Sep 8, 2005 at 4:12 PM
On Wed, 7 Sep 2005, Mohsen Hariri wrote:
> hi
>
> Still my boomerang doesn't compile. There are errors while
> compiling prog.cpp in ansi-c-parser.h ( I wonder why this
> header works when it comes to compiling ansi-c-parser.cpp
> or ansi-c-parser.cpp ). Here are the errors :
> Compiling...
> prog.cpp
> e:\Program Files\Microsoft Visual Studio .NET
> 2003\Vc7\PlatformSDK\Include\WinSock.h(691) : warning C4005: 'NO_ADDRESS' :
> macro redefinition
Presumably this means you have windows.h included. Actually, I think this
should not be included for anything other than windows.cpp.
Where does your garbage collector come from? Is the the one that we have
checked in, or did you compile it yourself or something? Just that gc.h
will include windows.h if GC_WIN32_THREADS is set. A bit of a long shot, I
will admit.
I can't find any problem; it compiles fine for me with the latest updates.
And I've checked again to make sure that every change has been checked in.
Are you sure you have the latest CVS now? In particular, there should be no
#include of windows.h in include/BinaryFile.h, except in comments.
> j:\moh\decompiler\boomerang\include\types.h(22) : see previous definition of
> 'NO_ADDRESS'
> /home/38/binary/u1.luna.tools/lib/bison.h(363) : error C2143: syntax error :
> ...
> and after commenting out those #line directives:
>
> ...
>
> It complains about CDECL definition in the enum.( again I wonder why this
> header works when it comes to compiling ansi-c-parser.cpp
> or ansi-c-parser.cpp )
I'd say because something before the #include for ansi-c-parser.h in
db/prog.cpp is #including windows.h. I think you'll have to comment them
out one by one until the error goes away, then decide why windows.h is
getting included, and try to prevent this.
You could try #undefine NO_ADDRESS and #undefine CDECL before the #include
of ansi-c-parser.h in db/prog.cpp, to make sure that this is what the
problem is. Maybe it's not such a bad permanent solution, either, if it's
enough to fix the problem.
- Mike
Mohsen Hariri <m.hariri@gmail.com>
Reply-To: m.hariri@gmail.com
To: Mike van Emmerik <emmerik@itee.uq.edu.au>
It compiles now. Thanks for the help.
I've attached a changed version of Boomerang, with the
library detection codes. It compiles only on windows
currently (needs some minor changes, like replacing #pragma once,
not a structural change). My codes are in symbols directory.
Some info on implementation:
1. I've interfaced the libid with prog class. It will currently use
addProc and BinaryFile::AddSymbol to add its matched
signatures.
2. I have implemented just two matching classed currently:
BfdObjMatcher: Can match any object file supported by BFD,
so this is platform independent.
Sat, Sep 10, 2005 at 12:46 PM
BfdArchMatcher: Can match any archive file supported by BFD,
again platform independent.
These modules use original obj and lib files, and I will implement
generated signature matching next.
3. Detection steps are like this:
a. every function in the object files are searched for in the binary
code sections, after the relocatable and non-resolved bytes(
you called them "wild bytes" in dcc, so I used that term) are
marked out.(Implementation of matching in BytePattern class)
b. when a match occures, the function is created using Prog::newProc.
every symbol referenced by this function is also added to the symbols
using BinaryFile::AddSymbol. I have to demangle names here, but
for the time being I just remove the first '_'(underscode) from the names.
I will use wine's demangling codes for MSVC next.
4. The changes I've made to other parts of the project:
a. In prog.cpp & prog.h, I've added this method:
// Search for library signatures from sig_file and match them
// 'hint' is used to force usage of specified signature matching
// module
void MatchSignatures(const char * sig_file, const char * hint);
and also:
FrontEnd * getFrontEnd();
( I know you say why, I needed that in boomerang to seperate
loading and decompiling of the binary, unless you suggest a better
way )
b. in boomerang i added these commands:
loadbinary : just load the binary, no decoding
decode : (without parameter) decode the loaded binary
matchsym : matches the specified symbol container
against the loaded binary
and added these methods:
Prog *Boomerang::load(const char *fname)
void Boomerang::decode(Prog *prog, const char *pname)
just to seperate loading and decoding phases.
c. in Win32BinaryFile, I've commented out main function
detection. Because now when all libraries are
detected(specifically the ones that have startup
functions), the main function is automatically
found, so no need to find it that way.
Am i right?
d. some minor changes to the msvc project files, to
help it find bfd.lib or libid.lib
some more notes:
1. BinaryFile.h contains implementation for some methods,
so when it is included in multiple places, the linker
will complain about duplicate definitions. I think we have
to move those implementations to BinaryFile.cpp
2. when a library proc is detected, while adding its referenced
symbols, if we see the symbol is a function, we can
newProc again. But we have to check if it's already created
or not.
3. we can find out library function parameters(after I implement demangling).
do you think the parameters are useful here? I think this is a duplicate
for what you have implemented in header file parsing. I mean when we have
a library, we have its header file probably. so just the names are got from
the library, and parameters and calling conversion are got from the header file.
but this won't work for the dll files which have mangled export tables. In
this case we have to generate the header file based on the dll.(cause
when we have a dll, we cannot assume that we have its header file too)
Anyway, if we don't want to ignore parameters extracted from library files,
we have to implement a method like this:
ParseAndAddProc(ADDRESS address, char *proc_proto)
that can be called like this:
ParseAndAddProc(0x123123, "int myproc(int *p1, void *p2)"
but then we have some hard times when we have something
like this:
ParseAndAddProc(0x123123, "MyObj myptoc(MyObj2 *obj)")
here we don't have any idea about classes, we just have their names.
I suggest we stick to the first method(header files), and add the demangled proto just
in the function comments.
my todo:
1. add demangling codes
2. add signature generation & detection
I want to use the same schema for
signatures too(I mean I will preserve
the referenced symbols with the function
signature). What do you think?
3. add dll ordinal import function resolving
(some dlls exports are mangled c++
names, we can decode them too)
4. change codes to compile on other
compilers
You can test my codes using the test file in /test/symbols/test_sig
directory. This executable only uses libc(I've included libc too).
you just run boomerang -k and issue these commands:
loadbinary <path>test_sig.exe
matchsym <path>libc.lib
the main function is detected, but I see the parameters are incorrect.
that's all. I guess this was the longest email I've ever written :p
PS. the attached file should be renamed to "zip"
Waiting for your suggestions
Mohsen
[Quoted text hidden]
boomerang_libid.zipx
4693K Download
Mike van Emmerik <emmerik@itee.uq.edu.au>
To: Mohsen Hariri <m.hariri@gmail.com>
Mon, Sep 12, 2005 at 4:20 AM
> FrontEnd * getFrontEnd();
> ( I know you say why, I needed that in boomerang to seperate
> loading and decompiling of the binary, unless you suggest a better
> way )
Without examining the code in detail, that sounds fine.
> b. in boomerang i added these commands:
>
> loadbinary : just load the binary, no decoding
> decode : (without parameter) decode the loaded binary
> matchsym : matches the specified symbol container
> against the loaded binary
>
> and added these methods:
>
> Prog *Boomerang::load(const char *fname)
> void Boomerang::decode(Prog *prog, const char *pname)
>
> just to seperate loading and decoding phases.
Cool.
> c. in Win32BinaryFile, I've commented out main function
> detection. Because now when all libraries are
> detected(specifically the ones that have startup
> functions), the main function is automatically
> found, so no need to find it that way.
> Am i right?
Well, it depends on the details. When you say "the main function is
automatically found", do you mean that you already have special high level
pattern matching for main? (A la what I did in dcc; I needed special
patterns for finding main, and they were somewhat compiler specific). It
might make sense to implement some of the logic in Win32BinaryFile.
> some more notes:
>
> 1. BinaryFile.h contains implementation for some methods,
> so when it is included in multiple places, the linker
> will complain about duplicate definitions. I think we have
> to move those implementations to BinaryFile.cpp
Compilers are supposed to deal with this. Gcc has a "linkonce" tag for
these sorts of functions. The idea is that these are great candidates for
inlining; if you never declare functions in .h files, you'll probably never
get inlining (unless your compiler has a special pass between compiling and
linking, or does the inlining during linking, both of which seem uncommon).
Are you getting errors from MSVC with this, or are you just assuming that
this has to be trouble but have not seen it yet?
> 2. when a library proc is detected, while adding its referenced
> symbols, if we see the symbol is a function, we can
> newProc again. But we have to check if it's already created
> or not.
I'm sure there is a function that already does that.
> 3. we can find out library function parameters(after I implement
> demangling). do you think the parameters are useful here? I think this is
> a duplicate for what you have implemented in header file parsing. I mean
> when we have a library, we have its header file probably. so just the
> names are got from the library, and parameters and calling conversion are
> got from the header file.
Presumably, both should get you the same answer, and I don't know what to
do if they don't match. Perhaps just an error message. So I think for now
what you have done is fine.
> but this won't work for the dll files which
> have mangled export tables.
Are these called by ordinal then? Otherwise, how can the caller know what
to call?
> In this case we have to generate the header
> file based on the dll.(cause when we have a dll, we cannot assume that we
> have its header file too)
Yes.
>
> Anyway, if we don't want to ignore parameters extracted from library files,
> we have to implement a method like this:
> ParseAndAddProc(ADDRESS address, char *proc_proto)
>
> that can be called like this:
> ParseAndAddProc(0x123123, "int myproc(int *p1, void *p2)"
>
> but then we have some hard times when we have something
> like this:
> ParseAndAddProc(0x123123, "MyObj myptoc(MyObj2 *obj)")
>
> here we don't have any idea about classes, we just have their names.
Well, just having the name is useful for a decompiler (unless the name is
obfuscated). The compiler needs to know the size of elements of the class,
so that it can calculate how to access the member variables, how to cast
from one class to another (with multiple inheritance, there is sometimes a
correction needed to the "this" pointer). It would be great to have the
names and types of all the member variables, and the names and parameters
of all the methods, but as you say this is typically not available.
> I suggest we stick to the first method(header files), and add the
> demangled proto just in the function comments.
Sounds fine to start with.
> my todo:
>
> 1. add demangling codes
>
> 2. add signature generation & detection
> I want to use the same schema for
> signatures too(I mean I will preserve
> the referenced symbols with the function
> signature). What do you think?
Sorry, I don't really understand the question.
> You can test my codes using the test file in /test/symbols/test_sig
> directory. This executable only uses libc(I've included libc too).
> you just run boomerang -k and issue these commands:
> loadbinary <path>test_sig.exe
> matchsym <path>libc.lib
>
> the main function is detected, but I see the parameters are incorrect.
OK, I'll try and find some time to try this out.
By the way, I see that you have included source code in the zip file;
is it your intention to donate this code to the Boomerang project? I'd be
happy to use it, I think it will fill a big need. But it's your code, and
Boomerang isn't GPL'd, so you don't have to contribute your changes. If
it's a bit unstable, as long as it compiles on all platforms, it can be
disabled unless a runtime switch is used. Then when it's ready for real
world use, we just take out the command line switch.
Good work!
- Mike
Mohsen Hariri <m.hariri@gmail.com>
Reply-To: m.hariri@gmail.com
To: Mike van Emmerik <emmerik@itee.uq.edu.au>
Mon, Sep 12, 2005 at 11:13 PM
> > c. in Win32BinaryFile, I've commented out main function
> > detection. Because now when all libraries are
> > detected(specifically the ones that have startup
> > functions), the main function is automatically
> > found, so no need to find it that way.
> > Am i right?
>
> Well, it depends on the details. When you say "the main function is
> automatically found", do you mean that you already have special high level
> pattern matching for main? (A la what I did in dcc; I needed special
> patterns for finding main, and they were somewhat compiler specific). It
> might make sense to implement some of the logic in Win32BinaryFile.
look at these:
http://msdn.microsoft.com/msdnmag/issues/01/01/hood/
http://www.codeguru.com/article.php/c6945__1/
But for a short summary:
In MSVC for example, a function is executed which will then call program's main
function. The name of this function is hardcoded and depends on the
linker /SUBSYSTEM and /ENTRY arguments. For example when the parameter
is /SUBSYSTEM:CONSOLE and no /ENTRY is used, the linker looks for
"mainCRTStartup" whose address will be put in PE's header as PE's
entry point. (I'm sure you have seen GetVersionInfo in all MSVC programs
a little after EP, it is in mainCRTStartup. That's why all MSVC programs are
dynamically linked with kernel32.dll, and that's why you never had an empty
IAT table)
sources for CRT startup codes can be found in here:
Microsoft Visual Studio .NET 2003\Vc7\crt\src
mainCRTStartup resides in crt0.c
crt0.c is compiled into libc.lib, so if we check signatures of any
MSVC executable
against the correct version of libc.lib, we will find mainCRTStartup(or alike
function for other SUBSYSTEMs) then using the references of that function,
we can locate main function.
I guess you didn't understand what I mean by preserving the references
of a function:
> > my todo:
>>
> > 1. add demangling codes
>>
> > 2. add signature generation & detection
> > I want to use the same schema for
> > signatures too(I mean I will preserve
> > the referenced symbols with the function
> > signature). What do you think?
>
> Sorry, I don't really understand the question.
A library function has these info that can be retrieved:
1. function name and prototype
2. the references of that function
for example, in a library if we have:
int atoi(char *x){
return (int)atol(x);
}
and we locate atoi, then we know atoi references atol,
so we will find atol without the need for matching signatures.
that is already done in libid codes. you see
these outputs from the matchsym command:
...
Matched: mainCRTStartup
...
Symbol added: 4086fc -> __argv
Symbol Reference: __argv
Symbol added: 4086f8 -> __argc
Symbol Reference: __argc
Symbol added: 400ffc -> main
Symbol Reference: main
Symbol added: 401d8a -> exit
Symbol Reference: exit
....
here we have detected mainCRTStartup by matching signatures,
and then using its references we have detected the location of main
function and some other symbols. dcc and IDAPro signature files just
hold the function signatures, and throw away their references, but references
seem usefull. I want to preserve them in libid signature files.
> Compilers are supposed to deal with this. Gcc has a "linkonce" tag for
> these sorts of functions. The idea is that these are great candidates for
> inlining; if you never declare functions in .h files, you'll probably never
> get inlining (unless your compiler has a special pass between compiling and
> linking, or does the inlining during linking, both of which seem uncommon).
I didn't know that.
> Are you getting errors from MSVC with this, or are you just assuming that
> this has to be trouble but have not seen it yet?
I just get warnings, so there is no problem here.
> > but this won't work for the dll files which
> > have mangled export tables.
>
> Are these called by ordinal then? Otherwise, how can the caller know what
> to call?
They are imported by names. I had some doubts, so I created
a project. I've attached my project, if you liked to take a look.
> By the way, I see that you have included source code in the zip file;
> is it your intention to donate this code to the Boomerang project?
YES. This is not the first open-source project I'm participating in. I've
already done some coding in IBS(http://ibs.sf.net), which is a product of
our company. The difference is while writing IBS, I just
had fun when I was with my friends talking and working on the project, and
thinking about the future of the project. But I guess participating in
Boomerang is what I have been made for. ;) I'm not working on it just for my
undergraduate project, I like it very much, even when no one
is here to chat about it or there will be no money making in the future from it.
- Mohsen
mangleddll.zipx
99K Download
Mohsen Hariri <m.hariri@gmail.com>
Re: Boomarang, symbols, and licenses
2 messages
Mike van Emmerik <emmerik@itee.uq.edu.au>
To: Mohsen Hariri <m.hariri@gmail.com>
Wed, Sep 14, 2005 at 11:01 AM
>> By the way, I see that you have included source code in the zip file;
>> is it your intention to donate this code to the Boomerang project?
>
> YES. This is not the first open-source project I'm participating in.
> ...
> But I guess participating in
> Boomerang is what I have been made for. ;) I'm not working on it just for my
> undergraduate project, I like it very much, even when no one
> is here to chat about it or there will be no money making in the future from it.
I love your enthusiasm!
But there is a problem that needs working through.
Your code apparently depends on BFD. Presumably, you could rewite it to not
use BFD, but what I pain. I looked at using the opcodes library at one
stage; it's part of binutils.
The problem is that BFD is GPL'd. Nothing wrong with the GPL, as far as I
am concerned, except that it is my understanding that the GPL is not
compatible with BSD-like licences, such as large parts of Boomerang are
released under.
Perhaps I could explain a bit of Boomerang's history. It is derived in
large part from UQBT, the University of Queensland Binary Translator. This
was subsidised in part by Sun Microsystems; it paid my salary for at least
a year. As a result, Sun has a say over the disposition of code written
during that time. It was a long and painful process getting the Sun lawers
to agree to relase the UQBT code under a BSD-like license.
So large parts of Boomerang have to have comments at the top indicating
that Sun part owns the copyright for those files, and they can only be used
in conjunction with the license. Basically, it means that the headers have
to be preserved, that you can't use Sun's name to promote your product, and
things like that. Essentially, the code can be used for any purpose,
including commercial purposes, and you don't have to contribute changes.
The GPL is a similar license, except that you DO have to contribute
changes. It makes no sense to me that these are incompatible; it seems to
me that by putting Sun (where they already exist) comments and GNU headers
on the files, and distributing a copy of the GPL with the code, would
satisfy all the requirements. However, I've been told that this is not the
case, and that making Boomerang GPL is not feasible.
I'll look into this in more detail, and if you happen to have any
expertise in this area, I'd love to hear about it. Surely there is some
way that this issue can be resolved. But until then, I don't think it's a
good idea to check in your code. Perhaps checked into a CVS branch while
this is sorted out would be OK.
It would be good to have this sorted out, so that other GPL'd pieces of
code can be used in Boomerang. If it loses the ability to be used with
commercial proprietary products, I don't think that this will be a great
loss. (Initially, we thought this might be quite important, e.g. someone
like Data Rescue (producers of IDA Pro) might want to use Boomerang code
to enhance an existing proprietary product).
Let's hope that this is sorted soon.
- Mike
Mohsen Hariri <m.hariri@gmail.com>
Reply-To: m.hariri@gmail.com
To: Mike van Emmerik <emmerik@itee.uq.edu.au>
We are using BFD as a dynamically linked library.
(there was some exception for dynamically linked libraries
in LGPL licence, but BFD is GPLed)
Can it help?
[Quoted text hidden]
Wed, Sep 14, 2005 at 12:14 PM
Mohsen Hariri <m.hariri@gmail.com>
Re: Boomerang and GPL code (fwd)
2 messages
Mike van Emmerik <emmerik@itee.uq.edu.au>
To: Mohsen Hariri <m.hariri@gmail.com>
Wed, Sep 14, 2005 at 11:50 AM
Mohsen,
I asked my Boomerang co-author about the GPL and BSD question, and it looks
like it is possible to release Boomerang under a combined license.
So that's good news!
I think that this will need a few words on the Boomerang page.
- Mike
---------- Forwarded message ---------Date: Wed, 14 Sep 2005 18:11:21 +1000
From: QuantumG <qg@biodome.org>
To: Mike van Emmerik <emmerik@itee.uq.edu.au>
Subject: Re: Boomerang and GPL code
Mike van Emmerik wrote:
>
> So the question is: do you know for sure that I can't include GPL code? If
> it's just a case of RTFM, I'm happy to do that; it's just that I thought you
> looked into this and decided pretty definatively that it could not be done.
> I can't see what the problem is, but that could easily be ignorance on my
> part.
You can indeed distribute it as BSD+GPL code.. however, you must apply the GPL
to the whole work. So yeah, it means that if someone writes an extension for
Boomerang they have to distribute it under the terms of the GPL (they have to
provide source). But if they want to seperate the BSD parts from the GPL parts
and then make an extension from the BSD parts only, they can and then they are
not bound by the GPL.
Mohsen Hariri <m.hariri@gmail.com>
Reply-To: m.hariri@gmail.com
To: Mike van Emmerik <emmerik@itee.uq.edu.au>
VERY GOOD.
GPL is somehow better for such a project, when you are
sure you are helping the whole opensource society, and
not just providing something that can be abused to make
money from your work.
[Quoted text hidden]
Wed, Sep 14, 2005 at 2:07 PM
Mohsen Hariri <m.hariri@gmail.com>
libid as a separate module?
5 messages
Mike van Emmerik <emmerik@itee.uq.edu.au>
To: Mohsen Hariri <m.hariri@gmail.com>
Sat, Sep 17, 2005 at 2:43 AM
Mohsen,
I'm thinking of making libid a separate module in Boomerang, with its own
license. It would be a separate library, like libWin32BinaryFile.so /
Win32BinaryFile.dll, and the main code would check for its presence. If not
present, it would continue on as now.
That way, there is no need to change the license on the main part of
Boomerang. This can only work if the signatures code can be well separated
from the rest of Boomerang. This might be OK, or it might end up being a
royal pain. What do you think?
I was also thinking of calling for comments on changing the license of
Boomerang on the web page. It could be as simple as asking, waiting a week,
and if there are no major objections, change the license for Boomerang and
libid can be a required part of Boomerang. It could be a separate library
or not as convenient. Note that as I implemented it, the code in the
symbols directory is just statically linked with the rest of the code. The
earlier vision for Boomerang is that most functionality would be
implemented as separate dynamically linked libraries, but it hasn't
happened that way, and I'm not all that sure that it would be easier to
develop that way, anyway. It would be good to use lots of libraries if
there were separate possible implementations. For example, perhaps type
analysis should have been done that way; you use data-flow based or
constraint based or ad-hoc, and the other 2 modules don't have to take up
memory in your executable.
Either way, I think it makes sense to put the library code on a cvs branch
if it's not a separate module, so that people downloading boomerang don't
get any problems caused by the new code not being well tested on all
platforms. It makes things easier for you too, since you can check in
changes even if they don't compile, and you don't have to test on many
platforms before checking in. (Witness the hassles with MinGW already).
Once the code gets more settled, you can move off the cvs branch, since the
probability of surprises on other platforms becomes very low.
I'd be interested in your comments on this.
- Mike
Mohsen Hariri <m.hariri@gmail.com>
Reply-To: m.hariri@gmail.com
To: Mike van Emmerik <emmerik@itee.uq.edu.au>
Sat, Sep 17, 2005 at 10:04 AM
> from the rest of Boomerang. This might be OK, or it might end up being a
> royal pain. What do you think?
That's good. At first I tried to compile it as a DLL, it's easy I guess.
> Either way, I think it makes sense to put the library code on a cvs branch
> if it's not a separate module, so that people downloading boomerang don't
> get any problems caused by the new code not being well tested on all
> platforms. It makes things easier for you too, since you can check in
> changes even if they don't compile, and you don't have to test on many
> platforms before checking in. (Witness the hassles with MinGW already).
> Once the code gets more settled, you can move off the cvs branch, since the
> probability of surprises on other platforms becomes very low.
I agree. At least till the codes become more stable. So please
set up a branch, and I'll use that.
-Mohsen
Mike van Emmerik <emmerik@itee.uq.edu.au>
To: Mohsen Hariri <m.hariri@gmail.com>
Sat, Sep 17, 2005 at 3:58 PM
> I agree. At least till the codes become more stable. So please
> set up a branch, and I'll use that.
OK, but the file server for the machine I use at Uni is down, sorry.
Hmmm... looks like this email might not even nake it, I'm getting error
messages at the bottom of the screen. Sigh.
So there may be a bit of a delay. In the meantime, you can continue
developing, I guess.
Sorry about the delay.
- Mike
Mohsen Hariri <m.hariri@gmail.com>
Reply-To: m.hariri@gmail.com
To: Mike van Emmerik <emmerik@itee.uq.edu.au>
Sat, Sep 17, 2005 at 4:05 PM
I guess it's 10 o'clock there... you're still at uni?
Here is 5 pm, and I'm at work. :p
[Quoted text hidden]
Mike van Emmerik <emmerik@itee.uq.edu.au>
To: Mohsen Hariri <m.hariri@gmail.com>
Sun, Sep 18, 2005 at 1:39 AM
On Sat, 17 Sep 2005, Mohsen Hariri wrote:
> I guess it's 10 o'clock there... you're still at uni?
Not physically at Uni, no; I access my Linux machine at Uni from home
using ssh. With the price of petrol so high, I may do more and more of
that where possible.
Your time zone comes through as +0430, which would indicate that you are
5.5 hours behind me, on the same day, but I'm not sure that gmail gets the
time zones correct. I suspect that the mail server was affected by the
outage, so it's hard to tell.
> Here is 5 pm, and I'm at work. :p
Right now it's 8am Sunday, and I'm in bed :-)
The file server is back now. They mentioned some maintenance that they
needed to do on a hard drive last Friday, looks like they had some
teething issues or replaced some more of the raid drives.
- Mike
Mohsen Hariri <m.hariri@gmail.com>
Re: running matchsym
10 messages
Mike van Emmerik <emmerik@itee.uq.edu.au>
To: Mohsen Hariri <m.hariri@gmail.com>
Thu, Sep 15, 2005 at 10:29 AM
> You can test my codes using the test file in /test/symbols/test_sig
> directory. This executable only uses libc(I've included libc too).
> you just run boomerang -k and issue these commands:
> loadbinary <path>test_sig.exe
I've made some changes so it compiles and links on Sparc/Solaris. (I wanted
to test on something big-endian; endianness problems are common with
loaders).
This command seems to work; it says "loading...".
> matchsym <path>libc.lib
No matter what I do (slashes verses reverse slashes, etc), I can't get this
to work. It always says "No default symbol matcher module for
test/symbols/libc.lib".
I'm open to suggestions as to how to proceed from here.
I could send you a patch file for the changes I had to make to get it to
compile on Solaris.
I could check the code in on a CVS branch. This has a few advantages; you
could update from the CVS branch and get my changes rather painlessly, and
I can also check out the code and test it on MinGW, Linux, and OS X.
However, once you check out from a branch, it's more difficult to get
updates from the main trunk (ordinary non-branch changes).
- Mike
Mohsen Hariri <m.hariri@gmail.com>
Reply-To: m.hariri@gmail.com
To: Mike van Emmerik <emmerik@itee.uq.edu.au>
Thu, Sep 15, 2005 at 2:10 PM
I've tested that on my windows box.
You mean it doesn't work at all or just when the system
is big endian?
anyway, please give me the make files so that I can compile
it on my linux box, I can do it myself, but I may make some
mistakes as I have little experience with that.
please send me the patches.
[Quoted text hidden]
Mike van Emmerik <emmerik@itee.uq.edu.au>
Thu, Sep 15, 2005 at 4:17 PM
To: Mohsen Hariri <m.hariri@gmail.com>
On Thu, 15 Sep 2005, Mohsen Hariri wrote:
> I've tested that on my windows box.
> You mean it doesn't work at all or just when the system
> is big endian?
I don't know; I haven't attempted to run it under Windows. I assume it
would work, so I suspect an endianness problem.
> anyway, please give me the make files so that I can compile
> it on my linux box, I can do it myself, but I may make some
> mistakes as I have little experience with that.
>
> please send me the patches.
>
OK, no problem. I'll do it in the next few days; I'm a bit rusty on making
patch files and it's bed time now.
- Mike
Mike van Emmerik <emmerik@itee.uq.edu.au>
To: Mohsen Hariri <m.hariri@gmail.com>
Fri, Sep 16, 2005 at 5:14 AM
On Thu, 15 Sep 2005, Mohsen Hariri wrote:
> please send me the patches.
Well, I had no end of trouble with the carriage returns that Windows
inserts into all text files. Also, I could not figure out how to stop diff
-r from comparing directories, telling me for example that I don't have .o
files and so on. I had to change all the CVS/Repository files, so they all
showed up, but I deleted the CVS files in the "original" directory to save
spurious output, but of course then I get "only in this directory: CVS"
messages.
I think that maybe patch will just ignore these lines, or maybe you will
have to delete them manually. It should be easy with a few search and
replace commands. Or maybe it might be easier to apply the patches
manually. I've used -c to make this a little easier.
Do let me know how this goes; I don't expect problems but there usually
are.
You will need libiberty.a (or .so etc) as well as libbfd.a; I don't
understand why. I think I found this last time I tried to use libbfd. Or
maybe that's just a Solaris thing; it might be nice to fiddle with it under
Linux. (Without libiberty, I had about 15-20 undefined symbols such as
xexit, concat, hex_init, _objalloc_alloc etc). I had to put a soft link
from a libbfd.a I found to lib/, like this:
% ln -s /opt/local/lib/libbfd.a lib
I didn't need it with libiberty. You could also fiddle with the
LD_LIBRARY_PATH environment variable, but I find the soft link easier and
you don't need to fiddle with ~/.profile or the like so it works next time.
Good luck!
- Mike
boomerang_libid.diff
17K View Download
Mohsen Hariri <m.hariri@gmail.com>
Reply-To: m.hariri@gmail.com
To: Mike van Emmerik <emmerik@itee.uq.edu.au>
Fri, Sep 16, 2005 at 12:41 PM
I guess all file attributes are broken now that I've checked out
the files on my windows, so I think it's better to checkout
all the files from cvs in linux and then replace changed files
with mines. I did so.
I did the diffs manually in the files. and everything compiles well.
but I had a linker error where bfd was needed. That was because
I had ran configure before making the Makefile.in changes. So
I tried to run configure again, but this time I get this error:
checking size of char... configure: error: cannot compute sizeof (char), 77
See `config.log' for more details.
I looked at config.log, but nothing useful. I googles the error and again
nothing useful, some people said it was because of some library files
( I don't understand how can "size of char" be related to "libraries")
Anyway, it's lunch time and I'm going to have it. If you have any
idea of what the solution is, please tell me. Otherwise I will investigate
the problem myself.
One problem I encountered while compiling:
I'm compiling on a Fedora 4 linux. The default bfd version on FC4
is 2.15.92.0.2.
When I was on my windows, I had tried to compile BFD 2.16 under mingw
without success, because bfd needed a function in libiberty which
didn't exist
(the default libiberty included with BFD cannot be compiled under mingw, and
mingw has a already compiled version, so I had to use that). So I tried with
BFD 2.15 and it compiled well.
Now the problem: in BFD 2.15+, some bfd structures have changed. So I need
to know which version of BFD exists. I guess it's something
'configure' should
deal with, right? anyway, I have put #define BFD_2_15 on my bfdobjmatcher.h
which should be commented out when linking with BFD > 2.15
I guess I have to use linux for develompent, because having to change
the attributes every time I want to commit changes is not practical. Maybe
I use windows to develop, and then copy just the changed files to my linux and
then commit them, so when will you ever give me CVS access? :P
I've attached the changed project(tarred from my linux, so file attributes
are OK). If you found what is the problem with my configure, please
let me know.
PS. the strcmpi function you sent me needs a change, here it is:
(I guess this won't solve the problem with the matchsym error
you get, i dunno why that error happens, that's just a simple
string comparison and when it cannot find ".lib" at the end of
library file, it shows that error)
int strcmpi(const char* s1, const char* s2) {
int n1 = strlen(s1);
int n2 = strlen(s2);
for (int i=0; i < n1; ++i) {
if (i >= n2) return -1;
char c1 = toupper(*s1++);
char c2 = toupper(*s2++);
if (c1 < c2) return -1;
if (c1 > c2) return 1;
}
if(n2>n1)
return 1;
return 0;
}
thanks
-Mohsen
[Quoted text hidden]
Mohsen Hariri <m.hariri@gmail.com>
Reply-To: m.hariri@gmail.com
To: Mike van Emmerik <emmerik@itee.uq.edu.au>
Fri, Sep 16, 2005 at 1:05 PM
[Quoted text hidden]
[Quoted text hidden]
boomerang_configure_error.tar.bz2
3544K Download
Mike van Emmerik <emmerik@itee.uq.edu.au>
To: Mohsen Hariri <m.hariri@gmail.com>
Sat, Sep 17, 2005 at 2:18 AM
> I guess all file attributes are broken now that I've checked out
> the files on my windows, so I think it's better to checkout
> all the files from cvs in linux and then replace changed files
> with mines. I did so.
Most programs can handle the changed line endings, so usually it doesn't
matter. What program are you using for cvs access under Windows? Since you
seem to have MinGW working, it probably makes sense to use that for the cvs
access. Assuming that it preserves the line endings of files. I use the
Cygwin cvs for when I'm using MSVC; that works out well. (You can access
Windows files like c:\foo with /cygdrive/c/foo under Cygwin, and /c/foo
under MinGW, I believe).
> I did the diffs manually in the files. and everything compiles well.
> but I had a linker error where bfd was needed. That was because
> I had ran configure before making the Makefile.in changes. So
> I tried to run configure again, but this time I get this error:
>
> checking size of char... configure: error: cannot compute sizeof (char), 77
> See `config.log' for more details.
Bizarre. Actually, I don't think Boomerang uses those anyway, so it might
be easiest to just get rid of them (all the sizeof(<type>) tests). Sorry, I
don't have any other suggestions.
> One problem I encountered while compiling:
> I'm compiling on a Fedora 4 linux.
Same as I use. Though I haven't tried using BFD on Core 4.
> The default bfd version on FC4 is 2.15.92.0.2.
> When I was on my windows, I had tried to compile BFD 2.16 under mingw
> without success, because bfd needed a function in libiberty which
> didn't exist
> (the default libiberty included with BFD cannot be compiled under mingw, and
> mingw has a already compiled version, so I had to use that). So I tried with
> BFD 2.15 and it compiled well.
I was looking at what libiberty is; it seems to be a collection of
miscellaneous stuff that some GNU programs, including libbfd, need. It
doesn't sound like it should be hard to compile.
>
Now the problem: in BFD 2.15+, some bfd structures have changed.
I notice that you were using bfd_section, which used to be called sec. But
if you use asection, it should work with either version. Perhaps there is a
similar solution that doesn't need messing with configuration.
What is the structure that has changed?
> So I need
> to know which version of BFD exists. I guess it's something
> 'configure' should
> deal with, right? anyway, I have put #define BFD_2_15 on my bfdobjmatcher.h
> which should be commented out when linking with BFD > 2.15
Ick. Let's avoid that if possible. But if needed, I can hack something into
configure.
> I guess I have to use linux for develompent, because having to change
> the attributes every time I want to commit changes is not practical.
Oh, if you're more comfortable developing in Windows, then please continue
to do so. I think we just need to solve the line ending issue some other
way.
> Maybe
> I use windows to develop, and then copy just the changed files to my linux and
> then commit them, so when will you ever give me CVS access? :P
As soon as you ask. What's your sourceforge username? You're contributing
at least as much as other developers.
> I've attached the changed project (tarred from my linux, so file
> attributes are OK). If you found what is the problem with my configure,
> please let me know.
Ah, OK. I'll look into it.
> PS. the strcmpi function you sent me needs a change, here it is:
Oops, OK, I see it. Actually, that might fix the problem, since it affects
strings that are the same up to a certain point... well maybe. Sorry about
the bug.
- Mike
Mohsen Hariri <m.hariri@gmail.com>
Reply-To: m.hariri@gmail.com
To: Mike van Emmerik <emmerik@itee.uq.edu.au>
Sat, Sep 17, 2005 at 10:15 AM
> Most programs can handle the changed line endings, so usually it doesn't
> matter. What program are you using for cvs access under Windows? Since you
> seem to have MinGW working, it probably makes sense to use that for the cvs
> access. Assuming that it preserves the line endings of files. I use the
> Cygwin cvs for when I'm using MSVC; that works out well. (You can access
> Windows files like c:\foo with /cygdrive/c/foo under Cygwin, and /c/foo
> under MinGW, I believe).
I was using a windows native cvs.exe. I'll use cvs under cygwin from now.
> I notice that you were using bfd_section, which used to be called sec. But
> if you use asection, it should work with either version. Perhaps there is a
> similar solution that doesn't need messing with configuration.
>
> What is the structure that has changed?
the bfd_section structure had a 'comdat' member, but now we have to use a method
to access this member. and the 'comdat' structure has been renamed.
see the source
files I sent to you.
> As soon as you ask. What's your sourceforge username? You're contributing
> at least as much as other developers.
My username on sf is peakxx.
-Mohsen
Mike van Emmerik <emmerik@itee.uq.edu.au>
To: Mohsen Hariri <m.hariri@gmail.com>
Sun, Sep 18, 2005 at 4:16 AM
Mohsen,
I've gotten a configure script test working for the BFD 2.15 define, and I
had it compiling. But somehow with the files you sent me, several changes
were lost (e.g. to boomerang.cpp, and I think now db/prog.cpp as well). It
will take a little longer to get the cvs branch set up and the files
checked in.
- Mike
Mohsen Hariri <m.hariri@gmail.com>
Reply-To: m.hariri@gmail.com
To: Mike van Emmerik <emmerik@itee.uq.edu.au>
yes. seems I forgot some changes. cause configure didn't work and ...
please make a branch from your current cvs and I will update
the branch with my changes.
[Quoted text hidden]
Sun, Sep 18, 2005 at 7:46 AM
Mohsen Hariri <m.hariri@gmail.com>
Branch created (finally)
9 messages
Mike van Emmerik <emmerik@itee.uq.edu.au>
To: Mohsen Hariri <m.hariri@gmail.com>
Sun, Sep 18, 2005 at 3:23 PM
> My username on sf is peakxx.
Welcome, developer Mohsen!
I've also created the branch tag; it is called libid.
To get write access to CVS, you basically need
to change your CVS/Root files. You can check out a new working copy with
cvs -d :ext:peakxx@cvs.sourceforge.net:/cvsroot/boomerang co -r libid -d
dirname boomerang
Or you can change the file CVS/Root to contain the :ext... string. You
need to change all the CVS/Root files, and there tens of these, so use a
command like
find . -name Root -exec cp /path/to/boomerang/CVS/Root {} \;
The {} expands to the path to the current file.
When you have checked it out, your files (execpt the ones that are the
same as the main branch versions) will have a sticky tag. This means that
you are "stuck on the branch", which is what you want. You can check in
your changes with
cvs ci -m "Check in message"
You don't need to mention -b libid, since the sticky tag implies it.
Later, when all is well, I can guide you throgh the process of updating to
the main branch.
You have full cvs write access now, so you can cause some damage,
accidentally or otherwise, so please use caution. Almost everything in CVS
is readily reversable, so no need to panic if something goes wrong. In
particular, please make sure you are "on the branch" before your first
check in. For example:
% cvs status boomerang.cpp
Enter passphrase for key '/home/44/emmerik/.ssh/id_dsa':
===================================================================
File: boomerang.cpp Status: Up-to-date
Working revision: 1.125.2.1
Repository revision: 1.125.2.1
/cvsroot/boomerang/boomerang/boomerang.cpp,v
Sticky Tag:
libid (branch: 1.125.2)
Sticky Date:
(none)
Sticky Options:
(none)
You'll be asked for your Sourceforge password unless you have set up an
ssh key. If you set up an ssh key and an ssh agent, you can avoid having
to type in passwords and/or pass phrases for every cvs command. As a
developer, you need to enter a password or pass phrase even for read only
commands (such as cvs status above). But I guess you know all that from
the other project you were working on.
With the files as checked in, I get
boomerang.cpp: In member function `void Boomerang::decode(Prog*, const char*)':
boomerang.cpp:1062: error: 'class Prog' has no member named 'printSymbols'
Sounds like a problem with prog.cpp not getting the latest update; I
rnamed printSymbols to printSymbolsToFile. The diff for that is so small
you may as well do it manually (just beware of spaces verses tabs; we like
to be pretty strict about 4 column tabs; my emailer won't paste tabs):
% cvs diff -r 1.135 -r 1.136 db/prog.cpp
< * $Revision: 1.135 $ // 1.126.2.14
--> * $Revision: 1.136 $ // 1.126.2.14
581a582,588
> void Prog::dumpGlobals() {
>
for (std::set<Global*>::iterator it = globals.begin(); it != globals.end(); it++) {
>
(*it)->print(std::cerr, this);
>
std::cerr << "\n";
>
}
>}
>
1311,1312c1318,1319
< void Prog::printSymbols() {
< std::cerr << "entering Prog::printSymbols\n";
--> void Prog::printSymbolsToFile() {
> std::cerr << "entering Prog::printSymbolsToFile\n";
1333c1340
< std::cerr << "leaving Prog::printSymbols\n";
--> std::cerr << "leaving Prog::printSymbolsToFile\n";
1442a1450,1454
> void Global::print(std::ostream& os, Prog* prog) {
>
Exp* init = getInitialValue(prog);
>
os << nam << " at " << std::hex << uaddr << std::dec << " initial value " << (init ? init>prints() : "<none>");
>}
>
I've rarely done this, but I believe you can do the above with
% cvs update -j 1.135 -j 1.136 db/prog.cpp
which is supposed to merge in the changes between 1.135 and 1.136 into the
current checked out copy. I'm not sure what the resultant revision is, or
how it affects sticky tags (I imagine it would not affect either). So use
with caution.
I got the revision number from cvs log. There may be one or two more like
this. Ah, thinking about this, this one is not your fault; it's just that
I checked in that change and you had not updated to that change before
making the tar/zip files. I did some overwriting of files, which is always
dangerous. These things can get tricky. Sigh. One day I'll be an expert at
cvs branches.
Good luck! Please let me know when you've done the first few checkins by
email. That way I can make sure that there are no hassles that might
affect people. These days, it doesn't take long between a small mistake
and a bug report.
- Mike
Mohsen Hariri <m.hariri@gmail.com>
Reply-To: m.hariri@gmail.com
To: Mike van Emmerik <emmerik@itee.uq.edu.au>
Mon, Sep 19, 2005 at 12:21 PM
I've checked in my changes. Now libid works as a DLL under windows,
and is loaded when needed. I used the codes in the BinaryFileFactory.cpp
so for other platforms also it should work OK, it just needs the makefile to
be changed.
Look if I'm doing things the right way in cvs.
[Quoted text hidden]
Mike van Emmerik <emmerik@itee.uq.edu.au>
To: Mohsen Hariri <m.hariri@gmail.com>
Tue, Sep 20, 2005 at 1:54 AM
> I've checked in my changes. Now libid works as a DLL under windows,
> and is loaded when needed. I used the codes in the BinaryFileFactory.cpp
> so for other platforms also it should work OK, it just needs the makefile to
> be changed.
>
> Look if I'm doing things the right way in cvs.
I've booked in a handful of changes, to
symbols/*
include/SymbolMatcher.h
Makefile.in
to enable making under Unix. I haven't tested yet; I need to copy over
some test files. It made for me with no problems under MinGW as well.
I removed carriage returns from symbols/lididloader.cpp; this might make
for a difficult merge if you have made any changes. Assuming you haven't,
it's probably easier to rename your copy and update from cvs.
You have checked in bfd.dll but not libgc.dll and libexpat.dll (to the
branch). I changed bfd.dll to have the sticky option -kb (so cvs knows it
is a binary and doesn't try to do keyword substitution).
Your files will need the GPL disclaimer at the top of each file, but that
can wait. We will also need the LICENSE file updated with the GPL.
You have not checked in the test/symbols files. Have you decided if you
want all those checked in?
The problems are all minor, and you successfully checked in on the branch.
Well done!
- Mike
Mohsen Hariri <m.hariri@gmail.com>
Reply-To: m.hariri@gmail.com
To: Mike van Emmerik <emmerik@itee.uq.edu.au>
Tue, Sep 20, 2005 at 9:06 AM
> You have checked in bfd.dll but not libgc.dll and libexpat.dll (to the
> branch). I changed bfd.dll to have the sticky option -kb (so cvs knows it
> is a binary and doesn't try to do keyword substitution).
should I do that? I mean adding libgc and libexpath.dll?
seems they are statically linked under windows, with default
configuration, right?
> Your files will need the GPL disclaimer at the top of each file, but that
> can wait. We will also need the LICENSE file updated with the GPL.
Please do it for me. I don't know still how to put that revision history
on top of my files, I guess it must be easy. Will you do that?
> You have not checked in the test/symbols files. Have you decided if you
> want all those checked in?
I don't know if I put that test in the cvs. Because that test used
libc.lib which
was sold with VS2003, so that will have a copyright problem, right?
I'm working on signatures. As soon as I'm done, I will make some sig files
and put them in the tests directory.
So about signatures: I will make matchsym command without arguments to
autodetect any possible library in the executable.
my next mail is the format of signature files, in 1 or 2 hours,
I guess it would be long :p
-Mohsen
Mike van Emmerik <emmerik@itee.uq.edu.au>
To: Mohsen Hariri <m.hariri@gmail.com>
Tue, Sep 20, 2005 at 9:50 AM
On Tue, 20 Sep 2005, Mohsen Hariri wrote:
>> You have checked in bfd.dll but not libgc.dll and libexpat.dll (to the
>> branch). I changed bfd.dll to have the sticky option -kb (so cvs knows it
>> is a binary and doesn't try to do keyword substitution).
>
> should I do that? I mean adding libgc and libexpath.dll?
> seems they are statically linked under windows, with default
> configuration, right?
Err, not as far as I know. What is the Windows equivalent of a .a file?
.lib? I think the idea was to make it really easy for Windows people, since
they are likely to be the least capable, and things like libexpat are more
Unix friendly than Windows friendly.
>> Your files will need the GPL disclaimer at the top of each file, but that
>> can wait. We will also need the LICENSE file updated with the GPL.
>
> Please do it for me. I don't know still how to put that revision history
> on top of my files, I guess it must be easy. Will you do that?
OK, when I get a bit of time. Should be pretty easy.
>> You have not checked in the test/symbols files. Have you decided if you
>> want all those checked in?
>
> I don't know if I put that test in the cvs. Because that test used
> libc.lib which
> was sold with VS2003, so that will have a copyright problem, right?
Yes, let's not check that in. In fact, I think we should think about what
files if any to check in for library testing. Certainly, a functional test
of some sort would be good.
> I'm working on signatures. As soon as I'm done, I will make some sig files
> and put them in the tests directory.
Hmmm. I suppose the signatures themselves could be generated in a totally
stand alone program, not just a separate library. That tool could be GPL'd,
and Boomerang could stay as is (BSD-like license). Do you need bfd for the
signature *reading* code? Surely not.
Would you consider making the signature generator a separate program under
Windows? I can modify the Unix makefile to do the same thing. I don't know
why this didn't come to me before. It would sure simplify the licensing.
- Mike
Mohsen Hariri <m.hariri@gmail.com>
Reply-To: m.hariri@gmail.com
To: Mike van Emmerik <emmerik@itee.uq.edu.au>
Tue, Sep 20, 2005 at 10:17 AM
> Err, not as far as I know. What is the Windows equivalent of a .a file?
> .lib? I think the idea was to make it really easy for Windows people, since
> they are likely to be the least capable, and things like libexpat are more
> Unix friendly than Windows friendly.
.lib files stand for both .a and .la files. They can contain the whole code,
or they can contain a simple stub for the dll file.
> Would you consider making the signature generator a separate program under
> Windows? I can modify the Unix makefile to do the same thing. I don't know
> why this didn't come to me before. It would sure simplify the licensing.
I'm making a seperate program somehow. But I think the non-signatured lib
file matching and obj file matching are usefull, and they can be a part of
boomerang. anyway, maybe the overhead of changing the license is more
than the benefits.
- Mohsen
Mohsen Hariri <m.hariri@gmail.com>
Reply-To: m.hariri@gmail.com
To: Mike van Emmerik <emmerik@itee.uq.edu.au>
Tue, Sep 20, 2005 at 10:58 AM
One more thing. I want to use wine's demangling codes, and
it's LGPL. I can do this also in the seperate signature generator
program. But tomorrow we will be trying to read debug information
from an executable and that's again GPL.
Anyway I think not using any GPL code in the whole boomerang
project is somehow limiting ourselves, and what would boomerang
get from being under pure BSD license? other than being misused by
properietary softwares?
I think technology edge programs like boomerang MUST be GPLed,
because they have a high potential for being misused.
anyway you are the boss ;) you decide.
-Mohsen
[Quoted text hidden]
Mike van Emmerik <emmerik@itee.uq.edu.au>
To: Mohsen Hariri <m.hariri@gmail.com>
Tue, Sep 20, 2005 at 12:11 PM
On Tue, 20 Sep 2005, Mohsen Hariri wrote:
> One more thing. I want to use wine's demangling codes, and
> it's LGPL. I can do this also in the seperate signature generator
> program. But tomorrow we will be trying to read debug information
> from an executable and that's again GPL.
Well, you could extend the Win32BinaryFile etc classes to handle it. But
then again, it probably makes sense to get rid of these libraries and just
re-use BFD anyway. Any improvements to BFD will automatically be available,
any new formats likewise, and so on.
We considered using BFD early on, but decided it was just too hard to use
and/or learn how to use. So we wrote our own. Sounds lame now, but that's
how it happened.
> Anyway I think not using any GPL code in the whole boomerang
> project is somehow limiting ourselves, and what would boomerang
> get from being under pure BSD license? other than being misused by
> properietary softwares?
Yes, I can see that not using any GPL'd code is quite restrictive. If not
now, then more and more so in the future. So we may as well prepare for the
inevitable.
> I think technology edge programs like boomerang MUST be GPLed,
> because they have a high potential for being misused.
Interesting point of view.
> anyway you are the boss ;) you decide.
I've started writing up a web page explaining the new license. It will take
a few days, as spare time is short.
- Mike
Mohsen Hariri <m.hariri@gmail.com>
Reply-To: m.hariri@gmail.com
To: Mike van Emmerik <emmerik@itee.uq.edu.au>
Tue, Sep 20, 2005 at 10:35 PM
> We considered using BFD early on, but decided it was just too hard to use
> and/or learn how to use. So we wrote our own. Sounds lame now, but that's
> how it happened.
YES. BFD is hard to use. The documentation is very poor, you
have to just explore it yourself. Considering BFD documentation
these days, I can imagine how it was in 1992-3 when you were
starting UQBT. BFD has done a hard job in C, which could be
done much more readable and maintanable in C++. But I know
they have many reasons to do so. Anyway, current binary file
handling in boomerang is much more readable than BDF.
Maybe BFD use BinaryFile class from boomerang someday :p
- Mohsen
Mohsen Hariri <m.hariri@gmail.com>
Library Signatures
2 messages
Mohsen Hariri <m.hariri@gmail.com>
Reply-To: Mohsen Hariri <m.hariri@gmail.com>
To: Mike van Emmerik <emmerik@itee.uq.edu.au>
Fri, Sep 23, 2005 at 6:55 PM
Hi,
I'm still thinking about the signature files.
The method you used in dcc was very interesting,
(minimal perfect hashing), but if we want to do
the same in boomerang we would be somehow
cpu instruction dependent, cause we have to
identify which instructions can get relocatable
arguments.
The way IDA signatures work is storing signatures
in a tree-structure format. That would be a more
practical method for our purpose.
Now I know much more about compiling templates. Didn't know
templates have that many problems, for example you cannot
generally store a function which uses templates in a library/object.
That's why all stl headers contain codes. some interesting point
here is that the linker doesn't warn you about multiple definitions
if you have two object files, that in each of them you have used
a vector<int> for example. (here the vector<int> class
will compile and will be stored in both object files, and there will
be many duplicate functions). So for classes with templates,
microsoft linker works like gcc with link once option. So if
we use a template tag for binaryfile class, we won't get any
warnings about duplicate definitions. :p
Anyway, I feel I can start coding now. Please just tell me what
you think about these problems:
1. (again)Can a library be used on more than one platform?
Is it common? If it's common, we have to put more
than one platform id in header of signature files.
IDA sig files store only one platform in each sig file.
2. What should we do with functions with the same name
but different parameters? We have to use both function
name and parameters to look up the function in header
file to find the prototype, or some other method like
putting a hint number in the signature function and
related header function, anyway we cannot
find the function by using just the function name.
-Mohsen
Mike van Emmerik <emmerik@itee.uq.edu.au>
Sat, Sep 24, 2005 at 3:17 AM
To: Mohsen Hariri <m.hariri@gmail.com>
> Hi,
>
> I'm still thinking about the signature files.
> The method you used in dcc was very interesting,
> (minimal perfect hashing), but if we want to do
> the same in boomerang we would be somehow
> cpu instruction dependent, cause we have to
> identify which instructions can get relocatable
> arguments.
Yesm but you need to identify the "wildcards" either way. I do think that
the tree method is better; I just didn't think of it at the time. Plus, it
allows different sized patterns (you might sometimes need 100 bytes in a
pattern to distinguish a difference in the 101th byte, but most patterns
can be about 20 bytes).
> The way IDA signatures work is storing signatures
> in a tree-structure format. That would be a more
> practical method for our purpose.
Agreed.
> Now I know much more about compiling templates. Didn't know
> templates have that many problems, for example you cannot
> generally store a function which uses templates in a library/object.
Interesting.
> That's why all stl headers contain codes. some interesting point
> here is that the linker doesn't warn you about multiple definitions
> if you have two object files, that in each of them you have used
> a vector<int> for example. (here the vector<int> class
> will compile and will be stored in both object files, and there will
> be many duplicate functions). So for classes with templates,
> microsoft linker works like gcc with link once option. So if
> we use a template tag for binaryfile class, we won't get any
> warnings about duplicate definitions. :p
Heh.
> Anyway, I feel I can start coding now. Please just tell me what
> you think about these problems:
>
> 1. (again)Can a library be used on more than one platform?
Yes, most libraries would be like that: curses, bfd, expat, gc, etc.
> Is it common?
Yes.
> If it's common, we have to put more
> than one platform id in header of signature files.
But signatures are specific to a platform. For example, the PowerPC curses
library will have very different patterns to the Windows one, and probably
the Linux one will be somewhat different to the Win32 one.
Oh, perhaps we are talking about different "signatures" (unfortunate name
clash). If you are thinking about prototypes (spec of names and parameter
types), then yes, they should generally be the same across all platforms.
There may be minor exceptions (e.g. on a 16-bit platform, there may be a
long where on 32 bit platforms there is an int), but they should equate to
the same prototype in Boomerang terms. It might be good to allow for
exceptions when required.
> IDA sig files store only one platform in each sig file.
I'm pretty amazed that they can combine so many version even, e.g. MFC
versions 2-4. (Signatures, not prototypes).
I wonder if prototypes should be stored in the signature (pattern) file?
> 2. What should we do with functions with the same name
> but different parameters? We have to use both function
> name and parameters to look up the function in header
> file to find the prototype, or some other method like
> putting a hint number in the signature function and
> related header function, anyway we cannot
> find the function by using just the function name.
Maybe this goes away too if you put the prototype in the pattern file. I
think you only ever start with a pattern and want to know the prototype,
if any. I can't see a need (in Boomerang, perhaps in some stand alone
tools) to ask "do you have a pattern for this prototype?".
- Mike
Download