A Decompiler Project Undergraduate Thesis By Mohsen Hariri 1. Introduction 2. Boomerang Project 3. Binary Translation 3.1 Front-end 3.2 SLED 3.3 SSL 3.4 RTL 3.5 CFG 3.6 Backends 4. Static Library Detection 4.1 Application 4.2 Library Files 4.3 Matching Methods 4.3.1 Raw Object/Library Matching 4.3.2 Signature Matching 4.4 Symbols Demangling 5. Terminology 5.1 Terms 5.2 Acronyms 6. Project Discussions 1. Introduction A compiler is a program that takes as input a program written in a high level language and produces as output an executable program for a target machine; in other words, the input is language dependent and the output is machine dependent. Decompiler, or reverse compiler, attempts to perform the inverse process: given an executable program the aim is to produce a high level language program that performs the same function as the executable program. The input in this case is machine dependent, and the output is language dependent. Decompilation techniques were initially used to aid in the migration of programs from one platform to another. Since then, decompilation techniques have been used to aid in many other fields such as the recovery of lost source code, debugging of programs, comprehending programs, recovery of high level views of programs and worm and virus analysis. In general, decompilation problem is an equivalent of halting problem. A naive approach to decompilation attempts to enumerate all valid phrases of an arbitrary attribute grammar, and then to perform a reverse match of these phrases to their original source code. An algorithm to solve this problem has been proved to be halting problem equivalent. A more sensible approach is to try to determine which addresses contain data and which ones contain instructions in the given binary program. Given that in a Von Neumann machine, data and instructions are represented in the same way in the computer memory, an algorithm that solves this data/instruction problem would also solve the halting problem, and that is impossible. This means that the decompilation problem belongs to the class of non-computable problems; it is equivalent to the halting problem, and is therefore only partially computable. In other words, we can build a decompiler which produces the right output for some input programs, but not for all input programs in general. (from “A Methodology for Decompilation”, Cristina Cifuentes and K.John Gough) Today, there are many generic disassemblers available for the public, but decompilers, as they are much harder to design and implement, are very few and almost all of them are designed and programmed for a specific programming language and compiler. A disassembler is a program that reads an executable program and translates it into an equivalent assembler program; a decompiler goes a step further by translating the program into an equivalent high level language (such as C or Pascal) program. As with most reverse engineering tools, disassemblers and decompilers are semiautomated tools rather than fully automated tools. In effect, decompilation is merely an extension of disassembly. If you can produce assembly language source for a program, then you can produce high-level language source for that program with more effort. Figure 1 – A Decompilation System 2. Boomerang Project The Boomerang project is an attempt to develop a real decompiler for machine code programs through the open source community. A decompiler takes as input an executable file, and attempts to create a high level, compilable, possibly even maintainable source file that does the same thing. It is therefore the opposite of a compiler, which takes a source file and makes an executable. However, a general decompiler does not attempt to reverse every action of the decompiler; rather it transforms the input program repeatedly until the result is high level source code. It therefore won't recreate the original source file; probably nothing like it. It does not matter if the executable file has symbols or not, or was compiled from any particular language. (However, declarative languages like ML are not considered.) The intent is to create a retargetable decompiler (i.e. one that can decompile different types of machine code files with modest effort, e.g. X86-windows, sparc-solaris, etc). It was also intended to be highly modular, so that different parts of the decompiler can be replaced with experimental modules. It was intended to eventually become interactive, a la IDA Pro, because some things (not just variable names and comments, though these are obviously very important) require expert intervention. Whether the interactivity belongs in the decompiler or in a separate tool remains unclear. By transforming the semantics of individual instructions, and using powerful techniques such as Static Single Assignment dataflow analysis, Boomerang should be (largely) independent of the exact behavior of the compiler that happened to be used. Optimization should not affect the results. Hence, the goal is a general decompiler. (From Boomerang project homepage, http://boomerang.sourceforge.net/) The main theories for this project are taken from researches about Binary Translations, mainly UQBT project. These researches on UQBT project have been around since 1995 under the support of Sun Microsytems Laboratories. Boomerang project started in 2002, and as the time being, it can decompile (very) simple and small programs, generating the equivalent C language code. One major limitation is that there is as yet no recognition of statically linked library functions (a la FLIRT of IDA Pro). This is a severe limitation for many Windows programs, most of which have large amounts of library code statically linked (as well as plenty that are dynamically linked). That means that until this is implemented, then without significant manual guidance, Boomerang can't even decompile its own test/windows/hello.exe. (From Boomerang project homepage, http://boomerang.sourceforge.net/) In this project, I’m going to implement library detection for boomerang. But first, I need to know the project parts and have an overview of the system. So I’ll start by describing the already implemented parts, and then focus on my part in the whole project. 3. Binary Translation Binary translation is the process of automatically translating a binary executable program from one machine to another. This process normally involves different machines, Mi, different operating systems, OSi, and different binary-file formats, BFFi, in order to translate programs from (M1, OS1, BFF1) to (M2, OS2, BFF2). Like a compiler, a binary translator can be loosely divided into front end, analyzer and optimizer, and back end. The front end decodes a source-machine binary file, produces RTLs, and then lifts the level of abstraction to HRTL (the high-level intermediate representation) by using knowledge of the source-machine calling conventions and instruction set. The analyzer and optimizer map from source-machine locations to target-machine locations, and it may apply other machinespecific optimizations to prepare for the back end. The back end translates the intermediate HRTL to target-machine instructions, and it writes a binary file in the required format. Like a compiler, a binary translator can be loosely divided into front end, analyzer and optimizer, and back end. The front end decodes a source-machine binary file, produces RTLs, and then lifts the level of abstraction to HRTL (the high-level intermediate representation) by using knowledge of the source-machine calling conventions and instruction set. The analyzer and optimizer map from source-machine locations to target-machine locations, and it may apply other machinespecific optimizations to prepare for the back end. The back end translates the intermediate HRTL to target-machine instructions, and it writes a binary file in the required format. Compilers and other tools have traditionally been called retargetable when they can support multiple target machines at low cost. By extension, we call a binary-analysis tool resourceable if it can analyze binaries from multiple source machines at low cost. Retargetability is supported in the UQBT framework through specifications of properties of machines and operating system conventions. (From UQBT project homepage, http://www.itee.uq.edu.au/~cristina/uqbt.html) Figure 2 - UQBT Abstract Architecture Figure 3 - UQBT Detailed Architecture (2001) 3.1 The Front-end The front end module deals with machine dependent features and produces a machine independent representation. It takes as input a binary program for a specific machine, loads it into virtual memory, parses it, and produces an intermediate representation of the program. The loader is an operating system program that loads an executable program into memory, sets up the segment registers and the stack, and transfers control to the program. Note that executable files do not contain much information about which segments are used as data and which ones are used as code, data segments can contain code and/or addresses. The parser decides the type of machine instruction at a given memory location, determines its operands and any offsets involved. The parsing of machine instructions is not as easy as it might appear. First of all, there are addressing modes that depend on the value of variables or registers at runtime. Second, indexed and indirect access to memory locations are difficult to resolve. Third, the complex machine instruction sets in today's machines utilize almost all combination of bytes, and therefore it is very hard to determine if a given byte is an instruction or is data. Fourth, there is no difference as to how data and instructions are stored in memory in a Von Neumann machine. Finally, idioms are used by compiler writers to perform a function in the minimal number of machine cycles, and therefore a group of instructions will make sense only in a logical way, but not individually. In order to determine which bytes of information are instructions and which ones are data, we start at the unique entry point to the program, given by the loader. This entry point must the first instruction for the program, in order to begin execution. From there on, instructions are parsed sequentially, until the flow of control changes due to a branch, a procedure call, etc. In this case, the target location is like a new entry point to part of the program, and from there onwards, instructions can be parsed in the previous way. Once there are no more instructions to parse, due to an end of procedure or end of program, we return to where the branch of control occurred and continue parsing at that level. This method traverses all possible instruction paths. At the same time, data references are placed in a global or local symbol table, depending on where the data is stored (i.e. as an offset on the stack, or at a definite memory location). A major problem is introduced by the access of indexed and indirect memory instructions and locations. An idiom is a sequence of instructions which forms a logical entity and has a meaning that cannot be derived by considering the primary meanings of the individual instructions. To handle these, heuristic methods need to be implemented to determine as much information as possible; analytic methods, such as emulation, cannot provide the whole range of solutions anyway. In general, it is impossible to solve these types of problems as they are equivalent to solving the halting problem, as previously mentioned. Different problems are introduced by self-modifying code and virus tricks. A way to tackle these cases is to flag the sections of code involved, and comment them in the final program. Assembler code might be all that can be produced in these cases. Even more, a suggested optimal algorithm for parsing consists in finding the maximum number of trees that contain instructions; this is a combinatorial method that has been proved to be NP-complete. For dense machine instruction sets, this algorithm does not solve the problem of data residing in code segments. The intermediate code generator produces an intermediate representation of the program. It works close together with the parser, invoking it to get the next instruction. Each machine instruction gets translated into an intermediate code instruction, such representation being machine and language independent. Defined/used (du) chains of registers are also attached to the intermediate instruction; these are used later in the data flow analysis phase. The quality of the intermediate code can be improved by an optimization stage that eliminates any redundant instructions, finds probable idioms, and replaces them by an appropriate intermediate instruction. Many idioms are machine dependent and reveal some of the semantics associated with the program at hand. Such idioms represent low level functions that are normally provided by the compiler at a higher level (e.g. multiplication and division of integers by powers of 2). Other idioms are machine independent and they reflect a shortcut used by the compiler writer in order to get faster code (i.e. fewer machine cycles for a given function), such as the addition and subtraction of long numbers. Some of these idioms are widely known in the compiler community, and should be coded into the decompiler. (from “A Methodology for Decompilation”, Cristina Cifuentes and K.John Gough) 3.2 SLED SLED (Specification Language for Encoding and Decoding) defines the mapping between symbolic, assembly-language, and binary representations of machine instructions. The New Jersey Machine Code (NJMC) toolkit allows users to write machine descriptions of assembly instructions and their associated binary representations using the Specification Language for Encoding and Decoding (SLED). SLED provides for compact specifications of RISC and CISC machines; with 127, 193 and 460 lines of specification for the MIPS, SPARC and Pentium respectively. The toolkit also provides extra support for encoding of assembly to binary code, and decoding of binary to assembly code. For decoding purposes, the toolkit provides a matching statement which resembles the C switch statement. The toolkit generates C and Modula-3 code from matching statements, hence automating part of the disassembly process. The generated code can be integrated as a module of a binary-decoding application. (From UQBT project documents) 3.3 SSL and RTL Register transfer lists (RTL) is an intermediate language that describes transfers of information between register-based instructions. RTL assumes an infinite number of registers hence it is not constrained to a particular machine representation. More recently, RTL has been used as an intermediate representation in different system tools such as the link-time optimizer OM, GNU’s compilers, and the editing library EEL. In all these tools, RTL stands for register transfer language, and the representations vary widely. (From UQBT project documents) SSL (Syntax/Semantic Language, or Semantic Specification Language), has been developed in order to describe the semantics of machine instructions. In UBQT, SSL is defined in terms of RTLs. The syntax of SSL is defined in ExtendedBackus-Naur-Form (EBNF). HRTL is the result of applying some transformations on RTLs, to build a more abstract representation of RTLs. 3.4 CFG A control flow graph (CFG) is a directed graph that represents the flow of control of a program, thus, it only represents the flow of instructions (code) of the program and excludes data information. The nodes of a CFG represent basic blocks of the program, and the edges represent the flow of control between nodes. A basic block is a sequence of consecutive statements in which flow of control enters at the beginning and leaves at the end without halt or possibility of branching except at the end. Figure 4 - Data Structures to Represent a Binary Program 3.5 Backends The Back-end generates the final output of the program. It can be an executable file in another machine’s architecture, or a high-level language representation of the input executable file. 4. Static Library Detection 4.1 Application Executable files contain both user codes and library codes. Being able to distinguish them will let us generate simpler and smaller source files which are also more readable. Furthermore, many library functions are written in assembler and contain instructions and structures that are difficult or impossible to decompile. All programs are more readable when calls to library functions use their symbolic names (e.g. “strcmp” rather than “proc003”) 4.2 Library Files There are many object and library file formats available, and there are some free software to read and use these library formats. Current mostly used object file formats are: 1. COFF Common Object File Format, is used by Microsoft compilers. The standards for this format are not completely documented, but there are some free tools to read them. 2. OMF Designed by Intel, and used by Microsoft before switching to COFF. Borland compilers used this format long after Microsoft. This format is also documented to some extents, but I have not seen any free tools to read this format. 3. ELF Used by GNU gcc, both for object files and executable files. The standards for this format are available, and freely accessible. There are other formats like relocatable a.out used on BSD UNIX systems and IBM 360 objects, which I have not mentioned. BFD library is the GNU library to deal with object, library and executable files. It supports a wide range of object file formats such as COFF and ELF. But it does not have support for OMF format. I am using BFD library to read from object file. BFD provides a uniform API to work with object files. So that using BFD library will allow boomerang to read any object file supported by BFD. But this method will prevent us from using format specific extra information which could be extracted if we weren’t using BFD. 4.3 Matching Methods Matching a function in an object file is composed of these steps: 1. Reading the function op-codes from the object file 2. flagging the relocatable bytes out 3. Matching the remaining data pattern against the executable sections When a match occurs, we will execute the following steps: 1. Create a virtual procedure in boomerang using Prog::newProc 2. In C++, parameters and return value are encoded and appended to the procedure name. If we know the demangling schema for the compiler, we can even extract the prototype of the function. For more information see section 4.4. 3. Attaching the names to any references by this procedure using BinaryFile::AddSymbol. For example: int myproc(){ g_mm = 10; } If in an executable we locate myproc, then using the function references we can also locate g_mm which is a global variable here. Here we have assumed we are only matching function codes against executable sections of the executable file. But, it’s also possible to match data patterns against executable files, such as searching for MD5 or Blowfish substitution tables, but the usefulness of this method is in very rare conditions, where we just have the data and the codes are somehow encrypted or inaccessible, because usually these tables are referenced by functions and can be found by flagging function references. Implementation for a general symbol matching module is defined in SymbolMatcher abstract class. To apply a symbol container to the executable, we will use SymbolMatcherFactory, which will return a suitable SymbolMatcher object for the specified symbol container file. 4.3.1 Raw Object/Library Matching This method is used for raw object/library files whose format is supported by BFD. It is implemented in BfdObjMatcher and BfdArchMatcher. BfdObjMatcher is used to read and match object files, and BfdArchMatcher does the same for library files. 4.3.2 Signature Matching Considering the number of libraries, and number of versions of a library, and the functions it contains, matching whole available libraries against an executable is very time consuming. So for automatic library function detection, raw library matching is impossible. To overcome this problem, we can consider the following points: 1. Limiting the number of libraries to be matched against the executable by checking if this executable can have used specified library or not, for example by checking platform of the executable against library. 2. Some libraries, when used in an executable, always add a function to the executable. This primary function can be searched for before any other function of that library, and if not found, we can assume that this library has not been used. Even after applying these considerations, still the problem of large number of functions to match exists. So we will use another method for matching function data, by storing function signatures in a tree structure and matching all library functions at once using that structure. This method eliminates searching the whole code sections for each function. Making a tree structure for functions, considering relocation bytes, which we call wild-bytes, can be done in many different ways and needs a lot of decisions. For example how many bytes of a function should we use for matching, and when two functions are the same in those bytes how should we distinguish them, or when two functions differ only in relocation bytes, how we can find the difference. Signature file structure is shown bellow: struct SIG_FILE_HEADER{ // library platform platform platform_id; // should I use "MACHINE" enum? (from binaryfile.h) // format of the executable file // that signatures can be matched against LOAD_FMT file_format; // name of the library this // signature is extracted from SIG_STRING_IDX library_name; // name of header file which // defines possible structures // used in this library SIG_STRING_IDX header_file; // start of symbols section int symbols_section_start; // total number of symbols // in this signature file int total_symbols; // start of nodes section int tree_section_start; // total number of nodes // in this signature file int total_nodes; // start of references section int references_section_start; // total number of references // in this signature file int total_references; // start of strings section int strings_section_start; // total number of strings // in this signature file int total_strings; // start of index arrays section int index_array_section_start; // total number of index arrays // in this signature file int total_index_arrays; // primary symbol of this signature // file. // primary symbol is the symbol which // is always linked when this library // is used. SIG_SYMBOL_IDX primary_symbol; }; struct SIG_SYMBOL{ // prototype of the symbol SIG_STRING_IDX prototype; // comments SIG_STRING_IDX comments; SIG_NODE_IDX matching_node; }; struct SIG_NODE{ // if this node is just a reference bool is_ref; // parent of this node SIG_NODE_IDX parent; union{ // if not just a reference struct{ // size of node contents int contents_size; // signature contents unsigned char *contents; }; // if just a reference SIG_SYMBOL_REF_IDX sym_ref; }; // is it a leaf? if so, we // have to flag the matched symbol bool is_leaf; union{ // if this node is non-leaf, // specifies the child nodes struct{ int childs_count; SIG_ARRAY_IDX child_nodes; }; // if this node is a leaf, // specifies which symbol // is matched SIG_SYMBOL_IDX symbol; }; }; struct SIG_SYMBOL_REF { // if the offset of this symbol // is stored relative to current // location bool is_relative; // true if the reference // is already in this signature file, // as a symbol, and false if not bool is_resolved; union{ // if not resolved, // we just hold the name // or prototype SIG_STRING_IDX name; // if resolved, we hold the // symbol index SIG_SYMBOL_IDX symbol; }; }; struct SIG_STRING{ // size of the structure, in bytes int size; // string contents, variable size // items count is (size - 1) char content[1]; }; struct SIG_ARRAY{ // size of the structure, in bytes int size; // array contents, variable size // items count is : ((size / sizeof(int)) - 1) int data[1]; }; The signature file, like dcc, will have some sections (they could be put in separate files, but I thought current style is better). Sections are: 1. Symbols Section: Contains symbols information 2. Signature Tree Section: Contains nodes of the signature tree, which will be matched against executable codes. 3. References Section: References made by functions are stored here, they can be variables or other functions. 4. Strings Section: All strings in the signature file are kept here, for performance purposes, to avoid variable-sized structures 5. Indexes Section: All index arrays are kept here, again for performance purposes, to avoid variable-sized structures All indexes are calculated relative to their section start, and contain the byte offset. The signature tree is composed of some bytes and some references, which are saved in separate nodes. For example: int len2x(char *str) { int i = strlen(str); return i*2; } int len3x(char *str) { int i = strlen(str); return i*3; } are compiled into these code: Figure 5 - Functions Code-Bytes If we have only these two functions in our signature file, the signature tree would look like this: Figure 6 - Signature Tree 4.4 Symbols Demangling In old C language, different functions could be distinguished using only their names. But in C++, different functions can have the same name and just differ in arguments. So compilers needed to differentiate them using both the name and the parameters. Mangling is the process of adding tokens for parameters to the name of the function. There are no public standards for mangling, and each compiler has created its own mangling scheme. For example: “?kk@@YAXXZ” stands for “void __cdecl kk(void)” For a decompiler, it would be a precious source of information to have the name, parameters, return value and calling conversion of a function. So we needed to at least have some sort of demangling schemes for generally used compilers. Fortunately there were some free implementations for popular compilers’ demangling schemes. We used wine demangling for Microsoft compiler and BFD demangling for gcc. 5. Terminology 5.1 Terms "Original source code", as opposed to the decompiled output (which although an output, is still referred to as a "source" code. This is the original, usually high level source code that the program was written in. "Input program", for the program that the decompiler reads. Terms such as "source program" just create confusion. "Decompiled output", which as noted above is a form of source code. "Executable file" refers to a general class of files that could be decompiled. Here, the term includes both machine code programs, and programs compiled to a virtual machine form. The term "native" distinguishes machine code executables from others. Most people understand "machine code" as meaning an executable that is executed directly by the processor, and so means the same but is clearer than "native executable". 5.2 Acronyms UQBT: University of Queensland Binary Translator. Boomerang is based in part on code from UQBT. SSL: Semantic Specification Language. This is the language used in SSL files, which specify the semantics (meaning) of instructions. IR: Internal Representation; a representation of the input program in a form that is convenient for the current analysis or transformation. RTL: Register Transfer List (sometimes Register Transfer Language). The term Register Transfers actually comes from hardware design, where registers are arrays of single bit storage elements, but in software engineering has come to mean a style of program representation at the register and memory level. Every transfer (assignment) is explicit, including to flags registers. DFA: Data Flow Analysis. SSA: Static Single Assignment. A representation variation that makes certain kinds of DFA easier to perform. CFG: Control Flow Graph. Nodes in the CFG are Basic Blocks, and edges represent possible control flow (execution paths that the program could take). For example, a basic block ending in a conditional branch would have two out-edges, one each for the case where the branch is taken and not taken. BB: Basic Block. Usually a list of statements or RTL which are always executed together. A basic block is terminated by a conditional or unconditional branch or call, an indirect branch or call (including n-way branches or switch statements), return instructions, or labels (where other control flow enters). If the label is not explicit, a "fall through" basic block could terminate in an ordinary (non control flow altering) instruction. TA: Type Analysis. HLL: High Level Language; in a compiler, typically the output language. AST: Abstract Syntax Tree. This is an IR close to the HLL, typically at the statement level. For example, a node of the AST might be labeled as a pretested-while node, and children of that node could represent the loop conditional expression, and a block node representing statements in the loop. In a compiler, an AST typically results from parsing the input HLL program. 6. Project Discussions Mohsen Hariri <m.hariri@gmail.com> About Boomerang 4 messages Mohsen Hariri <m.hariri@gmail.com> Reply-To: Mohsen Hariri <m.hariri@gmail.com> To: emmerik@users.sourceforge.net, quantumg@users.sourceforge.net Fri, Jul 15, 2005 at 1:30 PM Hi As my undergraduate project, I want to participate in development of Boomerang project. I have read the stuff in the site and downloaded the current CVS codes, and will inspect them in a few days. My skills/experiences are: 1. Programming in C/C++, Python, PHP, Assembly for about 7 years 2. Familiar with many disassembling/debugging tools from old turbo debugger to current IDA, w32dasm and OllyDBG 4. Familiar with Intel cpu architecture/instruction set 5. Familiar with Windows programming for about 5 years 6. Familiar with PE/COFF/OMF executable/object file formats If there are any startup tips, or which part of the project is better to start reading, please tell me. I hope that I can do something in this bright-future project. Regards Mohsen Hariri Mike van Emmerik <emmerik@itee.uq.edu.au> To: Mohsen Hariri <m.hariri@gmail.com> Cc: quantumg@users.sourceforge.net > Hi > > As my undergraduate project, I want to participate in > development of Boomerang project. Great! > My skills/experiences are: > 1. Programming in C/C++, Python, PHP, Assembly > for about 7 years > 2. Familiar with many disassembling/debugging tools from old > turbo debugger to current IDA, w32dasm and OllyDBG > 4. Familiar with Intel cpu architecture/instruction set Sun, Jul 17, 2005 at 4:39 PM > 5. Familiar with Windows programming for about 5 years > 6. Familiar with PE/COFF/OMF executable/object file formats Your skills seem well suited to working on Boomerang, apart from your counting skills :-) > If there are any startup tips, or which part of the project is > better to start reading, please tell me. The "still to be done" Boomerang page (http://boomerang.sourceforge.net/tobedone.html) has all the details on this. There are also some general ideas on the "information for students" page: http://boomerang.sourceforge.net/students.html . But these don't prioritise the things that need to be done, or consider them from the point of view of an undergraduate project. For a moderately stand alone project, perhaps consider some solution to the detection of static library files. This problem is hlding Boomerang back, although the -sf switch allows the information to be entered manually. You could perhaps just use the BinaryFile interface (include/BinaryFile.h) to read a binary file, and perhaps use small parts of the decoder to find call statements. From this, you could generate a list of addresses and static library signatures, suitable for including in a -sf file. That way, you would not have to make any changes to Boomerang at all, and not have to understand all that much about how it works in great detail. Consider for example that three students recently worked on a PowerPC front end, and found it difficult to get it working to the extent even of allowing "hello world" to decompile. That was a one semester project; perhaps yours is a full year project. It can be frustrating trying to understand significant parts of a 60,000 line project, espececially if parts of it (e.g. the front end) use unfamiliar tools such as the New Jersey Machine Code Toolkit. (These students took well over a week just to make the toolkit, for example). So a front end would make a good whole year project, but I'd avoid it for a one semester project. A Java front end might be interesting, so that Booerang could be compared with Java decompilers, some of which are very good. (However, you may get bogged down in design matters, e.g. how to convey some kinds of type information from the front end to the intermediate representation.) The structuring code (converting to loops and if/then/else, handling break and continue, etc) needs some work. The advantage of this is that the code is relatively clean, and is fairly well isolated from the rest of the decompiler. The main needs here are for loops (or at least while loops without the if/then around them), similarly changing the if statement around switch statements to become the default case, and short circuit conditionals (e.g. if (p1 || p2) { ... }. If you are interested in parsers, the three or four parsers in Boomerang need to use something better supported than Coetmeur's Bison++. There is a Sourceforge project called Bisonc++ that sounds promising. The disadvantage of this project is that when completed there is no change visible from the outside; it's all "inner beauty". A new back end would be interesting, and Boomerang only has the one. There may be some design issues, since Boomerang is moderately strongly tied to "C-like" output. I would not attempt yet another GUI for Boomerang, unless you have really strong ideas about what is needed, and I'd recommend running it past us first. Similarly, implementing new analyses for Boomerang would probably be beyond an undergrad project. Also, it might clash with changes I may implement as part of my thesis research. > I hope that I can do something in this bright-future project. Well, there are several ideas in the above that would make interesting projects. Of course, in reading the code, you may come up with something completely different. For example, some Danish students recently finished a project on how "phi loops" can happen, and what to do about them. A fellow from Italy is looking in to how Boomerang might be able to deal with malware. I've had some ideas recently on how to parallise decompilation, but these could well be difficult to implement. So consider all these ideas, and by all means discuss them with me if you like. Good luck! - Mike Mohsen Hariri <m.hariri@gmail.com> Reply-To: Mohsen Hariri <m.hariri@gmail.com> To: Mike van Emmerik <emmerik@itee.uq.edu.au> Sun, Jul 31, 2005 at 10:11 AM Hi Sorry for the soo late reply and Thanks for your long and descriptive reply. I think the static library detection would be a fine project. After a little digging around in boomerang sources, and taking a look how IDA does library function detection, I've come up to this: I will make another "project" in the "solution" (I'm using the visual studio names, as I will be using that for development), which will contain these tools: 1. A program to make library signature files( something like IDA FLIRT signatures ) from given libraries 2. An interface for the boomeran Win32BinaryFile to be able to use those signatures(We have to access the contents of the file, so I think this feature is very dependent on executable file format, maybe we can just use the library signature generation files independent of the executable format) and one more thing. I've read about microsoft compiler global optimizations, I read it in one Matt Pietrick MSDN mag issues(I did a little search but couldn't find it). I havn't seen what would happen to library function when global optimization is activated but I remember Matt said "Global optimization makes functions exported from different files to be optimized by passing arguments by registers, or eliminating arguments.... " here is the MSDN link, it doesn't say anything about libraries: http://winfx.msdn.microsoft.com/library/default.asp?url=/library/enus/dv_vccomp/html/d10630cc-b9cf-4e97-bde3-8d7ee79e9435.asp Anyway I will test that. Please tell me if I'm doing it the right way. One more thing, when DLL functions are imported by number(Ordinal import), we can get their names if we have the DLL files. Is it something related to what I do? bye Mohsen [Quoted text hidden] Mike van Emmerik <emmerik@itee.uq.edu.au> To: Mohsen Hariri <m.hariri@gmail.com> Sun, Jul 31, 2005 at 3:04 PM > Hi > > Sorry for the soo late reply and Thanks for your long and descriptive reply. No problem. > I think the static library detection would be a fine project. After a little > digging around in boomerang sources, and taking a look how IDA does library > function detection, I've come up to this: > > I will make another "project" in the "solution" (I'm using the visual studio > names, as I will be using that for development), which will contain these > tools: > > 1. A program to make library signature files( something like IDA FLIRT > signatures ) from given libraries Good. The problem with this is finding the library files. I wonder if it might be possible to create signatures from executable files, where you can specify some start addresses and library names, e.g. from a symbols.h file (as used by the -sf switch). Of course you need some way of identifying those library functions, e.g. using IDA Pro or other tools, or figuring out what it does from a disassembly and guessing. Just a thought. > 2. An interface for the boomeran Win32BinaryFile to be able to use those > signatures (We have to access the contents of the file, so I think this > feature is very dependent on executable file format, maybe we can > just use the library signature generation files independent of the > executable > format) Well, just make an interface; I don't think it should be accessed from Win32BinaryFile, as signatures can be generated for any architecture in any exeutable file format. You may only be able to generate signatures for Win32 binary files with Tool 1 above, but other tools can be written to generate signatures for other combinations, and those signatures should be applicable by tool 2. Functions are in a sense just byte streams; it should not matter what the architecture is or how they are encoded in the executable file. > > and one more thing. I've read about microsoft compiler global optimizations, > ... > Anyway I will test that. Good idea, though I'd be very surprised if the library functions are accessed with register calling convention. I've certainly seen user programs that have such parameters. > Please tell me if I'm doing it the right way. You're on the right track, as far as I can tell. > One more thing, when DLL functions are imported by number(Ordinal import), > we can get their names if we have the DLL files. Is it something related > to what I do? Yes and no. No in the sense that this is not signature matching, so you don't need to include this as part of your project. Yes in the sense that this achieves the same purpose (i.e identifying library functions), just for a different class of library functions (dynamically linked functions using ordinal numbers as opposed to statically linked functions). They are similar in that each signature can have some sort of ID associated with it (perhaps a checksum, index in a special hash table, etc), and associated with that ID is a function signature (unfortunate clash of terminology here; I mean a functions's name, names and types of parameters, and type of return value if any). So it would seem fairly easy to extend the identification of static library functions to those that are dynamically linked and accessed with ordinal numbers. I'll leave it up to you to decide whether to make this extension. Good luck! - Mike Mohsen Hariri <m.hariri@gmail.com> news 2 messages Mohsen Hariri <m.hariri@gmail.com> Reply-To: Mohsen Hariri <m.hariri@gmail.com> To: Mike van Emmerik <emmerik@itee.uq.edu.au> Tue, Aug 2, 2005 at 9:17 PM Hi Now I've read all I could find about IDA FLIRT, there was a very good document(http://www.datarescue.com/idabase/flirt.htm) it tells everything about it. I didn't know the "main" function detection is also a part of these signatures. I found the article about cross file function calling optimizations. (http://msdn.microsoft.com/msdnmag/issues/02/05/Hood/) it is called LTCG(Link Time Code Generation) not Global Optimization. it says at the end:"Next, the OBJ files produced when using LTCG aren't standard COFF format OBJs." and he says they are compiler version dependent, so we don't have to worry, the standard LIBs don't seem to be using those features, at least not in near future. Now I want to discuss how should we do library detection. I want the method to have these features: 1. main function detection something like IDA, but it cannot be done automatically, so signature detections can be implemented in three parts: one to detect if the executable is using specified library, one to locate the main function, and one to detect functions locations. the first two parts are optional for each library. 2. parameter and return value type/name identification for library functions, we can also save parameter names and use them when a function is detected, I think it is done in IDS files for IDA, am I right? 3. Ability to use a LIB or OBJ file directly, without the need to go through those signature creation procedures, in case we want to. So I will implement this feature in two parts: 1. A Detection Method 2. A Detection Input Detection methods would be: 1. Normal Library Detection, which takes a LIB or OBJ file format. In this method we can also find variable names, for example we apply an object file using this method, and some functions from that object file are found in the executable, and those functions use some global variables, and we can flag type/name of those global variables also. But this method would be slow, as we would be checking whole functions against each function position in the executable. The input files can be in OMF/COFF format. 2. Signatured Library Detection something like what IDA does, but first we have to decide on our signature files format. 3. Dynamic Library Detection this method flags the function names along with parameters and return values. the inputs are DLL files, and they may be provided automatically if they exist in system folders when boomeran processes imports But should library detection be applied after functions locations detection, I mean after first analysis of boomerang or before that? Anyway, I think for the first step, I will implemend nomal library detection for COFF files, and see how it will be. Is there any other way to contact you like using mirc or IMs or alike? Bye Mohsen Mike van Emmerik <emmerik@itee.uq.edu.au> To: Mohsen Hariri <m.hariri@gmail.com> Wed, Aug 3, 2005 at 5:24 AM > But should library detection be applied after functions locations detection, > I mean after first analysis of boomerang or before that? I think it should probably be done first up. Signature detection, assuming it's pretty robust, should arguably (perhaps optionally) override directions from the user (e.g. in a symbols.h -sf file), since the user was possibly guessing, and the signature detection probably is right. In any case, you should probably do signature detection first if only to point out any conflicts. > Is there any other way to contact you like using mirc or IMs or alike? I'm not a great fan of chat/messaging, but I've been knowm to use Mozilla's messaging client. Actually, I now use Firefox, and it looks like I might have to hunt one down. - Mike Mohsen Hariri <m.hariri@gmail.com> progress 2 messages Mohsen Hariri <m.hariri@gmail.com> To: Mike van Emmerik <emmerik@itee.uq.edu.au> Sat, Aug 6, 2005 at 11:48 AM Hi Today I wrote a tiny program and tried to decompile it, and see the flow of boomerang, to understand its parts better. I used: J:\moh\decompiler\boomerang\debug>console -E 0x00411020 -sf J:\moh\decompiler\tiny\sig.h J:\moh\decompiler\tiny\Debug\tiny.exe contents of the sig.h file: 0x411020 __cdecl int myfunc(void); when boomerang decompiles this function, it doesn't care the return type I specify, and tells this function is a void function. why? By the way, I've attached a diff file to prevent access violation when import table of an executable is empty, as my tiny program was. now I understand what you meant by saying I have to generate something like a signature file(-sf). I think now I'm able to start some coding, I will start by reading a COFF object file and matching the whole functions against the executable, and generating a signature file, just for the start. I want to name the library detection project "LIBID", (Library Identification). any opinions? bye Mohsen Win32BinaryFile.diff 4K Download Mike van Emmerik <emmerik@itee.uq.edu.au> To: Mohsen Hariri <m.hariri@gmail.com> > when boomerang decompiles this function, it doesn't > care the return type I specify, and tells this function > is a void function. why? Sun, Aug 7, 2005 at 1:40 AM Oops, thought I fixed that one. Basically, I made a fairly major change of how Boomerang detects parameters and returns, but it doesn't fit well with the -sf symbol file. So I have to make explicit exceptions to the logic for parameters, for returns (which you found are not working), for the function name, argument names, etc. I'll see if I can fix that in the next few days. - Mike Mohsen Hariri <m.hariri@gmail.com> Re: diff 3 messages Mike van Emmerik <emmerik@itee.uq.edu.au> To: Mohsen Hariri <m.hariri@gmail.com> Sun, Aug 7, 2005 at 1:41 AM > By the way, I've attached a diff file to prevent access violation > when import table of an executable is empty, as my tiny program > was. Heh, thanks. I must never have had a tiny enough excecutable :-) Diffs like this are always welcome. - Mike Mohsen Hariri <m.hariri@gmail.com> To: Mike van Emmerik <emmerik@itee.uq.edu.au> Sun, Aug 7, 2005 at 10:34 AM seems the only opensource library I can find is BFD, for reading object files. but it does not support OMF (borland's format). tooo bad!! do you know any other libs? should I use this and hope in future we can use some other ways for OMF? [Quoted text hidden] Mike van Emmerik <emmerik@itee.uq.edu.au> To: Mohsen Hariri <m.hariri@gmail.com> Mon, Aug 8, 2005 at 5:55 AM > seems the only opensource library I can find is BFD, for reading > object files. but it does not support OMF (borland's format). > tooo bad!! Ick! I didn't realise that. What Borland compilers still use OMF? I'm sure the latest ones use PE, e.g. test/windows/switch_borland.exe . This page: http://csharpcomputing.com/Tutorials/Lesson20.htm seems to indicate that you can effectively use the MSVC linker to convert OMF to PE, but it's not clear if you can just convert and save the result; perhaps it just allows linking of OMF .obj files. Ah, OMF files, I remember those. I have some simple source code that parsers part of them, still available in the file makedsig.c, in this archive: http://www.itee.uq.edu.au/~cristina/dcc/distribution/makedsig.zip It would be a lot of work to make a complete loader from this, though. The above code just scans a .lib file in OMF format, and pulls out some names, from memory. > do you know any other libs? You should look at cgen (http://sourceware.org/cgen/), though it might use BFD as the loader, and ISDL (http://www.princeton.edu/~mescal/spam/pubs/ISDL-TR.html), which is supposed to be able to generate an assembler from a spec, though I don't know if that includes the binary file writer (and of course you want a reader, but perhaps they come together.) Apart from that, it's a bit dismal, I agree. Also, BFD has a GPL license I think, and I'm not sure if that was one of the reasons I rejected using it for Boomerang. What about the BinaryFile part of Boomerang? It may be possible to add OMF support to that without too much effort. > should I use this and hope in future we can use some other ways for OMF? Is OMF really important? Searching the web, it seems really obsolete. - Mike Mohsen Hariri <m.hariri@gmail.com> Re: last email 1 message Mike van Emmerik <emmerik@itee.uq.edu.au> To: Mohsen Hariri <m.hariri@gmail.com> Mon, Aug 8, 2005 at 6:00 AM Mohsen, sorry with the last email, of course OMF is relevant, because people want to decompile old executables, and therefore they would want signatures for it. I forgot what you wanted to do with the files... Anyway, that makes the makedsig.c file even more relevant. This is a signature generator for dcc that I wrote about 11 years ago (when OMF was still the dominant format). You might even get some ideas from the perfect hashing function stuff that I used, though there are probably other storage formats that are better these days (e.g. something based on trees, as per IDA). - Mike Mohsen Hariri <m.hariri@gmail.com> compiling libbfd 2 messages Mohsen Hariri <m.hariri@gmail.com> To: Mike van Emmerik <emmerik@itee.uq.edu.au> Fri, Aug 12, 2005 at 3:53 PM Hi I'm just stuck compiling libbfd as a dll under cygwin. I'm not familiar with cygwin, actully it's my first time compiling something under cygwin. I run these commands under bfd folder of binutils: ./configure --enable-shared=yes checking build system type... i686-pc-cygwin checking host system type... i686-pc-cygwin checking target system type... i686-pc-cygwin checking for gcc... gcc checking for C compiler default output file name... a.exe [...] checking if libtool supports shared libraries... yes checking if package supports dlls... no checking whether to build shared libraries... no [WHY????] checking whether to build static libraries... yes creating libtool [...] I tried anything I could, searching internet for hints, changing the configure script, or trying to make a dll out of statically compiled libbfd.a (actully it works, but the dll has no exported function, even with gcc's --export-all parameter), all unsuccessful. maybe you have more experience with cygwin. anyway, if I'm making any stupid mistake please inform me. thanks in advance bye Mike van Emmerik <emmerik@itee.uq.edu.au> To: Mohsen Hariri <m.hariri@gmail.com> Sat, Aug 13, 2005 at 12:59 AM > Hi > > I'm just stuck compiling libbfd as a dll under cygwin. I'm not familiar with > cygwin, actully it's my first time compiling something under cygwin. > I run these commands under bfd folder of binutils: > > ./configure --enable-shared=yes Um, I've never used that method. I believe that it requires the Makefile to pick up some define (e.g. in include/config.h) and make the appropriate things happen. What I'd do is change the link command to include "-shared". This is the Linux linker's command to generate a shared object, which for Cygwin will generate a .dll file. (You may have to rename the output with the dll name). That's what we do in Boomerang. That part of the Makefile was sorted out by some people more expert at these sorts of things than me. Here is the command that generates lib/libWin32BinaryFile.dll (from the loader/ directory) under CygWin: g++ -Wall -g -O0 -o ../lib/libWin32BinaryFile.dll -shared Win32BinaryFile.o microX86dis.o -lBinaryFile -L../lib Perhaps this will help. > [...] > checking if libtool supports shared libraries... yes > checking if package supports dlls... no > checking whether to build shared libraries... no [WHY????] Sorry, no idea. I'm really lame with configure scripts. You could try to read the actual script (it's just a shell script) to try to understand the logic, and maybe insert some echo commands or the like to debug it. I remember doing that for a while before those experts came along and fixed that part of Boomerang. > maybe you have more experience with cygwin. anyway, if I'm making > any stupid mistake please inform me. I wouldn't say a stupid mistake, but I think that the approach of fiddlig with the configure script will surely be less successful than changing the Makefile or modifying the final link command by hand. Good luck! - Mike Mohsen Hariri <m.hariri@gmail.com> libid 4 messages Mohsen Hariri <m.hariri@gmail.com> To: Mike van Emmerik <emmerik@itee.uq.edu.au> Sun, Aug 21, 2005 at 1:01 PM Hi, In this week I have done the followings: 1. at last bfd was compiled under cygwin, I used mingw target under cygwin to make windows native DLL(the cygwin target just hung when I tried to load the compiled dll) I made some test projects and now I'm ready to make library signatures. 2. I thought it's better to mention the already done parts of the project in my undergraduate thesis, and write 'how it works', so I read UQBT and some other documents mostly written by 'Cristina Cifuentes', they were excellent. Now I'm much more familiar with the Boomerang and UQBT, I think I know all the parts and their functionalities, at least at an abstract level. I didn't know how BIG the project is. 3. I saw dcc sources, they were simple and I could understand them easily. I'll take a deeper look in a few days as I start my coding. (those files were from 1993... in that time I was a 11 years old boy while you were looking for signatures :p) Seems your university is very active on this subject. 4. I tried to write how to implement the LibId, and attached the document I wrote. I have some Q's about (Win32)BinaryFile (they are in attached document too) : 1. dlprocptrs: is it the place to hold dynamic library functions(as it's mentioned in its comments)? so why the "main" symbol is added and searched for in this map? if it's general purpose, can I use it for variables also? or I have to use boomerang::symbols? (the ones I can extract from object and library files). anyway what's the difference? 2. when a signature detector faces a callback function, should it add that function to entry points to be decompiled?(boomerang::entrypoints) 3. dynamically imported functions, are loaded in IAT itself. so I will use BinaryFile::AddSymbol on the dynamically linked function address container variable, not on function address. is it ok? (In Win32BinaryFile::findJumps, I guess you have done this) By the way, some mistypings in http://boomerang.sourceforge.net/terminology.html page: Acronyms: CFG : ... one each for for the case where ... Acronyms: BB : ... A basic block is terminated by a conditional or undonditional branch ... Acronyms: AST : ... might be labelled as ... not important, I know ;) libid.txt 2K View Download Mohsen Hariri <m.hariri@gmail.com> To: Mike van Emmerik <emmerik@itee.uq.edu.au> Fri, Aug 26, 2005 at 5:36 PM did you receive my mail? please inform me if you got this. [Quoted text hidden] libid.txt 2K View Download Mike van Emmerik <emmerik@itee.uq.edu.au> To: Mohsen Hariri <m.hariri@gmail.com> Sat, Aug 27, 2005 at 1:42 AM On Fri, 26 Aug 2005, Mohsen Hariri wrote: > did you receive my mail? please inform me if you got this. Oh, sorry, I did, but the whole family has had the 'flu and things have been a bit crazy. I'll reply next. - Mike [Quoted text hidden] Mike van Emmerik <emmerik@itee.uq.edu.au> To: Mohsen Hariri <m.hariri@gmail.com> Sat, Aug 27, 2005 at 2:12 AM On Sun, 21 Aug 2005, Mohsen Hariri wrote: > Hi, > > In this week I have done the followings: > 1. at last bfd was compiled under cygwin, I used mingw target under > cygwin to make windows native DLL A good trick to remember. > , so I read UQBT and some > other documents mostly written by 'Cristina Cifuentes', they were excellent. Yes, she's a good writer. > Now I'm much more familiar with the Boomerang and UQBT, I think I know > all the parts and their functionalities, at least at an abstract level. I > didn't know how BIG the project is. Yes, and it seems to keep getting bigger. > 3. I saw dcc sources, they were simple and I could understand them > easily. Wow, I found them quite tricky myself :-) > Seems your university is very active on this subject. Well, it all stems from Prof. John Gough (a compilers person, recently retired but still writing books) wondering about decompilers back about 1990. From there came Cristina's thesis, published in 1994, and from that many projects, including UQBT and Boomerang, and Boomerang seems to have started several other decompilers (Anakrino, exetoc, perhaps two in Japan). REC was influenced by Cristina's work, and even the Flirt part of IDA Pro was influenced by dcc's signatures. Not a bad chain of events! But UQ hasn't been directly involved in the last 5 years apart from my own thesis. Hopefully when I publish in the next year or so, there might be some more influencing. > I have some Q's about (Win32)BinaryFile (they are in attached document too) >: > 1. dlprocptrs: is it the place to hold dynamic library functions(as it's > mentioned in its comments)? Be aware that this code was thrown together by the other main author (QuantumG). So I'm not as familiar with this loader as with the others. From memory, this map holds one entry for each entry in the import address table (IAT). I don't recall if it has entries for the export table. When gathering symbols, you would be interested in eports, not imports. > so why the "main" symbol is added and searched for in this > map? We have this idea of attempting to find an entry point for everything that gets decompiled. As you can see, it doesn't always work out. It's OK when the program being decompiled is an executable. Then we are just adding a sort of synthetic import. > if it's general purpose, can I use it for variables also? or I have to use > boomerang::symbols? Ah, Boomerang::symbols (also GuantumG code) are the symbols loaded from the symbols.h file(s) (via the -sf switch; see the web page on that switch for an overview). I think that these symbols may be suitable; somehow, when addresses are used and they match with the entries in Boomerang::symbols, they become associated with the name from the symbol (e.g. when a new procedure is made, that name would be used instead of proc99). As to whether to use an existing data structure or not, that's a design decision you have to make. I can see points in favour of either approach; for example, it would be nice to keep the signatures code fairly separate, perhaps mostly in one class. There could be thousands of signature symbols, and they may need special data structures because of their number. For starters, though, I'd consider trying Boomerang::symbols first. Perhaps the -sf logic and the signatures could be combined, either by design or after implementation. > 2. when a signature detector faces a callback function, should it add that > function to entry points to be decompiled?(boomerang::entrypoints) That sounds reasonable, especially if you put the signatures into the Boomerang::symbols map. I think that entrypoints comes from -e and -E switches. > 3. dynamically imported functions, are loaded in IAT itself. so I will use > BinaryFile::AddSymbol on the dynamically linked function > address container variable, not on function address. is it ok? > (In Win32BinaryFile::findJumps, I guess you have done this) Wow, that is my code, but written a long time ago. It looks like a pretty nasty hack. I guess I was irritated by all the one line procedures that result from jumps to IAT entries (where the real symbols are). Any way of making this a little more general would be good. I believe it's possible for some calls to use the jump tables, with others using the IAT entry directly. I don't recall why they do this. > By the way, some mistypings in > http://boomerang.sourceforge.net/terminology.html page: > > Acronyms: CFG : ... one each for for the case where ... > Acronyms: BB : ... A basic block is terminated by a conditional or > undonditional > branch ... > Acronyms: AST : ... might be labelled as ... > > not important, I know ;) Ah, bit it is. I'll try and get to those this weekend. Thanks! - Mike Mohsen Hariri <m.hariri@gmail.com> another thing 2 messages Mohsen Hariri <m.hariri@gmail.com> To: Mike van Emmerik <emmerik@itee.uq.edu.au> Sat, Sep 3, 2005 at 9:54 AM binaryfile.h needs this: #ifndef _WIN32 #include <dlfcn.h> #else #include <windows.h> WinSock.h #endif // include before types.h: name collision of NO_ADDRESS and for the HINSTANCE you are using, like binaryfilefactory.cpp. Mike van Emmerik <emmerik@itee.uq.edu.au> To: Mohsen Hariri <m.hariri@gmail.com> Sat, Sep 3, 2005 at 3:49 PM [Quoted text hidden] Except that windows.h has a name collision with something in the objective C code as well. So now it's a void* and I cast it (ugly, but there are only two places it's used anyway). - Mike Mohsen Hariri <m.hariri@gmail.com> compiling problem 10 messages Mohsen Hariri <m.hariri@gmail.com> To: Mike van Emmerik <emmerik@itee.uq.edu.au> Wed, Sep 7, 2005 at 10:18 AM hi Still my boomerang doesn't compile. There are errors while compiling prog.cpp in ansi-c-parser.h ( I wonder why this header works when it comes to compiling ansi-c-parser.cpp or ansi-c-parser.cpp ). Here are the errors : Compiling... prog.cpp e:\Program Files\Microsoft Visual Studio .NET 2003\Vc7\PlatformSDK\Include\WinSock.h(691) : warning C4005: 'NO_ADDRESS' : macro redefinition j:\moh\decompiler\boomerang\include\types.h(22) : see previous definition of 'NO_ADDRESS' /home/38/binary/u1.luna.tools/lib/bison.h(363) : error C2143: syntax error : missing '}' before '=' /home/38/binary/u1.luna.tools/lib/bison.h(363) : error C2059: syntax error : '=' /home/38/binary/u1.luna.tools/lib/bison.h(186) : error C2143: syntax error : missing ';' before '}' /home/38/binary/u1.luna.tools/lib/bison.h(186) : error C2238: unexpected token(s) preceding ';' ... and after commenting out those #line directives: Compiling... prog.cpp e:\Program Files\Microsoft Visual Studio .NET 2003\Vc7\PlatformSDK\Include\WinSock.h(691) : warning C4005: 'NO_ADDRESS' : macro redefinition j:\moh\decompiler\boomerang\include\types.h(22) : see previous definition of 'NO_ADDRESS' j:\moh\decompiler\boomerang\c\ansi-c-parser.h(487) : error C2143: syntax error : missing '}' before '=' j:\moh\decompiler\boomerang\c\ansi-c-parser.h(487) : error C2059: syntax error : '=' j:\moh\decompiler\boomerang\c\ansi-c-parser.h(553) : error C2143: syntax error : missing ';' before '}' j:\moh\decompiler\boomerang\c\ansi-c-parser.h(553) : error C2238: unexpected token(s) preceding ';' j:\moh\decompiler\boomerang\c\ansi-c-parser.h(555) : error C2059: syntax error : 'public' ... It complains about CDECL definition in the enum.( again I wonder why this header works when it comes to compiling ansi-c-parser.cpp or ansi-c-parser.cpp ) by the way, I've attached a patch for the project to compile when 'max' function was defined somewhere(As on my system). dfa.cpp.patch 1K Download Mike van Emmerik <emmerik@itee.uq.edu.au> To: Mohsen Hariri <m.hariri@gmail.com> Wed, Sep 7, 2005 at 10:31 AM > hi > > Still my boomerang doesn't compile. There are errors while > compiling prog.cpp in ansi-c-parser.h ( I wonder why this > header works when it comes to compiling ansi-c-parser.cpp > or ansi-c-parser.cpp ). It's an ordering issue; sometimes it's just not a good idea to #include "windows.h" (which includes the whole universe and makes name collisions much more likely). I'm feeling a little guilty here; I suspect that I #included windows.h somewhere, noted problems with Windows compilation, removed it, and then forgot to check in all the changes. It will have to wait till I get home, sorry. > by the way, I've attached a patch for the project to compile > when 'max' function was defined somewhere(As on my system). Good; thanks. What compiler are you using exactly? - Mike Mohsen Hariri <m.hariri@gmail.com> To: Mike van Emmerik <emmerik@itee.uq.edu.au> Wed, Sep 7, 2005 at 10:40 AM Microsoft Visual C++ from visual studio 2003, no PSDK installed. I guess you use this version too, right? donno why my compiler complains too much... ;) [Quoted text hidden] Mohsen Hariri <m.hariri@gmail.com> To: Mike van Emmerik <emmerik@itee.uq.edu.au> don't worry about compilation, I can test my codes without compiling the whole project(the project took too long to compile, so I made a test platform) currently I work only 6-7 hours a week on this project, but from next week I will have more time to put on this project. hope I can get something done in next week. one more thing, about c++ mangled names. do you know any library I can use to demangle them? cause it seems there is no standard for it, and wvery compiler does what it likes. am I wrong? On 9/7/05, Mike van Emmerik <emmerik@itee.uq.edu.au> wrote: [Quoted text hidden] Wed, Sep 7, 2005 at 11:16 AM Mike van Emmerik <emmerik@itee.uq.edu.au> To: Mohsen Hariri <m.hariri@gmail.com> Wed, Sep 7, 2005 at 11:23 AM > one more thing, about c++ mangled names. do you know > any library I can use to demangle them? cause it seems > there is no standard for it, and every compiler does what > it likes. am I wrong? They do all do whatever they want. However, I think that the latest gcc scheme is somewhat standardised now. The c++filt program demangles; I have not had success finding a tool other than the compiler (which is NOT convenient) for the mangling process. You could do worse then exec'ing this tool, or finding the source code for it (part of binutils, I would guess). - Mike Mike van Emmerik <emmerik@itee.uq.edu.au> To: Mohsen Hariri <m.hariri@gmail.com> Wed, Sep 7, 2005 at 11:26 AM On Wed, 7 Sep 2005, Mohsen Hariri wrote: > Microsoft Visual C++ from visual studio 2003, Yep, same as mine. Though on the machine at work, I have the 2002 Microsoft Development Environment, version 7.0.9466. It seems to be a very early .NET version, and it won't read Boomerang's solution files. I'm thinking of trying to redo the solution files on that compiler, so it should be compatible with all .NET versions. > no PSDK installed. I guess you use this version > too, right? dunno why my compiler complains too > much... ;) Just lucky, I guess :-) - Mike Mike van Emmerik <emmerik@itee.uq.edu.au> To: Mohsen Hariri <m.hariri@gmail.com> Thu, Sep 8, 2005 at 4:12 PM On Wed, 7 Sep 2005, Mohsen Hariri wrote: > hi > > Still my boomerang doesn't compile. There are errors while > compiling prog.cpp in ansi-c-parser.h ( I wonder why this > header works when it comes to compiling ansi-c-parser.cpp > or ansi-c-parser.cpp ). Here are the errors : > Compiling... > prog.cpp > e:\Program Files\Microsoft Visual Studio .NET > 2003\Vc7\PlatformSDK\Include\WinSock.h(691) : warning C4005: 'NO_ADDRESS' : > macro redefinition Presumably this means you have windows.h included. Actually, I think this should not be included for anything other than windows.cpp. Where does your garbage collector come from? Is the the one that we have checked in, or did you compile it yourself or something? Just that gc.h will include windows.h if GC_WIN32_THREADS is set. A bit of a long shot, I will admit. I can't find any problem; it compiles fine for me with the latest updates. And I've checked again to make sure that every change has been checked in. Are you sure you have the latest CVS now? In particular, there should be no #include of windows.h in include/BinaryFile.h, except in comments. > j:\moh\decompiler\boomerang\include\types.h(22) : see previous definition of > 'NO_ADDRESS' > /home/38/binary/u1.luna.tools/lib/bison.h(363) : error C2143: syntax error : > ... > and after commenting out those #line directives: > > ... > > It complains about CDECL definition in the enum.( again I wonder why this > header works when it comes to compiling ansi-c-parser.cpp > or ansi-c-parser.cpp ) I'd say because something before the #include for ansi-c-parser.h in db/prog.cpp is #including windows.h. I think you'll have to comment them out one by one until the error goes away, then decide why windows.h is getting included, and try to prevent this. You could try #undefine NO_ADDRESS and #undefine CDECL before the #include of ansi-c-parser.h in db/prog.cpp, to make sure that this is what the problem is. Maybe it's not such a bad permanent solution, either, if it's enough to fix the problem. - Mike Mohsen Hariri <m.hariri@gmail.com> Reply-To: m.hariri@gmail.com To: Mike van Emmerik <emmerik@itee.uq.edu.au> It compiles now. Thanks for the help. I've attached a changed version of Boomerang, with the library detection codes. It compiles only on windows currently (needs some minor changes, like replacing #pragma once, not a structural change). My codes are in symbols directory. Some info on implementation: 1. I've interfaced the libid with prog class. It will currently use addProc and BinaryFile::AddSymbol to add its matched signatures. 2. I have implemented just two matching classed currently: BfdObjMatcher: Can match any object file supported by BFD, so this is platform independent. Sat, Sep 10, 2005 at 12:46 PM BfdArchMatcher: Can match any archive file supported by BFD, again platform independent. These modules use original obj and lib files, and I will implement generated signature matching next. 3. Detection steps are like this: a. every function in the object files are searched for in the binary code sections, after the relocatable and non-resolved bytes( you called them "wild bytes" in dcc, so I used that term) are marked out.(Implementation of matching in BytePattern class) b. when a match occures, the function is created using Prog::newProc. every symbol referenced by this function is also added to the symbols using BinaryFile::AddSymbol. I have to demangle names here, but for the time being I just remove the first '_'(underscode) from the names. I will use wine's demangling codes for MSVC next. 4. The changes I've made to other parts of the project: a. In prog.cpp & prog.h, I've added this method: // Search for library signatures from sig_file and match them // 'hint' is used to force usage of specified signature matching // module void MatchSignatures(const char * sig_file, const char * hint); and also: FrontEnd * getFrontEnd(); ( I know you say why, I needed that in boomerang to seperate loading and decompiling of the binary, unless you suggest a better way ) b. in boomerang i added these commands: loadbinary : just load the binary, no decoding decode : (without parameter) decode the loaded binary matchsym : matches the specified symbol container against the loaded binary and added these methods: Prog *Boomerang::load(const char *fname) void Boomerang::decode(Prog *prog, const char *pname) just to seperate loading and decoding phases. c. in Win32BinaryFile, I've commented out main function detection. Because now when all libraries are detected(specifically the ones that have startup functions), the main function is automatically found, so no need to find it that way. Am i right? d. some minor changes to the msvc project files, to help it find bfd.lib or libid.lib some more notes: 1. BinaryFile.h contains implementation for some methods, so when it is included in multiple places, the linker will complain about duplicate definitions. I think we have to move those implementations to BinaryFile.cpp 2. when a library proc is detected, while adding its referenced symbols, if we see the symbol is a function, we can newProc again. But we have to check if it's already created or not. 3. we can find out library function parameters(after I implement demangling). do you think the parameters are useful here? I think this is a duplicate for what you have implemented in header file parsing. I mean when we have a library, we have its header file probably. so just the names are got from the library, and parameters and calling conversion are got from the header file. but this won't work for the dll files which have mangled export tables. In this case we have to generate the header file based on the dll.(cause when we have a dll, we cannot assume that we have its header file too) Anyway, if we don't want to ignore parameters extracted from library files, we have to implement a method like this: ParseAndAddProc(ADDRESS address, char *proc_proto) that can be called like this: ParseAndAddProc(0x123123, "int myproc(int *p1, void *p2)" but then we have some hard times when we have something like this: ParseAndAddProc(0x123123, "MyObj myptoc(MyObj2 *obj)") here we don't have any idea about classes, we just have their names. I suggest we stick to the first method(header files), and add the demangled proto just in the function comments. my todo: 1. add demangling codes 2. add signature generation & detection I want to use the same schema for signatures too(I mean I will preserve the referenced symbols with the function signature). What do you think? 3. add dll ordinal import function resolving (some dlls exports are mangled c++ names, we can decode them too) 4. change codes to compile on other compilers You can test my codes using the test file in /test/symbols/test_sig directory. This executable only uses libc(I've included libc too). you just run boomerang -k and issue these commands: loadbinary <path>test_sig.exe matchsym <path>libc.lib the main function is detected, but I see the parameters are incorrect. that's all. I guess this was the longest email I've ever written :p PS. the attached file should be renamed to "zip" Waiting for your suggestions Mohsen [Quoted text hidden] boomerang_libid.zipx 4693K Download Mike van Emmerik <emmerik@itee.uq.edu.au> To: Mohsen Hariri <m.hariri@gmail.com> Mon, Sep 12, 2005 at 4:20 AM > FrontEnd * getFrontEnd(); > ( I know you say why, I needed that in boomerang to seperate > loading and decompiling of the binary, unless you suggest a better > way ) Without examining the code in detail, that sounds fine. > b. in boomerang i added these commands: > > loadbinary : just load the binary, no decoding > decode : (without parameter) decode the loaded binary > matchsym : matches the specified symbol container > against the loaded binary > > and added these methods: > > Prog *Boomerang::load(const char *fname) > void Boomerang::decode(Prog *prog, const char *pname) > > just to seperate loading and decoding phases. Cool. > c. in Win32BinaryFile, I've commented out main function > detection. Because now when all libraries are > detected(specifically the ones that have startup > functions), the main function is automatically > found, so no need to find it that way. > Am i right? Well, it depends on the details. When you say "the main function is automatically found", do you mean that you already have special high level pattern matching for main? (A la what I did in dcc; I needed special patterns for finding main, and they were somewhat compiler specific). It might make sense to implement some of the logic in Win32BinaryFile. > some more notes: > > 1. BinaryFile.h contains implementation for some methods, > so when it is included in multiple places, the linker > will complain about duplicate definitions. I think we have > to move those implementations to BinaryFile.cpp Compilers are supposed to deal with this. Gcc has a "linkonce" tag for these sorts of functions. The idea is that these are great candidates for inlining; if you never declare functions in .h files, you'll probably never get inlining (unless your compiler has a special pass between compiling and linking, or does the inlining during linking, both of which seem uncommon). Are you getting errors from MSVC with this, or are you just assuming that this has to be trouble but have not seen it yet? > 2. when a library proc is detected, while adding its referenced > symbols, if we see the symbol is a function, we can > newProc again. But we have to check if it's already created > or not. I'm sure there is a function that already does that. > 3. we can find out library function parameters(after I implement > demangling). do you think the parameters are useful here? I think this is > a duplicate for what you have implemented in header file parsing. I mean > when we have a library, we have its header file probably. so just the > names are got from the library, and parameters and calling conversion are > got from the header file. Presumably, both should get you the same answer, and I don't know what to do if they don't match. Perhaps just an error message. So I think for now what you have done is fine. > but this won't work for the dll files which > have mangled export tables. Are these called by ordinal then? Otherwise, how can the caller know what to call? > In this case we have to generate the header > file based on the dll.(cause when we have a dll, we cannot assume that we > have its header file too) Yes. > > Anyway, if we don't want to ignore parameters extracted from library files, > we have to implement a method like this: > ParseAndAddProc(ADDRESS address, char *proc_proto) > > that can be called like this: > ParseAndAddProc(0x123123, "int myproc(int *p1, void *p2)" > > but then we have some hard times when we have something > like this: > ParseAndAddProc(0x123123, "MyObj myptoc(MyObj2 *obj)") > > here we don't have any idea about classes, we just have their names. Well, just having the name is useful for a decompiler (unless the name is obfuscated). The compiler needs to know the size of elements of the class, so that it can calculate how to access the member variables, how to cast from one class to another (with multiple inheritance, there is sometimes a correction needed to the "this" pointer). It would be great to have the names and types of all the member variables, and the names and parameters of all the methods, but as you say this is typically not available. > I suggest we stick to the first method(header files), and add the > demangled proto just in the function comments. Sounds fine to start with. > my todo: > > 1. add demangling codes > > 2. add signature generation & detection > I want to use the same schema for > signatures too(I mean I will preserve > the referenced symbols with the function > signature). What do you think? Sorry, I don't really understand the question. > You can test my codes using the test file in /test/symbols/test_sig > directory. This executable only uses libc(I've included libc too). > you just run boomerang -k and issue these commands: > loadbinary <path>test_sig.exe > matchsym <path>libc.lib > > the main function is detected, but I see the parameters are incorrect. OK, I'll try and find some time to try this out. By the way, I see that you have included source code in the zip file; is it your intention to donate this code to the Boomerang project? I'd be happy to use it, I think it will fill a big need. But it's your code, and Boomerang isn't GPL'd, so you don't have to contribute your changes. If it's a bit unstable, as long as it compiles on all platforms, it can be disabled unless a runtime switch is used. Then when it's ready for real world use, we just take out the command line switch. Good work! - Mike Mohsen Hariri <m.hariri@gmail.com> Reply-To: m.hariri@gmail.com To: Mike van Emmerik <emmerik@itee.uq.edu.au> Mon, Sep 12, 2005 at 11:13 PM > > c. in Win32BinaryFile, I've commented out main function > > detection. Because now when all libraries are > > detected(specifically the ones that have startup > > functions), the main function is automatically > > found, so no need to find it that way. > > Am i right? > > Well, it depends on the details. When you say "the main function is > automatically found", do you mean that you already have special high level > pattern matching for main? (A la what I did in dcc; I needed special > patterns for finding main, and they were somewhat compiler specific). It > might make sense to implement some of the logic in Win32BinaryFile. look at these: http://msdn.microsoft.com/msdnmag/issues/01/01/hood/ http://www.codeguru.com/article.php/c6945__1/ But for a short summary: In MSVC for example, a function is executed which will then call program's main function. The name of this function is hardcoded and depends on the linker /SUBSYSTEM and /ENTRY arguments. For example when the parameter is /SUBSYSTEM:CONSOLE and no /ENTRY is used, the linker looks for "mainCRTStartup" whose address will be put in PE's header as PE's entry point. (I'm sure you have seen GetVersionInfo in all MSVC programs a little after EP, it is in mainCRTStartup. That's why all MSVC programs are dynamically linked with kernel32.dll, and that's why you never had an empty IAT table) sources for CRT startup codes can be found in here: Microsoft Visual Studio .NET 2003\Vc7\crt\src mainCRTStartup resides in crt0.c crt0.c is compiled into libc.lib, so if we check signatures of any MSVC executable against the correct version of libc.lib, we will find mainCRTStartup(or alike function for other SUBSYSTEMs) then using the references of that function, we can locate main function. I guess you didn't understand what I mean by preserving the references of a function: > > my todo: >> > > 1. add demangling codes >> > > 2. add signature generation & detection > > I want to use the same schema for > > signatures too(I mean I will preserve > > the referenced symbols with the function > > signature). What do you think? > > Sorry, I don't really understand the question. A library function has these info that can be retrieved: 1. function name and prototype 2. the references of that function for example, in a library if we have: int atoi(char *x){ return (int)atol(x); } and we locate atoi, then we know atoi references atol, so we will find atol without the need for matching signatures. that is already done in libid codes. you see these outputs from the matchsym command: ... Matched: mainCRTStartup ... Symbol added: 4086fc -> __argv Symbol Reference: __argv Symbol added: 4086f8 -> __argc Symbol Reference: __argc Symbol added: 400ffc -> main Symbol Reference: main Symbol added: 401d8a -> exit Symbol Reference: exit .... here we have detected mainCRTStartup by matching signatures, and then using its references we have detected the location of main function and some other symbols. dcc and IDAPro signature files just hold the function signatures, and throw away their references, but references seem usefull. I want to preserve them in libid signature files. > Compilers are supposed to deal with this. Gcc has a "linkonce" tag for > these sorts of functions. The idea is that these are great candidates for > inlining; if you never declare functions in .h files, you'll probably never > get inlining (unless your compiler has a special pass between compiling and > linking, or does the inlining during linking, both of which seem uncommon). I didn't know that. > Are you getting errors from MSVC with this, or are you just assuming that > this has to be trouble but have not seen it yet? I just get warnings, so there is no problem here. > > but this won't work for the dll files which > > have mangled export tables. > > Are these called by ordinal then? Otherwise, how can the caller know what > to call? They are imported by names. I had some doubts, so I created a project. I've attached my project, if you liked to take a look. > By the way, I see that you have included source code in the zip file; > is it your intention to donate this code to the Boomerang project? YES. This is not the first open-source project I'm participating in. I've already done some coding in IBS(http://ibs.sf.net), which is a product of our company. The difference is while writing IBS, I just had fun when I was with my friends talking and working on the project, and thinking about the future of the project. But I guess participating in Boomerang is what I have been made for. ;) I'm not working on it just for my undergraduate project, I like it very much, even when no one is here to chat about it or there will be no money making in the future from it. - Mohsen mangleddll.zipx 99K Download Mohsen Hariri <m.hariri@gmail.com> Re: Boomarang, symbols, and licenses 2 messages Mike van Emmerik <emmerik@itee.uq.edu.au> To: Mohsen Hariri <m.hariri@gmail.com> Wed, Sep 14, 2005 at 11:01 AM >> By the way, I see that you have included source code in the zip file; >> is it your intention to donate this code to the Boomerang project? > > YES. This is not the first open-source project I'm participating in. > ... > But I guess participating in > Boomerang is what I have been made for. ;) I'm not working on it just for my > undergraduate project, I like it very much, even when no one > is here to chat about it or there will be no money making in the future from it. I love your enthusiasm! But there is a problem that needs working through. Your code apparently depends on BFD. Presumably, you could rewite it to not use BFD, but what I pain. I looked at using the opcodes library at one stage; it's part of binutils. The problem is that BFD is GPL'd. Nothing wrong with the GPL, as far as I am concerned, except that it is my understanding that the GPL is not compatible with BSD-like licences, such as large parts of Boomerang are released under. Perhaps I could explain a bit of Boomerang's history. It is derived in large part from UQBT, the University of Queensland Binary Translator. This was subsidised in part by Sun Microsystems; it paid my salary for at least a year. As a result, Sun has a say over the disposition of code written during that time. It was a long and painful process getting the Sun lawers to agree to relase the UQBT code under a BSD-like license. So large parts of Boomerang have to have comments at the top indicating that Sun part owns the copyright for those files, and they can only be used in conjunction with the license. Basically, it means that the headers have to be preserved, that you can't use Sun's name to promote your product, and things like that. Essentially, the code can be used for any purpose, including commercial purposes, and you don't have to contribute changes. The GPL is a similar license, except that you DO have to contribute changes. It makes no sense to me that these are incompatible; it seems to me that by putting Sun (where they already exist) comments and GNU headers on the files, and distributing a copy of the GPL with the code, would satisfy all the requirements. However, I've been told that this is not the case, and that making Boomerang GPL is not feasible. I'll look into this in more detail, and if you happen to have any expertise in this area, I'd love to hear about it. Surely there is some way that this issue can be resolved. But until then, I don't think it's a good idea to check in your code. Perhaps checked into a CVS branch while this is sorted out would be OK. It would be good to have this sorted out, so that other GPL'd pieces of code can be used in Boomerang. If it loses the ability to be used with commercial proprietary products, I don't think that this will be a great loss. (Initially, we thought this might be quite important, e.g. someone like Data Rescue (producers of IDA Pro) might want to use Boomerang code to enhance an existing proprietary product). Let's hope that this is sorted soon. - Mike Mohsen Hariri <m.hariri@gmail.com> Reply-To: m.hariri@gmail.com To: Mike van Emmerik <emmerik@itee.uq.edu.au> We are using BFD as a dynamically linked library. (there was some exception for dynamically linked libraries in LGPL licence, but BFD is GPLed) Can it help? [Quoted text hidden] Wed, Sep 14, 2005 at 12:14 PM Mohsen Hariri <m.hariri@gmail.com> Re: Boomerang and GPL code (fwd) 2 messages Mike van Emmerik <emmerik@itee.uq.edu.au> To: Mohsen Hariri <m.hariri@gmail.com> Wed, Sep 14, 2005 at 11:50 AM Mohsen, I asked my Boomerang co-author about the GPL and BSD question, and it looks like it is possible to release Boomerang under a combined license. So that's good news! I think that this will need a few words on the Boomerang page. - Mike ---------- Forwarded message ---------Date: Wed, 14 Sep 2005 18:11:21 +1000 From: QuantumG <qg@biodome.org> To: Mike van Emmerik <emmerik@itee.uq.edu.au> Subject: Re: Boomerang and GPL code Mike van Emmerik wrote: > > So the question is: do you know for sure that I can't include GPL code? If > it's just a case of RTFM, I'm happy to do that; it's just that I thought you > looked into this and decided pretty definatively that it could not be done. > I can't see what the problem is, but that could easily be ignorance on my > part. You can indeed distribute it as BSD+GPL code.. however, you must apply the GPL to the whole work. So yeah, it means that if someone writes an extension for Boomerang they have to distribute it under the terms of the GPL (they have to provide source). But if they want to seperate the BSD parts from the GPL parts and then make an extension from the BSD parts only, they can and then they are not bound by the GPL. Mohsen Hariri <m.hariri@gmail.com> Reply-To: m.hariri@gmail.com To: Mike van Emmerik <emmerik@itee.uq.edu.au> VERY GOOD. GPL is somehow better for such a project, when you are sure you are helping the whole opensource society, and not just providing something that can be abused to make money from your work. [Quoted text hidden] Wed, Sep 14, 2005 at 2:07 PM Mohsen Hariri <m.hariri@gmail.com> libid as a separate module? 5 messages Mike van Emmerik <emmerik@itee.uq.edu.au> To: Mohsen Hariri <m.hariri@gmail.com> Sat, Sep 17, 2005 at 2:43 AM Mohsen, I'm thinking of making libid a separate module in Boomerang, with its own license. It would be a separate library, like libWin32BinaryFile.so / Win32BinaryFile.dll, and the main code would check for its presence. If not present, it would continue on as now. That way, there is no need to change the license on the main part of Boomerang. This can only work if the signatures code can be well separated from the rest of Boomerang. This might be OK, or it might end up being a royal pain. What do you think? I was also thinking of calling for comments on changing the license of Boomerang on the web page. It could be as simple as asking, waiting a week, and if there are no major objections, change the license for Boomerang and libid can be a required part of Boomerang. It could be a separate library or not as convenient. Note that as I implemented it, the code in the symbols directory is just statically linked with the rest of the code. The earlier vision for Boomerang is that most functionality would be implemented as separate dynamically linked libraries, but it hasn't happened that way, and I'm not all that sure that it would be easier to develop that way, anyway. It would be good to use lots of libraries if there were separate possible implementations. For example, perhaps type analysis should have been done that way; you use data-flow based or constraint based or ad-hoc, and the other 2 modules don't have to take up memory in your executable. Either way, I think it makes sense to put the library code on a cvs branch if it's not a separate module, so that people downloading boomerang don't get any problems caused by the new code not being well tested on all platforms. It makes things easier for you too, since you can check in changes even if they don't compile, and you don't have to test on many platforms before checking in. (Witness the hassles with MinGW already). Once the code gets more settled, you can move off the cvs branch, since the probability of surprises on other platforms becomes very low. I'd be interested in your comments on this. - Mike Mohsen Hariri <m.hariri@gmail.com> Reply-To: m.hariri@gmail.com To: Mike van Emmerik <emmerik@itee.uq.edu.au> Sat, Sep 17, 2005 at 10:04 AM > from the rest of Boomerang. This might be OK, or it might end up being a > royal pain. What do you think? That's good. At first I tried to compile it as a DLL, it's easy I guess. > Either way, I think it makes sense to put the library code on a cvs branch > if it's not a separate module, so that people downloading boomerang don't > get any problems caused by the new code not being well tested on all > platforms. It makes things easier for you too, since you can check in > changes even if they don't compile, and you don't have to test on many > platforms before checking in. (Witness the hassles with MinGW already). > Once the code gets more settled, you can move off the cvs branch, since the > probability of surprises on other platforms becomes very low. I agree. At least till the codes become more stable. So please set up a branch, and I'll use that. -Mohsen Mike van Emmerik <emmerik@itee.uq.edu.au> To: Mohsen Hariri <m.hariri@gmail.com> Sat, Sep 17, 2005 at 3:58 PM > I agree. At least till the codes become more stable. So please > set up a branch, and I'll use that. OK, but the file server for the machine I use at Uni is down, sorry. Hmmm... looks like this email might not even nake it, I'm getting error messages at the bottom of the screen. Sigh. So there may be a bit of a delay. In the meantime, you can continue developing, I guess. Sorry about the delay. - Mike Mohsen Hariri <m.hariri@gmail.com> Reply-To: m.hariri@gmail.com To: Mike van Emmerik <emmerik@itee.uq.edu.au> Sat, Sep 17, 2005 at 4:05 PM I guess it's 10 o'clock there... you're still at uni? Here is 5 pm, and I'm at work. :p [Quoted text hidden] Mike van Emmerik <emmerik@itee.uq.edu.au> To: Mohsen Hariri <m.hariri@gmail.com> Sun, Sep 18, 2005 at 1:39 AM On Sat, 17 Sep 2005, Mohsen Hariri wrote: > I guess it's 10 o'clock there... you're still at uni? Not physically at Uni, no; I access my Linux machine at Uni from home using ssh. With the price of petrol so high, I may do more and more of that where possible. Your time zone comes through as +0430, which would indicate that you are 5.5 hours behind me, on the same day, but I'm not sure that gmail gets the time zones correct. I suspect that the mail server was affected by the outage, so it's hard to tell. > Here is 5 pm, and I'm at work. :p Right now it's 8am Sunday, and I'm in bed :-) The file server is back now. They mentioned some maintenance that they needed to do on a hard drive last Friday, looks like they had some teething issues or replaced some more of the raid drives. - Mike Mohsen Hariri <m.hariri@gmail.com> Re: running matchsym 10 messages Mike van Emmerik <emmerik@itee.uq.edu.au> To: Mohsen Hariri <m.hariri@gmail.com> Thu, Sep 15, 2005 at 10:29 AM > You can test my codes using the test file in /test/symbols/test_sig > directory. This executable only uses libc(I've included libc too). > you just run boomerang -k and issue these commands: > loadbinary <path>test_sig.exe I've made some changes so it compiles and links on Sparc/Solaris. (I wanted to test on something big-endian; endianness problems are common with loaders). This command seems to work; it says "loading...". > matchsym <path>libc.lib No matter what I do (slashes verses reverse slashes, etc), I can't get this to work. It always says "No default symbol matcher module for test/symbols/libc.lib". I'm open to suggestions as to how to proceed from here. I could send you a patch file for the changes I had to make to get it to compile on Solaris. I could check the code in on a CVS branch. This has a few advantages; you could update from the CVS branch and get my changes rather painlessly, and I can also check out the code and test it on MinGW, Linux, and OS X. However, once you check out from a branch, it's more difficult to get updates from the main trunk (ordinary non-branch changes). - Mike Mohsen Hariri <m.hariri@gmail.com> Reply-To: m.hariri@gmail.com To: Mike van Emmerik <emmerik@itee.uq.edu.au> Thu, Sep 15, 2005 at 2:10 PM I've tested that on my windows box. You mean it doesn't work at all or just when the system is big endian? anyway, please give me the make files so that I can compile it on my linux box, I can do it myself, but I may make some mistakes as I have little experience with that. please send me the patches. [Quoted text hidden] Mike van Emmerik <emmerik@itee.uq.edu.au> Thu, Sep 15, 2005 at 4:17 PM To: Mohsen Hariri <m.hariri@gmail.com> On Thu, 15 Sep 2005, Mohsen Hariri wrote: > I've tested that on my windows box. > You mean it doesn't work at all or just when the system > is big endian? I don't know; I haven't attempted to run it under Windows. I assume it would work, so I suspect an endianness problem. > anyway, please give me the make files so that I can compile > it on my linux box, I can do it myself, but I may make some > mistakes as I have little experience with that. > > please send me the patches. > OK, no problem. I'll do it in the next few days; I'm a bit rusty on making patch files and it's bed time now. - Mike Mike van Emmerik <emmerik@itee.uq.edu.au> To: Mohsen Hariri <m.hariri@gmail.com> Fri, Sep 16, 2005 at 5:14 AM On Thu, 15 Sep 2005, Mohsen Hariri wrote: > please send me the patches. Well, I had no end of trouble with the carriage returns that Windows inserts into all text files. Also, I could not figure out how to stop diff -r from comparing directories, telling me for example that I don't have .o files and so on. I had to change all the CVS/Repository files, so they all showed up, but I deleted the CVS files in the "original" directory to save spurious output, but of course then I get "only in this directory: CVS" messages. I think that maybe patch will just ignore these lines, or maybe you will have to delete them manually. It should be easy with a few search and replace commands. Or maybe it might be easier to apply the patches manually. I've used -c to make this a little easier. Do let me know how this goes; I don't expect problems but there usually are. You will need libiberty.a (or .so etc) as well as libbfd.a; I don't understand why. I think I found this last time I tried to use libbfd. Or maybe that's just a Solaris thing; it might be nice to fiddle with it under Linux. (Without libiberty, I had about 15-20 undefined symbols such as xexit, concat, hex_init, _objalloc_alloc etc). I had to put a soft link from a libbfd.a I found to lib/, like this: % ln -s /opt/local/lib/libbfd.a lib I didn't need it with libiberty. You could also fiddle with the LD_LIBRARY_PATH environment variable, but I find the soft link easier and you don't need to fiddle with ~/.profile or the like so it works next time. Good luck! - Mike boomerang_libid.diff 17K View Download Mohsen Hariri <m.hariri@gmail.com> Reply-To: m.hariri@gmail.com To: Mike van Emmerik <emmerik@itee.uq.edu.au> Fri, Sep 16, 2005 at 12:41 PM I guess all file attributes are broken now that I've checked out the files on my windows, so I think it's better to checkout all the files from cvs in linux and then replace changed files with mines. I did so. I did the diffs manually in the files. and everything compiles well. but I had a linker error where bfd was needed. That was because I had ran configure before making the Makefile.in changes. So I tried to run configure again, but this time I get this error: checking size of char... configure: error: cannot compute sizeof (char), 77 See `config.log' for more details. I looked at config.log, but nothing useful. I googles the error and again nothing useful, some people said it was because of some library files ( I don't understand how can "size of char" be related to "libraries") Anyway, it's lunch time and I'm going to have it. If you have any idea of what the solution is, please tell me. Otherwise I will investigate the problem myself. One problem I encountered while compiling: I'm compiling on a Fedora 4 linux. The default bfd version on FC4 is 2.15.92.0.2. When I was on my windows, I had tried to compile BFD 2.16 under mingw without success, because bfd needed a function in libiberty which didn't exist (the default libiberty included with BFD cannot be compiled under mingw, and mingw has a already compiled version, so I had to use that). So I tried with BFD 2.15 and it compiled well. Now the problem: in BFD 2.15+, some bfd structures have changed. So I need to know which version of BFD exists. I guess it's something 'configure' should deal with, right? anyway, I have put #define BFD_2_15 on my bfdobjmatcher.h which should be commented out when linking with BFD > 2.15 I guess I have to use linux for develompent, because having to change the attributes every time I want to commit changes is not practical. Maybe I use windows to develop, and then copy just the changed files to my linux and then commit them, so when will you ever give me CVS access? :P I've attached the changed project(tarred from my linux, so file attributes are OK). If you found what is the problem with my configure, please let me know. PS. the strcmpi function you sent me needs a change, here it is: (I guess this won't solve the problem with the matchsym error you get, i dunno why that error happens, that's just a simple string comparison and when it cannot find ".lib" at the end of library file, it shows that error) int strcmpi(const char* s1, const char* s2) { int n1 = strlen(s1); int n2 = strlen(s2); for (int i=0; i < n1; ++i) { if (i >= n2) return -1; char c1 = toupper(*s1++); char c2 = toupper(*s2++); if (c1 < c2) return -1; if (c1 > c2) return 1; } if(n2>n1) return 1; return 0; } thanks -Mohsen [Quoted text hidden] Mohsen Hariri <m.hariri@gmail.com> Reply-To: m.hariri@gmail.com To: Mike van Emmerik <emmerik@itee.uq.edu.au> Fri, Sep 16, 2005 at 1:05 PM [Quoted text hidden] [Quoted text hidden] boomerang_configure_error.tar.bz2 3544K Download Mike van Emmerik <emmerik@itee.uq.edu.au> To: Mohsen Hariri <m.hariri@gmail.com> Sat, Sep 17, 2005 at 2:18 AM > I guess all file attributes are broken now that I've checked out > the files on my windows, so I think it's better to checkout > all the files from cvs in linux and then replace changed files > with mines. I did so. Most programs can handle the changed line endings, so usually it doesn't matter. What program are you using for cvs access under Windows? Since you seem to have MinGW working, it probably makes sense to use that for the cvs access. Assuming that it preserves the line endings of files. I use the Cygwin cvs for when I'm using MSVC; that works out well. (You can access Windows files like c:\foo with /cygdrive/c/foo under Cygwin, and /c/foo under MinGW, I believe). > I did the diffs manually in the files. and everything compiles well. > but I had a linker error where bfd was needed. That was because > I had ran configure before making the Makefile.in changes. So > I tried to run configure again, but this time I get this error: > > checking size of char... configure: error: cannot compute sizeof (char), 77 > See `config.log' for more details. Bizarre. Actually, I don't think Boomerang uses those anyway, so it might be easiest to just get rid of them (all the sizeof(<type>) tests). Sorry, I don't have any other suggestions. > One problem I encountered while compiling: > I'm compiling on a Fedora 4 linux. Same as I use. Though I haven't tried using BFD on Core 4. > The default bfd version on FC4 is 2.15.92.0.2. > When I was on my windows, I had tried to compile BFD 2.16 under mingw > without success, because bfd needed a function in libiberty which > didn't exist > (the default libiberty included with BFD cannot be compiled under mingw, and > mingw has a already compiled version, so I had to use that). So I tried with > BFD 2.15 and it compiled well. I was looking at what libiberty is; it seems to be a collection of miscellaneous stuff that some GNU programs, including libbfd, need. It doesn't sound like it should be hard to compile. > Now the problem: in BFD 2.15+, some bfd structures have changed. I notice that you were using bfd_section, which used to be called sec. But if you use asection, it should work with either version. Perhaps there is a similar solution that doesn't need messing with configuration. What is the structure that has changed? > So I need > to know which version of BFD exists. I guess it's something > 'configure' should > deal with, right? anyway, I have put #define BFD_2_15 on my bfdobjmatcher.h > which should be commented out when linking with BFD > 2.15 Ick. Let's avoid that if possible. But if needed, I can hack something into configure. > I guess I have to use linux for develompent, because having to change > the attributes every time I want to commit changes is not practical. Oh, if you're more comfortable developing in Windows, then please continue to do so. I think we just need to solve the line ending issue some other way. > Maybe > I use windows to develop, and then copy just the changed files to my linux and > then commit them, so when will you ever give me CVS access? :P As soon as you ask. What's your sourceforge username? You're contributing at least as much as other developers. > I've attached the changed project (tarred from my linux, so file > attributes are OK). If you found what is the problem with my configure, > please let me know. Ah, OK. I'll look into it. > PS. the strcmpi function you sent me needs a change, here it is: Oops, OK, I see it. Actually, that might fix the problem, since it affects strings that are the same up to a certain point... well maybe. Sorry about the bug. - Mike Mohsen Hariri <m.hariri@gmail.com> Reply-To: m.hariri@gmail.com To: Mike van Emmerik <emmerik@itee.uq.edu.au> Sat, Sep 17, 2005 at 10:15 AM > Most programs can handle the changed line endings, so usually it doesn't > matter. What program are you using for cvs access under Windows? Since you > seem to have MinGW working, it probably makes sense to use that for the cvs > access. Assuming that it preserves the line endings of files. I use the > Cygwin cvs for when I'm using MSVC; that works out well. (You can access > Windows files like c:\foo with /cygdrive/c/foo under Cygwin, and /c/foo > under MinGW, I believe). I was using a windows native cvs.exe. I'll use cvs under cygwin from now. > I notice that you were using bfd_section, which used to be called sec. But > if you use asection, it should work with either version. Perhaps there is a > similar solution that doesn't need messing with configuration. > > What is the structure that has changed? the bfd_section structure had a 'comdat' member, but now we have to use a method to access this member. and the 'comdat' structure has been renamed. see the source files I sent to you. > As soon as you ask. What's your sourceforge username? You're contributing > at least as much as other developers. My username on sf is peakxx. -Mohsen Mike van Emmerik <emmerik@itee.uq.edu.au> To: Mohsen Hariri <m.hariri@gmail.com> Sun, Sep 18, 2005 at 4:16 AM Mohsen, I've gotten a configure script test working for the BFD 2.15 define, and I had it compiling. But somehow with the files you sent me, several changes were lost (e.g. to boomerang.cpp, and I think now db/prog.cpp as well). It will take a little longer to get the cvs branch set up and the files checked in. - Mike Mohsen Hariri <m.hariri@gmail.com> Reply-To: m.hariri@gmail.com To: Mike van Emmerik <emmerik@itee.uq.edu.au> yes. seems I forgot some changes. cause configure didn't work and ... please make a branch from your current cvs and I will update the branch with my changes. [Quoted text hidden] Sun, Sep 18, 2005 at 7:46 AM Mohsen Hariri <m.hariri@gmail.com> Branch created (finally) 9 messages Mike van Emmerik <emmerik@itee.uq.edu.au> To: Mohsen Hariri <m.hariri@gmail.com> Sun, Sep 18, 2005 at 3:23 PM > My username on sf is peakxx. Welcome, developer Mohsen! I've also created the branch tag; it is called libid. To get write access to CVS, you basically need to change your CVS/Root files. You can check out a new working copy with cvs -d :ext:peakxx@cvs.sourceforge.net:/cvsroot/boomerang co -r libid -d dirname boomerang Or you can change the file CVS/Root to contain the :ext... string. You need to change all the CVS/Root files, and there tens of these, so use a command like find . -name Root -exec cp /path/to/boomerang/CVS/Root {} \; The {} expands to the path to the current file. When you have checked it out, your files (execpt the ones that are the same as the main branch versions) will have a sticky tag. This means that you are "stuck on the branch", which is what you want. You can check in your changes with cvs ci -m "Check in message" You don't need to mention -b libid, since the sticky tag implies it. Later, when all is well, I can guide you throgh the process of updating to the main branch. You have full cvs write access now, so you can cause some damage, accidentally or otherwise, so please use caution. Almost everything in CVS is readily reversable, so no need to panic if something goes wrong. In particular, please make sure you are "on the branch" before your first check in. For example: % cvs status boomerang.cpp Enter passphrase for key '/home/44/emmerik/.ssh/id_dsa': =================================================================== File: boomerang.cpp Status: Up-to-date Working revision: 1.125.2.1 Repository revision: 1.125.2.1 /cvsroot/boomerang/boomerang/boomerang.cpp,v Sticky Tag: libid (branch: 1.125.2) Sticky Date: (none) Sticky Options: (none) You'll be asked for your Sourceforge password unless you have set up an ssh key. If you set up an ssh key and an ssh agent, you can avoid having to type in passwords and/or pass phrases for every cvs command. As a developer, you need to enter a password or pass phrase even for read only commands (such as cvs status above). But I guess you know all that from the other project you were working on. With the files as checked in, I get boomerang.cpp: In member function `void Boomerang::decode(Prog*, const char*)': boomerang.cpp:1062: error: 'class Prog' has no member named 'printSymbols' Sounds like a problem with prog.cpp not getting the latest update; I rnamed printSymbols to printSymbolsToFile. The diff for that is so small you may as well do it manually (just beware of spaces verses tabs; we like to be pretty strict about 4 column tabs; my emailer won't paste tabs): % cvs diff -r 1.135 -r 1.136 db/prog.cpp < * $Revision: 1.135 $ // 1.126.2.14 --> * $Revision: 1.136 $ // 1.126.2.14 581a582,588 > void Prog::dumpGlobals() { > for (std::set<Global*>::iterator it = globals.begin(); it != globals.end(); it++) { > (*it)->print(std::cerr, this); > std::cerr << "\n"; > } >} > 1311,1312c1318,1319 < void Prog::printSymbols() { < std::cerr << "entering Prog::printSymbols\n"; --> void Prog::printSymbolsToFile() { > std::cerr << "entering Prog::printSymbolsToFile\n"; 1333c1340 < std::cerr << "leaving Prog::printSymbols\n"; --> std::cerr << "leaving Prog::printSymbolsToFile\n"; 1442a1450,1454 > void Global::print(std::ostream& os, Prog* prog) { > Exp* init = getInitialValue(prog); > os << nam << " at " << std::hex << uaddr << std::dec << " initial value " << (init ? init>prints() : "<none>"); >} > I've rarely done this, but I believe you can do the above with % cvs update -j 1.135 -j 1.136 db/prog.cpp which is supposed to merge in the changes between 1.135 and 1.136 into the current checked out copy. I'm not sure what the resultant revision is, or how it affects sticky tags (I imagine it would not affect either). So use with caution. I got the revision number from cvs log. There may be one or two more like this. Ah, thinking about this, this one is not your fault; it's just that I checked in that change and you had not updated to that change before making the tar/zip files. I did some overwriting of files, which is always dangerous. These things can get tricky. Sigh. One day I'll be an expert at cvs branches. Good luck! Please let me know when you've done the first few checkins by email. That way I can make sure that there are no hassles that might affect people. These days, it doesn't take long between a small mistake and a bug report. - Mike Mohsen Hariri <m.hariri@gmail.com> Reply-To: m.hariri@gmail.com To: Mike van Emmerik <emmerik@itee.uq.edu.au> Mon, Sep 19, 2005 at 12:21 PM I've checked in my changes. Now libid works as a DLL under windows, and is loaded when needed. I used the codes in the BinaryFileFactory.cpp so for other platforms also it should work OK, it just needs the makefile to be changed. Look if I'm doing things the right way in cvs. [Quoted text hidden] Mike van Emmerik <emmerik@itee.uq.edu.au> To: Mohsen Hariri <m.hariri@gmail.com> Tue, Sep 20, 2005 at 1:54 AM > I've checked in my changes. Now libid works as a DLL under windows, > and is loaded when needed. I used the codes in the BinaryFileFactory.cpp > so for other platforms also it should work OK, it just needs the makefile to > be changed. > > Look if I'm doing things the right way in cvs. I've booked in a handful of changes, to symbols/* include/SymbolMatcher.h Makefile.in to enable making under Unix. I haven't tested yet; I need to copy over some test files. It made for me with no problems under MinGW as well. I removed carriage returns from symbols/lididloader.cpp; this might make for a difficult merge if you have made any changes. Assuming you haven't, it's probably easier to rename your copy and update from cvs. You have checked in bfd.dll but not libgc.dll and libexpat.dll (to the branch). I changed bfd.dll to have the sticky option -kb (so cvs knows it is a binary and doesn't try to do keyword substitution). Your files will need the GPL disclaimer at the top of each file, but that can wait. We will also need the LICENSE file updated with the GPL. You have not checked in the test/symbols files. Have you decided if you want all those checked in? The problems are all minor, and you successfully checked in on the branch. Well done! - Mike Mohsen Hariri <m.hariri@gmail.com> Reply-To: m.hariri@gmail.com To: Mike van Emmerik <emmerik@itee.uq.edu.au> Tue, Sep 20, 2005 at 9:06 AM > You have checked in bfd.dll but not libgc.dll and libexpat.dll (to the > branch). I changed bfd.dll to have the sticky option -kb (so cvs knows it > is a binary and doesn't try to do keyword substitution). should I do that? I mean adding libgc and libexpath.dll? seems they are statically linked under windows, with default configuration, right? > Your files will need the GPL disclaimer at the top of each file, but that > can wait. We will also need the LICENSE file updated with the GPL. Please do it for me. I don't know still how to put that revision history on top of my files, I guess it must be easy. Will you do that? > You have not checked in the test/symbols files. Have you decided if you > want all those checked in? I don't know if I put that test in the cvs. Because that test used libc.lib which was sold with VS2003, so that will have a copyright problem, right? I'm working on signatures. As soon as I'm done, I will make some sig files and put them in the tests directory. So about signatures: I will make matchsym command without arguments to autodetect any possible library in the executable. my next mail is the format of signature files, in 1 or 2 hours, I guess it would be long :p -Mohsen Mike van Emmerik <emmerik@itee.uq.edu.au> To: Mohsen Hariri <m.hariri@gmail.com> Tue, Sep 20, 2005 at 9:50 AM On Tue, 20 Sep 2005, Mohsen Hariri wrote: >> You have checked in bfd.dll but not libgc.dll and libexpat.dll (to the >> branch). I changed bfd.dll to have the sticky option -kb (so cvs knows it >> is a binary and doesn't try to do keyword substitution). > > should I do that? I mean adding libgc and libexpath.dll? > seems they are statically linked under windows, with default > configuration, right? Err, not as far as I know. What is the Windows equivalent of a .a file? .lib? I think the idea was to make it really easy for Windows people, since they are likely to be the least capable, and things like libexpat are more Unix friendly than Windows friendly. >> Your files will need the GPL disclaimer at the top of each file, but that >> can wait. We will also need the LICENSE file updated with the GPL. > > Please do it for me. I don't know still how to put that revision history > on top of my files, I guess it must be easy. Will you do that? OK, when I get a bit of time. Should be pretty easy. >> You have not checked in the test/symbols files. Have you decided if you >> want all those checked in? > > I don't know if I put that test in the cvs. Because that test used > libc.lib which > was sold with VS2003, so that will have a copyright problem, right? Yes, let's not check that in. In fact, I think we should think about what files if any to check in for library testing. Certainly, a functional test of some sort would be good. > I'm working on signatures. As soon as I'm done, I will make some sig files > and put them in the tests directory. Hmmm. I suppose the signatures themselves could be generated in a totally stand alone program, not just a separate library. That tool could be GPL'd, and Boomerang could stay as is (BSD-like license). Do you need bfd for the signature *reading* code? Surely not. Would you consider making the signature generator a separate program under Windows? I can modify the Unix makefile to do the same thing. I don't know why this didn't come to me before. It would sure simplify the licensing. - Mike Mohsen Hariri <m.hariri@gmail.com> Reply-To: m.hariri@gmail.com To: Mike van Emmerik <emmerik@itee.uq.edu.au> Tue, Sep 20, 2005 at 10:17 AM > Err, not as far as I know. What is the Windows equivalent of a .a file? > .lib? I think the idea was to make it really easy for Windows people, since > they are likely to be the least capable, and things like libexpat are more > Unix friendly than Windows friendly. .lib files stand for both .a and .la files. They can contain the whole code, or they can contain a simple stub for the dll file. > Would you consider making the signature generator a separate program under > Windows? I can modify the Unix makefile to do the same thing. I don't know > why this didn't come to me before. It would sure simplify the licensing. I'm making a seperate program somehow. But I think the non-signatured lib file matching and obj file matching are usefull, and they can be a part of boomerang. anyway, maybe the overhead of changing the license is more than the benefits. - Mohsen Mohsen Hariri <m.hariri@gmail.com> Reply-To: m.hariri@gmail.com To: Mike van Emmerik <emmerik@itee.uq.edu.au> Tue, Sep 20, 2005 at 10:58 AM One more thing. I want to use wine's demangling codes, and it's LGPL. I can do this also in the seperate signature generator program. But tomorrow we will be trying to read debug information from an executable and that's again GPL. Anyway I think not using any GPL code in the whole boomerang project is somehow limiting ourselves, and what would boomerang get from being under pure BSD license? other than being misused by properietary softwares? I think technology edge programs like boomerang MUST be GPLed, because they have a high potential for being misused. anyway you are the boss ;) you decide. -Mohsen [Quoted text hidden] Mike van Emmerik <emmerik@itee.uq.edu.au> To: Mohsen Hariri <m.hariri@gmail.com> Tue, Sep 20, 2005 at 12:11 PM On Tue, 20 Sep 2005, Mohsen Hariri wrote: > One more thing. I want to use wine's demangling codes, and > it's LGPL. I can do this also in the seperate signature generator > program. But tomorrow we will be trying to read debug information > from an executable and that's again GPL. Well, you could extend the Win32BinaryFile etc classes to handle it. But then again, it probably makes sense to get rid of these libraries and just re-use BFD anyway. Any improvements to BFD will automatically be available, any new formats likewise, and so on. We considered using BFD early on, but decided it was just too hard to use and/or learn how to use. So we wrote our own. Sounds lame now, but that's how it happened. > Anyway I think not using any GPL code in the whole boomerang > project is somehow limiting ourselves, and what would boomerang > get from being under pure BSD license? other than being misused by > properietary softwares? Yes, I can see that not using any GPL'd code is quite restrictive. If not now, then more and more so in the future. So we may as well prepare for the inevitable. > I think technology edge programs like boomerang MUST be GPLed, > because they have a high potential for being misused. Interesting point of view. > anyway you are the boss ;) you decide. I've started writing up a web page explaining the new license. It will take a few days, as spare time is short. - Mike Mohsen Hariri <m.hariri@gmail.com> Reply-To: m.hariri@gmail.com To: Mike van Emmerik <emmerik@itee.uq.edu.au> Tue, Sep 20, 2005 at 10:35 PM > We considered using BFD early on, but decided it was just too hard to use > and/or learn how to use. So we wrote our own. Sounds lame now, but that's > how it happened. YES. BFD is hard to use. The documentation is very poor, you have to just explore it yourself. Considering BFD documentation these days, I can imagine how it was in 1992-3 when you were starting UQBT. BFD has done a hard job in C, which could be done much more readable and maintanable in C++. But I know they have many reasons to do so. Anyway, current binary file handling in boomerang is much more readable than BDF. Maybe BFD use BinaryFile class from boomerang someday :p - Mohsen Mohsen Hariri <m.hariri@gmail.com> Library Signatures 2 messages Mohsen Hariri <m.hariri@gmail.com> Reply-To: Mohsen Hariri <m.hariri@gmail.com> To: Mike van Emmerik <emmerik@itee.uq.edu.au> Fri, Sep 23, 2005 at 6:55 PM Hi, I'm still thinking about the signature files. The method you used in dcc was very interesting, (minimal perfect hashing), but if we want to do the same in boomerang we would be somehow cpu instruction dependent, cause we have to identify which instructions can get relocatable arguments. The way IDA signatures work is storing signatures in a tree-structure format. That would be a more practical method for our purpose. Now I know much more about compiling templates. Didn't know templates have that many problems, for example you cannot generally store a function which uses templates in a library/object. That's why all stl headers contain codes. some interesting point here is that the linker doesn't warn you about multiple definitions if you have two object files, that in each of them you have used a vector<int> for example. (here the vector<int> class will compile and will be stored in both object files, and there will be many duplicate functions). So for classes with templates, microsoft linker works like gcc with link once option. So if we use a template tag for binaryfile class, we won't get any warnings about duplicate definitions. :p Anyway, I feel I can start coding now. Please just tell me what you think about these problems: 1. (again)Can a library be used on more than one platform? Is it common? If it's common, we have to put more than one platform id in header of signature files. IDA sig files store only one platform in each sig file. 2. What should we do with functions with the same name but different parameters? We have to use both function name and parameters to look up the function in header file to find the prototype, or some other method like putting a hint number in the signature function and related header function, anyway we cannot find the function by using just the function name. -Mohsen Mike van Emmerik <emmerik@itee.uq.edu.au> Sat, Sep 24, 2005 at 3:17 AM To: Mohsen Hariri <m.hariri@gmail.com> > Hi, > > I'm still thinking about the signature files. > The method you used in dcc was very interesting, > (minimal perfect hashing), but if we want to do > the same in boomerang we would be somehow > cpu instruction dependent, cause we have to > identify which instructions can get relocatable > arguments. Yesm but you need to identify the "wildcards" either way. I do think that the tree method is better; I just didn't think of it at the time. Plus, it allows different sized patterns (you might sometimes need 100 bytes in a pattern to distinguish a difference in the 101th byte, but most patterns can be about 20 bytes). > The way IDA signatures work is storing signatures > in a tree-structure format. That would be a more > practical method for our purpose. Agreed. > Now I know much more about compiling templates. Didn't know > templates have that many problems, for example you cannot > generally store a function which uses templates in a library/object. Interesting. > That's why all stl headers contain codes. some interesting point > here is that the linker doesn't warn you about multiple definitions > if you have two object files, that in each of them you have used > a vector<int> for example. (here the vector<int> class > will compile and will be stored in both object files, and there will > be many duplicate functions). So for classes with templates, > microsoft linker works like gcc with link once option. So if > we use a template tag for binaryfile class, we won't get any > warnings about duplicate definitions. :p Heh. > Anyway, I feel I can start coding now. Please just tell me what > you think about these problems: > > 1. (again)Can a library be used on more than one platform? Yes, most libraries would be like that: curses, bfd, expat, gc, etc. > Is it common? Yes. > If it's common, we have to put more > than one platform id in header of signature files. But signatures are specific to a platform. For example, the PowerPC curses library will have very different patterns to the Windows one, and probably the Linux one will be somewhat different to the Win32 one. Oh, perhaps we are talking about different "signatures" (unfortunate name clash). If you are thinking about prototypes (spec of names and parameter types), then yes, they should generally be the same across all platforms. There may be minor exceptions (e.g. on a 16-bit platform, there may be a long where on 32 bit platforms there is an int), but they should equate to the same prototype in Boomerang terms. It might be good to allow for exceptions when required. > IDA sig files store only one platform in each sig file. I'm pretty amazed that they can combine so many version even, e.g. MFC versions 2-4. (Signatures, not prototypes). I wonder if prototypes should be stored in the signature (pattern) file? > 2. What should we do with functions with the same name > but different parameters? We have to use both function > name and parameters to look up the function in header > file to find the prototype, or some other method like > putting a hint number in the signature function and > related header function, anyway we cannot > find the function by using just the function name. Maybe this goes away too if you put the prototype in the pattern file. I think you only ever start with a pattern and want to know the prototype, if any. I can't see a need (in Boomerang, perhaps in some stand alone tools) to ask "do you have a pattern for this prototype?". - Mike