Instrumentation and Optimization of WIN32/Intel Executables

Etch is an application program performance evaluation and optimization system, developed for Intel x86 platforms running the Windows/NT operating system. The system allows you to annotate existing binaries with arbitrary instructions (for example, to trace, or perform coverage analysis), or to rewrite an existing binary so that it executes more efficiently.

Etch works directly on x86 executables. It does not require program source code for either measurement or optimization.

Some Results

If you'd like to see some traces we've generated from a few popular Windows programs on X86, click here.

Who uses Etch?

Etch is targeted at two different user groups: developers, who wish to understand the performance of their programs during the development cycle, and users, who wish to understand and improve the performance of common applications executing in their environment.

Etch provides both groups with measurement tools to evaluate performance at several levels of detail, and optimization tools to automatically restructure programs to improve performance, where possible.

How Etch Works

Etch reads executable binaries (and, under Win32, DLLs) for an application, modifies the image, and writes a new one that has been enhanced for measurement or optimization. The transformations performed on the binary by Etch do not change program correctness, although a program transformed for performance measurement collection will run more slowly. Etch does not require changes to the operating system, but a modified Etch binary may utilize OS facilities, such as software timers, or even implementation-specific facilities, such as Intel Pentium performance counters.

Using Etch

There are three key concepts in using Etch:

Instrumentation transforms a binary according to an arbitrary criteria. For example, a program may be instrumented to count instructions, or to count the occurrence of each instruction, or to simulate a cache by tracking memory references.
Data collection
Once instrumented, an executable can be run. At that time, instrumentation routines collect data about the program.
Data processing
Once run, any data generated by an instrumented executable can be processed. Trace-based optimization is a typical data processing phase made possible by Etch.


To instrument a program, Etch is invoked with the name of an executable and a DLL. The DLL provides a set of routines which are invoked for each instruction in the executable. Roughly, Etch operates as:

        for each instruction in executable

The instrumentation tool provides implementations of these "Before" and "After" functions. The call back functions can in turn direct Etch to modify the executable with respect to the specific instruction. The directions in effect say "before (or after) this instruction runs, please call some specific function with some specific set of arguments." For example, to count instructions, the InstrumentBefore procedure would direct Etch to insert code that incremented a counter at runtime. These inserted instructions do not change the correctness of the program.

Once the entire executable has been scanned and instrumented, Etch writes a new version of the executable that can be run. Any functions referenced in the callback routines, as well as the Etch runtime library are included in the new executable.

Data Collection

The executable written by Etch can be run, and any instrumentation routines will run as a side effect of running the program. Instrumentation routines, as the program is running, can inspect the state of the program, for example, the contents of registers, or effective addresses. All addresses, whether text or data, are relative to the original binary, so the collection routines do not have to compensate for the fact that they are part of a modified executable.

Data Processing

When an Etched program terminates, its data collection routines can save information about the executable to disk. Later, post-processing utilities can examine the data. For example, a predicted execution time can be determined after the fact based on hypothetical processor, cache, and memory speeds. At a lower level, detailed information about a program's performance can be obtained such as is shown below in the graph of instruction cache performance for a collection of popular Win32 programs. The graph shows the miss penalty of the first level instruction cache and a second level unified cache for the Perl interpreter, three commercial C++ compilers, and MS-WORD.


Etch also provides facilities for rewriting an executable in order to improve its performance. For example, the instrumentation phase, rather than adding new instructions, can direct Etch to write the executable out according to a different code layout optimized for cache and VM behavior.

The impact of optimization

The graph below shows the reduction in instruction cache misses and execution time (in cycles) for a collection of popular Win32 programs that have been optimized for code layout using Etch on a 90Mhz Pentium. Etch was first used to discover the programs' locality while executing against a training set, and then rewritten in order to achieve a tighter cache and VM packing. Infrequently executed basic blocks were moved out of line, and frequently interacting basic blocks were laid out contiguously in the executable. The results were measured using inputs different than those used during training.

The User Interface

In addition to a programming interface, Etch also offers a graphical user interface for performing common instrumentation and optimization operations. The user interface can drive the measurement process: it runs Etch on the original binary to produce a new binary, modified to collect the necessary behavioral data; it executes the modified binary to produce the data; and it feeds the data to analysis tools that produce graphs or charts that help to pinpoint problems. Once a problem has been identified, the user may instruct Etch to perform a performance-optimization transformation. For example, Etch may rewrite the original binary to change the layout of data or code in order to improve cache or virtual memory performance.

Sample dialog box from the user interface

Sample results showing distribution of instruction opcodes


Etch runs on Intel 486, Pentium and P6 processors with at least 24 MB of memory. Etch works on 32-bit (Win32) binaries. It has been used for programs built by MSVC, Borland, and Intel compilers.


If you are interested in obtaining more information about Etch, please contact

Project Members

Etch is due to the efforts of people at Harvard University and the University of Washington. These include: Dennis Lee, Ted Romer, Geoff Voelker, Alec Wolman, Wayne Wong, Brad Chen, Brian Bershad, and Hank Levy.

Copyright (c) 1997 The University of Washington. All rights reserved.