Research paper on QEMU & SystemC by Tse-Chen Yeh

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN, VOL. XX, NO. X, MMM YYYY

Toward a Q EMU and SystemC-based Framework for Virtual Platform Construction and Design Space Exploration on SoC Development Ming-Chao Chiang, Tse-Chen Yeh, and Guo-Fu Tseng

Abstract—In this paper, we present a framework for virtual platform construction and design space exploration on system-onchip development based on Q EMU and SystemC. The framework is motivated by the following observations: (1) all the components of a virtual platform should be replaceable without too much effort; (2) design space exploration should be enabled for not only the system designer but also the hardware designer at the early stage of a design; and (3) the hardware/software cosimulation should be fast enough, even when co-simulating a full-fledged operating system. We show in the paper that these requirements can be easily fulfilled by the framework described herein, by giving either a demonstration or the experimental result. The first requirement is demonstrated by the replacement of the vector interrupt controller on the virtual platforms of QEMU by an external hardware modeled in SystemC. The second requirement is demonstrated by the waveform of AMBA on-chipbus ported from QEMU-SystemC and the statistics collected while co-simulating a full-fledged Linux kernel. For the third requirement, our experimental results—for booting up the Linux kernel above on a laptop—show that with every instruction executed and every memory accessed since power-on traced, the hardware/software co-simulation takes less than 15 minutes in average; and even in the worst case, it takes no more than 16 minutes. Index Terms—SoC, QEMU, SystemC, platform-based design, transaction-level modeling, hardware/software co-simulation

I. I NTRODUCTION

ECAUSE OF the cost consideration and the time-tomarket pressure, most of the System-on-Chip (SoC) designs, be it a single or multiple processor system, are hardly ever developed starting from scratch nowadays. Rather, except for some of the components, they are more often than not constructed from IPs available in the marketplace. Moreover, most of the system design teams strongly favor that the associated software can be designed and developed before the physical hardware is available, partially because of the long lead time required for custom silicon development and partially because the development of the software may shed light on the hardware design. Even though many electronic design automation (EDA) companies have delivered a lot of M. C. Chiang is with the Department of Computer Science and Engineering, National Sun Yat-sen University, Kaohsiung 80424, Taiwan, R.O.C. mcchiang@cse.nsysu.edu.tw. T. C. Yeh is with the Department of Computer Science and Engineering, National Sun Yat-sen University, Kaohsiung 80424, Taiwan, R.O.C. sdgp03@ms18.hinet.net. G. F. Tseng is with the Department of Computer Science and Engineering, National Sun Yat-sen University, Kaohsiung 80424, Taiwan, R.O.C. cooldavid@cooldavid.org.

platform-based hardware/software co-simulation environments to facilitate the Electronic System Level (ESL) design flow [1], the open source solutions are either absent or far less powerful. The gap that prohibits the hardware and software from being co-simulated seamlessly is, to a large degree, an issue of the design requirements from the software designers, hardware designers, and perhaps system architects. The situations can be made even more complicated because the software designers generally need not to know many of the details that are required by a hardware simulator such as the instruction-accurate information when developing the operating system (OS), device drivers, and related applications. However, the hardware designers and system architects need to take into account all of the low-level details, such as the instruction-accurate or cycle-accurate information. For instance, most of the hardware designers are forced to pay special attention to the register transfer level (RTL) and the advance of the design process. These different design requirements for the levels of details reveal the trade-off between the simulation accuracy and speed. As a result, quite a few system level description languages (SLDLs) [2], [3], [4], [5] are invented the main purpose of which are to solve this problem. The emergence of SystemC, somehow, provides a balance between the hardware and software design. On one hand, it is capable of modeling hardware that spans a wide range of levels—from the behavioral level to the transaction level to the register transfer level. On the other hand, as a C++ class library, it integrates seamlessly with the software written in either C or C++, which can be co-simulated with the SystemC simulation engine. Among the platform-based designs, the most useful construct is the core-based methodology. The idea is to use Instruction Set Simulator (ISS) together with the hardware models (usually modeled at the transaction level) for SoC or Multi-Processor SoC (MPSoC) co-simulation [6], [7], [8], [9], [10], [11]. Several EDA tools available in the marketplace today use platform-based design and transaction level modeling (TLM) [12], [13], [14]. Examples include CoWare’s Platform Architect, ARM’s RealView SoC Designer, and Synopsys’s Innovator. All of them provide the SystemC model library and the capability of design space exploration [15], [16], [17]. In addition, several other tools use ISS of a specific processor. For instance, SPACE and MPARM support ARM ISS simulation [18], [19]. Platune supports MIPS architecture [20]. In 2007, QEMU-SystemC is proposed as a software/hardware SoC

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN, VOL. XX, NO. X, MMM YYYY

emulation framework [21].1 The original intent of QEMUSystemC is to facilitate the development of software and device drivers for whatever OS that happens to run on QEMU without spending too much effort on modifying the virtual platform itself. A. Motivation of the Work However, our observation shows that there exist several constraints when QEMU-SystemC is applied to virtual platform construction and design space exploration. First, the hardware models that can be plugged in are limited by the plug-in interface of QEMU-SystemC, which preserves a region of the virtual address space of QEMU for the I/O of hardware models. Second, the plug-in interface allows for sending a hardware interrupt signal, but not to receive. Third, the transactions of the on-chip bus model do not carry sufficient processor information such as the instructions executed and memory accessed. To make QEMU-SystemC a more flexible and extensible framework and to make it possible for the design space exploration on the SoC development, we need to overcome the aforementioned limitations on QEMU-SystemC. In this paper, not only do we present a much more flexible and extensible plug-in interface with the send/receive interrupt channels for the hardware modeling, but we also adapt QEMU as a virtual machine to be an instruction-accurate ISS—to facilitate both virtual platform construction and design space exploration. Moreover, we also address one of the major challenges in designing such a framework; that is, how to eliminate the communication overhead between QEMU (as an instructionaccurate ISS) and SystemC, especially when design space exploration is concerned.

available in QEMU though we use only the ARM target and the x86 host of QEMU as an example throughout this paper. 2) We provide the memory mapping interface and the interrupt connections that made it extremely easy for the replacement of hardware models—a big step to facilitate the establishment and extensibility of a virtual platform. 3) We achieve design space exploration by extracting all the necessary information from QEMU, thus in one sense elaborating a virtual machine into an instructionaccurate ISS. We present the techniques used to extract information from QEMU. 4) Our experimental result shows that the virtual platform described herein is much faster than many of the commercially available tools in the sense that even on a laptop, it takes in average less than 15 minutes co-simulation time to boot up a full-fledged Linux kernel—with every instruction executed and every memory accessed since power-on traced. By implementing QEMU and SystemC as two threads running in the same process space and communicating through the shared memory remapping, we eventually eliminate not only the overhead associated with pipe and socket but also the overhead of context switches between (heavy-weight) processes. C. Organization of the Paper The remainder of the paper is organized as follows. The related work is given in Section II. The proposed framework is presented in III. The experimental results are summarized in Section IV. Section V concludes the work. II. R ELATED W ORK

B. Contribution of the Paper Most of the researches on the framework of SoC development have been based on a specific processor except the commercial ones. Even though QEMU-SystemC has the potential of extending the processor types, its primary focus has been on the software and device driver development. The main contributions of the paper are therefore fourfold: 1) We proposed a framework for SoC development based on QEMU and SystemC. The proposed framework has several advantages compared to many of the commercially available tools. Of them are: (1) the proposed framework is at least as flexible as many of the commercial tools as far as the model replacement is concerned; (2) the proposed framework can benefit from the advance of QEMU and SystemC in the sense that any enhancements to QEMU and SystemC can be easily incorporated into the framework; and (3) Unlike ISS that is generally designed for a specific processor, the proposed framework is potentially more extensible and configurable in terms of the number of processors 1 Throughout this paper, we will use QEMU-SystemC to refer to the framework described in [21] and QEMU and SystemC to refer to QEMU and SystemC in general.

In this section, we begin with a brief introduction to SystemC. Then, ISS for system emulation is discussed. In order to ensure that the framework described herein is capable of running an OS, we consider only ISSs that are able to emulate a system. We also present several techniques commonly used in realizing the system emulation. Then, several representative processor-based SoC development frameworks which can facilitate the ESL design flow are revealed, including both commercial and open source tools. Finally, the communication mechanisms between QEMU (as an instruction-accurate ISS) and SystemC is introduced, which have a strong impact on the performance of hardware/software co-simulation. A. SystemC SystemC is an ANSI standard C++ class library developed by Open SystemC Initiative (OSCI) [22] in 1999 and approved as IEEE standard in 2005 [23]. Although SystemC is a relatively new standard, it has become one of the most popular modeling languages in the ESL design flow [24]. Because SystemC can simulate concurrency, events, and signals of a hardware, the abstraction of the hardware model can be achieved up to the transaction level without the need of considering the signal level details [25], [26]. From the perspective of ESL

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN, VOL. XX, NO. X, MMM YYYY

Compile-time Target ISA

manually (1)

Micro-operations hand-coded as Tiny Functions (TFs) in C

gcc (2)

Object file in host object file format

TCG (3)

Dynamic Code Generator (DCG)

Run-time Target code

ICG (4)

A sequence of indices

DCG (5)

Host code

Fig. 1. The steps taken by QEMU to dynamically translate the target code to the host code. The first three steps are done at compile-time while the last two steps are invoked at run-time. (1) Each instruction in the instruction set architecture (ISA) of the target processor is translated into a sequence of micro-operations each of which is hand-coded as a tiny function (TF) in C. (2) The set of TFs written in C are compiled by gcc into an object file in the host object file format. (3) The tiny code generator (TCG) is launched to generate from the object file created in step 2 the Dynamic Code Generator (DCG) that will be invoked at run-time to generate the host code from a sequence of indices generated by an intermediate code generator (ICG). (4) An intermediate code generator (ICG) is invoked by QEMU to translate the target code into a sequence of indices each of which uniquely identifies a TF. (5) The DCG generated at step 3 is launched to translate the sequence of numbers to the host code by memory coping the TF code corresponding to each index to the host code buffer referred to as Translation Block (or TB for short).

design flow, a platform-based design together with SystemC can satisfy the requirements of hardware/software partitioning, post-partition analysis, and verification using TLM and/or RTL modeling [5]. B. ISS for System Emulation Since our goal is to develop a framework for system simulation, we will focus on ISSs that are open source and are able to run a full-fledged operating system. The most famous software for system emulation is VMware [27], which can emulate a lot of guest OSs on Windows, Linux, and BSD, but it is a commercial software and has nothing to do with SystemC and hardware/software co-simulation. Bochs [28] is an open source system emulator written completely in C++, but it can only emulate x86 and is slower than QEMU [29]. QEMU is also an open source system emulator, which can emulate several target CPUs on several hosts. Moreover, quite a few OSs have been ported to the virtual platforms supported by QEMU. There are eventually several reasons why we choose QEMU instead of Bochs as the ISS. First comes the simulation speed of QEMU that is faster than Bochs. Then comes the fact that QEMU supports more processors than Bochs. Finally, it comes the QEMU-SystemC framework that has proved the possibility of integrating QEMU and SystemC. 1) Dynamic Binary Translation Used by QEMU: Before we discuss the modifications required to convert QEMU as a system simulator to an ISS, we will take a look at how the so-called Dynamic Binary Translation (DBT) [30], [31], [32] used by QEMU works. By using DBT, QEMU is able to emulate several target CPUs on several hosts. Ideally, it can be extended to emulate as many target CPUs on as many hosts as we wish. The purpose of DBT is to convert the code compiled for a target processor to the code for the host processor at run-time. This is fundamentally different from the so-called static binary translation (SBT) [33], [34], [35] technique, which translates the code off-line.

As shown in Fig. 1, the DBT used by QEMU for translating the target code to the host code is basically composed of two steps [36]. The details are as given below. 1) The first step, composed of three sub-steps, is to generate the dynamic code generator (DCG) off-line, as follows: a) The first sub-step is to split each of the target instructions into a sequence of micro-operations. Each micro-operation is then hand-coded as a tiny function (TF) in C. b) The second sub-step is to have the set of TFs compiled by gcc [37] into the host code in an object file in the host object file format such as ELF on Linux. c) The third sub-step is that either dyngen or Tiny Code Generator (TCG)2 â&#x20AC;&#x201D;taking as input the object file containing all the micro-operationsâ&#x20AC;&#x201D;is then carried out to generate the dynamic code generator (DCG). 2) The second step, consisting of two sub-steps, is to translate the target code to the host code at run-time, as follows: a) The first sub-step is to translate each of the target CPU instructions into a sequence of indices each of which uniquely identifies a TF. b) The second sub-step is for the DCG to copy the host code corresponding to each TF the index of which is given in the first sub-step to Translation Block (TB), in the order as given in the first substep and with constant parameters patched. In short, the DBT used by QEMU can be divided into two steps the first of which is the preparation step while the second of which is the translation step. The first step is responsible for 2 Note that dyngen is for QEMU version 0.9.x while TCG is for QEMU version 0.10.x. The major difference between dyngen and TCG is that the former is gcc version dependent while the latter has been designed as a replacement of dyngen to make it gcc version independent.

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN, VOL. XX, NO. X, MMM YYYY

generating the DCG off-line. The second step is responsible for actually translating each of the target instructions into the host code—by first translating the target instructions into a sequence of indices to TFs and then from the sequence of indices into the host code. In other words, if we think of the sequence of indices as the intermediate code or the watershed of the front end and the back end, then DCG plays the role of the back end of a compiler. From the viewpoint of QEMU, the target instructions to be executed have to be loaded into memory first. Then, they are translated instruction by instruction to the host code at run-time up to the next branch, jump, or instruction that interrupts the flow of execution. Then, the translation will be suspended, and the host code in the Translation Block Cache (TBC)—which is composed of one or more TBs (the basic block of QEMU) chained together—is executed. The purpose of chaining TBs together is so that they can be executed together without returning the control to the QEMU monitor. In addition, to speed up the performance, QEMU does not check at every TB if a virtual hardware interrupt is pending. Instead, the virtual interrupt hardware needs to call a specific function asynchronously to announce that an interrupt is pending. A simplified view of the CPU main loop of QEMU is as given in Fig. 2 assuming that the virtual machine is just powered on or reset. In that case, the interrupt signals of the virtual machine are all disabled, and the TBC is empty. Then, as Fig. 2 shows, the very first thing the CPU main loop of QEMU does is to process all the pending interrupts, if any, by setting up calls to the corresponding interrupt service routines (ISRs) so that they will get translated by DBT to the host code. Then, it will start executing whatever host code in the TBC (see Fig. 4). However, if it is a TBC miss, then the DBT will be launched. After that, the CPU main loop will be re-entered, and the execution of the host code will be re-started. This time, it will be a TBC hit because QEMU has just filled some of the TBs in the TBC up. In other words, the execution of the host code, thus the DBT, to some extent alternates between the host code of the non-ISR target code and the host code of the ISR target code. As such, the target code—be it an application or an operating system—can be emulated by the host. The CPU main loop will loop forever until the virtual CPU aborts, which will in turn halt the processor. 2) Hardware Emulation Features in QEMU: QEMU provides two execution modes: the user mode and the system mode [36]. The user mode is provided to execute programs directly. The system mode is provided to execute an OS of the target CPU with a software memory management unit (MMU). Since our goal is to build a full-fledged system simulation framework, we will focus on the system mode with a software MMU. The way the load and store instructions of the target CPU access the memory depends on how the virtual address of the target OS is mapped to the virtual address of the host processor. As for the I/O, QEMU has defined a set of callback functions in C acting as the I/O interface, which can be used to model the virtual hardware device for the virtual platform of QEMU. Most of the hardware interrupt sources will be connected to the virtual interrupt controller modeled by the

Process pending interrupts

Launch DBT

yes

Is a TBC miss? no Execute host code in TBC

CPU main loop of QEMU Fig. 2.

A simplified view of the CPU main loop of QEMU

I/O interface of QEMU. Then, the virtual interrupt controller will ultimately be connected to the virtual CPU, which will in turn call a specific function asynchronously to inform the CPU main loop of QEMU that an interrupt is pending. C. Processor-Based SoC Development Framework Our experience shows that state-of-the-art tools for the SoC development using processor-based approach, whether or not it is a virtual platform, all contain certain fundamental elements: • Abundant Intellectual Property (IP) libraries: The more IP models the libraries provide, the more versatile the system can be configured. In addition, most of the processor models are composed of an ISS encapsulated in a wrapper written in SystemC. • Design space exploration: If the design space exploration cannot provide sufficient and accurate statistics, the system designer will not be able to determine which Pareto optimal3 configuration will fit the system requirements. • Essential abstraction levels of models: In addition to facilitating the productivity of ESL design flow, different levels of details of hardware IP models need to be delivered. For example, the transaction level models are more appropriate for the software development or the hardware modeling at early stage in the design. On the other hand, the bus-cycle-accurate models are usually used for co-simulation with other RTL models or for hardware/software co-verification. 1) Commercial EDA Tools: As typical of the commercial EDA tools for the SoC development, CoWare’s Platform Architect [15] consists of several libraries. One of them is called CoWare Model Library [39], which is made up of several processor packages, on-chip buses, memory subsystems, and several on-chip peripherals at both the transaction and cycle accurate levels. The processor packages, including ARM, PowerPC, MIPS, and so on, all contain an ISS to which an executable can be loaded to simulate an SoC connected to 3 As defined in [38], a solution is Pareto optimal if there exists no feasible solution for which an improvement in one objective does not lead to a simultaneous degradation in one (or more) of the other objectives.

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN, VOL. XX, NO. X, MMM YYYY

other devices via the transaction or pin level on-chip bus, like AMBA, AXI, CoreConnect, and etc. Also, they all allow certain statistics to be collected for design space exploration, namely, bus transaction, bus utilization, bus contention, function call trace, and so on. Moreover, the TLM Application Programming Interface (API), a low-level library also provided by CoWare, can facilitate the designers to develop their own hardware models [40], [41]. Another commercial EDA tool is ARM’s RealView Development Suite (RVDS) [16]. Although it only focuses on the ARM processor family, RVDS provides ARM’s RealTime System Models [42] which use the so-called System Generator [43] to create the virtual platform for a specific machine configuration which can be modified later. Because most of the peripheral models are associated with the ARM PrimeCells, RVDS concentrates more on profiling, software debugging, and optimization. Still, another commercial EDA tool is Synopsys’s Innovator [17]. It provides the DesignWare System-Level Library, which contains tool independent and SystemC TLM-2.0 [44] compatible models [45]. The processor packages it provides for the virtual platform configuration are ARM, MIPS and PowerPC. Moreover, Innovator is not only capable of mapping virtual I/O to the real world devices such as keyboard and mouse, it is also capable of platform analysis at the transaction level. 2) QEMU-SystemC: QEMU-SystemC [21] is an open source software/hardware emulation framework for the SoC development. It allows devices to be inserted into specific addresses of QEMU and communicates by means of the PCI/AMBA bus interface as shown in Fig. 3 [21]. Application Linux

Device driver

QEMU PCI/AMBA interface

PCI/AMBA to SystemC bridge

SystemC module

The advantage of QEMU-SystemC is that the hardware designers need not worry about the detailed configurations of the virtual platform, such as system memory mapping table, because the mapping is fixed on a portion of the address space of QEMU. The disadvantages, or limitations, of QEMUSystemC can be summarized as follows:

•

only send but also receive the interrupts, like interrupt controller, but no such interface is provided by QEMUSystemC. It provides insufficient information for design space exploration: Although QEMU-SystemC can facilitate the development of the device driver, it provides insufficient information for design space exploration. For example, it provides no trace of the instructions executed, the addresses of the memory accessed, and etc., for the system designers.

D. Communication between QEMU and SystemC As noted earlier, to provide the capability of design space exploration and to play the role of an ISS, QEMU has to be able to send all the information needed for design space exploration—such as the instructions executed and the addresses of the memory accessed—to SystemC. As such, the communication between QEMU and SystemC needs to be implemented in such a way that the communication overhead is either completely eliminated or highly reduced. To solve this issue, we tested several communication mechanisms. The socket-based IPC is often used for the framework that needs to be communicated with GDB or the external devices such as GDB remote debug interface (RDI) [6], [7], [9], [10] or board-based Ethernet interface [11] suitable for different simulation environments. On the other hand, the pipe-based IPC can be performed on the ISS with or without GDB support [7], [46]. The shared memory remapping provides still another communication mechanism between QEMU and SystemC [8], and most of the memory address translations of this communication mechanism have been taken care of internally by the software MMU of QEMU [36]. III. T HE P ROPOSED F RAMEWORK

Fig. 3. The block diagram of QEMU-SystemC [21]. The functional descriptions of PCI/AMBA interface and PCI/AMBA to SystemC bridge in the block diagram are different from those in the original paper, but they are identical from the implementation perspective.

•

Most of the system configurations cannot be changed: This is because QEMU-SystemC registers a fixed address space with QEMU. As such, most of the system configurations cannot be changed as far as QEMU-SystemC is concerned. Only the channel for sending the hardware interrupt is provided: Under certain circumstances, the hardware designers need to model the hardware which can not

The original idea behind QEMU is to provide a virtual machine that can do whatever a physical machine can do. In other words, the virtual machine should be able to run a fullfledged OS and all the applications that can be run on the OS and run them fast. Therefore, most of the design considerations of QEMU are on enhancing the performance of the QEMU itself, especially when emulating an OS is considered. As such, several tricky techniques are used. However, such a design choice also increases the difficulty on making QEMU an instruction-accurate ISS for the framework we proposed herein for virtual platform construction and design space exploration on the SoC development. Another enhancement to the performance of QEMU is that all the transactions between individual hardware models connected by the on-chip infrastructure, such as bus bridge or on-chip bus, are made transparent in the QEMU. Instead of modifying the internals of QEMU, all QEMU-SystemC does is to preserve a region of the virtual address space of QEMU for memory-mapped I/O of devices modeled in SystemC. However, the price to pay for such a design choice is that it limits the possibilities of platform analysis and design space exploration at the system level and the possibilities of hardware model configurations.

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN, VOL. XX, NO. X, MMM YYYY

A. From System Emulator to Instruction-Accurate ISS In order to fulfill the requirements of design space exploration, the very first thing we have to do is to make QEMU behave as an instruction-accurate ISS, in the sense that it provides all the information SystemC needs for hardware/software co-simulation. This requires that the internals of QEMU be modified. Although the modifications depend somehow on the type of the target processor QEMU supports, the general ideas are basically the same. In other words, the modifications can be divided into two parts. First comes the part of implicitly extracting the information SystemC needs for co-simulation from QEMU such as all the target instructions executed, all the virtual addresses referenced, all the data accessed, and etc. Then comes the part of explicitly providing interfaces to support things like virtual hardware interrupt and I/O. Although the first part is sufficient to make QEMU an instruction-accurate ISS, the second part is indispensable as far as the hardware/software co-simulation is concerned. To make the information extraction totally transparent, we have to rely on the so-called helper functions of TCG. The purpose of the helper functions of TCG is to wrap up whatever functions to be called such as the library functions provided by QEMU to ensure that they comply with the coding rules imposed by the TCG of QEMU version 0.10.x. The most important difference between the tiny functions and the helper functions is that the tiny functions are manually created to map the target ISA to the host code while the helper functions are designed to make such a translation done automatically by TCG. In other words, the purpose of the helper functions is to make the calls to the library functions consistent with the coding rules of TCG. To use the helper functions of TCG to extract the information SystemC needs for co-simulation, several things have to be done, and we will discuss them one at a time below. However, instead of giving the low-level details, we will focus on the high-level view of what has to be done. To simplify our discussion that follows, let us assume that f is the name of the helper function to be defined, tr the type of the return value, and ti the type of parameter i. 1) For each helper function f to be defined, the first thing to do is to use the macro DEF_HELPER_n(f, tr , t1 , . . . , tn )4 to generate three pieces of code: (1) the prototype of the helper function helper_f , (2) the ‘op’ helper function gen_helper_f to be called by DBT to generate the host code to call the helper function, and (3) the code to register the helper function at run-time for the purpose of debugging. For instance, the macro 1

DEF_HELPER_2(fetch_insn, void, i32, i32)

will generate the following code. 1

void helper_fetch_insn (uint32_t, uint32_t);

2 3 4 5

static inline void gen_helper_fetch_insn(TCGv_i32 arg1, TCGv_i32 arg2) {

4 where n >= 0 is the number of input parameters, with n = 0 indicating that there is no input parameter.

TCGArg args[2]; int sizemask; sizemask = 0; args[1 - 1] = GET_TCGV_I32(arg1); sizemask |= 0 << 1; args[2 - 1] = GET_TCGV_I32(arg2); sizemask |= 0 << 2; tcg_gen_helperN(helper_fetch_insn, 0, sizemask, TCG_CALL_DUMMY_ARG, 2, args);

6 7 8 9 10 11 12 13 14

}

15 16

tcg_register_helper(helper_fetch_insn, "fetch_insn");

The prototype of the helper function is given in line 1. The ‘op’ helper function is defined in lines 3 through 14. The code for registering the helper function is given in line 16. 2) The second thing is to define the helper function helper_f declared above—by wrapping up whatever to be executed inside the helper function. The helper function will get called by the host code generated by the ‘op’ helper function defined above and executed together with the host code of each target instruction, as shown in Fig. 5. 3) The third thing is to use the tcg_const_i32 function or its variants to make values that vary from instruction to instruction at the time of dynamic binary translation constants at run-time. This can be exemplified by the so-called Virtual Program Counter (VPC) or Virtual Instruction Pointer (VIP),5 whose value is the address of the next target instruction to be executed, i.e., before it is translated to the host code. In other words, its value varies every time an instruction is fetched. In this case, we can use the tcg_const_i32 function or its variants to save the value somewhere else so that it can be retrieved as a constant at run-time. Similarly, each target instruction can be saved away and retrieved as a constant at run-time using exactly the same mechanism. Once we have all the helper functions and the ‘op’ helper functions for retrieving the information SystemC needs defined, all we have to do is to find the right place to insert each ‘op’ helper function so that it will get called at the time of DBT to generate the host code to call the helper function associated with it. As such, the helper functions will be executed along with the target instructions. Since the key information an instruction-accurate ISS needs to provide consists of the address and data of target instructions fetched and memory accessed. We will discuss how they are retrieved below. 1) Simulating target instruction fetch stage: As far as QEMU is concerned, the target instructions are not fetched; rather, they are dynamically binary translated to the host code. In other words, the instructions fetched at run-time are the host instructions instead of the target instructions SystemC needs. To solve this problem, all we have to do is to insert an ‘op’ helper function— the parameters of which are the address of the target instruction and the target instruction itself–right after the place where the target instruction is being fetched 5 Though they mean the same thing and can be used interchangeably, we will use VPC throughout the paper.

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN, VOL. XX, NO. X, MMM YYYY

TABLE I N OTATIONS USED IN F IGS . 4 AND 5 TB TI TF IXC TB0 IXCml a IXCml d ml TF IXCms TFms TBC

Translation block. Target instruction. Tiny function. Information extraction code. TB with IXC. Memory load address IXC. Memory load data IXC. ml TF for memory load, i.e., with IXCml a and IXCd . Memory store IXC. TF for memory store, i.e., with IXCms . TB cache, which is composed of one or more TBs, chained together.

for DBT. The ‘op’ helper function so inserted will get called at the time of DBT to generate the host code— and place it right before each target instruction—to call the associated helper function and pass it the parameters above so that the helper function can send the address of the target instruction and the target instruction itself to SystemC, as shown in Fig. 5 (cf. Fig. 4). Note that the notations used in Figs. 4 and 5 are summarized in Table I. 2) Simulating memory access stage: Compared with QEMU version 0.9.1, the memory access interface of QEMU version 0.10.x has been simplified. As a result, all we have to do is to define two helper functions: one to extract the address and one to extract the data. For memory load, the ‘op’ helper functions for extracting the address and data will be inserted, respectively, right before and right after the memory load function. For memory store, the ‘op’ helper functions for extracting the address and data will be inserted right before the memory store function. After that, the ‘op’ helper functions will be called by the DBT to generate the host code to call the corresponding helper functions, and the results are as shown in Fig. 5(b) and (c). Since the sole purpose of the helper functions is to extract the address and data from the target instructions executed by QEMU instead of modifying their values, the behavior of executing the target instructions is guaranteed to be the same as the helper functions do not exist, except that it takes longer to execute now because it consists of not only the “original” host code but also the host code for extracting the information SystemC needs.

B. I/O Interface To fulfill the requirements of being a system emulator, QEMU provides an I/O interface for porting the target processor into different virtual platforms. Although undocumented, most of the existing virtual platforms have been modeled and constructed based on the I/O interface of QEMU. The I/O interface can be divided into two categories: PCI and memorymapped I/O. They are suitable for different target processors. For QEMU-SystemC [21] and the framework we proposed herein, the communication mechanism between QEMU and SystemC is implemented based on this I/O interface.

···

Fig. 4. TBC before insertion of IXC (cf. Fig. 5). Note that the lower layer is an enlargement of its immediate upper layer. In other words, the top layer shows that TBC is composed of an unknown number of TBs chained together. The second layer shows that each TB at the top layer is composed of an unknown number of TIs. The bottom layer shows that each TI in the second layer is composed of an unknown number of TFs. Or in short, TBC is composed of an unknown number of TFs.

TB0

···

IXC

···

TI ···

IXC

(a) TB0

TB0

···

IXC

··· IXCml a

···

TI TFml

···

IXCml d

(b) TB0

TB0

···

IXC

··· IXCms

···

TI TFms

···

IXC

(c) Fig. 5. TBC after insertion of IXC (cf. Fig. 4). Note that the lower layer is an enlargement of its immediate upper layer. (a) Non-load and non-store instructions, (b) Memory load instructions, and (c) Memory store instructions.

C. Hardware Interrupt Mechanism To make the proposed framework capable of model replacement, hardware models with interrupt are divided into three categories: (1) hardware models that can only receive interrupts, (2) hardware models that can only send interrupts, and (3) hardware models that can both receive and send interrupts. In the first category are the target processors whose interrupts are handled internally by QEMU. In the second category are the peripherals needing to announce that the tasks they perform are completed by sending an interrupt signal to the target processor. In the third category are the interrupt controllers, which are responsible for gathering the interrupts raised by different peripherals and then either masking or sending the interrupts gathered to either the upper level of the interrupt hierarchy or to the target processor. Since QEMU-SystemC models only the peripherals in SystemC that raise interrupts, it

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN, VOL. XX, NO. X, MMM YYYY

PL011 (UART)

the other hand, has been designed from scratch for precisely modeling the behavior of a hardware. As such, it provides all the facilities required by the hardware designers and can compensate for whatever lacks in C and QEMU, especially when the hardware models are concerned. Moreover, since most of the memory address translations have been taken care of by QEMU, we can focus our attention on the issues of the communication mechanism and the information exchange between QEMU and SystemC. In practice, the performance of an IPC, no matter that our concern is latency or bandwidth, depends, to a certain degree, on the host operating system. As a consequence, the experimental results will vary from system to system [48]. The socket-based IPC mechanism allows QEMU and SystemC to be executed on different hosts whereas the pipe-based IPC mechanism and the shared memory mechanism only allow cosimulation on the identical host. No matter which approach is adopted, the context switches between QEMU and SystemC are unavoidable unless QEMU and SystemC are implemented as a single thread running in a single process. This, however, is too restricted. Thus, we adopt the approach of implementing QEMU and SystemC as two threads since context switches between threads are generally much faster than between processes. Moreover, as far as the proposed framework described in the paper is concerned, the shared memory mechanism is adopted and used as a FIFO between QEMU and SystemC as shown in Fig. 7. In other words, by design, the communication between QEMU and SystemC is one way so that the relative order of the instructions executed, the memory accessed, and the I/O write operations performed is retained. Besides, the synchronization of the I/O read operations can be achieved by having QEMU call the I/O read function—which will pass the pointer to the data to be read to SystemC—and then block until SystemC returns. Our experimental result given in Section IV-C shows that the framework we proposed in the paper can boot up a full-fledged Linux kernel in less than 15 minutes in average. And even in the worst-case, it takes no more than 16 minutes.

.. .

Shared memory

only needs to provide the capability of sending interrupt from SystemC to QEMU. The framework we proposed herein has been made capable of both sending and receiving interrupts. As such, it is capable of handling hardware models like the interrupt controllers discussed above. The built-in interrupt mechanism of QEMU requires that each interrupt handler be registered by calling the qemu_allocate_irqs() function. The interrupt handler may process the input interrupt signals and then transmit the results to the upstream components. The return value of the qemu_allocate_irqs() function is a pointer used by the downstream peripherals to find their upstream components when they need to announce an interrupt. All the virtual hardware devices in QEMU use this mechanism to cascade all the interrupt controllers together to form an interrupt hierarchy. In order to maintain the built-in interrupt hierarchy after we replace models of QEMU by models written in SystemC, we need to exchange the interrupt settings between QEMU and SystemC. The interrupt connections need to be fixed before the system simulation begins, and they will remain unchanged throughout the entire simulation. This can be easily done by sending a request to SystemC for the interrupt settings and then using the information sent back by SystemC to setup the interrupt connections between QEMU and SystemC. All we need to modify is the virtual hardware initialization and the interrupt handler provided for the downstream peripherals. As Fig. 6 shows, as far as the Versatile/PB926EJ-S development board [47] is concerned, the interrupt hierarchy is composed of the PL190 Vector Interrupt Controller (VIC), the Secondary Interrupt Controller (SIC), the PrimeCell PL011 UART controller, and PL050 Keyboard Mouse Interface (KMI) controller. From the VIC viewpoint, SIC, UART, and KMI are the downstream while the IRQ and FIQ of the target processor are the upstream of VIC. FIQ IRQ ARM926

PL190 (VIC)

.. . SIC

.. .

QEMU PL050 (KMI0, KMI1)

···

Thread

SystemC Thread

Process Versatile/PB926EJ-S

Fig. 6. Block diagram of the hardware interrupt hierarchy on Versatile/PB926EJ-S development board.

Fig. 7. The inter-process communication (IPC) mechanism used by the framework we proposed in the paper.

E. Portability of Proposed Framework D. Communication between ISS and SystemC QEMU uses the DBT technique to support several target CPUs. It also specifies an interface for adding virtual hardware written in C. Although it seems that modeling a hardware can be done without SystemC, the C programming language essentially does not provide any facilities to model the behavior of a hardware precisely, e.g., concurrency, even though it has been widely used in creating the so-called C models. SystemC, on

At the time of this writing, a new release of QEMU (version 0.10.0) was announced. Our proposed framework was originally built on QEMU version 0.9.1. QEMU version 0.10.0 provides a new tool called the Tiny Code Generator (TCG) as a replacement of the dyngen tool—which is strongly coupled with gcc version 3. In other words, TCG successfully gets rid of such an undesirable dependency. Although the helper function interface and the constant parameter differ

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN, VOL. XX, NO. X, MMM YYYY

from the previous version, the concept of DBT is not changed. Thus, as far as the proposed framework is concerned, our experience shows that the migration from QEMU version 0.9.1 to QEMU version 0.10.0 takes less than a week, which includes understanding the TCG first and then finishing the implementation. In other words, this eventually shows the fact that the framework we proposed herein can benefit from any enhancements made to QEMU, and any enhancements made to SystemC, of course. This is because as far as the proposed framework is concerned, QEMU and SystemC are loosely coupled. IV. E XPERIMENTAL R ESULTS In this section, we demonstrate the performance of the proposed framework in terms of three different measures: (1) flexibility for the model replacement, (2) statistics that can be collected while the system is up and running, and (3) the time it takes to boot up a full-fledged Linux kernel. For all the experimental results given in this section, a 2.00GHz Intel Core 2 Duo T7200 processor machine with 1.5GB of memory is used as the host, and the target OS is built using the BuildRoot package [49], which is capable of automatically generating almost everything we need, including the crosscompilation tool chain, the target kernel image, and the initial RAM disk. The Linux distribution is Fedora 11, and the kernel is Linux version 2.6.29.4-167. QEMU version 0.10.5 and SystemC version 2.2.0 (including the reference simulator provided by OSCI) are all compiled by gcc version 4.4.0. A. Flexibility for Model Replacement The flexibility for model replacement can be demonstrated by migrating the PrimeCell PL190 VIC of the virtual platform provided by QEMU to a PL190 VIC written in SystemC, as shown in Fig. 8, which is essentially the same as that of Fig. 6 except that the PL190 VIC is moved out of QEMU and is written completely in SystemC. Since PL190 VIC is one of the two interrupt controllers on the virtual Versatile/PB926EJ-S platform and is the one actually interacted with ARM926EJ-S, the correctness of its implementation in SystemC can be easily verified. In other words, any bugs found in PL190 VIC written in SystemC will eventually crash the OS in question sooner or later. Thereby, a few runs of ARM Linux on the virtual Versatile/PB926EJ-S platform of QEMU will essentially give us a hint regarding the correctness of the PL190 VIC in SystemC. A snapshot of running ARM Linux on QEMU with the PL190 VIC hardware model implemented in SystemC is as shown in Fig. 9. B. Statistics of Instructions and Waveform of AMBA On-Chip Bus The major difference between QEMU-SystemC and the framework we described herein can be easily seen by the demonstration given in this section. QEMU-SystemC can only explore waveform of hardware models written in SystemC as given in Fig. 10(a) and (b) whereas the framework we proposed herein can not only trace waveform of hardware

FIQ IRQ ARM926 PL011 (UART)

SIC

.. .

PL190 (VIC) . .. in SystemC

.. . PL050 (KMI0, KMI1)

Virtual Versatile/PB926EJ-S Platform on QEMU

Hardware models in SystemC

Fig. 8. The block diagram of virtual Versatile/PB926EJ-S platform on QEMU with the PL190 VIC model (in C) replaced by a hardware model written in SystemC.

models written in SystemC but also all the instructions executed and all the memory accessed as shown in Fig. 10(c) and (d) using exactly the same AMBA bus protocol. Note that the waveforms given in Fig. 10 is shown by another open source tool called GTKWave [50]—a fully featured GTK+-based wave viewer for Unix and Win32, which reads standard Verilog VCD/EVCD files and allows their contents to be viewed. The primary differences between QEMU-SystemC and the framework we proposed herein are as shown by the waveforms of the address and data signals of the AMBA interface. QEMU-SystemC can only provide the signal transitions of I/O whereas the framework we proposed herein can simulate not only the signal transitions of I/O but also all the instructions executed and all the memory accessed starting from the very first moment the system was being boot up, which are probably most concerned by the system designers. Moreover, the load and store operations and the I/O access to the hardware models written in SystemC can all be recognized from the sequence of instructions executed. These statistics are extremely useful as far as design space exploration for SoC development is concerned. As a demonstration, the statistics are as shown in Tables III and IV, and the notations are as defined in Table II. In order to gather the statistics, the initial shell script is modified to enable the option of rebooting the virtual machine automatically as soon as the booting sequence is completed. Furthermore, the pre-defined no-reboot option of QEMU will catch the reboot signal once the OS executes the reboot command in the shell script and then shuts the QEMU down. Thus, the test bench can easily estimate the co-simulation time of QEMU and SystemC at the OS level. As shown in Table III, the column labeled “NTI ” shows the number of target instructions actually executed by the virtual ARM processor. In other words, instructions fetched but not executed are not counted. The columns labeled “NLD ” and “NST ” present, respectively, the number of load and store operations of the virtual processor including the memory-

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN, VOL. XX, NO. X, MMM YYYY

Fig. 9.

A snapshot of running ARM Linux on QEMU with the PL190 VIC written in SystemC.

map I/O. The column labeled “NTI+LD+ST ” gives the total number of target instructions executed and load and store operations performed because the number of the read/write operations of the VIC (PL190 in this case) have been counted as the load and store operations of the virtual processor. That is, NTI+LD+ST = NTI + NLD + NST . The columns labeled “NRD ” and “NWT ” give an idea about the number of read/write transactions between the virtual processor and the VIC. They are given to show that the proposed framework can provide information as detail as the number of the read/write operations of the VIC. Note that all the numbers given are, as the names of the rows suggest, the min, max, and average of booting up the ARM Linux and shutting it down immediately on the proposed framework for 30 times. The percentages given in parentheses are computed as Nα NTI+LD+ST

× 100%

where the subscript α is either TI, LD, ST, TI + LD + ST, RD, or WT. For instance, the percentage given in the column labeled “NTI ” of the row labeled “µ” of Table III is computed as 672,447,292.77 × 100 = 67.91%. 990,179,276.60

The others are computed similarly. They are shown to give an idea about how many percent of all the target instructions executed and all the load/store operations performed are target instructions, how many of them are load and store operations, and so on. As Table III shows, NTI , NLD , and NST count for about 68%, 24%, and 8%, respectively. The number of the read/write operations between the virtual processor and the VIC, denoted NRD and NST , counts for an extremely small percentage. Table IV shows the same results as given in Table III except that the numbers have been normalized as per second instead of per run. This would make it easier to understand exactly how many instructions are executed or how many load and store operations are performed in a second in the worst, best, and average case. For instance, as the row labeled “µ” of Table IV shows, in the average-case, the proposed framework can execute about 756,285.83 instructions and perform about 265,844.27 load and 91,501.37 store operations in a second. Even in the worst-case, as the row labeled “min” of Table IV shows, it can still execute about 727,662.00 instructions and perform about 256,299.00 load and 88,724.00 store operations in a second. However, the simulation efficiency is not necessarily the best metric for measuring the simulation speed because the number of instructions or operations simulated is

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN, VOL. XX, NO. X, MMM YYYY

TABLE II N OTATIONS USED IN TABLES III AND IV min max µ σ NTI NLD NST NRD NWT

The best-case co-simulation time and the worst-case simulation efficiency of 30 runs, where the “simulation efficiency” is defined to be the number of instructions or operations simulated per second as far as this paper is concerned. The worst-case co-simulation time and the best-case simulation efficiency of 30 runs. The mean of co-simulation time and simulation efficiency of 30 runs. The standard deviation of co-simulation time and simulation efficiency of 30 runs. The number of target instructions simulated. The number of load operations of the virtual processor. The number of store operations of the virtual processor. The number of times the virtual processor reads data from the VIC (PL190). The number of times the virtual processor writes data to the VIC (PL190).

TABLE III T HE EXPERIMENTAL RESULTS OF BOOTING UP THE L INUX KERNEL ON THE PROPOSED FRAMEWORK FOR 30 TIMES . Statistics

Co-simulation time

min

14m01.994s

max

15m37.327s

14m49.317s

00m22.461s

NTI 658,473,617.00 (67.96%) 701,925,646.00 (67.94%) 672,447,292.77 (67.91%) 13,997,515.54

NLD 231,778,725.00 (23.92%) 244,722,965.00 (23.69%) 236,356,464.13 (23.87%) 3,911,802.32

NST 78,616,811.00 ( 8.11%) 86,537,514.00 ( 8.38%) 81,375,519.70 ( 8.22%) 2,405,392.30

NTI+LD+ST 968,869,153.00 (100.00%) 1,033,186,125.00 (100.00%) 990,179,276.60 (100.00%) 20,314,710.16

NRD 132,556.00 (0.01%) 150,312.00 (0.01%) 141,016.27 (0.01%) 4,264.56

NWT 198,556.00 (0.02%) 225,370.00 (0.02%) 211,326.10 (0.02%) 6,421.29

TABLE IV S IMULATION EFFICIENCY OF BOOTING UP THE L INUX KERNEL ON THE PROPOSED FRAMEWORK FOR 30 TIMES ( I . E ., TRANSACTIONS PER SECOND ). Statistics min max µ σ

NTI 727,662.00 (67.84%) 782,040.00 (67.90%) 756,285.83 (67.91%) 10,303.83

NLD 256,299.00 (23.89%) 275,273.00 (23.90%) 265,844.27 (23.87%) 3,565.36

NST 88,724.00 (8.27%) 94,408.00 (8.20%) 91,501.37 (8.22%) 1,277.69

not necessarily proportional to the time it takes.

C. Performance of Co-simulation on the Proposed Framework One of the major concerns on the framework for virtual platform construction and design space exploration on the SoC development is certainly the hardware/software cosimulation speed, especially when co-simulating a full-fledged OS. The co-simulation times shown in the column labeled “Co-simulation time” of Table III are collected using the Linux’s time command, and the configuration of the virtual platform is identical to that used in Section IV-B. The rows labeled “min,” “max,” and “µ” present, respectively, the bestcase, the worst-case, and the average-case running time of booting up the kernel and shutting it down immediately for 30 times. The row labeled “σ” gives the variability. Note that only the real time of the time command, i.e., the time elapsed between invocation and termination of a co-simulation, is shown. From the perspective of the hardware designer, it is probably much much faster than just acceptable to co-simulate a full-fledged operating system in less than 15 minutes in average and less than 16 minutes in the worst-case, especially in the early stage of SoC development.

NTI+LD+ST 1,072,685.00 (100.00%) 1,151,721.00 (100.00%) 1,113,631.47 (100.00%) 15,146.88

NRD 157.00 (0.01%) 160.00 (0.01%) 158.00 (0.01%) 1.06

NWT 235.00 (0.02%) 241.00 (0.02%) 237.17 (0.02%) 1.63

V. C ONCLUSION This paper presents a framework for virtual platform construction and design space exploration on the SoC development based on QEMU and SystemC. The proposed framework is not only flexible for model replacement, but it also provides the capability of design space exploration. Moreover, our experience shows that the proposed framework can even benefit from whatever enhancements are made to QEMU and SystemC. Our experimental result shows that the vector interrupt controller of the existing virtual platform of QEMU written in C can be easily replaced by a hardware interrupt controller modeled in SystemC. Furthermore, design space exploration can be easily achieved by making QEMU play the role of an ISS, which will send all the information required by design space exploration to SystemC such as the instructions executed, the addresses of the memory accessed, the I/O operations performed, and so on. Finally, the “faster than fast” co-simulation time indicates that the framework we proposed in the paper makes it possible to co-simulate a full-fledged operating system in the early stage of the SoC development.

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN, VOL. XX, NO. X, MMM YYYY

(a)

(b)

(c)

(d) Fig. 10. A snapshot of the waveform of AMBA on-chip bus. (a) and (b) give a snapshot of QEMU-SystemC while (c) and (d) show a snapshot of the proposed framework. (a) The address jumps directly from 0x0000 0000 to the I/O read/write address of the virtual hardware device modeled in SystemC, and there is no information associated with the instructions executed. (b) The waveform reveals no further information about the instructions executed and the memory accessed other than the virtual hardware modeled in SystemC. (c) The address is increased by 4 every time an instruction is fetched since the start-up of the virtual machine, and the waveform of HRDATA reveals the sequence of machine code executed by the ARM processor when running a Linux. (d) The HADDR, HRDATA and HWDATA reveal that the operations alternate between the interconnect, processor, and virtual hardware modeled in SystemC. The addresses in the range from 0x1014 0000 to 0x1014 0400, i.e., 1K, are accesses to the internal memory-mapped I/O of PL190 VIC modeled in SystemC.

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN, VOL. XX, NO. X, MMM YYYY

R EFERENCES [1] D. Densmore, R. Passerone, and A. Sangiovanni-Vincentelli, “A Platform-Based Taxonomy for ESL Design,” IEEE Design and Test of Computers, vol. 23, no. 5, pp. 359– 374, September 2006. [2] S. Swan, “An Introduction to System Level Modeling in SystemC 2.0,” Cadence Design Systems, Inc., Tech. Rep., 2001. [3] P. R. Panda, “SystemC - A Modeling Platform Supporting Multiple Design Abstractions,” in Proceedings of The 14th International Symposium on System Synthesis, 2001, pp. 75–80. [4] W. Müller, W. Rosenstiel, and J. Ruf, SystemC: Methodologies and Applications. Kluwer Academic Publishers, 2003. [5] B. Bailey, G. Martin, and A. Piziali, ESL Design and Verification. Morgan Kaufmann Publishers, 2007. [6] L. Benini, D. Bertozzi, D. Bruni, N. Drago, F. Fummi, and M. Poncino, “Legacy SystemC Co-Simulation of Multi-Processor Systems-on-Chip,” in Proceedings of IEEE International Conference on Computer Design: VLSI in Computers and Processors, 2002, pp. 494–499. [7] ——, “SystemC Co-Simulation and Emulation of Multiprocessor SoC Design,” IEEE-Computer Society, vol. 36, pp. 53–59, April 2003. [8] I. Oussorov, W. Raab, U. Hachmann, and A. Kravtsov, “Integration of Instruction Set Simulators into SystemC High Level Models,” in Proceedings of Euromicro Symposium on Digital System Design, 2002, pp. 126–129. [9] F. Fummi, S. Martini, G. Perbellini, and M. Poncino, “Native ISSSystemC Integration for the Co-Simulation of Multi-Processor SoC,” in Proceedings of Europe Conference and Exhibition Design, Automation and Test, vol. 1, Feburary 2004, pp. 564–569. [10] L. Formaggio, F. Fummi, and G. Pravadelli, “A Timing-Accurate HW/SW Cosimulation of an ISS with SystemC,” in Proceedings of International Conference on Hardware/Software Codesign and System Synthesis, September 2004, pp. 152–157. [11] F. Fummi, M. Loghi, G. Perbellini, and M. Poncino, “SystemC CoSimulation for Core-Based Embedded Systems,” in Proceedings of Design Automation For Embedded Systems, vol. 11, 2007, pp. 141–166. [12] L. Cai and D. Gajski, “Transaction Level Modeling in System Level Design,” Center for Embedded Computer Systems, Tech. Rep., 2003. [13] ——, “Transaction level modeling: An Overview,” in Proceedings of First IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis, 2003, pp. 19–24. [14] F. Ghenassia, Transaction-Level Modeling with SystemC: TLM Concepts and Applications for Embedded Systems. Springer, 2005. [15] CoWare Platform Architect. [Online]. Available: http://www.coware. com/products/platformarchitect.php [16] “ARM RealView Development Suite.” [Online]. Available: http: //www.arm.com/products/DevTools/ [17] Synopsys Innovator. [Online]. Available: http://www.synopsys.com/ Tools/SLD/VirtualPlatforms/Pages/Innovator.aspx [18] J. Chealier, O. Benny, M. Rondonneau, G. Bois, E. M. Aboulhamid, and F.-R. Boyer, “SPACE: A Hardware/Software SystemC Modeling Platform Including an RTOS,” in Proceedings of Forum on Specification and Design Languages, 2003. [19] L. Benini, D. Bertozzi, A. Bogliolo, F. Menichelli, and M. Olivieri, “MPARM: Exploring the Multi-Processor SoC Design Space with SystemC,” Journal of VLSI Signal Processing, vol. 41, no. 2, pp. 169– 182, 2005. [20] T. Givargis and F. Vahid, “Platune: A Tuning Framework for Systemon-a-Chip Platforms,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 21, no. 11, pp. 1317–1327, November 2002. [21] M. Montón, A. Portero, M. Moreno, B. Mart´ınez, and J. Carrabina, “Mixed SW/SystemC SoC Emulation Framework,” in Proceedings of IEEE International Symposium on Industrial Electronics, June 2007, pp. 2338–2341. [22] Open SystemC Initiative (OSCI). [Online]. Available: http://www. systemc.org/ [23] IEEE Standard System C Language Reference Manual, Design Automation Standards Committee, 2005. [Online]. Available: http: //standards.ieee.org/getieee/1666/download/1666-2005.pdf [24] Nine reasons to adopt SystemC ESL design. [Online]. Available: http://www.eetimes.com/news/design/columns/eda/showArticle. jhtml?articleID=47212187 [25] T. Grötker, S. Liao, G. Martin, and S. Swan, System Design with SystemC. Kluwer Academic Publishers Group, 2002. [26] D. C. Black and J. Donovan, SystemC: From The Ground Up. Springer Science+Business Media, 2004. [27] VMware. [Online]. Available: http://www.vmware.com

[28] Bochs. [Online]. Available: http://bochs.sourceforge.net/ [29] F. Bellard. QEMU. [Online]. Available: http://bellard.org/qemu/index. html [30] M. Probst, “Fast Machine-Adaptable Dynamic Binary Translation,” in Proceedings of the Workshop on Binary Translation, 2001. [31] D. Ung and C. Cifuentes, “Machine-Adaptable Dynamic Binary Translation,” in Proceedings of the ACM Sigplan Workshop on Dynamic and Adaptive Compilation and Optimization, January 2000, pp. 41–51. [32] M. Gschwind, E. R. Altman, S. Sathaye, P. Ledak, and D. Appenzeller, “Dynamic and Transparent Binary Translation,” IEEE Computer, vol. 33, pp. 54– 59, March 2000. [33] C. Cifuentes and V. Malhotra, “Binary translation: Static, Dynamic, Retargetable?” in Proceedings of International Conference on Software Maintenance, November 1996, pp. 340–349. [34] E. R. Altman, D. Kaeli, and Y. Sheffer, “Welcome to the Opportunities of Binary Translation,” IEEE Computer, vol. 33, pp. 40– 45, March 2000. [35] C. Cifuentes and M. V. Emmerik, “UQBT: Adaptable Binary Translation at Low Cost,” IEEE Computer, vol. 33, pp. 60– 66, March 2000. [36] F. Bellard, “QEMU, a Fast and Portable Dynamic Translator,” in Proceedings of USENIX Annual Technical Conference, June 2005, pp. 41–46. [37] The Free Software Fundation, the GNU Compiler Collection. [Online]. Available: http://gcc.gnu.org/ [38] WikiAnswers.com. What is pareto optimal? [Online]. Available: http://wiki.answers.com/Q/What is pareto optimal [39] CoWare Model Library. [Online]. Available: http://www.coware.com/ products/modellibrary.php [40] B. Vanthournout, S. Goossens, and T. Kogel, “Developing Transactionlevel Models in SystemC,” CoWare Inc., Tech. Rep., 2006. [41] T. Kogel, “TLM Peripheral Modeling for Platform-Driven ESL Design,” CoWare Inc., Tech. Rep., 2006. [42] ARM Real-Time System Models. [Online]. Available: http://www.arm. com/products/DevTools/RealTimeSystemModel.html [43] System Generator. [Online]. Available: http://www.arm.com/products/ DevTools/SystemGenerator.html [44] J. Aynsley, OSCI TLM-2.0 User Manual, 2008. [Online]. Available: http://www.systemc.org/members/download files/check file? agreement=tlm 2-0 080606 [45] DesignWare System-Level Library. [Online]. Available: http://www.synopsys.com/TOOLS/SLD/VIRTUALPLATFORMS/ Pages/SLLibrary.aspx [46] S. Boukhechem, E.-B. Bourennane, and H. Samahi, “Co-Simulation Platform based on SystemC for Multiprocessor System on Chip Architecture Exploration,” in Proceedings of Internatonal Conference on Microelectronics, December 2007, pp. 105–110. [47] RealView Versatile/PB926EJ-S. [Online]. Available: http://www. bluewatersys.com/development/doc/realview/versatile/pb.php [48] P. K. Immich, R. S. Bhagavatula, and R. Pendse, “Performance Analysis of Five Interprocess Communication Mechanisms Across UNIX Operating Systems ,” Journal of Systems and Software, vol. 68, pp. 27– 43, October 2003. [49] BuildRoot. [Online]. Available: http://buildroot.uclibc.org/ [50] GTKWave. [Online]. Available: http://gtkwave.sourceforge.net/

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN, VOL. XX, NO. X, MMM YYYY

Ming-Chao Chiang received the B.S. degree in Management Science from National Chiao Tung University, Hsinchu, Taiwan in 1978 and the M.S., M.Phil., and Ph.D. degrees in Computer Science from Columbia University, New York, NY, U.S.A. in 1991, 1998, and 1998, respectively. He had over 12 years of experience in the software industry encompassing a wide variety of roles and responsibilities in both large and start-up companies before joining the faculty of the Department of Computer Science and Engineering, National Sun Yat-sen University, Kaohsiung, Taiwan in 2003, where he is currently an Assistant Professor. His current research interests include system software, web mining, image warping, and multimedia systems.

Tse-Chen Yeh received the B.S. and M.S. degrees, both in Information Engineering from I-Shou University, Kaohsiung, Taiwan in 1996 and 1998, respectively. He is currently working toward the Ph.D degree in Computer Science and Engineering at National Sun Yat-sen University, Kaohsiung, Taiwan. His current research interests include system modeling, hardware/software co-simulation, and design space exploration.

Guo-Fu Tseng received the B.S. degree in Computer Science and Engineering from National Sun Yat-sen University, Kaohsiung, Taiwan in 2007. He is currently working toward the M.S. degree in Computer Science and Engineering at National Sun Yat-sen University, Kaohsiung, Taiwan. His current research interests include computer network, distributed system, and Linux operating system.