Research Paper II of QEMU & SystemC by Tse-Chen Yeh

On the Interfacing between QEMU and SystemC for Virtual Platform Construction: Using DMA as a Case Tse-Chen Yeha , Ming-Chao Chianga,∗ a Department

of Computer Science and Engineering, National Sun Yat-sen University, Kaohsiung 80424, Taiwan, R.O.C.

Abstract In this paper, we present an interface for the hardware modeled in SystemC to access those modeled in QEMU on a QEMU and SystemC-based virtual platform. By using QEMU as the instruction-accurate instruction set simulator (IA-ISS) and its capability to run a full-fledged operating system such as Linux, the virtual platform with the proposed interface can be used to facilitate the co-design of hardware models and device drivers at the early stage of Electronic System Level (ESL) design flow. In other words, by using such a virtual platform, the hardware models and associated device drivers can be cross verified while they are being developed so that malfunctions in the hardware models or the device drivers can be easily detected. Moreover, the virtual platform with the proposed interface is capable of providing statistics of instructions executed, memory accessed, and I/O performed at the instruction-accurate level—thus not only making it easy to evaluate the performance of the hardware models but also making it possible for design space exploration. Keywords: QEMU, SystemC, ESL, SoC, hardware modeling, DMA, OS, device driver

1. Introduction To deal with the increasing complexity of System-on-Chip (SoC), the hardware/software co-simulation based on virtual platforms has become a popular approach for the Electronic System Level (ESL) design flow. In [18], three approaches to modeling the processor of a virtual platform are addressed: HDL, instruction set simulator (ISS), and formal. In general, the simulation speed of the HDL approach is far slower than that of the ISS and formal approach. The formal approach, which uses “compiled simulation” to simulate software statically, is always faster than the ISS approach, which uses “interpretive simulation” to simulate software dynamically. In terms of the simulation speed, although the formal approach such as LISA [20, 21, 16] is the fastest, the ISS approach such as QEMU-SystemC [28],1 which is capable of booting up a full-fledged Linux kernel in about 11 seconds, is generally fast enough to be acceptable by system architects and software designers. Another simulation framework [31, 24] that combines QEMU-SystemC with CoWare’s Platform Architect was proposed in 2009. Although the authors claim that they implement the so-called local master interface to access the host memory, no details whatsoever are provided. Fig. 1 shows the main differences in building a co-simulation environment on the ISS and virtual machine (VM) based virtual platforms. For the ISS-based virtual platform shown in Fig. 1(a), a lot of hardware and interconnect models need to be ∗ Corresponding

author. Email address: mcchiang@cse.nsysu.edu.tw (Ming-Chao Chiang) 1 Throughout this paper, we will use QEMU-SystemC to refer to the virtual platform proposed in [28]. Preprint submitted to Systems Architecture

built so as to be able to run a full-fledged operating system (OS). Although it is not trivial to adapt the hardware models and the system functionalities to fit an unmodified OS, the accessibility between the hardware models can be retained if they are all implemented in a single language, say, SystemC. On the other hand, for the VM-based virtual platform given in Fig. 1(b), almost all the necessary hardware models required to run an OS are contained in the virtual platform; thus, all we have to do is to extract the information from the processor model in the virtual machine to make it behave as an ISS [32]. Some ISSs have a predefined interface for connecting the memory and peripheral models. A good example is ARMulator [7]. Others, created from VM such as QEMU [10], provide no documented interface for interfacing with the external peripheral models because they were designed to mimic a physical machine, which is capable of running a full-fledged system instead of just for simulation. A good example is the virtual platform described herein. One of the problems with such a virtual platform is that the hardware models on the virtual platform are spread out to either QEMU or SystemC. The consequence is that it is impossible for models in SystemC to access models in QEMU without a core interface as described herein. 1.1. Motivation of the Work In order to make the combination of QEMU and SystemC capable of accessing all the hardware modeled in QEMU and SystemC for hardware/software co-simulation, SystemC requires that QEMU provide the interface for memory access, I/O operations initiated by the processor, and interrupt handling as well as for peripherals to access memory directly. Proposed in 2007, QEMU-SystemC was successful in exporting the I/O inNovember 3, 2010

OS porting virtual platform ISS

virtual platform

ISS interconnect and peripheral models

(a) OS porting

OS porting

virtual machine

virtual platform

processor model interconnect and peripheral models

extended virtual platform ISS wrapper

external interconnect and infrastructure peripheral models interface

(b) Figure 1: The flow of building a co-simulation environment on the ISS and VM-based virtual platforms. (a) Conventional ISS-based virtual platform and (b) VM-based virtual platform. The shaded parts indicate the portions of the virtual platform that need to be built.

terface for virtual hardware devices modeled in SystemC; however, the I/O interface provided by QEMU-SystemC is only capable of simulating the operations of slave devices accessed by the processor model. In other words, the I/O interface provided by QEMU-SystemC is incapable of modeling master devices, which need to access other slave devices. To overcome such a limitation, we propose a much more generic interface for connecting master and slave ports of hardware devices modeled in SystemC to QEMU. To make the idea more concrete, we use Direct Memory Access Controller (DMAC) modeled in SystemC as an example to illustrate how the proposed interface works.

1.3. Organization of the Paper The remainder of the paper is organized as follows. The related work is given in Section 2. The proposed interface is presented in 3. The experimental results are summarized in Section 4. Section 5 concludes the work. 2. Related Work In this section, we begin with a brief introduction to SystemC and the hardware emulation features the original version of QEMU provides, which to the best of our knowledge are not described in any published documents of QEMU. Next comes a brief description of QEMU-SystemC, which is the first virtual platform based on QEMU and SystemC. Then comes a brief description of a virtual platform that combines the enhanced QEMU-SystemC wrapper with CoWare’s Platform Architect. Finally, we introduce the ISS based on QEMU and SystemC, which is used as a processor model in the virtual platform we proposed. Finally, we compare the simulation speed of several known ISS-based virtual platforms.

1.2. Contribution of the Paper The main contributions of the paper are threefold: 1. We proposed an interface for connecting master/slave ports of hardware devices modeled in SystemC to QEMU, which overcomes the limitations of QEMU-SystemC. 2. The virtual platform2 can facilitate the co-design of hardware models and device drivers at the early stage of ESL design flow, even before the hardware platform is available. It can even be used to co-verify the correctness of the hardware models and the associated device drivers under development. 3. The virtual platform is capable of providing the statistics of booting up a full-fledged Linux kernel and handling the data movement using DMAC. In other words, the virtual platform can even be used to benchmark the performance of attached hardware.

2.1. SystemC SystemC is an ANSI standard C++ class library developed by Open SystemC Initiative (OSCI) [29] in 1999 and approved as IEEE standard in 2005 [22]. Due to the requirements of abstraction at different levels of detail, it has become one of the most popular modeling languages in the ESL design flow [4]. Because SystemC can simulate concurrency, events, and signals of hardware, the abstraction of the hardware model can be achieved at the transaction level without the need of considering details down to the signal level [17, 13]. From the perspective of ESL design flow, a platform-based design together with SystemC can satisfy the requirements of hardware/software partitioning, post-partition analysis, and verification using

2 Since there is no confusion possible, we will use “the virtual platform” to refer to “the virtual platform with the proposed interface” throughout the paper.

TLM and/or RTL modeling [9]. In this paper, SystemC is the language of choice for modeling the hardware features such as clock and concurrency that C cannot model.

bus connection, the same problem exists as the PCI/AMBA version does. For instance, the instructions executed, the memory accessed, and so on, which can be valuable to the system designers, are unfortunately not provided.

2.2. Hardware Emulation Features of QEMU QEMU provides two execution modes: the user mode and the system mode [10]. The user mode is provided to execute programs directly. The system mode is provided to execute an OS of the target CPU with a software memory management unit (MMU). Since our goal is to simulate a full-fledged system, we will focus on the system mode with a software MMU. The way the load and store instructions of the target CPU access the memory depends on how the virtual address of the target OS is mapped to the virtual address of the host processor. As for the slave ports of I/O, QEMU predefined a set of callback functions in C to act as the slave I/O interface, which can be used to model the virtual hardware devices for the virtual platform of QEMU. Hardware devices with master ports may access the other peripherals or memory areas directly. Because the memory of QEMU is managed by a software MMU, the most convenience way to access the memory of QEMU is thus to utilize the memory access functions defined by QEMU, which have been applied to a variety of virtual platforms provided by QEMU. In addition, most of the hardware interrupt sources will be connected to the virtual interrupt controller modeled by the I/O interface of QEMU. Then, the virtual interrupt controller will ultimately be connected to the virtual CPU, which will in turn call a specific function asynchronously to inform the CPU main loop of QEMU that an interrupt is pending.

2.4. Co-Simulating QEMU-SystemC with CoWare Another framework [31, 24] that combines the enhanced QEMU-SystemC wrapper with CoWare’s Platform Architect [3] is shown in Fig. 3. The QEMU-SystemC wrapper communicates with the CoWare-SystemC wrapper by using the interprocess communication (IPC) socket interface. This framework utilizes the bus models provided by off-the-shelf Model Library [2], which supports lots of capabilities of profiling and analysis. However, no details whatsoever about the CoWare-SystemC wrapper they proposed are provided. QEMU ARM virtual platform QEMU-SystemC wrapper socket interface

socket interface CoWare-SystemC wrapper VM access port M

2.3. QEMU-SystemC QEMU-SystemC [28] is an open source software/hardware emulation framework for the SoC development. It allows devices to be inserted into specific addresses of QEMU and communicates by means of the PCI/AMBA bus interface as shown in Fig. 2. The bus interface was upgraded to the TLM-2.0 interface [8] appending version [27] in 2009.

interrupt controller

on-chip bus

hardware model

memory model

CoWare PA Application Linux

Figure 3: The block diagram of the framework that combines QEMU-SystemC with CoWare’s Platform Architect. Note that “M” and “S” denote, respectively, the master and slave ports connected to the on-chip bus model.

Device driver

QEMU PCI/AMBA interface

PCI/AMBA to SystemC bridge

2.5. QEMU and SystemC-based ISS

SystemC module

QEMU is in essence an instruction-accurate virtual machine (IA-VM); however, the instructions executed are only available for debugging offline. Fortunately, by leveraging the strengths of QEMU and SystemC, our implementation shows that this problem can be easily solved by converting QEMU from an IA-VM to an instruction-accurate instruction set simulator (IAISS). In practice, the performance of the ISS, no matter that our concern is latency or bandwidth, depends, to a certain degree, on the IPC of the host operating system in use. As a consequence, the experimental results will vary from system to system [23]. The socket-based IPC mechanism allows QEMU and

Figure 2: The block diagram of QEMU-SystemC [28]. The functional descriptions of PCI/AMBA interface and PCI/AMBA to SystemC bridge in the block diagram are different from those in the original paper but identical from the implementation perspective.

Although the waveform of the AMBA on-bus chip of the QEMU-SystemC framework can be used to trace the access of slave device, no information about the processor and master device is available for the virtual platform. Because the TLM-2.0 appending version only changes the interface for modeling the 3

SystemC to be executed on different hosts whereas the pipebased IPC mechanism and the shared memory mechanism only allow co-simulation on the identical host. No matter which approach is adopted, the context switches between QEMU and SystemC are unavoidable unless QEMU and SystemC are implemented as a single thread running in a single process. This, however, is too restricted. Thus, we adopt the approach of implementing QEMU and SystemC as two threads in a process since context switches between threads are generally much faster than between processes. Moreover, as far as the paper is concerned, the shared memory mechanism is designed and used as a unidirectional FIFO between QEMU and SystemC, as shown in Fig. 4. In other words, the communication between QEMU and SystemC is one way so that the relative order of the instructions executed, the memory accessed, and the I/O write operations performed is retained by the packet receiver within both the ISS wrapper and infrastructure interface. Because the interface can simulate different bus transactions by using information in the received packets, it can be used to build different Bus Functional Models (BFMs). In addition, the synchronization between QEMU and SystemC is only needed by the I/O read operations, which can be achieved by having QEMU call the I/O read function—which will pass the pointer to the data to be read to SystemC—and then block until SystemC returns. Furthermore, it is the infrastructure interface that is discussed in the paper. The details will be given in Section 3. QEMU

SystemC

virtual platform

ISS wrapper & infrastructure interface

processor model

unidirectional FIFO

memory access interface

simulation speed of the instruction-accurate ISS, while the column labeled “OS Simulated” indicates that the simulation efficiency is gathered by simulating the indicated OS on the virtual platform. The numbers given in the sub-columns labeled “w/o trace” and “with trace” of the column labeled “instructions/sec” of Table 1 give the simulation speed with the capability of instruction trace turned off and the simulation speed with the capability of instruction trace turned on. The column labeled “transactions/sec” of Table 1 adds the number of memory access comparing to the column labeled “instructions/sec.” It can be easily seen from Table 1 that QSC2 without trace is only slower than RealView [16] whereas QSC2 with trace is only faster than Benini et al. [12]. This is expected because QSC2 provides much more information about all the instructions executed, all the memory accessed, and even all the I/O operations performed. Also shown in Table 1, most of the platforms provide the simulation efficiency without having OS run on the virtual platform except RealView, Simics, Mambo, and QSC2. As described in Table 1, the row labeled “RealView” indicates that the RealView Real Time System Model for the ARM1176JZ(F)-S processor can simulate a Linux boot at more than 100 MIPS [16]. Furthermore, the simulation speed of LISA with static scheduling [19] is several order of magnitude faster than LISA with dynamic scheduling [19]. Although Simics can be used to boot up a variety of OSs such as QEMU, the simulation efficiency of booting up Linux is in the range of 3.2–9.3M instructions per second. Also shown in Table 1, ARMulator can provide simulation efficiency of 2M instructions per second at the instruction-accurate level [19]. It is important to note that all the statistics given in Table 1 are calculated based on the assumption that only one processor is used for all the platforms.

packet receiver

bus functional model

3. Interfacing with Attached Hardware

memory-mapped I/O interface hardware model thread

Before we turn our discussion to the proposed interface, we will first look at the virtual platform of QEMU, as shown in Fig. 5(a). Basically, the virtual platform of QEMU is made up of the processor model, the software MMU, the memory and memory-mapped I/O models, which are managed by the software MMU. Moreover, the mechanism for cascading the interrupts is used to cascade the downstream interrupt-driven hardware models to the topmost one, i.e., the interrupt signals of the processor model. As the block diagram of QEMU-SystemC in Fig. 5(b) shows, the interface of the external memory-mapped I/O and the upward-sending interrupt mechanism provides the fundamental capability to attach “simple” hardware models written in SystemC. However, the interface is not versatile enough for hardware models that need to access the memory model of QEMU such as DMAC or the upstream hardware models that are capable of receiving the interrupts triggered by the downstream devices such as vector interrupt controller (VIC). In this section, we turn our discussion to the interface we proposed for attaching the virtual hardware modeled in SystemC to QEMU. These functions can be divided into three categories:

thread process

Figure 4: The IPC mechanism used by the ISS wrapper and infrastructure interface described herein.

2.6. Simulation Speed of Different Virtual Platforms As described in [26], the simulation of the functional model at the instruction-accurate level can be made 1,000 to 100,000 times faster than the full cycle-accurate RTL simulation. Most of the processor-based platforms need to take into account the instruction simulation techniques of the selected ISS, and most of the fastest ISSs using interpretive simulation use the dynamic binary translation to increase their simulation speed. Table 1 compares the simulation speed of several MPSoC/SoC based virtual frameworks proposed by academic units and commercial sectors. In Table 1, the row labeled “QSC2” refers to the QEMU and SystemC-based framework we proposed. The column labeled “Instruction-Accurate ISS” refers to the 4

Table 1: Comparison of the simulation speed of several MPSoC/SoC based virtual platforms with one processor. Simulation Instruction-Accurate ISS Virtual Platform OS Simulated Technique instructions/sec transactions/sec RealView [16] 100M n/a Linux Compiled LISA (static) [19] 11-36M n/a no Simulation LISA (dynamic) [19] 4-6M n/a no Reflective ReSP [11] < 2.9M n/a no Simulation Benini et al. [12] 31.7K n/a no Simics [25] 3.2–9.3M n/a Linux OVP [5] 4.2M n/a no Interpretive Mambo [14] 4M n/a Linux Simulation ARMulator [19] 2M n/a no w/o trace with trace w/o trace with trace Linux QSC2 [32] 36.21–38.31M 0.75–0.78M 49.44–52.26M 1.10–1.15M Linux

processor model software MMU

processor model

memory

software MMU

memory-mapped I/O #1 .. .

memory memory-mapped I/O #1 .. .

memory-mapped I/O interface for SystemC .. .

memory-mapped I/O #K

interrupt cascading QEMU virtual machine

data bus port physical address

hardware model

interrupt cascading

interrupt propagation

QEMU virtual machine

(a)

bus functional model (BFM)

interrupt mechanism

Hardware models in SystemC

(b)

Figure 5: The block diagram of QEMU vs. the block diagram of QEMU and SystemC-based virtual platform. (a) QEMU and (b) QEMU-SystemC. The differences between QEMU and QEMU-SystemC are shaded.

1. Processor-associated access. This refers to access initiated by the processor model, which includes read from peripherals to the processor and write from the processor to peripherals. In this case, the virtual device plays the role of slave device of BFM. 2. Memory-associated access. Because the memory of QEMU is managed by a software MMU described in Section 2.2, all the access to the memory of the virtual platform needs to go through the address translation mechanisms of the software MMU. 3. Interrupt cascading mechanism. Although the interrupt line has nothing to do with any data access to BFM, it is indispensable for a system with interrupt-driven hardware models to work properly.

The I/O interface can be divided into two categories: PCI and memory-mapped I/O. Because the virtual platform we proposed is aimed for the SoC development, we will only present the interface for the memory-mapped I/O. Our implementation of the I/O interface is similar in principle to that of QEMU-SystemC [28] except that the interrupt mechanism we provide is much more complete.

To ensure the portability of QEMU, the I/O interface provides callback functions to access 8-, 16- and 32-bit data. The set of callback functions can be registered by calling the function cpu register io memory(). Most of the hardware devices require a physically consecutive memory region, and the return value of cpu register io memory() will be used as the third argument to the function sysbus init mmio(). The purpose of the function call is to register the memory-mapped I/O space to QEMU. After that, the read/write functions can be called by the virtual processor to access the internal state of the hardware model.

3.1. Processor-Associated Access To fulfill the requirements of being a system emulator, QEMU provides an I/O interface for connecting the target processor to the virtual platforms provided by QEMU. Although undocumented, most of the existing virtual platforms are modeled and constructed based on this I/O interface of QEMU. 5

of the implementation or the computer organization as shown in Fig. 6(a) and (b). The “internal memory access interface” shown in Fig. 6(b) is responsible for registering the functions exported for the attached hardware models written in SystemC, i.e., the same interface proposed in QEMU-SystemC except the interrupt handler used to receive the interrupts triggered by the downstream components. The code fragment of the interface of the “internal memory access interface” is as given below. Although the macro FROM SYSBUS() acts like the macro container of() used in the Linux kernel, it is only useful where the pre-defined structure SysBusDevice in the structure soc state is used.

3.2. Memory-Associated Access In order to handle the diversity of virtual platforms, the memory access mechanism of QEMU is complicated in the sense that it needs to handle endianness, alignment, virtual to physical translation, memory-mapped I/O, and so forth. Moreover, some of the memory access functions need to invoke the dynamic binary translation (DBT) to generate the executable code at runtime to simulate the execution of instructions. However, our purpose is to initiate a transaction on behalf of the master port of a virtual device, we will not discuss any further the memory access functions, which need to deal with DBT. A good example is ldl code(), the purpose of which is to fetch instructions from the virtual address space where each instruction occupies 4 bytes. Instead, the following four functions

1 2 3 4 5

• cpu physical memory read(),

··· typedef struct { SysBusDevice busdev; ··· } soc state;

6 7 8

• cpu physical memory write(),

9 10

• ldx phys(), and

static void sc soc irq hdlr(void ∗opaque, int irq, int level) { ··· }

11 12 13

• stx phys()

14 15

are used for the master ports of a virtual device to access the physical memory of QEMU. The first two are capable of handling variable-length data while the last two can only be used for fixed-length data. Note that x in the name of the functions ldx phys() and stx phys() can be either b, w, l, or q to indicate, respectively, byte, word, long, and quad-word.

static uint32 t sc soc read(void ∗opaque, uint32 t offset) { ··· }

16 17 18 19 20

static void sc soc write(void ∗opaque, uint32 t offset, uint32 t value) { ··· }

21 22 23 24 25 26

3.3. Interrupt Cascading Mechanism Due to the complexity of system architecture, the interrupt mechanism of a system is generally complicated. Although not all the hardware models need the interrupt line to preempt the execution of a program, it is unavoidable for interrupt-driven hardware models because the only mechanism to signal the completion of their operation is by interrupt. For convenience of replacing a hardware model, the interrupt cascading mechanism needs to be two-way. One is for receiving interrupt from QEMU while the other is for sending interrupt to QEMU. Most of the downstream devices (the devices sending interrupt to QEMU) need only to trigger the interrupt. However, to model the devices sitting in the middle, such as interrupt controller or second interrupt controller, both of the sending and receiving directions are indispensable. In QEMU, the interrupt processing uses the qdev init gpio in() function to register the interrupt handler, which can receive the interrupt from other downstream hardware devices. Then, the function qemu set irq() will be called to determine if the incoming interrupt has to be sent upward. To make it easier to understand the proposed interface described herein, all the functions defined as part of the interface are summarized in Table 2.

static CPUReadMemoryFunc ∗sc soc readfn[] = { sc soc read, sc soc read, sc soc read };

27 28 29 30 31 32

static CPUWriteMemoryFunc ∗sc soc writefn[] = { sc soc write, sc soc write, sc soc write };

33 34 35 36 37 38 39 40 41 42 43 44

static void sc soc init(SysBusDevice ∗dev) { ··· soc state ∗s = FROM SYSBUS(soc state, dev); int iomemtype = cpu register io memory(sc soc readfn, sc soc writefn, s); sysbus init mmio(dev, 0x1000, iomemtype); ··· qdev init gpio in(&dev−>qdev, sc soc irq hdlr, 32); ··· } ···

In essence, the implementation of QEMU relies heavily on the function pointers, such as CPUReadMemoryFunc and CPUWriteMemoryFunc in lines 22 and 28. Although the arrays of the function pointers sc soc readfn[] and sc soc writefn[] are set for three read/write functions for 8, 16, and 32-bit data, the virtual platform we describe herein only needs the 32-bit access due to the on-chip bus width. The cpu register io memory() function will associate the arrays of the read/write functions sc soc readfn[] and sc soc writefn[] with the structure soc state and set fields that will be accessed and managed by the software MMU accordingly. Eventually, the initialization function qdev init gpio in() will register the sc soc irq hdlr() function as the interrupt handler for receiving the interrupts

3.4. Integrating Interface of the Virtual Platform Although the processor-associated access is introduced as part of the infrastructure interface we proposed, it should be part of the data port of the ISS wrapper from the perspective 6

Table 2: The proposed interface for the QEMU and SystemC-based virtual platform. Function Description cpu register io memory() For registering I/O read/write functions Processor-Associated Interface sysbus init mmio() For registering a memory region for memory-mapped I/O cpu physical memory read() For reading variable-length data from memory cpu physical memory write() For writing variable-length data to memory Memory-Associated Interface ldn phys() For reading fixed-length data from memory stn phys() For writing fixed-length data to memory For registering interrupt handler for receiving interrupts from qdev init gpio in() Interrupt Cascading Mechanism downstream hardware devices qemu set irq() For sending interrupts to upstream hardware devices Category

ISS wrapper processor model

instruction fetch extractor

software MMU

memory access extractor

instruction bus port virtual address physical address

ISS wrapper processor model

data bus M port

instruction fetch extractor

bus functional model (BFM) software MMU

memory

memory access extractor

memory

infrastructure interface

memory-mapped I/O interface for SystemC .. .

memory-mapped I/O #1 .. .

hardware model interrupt mechanism

virtual address physical address

S memory-mapped I/O #1 .. .

instruction bus port

memory-mapped I/O #K

data bus M port

infrastructure interface internal memory S access interface interrupt propagation

memory-mapped I/O interface for SystemC .. .

interrupt propagation

M bus functional model (BFM)

hardware model interrupt mechanism

memory-mapped I/O #K

interrupt cascading

QEMU virtual machine

ISS wrapper & infrastructure in SystemC

QEMU virtual machine

(a)

ISS wrapper & infrastructure in SystemC

(b)

Figure 6: The block diagram of the QEMU and SystemC-based virtual platform. Note that “M” and “S” denote, respectively, the master and slave ports of the hardware models. Note that the revised parts of the virtual platform are shaded. (a) Addition of the bidirectional interrupt mechanism. (b) Addition of the internal memory access interface.

from the downstream hardware models. In addition, the third argument of qdev init gpio in() shows that the hardware model preserves 32 interrupt pins to be connected by the downstream hardware models. As we have previously discussed in Section 2.5, the shared memory mechanism is used as a unidirectional FIFO between QEMU and SystemC. The function pointer technique can be used to set up the proposed interface for QEMU and SystemC. In order to avoid the null function pointer, all the functions to be used in the QEMU and SystemC co-simulation need to be set before any meaningful transaction can be initiated as the code fragment of the initialization function of the virtual Versatile/PB926EJ-S platform given below shows. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

16 17 18 19 20 21 22 23 24 25

irq2 temp.irq[1] = cpu pic[1]; soc int init(0x10140000, &irq2 temp.irq[0], 2, qemu set irq); ··· sysbus create simple(”pl080”, 0x10130000, pic[17]); soc int init(0x10130000, &pic[17], 1, qemu set irq); soc mmio init(cpu physical memory read, cpu physical memory write, ldl phys); ··· } ···

The declaration in line 11 sets aside an array of pointers to IRQ to indicate the nFIQ and nIRQ signals of the ARM processor. The sysbus create varargs() function is used for registering the initialization functions of hardware modeled in QEMU. After the sysbus create varargs() function is called, the soc int init() function is called to deliver packets that contain the information for initialization via the unidirectional FIFO from QEMU to SystemC as shown in Fig. 7. Upon receiving the packets (by the packet receiver on the SystemC side), the information in the packets will be retrieved to initialize the function pointers of the infrastructure interface we proposed. The first and third arguments of the soc int init() function indicate, respectively, the base address of the memory-mapped hardware model and the number of signals used to send interrupts upward to the upstream hardware models. Because the

··· static void versatile init(ram addr t ram size, const char ∗boot device, const char ∗kernel filename, const char ∗kernel cmdline, const char ∗initrd filename, const char ∗cpu model, int board id) { ··· irq2 irq2 temp; ··· dev = sysbus create varargs(”pl190”, 0x10140000, cpu pic[0], cpu pic[1], NULL); irq2 temp.irq[0] = cpu pic[0];

QEMU

SystemC

unidirectional FIFO

ISS wrapper & interrupt interface instruction bus port

transaction dispatcher information extractor

路路路

packet receiver

signal transition

transaction

transactions extracted from QEMU

data bus port

interface initialization

software MMU

internal memory access interfcace

memory model interface initialization

interrupt cascading

interrupt interface

Figure 7: The architecture and the inter-process communication (IPC) mechanism used by the ISS presented in the paper. All exported from QEMU, the instruction bus port and the data bus port are for connecting the BFM. Furthermore, the internal memory access interface and the interrupt interface are for modeling, respectively, the memory access to the memory model within QEMU and the interrupt-driven hardware in QEMU.

DMAC model only uses the cpu physical memory read(), cpu physical memory write(), and ldl phys() functions, the code fragment shows that only these function pointers are initialized, which will be used by the hardware modeled in SystemC. In other words, access to the hardware devices modeled in SystemC and in QEMU can be achieved via the internal memory access interface shown in Fig. 6(b) and Fig. 7.

processor interface

ISS wrapper Infrastructure Interface

ARM926

Internal memory access interface

Memory FIQ IRQ

PL190 (VIC)

Bus Functional Model (BFM)

Interrupt propagation interface .. .

PL080 (DMAC) PL011 (UART)

4. Experimental Results SIC

In this section, we turn our attention to the experimental results of using the proposed interface to connect an ARM PrimeCell PL080 DMAC [6] modeled in SystemC to the virtual Versatile/PB926EJ-S platform, which is used as the experimental virtual platform throughout the paper, the details of which are as shown in Fig. 8. In addition, the processor model is based on the ARM9 processor without cache. Because Linux kernel dose not provide the device driver for PrimeCell PL080 of the Versatile/PB926EJ-S platform, we need to develop our own device driver for the purpose of testing. The performance of the target virtual platform is evaluated based on two different measures: (1) the time it takes to boot up a full-fledged Linux kernel, and (2) the statistics that can be collected while the system is being boot up. For all the experimental results given in this section, a 2.40GHz Intel Core 2 Quad Q6600 processor machine with 2GB of memory is used as the host, and the target OS is built using the BuildRoot package [1], which is capable of automatically generating almost everything we need, including the cross-compilation tool chain, the target kernel image, and the initial RAM disk. The Linux distribution is Fedora 9, and the kernel is Linux version 2.6.27.12-78. QEMU version 0.11.0-rc1 and SystemC version 2.2.0 (including the reference simulator provided by OSCI) are all compiled by gcc version 4.3.1.

.. .

PL050 (KMI0, KMI1) Virtual Versatile/PB926EJ-S Platform on QEMU

BFM & hardware models in SystemC

Figure 8: The block diagram of the virtual platform Versatile/PB926EJ-S of QEMU with the PL080 DMAC model written in C replaced by a hardware model written in SystemC. The processor, memory, and interrupt mechanisms are exported from QEMU to interface with the BFM and hardware devices modeled in SystemC via the ISS wrapper and infrastructure interface.

the ioctl() function defined in the device driver. In practice, most of the DMA device drivers can not be characterized as either char or block device. For example, for Samsung S3C6410 SoC [30], the DMA device driver is used as the base of a driver stack by exporting functions to be used by other drivers such as sound driver or memory controller driver. That is why we choose to implement the device driver for DMAC as a char device so that it is easier to control such a device. By using the virtual platform, the device driver and the DMAC hardware model can be used to cross verify the functionality of each other while they are being developed.3 A byproduct of this is that it proves that the hardware/software 3 The consequence is that we eventually found two long-standing bugs within the read/write operations of the PL080 DMAC model in QEMU, which are inconsistent with that specified in the PL080 DMAC Technical Reference Manual [6]. These two bugs can terminate the simulation when accessing some of the control registers, which are located at specific address region of memorymapped I/O.

4.1. Device Driver for DMAC To make it easier for the test, the device driver for DMAC is implemented as a char device [15]. Thereby, application program can be easily written to control the behavior by calling 8

co-simulation on a virtual platform can be used to verify the functionality of hardware models and device drivers at the early stage of ESL design flow, even before the physical hardware is available. Besides, the number of words moved by DMAC can be observed from the statistics reported by the IA-ISS.

DMA and not using DMA are, respectively, 12m48.265s and 11m58.602s in the worst case. The slowdown of the simulation speed with data movement via DMA is due to the I/O synchronization needed for communicating with the DMA modeled in SystemC. The percentages given in parentheses are defined as

4.2. Time to Boot up Linux

Nα × 100% NTX

In order to gather the statistics, the initial shell script is modified to enable the option of executing the DMAC test bench and then rebooting the virtual machine automatically as soon as the booting sequence is completed. Furthermore, the predefined no-reboot option of QEMU will catch the reboot signal once the OS executes the reboot command after completing the DMAC test in the shell script and then shuts the QEMU down. Thus, the test bench can easily estimate the co-simulation time of QEMU and SystemC at the OS level. For the purpose of comparison, we use two test benches to test the data movement. One uses DMA to move the data, but not the other. The amounts of data moved are 2,048,000 words (4 bytes per word), half of which are read while the other half of which are write. When moving the data via DMA, an interrupt will be raised to signal the end of the transfer. The time it takes to boot up a full-fledged Linux kernel is as shown in the column labeled “Co-simulation time” of Tables 4 and 6. The rows labeled “min,” “max,” and “µ” present, respectively, the best-case, the worst-case, and the average-case running time of booting up the kernel and shutting it down immediately for 30 times. The row labeled “σ” gives the variability. As described in Tables 4 and 6, the column labeled “NTI ” shows the number of target instructions actually executed by the virtual ARM processor. The columns labeled “NLD ” and “NST ” present, respectively, the number of load and store operations of the virtual processor including the memory-map I/O. The columns labeled “NDMAR ” and “NDMAW ” give, respectively, the number of reads and the number of writes initiated by the DMAC, i.e., by the master ports of DMAC. The column labeled “NTX ” gives the total number of target instructions executed and load and store operations performed. Because the number of read/write operations of the slave port of DMAC (PL080 in this case) has been counted as the load and store operations of the virtual processor, only the number of read/write operations of the master ports has to be counted. That is,

where the subscript α is either TI, LD, ST, DMAR, DMAW, TX, RD, or WT. For instance, the percentage given in the column labeled “NTI ” of the row labeled “µ” of Table 4 is computed as 460,050,883.57 × 100% = 69.89%. 658,223,885.83 4.3. Simulation Efficiency Tables 5 and 7 show the simulation efficiency of the same results as given in Tables 4 and 6 except that the numbers have been normalized so that they indicate the simulation efficiency instead of the numbers per run. This would make it easier to understand exactly how many instructions are executed or how many load and store operations are performed in a second in the worst, best, and average case. For instance, as the row labeled “µ” of Table 5 shows, on average, the instruction-accurate ISS we proposed can execute about 718,129.55 instructions and perform about 214,426.27 load and 91,382.35 store operations in a second. Even in the worst-case, as the row labeled “max” of Table 5 shows, it can still execute about 624,327.88 instructions and perform about 189,278.53 load and 83,234.48 store operations in a second. 4.4. Performance Evaluation of Target Virtual Platform As it can be seen from Tables 5 and 7, without normalization, the experimental results can not be easily compared. As such, all the instruction and transaction counts are normalized in terms of the simulation time first before they are compared by NβNorm = Nβ × T cosim where Nβ indicates the statistics shown in Table 7 except NDMAR , NDMAW , NRD , and NWT because the number of words transferred is fixed. Moreover, T cosim indicates the cosimulation time in the column “Co-simulation time” of Table 4. For instance, the number given in the column labeled “NTI ” of the row labeled “min” of Table 8 is computed as

NTX = NTI + NLD + NST + NDMAR + NDMAW . The columns labeled “NRD ” and “NWT ” give an idea about the number of read/write transactions between the virtual processor and the DMAC. Note that all the numbers given are, as the names of the rows suggest, the min, max, and average of booting up the ARM Linux and shutting it down immediately on our virtual platform for 30 times. It is worth pointing out that if DMA is not used in moving the data, then the virtual platform can not provide any information because the data is moved by the load and store instructions of the virtual processor. That explains why the columns labeled “NDMAR ” and “NDMAW ” in Table 6 remain zero. The simulation times of booting up Linux with data movement using

878, 977.19 per sec × 527.359 sec = 463, 536, 531.94. The statistics of different data movement settings shown in Tables 4, 6, and 8 are compared in Fig. 9. It is interesting to note that T DMAR and T DMAW are visible only for the simulation setting “with DMA.” Compared to the case of not using DMA, the performance gain of using DMA in terms of NTI can be defined by Norm DMA GTI = NTI − NTI ;

Table 3: Notations used in Tables 4, 6, 5, 7, and 8

min

The best-case co-simulation time and the worst-case simulation efficiency of 30 runs, where the “simulation efficiency” is defined to be the number of instructions or operations simulated per second as far as this paper is concerned. The worst-case co-simulation time and the best-case simulation efficiency of 30 runs. The mean of co-simulation time and simulation efficiency of 30 runs. The standard deviation of co-simulation time and simulation efficiency of 30 runs. The total number of transactions. The number of target instructions simulated. The number of load operations of the virtual processor. The number of store operations of the virtual processor. The number of read operations initiated by the master ports of DMAC. The number of write operations initiated by the master ports of DMAC. The number of times the virtual processor reads data from the DMAC (PL080). The number of times the virtual processor writes data to the DMAC (PL080).

max µ σ NTX NTI NLD NST NDMAR NDMAW NRD NWT

Table 4: Simulation time of booting up the Linux kernel plus data movement with DMA on our virtual platform for 30 times. Statistics

Co-simulation time

min

08m47.359s

max

12m48.265s

10m44.956s

01m00.20s

NTI 445,011,311.00 (70.31%) 479,649,276.00 (69.41%) 460,050,883.57 (69.89%) 10,081,646.36

NLD 131,257,961.00 (20.74%) 145,416,069.00 (21.04%) 137,461,532.53 (20.88%) 4,074,998.19

NST 54,641,476.00 (8.63%) 63,946,144.00 (9.25%) 58,663,469.73 (8.91%) 2,684,633.76

NDMAR 1,024,000.00 (0.16%) 1,024,000.00 (0.15%) 1,024,000.00 (0.16%) 0.00

NDMAW 1,024,000.00 (0.16%) 1,024,000.00 (0.15%) 1,024,000.00 (0.16%) 0.00

NTX 632,958,748.00 (100.00%) 691,059,489.00 (100.00%) 658,223,885.83 (100.00%) 16,841,278.31

NRD 3,008.00 (0.00%) 3,008.00 (0.00%) 3,008.00 (0.00%) 0.00

NWT 9,002.00 (0.00%) 9,002.00 (0.00%) 9,002.00 (0.00%) 0.00

Table 5: Simulation efficiency of booting up the Linux kernel plus data movement with DMA on our virtual platform for 30 times (i.e., transactions per second). Statistics min max µ σ

NTI 843,848.88 (70.31%) 624,327.88 (69.41%) 718,129.55 (69.92%) 52,477.67

NLD 248,896.78 (20.74%) 189,278.53 (21.04%) 214,426.27 (20.88%) 14,152.60

NST 103,613.43 (8.63%) 83,234.48 (9.25%) 91,382.35 (8.90%) 4,831.34

NDMAR 1,941.75 (0.16%) 1,332.87 (0.15%) 1,601.47 (0.16%) 148.75

NDMAW 1,941.75 (0.16%) 1,332.87 (0.15%) 1,601.47 (0.16%) 148.75

NTX 1,200,242.59 (100.00%) 899,506.64 (100.00%) 1,027,141.10 (100.00%) 71,759.12

NRD 5.70 (0.00%) 3.92 (0.00%) 4.70 (0.00%) 0.44

NWT 103,613.43 (0.00%) 11.72 (0.00%) 14.08 (0.00%) 1.31

Table 6: Simulation time of booting up the Linux kernel plus data movement without DMA on our virtual platform for 30 times. Statistics

Co-simulation time

min

08m15.351s

max

11m58.602s

10m15.390s

01m04.551s

NTI 435,402,217.00 (70.74%) 472,317,649.00 (69.84%) 452,457,224.70 (70.30%) 10,355,227.14

NLD 127,911,363.00 (20.78%) 142,224,553.00 (21.03%) 134,521,372.77 (20.90%) 4,193,136.67

NST 52,185,122.00 (8.48%) 61,773,665.00 (9.13%) 56,644,735.00 (8.80%) 2,765,526.43

NDMAR 0.00 (0.00%) 0.00 (0.00%) 0.00 (0.00%) 0.00

NDMAW 0.00 (0.00%) 0.00 (0.00%) 0.00 (0.00%) 0.00

NTX 615,498,702.00 (100.00%) 676,315,867.00 (100.00%) 643,623,332.47 (100.00%) 17,313,890.24

NRD 8.00 (0.00%) 8.00 (0.00%) 8.00 (0.00%) 0.00

NWT 1.00 (0.00%) 1.00 (0.00%) 1.00 (0.00%) 0.00

Table 7: Simulation efficiency of booting up the Linux kernel plus data movement without DMA on our virtual platform for 30 times (i.e., transactions per second). Statistics min max µ σ

NTI 878,977.19 (70.74%) 657,272.94 (69.84%) 741,861.21 (70.33%) 63,968.33

NLD 258,223.69 (20.78%) 197,918.39 (21.03%) 220,382.04 (20.89%) 17,364.13

NST 105,349.79 (8.48%) 85,963.67 (9.13%) 92,639.92 (8.78%) 5,959.48

NDMAR 0.00 (0.00%) 0.00 (0.00%) 0.00 (0.00%) 0.00

NDMAW 0.00 (0.00%) 0.00 (0.00%) 0.00 (0.00%) 0.00

NTX 1,242,550.66 (100.00%) 941,155.00 (100.00%) 1,054,883.18 (100.00%) 87,291.94

NRD 0.02 (0.00%) 0.01 (0.00%) 0.01 (0.00%) 0.00

NWT 105,349.79 (0.00%) 0.00 (0.00%) 0.00 (0.00%) 0.00

Table 8: Normalized simulation statistics of booting up the Linux kernel plus data movement without DMA on our virtual platform for 30 times. Statistics min max

NTI 463,536,531.94 504,959,795.25

NLD 136,176,586.93 152,053,771.89

NST 55,557,159.90 66,042,878.93

NDMAR 0.00 0.00

NDMAW 0.00 0.00

NTX 655,270,273.51 723,056,446.08

NRD 8.00 8.00

NWT 1.00 1.00

Table 9: Performance gain in booting up the Linux kernel on our virtual platform with DMA for 30 times. Statistics min max

GTI 18,525,220.94 (4.16%) 25,310,519.25 (5.28%)

GLD 3,894,625.93 (2.94%) 5,613,702.89 (3.83%)

GST −108, 316.10 (−0.19%) 1,072,734.93 (1.65%)

GTX 20,263,525.51 (3.19%) 29,948,957.07 (4.32%)

Performance gain of the data movement with DMA and w/o DMA

10000

7 NTI NLD NST NTX NDMAR NDMAW

1000 100 10

GTI GLD GST GTX

6 Performance gain (%)

Number of million transactions

Number of transactions of different data movement settings

1 0.1

5 4 3 2

) ax (m A M ) D in ith (m w A M D ) ax ith w (m ed iz A al M ) m D in or /o m N w ed ( iz A ) al M ax m D m or /o ( N w A M ) D in (m /o

0 -1 min

Data movement settings

max Data movement

Figure 9: The statistics of different data movement settings in a logarithmic plot.

Figure 10: Performance gain of data movement “with DMA” with respect to one “without DMA” in percentage.

the performance gain of using DMA in terms of NLD and NST by Gγ = NγNorm − (NγDMA + 1, 024, 000)

The performance gain of data movement “with DMA” with respect to one “without DMA” in percentage is as shown in Fig. 10, which is essentially a summary of Table 9. Our experimental results given in Table 9 show that with DMA, the performance gain, with respect to the number of instructions required to boot up a Linux kernel, is about 18.53– 25.31M instructions for transferring 2,048,000 words or 4.16– 5.28% compared to without DMA. In other words, with DMA, the number of instructions saved in transferring a word is

where γ indicates either LD or ST; the performance gain of using DMA in terms of NTX by Norm DMA GTX = NTX − (NTX + 2, 048, 000)

where N·Norm and N·DMA indicate, respectively, the normalized simulation results of not using DMA as shown in Table 8 and the normalized simulation results of using DMA as shown in Table 4. The results are as shown in Table 9. Furthermore, the performance gain of using DMA in terms of NTI in percentage can be defined accordingly by Norm DMA NTI − NTI DMA NTI

18, 525, 220.94 Nmin = = 9.05 2, 048, 000 2, 048, 000 in the worst case and is Nmax 25, 310, 519.25 = = 12.36 2, 048, 000 2, 048, 000

× 100%;

in the best case where Nmin and Nmax indicate, respectively, the statistics given in the column labeled “GTI ” of rows “min” and “max” of Table 9. In other words, using DMA to transfer a word can save 9.05–12.36 instructions for the ARM9 processor. The other columns labeled “GLD ”, “GST ” and “GTX ” in Table 9 show, respectively, the performance gain of using DMA in terms of “NLD ”, “NST ” and “NTX ;” that is, the number of load operations, the number of store operations, and the number of total transactions saved. The performance loss shown in the column labeled “GST of Table 9 is caused by the fact that booting up a kernel is a non-deterministic procedure.

the performance gain of using DMA in terms of NLD and NST in percentage by NγNorm − (NγDMA + 1, 024, 000) NγDMA + 1, 024, 000

× 100%

where γ is as defined above; the performance gain of using DMA in terms of NTX in percentage with DMA by Norm DMA NTX − (NTX + 2, 048, 000) DMA NTX + 2, 048, 000

× 100%

because without DMA, all the transfers are handled by the load and store instructions of the virtual processor, which have been counted as part of the total number of transactions. For instance, the performance gain given in the column labeled “NLD ” of the row labeled “min” of Table 9 is

5. Conclusion This paper presents an interface for connecting QEMU to SystemC for a QEMU and SystemC-based virtual platform. The proposed interface can be used to enable the master/slave ports of the attach hardware modeled in SystemC to access the virtual platform. As a consequence, such a virtual platform can facilitate the co-design of the hardware models written in SystemC and the associated device drivers while they are being

136, 176, 586.93 − (131, 257, 961.00 + 1, 024, 000) = 3, 894, 625.93,

and the performance gain in percentage is 136, 176, 586.93 − (131, 257, 961.00 + 1, 024, 000) × 100% = 2.94%. 131, 257, 961.00 + 1, 024, 000

developed. Moreover, it can be used to co-verify the correctness of the hardware models and the associated device drivers. For concreteness, we use DMAC as an example to show how the proposed interface works. By using such a virtual platform, we were eventually able to fix two long-standing bugs in the DMAC model of QEMU. Furthermore, the virtual platform we proposed can even provide instruction-accurate statistics for measuring the performance of the attached hardware. Even more important, with all the transactions traced, the virtual platform takes only 12m48.265s to boot up a full-fledged kernel, even in the worst case.

[18] G. R. Hellestrand. Systems Engineering: The Era of the Virtual Processor Model (VPM). http://www.vastsystems.com/solutionstechnical-papers.html. [19] A. Hoffmann, T. Kogel, A. Nohl, G. Braun, O. Schliebusch, O. Wahlen, A. Wieferink, and H. Meyr. A Novel Methodology for the Design of Application-Specific Instruction-Set Processors (ASIPs) Using a Machine Description Language. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 20:1338–1354, November 2001. [20] A. Hoffmann, O. Schliebusch, A. Nohl, G. Braun, O. Wahlen, and H. Meyr. A Methodology for the Design of Application Specific Instruction Set Processors (ASIP) Using the Machine Description Language LISA. In Proceedings of the International Conference on Computer Aided Design, pages 625–630, 2001. [21] A. Hoffmann, O. Schliebusch, A. Nohl, G. Braun, O. Wahlen, and H. Meyr. A Universal Technique for Fast and Flexible Instruction-Set Architecture Simulation. In Proceedings of the 39th Annual Design Automation Conference, pages 22–27, 2002. [22] IEEE Computer Society. IEEE Standard System C Language Reference Manual. Design Automation Standards Committee, 2005. http:// standards.ieee.org/getieee/1666/download/1666-2005.pdf. [23] P. K. Immich, R. S. Bhagavatula, and R. Pendse. Performance Analysis of Five Interprocess Communication Mechanisms Across UNIX Operating Systems . Journal of Systems and Software, 68:27– 43, October 2003. [24] J.-W. Lin, C.-C. Wang, C.-Y. Chang, C.-H. Chen, K.-J. Lee, Y.-H. Chu, J.-C. Yeh, and Y.-C. Hsiao. Full system simulation and verification framework. In Proceedings of Fifth International Conference on Information Assurance and Security, pages 165–168, Aug. 2009. [25] P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hållberg, J. Högberg, F. Larsson, A. Moestedt, and B. Werner. Simics: A Full System Simulation Platform. Computer, 35(2):50–58, Feburary 2002. [26] G. Martin. Overview of the MPSoC Design Challenge. In Proceedings of the 43rd ACM/IEEE Design Automation Conference, pages 274–279, 2006. [27] M. Montón, J. Carrabina, and M. Burton. Mixed Simulation Kernels for High Performance Virtual Platforms. In Proceedings of Forum on Specification and Design Languages, pages 1–6, September 2009. [28] M. Montón, A. Portero, M. Moreno, B. Mart´ınez, and J. Carrabina. Mixed SW/SystemC SoC Emulation Framework. In Proceedings of IEEE International Symposium on Industrial Electronics, pages 2338–2341, June 2007. [29] OSCI. Open SystemC Initiative. http://www.systemc.org/. [30] Samsung. Mobile SoC Application Processor S3C6410. http://www.samsung.com/global/business/semiconductor/ productInfo.do?fmly_id=229&partnum=S3C6410. [31] C.-C. Wang, R.-P. Wong, J.-W. Lin, and C.-H. Chen. System-level development and verification framework for high-performance system accelerator. In Proceedings of International Symposium on VLSI Design, Automation and Test, pages 359–362, Apr. 2009. [32] T.-C. Yeh, G.-F. Tseng, and M.-C. Chiang. A fast cycle-accurate instruction set simulator based on QEMU and SystemC for SoC development. In Proceedings of the 15th IEEE Mediterranean Electrotechnical Conference, pages 1033–1038, Apr. 2010.

Acknowledgment This work was supported in part by National Science Council, Taiwan, ROC, under Contract No. NSC98-2221-E-110-049. References [1] BuildRoot. http://buildroot.uclibc.org/. [2] CoWare Model Library. http://www.coware.com/products/ modellibrary.php. [3] CoWare Platform Architect. http://www.coware.com/products/ platformarchitect.php. [4] Nine reasons to adopt SystemC ESL design. http://www. eetimes.com/news/design/columns/eda/showArticle.jhtml? articleID=47212187. [5] P. Agrawal. Hybrid Simulation Framework for Virtual Prototyping Using OVP, SystemC & SCML. http://web.iitd.ac.in/~vdtt/ Research/projects/thesis/jvl072170.pdf. [6] ARM. PrimeCell DMA Controller (PL080) Technical Reference Manual, 2003. http://infocenter.arm.com/help/index.jsp. [7] ARM. RealView ARM ISS User Guide, 2007. http://infocenter. arm.com/help/index.jsp. [8] J. Aynsley. OSCI TLM-2.0 Language Reference Manual, 2009. http://www.systemc.org/members/download_files/check_ file?agreement=tlm_2-0_080606. [9] B. Bailey, G. Martin, and A. Piziali. ESL Design and Verification. Morgan Kaufmann Publishers, 2007. [10] F. Bellard. QEMU, a Fast and Portable Dynamic Translator. In Proceedings of USENIX Annual Technical Conference, pages 41–46, June 2005. [11] G. Beltrame, C. Bolchini, L. Fossati, A. Miele, and D. Sciuto. ReSP: A Non-Intrusive Transaction-Level Reflective MPSoC Simulation Platform for Design Space Exploration. In Proceedings of the 13th Asia and South Pacific Design Automation Conference, pages 673–678, 2008. [12] L. Benini, D. Bertozzi, D. Bruni, N. Drago, F. Fummi, and M. Poncino. Legacy SystemC Co-Simulation of Multi-Processor Systems-onChip. In Proceedings of IEEE International Conference on Computer Design: VLSI in Computers and Processors, pages 494–499, 2002. [13] D. C. Black and J. Donovan. SystemC: From The Ground Up. Springer Science+Business Media, 2004. [14] P. Bohrer, J. Peterson, M. Elnozahy, R. Rajamony, A. Gheith, R. Rockhold, C. Lefurgy, H. Shafi, T. Nakra, R. Simpson, E. Speight, K. Sudeep, E. V. Hensbergen, and L. Zhang. Mambo: A Full System Simulator for the PowerPC Architecture. ACM SIGMETRICS Performance Evaluation Review, 31:8–12, March 2004. [15] J. Corbet, A. Rubini, and G. Kroah-Hartman. Linux Device Drivers. O’Reilly, February 2005. [16] Design & Reuse. ARM Expands RealView Product Family with Fast Simulation Technology to Speed Up Software Development. http://www.design-reuse.com/news/10602/armrealview-fast-simulation-technology-speed-up-softwaredevelopment.html. [17] T. Gr¨otker, S. Liao, G. Martin, and S. Swan. System Design with SystemC. Kluwer Academic Publishers Group, 2002.