Uploaded by Egor Карпаев

Generation of MARTE Allocation Models fr

advertisement
Languages for Embedded Systems
and Their Applications
Lecture Notes in Electrical Engineering
Volume 36
For other titles published in this series, go to
www.springer.com/series/7818
Martin Radetzki
Editor
Languages
for Embedded Systems
and Their Applications
Selected Contributions on Specification,
Design, and Verification from FDL’08
Editor
Prof. Dr. Martin Radetzki
Institut für Technische Informatik
Universität Stuttgart
Pfaffenwaldring 47
70569 Stuttgart
Germany
martin.radetzki@informatik.uni-stuttgart.de
ISSN 1876-1100 Lecture Notes in Electrical Engineering
ISBN 978-1-4020-9713-3
e-ISBN 978-1-4020-9714-0
DOI 10.1007/978-1-4020-9714-0
Springer Dordrecht Heidelberg London New York
Library of Congress Control Number: 2009920328
©Springer Science+Business Media B.V. 2009
No part of this work may be reproduced, stored in a retrieval system, or transmitted in any form or by
any means, electronic, mechanical, photocopying, microfilming, recording or otherwise, without written
permission from the Publisher, with the exception of any material supplied specifically for the purpose
of being entered and executed on a computer system, for exclusive use by the purchaser of the work.
Printed on acid-free paper
Springer is part of Springer Science+Business Media (www.springer.com)
Preface
Embedded systems take over complex control and data processing tasks in diverse
application fields such as automotive, avionics, consumer products, and telecommunications. They are the primary driver for improving overall system safety,
efficiency, and comfort. The demand for further improvement in these aspects can
only be satisfied by designing embedded systems of increasing complexity, which
in turn necessitates the development of new system design methodologies based on
specification, design, and verification languages.
The objective of the book at hand is to provide researchers and designers with
an overview of current research trends, results, and application experiences in computer languages for embedded systems. The book builds upon the most relevant
contributions to the 2008 conference Forum on Design Languages (FDL), the premier international conference specializing in this field. These contributions have
been selected based on the results of reviews provided by leading experts from research and industry. In many cases, the authors have improved their original work
by adding breadth, depth, or explanation.
System development includes the tasks of defining an initial, high-level specification, designing system architecture and functional blocks, and verifying that
architecture and functionality meet the specified properties and requirements. The
designers working on these tasks, and the electronic design automation tools deployed in the design process, have to take into account software, digital logic, and
analog system components and their complex interactions in heterogeneous, mixed
discrete/continuous systems. This book therefore addresses related issues in four
parts, dedicated to specification, heterogeneity, design, and verification.
Part I, Model-Based System Specification Languages, focuses on two high-level
specification languages which are emerging as standards for embedded systems: the
Architecture Analysis and Design Language (AADL), and the Modeling and Analysis of Real-Time and Embedded Systems (MARTE) profile for the Unified Modeling
Language (UML). Beyond their syntax and semantics, the methods built upon these
languages, and initial applications are presented in three chapters. Two further chapters are dedicated to competing approaches using an abstract state machine based
language and Matlab/Simulink driven modeling, respectively.
Part II, Languages for Heterogeneous System Design, is devoted to two promising languages that provide the means to describe heterogeneous systems. The
discrete-time and continuous-time worlds are brought together by SystemC-AMS
on system level, whereas VHDL-AMS provides all it takes to describe mixed analog and digital circuits.
Part III, Digital Systems Design Methodologies Based on C++, is the largest
part of this book, based on the substantial impact that the SystemC library and
its methodology-specific additions continue to be making in the digital (hardware/software) design community. This part comprises eight chapters devoted to
M. Radetzki (ed.), Languages for Embedded Systems and Their Applications,
Lecture Notes in Electrical Engineering 36,
© Springer Science + Business Media B.V. 2009
v
vi
Preface
the subjects of transaction-level modeling and its applications, architecture and
performance evaluation, design and scheduling of functional blocks, as well as
programming and modeling approaches for (run-time) reconfigurable FPGA architectures.
Part IV, Verification and Requirements Evaluation, features contributions addressing both functional and beyond-functional properties. Functional aspects include the verification of circuitry implementing arithmetic operations and the
debugging of contradictory functional constraints specified with the SystemC Verification Library (SCV). Analysis of beyond-functional properties such as timing
behavior, performance, area cost, and power dissipation, is covered for Multi
Processor Systems-on-Chip (MPSoC) as well as on-chip interconnection networks.
The selection of the contributions to the before-mentioned parts has been guided
by the reviews provided by FDL reviewers and programme committee members.
I would like to thank everybody involved in these reviews, and in particular the
FDL’08 chairpersons responsible for the conference tracks that relate to the four
parts, namely Dominique Borrione (PDV track, Property Driven Verification),
Pierre Boulet (UMES track, UML and MDE for Embedded Systems), Sorin Huss
(DCS track, Discrete and Continuous Systems), and Frank Oppenheimer (CSD
track, C-Based System Design). Moreover, I would like to acknowledge the extra
effort made by the authors to layout, and in most cases revise and extend their original work as a contribution to this book.
Universität Stuttgart
Martin Radetzki
FDL’08 General Chair
Contents
Part I
1
2
Model-Based System Specification Languages
Power and Energy Estimations in Model-Based Design . . . .
Eric Senn, Saadia Douhib, Dominique Blouin, Johann Laurent,
Skander Turki and Jean-Philippe Diguet
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 AADL Component Based Design Flow . . . . . . . . . . .
1.3 Consumption Analysis: the Methodology . . . . . . . . . .
1.4 Power Estimation . . . . . . . . . . . . . . . . . . . . . .
1.4.1 Power Models . . . . . . . . . . . . . . . . . . . .
1.4.2 Multi-level Estimation . . . . . . . . . . . . . . . .
1.5 Power Estimation for Complex DSP . . . . . . . . . . . .
1.6 Power Estimation for Field Programmable Gate Array . . .
1.7 Power Estimation for Operating System Services . . . . . .
1.7.1 Ethernet Communications Consumption Modelling
1.7.2 Models . . . . . . . . . . . . . . . . . . . . . . . .
1.8 Consumption Analysis Tool . . . . . . . . . . . . . . . . .
1.8.1 Property Sets . . . . . . . . . . . . . . . . . . . . .
1.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3
5
7
8
9
11
14
16
17
18
19
20
20
23
24
MARTE vs. AADL for Discrete-Event and Discrete-Time Domains
Frédéric Mallet and Robert de Simone
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Marte Time Model . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.2 Event-Triggered Communications . . . . . . . . . . . . .
2.2.3 Time-Triggered Communications . . . . . . . . . . . . .
2.2.4 Periodic Tasks and Physical Time . . . . . . . . . . . . .
2.2.5 TimeSquare . . . . . . . . . . . . . . . . . . . . . . . .
2.3 AADL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3.1 Modeling Elements . . . . . . . . . . . . . . . . . . . .
2.3.2 AADL Application Software Components . . . . . . . .
2.3.3 AADL Flows . . . . . . . . . . . . . . . . . . . . . . . .
2.3.4 AADL Ports . . . . . . . . . . . . . . . . . . . . . . . .
2.4 Three Different Configurations . . . . . . . . . . . . . . . . . .
2.4.1 The Aperiodic Case . . . . . . . . . . . . . . . . . . . .
2.4.2 The Mixed Event–Data Flow Case . . . . . . . . . . . .
2.4.3 The Periodic Case . . . . . . . . . . . . . . . . . . . . .
.
27
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
27
28
29
29
30
31
31
31
31
32
33
33
34
34
37
38
vii
viii
3
4
5
Contents
2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
40
40
Generation of MARTE Allocation Models from Activity Threads
Andreas W. Liehr, Klaus J. Buchenrieder, Heike S. Rolfs and
Ulrich Nageldinger
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3 Building System Models with MARTE . . . . . . . . . . . . .
3.4 Utilizing Activity Threads for Design Space Exploration . . . .
3.5 Generating MARTE Allocation Models with Activity Threads .
3.6 A Prototypic Implementation of the Method . . . . . . . . . .
3.7 Visualization of Performance Feedback . . . . . . . . . . . . .
3.8 Summary and Outlook . . . . . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. .
43
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
43
45
46
47
48
51
53
54
55
. . . . . . .
57
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
57
59
60
60
61
62
64
64
66
68
68
. . . . . .
71
.
.
.
.
.
.
.
.
.
.
.
71
72
73
73
76
77
77
80
80
80
82
Model-Driven System Validation by Scenarios . . . . . .
A. Carioni, A. Gargantini, E. Riccobene and P. Scandurra
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . .
4.2 ASMs and ASMETA . . . . . . . . . . . . . . . . .
4.3 Scenario-Based Validation of ASM Models . . . . . .
4.3.1 The AVAL L A Language . . . . . . . . . . . .
4.4 The Model-Driven Validation Environment . . . . . .
4.4.1 From SystemC UML Models to ASM Models
4.4.2 Model Validator . . . . . . . . . . . . . . . .
4.5 The Simple Bus Case Study . . . . . . . . . . . . . .
4.6 Related Work . . . . . . . . . . . . . . . . . . . . .
4.7 Conclusions and Future Work . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . .
An Advanced Simulink Verification Flow Using SystemC .
Kai Hylla, Jan-Hendrik Oetjens and Wolfgang Nebel
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . .
5.2 Related Work . . . . . . . . . . . . . . . . . . . . . .
5.3 Extended Verification Flow . . . . . . . . . . . . . . .
5.3.1 Conventional Flow . . . . . . . . . . . . . . . .
5.3.2 Extending the Verification Flow . . . . . . . . .
5.4 Implementation . . . . . . . . . . . . . . . . . . . . .
5.4.1 Synchronization . . . . . . . . . . . . . . . . .
5.4.2 Data Type Conversion . . . . . . . . . . . . . .
5.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . .
5.5.1 Implementation . . . . . . . . . . . . . . . . .
5.5.2 Extended Verification Flow . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Contents
ix
5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Part II
6
7
8
83
84
Languages for Heterogeneous System Design
VHDL–AMS Implementation of a Numerical Ballistic CNT Model
Dafeng Zhou, Tom J. Kazmierski and Bashir M. Al-Hashimi
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2 Mobile Charge Density and Self-Consistent Voltage . . . . . . .
6.3 Numerical Piece-Wise Approximation of the Charge Density . .
6.4 Performance of Numerical Approximations . . . . . . . . . . . .
6.5 VHDL–AMS Implementation . . . . . . . . . . . . . . . . . . .
6.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Wide-Band Sigma–Delta ADC Design in Superconducting
Technology . . . . . . . . . . . . . . . . . . . . . . . . . . .
R. Guelaz, P. Desgreys and P. Loumeau
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . .
7.2 Sigma–Delta Second Order Architecture . . . . . . . . .
7.2.1 Bandpass Sigma–Delta Modulator . . . . . . . .
7.2.2 The Josephson Junction . . . . . . . . . . . . . .
7.2.3 The RSFQ Balanced Comparator . . . . . . . . .
7.2.4 Sigma Delta Modulator Operation with Josephson
Junctions . . . . . . . . . . . . . . . . . . . . . .
7.2.5 System Modeling with VHDL–AMS . . . . . . .
7.3 The Sigma–Delta ADC Design . . . . . . . . . . . . . .
7.3.1 Clock and Comparator Design . . . . . . . . . . .
7.4 Simulation Results . . . . . . . . . . . . . . . . . . . . .
7.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . .
Heterogeneous and Non-linear Modeling in SystemC–AMS
Ken Caluwaerts and Dimitri Galayko
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . .
8.1.1 SystemC–AMS Modeling Platform . . . . . . . .
8.1.2 Summary of Electrostatic Harvester Operation . .
8.2 SystemC–AMS Modeling of the Harvester . . . . . . . .
8.2.1 Resonator Modeling . . . . . . . . . . . . . . . .
8.2.2 Implementation of the Conditioning Circuit Model
8.2.3 Model of the Whole System . . . . . . . . . . . .
8.3 Modeling Results . . . . . . . . . . . . . . . . . . . . .
8.3.1 Description of the Modeling Experiment . . . . .
8.3.2 Modeling Results Validation . . . . . . . . . . . .
8.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . .
.
87
.
.
.
.
.
.
.
87
88
89
91
93
98
99
. . . . . 101
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
101
102
102
103
105
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
105
106
107
107
109
111
112
. . . . . 113
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
113
114
116
118
118
119
123
124
124
126
127
127
x
Contents
Part III Digital Systems Design Methodologies Based on C++
9
Application Workload and SystemC Platform Modeling
for Performance Evaluation . . . . . . . . . . . . . . . . . .
Jari Kreku, Mika Hoppari, Tuomo Kestilä, Yang Qu,
Juha-Pekka Soininen and Kari Tiensyrjä
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . .
9.2 Performance Modeling and Simulation . . . . . . . . . .
9.2.1 Application and Workload Modeling . . . . . . .
9.2.2 Execution Platform Modeling . . . . . . . . . . .
9.2.3 Allocation and Transformation to SystemC . . . .
9.2.4 Performance Simulation . . . . . . . . . . . . . .
9.3 Mobile Video Player Case Example . . . . . . . . . . . .
9.3.1 Modeling of the Execution Platform Components
9.3.2 Modeling of the Services . . . . . . . . . . . . .
9.3.3 Modeling of the Application . . . . . . . . . . . .
9.3.4 Analysis of Simulation Results . . . . . . . . . .
9.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . 131
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
10 Adaptive Interconnect Models for Transaction-Level Simulation
Rauf Salimi Khaligh and Martin Radetzki
10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . .
10.3 Adaptive Interconnect Models . . . . . . . . . . . . . . . . . .
10.3.1 Point-to-Point Communication . . . . . . . . . . . . .
10.3.2 Bus-Based Communication . . . . . . . . . . . . . . .
10.4 Model Implementation . . . . . . . . . . . . . . . . . . . . . .
10.4.1 An Adaptive FSL Model . . . . . . . . . . . . . . . . .
10.4.2 An Adaptive AHB Model . . . . . . . . . . . . . . . .
10.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . .
10.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
11 Efficient Architecture Evaluation Using Functional Mapping
C. Kerstan, N. Bannow and W. Rosenstiel
11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . .
11.1.1 Functional Mapping . . . . . . . . . . . . . . . . .
11.1.2 Timing Behavior . . . . . . . . . . . . . . . . . . .
11.2 Conventional Code Transformation . . . . . . . . . . . . .
11.3 Optimization Approach . . . . . . . . . . . . . . . . . . .
11.3.1 Class Unitized . . . . . . . . . . . . . . . . . . . .
11.4 Customize and Apply Unitized . . . . . . . . . . . . . . .
11.4.1 Application of u_trace . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
131
133
133
134
137
138
138
139
141
143
144
145
146
. . 149
.
.
.
.
.
.
.
.
.
.
.
149
151
152
152
154
157
157
158
160
164
164
. . . . 167
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
167
168
169
169
171
171
173
174
Contents
11.5 Using the Approach in the Design Flow
11.5.1 Handling Arrays . . . . . . . .
11.5.2 Design Example . . . . . . . .
11.5.3 Simulation Results . . . . . . .
11.6 Limitations and Experiences . . . . . .
11.7 Summary . . . . . . . . . . . . . . . .
11.7.1 Outlook . . . . . . . . . . . .
References . . . . . . . . . . . . . . .
xi
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
12 Symbolic Scheduling of SystemC Dataflow Designs
Jens Gladigau, Christian Haubelt and Jürgen Teich
12.1 Introduction . . . . . . . . . . . . . . . . . . .
12.2 Model of Computation . . . . . . . . . . . . . .
12.3 Symbolic Representation . . . . . . . . . . . .
12.4 QSS of SysteMoC Models . . . . . . . . . . . .
12.4.1 Transition Graphs . . . . . . . . . . . .
12.4.2 Path Searching . . . . . . . . . . . . . .
12.4.3 Scheduling Algorithm . . . . . . . . . .
12.5 Related Work . . . . . . . . . . . . . . . . . .
12.6 Example . . . . . . . . . . . . . . . . . . . . .
12.7 Conclusions and Further Work . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
175
175
176
178
179
181
181
181
. . . . . . . . . . 183
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
13 SystemC Simulation of Networked Embedded Systems
Francesco Stefanni, Davide Quaglia and Franco Fummi
13.1 Introduction . . . . . . . . . . . . . . . . . . . . .
13.2 The Architecture of SCNSL . . . . . . . . . . . . .
13.2.1 Main Components of SCNSL . . . . . . . .
13.3 Main Problems Solved by SCNSL . . . . . . . . .
13.3.1 Simulation of RTL Models . . . . . . . . .
13.3.2 Assessment of Transmission Validity . . . .
13.3.3 Simulation Planning . . . . . . . . . . . . .
13.3.4 Application to a Wireless Scenario . . . . .
13.4 Experimental Results . . . . . . . . . . . . . . . .
13.5 Conclusions . . . . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
183
184
187
189
190
191
193
195
196
197
198
. . . . . . . . 201
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
14 Modeling of Embedded Software Multitasking in SystemC/OSSS
Philipp A. Hartmann, Philipp Reinkemeier, Henning Kleen and
Wolfgang Nebel
14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . .
14.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . .
14.3 The OSSS Design Flow . . . . . . . . . . . . . . . . . . . . .
14.3.1 Application Layer . . . . . . . . . . . . . . . . . . . .
14.3.2 Virtual Target Architecture Layer . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
201
203
204
207
207
207
208
208
210
210
211
. . 213
.
.
.
.
.
.
.
.
.
.
213
214
216
216
217
xii
Contents
14.4 Modeling Software in OSSS . . . . . . .
14.4.1 Abstraction of Run-time System
14.4.2 Software Tasks . . . . . . . . . .
14.4.3 Software Shared Objects . . . . .
14.4.4 Software Execution Times . . . .
14.5 Exploration of Platform Effects . . . . .
14.6 Simulation Results . . . . . . . . . . . .
14.6.1 Accuracy and Performance . . .
14.6.2 Lazy Synchronization . . . . . .
14.7 Conclusion . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
15 High-Level Reconfiguration Modeling in SystemC . . . . . .
Andreas Raabe and Armin Felke
15.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . .
15.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . .
15.3 Basic Reconfiguration Modeling . . . . . . . . . . . . . .
15.3.1 Interpreting Reconfiguration as Circuit Switch . . .
15.3.2 Creating Reconfigurable Modules from Static Ones
15.3.3 Control . . . . . . . . . . . . . . . . . . . . . . . .
15.4 Advanced ReChannel Features . . . . . . . . . . . . . . .
15.4.1 Exportals . . . . . . . . . . . . . . . . . . . . . . .
15.4.2 Synchronization . . . . . . . . . . . . . . . . . . .
15.5 Explicit Description of Reconfiguration . . . . . . . . . . .
15.5.1 Resettable Processes . . . . . . . . . . . . . . . . .
15.5.2 Resettable Components . . . . . . . . . . . . . . .
15.5.3 Binding Groups of Switches . . . . . . . . . . . . .
15.6 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . .
15.7 Conclusion and Future Work . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
16 Stream Programming for FPGAs . . . . . . . . .
Franjo Plavec, Zvonko Vranesic and Stephen Brown
16.1 Introduction . . . . . . . . . . . . . . . . . .
16.2 Stream Computing . . . . . . . . . . . . . . .
16.2.1 Streaming on FPGAs . . . . . . . . .
16.3 Compiling Brook to Hardware . . . . . . . . .
16.3.1 Example Brook Program . . . . . . .
16.3.2 Exploiting Data Parallelism . . . . . .
16.4 Experimental Evaluation . . . . . . . . . . . .
16.4.1 Results . . . . . . . . . . . . . . . . .
16.5 Concluding Remarks . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
218
218
219
220
221
222
223
223
224
225
225
. . . . 227
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
227
228
229
229
230
231
231
231
232
233
234
236
237
238
239
240
. . . . . . . . . . . 241
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
241
243
244
244
246
248
250
251
252
253
Contents
xiii
Part IV Verification and Requirements Evaluation
17 A New Verification Technique for Custom-Designed Components
at the Arithmetic Bit Level . . . . . . . . . . . . . . . . . . . . . .
Evgeny Pavlenko, Markus Wedler, Dominik Stoffel, Wolfgang Kunz,
Oliver Wienand and Evgeny Karibaev
17.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . .
17.2 Normalization Method . . . . . . . . . . . . . . . . . . . . . .
17.2.1 ABL Normalization . . . . . . . . . . . . . . . . . . .
17.2.2 Mixed ABL/Gate-Level Problems . . . . . . . . . . . .
17.3 Synthesis of ABL Descriptions from Gate-Level Models . . . .
17.3.1 Generation of the Equivalent ABL Descriptions for
Boolean Functions in Reed–Muller Form . . . . . . . .
17.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . .
17.5 Conclusion and Future Work . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18 Debugging Contradictory Constraints in Constraint-Based
Random Simulation . . . . . . . . . . . . . . . . . . . . . . . .
Daniel Große, Robert Wille, Robert Siegmund and Rolf Drechsler
18.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . .
18.2 SystemC Verification Library . . . . . . . . . . . . . . . . .
18.3 Contradiction Analysis . . . . . . . . . . . . . . . . . . . .
18.3.1 Problem Formulation . . . . . . . . . . . . . . . . .
18.3.2 Concepts for Contradiction Analysis . . . . . . . . .
18.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . .
18.5 Experimental Evaluation . . . . . . . . . . . . . . . . . . . .
18.5.1 Types of Contradictions . . . . . . . . . . . . . . . .
18.5.2 Effect of Property 1 and Property 2 . . . . . . . . . .
18.5.3 Real-Life Example . . . . . . . . . . . . . . . . . . .
18.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . . . .
19 Design of Communication Infrastructures for Reconfigurable
Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Alessandro Meroni, Vincenzo Rana, Marco D. Santambrogio and
Francesco Bruschi
19.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . .
19.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . .
19.3 Real World Applications Analysis . . . . . . . . . . . . . . .
19.3.1 Applications Layer . . . . . . . . . . . . . . . . . . .
19.3.2 Scenarios Layer . . . . . . . . . . . . . . . . . . . .
19.3.3 Characteristics Layer . . . . . . . . . . . . . . . . .
19.3.4 Metrics Layer . . . . . . . . . . . . . . . . . . . . .
19.4 The Proposed Solution . . . . . . . . . . . . . . . . . . . . .
19.4.1 High Level Description . . . . . . . . . . . . . . . .
. . 257
.
.
.
.
.
.
.
.
.
.
257
259
259
262
263
.
.
.
.
.
.
.
.
264
268
271
272
. . . 273
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
273
275
276
276
277
280
282
283
284
285
288
289
. . . 291
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
291
293
293
294
295
295
296
297
298
xiv
Contents
19.4.2 High Level Network Simulation
19.4.3 Evaluation and Selection . . .
19.4.4 Verification and Validation . .
19.5 Results . . . . . . . . . . . . . . . . .
19.6 Concluding Remarks . . . . . . . . . .
References . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
20 Analysis of Non-functional Properties of MPSoC Designs . .
Alexander Viehl, Björn Sander, Oliver Bringmann and Wolfgang
Rosenstiel
20.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . .
20.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . .
20.3 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . .
20.3.1 Activity Model . . . . . . . . . . . . . . . . . . . .
20.3.2 Power Management Model . . . . . . . . . . . . .
20.4 Design Flow . . . . . . . . . . . . . . . . . . . . . . . . .
20.5 Abstraction of System Functionality . . . . . . . . . . . .
20.6 Simulation Model Generation . . . . . . . . . . . . . . . .
20.6.1 Communication Dependency Graphs . . . . . . . .
20.6.2 Temporal Environment Models . . . . . . . . . . .
20.6.3 Integration of Power Consumption and Power
Management . . . . . . . . . . . . . . . . . . . . .
20.6.4 Battery Models, Placement and Chip Environment .
20.7 Experimental Results . . . . . . . . . . . . . . . . . . . .
20.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
299
301
302
302
306
306
. . . . 309
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
309
311
312
312
313
313
314
315
315
316
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
318
319
320
322
323
Part I
Model-Based System Specification
Languages
Chapter 1
Power and Energy Estimations in Model-Based
Design
Eric Senn, Saadia Douhib, Dominique Blouin,
Johann Laurent, Skander Turki and
Jean-Philippe Diguet
Abstract The aim of our works is to provide for methods and tools to quickly
estimate the power consumption at the first steps of a system design. We introduce multi-level power models and show how to use them at different levels of the
specification refinement in the model-based AADL (Architecture & Analysis Design
Language) design flow. Those power models, with the underlying methodology for
power estimation, are currently being integrated in the Open Source AADL Tool Environment (OSATE) under the name CAT: Consumption Analysis Toolbox. Its first
prototype gives, in the case of a processor binding, power consumption estimations,
for software components in the AADL component assembly model, with a maximal
error ranging roughly from 5% at the lowest refinement level (the source code of
the software component is known), to 30% at the highest level (only the operating
frequency and basic target configuration parameters are considered). We illustrate
our approach with the power model of a simple RISC (PowerPC 405), of a complex
DSP (TI C62), and of a FPGA (from ALTERA). We show how those models can
be used at different levels in the AADL flow. Obviously, the power consumption
of Operating System (OS) services is also to be considered here. We show that the
OS principal impact on the overall consumption is mainly due to services implying
data transfers. We introduce a methodology to model Inter-Process Communications (IPC) power and energy consumption, and illustrate this methodology on the
building and use of a model for Ethernet based inter-process communications.
Keywords Power and energy consumption modelling · Model driven engineering ·
AADL · Embedded system · Processors · FPGA
1.1 Introduction
AADL (Architecture Analysis & Design Language) is an input modeling language
for real-time embedded systems [13]. It is now commonly used in the avionic domain for the design of safety critical systems. The aim of AADL models is providing
E. Senn ()
Lab-STICC, CNRS UMR 3192, Université de Bretagne Sud, Lorient, France
e-mail: eric.senn@univ-ubs.fr
M. Radetzki (ed.), Languages for Embedded Systems and Their Applications,
Lecture Notes in Electrical Engineering 36,
© Springer Science + Business Media B.V. 2009
3
4
E. Senn et al.
a framework in order to verify functional and non functional properties of a system,
from early analysis of the specification, to code generation for the targeted hardware
platform [15, 23, 30].
One objective of the European project ITEA/SPICES (Support for Predictable
Integration of mission Critical Embedded Systems) [2] is to improve the modeling
and analysis capabilities of AADL. In association with the AADL Standardization
Committee [3], enrichments of the language are being proposed, to appear in the
next AADL release (2.0).
We are currently working in this context to provide energy and power consumption estimations at different levels in the AADL refinement process. The advantages
of early verifications are well-known. They are however only possible if estimations
can be performed at this high-levels in a reasonable delay, allowing the user for a
fast exploration of the design space.
Knowing the power consumption of every component in the system, or at least
its upper bound, is indeed essential to guarantee the embedded system’s reliability. Indeed, this knowledge allows to avoid the risk of burning a component in case
its maximal power dissipation is exceeded (if its activity goes beyond its critical
threshold for instance). Indeed the temperature of a component is directly related
to its power dissipation. If the temperature rises, even if the component is not destroyed, its timing features can be altered. Typically, the delay of critical paths rises,
and a fault can occur that yield the entire system to stall [14]. The increasing of clock
skews in VLSI circuits with non-uniform substrate temperature is also a source of
major breakdown [20]. Analyzing power consumption in a complete system is also
necessary to avoid the risk of overloading a supply bus or a power supply source, in
case its capacity is exceeded. That will ensure that the power consumptions of the
system and of all its components (including power sources, power buses, etc) is in
the range of the allowed global power budget, and this whatever the configuration
of the system.
Significant research efforts have been devoted to develop tools for power consumption estimation at different abstraction levels in embedded system design. A lot
of those tools however work at the Register Transfer Level (RTL) (this is the case
for tools Diesel [11] and Petrol [21]), at the Cycle Accurate Bit Accurate (CABA)
level [6, 19], and a few tools at the architectural level (Wattch [8] and Simplepower
[31]). Such approaches cannot be used at high levels because simulation times at
such low abstraction levels become enormous for complete and complex systems,
like multiprocessor heterogeneous platforms.
In [12] and [18], the authors present a characterization methodology for generating power models within TLM for peripheral components. The pertinent activities are identified at several levels and granularities. The characterization phase
of the activities is performed at the gate level and is used to deduce the power of
coarse-grained activities at higher level. Again, applying such approaches for complex processors or complete systems is not doable. Instruction level or functional
level approaches have been proposed [22, 27, 29]. They however only work at the
assembly level, and need to be improved to take into account pipelined architectures,
large VLIW instruction sets, and internal data and instruction caches.
1 Power and Energy Estimations in Model-Based Design
5
We introduced the Functional Level Power Analysis (FLPA) methodology which
we have applied to the building of high-level power models for different hardware
components, from simple RISC processors to complex superscalar VLIW DSP [16,
17], and for different FPGA circuits [25].
We present here our new approach to allow for system level power and energy
estimations in model-based design flow. We show how it fits into the AADL design
flow and how our power models, being inter-operable, are used at different refinement levels. The AADL design flow is presented in Sect. 1.2, and the methodology
to perform power and energy estimations throughout this flow in Sect. 1.3. The first
phase of our consumption analysis approach, the power estimation, is detailed in
Sect. 1.4. We introduce the power model of the GPP PowerPC 405 and show how
multi-level power models are used at different levels in the specification refinement.
The power models of the DSP TI C62 and the FPGA Altera Stratix EP1S80 are introduced and used as examples in Sects. 1.5 and 1.6. Section 1.7 presents the modeling of energy consuming OS services, that ought to be included in the software
components consumption. The model of the IPC Ethernet communication is given
as an example. The software development and integration in OSATE is sketched in
Sect. 1.8. Dedicated packages and property sets for power estimation are exhibited
here. The accuracy of our power estimations is finally evaluated in comparison with
physical measurements.
1.2 AADL Component Based Design Flow
A component-based system is a component-based software application managed
by a component framework and deployed on a target platform [5]. The component
based AADL design flow is presented in Fig. 1.1. It relies mainly on three following models: The AADL component assembly model contains all the components
and connections instances of the application. It also references the implementation
models of all the components instances, found in the “AADL models library”. The
AADL target platform model describes the hardware of the physical target platform.
This platform is composed of at least one processor, one memory, and one bus entity to home processes and threads execution. The AADL deployment plan model
describes the AADL-PSM composition process. It defines all the binding properties
that are necessary to deploy the processes and services model of the componentbased application on the target platform. Those models are combined to obtain the
AADL-PSM model of the complete component-based system. From there, the final
implementation of the system is obtained through model transformations and code
generation.
The Open Source AADL Environment Tool (OSATE) [1] provides a framework
to specify a complete system using AADL. It also permits to check some of its
functional and non-functional properties. Those verifications rely on the use of different plug-ins included in the tool set. During the deployment, software components in the AADL component assembly model are bound to hardware components
6
E. Senn et al.
Fig. 1.1 AADL component
based design
in the AADL target platform model. OSATE scheduling analysis plug-in uses information embedded in the software components description to propose a binding
for the system [26]. Figure 1.2 shows the typical binding of an application on a
multi-processors architecture. In this example, process main_process is composed of six threads. It is bound to the memory sysRAM, as well as the data
block data_b. Threads control_thread, ethernet_driver_thread
and post_thread are bound to the first general purpose processor GPP1. Thread
pre_thread is bound to GPP2. Thread hw_thread1 is, like hw_thread2 a
hardware thread. It will be implemented in the reconfigurable FPGA circuit FPGA1.
One connection between pre_thread and post_thread has been declared
using in and out data ports in the threads. This connection is bound to bus
sys_bus since it connects two threads bound to two different components in the
platform. Intra-component connections, like between threads control_thread
and ethernet_driver_ thread, do not need to be bound to specific buses.
They will however consume hardware resources while being used.
In addition to communication buses, dedicated supply buses can also been declared. The resources analysis plug-in in OSATE comes with an Analyze Bus Power
Draw command that permits to check if the power capacity of a supply bus is not
exceeded. Indeed, a power capacity property (SEI::PowerCapacity) can be
declared for a bus, and every component that requires an access to this bus declares
a power budget (property SEI::PowerBudget) that it draws from the bus. The
plug-in then adds all the power budgets declared for a bus and compares the result
with the bus power capacity. This mechanism, even if it is interesting, is extremely
limited: power information is only a guess from the user and only applies to supply
buses.
1 Power and Energy Estimations in Model-Based Design
7
Fig. 1.2 Binding components to the target platform
1.3 Consumption Analysis: the Methodology
We propose to use realistic power estimates for power analysis by using an accurate
power estimation tool and precise power consumption models for every component
in the targeted hardware platform. The challenge for the power estimation tool is
to provide a realistic power budget for every component in the AADL component
assembly model. Eventually, two computation phases are necessary to analyze the
power consumption of the system:
1. In the first phase, the power budget is computed for every software component.
This phase is the power estimation.
2. In the second phase, the power budgets of software components are combined to
get the power budgets for hardware components. This second phase is the power
analysis. Using timing information, the energy analysis is performed afterwards.
Those two phases are presented in Fig. 1.3 in the case of the binding of a thread
to a processor. Plain edges represent phase 1 (estimation), dotted edges represent
phase 2 (analysis). The power estimation tool gathers several information in the
8
E. Senn et al.
Fig. 1.3 Power and Energy
consumption estimation in
the AADL design flow
system’s AADL specification at different places, here from the process and thread
descriptions, and from the processor specification. It also uses binding information
that may come from a preliminary scheduling analysis. The result of scheduling
analysis indeed gives the percentage load of processors for each thread. Whenever a
processor is idle, its power consumption is at the minimum level. Scheduling analysis is performed using basic information on the threads properties defined as properties for each thread implementation in the AADL component assembly model:
dispatch_protocol (periodic, aperiodic, sporadic, background), period, deadline, execution_time, etc. The tool uses the power model of the targeted hardware component (here a processor) to compute the power budget for the application component
(which can be a software or hardware thread). In fact, it decides the input parameters
of the model from the set of information it has gathered.
Once the power budgets have been computed for every component in the application, the power analysis can be done. The power analysis tool retrieves all the
component power budgets, together with additional information from the specification, and computes the power budget for every hardware component in the system.
Then it computes the power estimation for the whole system. Energy analysis can
be performed using information from the timing analysis tools currently being developed in the SPICES project.
1.4 Power Estimation
We have developed a specific power estimation tool to compute the power budget for every software component in the AADL component assembly model. This
tool, which is an evolution of our former power estimation tool SoftExplorer [24],
comes with a library of power models for every hardware component on the platform. The method to build a model is based on the Functional Level Power Analysis
(FLPA) [17]. This approach relies on the decomposition of a complex system into
functional blocks, independent regarding the power consumption. Parameters are
identified that characterize the way the functional blocks will be excited by the input specification. A set of measurements is performed, where the system consumption is measured for different values of the input parameters. Consumption charts
are plotted and mathematical equations are computed by regression. This method is
1 Power and Energy Estimations in Model-Based Design
9
interesting because it links low-level measures and observations of the physical implementation with high-level parameters from earlier steps in the design flow. This
is an efficient way of evaluating how an application will use the resources in the targeted implementation. As depicted before, the building of a model, and the choice
of its parameters, are based on a set of physical measurements. Measurements again
will be used to validate the model. The model output is compared with measured
values over a large set of realistic applications. This allows to define precisely the
maximal and average errors introduced by the model.
1.4.1 Power Models
A power model is thus a set of consumption laws that enables to compute the power
consumption of a component in function of a reduced set of input parameters. As
an example, we present in Table 1.1, Table 1.2 and Table 1.3 the set of consumption laws that constitute the power model of the PowerPC 405 processor that we
have developed in the frame of SPICES. Those laws come with an accuracy better
than 5% (the average maximum error found between the physical measurements
and the results of a law). The average error is 2%. Input parameters are the processor frequency (Fprocessor ) and the frequency of the bus it is connected to (Fbus ), the
type of instruction executed (calculation (Calc) or memory accesses (LD and ST)),
the configuration of the memory hierarchy associated to the processor’s core (i.e.
which caches are used (data and/or instruction) and where is the primary memory
(internal/external)), and the cache miss rate (γ ).
Table 1.1 presents the consumption laws for the 2.5 V power supply when caches
are disabled, or one of the two caches is enabled. When the two caches are disabled, the program and/or data can be stored into the internal FPGA memory (using
BRAM) or in an external memory (a SDRAM on our board). Even if it is possible
to store the program and data into the external memory, programmers never use this
possibility. Indeed, the cost of one access (in power and time) in the external memory is too high to make this solution acceptable. As a result, we have considered in
Table 1.1 Consumption laws for the 2.5 V power supply
2.5 V power supply
BRAM
P (mW) = 5.37Fbus + 1588
SDRAM
Instruction
γ =0
PCalc (mW) = 5.37Fbus + 1588
cache
γ >0
Fbus = 50:
PCalc (mW) = 0.76Fprocessor + 3.19γ + 2000
3
Fbus = 66:
PCalc (mW) = 154.78 ln(γ ) + 1837
4
Fbus = 100:
PCalc (mW) = 0.99Fprocessor + 4.47γ + 2155
1
2
5
PLD/ST (mW) = 4.96γ + 6.36Fbus + 1599
6
Data
LD
P (mW) = 4.1γ + 6.3Fbus + 1599
7
cache
ST
P (mW) = 6.88γ + 7.27Fbus + 1635
8
10
E. Senn et al.
Table 1.2 Consumption laws for the 1.5 V power supply
1.5 V power supply
Without
BRAM
cache
Instruction
SDRAM
cache
BRAM
PCalc (mW) = 0.32Fprocessor + 2.35Fbus + 63.95
9
PLD/ST (mW) = 0.55Fprocessor + 2.68Fbus + 60
10
PCalc (mW) = 0.46Fprocessor + 2.21Fbus + 61
11
PLD (mW) = 0.66Fprocessor + 3.28Fbus + 55
12
PST (mW) = 0.39Fprocessor + 3.59Fbus + 40
13
PCalc (mW) = 0.46Fprocessor + 2.21Fbus + 61
14
PLD (mW) = 0.70Fprocessor + 3.05Fbus + 70
15
PST (mW) = 0.44Fprocessor + 3.14Fbus + 65
16
17
Data
SDRAM
PLD/ST (mW) = 0.38Fprocessor + 3.45Fbus + 65
cache
BRAM
PLD (mW) = 0.40Fprocessor + 3.24Fbus + 74
18
PST (mW) = 0.44Fprocessor + 3.14Fbus + 65
19
Table 1.3 Consumption laws for the 1.5 V power supply
Data and instructions caches enabled
BRAM
SDRAM
1.5 V
P (mW) = 0.40Fprocessor + 3.24Fbus + 74
20
2.5 V
P (mW) = 5.37Fbus + 1588
21
1.5 V
P (mW) = 0.38Fprocessor + 3.45Fbus + 79
22
2.5 V
P (mW) = 4.1γ + 6.3Fbus + 1599
23
this case that the instructions and data are stored in the BRAM, either connected to
the processor through the OCM bus, or through the PLB bus. If the BRAM is used
then the consumption only depends on the bus frequency. If the SDRAM is used,
one of the two caches (for instruction or data) must be selected. With the instruction
cache, the consumption depends on the instruction type (LD/ST or computation instructions) and for computation instructions, it also depends on the cache miss rate
and the bus frequency. With the data cache, the consumption depends on the memory access instruction type (LD or ST).
Table 1.2 presents the consumption laws for the 1.5 V power supply again when
caches are disabled, or one of the two caches is enabled. The consumption depends
on which cache is used and on the type of primary memory (BRAM or SDRAM). If
the BRAM is used, then, without cache, the consumption depends on the instruction
type (computation or memory access). With the instruction cache, the consumption
depends on the instruction type, and with the data cache, it depends on the type
of memory access instruction (LD or ST). If the SDRAM is used, then, with the
instruction cache the consumption depends on the type of instruction (same law
than with the BRAM memory). With the data cache the consumption is the same for
LD and ST instructions.
Table 1.3 presents the consumption laws for the 2.5 V and 1.5 V supplies when
both the data and instructions caches are enabled. In this situation, the consumption
1 Power and Energy Estimations in Model-Based Design
11
depends on the primary memory used, but not on the type of instructions. On the
1.5 V supply, the consumption depends on the processor and bus frequencies. It only
depends on the bus frequency on the 2.5 V line.
1.4.2 Multi-level Estimation
As described before, the tool, when it is invoked, extracts relevant information (a set
of parameters) from the AADL specification, then computes the components’ power
consumption, and finally returns them to fill the power budget properties for the software components. The information that is extracted from the specification depends
on (i) the refinement level and (ii) the targeted hardware component. For instance,
information needed is not the same if the processor to which the software component (thread) is bound is an ARM7, a PowerPC, or a TI-C6x DSP. During the
deployment, software components are bound to hardware components. Thus, the
power estimation tool, as previously illustrated in Fig. 1.3 in the case of a thread to
processor binding, must resolve any possible binding of software components onto
hardware components, and that means: (i) threads onto processors, (ii) processes
and data onto memories, and (iii) connections onto bus. The binding makes it possible to put in relation components in the AADL component assembly model with
the power models of hardware components on the targeted platform.
In the following, still using the PowerPC 405 as an example, we show how a
power model is used at different refinement levels of the AADL specification. To
use a power model, we have to extract some relevant information in the AADL
specification, in order to determine the model’s input parameters. Depending on the
specification refinement, it might not be possible to determine precisely the value
of every parameters. It is important to determine the accuracy of our estimations at
every proposed level. In order to do that, we fix the values of parameters that are
known at the considered level, and then perform estimations with all the possible
values of the remaining unknown parameters. The maximum error is the difference
between the average and the maximum estimations. This is repeated for another set
of values for the known input parameters. The final maximum error is the maximum
of the maximum errors. This is a very worst case because most of the time the user,
even if he can’t determine them precisely at a refinement level, can give a realistic
value for every unknown parameters. We finally defined three refinement levels for
the PowerPC 405.
1.4.2.1 Refinement Level 1
At the first refinement level, our model gives a rough estimate of the power consumption for the software component, only from the knowledge of the processor
and some basic information on its operating conditions. In the case of the PowerPC
405, the maximum error we get here is 27%. The only information we need is the
12
E. Senn et al.
processor frequency and the frequency of the internal bus (OCM or PLB) to which
the processor is connected inside the FPGA. Those are the known parameters that
were used to determine the maximum error: valid processor/bus frequencies couples
were fixed and estimations performed with all the others parameters tested. Those
two parameters are in fact directly related to the target platform and the hardware
component. They can be changed according to the user’s will. They constitute what
we call hardware configuration parameters for this processor. They will be defined
as a property of the AADL processor implementation of the PowerPC 405 in the
AADL specification. To calculate the maximum error, power estimations are performed with one processor/bus frequencies couple, and with all the possible values
of the remaining parameters which are not known at a refinement level. The maximum error comes then from the difference between the average and the maximum
estimations. This is repeated for every valid processor/bus frequencies couple. The
final maximum error is the maximum of the maximum errors.
1.4.2.2 Refinement Level 2
At this refinement level, we have to add some information about the memories used.
We have to indicate what caches will be used in the PowerPC 405 (data cache, instructions cache, or both), and if its primary memory is internal (using the FPGA
BRAM memory bank) or external (using a SDRAM accessed through the FPGA
I/O). Indeed, while building the power model for the PowerPC 405, we have observed that it draws quite different power and energy in those various situation.
Four different operating conditions are identified:
• Caches disabled: If caches are disabled then, as explained before, instructions and
data are stored in the internal memory (BRAM). Consumption laws 1, 9 and 10
are used in this case.
• Instruction cache enabled: When an instruction is not in the cache, a cache miss
is generated and an access is done to the global memory (the external memory
in most cases). The instruction is fetched from this memory and stored into the
cache. If the global memory is (function of the development board) the internal
BRAM of the FPGA connected by the PLB bus, then laws 1, 14, 15, 16 are used.
If the global memory is the external SDRAM (accessed through the FPGA I/O)
then laws 2, 3, 4, 5, 6 (depending on the cache miss rate γ and the bus frequency
Fbus ) and laws 11, 12 and 13 are used.
• Data cache enabled: When a data is not found in the cache, a cache miss is generated. The data is read from the global memory and written to the cache. The
program instructions are stored in the internal BRAM connected to the OCM
bus. The global memory can be the internal BRAM connected to the PLB bus
(laws 1, 9, 18, 19 are used), or the external SDRAM (through FPGA I/O). In this
last situation, laws 7, 8, 9 and 17 are used.
• Data and instruction caches enabled: If the primary memory is the BRAM, laws
20 and 21 are used. Laws 22 and 23 are used if the primary memory is the external
SDRAM.
1 Power and Energy Estimations in Model-Based Design
13
Table 1.4 Maximal Errors with level 2 refinement
Fprocessor /Fbus (MHz)
2 caches BRAM
300
200
100
200
200
150
100
100
100
Error
100
100
100
66
50
50
50
33
25
Max
2595 2555 2515 2262 2134 2104 2084 1938 1869 0%
2 caches SDRAM_MAX 3129 3091 3053 2760 2604 2585 2566 2400 2321
2 caches SDRAM_MIN 2719 2681 2643 2350 2194 2175 2156 1990 1911
Error
7.0% 7.1% 7.2% 8.0% 8.5% 8.6% 8.7% 9.3% 9.7% 9.7%
BRAM_MAX
2570 2515 2460 2241 2112 2084 2057 1920 1856
BRAM_MIN
2472 2440 2408 2177 2053 2037 2021 1890 1829
Error
1.9% 1.5% 1.1% 1.4% 1.4% 1.1% 0.9% 0.8% 0.7% 1.9%
Icache_BRAM
2662 2592 2522 2305 2170 2135 2100 1957 1890
&BRAM_MAX
Icache_BRAM
2497 2451 2405 2193 2071 2048 2025 1897 1836
&BRAM_MIN
Error
3.2% 2.8% 2.4% 2.5% 2.3% 2.1% 1.8% 1.6% 1.4% 3.2%
Icache_SDRAM_MAX 3432 3267 3102 3155 3103 3020 2938 2882 2856
Icache_SDRAM_MIN
2600 2478 2356 2403 2367 2306 2239 2208 2190
Error
13.8% 13.7% 13.7% 13.5% 13.5% 13.4% 13.5% 13.2% 13.2% 13.8%
Dcache_BRAM_MAX
2595 2555 2515 2262 2124 2104 2084 1938 1869
Dcache_BRAM_MIN
2588 2544 2500 2254 2118 2096 2074 1930 1861
Error
0.1% 0.2% 0.3% 0.2% 0.1% 0.2% 0.2% 0.2% 0.2% 0.3%
Dcache_SDRAM_MAX 3535 3497 3459 3133 2960 2941 2922 2741 2622
Dcache_SDRAM_MIN 2744 2706 2668 2375 2218 2199 2180 2015 1936
Error
12.6% 12.8% 12.9% 13.8% 14.3% 14.4% 14.5% 15.3% 15.1% 15.3%
Table 1.4 shows the maximal errors we obtain for every valid set of known input
parameters, the others being unknown. The maximum error we obtain is 15.3% and
the average error is 6.6%. The first line indicates 0% because in this configuration,
there are not remaining unknown parameters that can change the power consumption of the processor.
The result of the scheduling analysis (which gives the %load of processors)
is also taken into account at this level. Indeed, the percentage of time a processor is idle, its power consumption is at the minimum level. Scheduling analysis
is performed using basic information on the threads properties defined as properties for each thread implementation in the AADL component assembly model:
dispatch_protocol (periodic, aperiodic, sporadic, background), period, deadline, execution_time, etc.
14
E. Senn et al.
1.4.2.3 Refinement Level 3
At this refinement level, the actual code of the software component is parsed. In the
case of the PowerPC 405, what is important is not exactly what instruction is executed, but rather the type of instruction being executed. We have indeed exhibited
that the power consumption changes noticeably from memory access instructions
(load or store in memory), to computation instructions (multiplication or addition).
As we have seen before, the place where the data is stored in memory is also important, so the data mapping is also parsed here. The average error we get at this
level is 2%. The maximum error is 5%. Logically, that corresponds to the max and
average errors for the set of consumption laws for the component.
1.5 Power Estimation for Complex DSP
The TI C62 processor has a complex architecture. It has a VLIW instructions set,
a deep pipeline (up to 15 stages), fixed point operators, and parallelism capabilities
(up to 8 operations in parallel). Its internal program memory can be used like a
cache in several modes, and an External Memory Interface (EMIF) is used to load
and store data and program from the external memory [28]. In the case of the C62,
the 6 following parameters are considered. The clock frequency (F) and the memory
mode (MM) are what we call architectural parameters. They are directly related
to the target platform and the hardware component, and can be changed according
to the user’s will. The influence of F is obvious. The C62 maximum frequency is
200 MHz (it is for our version of the chip); the designer can tweak this parameter to
adjust consumption and performances.
The remaining parameters are called algorithmic parameters; they directly depend on the application code itself. The parallelism rate α assesses the flow between
the processor’s instruction fetch stages and its internal program memory controller
inside its IMU (Instruction Management Unit). The activity of the processing units
is represented by the processing rate β. This parameter links the IMU and the PU
(Processing Unit). The activity rate between the IMU and the MMU (Memory Management Unit) is expressed by the program cache miss rate γ . The pipeline stall
rate (PSR) counts the number of pipeline stalls during execution. It depends on the
mapping of data in memory and on the memory mode.
The memory mode MM illustrates the way the internal program memory is used.
Four modes are available. All the instructions are in the internal memory in the
mapped mode (MM M ). They are in the external memory in the bypass mode (MM B ).
In the cache mode, the internal memory is used like a direct mapped cache (MM C ),
as well as in the freeze mode where no writing in the cache is allowed (MM F ).
Internal logic components used to fetch instructions (for instance tag comparison in
cache mode) actually depends on the memory mode, and so the power consumption.
A precise description of the C62 power model and its building may be found
in [16]. The variation of the power consumption with the input parameters, more
1 Power and Energy Estimations in Model-Based Design
15
Table 1.5 Default algorithmic parameters for the C62
α
β
PSR
LMSBV_1024
1
0.625
0.385
MPEG_1
0.687
0.435
0.206
MPEG_2_ENC
0.847
0.507
0.28
FFT_1024
0.5
0.39
0.529
DCT
0.503
0.475
0.438
FIR_1024
1
0.875
0.666
EFR_Vocoder_GSM
0.578
0.344
0.045
HISTO (image equalisation by histogram)
0.506
0.346
0.499
SPECTRAL (signal spectral power density estimation)
0.541
0.413
0.288
TREILLIS (Soft Decision Sequential Decoding)
0.55
0.351
0.038
LPC (Linear Predictive Coding)
0.684
0.468
0.171
ADPCM (Adaptive Differential Pulse Code Modulation)
0.96
0.489
0.194
DCT_2 (imag 128 × 128)
0.991
0.709
0.435
EDGE DETECTION
0.976
0.838
0.173
G721 (Marcus Lee)
1
0.682
0.032
AVERAGE VALUES
0.7549
0.5298
0.2919
precisely the fact that the estimation is not equally sensitive to every parameter,
allows to use the model in three different situations.
In the first situation, only the operating frequency is known. The tool returns
the average value of the power consumption, which comes from the minimum and
maximum values obtained when all the others parameters are being made to vary.
The designer can also ask for the maximum value if a higher bound is needed for
the power consumption.
In the second situation, we suppose that the architectural parameters (here F and
MM) are known. We also assume that the code is not known and that the designer
is able to give some realistic values for every algorithmic parameter. If not, default
values are proposed, from the values that we have observed running different representative applications on this DSP (see Table 1.5).
In the third situation, the source code is known. It is then parsed by our power estimation tools: the value of every algorithmic parameter is computed and the power
consumption is estimated, using the power model and the values enter by the user
for the frequency and memory mode.
The error introduced by our tool obviously differs in these three situations. To
calculate the maximum error, estimations are performed with given values for the
parameters known in the situation, and with all the possible values of the remaining
unknown parameters. The maximum error comes then from the difference between
the average and the maximum estimations. This is repeated for every valid set of
known input parameters. The final maximum error is the maximum of the maximum errors. Table 1.6 gives the maximum error in the three situations above, which
16
E. Senn et al.
Table 1.6 Maximum errors for the C62 power model (power in mW)
Known parameters
Memory
Max
Min
Average
Max
Mode
Power
Power
Power
Error
3037
848
2076
59%
Level 1
Frequency
X
Frequency, MM, α, β, γ , PSR
Mapped
2954
848
1809
53%
Cache
2955
756
1778
57%
Freeze
3018
882
1801
51%
Bypass
3037
2014
2397
21%
Level 2
Level 3
F , MM, and the source code is provided
Max Error = 8%, Average Error = 4%
correspond to three levels of the specification refinement. Note that the maximal
errors computed at level 2 are really pessimistic since we assume here that the designer is completely (100%) wrong on his evaluation of all the input parameters.
If his evaluation of those parameters is only 50%, or 25% wrong, then the error
introduced by our tool is reduced as well.
1.6 Power Estimation for Field Programmable Gate Array
FPGA (Field Programmable Gate Arrays) are now very common in electronic systems. They are often used in addition to GPP (General Purpose Processors) and/or
DSP (Digital Signal Processors) to tackle data intensive dedicated parts of an application. They act as hardware accelerators where and when the application is very
demanding regarding the performances, that typically for signal or image processing algorithms. In this case again power estimation can be performed at different
refinement levels.
At the highest levels, the code of the application is not known yet. The designer
needs however to quickly evaluate the application against power, energy and/or thermal constraints. A fast estimation is necessary here, and a much larger error is acceptable. The parameters we can use from the high-level specifications are the frequency F and the occupation ratio β of the targeted FPGA implementation, that
we consider as architectural parameters, and the activity rate α. The experienced
designer is indeed able to provide, even at this very high-level, a realistic guess of
those parameters’ value. As explained before, to obtain the model, i.e. the mathematical equation linking its output to the parameters, we performed a set of different
measurements on the targeted FPGA. For different values of the occupation ratio,
and for different values of the frequency, we made the activity rate varying and
measured the power consumption.
1 Power and Energy Estimations in Model-Based Design
17
Table 1.7 Maximum errors for the Altera Stratix EP1S80 (power in mW)
Known parameters
Max
Min
Average
Max
Power
Power
Power
Error
Level 1
Frequency (F = 10 MHz)
789
307
548
44%
Frequency (F = 90 MHz)
4824
835
2830
70%
Level 2
Frequency, α, β
F = 10 MHz, β = 0.1
353
307
324
8.8%
F = 10 MHz, β = 0.9
789
544
667
24.1%
F = 90 MHz, β = 0.1
931
835
883
6.9%
F = 90 MHz, β = 0.9
4824
2435
3630
44.8%
Level 3
F and the source code is provided
Max Error = 4.2%, Average Error = 1.3%
At our first refinement level, only the frequency is known. Our power estimation
tool uses the model to estimate, at the given frequency, the power consumption with
α = β = 0.1 and with α = β = 0.9. Then it returns the average value between those
minimal and maximal values. The maximal errors we obtain for F = 10 MHz and
F = 90 MHz (upper bound for the Altera Stratix EP1S80) are given Table 1.7.
At the next refinement level, the two architectural parameters F and β, are known
to the user. Like in the case of the former processor’s models, default values are
proposed for α and also β, coming from a set of representative applications. The
maximal error introduced in this case ranges from 6.9% to 44.8%. To determine
this error we compute the maximum and minimum estimations for the four extreme
(F, β) couples, and compare them to the estimations with α default value.
At the lowest refinement level, the source code (a synthesizable hardware description of the component behaviour, may be written in VHDL or SystemC . . . ) is
used. A High-Level Synthesis tool [9] permits to estimate the amount of resources
necessary to implement the application, and given the targeted circuit, to obtain its
occupation ratio (β) and its activity rate (α). Those two parameters and the frequency are finally used with the model.
1.7 Power Estimation for Operating System Services
We have presented how to obtain a precise estimation of the power and energy consumed by every thread in the application. There is still work to do to estimate the
consumption for the whole system. We have shown in [10] that a large part of a
system’s power consumption is due to data transfers. Those transfers may come
directly from the application, or may be induced by the operating system. Data
18
E. Senn et al.
transfers from the application are either embedded in the source code through direct
memory accesses by threads, or they can be supported by inter-process communication services from the operating system. In the first situation, the data transfer power
consumption is directly included in the power model of the processor that run the
code. The consumption of the external memory is easily computed from the number
of external accesses and the basic power features of the memory component from
its datasheet.
Power consumption of inter-process communication (IPC) services is however
not included in the processors’ model. As a result, we have developed specific power
models for IPC. Moreover, a specific AADL package was also developed to allow
the user to describe how IPC are called by the application. As an example, we show
here the model that we have developed for Ethernet IPC.
1.7.1 Ethernet Communications Consumption Modelling
We select a standard peripheral device, the Ethernet interface, as a representative
example implemented in most of embedded systems. As a first step, we identify the
key parameters that can influence the power and energy consumption of Ethernet
communications. Then we conduct physical power measures on a XUP pro development board, and take execution time values from the traces obtained by the Linux
Trace Toolkit [32]. Measures were realised when running different testbenches that
contain RTOS routines stimulating the Ethernet interface. The RTOS that we analyze in this study is the Montavista embedded Linux 3.1. Once we obtained all
measures, we built the power and energy consumption model of the Ethernet interface. The following sections explain in detail the three steps that conducted to the
Ethernet communications model.
1.7.1.1 Analysis of Relevant Parameters
Our study is focused on the effect of the operating system on power and energy
consumption of embedded system components. In the case of the Ethernet interface component, we identified hardware and software parameters influencing energy consumption. Hardware parameters for our models are processor frequency,
bus frequency and primary memory. Software parameters are related to the applicative tasks and the operating system services. They correspond to IP packets data size
and transmission protocol (UDP or TCP).
1.7.1.2 Power and Energy Characterisation
We performed power consumption characterisation for two components of the XUP
board. The first component corresponds to the Ethernet MAC controller which is
1 Power and Energy Estimations in Model-Based Design
Table 1.8 Power Model of
Ethernet communications
19
MAC controller
PMAC (mW) = 0.65Fproc (MHz) + 2100
PHY controller
PPHY (mW) = 1096.22
Table 1.9 Energy model of Ethernet communications
Proc. Freq.
Model
E(µJ/byte) = a
Error
· Packetbsize
MAC Controller (2.5 V)
100 MHz
−0.88
EUDP = 304.87 · Psize
, if Psize < 1500 b
−0.13
EUDP = 1.47 · Psize
, if Psize ≥ 1500 b
−0.68
ETCP = 126.29 · Psize
, if Psize < 1500 b
−0.107
ETCP = 1.80 · Psize
, if Psize ≥ 1500 b
200 MHz
−0.86
EUDP = 181.26 · Psize
, if Psize < 1500 b
−0.11
EUDP = 0.85 · Psize
, if Psize ≥ 1500 b
−0.65
ETCP = 60.73 · Psize
, if Psize < 1500 b
−0.1
ETCP = 1.17 · Psize
, if Psize ≥ 1500 b
300 MHz
−0.84
EUDP = 144.11 · Psize
, if Psize < 1500 b
−0.11
EUDP = 0.79 · Psize
, if Psize ≥ 1500 b
−0.65
ETCP = 63.44 · Psize
, if Psize < 1500 b
−0.09
ETCP = 1.09 · Psize
, if Psize ≥ 1500 b
6.28%
10.16%
7.14%
8.26%
8.59%
8.68%
embedded in the FPGA and powered by a 2.5 V power supply. The second is the
physical Ethernet controller which is powered by a 3.3 V power supply.
We used test programs that only stimulate the OS networking services. Therefore,
only the processor, RAM and Ethernet Interface are solicited.
1.7.2 Models
The power consumption mathematical model of the MAC and PHY controllers are
represented by the equations in Table 1.8.
For the power model, the average error between power consumption measured
and values estimated by the models is 3.5%, the maximum error is 9%.
Following the methodology defined in Sect. 1.3, we made performance analysis
of the whole system. Then, we calculate energy dissipation values in relation to the
variation of all the model parameters. We obtained energy values for the MAC and
PHY controllers. In Table 1.9, we give the model related to the MAC controller. For
each transmission protocol and each processor frequency, there are two laws. The
first is for IP packet data size less than 1500 byte, the second is for IP packet data
size greater than 1500 bytes. Since the maximum transmission unit (MTU) of the
20
E. Senn et al.
Ethernet network is 1500 byte, the Internet layer fragments IP packets larger than
MTU. On the other hand, there is more encapsulation and no fragmentation for IP
packets smaller than MTU. We can deduct from Table 1.9, that encapsulation yields
more energy dissipation than fragmentation.
The model we propose has some fitting error with respect to the measured energy
values it is based on. We use the following average error metric, where Ẽi ’s are
energy values given by the model and Ei ’s are energy values based on power and
performance measures:
n
1 |Ẽi − Ei |
.
n
Ei
i=1
After building the power and energy models of Ethernet communications, we
have integrated them in the models library of our power estimation tool. Following the methodology presented in Sect. 1.3, the tool estimates power and energy
consumption of applicative tasks communicating through Ethernet. To perform this
estimation, the tool extracts pertinent parameters from the AADL specification such
as processor frequency and transmission protocol type.
Though its precise modelling semantics, we noticed that AADL presents some
lacks regarding software modelling. For example, communications between threads
or processes are modelled through ports (event, data or event data ports). But this
modelling facility is not sufficient to describe all the communication mechanisms
supported by an operating system. Therefore, it is necessary to extend the AADL
language to enable precise modelling of process communications. These extensions
will be presented in future publications.
1.8 Consumption Analysis Tool
Our power models, for processors, FPGA, OS services, and the underlying power
estimation methodology are integrated to the Open Source AADL Tool Environment
(OSATE) in the form of a global Consumption Analysis Toolbox (CAT). For an
AADL system component, or for one of its subcomponent selected in the model
graphical editor window, the tool computes the power consumption and displays the
results in a component-specific Eclipse view (Fig. 1.4).
1.8.1 Property Sets
As we have seen in the previous sections of this chapter, our power consumption
models require specific input parameters for power consumption estimation to be
performed. These parameters must be stored in the AADL specification so that they
can be extracted later on by CAT. This is achieved by providing custom AADL
library extensions added to the Osate design environment when CAT is installed in
the Eclipse workbench.
1 Power and Energy Estimations in Model-Based Design
21
Fig. 1.4 Power estimation results view shown for the selected component in the Osate editor
window
Fig. 1.5 CAT AADL
extensions defining families
of processor classifiers and
their corresponding property
sets
Typically, a set of predefined component classifiers with their corresponding
property sets are provided (Fig. 1.5). Those classifiers represent the actual components for which power can be estimated. The AADL “extends” mechanism is
used to mark or type the model components of interest with one of our component
library classifier. The extension classifiers also provide a place to set default predefined property associations for the component. As an example, we show in Fig. 1.6
the AADL textual representation of a processor component extending our power PC
405 library classifier.
22
E. Senn et al.
processor XUPProcessorType
extends cat::processors::ibm::Processor_POWERPC405_Type
features
Bus_PLB: requires bus access Bus_OnChip.Bus_PLB;
Data_jtag: in out data port;
properties
CAT_Processor_IBM_Properties::PC405_Processor_Bus_Freq_Couple =>
P300_B100;
end XUPProcessorType;
Fig. 1.6 AADL specification of a PowerPC processor
Table 1.10 Property set for the PowerPC 405
PowerPC 405 property set
Data_Memory_Config: CAT_Processor_IBM_Properties::Memory_Configurations
applies to (processor);
Inst_Memory_Config: CAT_Processor_IBM_Properties::Memory_Configurations
applies to (processor);
PC405_Processor_Bus_Freq_Couple:
CAT_Processor_IBM_Properties::PC405_Supported_Processor_Bus_Frequencies
applies to (processor);
PC405_Supported_Processor_Bus_Frequencies: type enumeration (P300_B100,
P200_B100, P200_B66, P200_B50, P150_B50, P100_B100, P100_B50);
Data_Cache_Miss_Rate: aadlreal applies to (processor);
Data_Memory_Config: CAT_Processor_IBM_Properties::Memory_Configurations
applies to (processor);
Memory_Configurations: type enumeration (OCM_BRAM, PLB_BRAM,
PLB_BRAM_CACHE, PLB_SDRAM_CACHE);
The set of properties that are used by the estimation tool actually depends on the
processor itself, and more precisely, on its power model. For another processor, another set of specific properties might be necessary since another set of configuration
parameters might apply. The property set of the processor comes finally as a part of
its power model, and, as this, will remain separated from the general property set
associated to the current AADL working project for the application being designed
in the OSATE environment.
Table 1.10 shows the property set for the PowerPC 405 classifier. This processor can be clocked at 100, 150, 200 or 300 MHz, and, depending on the
processor frequency, the bus (OCM or PLB) frequency can take different values between 25 to 100 MHz. This is modelled as an enumeration type property
(PC405_Supported_Processor_ Bus_Frequencies) ensuring that only the predefined
processor/bus frequency couples can be set.
Tables 1.11, 1.12 and 1.13 respectively show property sets for generic digital
signal processors, the TI C62 processor family and the Alter Stratix EP1S80 FPGA
that we modelled as a system component.
1 Power and Energy Estimations in Model-Based Design
23
Table 1.11 Property set for digital signal processors
Texas Instruments DSPs Property Set
Memory_Mode: CAT_Processor_TI_Properties::Memory_Modes;
applies to (processor)
Memory_Modes: type enumeration (CACHE, FREEZE, BYPASS, MAPPED);
Table 1.12 Property set for the TI C62
TI C62 property set
Processor_Frequency: aadlreal applies to (processor);
Processor_Memory_Mode: CAT_Processor_TI_Properties::Memory_Modes
applies to (processor);
Processor_Parallelism_Rate: aadlreal applies to (processor);
Processor_Processing_Rate: aadlreal applies to (processor);
Processor_Cache_Miss_Rate: aadlreal applies to (processor);
Processor_Pipeline_Stall_Rate: aadlreal applies to (processor);
Processor_Memory_Mode_Type: type enumeration (CACHE,FREEZE,
BYPASS,MAPPED);
Processor_Parallelism_Rate_Default: constant aadlreal => 0.7549;
Processor_Processing_Rate_Default: constant aadlreal => 0.5298;
Processor_Cache_Miss_Rate_Default: constant aadlreal => 0.25;
Processor_Pipeline_Stall_Rate_Default: constant aadlreal => 0.2919;
Table 1.13 Property set for the Altera Stratix EP1S80
FPGA Altera Stratix EP1S80 property set
FPGA_Frequency: aadlreal applies to (system);
FPGA_Activity_Rate: aadlreal applies to (system)
FPGA_Occupation_Ratio: aadlreal applies to (system);
FPGA_Activity_Rate_Default: constant aadlreal => 0.4;
FPGA_Occupation_Ratio_Default: constant aadlreal => 0.5;
1.9 Conclusion
We have presented a method to perform power consumption estimations in the component based AADL design flow. The power consumption of components in the
AADL component assembly model is estimated whatever the targeted hardware resource, in the AADL target platform model, is: a DSP (Digital Signal Processor),
a GPP (General Purpose Processor), a FPGA (Field Programmable Gate Array), or
a peripheral device such as Ethernet controllers.
A power estimation tool has been developed with a library of multi-level power
models for those (hardware) components. These models can be used at different levels in the AADL specification refinement process. We have currently defined three
24
E. Senn et al.
Table 1.14 Maximal errors summary
Component
Max Error Level 1
Max Error Level 2
Max Error Level 3
TI C62
59%
57%
8%
PowerPC 405
27%
15.3%
5%
Altera Stratix EP1S80
70%
44.8%
4.2%
refinement levels in the AADL flow. At the lowest level, level 3, the (software) component’s actual business code is considered and an accurate estimation is performed.
This code, written in C, or C++, for standard threads, can also be written in VHDL
or SystemC for hardware threads. At level 2, the power consumption is only estimated from the component operating frequency, and its architectural parameters
(mainly linked to its memory configuration in the case of processors). At level 1,
the highest level, only the operating frequency of the component is considered.
Three power models have been presented for the TI C62 GPP, the PowerPC405
GPP, and the Altera Stratix EP1S80 FPGA. The maximum errors introduced by
these models, at the three refinement levels, are given in Table 1.14.
Our approach is however not limited to those architectures for it has been used
successfully to develop power models for many other processors, some of them
being even more complex (superscalars, pipelines, VLIW, with floating point units
and L1 and L2 caches): the TI Digital Signal Processors C62, C64, C67, and C55,
and the General Purpose Processors ARM7, ARM9, PowerPC and Xscale (details
on these power models can be found in former publications) [16].
We have also considered the operating system effect on power and energy consumption of the whole system. We have noticed that the major sources of consumption are data transfers, and we modeled the most consuming type of data transfer
which is between I/O devices and applicative tasks. We presented the case of Ethernet communications.
In the frame of the SPICES project, our power estimation methodology and
power models are being integrated in the Open Source AADL Tool Environment
OSATE, under the name CAT: Consumption Analysis Toolbox. A first prototype of
this tool has been released in September 2008.
We are also concerned about the modelling in the large approaches [7] like the
eclipse AM3 project [4] (Atlas Megamodel Management Tool). In such approaches
we will have a repository of models, metamodels, model transformations and services where we can publish bridges to/from AADL. This is the future basis for tools
interchange and interoperability. Integrating our consumption analysis tool CAT in
such an environment will give it a larger visibility and accessibility.
References
1. The SAE AADL Standard Info Site. http://www.aadl.info/
2. The SPICES ITEA Project Website. http://www.spices-itea.org/
1 Power and Energy Estimations in Model-Based Design
25
3. SAE—Society of Automative Engineers. SAES AS5506, v1.0. Embedded Computing Systems Committee, SAE, November 2004.
4. AM3, ATLAS MegaModel Management. http://www.eclipse.org/gmt/am3/
5. H. Balp, É. Borde, G. Haïk, and J.-F. Tilman. Automatic composition of AADL models for the
verification of critical component-based embedded systems. In Proc. of the Thirteenth IEEE
Int. Conf. on Engineering of Complex Computer Systems (ICECCS), Belfast, Ireland, 2008.
6. R. BenAtitallah, S. Niar, A. Greiner, S. Meftali, and J.L. Dekeyser. Estimating energy consumption for an MPSoC architectural exploration. In ARCS06, Frankfurt, Germany, 2006.
7. J. Bezivin, F. Jouault, and P. Valduriez. On the need for megamodels. In Proceedings of the
OOPSLA/GPCE: Best Practices for Model-Driven Software Development Workshop, 19th Annual ACM Conference on Object-Oriented Programming, Systems, Languages, and Applications, 2004.
8. D. Brooks, V. Tiwari, and M. Martonosi. Wattch: A framework for architectural-level power
analysis and optimizations. In Proc. International Symposium on Computer Architecture
ISCA’00, pages 83–94, 2000.
9. P. Coussy, G. Corre, P. Bomel, E. Senn, and E. Martin. High-level synthesis under I/O timing
and memory constraints. In ISCAS05, International Symposium on Circuits and Systems, May
2005, Kobe, Japan, 2005.
10. S. Dhouib, J.P. Diguet, E. Senn, and J. Laurent. Energy models of real time operating systems
on FPGA. In Fourth International Workshop on Operating Systems Platforms for Embedded
Real-Time Applications, OSPERT 2008, Prague, July 2–4, 2008.
11. Philips Research. Diesel user manual. Technical Report, Philips Electronic Design and Tools
Group, June 2001.
12. N. Dhanwada, I. Lin, and V. Narayanan. A power estimation methodology for SystemC transaction level models. In International Conference on Hardware/Software Codesign and System
Synthesis, 2005.
13. P.H. Feiler, B.A. Lewis, and S. Vestal. The SAE architecture analysis & design language
(AADL). A standard for engineering performance critical systems. In IEEE International Symposium on Computer-Aided Control Systems Design, pages 1206–1211. Munich, 2006.
14. W. Huangy, M.R. Stany, K. Skadronz, K. Sankaranarayananz, S. Ghoshyz, and S. Velusamyz.
Compact thermal modeling for temperature aware design. In Proceedings of DAC 2004, June
7–11, San Diego, California, USA, 2004.
15. J. Hugues, B. Zalila, and L. Pautet. Rapid prototyping of distributed real-time embedded systems using the AADL and ocarina. In Proceedings of the 18th IEEE International Workshop
on Rapid System Prototyping (RSP’07), pages 106–112. IEEE Computer Society Press, Porto
Alegre, 2007.
16. N. Julien, J. Laurent, E. Senn, and E. Martin. Power consumption modeling of the TI C6201
and characterization of its architectural complexity. IEEE Micro, Special Issue on Power- and
Complexity-Aware Design, 2003.
17. J. Laurent, N. Julien, E. Senn, and E. Martin. Functional level power analysis: an efficient
approach for modeling the power consumption of complex processors. In Proc. Design Automation and Test in Europe DATE, Paris, France, 2004.
18. I. Lee, H. Kim, P. Yang, S. Yoo, E. Chung, K. Choi, J. Kong, and S. Eo. PowerVip: Soc power
estimation framework at transaction level. In Proc. ASP-DAC, 2006.
19. M. Loghi, M. Poncino, and L. Benini. Cycle-accurate power analysis for multiprocessor systems on a chip. In Proceedings of the GLSVLSI, Boston, Massachusetts, USA, April 2004.
20. J. Long, J.C. Ku, S.O. Memik, and Y. Ismail. A self-adjusting clock tree architecture to cope
with temperature variations. In Proceedings of the 2007 IEEE/ACM International Conference
on Computer-Aided Design, pages 75–82. IEEE Press, San Jose, 2007.
21. R. Peset-Lopis and K. Goossens. The petrol approach to high-level power estimation. In Proceedings of the ISLPED, Monterey, California, USA, August 1998.
22. G. Qu, N. Kawabe, K. Usami, and M. Potkonjak. Function-level power estimation methodology for microprocessors. In Proc. Design Automation Conf. DAC’00, pages 810–813, 2000.
23. A.E. Rugina, K. Kanoun, and M. Kaâniche. AADL-based dependability modelling. Technical
Report 06209, LAAS, 2006.
26
E. Senn et al.
24. E. Senn, J. Laurent, N. Julien, and E. Martin. SoftExplorer: Estimating and optimizing the
power and energy consumption of a C program for DSP applications. The EURASIP Journal
on Applied Signal Processing, Special Issue on DSP-Enabled Radio (16), 2005.
25. E. Senn, N. Julien, N. Abdelli, D. Elleouet, and Y. Savary. Building and using system, algorithmic, and architectural power and energy models in the FPGA design-flow. In Intl. Conf. on
Reconfigurable Communication-Centric SoCs 2006, Montpellier, France, July 2006.
26. F. Singhoff, J. Legrand, L. Nana, and L. Marcé. Scheduling and memory requirements analysis
with AADL. In Proceedings of the 2005 Annual ACM SIGAda International Conference on
Ada, Atlanta, GA, USA, 2005.
27. S. Steinke, M. Knauer, L. Wehmeyer, and P. Marwedel. An accurate and fine grain instructionlevel energy model supporting software optimizations. In Proc. Int. Workshop on Power and
Timing Modeling, Optimization and Simulation PATMOS’01, pages 3.2.1–3.2.10, 2001.
28. TMS320C6x User’s Guide. Texas Instruments Inc., 1999.
29. V. Tiwari, S. Malik, and A. Wolfe. Power analysis of embedded software: a first step towards
software power minimization. IEEE Trans. VLSI Systems, 2:437–445, 1994.
30. T. Vergnaud. Modélisation des systèmes temps-réel embarqués pour la génération automatique
d applications formellement vérifiées. PhD thesis, Ecole Nationale Supérieure des Télécommunications de Paris, France, 2006.
31. W. Ye, N. Vijaykrishnam, M. Kandemir, and M. Irwin. The design and use of simplepower:
a cycle accurate energy estimation tool. In Proc. Design Automation Conference DAC’00, June
2000.
32. K. Yaghmour and M.R. Dagenais. Measuring and characterizing system behavior using
kernel-level event logging. In 2000 USENIX Annual Technical Conference, USENIX, San
Diego, CA, USA, June 18–23, 2000.
Chapter 2
MARTE vs. AADL for Discrete-Event
and Discrete-Time Domains
Frédéric Mallet and Robert de Simone
Abstract Real-time embedded applications tend to combine periodic and aperiodic computations. Modeling standards must then support both discrete-time and
discrete-event models of computation and communication whereas they historically
pertain to two different communities: asynchronous and synchronous designers. In
this article, two emerging standards of the domain (MARTE and AADL) are compared and their ability to tackle this issue is assessed. We plead for combining both
standards and show how MARTE can be extended to integrate AADL features required for end-to-end flow latency analysis.
Keywords UML Marte · AADL · MoCC · Time requirement
2.1 Introduction
Embedded applications often combine aperiodic (or sporadic) and periodic computations. In the automotive industry, this has lead to bus standards like FlexRay
(http://www.flexray.org) or TT-CAN [10] that combine event-triggered messages
(for aperiodic computations) with time-triggered messages (for periodic computations). In the avionic industry, an application generally mixes aperiodic events
(e.g., interactions with the pilot, switching between air/ground modes . . . ) together
with periodic events when updating the system (e.g., fuel quantity, update system
data . . . ). On the one hand, time-triggered approaches enhance predictability by reducing latency jitters and provide higher dependability by making it easier to detect
missed messages or illegal accesses to the bus. On the other hand, event-triggered
systems are more flexible to support configuration changes without a complete redesign and adapt faster to asynchronous events. In electronic design automation
(EDA) event-driven simulators (like those for VHDL or Verilog) provide a large
flexibility and support the design of both synchronous and asynchronous architectures. Though, cycle-based simulators have better performances provided that architectures are mainly synchronous.
In EDA, Avionic and automotive industries, designers need models able to describe and compose these two communication models. The main point is that, even
F. Mallet ()
Aoste Project I3S-INRIA, INRIA Sophia Antipolis Méditerranée, Université de Nice
Sophia Antipolis, Sophia Antipolis Cedex, France
e-mail: Frederic.Mallet@sophia.inria.fr
M. Radetzki (ed.), Languages for Embedded Systems and Their Applications,
Lecture Notes in Electrical Engineering 36,
© Springer Science + Business Media B.V. 2009
27
28
F. Mallet, R. de Simone
in time-triggered sampled communications, the propagation of data in logically instantaneous communications introduces specific phenomena akin to event-based
communication features. Indeed, the consumer must wait for data availability to
start computing on it. This may introduce well-known problems of priority inversion when component blocks have to be executed atomically. All these phenomena
deserve careful semantic treatment to be handled correctly, which is the true essence
of this work.
Considering the large number of actors in the design of the very large systems
(or even systems of systems) standard-based approaches are required to provide
interoperability between models and to cover the whole design flow, from system
requirements to code generation. These models must be precise enough to support
various analyses at different refinement levels. We focus here on two particular standards, AADL (Architecture Analysis & Design Language) [12] standardized by the
Society of Automotive Engineers and the UML (Unified Modeling Language) profile for M ARTE (Modeling and Analysis of Real-Time and Embedded systems) [14],
recently adopted by the Object Management Group (OMG). Both standards focus
on modeling and analysis of embedded systems. Both offer constructs to model the
application, the execution platform and to allocate the former onto the latter.
The expressiveness of M ARTE and AADL is compared and their ability to combine periodic and aperiodic computations is assessed. Concerning M ARTE, the discussion focuses on its Time Model [2], specifically devised to specify timed domains of computation and communication in a formal way. This is the continuation
of some of our previous work [3, 9] to compare both formalisms. An AADL example
[6], which comes with its implementation, is used for the comparison. In the selected
example, several threads, periodic or not are connected through event, data or event–
data ports. The combination of various parameters induces either asynchronous or
sampled communications.
AADL two-layered model is compared to our three-layered UML -based approach.
The latter gives more flexibility and avoids the mixing of different levels in the same
model. Rather than opposing the two languages we have investigated how the two
standards can be combined and gateways can be created. Indeed, a subset of M ARTE
can be combined with AADL to cover a larger scope than the one currently covered
by AADL, thus benefiting to AADL users. Such a combination would also benefit
to M ARTE users because some of their models could then be analyzed by existing
AADL tools.
Section 2.2 gives an overview of the M ARTE time model. Section 2.3 introduces
AADL and shows how M ARTE is used to model the AADL constructs that address
the two communication schemes under focus. Section 2.4 presents M ARTE models
for three configurations of an AADL-inspired example.
2.2 Marte Time Model
Time and time-related concepts of the UML profile for M ARTE have been previously
described [2]. This section recalls the main definitions.
2 MARTE vs. AADL
29
2.2.1 Definitions
In M ARTE, Time can be physical and considered as continuous, or discretized. It
can also be logical and related to user-defined clocks. Time may even be multiform,
allowing different times to progress in a non-uniform fashion, and possibly independently to any (direct) reference to physical time. The interest to tackle multiform
time has been exhibited in synchronous languages [4]. M ARTE Time subprofile, inspired from the theory of tags systems [8], provides a set of general mechanisms
to define Models of Computations and Communications (MoCC). These modeling
aspects should be hidden to end-users but not to model architects. This work intends
to build a MoCC suitable for AADL.
The time structure is defined by a set of clocks and relations on these clocks.
Here clocks are not a device used to measure the progress of physical time. It is
rather a mathematical object lending itself to formal processing. Clocks referring to
physical time are called chronometric clocks. A distinguished chronometric clock
named idealClk is provided as part of the M ARTE time library. This clock represents
the “ideal” physical time used, for instance, in physics and mechanics laws. At the
design level most of the clocks are logical. For instance, we consider the processor
cycle or the bus cycle as being logical clocks. Making a distinction between chronometric and logical clocks is important. For chronometric clocks, a metric is associated with instant labels thus making relevant the distance between two instants. For
logical clocks, the distance between two successive instants is irrelevant.
More precisely, a Clock is an ordered set of instants I and a quasi-order relation ≺ on I, named strict precedence. ≺ is a total, irreflexive, and transitive binary
relation on I. A discrete-time clock is a clock with a discrete set of instants I.
A Time Structure is a set of clocks C with a binary, reflexive and transitive relation
named precedence. pred is a partial order relation on the set of all instants of all
clocks within a time structure. From we derive another instant relation named
Coincidence (≡ ∩ ).
Clocks are independent of each other unless some instant relations are imposed.
To impose many—or infinitely many—instant relations at once, clock relations are
used. A comprehensive description of clocks relations is available as a research
report [1], we focus here on those required to represent event-triggered and timetriggered communications. Connections between UML model elements and M ARTE
clock relations are made through stereotype ClockConstraint, which extends metaclass UML::Constraint. The language to be used on these clock constraints is called
Clock Constraint Specification Language, CCSL. It is defined as an annex of M ARTE
specification and its formal semantics is briefly introduced separately [11].
2.2.2 Event-Triggered Communications
Two clocks (ts , tf ) are associated with each task t , the first one contains instants at
which the task starts and the other the instants at which it finishes. A task cannot
30
F. Mallet, R. de Simone
Fig. 2.1 Clock relation
alternatesWith
end before having started and every time a task starts it must finish, in one way or
another (normal ending, preemption, interrupted). The clock relation alternatesWith
can represent this causality relation between ts and tf . Equation (2.1) denotes that
every ith instant of ts strictly precedes every ith instant of tf which in turns (weakly)
precedes every (i + 1)th instant of ts . This relation is not symmetrical and does not
assume the task t as being periodic.
ts alternatesWith tf
t1f alternatesWith t2s
(ts ∼ tf )
(2.1)
(t1f ∼ t2s )
(2.2)
Alternation is a very general relation and can also represent an event-triggered
communication from a task t1 to another task t2, Eq. (2.2). Task t1 runs to completion and sends an event that triggers the execution of t2.
Figure 2.1 illustrates the alternation relation. Horizontal lines represent the clocks
and their instants. Dashed arrows with a filled triangle as an arrowhead are strict
precedence relations whereas arrows with a hollow triangle as an arrowhead are
(weak) precedence relations. The precedence relations are directly induced by
Eq. (2.2).
In that example, the termination of t1 asynchronously triggers the start of t2.
Note that we only have partial orders i.e., no instant relation is induced between the
start or end of t2 and the next start of t1.
2.2.3 Time-Triggered Communications
With time-triggered communications, the data is sampled from a buffer according
to a triggering condition. Clock relation sampledOn is used to represent sampling
and the triggering condition is given by the instants of clocks.
Following our previous example, we replace Eq. (2.2) by Eq. (2.3). clk is the
sampling condition, i.e., the triggering clock.
t2s ≡ t1f sampledOn clk
(2.3)
Figure 2.2 illustrates the use of clock relation sampledOn. It does not show the
start of t1 since it is not relevant here. The start of task t2 is precisely given by
sampling clock clk, however, some events may be missed if the sampling clock is
not fast enough. Vertical lines denote coincidence relations.
2 MARTE vs. AADL
31
Fig. 2.2 Clock relation
sampledOn
2.2.4 Periodic Tasks and Physical Time
Logical clocks are infinite sets of instants but we do not assume any periodicity, i.e.,
the distance between successive instants is not known. The relation discretizedBy
is used to discretize idealClk, a dense chronometric (related to physical time) perfect (with no jitter or any other flaw) clock. Equation (2.4) creates, as an example,
a 100 Hz clock.
c100 ≡ idealClk discretizedBy 0.01
(2.4)
Equation (2.4) states that the distance (duration) between two successive instants
of clock c100 is 0.01 s. The unit second (s) is implied by the use of idealClk.
2.2.5 TimeSquare
T IME S QUARE is a software environment for modeling and analysis timed systems.
It supports an implementation of the Time Model introduced in M ARTE and its
companion Clock Constraint Specification Language (CCSL). T IME S QUARE displays possible time evolutions—solutions to the clock constraint specification—as
waveforms generated in the VCD format [7]. The VCD format has been chosen because it is an IEEE standard defined as part of the Verilog language and as such, is
often used in EDA. T IME S QUARE is available at http://www.inria.fr/sophia/aoste/.
2.3 AADL
2.3.1 Modeling Elements
AADL supports the modeling of application software components (thread, subprogram, process), execution platform components (bus, memory, processor, device)
and the binding of software onto execution platform. Each model element (software or execution platform) must be defined by a type and comes with at least one
implementation.
Initially, there were plans to create a specific UML profile for AADL. However, the
emerging profile for M ARTE is now expected to be the basis for UML representation
32
F. Mallet, R. de Simone
Fig. 2.3 MARTE model library for AADL threads
of AADL models [5]. The adopted M ARTE specification provides guidelines in this
direction. The main goal of this contribution is to further investigate how specific
AADL concepts required for end-to-end flow latency analysis can be represented in
M ARTE. As such, this work is not (yet?) included in the official OMG specification.
2.3.2 AADL Application Software Components
Threads are executed within the context of a process, therefore the process implementations must specify the number of threads it executes and their interconnections. Type and implementation declarations also provide a set of properties that
characterizes model elements. For threads, AADL standard properties include the
dispatch protocol (periodic, aperiodic, sporadic, background), the period (if the dispatch protocol is periodic or sporadic), the deadline, the minimum and maximum
execution times, along with many others.
We have created a UML library (see Fig. 2.3) to model AADL application software
components [9]. Only elements of our library concerning the periodic and aperiodic
threads are shown.
AADL threads are modeled using the stereotype SwSchedulableResource from
the M ARTE Software Resource Modeling sub-profile. Its meta-attribute deadlineElements and periodElements explicitly identify the actual properties used to represent the deadline and the period. Using a meta-attribute of type Property avoids a
premature choice of the type of such properties. This makes it easier for the transformation tools to be language and domain independent. In our library, M ARTE type
NFP_Duration is used as an equivalent for AADL type Time.
2 MARTE vs. AADL
33
Fig. 2.4 The example in AADL
2.3.3 AADL Flows
AADL end-to-end flows explicitly identify a data-stream from sensors to the external
environment (actuators). Figure 2.4 shows an example previously used [6] to discuss
flow latency analysis with AADL models.
This flow starts from a sensor (Ds, an aperiodic device instance) and sinks in
an actuator (Da, also aperiodic) through two process instances. The first process
executes the first two threads and the last thread is executed by the second process.
The two devices are part of the execution platform and communicate via a bus (db1)
with two processors (cpu1 and cpu2), which host the three processes with several
possible bindings. All processes are executed by either the same processor, or any
other combination. One possible binding is illustrated by the dashed arrows. The
component declarations and implementations are not shown. Several configurations
deriving from this example are modeled with M ARTE and discussed in Sect. 2.4.
2.3.4 AADL Ports
There are three kinds of ports: data, event and event–data. Data ports are for data
transmissions without queueing. Connections between data ports are either immediate or delayed. Event ports are for queued communications. The queue size may
induce transfer delays that must be taken into account when performing latency
analysis. Event data ports are for message transmission with queueing, here again
the queue size may induce transfer delays. In our example, all components have data
ports represented as a solid triangle. We have omitted the ports of the processes since
they are required to be of the same type than the connected port declared within the
thread declaration and are therefore redundant.
UML components are linked together through ports and connectors. No queues
are specifically associated with connectors. The queueing policy is better repre-
34
F. Mallet, R. de Simone
sented on a UML activity diagram, that models the algorithm. Activities are made
of actions. The execution sequence is given by the control flow. Data communications between the actions are represented with object flows. In UML, by default, an
object flow has a queue, the size of which can be parameterized with its property
upperBound. So object flows can be used to represent both event and event–data
AADL communication links. UML allows the specification of a customized selection
policy to select which token is read among the ones stored in the object node. Unfortunately, the selection behavior is only allowed to select one single token making it impossible to represent the AADL dequeue protocol AllItems. This protocol
dequeues all items from the port every time the port is read. Therefore, only the
dequeue protocol OneItem is supported.
To model data ports, UML provides «datastore» object nodes. On these nodes,
tokens are never consumed thus allowing for multiple readings of the same token.
Using a data store node with an upper bound equal to one is a good way to represent
AADL data port communications.
2.4 Three Different Configurations
This section illustrates the use of M ARTE on three different configurations of the
AADL example. First, we address a case where all threads are aperiodic. Then, we
consider a mixed periodic/aperiodic case. We finish with a case where all threads
are periodic and harmonic.
2.4.1 The Aperiodic Case
We rely on a model with three layers (see Fig. 2.5) where each layer denotes a
different aspect of the system. The top-most layer represents the algorithm, i.e., different actions to be executed, and includes the data and control flow. The algorithm
gives causal relations among the actions. All communications are through event–
data ports with infinite queues, represented as object nodes. The two actions acquire
and release model the behavior of the two devices.
The middle layer is a composite structure diagram that models AADL software
components and represents the actual configuration under study. Here, all threads
are aperiodic and therefore the classifier AperiodicThread defined in Fig. 2.3 is used.
The bottom layer represents the execution platform.
This layer-oriented approach significantly differs from the AADL model where all
the layers are combined. AADL models do not consider the pure applicative part but
rather merge it either within the second or the third level (compare with Fig. 2.4).
Layers can be changed independently of the others, which gives a great flexibility.
If required, new layers can be added to model virtual machines or middleware, for
instance.
2 MARTE vs. AADL
35
Fig. 2.5 MARTE model, fully aperiodic case
The AADL binding mechanism is equivalent to the M ARTE allocation. Actions
and object nodes are allocated (dashed arrows) to software components. All threads
are aperiodic, therefore all communications are asynchronous and we only use clock
relation alternatesWith (Eqs. (2.5)–(2.8)).
Ds alternatesWith T 1s
(Ds ∼ T 1s )
(2.5)
T 1f alternatesWith T 2s
(T 1f ∼ T 2s )
(2.6)
T 2f alternatesWith T 3s
(T 2f ∼ T 3s )
(2.7)
T 3f alternatesWith Da (T 3f ∼ Da)
(2.8)
These clock relations are extracted using model-driven engineering techniques
and fed into T IME S QUARE. T IME S QUARE checks the relation consistency and proposes, when possible, one execution conformant to the clock relations (see Fig. 2.6).
36
F. Mallet, R. de Simone
Fig. 2.6 VCD result, all aperiodic case
Fig. 2.7 Timing diagrams, all aperiodic case
With T IME S QUARE, the VCD output is annotated to embed clock relations so that
instant relations are displayed. Dashed arrows denote precedences. The transformation could also target other analysis tools like, for instance, Cheddar [13], often used
with AADL models.
After the analysis, it is important to bring back the results into the UML model.
The model elements closest to VCD waveforms are the UML timing diagrams. By
combining the two clocks relative to each task (e.g., t1s and t1f ) the whole information relative to the task itself (e.g., t1) is built. The result, illustrated in Fig. 2.7,
represents a family of possible schedules for a given execution flow and a given pair
application/execution platform.
Computation execution times (thick horizontal lines) equal the latency for devices and range between the MinimumExecutionTime (MinET) and the Deadline for
threads.
2 MARTE vs. AADL
37
Fig. 2.8 MARTE model, mixed case
Fig. 2.9 VCD result, mixed case
2.4.2 The Mixed Event–Data Flow Case
We study here a second configuration that only differs by making periodic thread t2
(Fig. 2.8). Only the second layer of our model needs to be modified, whereas with
AADL the whole model has been rebuilt.
The communication from step1 to step2 becomes a sampled one. In CCSL,
Eq. (2.6) is replaced by Eqs. (2.9)–(2.10), where P is the sampling period of t2.
clk = IdealClk discretizedBy P
t2s = t1f sampledOn clk
(2.9)
(2.10)
The new simulation run processed by T IME S QUARE is shown in Fig. 2.9. Vertical
plain lines with diamonds represent coincidence relations on instants.
We also get a different timing diagram (see Fig. 2.10). Oblique lines linking
two computation lines represent communications and sampling delays. For sampled communications, this amounts to wait for the next tick of the receiver clock.
The maximal sampling delay is when the communication waits for the full sampling period because the previous tick has just been missed. “Oblique” lines are not
normative in UML timing diagrams but it is a convenient notation to represent intermediate communication states between two steady processing states (e.g., between
t1 and t2).
38
F. Mallet, R. de Simone
Fig. 2.10 Timing diagram, mixed case
Additionally, on this representation it is easy to process latencies for a given flow.
On this configuration and assuming, as in [6], that the sampling delays are always
maximal, we get the formulas given in the following equations.
Latencyworst-case = Ds.latency +
(ti .deadline)
i
+ t2.period + Da.latency
Latencybest-case = Ds.latency +
(ti .MinET)
(2.11)
+ t2.period + Da.latency
Latencyjitter =
(ti .deadline − ti .MinET)
(2.12)
i
(2.13)
i
The jitter (Eq. (2.13)) is identical to the fully asynchronous case, even though,
due to the synchronization, the best-case and worst-case latencies are increased.
2.4.3 The Periodic Case
Finally, we address here the case where all threads are periodic but not fully synchronous. Thread t2 is twice as slow as threads t1 and t3, i.e., its period is twice
larger. All threads being periodic, their deadline is assumed to be smaller than their
period. Only the second layer needs to be replaced by a new configuration where all
2 MARTE vs. AADL
39
Fig. 2.11 Timing diagram, fully periodic case
threads are periodic. The timing diagram obtained with this configuration is shown
in Fig. 2.11.
The first three communications in the flow (from acquire to step1, from step1 to
step2, and from step2 to step3) are sampled communications. The last one (from
step3 to release) is data-driven, since actuator Da is aperiodic. In this last configuration and as expected, becoming synchronous makes the system more predictable
since the latency jitter is much smaller (Eq. (2.16)). However, both the best-case and
worst-case are bigger than in the two previous cases.
Latencyworst-case = Ds.latency + t1.period + t2.period
+ t2.period + t3.deadline + Da.latency
(2.14)
Latencybest-case = Ds.latency + t1.period + t2.period
+ t2.period + t3.MinET + Da.latency
Latencyjitter = t3.deadline − t3.MinET
(2.15)
(2.16)
2.5 Conclusion
AADL offers lots of features very important to model and analyze computations and
communications of embedded systems. However, combining all these features without a guideline (not part of the standard) can lead to models completely meaningless
and impossible to analyze. We have shown how the M ARTE Time model could be
used to have the same expressiveness with less modeling concepts. More generally,
M ARTE and its time model could be used to model various timed models of computation and communication.
It is important to have specifications free, as much as possible, of implementation choices (platform independent models). To achieve this goal, we need model
40
F. Mallet, R. de Simone
elements of a higher level of abstraction than AADL threads. AADL two-level models
assume that part of the application has already been allocated to a software execution platform made of threads. Our approach makes such an allocation explicit when
required and allows alternative solutions. We propose to use for that purpose UML
activities. Making a link to the software execution platform (runtime executive) is
not a refinement but rather an allocation. The former implies models of the same
nature, whereas the latter makes links between models of different natures.
Additionally, rather than building a specific graphical editor for AADL models
it is more cost-effective to customize existing editors, give them the right semantics and perform model transformations towards analysis tools. For now, graphical
customization supported by profiling tools is limited and big improvements are required. However, making AADL UML-friendly leads the path to interoperability with
several other modeling standards like S YS ML [15], which is appropriate to model
at the system level.
Glossary
AADL Architecture Analysis and Design Language, standardized by SAE
EDA Electronic Design Automation
MARTE Modeling and Analysis of Real-Time and Embedded systems
MET/MinET Minimum Execution Time
MoCC Model of Computation and Communication
NFP Non Functional Property
OMG The Object Management Group
SAE Society of Automotive Engineers
TT-CAN Time-Triggered Controller Area Network
UML The Unified Modeling Language, adopted by the OMG
VCD Value Change Dump
VHDL Very high speed integrated circuit Hardware Description Language
References
1. C. André and F. Mallet. Clock constraints in UML MARTE CCSL. Research Report 6540,
INRIA, May 2008. https://hal.inria.fr/inria-00280941.
2. C. André, F. Mallet, and R. de Simone. Modeling time(s). In MoDELS’07, LNCS 4735, pages
559–573. Springer, Berlin, 2007.
3. C. André, F. Mallet, and R. de Simone. Modeling of AADL data-communications with UML
Marte, In E. Villar, editor, Embedded Systems Specification and Design Languages, Selected
Contributions from FDL’07, LNEE 10, pages 150–170. Springer, Berlin, 2008.
4. A. Benveniste, P. Caspi, S. A. Edwards, N. Halbwachs, P. Le Guernic, and R. de Simone. The
synchronous languages 12 years later. Proceedings of the IEEE, 91(1):64–83, 2003.
2 MARTE vs. AADL
41
5. M. Faugère, T. Bourbeau, R. de Simone, and S. Gérard. Marte: Also a UML profile for modeling AADL applications. In ICECCS, pages 359–364. IEEE Comput. Soc., Los Alamitos,
2007.
6. P.H. Feiler and J. Hansson. Flow latency analysis with the architecture analysis and design
language. Technical Report CMU/SEI-2007-TN-010, CMU, June 2007.
7. IEEE Standards Association. IEEE Standard for Verilog Hardware Description Language.
IEEE Std 1364TM-2005, Design Automation Standards Committee, 2005.
8. E.A. Lee and A.L. Sangiovanni-Vincentelli. A framework for comparing models of computation. IEEE Transactions on CAD of Integrated Circuits and Systems, 17(12):1217–1229,
1998.
9. S.-Y. Lee, F. Mallet, and R. de Simone. Dealing with AADL end-to-end flow latency with
UML Marte. In ICECCS, pages 228–233. IEEE Comput. Soc., Los Alamitos, 2008.
10. G. Leen and D. Heffernan. TTCAN: a new time-triggered controller area network. Microprocessors and Microsystems, 26(2):77–94, 2002.
11. F. Mallet, C. André, and R. de Simone. CCSL: specifying clock constraints with UML Marte.
ISSE, 4(3):309–314, 2008.
12. SAE. Architecture analysis and design language (AADL). AS5506/1, 2006. http://www.sae.
org.
13. F. Singhoff and A. Plantec. AADL modeling and analysis of hierarchical schedulers. In A.
Srivastava and L.C. Baird III, editors, SIGAda, pages 41–50. Assoc. Comput. Mach., New
York, 2007.
14. The ProMARTE Consortium. UML profile for MARTE, beta 2. OMG document number:
ptc/08-06-08, Object Management Group, 2008.
15. T. Weilkiens. Systems Engineering with SysML/UML: Modeling, Analysis, Design. The
MK/OMG Press, Burlington, 2008.
Chapter 3
Generation of MARTE Allocation Models
from Activity Threads
Andreas W. Liehr, Klaus J. Buchenrieder,
Heike S. Rolfs and Ulrich Nageldinger
Abstract UML and specialized profiles, such as MARTE, are established specification and modeling procedures in the system development process. While languagebased system specification and resource modeling shortens the design cycle, the
exploration of the design-space is time-consuming. Most expensive proves the generation of system models respectively the generation of architectural alternatives
for subsequent exploration. This work contributes a method, that utilizes activity
threads to reduce the effort, needed to build such a set. With this method, a group
of system models, each representing one design alternative, can automatically be
generated. Therefore, only one architecture model and one function model in combination with an activity thread is required. The proposed method is the first step towards automated comparison of the performance for design alternatives at an early
stage in the development process.
Keywords Component-based system modeling · Design-space exploration ·
Unified modeling language · UML MARTE
3.1 Introduction
In the early development stages of embedded systems, fundamental decisions concerning the system architecture and a proper hardware/software partition must be
made. The choices determine, e.g., power consumption, performance, and other system features. For this reason, a method plus supporting tools for fast and efficient
specification and automated evaluation of system design alternatives is required.
The simulation and analysis of performance and power relevant parts of the system is a way to gather the information, which alternative fits best the constraints for
This work has been supported within a subcontract between Infineon Technologies AG
and the Universität der Bundeswehr München. This contract is part of the Project
“Verteilte integrierte Systeme und Netzwerkarchitekturen für die Applikationsdomänen
Automobil und Mobilkommunikation” (VISION), 01 M 3078.
A.W. Liehr ()
Fakultät für Informatik, Universität der Bundeswehr München, 85577 Neubiberg,
Germany
e-mail: andreas.liehr@unibw.de
M. Radetzki (ed.), Languages for Embedded Systems and Their Applications,
Lecture Notes in Electrical Engineering 36,
© Springer Science + Business Media B.V. 2009
43
44
A.W. Liehr et al.
Fig. 3.1 Utilizing activity
threads for the mapping of
application to resources
its realization. To evaluate such system alternatives, each of them has to be specified
in a modeling language fitting this purpose. The MARTE profile [18] for UML [17]
provides the features required for system modeling.
The specification of the architecture is carried out with the Hardware Resource
Modeling (HRM) mechanism of the MARTE specification. The Real-Time Execution (RTE) Model of Computation and Computing allows for a specification of the
behavior with real-time constraints. The mapping of behavior to architecture can
be achieved as declared in the Allocation Modeling (Alloc) section of the MARTE
profile.
The MARTE profile provides mechanisms to enrich the models with physical,
logical and functional information. These enable the generation of simulation models for performance- and power estimation. Using this approach to support the decision making of hardware/software codesign requires one set of models for every
design alternative. Since the construction of alternatives is computationally expensive, we disclose a solution, that drastically reduces manual interaction.
In this work, we present a method to build a set of MARTE allocation models,
each representing one design alternative for a hardware/software system. As input
to this method, only one MARTE hardware resource model and a single real-time
execution model must be specified. The differing approaches of spacial allocation
from function to architecture are declared with an Activity Thread, as presented by
Liehr and Buchenrieder [10]. As a result, one application model per design alternative is built, as depicted in Fig. 3.1. This approach speeds system development, in
that system models and design alternatives can be generated much faster.
This chapter is organized as follows. Section 3.2 discusses the relationship between this approach and related research activities. Section 3.3 introduces the system specification process based on the UML MARTE profile. Section 3.4 presents
activity threads and illustrates their application in the process of hardware/software
codesign. In Sect. 3.5, the utilization of activity threads for design space exploration with UML MARTE system models is demonstrated with an example. Section 3.6 provides detailed information concerning the prototypic implementation
3 Generation of MARTE Allocation Models from Activity Threads
45
and the choice of tools. It is followed by an example application for our approach.
We conclude with a summary and an outlook.
3.2 Related Work
In the early 1980s, researchers recognized the potential of system evaluation methods, that utilize formal models of computer based systems. It became clear, that such
methods speed up the system development process and reduce the cost, especially
through the reduction of the demand for prototyping and testing or the real-life system. An early approach of this technique is the Software Performance Engineering
Method, introduced by Smith [21]. This method consists of two models, (1) a software execution model expressed with execution graphs representing the software
behavior; (2) a system execution model, based on queuing networks, that describes
the system behavior. The input data of the second model is a combination of the
results from the software execution model and information about the system hardware. Extensions of this approach have been published by Cortellessa et al. [2, 3].
Using such methods requires broad knowledge in the field of system architecture
and the definition of formal graph based models.
In the 1990s, a multitude of pattern based approaches emerged. These frameworks, in which developers define system models from predefined and reusable system patterns, rely on executable simulation models. While such approaches improve
the efficiency of the modeling process through the prepackaged nature, their versatility suffers from the fact, that pattern definitions are required for every system part.
Therefore, the pool of available patterns has to be advanced, each time when novel
architecture components emerge. Approaches of this method have been published
inter alia by Petriu et al. [19, 20], Balsamo et al. [1], and Liehr and Buchenrieder
[10].
With the increasing dissemination of the Unified Modeling Language (UML),
system design approaches, utilizing UML not only for software but also for hardware specification, gained popularity. At the turn of the century, UML was already
well known by most system developers. In fact, computer based systems were
expressed with UML and annotated with information for the system evaluation
process. Subsequently, executable models, for the purpose of system evaluation,
were built from the UML descriptions without further user interaction. Representatives of this approach were introduced by Kabajunga and Pooley [8], Kähkipuro [9],
Gu and Petriu [6, 7], and Woodside et al. [23].
UML was initially developed to support a team-based software development
process and the modeling of relations in large projects, utilizing object-oriented programming paradigms. Therefore, plain UML was of limited value to model holistic
systems. Concurrent approaches to build more convenient UML dialects for specification demands in the field of architecture modeling were brought forward. The
UML Profile for Schedulability, Performance, and Time (SPT) and the OMG Systems Modeling Language (SysML) gained broader acceptance within the research
community and for developers of system modeling tools.
46
A.W. Liehr et al.
The OMG UML Profile for Schedulability, Performance, and Time specifies an
UML profile, that defines standard paradigms for modeling of time-, schedulability-,
and performance-related aspects of real-time systems. It (1) enables the construction
of models to make quantitative predictions; (2) facilitates communication of design
intent between developers in a standard way; and (3) permits interoperability between various analysis and design tools [15].
The OMG Systems Modeling Language is a general-purpose graphical modeling language for specifying, analyzing, designing, and verifying complex systems,
that may include hardware, software, information, personnel, procedures, and facilities. In particular, the language provides graphical representations with a semantic
foundation for modeling system requirements, behavior, structure, and parametrics,
which is used to integrate with other engineering analysis models [16].
To add the capabilities for model-driven development of Real Time and Embedded Systems to UML, the UML profile for Modeling and Analysis of Real-time
and Embedded systems (MARTE) was introduced. It provides support for specification, the design, and verification and validation stages of the system development
process. MARTE is intended to replace the UML SPT profile [18].
Researching enhancements, regarding the system modeling process with UML
MARTE, that address the efficient specification of a set of concurring design approaches for one computer based system, is just the next logical step towards the
development of even more user-friendly system modeling tools.
3.3 Building System Models with MARTE
The modeling of computer systems is vital in the initial stages of system development, because design alternatives can be compared, rated, and decisions made. This
shortens the development time and lowers the development effort.
The UML MARTE profile extends UML with a detailed hardware resource
model. In this model, the logical view classifies the hardware resources with respect to functionality and physical properties, of the hardware resources. MARTE
adopts the Y-model as presented in Dumolin et al. [4] and constitutes the system
model with three design views [22]:
• Application model, specifying the system functionality
• Resource model, representing the execution platform
• Allocation model, that maps the function to the architecture
In this work, UML activity diagrams are utilized as application model. MARTE
provides UML stereotypes to include constraints for real-time execution modeling,
as illustrated by Frederic et al. [5]. While this representation of the application is
not sufficient for the performance simulation process, it is adequate for the demonstration of our approach.
As resource model representation, we use UML composite structure diagrams.
The components of this diagrams are extended with stereotypes from the logical
view and the physical view of the MARTE HRM profile. Despite the adaption of
3 Generation of MARTE Allocation Models from Activity Threads
47
the Y-model by MARTE consists of three models (General Resource Model, Software Resource Model and Hardware Resource Model), only the HRM is needed to
demonstrate our method.
The third component of the Y-model is the allocation of the function to the architecture. In this work, the allocation is realized with UML component diagrams
in the context of MARTE Allocation Modeling (Alloc).
Using activity threads to compose a set of this allocation models modifies the
Y-model for system specification as illustrated in Fig. 3.1.
3.4 Utilizing Activity Threads for Design-Space Exploration
In previous work, we presented a method for performance prediction that utilizes
UML based software representations, pattern based hardware models and activity
threads, that map functional components to the hardware architecture for the generation of performance simulation models [10, 11].
The hardware model appointed in this approach defines the system hardware architecture for a computer system under construction. The choice of a pattern-based
approach fosters the fast assembly of the model from vendor defined hardware components, that contain performance and interface information and are stored in a hardware pattern database. The hardware model is a superset of the required hardware
for all system design approaches to be evaluated. Depending on the considered design approach, only a subset of the whole hardware model is taken into account.
The software model contains the performance related information concerning
software components of the computer system to be designed. The control flow of
the intended functionality of the system is represented as UML activity diagram.
The activities of this diagram are enriched with performance information, needed
for the composition of the performance simulation model.
The functional descriptions from the software model are linked to hardware modules by activity threads. Activity threads are realized as graphs with a start node and
an end node. Every path from the start node to the end node contains intermediate nodes, each denoting an activity from the activity diagram which represents the
software. For every activity from the software model, exactly one node in every
path of the activity thread must exist. Such a path represents one design method for
evaluation. The number of differing paths is the number of system models to build.
The specification of ATs can be achieved with UML activity diagrams. That allows
a specification with off-the-shelf UML modeling tools.
After the user has supplied the three models, the system model generation is carried out without user interaction. In the original work, this method results in a set
of Extended Queuing Network Models (EQNs), representing the related design approaches. The simulation of the EQNs is realized with normative simulation tools
or an XML based EQN simulator [12]. The usage of EQN for the simulation models allows for the simulation of concurrent systems with shared resources, such as
busses or memory.
48
A.W. Liehr et al.
Fig. 3.2 The architecture composite structure diagram with MARTE extensions
3.5 Generating MARTE Allocation Models with Activity
Threads
As shown in the previous section, Activity threads can be exploited to automate the
spatial association of functional components to architectural units with the goal to
explore design alternatives. We adapt this established approach to system models
represented in UML with MARTE extensions.
Figure 3.2 depicts a system resource model, represented as composite structure
diagram. The stereotype hwComponent is applied to all components of the resource model, to detail physical properties. Functional properties are specified with
3 Generation of MARTE Allocation Models from Activity Threads
49
Fig. 3.3 The application as
UML activity diagram
the appropriate stereotypes from the logical view. The stereotype hwRessource
is applied to the CPU, the stereotype hwBUS to the BUS, the stereotype hwRAM to
RAM1 and RAM2 etc.
The architecture model contains the superset of the hardware components, used
in all system design approaches. Every single design approach uses only a subset of
the hardware model. Hardware components, that are not used for a system design
approach, will be omitted in the generation process of the allocation model for this
single approach.
To enable the evaluation of core dimensions of the system to model, the annotation of dimension specific information into the architectural model is required. Information regarding the consumption of power by specific parts of the architecture
in different stages of resource utilization would represent such information, if power
consumption would be one of the dimensions to evaluate. If the performance of the
system is the focus of the evaluation process model components must be annotated
with performance related data, supplied by component vendors for import.
In this field, the SPIRIT Consortium breaks new paths with its standardization
efforts with relation to the descriptions of multi-sourced IP blocks. The intention of
SPIRIT is, to provide a unified set of specifications based on IP meta-data. These
contain the specifications to import complex IP bundles into SoC design tool sets,
and to exchange design descriptions between tools. Utilizing SPIRIT descriptions
as information source for the architectural model will lessen the amount of information, the user has to deliver. It also ensures, that information about novel architecture
components becomes readily available, as members of the SPIRIT consortium are
committed to supply descriptions of their IP components.
The behavior of the considered system is modeled with UML activity charts. As
an example, consider Fig. 3.3. It shows the activity chart of an application, in which
the actions A and B are concurrently activated and the result is synchronized. Subsequently, action C executes before action D concludes. The stereotypes rtf and
rtAction from the real-time execution model of the MARTE profile are applied
on the activities to model real-time features.
As for the architecture model, the activities of the functional model must also
be enriched with information specific to the intended evaluation. The information
comprises statements regarding the complexity and the IO-behavior of the functional parts, that the activities represent. Current approaches encompass the analysis
of pseudo code and formal specification.
Analysis of pseudo code requires the user to provide a simplified code for every
activity. The pseudo code is compiled and executed to deduce timing-measures. For
a formal analysis, complexity and I/O-behavior is symbolically expressed. While
50
A.W. Liehr et al.
Fig. 3.4 An activity thread represented as UML activity diagram
system engineers prefer the first method, experts from the system analysis domain
prefer the second. The architectural model and the functional model are related with
Activity Threads (AT). Thereby, the user states with the definition of the activity
thread, which mappings of function to architecture must be investigated. Figure 3.4
shows an example of an AT, fitted to the composite structure diagram of Fig. 3.2 and
the activity diagram given in Fig. 3.3.
The AT is represented as UML activity diagram. The semantic of the activities in
the AT differs from the common usage of activity diagrams: An activity of an AT defines, to which hardware components an activity of the activity diagram, that is part
of the application model, will be mapped. Each activity of the AT resides within an
activity partition, denoted with a dashed box. Functional descriptions within a partition can be referenced by the ATs partition name. As an example, consider the four
activities on the left side of Fig. 3.4. The topmost maps the function, represented
by activity A from Fig. 3.3, to the hardware components ASIC1, BUS and RAM1
of the resource model in Fig. 3.2. The lowest, maps the same functional part to the
hardware components CPU, BUS and RAM2.
The implementation of the concept of activity threads, as part of a framework for
system evaluation in the development process, enables user guided design-space exploration for holistic computer based systems. The user limits the range of possible
design approaches from a permutation over all possibilities to design alternatives.
Each path through the activity diagram from an initial node to a final node represents one design alternative. Obviously, the AT in Fig. 3.4 defines a set of different
deployments of functionality to hardware. The eight deployments are illustrated in
Fig. 3.5.
Each line represents one system design approach and results in an automatic generated allocation model. Note, that a permutation over all possible solutions would
lead to 1296 potential system design approaches in contrast to eight unfolded activity threads. Hence, activity threads not only reduce the complexity but also provide
the user with the flexibility and convenience of an automated specification of alternative system models.
3 Generation of MARTE Allocation Models from Activity Threads
51
Fig. 3.5 The unfolded activity thread represented as UML activity diagram
The allocation model for the first mapping of the unfolded threads is shown in
Fig. 3.6. For its visual representation, we chose an UML composite structure diagram, using the methods for allocation modeling as described in the specification of
the MARTE profile.
3.6 A Prototypic Implementation of the Method
To build allocation models from the application model, the resource model and
the activity thread, we utilize a Python program and the Python interpreter in Version 2.5.
We chose the LXML Pythonic XML processing library in combination with
libxml2 to gather information from the XML structure that represents the UML
models. This configuration is also employed for the prototype that generates the
XML structure, which represents the allocation model.
As UML modeling tool, we put to work the Enterprise Architect from Sparx
Systems in Version 7.1. The UML models, we use as input for our prototype, are
exported as XMI 2.1 specified XML file [14].
Our current version of the program enables the generation of the allocation model
from an UML activity diagram as application model, an UML composite structure
52
Fig. 3.6 The physical deployment of the application as MARTE allocation model
A.W. Liehr et al.
3 Generation of MARTE Allocation Models from Activity Threads
53
diagram as resource model and an UML activity diagram with activity partitions as
activity threads.
The activity thread of Fig. 3.4 is unfolded to single threads without forks by
graph transformation. This results in the eight threads as shown in Fig. 3.5, each
representing one system design alternative. In the following, each of these alternatives is treated separately with the algorithm described below:
• A new composite structure diagram, utilized as allocation model for the particular
evaluated system alternative, is built. Inside this diagram two empty classes, the
first serving as container for the application and the second serving as container
for the architecture, are generated. Inside the application class, an empty class is
generated for each activity of the application model. Within this class, a single
UML-part is created for each hardware component, utilized by this application.
It holds the MARTE stereotype app_allocated and represents the utilization
of a component, such as (bus, cpu, memory, etc.), by the particular application.
• As the algorithm steps through the activities of the activity thread, the mappings
of application parts to architectural components are resolved. The first time, a
component of the hardware model occurs as target of an allocation, an UML-part
is created, to represent this component. This UML-part is deployed inside the
container for the hardware of this model. The generated part inherits the MARTE
stereotype ep_allocated.
• The mapping of functional parts to architectural components is defined by the
corresponding activity thread. The mapping connects the application to the hardware. The connection starts at the UML-part within the representation of the application and ends at the UML-part representing the hardware component. The
stereotype of this connection is an allocation.
As a result, we obtain a set of composite structure diagrams, defining the allocation of the application to the resources as seen in Fig. 3.6. To improve the readability
of the diagram, we omitted the visualization for the allocation of the BUS component. The diagram is serialized as XML file in the XMI 2.1 format and can be fed
back into the utilized UML design environment as immediate visual feedback of the
process.
3.7 Visualization of Performance Feedback
The introduced method for the generation of allocation models was successfully
implemented into an approach to visualize the suitability of design alternatives for
a hardware/software system from the view of performance evaluation [13].
This method, depicted in Fig. 3.7 delivers information about the fulfillment of
performance goals directly into the system description modeled with the UML
MARTE profile. Thereby, a novel UML stereotype, that enables the model valid
integration of our approach into UML supported system development processes,
was introduced. The contributed method fosters a guided design-space exploration
and is seamlessly integrateable into a system design-flow with off the shelf tools.
54
A.W. Liehr et al.
Fig. 3.7 Generating
performance simulation
feedback for UML models
As we have shown, our presented approach to utilize activity threads in the system modeling process with UML MARTE proves to be adoptable into this framework, built to support the development process of embedded systems.
3.8 Summary and Outlook
In this work, we presented an effective method to automatically generate architectural alternatives for hardware/software systems. To streamline the hardware/
software codesign process, we extended our established work, based on the activity
thread approach, with the UML MARTE profile [10]. As a result, the contributed
method fosters a guided design-space exploration and reduces the complexity and
the work, which had to be contributed manually otherwise by the system developer.
For illustration, we designed an exemplar system with the codesign method
brought forward here. For the implementation, we generated a set of allocation
3 Generation of MARTE Allocation Models from Activity Threads
55
models as UML composite structure diagrams in XML. These diagrams define the
deployment of the system function to the system architecture.
Our current and future research efforts focus on the automated generation of
simulation models from design alternatives. For this reason, we will include mechanisms from the Performance Analysis Modeling of MARTE. The generated simulation models, based on Extended Queuing Network Models, will provide information about the performance behavior and the power consumption of the system
under design. This will enable us to predict, whether prespecified time- and power
consumption goals can be met.
Furthermore, we will enhance our system modeling approach so that our estimation capability encompasses the dimension of power consumption as well.
References
1. S. Balsamo, M. Marzolla, and R. Mirandola. Efficient performance models in componentbased software engineering. In SEAA ’06 Proceedings of the 32nd Euromicro Conference on
Software Engineering and Advanced Applications, Cavtat/Dubrovnik, Croatia, 2006.
2. V. Cortellessa and R. Mirandola. Deriving a queueing network based performance model from
UML diagrams. In WOSP ’00: Proceedings of the 2nd International Workshop on Software
and Performance, pages 58–70. Assoc. Comput. Mach., New York, 2000.
3. V. Cortellessa, A. D’Ambrogio, and G. Iazeolla. Automatic derivation of software performance models from case documents. Performance Evaluation, 45(2–3):81–105, 2001.
4. C. Dumoulin, P. Boulet, J.-L. Dekeyser, and P. Marquet. UML 2.0 structure diagram for intensive signal processing application specification. Rapport de recherche 4766, Institut National
de Recherche en Informatique et en Automatique (INRIA), March 2003.
5. T. Frederic, G. Sebastien, D. Jerôme, and T. François. Software real-time resource modeling.
In Forum on Specification and Design Languages (FDL), Barcelona, Spain, pages 231–236.
European Electronic Chips & Systems design Initiative (ECSI), September 2007.
6. G.P. Gu and D.C. Petriu. Early evaluation of software performance based on the UML performance profile. In CASCON ’03: Proceedings of the 2003 Conference of the Centre for
Advanced Studies on Collaborative Research, pages 66–79. IBM Press, Indianapolis, 2003.
7. G.P. Gu and D.C. Petriu. From UML to LQN by XML algebra-based model transformations.
In WOSP ’05: Proceedings of the 5th International Workshop on Software and Performance,
pages 99–110. Assoc. Comput. Mach., New York, 2005.
8. C. Kabajunga and R. Pooley. Simulating UML sequence diagrams. In 14th UK Performance
Engineering Workshop, Edinburgh, England, pages 198–207, July 1998.
9. P. Kaehkipuro. UML-based performance modeling framework for component-based distributed systems. In Performance Engineering—State of the Art and Current Trends, LNCS 2047,
pages 167–184. Springer, Berlin, 2001.
10. A.W. Liehr and K.J. Buchenrieder. Generation of related performance simulation models at
an early stage in the design cycle. In 14th IEEE International Conference on the Engineering
of Computer-Based Systems (ECBS), pages 7–14. IEEE Comput. Soc., Tucson, 2007.
11. A.W. Liehr and K.J. Buchenrieder. Performance evaluation of HW/SW-system alternatives. In
Design, Automation and Test in Europe (DATE) University Booth—Demonstration and Poster
Exhibition, Munich, Germany, March 2008.
12. A.W. Liehr and K.J. Buchenrieder. An XML based simulation method for extended queuing
networks. In L.S. Louca, editor, 22nd European Conference on Modelling and Simulation,
pages 322–328, Nicosia, Cyprus. European Council for Modelling and Simulation, June 2008.
13. A.W. Liehr, K.J. Buchenrieder, and U. Nageldinger. Visual feedback for design-space exploration with UML MARTE. In The Fifth International Conference on Innovations in Information Technology, Al Ain, UAE. IEEE Comput. Soc., Los Alamitos, 2008.
56
A.W. Liehr et al.
14. OMG. Mof 2.0/xmi mapping specification, v2.1. OMG Document 05-09-01, Object Management Group, September 2005.
15. OMG. The unified modeling language (UML), version 2.0. http://www.omg.org/technology/
documents/formal/uml.htm, July 2005.
16. OMG. Omg systems modeling language (omg sysml), v1.0. OMG Document 2007-09-01,
Object Management Group, September 2007.
17. OMG. Omg unified modeling language. OMG Document 2007-11-04, Object Management
Group, November 2007.
18. OMG. UML profile for MARTE. OMG Document 07-08-04, Object Management Group,
August 2007.
19. D.B. Petriu and M. Woodside. Analysing software requirements specifications for performance. In WOSP ’02: Proceedings of the 3rd International Workshop on Software and Performance, pages 1–9. Assoc. Comput. Mach., New York, 2002.
20. D.C. Petriu and X. Wang. Deriving software performance models from architectural patterns
by graph transformations. In TAGT’98: Selected Papers from the 6th International Workshop
on Theory and Application of Graph Transformations, London, UK, pages 475–488. Springer,
Berlin, 2000.
21. C.U. Smith. The evolution of software performance engineering: a survey. In ACM ’86: Proceedings of 1986 ACM Fall Joint Computer Conference, pages 778–783. IEEE Comput. Soc.,
Los Alamitos, 1986.
22. S. Taha, A. Radermacher, S. Gerard, and J.-L. Dekeyser. An open framework for detailed
hardware modeling. In The IEEE Second International Symposium on Industrial Embedded
Systems (SIES), Lisbon, Portugal, vol. 1, pages 118–125. IEEE Comput. Soc., Los Alamitos,
2007.
23. M. Woodside, D.C. Petriu, D.B. Petriu, H. Shen, T. Israr, and J. Merseguer. Performance by
unified model analysis (PUMA). In WOSP ’05: Proceedings of the 5th International Workshop
on Software and Performance, pages 1–12. Assoc. Comput. Mach., New York, 2005.
Chapter 4
Model-Driven System Validation by Scenarios
A. Carioni, A. Gargantini, E. Riccobene and
P. Scandurra
Abstract The chapter presents a method for scenario-based validation of embedded system designs provided in terms of UML models. This approach is based on
model transformations from SystemC UML graphical models into Abstract State
Machine (ASM) formal models, and exploits the scenario-based model validation of
the ASMs. This validation approach complements an existing model-driven design
methodology for embedded systems based on the SystemC UML profile. A validation tool integrated into an existing model-driven co-design environment to support
the proposed scenario-based validation flow is also presented. It allows the designer
to functionally validate system components from SystemC UML designs early at
high levels of abstraction.
Keywords Model-based design · System validation · Abstract state machines ·
UML
4.1 Introduction
In the Embedded System (ES) and System-on-Chip (SoC) design area, conventional
system level design flows usually start by developing a system functional executable
model from a system specification written in natural language. It is an emerging
practice to develop the functional model and refine it with SystemC (built upon
C++), which is considered as de facto, open [20], industry-standard language for
functional system-level models [27]. The functional executable model, as program
code, introduces design decisions which should be postponed when a commitment
between software applications and hardware platform has been established, and suffers of the all the limitations of coding with respect to modeling: less flexibility,
limited reusing, unreadable documentation. Furthermore, a system design given in
terms of code is hardly traceable with respect to the initial specification and prevents
a meaningful analysis of the system.
This work is supported in part by the project Model-driven methodologies and techniques
for embedded system design through UML, ASMs and SystemC at STMicroelectronics.
E. Riccobene ()
Dipartimento di Tecnologie dell’Informazione, Università degli Studi di Milano, Crema,
Italy
e-mail: riccobene@dti.unimi.it
M. Radetzki (ed.), Languages for Embedded Systems and Their Applications,
Lecture Notes in Electrical Engineering 36,
© Springer Science + Business Media B.V. 2009
57
58
A. Carioni et al.
The improvement of the current system level design would require new design
methods, languages and tools capable of raising the level of abstraction to a point
where productivity can be improved, errors can be easier to identify and correct, better documentation can be provided, and embedded system designers can collaborate
more effectively. Furthermore, early stages of the design process would benefit from
the use of graphical interface tools that visualize the system specification and allow
multiple team members to share the relevant information [1]. All these reasons have,
therefore, caused more and more increasing interest toward visual software modeling languages like the UML (Unified Modeling Language) [28] able to capture and
visualize system structure and behavior at multiple levels of abstraction, and to generate executable models in C/C++/SystemC from system specifications.
Along this research line, we defined a model-driven methodology [25] and a
development process [26] for embedded system design. The new design flow is
based on the principles of high level modeling, models transformation and automatic
code generation of the Model Driven Engineering (MDE) approach. As modeling
languages, it involves the UML 2, a SystemC UML profile (for the hardware side),
and a multi-thread C UML profile (for the software side). It allows system modeling
from a functional executable level down to the Register Transfer Level (RTL).
We here address the problem of analyzing high-level UML-based embedded system descriptions, namely to find techniques for system model validation and verification. Validation is intended as the process of investigating a model with respect to
its user perceptions, in order to ensure that the specification really reflects the user
needs and statements about the application, and to detect faults in the specification
as early as possible with limited effort. Validation should precede the application
of more expensive and accurate methods, like formal verification of properties, that
should be applied only when a designer has enough confidence that requirements
satisfaction is guaranteed. There exist different techniques for system design validation. The scenario-based one allows the designer to build critical scenarios reflecting given system requirements to be guaranteed and check for requirements satisfaction. Of course, this technique requires tools able to support automatic scenario
execution.
UML-based design methods are not yet well supported by effective validation
methods, and, in general, formal model validation and verification techniques are
not directly applicable to UML-based models, due to their lack of a precise semantics. Formal methods and analysis tools have been most often applied to low level
hardware design. However, these techniques are not applicable to system descriptions given in terms of programs of system-level languages like SystemC, since
system description are closer to software programs than to traditional hardware description [29]. So far, the focus in the literature has been mainly on traditional codebased simulation than on design model validation.
To tackle the problem of validating UML-based system models, we combine our
SystemC UML modeling language with the Abstract State Machine (ASM) [6] formal notation in order to automatically map a visual UML model into a formal ASM
model, and then to exploit well established techniques for ASM model analysis.
This approach allows us to functionally validate SystemC UML designs early at
4 Model-Driven System Validation by Scenarios
59
high levels of abstraction. In particular, we here present the scenario-based validation of embedded system designs provided as SystemC UML models. As a proof-ofconcept, the paper reports the results of the scenario-based validation for the Simple
Bus case study from the SystemC distribution.
We also present a validation tool, integrated into our model-driven HW-SW codesign environment originally presented in [24], to support the scenario-based validation flow. It makes use of the ASMETA (ASM mETAmodeling) toolset [4] as
supporting tools around ASMs.
The choice of the ASMs among other formal methods is intentional and due to
the fact that this method (a) comes with a rigorous scientific foundation [6], (b) provides executable specifications and, therefore, it is suitable for high-level model
validation, (c) is endowed with a metamodel [10] defining the ASM abstract syntax
in terms of an object-oriented representation, and the metamodel availability allows
automatic mapping of SystemC UML models into ASM models by exploiting MDE
techniques of automatic model transformations [30].
A preliminary version of this work was presented in [11]. We here provide more
details on the language for scenarios modeling and the tool components that allow
transformations from visual to formal models, and models validation.
This paper is organized as follows. Section 4.2 provides some background on the
ASMs and their supporting toolset. Section 4.3 presents our basic idea on how targeting validation in the ASM context, and presents the language for scenarios construction. Section 4.4 focus on the model validation flow by describing the mapping
from the SystemC UML models to ASM models and the scenario-based approach
for high-level validation of SystemC UML models. Section 4.5 provides some results of the scenario-based validation of the Simple Bus case study. Section 4.6
quotes some relevant related work. Finally, Sect. 4.7 concludes the paper.
4.2 ASMs and ASMETA
Abstract State Machines are an extension of FSMs, where unstructured control
states are replaced by states with arbitrary complex data. The states of an ASM are
multi-sorted first-order structures, i.e. domains of objects with functions and predicates defined on them, while the transition relation is specified by “rules” describing
the modification of the functions from one state to the next. A complete mathematical definition of the ASM method can be found in [6]. The notion of ASMs moves
from a definition which formalizes simultaneous parallel actions of a single agent,
either in an atomic way, Basic ASMs, and in a structured and recursive way, Structured or Turbo ASMs, to a generalization where multiple agents interact Multi-agent
ASMs. Appropriate rule constructors also allow non-determinism and unrestricted
synchronous parallelism.
The ASMETA (ASM mETAmodeling) toolset [4, 10] is a set of tools around
ASMs developed according to the model-driven development principles. At the
core of the toolset, the AsmM metamodel [4], available in both meta-languages
OMG/MOF [17] and EMF/Ecore [3], provides a complete object-oriented representation of ASM concepts.
60
A. Carioni et al.
The ASMETA toolset includes: a notation, AsmetaL, to write ASM models conforming to the AsmM in a textual and human-comprehensible form; a text-to-model
compiler, AsmetaLc, to parse AsmetaL models and check for their consistency w.r.t.
the AsmM OCL constraints; a simulator, AsmetaS, to execute ASM models; the
AVAL L A language, a domain-specific modeling language for scenario-based validation of ASM models, with its supporting tool, the A SMETAV validator; and the
ATGT tool that is a test case generator based on the SPIN model checker [15].
4.3 Scenario-Based Validation of ASM Models
Scenario-based validation of ASM models [9] requires the formalization (complete
or incomplete) of the system behavior in terms of an ASM specification, and a scenario representing a description of external actor actions and system reactions.
We support two kinds of external actors: the user, who has only a black box (i.e.
outside) view of the system, and the observer having, instead, a gray box (i.e. also
internal) view. By allowing two types of actors, we are able to build scenarios useful
for classical validation (those including user actions and machine reactions), and
scenarios useful for testing activity (those including also observer actions) requiring
the inspection of the internal configurations of the machine. Therefore, our scenariobased validation approach goes behind the UML use-cases it was inspired from, and
has the twofold goal of model validation and model testing.
A user actor is able to interact, in a black box manner, with the system by setting
the values of the external environment, so asking for a particular service, waits for a
step of the machine as reaction to his/her request, and can check the values given in
outputs from the system. An observer actor has the further capabilities of inspecting
the internal state of the system (i.e. values of machine functions and locations),
to require the execution of particular system (sub-)services of the machine, and to
check the validity of possible invariants of a certain scenario. We describe scenarios
in an algorithmic way as interaction sequences consisting of actions, where each
action in turn is an activity of a user or observer actor, and an activity of the machine
as reaction of the actor actions.
4.3.1 The AVAL L A Language
The AVAL L A language has been defined in [9] as a domain-specific modeling language in the context of scenario-based validation of ASM models written in AsmetaL.
Figure 4.1 shows the AVAL L A metamodel, which defines the language abstract syntax in terms of an (object-oriented) model. For a formal definition of the
AVAL L A semantics, see [9].
An instance of the class Scenario represents a scenario of a provided ASM
specification. A scenario has an attribute name, an attribute spec denoting the
4 Model-Driven System Validation by Scenarios
61
Fig. 4.1 The AVAL L A metamodel
ASM specification to validate, and a list of target commands of type Command . Additionally, a scenario may contain the specification of some critical properties, here
referred to as scenario invariants, that should always hold (and therefore checked)
for the particular scenario. The composite associations between the Scenario
class (the whole) and its component classes (the parts) Invariant and Command
assures that each part is included in at most one Scenario instance.
The abstract class Command and its concrete sub-classes provide a classification
of scenario commands. The Set command allows the user actor to set the external
environment, i.e. to supply values of monitored or shared functions as input signals
to the system. The Check class represents commands supplied either by the user
actor to inspect external property values, or by the observer actor to further inspect
internal property values in the current state of the underlying ASM. By an Exec
command, an observer actor may require the execution of particular ASM transition rules performing given system (sub-)services. Finally, commands Step and
StepUntil represent the reaction of the system, which can execute one single
ASM step and one ASM step iteratively until a specified condition becomes true.
Examples of scenario scripts are provided in Sect. 4.5 for the Simple bus case
study.
4.4 The Model-Driven Validation Environment
The scenario-based validation environment has been developed as a component of
a more complex co-design environment [24], which allows embedded system modeling, at different levels of abstraction, by using the SystemC UML profile [23] and
forward/reverse engineering to/from C/C++/SystemC programming languages.
Figure 4.2 shows the architecture of the validation component.
The scenario-based validation process starts by applying (phase 1) the
UML2AsmM transformation to the SystemC-UML model of the system (exported
62
A. Carioni et al.
Fig. 4.2 Architecture of the validation environment
from the UML modeler component of the co-design environment [24]). This automatic mapping transform the input visual model into a corresponding ASM model
written in AsmetaL.
Once the ASM model is generated, system validation (phase 2) is possible by
supplying suitable scenarios written in AVAL L A.
A brief description of each activity follows. Note that as required skills and expertise the designer has to familiarize with the SystemC UML profile (embedded in
the UML modeler), and with very few commands of the AVAL L A textual notation
to write pertinent validation scenarios.
4.4.1 From SystemC UML Models to ASM Models
SystemC UML models, provided in input from the co-design tool [24], are transformed into corresponding ASM models (an instance of the AsmM metamodel).
This transformation is defined (once for all) by establishing a set of semantic mapping rules between the SystemC UML profile and the AsmM metamodel. This
UML2AsmM transformation is completely automatized by means of the ATL transformation engine [2] developed as a possible implementation of the OMG QVT [22]
standard.
In order to provide a one-to-one mapping (for both the structural and behavioral
aspects), first we had to express in terms of ASMs the SystemC discrete (absolute
and integer-valued) and event-based simulation semantics. To this goal, we took inspiration from the ASM formalization of the SystemC 2.0 simulation semantics in
[19] to define a precise and executable semantics of the SystemC UML profile and,
in particular, of the SystemC scheduler and the SystemC process state machines (an
extension of the UML statecharts for modeling the behavior of the reactive SystemC processes). We then proceeded to model in ASMs the predefined set of interfaces, ports and primitive channels (the SystemC layer 1), and SystemC-specific
data types. The resulting SystemC-ASM component library is available as target of
the UML2AsmM transformation.
Exploiting the SystemC-ASM component library, a SystemC module M is
mapped into an ASM containing in its signature a dynamic abstract domain M.
This domain is the set of instances that can be created by the corresponding module.
4 Model-Driven System Validation by Scenarios
63
Module attributes and ports of type T are mapped into controlled ASM functions
declared in the signature of the ASM corresponding to the module. Basically, these
functions have M as domain, and T as codomain. Multiplicity and properties (like
ordered, unique, etc.) of attributes and ports are captured by the codomain types of
the corresponding functions. A multi-port of type T, for example, is mapped into
a controlled ASM function with codomain P(T ), i.e. the mathematical powerset
of T . A hierarchical channel is treated as a module. A primitive channel is mapped,
instead, into a concrete sub-domain of the predefined abstract domain PrimChannel,
which is part of the SystemC-ASM component library. An event is mapped into an
element of a predefined abstract domain Event.
For the behavioral part, a process (a sc_thread or a sc_method) is mapped
into an element of a predefined abstract domain Process. A process behavior
within a module is defined by a named, possibly parameterized, transition rule
declared within the ASM corresponding to the container module. Moreover, since
in the SystemC process state machines, control structures (like if-then-else,
while loop, etc.) and process synchronization points (statements like wait,
static_wait, dont_initialize, etc.) are modeled in terms of stereotyped
pseudo-states (junction or choice) and states, respectively, a one-to-one mapping is
defined between the state-like diagram of the process behavior and the basic ASM
rule constructs (if-then-else rule, seq rule, etc.). Some special ASM rule constructs,
however, have been introduced in the SystemC-ASM component library in order
to capture in ASMs the semantics underlying all possible forms of synchronization
calls (which require dealing with the ASM agent representing the SystemC scheduler). In particular, the infinite loop mechanism of a thread has been modeled with
a specific design pattern of ASM rule constructors.
As example of application of such mapping, Fig. 4.3 shows the UML notation,
the SystemC code, and the resulting ASM (in AsmetaL) for a module.
Fig. 4.3 A UML module (A), its SystemC code (B) and its corresponding ASM (C)
64
A. Carioni et al.
4.4.2 Model Validator
Scenarios written in AVAL L A are executed by means of the A SMETAV validator. It
is a Java application which makes use of the AsmetaS simulator to run scenarios.
A SMETAV reads a user scenario written in AVAL L A (see Fig. 4.2), it builds the scenario as instance of the AVAL L A metamodel by means of a parser, it transforms
the scenario and the AsmetaL specification which the scenario refers to, to an executable AsmM model. Then, A SMETAV invokes the AsmetaS interpreter to simulate the scenario. During simulation the user can pause the simulation and watch the
current state and value of the update set at every step, through a watching window.
During simulation, A SMETAV captures any check violation and if none occurs it finishes with a “PASS” verdict. Besides a “PASS”/“FAIL” verdict, during the scenario
running A SMETAV collects in a final report some information about the coverage of
the original model; this is useful to check which transition rules have been exercised.
4.5 The Simple Bus Case Study
The Simple Bus case study is a well-known transactional level example, designed to perform also cycle-accurate simulation. It is made of about 1200 lines
of code that implement a high performance, abstract bus model. The complete code
is available at the official SystemC web site [20].
The Simple Bus system was modeled [23] in a forward engineering flow using
the SystemC UML profile. The UML object diagram in Fig. 4.4 shows the internal
collaboration structure of the objects involved in a specific configuration of the Simple Bus design: three master blocks (a blocking master master_b, a non-blocking
master master_nb, and a monitor master_d); two slave memories (one fast,
mem_fast and one slow, mem_slow); a bus connecting masters and slaves; an
arbiter with a priority-based arbitration to select a request to serve and with buslocking support; a clock generator C1.1 Every master submits read/write requests to
the bus at regular time instants. The designer assigns a unique priority to each master: master_nb has priority 3, while master_b has priority 4. Masters can issue
a request at the same time, so the arbiter must choose one request according to some
deterministic rules. In the simplest case, precedence is accorded to the device with
higher priority,2 in our case the non-blocking master has priority 3 which is higher
(following a decreasing order) than the priority 4 of the blocking master. When a
master occupies the bus, an incoming request is therefore queued and served later in
a different time instant, or served from the next clock cycle if it has a higher priority
(and the current request will be terminated later).
To illustrate the typical use of the AVAL L A language in writing validation scenarios, below we report two scenario examples and their related validation results
for the Simple Bus design.
1 Note
that all connectors are intended as stereotyped with «sc_connector».
2 Two
devices cannot have the same priority, so the determinism is assured.
4 Model-Driven System Validation by Scenarios
65
Fig. 4.4 Simple Bus—UML Object Diagram
The first scenario shows how high level modeling tools like AsmetaV/AVAL L A
are helpful to abstract and stand out monitoring and debugging functionality, typically embedded within the SystemC design (in our case within the master_d
monitor, the arbiter, and the bus) by inserting C++ code lines, thus further alleviating the designers’ burden of writing code. The second scenario shows instead how
to validate the fairness of the arbitration rules adopted for scheduling the masters
requests.
Scenario s1 At given time instants, the memory locations between address 120
and address 132 are read (directReadBus). The actual values must match the
expected values.
scenario s1 load Top.asm
step until time = 0 and phase = TIMED_NOTIFICATION;
check directReadBus(bus, 120) = 0
and directReadBus(bus, 124) = 0
and directReadBus(bus, 128) = 0
and directReadBus(bus, 132) = 0;
step until time = 1600 and phase = TIMED_NOTIFICATION;
check directReadBus(bus, 120) = 16
and directReadBus(bus, 124) = 0
and directReadBus(bus, 128) = 0
and directReadBus(bus, 132) = 0;
Scenario s2 At time 0, the master_nb (with priority 3) issues a read request
(status = SIMPLE_BUS_REQUEST and do_write = false) at address
66
A. Carioni et al.
56 (address = 56), and the master_b (with priority 4) issues a burst read request from address 76 to 136. We assume that the clock period is 15 time units. The
bus checks the requests at each negative clock edge. At time 15, the bus must serve
the master with higher priority, i.e. the master_nb, and complete it (status =
SIMPLE_BUS_OK). At time 30, the master_nb issues a new write request at address 56. At time 45, the bus serves again the master_nb ignoring for the second
time the still pending read request of the master_b.
scenario s2 load Top.asm
step until time = 0 and phase = TIMED_NOTIFICATION;
check (exist $r00 in Request with priority($r00) = 3
and do_write($r00) = false
and address($r00) = 56
and status($r00) = SIMPLE_BUS_REQUEST);
check (exist $r01 in Request with priority($r01) = 4
and do_write($r01) = false
and address($r01) = 76
and end_address($r01) = 136
and status($r01) = SIMPLE_BUS_REQUEST);
step until time = 15 and phase = TIMED_NOTIFICATION;
check (exist $r02 in Request with priority($r02) = 3
and status($r02) = SIMPLE_BUS_OK);
step until time = 30 and phase = TIMED_NOTIFICATION;
check (exist $r03 in Request with priority($r03) = 3
and do_write($r03) = true
and address($r03) = 56
and status($r03) = SIMPLE_BUS_REQUEST);
step until time = 45 and phase = TIMED_NOTIFICATION;
check (exist $r04 in Request with priority($r04) = 3
and status($r04) = SIMPLE_BUS_OK);
Both scenarios ended with verdict PASS and allowed a coverage of all ASM rules
of the Simple Bus model.
4.6 Related Work
In [21], the authors present a model-driven development and validation process
which begins by creating (from a natural language specification of the system requirements) a functional abstract model and (still manually) a SystemC implementation model. The abstract model is described using the Abstract State Machine
Language (AsmL)—another implementation language for ASMs. Our methodology, instead, benefits from the use of the UML as design entry-level and of model
translators which provide automation and ensure consistency among descriptions
in different notations (such those in SystemC and ASMs). Moreover, these last can
remain hidden to the designer, making the process completely transparent to the
user who do not want to deal with them. In [21], a designer can visually explore
the actions of interest in the ASM model using the Spec Explorer tool and generate tests. These tests are used to drive the SystemC implementation from the ASM
4 Model-Driven System Validation by Scenarios
67
model to check whether the implementation model conforms to the abstract model
(conformance testing). The test generation capability is limited and not scalable. In
order to generate tests, the internal algorithm of Spec Explorer extracts a finite state
machine from ASM models and then use test generation techniques for FSMs. The
effectiveness of their methodology is therefore severely constrained by the limits of
Spec Explorer. The authors themselves say that the main difficulty is in using Spec
Explorer and its methods for state space pruning/exploration. The ASMETA ATGT
tool that we want to use for the same goal exploits, instead, the method of model
checking to generate test sequences, and it is based on a direct encoding of ASMs
in PROMELA, the language of the model checker SPIN [15].
The work in [13] also uses AsmL and Spec Explorer to settle a development and
verification methodology for SystemC. They focus on assertion based verification
of SystemC designs using the Property Specification Language (PSL), and although
they mention test case generation as a possibility, the validation aspect is largely
ignored. We were not able to investigate carefully their work as their tools are unavailable. Moreover, it should be noted that approaches in [13, 21], although using
the Spec Explorer tool, do not exploit the scenario-based validation feature of Spec
Explorer. Indeed, in [5, 12] was shown how Spec Explorer allows scenario-oriented
modeling.
In [16], a model-driven methodology for development and validation of systemlevel SystemC designs is presented. The development and validation flow is entirely
based on the specification of a functional model (reference model) in the ESTEREL
language, a state machine formalism, and on the use of the ESTEREL Studio development environment for the purpose of test generation. The proposed approach
concentrates on providing coverage-directed test suite generation for system level
design validation.
Authors in [7] provide test case generation by performing static analysis on SystemC designs. This approach is limited by the strength of the static analysis tools,
and the lack of flexibility in describing the reachable states of interest for directed
test generation. Moreover, static analysis requires sophisticated syntactic analysis
and the construction of a semantic model, which for a language like SystemC (built
on C++) is difficult due to the lack of formal semantics.
The SystemC Verification Library [20] provides API for transaction-based verification, constrained and weighted randomization, exception handling, and HDLconnection. We aim, however, at developing formal techniques to augment SystemC
verification.
The Message Sequence Chart (MSC) notation [18], originally developed for
telecommunication systems, can be adapted to embedded systems to allow validation. For instance, in [8] MSC is adopted to visualize the simulation of SystemC
models. The traces are only displayed and not validated, and the author report the
difficulties of adopting a graphical notation like MSC. Our approach is similar to
that presented in [14], where the MSCs are validated against the SDL model, from
which a SystemC implementation is derived. MSCs are also generated by the SDL
model and replayed to cross validation and regression testing.
68
A. Carioni et al.
4.7 Conclusions and Future Work
We proposed a scenario-based validation approach to system-level design by the
use of the SystemC UML profile (for the modeling part) and the ASM formal
method and its related ASMETA toolset (for the validation part). We have been
testing our validation technique on case studies taken from the standard SystemC
distribution, like the Simple Bus presented here, and on some of industrial interest. Thanks to the ease in raising the abstraction level using ASMs, we believe our
approach scales effectively to industrial systems.
This work is part of our ongoing effort to enact design flows that start with system
descriptions using UML-notations and produce C/C++/SystemC implementations
of the SW and HW components as well as their communication interfaces, and that
are complemented by formal analysis flows for system validation and verification.
As future step, we plan to integrate A SMETAV with the ATGT tool of the AS META toolset to be able to automatically generate some scenarios by using ATGT
and ask for a certain type of coverage (rule coverage, fault detection, etc.). Test cases
generated by ATGT and the validation scenarios can be transformed in concrete SystemC test cases to test the conformance of the implementations with respect to their
specification. Moreover, we plan to support system properties formal verification by
model checking techniques. This requires transforming ASM models into models
in the language of the model checkers, such as the Promela language of the SPIN
model checker.
References
1. R. Chen, M. Sgroi, G. Martin, L. Lavagno, A.L. Sangiovanni-Vincentelli, and J. Rabaey. Embedded system design using UML and platforms. In E. Villar and J. Mermet, editors, System
Specification and Design Languages, CHDL Series. Kluwer Academic, Dordrecht, 2003.
2. The ATL language. www.eclipse.org/m2m/atl/.
3. Eclipse Modeling Framework. www.eclipse.org/emf/.
4. The ASMETA toolset. http://asmeta.sf.net/, 2006.
5. M. Barnett et al. Validating use-cases with the AsmL test tool. In QSIC Int. Conference on
Quality Software, pages 238–246. IEEE Press, New York, 2003.
6. E. Börger and R. Stärk. Abstract State Machines: A Method for High-Level System Design and
Analysis. Springer, Berlin, 2003.
7. F. Bruschi, F. Ferrandi, and D. Sciuto. A framework for the functional verification of SystemC
models. Int. J. Parallel Program., 33(6):667–695, 2005.
8. T. Kogel et al. Virtual architecture mapping: a SystemC based methodology for architectural
exploration of system-on-chip designs. In A.D. Pimentel and S. Vassiliadis, editors, Computer Systems: Architectures, Modeling, and Simulation, SAMOS, LNCS 3133, pages 138–
148. Springer, Berlin, 2004.
9. A. Gargantini, E. Riccobene, and P. Scandurra. A scenario-based validation language for
ASMs. In ABZ’ 08: Proc. of the 1st International Conference on Abstract State Machine,
B and Z, LNCS 5238, pages 71–84. Springer, Berlin, 2008.
10. A. Gargantini, E. Riccobene, and P. Scandurra. A language and a simulation engine for abstract
state machines based on metamodeling. Journal of Universal Computer Science, 14(12):1949–
1983, 2008.
4 Model-Driven System Validation by Scenarios
69
11. A. Gargantini, E. Riccobene, P. Scandurra, and A. Carioni. Scenario-based validation of Embedded Systems. In FDL’ 08: Proc. of Forum on Specification and Design Languages, pages
191–196. IEEE Press, New York, 2008.
12. W. Grieskamp, N. Tillmann, and M. Veanes. Instrumenting scenarios in a model-driven development environment. Information & Software Technology, 46(15):1027–1036, 2004.
13. A. Habibi and S. Tahar. Design and verification of SystemC transaction-level models. IEEE
Transactions on VLSI Systems, 14:57–68, 2006.
14. M. Haroud et al. HW accelerated ultra wide band MAC protocol using SDL and SystemC. In
IEEE Radio and Wireless Conference, pages 525–528. IEEE Press, Los Alamitos, 2004.
15. G.J. Holzmann. The model checker SPIN. IEEE Transactions on Software Engineering,
23(5):279–295, 1997.
16. D. Mathaikutty, S. Ahuja, A. Dingankar, and S. Shukla. Model-driven test generation for system level validation. In HLVDT’07: High Level Design Validation and Test Workshop, pages
83–90. IEEE Press, New York, 2007.
17. OMG. The meta object facility, formal/2002-04-03.
18. Message Sequence Charts (MSC). ITU-T. Z.120, 1999.
19. W. Müller, J. Ruf, and W. Rosenstiel. SystemC Methodologies and Applications. Kluwer Academic, Dordrecht, 2003.
20. Open SystemC Initiative. http://www.systemc.org.
21. H.D. Patel and S.K. Shukla. Model-driven validation of SystemC designs. In DAC’07: Proc.
of the 44th Design Automation Conference, pages 29–34. Assoc. Comput. Mach., New York,
2007.
22. OMG. Query/Views/Transformations, ptc/07-07-07.
23. E. Riccobene, P. Scandurra, A. Rosti, and S. Bocchio. A UML2 profile for SystemC 2.1.
STMicroelectronics Technical Report, April 2007.
24. E. Riccobene, P. Scandurra, A. Rosti, and S. Bocchio. A model-driven design environment
for embedded systems. In DAC’06: Proc. of the 43rd Design Automation Conference, pages
915–918. Assoc. Comput. Mach., New York, 2006.
25. E. Riccobene, P. Scandurra, A. Rosti, and S. Bocchio. A model-driven co-design flow for embedded systems. In Advances in Design and Specification Languages for Embedded Systems
(Best of FDL’06), 2007.
26. E. Riccobene, P. Scandurra, A. Rosti, and S. Bocchio. Designing a unified process for embedded systems. In Fourth Int. Workshop on Model-Based Methodologies for Pervasive and
Embedded Software. IEEE Press, New York, 2007.
27. T. Gröetker, S. Liao, G. Martin, and S. Swan. System Design with SystemC. Kluwer Academic,
Dordrecht, 2002.
28. OMG. The Unified Modeling Language. www.uml.org.
29. M.Y. Vardi, Formal techniques for SystemC verification. In DAC’07: Proc. of the 44th Design
Automation Conference, pages 188–192. IEEE Press, New York, 2007.
30. T. Zhang, F. Jouault, J. Bézivin, and J. Zhao, A MDE based approach for bridging formal
models. In Proc. 2nd IFIP/IEEE International Symposium on Theoretical Aspects of Software
Engineering. IEEE Comput. Soc., Los Alamitos, 2008.
Chapter 5
An Advanced Simulink Verification Flow
Using SystemC
Kai Hylla, Jan-Hendrik Oetjens and
Wolfgang Nebel
Abstract Functional verification is a major part of today’s system design task. Several approaches are available for verification on a high level of abstraction, where
designs are often modeled using MATLAB/Simulink, as well as for RT-level verification. Different approaches are a barrier to a unified verification flow. For simulation based RT-level verification, an extended test bench concept has been developed at Robert Bosch GmbH. This chapter describes how this SystemC-based test
bench concept can be applied to Simulink models. The implementation of the resulting verification flow addresses the required synchronization of both simulation
environments, as well as data type conversion. An example is used to evaluate the
implementation and the whole verification flow. It is shown that using the extended
verification flow saves a significant amount of time during development. Reusing
test bench modules and test cases preserves consistency of the test bench. Verification is done automatically rather than by inspecting the waveform manually. The
extended verification flow unifies system-level and RT-level verification, yielding a
holistic verification flow.
Keywords SystemC · Simulink · Verification · Co-simulation · Test bench
5.1 Introduction
Chip complexity and size have been increasing ever since the first chip was developed. Verification effort tends to increase exponentially with the size of the design.
Today’s verification effort is about 70% of the total project effort [2]. Along the
development of new methodologies for designing chips, methodologies for verification have been developed. First verifications were done by inspecting the design
manually. The increasing complexity of designs induced test benches. Test benches
stimulate the design under verification (DUV) in a reproducible way. The increasing demand of security and safety requirements led to formal verification methodologies. However, these cannot handle larger designs in an appropriate amount of
time yet. Until they become usable, simulation based verification is the first class.
K. Hylla ()
OFFIS Institute, Escherweg 2, 26129 Oldenburg, Germany
e-mail: kai.hylla@offis.de
M. Radetzki (ed.), Languages for Embedded Systems and Their Applications,
Lecture Notes in Electrical Engineering 36,
© Springer Science + Business Media B.V. 2009
71
72
K. Hylla et al.
Since a lot of design constraints and requirements are to be met, many test cases
are needed. Verification of designs needs to be done automatically and repeatable.
Concepts allowing reuse of test benches are needed. Ideally, these concepts provide constrained random data generation and are usable among different levels of
abstraction.
This contribution addresses a simulation based verification flow, starting at
system-level based on MATLAB/Simulink models. It describes concisely, how a
test bench concept [3], already used in the company’s verification flow, is extended
to cover different levels of abstraction. The extended test bench concept is usable at
RT-level as well as at a higher level, where designs are modeled using Simulink.
The following Sect. 5.2 describes some solutions available for verification of
system models. It also states, why it is difficult to apply them to the Simulink-based
verification flow currently used at Robert Bosch GmbH. Section 5.3 describes the
weaknesses of a conventional verification flow, a way to improve the test bench concept, and how this could be applied to Simulink models. Section 5.4 pictures how
the extended verification flow has been implemented, considering synchronization
and data type conversion. Section 5.5 shows how the implementation has been evaluated. An example is used to evaluate the extended verification flow presented in
this chapter. Finally, Sect. 5.6 gives a conclusion and identifies further work.
5.2 Related Work
Often, verification is done in an unstructured way. However, there are several tools
and methodologies for structured system verification available.
A recent methodology for verifying models is provided with the Open Verification Methodology (OVM) [5]. It is jointly developed by Cadence and Mentor
Graphics. OVM is based upon the Universal Reuse Methodology (uRM) [11] and the
Advanced Verification Methodology (AVM) [1]. OVM provides an object-oriented
concept for verifying models. Communication between components of the verification environment is done by using transaction-level modeling (TLM). Stimuli can
be generated using different layers and transactions can be randomized automatically.
The Verification Methodology Manual for SystemVerilog (VMM) [12], which is
available from Synopsys, specifies a functional verification methodology. The standard library defined by the VMM includes constrained random stimulus generation,
functional coverage collection, assertions, and TLM. The layered structure allows
creating both, simple as well as complex test benches.
A well-known tool for verifying Simulink models is Simulink Verification and
Validation [9]. It allows to create designs based upon requirements. It provides an
integrated requirements management. During the simulation, user-extendable verification blocks determine whether or not the requirements have been implemented
correctly. This can be done by assuring that the assertions on test signals hold. It
also allows a coverage analysis of the models. A likewise library is Simulink Design
Verifier [6], which allows verification of requirements based on property proving.
5 An Advanced Simulink Verification Flow Using SystemC
73
OVM and VMM are unsuitable for Simulink models, while Simulink Verification and Validation as well as Simulink Design Verifier address the verification of
Simulink models. However, these Simulink tools have a significantly different concept compared to the extended test bench concept, which is used at the company for
RT-level verification. A combination of both is hardly feasible.
5.3 Extended Verification Flow
As mentioned in the previous section, an extension of the existing verification flow
is necessary. Before implementing such an extension, it is required to understand
the problems and adversities of the existing flow. The company’s constraints and
requirements must be also considered. The next section shows a conventional flow
and describes the extended test bench concept developed at the company. The subsequent section shows how the existing verification flow can be extended to comply
with the company’s needs.
5.3.1 Conventional Flow
A conventional verification flow starts at the specification of the system to be built.
The specification is usually provided in a textual and often non-formalized form.
In many instances, it consists of a set of requirements and constraints. As shown in
Fig. 5.1, on the following page, the description of the system is brought into an executable specification using Simulink. During this step, the system is partitioned into
analog and digital components. Then, in order to verify the system, a test bench is
created. This test bench consists of several test bench modules (TBM), which generate stimuli for the system and observe the system’s output. The partitioning into
TBMs is typically done based upon the identified functionality and interfaces, respectively. A test case (TC) describes the stimuli that should be created by the TBMs
and the design’s expected output. Ideally, the TC is completely separated from the
test bench. After the verification of the system has yielded satisfying results, the
system can be brought to a lower level of abstraction. Typically, a hardware description language (HDL) like VHDL or Verilog is used for that purpose. Here, the
verification flow is usually split into an analog and a digital verification flow. Aside
from the digital components, TBMs and TCs need to be ported, too. Normally this
is done manually. Therefore, each component needs to be rewritten by a developer.
The same applies to the test cases. In order to assure that porting has been done
accurate, the equivalence of the ported and the original component has to be shown,
which is a serious task of its own. As the analog components do no longer exist
in the digital flow, their behavior must be modeled. Therefore, new TBMs have to
be written. The test cases must be adapted to provide the information required by
the new TBMs. Again, the system is verified using the test bench. After satisfying
results are achieved, the results from the analog and the digital flow are merged and
further development towards the final chip design is done.
74
K. Hylla et al.
Fig. 5.1 Conventional verification flow
Fig. 5.2 Conventional test
bench concept
5.3.1.1 Conventional Test Bench Concept
As just mentioned, test bench and TC should be clearly separated from each other.
However, developers often do not follow that rule. Often test bench and test case
are mixed up together. Especially when writing test benches in a HDL. Signals are
manipulated directly from within the test bench and even partitioning the test bench
into TBMs is not performed regularly. Often each TC is implemented in its own test
bench as shown in Fig. 5.2.
The test bench does only stimulate the DUV. The output is been traced and finally
viewed using a waveform viewer. The developer decides, whether the behavior of
the DUV is correct or not. The stimulus is reproducible. However, verification has
to be done manually, each time the simulation has run. On larger designs, manual
verification can hardly be done for the whole design. There are too many signals
5 An Advanced Simulink Verification Flow Using SystemC
75
to be considered. As test bench and test case are indistinguishable from each other,
there are a large number of test benches. One for each test case. This leads to a
confusing verification environment. In order to write a new test case, a complete
test bench has to be written, which is more time consuming than writing the new test
case only. If the interfaces of one of the components change, all existing test benches
need to be adjusted. This causes a significant effort. Changing a test case requires
changing the test bench, too. This is more complicated as the test bench contains a
lot more information than required for specifying the test case only. Changing the
test bench requires re-compiling it, which, depending on the size of the design, can
be time consuming.
5.3.1.2 Extended Test Bench Concept
In order to deal with the problems just mentioned, an extended test bench concept
has been developed at Robert Bosch GmbH [3]. The concept forces a strict separation of test bench and test case. Test cases are implemented as single text files,
the so-called command files. The command files contain the instructions to be executed by the test bench. The test bench itself consists of several TBMs and a shared
controller. The controller processes the command file and operates the TBMs. The
TBMs implement commands that can be used within the command file then. Complex operations can be provided by the TBMs by means of simple commands. Because of these user-defined commands, the syntax of the command file is highly
flexible and extensible. Test cases for verifying interfaces can be reused independently from the concrete implementation of the protocol. TBMs implementing different protocols simply need to provide the same commands. The same TBMs could
be reused to verify different designs, if the components of the design implement the
same interface. Figure 5.3 shows the conceptual design of a test bench using the
extended test bench concept.
The concept was first implemented using VHDL. This implementation is used
within a productive environment since several years. Currently, the concept is being
adapted to SystemC [10]. Due to the usage of SystemC, the controller is enhanced.
It provides a set of new features like constrained randomization, which can be used
within the command files. Co-simulation of VHDL and SystemC TBMs is possible. Existing VHDL TBMs can be used within a SystemC test bench environment.
The system to be build can be implemented using either SystemC or VHDL. This
allows a smooth transition from the VHDL test benches towards the more powerful
SystemC test benches.
Fig. 5.3 Extended test bench
concept.
76
K. Hylla et al.
Fig. 5.4 Extended verification flow.
5.3.2 Extending the Verification Flow
The extended test bench concept described above has achieved several improvements. A pure SystemC-based approach is not suitable, since the existing flow
should be modified as little as possible. Therefore, the presented test bench concept should be used for the Simulink models as well. The resulting verification flow
is shown in Fig. 5.4.
The TBMs used for verifying the Simulink model are implemented as SystemCbased test bench modules regarding to the extended test bench concept. The TBMs
can be made available to Simulink using so-called S-functions. S-functions allow the
usage of C/C++ source code from within Simulink and appear as ordinary Simulink
blocks within the model. Again, the TBMs use the common controller, which is not
shown in the figure. Test cases are still implemented in a single command file each.
As shown in Table 5.1, the TBMs used for verifying Simulink can be used for verifying the VHDL description of the system, as well. No modifications are necessary.
Thus, development time and effort can be saved. Additionally the likelihood of errors can be reduced. Using the TBMs that had stimulated the analog components on
the RT-level as well, requires the analog components to be ported. Since the resulting code will be part of the test bench, and thus does not need to be synthesizable
in later steps, the Real-time Workshop [8] can be used. The Real-time Workshop allows generation of C-code from Simulink models. The generated source code is to
be combined with the original TBMs to form new ones. The new TBMs provide the
5 An Advanced Simulink Verification Flow Using SystemC
77
Table 5.1 Steps to be done, when switching from Simulink-level to RT-level
Component
Conventional flow
Extended flow
digital
ported
ported
analog
rewritten as TBMs with a digital interface
connected to TBM
TBM (digital interface)
ported
reused
TBM (analog interface)
discarded
reused
Test case
rewritten to adapt to new TBMs
reused
same commands as the old ones, but their output is equal to the output formerly generated by the analog components. Thus, no modifications of the TCs are required.
This concept allows a smooth integration into the company’s existing verification
flow. It can be used together with existing test bench modules, implemented using
Simulink components. This allows the developer to change gradually towards the
extended test bench concept. Models that had been developed using a conventional
verification flow can be used furthermore.
5.4 Implementation
Co-simulation between SystemC and VHDL is already provided by the extended
test bench concept. Therefore the implementation addresses the co-simulation between SystemC and Simulink. It is described in two parts: The first one discusses the
synchronization of Simulink and SystemC, while the second one describes how data
could be exchanged between both environments. In order to integrate the extended
test bench modules into Simulink it is required to write wrappers that implement
the S-function. Advanced base classes, which provide the functionality described
below, allow the developer to implement the module wrapper in a short amount of
time. In order to create flexible TBMs, the type of data conversion can be chosen by
a runtime parameter.
5.4.1 Synchronization
The first important step when implementing the co-simulation is to synchronize both
environments. This is a complex task, as Simulink and SystemC use different models of simulation. Simulink uses a so-called sample time, which defines the points
in time, when a component of the model is to be updated. SystemC uses an event
queue, which is processed event by event. Since the SystemC module is integrated
into the Simulink environment as an S-function, it is necessary that the S-function is
updated each time an incoming or outgoing signal of the SystemC module changes.
There are several variants of how the environments can be synchronized. Altogether
13 different variants have been evaluated. They can be clustered as follows:
78
K. Hylla et al.
Fixed Sample Time Variants of this type assign the S-function a fixed sample time.
The sample time can be set to the time resolution of SystemC, resulting in a very
low performance. If a value larger than the time resolution is chosen, an error might
occur. In this case, the TBM will not be updated at the correct time. Obtaining the
adequate value for the sample time from the TBM is difficult. The knowledge of the
TBM’s developer is required to set up the correct sample time. As internal events
might occur randomly, for some modules even this is not sufficient. Their behavior
could not be represented adequately using a fixed sample time.
Variable Sample Time These types of synchronization variants inform Simulink
when the next update of the S-function should occur. Therefore, the internal event
queue can be easily mapped to the sample time. The next update of the S-function
should occur at the time the next event is scheduled for. However, the time, an
incoming signal changes, cannot be predicted. A maximum time interval between
two consecutive events can be specified. The S-function is updated either if an event
occurs, or if the maximum time interval is reached. This interval assures that the
incoming signals are read with an adequate rate. Finding the optimal value for the
maximum time can be a difficult task as no information about the incoming signals
is available.
Inherited In these variants, the S-function inherits its sample time from its driving
Simulink components. This way it can be assured that no signal changes were
missed. Internal events of the TBM, which cause changes on outgoing signals cannot be handled correctly by these types of variants. The TBM cannot force an update
of the S-function if an event from the event queue occurs between two consecutive
updates of the S-function, determined by the inherited sample time.
Base Period It cannot be assumed that all driving Simulink components of a TBM
have the same sample time. Therefore, it is obvious to set up the sample time for
each signal individually. In addition, it cannot be predicted, which outgoing signal is
affected by which incoming signal. Hence, the sample time of each outgoing signal
must fit the sample time of all incoming signals. This sample time is the greatest
common divisor (GCD) of all incoming signals’ sample time. As the sample time
of each incoming signal might have an offset, the calculation of the GCD is more
complex. If the internal events of the TBM can be considered periodic, these periods
are taken into account during the calculation. In this case the TBM has a so-called
base period, which consists of the sample times of the incoming signals and the
internal periods of the module. This base period is assigned to each outgoing signal.
It is also possible to assign an offset to the base period. The offset corresponds to
the smallest offset of the inherited sample times. This might help to improve the
performance, as fewer values must be considered when calculating the GCD. The
following equations illustrate that.
α = {p, o1 , . . . , ok−1 , ok , ok+1 , . . . , on }
(5.1)
β = {p, o1 − ok , . . . , ok−1 − ok , ok+1 − ok , . . . , on − ok }
(5.2)
gcd(β) ≥ gcd(α)
(5.3)
5 An Advanced Simulink Verification Flow Using SystemC
79
Table 5.2 Synchronization methods
Variant
Fix
Variable
Inherit
Base Period
performance
−
+
+
◦
handle incoming signals
−
−
+
+
handle outgoing signals
◦
+
−
+
handle internal period
+
+
−
+
+ = good, ◦ = neutral, − = bad
The set α contains all periods pi and offsets oi . From set β the smallest offset ok
of all periods is excluded and all offsets are shifted by that minimal offset. Based
upon the rules applying to the GCD, it can be proven that the GCD of β is greater
or equal to the GCD of set α. Hence, the GCD is larger, the sample time is larger
and thus less updates of the S-function are required.
Each of the evaluated variants has its own advantages and disadvantages.
Table 5.2 gives a short overview. No variant can handle all scenarios that might occur. Moreover, due to limitations of Simulink some combinations of sample times
are not possible. For example, the preferred variant, where the incoming signals
inherit their sample time and outgoing signals have a variable sample time is not
supported. Therefore, the sample time of the S-function is chosen depending on the
kind of the TBM. Five scenarios can be identified:
Scenario 1 The TBM has no incoming signals. In this case only internal events
can occur. Therefore, the S-function gets a variable sample time assigned. The
S-functions uses the event queue of the encapsulated SystemC module to predict
the time the next event occurs. This is the time the S-function needs an update.
Scenario 2 If Scenario 1 applies but a solver supporting a variable sample time
is not available, a fixed sample time is used. The sample rate is chosen by the developer. Since information about the internal behavior of the SystemC module is
required in order to determine an appropriate sample rate, this can not be done automatically.
Scenario 3 The TBM has incoming signals but no internal period. Since there
are no internal periods, the TBM does only react on incoming signals. Therefore,
the inherited sample time is chosen. In order to allow components that operate on
different sample times to be connected to the S-function, the sample time is inherited
for each port individually. Outgoing signals are updated, whenever an incoming
signal changes.
Scenario 4 The TBM has incoming signals and an internal period. In this case,
the incoming signals inherit their sample time. Based upon the sample times of the
incoming signals and the internals rates, the base period is calculated and assigned
to the outgoing signals.
Scenario 5 This scenario is a special case of Scenario 4. In this case the module
does have incoming signals and internal processes, but no outgoing signals. Despite
80
K. Hylla et al.
of the sample times inherited from the incoming signals, an additional sample time
must be specified that covers the internal processes of the module. Due to restrictions of Simulink it is necessary to add an outgoing port to the S-function. This port
allows the specification of the additional sample time. It does not carry any data and
should be terminated within the Simulink model.
5.4.2 Data Type Conversion
The second step done when implementing the co-simulation is data type conversion. Conversion is done by the wrapper implicitly. Different types of conversion
are available. It is independent from the modules and may change from model to
model. Therefore, it is interchangeable by a runtime parameter, which can be set
for each module individually. Since Simulink internally uses default C/C++-types,
these can easily be mapped. The mapping of SystemC data types like bit vector
or fixed-point types is done in different ways. The easiest way is to convert them
into the default Simulink type double. Logic- and bit vectors can be treated as
integer numbers, as long as the vectors are small enough. A more advanced conversion uses the Simulink Fixed Point-extension [7], which also provides an sufficient
API. This extension provides data types commonly used in chip development. Each
conversion has a build-in value checking, in order to assure that only valid values
are received from the driving blocks. More conversions are conceivable and can be
easily implemented, due to the provided class structure.
5.5 Evaluation
Our approach has been evaluated twice. First the correctness of the implementation is verified. Secondly, the extended flow has been applied to an already existing
benchmark, in order to proof the assumed benefits of the flow.
5.5.1 Implementation
Example systems have been modeled in order to verify the correctness of the implementation. Each example covers one of the scenarios described in the previous
section.
Scenario 1 The TBM generates a number of signal changes. The interval between two consecutive changes is a randomized value between 1 ns and 1 000 ns.
The time the next event is scheduled for, is the time the S-function should be updated. During the next update, the TBM designates the difference between the current Simulink time and the expected SystemC time. More than 10 000 events have
5 An Advanced Simulink Verification Flow Using SystemC
(a) normal scaling
81
(b) logarithmic scaling
Fig. 5.5 Evaluation of Scenario 1
been simulated in a single simulation. The results are shown in Fig. 5.5a. The figure
shows that the simulation was not free from errors. The maximal difference amounts
8.67 × 10−19 ns and thus is ten orders of magnitude smaller than the smallest distance between two consecutive signal changes. The errors have a double-logarithmic
behavior as shown in Fig. 5.5b. This leads to the conclusion that the error results
from the calculation errors, which occur when comparing the Simulink and the SystemC time. The calculation error is caused by limitations of the data type double
used for representing time in both environments.
Scenario 2 This scenario is similar to Scenario 1. Since no variable step solver
is available a fixed sample time is used. The sample time is chosen in a way that
the smallest distance between two consecutive events can be handled appropriately.
Thus an oversampling is possible, which has a negative influence on the performance of the simulation. The occurring error is similar to the error that had occurred
in Scenario 1. This error is also based on the aforementioned calculation errors.
Scenario 3 This model consists of a TBM directly driven by two other components,
generating stimuli. Both components contain two signals each and both perform the
same operation. The value of the first signal set by the component corresponds to
the Simulink time the signal was set. The second signal contains a simple counter.
The counter is incremented each time the first signal is been changed. This allows
the missed signal changes to be counted. The components are about to operate on
different sample times. The TBM should be updated, each time one of the incoming
signals changes. On an update, the TBM calculates the difference between the value
of the time signal that had changed and the current time in Simulink. As the update
should occur at the same time the signal has been written, these values should be
equal. The counter should be incremented by one since the last update. As expected,
no errors occurred and no signal changes were missed at all.
Scenario 4 The TBM combines the tests from Scenarios 1 and 3. The errors caused
by the incoming signals and the internal period are logged separately. The evaluation
of the results shows that no errors occur and no signal changes were missed, when
synchronizing the incoming signals. Synchronizing the internal period of the TBM
leads to errors similar to the errors shown in Fig. 5.5.
Scenario 5 The calculated base period equals exactly the one from Scenario 4. As
in this scenario the TBM has no outgoing ports, the base period is assigned to the
82
K. Hylla et al.
dummy port, which has been added. As expected, the occurring errors match exactly
the errors that had occurred in Scenario 4.
To sum up, it has been shown that the errors, measured in Scenarios 1, 3 and 5
are not caused by the implementation. They are induced by the characteristics of the
floating-point representation. All scenarios have shown the expected results. Thus,
the implementation can be considered correct.
5.5.2 Extended Verification Flow
In order to evaluate the extended verification flow, an example Simulink model has
been implemented. To achieve impartiality, an existing model [4] has been reimplemented. This model implements an overhead crane mounted on a track. The crane
carries a load that is connected by means of a free running cable. The whole system
is shown in Fig. 5.6. The crane contains sensors, a diagnosis and a control unit. The
controlling unit is the design to be verified. Sensor values, the behavior of the load
and the positions of car and load are modeled as differential equations using default
Simulink components. TBMs implement the job control, the external force fd as
well as the possibility to overwrite the value of the α-sensor.
The test bench was implemented twice: First using pure Simulink components
as it would have been done using a conventional flow. The test cases are implemented as simple tables, containing time/value-tuples. Verification is done by manually comparing the waveform with the implicitly expected behavior. Secondly, the
TBMs have been rewritten as test bench modules using SystemC regarding to the
extended test bench concept. Therefore the TCs are realized as the aforementioned
command files. Additionally the second design has been enhanced by an extended
TBM that verifies the behavior. Thus, the TBM allows the automatic verification of
the model, superseding a manual verification.
Using SystemC within a Simulink model influences the speed of the simulation. We chose the time simulated divided by the time required for the simulation
Fig. 5.6 Crane with load
5 An Advanced Simulink Verification Flow Using SystemC
Table 5.3 Comparing both
verification flows in respect
of man hours
Task
83
Conventional Extended
Implementing the test bench modules
1.5 h
2.5 h
Implementing the test case
1.0 h
0.5 h
Porting the test bench modules
2.0 h
0.5 h
Porting the test case
1.5 h
0.0 h
Sum
6.0 h
3.5 h
as a metrics of the performance. Using SystemC-based TBMs slows down the simulation speed by a factor of about 2.33 for this benchmark. However, the factor
depends on the number of SystemC-based TBMs and thus can not be generalized.
But not only the time required for simulation must be considered.
Table 5.3 compares the times required for implementing the example using the
conventional and the extended verification flow respectively. Writing the TBMs as
extended test bench modules takes longer than implementing them using Simulink
components. Test cases can be written in much shorter time using the extended
flow, as they can be described in a more natural way than using tables. While the
conventional flow requires TBMs and TCs to be ported when switching from system
to RT-level, the extended flow reuses the already existing TBMs and TCs. For the
presented example using the extended flow saves about 41% of the time compared
to the time required when using the conventional flow. The time for creating and
porting the system model is not taken into account. It will decrease the percentage
of the saved time, depending on the complexity of the model.
For this example, only simple TBMs and TCs are necessary. If a more complex
behavior of the TBMs is required, SystemC will have an advantage over Simulink.
Complex behavior can be implemented in a simpler way using SystemC. Thus, implementing complex TBMs using SystemC will take less time. Aside from the saved
amount of time, the usage of the extended verification flow has provided other advantages: (i) due to the reuse of TBMs and TCs, there was no need to prove the
equivalence of the TBMs and TCs that had been ported and (ii) verification is done
automatically rather than by inspecting the waveform manually.
5.6 Conclusion
In this chapter, the weaknesses of a conventional verification flow have been pointed
out. An extended test bench concept has been presented. This concept provides
reusable TBMs and TCs, as well as an automatic verification of the system. It has
also been presented, how this concept has been applied to Simulink models. It has
been shown, how the resulting extended verification flow has been implemented.
The implementation of the resulting verification flow addressed the required synchronization of both simulation environments as well as data type conversion. Models covering each scenario mentioned, have been used to evaluate the implementation. The results of the evaluation show that the implementation is correct.
84
K. Hylla et al.
The whole verification flow has been evaluated, using an example. It has been
shown that the usage of the extended verification flow saves a significant amount
of time during the development process. Reusing test bench modules and test cases
preserves consistency of the test bench and thus reduces the likelihood of errors.
Verification is done automatically rather then by inspecting the waveform manually.
Future development will have a focus on verification of analog components.
These had not been considered within this work and the presented verification flow.
This will be addressed by the integration of AMS-capable tools into the flow.
References
1. Advanced Verification Methodology. http://www.mentor.com.
2. J.-F. Boland, C. Thibeault, and Z. Zilic. Using MATLAB and Simulink in a SystemC verification environment. In Proceedings of Design and Verification Conference, DVCon, February
2005.
3. R. Lissel and J. Gerlach. Introducing new verification methods into a company’s design flow:
an industrial user’s point of view. In Design, Automation & Test in Europe Conference &
Exhibition, DATE ’07, pages 1–6, April 2007.
4. E. Moser and W. Nebel. Case study: system model of crane and embedded control. In Proceedings of the Conference on Design, Automation and Test in Europe, page 721, 1999.
5. Open Verification Methodology. http://www.ovmworld.org/.
6. Simulink Design Verifier. http://www.mathworks.com/products/sldesignverifier/.
7. Simulink Fixed Point. http://www.mathworks.com/products/simfixed/.
8. Simulink Real-time Workshop. http://www.mathworks.com/products/rtw/.
9. Simulink Verification and Validation. http://www.mathworks.com/products/simverification/.
10. SystemC Library. http://www.systemc.org.
11. Universal Reuse Methodology. http://www.cadence.com.
12. Verification Methodology Manual for SystemVerilog. http://www.synopsys.com.
Part II
Languages for Heterogeneous System
Design
Chapter 6
VHDL–AMS Implementation of a Numerical
Ballistic CNT Model
Dafeng Zhou, Tom J. Kazmierski and
Bashir M. Al-Hashimi
Abstract This contribution presents a VHDL–AMS implementation of a novel numerical carbon nanotube transistor (CNT) modeling approach which relies on a
flexible and efficient cubic spline non-linear approximation of the non-equilibrium
mobile charge density. The underlying algorithm creates a rapid and accurate solution of the numerical relationship between the charge density and the self-consistent
voltage. This leads to a speed-up in the calculation of the current through the channel by about two orders of magnitude without losing much accuracy. The numerical approximation is accurate within less than 1.5% of the normalized RMS error
compared with a previously reported theoretical modeling approach. The proposed
VHDL–AMS implementation has been used in simulations of a logic inverter in
SystemVision to demonstrate the feasibility of applying the spline-based technique
in development of efficient and accurate CNT models for applications in circuitlevel simulators.
Keywords VHDL–AMS · Ballistic transport · CNT model · Circuit level
simulation
6.1 Introduction
Transistors using carbon nanotubes are expected to become the basis of next generation integrated circuits [1, 11]. These expectations are motivated by the growing difficulties in overcoming physical limits of silicon-based transistors fabricated
using current technologies. A number of theoretical models have been created to
describe the interplay between different physical effects within the nanotube channel and their effect on the performance of the device [3, 5–7, 10, 13]. The standard
methodology of modeling carbon nanotube transistors (CNTs) is to derive the channel current from the non-equilibrium mobile charge injected into the channel when
voltages are applied to the transistor terminals [11]. However, a common problem
these models are facing is the complexity of calculating the Fermi–Dirac integral
T.J. Kazmierski ()
School of Electronics and Computer Science, University of Southampton, Southampton,
SO17 1BJ, UK
e-mail: tjk@ecs.soton.ac.uk
M. Radetzki (ed.), Languages for Embedded Systems and Their Applications,
Lecture Notes in Electrical Engineering 36,
© Springer Science + Business Media B.V. 2009
87
88
D. Zhou et al.
and non-linear algebraic equations which express the relationships between charge
densities and the current. Moreover, the channel current between the source and
drain is affected not only by the non-equilibrium mobile charge in the nanotube
but also by the charges present at terminal capacitances thus adding to the complexity of the current calculation which is a time-consuming iterative approaches.
Recently, the standard theoretical methodology has been improved by approaches
where the slow Newton–Raphson iterations and the numerical evaluation of the
Fermi–Dirac integral are replaced by numerical approximations while still maintaining good performance compared with theories. These new techniques suggest
piece-wise approximation of charge densities, either linear [6] or non-linear [8] to
simplify the numerical calculation. However, while both these approaches accelerate current calculations significantly, they are not flexible enough to allow the user
to control the trade-offs between the modeling accuracy and implementation speed.
Here we generalize our earlier piece-wise non-linear approach [8] and propose a
cubic spline piece-wise approximation of the non-equilibrium mobile charge density and develop a very accurate technique where a an accuracy better than 1.5% in
terms of average RMS error can be achieved with just a 5-piece spline, which compares favorably with the 5% obtained by the simple non-linear approximation [8].
The spline-based approach still achieves a speed up of around two orders of magnitude compared with a reported implementation of the theoretical model [12] and
allows an easy trade-off between accuracy and speed. The spline approximation is
not only capable of describing performance of ideal ballistic CNT models, but also
extendable with non-ballistic effects. The model has been implemented and tested
in MATLAB and VHDL–AMS. As an example, we show how our VHDL–AMS
model can be used to simulate a CMOS-like inverter made of two complementary
CNTs. This illustrates the feasibility of using this novel model in circuit-level simulators for future logic circuit analysis.
6.2 Mobile Charge Density and Self-Consistent Voltage
When an electric field is applied between the drain and the source of a CNT, a nonequilibrium mobile charge is generated in the carbon nanotube channel. It can be
described as follows [9, 11, 15]:
Q = q(NS + ND − N0 )
(6.1)
where NS is the density positive velocity states filled by the source, ND is the density
of negative velocity states filled by the drain and N0 is the equilibrium electron
density. These densities are determined by the Fermi–Dirac probability distribution:
1 +∞
NS =
D(E)f (E − USF )dE
(6.2)
2 −∞
1 +∞
D(E)f (E − UDF )dE
(6.3)
ND =
2 −∞
6 VHDL–AMS Implementation of a Numerical Ballistic CNT Model
N0 =
89
+∞
D(E)f (E − EF )dE
(6.4)
−∞
where D(E) is the density of states, f is the Fermi probability distribution, E represents the energy levels per nanotube unit length, and USF and UDF are defined
as
USF = EF − qVSC
(6.5)
UDF = EF − qVSC − qVDS
(6.6)
where EF is the Fermi level, q is the electronic charge and VSC represents the selfconsistent voltage [11] whose presence in these equations illustrates that the CNT
energy band is affected by the external terminal voltages. The self-consistent voltage
VSC is determined by the terminal voltages and charges at terminal capacitances by
the following non-linear algebraic equation [6, 11]:
VSC = −
Qt + qNS (VSC ) + qND (VSC ) + qN0
C
(6.7)
where Qt represents the charge stored in terminal capacitances and is defined as
Qt = VG CG + VD CD + VS CS
(6.8)
where CG , CD , CS are the gate, drain, and source capacitances respectively and the
total terminal capacitance C can be derived by
C = CG + CD + CS
(6.9)
6.3 Numerical Piece-Wise Approximation of the Charge Density
The standard approach to the solution of Eq. (6.7) is to use the Newton–Raphson
iterative method and in each iteration evaluate the integrals in Eqs. (6.2) and (6.3)
to obtain the state densities ND and NS . This approach has been proved effective in
CNT transistor modeling [6, 12]. However, the iterative computation and repeated
integrations consume immense CPU resources and thus are unsuitable for circuit
simulation.
Our earlier work [8] proposed a piece-wise non-linear approximation technique
that eliminates the need for these complex calculations. It suggested to calculate the
charge densities and self-consistent voltage by dividing the continuous density function into a number of linear and non-linear pieces which together compose a fitting
approximation of the original charge density curve. Then the VSC equation (6.7) is
simplified to a group of linear, quadratic and cubic equations, which can be solved
easily and fast.
Although this approach has been shown to be efficient and accurate [8], its weakness is that it requires an optimal fitting process when deciding on the number of
90
D. Zhou et al.
approximation pieces and intervals of the ranges which makes the model inflexible
and awkward to use. Here we propose to use a cubic spline piece-wise approximation to overcome these difficulties.
For a set of n (n ≥ 3) discrete points (x0 , y0 ), (x1 , y1 ), . . . , (xi+1 , yi+1 ) (i = 0, 1,
. . . , n − 2), cubic splines can be constructed as follows [2]:
y = Ayi + Byi+1 + C ÿi + D ÿi+1
(6.10)
where A, B, C and D are the coefficients for each pieces of the cubic spline.
For simple demonstration here, the horizontal interval between every two neighbor points is equal to h, then we have x1 − x0 = x2 − x1 = · · · = xi+1 − xi = h.
Therefore, the cubic spline coefficients can be expressed as functions of x:
xi+1 − x
xi+1 − x
=
xi+1 − xi
h
x − xi
x − xi
=
B ≡1−A=
xi+1 − xi
h
1
C ≡ A3 − A (xi+1 − xi )2
6
1 3
D ≡ B − B (xi+1 − xi )2
6
A≡
(6.11)
(6.12)
(6.13)
(6.14)
These equations show that A and B are linearly dependent on x, while C and D
are cubic functions of x. To derive the y(x) expression, the second-order derivative
of y have to be computed via a tridiagonal matrix:
⎡
⎤⎡
⎤
⎡
⎤
1 4 1
ÿ0
y2 − 2y1 + y0
⎢
⎥ ⎢ ÿ1 ⎥
⎥
1 4 1
y3 − 2y2 + y1
6 ⎢
⎢
⎥⎢
⎥
⎢
⎥
(6.15)
= 2⎢
⎢
⎥
⎢
⎥
⎥
.
.
.
..
..
⎣
⎦ ⎣ .. ⎦ h ⎣
⎦
yn−1 − 2yn−2 + yn−3
ÿn−1
1 4 1
Now that the cubic spline coefficients and the second derivative have been obtained, the function of each spline can be derived with the coefficients ai , bi , ci and
di calculated by using Eqs. (6.11), (6.12), (6.13), (6.14) and (6.15):
yi = a i x 3 + b i x 2 + c i x + d i
(6.16)
The two linear regions that extend the cubic splines on both sides can be described as follows:
y = yn
(x > xn )
y = a l x + bl
(x < x0 )
(6.17)
(6.18)
where al = ÿ0 = 3a0 x02 + 2b0 x0 + c0 and bl = y0 − al x0 . To demonstrate the performance of this approach, we have compared the speed and accuracy of an example
model with results of other reported approaches.
6 VHDL–AMS Implementation of a Numerical Ballistic CNT Model
91
Fig. 6.1 Piece-wise cubic spline approximation with n = 4 (circlet line) of mobile charge compared with the theoretical result (solid line)
6.4 Performance of Numerical Approximations
An example model which uses three cubic splines, n = 4, and two linear pieces at
the ends was compared with the theoretical curves calculated from Eqs. (6.2) and
(6.3) correspondingly.
To solve the resulting 3rd order polynomial equations, Cardano’s method [4] is
applied to determine the appropriate root which represents the correct value of VSC .
According to the ballistic CNT transport theory [11, 12] the drain current caused
by the transport of the non-equilibrium charge across the nanotube can be calculated
using the Fermi–Dirac statistics as follows:
2qkT
USF
UDF
IDS =
F0
− F0
(6.19)
π
kT
kT
where F0 represents the Fermi–Dirac integral of order 0, k is Boltzmann’s constant,
T is the temperature and is reduced Planck’s constant.
Since the self-consistent voltage VSC is directly obtained from the spline model,
the evaluation of the drain current poses no numerical difficulty as energy levels
USF , UDF can be found quickly from Eqs. (6.5), (6.6) and IDS can be calculated
using:
IDS =
EF −qVSC EF −q(VSC −VDS ) 2qkT kT
− log 1 + e
log 1 + e kT
π
(6.20)
These calculations are direct and therefore considerably fast, as there are no
Newton–Raphson iterations or integrations of the Fermi–Dirac probability distri-
92
D. Zhou et al.
Table 6.1 Average CPU time comparison between different models
Loops
5
FETToy
64.43 s
3-piece
4-piece
CS Model
CS Model
PWNL Model
PWNL Model
n=4
n=5
0.02 s
0.06 s
0.57 s
0.95 s
10
128.78 s
0.04 s
0.12 s
1.15 s
1.91 s
50
642.44 s
0.19 s
0.56 s
5.82 s
9.59 s
100
1287.45 s
0.38 s
1.12 s
11.69 s
19.33 s
Table 6.2 Average RMS errors in piece-wise and cubic spline approximations for 1 nm nanotube
at EF = −0.32 eV and T = 300 K
VG [V ]
3-piece
4-piece
CS Model
CS Model
PWNL Model
PWNL Model
n=4
n=5
0.1
4.4%
2.0%
1.3%
0.9%
0.2
3.6%
1.7%
1.0%
0.8%
0.3
2.7%
1.4%
0.8%
0.6%
0.4
1.9%
1.0%
0.6%
0.5%
0.5
1.6%
1.2%
0.9%
0.7%
0.6
2.2%
1.6%
1.1%
1.0%
bution. For performance comparison, we have also tried a 4-piece cubic spline approximation (with n = 5) which is expected to be more accurate but slower than the
first model. Table 6.1 shows the average CPU times for both models and those from
FETToy [12] and previously reported piece-wise models [8], while Table 6.2 compares the accuracy of both numerical model types. It can be seen from Tables 6.1
and 6.2 that although the spline models sacrifice some speed compared with the
simple piece-wise non-linear models, they are still more than two orders of magnitude faster than FETToy. They also achieve a much better accuracy than the simple piece-wise non-linear models. The extent to which the modeling accuracy was
compromised by numerical approximation was measured by calculating average
RMS errors in the simulations and the results are shown in Table 6.2. As expected,
the spline models are more accurate with errors not exceeding 1.0% at T = 300 K
and EF = −0.32 eV throughout the typical ranges of drain voltages VDS and gate
bias VG . Figure 6.2 shows the IDS characteristics calculated by FETToy compared
with the 3-piece spline model.
The performance of this approach can be affected by the values of EF , T , d and
terminal voltages. The choice of the number of cubic spline approximation pieces is
an obvious trade-off between speed and accuracy as slightly more operations need
to be performed with more pieces while the shape of the mobile charge curve is
reflected more accurately.
6 VHDL–AMS Implementation of a Numerical Ballistic CNT Model
93
Fig. 6.2 Drain current characteristics at T = 300 K and EF = −0.32 eV for FETToy (solid lines)
and a 3-piece cubic spline approximation (circlet lines)
Fig. 6.3 Schematics of the
simulated inverter
6.5 VHDL–AMS Implementation
The proposed approach has been used to implement both n-type-like and p-type-like
CNT transistor models in VHDL–AMS and to simulate a CMOS-like inverter shown
in Fig. 6.3. The bulk voltage was also considered to take into account the effects on
the charge densities generated by the substrate voltage. This is especially important
for the p-type-like transistor. Figure 6.4 shows IDS characteristics of the n-type-
94
D. Zhou et al.
Fig. 6.4 VHDL–AMS simulation results on drain current characteristics at T = 300 K and
EF = −0.32 eV for a 3-piece cubic spline model
Fig. 6.5 Inverter simulation result; input ramps from 0 V to 0.6 V
6 VHDL–AMS Implementation of a Numerical Ballistic CNT Model
95
like transistor implemented in VHDL–AMS which match closely the MATLAB
calculations shown in Fig. 6.2.
The VHDL–AMS test-bench for the inverter invokes the two transistors as well as
a ramp voltage source and a constant voltage source. The constant source provides
the supply voltage VCC for the gate, while the ramp source was used to produce
the output characteristic of the inverter. The simulation result is shown in Fig. 6.5.
Considering that the transport characteristics of both transistors are not the same,
it is worth noting that the inverter output is not symmetrical at VCC /2 due to the
stronger n-type-like transistor.
The VHDL–AMS code of the transistor top model is shown below.
−−
−−
−−
−−
−−
VHDL−−AMS model o f CNT T r a n s i s t o r I−−V C h a r a c t e r i s t i c s
u s i n g c u b i c s p l i n e a p p r o x i m a t i o n o f S /D charge d e n s i t i e s
( c ) S o u t h a m p t o n U n i v e r s i t y 2008
S o u t h a m p t o n VHDL−−AMS V a l i d a t i o n S u i t e
A u t h o r : Dafeng Zhou , Tom K a z m i e r s k i and B a s h i r M Al−
Hashimi
S c h o o l o f E l e c t r o n i c s and Computer S c i e n c e
U n i v e r s i t y of Southampton
H i g h f i e l d , S o u t h a m p t o n SO17 1BJ , U n i t e d Kingdom
T e l . +44 2380 593520 Fax +44 2380 592901
e−m a i l : dz05r@ecs . s o t o n . ac . uk , t j k @ e c s . s o t o n . ac . uk
C r e a t e d : 17 O c t o b e r 2007
L a s t r e v i s e d : November 2008 ( by Dafeng Zhou )
−−
−−
−−
−−
−−
−−
−−
−−
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
−−
−− D e s c r i p t i o n :
−− T h i s i s a f a s t n u m e r i c a l model o f b a l l i s t i c t r a n s p o r t
−− i n c a r b o n n a n o t u b e t r a n s i s t o r s . The d e f a u l t v a l u e o f t h e
−− E f _ i p a r a m e t e r ( Fermi l e v e l ) p r o d u c e s n−t y p e − l i k e b e h a v i o r
;
−− a p−t y p e − l i k e t r a n s i s t o r can be o b t a i n e d by m o d i f y i n g t h e
−− Fermi l e v e l . Package c n t c u r r e n t p r o v i d e s t h e s p l i n e d a t a a
−− nd t h e body o f f u n c t i o n F c n t w h i c h c a l c u l a t e s c u r r e n t I d s
−− f r o m t h e s p l i n e s .
−−
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
−− VHDL−−AMS Model o f B a l l i s t i c CNT T r a n s i s t o r
l i b r a r y IEEE ;
u s e IEEE . m a t h _ r e a l . a l l ;
u s e IEEE . e l e c t r i c a l _ s y s t e m s . a l l ;
l i b r a r y work ;
u s e work . c n t c u r r e n t . a l l ;
u s e work . S o l v e V s c E q u a t i o n _ p a c k . a l l ;
u s e work . c o e f f _ p a c k . a l l ;
96
D. Zhou et al.
e n t i t y CNTTransistor i s
g e n e r i c ( −− model p a r a m e t e r s
T : real := 300.0;
d c n t : r e a l : = 1 . 0 E−9;
E f _ i : r e a l : = − 0.32∗1.6E−19;
xmax : r e a l : = − 0.2;
xmin : r e a l : = − 0.5;
n : integer := 4) ;
port ( t e r m i n a l drain , gate , source , bulk : e l e c t r i c a l ) ;
end e n t i t y C N T T r a n s i s t o r ;
architecture C h a r a c t e r i s t i c of CNTTransistor i s
−−t e r m i n a l v a l u e s
q u a n t i t y Vdi a c r o s s d r a i n t o b u l k ;
q u a n t i t y Vgi a c r o s s g a t e t o b u l k ;
quantity Vsi across s o u r c e to bulk ;
q u a n t i t y I d s through d r a i n t o s o u r c e ;
begin
I d s == F c n t ( Vgi , Vsi , Vdi , E f _ i , T , d c n t , xmax , xmin , n ) ;
end a r c h i t e c t u r e C h a r a c t e r i s t i c ;
The coefficients for cubic spline approximation pieces are derived using a MATLAB script which generates text of the VHDL–AMS package coeffpack. The generated package is included in the simulation.
Combining Eqs. (6.7), (6.16), (6.17) and (6.18), a series of continuous linear and
3rd order polynomial equations to solve the self-consistent voltage are derived using
following equations.
(6.21)
ND (VSC ) = NS (VSC − VDS )
3
2
3
VSC = −Qt + q ai VSC + bi VSC + ci VSC + di + q aj (VSC − VDS )
+ bj (VSC − VDS )2 + cj (VSC − VDS ) + dj − qN0 /C
(6.22)
From Eq. (6.21), ND (VSC ) can be treated as an x-axial shift of NS (VSC ), and the
discrepancy between them is VDS . It can be noticed from Eq. (6.22) that, when
all the parameters are fixed, the value of VSC is determined by only VDS and the
spline coefficients. For a given VDS , the summary of NS (VSC ) and ND (VSC ) can be
3 + b V 2 + c V + d )+ q[a (V −V )3 + b (V −V )2 +
expressed as (ai VSC
i SC
i SC
i
j SC
DS
j SC
DS
cj (VSC − VDS ) + dj ], which consists of several regions based on the changing of
the value of i and j , represented as QsRange and QdRange in an inner function
respectively. It can be seen that QsRange and QsRange only change when VDS shifts
from one of the spline pieces to another, and in total there are 2n + 1 regions for
6 VHDL–AMS Implementation of a Numerical Ballistic CNT Model
97
the expression. Below are the combination coefficients of different QsRanges and
QsRanges due to the shifting of VDS .
The package listed below contains the code of function F cnt which solves the
spline approximation of the VSC equation (Eq. (6.22)) and evaluates the drain current.
−−Package o f C n t c u r r e n t
l i b r a r y IEEE ;
u s e IEEE . m a t h _ r e a l . a l l ;
u s e IEEE . e l e c t r i c a l _ s y s t e m s . a l l ;
l i b r a r y work ;
u s e work . S o l v e V s c E q u a t i o n _ p a c k . a l l ;
u s e work . FindQRange_pack . a l l ;
u s e work . c o e f f _ p a c k . a l l ;
package c n t c u r r e n t i s
f u n c t i o n F c n t ( Vgi , Vsi , Vdi , E f _ i , T , d c n t , xmax , xmin :
real ; n : integer )
return real ;
end package c n t c u r r e n t ;
package body c n t c u r r e n t i s
−− some p h y s i c a l c o n s t a n t s :
c o n s t a n t e0
: r e a l : = 8 . 8 5 4 E−12;
constant pi
: real := 3.1415926;
constant t0
: r e a l : = 1 . 5 E−9;
constant L
: r e a l : = 3 . 0 E−8;
constant q
: r e a l : = 1 . 6 E−19;
constant k
: real := 3 . 9 ;
c o n s t a n t a c c : r e a l : = 1 . 4 2 E−10;
c o n s t a n t Vcc : r e a l : = 3 . 0 ∗ 1 . 6 E−19;
constant h
: r e a l : = 6 . 6 2 5 E−34;
c o n s t a n t h b a r : r e a l : = 1 . 0 5 E−34;
c o n s t a n t KB
: r e a l : = 1 . 3 8 0 E−23;
f u n c t i o n F c n t ( Vgi , Vsi , Vdi , E f _ i , T , d c n t , xmax , xmin :
real ; n : integer ) return real i s
v a r i a b l e EF , Vd , Vg , Vs , Vds , I d s , E f _ t , E f i , N0 , c ,
Cox , Cge , Cse , Cde , C t o t , qC , qCN0 , i n t , Vsc : r e a l ;
v a r i a b l e yy : r e a l _ v e c t o r ( 0 t o 1 ) ;
v a r i a b l e QsRange , QdRange : i n t e g e r ;
begin
Cox
Cge
Cse
Cde
:=
:=
:=
:=
2 . 0 ∗ p i ∗k∗ e0 / l o g ( ( t 0 + d c n t / 2 . 0 ) ∗ 2 . 0 / d c n t ) ;
Cox ;
0 . 0 9 7 ∗ Cox ;
0 . 0 4 0 ∗ Cox ;
98
D. Zhou et al.
C t o t : = Cge+Cse+Cde ;
Efi := Ef_i ;
i f E f i > 0 . 0 then
E f _ t :=− E f i ;VD:=− Vdi ; VS:=− V s i ;VG:=− Vgi ;
else
E f _ t : = E f i ;VD: = Vdi ; VS: = V s i ;VG: = Vgi ;
end i f ;
EF : = E f _ t / q ; Vd : = Vdi ; Vs : = V s i ; Vg : = Vgi ;
N0 : = 1 . 1 4 3 1 E3 ;
c : = −q ∗ ( Vg∗Cge+Vs∗ Cse+Vd∗Cde ) / C t o t ;
Vds : = Vd − Vs ;
qC : = q∗q / C t o t ;
qCN0 : = qC∗N0 ;
i n t : = ( xmax−xmin ) / r e a l ( n −1) ;
−− F i n d t h e r a n g e s o f Qs and Qs a p p r o x i m a t i o n s
−− where t h e s o l u t i o n o f Vsc i s l o c a t e d
yy : = FindQRange ( Vds , q , c , qC , qCN0 , xmax , xmin , i n t , n ) ;
QsRange : = i n t e g e r ( yy ( 0 ) ) ;
QdRange : = i n t e g e r ( yy ( 1 ) ) ;
−− C a l c u l a t e t h e Vsc u s i n g Cardone ’ s method f r o m 3 r d
order
−− p o l y n o m i a l and l i n e a r e q u a t i o n s
Vsc : = S o l v e V s c E q u a t i o n ( Vds , q , c , qC , qCN0 , QsRange ,
QdRange ) ;
−− O b t a i n t h e d r a i n / s o u r c e c u r r e n t
i f Efi <=0.0 then
I d s : = 4 . 0 ∗ q∗KB∗T / h ∗ ( l o g ( 1 . 0 + exp ( q ∗ ( EF−Vsc ) /KB/ T ) )
−l o g ( 1 . 0 + exp ( q ∗ ( EF−Vsc−Vds ) /KB/ T ) ) ) ;
e l s i f Efi >0.0 then
I d s : = −4.0∗ q∗KB∗T / h ∗ ( l o g ( 1 . 0 + exp ( q ∗ ( EF−Vsc ) /KB/ T ) )
−l o g ( 1 . 0 + exp ( q ∗ ( EF−Vsc−Vds ) /KB/ T ) ) ) ;
else
Ids := 0 . 0 ;
end i f ;
return Ids ;
end f u n c t i o n F c n t ;
end package body c n t c u r r e n t ;
6.6 Conclusion
This contribution proposes to use and investigates the numerical performance of
cubic splines in numerical calculations of CNT ballistic transport current with the
6 VHDL–AMS Implementation of a Numerical Ballistic CNT Model
99
aim to provide a practical and numerically efficient model for implementation in
SPICE-like circuit simulators. The cubic spline approximation is more flexible and
easier to use than the earlier piece-wise models [6, 8] and the presented results
further reinforce the suggestions that numerical integrations and internal NewtonRaphson iterations can be avoided in the calculation of the self-consistent voltage
in the CNT. The cubic spline parameters assure the continuity of the first derivative
everywhere and were optimized for fitting accuracy. When compared with FETToy
[12], a reference theoretical CNT model, we have demonstrated that the proposed
approximation approach, although marginally slower than our earlier models, still
leads to a computational cost saving of more than two orders of magnitude while
increasing the modeling accuracy. To verify the feasibility of the proposed model,
VHDL–AMS implementations for both n-type-like and p-type-like transistors were
derived and used to calculate their IDS characteristics as well the output characteristic a simple logic inverter using the SystemVision simulator from Mentor Graphics.
The results matched closely those from MATLAB simulations. The new VHDL–
AMS model is now available on the Southampton VHDL–AMS Validation Suite
website [14] for public use.
Acknowledgements The authors would like to acknowledge the support of EPSRC/UK for
funding this project in part under grant EP/E035965/1.
References
1. P. Avouris, J. Appenzeller, R. Martel, and S.J. Wind. Carbon nanotube electronics. Proceedings of the IEEE, 91(11):1772–1784, 2003.
2. R. Bulirsch and J. Stoer. Introduction to Numerical Analysis, 2nd edition. Springer, Berlin,
1996.
3. T. Dang, L. Anghel, and R. Leveugle. Cntfet basics and simulation. In IEEE International
Conference on Design and Test of Integrated Systems in Nanoscale Technology (DTIS), Tunis,
Tunisia, 5–7 September 2006.
4. U.K. Deiters. Calculation of densities from cubic equations of state. AIChE Journal,
48(4):882–886, 2002.
5. C. Dwyer, M. Cheung, and D.J. Sorin. Semi-empirical SPICE models for carbon nanotube
FET logic. In 4th IEEE Conference on Nanotechnology, Munich, Germany, 16–19 August
2004.
6. H. Hashempour and F. Lombardi. An efficient and symbolic model for charge densities in
ballistic carbon nanotube FETs. In IEEE-NANO 2006, Sixth IEEE Conference on Nanotechnology, volume 1, pages 23–26.
7. A. Hazeghi, T. Krishnamohan, and H.-S.P. Wong. Schottky-barrier carbon nanotube fieldeffect transistor modeling. IEEE Transactions on Electron Devices, 54:439–445, 2007.
8. T.J. Kazmierski, D. Zhou, and B.M. Al-Hashimi. Efficient circuit-level modeling of ballistic
CNT using piecewise non-linear approximation of mobile charge density. In IEEE International Conference on Design, Automation and Test in Europe (DATE), Munich, Germany,
10–14 March 2008.
9. P.L. McEuen, M.S. Fuhrer, and H. Park. Single-walled carbon nanotube electronics. IEEE
Transactions on Nanotechnology, 1(1):78–85, 2002.
10. B.C. Paul, S. Fujita, M. Okajima, and T. Lee. Modeling and analysis of circuit performance
of ballistic CNFET. In 2006 Design Automation Conference, San Francisco, CA, USA, 24–28
July 2006.
100
D. Zhou et al.
11. A. Rahman, J. Guo, S. Datta, and M.S. Lundstrom. Theory of ballistic nanotransistors. IEEE
Transactions on Electron Devices, 50(9):1853–1864, 2003.
12. A. Rahman, J. Wang, J. Guo, S. Hasan, Y. Liu, A. Matsudaira, S.S. Ahmed, S. Datta, and M.
Lundstrom. Fettoy 2.0—on line tool, 14 February 2006. https://www.nanohub.org/resources/
220/.
13. A. Raychowdhury, S. Mukhopadhyay, and K. Roy. A circuit-compatible model of ballistic
carbon nanotube field-effect transistors. Applied Physics Letters, 23(10):1411–1420, 2004.
14. S. Wang and T.J. Kazmierski. Southampton VHDL–AMS validation suite, 18 October 2007.
https://www.syssim.ecs.soton.ac.uk/index.htm.
15. M.-H. Yang, K.B.K. Teo, L. Gangloff, W.I. Milne, D.G. Hasko, Y. Robert, and P. Legagneux. Advantages of top-gate, high-k dielectric carbon nanotube field-effect transistors. Applied Physics Letters, 88(11):113507, 2006.
Chapter 7
Wide-Band Sigma–Delta ADC Design
in Superconducting Technology
R. Guelaz, P. Desgreys and P. Loumeau
Abstract This chapter presents a bandpass sigma–delta ADC design in superconducting technology. We studied an architecture based on Josephson junctions
VHDL–AMS modeling. We proceed by analyzing the standard linear model and
proposed different hierarchical model based on functional analyzing. To simplify
comparator behaviour and simulations we assimilate SFQ pulses to ideal rectangular pulses to specify ADC performance. Josephson junction modeling is based
on writing of RCSJ electrical model with accordance to VHDL–AMS language.
Each ADC functional element is dissociated and simulated to validate its behaviour.
Pulses duration is directly a physical parameter obtained by technology. We give
relations between ADC performance (SNR) to the comparator SFQ pulses form and
duration with normalized physical parameters.
Keywords ADC · Sigma–delta bandpass · Superconducting · VHDL–AMS
7.1 Introduction
Recent advances in RSFQ (Rapid Single Flux Quantum) technology promise opportunity to get better performance compared to CMOS technology [1] for specific
applications such as spatial telecommunications. Sigma–delta bandpass ADC [2] is
an element that could be implemented in software radio communication systems [3]
in replacement of classic time-interleaved architecture. Sigma–delta is well known
to be an efficient converter because of its simple architecture, easy to implement
and composed with basic elements. RSFQ technology exploits sigma–delta advantages because of the capability to operate at several hundred GHz [4–7]. To perform
actual converter, we propose to identify sigma–delta ADC classic structure to a converter using Josephson junctions which compose principally the ADC comparator
and the clock generation. To reproduce sigma–delta principle, a feedback must be
operated between output and input. SFQ (Single Flux Quantum) pulses generated
by the Josephson junctions of the balanced comparator are studied and quantized to
reproduce this feedback effect. Pulse form and duration determine directly the ADC
P. Desgreys ()
L.T.C.I., CNRS-UMR 5141, Institut Telecom, Telecom-Paristech, Paris, France
e-mail: patricia.desgreys@telecom-paristech.fr
M. Radetzki (ed.), Languages for Embedded Systems and Their Applications,
Lecture Notes in Electrical Engineering 36,
© Springer Science + Business Media B.V. 2009
101
102
R. Guelaz et al.
performance, the dependence is shown on the particular parameter SNR (signal-tonoise ratio). To simplify the ADC conception with performance rapid estimation,
we assimilated SFQ pulses to rectangular forms with area and duration in agreement to the real case. We study pulses effect in the current across the resonator.
VHDL–AMS language [8] permits us in this particular case to reproduce pulses effect in the system modeling with the use of “break” statement. Precedent work [9]
has considered SFQ (Single Flux Quantum) pulses as Dirac forms, but model could
be improved with incorporation with pulse duration parameter.
7.2 Sigma–Delta Second Order Architecture
7.2.1 Bandpass Sigma–Delta Modulator
Sigma–delta modulation principle is based on oversampling and noise shaping. Its
usual block diagram representation is presented in Fig. 7.1. Principle reposed on a
feedback loop that permits to make a prediction based on the precedent information. In our case, we study a bandpass sigma–delta modulator. For a second order
ADC with 1-bit quantization, a simple comparator at the output generates the feedback with 1/−1 (normalized) values. Theoretical interpretation of the resonator is
represented by Eq. (7.1) with z transform.
B(z) =
−z−2
1 + z−2
(7.1)
The system could be interpreted as a linearized model where quantization is considered as a source of an additive white noise Q(z). Interpretation of the output Y (z)
given in Eq. (7.2) is an addition of the response due to the input and the response
due to the quantization noise source:
Y (z) = STF(z)X(z) + NTF(z)Q(z)
(7.2)
with the signal transfer function STF(z) and the noise transfer function NTF(z) as
STF(z) =
Fig. 7.1 2nd order
sigma–delta classic
architecture
B(z)
,
1 + B(z)C(z)
NTF(z) =
1
1 + B(z)C(z)
(7.3)
7 Wide-Band Sigma–Delta ADC Design in Superconducting Technology
103
Fig. 7.2 SNRmax as function
to OSR and modulator
order N
With the oversampling principle, the modulator permits to obtain in the input signal
band a gain of 1 with a least quantification error. So the signal-to-noise ratio SNR
will be significantly improved and the resolution too. To estimate the ADC performances, we consider that noise is superposed to the signal and the SNR is the ratio
between the signal power and the quantification noise power. Power spectral density of the quantization noise PSDN , given by Eq. (7.4), is the multiplication of the
modulator noise transfer function NTF with the quantification noise PSDQ .
2
4π f
q
(7.4)
PSDN = |NTF(f )|PSDQ = 2 1 + cos
Fclk
12
with f the frequency and q the quantizer resolution.
The SNRmax is expressed by
3(2n + 1)OSR2N+1
SNRdb-max = 10 log
2π 2
(7.5)
With n the quantizer bit resolution and 2N the filter order in the case of a resonator. The SNR in function of the oversampling ratio (OSR) and of the filter order
in general case is given in Fig. 7.2.
For example with N = 1 which corresponds to a second order resonator, OSR =
128 leads to SNRmax = 60 dB.
7.2.2 The Josephson Junction
Basis of RSFQ technology lies on the properties of Josephson junction which is a
three layers material composed by two superconducting materials separated by a
104
R. Guelaz et al.
Fig. 7.3 RCSJ Josephson
junction electric model
Fig. 7.4 Current–voltage
characteristic of an
overdamped junction
thin metallic layer (SNS). Principal property of the junction is to have perfect voltage oscillator behavior when it is polarized with a constant current above its critical
current Ic . Oscillating frequency can reach several hundred GHz. Junction model
is based on the writing of each branch of the RCSJ (Resistively and Capacitively
Shunted Junction) circuit composed by an ideal current source with the critical current Ic in parallel with a resistance R and a capacitance C as illustrated in Fig. 7.3.
The relation between junction phase φ and RCSJ model can be written as
Isj = Ic sin(φ) +
φ0 dφ φ0 C d 2 φ
+
2πRn dt
2π dt 2
(7.6)
where Isj is the current across the junction and φ0 is the flux quantum constant equal
to 2.07 µV/GHz. Its behavior (Fig. 7.4) is a non-hysteretic form; when the current
is above its critical current value the junction generates a voltage pulse.
To reproduce a 2nd order sigma–delta modulator, we isolate each functional element of the ADC: mainly the comparator and the clock generation and we propose
to implement these elements from Josephson junctions based circuitry.
7 Wide-Band Sigma–Delta ADC Design in Superconducting Technology
105
7.2.3 The RSFQ Balanced Comparator
The comparator is composed by 2 junctions JJ2 and JJ3. The decision instants are
fixed by each clock front, when the sum of the current is over the JJ2 critical current,
a pulse is generated otherwise junction JJ2 switches. To generate the needed modulator feedback, we will use this voltage pulse generated at each clock front. For
the comparator behavior assimilation to classical +1/−1 feedback, we consider that
unipolar pulses noted Vq (t) can be decomposed by the sum of a bipolar pulses train
noted Vq2 (t) and a periodic pulses train noted Vq1 (t) presented in Fig. 7.6. Principal
information is in Vq2 (t), periodic effect of Vq1 (t) can be compensated by an offset
in the comparator that is fixed by JJ2 critical current. Bulzacchelli [8] have proposed this simplified representation to study the impact of the pulses in sigma–delta
modulator.
7.2.4 Sigma Delta Modulator Operation with Josephson Junctions
As seen in the theoretical analysis of the sigma–delta modulator, the heart of the
operation is based on the feedback loop effect between modulator output and input. In classical architecture, the comparator generates 1/−1 (normalized) voltages
synchronized with the clock.
In a superconducting integration, we use an L–C resonator to make the 2nd order
bandpass filter B(z). And we consider that feedback effect can be done directly in
the current across the resonator thanks to the use of the RSFQ comparator. In fact,
the voltage pulse created at the comparator node is integrated by L–C resonator.
This result is observed in a change of the L–C current by a value proportional to the
pulse area. In the case of RSFQ pulse, the area is always equal to φ0 and the pulse
duration is negligible (like a Dirac pulse). Therefore the L–C current is decremented
by a constant value.
Finally transposition between current/tension will reproduce the same behavior
as presented in Fig. 7.5.
To identify sigma–delta principle in the RSFQ design, we consider that each
SFQ pulse delivered by the comparator, results in the addition or the subtraction of
a value I L to the resonator current IL :
I L = φ0 /L
Fig. 7.5 Assimilation with
use of Josephson junctions
(7.7)
106
R. Guelaz et al.
Fig. 7.6 ADC modeling with a comparator generating rectangular bipolar pulses
Resolution of circuit equation (Eq. (7.8)) when considering a rectangular pulse
of duration τ for Vq2 (t) is described by Eq. (7.9). We consider p as the Laplace
symbol.
Vin (p) − VJJ3 (p) = IL (p) ·
1
1 + LCp 2
+ Lp = IL (p) ·
Cp
Cp
(7.8)
During the time τ ≪ Tclk , it is demonstrated that current IL (t) have a linear
variation
C
1
−φ0 1
·√
· ·t
(7.9)
·t =
IL (t) = −A
L
2τ L
LC
7.2.5 System Modeling with VHDL–AMS
To model the ADC behavior, we consider two approaches. First one supposes that
the comparator effect on the current across the resonator L–C is produced at each
clock front with a fixed value ±φ0 /(2L). The simulation of these abrupt changes
is available with the implementation of the “break” statement. This VHDL–AMS
instruction stops the analogical simulator and initializes it again with a new value of
the current IL . This new value is determined at each clock pulse and is a function
of the present value. The second approach of the modeling is to implement pulse
duration for rectangular pulses form. In physical case, pulses have a duration that is
set by the technology. Variation of this parameter will informs us about impact of
this technological parameter on the ADC operation.
7 Wide-Band Sigma–Delta ADC Design in Superconducting Technology
107
Fig. 7.7 SFQ comparator
with clock stimulation
7.3 The Sigma–Delta ADC Design
7.3.1 Clock and Comparator Design
As presented in the Fig. 7.7, RSFQ comparator is based on the use of two junctions,
one of which (JJ3) has a higher critical current. Comparison is made on the input
current, if it is positive, an SFQ pulse is generated at the output. If it is negative,
output is unchanged. SFQ pulse is interpreted as “1” logic level, otherwise it’s a “0”
logic level.
The clock is generated by polarization of junction JJ1 at a constant current. JJ1’s
critical current is fixed at approximately a value corresponding to the sum of JJ2 and
JJ3 critical currents. L4 inductance is set to a value ensuring that the comparison
results appear between two SFQ clock pulses. Ip is the comparator polarization
current. In accordance to technology Nb/AlOx/Nb with parameter Rn Ic = 300 µV,
we obtain a clock at 60 GHz (Tclk = 16.6 ps) with:
Vc = 1.15 mV
Rc = 1 Ohm
L3 = 1 pH
JJ1: Ic = 542 µA, Rn = 300 µ/542 µ = 0.55 Ohm
C = φ0 Ic /(2πRn Ic )2 = 0.32 pf
We implement this clock separately and we obtained the simulation results shown
in Figs. 7.8–7.10. Simulations are executed with Simplorer V7 software.
When the sum of the input current Ix and the current across junction JJ2 is above
junction JJ3 critical current, a SFQ pulse is generated at the comparator output. With
an ideal input current Ix , we analyze the comparator output result obtained at JJ2,
108
R. Guelaz et al.
Fig. 7.8 Clock generated at
JJ1 junction for a fixed
frequency 60 GHz
Fig. 7.9 JJ3—comparator output with input Ix [−300 µA; +300 µA]
and JJ3 with the following circuit values:
JJ1: Ic = 900 µA
JJ2: Ic = 300 µA
JJ3: Ic = 310 µA
Vc = 1.2 mV
Rc = 1 Ohm
L3 = 1 pH
L4 = 4 pH
Coupling the comparator to the clock implies to adjust Ic and Vc values to obtain the same form for the clock. Symmetrical signal resulted from JJ3 junction is
presented in Fig. 7.9.
7 Wide-Band Sigma–Delta ADC Design in Superconducting Technology
109
This simulation results show that when input current is negative, we have no
pulses, just little oscillations that are resulting of JJ2 switching. In this case of simulation, we are at the frequency limit for proper comparator operation. This is visible
because there is a first little pulse before the switching of JJ3 and because the pulses
have not the same level when JJ3 commutes. For the last part of our study, the
SFQ pulses generated by junction JJ3 are assimilated to rectangular pulses with the
same area and duration; these parameters are controlled by fabrication and technology.
7.4 Simulation Results
To illustrate sigma–delta bandpass ADC operation, we implement the structure of
the Fig. 7.6 written in VHDL–AMS into the Smash-Dolphin®software. The converter is specified for a clock system fixed at 120 GHz. Simulations are done for
an over sampling ratio of 120 and a bandwidth of 500 MHz. We study specifically
the SNR as a function of the pulse duration generated by the comparator. The pulse
duration is included in the range of 1 fs to 2.5 ps. Figure 7.10 shows the pulses effect
on the current and the modulator output.
At each clock front, we observe a step as demonstrated in theory. It respects the
current sign and it is constant. Effect of pulses duration must be taken into account
and is illustrated in Fig. 7.11.
When pulse duration is not negligible, evolution of the input signal during the
pulse modifies the step value. In the precedent figure, we can observe two different
steps noted 1 and 2. So quantification is not correctly operated. The conclusion
is that the use of SFQ pulses to reproduce sigma–delta operation has a limit in
frequency. To estimate this limit for a specified resolution, we simulate the impact
of the pulse duration on the SNR as it is shown in Fig. 7.12.
This result shows that the SNR is a quasi-linear function of the comparator pulse
duration. In particular, if the duration is under 2.2 ps, the SNR obtained is in the
Fig. 7.10 Simulation results with ideal pulses effect on the current
110
R. Guelaz et al.
Fig. 7.11 Simulation with pulse duration effect
Fig. 7.12 SNR in function of
the pulse duration for an over
sampling ratio of 120
range [55–50] dB which corresponds to 8 bits resolution. Best results are around
55 dB and are obtained when τ ≪ Tclk /100 where Tclk is the clock period.
Now we simulate the real circuit implemented with Josephson junctions described in VHDL–AMS; we simulate the complete ADC represented in Fig. 7.13
and the results are shown in Fig. 7.14.
The simulation results show firstly the alternative switching of comparator junctions. The clock is generated by junction JJ1 and result in constant pulses. The
most difficult aspect of the design is to avoid perturbation generated by the output
switching. Decision instant must be rigorously between 2 SFQ pulses. At the limit
working frequency as shown in the present simulation, steps on the current are a
little bit different one to another but we measure approximately I L = φ0 /L. The
SNR is obtained with the spectral analysis of the output voltage. For a Blackman
windowing with 16384 points, we obtain the result illustrated by Fig. 7.15.
7 Wide-Band Sigma–Delta ADC Design in Superconducting Technology
111
Fig. 7.13 Sigma–delta
modulator operating
at 60 GHz
Fig. 7.14 Comparison with real SFQ pulses
Noise Transfer Function is highlighted in this result with its particular form of
rejected noise out of the interest bandwidth (14.75–15.25 GHz). Signal is identified
at the particular frequency of 15.1 GHz. In this example, SNR obtained is 54 dB.
7.5 Conclusion
In this work, we propose an approach to estimate performance of a superconducting sigma–delta bandpass ADC in RSFQ technology. We assimilate comparator behavior to a pulses generator parameterized by the pulses duration. We proceed by
different step modeling to reproduce specificity of SFQ use to reproduced sigma–
delta modulator. We show that the increase of comparator pulse duration implies
quasi-linear decrease of the SNR. In the next step, simulations should give rapidly
the ADC performance when using the complete circuit with Josephson junctions.
112
R. Guelaz et al.
Fig. 7.15 Output spectral analysis
References
1. O.A. Mukhanov, V.K. Semenov, et al. High-resolution ADC operating at 19.6 GHz clock
frequency. Superconductor Science and Technology, 14:1065–1070, 2001.
2. R. Schreier, G.C. Temes. Understanding Delta-Sigma Data Converters. Wiley Interscience,
New York, 2005.
3. P. Loumeau, L. Naviner, and J.F. Naviner. Analog-to-digital conversion for software radio. In
G. Vivier, editor, Reconfigurable Mobile Radio Systems: A Snapshot of Key Aspects Related
to Reconfigurability in Wireless Systems. Iste Publishing Company, London, 2007.
4. E. Baggetta, R. Setzu, J.-C. Villégier, and M. Maignan. Implementation of basic NBN RSFQ
logic gates of a wide-band sigma–delta modulator. In Applied Superconductivity Conference,
Seattle, 2006.
5. E. Baggetta, R. Setzu, and J.-C. Villégier. Study of SNS and SIS NbN Josephson junctions
coupled to a microwave band-pass filter. Journal of Physics. Conference Series, 43:1167–
1170, 2006.
6. K.K. Likharev, V.K. Semenov. RSFQ logic/memory family, a new Josephson-junction technology for sub-Terahertz-clock-frequency digital systems. IEEE Transactions on Applied Superconductivity, 1(1):3–28, 1991.
7. P. Febvre, J.-C. Berthet, D. Ney, A. Roussy, J. Tao, G. Angenieux, N. Hadacek, and J.-C. Villegier. On-chip high-frequency diagnostic of RSFQ logic cells. IEEE Transactions on Applied
Superconductivity, 11(1):284–287, 2001.
8. IEEE STD 1076.1-1999: IEEE standard VHDL analog and mixed signal extensions. SH94731,
IEEE, 1999.
9. J. Bulzachelli. A superconducting bandpass delta-sigma modulator for direct analog-to-digital
conversion of microwave radio. Ph.D. thesis, Massachusetts Institute of Technology, 2003.
Chapter 8
Heterogeneous and Non-linear Modeling
in SystemC–AMS
Ken Caluwaerts and Dimitri Galayko
Abstract This contribution presents a SystemC–AMS model of a mixed non-linear,
strongly coupled and multidomain electromechanical system designed to scavenge
the energy of ambient vibrations and to generate an electrical supply for an embedded microsystem. The system operates in three domains: purely mechanical (the
resonator), coupled electromechanical (electrostatic transducer associated with the
moving mass) and electrical circuit, including switches, diodes and linear electrical components with varying parameters. The modeling difficulties related with
the non-linear discontinuous behavior of the system and with the limitations of
SystemC–AMS are resolved by simultaneous use of the two modeling domains
available in SystemC–AMS: the one allowing SDF (Synchronous Data Flow) modeling and the one allowing LIN ELEC (linear electrical) circuit analysis. The modeling results are compared with VHDL–AMS and Matlab Simulink models.
Keywords Vibration energy harvesting/scavenging · Capacitive transducer ·
Flyback · Charge pump · SystemC–AMS · Heterogeneous modeling · Non-linear
modeling · SDF
8.1 Introduction
Harvesters of mechanical energy are complex heterogeneous multiphysic systems
that are difficult to describe analytically. They are non-linear and time-variable and
exhibit complex and discontinuous behavior. Hence, a design optimizing their energetic performances requires reliable and if possible simple models.
This work focuses on modeling of harvesters which use capacitive transducers
for the energy conversion from mechanical into electrical form. A typical capacitive
harvester includes (Fig. 8.1):
• a mechanical resonator allowing an accumulation of the mechanical energy,
• an electrostatic (capacitive) transducer, with one electrode attached to a mobile
mass, and the other fixed to the system which is submitted to external vibrations,
• a conditioning electrical circuit managing the flow of electrical charges on the
transducer electrodes.
D. Galayko ()
LIP6, University of Paris-VI, Paris, France
e-mail: dimitri.galayko@lip6.fr
M. Radetzki (ed.), Languages for Embedded Systems and Their Applications,
Lecture Notes in Electrical Engineering 36,
© Springer Science + Business Media B.V. 2009
113
114
K. Caluwaerts and D. Galayko
Fig. 8.1 General structure of
the vibration energy harvester
Electromechanical harvesters cannot generally be modeled using a purely electrical simulator, since they include non-linear electromechanical elements—typically
an electromechanical transducer requiring a behavioral model. Existing modeling
approaches use signal data flow diagrams (e.g. Simulink models) [1], Spice-level
descriptions including Spice macromodels to model the transducer [5, 6] and behavioral VHDL–AMS descriptions [3].
However, as shown by recent research, in autonomous SOCs/SIPs, mechanical
energy harvester is likely to be part of a multisource block which will include other
energy sources (thermal, solar, etc.), a rechargeable battery and an “intelligent” energy management unit (probably, a low-power microprocessor) [8]. Thus, the model
of the mechanical energy harvester must be compatible with the model of the global
system.
SystemC–AMS is one of the only modeling platforms that make it possible to
describe physical, electrical analog, digital and software blocks in the same model
[10, 11]. Hence, it is an excellent candidate for modeling complex energy generators, including digital and software blocks.
In this contribution we present a SystemC–AMS model of a vibrational energy harvester system whose conditioning circuit architecture was proposed in [13]
(Fig. 8.1). In particular, we describe our solutions for the difficulties related with the
non-linearity and the switching operation of the conditioning circuit.
The chapter is organized as follows. In Sects. 8.1.1 and 8.1.2, we introduce the
SystemC–AMS platform and the modeled system. In Sect. 8.2 we present the model
of all the blocks of the harvester and in Sect. 8.3 we discuss the modeling results
and compare them with a VHDL–AMS and a Simulink model.
8.1.1 SystemC–AMS Modeling Platform
The used platform is the first experimental version 0.15RC5 of the extension of
SystemC–AMS [10]. The authors of this prototype participated in the releasing of
the first draft standard of the extension of SystemC–AMS. This new standard enables system-level design and modeling of analog/mixed-signal systems by defining
8 Heterogeneous and Non-linear Modeling in SystemC–AMS
115
Fig. 8.2 SystemC–AMS
models: a example of
an algebraic non-linear
system modeled in SDF,
b illustration of a connection
between SDF and LinElec
models
additional language constructs and execution semantics for efficient simulation of
discrete- and continuous-time behavior [4, 12].
The current SystemC–AMS prototype allows to define two kinds of models: Synchronous Data Flow diagram (SDF) and a Linear Electrical Network (LinElec) models.
An SDF model is defined in multirate synchronous data flow domain (SDF),
which can be used to describe analog non-conservative (signal-flow) behaviors: each
block has one or several inputs and outputs, and the data are propagated throughout
the blocks. Multirate means that the model designer can define a time step equal to
an integer multiple of the minimal model time step. This multiple can be different
for each block, but neither the minimal time step nor the individual block multiples
can be changed during the simulation. To run an SDF simulation, SystemC–AMS
uses a static scheduling algorithm.
Another important point is that the SystemC–AMS imposes some limits on the
modeling of non-linear systems. In fact, the only available method for non-linear
equation resolution is the fixed-point iteration method [2]. For example, an algebraic
system described by the equation
y = f (x, y),
(8.1)
where x is a known input, can only be modeled if a one-step delay is inserted in the
loop (Fig. 8.2a). Hence, the output y exhibits a transient process which converges to
the solution of (8.1). In fact, the system of Fig. 8.2a models the fixed-point iterative
process:
yi = f (xi , yi−1 ).
(8.2)
It is known that this process converges if and only if the initial (guess) y value is
in the attraction basin of the solution, and if x changes slowly comparing with the
delay. In particular, if f exhibits strong non-linearities (like the exponential models
of diodes) and if the delay (step) is not small enough, convergence problems can
occur.
The Linear Electrical Network description level is provided for modeling of linear electrical networks. The network is defined by its electrical netlist (described by
a specific C++ code). Before the start of a simulation, the SystemC–AMS core sets
up the matrix equation corresponding to the linear network, and during the simulation, closed expressions are used for computing the network quantities.
116
K. Caluwaerts and D. Galayko
Time-varying and even non-linear electrical networks can be modeled using the
connection between SDF and LinElec models (Fig. 8.2b). This is possible thanks to
the mechanism allowing the inclusion of components with time-varying parameters
(resistor, capacitor, voltage and current source) in the linear circuit models. The
connection does exist in both directions: for example, a SDF domain signal can
control the resistance value in a linear circuit, and an electrical value (current or
voltage) measured in the linear circuit can be converted to an input of a SDF domain
block as well (Fig. 8.2b).
In this work we show that although SystemC–AMS does not offer a direct possibility to model electrical networks with non-linear elements in a charge conservative
domain, this limitation can be circumvented using the coupling mechanism between
SDF and LinElec models.
8.1.2 Summary of Electrostatic Harvester Operation
The modeled harvester can be seen as a conjunction of three blocks: an electromechanical resonant transducer including a resonator and a variable capacitor, a
charge pump and a flyback circuit. The last two blocks form the conditioning circuit
(Fig. 8.1).
8.1.2.1 Resonant Electromechanical Transducer
The idea of a vibration energy harvester using an electrostatic transducer is the following. The energy stored in the capacitor is given by:
W=
Q2
,
2C
(8.3)
where Q is the charge, C is the capacitance. If C varies thanks to some mechanical
force modifying the capacitor geometry, it is possible to charge the capacitor when
C is high (at the price of some energy W1 ), and discharge it when C is low (getting
back some energy W2 ). From the formula (8.3) one obtains that W2 > W1 , and that
the difference W2 − W1 corresponds to the energy harvested from the mechanical
domain.
Figure 8.1 presents the diagram of the resonant transducer block. It is modeled
as a second-order lumped parameter system including a mass, a spring, a damper
(not shown) and a parallel-plate capacitor with a mobile electrode. The mechanical
behavior is described by Newton’s second law:
−kx − µẋ + Ft (x) − maext = mẍ,
(8.4)
where x is the mass displacement from the equilibrium position, m is the mass,
k is the stiffness of the spring, µ is the viscous damping constant of the resonator,
8 Heterogeneous and Non-linear Modeling in SystemC–AMS
117
Ft is the force generated by the capacitive transducer and aext is the acceleration of
the external vibrations. The term −maext takes into account the fact that the equation is written with regard to a non-inertial frame reference moving with acceleration aext [7].
The transducer force is given by the following formula:
Ft =
2 dC
Vvar
var
,
2 dx
(8.5)
where Vvar is the voltage applied on the transducer and Cvar is the transducer’s
capacitance. The gradient of capacitance depends on the transducer geometry and
in our model, the capacitance varies linearly with the displacement (the numerical
values are valid for the device presented in [9]):
Cvar (x) = 75 · 10−12 + 1 · 10−6 x (Farads).
(8.6)
8.1.2.2 Conditioning Circuit Operation
The role of the conditioning circuit is to manage the electrical charge flow on the
variable capacitor. The circuit proposed in [13] is composed of the charge pump and
of the flyback circuit (Fig. 8.3).
The charge pump is composed of two diodes and three capacitors Cres , Cvar and
Cstore whose values are such that: Cres ≫ Cstore ≫ Cvar .
The role of the charge pump is to make use of the mechanical energy to transfer
the charges from Cres to Cstore . Since Cres ≫ Cstore , in this process the electrical
system accumulates the energy coming from the mechanical domain.
The circuit starts the operation from the state where all capacitors are charged at
some initial voltage V0 , the switch is blocked (opened) and the variable capacitor
Cvar is at its maximal value.
Fig. 8.3 Modeled conditioning circuit [13]
118
K. Caluwaerts and D. Galayko
When Cvar decreases, the voltage Vvar increases, the diode D2 turns on and the
charges flow from Cvar to Cstore increasing Vstore . When Cvar is minimal and starts
to increase, its voltage decreases, D2 is off and until Vvar = Vres , both the diodes
are off. When Vvar ≤ Vres , D1 is on and the charges from Cres flow towards Cvar ,
discharging Cres . This is repeated during the next capacity variation cycle.
When Vstore approaches the saturation value [13], the flyback is activated by closing the switch. The natural oscillation period of the LC network is much smaller than
the vibration frequency, and very quickly Cstore discharges on Cres . At the moment
where Vstore = Vres , the switch turns off, and the system returns to its initial state.
Now, all the capacitors have equal voltages again, but slightly higher than those at
the start of the pumping process; this increase in the capacitor voltages corresponds
to the harvested energy.
Switching is driven by the event corresponding to the crossing of the threshold
values by the voltage Vstore , and can be modeled by a finite-state automaton with a
one bit memory register [3].
8.2 SystemC–AMS Modeling of the Harvester
8.2.1 Resonator Modeling
Equation (8.4) was modeled as a feedback SDF system, whose structure is given in
Fig. 8.4.
To define the operation of the resonator, two input values have to be provided:
the voltage of the transducer’s capacitor (Vvar ) and the external acceleration (aext ).
The output value of this block is the capacitance of the transducer calculated in the
block Cvar from the mobile mass position through the known function Cvar (x) (8.6).
This block is implemented as a SystemC module (specified as SC_MODULE in
the code). The delay at the input of the first integrator is required by the SystemC
SDF solver which doesn’t tolerate delay-less loops.
Fig. 8.4 SDF diagram of the
resonator
8 Heterogeneous and Non-linear Modeling in SystemC–AMS
119
8.2.2 Implementation of the Conditioning Circuit Model
The conditioning circuit contains several components whose modeling is tricky. In
the following sections we describe it for each block.
8.2.2.1 Variable Capacitor
The SystemC–AMS LinElec tool provides a model of a variable capacitor. This is
the most important component of our application, thus we carried out several simple test cases before incorporating it in a complex model. The main question was
whether or not the variable capacitor model was charge conservative. For example, if the capacitor voltage is fixed, variations of the capacitance must generate an
electrical current in the voltage source.
The response to this question was positive. Indeed, in SystemC–AMS, when a
variable capacitor is added to the circuit, two state variables are created: the capacitor voltage and the charge equal to the product of the capacitance and the voltage.
A variable capacitor is controlled by a signal issued from the SDF domain. Thus,
it is natural to connect the signal Cvar issued from the Fig. 8.4 model to the input of
the variable capacitor.
8.2.2.2 Diode Implementation
The implementation of the diodes is the most challenging part of the modeling task,
since they have a very non-linear behavior. At first, we tried to model the diodes
as current sources controlled by their own voltages with an exponential functional
relation between both values. The implemented scheme is given in Fig. 8.5a. The
voltage on the diode is measured in the linear electrical domain and converted to
the SDF domain using the predefined SystemC–AMS block sca_vd2sdf (Voltage
Difference To Signal Data Flow). Then the current is calculated and after a necessary
delay step, the current passing through the current source is updated.
Fig. 8.5 Different
implementations of a diode in
SystemC–AMS
120
K. Caluwaerts and D. Galayko
Although correct in theory, the presented diode model did not work, mainly because of the very steep exponential I–V diode characteristic in conjunction with the
fixed delay between the voltage measurement and the current generation. So, if at
step k the voltage becomes, for example, 1 V, the corresponding very high current
will be generated throughout the whole time step k + 1, which will definitively cause
the modeling process to fail.
For this reason, we proposed a different scheme using a variable resistance
(Fig. 8.5b). In this case, the diode is modeled as a switch with on and off resistances, and switching is ordered by the voltage on the switch itself. Although the
value of the resistance will also be updated with a one time step delay after the voltage change, there is no delay between the current and the voltage calculation, thus
the model is accurate enough if the time step is low.
The listing of the diode implementation is given below.
Two different diode models are implemented. The first one is the two-state diode
as described above. The second one shows a more general diode law. Instead of
using the input voltage directly, we first limit its precision to limit the number of
different states (resistance values) our diode can be in.
It’s very important to limit the number of states, as every state transition forces
SystemC–AMS to reinitialize its network matrix. To further limit the number of
states, one should choose a cutoff voltage. If the voltage on the diode is lower than
this voltage, the diode is considered to be completely blocking. The values used, depend on the diode model one wants to use. With this technique, one can approximate
many different diode models with SystemC–AMS.
# d e f i n e TWO_STATE_DIODE
/ / T h i s c o d e i m p l e m e n t s a two s t a t e d i o d e .
/ / t o i m p l e m e n t a g e n e r a l d i o d e model
/ / uncomment t h e n e x t l i n e
/ / # u n d e f TWO_STATE_DIODE
/ / c a l c u l a t e s the diode ’ s s t a t e
SCA_SDF_MODULE( E l e c t r i c a l _ d i o d e _ f u n c t i o n )
{
s c a _ t d f _ i n < double > v o l t a g e ;
s c a _ t d f _ o u t < double > r e s i s t a n c e ;
void s i g _ p r o c ( )
{
d o u b l e vc = v o l t a g e . r e a d ( ) ; / / v o l t a g e on t h e d i o d e ( i n p u t )
double r e s ; / / t h e r e s i s t a n c e o f t h e diode ( o u t p u t )
# i f d e f TWO_STATE_DIODE
i f ( vc > 0 . 0 ) {
r e s = 1 e −9; / / open r e s i s t a n c e
} else {
r e s = 1 e10 ; / / c l o s e d r e s i s t a n c e
}
8 Heterogeneous and Non-linear Modeling in SystemC–AMS
121
# else
double v o l t =
round_to_n_decimals ( voltage . read ( ) , <precision >);
double c u r r e n t =
< r e a l i s t i c c u r r e n t d i o d e law u s i n g v o l t > ;
i f ( c u r r e n t == 0 . | | v o l t < < c u t o f f _ v o l t a g e > )
r e s 1 e10 ; / / d i o d e c l o s e d
else
res = volt / current ; / / calculate resistance
# endif
resistance . write ( res );
}
void a t t r i b u t e s ( )
{
current . set_delay (1);
}
/ / obligatory 1 step delay
E l e c t r i c a l _ d i o d e _ f u n c t i o n ( sc_module_name name_ ) :
sca_tdf_module ( )
{}
};
SC_MODULE( E l e c t r i c a l _ d i o d e )
{
sca_elec_port p ; / / electric terminals
sca_elec_port n;
E l e c t r i c a l _ d i o d e ( sc_module_name name_ ) :
s c _ m o d u l e ( name_ )
{
/ / construct blocks
r e s i s t o r = new s c a _ t d f 2 r ( " s o u r c e " , 1 e10 ) ;
c o n v e r t e r = new s c a _ v d 2 t d f ( " c o n v e r t e r " ) ;
f u n c t i o n = new E l e c t r i c a l _ d i o d e _ f u n c t i o n ( " f u n c t i o n " ) ;
// settings
c o n v e r t e r −> s c a l e = 1 . 0 ;
/ / connect converter
c o n v e r t e r −>p ( p ) ;
c o n v e r t e r −>n ( n ) ;
c o n v e r t e r −> t d f _ v o l t a g e ( v o l t a g e ) ;
/ / connect function
f u n c t i o n −> v o l t a g e ( v o l t a g e ) ;
f u n c t i o n −> r e s i s t a n c e ( r e s i s t a n c e ) ;
}
/ / connect r e s i s t o r
r e s i s t o r −>p ( p ) ;
r e s i s t o r −>n ( n ) ;
r e s i s t o r −> c t r l ( r e s i s t a n c e ) ;
122
K. Caluwaerts and D. Galayko
/ / variable resistor
sca_tdf2r∗ resistor ;
/ / converter
sca_vd2tdf∗ converter ;
/ / resistance calculation block
Electrical_diode_function ∗ function ;
// signals
s c a _ t d f _ s i g n a l < double > v o l t a g e ;
s c a _ t d f _ s i g n a l < double > r e s i s t a n c e ;
};
This model is used as a normal SystemC–AMS component. For example, an insertion of a diode between the nodes p and n is achieved as follows:
E l e c t r i c a l _ d i o d e diode ( " diode " ) ;
sca_elec_node p , n ;
diode . p ( p ) ;
diode . n ( n ) ;
8.2.2.3 Initial Charge of the Capacitors
When modeling power electronic systems, it is very important to be able to control
the initial energy of the reactive elements. In SystemC–AMS, default values are
zero, and no mechanism is proposed to modify them.
Thus for each capacitor, we use the scheme shown in Fig. 8.6. A voltage source
was connected during several step periods to the capacitor through a switch (variable resistance). The calculation of the initial pre-charge time is encapsulated in the
model. The switch is on during at least the first time step, then, the time constant is
calculated (the RC product), and if the latter is high, the switch stays on during the
time equal to at least 10RC.
Fig. 8.6 Implementation of
a capacitor with initial
pre-charge
8 Heterogeneous and Non-linear Modeling in SystemC–AMS
123
8.2.2.4 Conditioning Circuit/Flyback Switch Modeling
As shown in [3], the switch should be driven by the energy state of the charge pump.
Thus in our model we implemented the switch as a finite-state automaton (with
2 possible states), driven by the events of threshold-crossing by the voltage Vstore .
The switch becomes on when it is off and when the Vstore voltage crosses some V1
threshold, and the switch becomes off when it is on and the Vstore voltage crosses
some V2 threshold, V1 < V2 .
8.2.2.5 Modeling of the Diode D3
A particular problem arises when simulating the circuit’s behavior when the flyback/
charge pump switch is turning off. Before this moment, the current in the inductor
is maximal. When the switch turns off, the current path is broken, which normally
provokes a high negative voltage on the inductor. Since the voltage of the left node
of the inductor is fixed by the Cres capacitor, a negative voltage glitch is generated
on D3. However, the latter is connected so as to turn on when a negative voltage is
generated on the node fly (Fig. 8.3). By turning on, the D3 diode allows the inductor
current to continue. One can say that the D3 diode “absorbs” the negative voltage
glitch.
However, the described phenomenon takes place instantaneously, whereas the
model in SystemC–AMS is strictly causal: for the diode D3 to turn on, its voltage
should become negative during the preceding step. Since the switching off is abrupt
(the off resistance of the switch is high), the generated voltage is very high and being
present throughout the step, seriously disturbs the circuit’s state vector.
To limit this negative effect, we connected the node fly to a small parallel-toground capacitor Cp = 1 pF, whose role is to maintain the voltage on the node fly at
a reasonable level during the transition step. Adding this capacitor doesn’t invalidate
the model, since it naturally models the nodes’s parasitic capacitance.
When the switch turns off, the current continues to flow through the capacitor
Cp , charging it at a negative voltage, and at the next step the diode turns on and the
circuit operates normally.
8.2.3 Model of the Whole System
The model of the global system is presented in Fig. 8.7: it is composed of the conditioning circuit and the resonator models connected through the input and output
terminals. The delay is necessary since there is a loop, a sca_v2sdf block converts
the voltage of Cstore capacitor to the SDF domain.
124
K. Caluwaerts and D. Galayko
Fig. 8.7 Diagram of the
complete harvester model in
SystemC–AMS
Table 8.1 Numerical values
of modeling test case
k,
m,
µ,
ω,
aext ,
nm−1
kg
Nsm−1
rad·s−1
ms−2
152.6
46e–6
2.19e–3
2π · 298
10 · sin(ωt)
Cres , F
Cstore , F
Cp , F
L, H
RL , 10−6
3.3 · 10−9
10−13
2.5 · 10−3
10−1
t,
RONDI ,
ROFFDI ,
RONSW ,
ROFFSW ,
s
4 · 10−9
10−9
1010
1
1011
Switch low
Switch high
threshold V1 , V
threshold V2 , V
6
11
8.3 Modeling Results
8.3.1 Description of the Modeling Experiment
Table 8.1 gives the numerical values for the circuit model.
With a time step of 4 ns, a 1 s simulation requires 250 million steps. Such a small
step was chosen to accurately model the quick processes involving highly non-linear
elements (diodes).
In the first version of our model, the capacitance of the variable capacitor Cvar
calculated by the SDF resonator was updated in the LIN ELEC domain at each step
(every 4 ns). This required a reinitialization of the network matrix at each step, leading to an excessively long simulation time (the linear solver of SystemC–AMS is optimized for modeling time-invariant systems, or systems whose parameters change
rarely). This problem was identified using the GNU profiler (gprof ).
Therefore, we modified the variable capacitor’s mixed SDF-LIN ELEC model:
in the new version, the LIN ELEC capacitance value is updated only one time every
400 ns. This modification does not affect the accuracy of the model, since the capac-
8 Heterogeneous and Non-linear Modeling in SystemC–AMS
125
Fig. 8.8 Global view of the harvester operation. a Vstore voltage, b Vres voltage evolution highlighting an energy accumulation
Fig. 8.9 Zoom on the circuit behavior. a Vvar and Vstore at the end of the first cycle, b flyback
circuit operation
itance varies sinusoidally with a frequency of only 298 Hz. This optimization made
the simulation run about 9 times faster.
With this modification, modeling 1 second of system operation required 120
minutes of machine time on Mac OS X 10.4, Intel Core 2 Duo, 2.0 GB, 4 MB L2
cache, 2 GHz clock computer (only one processor core was used to run the simulation). SystemC 2.2.0 and SystemC–AMS 0.15 RC5 were used.
Figure 8.8 presents the global view of the simulation results, showing the evolution of Vstore and Vres during 1 s. This is typical behavior for a charge pump,
apparently identical to the results obtained by the VHDL–AMS simulation [3]. The
evolution of Vres highlights an accumulation of the harvested energy in the system.
Figure 8.9a presents an enlarged view of the Vstore and Vvar voltage evolution at the
end of the first cycle, Fig. 8.9b shows the evolution of the flyback current (in the
inductor).
126
K. Caluwaerts and D. Galayko
8.3.2 Modeling Results Validation
To verify our model we compared the SystemC–AMS model with a VHDL–AMS
and a Matlab Simulink model. The former was described in [3], the latter was created using the electrical network equations.
Every model used the same block parameters (resistances, initial voltages, etc.),
but with different diode models. Simulink used a quadratic diode law, whereas
VHDL–AMS used a model with 3 zones (linear on and off zones, and a quadratic
transition zone to keep the first derivative continuous). The VHDL–AMS model did
not include the Cp capacitor, as it caused the simulation to slow down too much.
For SystemC–AMS we used the two-state diode model.
The values of three system quantities were compared: Vstore , Vres and IL (the
quantities defining the energy state of the system). From the results of each model,
Vstore was measured every 10 ms to compare global system operation, Vres ’s peak
value minus 5 V (the initial voltage) was measured after three cycles of the charge
pump/flyback operation to compare only the harvested energy, and for IL the peak
value was taken from the first cycle to compare the flyback operation. We also
performed a number of other global and detailed tests. In all tests SystemC–AMS
showed similar results as Simulink and VHDL–AMS (under 2% relative difference).
Table 8.2 presents the relative differences of Simulink and SystemC–AMS modeling versus VHDL–AMS modeling.
These results show that the SystemC–AMS model is correct. The use of various
diode models is reflected in the different energy yields (reflected by the values of
Vres after 4 flyback cycles, Fig. 8.8). Nonetheless, these differences are very small
(less than 2%), the harvested energy value obtained with Simulink being a bit closer
to that obtained with VHDL–AMS (since Simulink’s diode model is more similar
to that of VHDL–AMS).
Simulation times were 3.5 minutes for Matlab Simulink using the ode23s solver
on a Dual Intel Xeon 3 GHz (4 cores) with 6 GB of memory and 5.75 minutes for
VHDL–AMS using the ADVance MS simulator on a Sun Ultra-80.
Compared with SystemC–AMS modeling, Simulink and VHDL–AMS simulations consume much less machine time. This is explained by the fact that the corresponding solvers use a variable simulation step, allowing a dramatical time step
reduction only during the flyback circuit operation. The SystemC–AMS model uses
a fixed step, which is chosen to accurately model the quickest process of the system.
Table 8.2 Relative
differences of the values
obtained in SystemC–AMS
and Simulink with
VHDL–AMS
Vstore
Vres
IL
SystemC–AMS
0.495%
1.468%
0.595%
Simulink
0.886%
0.316%
0.100%
8 Heterogeneous and Non-linear Modeling in SystemC–AMS
127
8.4 Conclusion
This study presented a complex SystemC–AMS model of a vibration energy harvester with a capacitive transducer. It was demonstrated that complex non-linear
electrical circuits coupled with non-electrical domain subsystems can be accurately
modeled with SystemC–AMS. The modeling results were compared with VHDL–
AMS and Simulink simulation outputs.
We explored the possibilities of this promising new extension for SystemC. By
introducing new reusable components we extended the modeling possibilities of this
simulator to non-linear electrical circuits.
The main limitation of the current version of the SystemC–AMS simulator for
modeling complex AMS systems is the impossibility to vary dynamically the simulation step. Also, if a time-variable linear system is modeled with the LIN ELEC
solver, the equation system matrix is reinitialized at each variation of the system parameters. This initialization is a very time-consuming operation which, if executed
at each time step, can dramatically increase the simulation time.
Nevertheless, SystemC–AMS, being a SystemC extension, offers the unique possibility to model systems from the hardware level (electrical circuit) up to the software level using the same standardized language (C++).
The results of this study suggest that the actual version of SystemC–AMS can, if
necessary, be used for precise modeling of complex systems and circuits. However,
the highlighted difficulties show that the SystemC–AMS is probably more appropriate for behavioral models with a higher abstraction level. For example, the simulation time can be reduced at price of small precision loss if quick processes, e.g.,
the flyback operation, are simulated using simplified behavioral models. The actual
model architecture and the precision/detail level depend on particular goals of the
model designer.
References
1. Y. Chiu, Y.-S. Chu, and C.-T. Kuo. MEMS design and fabrication of an electrostatic vibrationto-electricity energy converter. Journal of Microsystem Technologies, 13(11–12):1663–1669,
2007.
2. G. Dahlquist and Å. Björck. Numerical Methods in Scientific Computing, Volume 1. SIAM,
Philadelphia, 2008.
3. D. Galayko, R. Pizarro, P. Basset, A.M. Paracha, and G. Amendola. AMS modeling of controlled switch for design optimization of capacitive vibration energy harvester. In BMAS’2007
International Workshop Proceeding, pages 115–120, 2007.
4. C. Grimm, M. Barnasconi, A. Vachoux, and K. Einwich. An introduction to modeling embedded analog/mixed-signal systems using SystemC AMS extensions. In DAC2008 International
Conference, June 2008.
5. E. Halvorsen, L.-C.J. Blystad, S. Husa, and E. Westby. Simulation of electromechanical systems driven by large random vibrations. In MEMSTECH2007 Conference Proceedings, pages
117–122, 2007.
6. G. Kondala Rao, P.D. Mitcheson, and T.C. Green. Simulation toolkit for energy scavenging
inertial micro power generators. In PowerMEMS 2007 Workshop Proceedings, pages 137–140,
November 2007.
128
K. Caluwaerts and D. Galayko
7. L.D. Landau and E.M. Lifshitz. Course of Theoretical Physics: Mechanics. ButterworthHeinemann, Burlington, 1982.
8. H. Lhermet, C. Condemine, M. Plissonier, R. Salot, P. Audebert, and M. Rosset. Efficient
power management circuit: From thermal energy harvesting to above-IC microbattery energy
storage. IEEE Journal of Solid-State Circuits, 43(1):246–255, 2008.
9. A.M. Paracha, P. Basset, P.C.L. Lim, F. Marty, and T. Bourouina. A bulk silicon-based
vibration-to-electric energy converter using an in-plane overlap plate (IPOP) mechanism. In
PowerMEMS’2006 Workshop Proceedings, 2006.
10. A. Vachoux, C. Grimm, and K. Einwich. Extending SystemC to support mixed discrete–
continuous system modeling and simulation. In ISCAS2005 International Conference Proceedings, pages 5166–5169, 2005.
11. M. Vasilevski, F. Pêcheux, H. Aboushady, and L. De Lamarre. Modeling heterogeneous systems using SystemC–AMS case study: A wireless sensor network node. In BMAS2007 International Workshop Proceedings, pages 11–16, 2007.
12. www.systemc.org. Official web site of Open SystemC Initiative (OSCI) group, 2008.
13. B.C. Yen and J.H. Lang. A variable-capacitance vibration-to-electric energy harvester. IEEE
Transactions on Circuits and Systems, 53(2):288–295, 2006.
Part III
Digital Systems Design Methodologies
Based on C++
Chapter 9
Application Workload and SystemC Platform
Modeling for Performance Evaluation
Jari Kreku, Mika Hoppari, Tuomo Kestilä,
Yang Qu, Juha-Pekka Soininen and
Kari Tiensyrjä
Abstract Increasing number of concurrent applications in future mobile devices
will be based on parallel heterogeneous multiprocessor system-on-chip platforms
using network-on-chip communication to achieve scalability. A performance modeling and simulation approach is described to explore efficiently the applicationplatform solution/design space at system-level. The application behavior is abstracted to workload models that are mapped onto performance models of the execution platform for transaction level simulation. The approach provides separation of
application and platform through service-oriented modeling. The experimentation
of the approach in a mobile video player case study is presented.
Keywords Workload · Platform · Transaction-level · Performance · Simulation ·
Abstract instruction · Service modeling · UML2 · SystemC
9.1 Introduction
The digital processing architectures of future handheld mobile multimedia devices
will evolve from current System-on-Chips (SoC) and Multi-Processor-SoCs with a
few processor cores to massively parallel computers that consist mostly of heterogeneous sub-systems, but may also contain homogeneous computing sub-systems.
The Network-on-Chip (NoC) communication paradigm will replace the bus-based
communication to allow scalability, but it will increase uncertainties due to latencies in case large centralized or external memories are required. Moreover, several
currently independent mobile devices, e.g. phone, music player, television, movie
player, desktop and Internet tablet, will converge to one. The sets and types of applications running on the terminal are dependent on the context of the user. To deliver
requested services to the user, some of the applications run sequentially and independently, while many others execute concurrently and interact with each others.
As a consequence of the above trends the overall complexity of system development
will increase by orders of magnitude.
J. Kreku ()
VTT Technical Research Center of Finland, Kaitoväylä 1, 90571, Oulu, Finland
e-mail: jari.kreku@vtt.fi
M. Radetzki (ed.), Languages for Embedded Systems and Their Applications,
Lecture Notes in Electrical Engineering 36,
© Springer Science + Business Media B.V. 2009
131
132
J. Kreku et al.
Platform-based design [1] was introduced in late 1990’s to address the challenges
of increasing design complexity of SoCs that consist typically of a few processor
cores, hardware accelerators, memories and I/O peripherals communicating through
a shared bus. To alleviate scalability problems of the shared bus, the NoC architecture paradigm was proposed as a communication centric approach for systems
requiring multiple processors or integration of multiple SoCs.
Model-based approaches [2] emphasize the separation of the application and execution platform modeling. The principles of the Y-chart model [3] are usually applied in the design space exploration, i.e. a model of application is mapped onto a
model of platform and the resulting allocated model is analyzed.
Due to the change from vertical to horizontal business model in the mobile industry [4], a service-oriented and subsystem-based development methodology [5] is being adopted. The end user interactions and the associated applications are modeled
in terms of services they require from the underlying execution platform. An obvious consequence is that the execution platform needs also to be modeled in terms of
services it provides for the applications.
Both the application and platform designers are facing an abundant number of
design alternatives and need systematic approaches for the exploration of the design
space. Efficient methods and tools for early system-level performance analysis are
necessary to avoid wrong decisions at the critical stage of system development.
Design time application mapping and platform exploration addressing run-time
management of MP-SoC platform is presented in [6]. The idea is to generate at design time a set of Pareto-optimal application mappings that the run-time manager
uses when switching applications on the platform. An extension to the Nostrum
simulator environment for system-level optimization of dynamic data and communication is presented in [7]. Measured or statistical communication traces are used in
[8] for the analysis of network and computing node communication. The application
mapping approach presented in [9] uses static profiling and co-simulation to achieve
co-exploration of the application design space including both the architecture model
and the application source code.
Performance evaluation has been approached in many ways at different levels of
refinement. SPADE [10] implements a trace-driven, system-level co-simulation of
application and architecture. Artemis [11] extends this by introducing the concept
of virtual processors and bounded buffers. TAPES performance evaluation approach
[12] abstracts the functionalities by processing latencies and covers only the interaction of the associated sub-functions on the architecture without actually running the
corresponding program code. MESH [13] looks at resources, software, and schedulers/protocols as three abstraction levels that are modeled by software threads on
the evaluation host.
Our approach differs from the above as to the way the application is modeled
and abstracted. The workload model mimics truly the control structures of the applications, but the leaf level load data is presented like traces. Also the execution
platform is modeled rather at transaction than instruction level. The timing information is resolved in the simulation of the system model in our case.
In this chapter we present a model-based approach for system-level performance evaluation. The approach combines the top-down refinement type application
9 Application Workload and SystemC Platform Modeling for Performance Evaluation
133
modeling and bottom-up composition type platform modeling. Both are based on
service-oriented approach with defined service interfaces, which brings scalability
to both of the sides. The application models are abstracted to workload models that
are mapped onto the platform performance models and the resulting system model is
simulated at transaction-level to obtain performance data. The tool support is based
on a commercial UML2 tool, Telelogic Tau G2 [14], and OSCI SystemC [15].
The rest of the contribution is structured as follows. Section 9.2 describes our
performance modeling approach. Section 9.3 presents the mobile video player case
study. Section 9.4 draws conclusions and discusses future work.
9.2 Performance Modeling and Simulation
The performance modeling and evaluation approach of ABSOLUT follows the
Y-chart model as depicted in Fig. 9.1 [16, 17]. The layered hierarchical workload
models represent the computation and communication loads the applications cause
on the platform when executed. The layered hierarchical platform models represent
the computation and communication capacities the platform offers to the applications. The workload models are mapped onto the platform models and the resulting
system model is simulated at transaction-level to obtain performance data.
9.2.1 Application and Workload Modeling
Starting from the end-user requirements, a hierarchically layered service-oriented
application model is created using UML2 use case, state machine, composite structure and sequence diagrams (Fig. 9.1). The top layer, called abstract use case model,
consists of system level services visible to the user. These services are decomposed
Fig. 9.1 Performance
modeling and evaluation
approach
134
J. Kreku et al.
to sub-services in the refined use case model. Further refinement results in primitive
services that are requested from the execution platform model. The functional simulation of the model in Telelogic Tau UML2 tool provides sequence diagrams which
are needed for verification and for building the workload model.
The purpose of workload modeling is to illustrate the load an application causes
to an execution platform when executed. The workload models reflect accurately the
control structures of the applications, but the computing and communication loads
are abstractions derived either analytically [18], from measured traces [19] or using
a source code-compilation tool approach [16].
Due to load abstraction, the models can be simulated before the applications are
finalized, enabling early performance evaluation. As opposed to most performance
simulation approaches, the workload models do not contain timing information. It
is left to the platform model to find out how long it takes to process the workloads.
This arrangement results in enhanced modeling and simulation speed. It is also easy
to modify the models, which facilitates easier evaluation of various use cases with
minor differences. For example, it is possible to parameterize the models so that the
execution order of applications varies from one use case to another.
The workload models have a hierarchical structure. The top-level workload
model divides into application workloads, which are constructed of one or more
process workloads. These are comprised of function workloads that are basically
control flow graphs of basic blocks and branches. Basic blocks are ordered sets
of load primitives used for load characterization. Load primitives are abstract instructions read and write for modeling memory accesses and execute for modeling
data processing. The process and function workload models can also be statistical,
in which case the model will describe the total number of different types of load
primitives and the control is a statistical distribution for the primitives.
The entire hierarchy of workload models is collected in a UML2 class diagram,
which presents the associations, dependences, and compositions of the workloads.
Control inside the workloads is described with state machine diagrams and composite structure diagrams are used to connect the control with the corresponding
workload model.
UML2 is a standard language used in software development and thus the possibility to use UML2-based workload models enables reuse of existing UML2 application models, thus reducing the effort and making the performance simulation
approach more accessible in general.
9.2.2 Execution Platform Modeling
The platform model is an abstracted hierarchical representation of the actual platform architecture consisting of component, subsystem, and platform layers. Each
layer has its own services, which are abstraction views of the architecture models.
They describe the platform behaviors and related attributes, e.g. performance, but
hide other details. Services in the subsystem and platform architecture layers can be
invoked by workload models.
9 Application Workload and SystemC Platform Modeling for Performance Evaluation
135
Component Layer
This layer consists of processing (e.g. processors, DSPs, dedicated hardware and
reconfigurable logic), storage, and interconnection (e.g. bus and network structure)
elements. An element must implement one or more types of component-layer services. The component-layer read, write and execute services are the primitive services, based on which higher level services are built.
All the component models contain cycle-approximate timing information. However, modeling applications as workloads causes certain limitations to what needs to
be modeled on the platform side as the workloads do not express the dependences
between abstract instructions. Therefore the data paths of processing units should
not be modeled in detail; instead the processor models have a cycles-per-instruction
(CPI) value, which is used to estimate the execution time of the workloads. Furthermore, the workload models often do not have exact address information; instead
they define, which memory block they are addressing. As a result, cache hits and
misses and e.g. SDRAM page misses must be modeled statistically in such a situation.
Subsystem Layer
The subsystem layer is built on top of the component layer and describes the components of the subsystem and how they are connected. The services used at this
layer include e.g. video pre- and post-processing and decoding for a video acceleration subsystem. The model is presented as a composite structure diagram that
instantiates elements from the component library. The subsystems are connected to
the communication network via network interfaces.
Platform Architecture Layer
This layer is built on top of the subsystem layer by incorporating communication
network and platform software. Platform-layer services consist of service declaration and instantiation information. The service declaration describes the functionalities that the platform can provide. The instantiation information describes how a
service is instantiated in the platform.
COGNAC is a custom C++ tool, which reads a text-based platform configuration
file and generates the top-level SystemC models for the platform and base classes
for the subsystems. The platform configuration file describes (1) the subsystems,
(2) what components are instantiated inside each subsystem, and (3) how the subsystems and components are connected to each other. The generated subsystem models
are stubs that should be extended by the designer by implementing subsystem-level
services.
Another configuration file is used to configure the parameters of the instantiated
components. The number of parameters varies on a component-by-component basis and can be anything from a few parameters to tens of parameters. However, the
136
J. Kreku et al.
Table 9.1 Low-level
interface functions
Interface
Description
read(A, W, B)
read W words of B bits from address A
write(A, W, B)
write W words of B bits to address A
execute(N)
simulate N data processing instructions
Table 9.2 High-level interface functions
Interface
Return value
Description
use_service(name, attr)
service identifier id
request service name using attr as parameters
wait_service(id)
N/A
wait until the completion of service id
designer can also set up default values inside the models. Each component has at
least parameters for the clock frequency and latencies for accessing the interconnection. Furthermore, processor models typically have parameters for instruction
and/or data cache hit probability and latency in addition to the CPI value. Correspondingly, memory models have parameters for access latencies and interconnect
models for data transfer and arbitration latencies.
Interfaces
The platform model provides two interfaces for utilizing its resources from the
workload models (Fig. 9.1). The low-level interface of the processing elements at
the component layer is intended for transferring load primitives as listed in Table 9.1
[20]. The functions of the low-level interface are blocking—in other words a load
primitive level workload model is not able to issue further primitives before the
previous primitives have been executed.
The high-level interface enables workload models to request services from the
platform model (Table 9.2). Use_service call is used to request the given service
and it is non-blocking so that the workload model can continue while the service
is being processed. It returns a unique service identifier, which can be given as a
parameter to the blocking wait_service call to wait until the requested service has
completed.
The platform model includes operating system (OS) models, which control accesses to the processing unit models of the platform by scheduling the execution of
process workload models. The OS model supports both low-level and high-level interfaces to the workloads and relays interface function calls to the processor or other
models which realize those interfaces. The OS model allows only those process
workloads which have been scheduled for execution to call the interface functions
and performs rescheduling periodically according to the scheduling policy implemented in the model.
9 Application Workload and SystemC Platform Modeling for Performance Evaluation
137
The services inside the platform can model either hardware or software services.
In ABSOLUT, software services are modeled as workload models, but unlike application models, they are integrated in the platform model and easily reusable by
the applications. If the service is provided by a process or a set of processes running
in the system, the service model consists of application or process layer workloads.
If the service is implemented as a library, the model will be at the function layer.
Service models can utilize other services, but eventually they consist of the same
read/write/execute load primitives as the application models.
There are two alternatives how to implement a HW service: It can be implemented simply as a delay in the associated component, if the processing of the service does not affect the other parts of the system at all. In this case the service must
not perform I/O operations or request other services. The second alternative is to
implement the service as read, write and possibly execute primitives like the SW
services, but in this case they are executed inside the HW component and not inside
a process workload running on one of the processor models.
9.2.3 Allocation and Transformation to SystemC
Figure 9.2 depicts the flow from UML2 application models to generated SystemC
workload models. A skeleton model of the platform is created in UML2 to facilitate
mapping between the workload models with service requirements and the platform
models with service provisions. The skeleton model describes what components
exist in the platform and what services the components provide. In the mapping
phase, each workload entity (function, process, application) is linked to a processor
or other component, which is able to provide the services required by that entity.
This is realized in the UML2 model using composite structure diagrams.
The allocated UML2 workload models are transformed to SystemC [21] using
automatic SystemC code generation from the Telelogic Tau UML2 tool. The gener-
Fig. 9.2 Transformation
from UML2 to SystemC
138
J. Kreku et al.
ator has been developed by the Lund University [22] and it produces SystemC code
files and Makefiles for building the models. In the SystemC domain, the workload
models have pointers to the platform model components they have been allocated to
and utilize their services via the low- or high-level interfaces.
9.2.4 Performance Simulation
The executable simulation model of the combined workload and execution platform
models (Fig. 9.1) is based on the OSCI SystemC library, extended with configurable
instrumentation. During the simulation of the system model the workloads send load
primitives and service calls to the platform model. The platform model processes
the primitives and service calls, advancing the simulation time while doing so. The
simulation run will continue until the top-level workload model stops it when the
use case has been completed.
The platform model is instrumented with counters, timers and probes, which
record the status of the components during the simulation. These performance
probes are manually inserted in the component models where appropriate and are
flexible so that they can be used to gather information about platform performance
as needed: Status probes collect information about utilization of components and
scheduling of processes performed by the operating system models. Counters are
used to calculate the number of load primitives, service calls, requests and responses
performed by the components. Timers keep track of the task switch times of the OS
models and processing times of services.
After simulation the performance probes output the collected performance data to
the standard output. A C++-based tool, VODKA, is used for viewing e.g. processor
utilization, bus and memory traffic and execution time, graphically.
9.3 Mobile Video Player Case Example
In the mobile video player (MVP) case example, a mobile terminal user wants to
view a movie on the device and selects one from a list of movies available on the
mobile terminal. The execution platform provides services for storing and playing
of movie files.
The platform in the mobile video player case consists of four subsystems
(Fig. 9.3): (1) General purpose (GP) subsystem, which is used for executing an
operating system and generic applications and services, (2) Image (IM) subsystem,
which accelerates image and video processing, (3) Storage (ST) subsystem, which
contains a repository for video clips, and (4) Display (DP) subsystem, which takes
care of displaying the UI and video. The subsystems are interconnected by a network using a ring topology.
The general purpose subsystem has two ARM11 general purpose processors for
executing the OS and applications. There is also a subsystem-local SDRAM memory controller and memory to be used by the two processors. For communicating
9 Application Workload and SystemC Platform Modeling for Performance Evaluation
139
Fig. 9.3 Execution platform for the mobile video player
with the other subsystems, each subsystem has a network interface. Image subsystem is built around the video accelerator. The services provided by the subsystem
are controlled by a simple ARM7 microcontroller. There is also some SRAM memory for the ARM and a DMA controller for offloading large data transfers. Storage
and display subsystems are mostly similar to the image subsystem and they contain
a simple ARM and a DMA controller. The storage subsystem has local memory for
storage and metadata, and the display subsystem has a graphics accelerator, local
SRAM for graphics, and a display interface for the screen.
9.3.1 Modeling of the Execution Platform Components
The modeling of the MVP platform began with the implementation of the component models. There are 23 components in total inside the four subsystems (Fig. 9.3),
but some of them are identical and others resemble other blocks closely. For example, the bus and network interface component are the same in each subsystem.
Internal SRAM memory and an ARM7 controller can be found from three of the
subsystems, and the ARM11 is close enough to the ARM7 that it can use the same
140
J. Kreku et al.
Fig. 9.4 Class hierarchy of the MVP platform model
model with different parameter values. As a result, only 7 different models of the
components (excluding parent classes) were needed.
The Component model is an abstract base class for all other models and defines
the parameters every model must have: clock frequency and address (Fig. 9.4). The
MVP platform uses the OCP protocol for communication, hence the Master and
Slave models add OCP master and slave ports respectively. They are still intended
as base classes from which processor or memory models can be derived easily. The
Master model contains methods for setting up OCP read and write requests, sending them to the port and receiving responses. Correspondingly, the Slave model has
methods for getting requests from the port, processing them, preparing responses
and sending them back to the port. Both models add parameters, which define latencies for accessing the port for read and write transactions. The Bus model is the third
model derived from the Component and simulates a basic OCP bus with round-robin
scheduling. It contains both master and slave OCP multiports and methods for arbitration, address decoding, and sending and receiving requests and responses. The
parameters of the Bus define the latencies for moving requests and responses across
the bus.
The general purpose processor (GPP) model provides the primitive interface to
the workload models and abstracts the data processing capability of processors with
a CPI value. The CPI is a workload-controllable parameter, which is used when the
model calculates the execution time for the read, write and execute calls from the
workloads. In the MVP case the GPP model is used only as a base class for the
9 Application Workload and SystemC Platform Modeling for Performance Evaluation
141
Table 9.3 Clock frequencies of the MVP platform components
GP subsystem
ST subsystem
IM subsystem
DP subsystem
ARM 0
333 MHz
50 MHz
83 MHz
100 MHz
ARM 1
333 MHz
25 MHz
83 MHz
100 MHz
83 MHz
83 MHz
83 MHz
83 MHz
Bus
166 MHz
25 MHz
83 MHz
100 MHz
SDRAM
166 MHz
50 MHz
166 MHz
100 MHz
Video_accel
166 MHz
DMA controller
Display IF
Network IF
SRAM
100 MHz
25 MHz
ARM model, which adds statistical instruction and data cache models on top of the
GPP. The ARM model has additional parameters for hit probability, hit latency and
line size for the caches. It is also possible to define, whether the data cache operates
in write-through or write-back mode and whether or not the cache is allocated on a
write miss.
The HW_accel model is a pure virtual base class for hardware accelerators and
provides the higher-level service interface (Table 9.2). The DMA_controller, Display_IF, and Video_accel models extend the HW_accel by implementing services
for dma transfers, display updates and video decoding and encoding respectively.
The SRAM and SDRAM models extend the Slave model by overloading the calculation of the data access latency. The SRAM model adds parameters for read,
write, and burst access latencies, whereas the SDRAM model has a different set of
parameters consisting of RAS precharge, RAS-to-CAS, CAS, and burst latencies.
Furthermore, the SDRAM model simulates page misses statistically because the
workloads do not generally provide exact addresses to the platform model. Thus, a
parameter defining the probability of the page miss is needed. An interesting implication of the modeling approach is that the memory models do not actually provide
data storage. It is not required because the functionality of applications is not simulated and data is not moved in the simulated transactions.
There were 213 parameters in total for the platform’s components, so it is not
feasible to display all of them here. Table 9.3 reveals the clock frequency of each
component as an example of the configuration.
9.3.2 Modeling of the Services
In the MVP case, three of the four subsystems have a DMA controller, which is used
for offloading large data transfers from the ARM7 microcontoller and/or the hardware accelerators in those subsystems. The DMA controller provides a componentlayer dma_transfer service implemented in hardware (Table 9.4). The service takes
142
J. Kreku et al.
Table 9.4 Services provided by the MVP platform
Subsystem
Service
Provider
Description
all
memcpy
process WL
Copy memory contents
ST, IM, DP
dma_transfer
DMA controller
Offload memory copying
DP
display_update
Display_IF
Update screen contents from the framebuffer
IM
video_decode
Video_accel
MPEG4 video decoding
IM
video_encode
Video_accel
MPEG4 video encoding
IM
preprocess
Video_accel
Color conversion
IM
postprocess
Video_accel
Color conversion
ST
list_video_files
process WL
Provides a list of available films
the source and target addresses and the size of the transfer as parameters. The implementation of the service utilizes the methods inherited from the Master model
for sending and receiving OCP requests to transfer the requested amount of data.
The Display_IF provides a display update service for viewing the user interface of the entire system on the screen. However, it does not provide hardwareaccelerated drawing services. The display update service takes framebuffer address,
display resolution and color depth as parameters. The Display_IF will read the contents of the framebuffer using the same approach as the DMA controller does every
time the service is requested.
The video accelerator provides hardware-accelerated video decoding and encoding services (Fig. 9.5).
The service parameters include source (m_attr->from) and target (m_attr->to)
addresses for the original and processed video data respectively, video resolution
(m_attr->X, m_attr->Y), macroblock size (m_attr->x, m_attr->y), bits per pixel
(m_attr->B) and compression ratio (m_attr->C). The implementation also relies on
the model parameters, which define the number of cycles required for the processing of discrete cosine transform (m_dct_cycles), quantization (m_q_cycles), motion compensation (m_mc_cycles), and variable length coding (m_vlc_cycles). The
decoding service contains several calls to read_macroblock, write_macroblock,
and process_macroblock methods; these are implemented on top of the internal
read, write, and execute primitive services respectively. The video accelerator provides also video preprocessing and postprocessing services for converting between
YUV420, YUV422 and RGB formats.
Each subsystem provides a SW service for memory copying. This is implemented
using the dma_transfer component-layer service in the three subsystems with a
DMA controller. The GP subsystem does not have DMA so the memory copying
service is implemented as an OS kernel process workload, which sends the required
number of read/write/execute primitives to one of the ARM11 processors when the
service is requested. The ST subsystem provides a list_video_files service for generating the list of films available in the film storage; the memcpy service is used to
move the actual film data. Implementing SW-based video decoding and encoding
9 Application Workload and SystemC Platform Modeling for Performance Evaluation
143
Fig. 9.5 Extract from the implementation of the video decoding service
services utilizing the ARM11 processors in the GP subsystem will be considered
later to explore their effects on performance.
The MVP platform model consisted of 12,935 lines of SystemC code after implementing the components and services. The figure includes comments but does
not take into account the OCP protocol models.
9.3.3 Modeling of the Application
Major parts of the activity in the MVP case example are modeled as services in the
platform model. The control part of the video playback application was modeled
using UML2 and then transformed to SystemC.
The MVP application provides a PlayVideo service to the user (Fig. 9.6). It is
decomposed into DisplayListOfVideos and PlaySelectedVideo service components.
Further decomposition of DisplayListOfVideos reveals that the list of video files is
produced by the list_video_files platform service, but showing the list on the display
has to be implemented by the application itself. The implementation consists of load
primitives used to write the display data to the framebuffer. PlaySelectedVideo, on
the other hand, is decomposed into ReadVideoFrame, decoding, and ShowDecod-
144
J. Kreku et al.
Fig. 9.6 Decomposition of the MVP services
edFrame. Decode is provided by the platform service, whereas the other two can be
provided by using the memcpy service.
The next step was to generate SystemC models from the UML2 using Telelogic
Tau and the Lund University’s SystemC generator. Finally, the load information was
inserted into the model in the form of read, write, and execute load primitives for
those parts of the application that were not provided by the platform services.
9.3.4 Analysis of Simulation Results
The simulation was run for the duration of the video player start up and playback of
one second (25 frames). Table 9.5 presents the utilization of every component in the
platform. The second ARM11 in the GP subsystem and the video accelerator have
the highest load average, about 50 and 60 per cent respectively. The ARM processors
of the GP subsystem are somewhat burdened due to the video application start up
load. Also generic background load running in the GP subsystem and modeling the
execution of other applications is affecting the result. The video accelerator load
is relatively high considering that the use case consisted of only video decoding.
Getting the system to handle video encoding might require increasing the clock
frequency of the accelerator.
None of the components in ST and DP subsystems have been at the limit of
their capacity. It is clearly possible to execute more demanding applications on this
platform from the point of view of these subsystems, or the clock frequency of these
components could be reduced to decrease power consumption of the device. Furthermore, combining the functionality of the ST and DP subsystems should be considered because both subsystems have low utilization. Table 9.6 shows the processing
times of services from the subsystem layer.
9 Application Workload and SystemC Platform Modeling for Performance Evaluation
145
Table 9.5 Utilization of MVP platform components
Component
GP subsystem
ST subsystem
IM subsystem
DP subsystem
ARM 0
11%
20%
10%
15%
ARM 1
51%
Video_accel
60%
DMA_controller
5%
0%
6%
Display_IF
Network IF
31%
1%
3%
2%
5%
7%
24%
3%
18%
Bus
14%
10%
SDRAM
16%
3%
SRAM
2%
Table 9.6 Examples of processing times of services
Service
Subsystem
Average
Min
Max
dma_transfer
ST
2.4 ms
0.1 ms
55 ms
memcpy
ST
12 ms
10 ms
65 ms
video_decode
IM
14 ms
14 ms
14 ms
dma_transfer
DP
3.1 ms
2.9 ms
3.1 ms
memcpy
DP
10 ms
10 ms
11 ms
The simulation results have not been validated with measurements since the execution platform is an invented platform intended to portray a future architecture.
Consequently, there are no cycle-accurate simulators for the platform which could
be used to obtain reference performance data. However, we have modeled MPEG4
video processing and the OMAP platform earlier and compared those simulation results to measurements from a real application in a real architecture [20]. In that case
the average difference between simulations and measurements was about 12%. The
accuracy of the simulation approach has been validated with other case examples in
[18, 19].
9.4 Conclusions
The application-platform performance modeling and evaluation approach is presented. It allows application and platform to be modeled at several levels of abstraction to enable early performance evaluation of the resulting system. Applications
are modeled in UML2 as workloads consisting of load primitives, whereas platform
models are cycle-approximate transaction-level SystemC models. Mapping between
UML2 application models and the SystemC platform models is based on automatic
generation of simulation models for system-level performance evaluation. The tool
146
J. Kreku et al.
support is based on the UML2 tool, Telelogic Tau, and the SystemC simulation tool
of OSCI.
The approach has been experimented with a mobile video player case study. The
utilization of all of the components was obtained with simulations that also yielded
the processing times of the platform’s services. Performance bottlenecks and power
saving possibilities were identified based on the simulations. The results from MVP
case have not been verified, but typically the average and maximum errors have been
about 15% and 25% respectively in the other modeled case examples.
The approach enables performance evaluation early, exhibits light modeling effort, allows fast exploration iteration, reuses application and platform models, and
provides performance results that are accurate enough for system-level exploration.
In the future, the approach will be expanded to other criteria besides performance,
like power consumption. Further tool support for automation of some steps of the
approach is in progress.
Acknowledgements This work is supported by Tekes (Finnish Funding Agency for Technology
and Innovation) and VTT under the EUREKA/ITEA contract 04006 MARTES, and partially by
the European Community under the grant agreement 215244 MOSART.
References
1. K. Keutzer, A. Newton, J. Rabaey, and A. Sangiovanni-Vincentelli, System-level design: orthogonalization of concerns and platform-based design. IEEE Transactions on ComputerAided Design of Integrated Circuits and Systems, 19(12):1523–1543, 2000.
2. MDA guide version 1.0.1, June 2003, document number: omg/2003-06-01. Available at http://
www.omg.org/mda/.
3. F. Balarin, M. Chiodo, P. Giusto, H. Hsieh, A. Jurecska, L. Lavagno, C. Passerone,
A. Sangiovanni-Vincentelli, E. Sentovich, K. Suzuki, and B. Tabbara. Hardware-Software CoDesign of Embedded Systems: The Polis Approach. Kluwer Academic, Dordrecht, 1997.
4. R. Suoranta, New directions in mobile device architectures. In The 9th Euromicro Conference
on Digital System Design (DSD’06), 2006.
5. K. Kronlöf, S. Kontinen, I. Oliver, and T. Eriksson. A method for mobile terminal platform
architecture development. In Advances in Design and Specification Languages for Embedded
Systems, pages 285–300. Springer, Dordrecht, 2007.
6. C. Ykman-Couvreur, V. Nollet, T. Marescaux, E. Brockmeyer, F. Catthoor, and H. Corporaal,
Design-time application mapping and platform exploration for MP-SoC customized run-time
management. IET Comput. Digit. Tech., 1(2):120–128, 2007.
7. L. Papadopoulos, S. Mamagkakis, F. Catthoor, and D. Soudris. Application-specific NoC platform design based on system level optimization. In IEEE Computer Society Annual Symposium on VLSI (ISVLSI’07), 2007.
8. J. Xu, W. Wolf, J. Henkel, and S. Chakradhar. A design methodology for application-specific
networks-on-chip. ACM Transactions on Embedded Computing Systems, 5(2):263–280, 2006.
9. G. Beltrame, D. Sciuto, C. Silvano, P. Paulin, and E. Bensoudane. An application mapping
methodology and case study for multi-processor on-chip architectures. In 2006 IFIP International Conference on Very Large Scale Integration, pages 146–151, 16–18 October 2006.
10. P. Lieverse, P. van der Wolf, K. Vissers, and E. Deprettere. A methodology for architecture
exploration of heterogeneous signal processing systems. VLSI Signal Processing, 29(3):197–
207, 2001.
9 Application Workload and SystemC Platform Modeling for Performance Evaluation
147
11. A. Pimentel and C. Erbas. A systematic approach to exploring embedded system architectures
at multiple abstraction levels. IEEE Transactions on Computers, 55(2):99–112, 2006.
12. T. Wild, A. Herkersdorf, and G.-Y. Lee. TAPES—Trace-based architecture performance evaluation with SystemC. Design Automation for Embedded Systems, 10(2–3):157–179, 2006.
13. J.M. Paul, D.E. Thomas, and A.S. Cassidy. High-level modeling and simulation of single-chip
programmable heterogeneous multiprocessors. ACM Transactions on Design Automation of
Electronic Systems, 10(3):431–461, 2005.
14. Telelogic Tau 3.0 User Guide (December 2006). Telelogic AB, 1998 pp.
15. Open SystemC Initiative Website. http://www.systemc.org
16. J. Kreku, M. Hoppari, T. Kestilä et al. Combining UML2 application and SystemC platform
modelling for performance evaluation of real-time embedded systems. EURASIP Journal on
Embedded Systems, 2008(3), 2008. doi:10.1155/2008/712329
17. J. Kreku, Y. Qu, J.-P. Soininen, and K. Tiensyrjä. Layered UML workload and SystemC platform models for performance simulation. In Proceedings of Forum on Specification and Design Languages, pages 223–228. ECSI, Gières, 2006.
18. J. Kreku, T. Kauppi, and J.-P. Soininen. Evaluation of platform architecture performance using
abstract instruction-level workload models. In International Symposium on System-on-Chip,
pages 43–48, 2004.
19. J. Kreku, J. Penttilä, J. Kangas, and J.-P. Soininen. Workload simulation method for evaluation
of application feasibility in a mobile multiprocessor platform. In The Euromicro Symposium
on Digital System Design, pages 532–539, 2004.
20. J. Kreku, M. Eteläperä, and J.-P. Soininen. Exploitation of UML 2.0-based platform service
model and SystemC workload simulation in MPEG-4 partitioning. In International Symposium on System-on-Chip, pages 167–170, 2005.
21. J. Kreku, M. Hoppari, K. Tiensyrjä, and P. Andersson. SystemC workload model generation
from UML for performance simulation. In Proceedings of Forum on Specification and Design
Languages. ECSI, Gières, 2007.
22. P. Andersson and M. Höst. UML and SystemC—comparison and mapping rules for automatic
code generation. In Proceedings of Forum on Specification and Design Languages. ECSI,
Gières, 2007.
Chapter 10
Adaptive Interconnect Models
for Transaction-Level Simulation
Rauf Salimi Khaligh and Martin Radetzki
Abstract Transaction level models are constructed for efficient simulation of complex embedded systems and systems-on-chip. Traditionally, the use case of a transaction level model dictates its accuracy and abstraction level, which are fixed during
simulation. Although the chosen level of accuracy may be required in some intervals, in some other intervals the model may simply be too accurate for the scenario
being simulated. This makes the model a simulation bottleneck and unnecessarily impedes the simulation performance. In this contribution we present an adaptive
approach for modeling interconnects. The abstraction level of an adaptive model dynamically adapts to the simulation scenario, increasing the simulation performance
without sacrificing the accuracy. We have developed adaptive models for point-topoint, FIFO based communication channels widely used in modern GALS and multiprocessor systems as well as models for complex, pipelined buses. We have applied
the proposed approach to two real-world communication protocols and developed
adaptive models of the AMBA AHB bus and the Fast Simplex Link (FSL) in SystemC, based on the recent OSCI TLM 2 standard. Our experiments clearly show the
increase in simulation performance compared to existing, non-adaptive models.
Keywords Transaction level modeling · Adaptive models · On-chip interconnect
models · SystemC
10.1 Introduction
Transaction level modeling enables us to simulate complete hardware and software
systems at very high speeds, which are often orders of magnitude faster than the
simulation speed of RTL and lower level models. Transaction level models are used
by different people for different use cases. Abstraction level of the models and their
accuracy depend on the use case at hand and there is a clear trade-off between the
simulation speed and accuracy.
The definition of the TLM abstraction levels and the terminology are still subject
of debate and standardization but speaking generally, transaction level models can
R. Salimi Khaligh ()
Embedded Systems Engineering Group, Institute of Computer Architecture and
Computer Engineering (ITI), Universitaet Stuttgart, Pfaffenwaldring 47, 70569 Stuttgart,
Germany
e-mail: salimi@informatik.uni-stuttgart.de
M. Radetzki (ed.), Languages for Embedded Systems and Their Applications,
Lecture Notes in Electrical Engineering 36,
© Springer Science + Business Media B.V. 2009
149
150
R. Salimi Khaligh and M. Radetzki
Fig. 10.1 Sufficient model
accuracy
be divided into three categories. They are either untimed, approximately timed or
timing accurate. The untimed (UT) models are very abstract models which achieve
very high simulation speeds at the cost of having no timing accuracy at all. At the
lowest TLM abstraction level are the timing accurate models, for example the cycle
accurate (CA) models in the domain of synchronous systems. In such models, it is
possible to observe or reconstruct the state of the model at each clock cycle, making
the model accurate enough for fine grained performance analysis and verification.
This level of timing accuracy comes at the price of simulation performance, which is
often several orders of magnitude slower than the simulation of the untimed models.
Between the two levels are models with approximate timing, which are used for
example for early software profiling or architectural analysis.
In this work we are concerned with transaction level models of interconnects
used in state of the art embedded systems and systems-on-chip and the following
discussions will focus on specifics of these models. Figure 10.1 depicts examples of
different abstraction levels and their accuracies for a bus TLM. The untimed models of buses typically model addressing accurately but arbitration and timing details
are absent. In more accurate, timed bus models (e.g. the PVT model in Fig. 10.1),
arbitration is modeled approximately but the transfers are atomic with an approximate total latency. Preemption of transfers is usually not modeled in these models which may compromise functional correctness, or may leave some errors undetected. A simple example of such a situation is a read burst preempted by a write
burst to an overlapping region of the memory. If the bus model does not model
preemption of transfers, the simulation results will be inaccurate. In more accurate
models (e.g. the CX model in Fig. 10.1), arbitration and preemptions may be modeled accurately but transfers are still as chunks with aggregate timings. Accurate
timing of beats in a burst is usually modeled only in the slower, cycle accurate (CA)
bus models (e.g. Fig. 10.1(e)). Traditionally, designers either have to develop and
use multiple models of the same entity at different abstraction levels, one for each
use case, or in order to have an acceptable level of accuracy, they have to use a
single, accurate but slow model. The abstraction level of the chosen model will be
fixed during simulation and for the simulation session as a whole, this does not always represent the most efficient abstraction level and will unnecessarily degrade
the simulation performance. Consider again the bus example in Fig. 10.1. During
periods with a single master active (a) PVT accuracy is sufficient. Similarly, simulation of the preemption logic is only necessary when preemptions occur (b, d).
Cycle accurate timing of data beats during burst transfers may only be required in
some intervals (e). As an example consider a system where a display controller periodically reads the frame buffer in fixed length bursts. We may only be interested
in the effect of those bursts on bus utilization and timing of other transfers, but not
10 Adaptive Interconnect Models for Transaction-Level Simulation
151
in the intra-burst timing details. In contrast, in the same system we may need the
timing and order of the beats in cache fill bursts which are performed by the processor. In this contribution we present a novel approach for modeling interconnects
which enables us to construct accurate and at the same time very efficient simulation models. This is achieved by incorporating a certain degree of adaptivity into the
models. Put simply, an adaptive model contains all the necessary levels of detail for
simulation at the most accurate abstraction level, but the details are only simulated
when required and hence the abstraction level of the model adapts to the simulation scenario dynamically. This is in contrast to traditional transaction level models with static abstraction levels, and is also different from multi-level, multiplexed
transaction level models which require several distinct models. We have focused on
two classes of communication, point-to-point communication and shared medium,
bus-based communication. The remainder of this chapter is organized as follows.
Section 10.2 gives an overview of the related work. Section 10.3 presents our modeling approach. In Sect. 10.4 details of SystemC and OSCI TLM 2 based models of
concrete bus (AMBA AHB) and point-to-point communication (FSL) protocols are
shown. Section 10.5 presents the simulation performance of the developed models
and compares them to existing models. Section 10.6 concludes this chapter with a
summary and directions for potential future work.
10.2 Related Work
Transaction level modeling (e.g. [7, 10, 17]) and TLM based design flows have become established design methods in industrial and research communities. TLM is
enabled by languages such as SystemC [12], SpecC [9] and SystemVerilog [1]. SystemC and the OSCI TLM standard [17] are currently the most general purpose and
the most widely accepted ones. The definition of the abstraction levels, terminology
and some core concepts are still subject of debate [6] and formalization [15]. An
interesting, systematic representation of transactions at different abstraction levels
and their relationships has been presented in the GreenBus model [14].
Designers are often only interested in certain time intervals or certain components during system operation. To accelerate the simulation performance during uninteresting phases or to focus on behavior of certain components, several approaches
have been proposed. An early pre-TLM example is SimOS [21] for simulation of
complete computer systems. This approach requires multiple models of each system component at different levels of accuracy. Another early example is the work
of Hines et al. [11] which is based on a layered model of communication between
processor and its peripherals.
In TLM, Beltrame et al. [5] have recently proposed a multi-accuracy approach
in which several models at different abstraction levels are wrapped in one model,
and the dynamic change of accuracy is based on multiplexing and demultiplexing
those models. In their model, switching between abstraction levels can only occur on
transaction boundaries, so the models can not react to conditions such as transaction
152
R. Salimi Khaligh and M. Radetzki
preemption, and such conditions can be missed during more abstract simulation
phases.
The result oriented modeling (ROM) based models [23] are adaptive to the simulation scenario in that although in ROM data transfers are atomic, the estimated
duration of a transfer is extended should the transfer get preempted. Considering
transaction boundaries only, the model is timing accurate. However since the data is
transfered at the boundaries without taking into account the preemption, data transfer is not accurate and functional correctness can be impaired. The dynamic change
of abstraction level in the existing approaches is not bumpless and may result in
loss of accuracy. An approach to accuracy adaptive simulation of transaction level
models which does not require multiplexing of distinct models was recently introduced in [20], and was demonstrated with an abstract communication protocol. The
applicability of the approach to real world protocols, data dependent latencies and
cycle accuracy of the data transfers were not addressed.
We have used concrete industrial communication protocols to evaluate our modeling approach. The Fast Simplex Link (FSL) [25] is a point-to-point FIFO based
communication protocol which is used in FPGA-based multiprocessor systems
(e.g. [13]). The AMBA AHB [3] is a representative of complex high performance
industrial buses, used by many in evaluation of modeling methodologies. In [18]
Pasricha et al. have used AHB in evaluation of their proposed CCATB abstraction
level. In [24] SpecC-based models of AHB at different abstraction levels are compared. The AHB model of Caldari et al. [8] is an early SystemC 2.0 model which is
not based on the OSCI TLM standard. A model of the AHB protocol based on an
object oriented TLM approach [19] is presented in [22].
10.3 Adaptive Interconnect Models
In this section we begin with the relatively simple case of point-to-point communication and then move on to more complex bus-based communication protocols.
10.3.1 Point-to-Point Communication
Point-to-point communication is widely used in modern, complex systems-on-chip.
For example, mixed clock FIFOs are used for communication between locally synchronous islands in Globally Asynchronous, Locally Synchronous (GALS) systems
(e.g. [2, 16]). Another example is use of point-to-point FIFO-based channels for
communication between processors in multiprocessor systems (e.g. [13, 26]) or between processors and dedicated co-processors. In this chapter we focus on communication channels that follow the general logical structure shown in Fig. 10.2.
That is, a FIFO-based uni-directional channel which consists of internal memory,
different producer and consumer clocks and status and handshaking signals. A bidirectional channel can be modeled as two uni-directional channels in opposite directions. The clocks may be from different clock domains and the production and
10 Adaptive Interconnect Models for Transaction-Level Simulation
153
Fig. 10.2 A FIFO-based
point-to-point communication
channel
consumption rates may be different. The status of the FIFO is reflected in the status
signals (e.g. full and empty). The physical architecture of the channel is not relevant
to our discussion.
In an abstract untimed model, point-to-point communication would usually be
modeled with a single function call from the producer to the consumer, transferring
all data items at once. A separate model of the channel most probably will not be
present. A certain level of timing accuracy can be achieved by using a FIFO model
which accounts for the total latency of data transfer. For example, let Tp and Tc be
the periods of the producer-side and consumer-side clocks respectively and assume
that the FIFO is unbounded, and that writing or reading a single data item requires
one clock cycle. A lower bound on the total latency of transferring N items from
the producer to the consumer under the above assumptions can be easily estimated.
For Tp < Tc , the time required by the consumer to read the N items is greater than
the time required by the producer to transfer the items to the FIFO. The first item
can be retrieved by the consumer Tp time units after the producer has started the
transfer. It can be easily seen that the total duration of the transfer in this case (i.e.
no congestion, no producer-side or consumer-side idle cycles) is Tp + N · Tc (the
case for Tp > Tc is similar). Such an estimate can be incorporated in the model, for
example by means of a wait() inside the channel model. To model details such as
congestion or producer side and consumer side idle cycles, a more accurate model
is required. The most natural model usually used for this purpose is a FIFO which
is written to and read from one item at a time. The accuracy of such a model comes
at the price of impeding the simulation performance. More importantly, in some
intervals during simulation the model will be unnecessarily too accurate. Figure 10.3
shows two simple scenarios of data transfer between a producer and a consumer
using a FIFO channel. In scenario (a) the FIFO is initially empty and there is no
congestion. Here the accuracy of a model which only takes into account the total
latency would be sufficient. In scenario (b) however the FIFO is initially not empty
and there is an interval of congestion. The producer is blocked after transferring two
data items, until the consumer removes some items from the FIFO (x and y in the
figure). Eventually, the producer transfers the remaining items to the FIFO. To have
this level of detail, the more accurate but slower model is required.
Based on these observations, our proposed model is capable of maintaining the
highest level of accuracy required in TLM use cases, without unnecessarily sacri-
154
R. Salimi Khaligh and M. Radetzki
Fig. 10.3 Communication
over a FIFO with and without
congestion
ficing the simulation speed. In the proposed model the FIFO slots do not store individual data items, but chunks of data belonging to an atomic transfer. This makes
item-by-item reading and writing superfluous and will increase the simulation performance significantly. Without congestion (e.g. Fig. 10.3(a)) data will be transferred in one chunk from the producer to the channel, and in one chunk from the
channel to the consumer. In case of a congestion (e.g. Fig. 10.3(b)) the transfer will
be divided into multiple chunks. To transfer a set of data items, the producer initially makes an optimistic attempt for transferring all of the items in one chunk. If
there is enough room in the FIFO, all of the items will be stored in the FIFO as one
chunk, to be retrieved later by the consumer. However, if the FIFO can not receive
all items at once, only the fitting number of items will be stored. The producer will
be notified and must transfer the remaining items later. The result is that from the
producer and consumer points of view, the size of the transferred chunks and their
timings will always be accurate, without the need to transfer data one item at a time.
The underlying principle and a comparison with related approaches can be found
in [20]. To summarize, the model is adaptive in the sense that it always stays at the
lowest abstraction level which is required at any instance. In the example shown in
Fig. 10.3, in interval (a) the model will be equivalent to an abstract but fast model
and in interval (b) it will be equivalent to a more accurate but slower model. With
the difference that in scenarios such as interval (a), the adaptive model will be much
more efficient than a cycle accurate model or a model which requires item-by-item
transfer.
10.3.2 Bus-Based Communication
In contrast to point-to-point communication, bus-based (i.e. shared medium, multimaster, multi-slave communication) involves arbitration and address decoding.
Complex timing characteristics of the state-of-the-art bus protocols are usually only
presentable at very low (often CA) TLM abstraction levels (e.g. [22]) or in cycle
based models (e.g. [4]). In the following discussions we focus on pipelined bus protocols but the concepts can be easily applied to simpler non-pipelined protocols.
Figure 10.4 shows an eight beat burst in a typical pipelined bus from request to
completion, which is preempted by a second transfer and resumed afterwards to
completion. Some time after requesting bus access at req, the initiator M0 is granted
access to the address bus at a1 (i.e. the transfer enters the address phase) and later
10 Adaptive Interconnect Models for Transaction-Level Simulation
155
Fig. 10.4 A preempted pipelined burst
to the data bus at d1. At dp the transfer is prematurely preempted (e.g. because of
a higher priority transfer or being split). Later at a2 and d2, the master is granted
access to the address and data buses respectively and at de the transfer is completed.
Not all of these details are used, or are relevant for most TLM applications even
at lower abstraction levels. The total latency of the transfer (L), the duration of the
uninterrupted parts (here L1, L2), the amount of data transferred in each part and the
additional latency caused by preemption (Tp) are almost always the sufficient level
of accuracy in TLM. It is sufficient that the timing of the atomic (i.e. non-preempted)
parts and the data associated with each part be accurate. Accurate timing of each
beat (e.g. different transfer latencies of D3 and D4) is not required in all cases. Observing these facts and that bursts always target groups of related addresses, we can
abstract away unnecessary timing details and at the same time accurately model arbitration, latencies and data transfer timings by incorporating degrees of adaptivity,
without the need for cycle accuracy. Unnecessary notifications and interactions between models are avoided to increase simulation performance. For example, unless
timing accurate data transfer for bursts is required, masters are only notified after
completion and preemption of transfers, and slaves are only notified when the data
phase of the transfers start.
Similar to the model used in [22], we use an abstract model in form of a pipeline,
whose stages represent the address and data bus(es) and occupation of those buses
by transactions (here a transaction corresponds to an uninterrupted portion of a bus
transfer). Figure 10.5 shows an abstraction of the same burst from Fig. 10.4 in terms
of its atomic parts T1 and T2 together with the pipeline model. It can be seen that
the state of the pipeline only changes upon address and data bus handovers (hvr
points in the figure). We use a global event to indicate the instances at which bus
handover needs to take place. Using this event and the pipeline model, the bus model
keeps track of the transactions currently in the address and data phases, and controls
the timing of the transactions. This event may be triggered by the slaves (e.g. upon
completion of single transfers, at the last address phase of bursts or when splitting
transfers), the masters (e.g. when a request arrives at an idle bus) or the bus model
itself (e.g. when the data bus is idle). This makes a clock event and clock-based
operations superfluous and increases the simulation performance significantly. From
the bus handover point of view the model adapts to the simulation scenario by being
156
R. Salimi Khaligh and M. Radetzki
Fig. 10.5 An abstract model of bus handover
Fig. 10.6 Arbitration
at the most accurate level when necessary (e.g. early in the burst in Fig. 10.5) and at
the higher abstraction levels otherwise.
Another level of adaptivity can be added to the model by simulating the arbitration mechanism only when necessary. As an example, in Fig. 10.6 the next master
to be granted the address bus is determined at arb1 and does not change until hvr1.
The result of arbitration is not used until the actual handover of the address and data
buses. In our proposed model, all requests are annotated with a time stamp and upon
occurrence of a handover event the bus is able to retroactively check the requests and
determine the master to be granted the bus. In case of sequential arbitration mechanisms this would be the closest previous clock edge. A similar retroactive check of
requests is also possible in case of combinational arbitration mechanisms in which
a request can be granted at the same cycle. In Fig. 10.6, time arb2, shows another
example where a request arrives at an idle bus and is granted immediately, causing
a handover notification.
We exploit two observations regarding timing of beats in bursts and preemption
of transfers to add further levels of adaptivity to the bus model. The first observation is that the timing of individual data beats during bursts is not always required.
For example, when benchmarking software on a virtual platform, for the processor
model the total latency of memory access is sufficient, but for other components in
the model more accurate timing may be required. We allow the accuracy of a data
transfer to be indicated by the initiators and targets of a transfer. For what we term
10 Adaptive Interconnect Models for Transaction-Level Simulation
157
Fig. 10.7 Accuracy of data transfer
an Aggregate Timed Transaction (ATT), only the aggregate duration will be known
by the master and all data will be transfered at the beginning (for write) or at the end
of the duration of the transaction (for read). For the read bursts shown in Fig. 10.7,
the data phase of the ATT starts at d3 and the slave transfers data of both beats at d5.
We call a transaction with detailed beat-wise timing a Segment Timed Transaction
(STT). For the STT shown in Fig. 10.7 for example, data transfer occurs at d1 and d2.
Slaves may support ATTs only, STTs only or both. Supporting STTs only for example, would force the data transfer to or from the slave to be beat-wise. Using STTs
and ATTs the transfer of data will be accurately timed only when necessary.
The second observation is that not all transfers are preempted and cycle accuracy
is not necessary to ensure accurate transfer of data corresponding to the atomic parts.
Here once again we use the adaptive approach introduced in [20]. The example
shown in Fig. 10.7 shows a burst preempted by a higher priority transfer. At the
beginning of the data phase at d6 the slave optimistically calculates the duration of
the transfer L and waits for that amount of time and simultaneously for a preemption
event. When notified of preemption, the slave transfers only the data corresponding
to the non-preempted part. Handling of preemption in case of STTs is similar and
straightforward.
10.4 Model Implementation
10.4.1 An Adaptive FSL Model
Fast Simplex Link (FSL) [25] is a FIFO based communication channel from XilinX
which can be used for unidirectional communication between components in FPGA
based designs. For example, the MicroBlaze processor core also from XilinX has
input and output FSL interfaces which can be used in multiprocessor designs, or
for communication between the processor and dedicated co-processors. An FSL
channel is a multi-clock FIFO with a configurable depth which can be used in synchronous mode (i.e. same producer and consumer clocks) or asynchronous mode
(i.e. different producer and consumer clocks). Based on the principles introduced
in Sect. 10.3 we have developed an adaptive model of FSL. The model is based on
the OSCI TLM 2 standard [17]. The model uses the blocking transport interfaces
158
R. Salimi Khaligh and M. Radetzki
Fig. 10.8 Transport calls for
a FSL transfer
and is modeled as a passive, target-only sc_module. Internally a FIFO data structure
is used to keep track of the chunks, and each chunk may occupy more than one
FIFO slot. The model is not clocked and only requires the periods of the producer
and consumer clocks as constructor parameters. Each chunk has a write timestamp
which is the simulation time at which the producer has started writing the chunk
to the FIFO. Similarly, each chunk has a read timestamp which is the simulation
time at which the consumer has started reading the chunk. Using the timestamps,
and the internal FIFO data structure, the amount of free FIFO slots at any instant in
simulation time can be determined. This is used to implement the adaptive behavior
explained previously in Sect. 10.3.
Figure 10.8 shows the sequence of b_transport calls from the producer and consumer for the example shown in Fig. 10.3(b).
10.4.2 An Adaptive AHB Model
We have implemented an adaptive model of the AMBA AHB [3] in SystemC based
on the OSCI TLM 2 [17] standard. The implemented subset of features covers most
of the AHB specification and is large enough to validate our proposed model and
the missing features (e.g. RETRY transfers) do not affect our results. The bus model
is implemented as a single sc_module with an array of tlm_initiator_socket and
tlm_target_socket sockets for connection with the masters and slaves respectively.
Our model uses the nonblocking transport interfaces and utilizes the backward path.
The bus model itself does not require a clock event, since all timing is handled based
on the aforementioned handover event and pipeline model. Only the period of the
bus clock is needed in some special cases (e.g. a timed notification of a handover
event when a request arrives at an idle bus), which is set as an attribute of the model.
A bus request starts by an nb_transport_fw call from the master to the bus carrying
the payload for that transaction. The generic payload is extended with the relevant
AHB specific details and information specific to the adaptive model. The most important extended attributes are the time stamp of the request, a preemption event and
10 Adaptive Interconnect Models for Transaction-Level Simulation
159
Fig. 10.9 Bus handover
the transaction type (ATT or STT). The time stamp is set at the time of the request
by the bus model and is used later by the retrospective arbitration mechanism. The
preemption event is notified by the bus model in case of preemption and is used by
the slaves for accurate transfer of data corresponding to the atomic part. The transaction type determines the requested accuracy of data transfers. In AHB, timing of
the request signal is crucial for performing back-to-back pipelined transfers. To enable this, inside the bus model one FIFO data structure per master is used to hold
the incoming payloads.
Bus handover is implemented in a SC_METHOD sensitive to the handover event
and in a simplified form is shown in Fig. 10.9. AD and DT are transactions in the
address and data phases prior to notification and AD’ and DT’ those after handover.
Here the internal pipeline model (Sect. 10.3) is updated (a). The payload of the transaction entering the data phase is transported to the corresponding slave (b) (only
once per burst) and the transaction entering the address phase is determined (c). In
case of an idle data bus a timed handover event is set by the bus model (d) so that
the transaction currently in the address phase can resume accordingly. Activity of
the slaves is implemented in a SC_THREAD and for a simple, memory like slave for
a simple data transfer is shown in Fig. 10.10. Processing starts with the reception
of a payload (a), optimistic estimation of the latency of the transfer (b), determining the correct amount of data (c) and transfer of data (d). The activity for bursts is
similar. In case of bursts, for accurate timing of bus handover a timed notification
of the handover event by the slave at the end of the last address phase is necessary. Figure 10.11 shows a typical sequence of transport calls along the forward and
backward paths for a burst transfer. Master requests access (a) but a higher priority
transaction occupies the address bus until (b) where the handover event is notified
and retrospective arbitration is performed. Later (c) the payload is transported to the
slave and the data phase begins. The nb_transport call at (d) is only used by the bus
model to notify a handover event, the last call (e) causes a handover event and is
then forwarded to the master on the backward path (f).
160
R. Salimi Khaligh and M. Radetzki
Fig. 10.10 Processing in slaves
Fig. 10.11 Transport calls
for an AHB transfer
10.5 Experimental Results
To evaluate the simulation performance of our models, we have compared the adaptive models to existing, fixed accuracy models, whose accuracies were comparable
to the best level of accuracy which the adaptive models were able to deliver. For
FSL we have used a sc_fifo based cycle accurate model and for AHB we have used
a cycle accurate OSCI TLM 1 model.
The adaptive FSL model achieves its highest simulation speed (i.e. it is at the
highest abstraction level) when there are no congestions. To measure the simulation
performance in this case we used a small model consisting of a producer, a consumer
and a FIFO model which was effectively unbounded for the simulation scenario.
Figure 10.12 shows the results. Here the producer performs back-to-back transfers
of increasing sizes without idle cycles, and similarly the consumer retrieves the
items from the FIFO without any idle cycles. Since the cycle accurate model transfers the data byte-by-byte in all situations, its simulation performance stays effectively constant. However, without congestion, in the adaptive model data is always
transfered in chunks which results in a significant increase in simulation speed for
larger chunks.
To measure the effect of congestions, we repeated the previous tests with different FIFO sizes. Predictably, the simulation performance of the cycle accurate model
did not change considerably with a reduction in FIFO size. Figure 10.13 shows the
measurement results for the adaptive FSL model (the curves for the cycle accurate
10 Adaptive Interconnect Models for Transaction-Level Simulation
Fig. 10.12 Maximum simulation performance of the FSL model
Fig. 10.13 Effect of FIFO congestion
161
162
R. Salimi Khaligh and M. Radetzki
Fig. 10.14 Maximum simulation performance of the AHB model
model have been omitted here for clarity). Using an adaptive model with a FIFO
size of 1, data is transferred byte-by-byte, similar to the cycle accurate model. This
represents the most accurate level of the adaptive model and its worst case simulation performance. The reduction in the simulation speed with a limited FIFO
size which allows data transfers in chunks larger than one byte can be seen in the
curve for FIFO size of 50. For transfer sizes larger than 50 bytes, the transfer will
be broken to multiple chunks and this will result in a breakdown in the simulation
performance. However the model is still much more efficient than the cycle accurate
model.
Figure 10.14 compares the maximum simulation performance of the adaptive
AHB model with that of a cycle accurate AHB model [22] in a single master, single slave setup (a comparison of the used cycle accurate model with AHB models
mentioned in the related work can be found in [22]). The master performs transfers of increasing sizes using the most efficient combination of AHB transactions
(single data transfers and valid fixed length bursts). Again, the best performance of
the adaptive model is achieved for larger transfers where it is almost an order of
magnitude faster than the cycle accurate model. This is the result of a reduction in
handover events per transaction, which is specially significant for large bursts. The
worst case performance of the adaptive model is reached for smaller transfer sizes
where one handover event is required per transaction.
As a result of bus traffic and eventual preemption of transfers, the adaptive bus
model will be at lower abstraction levels to maintain the accuracy, and predictably
the simulation performance will be lower than the best case. This can be seen in
10 Adaptive Interconnect Models for Transaction-Level Simulation
163
Fig. 10.15 Effect of preemptions
Fig. 10.15. Here a high priority master performing single data transfers with varying
degrees of bus utilization is added to the system. In this setup, the worst starvation–
free case occurs at u = 66% where all preemptable transfers are preempted, which
results in a decrease in simulation performance as seen in the figure. For lower bus
utilizations, the decrease in the simulation performance is also lower.
The above mentioned scenarios are however, rather synthetic. To measure the
performance of the bus model in more realistic situations, we simulated different communication architectures for an ARM926EJ-S based implementation of an
MPEG decoder. The four masters in the system were the Processor (instruction and
data interfaces), a DMA controller and an LCD controller, which communicated
with six slaves. The traffic generators used in the simulations initiated bus transactions based on the traces captured by a logic analyzer from a running system
and therefore represent real transaction patterns. The simulated communication architectures were chosen such that the interdependency of the transactions was respected as much as possible. Table 10.1 summarizes the results. Arch1 represents
the communication architecture of the real system which uses a multi-port memory
controller (MPMC) and has four single-master buses. In Arch2, the DMA Controller
and the LCD controller (no overlapping transactions) share the same bus. In Arch3
and Arch4 the LCD controller uses the same data bus as the CPU and these architectures are only different in the priority assignments of the LCD controller and the
Processor and therefore are directly comparable. The slight decrease in performance
due to larger number of preemptions can be seen by comparing the results for Arch3
and Arch4. In Arch1 the bus model is always at the highest abstraction level and
164
R. Salimi Khaligh and M. Radetzki
Table 10.1 An ARM-based MPEG case study
Arch 1
Arch 2
Arch 3
Arch 4
Number of Buses
4
3
3
3
Number of Transfers
305459
305459
305459
305459
Number of Preemptions
0
0
540
3038
Simulation Speed (MCycles/S)
9.11
8.94
9.09
8.76
therefore the highest performance is achieved (considering the number of buses in
the model).
10.6 Conclusion
We have shown that by incorporating adaptivity in a transaction level model, it is
possible to construct a single model which is capable of delivering the required level
of accuracy for different situations during simulation. More importantly, an adaptive model does not unnecessarily sacrifice the simulation performance. Our experimental results clearly show the benefits of using adaptive models and the potential
increase in simulation speed, which is the most important motivation for using transaction level models. This has proven to be a very promising research direction and
we are currently applying our proposed approach to other interconnect models such
as networks-on-chip, processor models and reconfigurable architectures.
References
1. Accellera Organization, Inc. SystemVerilog 3.1a Language Reference Manual, May 2004.
2. R.W. Apperson, Z.Yu, M.J. Meeuwsen, T. Mohsenin, and B.M. Baas. A scalable dual-clock
FIFO for data transfers between arbitrary and haltable clock domains. IEEE Transactions on
Very Large Scale Integration (VLSI) Systems, 15(10):1125–1134, 2007.
3. ARM Limited. AMBA Specification, Version 2.0, May 1999.
4. ARM Limited. AMBA AHB Transaction Level Modeling Specification, Version 1.0.1, May
2006.
5. G. Beltrame, D. Sciuto, and C. Silvano. Multi-accuracy power and performance transactionslevel modeling. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD), 26:1830–1842, 2007.
6. M. Burton, J. Aldisy, R. Guenzel, and W. Klingauf. Transaction level modelling: a reflection
on what TLM is and how TLMs may be classified. In Proceedings of the Forum on Specification and Design Languages (FDL ’07), September 2007.
7. L. Cai and D. Gajski. Transaction level modeling: an overview. In Proceedings of the 1st
IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS ’03), October 2003.
8. M. Caldari, M. Conti, M. Coppola, S. Curaba, L. Pieralisi, and C. Turchetti. Transaction-level
models for AMBA bus architecture using SystemC 2.0. In Proceedings of the Conference on
Design, Automation and Test in Europe (DATE ’03), March 2003.
10 Adaptive Interconnect Models for Transaction-Level Simulation
165
9. R. Dömer, A. Gerstlauer, and D. Gajski. The SpecC Language Reference Manual, Version 2.0.
SpecC Technology Open Consortium (www.specc.org), December 2002.
10. F. Ghenassia. Transaction-Level Modeling with SystemC: TLM Concepts and Applications for
Embedded Systems. Springer, New York, 2006.
11. K. Hines and G. Borriello. Dynamic communication models in embedded system cosimulation. In Proceedings of the Design Automation Conference (DAC ’97), June 1997.
12. IEEE Computer Society. Standard SystemC Language Reference Manual. Standard 16662005, March 2006.
13. Y. Jin, N. Satish, K. Ravindran, and K. Keutzer. An automated exploration framework for
FPGA-based soft multiprocessor systems. In CODES+ISSS ’05: Proceedings of the 3rd
IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis, pages 273–278. Assoc. Comput. Mach., New York, 2005.
14. W. Klingauf, R. Günzel, O. Bringmann, P. Parfuntseu, and M. Burton. GreenBus: a generic
interconnect fabric for transaction level modelling. In Proceedings of the 43rd Annual Conference on Design Automation (DAC ’06), July 2006.
15. B. Niemann and C. Haubelt. Towards a unified execution model for transactions in TLMs.
In Proceedings of the Fifth ACM–IEEE International Conference on Formal Methods and
Models for Codesign (MEMOCODE ’07), June 2007.
16. K. Niyogi and D. Marculescu. System level power and performance modeling of GALS pointto-point communication interfaces. In Proceedings of the International Symposium on Low
Power Electronics and Design (ISLPED ’05), pages 381–386, August 2005.
17. Open SystemC Initiative (OSCI) TLM Working Group (www.systemc.org). Transaction Level
Modeling Standard 2 (OSCI TLM 2), June 2008.
18. S. Pasricha, N. Dutt, and M. Ben-Romdhane. Fast exploration of bus-based on-chip communication architectures. In Proceedings of the International Conference on Hardware/Software
Codesign and System Synthesis (CODES+ISSS ’04), September 2004.
19. M. Radetzki. SystemC TLM transaction modelling and dispatch for active objects. In Proceedings of the Forum on Specification and Design Languages (FDL ’06), September 2006.
20. M. Radetzki and R. Salimi-Khaligh. Accuracy-adaptive simulation of transaction level models. In Proceedings of Design, Automation and Test in Europe 2008 (DATE 08), March 2008.
21. M. Rosenblum, S. Herrod, E. Witchel, and A. Gupta. Complete computer system simulation:
the SimOS approach. IEEE Parallel and Distributed Technology: Systems and Applications,
3(4):34–43, 1995.
22. R. Salimi-Khaligh and M. Radetzki. Efficient and extensible transaction level modeling based
on an object oriented model of bus transactions. In Proceedings of the International Embedded
Systems Symposium (IESS ’07), May 2007.
23. G. Schirner and R. Dömer. Fast and accurate transaction level models using result oriented
modeling. In Proceedings of the IEEE/ACM International Conference on Computer-Aided
Design (ICCAD ’06), November 2006.
24. G. Schirner and R. Dömer. Quantitative analysis of transaction level models for the AMBA
bus. In Proceedings of the Conference on Design, Automation and Test in Europe (DATE ’06),
March 2006.
25. Xilinx, Inc. Fast Simplex Link (FSL) Specification, Version 2.11a, June 2007.
26. S. Xu and H. Pollitt-Smith. A multi-microblaze based SOC system: from SystemC modeling
to FPGA prototyping. In IEEE International Workshop on Rapid System Prototyping, pages
121–127, 2008.
Chapter 11
Efficient Architecture Evaluation
Using Functional Mapping
C. Kerstan, N. Bannow and W. Rosenstiel
Abstract For an efficient design flow the substantiation of early design decisions is
obligatory. This implies a simple and fast architecture evaluation using simulative
approaches. This paper introduces an approach which enables a powerful hardware/
software partitioning and the reuse of already existing functional code by applying
minimal code modifications only. The primary objective is to provide a solution to
enable an automated application. In this novel approach, code readability and transformation effort are improved significantly by using the powerful operator overloading mechanism of C++. The implementation can easily be customized and combined
with other approaches concerning simulative design evaluations. For example, it is
possible to realize implicit timing behavior, transparent communication over module boundaries, tracing of simulation data or collecting debugging information.
Keywords SystemC · TLM-2.0 · Functional mapping · Communication and
timing behavior
11.1 Introduction
Since the introduction of SystemC [9, 12] the improvement of the system level design flow was a main intention. Beside different abstraction levels, several tools
and methods facilitate a more efficient system development compared to classical
approaches. High level synthesis tools like Mentor’s Catapult [6, 7] support the
direct conversion of C/C++ respectively SystemC to VHDL assuming hardware/
software partitioning is already done. Object-oriented approaches like OSSS [2, 8]
allow a more general refinement using SystemC.
However, despite using the SystemC approach there is still the need for a fast
evaluation of several possible architectures before final mapping decision and hardware/software implementation. Several approaches exist to substantiate early design
decisions and support a more or less flexible design flow. An approach to realize efficient system modeling is done in [10] which partly uses high level synthesis to
obtain well-founded design decisions. Due to this the whole system can be synthesized in hardware but the power of C/C++ and SystemC is reduced to a synthesizable
subset.
C. Kerstan ()
Robert Bosch GmbH, Corporate Sector Research and Advance Engineering,
71701 Schwieberdingen, Germany
e-mail: Christian.Kerstan@de.bosch.com
M. Radetzki (ed.), Languages for Embedded Systems and Their Applications,
Lecture Notes in Electrical Engineering 36,
© Springer Science + Business Media B.V. 2009
167
168
C. Kerstan et al.
11.1.1 Functional Mapping
A more generic approach is presented in [3] by introducing Module Adapters (MAs)
which are responsible for intermodule communication, tracing and data-dependent
timing behavior (see Fig. 11.1). This adapter concept allows an implementation of
the functional behavior in C/C++ that is independent from the architecture. The
architecture itself can be realized in SystemC without regarding the concrete functional behavior. Therefore, the simulation model can be generated by using empty
module shells together with already existing components which can easily be combined thanks to the TLM-2.0 standard [13]. The module shells provide simple communication, tracing interfaces without realizing any functional behavior. The functional code has to be mapped into these shells to achieve the desired behavior (see
Fig. 11.2).
However, particularly the necessary code modification for a functional mapping
to an architecture is sophisticated. The problem of parsing SystemC or C/C++ code
for further analysis and processing is addressed by several works. Similar to this
paper, [16] also uses the power of the C++ compiler instead of implementing an own
Fig. 11.1 The approach
introduced in [3] allows
architecture independent
functional implementation
Fig. 11.2 Functional
mapping of C/C++ code to
a SystemC architecture
11 Efficient Architecture Evaluation Using Functional Mapping
169
parser. In contrast to [16] where a complete representation of a SystemC design is
generated, the main focus of this approach is to smoothly enrich the C++ code with
the required information to gain first but expressive architecture estimations. Since
simulations can evaluate several system aspects this work focuses on the timing
behavior based on a correct functional execution.
11.1.2 Timing Behavior
The timing behavior of a module is introduced in different places. First of all timing
is caused by data transfers while communicating with other modules due to the
timing of the connected bus. The transferred data is generated by the functional code
that is mapped into the different modules. The module’s internal timing behavior
for computation still has to be annotated in addition to the data transfer. This timing
behavior can be gained by experiences or is based on assumptions. If the module is
realized in hardware it can be extracted for instance with high level synthesis like
done in [10].
If the functional code is supposed to run on a software processor, accurate timing
behavior can be achieved by using a corresponding instruction set simulator (ISS).
Such processor models are usually provided by commercial solutions from tool vendors like VaST [18], CoWare [5] or Synopsys [17]. To run functional code on such
an ISS the code has to be compiled for the specific processor model which implies
additional demands on the code. Due to the fact that the compiled binary code does
not directly run on the simulation host machine the usage of an ISS can slow down
the entire simulation.
To increase the simulation speed [15] and [19] introduce approaches which use
code instrumentation to achieve accurate timing results without a detailed simulation of every single instruction. [19] uses debug information and static code analysis
to annotate timing information for each code line. In contrast to this the approach
described in [15] divides the code into blocks. This hybrid approach annotates static
timing information for the blocks and combines them with dynamic system conditions during runtime.
11.2 Conventional Code Transformation
Mapping existing functional C/C++ code into a SystemC design as described in
[3] enables a fast and efficient analysis and evaluation of different hardware architecture realizations. This work uses TLM-2.0 compatible module shells which
are very similar to the MAs providing a simple interface for external communication. Therefore, the required communication and some other hardware aspects
can be abstracted. This increases the readability of the code and allows architectural changes without modifying the functional code. If the functional code and all
its variables are mapped to the same module as shown in Fig. 11.3, no external
170
C. Kerstan et al.
Listing 11.1 Simple functional C++ example
Fig. 11.3 Listing 11.1 mapped on a
simple SystemC architecture
Listing 11.2 Mapping Listing 11.1 to the hardware
architecture as shown in Fig. 11.4 requires several code
modifications
Fig. 11.4 Functional independent
mapping of variables is more appropriate than the mapping shown in
Fig. 11.3
communication and in consequence no code modifications are necessary (see Listing 11.1).
Certainly, variables which are located in other modules (see Fig. 11.4) require
complex modifications. Using the conventional approach the code has to be enriched
with read and write instructions for every data access which is not located in the
same module (see Listing 11.2). Besides, the lower readability an automation is very
complex due to the expressive syntax of C/C++.
Listing 11.2 shows a possible code structure for external communication. Some
optimizations are possible but there are a few issues which have to be taken into
account. For instance, there is only one generic interface for each module. That is
the reason why each access needs more parameters than only the value. Alternatively
it would also be possible to have several method declarations (e.g. int reada();
void writea(int); int readb(); etc.) for data transfers to every variable that
is mapped into another module.
11 Efficient Architecture Evaluation Using Functional Mapping
171
11.3 Optimization Approach
To decrease the effort for code enrichment and to keep the readability of the code a
new approach is presented that uses the power of C++. The idea is to substitute the
variable types by a meta class instance, which from the perspective of the functional
code acts like the original one. The template mechanism allows to substitute most
kinds of variables. The expected behavior is realized by operator overloading. Thus
the access over a MA is transparent to the programmer. The only required modification is the replacement of all concerned original objects or variable declarations,
each by its corresponding meta class instance.
A basic requirement is a light, fast and extensible solution which covers most use
cases. Only a few operators need to be overloaded to allow the compiler handling all
possible operations using implicit transformations if necessary. To avoid incorrect
read() or write() operations most overloaded operators have the original type
either as parameter or as result.
11.3.1 Class Unitized
For type independence the proposed meta class unitized uses the template mechanism. As depicted in Listing 11.3 it is designed for various usages by providing
two methods which have to be customized by inheritance.
First of all there have to be some constructors which are necessary for several
implicit operations used by the compiler (see Listing 11.4). An initialization during
the definition of a unitized object by a value of the substituted type is enabled by
a corresponding constructor. Furthermore, there also has to be a copy constructor
to allow initialization by already existing unitized objects. Hence the definition
of these constructors also causes the compiler to ask for the declaration of a default
constructor.
The very important cast operator (see Listing 11.5) is responsible for implicit
casting to the original type which is represented by this object. It is especially
needed for calculations and assignments to other variables.
Listing 11.3 Declaration of unitized and its virtual methods which have to be overwritten
172
C. Kerstan et al.
Listing 11.4 It is essential to allow the object initialization by appropriate constructors
Listing 11.5 This operator is responsible for im- and explicit type casts of unitized objects
Listing 11.6 Supporting the assignment of unitized or originally typed variables and values
Another special operator which has to be modified is the assignment operator.
Due to the possible combinations of different parameter and result types, overwriting this operator can cause undesirable side effects in combination with other overloaded operators and declared constructors. Listing 11.6 shows an implementation
avoiding unnecessary operations during an assignment.
The described implementation avoids side effects by overwriting only operators
which cannot implicitly be substituted by the compiler. Similar to the equal operators every other operator modification has to be verified carefully. Nevertheless,
the modifications of the remaining operators are very similar to Listing 11.7. After implementing every other operator matching the pattern <operator>= and the
increment/decrement operators the class unitized can be adapted by inheritance.
11 Efficient Architecture Evaluation Using Functional Mapping
173
Listing 11.7 Remaining equal (-=, *=, /=, %=, |=, &=, ˆ=) and the increment/decrement
operators (++/--) have to be implemented analog to the exemplary += operator realization
Listing 11.8 Customizing unitized to realize automated logging using log4cxx [1]
11.4 Customize and Apply Unitized
To demonstrate how to customize and use the introduced unitized class it is exemplarily adopted to tracing purposes. Instead of realizing the communication between modules, the first inherited class (see Listing 11.8) saves the value internally
and logs every access in order to trace simulation data. The tracing itself is based
on the open source framework log4cxx [1] which provides a flexible and efficient
logging.
By overwriting the read and write methods, the proper value is stored in value
and an adequate logging is done using the log4cxx logger referenced by logger.
Of course the constructors also have to be implemented to initialize the internal
variables. For completeness the extract of logging.h (see Listing 11.9) points out
how log4cxx is used in this case.
174
C. Kerstan et al.
Listing 11.9 LOG realizes simple applicability similar to cout and avoids spare invocations
Listing 11.10 Tracing a from Listing 11.1
requires only minimal modifications
Fig. 11.5 Generated output of Listing 11.10
respecting every access of a
Listing 11.11 The additional tracing of b
requires an extra similar modification
Fig. 11.6 Accesses on a or b of Listing 11.11 are traced by the default logger
11.4.1 Application of u_trace
The variables of Listing 11.1 can now be traced by replacing variables with an
u_trace object. The only essential modification to trace a is changing the type
from int to u_trace<int> (see Listing 11.10).
Because variable a is constructed without specifying an explicit logger the root
logger is taken by default (see Fig. 11.5). By modifying the type of b in the same
way, the modified code looks like presented in Listing 11.11. During execution every
access to a or b is traced as shown in Fig. 11.6.
11 Efficient Architecture Evaluation Using Functional Mapping
175
Listing 11.12 Customizing unitized for TLM-2.0 based communication purposes
There is only a minor change needed for tracing variable accesses. So an automatic modification of code can be realized without parsing and treating the whole
code.
11.5 Using the Approach in the Design Flow
To use the new approach for timing evaluation a new unitized successor has
to implement the communication demands. In contrast to u_trace it has to pass
requests directly to the corresponding SystemC module. Therefore, it does not store
any values but has to keep a reference to the initiator port of the parent module
and due to the memory mapped system the address to access the correct module
and its content (see Listing 11.12). The included master module inherits form the
initiator_socket of TLM-2.0 and provides a simple read/write interface.
u_map enables the modification of Listing 11.1 to achieve the same result as
Listing 11.2 by almost keeping the original code and thus maintaining readability
(see Listing 11.13). Figure 11.7 outlines the system configuration from the point of
view of the u_map objects.
11.5.1 Handling Arrays
A small enhancement of u_map enables the treatment of arrays as easy as normal
variables. Listing 11.14 implements such an extension for one dimensional arrays.
176
Listing 11.13 In contrast to Listing 11.2 the
same mapping of Listing 11.1 is realized with
almost no modifications
C. Kerstan et al.
Fig. 11.7 Outline of the mapped Listing
11.13 and the intent of the u_map constructor
parameters
Listing 11.14 Required operator overloading to empower u_map to handle common ID arrays
Fig. 11.8 Depiction of Listing
11.14 and its meaning in the
mapped architecture
The overloaded [] operator creates a new u_map object calculating the correct
offset using the index parameter and the size of the content type. Certainly, the
complexity of the calculation increases with each dimension.
The implicit temporary creation of a u_map object (line 25 in Listing 11.14)
avoids an interface extension of the read and write methods with an offset parameter. The new u_map object directly points to the correct address without needing
an extra offset (see Fig. 11.8). Due to this, less overhead is generated which should
not impact the performance dramatically. However, one deficit occurs at declaration
because there is no difference anymore between a u_map and an array of the same
type.
11.5.2 Design Example
The applicability of u_map will be demonstrated with a more application relevant
example which was introduced in [11]. A discrete Laplace operator has to be applied
to image data (see Fig. 11.9).
An implementation of such an operator is shown in Listing 11.15. While the
image dimensions are defined as macros, the images themselves are represented
by ordinary two-dimensional arrays. For the reason of simplicity, the algorithm is
implemented assuming source_picture containing proper image data.
The first hardware architecture shown in Fig. 11.10 consists of a processing
unit executing the algorithm as well as a standard router component distributing
11 Efficient Architecture Evaluation Using Functional Mapping
177
Fig. 11.9 2D Laplace
operator realized in the
design example exemplary
applied on Lenna [14]
Fig. 11.10 Target hardware
architecture of the design
example
Listing 11.15 Common basic implementation of a discrete two-dimensional Laplace operator
requests to one of the two standard memory components, which are responsible for
the source_picture and target_picture image data.
By using only standard components and generic module shells for communication, the architecture can be generated automatically. The necessary address mapping can be achieved through approaches as described in [4]. The remaining task is
to map the functional code into the processing module.
Since only copying the code (see Listing 11.15) into the run() method
is not sufficient, it has to be modified. Especially the accesses to the arrays
source_picture and target_picture which are mapped to different components in the architecture (see Fig. 11.10) have to be considered. Using the described u_map the only required modification affects the declaration of both arrays.
Besides TYPE that is globally defined for the pixel size, the image address offsets
are defined by address mapping. i_port is the initiator port of the specific compo-
178
C. Kerstan et al.
Listing 11.16 The two-dimensional arrays in Listing 11.15 require some extra modifications
Fig. 11.11 The simulated
time advancement can be
extracted from the simulation
output of Listing 11.16
nent and setElementSize() is necessary for the correct address decomposition
using two-dimensional arrays.
11.5.3 Simulation Results
Excluding the filter module which initially has no timing the delay of each component in this example is set to 1 ns. For each pixel, 9 pixels have to be read and the
result has to be written. Due to the fact that the border pixels are not processed and
one access needs 2 ns the whole processing time for a 512 × 512 image can easily
be determined.
(510 · 510) · (9 + 1) · 2 ns = 5202 µs
Therefore, it is easy to analyze the simulation time advancement (see Listing 11.11). But the complexity already significantly increases by inserting a buffer
11 Efficient Architecture Evaluation Using Functional Mapping
179
Fig. 11.12 Extending the
hardware architecture of
Fig. 11.10 by an extra buffer
between the processing unit
and the router
Fig. 11.13 Simulation output
of the architecture shown in
Fig. 11.12
between the processing unit and the router (see Fig. 11.12) with the size of 9 entries
with the size of a pixel data. With the buffer delay also set to 1 ns accessing already
buffered data only needs 1 ns but one memory access takes 3 ns. Apart from the first
pixel of each line there are only 3 read memory accesses necessary since the other
6 pixel values are already available in the buffer (see Fig. 11.9). This results in an
increase of the overall performance (see Fig. 11.13). Considering the later substitution of the router by a more complex bus and a more realistic timing annotation it is
an even more reasonable optimization.
Using an approach described in [15] and [19] the missing internal timing behavior can be gained. In combination with unitized and TLM-2.0 the synchronization respectively the amount of wait statements can be reduced to a minimum.
11.6 Limitations and Experiences
Due to the fact that unitized overloads only necessary operators the compiler has
to apply implicit casts and operator substitution. Except the address, respectively
the & operator, even not explicitly overloaded operators can be used without causing undesirable side effects. To avoid direct address to the content of the unitized
object the & operator was not overloaded although it cannot be substituted. Therefore, its usage corresponds to the unitized object itself.
The occurrence of the address operator in the original source code may lead
to compiler errors which have to be fixed manually. Such an error indicates the
assignment of a unitized object address instead of the expected address of the
original variable or vice versa. However, the direct access to the unitized content
180
C. Kerstan et al.
Listing 11.17 Exemplary implementation of a Listing 11.18 Partial variable mapping can
parameter driven sawtooth shaped signal
cause compiler errors
is not allowed, because changes and requests can not be tracked using references or
pointers.
The following fictive example shall depict the problem and its solutions. The
exemplary application outlined in Listing 11.17 generates a sawtooth shaped signal
using parameters for its amplitude (posamp and negamp) and its edge slope (inc
and dec). For this purpose, the value of the variable which is referenced by op
is added to signal. If signal reaches or crosses the maximum amplitude, op
switches to the other variable.
Except the local variable op all other variables are located in other modules e.g.
sensors or memories. That is why they are substituted by adequate unitized objects (see Listing 11.18). In practice the introduced u_map would be used, but for
generality unitized is used here.
In Listing 11.18 compiler errors occur in every line where op has to be set. All
errors1 are related to the incompatibility of int * and unitized<TYPE> *. A simple remedy is to replace the type of the local pointer by unitized<int>* op =
&inc;
An alternative solution is to customize unitized similar to u_map which allows
the implicit modification of the internal address and therefore acts like a pointer, e.g.
unitized<unitized<int>*>op = &inc;
The unitized approach was evaluated through various applications. In this
process even parameter passing was considered. However, due to the power of the
C++ language and the variety of compilers an overall correctness cannot be guaranteed. Especially, experiences with complex objects which may overload operators
themselves is very limited. Nonetheless, unitized is constructed to induce compiler errors instead of an unwanted behavior.
1 C2440,
C2446 (compiler errors using Microsoft Visual C++).
11 Efficient Architecture Evaluation Using Functional Mapping
181
11.7 Summary
The introduced approach simplifies the mapping of C/C++ code to SystemC architectures. Therefore, the presented approach enables the reuse of existing functional
models. Due to the minimal changes the readability of the original code is obtained
and no complex particular code parser or massive manual modifications are necessary. The demonstrated examples depict the applicability and give some incitements
for realistic use cases, e.g. tracing purposes.
Separating the functional behavior and the system architecture reduces the complexity to obtain a flexible simulation model. Such an architecture can be changed
in structure and granularity without touching the functional code. Only if some formerly internal variables have to be mapped to external modules, appropriate modifications in the code have to be applied. In combination with available approaches
concerning timing behavior like [3, 15, 19] a fast and accurate architecture evaluation becomes possible.
11.7.1 Outlook
The proposed combination with [15] and [19] has to be evaluated in more detail.
Furthermore, the substitution of more complex objects has to be investigated. In
addition to the presented use-cases and specific implementations there are several
other possibilities where this approach could be valuable. For instance, using the
System Analysis Toolkit (SAT) introduced in [11], an enhanced analysis and diagnosis can be easily adapted to an existing design. Therefore, it is also possible to
realize a direct online analysis of simulation data without the usage of files.
References
1. Apache. log4cxx 0.10.0. http://logging.apache.org/log4cxx, August 2007.
2. N. Bannow and K. Haug. Evaluation of an object-oriented hardware design methodology for
automotive applications. In DATE, February 2004.
3. N. Bannow, K. Haug, and W. Rosenstiel. Performance analysis and automated C++ modularization using module-adapters for SystemC. In FDL, September 2004.
4. N. Bannow, K. Haug, and W. Rosenstiel. Automatic SystemC design configuration for a faster
evaluation of different partitioning alternatives. In DATE, March 2006.
5. CoWare Inc. Processor designer. http://www.coware.com/products/processordesigner.
6. FORTE Design Systems. Cynthesizer. http://www.forteds.com/products.
7. M. Graphics. Catapult Synthesis. http://www.mentor.com/products/esl.
8. E. Grimpe, B. Timmermann, T. Fandrey, R. Biniasch, and F. Oppenheimer. SystemC objectoriented extensions and synthesis features. In FDL, September 2002.
9. T. Groetker. System Design with Systemc. Kluwer Academic, Dordrecht, 2002.
10. C. Haubelt, J. Falk, J. Keinert, T. Schlichter, M. Streubühr, A. Deyhle, A. Hadert, and J.
Teich. A SystemC-based design methodology for digital signal processing systems. EURASIP
Journal on Embedded Systems, 2007(1):15, 2007. doi:http://dx.doi.org/10.1155/2007/47580.
182
C. Kerstan et al.
11. C. Kerstan, N. Bannow, and W. Rosenstiel. Closing the gap in the analysis and visualization
of simulation data for automotive video applications. In edaWorkshop’08, 2008.
12. OSCi. SystemC, Version 2.2. http://www.systemc.org, 2001.
13. OSCi. TLM-2.0. http://www.systemc.org, June 2008.
14. A. Sawchuk. Lenna Sjööblom. Signal & Processing Institute, University of Southern California. http://sipi.usc.edu/database, July 1973.
15. J. Schnerr, O. Bringmann, A. Viehl, and W. Rosenstiel. High-performance timing simulation
of embedded software. In DAC, 45th ACM/IEEE, pages 290–295. June 2008.
16. T. Schubert and W. Nebel. The Quiny SystemC front end: self-synthesising designs. In FDL,
September 2006.
17. Synopsys, Inc. Virtual Platforms. http://www.synopsys.com/products/sls.
18. VaST Systems Technology. CoMET. http://vastsystems.com/docs/CoMET_sep2008.pdf.
19. Z. Wang, A. Sanchez, A. Herkersdorf, and W. Stechele. Fast and accurate software performance estimation during high-level embedded system design. In edaWorkshop’08, 2008.
Chapter 12
Symbolic Scheduling of SystemC Dataflow
Designs
Jens Gladigau, Christian Haubelt and
Jürgen Teich
Abstract In this chapter, we propose a quasi-static scheduling (QSS) method applicable to SystemC dataflow designs. QSS determines a schedule where several
static schedules are combined in a dynamic schedule. This, among others, reduces
runtime overhead. QSS is done by performing as much static scheduling as possible
at compile time, and only treating data-dependent control flow as runtime decision.
Our approach improves known quasi-static approaches in a way that it is automatically applicable to real world designs, and has less restrictions on the underlying
model. The effectiveness of the approach based on symbolic computation is demonstrated by scheduling a SystemC design of a network packet filter.
Keywords Software scheduling · Quasi-static · SystemC · Symbolic methods
12.1 Introduction
SystemC is a system description language convenient for implementing dataflow
models and widely used when developing embedded systems, typically starting with
an abstract model. Further in the design flow, the abstract model is partitioned and
individual modules or clusters of modules are mapped to resources of a target architecture. These architectures may consist of processors, capable to execute software,
hardware accelerators, memories, etc. When generating software for an embedded
processor from high-level SystemC designs, a transformation of parallel executed
modules into sequential code is needed. It is possible to manual recode the software
part, but this is costly and may introduce errors. We concentrate on approaches
which automatically generate C-based software from a high-level SystemC design,
see [6, 7, 14] for examples of such design flows. Scheduling this software could
be done, e.g., by using a real-time operation system, or reassembling the SystemC
scheduler. Another approach is to use a generic and dynamic scheduling policy as
round-robin. All approaches, however, are affected by potential scheduling overhead. While generating C-based software from SystemC designs is out of scope
J. Gladigau ()
Department of Computer Science, University of Erlangen-Nuremberg,
Am Weichselgarten 3, 91058 Erlangen, Germany
e-mail: jens.gladigau@cs.fau.de
M. Radetzki (ed.), Languages for Embedded Systems and Their Applications,
Lecture Notes in Electrical Engineering 36,
© Springer Science + Business Media B.V. 2009
183
184
J. Gladigau et al.
for this work, we concentrate on a substitution for the generic and dynamic software scheduler. For some dataflow designs it is possible to generate static schedules
(e.g., [2]). However, as only limited models such as synchronous or cyclo-static
data flow models can be scheduled statically [11, 15], and dynamic approaches (like
round-robin) introduce enormous overhead, quasi-static scheduling (QSS) seems to
be a possible remedy.
Static schedules can be repeated infinitely often without the need of runtime decisions, while dynamic schedules imply runtime decisions before each execution step.
In QSS, at compile time analyses are applied to reduce runtime decisions to the minimum of data dependent decisions, which naturally are only decidable at runtime.
A quasi-static schedule consists of static sequences and unavoidable runtime decisions, so called conflicts. At runtime, these conflicts select from alternative static
sequences. I.e., only inevitable data dependent scheduling decisions are decided at
runtime and static decisions are made a priori (at compile time) where possible.
While this naturally results in more complex scheduling code, the runtime overhead
is minimized. To further reduce runtime overhead, checking for availability of input
data and buffer space can be omitted. In QSS, this is already considered at compile time. Besides, guarantees on memory requirements and better predictions (like
worst case execution time and dead code) can be given, which is difficult (or impossible) in presence of dynamic schedules. Even deadlock situations resulting from
bounded buffers are avoided.
To subsume, important properties of quasi-static schedules are: (1) minimized
runtime overhead, (2) deadlocks are avoided, and (3) knowledge of exact memory
requirements. In this paper, we propose a quasi-static scheduling approach for SystemC dataflow designs. More precisely, we will focus on generating a scheduler for
these parts of the model implemented as software on a single embedded processing
unit.
This chapter is organized as follows: To extract for the scheduling process necessary informations from the SystemC model, a well defined model description is
mandatory. The underlying principles are described in Sect. 12.2. To utilize symbolic methods, a symbolic representation of SystemC models is needed. How to
obtain such a representation is explained in Sect. 12.3. For the proposed model of
computation, a novel quasi-static scheduling approach is introduced in Sect. 12.4.
In Sect. 12.5, related work is revised and we differentiate our proposal from these.
Finally, we show results from applying our symbolic quasi-static scheduling to an
SystemC design, a network packet filter, in Sect. 12.6.
12.2 Model of Computation
Our goal is a quasi-static scheduling method to find schedules for dataflow SystemC
models. For this purpose, a well defined model of computation (MoC) represented
in SystemC is needed. A common way to model dataflow designs is actor-oriented,
which is also used in modern design of embedded systems [12]. In actor-oriented
modeling, a system is described by concurrently executed entities (called actors)
12 Symbolic Scheduling of SystemC Dataflow Designs
185
Fig. 12.1 Two actors A1 and A2 communicate via two channels c1 and c2 among each other. The
communication behavior is defined by finite state machines. Predicates (demands on buffers and
guard functions) and an action function are annotated on each transition
which communicate among each other only via dedicated channels. In order to allow
as much automation as possible, we require the SystemC model to be transformed
into SysteMoC [5], a SystemC library capable of extracting model information for
analyses. Other SystemC related approaches of modeling different MoCs exist, e.g.,
[8, 16]. In contrast to [16], SysteMoC does not demand modifications of SystemC’s
simulation kernel and is build upon standard SystemC, but avoids context switching
to gain simulation speed. Additionally, SysteMoC focuses on model extraction and
analyses on system level—the designer has not to choose a certain MoC a priori,
but the most restricted MoC (ranging from SDF over KPN to non-deterministic
dataflow) can be determined automatically based on principles presented in [24].
In a SysteMoC description, each SystemC module implements an actor which
is defined by a finite state machine specifying the communication behavior, and
functions controlled by this finite state machine. See Fig. 12.1 for a graphical representation of a simple SysteMoC model. There, two modules A1 and A2 , called
actors, communicate via two FIFO channels c1 and c2 among each other. Their
communication behavior is defined by the depicted state machines. On each transition data consumption and production rates according to the associated function
funci are annotated. For example, see the transition in actor A2 from state ‘0’ to
state ‘1’, annotated with i1 (1)/func3 . After this transition is taken, one single input data, called token, from the connected channel c1 is consumed (via port i1 ).
Functions (like func3 ) are executed atomically and data consumption and production is done after computing. Constant functions (e.g., g in Fig. 12.1), called guards,
are used to test values of internal variables and data in the input channels. Hence,
SysteMoC resembles FunState (Functions driven by State machines) [22] and also
realizes a rule-based model of computation [17]. If the predicates (guards and demands on available tokens/free space) annotated to a transition evaluate to true, this
transition is enabled. If more than one transition is enabled, one is chosen nondeterministically for taking and the annotated method funci , called action, will be
performed atomically.
If an application is already available as a SystemC specification, this application can be transformed into a SysteMoC model, if some restrictions apply. E.g.,
communication is done via SystemC FIFOs and a sole SC_ THREAD is used. We
sketch the transformation for the SystemC module Ct_merge from Fig. 12.2. The
figure shows the design of a rule based network packet filter with connection tracking capabilities. Network packets arrive from connected interfaces or from the host
186
J. Gladigau et al.
Fig. 12.2 A SystemC model of a network packet filter. The Ct_merge module is zoomed in,
and its implementation as SysteMoC actor with communication controlling finite state machine is
shown. Actions req and res are member functions of the module
c l a s s Ct_merge : p u b l i c s c _ m o d u l e {
/ / ... ports & constructor including SC_THREAD(process);
void p r o c e s s ( ) {
while ( 1 ) {
i f ( o3 . n u m _ f r e e ( ) && i 1 . n b _ r e a d ( r e q u e s t ) ) {
03. write ( request ) ;
i3 . read ( response ) ;
o1 . w r i t e ( r e s p o n s e ) ;
}
i f ( o3 . n u m _ f r e e ( ) && i 2 . n b _ r e a d ( r e q u e s t ) {
03. write ( request ) ;
i3 . read ( response ) ;
o2 . w r i t e ( r e s p o n s e ) ;
}
}
}};
Listing 12.1 Simplified Ct_merge SystemC module
the firewall is running on, and packets are forwarded (or discarded) based on a firewall rule set. More explanation of the model follows in Sect. 12.6. The simplified
SystemC source code of Ct_merge is given in Listing 12.1. The SystemC module
Ct_merge dispatches incoming requests for connection tracking entry look ups to
the module Ct_entries and forwards the corresponding response.
Transformed into an equivalent SysteMoC actor (right side of Fig. 12.2), the
finite state machine controlling the communication behavior checks for available
input data and available space on the output channels to store results. If this is fulfilled, the transition is activated, and on taking the corresponding action function
12 Symbolic Scheduling of SystemC Dataflow Designs
187
(e.g., req) is executed. Actions can access the values of the FIFOs, similar to SystemC, through ports. Additionally, a read or write offset is allowed (but not needed
in this example). This is implemented by help of the bracket operators indicating a
read or write offset as shown in the following listing, the implementation of both
actions from the SysteMoC actor Ct_merge.
v o i d r e q ( i n _ p o r t &i n ) { o3 [ 0 ] = i n [ 0 ] ; }
v o i d r e s ( o u t _ p o r t &o u t ) { o u t [ 0 ] = i 3 [ 0 ] ; }
12.3 Symbolic Representation
To reduce the complexity of the data structures and computations, an abstract view
of SysteMoC models is used, which for many models is sufficient for the scheduling
task. This abstract view only considers the communication behavior of the model—
concrete data values, data transformations performed by actions, and guard functions are neglected. Then, the abstract model state is given by the current fill size
of each FIFO channel and the current state of each actor’s state machine. For example, the abstract state of the model shown in Fig. 12.1 is defined by the tuple
(q1 , q2 , s) of integers, with q1 /q2 being the fill size of channel c1 /c2 , and s ∈ S, with
S = {0, 1, 2}, being the state of actor A2 . (A variable for the sole state in actor A1
can be omitted.) The model’s state space is spawned by these integers. Given the domain sets Qi = [0, max(ci )], with max(ci ) being the maximum fill size of channel
ci . For the example, the state space is Q1 × Q2 × S. The start state of the example is
(q1 , q2 , s) = (0, 0, 0) and taking, e.g., transition o1 (1)/func1 followed by transition
i1 (1)/func3 results in the abstract state (1, 0, 0) followed by state (0, 0, 1).
Scheduling such models is done by traversing their state space, as we show later.
Traversing could be done by enumeration, but due to huge state spaces, this is, in
general, prohibitive. We need a way to traverse the state space implicitly to avoid
enumeration. Symbolic techniques consider the state transition system implicitly,
rather than manipulating state sets directly, so we want to use symbolic techniques.
In particular, a symbolic representation of SysteMoC models as described above
is needed first. Similar to symbolic model checking [13] this is done by means of
characteristic functions for model states and the state transition relation. Given a
finite state machine m = (X, T , x0 ) with state set X, a set of transitions T , and an
initial state x0 ∈ X. A state set S ⊆ X is represented by its characteristic function
S (s) with
1 if s ∈ S,
S (s) =
0 otherwise
The characteristic function T (s, s ′ ) of the transition relation T is defined as
1 if (s, s ′ ) ∈ T ,
T (s, s ′ ) =
0 otherwise
188
J. Gladigau et al.
Fig. 12.3 An interval
decision diagram (IDD—left)
and an interval mapping
diagram (IMD) of
characteristic functions for
the SysteMoC example in
Fig. 12.1. The IDD represents
the state set
S = {(0, 0, 0), (1, 0, 0)}, and
the IMD encodes the
transition relation
These characteristic functions allow for usage of symbolic functions, called image
and preimage, which are the basis for the symbolic scheduling. We define them
similar to [10].
Definition 12.1 (Image) The image(S, T ) of a set of states S and a set of transitions
T is the set of all states which can be reached from any state s ∈ S by taking a
transition t ∈ T : image(S, T ) = {s ′ | ∃s : S (s) ∧ T (s, s ′ )}.
Definition 12.2 (Preimage) The preimage(S, T ) of a set of states S and a set of
transitions T is the set of all states which can reach any state s ∈ S by taking a
transition t ∈ T : preimage(S, T ) = {s | ∃s ′ : S (s ′ ) ∧ T (s, s ′ )}.
In our implementation, we use interval diagrams [20] for symbolic encoding,
which are similar to Boolean decision diagrams (BDDs) [3]. Instead of Boolean
values, a value range is associated to each variable, and on outgoing arcs intervals
are annotated. The approach is not limited to interval diagrams, and other symbolic
representations (like BDDs) could be used. We use interval diagrams because they
are well suited for representing transition systems with FIFO communication. Detailed description of these diagrams can be found in [20, 21].
We give an informal introduction to interval diagrams as used here. There are two
types of interval diagrams: (1) interval decision diagrams (IDDs), used to represent
characteristic functions of state sets, and (2) interval mapping diagrams (IMDs),
used to represent characteristic functions of transition relations. For the SysteMoC
example from Fig. 12.1 the symbolic representation as interval diagrams is shown
in Fig. 12.3. Given a maximum fill size of eight for the both channels. The left
diagram encodes the characteristic function of the state set S = {(0, 0, 0), (1, 0, 0)}.
Similar to BDDs, IDDs have a single root node, terminal nodes, and a variable order.
When encoding Boolean functions, there are only two terminal nodes, ‘0’ and ‘1’,
representing the function’s result. Depending on the actual value of the variables,
paths from the root node lead either to the ‘1’ or ‘0’ terminal node. As the IDD
encodes the characteristic function for the state set S = {(0, 0, 0), (1, 0, 0)}, only
(q1 , q2 , s) = (0, 0, 0) and (q1 , q2 , s) = (1, 0, 0) lead to the ‘1’ terminal node.
All transitions of a SysteMoC model are encoded in a single interval mapping
diagram, regardless of the number of different state machines (actors). Using this
transition relation for the image operation results in the set of states which can be
12 Symbolic Scheduling of SystemC Dataflow Designs
189
reached by a single transition of any actor. The right side of Fig. 12.3 depicts the
IMD for the characteristic function of the SysteMoC example’s transition relation.
IMDs contain only a single terminal node. Paths from the root to this terminal node
represent transformations of states due to transitions. Therefore, a mapping function
and two intervals are annotated on edges: a predicate interval determines the value
a variable must have to apply the mapping function, and an action interval (together
with the mapping function) describes the value transformation. Edges without annotations are neutral, i.e., the variable may have any value and is not altered. We give
an example using the dashed path in the IMD in Fig. 12.3. This path represents the
transformation of variables due to the transition i1 (1)/func3 in actor A2 . To apply
this transition to a single state (q1 , q2 , s), the value of q1 must be in the range [1, 8],
and s must equal zero (q2 is ignored). Then, for the following state, (q1′ , q2′ , s ′ ) is determined by q1′ = q1 − 1, q2′ = q2 , and s ′ = 1. This is a simplification for illustrating
purpose. For the actual transformations, interval arithmetic and a mapping function
similar the apply operation for BDDs are used. We refer to the cited literature for
further details.
12.4 QSS of SysteMoC Models
After introducing the model of computation for SystemC designs and their symbolic
representation in the previous sections, we now introduce the quasi-static scheduling
procedure for those models. We need some technical terms to describe the proposed
quasi-static scheduling. Abstract model state and the model’s state space were introduced in Sect. 12.3—the state space is spawned by variables for buffer fill sizes and
states of the actor’s finite state automata, and one particular point in this state space
is an abstract state. Note that this abstraction neglects data values. The basic task
while scheduling a model is searching paths through the model’s state space. A path
t0
t1
tj
tn−1
x0 → x1 → · · · xj → xj +1 · · · → xn through the model’s state space is a chain of
model states xi and transitions tj . Taking transition tj transforms the model state
form state xj to state xj +1 . Transitions leaving the same state of an actor’s finite
state machine can be in conflict with each other. E.g., due to annotated guards, either one or the other must be taken at runtime.
t0
ti
Then, the scheduling basically is performed as follows: a path x0 → · · · xi →
· · · x0 from the model’s start state x0 back to x0 is searched—taking whatever transitions into account (conflicting and non conflicting ones). If a transition tc1 on this
path is in conflict with other transitions tci , additional paths are searched to cope
with every possible outcome of the conflict at runtime. We call these paths alternative paths for tc1 , as they are alternatives for the conflicting transition tc1 encountered
at the first place. At runtime, one of these paths dependent on a runtime decision is
chosen. These alternative paths are paths from the model state the conflict transition
tci
tc1 is originating to any known model state, i.e., xc → · · · xk . A known state xk is any
state on any path found so far. This way, a path is present for every possible runtime
decision concerning tc1 . The described conflict handling procedure is recursively
190
J. Gladigau et al.
applied to every conflict transition encountered on any path. As the result, a clew of
paths is obtained, including a path for every possible runtime decision. This clew is
a preliminary stage of a scheduler controller automaton.
We now show that the scheduling algorithm terminates, either with a schedule
as the result, or indicating that no quasi-static schedule exists. Because queues have
limited maximum size and the number of state machine states in actors is limited,
too, the models state space is limited. If at some point no alternative path for a
conflict transition can be found within these state space boundaries, for the path including the conflict a different path is chosen. This backtracking strategy is applied
recursively, if needed. Therewith, finding a valid quasi-static schedule is guaranteed,
if one exists within the state space. Otherwise, the algorithm terminates unsuccessfully. Note that this worst case leads to enumeration of possible paths and therewith
to vast runtimes.
The above outline of the scheduling procedure is presented in more detail in a
following subsection. But before, we introduce so called transition graphs, which
we use to determine conflicts, and explain path searching, the workhorse of the
scheduling procedure.
12.4.1 Transition Graphs
During scheduling it is vital to know when alternative paths are needed at runtime—
our goal is a scheduler that can cope with every possible runtime condition. We
need alternative paths when a transition is in conflict with other transitions. Such a
conflict can only occur between transitions leaving the same state of an actor’s state
machine. This subsection explains a concept called transition graphs, which is used
later to exactly determine a conflict.
There is a transition graph for each actor’s state machine state. This transition
graph reflects the relation between the outgoing transitions. We subdivide transitions into two classes: safe and unsafe transitions. Transitions are called safe, if no
guards are annotated, otherwise they are called unsafe. Unsafe transitions can lead
to runtime decisions, while safe transitions only depend on available data and buffer
space. So, from safe transitions static scheduling sequences can be constructed,
while for unsafe transitions alternative paths may be needed. Only unsafe transitions
are included in a transition graph of an actor’s state machine state. As an example,
Fig. 12.4 shows a single state with five leaving transitions. This is a part of an actor’s
state machine, simplified to only include guards a, b, and c, as only these Boolean
functions lead to runtime decisions and so are relevant for the transition graph. The
resulting transition graph for this state is depicted in Fig. 12.5. Transitions with exactly the same guards are put in the same equivalence class (node). A directed edge
is inserted in the graph for a logical implication of the (Boolean) guards. The result
is a possible unconnected graph (as in the example). All nodes without leaving edges
are said to be leaf nodes. Such a transition graph is the basis for conflict handling.
12 Symbolic Scheduling of SystemC Dataflow Designs
Fig. 12.4 An actor’s finite state machine state
with five outgoing transitions. Only guards (a,
b, and c) are annotated at the transitions, as
only they are relevant to construct the transition graph
191
Fig. 12.5 The transition graph for the state
machine state shown in Fig. 12.4. There are
four equivalence classes. Guards for transition t2 logically imply guards of other transitions, as depicted by edges
Fig. 12.6 A sample path search for the example in Fig. 12.1 by consecutive image computation.
The dashed arrows indicate the found path, the ellipses depict the state sets resulting from image-computations
12.4.2 Path Searching
The crucial part of the scheduling algorithm is path searching through the model’s
state space. The challenge is: give me a single path, starting in state x0 , and ending
in an arbitrary state e ∈ E, with E being a set of states. Path searching is done
symbolically by consecutively computing the image Si of x0 until Si ∩ E = ∅ holds.
The first image S1 of x0 is the set of all states reachable in a single step. The image
S2 of the image S1 is the set of states reachable from x0 in two steps, and so on.
If a state e ∈ E is included in Si (i.e., can be reached), at least one path from x0
to e (with i steps) exists. By this symbolic breath-first search, a shortest path from
a single state x0 to some desired state is found. For the example in Fig. 12.1, the
search for the initial path (from start state (q1 , q2 , s) = (0, 0, 0) back to this state)
is shown in Fig. 12.6. Note that, in contrast to the figure, the exact relation between
states, as depicted by arrows between the states, is not known when computing the
images symbolically. By symbolic image computation all transitions are considered
implicitly. Hence, when a state is reached, the only thing we know is that at least one
path to this state must exist. A backtracking is used to find a concrete path, based on
consecutive calculating the preimage of concrete states [21].
192
J. Gladigau et al.
while ( 1 ) {
t1 ( ) ; t3 ( ) ;
if (g) {
t1 ( ) ; t5 ( ) ; t6 ( ) ; t2 ( ) ;
} else
t4 ( ) ;
t2 ( ) ; }
Fig. 12.7 The two paths found for the example in Fig. 12.1. First, starting in state
(0, 0, 0), the path with solid arcs was found.
The path with dashed arcs is an alternative
path for transition t4
Fig. 12.8 The quasi-static software scheduler gained from the paths depicted in
Fig. 12.7. Only one dynamic decision remains at runtime
The example in Fig. 12.1 also introduces a class of conflicts between transitions we call multi-rate conflicts. For easier explanation, we assume an enumeration of the transitions in the example, corresponding to the index of the action. I.e.,
t1 : o1 (1)/func1 , t2 : i2 (1)/func2 , and so on. Due to the guard g, the two transitions
t4 (o2 (1)&¬g/func4 ) and t5 (i1 (1)&g/func5 ) leaving state ‘1’ of actor A2 are in
conflict. Additionally, t4 and t5 have different rates in terms of dataflow: t4 demands
one token on channel c1 , while t5 demands one free space in channel c2 . To cope
with such multi-rate conflicts and effectively search for runtime dependent alternative paths, an advanced symbolic path searching algorithm is needed. In principle,
we added three new features to path searching:
• A desired transition may be anywhere on the path.
• Some transitions may be disabled during the path search.
• Disabled transitions may be unlocked.
The need to cover conflicting transitions anywhere on an alternative path, not
just at the beginning, is demanded by possible different rates of the conflicting transitions; otherwise it may be impossible to take the conflicting transition because
of absence of tokens or insufficient free space to produce tokens. This is already
motivated by the example. See the sole encountered conflict (transition t4 ) on the
path depicted with dashed arcs in Fig. 12.6. To resolve this conflict, an alternative path has to cover the transition t5 . But transition t5 is not enabled in model
state (q1 , q2 , s) = (0, 0, 1) due to the absence of tokens on channel c1 . By searching paths including the conflicting transition anywhere on the path, other actors
are allowed to produce (or consume) tokens, which may finally enable the conflicting transition. In the running example, transition t1 can produce a token on c1 and
therewith enables t5 . So, when searching for an alternative path for the conflicting
transition t4 , i.e., a path originating in state (0, 0, 1) and using t5 anywhere, a path
t1
t5
t6
t2
like (0, 0, 1) → (1, 0, 1) → (0, 0, 2) → (0, 2, 0) → (0, 1, 0) is found. State (0, 1, 0)
is known as part of the first path and therefore a valid end state. The both paths
together are shown in Fig. 12.7. For this example, these two paths already represent
a valid quasi-static schedule and a software scheduler can be derived, see Fig. 12.8.
12 Symbolic Scheduling of SystemC Dataflow Designs
193
There is only one runtime decision needed (the if-clause, testing the guard function g). Depending on the outcome of this clause, one of the two static sequences is
chosen. This directly reflects the two paths searched for the model (Fig. 12.7).
Note that transition t4 must initially be forbidden when searching the alternative
path—we search a runtime alternative for usage of t4 at the first place. This is why
path searching needs the capability to ignore transitions. Another improvement we
implemented, the transition unlocking feature, is not motivated by the simple example. It may be necessary to re allow taking a disabled transition an alternative path
is searched for. I.e., if we search an alternative path for usage of tc , transition tc
initially is disabled, but when using one of its conflicting transition on a path, tc afterwards may be necessary on this path to reach a known state, so it is unlocked.
Models can be constructed where no alternative paths are found otherwise.
To allow for two of the features—using desired transitions anywhere on the path
and unlocking—we must track if some transition (or one out of a set of transitions)
was used on a path. E.g., only after some desired transition is covered on a path, the
path eventually must end in a known state to be a valid alternative path. Therefore,
state sets have to be split in groups during path search: them who already covered
desired transitions, and them who don’t. These groups have to be treated independently. To not lose the advantages of symbolic path searching and avoid enumerating possible paths, we extended the image operations in two ways to implement
the advanced path searching symbolically: (1) image computation respects a given
set of transitions which are disabled while computing the image, and (2) the image
function returns, additionally to the set of states, the set of transitions involved in
the image computation. Therewith, the advanced path searching, as sketched above,
with covering transitions anywhere, disabled transitions, and unlocking can be implemented efficiently and symbolically. Because unlocking is rarely needed in real
world examples, we leave description of this feature. But it further raises the possible model complexity the approach is applicable to.
12.4.3 Scheduling Algorithm
We now explain the quasi-static scheduling algorithm for actor-oriented models in
a more detailed manner. Conflicts that demand runtime decisions are based on the
relation of transitions—two or more transitions leaving the same state of an actor’s
finite state machine can be in conflict. A conflict handling strategy may be: whenever using a conflicting transition on a path, search paths starting with one of the
remaining transitions in conflict, until one path for every transition is found. As
shown above, this common conflict handling cannot cope with multi-rate conflicts.
Furthermore, relations between transitions are ignored. But these relations could be
exploited by the scheduling procedure, which leads to better results. In the proposed
scheduling algorithm, we introduce a novel conflict handling strategy which exploits
transition relations and handles multi-rate conflicts.
As a preliminary step, before scheduling, for each state in each actor’s state machine all outgoing transitions including guards are analyzed. As the result, for each
194
2
4
6
J. Gladigau et al.
f u n c t i o n f i n d _ s c h e d u l e ( s t a r t , Tall )
known = { s t a r t } ; / / known states
disabled = {};
/ / disabled transitions
t0
tn−1
t1
( c o n f l i c t s , x0 → x1 → · · · xn−1 → x0 ) =
f i n d _ p a t h ( s t a r t , known , d i s a b l e d , Tall ) ;
known = known ∪ {x1 , . . . , xn−1 } ;
t0
8
10
12
14
t0
16
18
20
22
t1
tn−1
initialize s c h e d u l e with path x0 → x1 → · · · → x0 ;
w h i l e c o n f l i c t s = ∅ do
( x , tchosen , d i s a b l e d ) = c o n f l i c t s . pop ( ) ;
l e a f _ n o d e s = g e t _ l e a f _ n o d e s ( tchosen ) ;
leaf_nodes = leaf_nodes / disabled ;
w h i l e l e a f _ n o d e s = ∅ do
tmp_disabled = disabled ;
add transitions ≥ tchosen to t m p _ d i s a b l e d ;
t1
tn−1
( t m p _ c o n f l i c t s , x0 → x1 → · · · → xk ) =
f i n d _ p a t h ( x , known , t m p _ d i s a b l e d , Tleaf_nodes ) ;
known = known ∪ {x1 , . . . , xn−1 } ;
t0
t1
tn−1
add x0 → x1 → · · · → xk to s c h e d u l e ;
remove covered leaf from l e a f s ;
conflicts = conflicts ∪ tmp_conflicts ;
endwhile
endwhile
return schedule ;
Listing 12.2 find_schedule algorithm in pseudocode
state the aforementioned transition graph is gained, which represents a relation of
the transitions leaving this state. If on a path an unsafe transition (including a guard)
is taken, the corresponding transition graph is examined to determine the minimum
number of additional needed paths to guarantee proper execution at runtime. The important observation is: only one path for each leaf node must be searched. Therewith,
alternatives within conflicting transitions and conflict implications are respected.
t1
We give two examples using Fig. 12.5. Assume a path · · · xi → · · · xk is found at
first, covering transition t1 with guard a. According to the transition graph, only two
alternative paths with transition t3 and t5 have to be searched, starting from state xi .
Therewith, for every leaf node in the transition graph a path exists. I.e., for all possible runtime condition (including all combinations of runtime conditions) paths exist.
t2
Now assume the path · · · xi → · · · xk . Transition t2 is not included in a leaf node. So,
we need to find three additional paths, starting from xi , and covering t1 or t4 , t3 , and
t5 . By this procedure, we find the minimum required paths to guarantee proper execution at runtime; alternatives and implications are exploited. Together with the
advanced path searching (search paths that cover desired transitions anywhere on
the path), the algorithm also handles multi-rate conflicts. Listing 12.2 shows the
scheduling algorithm as pseudocode. Parameters of find_schedule are the start
state of the model and the set of all transitions Tall . The function returns the found
12 Symbolic Scheduling of SystemC Dataflow Designs
195
paths representing the quasi-static schedule. The algorithm is simplified, and always
finding a schedule without backtracking is assumed. First, a path from the model’s
start state start back to start is searched (lines 4 and 5) as an initial path.
find_path expects four parameters: (1) the originating state of the path, (2) a set
of known states (possible end states), (3) a set of disabled transitions, and (4) a set of
transitions from which this path has to cover one. As the result, find_path returns
the set conflicts of conflicting positions on the found path and the path itself.
Transitions on the path that are in conflict with other transitions need further investigation to assure proper operation at runtime. In line 5, Tall denotes all transitions of
the model, thus, path searching is not limited. The set of known states is updated by
subjoining the new states of the path (line 6), and schedule, which should contain
all found paths and eventually the found schedule, is initialized (line 7). If there are
no runtime decisions to make, i.e., the set conflicts is empty after the first path
search, a static schedule is returned.
While the set conflicts contains elements, additional paths for these conflicting transitions are searched (lines 8 to 22). Elements of conflicts are tuples of:
(1) the state some (2) conflicting transition is originating from and (3) disabled transitions when encountering this conflict. Function get_leaf_nodes returns the
set of leaf nodes from the transition graph (line 10). Because disabled transitions
cannot be covered they are removed (line 11). As described earlier, one alternative
path has to be searched for each leaf node (lines 12 to 21). The set disabled
includes all disabled transitions at the moment the currently treated conflict was
encountered. We need to add additional transitions for the currently treated conflict (line 14). These transitions are also gained from the transition graph which
includes transition tchossen : all transitions in tchossen ’s node, and all transitions in
nodes with implications to tchossen ’s node must be disabled. These disabled transitions tmp_disabled are respected in alternative path searching (lines 15 and 16).
The path must cover one of leaf_nodes’s transitions (Tleaf_nodes ) and eventually
end in a known state (known) to be a proper alternative path. If a path is found,
the set of known states is updated, the found path is added to schedule, the covered leaf is removed from leaf_nodes, and maybe newly encountered conflicting transitions are added to conflicts (lines 17 to 20). If the set conflicts is
empty, for all conflicts alternative paths have been searched, and the clew of paths
is returned, which is the basis for the quasi-static schedule. See Fig. 12.7 for an
example schedule, and Fig. 12.8 for the resulting software scheduler.
12.5 Related Work
In recent years, techniques related to quasi-static scheduling using dataflow specifications [1], restricted Petri nets [4, 9, 18, 19], or FunState [23] were presented. Key
difference between these approaches and our proposal is the underlying abstract
model. Our model can be extracted automatically from an actor-oriented SystemC
design and has less restrictions. E.g., whenever an output transition in free-choice
Petri nets [18] or equal conflict nets [19] is enabled, all output transitions of the
196
J. Gladigau et al.
originating place are enabled. A similar approach is taken by Strehl et al. [23] by
defining conflict states in FunState.
The symbolic QSS approach presented in [23] seems to be the most promising
one for actor-oriented SystemC designs and we used it as a basis for our proposed
scheduling procedure. But there are two major drawbacks with conflict states as
used in the FunState approach: (1) each transition leaving such a conflict state results in a path search while scheduling, and (2) all transitions must have exactly
the same demands on input tokens. By the first point, conflicting alternatives (i.e.,
two transitions leaving a conflict state with the same runtime condition in conflict
with other transitions) are not respected as equivalent alternatives, and conflict implications are ignored. The second point also implies that no multi-rate conflicts
are allowed. Our approach has neither drawback. We extended [23] in several ways
and adapted it for dataflow designs as described above, which enables scheduling
of not yet considered model properties (e.g., multi-rate conflicts). Additionally, we
eliminated a manual classification step, and search only for the minimum required
alternative paths.
As a remedy to construct a monolithic scheduler specification automaton, as
needed in [23], we discarded the state classification based approach, define the term
conflict based on relations of transitions, and introduced a novel conflict handling
strategy by using transition graphs. This also enables more natural design of state
machines without the need of outsourcing conflicts and alternatives in separate states
(if this is possible at all), what is demanded by the monolithic scheduler specification approach. Summarizing, we gain several improvements regarding the FunState
method: (1) our methodology starts with abstract SystemC models instead of FunState, which is intended for internal representation of systems only [22], (2) construction of a monolithic automaton with classified states, which represents the full
dynamic schedule, is avoided, (3) conflicts with different rates are considered and
can be scheduled in particular cases, (4) transitions instead of states determine conflicts, and (5) by introducing transition graphs, we search only for the minimum
required runtime dependent alternative paths by respecting conflict alternatives and
conflict implications, which results in lean schedules.
12.6 Example
So far, the proposed symbolic quasi-static scheduling procedure was introduced by
small examples. Next, we apply our scheduling to a real world SystemC example,
the connection tracking network packet filter, shown in Fig. 12.2. The packet filter
model consists of 14 modules (therein 39 transitions) and 18 channels. This includes
source and sink modules. Filtering is done in actors Input, Output, and Forward.
Based on a rule set, a decision is made to accept or to drop a packet. Connection
tracking is done by the Conntrack actors. Ct_merge merges requests of connection
tracking entries lookups to Ct_entries. Even with channel sizes kept to the minimum
needed (maximum depth of two), the model’s state space consists of about 44 × 106
reachable abstract states.
12 Symbolic Scheduling of SystemC Dataflow Designs
197
Fig. 12.9 The found clew of paths for the packet filter model (Fig. 12.2). Only each fork represents
a runtime decision, between them, chains of static scheduled transitions are depicted
We automatically extract the finite state machine for each actor as well as the
network graph (interconnection of the modules) from the SystemC program. From
this information, the symbolic representations of the state transition relation and the
models start state are gained. Scheduling this model takes about a second, including
the symbolic encoding of the model. During the scheduling, 39 conflicts were encountered, for which 30 alternative paths had to be searched. The resulting clew of
paths represents the quasi-static schedule and is shown in Fig. 12.9. As an example,
one path from the start state (on the left side, indicated by a small circle) back to
the start state is shown with dashed arrows. This path reflects the scenario “a packet
from the host is received and accepted”. The number of runtime decisions needed to
process such packets is reduced from ten (full dynamic schedule) to three (the found
quasi-static schedule). Additionally, runtime checks on buffers can be omitted. Besides this significant reduction in runtime overhead, answers to questions like “How
much processing is needed for a packet from the host which is accepted?” can easily
be given in presence of the quasi-static schedule.
12.7 Conclusions and Further Work
In this work, we introduced a symbolic quasi-static scheduling approach, which is
directly and automatically applicable to actor-oriented SystemC dataflow designs.
A significant improvement to prior approaches was achieved by identifying transitions as the origin of conflicts. The key contributions are (1) the conflict handling
mechanism based on transition graphs, which is an efficient instrument to determine
the minimum required alternative paths for conflicts, and (2) handling multi-rate
conflicts, which are not considered in prior approaches. By using a real world SystemC designs, we could show the applicability. The resulting scheduler significantly
reduced the number of runtime decisions and, hence, the scheduling overhead for
its software implementation. All other aforementioned benefits apply: avoidance of
deadlocks, knowledge of exact memory requirements, dead code detection, more
predictable behavior, etc.
In further work, we will use the found schedules not only for generating schedulers for embedded software, but also to reduce processing time in the SystemC
198
J. Gladigau et al.
simulation. First experiments with generated SystemC transaction level models gave
very promising results.
References
1. B. Bhattacharya and S.S. Bhattacharyya. Quasi-static scheduling of reconfigurable dataflow
graphs for DSP systems. In Proc. of RSP, pages 84–89, 2000.
2. S.S. Battacharyya, E.A. Lee, and P.K. Murthy. Software Synthesis from Dataflow Graphs.
Kluwer Academic, Norwell, 1996.
3. R.E. Bryant. Graph-based algorithms for boolean function manipulation. IEEE Trans. Comput., 35(8):677–691, 1986.
4. J. Cortadella, A. Kondratyev, L. Lavagno, C. Passerone, and Y. Watanabe. Quasi-static
scheduling of independent tasks for reactive systems. IEEE Trans. Comput.-Aided Design
Integr. Circuits Syst., 24(10):1492–1514, 2005.
5. J. Falk, C. Haubelt, and J. Teich. Efficient representation and simulation of model-based designs in SystemC. In Proc. of FDL, pages 129–134, Darmstadt, Germany, September 2006.
6. C. Haubelt, J. Falk, J. Keinert, T. Schlichter, M. Streubühr, A. Deyhle, A. Hadert, and J. Teich.
A SystemC-based design methodology for digital signal processing systems. EURASIP Journal on Embedded Systems, 2007. doi:10.1155/2007/47580.
7. F. Herrera, H. Posadas, P. Sanchez, and E. Villar. Systematic embedded software generation
from SystemC. In Proc. of DATE, pages 142–147, 2003.
8. F. Herrera and E. Villar. A framework for embedded system specification under different models of computation in SystemC. In Proc. of DAC, pages 911–914. Assoc. Comput. Mach., New
York, 2006.
9. P.-A. Hsiung and F.-S. Su. Synthesis of real-time embedded software by timed quasi-static
scheduling. In Proc. of VLSID, pages 579–584, 2003.
10. A.J. Hu and D.L. Dill. Efficient verification with BDDs using implicitly conjoined invariants.
In Proc. of CAV, pages 3–14. Springer, London, 1993.
11. E.A. Lee and D.G. Messerschmitt. Static scheduling of synchronous data flow programs for
digital signal processing. IEEE Trans. Comput., 36(1):24–35, 1987.
12. E.A. Lee, S. Neuendorffer, and M.J. Wirthlin. Actor-oriented design of embedded hardware
and software systems. Journal of Circuits, Systems, and Computers, 12(3):231–260, 2003.
13. K.L. McMillan. Symbolic Model Checking. Kluwer Academic, Norwell, 1993.
14. B. Niemann, F. Mayer, F. Javier, R. Rubio, and M. Speitel. Refining a high level SystemC
model. In SystemC: Methodologies and Applications, pages 65–95. Kluwer Academic, Norwell, 2003.
15. T.M. Parks, J.L. Pino, and E.A. Lee. A comparison of synchronous and cycle-static dataflow.
In Proc. of ASILOMAR, pages 204–210. IEEE Comput. Soc., Washington, 1995.
16. H.D. Patel and S.K. Shukla. SystemC Kernel Extensions for Heterogeneous System Modeling:
A Framework for Multi-MoC Modeling & Simulation. Kluwer Academic, Norwell, 2004.
17. H.D. Patel, S.K. Shukla, E. Mednick, and R.S. Nikhil. A rule-based model of computation for
SystemC: integrating SystemC and Bluespec for co-design. In Proc. of MEMOCODE, pages
39–48, 2006.
18. M. Sgroi and L. Lavagno. Synthesis of embedded software using free-choice Petri nets. In
Proc. of DAC, pages 805–810, 1999.
19. M. Sgroi, L. Lavagno, Y. Watanabe, and A.L. Sangiovanni-Vincentelli. Quasi-static scheduling of embedded software using equal conflict nets. In Proc. of ICATPN, pages 208–227,
1999.
20. K. Strehl. Interval diagrams: Increasing efficiency of symbolic real-time verification. In Proc.
of RTCSA, pages 488–491, Hong Kong, 1999.
21. K. Strehl. Symbolic Methods Applied to Formal Verification and Synthesis in Embedded Systems Design. PhD thesis, Swiss Federal Institute of Technology Zurich, February 2000.
12 Symbolic Scheduling of SystemC Dataflow Designs
199
22. K. Strehl, L. Thiele, M. Gries, D. Ziegenbein, R. Ernst, and J. Teich. FunState—an internal
design representation for codesign. IEEE Trans. VLSI Syst., 9(4):524–544, 2001.
23. K. Strehl, L. Thiele, D. Ziegenbein, R. Ernst, and J. Teich. Scheduling hardware/software
systems using symbolic techniques. In Proc. of CODES, pages 173–177, Rome, Italy, 1999.
24. C. Zebelein, J. Falk, C. Haubelt, and J. Teich. Classification of general data flow actors into
known models of computation. In Proc. of MEMOCODE, pages 119–128, Anaheim, CA,
USA, 2008.
Chapter 13
SystemC Simulation of Networked Embedded
Systems
Francesco Stefanni, Davide Quaglia and
Franco Fummi
Abstract The design and simulation of next-generation networked embedded systems are a challenging task since System design choices may affect the network behavior and Network design choices may impact on the System design. For this reason, it is important—at the early stages of the design flow—to model and simulate
not only the system under design, but also the heterogeneous networked environment in which it operates. However, System designers are more focused on System
design issues and tools while Network aspects are dealt implicitly by choosing traditional protocols even if, in this case, the chance of joint optimization is lost. To
solve this issue, we have exploited a modeling language traditionally used for System design—SystemC—to build a System/Network simulator named SystemC Network Simulation Library (SCNSL). This library allows to model network scenarios
in which different kinds of nodes, or nodes described at different abstraction levels,
interact together. The use of SystemC as unique tool has the advantage that HW,
SW, and network can be jointly designed, validated and refined. As a case study, the
proposed tool has been used to simulate a sensor network application and it has been
compared with NS-2, a well-known network simulator; SCNSL shows nearly twoorder-magnitude speed up with TLM modeling and about the same performance as
NS-2 with a mixed TLM/RTL scenario. The simulator is partially available to the
community at http://sourceforge.net/projects/scnsl/.
Keywords Networked embedded systems · Network simulation · IEEE 802.15.4 ·
SystemC
13.1 Introduction
The widespread use of Networked Embedded Systems (i.e., embedded systems with
communication capabilities like PDAs, cell-phones, routers, wireless sensors and
actuators) has generated significant research for their efficient design and integration
into heterogeneous networks [1, 5–8, 11, 13].
D. Quaglia ()
Department of Computer Science, University of Verona, Verona, 37134, Italy
e-mail: davide.quaglia@univr.it
M. Radetzki (ed.), Languages for Embedded Systems and Their Applications,
Lecture Notes in Electrical Engineering 36,
© Springer Science + Business Media B.V. 2009
201
202
F. Stefanni et al.
Fig. 13.1 Two-dimension
design space of networked
embedded systems
The design of such networked embedded systems is strictly connected with network functionality as suggested in the literature [4]. In fact, choices taken during
the System design exploration may influence the network configuration and vice
versa. The result is a two-dimension design space as depicted in Fig. 13.1. System and Network design can be seen as two different aspects of the same design
problem, but they are generally addressed by different people belonging to different
knowledge domains and using different tools. In the context of System design, tools
aim at providing languages to describe models and engines for their simulation.
The focus is the functional description of computational blocks and the structural
interconnection among them. Particular attention is given to the description of concurrent processes with the specification of the synchronization among them through
wait/notify mechanisms. The most popular languages in this context are VHDL,
Verilog, System Verilog, and SystemC [9]; in particular, the last language is gaining
acceptance for its flexibility in describing both HW and SW components and for
the presence of add-on libraries for Transaction-Level Modeling (TLM) and verification. In the context of network simulation, current tools reproduce the functional
behavior of protocols, manage time information about transmission and reception of
events and simulate packet losses and bit errors. Network can be modeled at different levels of detail, from packet transmission down to signal propagation [2, 12, 14].
The use of a single tool for System and Network modeling would be an advantage
for the design of networked embedded systems. Network tools cannot be used for
System design since they do not model concurrency within each network node and
do not provide a direct path to hardware/software synthesis. Instead, System tools
might be the right candidate for this purpose since they already model communications, at least at system level. However, the use of a system description language
for network modeling requires the creation of a basic set of primitives and protocols to support the asynchronous transmission of variable-length packets. To fill this
gap, in this work we have evaluated the potentiality of SystemC in the simulation of
packet-switched networks. In the past, SystemC was successfully used to describe
network-on-chip architectures [3] and to simulate the lowest network layers of the
Bluetooth communication standard [5]. In the proposed simulator, devices are modeled in SystemC and their instances are connected to a module that reproduces the
behavior of the communication channel; propagation delay, interference, collisions
and path loss are taken into account by considering the spatial position of nodes
and their on-going transmissions. The design of the node can be dealt at different
13 SystemC Simulation of Networked Embedded Systems
203
abstraction levels: from system/behavioral level (e.g., Transaction Level Modeling)
down to register transfer level (RTL) and gate level. After each refinement step,
nodes can be tested in their network environment to verify that communication constraints are met. Nodes with different functionality or described at different abstraction levels can be mixed in the simulation thus allowing the exploration of network
scenarios made of heterogeneous devices. Synthesis can be directly performed on
those models provided that they are described by using a suitable subset of the SystemC syntax.
The chapter is organized as follows. Section 13.2 describes the SystemC Network Simulation Library. Section 13.3 outlines the main solutions brought by this
work. Section 13.4 reports experimental results and, finally, conclusions are drawn
in Sect. 13.5.
13.2 The Architecture of SCNSL
The driving motivation at the base of SCNSL is to have a single simulation tool to
model both the embedded system under design and the surrounding network environment. SystemC has been chosen for its great flexibility, but a lot of work has
been done to introduce some important elements for network simulation.
Figure 13.2 shows the relationship among the system under design, SCNSL and
the SystemC standard library. In traditional scenarios, the system under design is
modeled by using the primitives provided by the SystemC standard library, i.e.,
modules, processes, ports, and events. The resulting model is then simulated by
a simulation engine, either the one provided in the SystemC free distribution or
a third-party tool. To perform network simulations new primitives are required as
described in Sect. 13.2.1. Starting from SystemC primitives, SCNSL provides such
elements so that they can be used together with System models to create network
scenarios.
Fig. 13.2 SCNSL in the context of SystemC modeling
204
F. Stefanni et al.
Another point regards the description of the simulation scenario. In SystemC,
such description is usually provided in the sc_main() function which creates
module instances and connects them before starting simulation; in this phase, it is
not possible to specify simulation events as in a story board (e.g., “at time X the
module Y is activated”). Instead, in many network simulators such functionality
is available and the designer not only specifies the network topology, but also can
plan events, e.g., node movements, link failures, activation/de-activation of traffic
sources, and packet drops. For this reason, SCNSL also supports the registration of
such network-oriented events during the initial instantiation of SystemC modules.
As depicted in Fig. 13.2, the model of the system under design uses both traditional SystemC primitives for the specification of its internal behavior, and SCNSL
primitives to send and receive packets on the network channel and to test if the
channel is busy. SCNSL takes in charge the translation of network primitives (e.g.,
packets events) into SystemC primitives.
13.2.1 Main Components of SCNSL
To support network modeling and simulation, a tool has to provide the following
elements:
• Kernel: the kernel is responsible for the correct simulation, i.e., its adherence to
the behavior of an actual communication channel; the kernel has to execute events
in the correct temporal order and it has to take into account the physical features
of the channel such as, for example, propagation delay, signal loss and so forth;
• Node: nodes are the active elements of the network; they produce, transform and
consume transmitted data;
• Packet: in packet-switched networks the packet is the unit of data exchanged
among nodes; it consists of a header and a payload;
• Channel: the channel is an abstraction of the transmitting medium which connects two or more nodes; it can be either a point-to-point link or a shared medium;
• Port: nodes use ports to send and receive packets.
Figure 13.3 shows the main components of SCNSL; they can be easily related
to the previous list as explained below. The simulation kernel is implemented by
the Network_if_t class. This class is the most complex object of SCNSL, because it manages transmissions and, for this reason, it must be highly optimized. For
instance, in the wireless model, the network transmits packets and simulates their
transmission delay; it can delete ongoing transmissions, change node position, check
which nodes are able to receive a packet, and verify if a received packet has been
corrupted due to collisions. The standard SystemC kernel does not address these aspects directly, but it provides important primitives such as concurrency models and
events. The network class uses these SystemC primitives to reproduce transmission
behavior. In particular, it is worth to note that SCNSL does not have its own scheduler since it exploits the SystemC scheduler by mapping network events on standard
SystemC events.
13 SystemC Simulation of Networked Embedded Systems
205
Fig. 13.3 Main components of SCNSL
The Node is one critical point of our library which supports both System and
Network design. From the point of view of a network simulator the node is just the
producer or consumer of packets and therefore its implementation is not important.
However, for the system designer, node implementation is crucial and many operations are connected to its modeling, i.e., change of abstraction level, validation, fault
injection, HW/SW partitioning, mapping to an available platform, synthesis, and so
forth. For this reason we introduced the class NodeProxy_if_t which decouples
node implementation from network simulation. Each Node instance is connected to
a NodeProxy instance and, from the perspective of the network simulation kernel,
the NodeProxy instance is the alter-ego of the node. This solution allows to keep a
stable and well-defined interface between the NodeProxy and the simulation kernel
and, at the same time, to let complete freedom in the modeling choices for the node;
as depicted in Fig. 13.3 the box named Node is separated from the simulation kernel
by the box named NodeProxy and different strategies can be adopted for the modeling of the node, e.g., interconnection of basic blocks or finite-state machine. It is
worth to note that other SystemC libraries can also be used in node implementation,
e.g., re-used IP blocks and testing components such as the well-known SystemC
Verification Library. For example, the figure also shows an optional package above
the node; this package is provided by SCNSL and it contains some additional SystemC modules, i.e., an RTL description of a timer and a source of stimuli. These
components may simplify designer’s work even if they are outside the scope of network simulation.
Another critical point in the design of the tool has been the concept of packet.
Generally, packet format depends on the corresponding protocol even if some features are always present, e.g., the length and source/destination address. System
design requires a bit-accurate description of packet content to test parsing functionality while from the point of view of the network simulator the strictly required fields
are the packet length for bit-rate computation and some flags to mark collisions (if
routing is performed by the simulator, source/destination addresses are used too).
206
F. Stefanni et al.
Fig. 13.4 Communicator:
class hierarchy
Furthermore, the smaller the number of different packet formats, the more efficient
is the simulator implementation. To meet these opposite requirements in SCNSL,
an internal packet format is used by the simulator while the system designer can
use other different packet formats according to protocol design. The conversion between the user packet format and the internal packet format is performed in the
NodeProxy.
Channels are very important components, because they are an abstraction of the
transmission media. Standard SystemC channels are generally used to model interconnections between HW components and, therefore, they can be used to model
network at physical level [5]. However, many general purpose network simulators
reproduce transmissions at packet level to speed up simulations. SCNSL follows
this approach and provides a flexible channel abstraction named Communicator_if_t. The communicator is the most abstract transmission component and,
in fact, both NodeProxy and Network classes derive from it. New capabilities and
behavior can be easily added by extending this class. Communicators can be interconnected each other to create chains. Each valid chain shall have on one end a
NodeProxy instance and, on the other end, the Network; hence transmitted packets
will move from the source NodeProxy to the Network traversing zero or more intermediate communicators and then they will eventually traverse the communicators
placed between the Network and the destination NodeProxy. In this way, it is possible to modify the simulation behavior by just creating a new communicator and
placing its instance between the network and the desired NodeProxy.
Figure 13.4 shows the class hierarchy of the Communicator; as said before, both
Network and NodeProxy inherit from the Communicator. A wireless network is a
specific kind of Network with its own behavior and thus it derives from the abstract
Network. A possible implementation of a wireless network model is described in
Sect. 13.3.4. NodeProxies depend both on the type of network and on the abstraction
level used in node implementation; for example, Fig. 13.4 reports a TLM and an
RTL version of a wireless NodeProxy.
13 SystemC Simulation of Networked Embedded Systems
207
13.3 Main Problems Solved by SCNSL
This section describes some issues encountered during the development of SCNSL
and the adopted solutions. The first problem regards the co-existence of RTL system
models with the packet-level simulation. The second one regards the assessment of
packet validity with reference to collision and out-of-range transmissions. The third
problem regards the planning of the network simulation, i.e., source activations,
link failures, and so forth. Finally, an application of SCNSL to a wireless network
is described.
13.3.1 Simulation of RTL Models
As said before, SCNSL supports the modeling of nodes at different abstraction levels. In case of RTL models, the co-existence of RTL events with network events has
to be addressed. RTL events model the setup of logical values on ports and signals
and they have an instantaneous propagation, i.e., they are triggered at the same simulation time in which the corresponding values are changing. Furthermore, except
for tri-state logic, ports and signals have always a value associated to them, i.e.,
sentences like “nothing is on the port” are meaningless. Instead, network events are
mainly related to the transmission of packets; each transmission is not instantaneous
because of transmission delay, during idle periods the channel is empty, and the repeated transmission of the same packet is possible and leads to distinct network
events.
In SCNSL, RTL node models handle packet-level events by using three ports
signaling the start and the end of each packet transmission, and the reception of
a new packet, respectively. Also in this case the NodeProxy instance associated to
each node translates network events into RTL events and vice versa. In particular,
each RTL node has to write on a specific port when it starts the transmission of
a packet while another port is written by the corresponding NodeProxy when the
transmission of the packet is completed. A third port is used to notify the node
about the arrival of a new packet. With this approach each transmission/reception of
a packet is detected even if packets are equal.
The last issue regards the handling of packets of different size. Real world protocols use packets of different sizes while RTL ports must have a constant width
decided at compile-time. SCNSL solves this problem by creating packet ports with
the maximum packet size allowed in the network scenario and by using an integer
port to communicate the actual packet size. In this way a NodeProxy or a receiver
node can read only the actual used bytes, thus obtaining a correct simulation.
13.3.2 Assessment of Transmission Validity
In wireless scenarios an important task of the network simulator kernel is the assessment of transmission validity which could be compromised by collisions and
208
F. Stefanni et al.
out-of-range distances. The validity check has been implemented by using two flags
and a counter. The first flag is associated to each node pair and it is used to check
the validity of the transmission as far as the distance is concerned; if the sender or
the receiver of an ongoing transmission has been moved outside the maximum transmission range, this flag is set to false. The second flag and the counter are associated
to each node and they are used to check the validity with respect to collisions. The
counter is used to register the numbers of active transmitters which are interfering at
a given receiver; if the value of this counter is greater than one, then on-going transmission to the given receiver are not valid since they are compromised by collisions.
However, even if, at a given time, the counter holds one, the transmission could be
invalid due to previous collisions; the flag has the purpose to track this case. When
a packet transmission is completed, if the value of the counter is greater than one,
the flag is set to false. The combined use of the flag and the counter allows to cover
all transmission cases in which packet validity is compromised by collisions.
13.3.3 Simulation Planning
In several network simulators, special events can be scheduled during the setup of
the scenario; such special events regard nodes movements, link status changes, traffic activation and packet drops. This feature is important because it allows to simulate the model into a dynamic network context. In SCNSL the simulation kernel
has not its own event dispatcher, hence this feature has been implemented into an
optional class, called EventsQueue_t. Even if SystemC allows to write in each
node the code which trigger such events, the choice of managing them in a specific
class of the simulator leads to the following advantages:
• Standard API: the event queue provides a clear interface to schedule a network
event without directly interacting with the Network class or altering node implementation.
• Simplified user code: Network events are more complex then System ones; the
event queue hides such complexity thus simplifying user code and avoiding setup
errors.
• Higher performance: the management of all the events inside a single class improves performance; in fact the event queue is built around a single SystemC
thread, minimizing memory usage and context switching.
This class can be used also to trigger new events defined through user-defined
functions. The only constraint is that such functions shall not block the caller, i.e.,
the events queue, to allow the correct scheduling of the following events.
13.3.4 Application to a Wireless Scenario
In this section the behavior of the library is described with reference to a wireless
scenario; in particular, an RTL node model is reported to clarify the concepts written
13 SystemC Simulation of Networked Embedded Systems
209
in Sect. 13.3.1. Module Rtl::Node_t represents an abstract network node. It
has a set of properties which are used by the simulation framework to reproduce
network behavior. Transmission rate represents the number of bits per unit of time
which the interface can handle; it is used to compute the transmission delay and the
network load. Transmission power is used to evaluate the transmission range and
the signal-to-noise ratio. Transmission rate and transmission power can be changed
during simulation to accurately simulate and evaluate power saving algorithms. The
module has the following ports:
•
•
•
•
packet ports to send and receive packets, respectively;
carrier port to perform carrier sense;
packet length ports to report actual packet size (one for each direction);
packet event management ports to report the presence of a new packet (one for
each direction) and the completion of packet transmission;
• rate port to communicate the transmission rate to the simulation kernel through
the NodeProxy;
• power port to communicate the transmission power to the simulation kernel
through the NodeProxy;
• sensor port to model a data input whose meaning is application-specific (e.g.,
a temperature sensor).
The sensor port of each node is bound to an instance of the module Stimulus_t which reproduces a generic environmental data source. It takes as input a
clock signal as timing reference to synchronize the generation of data values. Different kinds of stimuli can be generated by sub-classes of this module; the intensity
of the stimuli and their localization in time can follow a given statistical distribution
or be derived from a trace file.
The class Rtl::NodeProxy_t interfaces the node with the network and it
manages two node properties, i.e., node position and receiver sensitivity. Node position is used to compute the path loss and to reproduce mobile scenarios. The receiver sensitivity is the minimum signal power below which the packet cannot be
received. Even if these properties are related to the node they are frequently used
by the simulation kernel and thus we decided to model them in the NodeProxy to
simplify their access.
When a node starts transmission, its relative position to all other nodes in the
same network is computed, and the signal level in all those nodes is derived according to the path loss formula 1/d α . For each node, if the signal level is higher than its
receiver sensitivity, then it can be detected and it may interfere with other on-going
transmissions. If there are already on-going transmissions reaching the receiving
node, then all those messages are marked as collided (i.e., they are not valid). Also,
if there are other on-going transmissions which the currently sending node reaches
with its transmission, then those messages are marked as collided as well. Since
wireless nodes cannot detect collisions, a collided message is not interrupted and
the channel remains busy. The transmission time depends on the packet length, the
transmission rate, and the propagation delay.
210
F. Stefanni et al.
Fig. 13.5 CPU time as a
function of the number of
simulated nodes for the
different tested tools and node
abstraction levels
13.4 Experimental Results
The SystemC Network Simulation Library has been used to model a wireless sensor network application consisting of a master node which repeatedly polls sensor
nodes to obtain data. Node communications reproduce a subset of the well-known
IEEE 802.15.4 standard, i.e., peer un-slotted transmissions with acknowledge [10].
Different scenarios have been simulated with SCNSL by using nodes at different abstraction levels: (1) all nodes at TLM-PVT level, (2) all nodes at RTL, and
(3) master node at RTL and sensor nodes at TLM-PVT. The designer had written
172 code lines for the sc_main(), 688 code lines for the RTL node and 633 code
lines for the TLM-PVT node.
Figure 13.5 shows the CPU time as a function of the number of nodes for the
three scenarios and for a simulation provided by NS-2 representing the behavior of
a pure network simulator. A logarithmic scale has been used to better show results.
Simulations have been performed on the Intel Xeon 2.8 MHz with 8 GB of RAM
and 2.6.23 Linux kernel; CPU time has been computed with the time command by
summing up user and system time.
The speed of SCNSL simulations at TLM-PVT level is about two-ordermagnitude higher than in case of NS-2 simulation showing the validity of SCNSL
as a tool for efficient network simulation. Simulations at RT level are clearly slower
because each node is implemented as a clocked finite state machine as commonly
done to increase model accuracy in System design. However a good trade-off between simulation speed and accuracy can be achieved by mixing nodes at different
abstraction levels; in this case, experimental results report about the same performance of NS-2 with the advantage that at least one node is described at RT level.
13.5 Conclusions
We have presented a SystemC-based approach to model and simulate networked
embedded systems. As a result, a single tool has been created to model both the em-
13 SystemC Simulation of Networked Embedded Systems
211
bedded system under design and the surrounding network environment. Different
issues have been solved to reconcile System design and Network simulation requirements while preserving execution efficiency. In particular, the combined simulation
of RTL system models and packet-based networks has been faced. Experimental results for a large network scenario show nearly two-order-magnitude speed up with
respect to NS-2 with TLM modeling and about the same performance as NS-2 with
a mixed TLM/RTL scenario.
References
1. V. Aue et al. Matlab based codesign framework for wireless broadband communication DSPs.
In Proc. IEEE ICASSP, pages 1253–1256, 2001.
2. AWE Communications. WinProp: Software-Tool for the Planning of Mobile Communication
Networks. http://www.awe-communications.com.
3. D. Bertozzi et al. NoC synthesis flow for customized domain specific multiprocessor systemson-chip. IEEE Trans. Parallel Distrib. Syst., 16(2):113–129, 2005.
4. N. Bombieri, F. Fummi, and D. Quaglia. TLM/network design space exploration for networked embedded systems. In Proc. IEEE/ACM/IFIP CODES+ISSS, pages 58–63, 2006.
5. M. Conti and D. Moretti. System level analysis of the bluetooth standard. In Proc. IEEE DATE,
pages 118–123, March 2005.
6. D. Desmet et al. Timed executable system specification of an ADSL modem using a C++
based design environment: a case study. In Proc. IEEE CODES, pages 38–42, 1999.
7. D. Dietterle, J. Ebert, G. Wagenknecht, and R. Kraemer. A wireless communication platform
for long-term health monitoring. In Proc. IEEE International Conference on Pervasive Computing and Communications Workshop, March 2006.
8. J. Fleischmann and K. Buchenrieder. Prototyping networked embedded systems. IEEE Computer, 32(2):116–119, 1999.
9. IEEE Std 1666—2005 IEEE Standard SystemC Language Reference Manual. IEEE Std
1666—2005, pages 1–423, 2006.
10. LAN/MAN Standards Committee of the IEEE Computer Society. IEEE Standard for Information Technology—Part 15.4: Wireless Medium Access Control (MAC) and Physical Layer
(PHY) Specifications for Low Rate Wireless Personal Area Networks (LR-WPANs), September 2006.
11. N. Lugil and L. Philips. A W-CDMA transceiver core and a simulation environment for 3GPP
terminals. In Proc. IEEE Symp. on Spread Spectrum Techniques and Applications, pages 491–
495, September 2000.
12. S. McCanne and S. Floyd. NS Network Simulator—version 2. http://www.isi.edu/nsnam/ns.
13. R. Pasko et al. Functional verification of an embedded network component by co-simulation
with a real network. In Proc. IEEE HLDVT, pages 64–67, 2000.
14. C. Zhu et al. A comparison of active queue management algorithms using the OPNET modeler.
IEEE Communications Magazine, 40(6):158–167, 2002.
Chapter 14
Modeling of Embedded Software Multitasking
in SystemC/OSSS
Philipp A. Hartmann, Philipp Reinkemeier,
Henning Kleen and Wolfgang Nebel
Abstract Since the software part in today’s designs is increasingly important, the
impact of platform decisions with respect to the hardware and the software infrastructure (OS, scheduler, priorities, mapping) has to be explored in early design
phases.
In this work, we present an extension of the existing SystemC™-based OSSS
design flow regarding software multi-tasking in system models. The simulation of
the OSSS software run-time model supports different scheduling policies, as well
as efficient timing annotations, and deadlines. Inter-task communication is modeled
via user-defined Shared Objects. The impact of timing annotation granularity on the
achievable simulation performance and preemption accuracy is studied. As a result,
a lazy synchronization scheme is proposed, that is based on omitting SystemC time
synchronizations, that do not have observable effects on the application model.
Keywords Embedded software · Preemptive multi-tasking · SystemC · Abstract
RTOS modeling · Simulation performance · Design-space exploration
14.1 Introduction
The increasing pressure on time-to-market and cost of today’s embedded systems requires ever increasing design productivity. As a result, embedded software becomes
more and more important, since the required design effort for a software function
is usually expected to be lower than for the respective hardware parts. Additionally,
an increased flexibility and the possibility of late changes in the design process are
an advantage. On the other hand, the choice of the “correct” software architecture
such as the chosen RTOS, task priorities, scheduling policies, mapping of tasks on
processors is by no means a simple task.
To help developers during this phase of the design space exploration, efficient
and fast modeling of different architecture alternatives has to be supported by the
chosen design flow. Apart from considering the underlying hardware platform, this
P.A. Hartmann ()
OFFIS Institute for Information Technology, Oldenburg, Germany
e-mail: philipp.hartmann@offis.de
M. Radetzki (ed.), Languages for Embedded Systems and Their Applications,
Lecture Notes in Electrical Engineering 36,
© Springer Science + Business Media B.V. 2009
213
214
P.A. Hartmann et al.
includes the early analysis of software/RTOS effects on the system’s overall performance, which is important especially if multiple tasks are sharing a single processor.
Real-time capabilities have to be explored by choosing e.g. the scheduler, and task
priorities to fulfill a given set of requirements like deadlines or other application
specific constraints.
Not only since its IEEE standardization [13] is SystemC™ [17] a very popular
modeling language for system-level modeling of complex hardware/software systems. Since the modeling of real-time software specific aspects is not directly supported by SystemC itself, several extensions to SystemC have been developed, to
enable the early exploration of real-time software properties, some of which will be
compared briefly in Sect. 14.2.
The approach to software multi-tasking presented in this chapter is based on
OSSS—the Oldenburg System Synthesis Subset, an extension of the SystemC synthesisable subset [21] with object-oriented features. An introduction to the accompanying OSSS design flow is given in Sect. 14.3. The OSSS methodology is characterized by a layered approach and its partly automated path to an implementation.
Up to now, the support for software modeling in OSSS has been limited to a single task per processor. In this work, we extend the modeling capabilities of OSSS
with support for simulating multiple tasks on a processor running a given RTOS. In
Sect. 14.5, some observable effects of different software architecture decisions are
shown with an instructive example. This demonstrates the feasibility of our method
during (software) design space exploration.
Due to the object-oriented approach, communication between different components is modeled in OSSS by abstract method calls to so called Shared Objects. This
concept is reused for the modeling of inter-task communication in the new software
parts. The main advantage of this approach is the abstraction of error-prone RTOS
synchronization primitives and therefore a more robust modeling environment. In
Sect. 14.4, we present the new software multi-tasking features of OSSS, like tasks
and their properties, timing abstraction and Software Shared Objects.
An important property of abstract software models is their simulation performance. The synchronization overhead between several tasks (i.e. SystemC
processes) and the underlying simulation kernel is one of the limiting factors for
this. As shown in Sect. 14.6.1, the granularity of the timing annotations within
the Software Tasks has a direct impact on the overall simulation performance. In
Sect. 14.6.2, we show that this impact can be significantly reduced by applying our
lazy synchronization scheme, which reduces the SystemC overhead without changing the observable system behavior and correctness. In Sect. 14.7 a summary and
directions of future work are given.
14.2 Related Work
Many different approaches to modeling embedded software in the context of SystemC have been proposed. Some of them, like the SPACE framework [3] rely on
14 Modeling of Embedded Software Multitasking in SystemC/OSSS
215
the co-simulation with an external RTOS simulator or even an instruction set simulator. Although these approaches may provide higher simulation accuracy, they
usually lack the required simulation performance for early platform exploration and
are therefore out of the scope here.
Abstract RTOS models, like the one presented for SpecC in [6] are better suited
for early comparison of different scheduling and priority alternatives. The timing accuracy and therefore the simulation performance of this approach is limited by the
fixed minimal resolution of discrete time advances. Just recently, an extension deploying techniques with respect to preemptive scheduling models very similar to the
ones presented in this work has been presented in [19]. The “Result Oriented Modeling” collects and consumes consecutive timing annotations while still handling
preemptions accurately similar to the “lazy synchronisation” scheme presented in
Sect. 14.6.2.
Several approaches based on abstract task graphs [12, 15, 20] have been proposed
as well. In this case, a pure functional SystemC model is mapped onto an architecture model including an abstract RTOS. The mapping requires an abstract task graph
of the model, where estimated execution times can be annotated on a per-task basis
only, ignoring control-flow dependent durations. This reduces the achievable accuracy.
A single-source approach for the generation of embedded software from
SystemC-based descriptions has been proposed in [4, 11, 18]. Starting from untimed, heterogeneous models in HetSC, a POSIX conformant description can automatically be generated by the SWgen methodology. The performance analysis of
the resulting model with respect to an underlying RTOS model can be evaluated
with the PERFidiX library, that augments the generated source via operator overloading with estimated execution times. Due to the fine-grained timing annotations,
the model achieves a good accuracy but relatively weak simulation performance.
This interesting approach might significantly benefit from our proposed lazy synchronization, as presented in Sect. 14.6.
An early proposal of a generic RTOS model based on SystemC has been published in [14]. The presented abstract RTOS model achieves time-accurate task
preemption via SystemC events and models time consumption via a delay()
method. Additionally, the RTOS overhead can be modeled as well. Two different
task scheduling schemes are studied: The first one uses a dedicated thread for the
scheduler, while the second one is based on cooperative procedure calls, avoiding this overhead. Although in this approach explicit inter-task communication resources are required (message queue, . . . ), the simulation time advances simultaneously as the tasks consume their delays.
In [10], an RTOS modeling tool is presented. Its main purpose is to accurately
model an existing RTOS on top of SystemC. It cannot be directly used by a system designer. In this approach, the next RTOS “event” (like interrupt, scheduling
event, . . . ) is predicted during run-time. This improves simulation speed, but requires deeper knowledge of the underlying system.
Just recently, two other approaches [16, 22] have been published in German.
In [16], tasks are derived from a special base class, which augments the regular
216
P.A. Hartmann et al.
SystemC wait() method with synchronization calls to an abstract RTOS model.
An additional FIFO class, with a similar synchronization scheme is included as
a HW/SW communication primitive. Regular SystemC events can be used by the
application for inter-task communication as well. In [22], the main focus lies on
precise interrupt scheduling. For this purpose, a separate scheduler is introduced
to handle incoming interrupt requests. This is similar to our ring-based scheduling
approach (see Sect. 14.4.1). Timing annotations and synchronization within user
tasks is handled by a replacement of the SystemC wait(), similar to [16].
In this work, we present an abstract model for software multi-tasking based on
OSSS, which includes some properties of the above mentioned approaches, especially concerning a simple RTOS model. The flexible integration of user-defined
communication mechanisms via Shared Objects (Sect. 14.4.3) and the accurate and
efficient handling of timing annotations (Sect. 14.6.2) even in preemptive scheduling scenarios is the main contribution of our approach.
14.3 The OSSS Design Flow
Based on an object-oriented hardware design approach [8], one of the main objectives of OSSS is to enable the usage of object-oriented features known from languages such as C++ in a synthesisable SystemC model. This includes concepts such
as classes, inheritance, polymorphism and abstract communication based on method
calls. OSSS extends the synthesisable subset of SystemC [21] by defining synthesis
semantics for many of these features. Furthermore, new concepts specifically targeted to the modeling and design of embedded systems are introduced to raise the
level of abstraction and increase the expressiveness of the language.
OSSS defines separate layers of abstraction for improving refinement support
during the design process. The design entry point in OSSS is called the Application
Layer. By manually applying a mapping of the system’s components, the design can
be refined from Application Layer to the Virtual Target Architecture Layer, which
can be synthesized to a specific target platform in a separate step by the synthesis
tool Fossy [5].
14.3.1 Application Layer
On this layer the hardware/software system is modeled as a set of communicating processes, representing hardware modules and software tasks. The Application
Layer model abstracts from the details of the communication between the components of a model, such as the actual implementation of the communication channel,
even across HW/SW boundaries.
One concept introduced by OSSS without a direct equivalent in C++ is the so
called Shared Object, which equips user-defined classes with specific synchronization facilities. Due to their well-defined synthesis semantics, Shared Objects can act
14 Modeling of Embedded Software Multitasking in SystemC/OSSS
217
as a replacement for some non-synthesisable features of SystemC such as hierarchical channels, mutexes and semaphores.
Synchronization is performed by arbitrating concurrent accesses and a special
feature called Guarded Methods, that can be used to block the execution of a method
according to a user-defined condition. As a result, they are especially useful for modeling inter-process communication, both between hardware and software processes.
Communication between modules or tasks and Shared Objects is performed by
method calls through abstract communication links (binding).
A comprehensive description of the Shared Object concept, including several
design examples, is part of the publicly available simulation library [7].
14.3.2 Virtual Target Architecture Layer
In a refinement step, the Application Layer model is transformed to a so-called Virtual Target Architecture. This involves mapping software tasks to processor(s) and
hardware modules to certain hardware blocks as shown in Fig. 14.1. Moreover, the
abstract communication links of the Application Layer model are then to be mapped
to specific communication infrastructure, like buses or point-to-point channels.
To enable an easy mapping of the method-based communication on the Application Layer to a signal-based communication OSSS uses a concept known as Remote Method Invocation (RMI). A detailed description of this scheme is beyond the
scope of this contribution, but can be found in [9]. Basically, the implementation of
the RMI concept allows the transport of method calls including their parameters and
return values over arbitrary communication infrastructures and is used for HW/HW
as well as HW/SW communication.
Fig. 14.1 Mapping of Application Layer to Virtual Target Architecture Layer
218
P.A. Hartmann et al.
The Virtual Target Architecture Layer model of the design is then translated automatically by the synthesis tool Fossy [5], producing RTL SystemC or VHDL output
which is used for further synthesis by vendor-specific implementation tools.
14.4 Modeling Software in OSSS
The approach to modeling multi-tasking software in OSSS is not meant to directly
model existing real-time operating system (RTOS) primitives. The Software Tasks
in OSSS are meant to run on-top of a generic but lightweight run-time system, as depicted in Fig. 14.1. Task synchronization and inter-task communication is modeled
with (Software) Shared Objects, similar to the modeling with OSSS in the hardware domain. This enables a seamless specification environment, where the same
concepts are used for both, hardware and software on the Application Layer (see
Sect. 14.3). The flexible timing annotation mechanism enables high simulation performance, since the synchronization overhead with the SystemC kernel can be minimized.
14.4.1 Abstraction of Run-time System
The basis of the OSSS software run-time simulation model is an OSSS software
runtime abstraction class. This pre-defined library element handles the time-sharing
of a single processor by several Software Tasks (see Sect. 14.4.2), which are bound
to a particular RTOS instance.
A specific scheduling policy can be bound to each set of tasks, grouped by the
same ring, as depicted in Fig. 14.2. Several frequently used scheduling policies are
already provided by the modeling library, most notably
•
•
•
•
static priorities (preemptive and cooperative),
time-slice based round-robin,
earliest-deadline first,
and rate monotonic.
Additionally, user-defined schedulers can be implemented through an abstract
interface class. The RTOS overhead of context switches and execution times of
scheduling decisions can be annotated as well.
Several rings can be specified by the designer. The rings are an additional priority
layer, where every ring receives its own scheduling policy. The processor is assigned
to a task from lower-priority rings only, if there is no task in ready state (see
Sect. 14.4.2) in any higher priority ring. An example use case for this feature is to
model (prioritized, non-preemptive) interrupt service routines with otherwise timesliced round-robin scheduled user tasks.
14 Modeling of Embedded Software Multitasking in SystemC/OSSS
219
Fig. 14.2 Ring-based task
scheduling
With this set of basic elements, the behavior of the real RTOS on the target
platform can be modeled. Task synchronization is not part of the modeling elements, since the inter-task communication is meant to be modeled by using Software Shared Objects. On the target platform, an implementation of this primitive
will be provided by the means of the scheduling primitives of the architecture (see
Sect. 14.4.3).
14.4.2 Software Tasks
SystemC processes, that are meant to be implemented in software are modeled
as Software Tasks in OSSS. These tasks are derived from the common base class
and define their behavior within a main() method. In the multi-tasking OSSS
implementation, tasks are equipped with additional properties, like a priority, an
initial startup time, optional periods and deadlines, and an optional task ring (see
Sect. 14.4.1).
During simulation, the tasks can be in different states as shown in Fig. 14.3. The
distinction between blocked and waiting has been introduced to ease the runtime detection of deadlocks. A task in waiting state will enter the ready state
after a given amount of time, whereas a blocked task can only be unblocked, once
a certain logical condition is fulfilled (usually a guard, see below).
Technically, the Software Tasks are implemented with SC_THREADs, that use
internal synchronization mechanisms with the RTOS abstraction during the simulation, to achieve the effect of only a single active task at a time on a given processor.
Task preemption is supported at arbitrary times, independently of the granularity of
the annotated execution times, as discussed in Sect. 14.6. Regular SystemC wait()
calls are disabled within Software Tasks.
220
P.A. Hartmann et al.
Fig. 14.3 Task states and
transitions (terminate edges
omitted)
// guard definition ( <name>, <condition> )
OSSS_GUARD( not_empty, cnt_items_ > 0 );
// guarded method declaration of int get()
// – blocking, until not_empty guard holds
OSSS_GUARDED_METHOD
( int, get, OSSS_PARAMS(0), not_empty );
Listing 14.1 Example of a guarded method
14.4.3 Software Shared Objects
The inter-task communication in an OSSS software model is specified like in an
OSSS hardware model in terms of user-defined Shared Objects (see also Sect. 14.3),
which are inspired by the protected objects known from Ada [1]. On the final platform, the software implementation of a shared object has to be integrated with the
OSSS software run-time, since it usually requires RTOS primitives for the synchronization.
As outlined in Sect. 14.3, Shared Objects provide mutual exclusive access and
Guarded Methods to ensure deterministic behavior across several concurrent tasks
(see Listing 14.1). The guard mechanism resembles the well-known monitor concept and can directly be implemented e.g. on an underlying POSIX-compatible OS
using condition variables (pthread_cond_t). The required locks for mutual exclusive access can be implemented using the existing locking mechanism of the
underlying RTOS (e.g. mutexes).
14 Modeling of Embedded Software Multitasking in SystemC/OSSS
221
Therefore, in OSSS the complexity of inter-task synchronization primitives is
hidden from the designer, which improves design productivity. It is planned to automatically generate the required run-time system from an OSSS software model and
use cross-compilation techniques to translate Software Tasks and Shared Objects
directly to the target machine code.
14.4.4 Software Execution Times
A proper modeling of software multi-tasking requires the consideration of time consumption of the modeled tasks. In OSSS, the Estimated Execution Time (EET) of
a code block can be annotated within Software Tasks and inside methods of Shared
Objects with the help of the OSSS_EET() macro (see also [7]).
The macro receives a duration as an sc_time argument, that estimates the execution time of the following code block. This enables a flexible and accurate annotation, depending on the required accuracy. Control structures can be efficiently annotated and during the simulation, the resulting execution time respects the (potentially
data-dependent) control flow. Listing 14.2 exemplifies the syntax of these annotations. The simulation semantics of these annotations are discussed in Sect. 14.6.1.
OSSS_EET() blocks cannot be nested and must not contain inter-task communication calls. As of today, these times are meant to be determined by profiling the
cross-compiled code on the target processor. On the other hand, it is envisioned to
extract and back-annotate these times automatically in the future.
// . . .
while( some_condition )
// the following block has to be finished within 1ms
OSSS_RET( sc_time( 1000, SC_US ) )
{
OSSS_EET( sc_time( 20, SC_US ) ) {
// computation, that consumes 20μs
max_i = compute_number_of_iterations();
}
// estimate a data-dependent loop
for( int i=0; i<max_i; ++i )
OSSS_EET( sc_time( 100, SC_US ) ) {
// loop body
}
if( my_condition ) {
// communication only outside of EET blocks
result = my_shared->get(); // see Listing 14.1
}
} // end of RET block and loop
Listing 14.2 Example of estimated and required execution time annotations
222
P.A. Hartmann et al.
Fig. 14.4 Impact of
scheduling policy on
simulation
In addition to the EETs, OSSS enables the designer to specify local deadlines
for a specific code block. This is especially useful in combination with inter-task
communication calls or preemptive scheduling policies.
The syntax follows the one for the EETs, meaning that a certain Required
Execution Time (RET) is specified by the OSSS_RET() macro, which guards
the duration of the following code block. If required, RETs can be nested at arbitrary depth. The consistency of nested RETs is checked during the simulation like
any other violation of the RETs, or the optional globally annotated deadline of a
given task. If such a timing constraint cannot be fulfilled during the simulation, it
is reported by the library. Unmet RETs can arise from the choice of the scheduling
policy, (additional) delays caused by blocking guard conditions, or simply unexpectedly long estimated execution times (e.g. max_i ≥ 10 in Listing 14.2).
14.5 Exploration of Platform Effects
The main purpose of the modeling of abstract software multi-tasking in early design phases is the exploration of the impact of platform choices on the system’s correctness and performance. Therefore, we tested our approach with several example
scenarios. In Fig. 14.4, the simulation results of one example specified by G.C. Buttazzo [2, p. 95] are shown. Two tasks T1 , T2 are sharing the same processor with
periods τ1 = 5 ms, τ2 = 7 ms and estimated execution times t1 = 2 ms, t2 = 4 ms.
This minimalistic system is scheduled by (a) a rate-monotonic scheduler (RMS),
(b) according to earliest deadline first (EDF).
In case of the rate-monotonic schedule, the generated traces expose several deadline violations (denoted by circles in Fig. 14.4). But even in case of a more complex
systems, especially with inter-task communication and varying run-times, such violations would have been caught by our simulation model as well.
14 Modeling of Embedded Software Multitasking in SystemC/OSSS
223
14.6 Simulation Results
An important factor for the feasibility of abstract software models is the simulation
accuracy, they can achieve. Nonetheless, simulation performance is a critical factor
during design space exploration as well. These goals are contradictory, as we will
discuss in this section.
14.6.1 Accuracy and Performance
In order to ensure, that only one Software Task is active at a any time during the
simulation, the different tasks have to be synchronized with the central RTOS abstraction, which then assigns the tasks according to its scheduling policy. Since the
simulation time is usually handled by the (cooperative) SystemC kernel, this synchronization requires an (at least implicit) call of wait().
In order to support a preemptive scheduling policy, this implicit synchronization
is performed by the abstract run-time system. For every EET block, the run-time
advances the SystemC time for the current task by the annotated time and the additional delay due to preemptions of the current task by other (e.g. higher priority)
tasks during this period.
Since every SystemC time advance comes at the cost of a host system context
switch (see Fig. 14.5), the number of these synchronizations has to be reduced. This
leads to a trade-off between the granularity of the annotations and their accuracy.
If the implementation of timing annotations immediately consumes the annotated
delays, the granularity of these annotations must be kept quite coarse-grained to
ensure good performance. As a result, control structures like data-dependent loops
as shown in Listing 14.2 have to be estimated by their WCET, instead of taking the
real number of iterations into account.
Another difficult situation arises from sporadic or conditional inter-task communication, as shown in Listing 14.2 as well. To annotate the surrounding basic blocks
Fig. 14.5 Impact of EET
resolution on simulation time
224
P.A. Hartmann et al.
Fig. 14.6
Producer/Consumer
benchmark
and to keep the number of annotations low, the annotations might have to be moved
entirely before or after the communication primitive to keep the processor utilization accurate. This results in a loss of accuracy with respect to access traces on the
shared resources. Therefore, the trade-off between simulation time and modeling
accuracy limits the observable effects, which might lead to wrong design decisions.
14.6.2 Lazy Synchronization
If, on the other hand, the to-be-consumed processor time can be accumulated until a time synchronization is explicitly required, a considerable speed-up is to be
expected even in case of fine-grained annotations. Synchronization between the abstract OS and the currently active task is required, whenever interaction with the OS
or other components is requested. This especially includes inter-task communication and deadline validation.
Fortunately, communication in OSSS is expressed via Shared Objects, which
easily enables the integration of such a execution time accumulation. We have implemented this alternative lazy synchronization scheme as an optional feature of the
current OSSS multi-tasking library model. Task synchronization is delayed until
a Shared Object call or the border of an OSSS_RET() block is encountered. By
this, a task might logically pass multiple EET blocks without actually noticing any
SystemC time advance. At the above mentioned synchronization points, the accumulated delay is consumed at once. This accumulation of consecutive EET blocks
without intermediate synchronization is possible, since the observable behavior—
which only depends on the order and time-stamp of inter-task communication—is
not changed at all.
In Fig. 14.5, these two scenarios have been compared. The benchmark scenario
consists of a simple producer/consumer scenario, where the producer pushes random numbers to the consumer through a FIFO Shared Object (see Fig. 14.6). The
tasks are scheduled with static priorities and overall constant estimated execution
times, such that the FIFO channel is nearly empty (and thus blocks the consumer
task) during the simulation. The OSSS_EET() blocks in the producer are then increasingly split into small chunks, resulting in an increasing number of synchronizations in the “strict” scenario. The simulation consists of ten thousand FIFO calls (on
both sides) and has been run on a Pentium D workstation with 2.8 GHz. Figure 14.5
shows, that an accumulated synchronization leads to significantly faster execution
with higher number of consecutive EETs.
14 Modeling of Embedded Software Multitasking in SystemC/OSSS
225
14.7 Conclusion
In this work, we presented an approach to modeling embedded software in OSSS.
In comparison to the previously existing software modeling capabilities in OSSS,
the extended implementation introduces an abstract run-time system with support
for multi-tasking and SW/SW inter-task communication. The modeling primitives
like Software Tasks and Shared Objects are similar to the elements on the OSSS Application Layer (see Sect. 14.3) and abstract from the error-prone synchronization
primitives, the underlying RTOSs would provide.
The integrated RTOS abstraction includes different scheduling policies (preemptive and cooperative), periodic and continuous tasks, priorities, absolute and relative deadlines, without being targeted to a specific RTOS directly. Some simulation
results of different decisions on priorities and schedulers have been presented in
Sect. 14.5. As long as some locking primitive is available on the software target
architecture, the OSSS software run-time can be mapped on this platform. A prototypical implementation on an existing RTOS is currently under development.
The HW/SW and SW/HW communication capabilities of OSSS (Sect. 14.3) are
not yet fully integrated with the new software multi-tasking implementation. The
communication refinement will follow the OSSS Channel approach, see [7, 9].
As we have shown in Sect. 14.6, the granularity of timing annotations and the resulting synchronization overhead is an important factor for simulation speed. Therefore, the presented approach offers a flexible and simulation efficient way to specify estimated execution times. Moreover, the lazy synchronization scheme further
reduces the required SystemC kernel invocations by merging consecutive EETs between required synchronizations, which has been demonstrated with a benchmark
in Sect. 14.6.2. This further improves the early exploration of software platform
effects for systems modeled in OSSS.
References
1. A. Burns and A. Wellings. Concurrency in Ada. Cambridge University Press, Cambridge,
1997.
2. G.C. Buttazzo. Hard Real-time Computing Systems. Kluwer Academic, Dordrecht, 2002.
3. J. Chevalier, M. de Nanclas, L. Filion, O. Benny, M. Rondonneau, G. Bois, and E.M. Aboulhamid. A SystemC refinement methodology for embedded software. IEEE Design & Test of
Computers, 23(2):148–158, 2006.
4. V. Fernandez, F. Herrera, P. Sanchez, and E. Villar. Embedded software generation from SystemC for platform based design. In SystemC: Methodologies and Applications, pages 247–
272. Springer, Berlin, 2003.
5. Fossy—Functional Oldenburg System Synthesizer. http://fossy.offis.de.
6. A. Gerstlauer, H. Yu, and D. Gajski. RTOS modeling for system level design. In Proceedings
of Design, Automation and Test in Europe, pages 47–58, 2003.
7. C. Grabbe, K. Grüttner, H. Kleen, and T. Schubert. OSSS—A Library for Synthesisable System
Level Models in SystemC, 2007. http://system-synthesis.org.
8. E. Grimpe, W. Nebel, F. Oppenheimer, and T. Schubert. Object-oriented hardware design and
synthesis based on SystemC 2.0. In SystemC: Methodologies and Applications, pages 217–
246. Springer, Berlin, 2003.
226
P.A. Hartmann et al.
9. K. Grüttner, C. Grabbe, F. Oppenheimer, and W. Nebel. Object oriented design and synthesis
of communication in hardware-/software systems with OSSS. In Proceedings of the SASIMI
2007, October 2007.
10. Z. He, A. Mok, and C. Peng. Timed RTOS modeling for embedded system design. In 11th
IEEE, Real Time and Embedded Technology and Applications Symposium, RTAS 2005, pages
448–457, March 2005.
11. F. Herrera and E. Villar. A framework for embedded system specification under different models of computation in SystemC. In Proceedings of the Design Automation Conference, 2006.
12. S.A. Huss and S. Klaus. Assessment of real-time operating systems characteristics in embedded systems design by SystemC models of RTOS services. In Proceedings of Design &
Verification Conference and Exhibition (DVCon’07), San Jose, USA, 2007.
13. IEEE Standards Association (“IEEE-SA”) Standards Board. IEEE Std 1666-2005 Open SystemC Language Reference Manual. IEEE Press, New York, 2005.
14. R. Le Moigne, O. Pasquier, and J.-P. Calvez. A generic RTOS model for real-time systems
simulation with SystemC. In Proceedings of Design, Automation and Test in Europe Conference, 2004, volume 3, pages 82–87, 16–20 February 2004.
15. S. Mahadevan, M. Storgaard, J. Madsen, and K. Virk. ARTS: a system-level framework for
modeling MPSoC components and analysis of their causality. In 13th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems,
pages 480–483, September 2005.
16. M. Müller, J. Gerlach, and W. Rosenstiel. Abstrakte Modellierung von Hardware/SoftwareSystemen unter Berücksichtigung von RTOS-Funktionalität. In 11th Workshop Methoden und
Beschreibungssprachen zur Modellierung und Verifikation von Schaltungen und Systemen
(MBMV’08), pages 21–30, March 2008.
17. Open SystemC Initiative. SystemC™. http://www.systemc.org.
18. H. Posadas, F. Herrera, V. Fernandez, P. Sanchez, and E. Villar. Single source design environment for embedded systems based on SystemC. Transactions on Design Automation of
Electronic Embedded Systems, 9(4):293–312, 2004.
19. G. Schirner and R. Dömer. Introducing preemptive scheduling in abstract RTOS models using
result oriented modeling. In Proceedings of Design, Automation and Test in Europe (DATE
2008), pages 122–127. Munich, Germany, March 2008.
20. M. Streubühr, J. Falk, C. Haubelt, J. Teich, R. Dorsch, and Th. Schlipf. Task-accurate performance modeling in SystemC for real-time multi-processor architectures. In Proceedings of
the Design, Automation and Test in Europe Conference, pages 480–481. European Design and
Automation Association, Leuven, 2006.
21. Synthesis Working Group Members of Open SystemC Initiative. SystemC synthesizable subset, Draft 1.1.18. Whitepaper, Open SystemC Initiative (OSCI), December 2004.
22. H. Zabel and W. Müller. Präzises Interrupt Scheduling in abstrakten RTOS Modellen in SystemC. In 11th Workshop Methoden und Beschreibungssprachen zur Modellierung und Verifikation von Schaltungen und Systemen (MBMV’08), pages 31–39, March 2008.
Chapter 15
High-Level Reconfiguration Modeling
in SystemC
Andreas Raabe and Armin Felke
Abstract The ongoing trend towards development of parallel software and the increased flexibility of state-of-the-art programmable logic devices are currently converging in the field of reconfigurable hardware. On the other hand there is the traditional hardware market, with its increasingly short development cycles, which is
mainly driven by high-level prototyping of products. This paper presents a library
for modeling reconfiguration in the leading high-level system description language
S YSTEM C combining IP reuse and high-level modeling with reconfiguration. Details on the underlying simulation engine are given, which allows safe disabling and
re-enabling of all process types without altering the kernel. Novel control statements
and internal techniques that allow safe usage of process controlling in conjunction
with standard S YSTEM C language constructs are presented. A real world case study
using the presented library proves its applicability.
Keywords SystemC · Dynamic reconfiguration · FPGA · Simulation
15.1 Introduction
Due to increasing micro-miniaturization in chip production hardware development evolved from plain circuit design into the development of complex heterogeneous systems with an increasing number of increasingly complex processing
elements [12]. Not only that productivity of hardware designers does not grow as
fast as the number of available transistors per chip (productivity gap), but the time to
bring products to market is reduced as well. To fill the gap one obvious approach is
reuse of own and externally produced components (IP-cores). The latter are usually
provided closed source and remain intellectual property of the vendor. IP reuse is
widely regarded as one of the major motors of productivity in contemporary chipdesign.
To cope with complexity higher levels of abstraction were introduced in system design and simulation. They enable early estimation of time and hardware consumption. Especially the introduction of Transaction Level Design to evaluate the
A. Raabe ()
International Computer Science Institute, Architecture Group, 1947 Center St., Suite 600,
Berkeley, CA 94704, USA
e-mail: raabe@icsi.berkeley.edu
M. Radetzki (ed.), Languages for Embedded Systems and Their Applications,
Lecture Notes in Electrical Engineering 36,
© Springer Science + Business Media B.V. 2009
227
228
A. Raabe and A. Felke
impact of bus models has proven to be highly efficient and productive in practice [7].
A large number of system description languages, mostly based on C/C++, have been
proposed since [3]. Among them S YSTEM C became the most prominent one.
Configurable logic devices evolved considerably as well. State-of-the-art devices
provide tremendous processing power and dynamic reconfiguration abilities, qualifying them as highly parallel co-processing units. This led to increased research
in run-time reconfigurable systems, which are now close to commercial breakthrough [13].
To enable the design community to conveniently develop reconfigurable architectures in a short time-to-market, this paper presents the library R E C HANNEL, which
extends S YSTEM C with advanced language constructs for high-level reconfiguration modeling. Special focus lies on functional and transactional levels of modeling.
Details of the underlying simulation library are given, which allows safe disabling
and re-enabling of processes. The standard OSCI S YSTEM C-kernel is used without
any changes.
15.2 Related Work
OSSS+R [11] is certainly one of the most prominent approaches towards high-level
design methodology for reconfigurable hardware. It is based on OSSS [4], which extends the synthesizable subset of S YSTEM C by adding constructs that enable modeling, simulation and synthesis of object oriented features. Basically three different
such constructs are introduced: Polymorphic Objects, Shared Objects and Sockets.
The basic assumption of OSSS+R is, that reconfiguration can be interpreted as an
exchange of two objects sharing a common base type and can therefore be modeled
with polymorphism. Due to the fundamental change of the programming paradigm,
from component based hardware design to object orientation, reuse of existing S YS TEM C IP-cores is impossible. Additionally, only passive objects are featured as reconfigurable components.
An approach towards a S YSTEM C reconfiguration extension closer to the original
methodology was developed in the ADRIATIC project [14]. It targets transaction
level (TL) description and simulation of reconfiguration aspects, focusing on early
evaluation of the performance impact of reconfiguration. Therefore an automated
tool is introduced that analyzes static TL models and identifies components that
could be made reconfigurable. The designer’s task is now reduced to exploring the
design space with respect to reconfiguration. The main limitation of the ADRIATIC
approach is the restriction of the underlying interfaces. The target architecture needs
a generic bus layout that fits into the supported interface scheme.
In [1] a modified SystemC kernel that features advanced process control statements is presented. In [2] a modified SystemC kernel is presented that allows enabling and disabling of processes. It is even used for modeling and simulation of
a reconfigurable system. Both are altering the kernel and hence are not standard
compliant.
15 High-Level Reconfiguration Modeling in SystemC
229
To provide full coverage of S YSTEM C 2.2 and to enable some of the more advanced features presented in this article the R E C HANNEL library was completely
reimplemented. Since the user interface changed, a brief recap and update on the
basic features presented in [8] is given in the next section. Afterwards, Sects. 15.4
and 15.5 present novel language constructs and features.
15.3 Basic Reconfiguration Modeling
R E C HANNEL was designed with respect to a number of objectives. Any S YSTEM C
extension should comply with the language standard IEEE 1666 [5]. IP-reuse should
be possible to enable designers to exploit the vast number of commercially available components. The library should melt as seamlessly as possible with the S YS TEM C methodology and feature reconfiguration on all levels of abstraction without
a change in programming paradigm. Hand crafted refinement from functional to register transfer level should still be possible to allow maximum efficiency in resource
utilization and to avoid imposing any design limitations.
15.3.1 Interpreting Reconfiguration as Circuit Switch
Hardware designers often interpret reconfiguration as switching between two or
more modules [6]. This very common approach is usually modeled using a bus
that is either described as a module or a channel. The modules to be exchanged
are then connected to the bus and can now be addressed only one at a time. The
reconfiguration controlling can thus be described as an arbiter.
This approach has three main limitations: the tremendous development effort, delays caused by reconfiguration are usually not respected and the system’s topology
as well as its timing are altered.
R E C HANNEL uses a technique that resembles reconfiguration buses, but that
does not have these grave limitations. Portals are introduced to connect the original channel to the reconfigurable modules’ port.1 The portal’s function is twofold:
Firstly, accesses of the active module to the channel need to be executed, while inactive modules should not be allowed to access the channel. Additionally, it is necessary to provide the module’s port with an interface it is able to bind. This is both
done by binding a so-called accessor object, which is part of the portal, to the port. It
needs to be derived from the interface the port can connect to and forward interface
accesses to the channel. Secondly, channel events need to be passed to the currently
active module. Since the portal’s accessors implement the interface the modules’
port needs, they also possess the events provided by the interface. These events are
1 Hardware designers familiar with the Xilinx Virtex modular design flow, may think of portals as
“bus macros”.
230
A. Raabe and A. Felke
Fig. 15.1 A portal connecting two reconfigurable modules to a standard S YSTEM C channel.
A simulation sequence is shown, where a channel event is triggered by some outside source and
is then forwarded to the accessor associated with the currently active module. A channel access
within this module is triggered and is being executed via the accessor
now registered with event forwarders inside the portal. These components listen to
the channel’s events and notify the according events inside the accessor associated
with the currently active module. Figure 15.1 illustrates this.
As will be described in Sect. 15.3.3, a portal’s state is controlled by the modules
connected to it. On the other hand the portals’ state influences the modules’ state,
since only one reconfigurable module per portal is allowed to be active at a time.
The portal’s type depends on the interface of the channel that it is bound to.
Portals for all S YSTEM C channel interfaces are provided by the R E C HANNEL library along with a toolkit that allows easy construction of portals for custom-built
channels. This process could be automated as well, since it does not demand any
creative coding, but is merely a repetition of information known to the compiler, but
not available to the designer via C++ language constructs.
15.3.2 Creating Reconfigurable Modules from Static Ones
To allow IP reuse as demanded in Sect. 15.3, it is necessary to provide a mechanism to equip standard S YSTEM C modules with additional reconfiguration related
features (i.e., configuration timings, bit-file size, etc.). In the R E C HANNEL environment they also have to be able to efficiently interact with the portals they are
connected to. Both can be achieved by deriving from the original static module and
rc_reconfigurable, which is a base class providing the necessary capabilities. To provide some of the more advanced mechanisms that will be described in
Sect. 15.5 it is of some benefit to additionally wrap the original type into a template,
which also automates this derivation. A reconfigurable version M_rc of a static
module type M can now be derived.
class M_rc:
public rc_reconfigurable_module<M>
The resulting type M_rc is also of type M and rc_reconfigurable. To provide more convenience R E C HANNEL automates this process by providing a macro
15 High-Level Reconfiguration Modeling in SystemC
231
RC_RECONFIGURABLE_MODULE_DERIVED, which accepts the static module
type as parameter and a macro RC_RECONFIGURABLE_CTOR_DERIVED, where
the user can define reconfiguration related properties (e.g. the module’s loading delay).
RC_RECONFIGURABLE_MODULE_DERIVED(M_rc, M) {
RC_RECONFIGURABLE_CTOR_DERIVED(M_rc, M) {
rc_set_delay(RC_LOAD, sc_time(1, SC_MS));
}}
15.3.3 Control
To control reconfiguration it would be tedious for designers to switch all portals
manually. Hence a reconfigurable module can be requested to activate itself. This
request is passed down to the portals, which allow switching only if no other module
is registered as active. A module is only activated if all its portals can be switched.
But this implicit control of the portal’s state via rc_module is still not convenient
enough. Therefore a simulation control object rc_control provides registration
and reconfiguration control functions for modules. E.g. instantiating a control object, registration of three reconfigurable modules and activation of one of them via
rc_control looks like this:
rc_control ctrl;
ctrl.add(m1 + m2 + m3);
ctrl.activate(m1);
Figure 15.2 shows an example of a reconfigurable design with two alternatively
present modules. Their reconfiguration state is controlled by a custom configuration
controller using rc_control.
Additionally, in more complex designs it might be necessary to simulate usage of
multiple reconfigurable platforms with different reconfiguration behavior. By deriving from rc_control and overloading a callback function called takes_time
a simulation control object can be implemented that calculates reconfiguration timings from module attributes (i.e., bit-file size). This way properties of the reconfigurable hardware used for the design can be modeled or different alternatives can be
evaluated with respect to the impact on system performance.
15.4 Advanced ReChannel Features
15.4.1 Exportals
A portal is a specially designed component to connect a channel of a design’s static part to ports of reconfigurable modules. To allow reconfigurable modules to use
232
A. Raabe and A. Felke
Fig. 15.2 Design with two
alternatively present modules
exports as well as ports, the portal concept needs to be generalized. Therefore, its
control interface was encapsulated into the rc_switch base class. rc_portal
and the novel rc_exportal are derived from it. This way implicit control from
module to portal and from module to exportal is implemented using the same mechanism. Additionally, an exportal needs to forward channel events in the opposite
direction than a portal does, from reconfigurable to static end. Nevertheless, the
same techniques that are implemented in a portal can be used. Interface accesses on
the other hand are more difficult. Since they need to be forwarded from static parts
of the design to reconfigurable ones, it can occur that no reconfigurable module is
currently active to answer the request. Here two cases can be distinguished: If the
access is blocking, the exportal can simply wait until a module is activated that can
execute the request. But in case of a non-blocking access it must be executed immediately. If this occurs the design will in general be erroneous, but still the access
has to be executed on some interface to allow the simulation to continue and issue a
warning.2 This is done by supplying a fall-back channel that reacts accordingly.
15.4.2 Synchronization
Blocking accesses can cause problems as well. Since portals are plugged between
port and channel they can only gain control if an access is initiated, when it is finished or if a channel event is notified. In timed environments this does not cause
2 Note
that some other options are available as well. E.g. in the case of resolved signals, a return
value of ‘X’ (undefined) may be more appropriate, so any warnings can be omitted. Generally
speaking, it is up to the implementer of the fall-back channel to specify its favored behavior.
15 High-Level Reconfiguration Modeling in SystemC
233
any problems. But for untimed (and timed-untimed hybrid) modules it makes it difficult for the designer to control when the module is to be deactivated without input
and output data becoming asynchronous. E.g. a module reading from an input port
A and writing to an output port B must in general not be deactivated if it has read
the input, but no output was written yet. Therefore it must be possible to either
define blocks of code within the module’s processes as atomic transactions, or to
externally define synchronization conditions depending on the module’s communication behavior. The former is the far more elegant solution and will be described in
Sect. 15.5.1. Still, it rules out IP reuse and hence the latter approach is supported by
R E C HANNEL as well. Therefore synchronization filters are provided, which allow
bookkeeping of channel accesses by manipulating transaction counters. Only if all
counters equal zero, the module can be deactivated. Reconfigurable modules may
now equip portals with these filters and define synchronization conditions using the
information supplied by the filters. E.g. let M be an IP core with the behavior described above, then reading from input port A should increase and writing to output
port B should decrease a transaction counter.
RC_RECONFIGURABLE_MODULE_DERIVED(M_rc, M) {
RC_RECONFIGURABLE_CTOR_DERIVED(M_rc, M)
filterA(tc,1),
// if data is read begin transaction
filterB(tc,-1) { // if data is written end transaction
set_interface_filter(A, &filterA);
set_interface_filter(B, &filterB);
}
rc_transaction_counter tc; // initially equals 0
rc_fifo_in_filter<int> filterA;
rc_fifo_out_filter<int> filterB;
}
15.5 Explicit Description of Reconfiguration
R E C HANNEL also comes with a set of language extensions intended for explicit description of reconfiguration. This is preferentially applied if a reconfigurable module
is build from scratch, or if it is augmented with additional dynamic behavior.
In contrast to native S YSTEM C, R E C HANNEL language constructs possess an
implicit reset mechanism being triggered on reconfiguration. These language constructs are primarily comprised of classes, functions and macros corresponding
to a particular functionality already known from S YSTEM C (e.g. rc_signal,
rc_fifo, rc_event, rc_semaphore, etc.). With the availability of resettable
components and processes, both structure and behavior of reconfigurable modules
can be modeled in an intuitive way without the need to care about additional logic
that deals with reconfigurable behavior itself.
234
A. Raabe and A. Felke
Resetting a module accounts to resetting its processes and its sub-components
(variables, channels and sub-modules). Hence all of the sub-components and contained processes depend on a particular reconfigurable module up the S YSTEM C
object hierarchy tree, i.e., the first object among a component’s parent list which
is derived from class rc_reconfigurable. This object is denoted as context
module of its sub-component. Whereas if no such parent exists, a component is said
to be used in a non-reconfigurable context. As a general rule, resettable components and processes are optimized for utilization within context modules. But they
may also be employed in a non-reconfigurable context. All components provided by
R E C HANNEL are already equipped with the ability to implicitly reset themselves.
A designer can use R E C HANNEL’s predefined components without further knowledge of the underlying mechanism. How to create a custom resettable component
will be outlined in Sect. 15.5.2.
To enable process reset, R E C HANNEL provides its own process registry and
process control API for internal management of resettable processes. The API directly builds upon standard S YSTEM C functionality and therefore can be seen as
an additional layer on top of S YSTEM C’s process infrastructure. Hence it does not
need to alter the S YSTEM C kernel and is therefore compliant to the IEEE 1666
standard [5].
15.5.1 Resettable Processes
In order to enable process reset, a process control layer is introduced (Fig. 15.3). It
integrates with the S YSTEM C infrastructure by registering itself with S YSTEM C’s
process registry and thus it is not necessary to alter the kernel implementation. Any
process that is now registered with this process control layer instead of S YSTEM C’s
native infrastructure can then be disabled, (re-)enabled and reset. Processes registered with S YSTEM C will still remain fully functional and will not suffer from any
performance loss. Processes with process control do not differ from standard S YS TEM C processes and can thus coexist and inter-operate with these even within a
single module.
Fig. 15.3 R E C HANNEL’s process control is layered on top of the S YSTEM C infrastructure
15 High-Level Reconfiguration Modeling in SystemC
235
// synchronously resettable thread process
RC_THREAD(proc); sensitive << clk.pos();
reset_signal_is(reset); rc_set_sync_reset();
Listing 15.1 Implementation of a synchronously resetable thread using ReChannel primitives
void proc() {
[...] // do something
while(true) {
wait(new_input_available);
[...] // do something
} }
Listing 15.2 Process using an overloaded wait()
Macros available for process declaration are RC_METHOD, RC_THREAD and
RC_CTHREAD. They correspond semantically to the respective S YSTEM C process
types, but are registered with R E C HANNEL’s process control layer.
Primary reset condition is the deactivation of the context module. Additional reset conditions may be assigned by the user declaring the process. The invocation of
reset_signal_is() will result in the process being reset on the occurrence of
an edge of a signal. The reset behavior of a process may be either set to be asynchronous (default for RC_THREAD) or synchronous (default for RC_CTHREAD).
Consider the example given in Listing 15.1. It shows a thread that is made synchronously resetable using the presented primitives.
Classes of type rc_reconfigurable_module and rc_prim_channel,
amongst others, possess overloaded wait() and next_trigger() methods
which are specially prepared for checking the reset conditions of a process. Consider the example given in Listing 15.2 of a process function inside such a module.
E.g. if the module is blocked by the wait and a reset is triggered due to deactivation
of the context module, the execution of the process is canceled. It will be restarted
as soon as the context module is activated again.
The reset of thread processes is implemented using the C++ exception handling
mechanism. Therefore, exception safety plays a major role in simulation reliability.
Additionally, it is required that the executed code can be canceled safely in respect of
algorithmic correctness and data consistency. Cancellation of transactions or blocking operations, not specially designed to be cancelable, would render a design highly
unreliable with respect to simulation stability and correctness. This implies that a reset mechanism requires fine-grained control of where and when a process may be
reset. For this purpose R E C HANNEL provides the macro RC_NO_RESET by which
all reset signals can be temporarily disabled for the current process. The macro
RC_TRANSACTION enables the designer to enclose blocks of code that must be
finished before the reconfigurable context can deactivate. Listing 15.3 provides an
236
A. Raabe and A. Felke
x = input_fifo.read(); // read input (blocking)
RC_TRANSACTION { // after data has been read
y = calc(x);
// calculation must not be interrupted
output_fifo.write(y); // write output
}
// point of deactivation (if requested )
Listing 15.3 An example of a transaction used to define a block that must not be interrupted by
reconfiguration
example using a transaction to provide the necessary synchronization for the problem discussed in Sect. 15.4.2.
This is more elegant than using filters, but can (obviously) not be applied to IP
components.
Calls to S YSTEM C standard functionality are always considered atomic. This
means that if a resettable process calls external code that is not intended for being
cancelable at any time, a technical restriction is beneficial in this regard: For the
reset of thread processes to work, it is required that these processes have previously
been suspended in one of R E C HANNEL’s prepared wait() methods. Whereas if
a thread process calls a function or interface method, that uses S YSTEM C’s native
wait() functionality, the reset mechanism will be temporarily unavailable. Due to
this characteristic a reset condition is considered to be locally bound, e.g. within the
borders of a module. Thus, only code within these boundaries needs to be exception
safe and interruptible.
R E C HANNEL also supports resettable spawned thread processes. In contrast to
non-spawned processes these are considered to be temporary, i.e., they will be physically terminated if their context module is deactivated.
15.5.2 Resettable Components
Resettable components have the property that they can be automatically reset
by R E C HANNEL in case of activation or deactivation of their context module.
R E C HANNEL already provides resettable versions of all basic S YSTEM C channels,
(e.g. rc_fifo, rc_signal, rc_signal_rv, rc_semaphore, etc.) and the
event class (rc_event). Additionally, the macro rc_var() allows declaration
of resettable variables of arbitrary type.
rc_var(int, i); // declaration of resettable variable i
[...]
i = 0; // initialization of i within the constructor
A user-defined component can be easily made resettable by deriving it from
rc_resettable and implementing its abstract base interface.
15 High-Level Reconfiguration Modeling in SystemC
237
class myComponent : public rc_resettable
{ [...] // implementation of myComponent
// preservation of initial state
virtual void rc_on_init_resettable()
{ p_reset_value = p_curr_value; }
// definition of reset functionality
virtual void rc_on_reset()
{ p_curr_value = p_reset_value; }
int p_curr_value, p_reset_value;
};
Listing 15.4 Implementation of a resetable component
The particular state such a component is reset to can be assigned beforehand during the construction phase. At start of simulation the callback method
rc_on_init_resettable() is invoked once on all resettables to give them
opportunity to store their initial state after construction has finished. The request
for an immediate reset is propagated by a call to rc_on_reset(). Listing 15.4
illustrates this.
For the reset mechanism to work, resettable components automatically register
themselves with the current context module during construction. Hence the designer
does not have to care about any further details.
15.5.3 Binding Groups of Switches
If a channel or export is connected to a module, it is bound to a port by a single
binding statement. If reconfiguration is used, switches are bound to multiple module’s ports at the dynamic end. In practice modules provide a vast number of ports,
especially in RTL descriptions. In conjunction this results in nearly identical and
long sequences of binding statements, which make the code difficult to read. Even
worse is that implementing these binding blocks is error-prone, and that they are
difficult to maintain.
To enable convenient use of R E C HANNEL, port maps are provided, which group
ports, channels and exports. As a counterpart switch connectors can be used to group
switches. Switch connectors and port maps can now be bound using a single statement.
Moreover, port maps can be used to equip a module with multiple binding
schemes. This allows e.g., to provide bit-vectors in little-endian or big-endian bit
order. While the standard S YSTEM C check for type compliance of bound objects is
still provided, it is extended with a check of port map compliance. E.g., a port map
for little-endian order can not be bound to a switching connector which is defined
for big-endian order. Last but not least it is still possible to bind the module’s ports
(etc.) directly without using its port maps.
238
A. Raabe and A. Felke
Fig. 15.4 (a) Using port maps and switch connectors enables binding of complete modules to
switches with a single binding statement. (b) Topmodule models the reconfigurable area explicitly and thus has the same ports and exports as the reconfigurable modules. Here only a single type
of port map needs to be defined, to enable port-to-port and export-to-export binding
Figure 15.4(a) illustrates the use of port maps and switch connectors as it was
previously discussed. In Fig. 15.4(b) a more practical type of application is depicted.
Topmodule models the reconfigurable area explicitly and thus has the same ports
and exports as the reconfigurable modules. Here only a single type of port map
needs to be defined, to enable port-to-port and export-to-export binding.
15.6 Case Study
The R E C HANNEL library was tested within the CollisionChip [9] project. Figure 15.5 shows the overall hardware/software project implementing a hierarchical
collision detection with reconfigurable primitive test. It tests two objects for intersection. Depending on the primitive type the objects are constructed of, a primitive
test is loaded into the FPGA. The main design preselects pairs of primitives that
are very close to each other and hence might intersect. These preselected pairs are
checked for intersection by the currently active primitive test.
Parts to be realized in hardware were implemented and tested on timed functional
as well as on RT-level. Overall the RTL project consists of over 17000 lines of
S YSTEM C code. Introducing reconfiguration into simulation using R E C HANNEL
took a single developer only 2 days. The reconfigured modules were not altered, but
treated as closed source components.
A more detailed description of this case study can be found in [10].
15 High-Level Reconfiguration Modeling in SystemC
239
Fig. 15.5 Hierarchical
collision detection
architecture with
reconfigurable primitive test
15.7 Conclusion and Future Work
This article presented a library for modeling reconfiguration in the leading highlevel system description language S YSTEM C. R E C HANNEL combines IP reuse and
high-level modeling with reconfiguration. Advanced synchronization techniques for
high-level reconfiguration modeling were presented along with an internal process
control. The latter allows safe usage within the S YSTEM C framework by providing
the necessary synchronization statements. R E C HANNEL does not alter the S YS TEM C kernel and complies to the language standard [5].
To cover all S YSTEM C language constructs and to provide all of the above, a
full reimplementation of the library was provided and the basic mechanisms were
presented. Some extended techniques (e.g. resolution of driver conflicts, make-up
of variable reset functionality, etc.) and minor details were left out due to space
restrictions. A case study was presented proving the R E C HANNEL library to be
effective and productive. R E C HANNEL was released as an open source library under
BSD license and is thus available free of charge.
We are currently working on support for modeling mobility using R E C HANNEL
primitives, and providing a synthesis case study. Among our next steps will be the
discussion of functionality that should be incorporated into the S YSTEM C standard
to ease implementation of language extension libraries. We are also planning on
providing portals and other components necessary to cover the TLM library.
240
A. Raabe and A. Felke
References
1. B. Bhattacharyya, J. Rose, and S. Swan. Language extensions to SystemC: process control
constructs. In DAC ’07: Proceedings of the 44th Annual Conference on Design Automation,
pages 35–38. ACM Press, New York, 2007.
2. A.V. Brito, M. Kuhnle, M. Hubner, J. Becker, and E.U.K. Melcher. Modelling and simulation
of dynamic and partially reconfigurable systems using SystemC. In ISVLSI’07, pages 35–40.
IEEE Comput. Soc., Los Alamitos, 2007.
3. S.A. Edwards. The challenges of hardware synthesis from C-like languages. In DATE ’05:
Proceedings of the Conference on Design, Automation and Test in Europe, pages 66–67. IEEE
Comput. Soc., Washington, 2005.
4. E. Grimpe and F. Oppenheimer. Aspects of object oriented hardware modelling with SystemCPlus. In System on Chip Design Languages. Extended Papers: Best of FDL’01 and HDLCon’01. Kluwer Academic, Dordrecht, 2002.
5. IEEE Standards Association Standards Board. IEEE Std 1666 -2005 Open SystemC Language
Reference Manual.
6. P. Lysaght and J. Stockwood. A simulation tool for dynamically reconfigurable field programmable gate arrays. IEEE Transactions on Very Large Scale Integration (VLSI) Systems,
4(3):381–390, 1996.
7. S. Pasricha, N. Dutt, and M. Ben-Romdhane. Extending the transaction level modeling approach for fast communication architecture exploration. In DAC ’04: Proceedings of the 41st
Annual Conference on Design Automation, pages 113–118. ACM Press, New York, 2004.
8. A. Raabe, P.A. Hartmann, and J.K. Anlauf. Rechannel: describing and simulating reconfigurable hardware in SystemC. ACM Transactions on Design Automation of Electronic Systems,
13(1):1–18, 2008.
9. A. Raabe, S. Hochgürtel, G. Zachmann, and J.K. Anlauf. Space-efficient FPGA-accelerated
collision detection for virtual prototyping. In Design Automation and Test (DATE), pages 206–
211. Munich, Germany, 2006.
10. A. Raabe, A. Nett, and A. Niers. A refinement case-study of a dynamically reconfigurable
intersection test hardware. In ReCoSoc’08, July 2008.
11. A. Schallenberg, F. Oppenheimer, and W. Nebel. Designing for dynamic partially reconfigurable FPGAs with SystemC and OSSS. In Forum on Specification and Design Languages,
Lille, France, September 2004.
12. Y. Tanurhan. Processors and FPGAs quo vadis? IEEE Computer, 39(11):108–110, 2006.
13. N. Tredennick and B. Shimamoto. The rise of reconfigurable systems. In Engineering of Reconfigurable Systems and Algorithms, 2003.
14. N.S. Voros and K. Masselos. System Level Design of Reconfigurable Systems-on-Chip.
Springer, New York, 2005.
Chapter 16
Stream Programming for FPGAs
Franjo Plavec, Zvonko Vranesic and
Stephen Brown
Abstract There is an increasing need for automated conversion of high-level design descriptions into hardware. We present a flow that converts a software application written in the Brook streaming language into a hardware description targeting
FPGAs. We use a combination of our source-to-source compiler and a commercial
C2H behavioral synthesis compiler to implement our flow. Our approach results in
a significant throughput increase compared to software and ordinary C2H results
(up to 8.9× and 4.3×, respectively). The throughput can be further increased by
using more hardware resources to exploit data parallelism available in streaming
applications.
Keywords Streaming · FPGA · Data-level parallelism · Task-level parallelism ·
Behavioral synthesis · SOPC
16.1 Introduction
A complete system on a programmable chip (SOPC) can often fit into an FPGA
device. Such a system usually contains one or more processing nodes, either soft- or
hard-core, and a number of peripherals. As the complexity of SOPCs grows, there is
a need for tools that allow users to design their systems at a high level. Major FPGA
vendors already provide such tools for building systems based on soft processors,
which help software developers who wish to target FPGAs. However, such systems
do not exploit the full potential of FPGAs, because they fail to generate circuits that
fully exploit the parallel nature of an application. To take advantage of parallelism
available in FPGAs has required mastering one of the Hardware Design Languages
(HDLs). Recently, there has been a push towards supporting automatic compilation
of software programs into hardware.
There are three basic approaches to automatic compilation of software into hardware. Behavioral synthesis compilers [1] analyze programs written in a high-level
sequential language, such as C, and attempt to extract instruction-level parallelism
F. Plavec ()
University of Toronto, Department of Electrical and Computer Engineering,
10 King’s College Road, Toronto, Ontario, Canada
e-mail: plavec@eecg.toronto.edu
M. Radetzki (ed.), Languages for Embedded Systems and Their Applications,
Lecture Notes in Electrical Engineering 36,
© Springer Science + Business Media B.V. 2009
241
242
F. Plavec et al.
Fig. 16.1 Sample streaming application
by analyzing dependencies among instructions, and mapping independent instructions to parallel hardware units. Several such compilers have been released, including C2H from Altera [2], which is fully integrated into their SOPC design flow.
The main problem with the behavioral synthesis approach is that the amount of
instruction-level parallelism in a typical software program is limited. In the case
of C2H, programmers often have to restructure their code and explicitly manage
hardware resources, such as mapping of data to memory modules.
Another approach is to take an existing parallel programming model and map
a program written in it onto hardware [3]. This approach allows programmers to
express parallelism, but they have to deal with issues such as synchronization, deadlocks and starvation.
The third approach is to use a language that allows the programmers to express
parallelism without having to worry about synchronization and related issues. One
class of languages that is attracting a lot of attention lately is based on the streaming
paradigm. In streaming, data is organized into streams, which are collections of
data, similar to arrays, but with the elements that are guaranteed to be mutually
independent [4]. Computation on the streams is performed by kernels, which are
functions that implicitly operate on all elements of their input streams. Since the
stream elements are independent, a kernel can operate on individual stream elements
in parallel, thus exploiting data-level parallelism. If there are multiple kernels in a
program, they can operate in parallel in a pipelined fashion, thus exploiting tasklevel parallelism. Figure 16.1 depicts an application consisting of 5 kernels (depicted
as circles) and memory buffers that pass stream data between the kernels. Streaming
programming languages do not specify the nature of the memory buffers. We use
FIFO buffers because of their small size, which allows us to implement them in an
on-chip memory in the FPGA.
In this contribution we show that the streaming paradigm is suitable for implementation in FPGAs. Our compiler converts kernels into hardware blocks that operate on incoming streams of data. We chose Brook streaming language [4] as our
source language because it is based on the C programming language, so it is more
likely to be accepted by the programmer community. Brook language has been used
16 Stream Programming for FPGAs
243
in a number of research projects [5–7]. We show that a program expressed in Brook
can be automatically converted into parallel hardware, which executes significantly
faster than software. Our methodology uses our modified version of the Brook compiler, and also leverages the C2H commercial tool.
The rest of the chapter is organized as follows. In Sect. 16.2 we describe related
work in the field of stream computing. Our design flow and tools are described in
Sect. 16.3. Section 16.4 presents experimental results. We make some concluding
remarks in Sect. 16.5.
16.2 Stream Computing
The term stream computing (a.k.a. stream processing) has been used to describe a
variety of systems. Examples of stream computing include dataflow systems, reactive systems, signal processing and other systems [8]. We focus on streaming applications defined through a set of kernels, which define the computation, and a set of
data streams, which define communication. This organization allows the compiler
to easily analyze communication patterns in the application, so that parallelism can
be exploited. When a programmer specifies that certain data belongs to a stream,
this provides a guarantee that the elements of the stream are independent from one
another. The computation specified by the kernel can then be applied to stream elements in any order. In fact, all computation can be performed in parallel, limited
only by the available computing resources.
Stream processing is based on the Single Instruction Multiple Data (SIMD) paradigm, and is similar to vector processing. The major difference between stream and
vector processing is in computation granularity. While vector processing involves
only simple operations on (typically) two operands, a kernel may take an arbitrary
number of input, and produce one or more output streams. In addition, the kernel
computation can be arbitrarily complex, and the execution time may vary significantly from one element to the next. Finally, elements of a stream can be complex
(e.g. custom data structures), compared to vector processing, which can only operate
on primitive data types.
The recent interest in stream processing has been driven by two trends: the
emergence of multi-core architectures and general-purpose computation on graphics
processing units. Multi-core architectures have sparked interest in novel programming paradigms that allow easy programming of such systems. Stream processing
is a paradigm that is suitable for both building and programming these systems. For
example, the Merrimac supercomputer [7] consists of many stream processors, each
containing a number of floating-point units and a hierarchy of register files. A program expressed in a stream programming language is mapped to this architecture
in a way that captures data locality through the register file hierarchy. Programs for
Merrimac are written in the Brook streaming language [4]. Another similar project
is the RAW microprocessor and the accompanying StreamIt streaming language [9].
The processing power of graphics processing units (GPUs) has led to their use
for general-purpose computing [10]. GPUs are suitable for stream processing because they typically have a large number of 4-way vector execution units as well
244
F. Plavec et al.
as a relatively large memory. Streaming languages can be used to program these
systems, because kernel computation can be easily mapped to the processing units
in a GPU. Various compilers that convert code written in a streaming language into
a code suitable for execution on a GPU have been proposed [10, 11]. GPU Brook
[11] is a variant of the Brook streaming language specifically targeting GPUs. The
GPU Brook compiler hides many details of the underlying GPU hardware from the
programmer, thus making programming easier.
16.2.1 Streaming on FPGAs
Several research projects have investigated stream processing on FPGAs. Howes et
al. [12] compared performance of GPUs, PlayStation 2 (PS2) vector units, and FPGAs used as coprocessors. The applications were written in ASC (A Stream Compiler) for FPGAs, which has been extended to target GPUs and PS2. ASC is a C++
library that can be used to target hardware. The authors found that for most applications GPUs outperformed both FPGAs and PS2 vector units. However, they used
the FPGA as a coprocessor card attached to a PC, which resulted in a large communication overhead between the host processor and the FPGA. They showed that
removing this overhead improves performance of the FPGA significantly. Their approach does not clearly define kernels and streams of data, so the burden is on the
programmer to explicitly define which parts of the application will be implemented
in hardware.
Bellas et al. [13] developed a system based on “streaming accelerators”. The
computation in an application is expressed as a streaming data flow graph (sDFG)
and data streams are specified through stream descriptors. sDFG and the stream descriptors are then mapped to customizable units performing computation and communication. The disadvantage of this approach is that the application has to be described in a somewhat obscure format (sDFG).
Our approach to streaming on FPGAs is closest to the work described in [14],
which also converts computation into hardware IP blocks interconnected in a
pipelined fashion. Their approach targets ordinary C programs augmented with directives that allow the programmer to specify how an application maps to hardware.
Our approach is based on a streaming language, which provides a higher abstraction
level, so parallelism can be expressed without any knowledge of the target hardware.
16.3 Compiling Brook to Hardware
We propose using a streaming language as a natural choice for software programmers wishing to target their applications to FPGAs. We believe that the stream
processing paradigm is suitable for implementation in FPGAs, because programmable logic blocks in FPGAs are suitable for implementation of parallel compu-
16 Stream Programming for FPGAs
245
tation. Also, FPGAs are easily reprogrammable, so the generated hardware can be
tailored to the needs of a specific application.
The design space for FPGA implementation of streaming applications is large.
For instance, a kernel could be implemented as custom hardware, a soft-core processor, or a streaming processor. In either case, several parallel instances of hardware
implementing the kernel may be necessary to meet the throughput requirement. The
choice of types and numbers of hardware units will affect the topology of the interconnection network. Finally, stream elements can be communicated through on-chip
or off-chip memories, organized as regular memories or FIFO buffers.
We generate custom hardware for each kernel in the application. An ordinary
soft processor would be a poor choice for implementing kernels, because it can only
receive and send data through its data bus, which may quickly become a bottleneck.
Custom hardware units can have as many I/O ports as needed by an application
and are likely to provide the best performance. However, if a kernel is complex,
the amount of circuitry needed for its implementation as custom hardware may be
excessive, in which case a streaming processor may be a better choice. In this contribution we focus on implementing kernels as hardware units.
We base our work on GPU Brook [11] because it is open-source, used in many
projects and supported through a community forum. To implement a program written in GPU Brook in an FPGA, the kernel code should be converted into an HDL.
Instead of performing this conversion directly, we generate C code for each kernel
and then use the C2H behavioral compiler [2] to convert the C code into hardware.
All of the C code is generated automatically by our compiler, so the programmer
only has to write the Brook source code and pass it through our flow. The first part
of the flow is a source-to-source compiler. We reused the original GPU Brook parser
and wrote a code generator that emits C code for each kernel. C2H allows functions
in the C code to be implemented as hardware blocks. Altera documentation refers
to the generated hardware block as a “hardware accelerator” [2]. Hardware accelerators act as coprocessors to the main soft processor (Nios II), which controls the
accelerators and executes the code that was not selected for acceleration. Current
version of C2H does not support floating-point data type and operations.
Depending on the desired functionality, an accelerator can have one or more ports
for accessing other (memory) modules in the system; for each pointer dereference
in the original C code, a new port is created. Special pragma statements can be used
to define which memories in the system a port connects to. We use this functionality
to define how streams are passed between kernels through FIFOs. FIFOs are small
so they can be placed on-chip, and they fit naturally into the streaming paradigm,
because they act as registers in the pipeline. FIFOs are used instead of simple registers because they provide buffering for cases when execution time of a kernel varies
between the elements. For example, consider the system in Fig. 16.1. If the kernel
mul takes a long time to process one element, the next kernel downstream (sum)
could become idle if there was just one register between them. Using a FIFO, the
sum kernel can process data from the FIFO. As long as the mul kernel delivers the
next stream element before the FIFO buffer becomes empty, the sum kernel will not
have to stall.
246
F. Plavec et al.
16.3.1 Example Brook Program
In Brook, streams are declared similarly to arrays, except that characters “<” and “>”
are used instead of square brackets. Kernels are denoted using the kernel keyword.
We illustrate the work done by our compiler using the following Brook code:
kernel void mul (int a<>, int b<>, out int c<>) {
c = a*b; }
reduce void sum (int a<>, reduce int r<>) {
r = r+a;
}
void main () {
int output[REDUCE_LENGTH];
int stream1<IN_LENGTH>, stream2<IN_LENGTH>;
int mul_result<IN_LENGTH>
int reduce_result<REDUCE_LENGTH>;
create1 (stream1);
create2 (stream2);
mul (stream1, stream2, mul_result);
sum (mul_result, reduce_result);
write (reduce_result, output);
}
The code is incomplete and contains only two kernels: mul and sum. Kernel code
refers to individual streams, not stream elements. This prevents programmers from
introducing data dependencies between stream elements. It is assumed that the operation is to be performed over all stream elements. A special kind of kernel, so called
reduction kernel, uses several elements of the input stream to produce one element
of the output stream. These kernels are used to perform reduction operations, and
are denoted by the reduce keyword.
To convert Brook code into C, our compiler generates an explicit for loop around
the statements inside the kernel function to specify that the kernel operation should
be performed over all stream elements. For the Brook code above, our compiler
produces the code similar to this:
void mul () {
volatile int *a, *b, *c; int _iter;
int _temp_a, _temp_b, _temp_c;
for (_iter=0; _iter<IN_LENGTH; _iter++) {
_temp_a = *a;
_temp_b = *b;
_temp_c = _temp_a * _temp_b;
*c = _temp_c;
}
16 Stream Programming for FPGAs
247
}
void sum() {
volatile int *a, *r;
int _temp_r, _iter, _mod_iter=0;
for (_iter=0; _iter<IN_LENGTH; _iter++) {
if ((_mod_iter == 0) && (_iter != 0))
*r =_temp_r;
if (_mod_iter == 0)
_temp_r = *a;
else
_temp_r = _temp_r + (*a);
if (_mod_iter == (IN_LENGTH/REDUCE_LENGTH-1))
_mod_iter = 0;
else
_mod_iter = _mod_iter + 1;
}
*r = _temp_r; }
#pragma altera_accelerate connect_variable mul/c to
mul_result/in
#pragma altera_accelerate connect_variable sum/a to
mul_result/out
In this code, all compiler-generated variable names start with the “_” character.
For the mul kernel, the code is a straightforward for loop that reads elements from
input streams (FIFOs) a and b, multiplies them and writes the result to the output
stream (FIFO) c. The IN_LENGTH limit for the for loop was automatically inserted
by the compiler, based on the sizes of the streams passed to the mul kernel from the
main program.
Pointers a, b and c are connected to FIFOs, which is specified by C2H pragma
statements, which are also automatically generated by our compiler. For brevity, we
only show two pragma statements in the above code. The first pragma statement
specifies that the pointer c defined in the kernel mul connects to the in port of the
FIFO named mul_result. The second pragma statement specifies that the pointer
a defined in the kernel sum connects to the out port of the same FIFO. Together,
these pragmas define a connection between the mul and sum kernels, as indicated in
Fig. 16.1, based on dataflow analysis of the main program. FIFOs are implemented
in hardware, so kernel code does not have to manage FIFO read and write pointers.
Temporary variables _temp_a, _temp_b and _temp_c are used to preserve semantics of the original Brook program. Consider the statement c = a + a; in
Brook. According to Brook specification, this statement is equivalent to c = 2*a;
which is true if temporary variables are used. However, if the temporary variables
were not used and stream references were directly converted to pointers, the original statement would get translated into *c = *a + *a; which would produce an
incorrect result. This is because each pointer dereference performs a read from the
FIFO, so two consecutive stream elements would be read, instead of the same element being read twice. Temporary variables ensure that only one read is performed
per input stream element, and that only one write is performed per output stream
element.
248
F. Plavec et al.
The code generated for the sum kernel is more complicated because sum is
a reduction kernel. The reduction operation can result in more than one element
in the output stream. In our example, the input stream with IN_LENGTH elements is reduced to a stream of REDUCE_LENGTH elements. This means that
IN_LENGTH/REDUCE_LENGTH consecutive elements of the input are summed
to produce one element of the output. A straightforward implementation of this operation would use a modulo operation to determine when an appropriate number of
elements have been added and a new addition should be started. Since the modulo
operation is not efficiently implemented in FPGAs, our compiler avoids using it, and
instead generates the _mod_iter variable, which emulates the modulo operation
using a simple counter. Our compiler performs a similar optimization for division.
Although the code above shows the use of division operation, this is only for illustrative purposes. Since all stream sizes are known at compile time, our compiler
performs the division and inserts the result in its place.
The code produced by our compiler is larger than what is shown in the previous example. According to the Brook specification, a kernel can accept streams
of any dimensionality, but GPU Brook only supports 1-dimensional (1-D) and
2-dimensional (2-D) streams. To support both 1-D and 2-D streams, we have to
be able to emit code for both versions of the kernel. Supporting both 1-D and 2-D
streams for reduction kernels is more challenging, because one kernel in the original
source code can handle reductions of both 1-D and 2-D streams, and a 2-D stream
can be reduced into either a 2-D stream, a 1-D stream or a scalar. To handle all
of these cases, we generate several different versions of kernels, as needed for the
application.
Once the C code is generated, it is passed through Altera’s C2H compiler, which
generates a Verilog description of hardware accelerators for the kernels. The Verilog
code is then passed through Quartus II flow to generate a programming file for the
target FPGA.
To evaluate the system’s performance, we generate the streams using simple
loops inside the create1 and create2 kernels. This approach, as compared to reading the input data from memory, ensures that the runtime of our benchmarks is not
dominated by communication with the shared off-chip memory, which would be the
case if three different kernels were using the same memory. The results are written
to the main memory using a specialized kernel write, so that they can be checked
for correctness. The system also includes a Nios II processor, which verifies the
correctness of the results, and measures the execution time. For some applications,
such a processor may not be necessary, because the input data may be coming from
the outside and the results may be passed back to the outside world.
16.3.2 Exploiting Data Parallelism
The system in Fig. 16.1 contains hardware accelerators, which can operate in parallel, thus exploiting task-level parallelism. However, each accelerator processes
16 Stream Programming for FPGAs
249
Fig. 16.2 Replication example
stream elements one at a time, meaning that data parallelism is not exploited. One
way to exploit data parallelism is to replicate the functionality of each kernel. This is
possible because stream elements are independent, so they can be processed in parallel. In theory we could have as many kernel replicas as the number of elements in
the streams being processed. In practice, the kernel that is a bottleneck for the application will be replicated as many times as necessary to achieve required throughput.
Replication will usually be limited by the available hardware resources.
Figure 16.2 shows the application from Fig. 16.1 with kernels mul and sum replicated. In this example, sum and mul have comparable throughput so both of these
kernels have to be replicated to increase the application throughput. Kernels create
and write were not replicated because it is assumed that they already operate at a
maximum throughput, limited by either the application or the I/O communication
interface. If that is not the case, replicating one of these kernels may be beneficial. In
this situation, create kernels send elements to mul1/2 and mul2/2 kernels alternately,
in a round-robin fashion. Each parallel branch only processes half the elements, thus
effectively doubling the throughput.
One of the main goals of our research is to bring the benefits of FPGA hardware
to the software programmers, while hiding the details of the underlying hardware.
For example, we plan to hide the details of kernel replication. The programmer only
has to identify which kernels are bottlenecks in an application, and then specify how
much the throughput of the kernel should be increased. Compiler can then automatically create the necessary replicas of the kernel, or report an error if the kernel
cannot be sufficiently replicated due to limited hardware availability. In the current
version of our compiler we do not yet support automatic kernel replication. However, we have performed experiments with kernel replication, where we replicated
kernels manually, in a manner that we can envision a compiler could easily perform
automatically. We present the results of these and other experiments next.
250
F. Plavec et al.
16.4 Experimental Evaluation
To validate the correctness of our design flow and estimate the performance benefits of our approach, we implemented two small applications using our flow. We
then compared the throughput of our implementation in hardware with the throughput of the best software implementation running on a Nios II soft processor on the
same FPGA device, and the same software function accelerated using C2H. This
comparison is fair, because our design flow presents the programmer with a compiler interface that is similar to the traditional software development flow, much
like C2H. Comparing our approach to a hard-core processor or a GPU would not
be fair, because of significantly differing technologies used for their implementation. Research in [5] and [11] has shown that streaming programs can be efficiently
compiled and executed on general-purpose processors and GPUs, respectively.
We chose two applications that are often used to demonstrate computation acceleration, because they are simple and their characteristics are well understood. These
applications are Autocorrelation and Finite Impulse Response (FIR) filter. Autocorrelation is an operation that computes the cross-correlation of a signal and a timeshifted version of that same signal. In our experiments we perform autocorrelation
of a signal consisting of 100,000 samples for 8 different shift distances.
The FIR filter is commonly used in digital signal processing as a digital filter.
The filter works by storing a certain number of samples in a pipeline and then multiplying each sample by a constant factor and summing those products. The depth of
the pipeline is often referred to as the number of taps in the filter. In our experiments
we use a filter with 8 taps on an input signal consisting of 100,000 samples. In both
applications samples were represented as 32-bit integers. In all our experiments we
use FIFOs with depth 4. We found that increasing the FIFO sizes beyond 4 was not
beneficial for our benchmarks.
Our experimental system is based on the Nios II processor f(ast) version, with
instruction and data caches (4 KBytes each), and a hardware multiplier unit. The
processor is connected to an off-chip 8-MB SDRAM module, and the timer and
UART peripherals which enable measuring and reporting the program execution
times. Software implementations of both applications were first run on this system
and their throughput was recorded, along with the area and the maximum operating frequency (Fmax ) of the system. Next, we implemented each application in the
Brook streaming language, and compiled it using our basic flow, with each kernel mapped to one hardware accelerator. We measured the area and throughput of
each application, and then replicated the kernels in each application 2 and 4 times.
Finally, we accelerated the original software function using C2H. All experiments
were run on Altera’s DE2 development board, with a Cyclone II EP2C35F672C6
FPGA device. Software was run from an off-chip SDRAM memory, because the
large dataset of 100,000 elements could not fit into the available on-chip memory.
To make the comparison fair, the input dataset for software was generated inside a
loop and was not stored in the off-chip memory.
16 Stream Programming for FPGAs
251
Table 16.1 Throughput and area results for different application implementations
Application
Throughput
(KB/s)
Area
(LEs)
Fmax
(MHz)
Relative
throughput
autocor_soft
6,255
2,981
142
1
autocor
5,878
+2,353
132
0.94
autocor_x2
12,240
autocor_x4
23,833
autocor_c2h
5,509
fir_soft
2,163
fir
7,948
fir_x2
15,515
fir_x4
19,141
fir_c2h
5,849
+3,844
+6,706
+1,359
138
1.96
134
3.81
124
0.88
2,981
142
1
+3,599
130
3.68
127
7.17
127
8.85
113
2.70
+6,236
+11,505
+1,775
Relative
area
1
+0.79
+1.29
+2.25
+0.46
1
+1.21
+2.09
+3.86
+0.60
16.4.1 Results
Results of our experiments are summarized in Table 16.1. The first column indicates
the application, where autocor_soft and fir_soft correspond to software implementations, while autocor_c2h and fir_c2h correspond to C2H implementations. autocor
and fir correspond to the basic streaming implementations, whereas the applications
with x2 and x4 in their name correspond to the applications whose kernels were
replicated 2 and 4 times, respectively. The second column shows absolute throughput, while the third column presents area results for the logic performing the computation. We do not include peripherals and units that are in the system just for
the measurement and debugging purposes in this area, because they are not necessary once a real system is deployed. For software implementation, this means that
we report only the area for the Nios II processor and the SDRAM controller. For
streaming and C2H implementations we report only the area for the accelerators,
FIFOs (where applicable) and the SDRAM controller. We do not include the area
for the Nios II processor, because its role of starting the accelerators could easily
be replaced by a simple state machine. We indicate this with the “+” character in
the table, to emphasize that this area is in addition to area for the processor. The
fourth column in the table presents the system’s Fmax , and the final two columns
show throughput and area relative to the software implementation.
There are several interesting observations that can be made. First, it is interesting
to note that the streaming implementation and the software implementation of autocorrelation achieve similar throughputs. This is because the operations are simple
and the input data is generated on-chip, so the processor can fit the loop code into its
instruction cache and perform computation without accessing the off-chip memory.
The off-chip memory is accessed only when a result is written, which is the same
behavior as that of the streaming implementation. As a consequence, the streaming
version and the software version exhibit similar performance; the small difference
is due to the difference in Fmax of the two systems. C2H implementation exhibits
252
F. Plavec et al.
similar behavior, but achieves lower throughput due to lower Fmax . This is because
C2H implements the complete algorithm in one accelerator, while the streaming
approach distributes it across several accelerators.
One significant difference between the streaming implementation and the other
two approaches is that the throughput achieved by the streaming implementation can
be improved by replicating its kernels. As our results show, replicating the kernels
two or four times results in nearly double and quadruple throughput, respectively.
The situation is slightly different for the FIR filter application. In this application
the incoming samples have to be inserted into a shift register. In both implementations, this shifting is implemented as a circular buffer in memory. This operation is
more efficiently implemented in hardware because independent operations (e.g. updating the loop counters and writing to the buffer) can be performed in parallel.
As a result, C2H implementation achieves 2.7 times, and streaming implementation
achieves 3.68 times higher throughput than software. Throughput of the streaming
application can be further increased by replicating the kernels. Doubling the number
of kernels results in double throughput, as expected. However, when the kernels are
replicated four times, we fail to achieve four times higher throughput, because the
implementation of the shift register cannot be easily replicated. The kernel implementing the shift register requires all 8 elements of the input to be available to assign
them to outputs in a round-robin fashion. Therefore, replicating the node does not
reduce the amount of work each node has to perform. Although it is conceivable
that this could be improved manually, there does not seem to be an easy way to perform it automatically. Therefore, once the shift register implementation becomes a
bottleneck, performance cannot be automatically improved any more.
Comparing the area results, C2H implementations require less area for implementation than equivalent streaming implementation, but they also provide lower
Fmax and throughput. In addition, streaming kernels can be replicated to further
increase the throughput, while C2H does not provide such an option.
16.5 Concluding Remarks
In this chapter we presented a novel approach for allowing software programmers
to target FPGAs. The streaming paradigm allows programmers to effectively express parallelism, and it maps well to the FPGA logic. We presented a design flow
that converts a Brook streaming program into hardware using our source-to-source
compiler and Altera’s C2H compiler. Many FPGA systems currently use software
running on a soft processor to implement a part of functionality, while critical portions of the application are described in an HDL and implemented in hardware. Our
system allows the application to be fully described in software and still exploit the
capabilities of FPGA hardware, thus reducing design time and cost. Our experiments
show that this approach results in up to 8.9 times better throughput than a soft-core
processor, and up to 4.3 times better throughput than a C2H accelerated implementation running on the same FPGA. Moreover, the performance can be improved by
employing more hardware to perform the computation.
16 Stream Programming for FPGAs
253
Our future work will focus on automating the replication of kernels to increase
throughput automatically. We also want to add support for infinite streams to our
flow. As shown in [14], infinite streams are important for real applications, such
as multimedia. In these applications the data to be processed is constantly streaming into the chip. Although neither Brook nor GPU Brook support infinite streams,
we believe such a feature can be implemented without significant changes to the
language. For instance, a stream with length 0 could be interpreted as an infinite
stream.
We also plan to add debug support to our design flow. Debugger for the streaming
approach described in this chapter can be implemented using Nios II as a control
processor. It can observe values that pass through each of the FIFOs in the system,
and stop the flow of data through FIFOs to implement breakpoints. Finally, we plan
to build several large applications to demonstrate usability of our approach for realworld applications.
References
1. G. DeMichelli. Synthesis and Optimization of Digital Circuits. McGraw Hill, New York,
1994.
2. Altera. Nios II C-to-hardware acceleration compiler, November 2007. http://www.altera.com/
products/ip/processors/nios2/tools/c2h/ni2-c2h.html.
3. Y.L.C.N.W. Wong. Generating hardware from OpenMP programs. In Proc. of IEEE Int. Conf.
on Field Programmable Technology, pages 73–80, 2006.
4. I. Buck. Brook Spec v0.2. Tech. Report CSTR 2003-04, Stanford University, October 2003.
5. J. Gummaraju and M. Rosenblum. Stream programming on general-purpose processors. In
Proc. of 38th Int. Symp. on Microarchitecture, pages 343–354, 2005.
6. S.-W. Liao et al. Data and computation transformations for Brook streaming applications on
multiprocessors. In Proc. of Int. Symp. on Code Generation and Optimization, pages 196–207,
2006.
7. W.J. Dally et al. Merrimac: supercomputing with streams. In 2003 Conference on Supercomputing, pages 35–35, 2003.
8. R. Stephens. A survey of stream processing. Acta Informatica, 34(7):491–541, 1997.
9. M.I. Gordon et al.. A stream compiler for communication-exposed architectures. ACM
SIGOPS Operating Systems Review, 36(5):291–303, 2002.
10. D. Tarditi et al. Accelerator: using data parallelism to program GPUs for general-purpose uses.
In Proc. 12th Int. Conf. on Architectural Support for Programming Languages and Operating
Systems, pages 325–335, 2006.
11. I. Buck et al.. Brook for GPUs: stream computing on graphics hardware. Trans. on Graphics,
23(3):777–786, 2004.
12. L.W. Howes et al. Comparing FPGAs to graphics accelerators and the PlayStation 2 using a
unified source description. In Int. Conf. on Field-Programmable Logic, 2006.
13. N. Bellas et al. Template-based generation of streaming accelerators from a high level presentation. In IEEE Symp. on Field-Programmable Custom Computing Machines, 2006.
14. A.B.P. Mukherjee, R. Jones. Handling data streams while compiling C programs onto hardware. In Proc. IEEE Computer Society Annual Symp. on VLSI, pages 271–272, 2004.
Part IV
Verification and Requirements Evaluation
Chapter 17
A New Verification Technique
for Custom-Designed Components
at the Arithmetic Bit Level
Evgeny Pavlenko, Markus Wedler,
Dominik Stoffel, Wolfgang Kunz,
Oliver Wienand and Evgeny Karibaev
Abstract Arithmetic Bit-Level (ABL) normalization has been proven a viable approach to formal property checking of datapath designs. It is applicable where arithmetic bit level components and sub-components can be identified at the registertransfer (RT) level of the design and the property. This chapter extends the applicability of ABL normalization to cases where some of the arithmetic components are
custom-designed entities, e.g., specified using Boolean equations or gates. We transform these entities into ABL building blocks using Reed–Muller expressions as an
intermediate representation. We show how Boolean logic expressed in Reed–Muller
form can be automatically transformed into ABL components so that such logic
blocks can be treated together with the remaining ABL components in a subsequent
normalization run. The approach is evaluated on a number of industrial designs generated by a commercial arithmetic module generator.
Keywords Custom-designed component · Reed–Muller expansion · ABL
normalization
17.1 Introduction
Formal property checking has become mainstream in many SoC design flows. In
particular, it is common practice to verify control-intensive design blocks formally
in order to guarantee high quality for these blocks as well as to relieve system-level
verification from local debugging tasks.
Dealing with arithmetic circuits, however, there is still a notable lack of robustness when applying formal methodologies. Simulation, therefore, still prevails in
industry when verifying arithmetic datapaths. This imposes the risk of overlooking
bugs in arithmetic operations.
E. Pavlenko ()
Department of Electrical and Computer Engineering, University of Kaiserslautern,
Kaiserslautern, Germany
e-mail: pavlenko@eit.uni-kl.de
M. Radetzki (ed.), Languages for Embedded Systems and Their Applications,
Lecture Notes in Electrical Engineering 36,
© Springer Science + Business Media B.V. 2009
257
258
E. Pavlenko et al.
During the last two decades, datapath verification has been an intensive field
of research within the formal methods community. Meanwhile, there exists a large
variety of techniques tackling arithmetic circuit verification in different ways.
Word-level decision diagrams like *BMDs [3] that promise a compact canonical
representation for arithmetic functions have been investigated. By lack of robust
synthesis routines to derive these diagrams from bit-level implementations *BMDs
are, however, hardly used in RTL property checking. For example, Hamaguchi’s
method for *BMD synthesis [8] causes the diagram size to grow exponentially in
case of faulty circuits as noted by Wefel and Molitor [18]. Generating word-level
diagrams from bit-level specifications has remained an unresolved issue also for
more recent developments such as Taylor expansion diagrams [5].
As SAT-based techniques have become the predominant proof methods in formal
verification significant efforts were made to integrate SAT with solvers for other
domains. The integration of SAT and ILP techniques leads to hybrid solvers like
[1, 4]. However, ILP turns out to be unsuitable for RTL property checking because
non-linear arithmetic functions have to be handled. Unfortunately, even simple multiplication falls into this category.
More recently, SAT-modulo-theory (SMT) solvers have gained significant attention [7]. For the purpose of this chapter we consider the quantifier-free logic of
fixed-sized bit vectors (QF-BV) to be a natural candidate for representation of the
decision problems encountered in property checking. The solvers Spear [13] and
Boolector [2] showed best performance within this category in the competitions on
SMT solvers 2007 and 2008, respectively. However, our experience is that these
solvers show the same bottlenecks as solvers based on “bit blasting” (i.e., the problem is directly converted into CNF) when being applied to instances derived from
the verification of arithmetic datapaths.
For equivalence checking of arithmetic RTL circuits, especially multipliers,
a technique based on rewriting was proposed in [15]. A database of rewrite rules
is provided to support a large number of widely used multiplier implementation
schemes. However, for non-standard implementations the approach requires updating the database manually and is thus not fully automatic.
Computer algebra techniques have shown promising results at higher levels of
abstraction [12]. When applied to RTL designs with bit-level details they require a
large amount of intermediate result specifications [16]. However, if an arithmetic bit
level (ABL) description of the datapath under verification is available [19] they are
also applicable in RTL property checking.
In [14] an extraction technique is presented that automatically extracts ABL information from optimized gate netlists of arithmetic circuits after synthesis. This
approach is mainly designed for application in equivalence checking, and its arithmetic reasoning on the ABL is restricted to a single addition network. For RTL
property checking, however, global reasoning over several arithmetic components
is required. In [17] a generalization of the ABL description was introduced and a
normalization calculus for property checking was presented. All ABL information
required for property checking can usually be obtained directly from the RTL description. Unfortunately, this may change in full-custom design flows where ABL
17 A New Verification Technique for Custom-Designed Components at ABL
259
information may no longer be available at the RT level. In order to apply ABL techniques in full-custom design flows a description language for arithmetic circuits was
introduced in [10] that is used by designers to manually capture the necessary ABL
information. The methods used for the arithmetic proofs are similar to the normalization approach of [17].
In this chapter, we increase the level of automation for RTL property checking
of arithmetic circuits in those cases where ABL information is missing for certain
custom-made components of a design such as Booth encoders or sophisticated addition components that are typically implemented below the ABL abstraction level.
This reflects a large number of industrial applications where certain arithmetic components are custom-designed and others (especially array structures such as addition
networks) are not. We provide an extraction technique to generate ABL descriptions for property checking that can be applied to those parts of the design where
hand-crafted optimizations and specialized architectures have made it impossible
to translate the RTL description into an ABL description immediately. The proposed approach, thus, fills an important gap and ensures that the manual approach
of [10] can be reserved for high-end applications designed globally by a full-custom
methodology.
The chapter is organized as follows: Sect. 17.2 presents a brief review of the
ABL normalization technique, Sect. 17.3 describes our extensions to this technique,
Sect. 17.4 presents experimental results and Sect. 17.5 concludes this chapter.
17.2 Normalization Method
This section presents a brief review of the ABL normalization technique presented
in [17] as far as it is required to describe the proposed extensions. Furthermore,
we motivate the need for a generic property checking technique which can handle
mixed gate-level and ABL descriptions.
17.2.1 ABL Normalization
Sophisticated arithmetic circuit designs compose arithmetic functions using bitlevel arithmetic circuitry for addition and multiplication. This composition can be
modeled at the arithmetic bit level (ABL). An ABL description is a directed acyclic
graph where the vertices can be of type “partial product generator”, “addition network” or “comparator”. Partial product generators model bit-wise multiplication
and comparators model comparison of bit vectors. Addition networks model addition at different levels of abstraction, including bit-level units like half adders (HA)
or full adders (FA) as well as word-level additions such as bit-vector adders or addition schemes of multipliers. Our notion of an addition network can also capture
models for addition at intermediate levels of abstraction such as carry-save adders
(CSA) where bit-wise addition is performed on bit vectors.
260
E. Pavlenko et al.
Throughout this chapter we use the following notations:
• For a ∈ Z, b > 0 the remainder, a mod b, of the integer division a/b denotes the
smallest k ≥ 0 with k = a − mb for some m ∈ Z.
• The unsigned integer represented by a bit vector a = (an−1 , . . . , a0 ) is denoted
i
by Z(a) = n−1
i=0 2 ai .
• Conversely, x, n for n > 0 and x ∈ Z denotes the uniquely determined bit vec
i
tor a = (an−1 , . . . , a0 ) with x mod 2n = n−1
i=0 2 ai , i.e., a is the n-bit binary
unsigned integer representation of x.
Addition networks represent weighted additions r = (c + a∈A w(a) · a), n where
c ∈ Z is a constant offset, A is a set of bit-level variables called addends, w : A → Z
is a weight function and the bit vector r = (rn−1 , . . . , r0 ) is called the result.
It has been shown in [17] that the weights w(a) can be considered to be nonnegative for all addends a ∈ A. We say that a ∈ A is an addend of column k ≥ 0
if w(a), k + 1 has a leading 1. We use the notation Ai := {a ∈ A|a is addend of
column i} for the set of addends in column i. Note that we have A = n−1
i=0 Ai . With
this notation, we can reformulate the defining equation for r as
n−1 2i ci +
a ,n ,
r=
i=0
a∈Ai
with c, n = (cn−1 , . . . , c0 ).
Based on this characterization we can easily see that the result bit rk is influenced
by addends a ∈ A≤k := ki=0 Ai only. Whenever addends from columns i ≤ k influence the result rk+1 we say that the addition network generates carries at column k. This is the case when there is an assignment to the addends such that
( ki=0 2i (ci + a∈Ai a)) ≥ 2k+1 . The main task of ABL normalization is to simplify the comparison of structurally dissimilar ABL representations for arithmetic
functions. It is well known that this task is generally beyond the capacity of SATbased methods. The main cause for structural differences between implementation
and specification can be found in the application of commutative and distributive
laws. These laws are applied during implementation of arithmetic circuits in order
to optimize hardware costs and performance. In principle, it may be possible to enhance the specification by detailed structural information that allows the verification
tool to keep track of such optimizations. However, this results in overwhelming effort to keep the specification up to date throughout the design process. In order to
match structurally different ABL descriptions the normalization algorithm performs
a sequence of local equivalence transformations on these descriptions. For example, Fig. 17.1 shows the transformation of an ABL description into a normalized
instance where this structural difference between design and property is eliminated.
The main transformation applied during this normalization will be discussed in the
remainder of this subsection.
17 A New Verification Technique for Custom-Designed Components at ABL
261
Fig. 17.1 ABL normalization (CMP—comparator, Ni —addition network, Pi —partial products
generator)
17.2.1.1 Merging of Addition Networks
Merging of addition networks corresponds to the application of the commutative
law. We consider two addition networks N, N ′ for the weighted additions
r=
r′ =
c+
w(a) · a , n ,
a∈A
c′ +
w ′ (a ′ ) · a ′ , n′ .
a ′ ∈A′
Furthermore, we consider all the result bits ri′ of N ′ to be addends in consecutive
columns i + k of N with k ≥ 0. k indicates the column with r0′ ∈ Ak . It is easy to
see that w(ri′ ) = 2i · w(r0′ ) holds for all i ∈ {0, . . . , n′ − 1} in this case. Finally we
consider rn′ ′ −1 ∈ An−1 or that N ′ does not generate carries in the uppermost column
n′ − 1. Under these conditions we can merge the addition networks by replacing the
results ri′ by the addends A′i and the constant offsets ci′ in the appropriate columns
of N . The resulting network merge(N ,N ′ ) can be described by the following formula
r=
c+
a∈A
′ −1
n
′
′ ′
′
w(a) · a + 2 c +
w (a ) · a −
2i+k ri′ , n .
k
a ′ ∈A′
i=0
If all the results ri′ are only used as addends in a single column i + k of N , this
equation simplifies to
r=
c + 2k c′ +
a∈A\{r0′ ,...,rn′ ′ −1 }
w(a) · a +
a ′ ∈A′
2k w ′ (a ′ ) · a ′ , n .
262
E. Pavlenko et al.
Fig. 17.2 Partial products
distribution
17.2.1.2 Distribution of Partial Products
Distribution of partial products across addition networks corresponds to the application of the distributive law. Figure 17.2 shows an example how the partial products e
and f with a new variable d can be expressed as the outputs of an addition network
(FA) for the partial products ad, bd and cd.
This example will be generalized in the following. Let N be an addition network
with a set of addends A, a weight function w, constant offset c and result vector
r = (rn−1 , . . . , r0 ). Furthermore, let P be a partial product generator for the product
of r with a second bit vector q = (qm−1 , . . . , q0 ). Distribution of P over N results in
m addition networks N m and a partial product generator P ′ for the addends (a ∈ A)
with q. Here, the N i are defined by the following properties:
• Ai := {{a · qi |a ∈ A} ∪ {qi }} is the set of addends.
• The weight function w i is defined by w i (a · qi ) = w(a) and w i (qi ) = c.
• The constant offset ci = 0.
Distribution of partial products is sufficient to create a normal form for an ABL
description, i.e., as defined in [17], an equivalent ABL where partial products are
never generated from the results of an addition network. However, normal forms still
allow for different topologies of cascaded addition networks in the fanout of the partial products. In order to decide whether or not two implementations of an addition
network are equivalent, as in the left part of Fig. 17.1, we iteratively merge the addition networks in both implementations. If this process ends with a single addition
network for each of the implementations, as in the right part of Fig. 17.1, the equivalence check reduces to checking whether the added partial products are pairwise
equivalent and whether the equivalent partial products have the same weight modulo bit width in both implementations. Finally, it also needs to be checked whether
the constant offsets agree modulo bit width (see example in Sect. 17.3.1 for more
details).
17.2.2 Mixed ABL/Gate-Level Problems
In order to improve performance or area, designers sometimes describe certain parts
of an arithmetic circuit at the gate level. Therefore full ABL information is not
17 A New Verification Technique for Custom-Designed Components at ABL
263
always available in industrial RTL descriptions, hence the normalization technique
cannot be applied in these cases.
For instance, the performance of a multiplier is dominated by the additions of
partial products. A widely adopted technique to reduce the number of partial products is (radix 4) Booth encoding. At the ABL the Booth-encoded partial products
can be described by the following equation:
pi = 22i · b · A = 22i · b · (−2a(2i+1) + a(2i) + a(2i−1) ),
(17.1)
where A ∈ {−2, −1, 0, 1, 2} is a so-called Booth digit.
However, for implementation a designer will not consider instantiation of a
signed 3 × n-bit multiplier and a (3 + n) × 2i-bit multiplier to calculate a partial
product corresponding to Eq. (17.1). By contrast, designers implement multiplication by 2j , with j ∈ Z+ , by shifting the corresponding bit vector, and in case A < 0
the transformation A · b = −A · b + 1 is used.
The conditions for the extra bit shift and the negation are as follows:
shifti = (a(2i+1) ∧ a(2i) ∧ a(2i−1) ) ∨ (a(2i+1) ∧ a(2i) ∧ a(2i−1) ),
cpli = (a(2i+1) ∧ (a(2i) ∨ a(2i−1) )).
(17.2)
(17.3)
Therefore, the implemented partial products pi′ = pi − 22i · cpli will be:
⎧
⎪
⎨0, if a(2i+1) = a(2i) = a(2i−1) = 0
pi′ = 0, if a(2i+1) = a(2i) = a(2i−1) = 1
(17.4)
⎪
⎩
B, otherwise
where B = ((bn−1 ⊕ cpli , . . . , b0 ⊕ cpli ) ≪ shifti ) ≪ 2i. In the above equations and
further throughout this chapter, the symbols ¯, ∧, ∨, ⊕ and ≪ denote Boolean negation, conjunction, disjunction, exclusive-or and the shift-left operations, whereas +,
− and · denote arithmetic addition, subtraction and multiplication, respectively.
Suppose the addition of the partial products pi′ and the complement bits cpli is
performed using a tree of carry-save adders (CSA tree). In this case we generate
an ABL description of the adder tree and of the standard multiplier implementation used in the property. However, the Booth-encoded partial products will not be
part of this ABL. As a result the normalization approach fails to identify equivalent
partial products after merging the addition networks of the implementation and the
specification, respectively. This problem is illustrated in Fig. 17.3.
In the next section we provide a technique to synthesize an ABL description for
small local gate-level descriptions of arithmetic circuits such as the Booth encoder
of a multiplier discussed above.
17.3 Synthesis of ABL Descriptions from Gate-Level Models
In principle, every Boolean function can be synthesized into an equivalent ABL description. To see this suppose the Boolean function is given in positive Reed–Muller
264
E. Pavlenko et al.
Fig. 17.3 Incompletely
normalized instance
form. The positive Reed–Muller (positive Davio) decomposition for a Boolean function f : {0, 1}n → {0, 1} with respect to a variable xi is given by the following equation:
f (x0 , . . . , xn ) = f |xi =0 ⊕ xi · (f |xi =0 ⊕ f |xi =1 ),
(17.5)
where f |xi =1 = f (x0 , . . . , xi = 1, . . . , xn ), f |xi =0 = f (x0 , . . . , xi = 0, . . . , xn ) denote the positive and negative cofactors of f with respect to xi . Recursive application of this decomposition results in the Reed–Muller form for a given Boolean
function. There are efficient data structures such as OFDDs [9] and OKFDDs [6] to
represent and manipulate Boolean functions in Reed–Muller form. As we generate
only local Reed–Muller forms for small portions of a circuit, we currently do not
resort to these data structures and represent Reed–Muller expressions explicitly.
In the following subsection we study how to transform the Reed–Muller form of
a Boolean function into an ABL description that is suitable for the normalization
approach.
17.3.1 Generation of the Equivalent ABL Descriptions for Boolean
Functions in Reed–Muller Form
The product terms of the Reed–Muller form can be transformed into equivalent
cascades of partial product generators and the XOR can be implemented by a singlecolumn addition network. Such an addition network will always generate a carry in
its single column, unless it consists of a single product term only. If the result of
such a single column addition network N with more than a single addend is used
as addend in some other network N ′ , merging of N and N ′ is only possible if the
result is added to the uppermost column of N ′ . This, however, cannot be expected
in general since it would require that only addends to the uppermost column of an
addition network implementation are specified at the gate level. In order to overcome
this restriction we extend a single-column addition network to an equivalent multicolumn addition network as follows:
17 A New Verification Technique for Custom-Designed Components at ABL
265
Fig. 17.4 Example of a
Reed–Muller synthesis into
addition network
Theorem 1 Let N be a single-column addition network with result r. For every
′
, . . . , r0′ ) such
n ≥ 1 there is an n-column addition network N ′ with results (rn−1
′
′
that r0 = r and ri = 0 for all i > 0.
Proof Without loss of generality we can suppose N to have a set of addends
A = {a1 , . . . , am } with w(ai ) = 1 for all ai ∈ A and a constant offset c ∈ {0, 1}. We
need to consider the case c = 0 only. Note that c = 1 can be handled by inserting
a dummy addend a. In the resulting network N ′ we eliminate a by updating the constant offset to c +w(a). In order to transform
N , consider the n-column addition net
a
works N̂ with r̂ = (r̂n−1 , . . . , r̂0 ) = ( m
i=1 i ), n and Ñ with r̃ = (r̃n−1 , . . . , r̃0 ) =
n−1 i
( m
a
−
2
r̂
),
n.
i
i=1 i
i=1
Obviously, Ñ has the results r̃0 = r and r̃i = 0 for all i > 0. Moreover, the r̂i
can be expressed as Boolean functions (in Reed–Muller form) in terms of the addends ak . Therefore, we obtain a single-column addition network N̂i (i > 0) for
each of the r̂i . By induction hypothesis we can extend N̂i to an (n − i)-column addition network N̂i′ . By construction we can merge the addition networks N̂i′ with Ñ
and obtain an equivalent network N ′ .
By means of the above theorem we can generate ABL descriptions for Boolean
functions that are suitable for ABL normalization. We illustrate this by the example
depicted in Fig. 17.4. Suppose we want to implement r = a ⊕ b ⊕ c using a threecolumn addition network N ′ with results r ′ = (r2′ , r1′ , r0′ ). According to the above
proof we have to consider the intermediate network N̂ with three columns and the
above variables as addends in column 0. The addition network Ñ subtracts the results of N̂ from columns 1 and 2. We obtain the Reed–Muller forms r̂0 = a ⊕ b ⊕ c,
r̂1 = ab ⊕ ac ⊕ bc and r̂2 = 0 for the results of N̂ and recursively determine a twocolumn addition network N̂1′ with the results r̂1 and 0. This network has the addends
ab, ac and bc in column 0 and the addend abc in column 1. By construction the addition network N̂1′ can be merged with the addition network Ñ . This results in the
266
E. Pavlenko et al.
Fig. 17.5 Synthesized ABL model for the Reed–Muller form
equation
r ′ = (a + b + c − 2(ab + ac + bc) + 4abc) mod 8, 3
= (a + b + c + 2(ab + ac + bc) + 4(abc + ab + ac + bc)) mod 8, 3
for the result r ′ of the final addition network N ′ .
This addition network can be implemented by the half-adder/full-adder netlist
depicted in Fig. 17.5.
Our overall procedure for synthesis of ABL descriptions corresponding to local
parts of the circuit has two phases:
• Transform all local gate-level descriptions into Reed–Muller forms.
• Transform Reed–Muller forms into equivalent multi-column addition networks
as needed during normalization.
These algorithms are invoked on demand whenever conventional ABL normalization terminates, with remaining addends in the compared addition networks for
which gate-level representations exist. After converting these representations into
ABL we re-run normalization.
We conclude this section by illustrating the overall flow by means of a small example. Suppose it is required to verify the design of a 2 × 2 unsigned multiplier with
(radix-4) Booth-encoded partial products. We further assume that the partial products of the design are implemented at the gate level. The partial products provided
for the addition tree of the design are listed in the left part of Table 17.1. Furthermore, we annotate in braces the corresponding Reed–Muller form in terms of the
multiplier inputs ak and bi . The table also corresponds to the addition network NImpl
obtained in the normalization algorithm after merging the adders in the addition tree
17 A New Verification Technique for Custom-Designed Components at ABL
267
Table 17.1 Partial products for unsigned multipliers
Column
0
1
Result
Booth-encoded (radix 4)
Standard multiplier
bit
partial products
partial products
r0
cpl0 = {b1 }
a0 b0
p0′ [0] = {a0 b0 ⊕ b1 }
p0′ [1] = {b1 ⊕ a1 b0 ⊕ a0 b1
r1
⊕ a0 b0 b1 }
a1 b0 ,
a0 b1
2
p1′ [2] = {a0 b1 }, cpl1 = {0}
r2
p0′ [2] = {b1
3
a1 b1
⊕ a1 b1 ⊕ a1 b0 b1 }
signext = {1}, p1′ [3] = {a1 b1 }
r3
p0′ [3] = cpl0 = {b1 ⊕ 1}
Table 17.2 Multi-column addition network for implementation of the product
Partial product
Reed–Muller form
Addition network
cpl0
b1
(0, 0, 0, cpl0 ) = b1 , 4
p0′ [0]
a0 b0 ⊕ b1
(0, 0, 0, p0′ [0]) = (a0 b0 + b1
p0′ [1]
b1 ⊕ a1 b0 ⊕ a0 b1 ⊕ a0 b0 b1
(0, 0, p0′ [1]) = (b1 + a1 b0 + a0 b1
p1′ [2]
a0 b1
p0′ [2]
b1 ⊕ a1 b1 ⊕ a1 b0 b1
(0, p1′ [2]) = a0 b1 , 2
(0, p0′ [2]) = (b1 + a1 b1 + a1 b0 b1
cpl1
0
(0, cpl1 ) = 0, 2
signext
1
(signext) = 1, 1
p1′ [3]
p0′ [3]
a1 b1
(p1′ [3]) = a1 b1 , 1
− 2a0 b0 b1 ), 4
+ a0 b0 b1 − 2(a1 b0 b1 + a0 b1 ), 3
− 2a1 b1 ), 2
b1 ⊕ 1
(p0′ [3]) = 1 + b1 , 1
of the implementation, i.e., the result of NImpl is defined as follows:
(r3 , r2 , r1 , r0 ) = cpl0 + p0′ [0] + 2p0′ [1] + 4 p1′ [2] + p0′ [2] + cpl1
+ 8 signext + p1′ [3] + 1 − p0′ [3] , 4 .
The partial products for a standard implementation of a multiplier are depicted
in right part of Table 17.1. It is obvious that normalization cannot establish equivalence between these addition networks, as the partial products a0 b0 and a1 b0 do
not have an equivalent counterpart in the implementation network NImpl . In order to
complete normalization we convert the Reed–Muller forms of the partial products
of the implementation into the corresponding multi-column addition networks. The
result of this computation step is summarized in Table 17.2.
268
E. Pavlenko et al.
Merging the addition networks of Table 17.2 for the partial products with the
′
implementation network NImpl results in the addition network NImpl
with
(r3 , r2 , r1 , r0 ) = b1 + (a0 b0 + b1 − 2a0 b0 b1 )
+ 2(b1 + a1 b0 + a0 b1 + a0 b0 b1 − 2(a1 b0 b1 + a0 b1 ))
+ 4(a0 b1 + (b1 + a1 b1 + a1 b0 b1 − 2a1 b1 ) + 0)
+ 8(1 + a1 b1 + 1 + b1 ) , 4
= (16 + 16b1 + a0 b0 + 2a1 b0 + 2a0 b1 + 4a1 b1 ), 4
= (a0 b0 + 2a1 b0 + 2a0 b1 + 4a1 b1 ), 4.
′
Obviously, the resulting addition network NImpl
and the standard addition network for unsigned multiplication are identical. Therefore our implementation is
proven to be correct.
17.4 Experimental Results
In this section we summarize the experimental evaluation for the proposed techniques. We implemented these techniques in a property checking environment utilizing SAT and ABL normalization. The overall flow of the integrated verification
engine is shown in Fig. 17.6.
As input to the verification engine we consider a combinational netlist of bitvector functions representing the SAT instance that needs to be checked in order to
prove a given property. To derive such a netlist from an HDL design and a property
we use the industrial property checker OneSpin 360 MV [11].
Fig. 17.6 Property checking
flow
17 A New Verification Technique for Custom-Designed Components at ABL
Table 17.3 Industrial
multipliers / TO—1000 sec
269
Booth-encoded
Signed
multiplier
SMTs
Unsigned
16 × 16
TO
0.38
TO
0.42
23 × 23
TO
1.21
TO
1.1
32 × 32
TO
3
TO
3.1
64 × 64
TO
82
TO
63.8
ABL
SMTs
ABL
In order to simplify the corresponding SAT instance we normalize an ABL description generated from the arithmetic bit-vector functions in this netlist. However,
if certain parts of the arithmetic circuit design are implemented at the gate level
rather than the arithmetic bit level, normalization will not succeed. In this case, we
determine non-arithmetic bit-vector functions in the fan-in of design signals. For
these bit-vector functions we determine equivalent ABL representations and include
them into the ABL normalization problem.
The process of extending the ABL description followed by normalization is iterated until either all comparisons for arithmetic signals are proven or no suitable
extension of the normalized ABL description can be generated. In both cases a SAT
solver is called to prove the remaining parts of the problem to be unsatisfiable.
The efficiency of our prototype implementation was compared against two SMT
solvers, namely Spear v. 2.0 [13] and Boolector [2]. It should be noted that the
earlier versions of both Spear and Boolector demonstrated the best results at the
SMT competitions 2007 and 2008 respectively. All experiments were carried out
on an Intel Core 2 Duo E6400 (8 GB RAM) running Linux. All CPU times are
specified in seconds with a timeout limit (TO) as denoted above each table.
As a first step of the evaluation we collected experimental data in an industrial
setting. A number of multiplier implementations were generated by the module
generator of an industrial synthesis tool. The RTL code of the synthesized circuits
contains Booth-encoded multipliers with partial product generators described at the
gate level and CSA trees implementing the addition network at the ABL. Table 17.3
summarizes our experiments for signed and unsigned Booth-encoded multipliers
of various bit widths. Both SMT solvers are only adequate to solve the arithmetic
proof for small size instances and abort due to limited computing resources in case
of larger ones. However, when the missing ABL blocks are added to the arithmetic
proof problem our normalization-based tool can perform the proof in reasonable
time. For reasons of space we omit similar results obtained for non-booth encoded
and self-generated multipliers.
The second step of our evaluation explores the capacity limits of the Reed–Muller
extractor. We generated unsigned multipliers for different input bit-widths and different encodings for partial products. Furthermore, we step-wise increased the size
of the design portion specified at the gate level and at the same time reduced the ABL
part of the design. Starting with an initial design where only the partial products are
specified at the gate level, we iteratively implemented CSAs within the addition network by gate-level components. Figure 17.7 visualizes the first two iterations of this
270
E. Pavlenko et al.
Fig. 17.7 Increasing the gate
level portion of circuits by
one CSA per iteration
Table 17.4 Multipliers of Fig. 17.7 / TO—1000 sec
Unsigned
Standard scheme
mults.
i=1
i=2
i=3
0.38
0.5
1.8
16 × 16
23 × 23
32 × 32
64 × 64
Booth-encoded
i=4
i=5
177
TO
i=1
42
i=2
TO
1.58
1.61
4.4
392
TO
75
TO
5.5
5.52
9.22
575
TO
103
TO
TO
TO
435
TO
96
98
105
experiment. This experimental setup will end up with instances where the complete
multiplier is specified below the ABL abstraction and, of course, we do not expect
a technique based on the Reed–Muller form to be efficient in this extreme corner
case.
The results of Table 17.4 show that the proposed techniques can transform fairly
large circuit regions specified at the logic level into ABL descriptions. In particular,
if the borderline between the (optimized) partial product generators and the addition
network is blurred—as is often the case in custom-designed circuits—our approach
is powerful enough to transform the complete partial product generator as well as at
least the first level of addition logic into ABL.
We conclude the experimental evaluation with an example of a pipelined multiplier. In three subsequent clock cycles the test circuit performs the multiplication
of four words using a single multiplier. A finite state machine controls the assignment of the operands to the inputs of the multiplier as depicted in Fig. 17.8. The
figure also illustrates that the instantiated multiplier consists of a Booth encoder for
the partial products implemented at the gate level and an addition network implemented at the ABL. The investigated property proves that every four clock cycles
a correct arithmetic result is generated provided that the reset and enable signals
are triggered by the environment accordingly. The resulting decision problem after
unrolling the design and assumption propagation is depicted in Fig. 17.9. Note that
assumption propagation is used to eliminate control logic from the unrolled problem instance such that the remaining problem can be solved with the techniques
17 A New Verification Technique for Custom-Designed Components at ABL
271
Fig. 17.8 Shared multiplier
Fig. 17.9 Property checking
instance
Table 17.5 Shared multipliers / TO—3600 sec
Bit-width
10
12
14
16
18
20
22
SMTs
TO
TO
TO
TO
TO
TO
TO
ABL
31
91
162
629
1248
2961
TO
described in this chapter. The experimental results for different operand bit widths
are summarized in Table 17.5. Again, our tool demonstrates good performance and
solves the verification task within reasonable time for large instances.
17.5 Conclusion and Future Work
In this chapter, we propose a method to transform design components described at
the gate level into equivalent ABL building blocks. The method utilizes the Reed–
Muller form of Boolean functions. It is applicable in cases where small portions
of the arithmetic circuit implementation are specified at the gate-level while ABL
information is available for the remaining parts within the RTL design. The algorithms can be tightly integrated into a verification framework based on ABL normalization. The experimental results indicate applicability for typical industrial verification problems. Future work will explore how this technique can be leveraged for
equivalence checking by integrating it into the approach of [14]. Moreover, we are
currently integrating the work described in this chapter into advanced implementations of ABL normalization techniques based on computer algebra techniques,
namely Gröbner Basis techniques over Rings [19].
272
E. Pavlenko et al.
References
1. G. Audemard, P. Bertoli, A. Cimatti, A. Kornilowicz, and R. Sebastiani. A SAT-based approach for solving formulas over boolean and linear mathematical propositions. In Proc. International Conference on Automated Deduction (CAD), pages 195–210, 2002.
2. Boolector. http://fmv.jku.at/boolector.
3. R.E. Bryant and Y.-A. Chen. Verification of arithmetic circuits with binary moment diagrams.
In DAC ’95: Proceedings of the 32nd ACM/IEEE Conference on Design Automation, pages
535–541. Assoc. Comput. Mach., New York, 1995.
4. D. Chai and A. Kuehlmann. A fast pseudo-boolean constraint solver. In Proc. International
Design Automation Conference (DAC), pages 830–835, 2003.
5. M. Ciesielski, Z. Zeng, P. Kalla, and B. Rouzeyre. Taylor expansion diagrams: A compact,
canonical representation with applications to symbolic verification. In Proc. International
Conference on Design, Automation and Test in Europe (DATE), pages 285–291, 2002.
6. R. Drechsler, B. Becker, A. Sarabi, M. Theobald, and M. Perkowski. Efficient representation
and manipulation of switching functions based on ordered Kronecker functional decision diagrams. In Proc. International Design Automation Conference (DAC), pages 415–419, 1994.
7. H. Ganzinger, G. Hagen, R. Nieuwenhuis, A. Oliveras, and C. Tinelli. DPLL(T): Fast decision
procedures. In Proc. International Conference on Computer Aided Verification (CAV), pages
26–37, July 2004.
8. K. Hamaguchi, A. Morita, and S. Yajima. Efficient construction of binary moment diagrams
for verifying arithmetic circuits. In Proc. International Conference on Computer-Aided Design
(ICCAD), pages 78–82, November 1995.
9. U. Kebschull, E. Schubert, and W. Rostenstiel. Multi-level logic based on functional decision
diagrams. In Proc. European Design Automation Conference (EDAC), pages 43–47, 1992.
10. U. Krautz, C. Jacobi, K. Weber, M. Pflanz, W. Kunz, and M. Wedler. Verifying full-custom
multipliers by boolean equivalence checking and an arithmetic bit level proof. In ASP-DAC
’08: Proceedings of the 2008 Conference on Asia and South Pacific Design Automation, pages
398–403. IEEE Comput. Soc., Los Alamitos, 2008.
11. Onespin Solutions GmbH, Germany. OneSpin 360MV. www.onespin-solutions.com.
12. N. Shekhar, P. Kalla, and F. Enescu. Equivalence verification of polynomial datapaths using
ideal membership testing. IEEE Transactions on Computer-Aided Design, 26(7):1320–1330,
2007.
13. Spear. http://www.domagoj-babic.com.
14. D. Stoffel and W. Kunz. Equivalence checking of arithmetic circuits on the arithmetic bit level.
IEEE Transactions on Computer-Aided Design, 23(5):586–597, 2004.
15. S. Vasudevan, V. Viswanath, R.W. Sumners, and J.A. Abraham. Automatic verification of
arithmetic circuits in RTL using stepwise refinement of term rewriting systems. IEEE Transactions on Computers, 56(10):1401–1414, 2007.
16. Y. Watanabe, N. Homma, T. Aoki, and T. Higuchi. Application of symbolic computer algebra to arithmetic circuit verification. In Proc. International Conference on Computer Design
(ICCD), pages 25–32, October 2007.
17. M. Wedler, D. Stoffel, R. Brinkmann, and W. Kunz. A normalization method for arithmetic
data-path verification. IEEE Transactions on Computer-Aided Design, 26(11):1909–1922,
2007.
18. S. Wefel and P. Molitor. Prove that a faulty multiplier is faulty!? In GLSVLSI ’00: Proceedings
of the 10th Great Lakes symposium on VLSI, pages 43–46. Assoc. Comput. Mach., New York,
2000.
19. O. Wienand, M. Wedler, G.-M. Greuel, D. Stoffel, and W. Kunz. An algebraic approach for
proving data correctness in arithmetic data paths. In Proc. International Conference Computer
Aided Verification (CAV), pages 473–486. Princeton, NJ, USA, July 2008.
Chapter 18
Debugging Contradictory Constraints
in Constraint-Based Random Simulation
Daniel Große, Robert Wille,
Robert Siegmund and Rolf Drechsler
Abstract Constraint-based random simulation is state-of-the-art in verification of
multi-million gate industrial designs. This method is based on stimulus generation
by constraint solving. The resulting stimuli will particularly cover corner case test
scenarios which are usually hard to identify manually by the verification engineer.
Consequently, constraint-based random simulation will catch corner case bugs that
would remain undetected otherwise. Therefore, the quality of design verification
is increased significantly. However, in the process of constraint specification for a
specific test scenario, the verification engineer is faced with the problem of overconstraining, i.e. the overall constraint specified for a test scenario has no solution.
In this case the root cause of the contradiction has to be identified and resolved.
Given the complexity of constraints used to describe test scenarios, this can be a
very time-consuming process.
In this chapter we propose a fully automated contradiction analysis method. Our
method determines all “nonrelevant” constraints and computes all reasons that lead
to the over-constraining. Thus, we pinpoint the verification engineer to exactly the
sets of constraints that have to be considered to resolve the over-constraining. Experiments have been conducted in a real-life SystemC-based verification environment
at AMD Dresden Design Center. They demonstrate a significant reduction of the
constraint contradiction debug time.
Keywords Constraint-based random simulation · Contradiction · Debugging ·
SystemC verification library
18.1 Introduction
The continued advance of circuit fabrication technology that persisted over the last
30 years now allows the integration of more than 1 billion transistors in System-onChip (SoC) designs. The development of SoCs of such complexity leads to enormous challenges in Computer-Aided Design (CAD), especially in the area of design
verification, which needs to ensure the functional correctness of a design. Because
D. Große ()
Institute of Computer Science, University of Bremen, 28359 Bremen, Germany
e-mail: grosse@informatik.uni-bremen.de
M. Radetzki (ed.), Languages for Embedded Systems and Their Applications,
Lecture Notes in Electrical Engineering 36,
© Springer Science + Business Media B.V. 2009
273
274
D. Große et al.
the capacity of formal verification is limited, simulation is still the most frequently
used verification technique [22].
In directed simulation explicitly specified stimulus patterns (e.g. written by verification engineers) are applied to the design. Each of those patterns stimulates a
very specific design functionality (called a verification scenario) and the response
of the design is compared thereafter with the expected result. Due to project time
constraints, it is inherent for directed simulation that only a limited number of such
scenarios will be verified.
With random simulation these limitations are compensated. Random stimuli are
generated as inputs for the design. For example, to verify the communication over a
bus, random addresses, and random data are computed.
A substantial time reduction for the creation of simulation scenarios is achieved
by constraint-based random simulation (see e.g. [2, 22]). Here, the stimuli are generated directly from specified constraints by means of a constraint solver, i.e. stimulus patterns are selected by the solver which satisfy the constraints. The resulting
stimuli will also cover test scenarios for corner cases that may be difficult to generate manually. As a consequence, design bugs will be found that might otherwise
remain undetected, and the quality of design verification increases substantially.
For constraint-based random simulation several approaches have been proposed
(see e.g. [4, 11, 12, 21, 23]). However, a major problem that arises when stimuli
are specified in form of constraints is over-constraining, i.e. the constraint solver is
not able to find a valid solution for the given set of constraints. Whenever such a
contradiction occurs in a constraint-based random simulation run, this run has to be
terminated as no valid stimulus patterns can be applied. Note that over-constraining
may not necessarily happen at the very beginning of the simulation run, as modern
test-bench languages such as SystemVerilog [9] allow the addition of constraints
dynamically during simulation. In any case of over-constraining the verification engineer has to identify the root cause of the constraint contradiction. As this is usually
done manually by either code inspection or trial-and-error debug, it is a tedious and
time-consuming process.
To the best of our knowledge in this work we propose the first non-trivial algorithm for contradiction analysis in constraint-based random simulation. In the area
of constraint satisfaction problems methods for diagnosing over-constrained problems have been introduced (see e.g. [1, 16]). These methods aim to find a solution
for the over-constrained problem by relaxing constraints according to a given weight
for each constraint. In the considered problem no weights are available. Also, the
approaches do not determine all minimal reasons that cause the overall contradiction. In contrast, Yuan et al. proposed an approach to locate the source of a conflict
using a kind of exhaustive enumeration [22]. But since a very large runtime of this
method is supposed—neither an implementation nor experiments are provided—
they recommend to build an approximation. In the domain of Boolean Satisfiability
(SAT) a somewhat similar problem can be found: computing an unsat core of an
unsatisfiable formula, i.e. to identify an unsatisfiable sub-formula of the overall formula [5, 24]. However, to obtain a minimal reason the much more complex problem
of a minimal unsat core has to be considered [7, 14, 15]. Furthermore, all minimal
18 Debugging Contradictory Constraints in Constraint-Based Random Simulation
275
unsat cores are required to determine all contradictions. In general this is very time
consuming (see e.g. [13]).
In this chapter we propose a fully automatic technique for analyzing contradictions in constraint-based random simulation. The basic idea is as follows: The
overall constraint is reformulated such that (contradicting) constraints can be disabled by introducing new free variables. Next, an abstraction is computed that forms
the basis for the following steps. First, the self-contradicting constraints are identified. Then, all “nonrelevant” constraints are determined. Finally, for the remaining
constraints—typically only a very small set—a detailed analysis is performed. In
total our approach identifies all reasons of the over-constraining, i.e. all minimal
constraint combinations that lead to a contradiction of the overall constraint. As
shown by experiments in a verification environment of AMD Dresden Design Center (DDC), the debugging time is reduced significantly. The verification engineer
completely understands what causes the over-constraining and can resolve the contradictions in one single step.
The rest of this chapter is structured as follows. Section 18.2 briefly reviews
the SystemC Verification (SCV) library that is used for constraint-based random
simulation in this work. In Sect. 18.3 the considered problem is formalized and the
concepts of the contradiction analysis approach are given. The implementation of
the approach is described in detail in Sect. 18.4. Section 18.5 provides experimental
results. First, the different types of contradictions are illustrated as examples. Then,
we show the efficiency of our approach using several test cases and the application of
the analysis technique for a real-life industrial example, a verification environment
used at AMD DDC. Finally, the chapter is summarized in Sect. 18.6.
18.2 SystemC Verification Library
This section briefly reviews the SystemC Verification (SCV) library that is used for
constraint-based random simulation in this work. The SCV library was introduced in
2002 as an open source C++ class library [10, 17, 20] on top of SystemC [8, 19]. In
the following we focus only on the basic features of the SCV library for constraintbased random simulation.
Using the SCV library, constraints are modeled in terms of C++ classes. That
way constraints can be hierarchically layered using C++ class inheritance. In detail
a constraint is derived from the scv_constraint_base class. The data to be
randomized is specified as scv_smart_ptr variables.
An example of an SCV constraint is shown in Fig. 18.1. The name of the constraint is cstr. Here, the three 64 bit unsigned integer variables a, b, and addr are
randomized. The conditions on the variables a, b, and addr are defined by expressions in the respective SCV_CONSTRAINT() macros.
Internally, a constraint in the SCV library is represented by the corresponding
characteristic function, i.e. the function is true for all solutions of the constraint.
This characteristic function of a constraint is represented as a Binary Decision Diagram (BDD), a canonical and compact data structure for Boolean functions [3]. For
276
D. Große et al.
struct c s t r : public scv_constraint_base {
s c v _ s m a r t _ p t r < s c _ u i n t <64> > a , b , a d d r ;
SCV_CONSTRAINT_CTOR( c s t r ) {
SCV_CONSTRAINT ( a ( ) > 100 ) ;
SCV_CONSTRAINT ( b ( ) == 0 ) ;
SCV_CONSTRAINT ( a d d r ( ) >= 0 && a d d r ( ) <= 0 x400
);
}
};
Fig. 18.1 Example constraint
stimuli generation a weighting algorithm is applied for the constraint BDD to guarantee a uniform distribution of all constraint solutions and hence maximizing the
chance for entering unexplored regions of the design state space. As BDD package
CUDD [18] is used in the SCV library.
18.3 Contradiction Analysis
In this section first the considered problem, that is the contradiction of constraints,
is formalized. Then, we present concepts for the contradiction analysis approach.
18.3.1 Problem Formulation
Before the problem is formulated we define the type of constraints that are considered in this chapter.
Definition 18.1 A constraint is a Boolean function over variables from the set of
variables V . For the specification of a constraint, the typical HDL operators such
as e.g. logic AND, logic OR, arithmetic operators, and relational operators can be
used.
Usually a constraint consists of a conjunction of other constraints. We formalize
the resulting overall constraint in the following definition.
Definition 18.2 An overall constraint is defined as
C=
n−1
Ci
i=0
where Ci are constraints according to Definition 18.1.
18 Debugging Contradictory Constraints in Constraint-Based Random Simulation
277
In practice, the conjunction is built by the explicit use of several SCV_CONSTRAINT() macros or by applying inheritance, i.e. parts of the constraints are defined in a base class and inherited in the actual constraint. Note that this is not
specific to constraint-based random simulation using the SCV library. In fact, the
same principles are found, for example, in the random constraints of SystemVerilog
[9].
During the specification of complex non-trivial constraints, the problem of overconstraining arises:
Definition 18.3 An overall constraint C is over-constrained or contradictory iff C
is not satisfiable, i.e. C evaluates to 0 for all assignments to the constraint variables.
Typically, if C is contradictory the verification engineer has to manually identify
the reason for the over-constraining. This process can be very time-consuming because several cases are possible. For example, one of the constraints Ci may have
no solution. Another reason for a contradiction may be that the conjunction of some
of the constraints Ci leads to 0. In the following the term reason as used in the rest
of this chapter is defined.
Definition 18.4 A reason for a contradictory overall constraint C is the set R =
{Ci1 , Ci2 , . . . , Cik } ⊆ {C0 , C1 , . . . , Cn−1 } with the two properties:
1. The constraints in R form a contradiction, i.e. the conjunction Ci1 ∧ Ci2 ∧ · · · ∧
Cik always evaluates to 0. Therefore the overall constraint C is contradictory.
2. Removing an arbitrary constraint from R resolves the contradiction, i.e. minimality of R is required.
Often the root of over-constraining results from more than one contradiction,
i.e. there is more than one reason. If in this case only one reason is identified by the
verification engineer, the constraint solver has to solve the fixed constraint again,
but still there is no solution.
Based on these observations, the following problem is considered in this chapter:
How can we efficiently compute all minimal reasons for an over-constraining
and thereby support the verification engineer in constraint debugging?
Analyzing the contradictions in the overall constraint C and presenting all reasons is facilitated by our approach. In particular excluding all constraints which are
not part of a contradiction reduces the debugging time significantly.
18.3.2 Concepts for Contradiction Analysis
The general idea of the contradiction analysis approach is as follows: The overall
constraint C is reformulated such that the conflicting constraints can be disabled by
the constraint solver and C becomes satisfiable. By analyzing the logical dependencies of the disabled constraints, we can identify all reasons for the over-constraining.
278
D. Große et al.
Fig. 18.2 Contradictory constraint
Definition 18.5 Let C be over-constrained. Then the reformulated constraint C ′ is
built by introducing a new free variable si for each constraint Ci and substituting
each constraint Ci with an implication from si to Ci . That is,
C′ =
n−1
(si → Ci ).
i=0
For the reformulated constraint C ′ the following holds:
1. If si is set to 1, then the constraint Ci is enabled.
2. If si is set to 0, then the constraint Ci is disabled because Ci can evaluate to 0
or 1.
Note that the usage of an implication is crucial. If an equivalence is used instead of
an implication, si = 0 would imply the negation of Ci .
Example 18.6 Figure 18.2(a) shows a constraint C which is over-constrained. Reformulating C to C ′ avoids the over-constraining because a constraint Ci may be
disabled by assigning si to 0. The table in Fig. 18.2(b) gives all assignments to si
such that the reformulated overall constraint C ′ evaluates to 1.1 That is, the table
shows which constraints have to be disabled to get a valid solution. For example,
from the first row it can be seen that disabling C0 , C2 , C3 , and C5 avoids the contradiction.
Based on the reformulation the verification engineer is able to avoid the overconstraining. But to understand what causes the over-constraining, i.e. to identify
the reason of each contradiction, a more detailed analysis is required. Here, two
properties of the assignment table obtained from the reformulated overall constraint
can be exploited.
1 Here
‘−’ denotes a don’t care, i.e. the value of si can be either 0 or 1. The table is derived from
a symbolic BDD representation of all solutions for the si variables after abstraction of all other
variables.
18 Debugging Contradictory Constraints in Constraint-Based Random Simulation
279
Note that for simplicity we always refer to the assignment table in the presentation. As shown later in the implementation the assignment table needs not to be
build explicitly.
Property 1 The value of variable si is 0 for all solutions (i.e. in each row of the table) iff the respective constraint Ci is self-contradictory (that is Ci has no solution).
Proof ⇒: We show this by contraposition: If Ci has at least one solution, then there
is a row where si is 1. Obviously this solution (row) can be constructed by assigning
1 to si and 0 to sj for j = i, because (si → Ci ) = s i ∨ Ci = 0 ∨ Ci = Ci = 1 and
(sj → Cj ) = s j ∨ Cj = 1 ∨ Cj = 1 for j = i.
⇐: To satisfy C ′ each element of the conjunction must evaluate to 1, so
(si → Ci ) = s i ∨ Ci . Since Ci has no solution (Ci is always 0) si must be 0.
Thus, each constraint Ci whose si variable is always assigned to 0, is a reason
for the contradictory overall constraint C.
Property 2 The value of variable si is don’t care for all solutions (i.e. for all rows
of the table) iff the constraint Ci is never part of a contradiction of C.
Proof ⇒: This property is shown by contradiction. Assume that si is don’t care
for all solutions and Ci is part of a contradiction. Then, without loss of generality
there has to be another satisfiable constraint Cj such that Ci ∧ Cj = 0.2 If sj is set
to 1 and all other constraints Ck with k = j are disabled by sk = 0, then C ′ is 1.
However, switching si to 1 is not possible due to the conflict of Ci and Cj . But this
contradicts the assumption that the value of si is don’t care for all solutions.
⇐: Because the constraint Ci is never part of a contradiction, Ci can be enabled
or can be disabled. In other words, si can be set to 0 and also to 1 for each solution
of the overall constraint, which is equivalent to si is don’t care.
Thus, each constraint Ci whose si variable is always don’t care, is not part of a
reason for the contradictory overall constraint. Therefore these constraints are not
presented to the verification engineer and can be left out in the next steps.
Example 18.7 Consider again Example 18.6. Because the value of s0 is 0 for all
solutions, C0 is self-contradictory. Thus, R0 = {C0 } is a reason for C. Since the
value of s1 is always don’t care, C1 is never part of a contradiction. As a result the
first two constraints can be ignored in the further analysis.
Note that the overall constraint of the example in Fig. 18.2(a) has been specified
to demonstrate the two properties. In practice, the number of constraints that are
2 According
to Property 1 both constraints Ci and Cj have at least one solution.
280
D. Große et al.
never part of a contradiction is considerably larger. Thus, applying Property 2 reduces the debugging effort significantly because each “nonrelevant” constraint does
not have to considered anymore by the verification engineer.
In fact, all remaining constraints (if there are any) are part of at least one contradiction. Furthermore, since self-contradictory constraints have been filtered out
by Property 1 only a conjunction of two or more constraints causes a contradiction.
Now the question is, how can we identify the minimal contradicting conjunctions of
the remaining constraints, i.e. the reasons?
Example 18.8 Again Example 18.6 is considered. The constraints C0 and C1 have
been handled already according to Property 1 and Property 2. Now, the conjunction
of two or more of the remaining constraints, C2 , C3 , C4 , C5 , and C6 , causes a
contradiction. Only identifying the product of all these constraints certainly does
not help to resolve the conflict easily. In contrast, the over-constraining can only be
fixed if the different contradictions are understood. But this requires the computation
of all minimal reasons according to Definition 18.4. In the example, three reasons
can be found in total: R1 = {C2 , C4 } and R2 = {C3 , C4 } which overlap as well as
R3 = {C5 , C6 } which is independent of the two before.
To find the minimal reason for each contradiction, all constraint combinations
are tested for a contradiction starting with the smallest conjunction. For each
tested combination the respective si variables are set to 1. Thus, if the conjunction Ci1 ∧ · · · ∧ Cik leads to a contradiction ((si1 = 1) ∧ · · · ∧ (sik = 1) ∧ C ′ ≡ 0),
then this combination is a reason for C. The minimality is ensured by building the
constraint combinations in ascending order with respect to their size and skipping
each superset of a previously found reason. Since the overall problem has already
been simplified by exploiting Property 1 and Property 2, the combination based procedure has to be applied only for a small set of constraints, i.e. the remaining ones.
This is the key to the efficiency of the overall contradiction analysis procedure.
The next section presents the details on the implementation of the overall contradiction analysis approach.
18.4 Implementation
As already mentioned earlier, the SCV library uses BDDs for the representation of
constraints. More precisely the characteristic function of the overall constraint is
represented as a BDD. This characteristic function is true for all solutions of the
constraint, false otherwise. We implemented the contradiction analysis approach
using the SCV library. Therefore our implementation is “BDD driven”.
The pseudo-code of the contradiction analysis approach is shown in Fig. 18.3.
As input the approach starts with the BDD representation of the reformulated constraint C ′ and the set of all constraint variables V . At first, all constraint variables
are existentially quantified from the reformulated constraint (line 3). Thus, the resulting function C ′′ only depends on the si variables. In other words, this function
18 Debugging Contradictory Constraints in Constraint-Based Random Simulation
281
Fig. 18.3 Overall algorithm
is the symbolic representation of the assignment table described in the previous section. In general the quantified BDD is much more compact than the BDD for the
reformulated constraint. Thus, the following BDD operations can be executed very
fast.
After quantification the two sets R and S are initialized to the empty set. R stores
all reasons that are found. Note that for simplicity R contains the sets of the corresponding si variables of a reason, not the constraints itself. The set S is used to
save all si variables that are passed to the detailed analysis later. So this set corresponds to the remaining constraints. Then, for each constraint Ci it is checked if
Ci is either self-contradictory (line 9) or never part of a contradiction (line 12) according to Property 1 and Property 2. In the former case the respective si variable is
added to the set of reasons R (line 11). Both checks are conducted on the quantified
representation C ′′ of the reformulated constraint, that is:
• To check if si is 0 for all solutions (see Property 1) the conjunction C ′′ ∧ si = 1 is
carried out. If the result is the constant zero-function, si is never 1 in any solution,
i.e. si is always zero. Thus, Ci becomes a reason.
282
D. Große et al.
• The check if si is don’t care in all solutions (see Property 2) is carried out by
(C ′′ ∧ si = 0) ≡ (C ′′ ∧ si = 1). If the respective BDDs are equal, it has been
shown that si is don’t care, since regardless of the value of si the solutions are
identical. Therefore, the constraint Ci is not relevant for a contradiction and thus
neither added to the set R nor to the set S.
If both properties cannot be applied (line 14), then the respective constraint Ci is
part of a contradiction caused by the conjunction of Ci with one or more other
constraints. Thus, Ci is passed to the detailed analysis by inserting the respective si
into S (line 16).
Finally, the detailed analysis for all elements in S—the remaining constraints—
is performed (line 18 to 25). First, the power set P(S) of S is created resulting in
all subsets (i.e. combinations) of constraints considered for detailed analysis. Note
that we exclude the empty set as well as all sets which only contain one element
(this is already covered by Property 1) from the power set. Furthermore, during the
construction the elements of the power set are ordered according to their cardinality.
Then, for each subset X (i.e. for each combination) the conjunction of the respective
constraints is tested for a contradiction. Therefore, the conjunction of the current
combination X—represented as a cube of all variables si ∈ X—and C ′′ is created,
i.e. all respective constraints Ci are enabled (line 23). If the conjunction leads to
a contradiction, then X is a reason and thus, X is added to R (line 25). To ensure
minimality each contradiction test of a subset X is only carried out if no reason
X ′ ∈ R exists such that X ′ ⊂ X (line 20–22), i.e. no subset of X has already been
identified as reason for a contradiction (see also Definition 18.4).
In summary, the presented contradiction analysis procedure computes all minimal reasons R of a contradictory overall constraint C. First, the proposed reformulation of the overall constraint allows a representation where all contradictory
constraints can be disabled. From this representation a much more compact one is
computed by quantification. All following operations have to be carried out on this
representation only. Then, the two properties are applied which significantly reduces
the problem size since only 2n−|Z|−|DC| instead of all 2n subsets have to be considered in the detailed analysis (Z denotes the set of self-contradictory constraints, and
DC denotes the set of constraints, which are not part of a contradiction). In practice,
especially the number of “nonrelevant” constraints that belong to the set DC is very
large, so the input for the detailed analysis shrinks considerably.
18.5 Experimental Evaluation
This section provides experimental results for the contradiction analysis. First, different types of contradictions are discussed that have been observed in practice.
Then, we show the efficiency of our method using several test cases. Finally, we
demonstrate the advantages of our approach in an industrial setting. We briefly discuss a constraint-based simulation environment used at AMD DDC for verification
of SoC designs. By means of a concrete example we will show how time spent on
debugging constraint contradictions is significantly reduced by our approach.
18 Debugging Contradictory Constraints in Constraint-Based Random Simulation
283
Fig. 18.4 Types of contradictions
In all examples the partitioning of the constraints is given according to the specification in the constraint classes, i.e. each Ci in the following corresponds to a
separate SCV_CONSTRAINT() macro (see also Sect. 18.3.1). The contradiction
analysis is started by an additional command line switch and runs fully automatic in
the SCV library environment.
18.5.1 Types of Contradictions
We have identified different types of contradictions. In the following the general
structure is shown by means of examples. We assume that self-contradictory constraints as well as “nonrelevant” constraints have been removed. Assume k constraints are left. Then, one of the following cases are possible which are automatically identified by our approach:
1. There is exactly one contradiction that is caused by all k constraints. Here, no
other subset of the constraints forms a contradiction and thus all constraints are
the only reason for the over-constraining. A simple and a more complex example
is shown in Fig. 18.4(a).3
2. There are at least two contradictions. This case can be refined further:
a. Our approach determines m disjoint partitions from the constraint set. This
means our approach has identified m independent contradictions. An example
is given in Fig. 18.4(b). In this example for the constraint set {C0 , C1 , C2 , C3 }
the two reasons R0 = {C0 , C1 } and R1 = {C2 , C3 } are determined.
b. There is at least one overlapping, i.e. at least one constraint Ci is part of at
least two reasons. Also here an example is given in Fig. 18.4(c). This example
shows the two reasons R0 = {C0 , C2 , C4 } and R1 = {C1 , C3 , C4 }. Obviously
C4 is part of both reasons.
3 The
reasons are marked by brackets.
284
D. Große et al.
Our proposed approach is able to identify the minimal reason for all these types
of contradictions.
18.5.2 Effect of Property 1 and Property 2
Applying the two properties introduced in Sect. 18.3.2 significantly reduces the
complexity of the contradiction analysis since each matched constraint can be excluded from further considerations. To show the increasing efficiency we tested our
approach for several examples which contain some typical overconstraining errors
(e.g. typos, contradicting implications, hierarchical contradictions, etc.).
For the considered constraints we give some statistics in Table 18.1. In the first
column a number to identify the test case is given. Then, in the next columns information on the number of constraint variables and their respective sizes (i.e. number
of bits) are provided. Finally, the total number of constraints is given. The results
after application of our contradiction analysis are shown in Table 18.2. The first
four columns give some information about the test case, i.e. the number of constraints in total (n), the number of contradictions/reasons (|R|), and the runtime
in CPU seconds needed to construct the BDD in the SCV library (BDD T IME).
The next columns provide the results for the trivial analysis approach without (W / O
PROPERTIES ) and with the application of the properties ( WITH PROPERTIES), re′
spectively. Here the number of checks in the worst case (2n or 2n , respectively), the
√
number of checks actually executed by the approach (# ), and the runtime for the
Table 18.1 Constraint
characteristics
#
B OOL
I NT
1
10
8
2
3
3
3
10
10
4
8
40
5
5
30
L ONG
B ITS
–
C ONSTR . (n)
328
15
483
16
–
330
26
–
1,288
50
15
1,925
53
6
Table 18.2 Effect of using properties
BDD
W / O PROPERTIES
T IME
2n
√
#
n
|R |
1
15
1
5.48
32,768
24,577
2
16
3
14.90
65,536
26,883
3
26
1
22.30
67,108,864
–
4
50
3
35.96
5
53
2
238.07
> 1.1 · 1015
2n
′
√
#
|Z|
|DC|
4.12
0
13
4
4
11.25
1
8
128
107
0.04
TO
0
21
32
32
0.30
–
TO
0
42
256
190
2.10
–
TO
0
47
64
55
9.77
#
> 9.0 · 1015
WITH PROPERTIES
T IME
Time
0.06
18 Debugging Contradictory Constraints in Constraint-Based Random Simulation
285
detailed analysis (T IME) are given. Additionally the number of “nonrelevant” constraints (|DC|) and self-contradictory constraints (|Z|) obtained by the two properties are provided.
The results clearly show, that identifying all reasons without applying the properties leads to a large number of checks in the worst case (e.g. 253 ≥ 9.0 · 1015 in
example #5). In contrast, when the properties are applied most of the constraints
can be excluded for the analysis since they are “nonrelevant”. This significantly reduces the number of checks to be performed at detailed analysis. Instead of all 2n
only 2n−|Z|−|DC| checks are needed in the worst case (only 64 in example #5). As a
result the runtime of the detailed analysis is magnitudes faster when the properties
are applied. Moreover, for the last three test cases the reasons can be determined
within the timeout of 7200 CPU seconds only when the properties are applied.
18.5.3 Real-Life Example
The constraint contradiction analysis algorithm has been evaluated using a reallife design example. The corresponding verification environment is depicted in
Fig. 18.5.
The Design Under Verification (DUV) is a PCIe root complex design with an
AMD-proprietary host bus interface which is employed in a SoC recently developed
by AMD. The root complex supports a number of PCIe links. The verification tasks
are to show (1) that transactions are routed correctly from the host bus to one of the
PCIe links and vice versa, (2) that the PCIe protocol is not violated, and (3) that no
deadlocks occur when multiple PCIe links communicate to the host bus at the same
time.
Host bus and PCIe links (only one depicted in Fig. 18.5) are driven by Bus Functional Models (BFMs) which convert abstract bus transactions into the detailed signal wigglings on those buses. The abstract bus transactions are generated by means
of random generators (denoted by G) which are in turn controlled by constraints.
Bus monitors observe the transactions sent into or from either interface and send
them to checkers which perform the end-to-end transaction checking of the DUV.
The verification environment is implemented in SystemC 2.1, the SCV library, and
SystemVerilog, with a special co-simulation interface synchronizing the SystemVerilog and SystemC simulation kernels. The constraint-random verification methodology was chosen in order to both reduce effort in stimulus pattern development and to
get high coverage of stimulation corner cases. The PCIe and host bus protocol rules
were captured in SCV constraint descriptions and are used to generate the contents
of the abstract bus transactions driving the BFMs.
The PCIe constraint used to control stimulus generation within the PCIe transaction generator is a layered constraint. The lower level layer describes generic PCIe
protocol rules and is comprised of a number of 16 constraint terms. They are shown
286
D. Große et al.
Fig. 18.5 Architecture for verification
in Fig. 18.6(a) (denoted from C0 to C15 ).4 The meaning of the constraint variables
is given in the Table 18.3.
The upper level layer imposes user-specific constraints on the generic PCIe constraints (denoted by CUi ) in order to generate specific stimulus scenarios. Generic
PCIe constraints and user-defined constraints are usually developed by different verification engineers; the former by the designer of the test environment and the latter
by the engineer who implements and runs the tests.
4 Bit
operators are used as introduced in [6].
18 Debugging Contradictory Constraints in Constraint-Based Random Simulation
287
Fig. 18.6 PCIe transaction generator constraint with examples
The engineer writing the tests and hence the user-specific constraints which are
layered on top of the generic PCIe constraints is faced with the problem to resolve
contradictions which are generated by imposing the user-defined constraints on the
PCIe generic constraints. Given the complexity of the constraints, this is usually a
non-trivial task. Two real-life examples of contradictions that are not easy to resolve
by manual constraint inspection are depicted in Fig. 18.6(b).
In the first example the user sets the maximum transaction length to a value
greater than 128 bytes (CU1 ), thereby causing a contradiction to constraint C13 ,
which states that the total transaction length must not exceed 128 bytes. In the second example, the user independently constrains the transaction address to byte ad-
288
D. Große et al.
Table 18.3 Definition of random variables used in the PCIe constraint
VARIABLE NAME
DESCRIPTION
addr
transaction address (64 bits)
addr_space
transaction address space (memory,io,config)
tkind
transaction kind (request,response)
cmd
transaction command (read,write)
msr
transaction is targeted at MSR space
posted
transaction is posted (yes/no)
length
transaction size in dwords
be[]
array of byte enables (one per each dword data)
data[]
array of dword (32 bit) data
be[].len
length of byte enable array
data[].len
length of data array
[io|mem|cfg]_addr_base0,1
io, memory and config space window base addresses
[io|mem|cfg]_size0,1
io, memory and config space window sizes
dress 4000 (CU2 ) and the transaction length to 100 bytes (CU3 ). While both values,
viewed independently, are each perfectly legal (the address should be in 32 bit range
and the transaction length is less than 128), an over-constraining occurs. The reason
identified by our approach is R1 = {C12 , CU2 , CU3 }. By manual constraint inspection it is not immediately obvious that a PCIe protocol rule is violated when combining constraints CU2 and CU3 . However, reason R1 found for the contradiction
by our algorithm shows that when combining constraints CU2 and CU3 , then PCIe
protocol rule C12 is violated: “A transaction must not cross a 4 K page boundary”.
Our user constraints of transaction start address set to 4000 and transaction length
of 100 bytes would result in addresses that cross a 4 K page and therefore violate
this constraint.
The algorithm described in this chapter is able to identify exactly the violating
constraint expressions for both examples in about 30 seconds. The PCIe constraint
to be analyzed contained a total of 21 random variables to be solved which are constrained by 17 and 18 constraint expressions for the respective examples. The total
bit count for the random variables amounted to 781 bits. Without such an analysis capability, we would have had to spend several hours on manual constraint inspection in order to identify the root cause for the constraint contradiction. Thus,
a significant speed up of the contradiction debug cycle was achieved.
18.6 Conclusions
In this chapter we have presented a fully automatic approach to analyze contradictory constraints that occur in constraint-based random simulation. After reformulating the overall constraint and building an abstraction, the self-contradictory
18 Debugging Contradictory Constraints in Constraint-Based Random Simulation
289
constraints and all “nonrelevant” constraints are determined in an initial step. Then
for the small set of remaining constraints, all minimal reasons for a contradiction are
computed efficiently and presented to the verification engineer. The minimality and
completeness of the reasons allows to fully understand the over-constraining. Thus,
the verification engineer is able to resolve the conflict in one single step. In total, as
shown by industrial experiments, the debugging time is reduced significantly.
References
1. R.R. Bakker, F. Dikker, F. Tempelman, and P.M. Wognum. Diagnosing and solving overdetermined constraint satisfaction problems. In International Joint Conference on Artificial
Intelligence, pages 276–281, 1993.
2. J. Bergeron. Writing Testbenches Using SystemVerilog. Springer, Berlin, 2006.
3. R.E. Bryant. Graph-based algorithms for Boolean function manipulation. IEEE Trans. on
Comp., 35(8):677–691, 1986.
4. R. Dechter, K. Kask, E. Bin, and R. Emek. Generating random solutions for constraint satisfaction problems. In Eighteenth National Conference on Artificial Intelligence, pages 15–21,
2002.
5. E. Goldberg and Y. Novikov. Verification of proofs of unsatisfiability for CNF formulas. In
Design, Automation and Test in Europe, pages 10886–10891, 2003.
6. D. Große, R. Ebendt, and R. Drechsler. Improvements for constraint solving in the SystemC
verification library. In ACM Great Lakes Symposium on VLSI, pages 493–496, 2007.
7. J. Huang. Mup: a minimal unsatisfiability prover. In ASP Design Automation Conf., pages
432–437, 2005.
8. IEEE Std. 1666. IEEE Standard SystemC Language Reference Manual, 2005.
9. IEEE Std. 1800. IEEE SystemVerilog, 2005.
10. C.N. Ip and S. Swan. A tutorial introduction on the new SystemC verification standard. White
paper, 2003.
11. M.A. Iyer. Race: A word-level ATPG-based constraints solver system for smart random simulation. In Int’l Test Conf., pages 299–308, 2003.
12. N. Kitchen and A. Kuehlmann. Stimulus generation for constrained random simulation. In
Int’l Conf. on CAD, pages 258–265, 2007.
13. M.H. Liffiton and K.A. Sakallah. On finding all minimally unsatisfiable subformulas. In Theory and Applications of Satisfiability Testing, pages 173–186, 2005.
14. M.N. Mneimneh, I. Lynce, Z.S. Andraus, J.P. Marques-Silva, and K.A. Sakallah. A branch
and bound algorithm for extracting smallest minimal unsatisfiable formulas. In Theory and
Applications of Satisfiability Testing, pages 467–474, 2005.
15. Y. Oh, M. Mneimneh, Z. Andraus, K. Sakallah, and I. Markov. Amuse: A minimallyunsatisfiable subformula extractor. In Design Automation Conf., pages 518–523, 2004.
16. T. Petit, J.-C. Régin, and C. Bessière. Specific filtering algorithms for over-constrained problems. In International Conference on Principles and Practice of Constraint Programming,
pages 451–463, 2001.
17. J. Rose and S. Swan. SCV randomization version 1.0. 2003.
18. F. Somenzi. CUDD: CU Decision Diagram Package Release 2.3.0. University of Colorado at
Boulder, Boulder, 1998.
19. Synopsys Inc., CoWare Inc., and Frontier Design Inc. Functional Specification for SystemC
2.0. http://www.systemc.org.
20. SystemC Verification Working Group. SystemC Verification Standard Specification Version
1.0e. http://www.systemc.org.
21. J. Yuan, A. Aziz, C. Pixley, and K. Albin. Simplifying boolean constraint solving for random simulation-vector generation. IEEE Trans. on CAD of Integrated Circuits and Systems,
23(3):412–420, 2004.
290
D. Große et al.
22. J. Yuan, C. Pixley, and A. Aziz. Constraint-Based Verification. Springer, Berlin, 2006.
23. J. Yuan, K. Shultz, C. Pixley, H. Miller, and A. Aziz. Modeling design constraints and biasing
in simulation using BDDs. In Int’l Conf. on CAD, pages 584–590, 1999.
24. L. Zhang and S. Malik. Validating SAT solvers using an independent resolution-based checker:
Practical implementations and other applications. In Design, Automation and Test in Europe,
pages 10880–10885, 2003.
Chapter 19
Design of Communication Infrastructures
for Reconfigurable Systems
Alessandro Meroni, Vincenzo Rana,
Marco D. Santambrogio and Francesco Bruschi
Abstract Dynamic reconfiguration capabilities of FPGA devices are commonly exploited in order to perform changes in a system with respect to computational elements. This work aims at proposing a framework able to exploit different levels of
simulations in order to perform a requirements-driven design of the communication
infrastructure of a reconfigurable system, so that the overall performances can be
improved.
To accomplish this requirements-driven design it is necessary to perform a design
space exploration of applications and scenarios in which a particular system can be
used. A new scenario-centric approach is proposed in order to identify metrics and
requirements needed to apply a communication infrastructure reconfiguration.
Keywords Communication infrastructures · Dynamic reconfiguration · FPGA
19.1 Introduction
With the increasing diffusion of partially and dynamically reconfigurable FPGAs
[7, 19], designers have new possibilities to design at system level. One of the main
possibilities offered by FPGAs is to change the functionality implemented on the
physical device as many times as desired; exploiting this feature allows to follow
different approaches, in order to fit different scenarios. For instance, thanks to partial dynamic reconfiguration, it is possible to change the behavior of given areas of
the chip while the system is up and running, without need of stopping its execution.
In general, an architecture is characterized by its computational components and
communication infrastructure. With a run-time partial reconfiguration of the device
it is possible to dynamically change both these components, increasing the flexibility and the performances of the developed system. Nowadays, the design space
exploration has mainly focused on the computational aspect of the system-on-chip
(SoC), rather than on the communication one, but as the number of components increase a communication-based analysis is important to evaluate the scalability and
F. Bruschi ()
Dipartimento di Elettronica e Informazione, Politecnico di Milano, Via Ponzio 34/5,
20133 Milano, Italy
e-mail: bruschi@elet.polimi.it
M. Radetzki (ed.), Languages for Embedded Systems and Their Applications,
Lecture Notes in Electrical Engineering 36,
© Springer Science + Business Media B.V. 2009
291
292
A. Meroni et al.
the performance of the overall system; for these reasons, the proposed work is focused on both the design and the dynamic reconfiguration of the communication
layer.
As described in [18], partial reconfiguration can lead to a novel view of designs
and applications, giving designers new degrees of freedom. With this particular ability more than one advantage will be available at design time, such as the possibility
to change the communication infrastructure implementation, adapt hardware algorithms, increase resource utilization, upgrade hardware remotely, change the communication infrastructure at run-time and provide continuous hardware servicing.
With respect to the different communication architectures nowadays available,
it is important to state that, since the introduction of the SoC model, ad hoc mixes
of buses and point-to-point solutions characterized the communication structures.
Anyway, due to their intrinsic limitations, a common concept for segmented SoC
communication structures, based on networks, rose: network-on-chip (NoC) [3].
This new network-based approach has recently characterized the design of many
embedded systems, which proves that having a scalable and reliable system is a real
need. A communication-based analysis can be made evaluating parameters such as
the area usage or the power consumption. In order to perform a good analysis of the
communication infrastructure layer, it is also necessary to keep into account other
intrinsic metrics that belong to communication, such as throughput and latency [12].
The novelty introduced by the proposed work lies in:
• the introduction of a new point of view for the design of an embedded system,
applying a scenario-centric approach that simplifies the architecture definition
and that allows to focus more, during system design, on the application to be
implemented and on the requirements;
• the use of the metrics related to the particular application scenario, in order to
help the designer in the definition of the best fitting communication infrastructure
for the system that has to be developed;
• the possibility to apply partial dynamic reconfiguration in order to change, in
several ways, the current system communication architecture, for instance changing only bottleneck links or completely changing the infrastructure with one more
suitable with respect to the user specified constraints;
• the implementation of a simulation framework that starts with a High Level Description (HLD) of the target architecture, and performs three other important
phases before the actual VHDL implementation: a High Level Network Simulation (HLNS), a solutions Evaluation and Selection (E&S) and a SystemC Verification and Validation (V&V) process.
This chapter is organized as follows: related works in the communication infrastructures research field are presented in Sect. 19.2, then a real world applications
analysis is made in Sect. 19.3, while in Sect. 19.4 the proposed solution is discussed
and, finally, results and conclusions are drawn in Sects. 19.5 and 19.6.
19 Design of Communication Infrastructures for Reconfigurable Systems
293
19.2 Related Works
This section aims at presenting recent works that deal with the development of tools
and frameworks for NoCs automatic generation and verification.
Recently Ost et al. [17] presented MAIA, a framework useful to generate and
evaluate NoCs with varying architectural parameters. It is composed by three main
steps NoC specification and generation, traffic generation and traffic analysis. The
supported topologies are mesh, torus and ring. Also Wolkotte et al. [22] performed a
study about the possibility of simulate network-on-chips. They considered the possibility of using three different methods: VHDL, SystemC and FPGA simulations.
Their work gives a good amount of simulation results proving that their approach is
very useful in parallel systems.
Instead of performing simulation on FGPAs, Fen et al. [8] performed accurate
simulations using the OPNET network simulator. They analyzed only latency and
throughput of different NoCs such as 2D mesh, Fat-Tree and Butterfly Fat-Tree, using two switching techniques (wormhole and virtual-cut-through), discovering that
Fat-Tree topology with wormhole switching technique can be considered a good
solution for NoC designs.
Even if other works, such as NoCSim [15] and NoCGEN [4], have been evaluated, none of them introduced two very important concepts (exception made for
our recently presented flow on which this work is based [13]): (i) the analysis of
communication infrastructures and not just different network-on-chip topologies, so
that also bus-based and point-to-point solutions could have been considered; (ii) the
union of communication infrastructures exploration/generation/validation with the
partial dynamic reconfiguration paradigm, that can be seen as the main novelty introduced by the proposed framework.
19.3 Real World Applications Analysis
Aim of this section is the introduction of a deep analysis of the real world scenarios and applications where the proposed methodology for the design of reconfigurable communication infrastructures can be successfully accomplished. In order to
show how the metrics involved in the selection of the best fitting communication
infrastructure have been chosen, the layered approach of Fig. 19.1, used to perform
this real world applications analysis, is presented. First a set of different common applications have been studied; this set has been called Applications layer. Next some
scenarios, that can be contextualized in the previously discovered applications, have
been identified, defining the Scenarios layer. We connect, using a unique arrow,
each application with all the meaningful scenarios that can be identified in that particular instance, so that singular case-study can be contextualized. After this linking,
for each scenario in the Scenarios layer, a set of different characteristics has been
profiled that better describes the peculiarities of each scenario, identifying the Characteristics layer. Such a thing brings to the identification of the Metrics layer where
294
A. Meroni et al.
Fig. 19.1 Real world applications analysis. Layer classification: A Applications layer, B Scenarios
layer, C Characteristics layer and D Metrics layer
the main valuable information that can classify each scenario, based on its characteristics, are listed. All these layers compose the general classification reported in
Fig. 19.1, this is a qualitative classification that does not cover all the possible applications or scenarios, moreover not all the metrics have been listed. The image shows
the scenario-centric approach stated in the Introduction (Sect. 19.1), which can lead
to a better identification of the system requirements to obtain a requirements-driven
reconfigurable SoC.
In the following, it will be possible to find the description of each layer identified
in order to better understand each step of this approach.
19.3.1 Applications Layer
This first layer, Fig. 19.1A, groups different real applications that may exploit the
reconfigurable ability of the FPGAs devices. This first classification groups several
applications that may differ one from the other, but that have the common need of the
19 Design of Communication Infrastructures for Reconfigurable Systems
295
hardware reconfiguration capability; these applications range from the automotive
topic to the financial analysis one. Such a classification is also available in other
works, i.e. the Xilinx Market Solutions [23]. Some of the identified applications are:
Automotive [2], Robotics [9, 11, 24], Biomedical [20], Financial Analysis [10] and
Sensor Networks [1]; these are just examples proposed to show which applications
can be used within this analysis. Once defined the applications layer, an exploration
of some possible scenarios which have meaning for each application identified has
been made.
19.3.2 Scenarios Layer
As stated before, each scenario can be contextualized in one or more applications,
giving a real case-study to analyze. The connection between an application and a
scenario represents a real instance in which the hardware reconfiguration of the system can be applied, possibly increasing the performances of the system. A classification of this manner aims at better identifying in which real scenario our proposed
methodology can be applied, so to develop a set of metrics and specific aids, oriented
towards the automatic realization of the communication infrastructure that better fits
the applications and the scenarios requirements for methodology proposed.
In Fig. 19.1B, different situations that can be found in the embedded systems
industry are shown; such as the possibility to have more than one scenario connected to more than one application. A good example is the Automotive application
field and the Robotics one, in which some of the possible scenario identified for the
first one, such as adaptive cruise control, sensors acquisition, image processing and
edge detection, are also connected with the robotics field. This can lead to the identification of shared-scenarios that can be contextualized in different applications,
yielding different requirements and constraints to cope with. The identification of
shared-scenarios can therefore bring to the definition of common rules—i.e. specific
metrics applicable on the communication infrastructures selection—and also can increase and valorize the reusability of our methodology, highlighting its flexibility in
the arrangement and adaptation to different applications and scenarios.
The combination of a particular scenario mapped on a specific application generates a precise system case-study and brings to the identification of some interesting
peculiarities that better characterize the needs of this system. This set of characteristics defines the next layer of our classification.
19.3.3 Characteristics Layer
As previously stated, this new layer is filled with a list of possible characteristics representing each scenario contextualized in each application. Basically a set, uniquely
identified (we adopted both the colored and numbered notification, the former for
296
A. Meroni et al.
a simple and faster comprehension and the latter to cope with black/white sheets),
is created containing all the valuable features for each map application-scenario,
choosing among a pool composed of the common characteristics required by a real
embedded system.
An example is shown in Fig. 19.1C, where the sensors acquisition scenario of an
automotive application requires to be precise and immediate; these two qualitative
features mean that the values acquired through sensors, in an automotive application, must be gained very carefully and as soon as possible, so to cope with the
application requirements.
Another real case-study can be represented by the edge detection module of a
robot, that must accomplish its functionality considering different requirements such
as: high throughput, low area usage and low power consumption, in order to keep
the overall system performances high without affecting the other functional modules
that may be plugged on the robot. In this particular situation the reconfiguration
of the communication infrastructure can bring to an increase of the performances,
acting based on the objectives and the jobs performed by the robot, in order to keep,
for instance, the overall power consumption under specific thresholds but having
however a high throughput. These qualitative characteristics are then translated into
metrics so that a requirements-driven analysis can be performed based on their the
evaluation.
19.3.4 Metrics Layer
The Metrics layer represents the most interesting one, as it gives in output the actual
metrics characterization of a system. In this layer a qualitative evaluation of the
different metrics, based on the requirements needed by the considered system, is
performed.
In Fig. 19.1D it is shown a table representing for each evaluated metric its
relative estimation with respect to the considered case-study—graphically represented by the colored crosses. To give a clear idea of how a metric of a particular system is evaluated, in the metrics table of the Fig. 19.1D we proposed a
very simple classification characterized by only three different levels of estimation for each metric: high, medium and low. The evaluated metrics compose the
core of the methodology on which this previous framework [13] is based on; in
fact it performs a requirements-driven reconfiguration based on these metrics values. Using this flow, the designer must give in input just the scenario context and
the application-requirements needed by the system, so that through this framework,
a classification of the best fitting communication infrastructures will be produced
in output. Moreover it is possible to provide other system information that can support the infrastructure selection, such as the constraints and the specific structural
or topological characteristics of the system—that can be for instance the number of
nodes, or the number of masters and slave of the system.
19 Design of Communication Infrastructures for Reconfigurable Systems
297
Fig. 19.2 Schema of the proposed simulation framework
19.4 The Proposed Solution
We propose a simulation framework that aims at identifying the best suitable communication infrastructure, or a set of these, starting from designer specifications of
the systems.
Ideal goal of this work is to let the designer focus his/her attention on the functional characteristics of the target system, rather than on the communication infrastructure implementation details. The system specifications can vary along with
the scenario, but the information provided should nevertheless be enough to produce a high level description of the system. The designer can provide more than one
specification, each one describing a different functionality of the same scenario or
belonging to another one to which the designer is interested. For instance it can be
possible to provide two different system specifications of a robotic scenario, i.e. for
the edge detection and image processing functionalities, or the same two specifications but applied in two different scenarios, such robotics and automotive.
It is possible to provide two different kinds of high level specifications. The first
one is a completely custom one, in which the designer provides all the information describing the target system. The other one is a scenario-based specification, in
which the designer provides only the main system characteristics such as the number of system elements and their classification (distinction between master/slave
cores), the system constraints (can be area or timing constraints) and the communication schema of all the system elements (interconnections among elements). Once
all the necessary specifications have been identified, the designer can start using
the proposed framework, that will lead to the automatic identification of a single
static communication architecture that implements all the functionalities provided
with all the specifications (and that represents the best trade-off among all the possible identified infrastructures). If the static communication infrastructure does not
respect all the specified constrains or the results are not suitable solution, it is also
possible to obtain of a set of differential communication infrastructures. This last
output represents the evolutionary feature of the target system, in which partial dynamic reconfiguration can bring to very good results in terms of reconfiguration
time. This concept will be explained further, with the description of the flow phases.
The proposed framework consists of four phases, as shown in Fig. 19.2:
• HLD: the High Level Description represents the first phase of our framework.
This phase permits the designer to create a high level visual representation of the
298
A. Meroni et al.
system and to set all the specifications previously identified. The output provided
is one or more XML files describing the system with all its characteristics;
• HLNS: the High Level Network Simulation phase is performed with a well-known
Network Simulator [21]. Here the simulation of communication infrastructures,
that differ for topology and/or parameters, is performed. This simulator has been
configured to read XML files provided by the HLD phase, and in the end provides
simulation results that will be evaluated in the next phase;
• E&S: in the Evaluation and Selection phase the best fitting communication infrastructure is automatically selected considering its metrics results. Furthermore
the designer has the possibility to inspect the simulation results and then to manually choose a different solution, based on his experience;
• V&V: the Verification and Validation phase takes in input the XML system file
description of the selected communication infrastructures and performs a SystemC [16] simulation that validate the system consistency over the adopted communication infrastructure. Once this phase returns positive results, it will be simple to create the actual VHDL implementation of the communication infrastructure as described in the Verification and Validation paragraph.
Next, each phase is discussed in a more detailed way.
19.4.1 High Level Description
The high level description phase is represented by a GUI that makes it possible
for a designer to create a high level design of the target system, giving him the
possibility to add master/slave elements and interconnections to the model and also
to set constrains (such as throughput and latency between elements).
Given the specifications, the designer starts having an empty model that must
be filled with the system elements, which can be of two different types (master
or slave), and with the interconnections among the elements, this is done simply
connecting one master element with a slave one through a line. Finally all the constraints provided with the specifications must be set. For instance the specifications
can provide information about the maximum throughput supported by a connection
between two elements; the GUI permits to set, up to now, a set of constraints, composed of throughput, area and latency (consistent with that ones evaluated in the
HLNS phase), for all the elements and the interconnections of the system.
After the construction of the design model the GUI generates one or more XML
files that represent the description of the target system. More XML file will be generated if the designer created more than one specification. There is also the possibility
to manually write these XML files and, thanks to the common XML standard, other
programs can generate these description files. During this high level phase, the designer should perform a scenario-based classification that aims at better identifying
the set of metrics and specific aids, oriented towards the automatic realization of
the communication infrastructure that better fits the target system. This classification can lead to the analysis of different situations, for instance it is possible to have
19 Design of Communication Infrastructures for Reconfigurable Systems
299
more than one scenario that can be contextualized for a unique system. An example
can be found in an application that has both Automotive and also Robotics peculiarities; there are different functionalities that must be implemented (i.e. adaptive
cruise control, sensors acquisition, image processing, edge detection or other), each
one corresponding to a different or the same scenario. Depending on this analysis and the specifications, the designer should build the model of his target system,
using, if necessary, more than one XML file description.
19.4.2 High Level Network Simulation
The second phase of the framework is characterized by a well-known Network Simulator [21] that has been used to perform high level network simulations. This tool
reads the XML files generated in the previous phase and next builds different communication infrastructures based on the achieved information. Some of the most
common kinds of communication infrastructures used in the embedded systems research field have been selected for this simulation phase: point-to-point full connected, bus-based, square-mesh, star and spidergon [5, 14, 18]. Moreover, a custom
network-on-chip, based on the composition of mesh structures with some direct connections among switches is also created. The generation of this custom NoC has as
primary objective the minimization of the distance (conceived as number of hops)
among the master/slave elements and also as second objective the throughput maximization (given the XML specifications it is possible to identify a good number of
potential bottlenecks introduced by shared links, that can be avoided with the insertion of direct dedicated connections between switches). Simulation models have
been used to collect information regarding the following metrics: delivery rate, loss
rate, throughput, area usage and latency.
We adopted the common wormhole switching paradigm [6] with packet-based
communications among system elements with acknowledgments. We defined different intensity levels of packets generation for each master of the system (performed
using the Network Simulator primitives); so a simulation of the communication infrastructure, based on different traffic patterns, has been made. To better understand
the analyzed metrics, the following packet types definitions are necessary:
• completed: a completed packet is one that is arrived at its destination and its
acknowledgment returned to the source master;
• processed: a packet is said processed when it has been sent from the source master to its destination, but its acknowledgment is not arrived yet to the source master;
• dropped: a packet is classified as dropped when it cannot be handled by the
communication architecture (for instance if the bus is busy and cannot handle
another message request).
The metrics evaluated have been defined as follow, using the packet classification
described before:
300
A. Meroni et al.
Table 19.1 Links Complexity Computation Formulae
Communication
Total number
infrastructure
of elements
PtP (full connected)
n
Bus-based
n+1
Square-mesh
Star
Custom NoC
n+x
n+1
n+x
Links complexity
m·s
n
n + 2 · (x −
√
x)
n
[n + 2 · (x −
√
x)] + (D − B)
• Delivery rate has been defined as the number of completed packets over the
processed ones by each IP-Core (excepts the switch nodes in the Network-onChip architecture).
• Loss rate represents the ratio between the dropped packets and the processed
ones, by each switch node or bus or central node (for the star topology).
• Throughput has been defined as packets per seconds. In particular, a conservative formula has been used: total number of completely routed packets (of each
ip-core) over the total simulation time (50,000 simulated seconds, performed in
about 2 to 5 actual minutes, depending on the packet generation intensity level).
• Latency has been evaluated in two different ways, the first evaluation takes into
consideration the overall system with almost no load (the time employed by a
complete transmission of a single packet as been considered), while the second
one keeps into account a system with full load, that means a system with more
than one packet traveling through the communication architecture:
– with no load: for the latency introduced by the point-to-point and the bus architectures have been used two different constant values: 1 and 22 clock cycles,
respectively for the first and the second one. The network-on-chip infrastructures consider a packing/unpacking delay of 8 cycles plus 1 cycle for each
switch;
– with full load: this second evaluation considers the ratio between the average
latency caused by one completed packet and the total amount of completed
packets.
• Area usage, up to now, has been evaluated considering the formulae in Table 19.1
that compute the links complexity of the system. The links complexity has been
evaluated considering the number of links in the system, using the following conventions:
– m: number of masters in the systems;
– s: number of slaves in the system;
– n: number of core-elements in the systems (m + s);
– x: number of switches in the system;
– D: number of direct dedicated links added to the custom NoC;
– B: number of bottleneck links removed from the custom NoC.
The point-to-point links complexity has been computed considering only the possibility to make a connection between masters and slaves, excluding inter-masters
19 Design of Communication Infrastructures for Reconfigurable Systems
301
and inter-slaves connections. In the bus-based architecture, the links have been approximated to only the number of interconnections between each core and the bus,
without keeping into consideration the connections inside the bus; the same has been
made for the switches. For the custom NoC, we considered both the links added to
a system with a mesh infrastructure and also those ones that have been removed because considered bottleneck connections. The latency has been defined in order to
give more detailed information about the trend of the communication infrastructure
delay with respect to the analyzed communication infrastructures. Moreover, it is
important to state that the simulation models are characterized by parameters such
as transmission and packing/unpacking delays that have been evaluated with implemented solutions exploiting both the bus and the network-on-chip architectures.
Furthermore, some of the results, obtained through simulations, have been compared with actual implementations in order to obtain a more realistic and accurate
trend description.
As better analyzed in the results Section, several runs of the simulation models
have been performed; considering a scenario in which the designer has to deal with
strict constraints on latency and area usage—a very common set of constraints—we
discovered that the custom NoC architecture represents a good trade-off among all
the performances. In the end of the high level network simulation phase, a lot of simulation results are generated regarding each one of the communication infrastructures analyzed. The next phase takes in input these results that will be prompted
with graphics to the designer.
19.4.3 Evaluation and Selection
During the evaluation and selection phase, the best fitting communication infrastructure is automatically selected among all the architectures evaluated. As stated before, the designer can also manually choose which one of the proposed solutions
will be selected. This is done through the evaluation of the waveform graphics generated after the HLNS phase; then through the selection of the implementation mode
(static or dynamic) that the designer desires. The generated graphics can describe
different trends for each evaluated communication infrastructure and for each analyzed metrics, as can be seen in Sect. 19.5.
Main objective of the designer, when in manually mode, is to pick the communication infrastructure that has the best trade-off compared with the others or that
is more suitable for the target system. There can be different methods to perform
this selection, such as giving the priority to the area optimization instead of the
latency minimization. The designer can also turn back in this phase after the validation and verification one if the output of this last validation phase shows that
the choices made appear to be wrong, not feasible or simply not efficient. After
this evaluation, the designer can also choose between two different implementation
approaches, a static implementation or a dynamic one, depending on the available
devices (i.e. FPGAs) and the target applications. Each one of these two approaches
has different characteristics:
302
A. Meroni et al.
• Static: this solution does not consider the possibility to exploit partial dynamic
reconfiguration in order to change the communication infrastructure at run-time.
A designer can use this implementation mode to create a unique infrastructure
that implements all the functionalities needed by the system; so an architecture
that represents the composition of all the single ones of each functionality will be
created. The main advantage is represented by the fact that all the functionalities
needed by the system are plugged on the device, furthermore there will be no
reconfiguration delays. The cons of a complete plugged architecture are: device
area usage, high power consumption and also a lack of design modularity.
• Dynamic: with a dynamic approach, the framework will generate different partial communication infrastructures, based on the functionality requested with the
specifications. The generation of these architectures is performed trying to minimize the difference between one infrastructure and the next one that will be
plugged, and so minimizing the reconfiguration time. This approach introduces
modularity and is completely supported by partial dynamic reconfiguration paradigms [19].
19.4.4 Verification and Validation
The last phase is the validation and verification one. Here, the automatic generation
of a SystemC [16] model of the previous selected communication infrastructures,
is performed. During this phase consistency and feasibility checks are performed.
If a communication infrastructure results not applicable due, for instance, to implementation problems, the designer can always make a step back in the evaluation and
selection phase to choose a different solution. After this phase it is possible to easily
generate the VHDL code implementing the selected infrastructures; this is possible
exploiting some VHDL templates of the single system components (both computational and communication ones) that will be mapped one-to-one as the communication infrastructure model requires. Some of the results achieved after the evaluation
and selection phase, have been reported in the next section.
19.5 Results
Aim of this section is to present simulation results achieved by analyzing five different communication architectures: point-to-point full connected, bus-based, star,
square-mesh and custom NoC. These simulation results are very important for the
evaluation and selection phase and so for the validation and verification one; therefore it is possible to state that the simulation results achieved with the high level
network simulation phase are the main concept on which our simulation framework
is based on.
In Fig. 19.3 are reported the trends of delivery rate, loss rate and throughput with
respect to three different traffic levels: high, medium and low. We performed this
19 Design of Communication Infrastructures for Reconfigurable Systems
Fig. 19.3 Metrics trends for
the five analyzed architectures
with three different traffic
levels: high (a), medium (b)
and low (c)
303
304
A. Meroni et al.
Fig. 19.4 Area usage
estimation (defined as links
complexity) for a 16 elements
architecture
Fig. 19.5 Latency trend w r.t.
the number of masters in
mesh architectures
analysis considering these three levels of traffic so to have a simple and general
model from which obtain interesting results in simulations.
With low traffic (Fig. 19.3C), it is possible to see that more or less all the communication architectures perform in the same way. For this reason, if the developed
system is characterized by a very low traffic, it is possible to choose a simple communication infrastructure (such as a bus-based one). On the other hand, in the charts
with medium and high traffic, the five architectures obtain very different results. For
instance, it can be seen in Fig. 19.3B that with medium traffic the bus-based architecture has a very high loss rate compared with a mesh or a custom NoC, which
also provides a very good delivery rate and throughput (very similar to the one of a
point-to-point architecture, but with a smaller area usage, as shown in Fig. 19.4).
In Fig. 19.5 the latency trends of different square meshes are presented; it can be
seen how the latency deeply changes its trend with respect to the number of masters
in the system.
Another interesting result that it was possible to achieve with the simulations,
is the latency trend for each communication infrastructure analyzed, considering an
architecture composed by 8 masters and 8 slaves, as shown in Fig. 19.6, it is possible
to see how the latency of the bus architecture increases from a simulation with low
traffic (Fig. 19.6B) and one with high traffic (Fig. 19.6A). The star architecture
performed very bad with respect to the other infrastructures, in all the levels of
19 Design of Communication Infrastructures for Reconfigurable Systems
305
Fig. 19.6 Latency trend on
an 8 masters 8 slaves
architecture
traffic, this is probably due to the congestion of the central node that did not support
the load provided.
In Fig. 19.6C the latency trends are reported considering the system with no
load, but just taking into consideration one single completed packet. As it can be
seen the custom NoC performs better than the mesh, and considering the scalability
306
A. Meroni et al.
of the network-on-chip architectures and the previous results w.r.t. the area usage,
it is possible to state that the custom NoC will be the best solution among the NoC
topologies presented.
With the presented values, it is very easy to evaluate a cost function that is able
to detect the best fitting communication infrastructure for a system, given its characteristics and its constraints. For instance, by considering the charts in Fig. 19.3B
and Fig. 19.6, it is possible to state that the custom network-on-chip proposed in
this work represents the best trade-off, considering the overall performances and
the constraints of latency and area usage. Moreover it can be seen in Fig. 19.4 that
the point-to-point architecture explodes in number of links, while the mesh and the
custom NoC require fewer resources.
Simulation results, such those just presented, are used in the evaluation and selection phase to choose the right communication infrastructure, then, in the validation
and verification one, to validate the solution with a SystemC simulation. In the case
this step considers this last simulation invalid, the designer should go back to evaluate and select a new solution.
19.6 Concluding Remarks
Aim of this work is to present an innovative simulation framework based on a
scenario-centric approach that can be used in order to perform the study of a system
design, based on the applications for which it will be used. This approach leads to a
requirements-driven reconfiguration of the communication infrastructure of the system, as also described in the methodology presented in the previous work [13], that
points at the selection and the election of the best fitting communication infrastructure that will increase the performance of a system, keeping into account both its
characteristics and its constraints.
The proposed simulation framework guides the designer in the automatic identification of the best fitting communication infrastructure, given the specifications of
the target system. Moreover this framework generates, as final output, an actual implementable solution that can exploit partial dynamic reconfiguration, depending on
the system/designer requirements. This is a novelty concept introduced in a simulation framework for communication infrastructures exploration that uses both high
level network simulations and also a SystemC verification and validation phase.
References
1. I. Akyildiz, W. Su, Y. Sankarasubramaniam, and E. Cayirci. A survey on sensor networks.
IEEE Communications Magazine, 40(8):102–114, 2002.
2. AutomotiveDesignLine. http://automotivedesignline.com/.
3. L. Benini and G. De Micheli. Networks on chips: A new SoC paradigm. Computer, 35(1):70–
78, 2002.
19 Design of Communication Infrastructures for Reconfigurable Systems
307
4. J. Chan and S. Parameswaran. Nocgen: a template based reuse methodology for networks on
chip architecture. In Proceedings of 17th International Conference on VLSI Design, 717–720,
2004.
5. M. Coppola, R. Locatelli, G. Maruccia, L. Pieralisi, and A. Scandurra. Spidergon: a novel
on-chip communication network. In: SOC 2004: Proceedings of International Symposium on
System-on-Chip, Tampere, Finland, page 15, November 2004.
6. W.J. Dally and B. Towles. Route packets, not wires: on-chip interconnection networks. In
DAC ’01: Proceedings of the 38th Conference on Design Automation, pages 684–689. Assoc.
Comput. Mach., New York, 2001.
7. EAPR Xilinx. Early Access Partial Reconfiguration Guide. Xilinx Inc., San Jose, 2006.
8. G. Fen, W. Ning, and W. Qi. Simulation and performance evaluation for network on chip design using opnet. In: TENCON 2007—2007 IEEE Region 10th Conference, pages 1–4, 2007.
9. J. González Gómez, E. Aguayo, and E. Boemo. Locomotion of a modular worm-like robot
using a FPGA-based embedded microblaze soft-processor. In CLAWAR, CSIC, pages 3397–
3402, September 2004.
10. HighPerformanceComputingArchitectures. http://www.hpcwire.com/hpc/1578042.html.
11. H. Jung, M. Tambe, and S. Kulkarni. A dynamic distributed constraint satisfaction approach to
resource allocation. In Principles and Practice of Constraint Programming, pages 324–331,
2001.
12. H.G. Lee, N. Chang, U.Y. Ogras, and R. Marculescu. On-chip communication architecture
exploration: A quantitative evaluation of point-to-point, bus, and network-on-chip approaches.
ACM Transactions on Design Automation of Electronic Systems, 12(3):1–20, 2007.
13. A. Meroni, V. Rana, M. Santambrogio, and D. Sciuto. A requirements-driven reconfigurable
SoC communication infrastructure design flow. In 4th IEEE International Symposium on Electronic Design, Test & Applications, DELTA08, 2008.
14. M. Moadeli, A. Shahrabi, W. Vanderbauwhede, and M. Ould-Khaoua. An analytical performance model for the spidergon NoC. In 21st International Conference on Advanced Information Networking and Applications, AINA ’07, pages 1014–1021, 21–23 May 2007.
15. F. Moraes. Hermes: an infrastructure for low area overhead packet-switching networks on
chip. 2004.
16. OSCI. SystemC documentation (Last Check March 2008). http://www.systemc.org. Open SystemC Iniative OSCI, 2007.
17. L. Ost, A. Mello, J. Palma, F. Moraes, and N. Calazans. Maia—a framework for networks
on chip generation and verification. In Proceedings of the Design Automation Conference,
ASP-DAC 2005, volume 1, pages 49–52, 18–21 January 2005.
18. P.P. Pande, C. Grecu, M. Jones, A. Ivanov, and R. Saleh. Performance evaluation and design
trade-offs for network-on-chip interconnect architectures. IEEE Transactions on Computers,
54(8):1025–1040, 2005.
19. V. Rana, M.D. Santambrogio, and D. Sciuto. Dynamic reconfigurability in embedded system
design. In IEEE International Symposium on Circuits and Systems, pages 2734–2737, May
2007.
20. F. Su and K. Chakrabarty. Yield enhancement of reconfigurable microfluidics-based biochips
using interstitial redundancy. ACM Journal on Emerging Technologies in Computing Systems,
2(2):104–128, 2006.
21. A. Vargas. Omnet++. http://www.omnetpp.org, 2007.
22. P.T. Wolkotte, P.K.F. Holzenspies, and G.J.M. Smit. Fast, accurate and detailed NoC simulations. In First International Symposium on Networks-on-Chip, NOCS 2007, pages 323–332,
7–9 May 2007.
23. Xilinx. Xilinx market solutions. http://www.xilinx.com/esp/.
24. M. Yokoo, E.H. Durfee, T. Ishida, and K. Kuwabara. Distributed constraint satisfaction for formalizing distributed problem solving. In International Conference on Distributed Computing
Systems, pages 614–621, 1992.
Chapter 20
Analysis of Non-functional Properties of MPSoC
Designs
Alexander Viehl, Björn Sander,
Oliver Bringmann and Wolfgang Rosenstiel
Abstract In this chapter, a novel design and analysis methodology for simulationbased determination of non-functional properties of a system design, like performance, power consumption, and temperature is proposed. For simulation acceleration and handling of complexity issues, the design flow includes automated abstraction of component functionality. Specified platform attributes as dynamic power
management and formally modeled temporal input stimuli are automatically transformed to non-functional SystemC models. The framework implements the ability
for automated online and offline analysis of non-functional System-on-Chip properties.
Keywords Non-functional properties · Electronic system level · Performance
analysis · Power estimation · Temperature estimation · SystemC
20.1 Introduction
The ongoing progress of semiconductor technology shrinking allows the production of chips consisting of millions of transistors and integrating dozens of highly
complex components like microprocessors and DSP units. This enhancement boosts
product value by enabling the incorporation of increasing numbers of utility functions [9].
Whereas functional requirements can be checked at early design stages, a validation of non-functional needs is more complicated and is not sufficiently supported
by current approaches. This may lead to violations of non-functional requirements
(e.g. stand-by energy consumption)—and finally to costly design iterations or to
unsaleable products.
Besides complexity, shrinking feature sizes pose additional challenges to nonfunctional property (NFP) verification: checking properties like on-chip temperatures as well as the impact of the system environment, that were ignored or only
This work was partially supported by the BMBF project AIS under grant 01M3083G and
the DFG project ASOC under grant RO 1030/14-1 and 2.
A. Viehl ()
FZI Forschungszentrum Informatik, Haid-und-Neu-Str. 10-14, 76131 Karlsruhe,
Germany
e-mail: viehl@fzi.de
M. Radetzki (ed.), Languages for Embedded Systems and Their Applications,
Lecture Notes in Electrical Engineering 36,
© Springer Science + Business Media B.V. 2009
309
310
A. Viehl et al.
Fig. 20.1 NFP dependencies
considered in rare application areas in the past, are getting more and more in the
focus during system design because of nanoelectronic effects. Especially temperatures can have a major impact on reliability and run-time errors of the system
[2, 16, 28].
In this chapter, an approach for early system level evaluation of NFPs is presented. The methodology uses causal dependencies between the non-functional system properties performance, power and temperature. These dependencies are depicted in Fig. 20.1. The NFP analysis process is based on the determination of
execution time profiles of the system functionality on platform components. The
proposed methodology for automated mapping is presented in Sect. 20.5. Based on
these derived timing properties of single components, global activities within the
entire system resulting from component communication and interaction are determined. This can be used for analyzing the performance of the entire system. The
automated creation of abstract simulation models allowing global analysis of activities is introduced in Sect. 20.6. Furthermore, the incorporation of formally specified and constrained temporal environment models in simulation models is briefly
explained. Additionally, if an activity-based description of the power characteristics of each component is given, the energy consumption over time can be determined. Besides, if an activity state-based power management policy is applied, the
potentially negative impact on performance can be measured. The inclusion of dynamic power management during abstract system simulation is also described in
Sect. 20.6.
If information on geometry and heat capacities of the system components is
given, the local temperature distribution over time can be determined. Necessary
parameters and their incorporation in the design flow are described. To determine requirement violations, an on-line property verification approach is proposed
by creating and integrating assertions that check non-functional system properties.
The presented approach further allows the model-based exploration of a design by modifying parameters like mapping, geometries or input stimuli at specification model level. The developed design flow was tested using different examples. Section 20.7 illustrates experimental results from NFP analysis of a JPEG decoder.
20 Analysis of Non-functional Properties of MPSoC Designs
311
20.2 Related Work
Approaches for determining the global performance characteristics have to be distinguished whether they are simulation-based or analytic and whether they consider
the real execution semantics of functionalities on platform components.
Purely analytic black-box approaches [6, 21, 26] that are based on best-case and
worst-case temporal properties of explicitly modeled system models do not have to
be considered due to that they do not consider the real execution semantics on the
platform. Further, they do not incorporate the synchronization impact of blocking
communication instances as well as complex communication protocols.
Related work on performance analysis has to be further evaluated with respect
on the capability of extending worst-case/best-case assumptions. Different models
incorporate stochastic extensions like generalized stochastic Petri nets (GSPN) [15],
Stochastic Automata Networks (SAN) [14], stochastic task graphs [13], or Stochastic automata [5]. Although these models are able to break the pessimism of best-case
and worst-case analysis methods, no integrated design flow for deriving the analysis
model from real implementations is given. Further, the same methodologies (e.g.
[14, 22]) with the same restrictions are applied for determining the system power
characteristics.
Transaction Level Modeling (TLM) [7] uses models with components on different levels of abstraction for speeding up simulation with a potential loss of accuracy [23, 27]. Simulating a complete system of parallel processes consumes a lot of
computational resources. Moreover, synchronization overhead for coupling multiple simulators is an issue. The incorporation of power-models [3, 4] in functional
simulation of an entire system has to deal with growing simulation times as well.
The approach presented in this chapter measures execution time characteristics of
functional components on platform elements and uses the determined probabilistic
model that does not contain functional aspects of the system. As a result, simulation performance is increased and parameter exploration can be performed at model
level.
In [19], a method for temperature estimation of Low-Power Multiprocessor
Systems-on-Chip (MPSoCs) was presented. To determine the activity of memory and computing elements, a cycle-accurate simulation platform based on the
MPARM-environment [11] was used. For temperature estimation a model was
applied that—similar to the HotSpot-Tool [8]—employs the well-known thermalelectrical duality [17]. To be able to adjust the granularity of the estimation, the
silicon die and the heat spreader can be decomposed in elementary cubic cells of
arbitrary size.
Methods based on instruction set simulators allow determining the activity of the
system parts on a fine-grained level, but the simulation speed is potentially insufficient.
312
A. Viehl et al.
20.3 Preliminaries
20.3.1 Activity Model
To represent the temporal behavior of a system, a model called communication dependency graph (CDG) is used. It was originally developed for formal synchronization and performance analysis by determining communication instances that synchronize the control-flow of communication partners [29]. This model is used for
the system-level representation of timing properties of a design. The model considers probabilistic temporal properties of communication and computation [20, 30].
A communication dependency graph (CDG) denotes a consolidated representation
of a system consisting of communicating processes. Only the temporal and causal
behavior between communication endpoints is considered in each communication
process. The control-flow is represented by edges connecting communication endpoints. An edge exists if a path in the control-flow graph connects the communication endpoints without passing any other communication endpoint. The communication endpoints are characterized relating their synchronization behavior, and
whether they represent sending or receiving events. A CDG example consisting of
three processes is depicted in Fig. 20.2. The processes communicate using asynchronous communication instances with non-blocking sender and blocking receiver
nodes. Annotated edge weights represent best case and worst case (CET) execution times of all basic block paths between subsequent communication endpoints.
Dotted edges represent communication instances whereas blocking communication
endpoints are encircled by two concentric circles.
Fig. 20.2 Communication
dependency graph
20 Analysis of Non-functional Properties of MPSoC Designs
313
Fig. 20.3 PSM of
StrongARM SA-1100
20.3.2 Power Management Model
In order to estimate the temperatures of semiconductor devices during operation
and the lifetime of the battery, the power consumptions of the system components
have to be known. For their determination, an appropriate component modeling is
necessary.
Power State Machines (PSMs) [4] are often used to characterize the behavior
of dynamic power management of the platform components. The model is flexible enough for describing arbitrary power management schemes. Nodes of a PSM
express different operational states like IDLE or RUN. Edges between nodes characterize state transitions relating to events. Both nodes and edges can be annotated
with expressions of timing and power consumption. This allows the description of
different dynamic power management policies (which differ e.g. with respect to
wake-up penalties).
The PSM of a StrongARM SA-1100 is depicted in Fig. 20.3 [3]. It consists of
three different power states. State transitions are triggered by inactivity or activation
of resources due to external events.
20.4 Design Flow
The general design and analysis flow for NFP requirement evaluation presented in
this chapter is depicted in Fig. 20.4. The two entry points are on the one hand a
system specification and on the other hand the functional and still untimed implementation of a system in C/C++ or SystemC. The system specification used in this
approach consists of a SysML model describing requirements of the system by requirement diagrams. Furthermore, environmental aspects of the system (like e.g.
required data rates the system has to be able to handle or required environment
temperatures under which the system needs to work correctly) are modeled as requirement diagrams due to the missing support of environmental descriptions in
UML/SysML. The different environmental aspects are transformed to system property specific (performance, power, temperature) XML formats. These XML files are
used as input for code generation (temporal environment models, Sect. 20.6) and for
parametrization of code templates (battery model, thermal environment).
314
A. Viehl et al.
Fig. 20.4 Integrated design flow
For determining the temporal behavior of computational components of the system, execution time profiling as described in Sect. 20.5 is applied. The result is a
non-functional model of the system (i.e. CDG) that represents a probabilistic quantification of temporal aspects of computation as well as an abstract representation of
the control-flow of each component.
Using the information on temporal aspects of computation as well as on the environment and system requirements, the generation of timed, non-functional simulation models and assertions is performed (Sect. 20.6).
Based on the automatically generated SystemC/SCV simulation models, a simulation based verification of non-functional properties takes place. The included
assertions report requirement violations during simulation. The generation of VCD
traces during simulation provides an interface for offline analysis of non-functional
properties. Furthermore, an offline conflict evaluation concerning NFP can be made
based on these traces. This means, that a probabilistic quantification on requirement
violations (i.e. how often is a requirement violated) as well as a characterization
of the typical behavior of a design (e.g. temperature curves) can be generated for
giving the designer an initial overview on system properties.
20.5 Abstraction of System Functionality
In this section, we briefly explain the mapping and abstraction of system functionality for timing determination. The starting point is a functional SystemC [18] model
of a design, a platform description, mapping information, and the environment of
the system. Although the focus is on mapping of software processes onto microcontrollers in this chapter, the approach is applicable for a hardware realization of
20 Analysis of Non-functional Properties of MPSoC Designs
315
the functionality as well. The SystemC model is simulated together with a specified environment. The environment defines the interaction of the world with the
model, (e.g. audio packets for an mp3 decoder) and the period with jitter between
the packets. Introspection is used for gaining access to the communication data sent
between SystemC processes. As results, temporally ordered communication calls
of each process and the data sent through each communication channel are determined. The ordered communication calls are used for elaborating a consolidated
control-flow as it is represented by CDG processes. The communication data are
used as input for determining the execution time of code fragments on the target
platform with regard to the data computed by the design originating from the environment. For performing execution time profiling, each process from the original SystemC model has to be mapped to the target platform. Therefore, the tool
PORTOS [10] is used to generate a set of C/C++ files, each including the functionality of one SystemC process. The access to SystemC ports is mapped to special read(...) and write(...) calls. The code is instrumented by linking
these calls against communication stubs containing special system calls and assembler routines. After compilation, each single process is executed on an Instruction
Set Simulator (ISS). Therefore, the SimpleScalar [1] ISS was extended to handle
the special syscalls contained in the communication stubs. When a communication
function is called, the special syscall is recognized by the extension to the ISS.
Two timestamps are taken that encapsule the computation time of the communication routine. Using the first timestamp of a communication instance and the last
timestamp of the previous communication instance, the computation time between
these two subsequent communication instances of one process is calculated. In the
next step, the collected execution times are combined with the elaborated controlflow to the activity model, the CDG.
20.6 Simulation Model Generation
Because of the structured representation, the ability to consider aspects of hardware and software, to model at different abstraction levels and the availability of a
free simulator, SystemC [18] is used as simulation language. Therefore we present
a flexible and fast framework for the high level simulation based NFP analysis of
a system design.
20.6.1 Communication Dependency Graphs
The architectural model provides the structure of the generated SystemC model and
the information about the mapping of processes to modules and of communication
to shared or dedicated media. Furthermore, communication latencies are extracted
from the architectural information. The internal control flow of the processes is
adopted from the CDG. An edge in a CDG process is transformed to a wait(sc_time)
316
A. Viehl et al.
Fig. 20.5 Structure of delay channel
statement in SystemC. This means, that the execution of a SystemC process waits
the given time before it continues to send or to wait for subsequent communication
instances. The SystemC Verification library (SCV) describes methods for weighted
and restricted randomization. Weighted randomization of probabilistic execution
time profiles of CDG edges is realized using scv_bag. From the functional point
of view, execution times and communication latencies are selected locally in the
given intervals during simulation. The previously performed execution time profiling describe probabilistic time distributions to be used by simulation model generation. Globally, decisions about the execution number of loops concerning multiple processes have to be made. Branches are globally selected by using the relations annotated at the edges. This way, data interdependencies between the parallel
processes can be conserved. For monitoring the behavior of the system, traceable
signals (VCD) are inserted for the arrival and departure of the control flow at communication nodes. The events of sending and receiving delayed communication instances are traced as well. Asynchronous communication behavior is enabled by
buffering data to prevent data loss. For this issue, a delay channel was implemented,
that allows delayed message transfer and buffered communication with parametrizable bounded buffer sizes. A brief overview on the structure of the delay channel
is shown in Fig. 20.5. The channel provides TLM access methods for writing and
reading access. Further, parametrizable VCD signals that allow offline analysis of
writing to the channel buffers, reading from the buffers as well as buffer sizes are
generated. The delay() method provides a simple way to delay the transmission of
communication data across the channel.
Furthermore, the generated TLM interfaces can be used for exchanging the communication infrastructure for incorporating the impact of communication on global
timing. An accurate timing of computation has been already included using the
methodology presented in Sect. 20.5.
20.6.2 Temporal Environment Models
Embedded systems are electronic systems that are embedded in an interacting environment. The environment can send input events to the system. Depending on the
system characteristics it reacts or acts on the input stimuli. Concerning this model,
the environment is generating communication instances to be sent to the system.
20 Analysis of Non-functional Properties of MPSoC Designs
317
class event_gen_constr_burst : public scv_constraint_base {
public :
scv_smart_ptr < uint > a [ 5 ] ;
s c v _ s m a r t _ p t r < u i n t > t , T , J _ d i v _ 2 , max ;
SCV_CONSTRAINT_CTOR( e v e n t _ g e n _ c o n s t r _ b u r s t ) {
u i n t b =5; ∗ t = 4;
t −>d i s a b l e _ r a n d o m i z a t i o n ( ) ; ∗T = 1 0 0 ;
T−>d i s a b l e _ r a n d o m i z a t i o n ( ) ; ∗ J _ d i v _ 2 = 5 ;
J _ d i v _ 2 −>d i s a b l e _ r a n d o m i z a t i o n ( ) ;
∗max = ∗ J _ d i v _ 2 + ∗T ;
max−>d i s a b l e _ r a n d o m i z a t i o n ( ) ;
f o r ( u i n t n = 1 ; n <= b ; n ++) {
i f ( n == 1 )SCV_CONSTRAINT( a [ 0 ] ( ) >= 0 ) ;
e l s e SCV_CONSTRAINT( a [ n − 1 ] ( ) >= t ( ) ) ;
}
SCV_CONSTRAINT( a [ 0 ] ( ) +a [ 1 ] ( ) +a [ 2 ] ( ) +a [ 3 ] ( ) +a [ 4 ] ( ) <max ( ) ) ;
}
};
Listing 20.1 Burst model contraints using SystemC
The environment has a major impact on NFP, for example has the input data rate
an impact on component utilization and the behavior of dynamic power management is triggered by the inter-arrival time of input events. As result, it is important
to consider formally specified temporal system environments during NFP analysis.
The initially proposed atomic event models in [25] were extended by event model
streams [12]. The difference consists in the composition of an event stream model
from different atomic event models and complex patterns within these models. Five
different atomic event models are described: Periodic events, periodic events with
jitter, sporadic events, burst events with jitter and delay events. The most complex
event model is the burst event model which creates events in a burst. The parameters, that describe the burst event model are the outer period T , the burst length b,
the minimal inter-arrival time t and the jitter of the outer period J . The parameter b
denotes the number of events in a period. The minimum temporal distance between
two events is described by t. The jitter describes the variation of the outer period.
In [12], the following formulas concerning the arrival-time Zn of the n-th event an
and the arrival of all events in the outer period are given. As result, the overall sum
of all inter-event arrival times is less or equal to the period T with jitter J .
Zn = (n − 1) · t/T + J /2 − (b − n) · t , n ∈ (1, b)
and
n−1
(ai+1 − ai ) ≤ T ± J
i=1
For guaranteeing the semantically correct creation of these formally specified temporal environment models, the ability to specify constraints in the SystemC verification library SCV is used. Listing 20.1 shows the formulation of constraints for the
318
A. Viehl et al.
Fig. 20.6 Simulation code generation
burst model in SystemC. The example describes a burst event model with an outer
period T = 100, minimal inter-arrival time t = 4, jitter of the outer period J = 10
and the number of events b = 5. The declaration of all parameters as scv_smart_ptr
allows further exploration of constrained parameter variations.
20.6.3 Integration of Power Consumption and Power Management
The generation of the SystemC simulation code incorporating power management
strategies is schematically shown in Fig. 20.6.
As inputs, besides the CDG of the considered application, power state machines,
which are used for power consumption modeling and an architecture file that contains, among other things, the mapping between the power state machines and the
system components, are utilized.
For each process in the CDG a module with two processes, a functional and a
PSM process, is generated. Every edge that connects the process under consideration with another process leads to the declaration of a port inside the module and the
generation of a channel, that connects the corresponding modules as well. One port
and one channel per module become additionally necessary to realize the communication with the PowerFileGenerator module. This module collects the observed
power values during simulation and stores them in a database.
A functional process essentially covers the information contained in the associated CDG process. As the nodes are represented by read() and write() method calls
respectively, the determined edge latencies correspond to wait() statements. By the
execution of these statements, the corresponding processes are suspended during
simulation as described in Sect. 20.6.
The major task of a PSM process consists in modeling the associated power state
machine and the transfer of the power consumptions received during simulation
to the PowerFileGenerator module. The PowerFileGenerator in turn records the
20 Analysis of Non-functional Properties of MPSoC Designs
319
Fig. 20.7 Communication
between a PSM and
a functional process
power consumptions and saves them to a file so that they can be used by HotSpot to
calculate the temperatures.
The power state machine of a component leaves the operational mode iff the associated functional process becomes inactive. This may be the case due to the reading
of a module port that is blocking if the required data is not available immediately.
When the idle time lasts long enough, the PSM transitions into states with a lower
power consumption.
To emulate this behavior in SystemC, three events, idle, request2run and run
are introduced for every module which contains a PSM process. In Fig. 20.7 the
communication between the PSM and the functional process for the node R2 of
Fig. 20.6 is schematically shown.
The functional process uses the idle event to inform the PSM process that an idle
period has begun. While the request2run event is used by the functional process to
indicate that the work can be continued because the required data is now available,
the PSM process uses the run event to grant the allowance.
20.6.4 Battery Models, Placement and Chip Environment
Since portable, battery powered devices are becoming widespread, the incorporation of battery lifetime into simulation is getting important. Therefore, we offer the
opportunity to investigate the estimated battery lifetime based on its capacity and
the energy consumed by the platform during execution of the target application.
Until now, we do not cover complex phenomena that govern battery behavior,
like rate-dependence, temperature effects and capacity fading. However, it is possible to integrate more powerful battery models, like those presented in [24], into our
framework.
For temperature estimation, the HotSpot tool is applied. To be able to properly
forecast the temperatures during system operation, HotSpot needs information about
the power consumed, the placement of the chip components as well as the environment, like the employed packaging and cooling. While the first input is determined
during simulation, the latter two have two be specified by the designer. For this
reason, the proposed framework provides a simple interface by extending SysML
composite diagrams for annotating the necessary geometric and physical properties so that the needed information can easily be fed in. Another possibility would
320
A. Viehl et al.
have been the application of the UML MARTE profile due to that it already contains annotation mechanisms for layout information. But the supplied information
by MARTE were not sufficient to describe all information and so an extension would
have been necessary as well. Due to the inclusion of SysML for requirement specification, the extensions were made to SysML. A subsequent automatic data transformation enables the immediate use of HotSpot.
20.7 Experimental Results
To apply the proposed methodology to a synthetic yet representative example, it was
used for the analysis of an MPSoC consisting of four PowerPC 604 cores executing
a JPEG decoder application. Experimental results are presented in this section. All
analysis steps were implemented and integrated into the SysXplorer tool.
The starting point was a SystemC model of a JPEG decoder. According to the
flow presented in Sect. 20.5, each of its four processes with the utility functionality
were mapped to a PowerPC 604 with 100 MHz. The resulting elaborated CDG representing the communication structure of the communicating processes including
execution time information is depicted in Fig. 20.8.
The first observed property consists of buffer utilization. For this issue, buffer
utilization numbers were captured during simulation. The buffer utilization of the
communication channel (S8 , R8 ) between iquant and idct is depicted in Fig. 20.9.
Due to the bursty output behavior of iquant, the buffer is quickly filled with packets.
The maximal observed buffer utilization consists of 63 packets. Using the knowledge on the size of each packet which is 64 bytes, buffer dimensioning of the target
system can be based on the simulation of the abstract prototype.
Fig. 20.8 JPEG decoder CDG
20 Analysis of Non-functional Properties of MPSoC Designs
321
Fig. 20.9 Buffer utilization of channel (S8 , R8 )
Fig. 20.10 JPEG decoder
placement and power
consumptions
In the following, the placement/process-platform mapping and the power state
machines were defined, respectively. As placement, the configuration shown on
Fig. 20.10 was assumed. Because the power management policy of the PowerPC 604
was not known, the strategy of the StrongARM SA-1100 was adopted (see Fig. 20.3)
but the power consumptions were adapted (see also Fig. 20.10).
In Fig. 20.11, the temperatures calculated by the HotSpot tool based on the power
consumptions obtained through execution of the generated SystemC platform model
are shown for the time interval from 0 sec to 0.95 sec. After initialization, all cores
change into SLEEP state. Core 1 which executes process irleh, returns into RUN
after approximately 160 ms. Because the data produced by irleh is to be consumed
by izigzag, core 2 is triggered and awakes after a wake up phase of 160 ms (see
Fig. 20.3) at about 320 ms. Similiar behavior can be observed for the two remaining
cores.
Altogether, it can be recognized, that the temperatures for the cores 0 and 1 are
noticeably higher compared to the two other cores. With the approach presented
322
A. Viehl et al.
Fig. 20.11 Temperature curves of JPEG decoder
above, it is possible to check whether or not the system under investigation, which
causes a long border between hottest components, fulfills nevertheless its temperature requirements. The temperature can be considered as one instance of a nonfunctional property.
The outcome for an application of the proposed assertion-based power verification is graphically shown in Fig. 20.12. It was assumed that the power consumption
of the entire system should not exceed 2.5 W averaged over 0.1 ms. It can be recognized that this requirement is violated by the system twice during the depicted time
interval.
20.8 Conclusions
The verification of non-functional system requirements plays an increasingly important role in the design of current and future embedded systems. In this chapter,
a methodology and an integrated design flow implemented as toolset for the determination of NFP and for evaluation of specified non-functional requirements at
system level was presented. The evaluation process is based on dependencies between non-functional system properties. The creation process of simulatible SystemC models that represent the different system aspects like for example temporal
behavior and dynamic power management exploits these interdependencies. The
20 Analysis of Non-functional Properties of MPSoC Designs
323
Fig. 20.12 Assertion-based power verification
creation process further incorporates environmental specifications and system requirements that were automatically derived from specification documents. The applicability of the methodology has been presented by experimental results starting
from a functional SystemC model of a JPEG decoder.
References
1. T. Austin, E. Larson, and D. Ernst. Simplescalar: an infrastructure for computer system modeling. Computer, 35(2):59–67, 2002.
2. R.I. Bahar, D. Hammerstrom, J. Harlow, W.H. Joiner Jr., C. Lau, D. Marculescu, A. Orailoglu,
and M. Pedram. Architectures for silicon nanoelectronics and beyond. Computer, 40(1):25–
33, 2007.
3. L. Benini, A. Bogliolo, and G.D. Micheli. A survey of design techniques for system-level
dynamic power management. In Readings in Hardware/Software Co-design, pages 231–248,
2002.
4. L. Benini, R. Hodgson, and P. Siegel. System-level power estimation and optimization. In
ISLPED ’98: Proceedings of the 1998 International Symposium on Low Power Electronics
and Design, pages 173–178. Assoc. Comput. Mach., New York, 1998.
5. J. Bryans, H. Bowman, and J. Derrick. Model checking stochastic automata. ACM Trans.
Comput. Logic, 4(4):452–492, 2003.
6. S. Chakraborty, S. Künzli, and L. Thiele. A general framework for analysing system properties
in platform-based embedded system designs. In Proceedings of DATE, Munich, 2003.
7. A. Donlin. Transaction level modeling: flows and use models. In CODES+ISSS ’04. Assoc.
Comput. Mach., New York, 2004.
324
A. Viehl et al.
8. W. Huang, K. Sankaranarayanan, R. Ribando, M. Stan, and K. Skadron. An improved blockbased thermal model in HotSpot 4.0 with granularity considerations. Technical report, University of Virginia, Dept. of Computer Science, April 2007.
9. International Technology Roadmap for Semiconductors. 2007.
10. M. Krause, O. Bringmann, and W. Rosenstiel. Target software generation: an approach for
automatic mapping of SystemC specifications onto real-time operating systems. Des. Autom.
Embed. Syst., 10(4):229–251, 2007.
11. M. Loghi, F. Angiolini, D. Bertozzi, L. Benini, and R. Zafalon. Analyzing on-chip communication in a MPSoC environment. In DATE ’04: Proceedings of the Conference on Design,
Automation and Test in Europe, 2004.
12. A. Löffler. Modeling and Transformation of Temporal Environment Models. Study thesis, University of Karlsruhe, 2008.
13. S. Manolache, P. Eles, and Z. Peng. Schedulability analysis of applications with stochastic
task execution times. ACM Trans. Embed. Comput. Syst., 3(4):706–735, 2004.
14. R. Marculescu and A. Nandi. Probabilistic application modeling for system-level performance
analysis. In DATE ’01: Proceedings of the Conference on Design, Automation and Test in
Europe, pages 572–579. IEEE Press, Piscataway, 2001.
15. M.A. Marsan, G. Conte, and G. Balbo. A class of generalized stochastic petri nets for the
performance evaluation of multiprocessor systems. ACM Trans. Comput. Syst., 2(2):93–122,
1984.
16. J.W. McPherson. Reliability challenges for 45 nm and beyond. In DAC, 2006.
17. M.J. Moran, H.N. Shapiro, B.R. Munson, and D.P. DeWitt. Introduction to Thermal Systems
Engineering: Thermodynamics, Fluid Mechanics, and Heat Transfer. Wiley, New York, 2002.
18. W. Müller, W. Rosenstiel, and J. Ruf, editors. SystemC: Methodologies and Applications.
Kluwer Academic, Norwell, 2003.
19. G. Paci, P. Marchal, F. Poletti, and L. Benini. Exploring temperature-aware design in lowpower MPSoCs. In DATE ’06: Proceedings of the Conference on Design, Automation and
Test in Europe, 2006.
20. A. Viehl, M. Schwarz, O. Bringmann, and W. Rosenstiel. Probabilistic performance risk analysis at system-level. In CODES+ISSS ’07, 2007.
21. P. Pop, P. Eles, Z. Peng, and T. Pop. Analysis and optimization of distributed real-time embedded systems. In DAC ’04: Proceedings of the 41st Annual Conference on Design Automation,
pages 593–625. Assoc. Comput. Mach., New York, 2004.
22. Q. Qiu, Q. Wu, and M. Pedram. Dynamic power management of complex systems using generalized stochastic petri nets. In DAC ’00: Proceedings of the 37th Conference on Design
Automation, pages 352–356. Assoc. Comput. Mach., New York, 2000.
23. M. Radetzki and R.S. Khaligh. Accuracy-adaptive simulation of transaction level models. In
Proceedings of DATE, Munich, 2008.
24. R. Rao, S. Vrudhula, and D.N. Rakhmatov. Battery modeling for energy-aware system design.
Computer, 36(12):77–87, 2003.
25. K. Richter, D. Ziegenbein, M. Jersak, and R. Ernst. Model composition for scheduling analysis
in platform design. In Proceedings 39th Design Automation Conference DAC, 2002.
26. S. Schliecker, M. Ivers, and R. Ernst. Integrated analysis of communicating tasks in MPSoCs.
In CODES+ISSS ’06. Assoc. Comput. Mach., New York, 2006.
27. J. Schnerr, O. Bringmann, A. Viehl, and W. Rosenstiel. High-performance timing simulation
of embedded software. In DAC ’08: Proceedings of the 45th Annual Conference on Design
Automation, pages 290–295. Assoc. Comput. Mach., New York, 2008.
28. L. Shang and R.P. Dick. Thermal crisis: challenges and potential solutions. In IEEE Potentials,
2006.
29. A. Siebenborn, A. Viehl, O. Bringmann, and W. Rosenstiel. Control-flow aware communication and conflict analysis of parallel processes. In Proceedings of the 12th Asia and South
Pacific Design Automation Conference ASP-DAC 2007, Yokohama, Japan, 2007.
30. A. Viehl, M. Schwarz, O. Bringmann, and W. Rosenstiel. A hybrid approach for system-level
design evaluation. In IESS, pages 165–178, 2007.
Download