Editor's Note: Because of some of the questions I've received about
  Synopsys's unannounced "Protocol Compiler" (Dali), this entire ESNUG
  post consists of a paper I found on the topic.  PLEASE! -- Don't
  ask where or who I got it from.  :^)
                                          - John Cooley
                                            the ESNUG guy

( ESNUG 260 Item 1 ) -------------------------------------------- [4/96]

From: [ Don't Ask ]
Subject: Protocol Compiler (Dali), Behavioral Compiler & ATM Interface Design

Dali is an environment for ASIC designers to create designs with complex
interfaces and structured data streams.  The increasing complexity of
protocols have made existing design methodologies difficult and cumbersome.
Debugging an initial design to meet performance requirements, and later
re-engineering the design to accommodate modifications in protocol
specifications or other changes, is time-consuming if not difficult. Dali
was developed in response to these and other challenges.
Dali offers designers the following features:

  * A semantic driven design entry environment
    Design analysis and debugging early in the design process
  * Back annotation simulation results to the graphical representation
  * Automatic generation of sequential FSMs from the concurrent
    representation
  * Partition of complex FSM into a distributed set of FSMs
  * Generation of optimized RTL code for synthesis.


When we are looking at protocol specification from standardization
documents, they are always represented in a hierarchical fashion.  This
taken from the ATM User-Network Interface Specification.  Some of the
definitions we use:

  It is the protocol of an ATM cell with the F5 signaling flow embedded
  in the payload.

  Level 1 of Hierarchy is an F5 Flow (Segment Flow from Private to
  Public Switch)

  Level 2 of Hierarchy is a Function Type Specific Field for Loopback Mode

  Level 3 of hierarchy is a Loopback-Specific Field, the Loopback Indication
  field

  VPI - Virtual Path Identifier
  VCI - Virtual Channel Identifier
  PTI - Payload Type Identifier
  CLP - Cell Loss Priority
  HEC - Header Error Control
  OAM - Operation and Maintenance

  Function Type options:
   AIS:  Alarm Indication Signal
   FERF:  Far End Receive Failure Loopback


Dali defines the "ATM Cell Header" as a frame; the "ATM Cell Payload" is a
frame. Also, every element inside the header and the payload are also
frames. Dali recognizes the header and the payload, then it specifies the
action to be executed.

Frames:

Frames are structured data formats or cycle-level behaviors that
graphically illustrate data exchange standards. They map well to the
structured data formats (such as packets and cells) that appear in data
books, communications standards, and instruction sets.
A frame can be hierarchical in structure. It can be made of additional
frames, which can themselves be broken into frames. At the lowest level,
terminal frames represent the bit patterns, simple expressions, or value
fields of a protocol. This hierarchical presentation of data insulates
designers from having to deal with too much detail at one time.

Actions:

Data handling functions in HDL that can be attached to frames. They are
executed when the associated frames are accepted. An action must be
executed in one sample cycle. Actions take ports and variables as
arguments.


The definition of some Dali notations:

A "terminal" frame is the lowest level in the frame hierarchy and
represents a single-cycle behavior. The terminal frame is active for
exactly a clock cycle during which an associated expression is evaluated.
If the expression evaluates to true for that clock cycle, any attached
actions are executed.

A "reference" frame is a hierarchical frame that refers to another defined
frame. The time required (in cycles) for execution of a reference frame
depends on the behavior it describes.

The "sequential" frame operator creates a sequence of one or more frames
that can be accepted in sequential clock cycle. If any of the frames is
rejected, the sequence is rejected.

An "alternative" frame operator contains one or more frames that are
executed concurrently. Alternative frame operators are useful to define
several different possible behaviors or to perform concurrent processing.

An "optional" or "repeat" frame operator executes a frame in a variable
number of times. An optional frame operator is enclosed in square brackets.

A "qualifier" frame operator has a Boolean expression that must evaluate as
true for the entire execution of a frame.

An "If" frame operator has a Boolean expression that must be true for the
first cycle of the execution of a frame.


When using Dali, designers work in the protocol design environment which
has the following functional areas:

   1) protocol entry,
   2) protocol analysis,
   3) protocol synthesis,
   4) HDL generation, and
   5) integrated design simulation.

Dali allows designers to perform formal state analysis on their protocol
FSM. It can determine: 1) reachable states, 2) concurrent actions, and
3) state coverage.  After analysis, the design is ready for protocol
synthesis. Dali automatically creates the protocol logic from the graphical
representation and provides a user-directed partitioning scheme targeted
specifically at performance driven designs. Dali then converts the
synthesized FSM representation to an HDL description. If the output is
targeted for simulation, extra debugging information will be included. This
allows the simulation results to be back-annotated to the original
graphical representation.


Before we committed the whole design to the Dali environment, we wanted to
verify Dali's capability to generate good quality RTL code for synthesis.
To do so, we had implemented a small portion of the Cell Receiver with both
the Dali and the RTL design approaches. We are going to present the
findings here.


When writing RTL code, the designer has more freedom and possibilities at
the gate level implementation (lower abstraction level). He can choose the

- Style of FSM (moore/mealy)
- Style of HDL code (good reusable code)
- Specify states and transitions (invent names for the states that is a
      requirement for efficient debugging)

This freedom has the advantage of improving the results, but there are
risks to build in errors. Once a decision is made and the code is written,
it is difficult and causes much work to change it. The designer has to
consider many low-level details when the code is written. Design experience
counts here.


Dali offers many time-saving advantages over the traditional RTL approach.
Instead of focusing on states and  transitions, we can concentrate on
entering the protocol graphically at a high level. For the
ATM_Cell_Decoding above, the concurrent semantic of the frame
representation allows the design to be captured very easily and it's very
close to the original Cell Receiver specification.

Static design analysis and debugging can be performed while we are entering
the design. Simulation results are reflected back onto the original
protocol representation which makes understanding and debugging the design
much easier. Finally, Dali generates an optimized finite state machine for
the controller and automatically outputs the RTL code for synthesis.
Design or functionality changes and protocol modifications can be
incorporated easily into the protocol representation. It enables us to
quickly modify the design; especially before the top-level specification
has been frozen. In contrast, designers using a traditional RTL approach
must patch or painstakingly re-engineer at the register-transfer level.

The circuit generated from our first RTL description was far from optimal,
833 gates. However, the code was easy to understand.  We went back to
re-code the design again. This time we paid special attention to the code
style for synthesis. The synthesized  result was better. After a few
iterations, we were able to get the design down to 710 gates.
When we looked  at the synthesis result generated by Dali, the first result
was 757 gates. It was not far from our best result, and that was before any
special optimization techniques  had been applied. During protocol
synthesis, Dali was able to convert and collapse concurrent behavior into
sequential ones much more efficiently than a human being. At the same time,
it can eliminate unused frames and actions embedded in the original
representation.  We were convinced.


Dali performs interactive analysis that uses the frame representation to
display the analysis results. Designers select a point in the protocol and
Dali graphically displays the analysis results relative to the selected
protocol element. For example, Dali can detect the following:

 - Actions that could execute concurrently with a selected frame
    or action
 - Portions of the protocol that are reachable from a particular frame
    or action
 - Portions of the protocol which may be active at the same time

Designers can correct errors within the frame representation, then check
and analyze the design again - until they are satisfied with the analysis
result.


After generating the HDL output, designers can launch the simulator from
Dali. During simulation, Dali helps designers interpret the simulation
results by displaying the state of the simulation in terms of the protocol
representation. During simulation, back-annotated simulation results are
displayed onto the protocol representation which looks familiar to
designers.

The state of the simulation is displayed in Dali through highlighting
frames and actions; as the simulation steps through the protocol, designers
can see which frames are active, which frames are accepting, and which
actions are executing. This allows designers to think in terms of the
frames in the protocol rather than states of the controlling FSM. For the
ATM_Cell Simulation example above, the marker in the waveform viewer is at
580 ns. It correlates to the fourth frame in the original representation.
It means that at that particular clock cycle the fourth frame is being
executed.

Dali also provides a simulation control panel which interacts with the
simulation engine. It allows designers to control the simulation run
directly.


Debugging of an RTL FSM with an event driven simulator is difficult. The
link between the source (with state names) and the simulation result is
difficult to draw. The designer cannot step back in time. In the protocol
simulation we can step forward and backward with a marker and get the
back-annotated information to the active and accepted frames. The
disadvantage is that the simulation has to quit before changes in the
frames are allowed. Then the simulation has to be restarted again.
In RTL simulation, the question often asked is: what is the reason that the
FSM is in state x instead of state y; which is the correct one? This
question is difficult to answer, especially if the FSM has many transitions
and requires stepping through the HDL source code that has little
correlation with the specification. This question is easy to answer within
the Dali simulation environment. The designer can step with a marker
through the simulation waveform display and get the information of the
frames which are active.

      States:  number of states, number of state registers.
      Area:    combination + sequential + interconnect  area.

The protocol synthesis process performs the following:

   * Automatically creates the protocol control logic from the protocol
     specification
   * Provides a user-directed optimization targeted specifically for
     frame-based protocol designs
   * Provides a user-directed partitioning scheme  for complex protocol
     control logic.

Designers can specify synthesis options, including implementation
directives and the optimization effort level. Based on the compilation
directives, implicit optimization algorithms are applied to analyze and
optimize the representation. State graph algorithms are provided for state
minimization and state encoding. Various encoding schemes are available for
different applications. Low optimization effort levels are suitable for
quick turnaround and debugging, as the synthesizer attempts to preserve all
back-annotation information for later simulation. The highest optimization
effort level produces the best design implementation. After protocol
synthesis, reports are generated to show the number of states and registers
in the protocol state machine.


Instead of designing at the RTL, behavioral synthesis begins at the
algorithmic level. It enables designers to write behavioral specifications
similar to software algorithms. The description does not imply any specific
architecture; therefore, designers can easily change the synthesized
architecture by changing the high-level performance constraints. Contrasted
to the RTL approach, designers are no longer locked into a single
architectural implementation implied by the RTL description.

Behavioral synthesis is ideal for algorithmic based design.  The
specification time is shorter. The behavioral description is shorter and
more intuitive to develop. The design can also be simulated at the
behavioral level.  Since the description is more abstract than RTL,
simulation performance can be orders of magnitude faster than RTL
simulating. Architectural exploration from the behavioral level offers one
big advantage. The designer can quickly create and evaluate a number of
implementations before selecting one which fits the performance
requirement. As a result, the quality of the implementation is much better.


Two important tasks are performed by behavioral synthesis. They are
scheduling and hardware allocation.

 Scheduling extracts the control and data flow operations from the design
specification and assigns them into clock cycles. A state machine
controller is synthesized to sequence the operations and execute them in
their assigned cycles. The typical goal of this process is to implement the
design with the smallest amount of resources (register, multiplexers, and
operations) in the minimum number of clock cycles.

 Hardware allocation maps the operations and data from the specification
into the datapath. The target architecture of behavioral synthesis is a
general CPU model that contains datapath, memory, and a control machine.

Type definitions in VHDL package:

    subtype     fifo_word_t  is std_logic_vector (31 downto 0);
    subtype     tlb_index_t  is integer range 0 to 31;
    subtype     tlb_segm_t   is std_logic_vector ( 4 downto 0);
    subtype     tlb_addr_t   is std_logic_vector (15 downto 0);
    type        opcd is (store,find);
    type        rout_req_t is record
                  address : tlb_addr_t;
                  segment : tlb_segm_t;
                  command : opcd;
                end record;
    type        tlb_entry_t is record
                  address : tlb_addr_t;
                  segment : tlb_segm_t;
                end record;
    type        tlb_t  is array (0 to 31) of tlb_entry_t;

The entity definition:

        entity router is
           port (clk       : in  std_logic;
                 reset     : in  std_logic;
                 inp_strb  : in  std_logic;
                 inp       : in  rout_req_t;
                 busy      : out std_logic;
                 outp_strb : out std_logic;
                 outp      : out tlb_entry_t);
        end router;


The algorithm that the Translation lookaside buffer (TLB) executes is built
around a table which is implemented as a memory. The table, type tlb_t, has
32 entries. Each entry, type tlb_entry_t, has two fields: an address, which
is 24 bits wide; and a segment number, which is 5 bits wide. When the TLB
receives a request, type rout_req_t, it consists of 3 fields. They are a
recipient address, a segment address, and a one-bit opcode, which tells TLB
whether to search for or save the recipient/segment mapping.

Regardless of the opcode, TLB's action on receiving a request is to find
the recipient in the table. It does this using the binary search approach.
As shown above, except the wait and reset statements, the code is extremely
similar to the way it will be written in C. When writing this description,
the designer will not need to worry about the cycle-to-cycle behavior of
the circuit. Instead, he/she can focus on making sure the functionality of
the design is correct. The description is high-level, easy to understand
and good for simulation performance.

By the time the "search" loop has finished, whether or not the recipient
is found has been determined. Then, based on the opcode value, different
actions are performed.

 1. If the opcode is "find" and the recipient was found, transmit the
    segment number in the table entry found.
 2. If the opcode is "find" and the recipient was not found, transmit the
    special segment number.
 3. If the opcode is "store" and the recipient was found, update the
    recipient's segment value.
 4. If the opcode is "store" and the recipient was not found, insert the
    recipient's address and segment value into TLB. This requires that the
    entries currently at that location and above be "bumped" upwards in the
    memory. This is a very expensive operation.

Notice that the run time of this algorithm is not predictable. Thus
handshaking is essential to communicate with this module. The signal
"inp_strb" tells TLB that the input is stable to be read. During the
computation, the "busy" signal is set. After the computation has finished,
TLB set the "out_strb" signal to inform others the output is now ready.

More detail of this design can be found in: "Behavioral Synthesis, Digital
System Design Using the Synopsys Behavioral Compiler", David Knapp,
Prentice Hall PTR, 1996.


 Step 1. Designer specifies algorithm in VHDL/Verilog, desired clock rate,
         latency, and technology library
 Step 2. BC extracts the data and control flow from the source code
         dataflow : operators such as multiplies, adds, comparisons, etc.
         controlflow : if/case statements, for/while loops, memory accesses
 Step 3. BC schedules the operations into clock cycles order must be
         honored operations must be performed within the desired clock
         period (links to DesignWare)
 Step 4. BC allocates the scheduled operations to hardware resources
 Step 5. An RTL design specification is created to implement the scheduled
         design. It includes the following elements: datapath, memory,
         finite state machine.


Scheduling Modes:

Free-floating Mode
  In this mode, computations and read/write operations may flow freely past
  wait statements

Superstate-Fixed Mode
  In this mode, the relative times of read and write operations are
  preserved, but superstates(delimited by wait statements) may be stretched.
  Operations on variables are free to move around.

Cycle-Fixed Mode
  In this mode, the exact cycle times of read and write operations on
  signals are preserved. Operations on variables are free to move around.

With Behavioral Compiler, an array of bit vectors can be mapped to RAM
easily with the use of attributes. It allows designers to specify memory
reads and writes as array accesses in the behavioral HDL description.
During scheduling, the control sequence for the memory I/O will be built.
In the TLB description above, to use a different memory device, the
designer needs to change only the line that specifies the particular memory
component to use. In this case, we had tried to evaluate the latency of the
inner loop by switching between single or dual port RAM.

From our observation, with the behavioral level approach, the design effort
is proportional to the functional complexity of the design. However, with
the RTL approach, the design effort is more-or-less proportional to the
architectural complexity which increases more than linear with the
functionality.


References:

"ATM User-Network Interface Specification version 3.0/4.0", The ATM
  Forum.
"ATM Networks", Othmar Kyas, Thomson Publishing.
"Asynchronous Transfer Mode", 2nd Edition, Martin De Prycker,
  Prentice Hall.
"Getting Aplications for ATM", Network World, Mar. 11, 1996,
  http://www.nwfusion.com.
"ATM Chip Web",
  http://www.infotech.tu-chemnitz.de/www-public/prof-it/dako/atm.
"ATM Standards Documents and Implementation Agreements",
  http://cell-relay.indiana.edu/cell-relay/docs/standards.html.
"High-Level Synthesis: Introduction to Chip and System Design",
  Daniel D. Gajski, Nikil D. Dutt, Allen C. - H. Wu and Steve Y. - L. Lin,
  Kluwer Academic, 1992.
"Behavioral Synthesis : Digital System Design Using the Synopsys
  Behavioral Compiler", David Knapp, Prentice Hall, 1996.
"Behavioral Compiler User Guide version 3.5a", Synopsys, Inc, 1996.
"Dali Technology Backgrounder, version 1.0", Synopsys, Inc, 1996.
"Synopsys High-Level Design Tools",
  http://www.synopsys.com/products/products.html.
"Behavioral Synthesis Methodology for HDL-based Specification and
  Validation", David Knapp, Tai Ly, Ron Miller, Don MacMillen, DAC95.
"A System for Compiling and Debugging Structured Data Processing
  Controllers", Andrew Seawright, Wolfgang Meyer, Ulrich Holtmann, Barry
  Pangrle, Rob Verbrugghe, Joe Buck, EuroDAC96.


 Sign up for the DeepChip newsletter.
Email
 Read what EDA tool users really think.


Feedback About Wiretaps ESNUGs SIGN UP! Downloads Trip Reports Advertise

"Relax. This is a discussion. Anything said here is just one engineer's opinion. Email in your dissenting letter and it'll be published, too."
This Web Site Is Modified Every 2-3 Days
Copyright 1991-2024 John Cooley.  All Rights Reserved.
| Contact John Cooley | Webmaster | Legal | Feedback Form |

   !!!     "It's not a BUG,
  /o o\  /  it's a FEATURE!"
 (  >  )
  \ - / 
  _] [_     (jcooley 1991)