Synopsys Mentor Cadence TSMC GlobalFoundries SNPS MENT CDNS

( ESNUG 387 Item 16 ) -------------------------------------------- [01/23/02]

From: "Mark Warren" <mwarren@synopsys.com>
Subject: How You Code Latches & Flip-Flops *Greatly* Impacts VCS Runtimes

Hi, John,

With the last few versions of VCS, the coding style used for *synchronous*
devices has become critical.  If VCS clearly understands all your flip-
flops and latches, by default it will use its aggressive cycle-based
algorithms (which can result in large speedups and even more memory
reductions.)  But if VCS does not like your coding style for flops or your
clock-gen circuitry, then VCS's old event-driven algorithms dominate.
You'll then be stuck simulating at those much slower Cadence NC-Verilog
speeds.  This holds true for both RTL and gate-level (UDP) designs.

This letter is to show how minor changes in coding styles can have dramatic
effects on VCS simulation performance.

VCS Optimizations
=================

The algorithms VCS uses are split into two camps.  The first is Language
optimizations.  (These are also know as Front-End optimizations.)  These
optimize logic to be evaluated and minimize the events in the VCS event
queue.  All event-driven simulators use an event queue to schedule and
propagate events.  Event queues work well on asynchronous behavior (like
timing through random logic), but they are inherently slow due to the
overhead of this flexibility.

The second type of VCS optimizations are called RoadRunner which can be
thought of as cycle-based algorithms that trigger on synchronous logic.
Many synchronous devices only need a clock edge to propagate events so they
do not need the flexibility of the event queue.  This is the concept behind
the VCS RoadRunner algorithms; minimize the flexibility for these types
of constructs in order to get a bigger runtime speedup.  

Language and RoadRunner algorithms are on by default in VCS.  No special
compile-time flag needed.  A much more aggressive family of optimizations
within VCS (called Radiant optimizations) also exist, but I don't have the
time to discuss them here.  Radiant optimizations are enabled with the VCS
compile switch +rad.

Flip-Flop Coding Styles
=======================

It is very important that the coding style of synchronous elements in your
design are *race* free and coded in a way that VCS can understand them.  If
VCS recognizes the latches/FFs in your design, then it will automatically
turn on a RoadRunner Cycle optimization and give a good speedup.  VCS will
accelerate most flip-flop coding styles accepted by Design Compiler.  (These
are documented in the on-line VCS UsersGuide.)

Here are the general rules for coding synchronous devices in VCS:

  * Standard flip-flop coding looks like:

                         always @(posedge clk)
                                a <= b;

    This is a perfect coding style for VCS.  Non-Blocking Assignments (NBA)
    ensure that no race conditions exist between flip flops, and since there
    is no delay, VCS can simulate this flop using cycle-based algorithms.

  * Adding a delay slows down evaluations:

                         always @(posedge clk)
                                a <= #1 b;

    This example adds an unnecessary delay on the right-hand side of the NBA.
    Some users like to use this delay so waveforms are staggered and you can
    see which outputs are caused by which flops.  Others believe the delay
    helps avoid race conditions.  This is not true.  The nature of how NBAs
    propagate data ensures that ordering is correct, thus the delay is not
    needed.  The main problem with this delay is that it inhibits VCS from
    using cycle-based optimizations.  When VCS sees the delay, it assumes
    that the designer used it because some other logic relies on the delay
    to work properly (such as asynchronous feedback.)

    For regressions, you can force VCS to ignore any delay on the RHS of
    NBAs by using the compile-time "vcs +nbaopt" flag.  This will allow more
    RoadRunner optimizations (and a resulting in a speed up) but it may
    expose race conditions if NBAs were used in inappropriate places (such
    as inside "initial" blocks.)

  * Blocking assigns without delays are prone to races:

                         always @(posedge clk)
                                a = b;

    This coding style is prone to race conditions.  The IEEE LRM states that
    all "always" blocks act in parallel and you can not guarantee the order
    of execution.  Therefore you can not guarantee that b will or will not
    be updated before it is propagated to a.

  * Blocking assigns with delays are not recommended:

                         always @(posedge clk)
                                a = #1 b;

    Blocking assignments do not ensure proper ordering of events in daisy-
    chained flip-flops, so they require a #1 on the RHS to avoid race
    conditions.  Since this inhibits VCS cycle-based optimizations, this
    coding style is also not recommended.

  * Adding asynchronous resets:

                    always @(posedge clk or negedge rst)
                       begin
                         if (rst == 0)
                            a <= 0;
                         else
                            a <= b;
                       end

    Adding resets to flops generally will not adversely affect VCS
    optimizations.

Latch Coding Styles
===================

There are many ways to code a latch in Verilog.  These examples all work well
and will be properly accelerated in VCS:

  always @(clk or d)     always @(clk or d)    always @(clk or d)
    if (clk)               if (~clk)             q <= clk ? d : q;
       q <= d;                q <= d;

  always @(clk or scan_clk or d or scan_d)     always @(clk or enable or d) 
    if (clk)                                     if (clk)         
       q <= d;                                      if (enable) q <= d;
    else if (scan_clk)  
            q <= scan_d;     

Clock Drivers
==============

There are many ways to drive clocks in the Verilog language.  VCS is
usually happy with any clock coding style used.  Since VCS cycle-based
optimizations occur at the block level, there is no performance penalty for
having multiple clocks (even if they share paths or are asynchronous).  An
important rule to follow is that clock signals should never be driven by a
non-blocking assignment (NBA) -- it nullifies the benefits of using NBAs
inside of flops and is very prone to race conditions.

Coding Styles To Avoid 
======================

A few inefficient constructs usually don't impact the performance of VCS.
Use whatever you need on a local level to achieve the needed functionality.
But blocks that get instantiated thousands of times or "always" blocks
that get triggered thousands of times have a MUCH larger impact on overall
performance.  Here are some other coding styles which should be avoided:

  1) Use of delays inside "always" blocks

  2) Use of signal display/monitor inside "always" blocks that are not
     really necessary (i.e. a display to flag an ERROR message is OK but
     displaying signal values at every edge can seriously hurt performance.)

  3) Use of fork/join constructs 

  4) force/release assign/deassign

  5) multiple event controls - multiple "@" or triggers inside the
     "always" block

  6) Use of Verilog "task" calls that have event/delay control in them.

  7) Use of memories inside "always" blocks

  8) Don't drive a clock with an NBA 

  9) Avoid modeling flops with transistors

  10) Avoid any strength-based modeling

  11) Avoid named blocks

UDPs Coding Style
==================

Many vendor-supplied UDPs are well written, so generally VCS can infer the
latches and flops properly.  Trouble comes when a user includes only table
entries for particular input combinations in his UDP.  Uncovered input
combinations in a UDP definition is defined by the Verilog LRM to output X,
which can result in non-optimized UDPs.  This could become a VCS performance
problem if this ambiguity results in VCS not recognizing the clock input to
the UDP.  For illegal input combinations that are not expected to happen, a
good practice is to define "no change" explicitly. 

Use The VCS Built-In Profiler
=============================

Since synchronous devices are often instantiated thousands of times in a
typical design, they can have a major impact on performance.  If a flop
model does not get accelerated, it will show up high in the VCS profile
list.  It's wise to occasionally compile and run VCS with the "vcs +prof"
flag to see which constructs are taking the most CPU time in your
simulation.  This is often the best way to find an offending module.

    - Mark Warren
      Synopsys, Inc.                             Cupertino, CA

============================================================================
 Trying to figure out a Synopsys bug?  Want to hear how 12,000+ other users
    dealt with it?  Then join the E-Mail Synopsys Users Group (ESNUG)!

       !!!     "It's not a BUG,               jcooley@TheWorld.com
      /o o\  /  it's a FEATURE!"                 (508) 429-4357
     (  >  )
      \ - /     - John Cooley, EDA & ASIC Design Consultant in Synopsys,
      _] [_         Verilog, VHDL and numerous Design Methodologies.

      Holliston Poor Farm, P.O. Box 6222, Holliston, MA  01746-6222
    Legal Disclaimer: "As always, anything said here is only opinion."
 The complete, searchable ESNUG Archive Site is at http://www.DeepChip.com

"Relax. This is a discussion. Anything said here is just one engineer's opinion. Email in your dissenting letter and it'll be published, too."
This Web Site Is Modified Every 2-3 Days
Copyright 1991-2024 John Cooley. All Rights Reserved.
| Contact John Cooley | Webmaster | Legal | Feedback Form |


   !!!     "It's not a BUG,
  /o o\  /  it's a FEATURE!"
 (  >  )
  \ - / 
  _] [_     (jcooley 1991)