Synopsys Mentor Cadence TSMC GlobalFoundries SNPS MENT CDNS


( ESNUG 352 Item 1 ) --------------------------------------------- [5/17/00]

Subject: An Ultima ClockWise Tape-Out Story Plus A User Skew/Power Warning

> And yesterday, after some snooping, I discovered that LSI Logic, Matrox,
> HP, Texas Instruments, and Philips were all doing evals of ClockWise.
> I wonder if there are any ClockWise tape-outs yet.


From: Jon Stahl 

Hi John,

I recently taped out a design on which I inserted clock trees using the
Ultima ClockWise tool and really liked it.  In the past I have used
intentional skew in special situations to help improve timing, but it was
difficult time consuming hand work.

I have always thought that having an automated way of using skew to improve
timing would be great, but it seemed like a difficult problem.  In the past
I have seen a few academic papers on the subject, but until recently I have
never heard of a commercial tool.

Ultima has pulled off a very nice implementation of a "useful" skew tool.
The tool has several modes, zero-skew, useful2, and useful3 (and a promised
useful4).

In a few words, useful2 mode kind of preserves the look and feel of a
zero-skew methodology.  This mode separates registers into two groups:

  1) BSG (Bounded Skew Group), which are either
        - IO flops, that receive data from primary inputs/drive
          primary outputs
        - flops on another clock tree

  2) Internal - everything else

Once these two groups are identified, the tool creates a "schedule" for the
insertion delays of all the clock leaf points in the design.  Flops in the
BSG are scheduled with equal insertion delays, i.e. zero skew with respect
to each other. All other flops are fair game for it's useful skew algorithm.

The current useful skew scheduling algorithm is fairly straightforward.
A "permissible range" is derived, which is a window where the clock can
transition on a flop and have it's setup and hold times met w/respect to
"adjacent" flops (flops connected to the flop through combinational logic).
The tool attempts to schedule the transition early enough to meet the setup
time without violating hold, hold being the dominant constraint.  You can
also specify "safety margins", i.e. extra slack for both setup and/or hold
that the tool will attempt to meet.  (Personal opinion: I believe it would
be better to place the transition right in the center of the range to
provide as much margin on either side as possible...)

I used useful2 mode on my design because of one particular property.  As far
as the board level IO timing requirements, the design looked as if it were
zero-skew. I could schedule the PLL reference to match the BSG insertion
delay, and my external interfaces were clean.  Although doing this, of
course, ruled out any useful skew optimization on the IO flops.

Useful3 mode allows ClockWise to schedule the IO flops in order to optimize
the internal setup and hold constraints, but does not take I/O constraints
into account.  This is something that would have just caused havoc on my
design.  Ultima has talked about developing a useful4 mode that would allow
I/O constraints to be factored into the schedule.  This is something I am
eagerly anticipating as I/O timing is an area I have had to struggle with
in the past.

Now, to give you an idea of the design:

    850K gates
    49 RAM instances
    13mm die in 0.25um 4layer technology (LSI Logic LCBG11P)
    9 clock domains
    ~39K flops on main clock tree @ 100MHz
    Post-placement/IPO worst case timing slack was -1.2ns.

Normally the kind of results that would send you into a long re-synthesis,
placement, and IPO loop. However, on this design ClockWise was able to help
us get very close to placed timing closure *just* by inserting the clock
tree.

Here is a partial summary from the tool run ...

  IO/BSG skews (between 454 leaves)
      Max skew bound        :  0.300 ns
      Skew (min-min)        :  0.226 ns
      Skew (max-max)        :  0.226 ns

                                 without intra-BSG         with intra-BSG
  Safety margins              ---hold--  --setup--    ---hold--  --setup--
      Zero skew             : -0.013 ns  -1.664 ns    -0.013 ns  -1.664 ns
      Scheduled             : -0.601 ns   0.821 ns    -0.601 ns   0.821 ns
      Realized (min)        : -0.992 ns   0.623 ns    -0.992 ns   0.623 ns
      Realized (max)        : -0.992 ns   0.623 ns    -0.992 ns   0.623 ns

... and it requires some explanation.

I constrained the tool for a 300 ps max skew in the BSG group (of 454 I/O
flops).  It claimed to have achieved 226 ps (more on this later).

The safety margins section is split up into two main columns of numbers.
The "without intra-BSG" numbers show the timing for all paths excluding arcs
between flops in the BSG.  Since ClockWise can't use skew to improve timing
in the "intra-BSG" paths, they try to give you a breakout of the timing
that can't be improved and that which can.  On this design my most critical
paths were not among the "intra-BSG" set, and so the two sets are the same.

The numbers show the big advantage that using a local skew methodology
can provide.  With a non-existent tool that could have provided a perfect
zero-skew tree this design would have failed timing by -1.664 ns (according
to ClockWise).  Using skew, the tool scheduled for 0.821ns of positive slack
and claims to have built a tree achieving 0.623 ns.

A few caveats and notes are needed here:

  1) I was lucky enough on this design that my worst slacks happened to
     fall in pipeline stages where moving the clock around could improve
     them.  This is of course not always the case.

  2) One of the problems I ran into was that the timing numbers from
     ClockWise didn't completely correlate with the numbers I got using my
     vendors global router, delay calculator, and PrimeTime.  There were a
     few reasons, including a bug in the version (1.0.12) that I was using
     that prevented ClockWise from legalizing the cells correctly.  (I had
     to legalize using other tools).  I'm told that has been fixed.

     Anyway, that's why I said that the worst case slack for the design was
     -1.2ns, even though the ClockWise report shows -1.6 ns.  The good news
     was that it was the same path.  Taking a look at 100 paths from a
     correlation viewpoint (what was scheduled vs. what my vendor tools plus
     PrimeTime said), I noticed ~200 ps average difference, with outliers in
     the 500 ps range.  Considering the bug and difference in tools used to
     reach the answers I would consider this pretty good.

  3) The hold numbers appear as if ClockWise failed to meet the constraint
     badly, but that isn't that case.  This design contained some
     "interesting" synchronous RAMs with a write-through feature that had
     a greater than 3 ns write-to-read clock constraint.  ClockWise
     appropriately scheduled the write clock earlier than the read clock,
     but couldn't (tool issue) schedule the write clock early enough to
     completely meet the constraint Thus the -0.992ns hold violation.  On
     the whole ClockWise did a good job of scheduling so as to not violate
     hold, and at the expense of some setup violations it could have fixed.
     In the case of this design, it would have been much easier to add a few
     delay cells to fix hold problems rather than to fix the uncorrected
     setups.  But there is currently no way to drive the tool to do that.

  4) An annoying but not difficult problem that I had to work around was
     the fact that ClockWise puts flops on different trees with *any* arc
     between them into the BSG.  This is usually but not always what you
     would want.  For instance, I attempted to insert two trees in one run,
     a JTAG zero-skew tree, and the main clock tree useful2.  ClockWise put
     any flops with arcs between the JTAG and main tree into the BSG.  Not
     good, since this eliminates hundreds of flops that could potentially
     benefit from useful skew just because of test mode timing arcs.  There
     is currently no way to constrain the tool to avoid this, and so I had
     to do two runs one for each tree.

There were some ancillary benefits from using ClockWise that should be
mentioned:

  1) It fuzzes out your clock reducing peak power!

  2) It builds a real buffer tree (as opposed to trunk based methodologies),
     clock timing is less dependent on wire (not unique to ClockWise...)

  3) You can edit the schedule for special purpose stuff, e.g. create
     early taps for forwarded clocks, RAM write clock/read clock
     constraints, etc., and do it in nanoseconds instead of buffer levels.

Looking towards the future, I see a lot of very cool stuff that could be
done with a tool like this. Perhaps downsizing cells for power with the
additional margin provided post-clock insertion.  Or planning ahead for
certain pipeline stages to have greater than the average cycle time with the
knowledge that pre/post-ceding stages will have less.  RTL could be
partitioned differently, and/or synthesis constraints could be relaxed.

In any case it's a real working and helpful tool today.  And the support
Ultima provided on tool usage and issues was excellent.  I know of only one
other vendor that has said they are developing similar technology, claiming
they will have it released late this year.

    - Jon Stahl, Principal Engineer
      Avici Systems                             N. Billerica, MA

         ----    ----    ----    ----    ----    ----   ----

From: Mehmet Cirit 

Hi John,

I don't want to spoil your fun watching a point tool grow right in front of
you, but permit me to bring to your attention a few points:

  1. Adding more random skew to clocks increases noise on the supplies,
     through inductive and substrate coupling.  This is a nastier problem
     you don't want to mess with especially with low supply voltages.
     Reduction on the peak current may be due to increased average latency.
     dI/dt is more critical rather than peak current.  A low skew clock
     network is better from dI/dt point of view.

  2. Adding extra drivers, and the extra loads they may entail exacerbates
     so called "too much power" problem.  20% more buffers most likely
     results in 20% increase in capacitive load, possibly 10% increase
     in overall power usage.

  3. I wonder if a smart router trying to balance loads may take away the
     usefulness out of "useful" skew.

I hope these insights help your readers better understand skew/power issues.

    - Dr. Mehmet A. Cirit, President
      Library Technologies, Inc.                 Saratoga, CA

"Relax. This is a discussion. Anything said here is just one engineer's opinion. Email in your dissenting letter and it'll be published, too."
This Web Site Is Modified Every 2-3 Days
Copyright 1991-2024 John Cooley. All Rights Reserved.
| Contact John Cooley | Webmaster | Legal | Feedback Form |


   !!!     "It's not a BUG,
  /o o\  /  it's a FEATURE!"
 (  >  )
  \ - / 
  _] [_     (jcooley 1991)