( ESNUG 361 Item 1 ) --------------------------------------------- [11/16/00]

From: Thad McCracken <thad@geocast.com>
Subject: Technical Details Of The Geocast TSMC 0.18 PKS/Ambit-RTL Tape-out

Hi, John,

I know your readers are design engineers, so I thought I'd send you my
detailed experiences using PKS on a tape-out.

I'm a chip designer but my sidebar task here at Geocast is to also be the
methodology guy, too.  I set up our PKS flow here and we recently did two
200 Kgate tape-outs to test the PKS flow with our in-house custom standard
cell library.  We used MOSIS TSMC 0.18 and TSMC's CyberShuttle service. 

We used PKS v3.0.19.  The design was 200K gates of synthesized logic.  Most
its RTL was taken from a prior 0.35u chip I worked on that is now a part of
our our Geobox product line.  We reworked our RTL slightly to accommodate
these additional 8 custom macros (also designed here at Geocast):

           - a clock multiplier
           - a DLL
           - 3 instances of 32x32 register file
           - 3 instances of 64x64 register file
           - 1 instance of a CAM

We also implemented some additional testability around the CAM.

These two test chips are our first chips on a 0.18u process.   With the TSMC
CyberShuttle test tape-out (our second test tape-out) we tweaked the clock
multiplier on the chip.

The original 0.35u implementation of this design ran at 100Mhz.  These 0.18u
tape-outs were implemented to a timing target of 160Mhz -- a goal largely
set by the speed at which the I/Os would run.  We ran the core design in
PKS at 200 Mhz and the runtime went from 7 hours to 10 hours (as was to be
expected).

A few months ago there was some heated discussion in ESNUG about designing
chip hierarchically versus flat.  We're strongly in the hierarchical camp.
Our chips are typically too big to consider doing flat, and we like being
able to spin portions of the design w/o unnecessarily impacting unrelated
parts of the chip.  Since this 200 Kgate was basically a single P&R region,
though, we ran it through the flow I'll describe below flat (in a physical
sense, that is.)

Given this as a starting point, for PKS we started with a chip floorplan,
in .def format, that had the following information:

   - pad cell placements
   - standard cell rows
   - custom macros pre-placed (with appropriate adjustments to
     standard cell rows)
   - endcap cells
   - all power gridding


Our Initial PKS Flow
--------------------

The details of the PKS portion of our flow looks like:

  1.)  Set up all the system varibles like floorplanning parameters,
       special net pin names, cell halos, default RAM halos, etc.

       Read in Verilog, Ambit compiled version of the .libs, .LEF files,
       PKS layer utilization tables.

         read_alf, read_library_update # (to read in compiled .lib files)
         read_lef, read_lef_update     # (to read in .LEF files)
         set_logic_0_net vss           # (set logic 0 and 1 net names)
         set_logic_1_net vdd
         read_layers_usages   # (reads layer utilization table, used by PKS
                              #  to estimate which layers a route will 
                              #  hit, and how many times it will change
                              #  layers, along it's length.  This plays a
                              #  role in parasitic estimation.)

         read_ver                      # read in Verilog files
         do_build_generic              # Maps Verilog to "generic"
                                       # technology library

  2.)  Initialize floorplan (i.e. read in .def file)  Do the manual
       over-rides of the floorplan done by PKS.

       For "early" PKS runs we generally let PKS figure out the block size,
       and initialize the floorplan only in terms of:

          - standard cell row spacing
          - standard cell row orientation (flip or not?)
          - I/O to core distance
          - aspect ratio
          - desired standard cell row utilization
          - halo to put around embedded macros

       These are all set with "set_floorplan_parameters."

       Later in the flow the floorplan for the block becomes more defined,
       containing the information listed above, and you want PKS to work 
       within that floorplan.  At this point floorplan initialization is
       done by just reading in the floorplan .def file, using "read_def".


  3.)  Apply timing constraints -- i.e. constrain all the normal stuff you'd
       constrain before doing timing-based synthesis:

           - define clock(s) and assign them to pins (set_clock,
             set_clock_arrival_time)
           - define input signal arrival times (set_data_arrival_time)
           - define output signal required times (set_data_required_time)
           - define output max slew (set_port_capacitance_limit)
           - define output capacitive load (set_port_capacitance)
           - define input driver characteristics (either set_drive_cell or
             set_slew_time + set_port_capacitance_limit).
 

  4.)  Pre-Placement Optimization

       a) Generic design optimization (do_xform_optimize_generic) - i.e. do
          any optimization that is possible prior to mapping the design to
          our 0.18 custom in-house library.

       b) Map the design to our custom 0.18 Geocast library (do_xform_map)

       c) Pre-placement optimization (do_xform_pre_placement_optimize_slack)

          This amounts to whatever optimization of the design that can be
          done prior to placing it.  My understanding of how this works is
          that PKS derives a WLM from process layer information (from the
          LEF file(s) you read in before), and other information, and uses
          that to do initial optimization of the design.  The goal here is
          to come up with a starting point for placement.


  5.)  Timing-driven placement (do_place -timing_driven true)

       Just the placement of the results of step #4, driven by the timing 
       constraints of the design.

  6.)  Post-placement optimization (do_xform_optimize_slack -pks)

       The goal here is to get the design to timing closure by iterating on
       the placement, netlist structure, etc.  PKS also does periodic
       congestion analysis on the design during this phase, and makes
       adjustments to the placement along those lines as well.
  
  7.)  ECO placement to legalize cell placements, fix any timing issues that
       result (do_xform_tcorr_eco)

       I understand placement at this point has lumped cells into "bins",
       and this step basically spreads the cells in any given bin out, snaps
       them to standard cell rows, etc.  Other commands allow access to each
       individual part of this process (do_placement_spread, do_place -eco).
       do_xform_tcorr_eco does both of these and a short round of timing
       correction to resolve any timing issues (usually small) that result
       from the placement spread/legalization.


  8.)  Export .def, .gcf, and Verilog netlists for use in the downstream
       tools (CTGEN, SE).  The .def represents the placement of the design
       (and additionally has all the information presented in the floorplan
       you started with).  The GCF file represents all block timing
       constraints to SE, and is used by QPOPT and Pearl for timing
       analysis.


Steps 4-6 can be performed with one "do_optimize -pks", or done separately
with the following commands:

    do_xform_optimize_generic
    do_xform_map -hierarchical
    do_xform_pre_placement_optimize_slack
    do_place -timing_driven true
    do_xform_optimize_slack -pks

I prefer this latter approach as it allows an .adb to be written out between
these steps, making experimentation with various switches and pre- and post-
placement optimization strategies faster.


Through CTGEN and Silicon Ensemble
----------------------------------

We then took our design through CTGEN/SE as follows:

  1.)  Clock tree insertion using CTGEN

       CTGEN pretty much uses the output DEF from PKS, and constraints for
       the clock tree(s) you're inserting (skew, max_transition, min/max
       insertion delay) to come up with a clock tree structure that
       [hopefully] meets those constraints, and then places the structure
       in the design.  In general I found that CTGEN did a good job with
       clock tree insertion.  Some things I found that gave it a hard time
       were:

         - small slivers of standard cells between, for instance, 
           a hard macro and block boundary.  CTGEN sometimes had 
           difficulty balancing skew to synchronous elements placed
           in those slivers.  In general we try to avoid this by
           not creating this situation in the first place.  If that
           is not possible then some manual definition of the clock 
           tree structure "using define_structure in the CTGEN 
           constraint file" can help get better results.

         - For high-placement density designs or timing-critical 
           designs that end up having localized areas w/very high
           placement density, CTGEN can sometimes have a hard time
           meeting constraints w/o moving standard cells a long way
           (to make room for the clock tree).  I talk about this
           issue and how we worked around it more below.
       
  2.)  Import design (with clock tree) into Silicon Ensemble
 
  3.)  Detailed scan chain hookup done with SE TROUTE
 
  4.)  Timing constraints applied (GCF from PKS)
 
  5.)  Do a QPOPT pass to fix PKS timing violations that result from clock
       tree insertion and scan chain hookup.

       We basically try to rely on QPOPT for as little as possible - 
       i.e. don't expect it to fix really significant timing problems.
       We did use it to fix up transition violations on our scan chains.
       (PKS never sees these since the chain ordering is done in SE and
       it doesn't see any small transition nor the timing violations that
       occur as a result of clock tree insertion.  Cells get moved by
       some amount and some problems are bound to pop up).  On the tapeouts
       QPOPT turned out to not be critically necessary, and we generally
       shoot for this on all our P&R blocks.  Most post-route timing
       correlation problems can usually be tracked back to poorly placed
       macros or pins or some other thing that is better fixed at the
       source (as opposed to relying on QPOPT to fix them).

  6.)  Timing-driven WROUTE

  7.)  Hyperextract parasitic extraction, final timing analysis using Pearl.

       Of course everything was wrapped up from this point with LVS, DRC,
       etc.  No significant problems were encountered here.


Post-Route Timing Results
-------------------------

Our post-route timing (based on Hyperextract parasitics) were within 160 ps
of the PKS-estimated timing, so routed timing was within 3% (in the noise in
my opinion).  Overconstrainting synthesis relative to the "real" timing
targets can get this back if it's really important.  The really cool thing
about all this is that we got this result with one pass through synthesis
and P&R.  I've not ever experienced that with a WLM based synthesis/timing-
driven P&R flow (has anybody?).  This saved us a ton of time implementing
this chip.

Another very cool thing to note about this PKS flow was that it was not
necessary to do any detailed floorplanning to get to timing closure and a
routable design.  This block used to require quite a bit of "micro-
floorplanning" to get timing closure, and it had significant routing
congestion problems.  I did no such floorplanning on this chip, other than
preplacing all of its custom blocks.

I feel it important to note some of the things that contributed to the tight
timing correlation we got between PKS and post-route timings.  PKS basically
uses a routing layer utilization table and characterization of how often a
route typically changes layers, along with the area and fringe cap
coefficients from the LEF file for each layer to estimate the R and C of a
wire.  (If you're using an external vendor's LEF then presumably their
coefficients are based on some real data, and possibly don't need to be
mucked with.)  We ran several testcases of various sizes & routing densities
through PKS early on and used SE routing reports to derive the layer
utilization tables.  We then used Hyperextract runs on the same blocks to
come up with LEF coefficients for each routing layer.  I found this really
helpful in getting tight timing correlation.

Once things were tuned, I found that ~95% of the nets had PKS estimated
capacitance within 10% of Hyperextract cap, and nearly all of the remaining
nets were within 20%.  Note that I didn't make any adjustments here for very
short nets, which tended to have much larger error (as expected).   While
this correlation effort might sound involved, it's really as simple as
taking the coefficients from a log file from a Hyperextract run, and doing
some simple math on numbers easily gotten from SE routing reports.  The real
key is to pick a testcase that has a reasonable routing density, as this
will obviously affect the fringe cap coefficients you end up with in the
LEF.  I've done this correlation effort once and not had to change anything
since, and we've pushed quite a few blocks through the tools at this point.
This is a big improvement over having to do old fashioned custom WLM's on a
block-per-block basis.


Lessons Learned
---------------

It's probably also good to note some of the problems we ran into with PKS
and SE, and how were able to work around them. 

   1.)  Ambit/PKS does not directly support inference of master-slave
        flops, on which our library is based.  What this basically meant
        was that one of our clocks was left unconnected after initial
        mapping to our standard cell library.  This turned out to be
        quite easy to workaround, by using a TCL script within the
        Ambit shell to hook up the second clock tree after mapping,
        and before timing-driven optimization of the design.

   2.)  Our scan methodology was also not supported by PKS.  We don't use
        the popular MUX-scan methodology that a lot of people use,
        and instead use an LSSD method in which each synchronous element
        has separate scan clocks, allowing for synchronous elements
        of all types (falling and rising edge flops, latches) to be
        put on the same scan chain, and be fully scannable.  PKS couldn't
        deal with this, so we were left with using the script mentioned
        in #1 to hook up the scan clocks post-map, and then use SE's
        TROUTE to do the placement-based scan ordering.  Overall I was
        pretty impressed by the flexibility allowed in the Ambit shell
        for doing these sorts of things.  Being able to do this allowed
        us to basically integrate some of the less-standard parts of
        our flow into PKS in a very acceptable way, at the right point
        in the design flow.

        I should note that PKS v4.x appears to have some additional
        ability w/regard to scan support, and for placement-based scan
        ordering.  I haven't had time yet to evaluate whether or not
        it eliminates the need for some the workarounds we've implemented
        to deal with these shortcomings in v3.x.

   3.)  The PKS v3.x timing engine did not degrade slew over the RC of a
        net, which caused some problems for long and/or large fanout nets.
        The biggest place I've seen this pop up is on long routes to P&R
        block output ports that have a fairly high lumped capacitive
        load.  We worked around this in two ways:

          - QPOPT has been able to fix some of these types of problems (if
            they can be fixed w/ a simple upsize)
          - The placer in PKS can take net-weights, and application of a
            high weight to nets that connect to block output ports limits
            the amount by which the slew can degrade.  One can set weights
            on these nets be defining a TCL list of nets and using the
            "-net_weight" switch of do_place:

              do_place -timing_driven true -net_weight "$output_nets 10"
 
   
        PKS v4.0 appears to account for this slew degradation, which 
        eliminates this problem all together.  I haven't fully investigated
        this but initial trials on 4.0 do appear not to have this problem,
        and timing reports do show degraded slew over longer nets.

   4.)  For high-utilization designs or designs that are *really* tight
        on timing, we had problems with local placement utilization being
        too high leaving no room left for the clock tree!  The end result
        was that cells got moved larger distances to make room for clock
        tree buffers, which degraded post-route timing results, and in
        some cases routing congestion.  For placement utilizations up
        to 90%, we don't generally see this problem (the chip we taped out
        was ~90% placement utilization).  We've been able to work up to
        the mid-90's by passing some switch options to the placer that are
        supposed to leave room for clock tree buffers.  PKS 4.0 appears to
        have even more knobs that may help in this area, but I haven't had
        a chance to try them yet.

   5.)  Some iteration was required to get a floorplan that placed all
        the custom macros in a way that did not result in routing congestion
        around those macros -- especially if the macros account for a large
        percentage of the area of the chip.  I guess I don't really see this
        as a shortcoming in any tool as much as a fact of life, though.


Overall I was really impressed with PKS and how it fit into our flow.
Formerly we'd been using Ambit/SE, and PKS fit right into this.  In general
database handoff between PKS, CTGEN, and SE was pretty seamless.  Especially
helpful was the support I received from Cadence as we were putting our PKS
flow together.  I can honestly say it's the best support I've received
from any EDA company to date, in terms of both quality and timeliness.

    - Thad McCracken
      Geocast Networks Systems, Inc.             Beaverton, OR


 Sign up for the DeepChip newsletter.
Email
 Read what EDA tool users really think.


Feedback About Wiretaps ESNUGs SIGN UP! Downloads Trip Reports Advertise

"Relax. This is a discussion. Anything said here is just one engineer's opinion. Email in your dissenting letter and it'll be published, too."
This Web Site Is Modified Every 2-3 Days
Copyright 1991-2024 John Cooley.  All Rights Reserved.
| Contact John Cooley | Webmaster | Legal | Feedback Form |

   !!!     "It's not a BUG,
  /o o\  /  it's a FEATURE!"
 (  >  )
  \ - / 
  _] [_     (jcooley 1991)