( ESNUG 361 Item 1 ) --------------------------------------------- [11/16/00]
From: Thad McCracken <thad@geocast.com>
Subject: Technical Details Of The Geocast TSMC 0.18 PKS/Ambit-RTL Tape-out
Hi, John,
I know your readers are design engineers, so I thought I'd send you my
detailed experiences using PKS on a tape-out.
I'm a chip designer but my sidebar task here at Geocast is to also be the
methodology guy, too. I set up our PKS flow here and we recently did two
200 Kgate tape-outs to test the PKS flow with our in-house custom standard
cell library. We used MOSIS TSMC 0.18 and TSMC's CyberShuttle service.
We used PKS v3.0.19. The design was 200K gates of synthesized logic. Most
its RTL was taken from a prior 0.35u chip I worked on that is now a part of
our our Geobox product line. We reworked our RTL slightly to accommodate
these additional 8 custom macros (also designed here at Geocast):
- a clock multiplier
- a DLL
- 3 instances of 32x32 register file
- 3 instances of 64x64 register file
- 1 instance of a CAM
We also implemented some additional testability around the CAM.
These two test chips are our first chips on a 0.18u process. With the TSMC
CyberShuttle test tape-out (our second test tape-out) we tweaked the clock
multiplier on the chip.
The original 0.35u implementation of this design ran at 100Mhz. These 0.18u
tape-outs were implemented to a timing target of 160Mhz -- a goal largely
set by the speed at which the I/Os would run. We ran the core design in
PKS at 200 Mhz and the runtime went from 7 hours to 10 hours (as was to be
expected).
A few months ago there was some heated discussion in ESNUG about designing
chip hierarchically versus flat. We're strongly in the hierarchical camp.
Our chips are typically too big to consider doing flat, and we like being
able to spin portions of the design w/o unnecessarily impacting unrelated
parts of the chip. Since this 200 Kgate was basically a single P&R region,
though, we ran it through the flow I'll describe below flat (in a physical
sense, that is.)
Given this as a starting point, for PKS we started with a chip floorplan,
in .def format, that had the following information:
- pad cell placements
- standard cell rows
- custom macros pre-placed (with appropriate adjustments to
standard cell rows)
- endcap cells
- all power gridding
Our Initial PKS Flow
--------------------
The details of the PKS portion of our flow looks like:
1.) Set up all the system varibles like floorplanning parameters,
special net pin names, cell halos, default RAM halos, etc.
Read in Verilog, Ambit compiled version of the .libs, .LEF files,
PKS layer utilization tables.
read_alf, read_library_update # (to read in compiled .lib files)
read_lef, read_lef_update # (to read in .LEF files)
set_logic_0_net vss # (set logic 0 and 1 net names)
set_logic_1_net vdd
read_layers_usages # (reads layer utilization table, used by PKS
# to estimate which layers a route will
# hit, and how many times it will change
# layers, along it's length. This plays a
# role in parasitic estimation.)
read_ver # read in Verilog files
do_build_generic # Maps Verilog to "generic"
# technology library
2.) Initialize floorplan (i.e. read in .def file) Do the manual
over-rides of the floorplan done by PKS.
For "early" PKS runs we generally let PKS figure out the block size,
and initialize the floorplan only in terms of:
- standard cell row spacing
- standard cell row orientation (flip or not?)
- I/O to core distance
- aspect ratio
- desired standard cell row utilization
- halo to put around embedded macros
These are all set with "set_floorplan_parameters."
Later in the flow the floorplan for the block becomes more defined,
containing the information listed above, and you want PKS to work
within that floorplan. At this point floorplan initialization is
done by just reading in the floorplan .def file, using "read_def".
3.) Apply timing constraints -- i.e. constrain all the normal stuff you'd
constrain before doing timing-based synthesis:
- define clock(s) and assign them to pins (set_clock,
set_clock_arrival_time)
- define input signal arrival times (set_data_arrival_time)
- define output signal required times (set_data_required_time)
- define output max slew (set_port_capacitance_limit)
- define output capacitive load (set_port_capacitance)
- define input driver characteristics (either set_drive_cell or
set_slew_time + set_port_capacitance_limit).
4.) Pre-Placement Optimization
a) Generic design optimization (do_xform_optimize_generic) - i.e. do
any optimization that is possible prior to mapping the design to
our 0.18 custom in-house library.
b) Map the design to our custom 0.18 Geocast library (do_xform_map)
c) Pre-placement optimization (do_xform_pre_placement_optimize_slack)
This amounts to whatever optimization of the design that can be
done prior to placing it. My understanding of how this works is
that PKS derives a WLM from process layer information (from the
LEF file(s) you read in before), and other information, and uses
that to do initial optimization of the design. The goal here is
to come up with a starting point for placement.
5.) Timing-driven placement (do_place -timing_driven true)
Just the placement of the results of step #4, driven by the timing
constraints of the design.
6.) Post-placement optimization (do_xform_optimize_slack -pks)
The goal here is to get the design to timing closure by iterating on
the placement, netlist structure, etc. PKS also does periodic
congestion analysis on the design during this phase, and makes
adjustments to the placement along those lines as well.
7.) ECO placement to legalize cell placements, fix any timing issues that
result (do_xform_tcorr_eco)
I understand placement at this point has lumped cells into "bins",
and this step basically spreads the cells in any given bin out, snaps
them to standard cell rows, etc. Other commands allow access to each
individual part of this process (do_placement_spread, do_place -eco).
do_xform_tcorr_eco does both of these and a short round of timing
correction to resolve any timing issues (usually small) that result
from the placement spread/legalization.
8.) Export .def, .gcf, and Verilog netlists for use in the downstream
tools (CTGEN, SE). The .def represents the placement of the design
(and additionally has all the information presented in the floorplan
you started with). The GCF file represents all block timing
constraints to SE, and is used by QPOPT and Pearl for timing
analysis.
Steps 4-6 can be performed with one "do_optimize -pks", or done separately
with the following commands:
do_xform_optimize_generic
do_xform_map -hierarchical
do_xform_pre_placement_optimize_slack
do_place -timing_driven true
do_xform_optimize_slack -pks
I prefer this latter approach as it allows an .adb to be written out between
these steps, making experimentation with various switches and pre- and post-
placement optimization strategies faster.
Through CTGEN and Silicon Ensemble
----------------------------------
We then took our design through CTGEN/SE as follows:
1.) Clock tree insertion using CTGEN
CTGEN pretty much uses the output DEF from PKS, and constraints for
the clock tree(s) you're inserting (skew, max_transition, min/max
insertion delay) to come up with a clock tree structure that
[hopefully] meets those constraints, and then places the structure
in the design. In general I found that CTGEN did a good job with
clock tree insertion. Some things I found that gave it a hard time
were:
- small slivers of standard cells between, for instance,
a hard macro and block boundary. CTGEN sometimes had
difficulty balancing skew to synchronous elements placed
in those slivers. In general we try to avoid this by
not creating this situation in the first place. If that
is not possible then some manual definition of the clock
tree structure "using define_structure in the CTGEN
constraint file" can help get better results.
- For high-placement density designs or timing-critical
designs that end up having localized areas w/very high
placement density, CTGEN can sometimes have a hard time
meeting constraints w/o moving standard cells a long way
(to make room for the clock tree). I talk about this
issue and how we worked around it more below.
2.) Import design (with clock tree) into Silicon Ensemble
3.) Detailed scan chain hookup done with SE TROUTE
4.) Timing constraints applied (GCF from PKS)
5.) Do a QPOPT pass to fix PKS timing violations that result from clock
tree insertion and scan chain hookup.
We basically try to rely on QPOPT for as little as possible -
i.e. don't expect it to fix really significant timing problems.
We did use it to fix up transition violations on our scan chains.
(PKS never sees these since the chain ordering is done in SE and
it doesn't see any small transition nor the timing violations that
occur as a result of clock tree insertion. Cells get moved by
some amount and some problems are bound to pop up). On the tapeouts
QPOPT turned out to not be critically necessary, and we generally
shoot for this on all our P&R blocks. Most post-route timing
correlation problems can usually be tracked back to poorly placed
macros or pins or some other thing that is better fixed at the
source (as opposed to relying on QPOPT to fix them).
6.) Timing-driven WROUTE
7.) Hyperextract parasitic extraction, final timing analysis using Pearl.
Of course everything was wrapped up from this point with LVS, DRC,
etc. No significant problems were encountered here.
Post-Route Timing Results
-------------------------
Our post-route timing (based on Hyperextract parasitics) were within 160 ps
of the PKS-estimated timing, so routed timing was within 3% (in the noise in
my opinion). Overconstrainting synthesis relative to the "real" timing
targets can get this back if it's really important. The really cool thing
about all this is that we got this result with one pass through synthesis
and P&R. I've not ever experienced that with a WLM based synthesis/timing-
driven P&R flow (has anybody?). This saved us a ton of time implementing
this chip.
Another very cool thing to note about this PKS flow was that it was not
necessary to do any detailed floorplanning to get to timing closure and a
routable design. This block used to require quite a bit of "micro-
floorplanning" to get timing closure, and it had significant routing
congestion problems. I did no such floorplanning on this chip, other than
preplacing all of its custom blocks.
I feel it important to note some of the things that contributed to the tight
timing correlation we got between PKS and post-route timings. PKS basically
uses a routing layer utilization table and characterization of how often a
route typically changes layers, along with the area and fringe cap
coefficients from the LEF file for each layer to estimate the R and C of a
wire. (If you're using an external vendor's LEF then presumably their
coefficients are based on some real data, and possibly don't need to be
mucked with.) We ran several testcases of various sizes & routing densities
through PKS early on and used SE routing reports to derive the layer
utilization tables. We then used Hyperextract runs on the same blocks to
come up with LEF coefficients for each routing layer. I found this really
helpful in getting tight timing correlation.
Once things were tuned, I found that ~95% of the nets had PKS estimated
capacitance within 10% of Hyperextract cap, and nearly all of the remaining
nets were within 20%. Note that I didn't make any adjustments here for very
short nets, which tended to have much larger error (as expected). While
this correlation effort might sound involved, it's really as simple as
taking the coefficients from a log file from a Hyperextract run, and doing
some simple math on numbers easily gotten from SE routing reports. The real
key is to pick a testcase that has a reasonable routing density, as this
will obviously affect the fringe cap coefficients you end up with in the
LEF. I've done this correlation effort once and not had to change anything
since, and we've pushed quite a few blocks through the tools at this point.
This is a big improvement over having to do old fashioned custom WLM's on a
block-per-block basis.
Lessons Learned
---------------
It's probably also good to note some of the problems we ran into with PKS
and SE, and how were able to work around them.
1.) Ambit/PKS does not directly support inference of master-slave
flops, on which our library is based. What this basically meant
was that one of our clocks was left unconnected after initial
mapping to our standard cell library. This turned out to be
quite easy to workaround, by using a TCL script within the
Ambit shell to hook up the second clock tree after mapping,
and before timing-driven optimization of the design.
2.) Our scan methodology was also not supported by PKS. We don't use
the popular MUX-scan methodology that a lot of people use,
and instead use an LSSD method in which each synchronous element
has separate scan clocks, allowing for synchronous elements
of all types (falling and rising edge flops, latches) to be
put on the same scan chain, and be fully scannable. PKS couldn't
deal with this, so we were left with using the script mentioned
in #1 to hook up the scan clocks post-map, and then use SE's
TROUTE to do the placement-based scan ordering. Overall I was
pretty impressed by the flexibility allowed in the Ambit shell
for doing these sorts of things. Being able to do this allowed
us to basically integrate some of the less-standard parts of
our flow into PKS in a very acceptable way, at the right point
in the design flow.
I should note that PKS v4.x appears to have some additional
ability w/regard to scan support, and for placement-based scan
ordering. I haven't had time yet to evaluate whether or not
it eliminates the need for some the workarounds we've implemented
to deal with these shortcomings in v3.x.
3.) The PKS v3.x timing engine did not degrade slew over the RC of a
net, which caused some problems for long and/or large fanout nets.
The biggest place I've seen this pop up is on long routes to P&R
block output ports that have a fairly high lumped capacitive
load. We worked around this in two ways:
- QPOPT has been able to fix some of these types of problems (if
they can be fixed w/ a simple upsize)
- The placer in PKS can take net-weights, and application of a
high weight to nets that connect to block output ports limits
the amount by which the slew can degrade. One can set weights
on these nets be defining a TCL list of nets and using the
"-net_weight" switch of do_place:
do_place -timing_driven true -net_weight "$output_nets 10"
PKS v4.0 appears to account for this slew degradation, which
eliminates this problem all together. I haven't fully investigated
this but initial trials on 4.0 do appear not to have this problem,
and timing reports do show degraded slew over longer nets.
4.) For high-utilization designs or designs that are *really* tight
on timing, we had problems with local placement utilization being
too high leaving no room left for the clock tree! The end result
was that cells got moved larger distances to make room for clock
tree buffers, which degraded post-route timing results, and in
some cases routing congestion. For placement utilizations up
to 90%, we don't generally see this problem (the chip we taped out
was ~90% placement utilization). We've been able to work up to
the mid-90's by passing some switch options to the placer that are
supposed to leave room for clock tree buffers. PKS 4.0 appears to
have even more knobs that may help in this area, but I haven't had
a chance to try them yet.
5.) Some iteration was required to get a floorplan that placed all
the custom macros in a way that did not result in routing congestion
around those macros -- especially if the macros account for a large
percentage of the area of the chip. I guess I don't really see this
as a shortcoming in any tool as much as a fact of life, though.
Overall I was really impressed with PKS and how it fit into our flow.
Formerly we'd been using Ambit/SE, and PKS fit right into this. In general
database handoff between PKS, CTGEN, and SE was pretty seamless. Especially
helpful was the support I received from Cadence as we were putting our PKS
flow together. I can honestly say it's the best support I've received
from any EDA company to date, in terms of both quality and timeliness.
- Thad McCracken
Geocast Networks Systems, Inc. Beaverton, OR
|
|