( ESNUG 398 Item 3 ) --------------------------------------------- [07/31/02]
Subject: ( ESNUG 395 #5 ) Should Layout Or DC Buffer Up 40 Fanout Nets?
> I'm looking for a guideline and/or experiences for large blocks or small
> chips on when to allow Design Compiler to buffer high fanout nets, versus
> having your physical tool insert a buffer tree later in the design flow.
>
> In other words, if a net has 40 loads, should the Design Compiler be
> allowed to buffer the net, or should it be classed as an ideal_net? What
> about 100 loads? 200 loads? Obviously these aren't clocks, nor a master
> scan test enable that goes to every flop. These are nets in the grey area
> in between. We've had some congestion problems with an internal reset
> that had a large fanout. It was an easy fix to have layout insert a
> buffer tree, but the question came back as to what threshold should
> be used.
>
> - Wayne Miller
> Standard Microsystems Corporation
From: John Phillips <jphillip@matrox.com>
Hi, John,
If you do not allow DC to buffer, you will get a large delay through the
driver. If you declare the net an ideal net in DC, you will get an
unrealistically small delay. Either way, you will not get a good estimate
of your post-route timing. I'm not sure if you are using PhysOpt, but it
should be free to manipulate the buffer trees as long as the constraints
are met.
- John Phillips
Matrox Tech, Inc. Boca Raton, FL
---- ---- ---- ---- ---- ---- ----
From: Emre Tuncer <emre@mondes.com>
Hi John,
Placement engines find it difficult to deal with high fanout nets and
usually they ignore nets with fanouts larger than a certain threshold.
If you buffer high fanout nets too early, during synthesis, these nets will
be divided into smaller nets. The resultant clustering of leaf cells may
not be optimal in terms of placement and cause unneccessary congestion. On
the other hand, if you wait until after placement, and then buffer the high
fanout nets, adding all the buffers may not be feasible since the placement
engine ignored them in the first place, and may cause congestion hot spots.
Considering these nets as clocks is overkill and might make things worse.
What we have found is that the best time to address this problem is during
the physical prototyping stage, where the placement is coarse enough that
added buffers can be eased into the design, and the location of leaf cells
are known to a sufficient level of precision. Our Sonar tool based on this
approach.
To answer Wayne's question, probably the best bet would be to find out what
is the threshold of their placer for ignoring high fanout nets and use it
as a guideline.
- Emre Tuncer
Monterey Design Systems Sunnyvale, CA
---- ---- ---- ---- ---- ---- ----
From: John McGehee <johnm@voomtown.com>
Hi, John,
Please tell Wayne that Design Compiler (DC) will not do a good job of
inserting buffer trees because it can only randomly connect the buffer trees
based on wireloads and fanout, without regard for where the cells are
placed. PhysOpt is aware of the placement, so it can be trusted to insert
buffer trees.
If you are using Apollo/Saturn, Saturn can take care of your ~40 fanout
nets. Big nets like reset and scan enable should be done with clock
tree synthesis (CTS). Do CTS on the clocks last.
Astro Pre-Placement Optimization will automatically take care of all
your high-fanout nets. It does a good job, but it needs to run faster.
Use version 2001.2.3.5.0.2.2 or above. Earlier versions cannot handle
large nets like reset and scan enable.
- John McGehee
Voom, Inc. Los Altos, CA
---- ---- ---- ---- ---- ---- ----
From: Jon Harris <jharris@siroyan.com>
Hi, John,
With the introduction of automatic buffer tree insertion as part of the
basic "compile" command nowadays, DC does a good job of handling high
fanout nets automatically. For all of Siroyan's test chips I have used
DC to buffer every high fanout net except scan enable and obviously the
clock nets.
Having said that you need to consider that when using DC for high fanout
net buffering it has no concept of where the cells in the high-fanout
cone shall be placed and therefore how much wire load shall be present.
DC can only estimate how much load will be present based upon the
wireload model being used. Taking the example of a high fanout signal
of 100 loads, the pin capacitances of the loads and the wireload model
might only mandate the use of 2 or 3 buffers. However, you will find
that after placement the cells on your high fanout net have been placed
over a wide area and these 2 or 3 buffers that DC inserted are now
completely inadequate. The extra loading will degrade the transition
times and delays through the buffers massively. Your post-layout
optimisation tool will attempt to fix this, but quite possibly there may
not be appropriate placement area for additional buffers and timing
shall not be met.
So, in order to provide protection against this effect you can make use
of the DC command "set_default_fanout_load." This can be used to set
the maximum "fanout_load" that DC can tolerate on any net.
eg. dc_shell-t> set_default_fanout_load 15
Your target technology library will have a "fanout_load" attribute
applied to every input pin of each standard cell. Normally this shall
default to 1, and in this case the effect of setting the default
fanout_load to 15 means that DC shall ensure that the maximum fanout for
any net in the design shall be no more than 15. So with the maximum
fanout reduced, for a net that fans out to 200-300 destinations, DC
shall be forced to put in a buffer tree of 2 or 3 levels, comprised of
several dozen buffer cells. These buffers are then available during
placement to span the area occupied by the destination cells and there
is less likelihood that significant timing degradation will be seen in
layout.
Obviously, whatever post-layout optimisation tool you use to do in-place
optimisations and buffer re-sizing will still have a little work to do
to correct for wireload inaccuracies. However, the work already done by
DC on the high fanout nets shall ensure that these nets will require
only the same attention as the rest of the nets in the design.
As an aside, while I mention fanout_load, something else that Wayne
might like to consider is to increase the fanout_load on the input pins
to any memory macrocells to match the set_default_fanout_load value.
Consider a case where you have an address bus that is driving 8 RAMs,
which physically may be far apart. By default DC will synthesize one
buffer per address bit, with a fanout of 8. Once the netlist has been
placed, this one buffer can easily find itself driving 8 tracks halfway
across the chip, potentially in different directions and badly in need
of post-layout optimisation. However, if the fanout_load setting on the
RAM inputs is set to match the "set_default_fanout_load" value, DC will
synthesize one buffer per address bit for *EACH* RAM, which the
placement engine can then place as required to minimise delay.
An example segment of a RAM .lib file showing the fanout_load field is
given below :
bus(A) {
bus_type : R64X22BNTBM4B1_ADDRESS;
fanout_load : 15; <- New fanout_load value
direction : input;
capacitance : 0.052;
timing() {
related_pin : "CLK"
timing_type : setup_rising ;
...
The .lib file can be edited as shown and recompiled using just 2 commands:
dc_shell-t> read_lib my_ram.lib
dc_shell-t> write_lib my_ram -format db -output my_ram.db
DC may complain that there is no library compiler license, but as we haven't
changed the functionality of the RAM this error can be ignored and the new
my_ram.db file is successfully written out anyway.
As a final note, I have not yet tried leaving the scan enable buffering to
DC as this could fan out to tens or hundreds of thousands of flops -- and I
just don't feel that lucky!
- Jon Harris
Siroyan Reading, Berkshire, UK
---- ---- ---- ---- ---- ---- ----
From: Tom Tessier <tomt@hdl-design.com>
Hi John,
My answer to Wayne's question is "it depends." If the 40+ loads are in the
same hierarchical block in a floorplan, them most often let DC do it. If
the 40+ loads are a control signal which has to go all over the die in one
clock cycle then experiment with both DC and the P&R tool.
Experience depends upon the tools (about to start a flame war, please don't
shoot the messenger). I had a design that I took into Avanti Apollo that
I knew had large loads (both fanout and large capacitance due to wire
length). The design was done in the physical domain in a hierarchical
fashion because we couldn't get Avanti Apollo to converge when doing the
design flat (lots of RAM based blockages caused problems). We let Apollo do
the buffer insertion and placement. Most often it decided correctly that it
needed 2-3 buffers to get the signal across the die in the timeframe we gave
it. Most often it placed them very poorly. For example it would place one
buffer right at the pin of the sending hierarchical block, then the next two
close to the receiving pin of the hierarchical block. This didn't meet the
timing as I still had 6 mm+ of wire between the buffers. We ended up
putting 3000+ buffers in by hand, and generating PDEF placement to force the
tool to put them where we wanted (check out my San Jose SNUG 2001 Papers for
the gory details).
Same basic design 1 year later with a new foundry. Took out all the 3000+
buffers. This foundry used Cadence and decided they could get it done flat.
They solved the large load problem and large fanout problem without our
intervention. Provided them an SDC file and they solved the rest. Did the
flat layout help them? I think it did but it also complicated the issue as
the tool had a lot more data to work with. Was the designer on the second
go around better adapt at using the tool? Possibly. ;-) Was the client I
was working for very happy with the second encounter? You bet, as they saw
very minor timing closure issues. In fact the final timing closure issue
had to do with a problem that Paul Zimmer mentioned in ESNUG 393 #2, this
foundry wanted up to 16% on-chip variance for signoff. That is a healthy
chunk of the clock cycle to give up. Once source synchronous interface had
problems with this effect, but the team worked around it until it they got
timing that was acceptable.
So unfortunately it really depends upon experience.
- Tom Tessier
t2design, Inc. Louisville, CO
---- ---- ---- ---- ---- ---- ----
From: Srinivas Kakumanu <kakumanu@time2mkt.com>
Hi John,
I've had an experience on one of my recent hierarchical chips in 0.13 um
where all the blocks in the chip were taken to layout even with nets having
fanout more than 100. These nets are neither clock nets nor reset nets.
To give you an example, some of the nets are like a write enable to a set of
64-bit register where this write_en would be going to all the 64 flops. The
reason behind doing this was that the layout tools have got a very intensive
algorithm to buffer out high fanout nets (hfns) and this algorithm works to
build a balanced buffer tree and these buffers are inserted considering the
ACTUAL loads which we will be lacking at DC synthesis stage.
And we have experimented this aproach on bigger blocks and allowing layout
tools to do buffering on hfns nets resulted in less congestion, timing
problems than starting the block which is already buffered using DC at
synthesis stage itself.
- Srinivas Kakumanu
time2mkt.com
---- ---- ---- ---- ---- ---- ----
From: Lars Rzymianowicz <larsrzy@ti.uni-mannheim.de>
Hi John,
There's a Boston SNUG paper about this issue by Rick Furtner of TenSilica.
"High Fanout Without High Stress: Synthesis & Optimization of High-fanout
Nets Using Design Compiler 2000.11"
Basically, I'd recommend to use a threshold of 100, maybe 50. Layout tools
are much better at building balanced trees than logic synthesis tools like
DC or Ambit. One might also limit the fanout (set_max_fanout) of nets below
this threshold to 8. I had some good results with it, since it forced DC to
buffer nets with 8-100 fanout. That eased the later P&R job a lot.
- Lars Rzymianowicz
University of Mannheim Germany
|
|