( ESNUG 395 Item 8 ) --------------------------------------------- [06/26/02]
Subject: ( ESNUG 393 #2 ) 12% On-Chip Timing Variations & IBM's EinsTimer
> Our current ASIC vendor, IBM, is having us do timing analysis with an
> on-chip variation of 12%, applied to both cell delays and
> interconnect delays.
>
> This makes certain kinds of timing very difficult to pass. For example,
> using a PLL to zero out a 5 ns insertion delay requires a 5ns feedback
> path. But 12% variation between the two is 600 ps, which is a big chunk
> of time these days. Source-synchronous interfaces have similar problems.
>
> How realistic is this sort of thing? Has anyone done a paper on this?
>
> - Paul Zimmer
> Cisco Systems
From: Hank Walker <walker@cs.tamu.edu>
Hi, John,
There have been academic and industry papers published on intra-die
process variation. The majority is deterministic, due to stepper field
gradients, pattern density sensitivity and OPC imperfections. But
there is also a significant random component. The deterministic
variation in ILD thickness can be +/- 30% in 180 nm aluminum. I have
seen industrial data showing variation in flush delays, ring oscillator
delays, Fmax, etc., of neighboring chips that can easily be 10%, so
this implies that much of it is random intradie variation. The experts
at this are microprocessor clock tree designers. You might look for
papers by Sani Nassif (of IBM Austin) in IEEE conference proceedings,
or papers on "statistical design" or "parametric yield optimization".
So the vendor wants this type of timing to guarantee high parametric
yield for a single-bin part pushing the limits. They could not require
this, but then they would take a yield hit and just make it up with
higher wafer costs to you.
In summary, the highest performance designs have been using such
approaches since 250 nm, and it is now starting to show up in ASICs.
Unfortunately the tools to support statistical design are complete
losers, except for analog circuits. But there is a fair amount of
university research, so hopefully the situation will improve.
- Hank Walker
Texas A&M University College Station, TX
---- ---- ---- ---- ---- ---- ----
From: Srinivas Kakumanu <kakumanu@time2mkt.com>
Hi John
We always use a four corner case timing closure to tape out our chips.
We get two SDF files, one contains cell delays (cell.sdf) and the other
one contains interconnet delays (connect.sdfRC). We do STA on four
corner cases.
cell.sdf(max), connect.sdfRC(max)
cell.sdf(max), connect.sdfRC(min)
cell.sdf(min), connect.sdfRC(max)
cell.sdf(min), connect.sdfRC(min).
This is an alternative to the on-chip-variance in PrimeTime, if not exactly
the same. And I did see a difference of 300-500 psec variance on clock
nets in min-max and max-min cases. And the methodology recommends to fix
all setup-hold violations in these FOUR corners before it is qualified
to tape-out.
- Srinivas Kakumanu
Time2mkt.com
---- ---- ---- ---- ---- ---- ----
From: Del Cecchi <dcecchi@vnet.ibm.com>
Yes, the tracking can be that bad. I don't know if you are talking about
IBM as the vendor, but our tracking is about that. The main contributor is
probably Leffective variation due to etch variation.
- Del Cecchi
IBM Rochester, MN
---- ---- ---- ---- ---- ---- ----
From: Paul Zimmer <pzimmer@cisco.com>
Hi, John,
One explanation I got from someone else at IBM is that the variations happen
over a fairly small area. We were picturing a slow, gradual change across
the die, but this doesn't seem to be the case. Instead, it is more like
random variation.
There are probably both effects, a random device-to-device variation is
probably the biggest but there is likely also a longer range trend. Some
of the "random" stuff now seems to be geometry dependent.
Which begs the question: If your path goes through 20 or so elements, and
the variation is sort-or random, isn't all-min vs all-max a little
extreme? But statistical analysis of this would be very complex for the
tool...
This die variation problem hasn't been well known in the ASIC world. We
have used multiple vendors, and only once were we asked to turn on
"on-chip variation" in PrimeTime (equivalent to linear_comb_delay in
EinsTimer). Even then, the SDF files they supplied us with showed almost
NO variation between early and late, so it didn't do much.
- Paul Zimmer
Cisco Systems
---- ---- ---- ---- ---- ---- ----
From: Matt Weber <matt@siliconlogic.com>
Hi John,
IBM is the only ASIC vendor I know of that requires this analysis, so the
analysis is typically done in EinsTimer rather than PrimeTime.
How realistic is this sort of thing? Obviously, two gates on the same chip
will not turn out exactly the same, due to small variations in the mask and
lithography, adjacency effects from neighboring circuits, and other
factors.
Twelve percent sounds like a lot to me (especially for the wires), but not
unreasonable. Using one number isn't exactly accurate anyway. Some cells
will vary only a few percent and others may vary far more than 12%. I think
a single number is used to reduce the modeling complexity, and it generally
gives close enough results for the effects that we are trying to model.
Cross chip variation doesn't affect most of the data signals in your chip.
You've already verified that you meet setup requirements with worst case
delays. If some gates in the data path are a little bit faster due to cross
chip variations, no problem. Also, you've already verified that you meet
hold time requirements with best case delays. If some gates in the data
path are a little bit slower, no problem. Where cross chip variation really
bites you is in the clock trees.
Consider this circuit with a three level clock tree:
flop1
clk1a clk2a |D Q|--[logic]-+
+--|>o--+--|>o--+--|> | | flop2
clk0 | clk1b clk2b +--|D Q|
X-----|>--+--|>o--+--|>o--+--------------------|> |
We know from worst case analysis that the logic is fast enough to make
setup at flop2. This includes some clock skew which is calculated based
on the clock tree parasitics. However, it does not include any clock skew
caused by process differences between the clock drivers. If, on the real
chip, clock drivers clk1a and clk2a end up at a slower process point than
clk1b and clk2b, you may still end up with a setup failure, even though
static timing said the path was okay. Similar problems can happen with
hold checks and clock gating checks.
The differences that can occur on a chip between two instances of the same
cell can be modeled using on-chip variation analysis. In EinsTimer, this is
done by turning on LCD (linear combination of delays) analysis. One way to
enable the analysis in PrimeTime is with commands such as
set_operating_conditions -analysis_type on_chip_variation $OPCON
set_timing_derate -min 0.88 -max 1.0
If each stage of the clock tree above has 1ns latency and on chip variation
is 12%, the tool will initially say that the arrival times at the clock pins
of flop1 and flop2 can be different by 360ps (3ns * 12%), in addition to any
skew caused by placement and loading differences. As Paul mentioned,
hundreds of picoseconds are not easy to find these days. Fortunately, the
situation usually isn't quite this bad. The clock tree goes through driver
clk0 for both flop1 and flop2. Clk0 can't be fast for one of the flops and
slow for another. It can be fast or it can be slow, but it is the same for
both flop1 and flop2. So the 120 ps of on-chip variation that was
originally calculated for clk0 gets credited back.
In EinsTimer this process is called Common Path Pessimism Removal (CPPR).
I was hoping that Synopsys would have come up with a better name, but I see
that it is called Clock Reconvergence Pessimism Removal. The variable to
enable it is timing_remove_clock_reconvergence_pessimism. That rolls off
my tongue about as easily as "Peter Piper picked a peck of pickled
Primetime."
Anyway, if your clock insertion is 5 ns, most of your paths will still not
see 600 ps of on-chip variation. It all depends on how far back in the clock
tree you need to go to find the common point between the startpoint and
endpoint registers.
Although at first it appears that using on-chip variation analysis steals
a bunch of performance from you, I don't think this is necessarily the case.
Other ASIC vendors still must account for on chip variation. Without
running on-chip variation analysis, they must cover it some other way. I
assume it gets covered through padding the setup and hold margins of the
flops, requiring some set_clock_uncertainty even after the clocks are placed
and routed, or requiring XX ps of positive slack for timing signoff. Any of
these methods effectively penalize ALL of the paths in your design. By
doing on-chip variation analysis instead, you are able to stuff an extra
hundred picoseconds of logic into paths which are contained in a common
branch of the clock tree. By modeling the effect more accurately, we are
able to get more performance out of the design.
- Matt Weber
Silicon Logic Engineering, Inc. Eau Claire, WI
---- ---- ---- ---- ---- ---- ----
From: Paul Zimmer <pzimmer@cisco.com>
Hi, John,
The specific place we had the problem was with the PLL feedback. The
PLL is there to zero the clock insertion delay. How much the CPPR can
help depends on how much the feedback and clock paths have in common.
They SHOULD have a lot in common - the feedback normally comes from the
end of the clock tree, with a few extra gates to zero out the pad delay.
Getting the STA tool to recognize this common path is perhaps trickier.
But you have given me an idea. I'm not sure that the way I'm constraining
it in EinsTimer is the best. This mode is the only place that I'm using
RAT's instead of UDT's. Perhaps if I can recode this as UDT's somehow
I can improve the analysis...
No, that won't work, because the PLL early/late times are calculated
once and then hung on the PLL outputs. EinsTimer doesn't understand that
the PLL early at is based on clock tree max delay, and therefore should
be cppr'd with clock tree min delay.
Messy stuff.
- Paul Zimmer
Cisco Systems
---- ---- ---- ---- ---- ---- ----
From: Adam Shiel <adam@siliconlogic.com>
Hi John,
I work with Matt here at Silicon Logic Engineering and I do quite a bit of
work with EinsTimer. Not knowing the system Paul is working with, I can't
figure out exactly what the problem he's seeing is based on the description.
For normal latch to latch paths, or to I/O, CPPR should find the common
clock point and compute the proper credit. There is some extra pessimism
introduced by IBM's PLL adjust script in LCD mode that they've recently
added a command to eliminate. This sort of sounds like what you're
describing.
Computing the feedback adjust, IBM's PLL adjust script makes some
assumptions whether the gates in the feedback path are running fast or slow.
The script looks either at the late arrival time at the PLL feedback for
early PLL or the early Arrival Time for late PLL. Since it's just looking
at the either early or late Arrival Time, it's assumed that the gates on the
feedback path are at that process point. It's inconsistent to allow the
other process point delays through those gates, since you just assumed
they're at a fixed process point. However, that's what EinsTimer would do
by default for computing setups and holds in LCD.
It looks like IBM's introduced a new command in the last couple of months
that fixes the delay of the cells on the feedback path to a given PVT
point, even in LCD. Ask your IBM AE for the methodology alert about
et::remove_pll_pessimism.
We just found out about the command this week; maybe your AE is better
about keeping you informed about methodology alerts than ours is. I'm not
sure this is what your problem is but it sounds like it's in the ballpark.
Of course this advice may be worth exactly what you paid for it. :)
- Adam Shiel
Silicon Logic Engineering Eau Claire, WI
---- ---- ---- ---- ---- ---- ----
From: Paul Zimmer <pzimmer@cisco.com>
Hi, John,
That was dead on.
Please thank Adam for jumping in. His timing is perfect. I *just* sent IBM
an email describing this phenomenon. Basically, early RAT and early PLL are
inconsistent in their treatment of the clktree delays (or anything else
that's shared in the feedback path).
It'll be interesting to see if my AE responds with et::remove_pll_pessimism!
I'm going to go read up on it in the meantime.
- Paul Zimmer
Cisco Systems
---- ---- ---- ---- ---- ---- ----
From: Matt Weber <matt@siliconlogic.com>
Hi, John,
I had a Jimmy Buffet CD in my car the same day I was thinking about Paul's
problem. My new version of "Margaritaville" has intermittently stuck in my
head ever since:
Nibblin' on static timing,
Has me crazy and rhyming,
All my critical paths seem to have me foiled.
Slipping the tapeout,
Watching my boss pout,
See those managers--they're beginnin' to boil.
Wasting away again in static-timing-ville,
Searchin' for my lost multiplexed clock,
Some people claim that there are wireloads to blame,
So I guess....I'll take a shot with PhysOpt.
I think Paul has gotten me in trouble. People are wondering why I was
laughing out loud in my office at 8:00 this morning.
- Matt Weber
Silicon Logic Engineering, Inc. Eau Claire, WI
---- ---- ---- ---- ---- ---- ----
From: Paul Zimmer <pzimmer@cisco.com>
Hi, John,
By the way, I got an answer from my IBM AE. She says that
remove_pll_pessimism
should be executed BEFORE the pll_adjust command is invoked. Is that what
Matt was told as well?
- Paul Zimmer
Cisco Systems
---- ---- ---- ---- ---- ---- ----
From: Matt Weber <matt@siliconlogic.com>
Yes, remove_pll_pessimism gets executed before pll_adjust. I actually
haven't tried it yet myself, but a couple of other people here have, and
those were the instructions.
- Matt Weber
Silicon Logic Engineering, Inc. Eau Claire, WI
|
|