( ESNUG 445 Item 14 ) ------------------------------------------- [05/24/05]
From: [ Mr. Brightside ]
Subject: Magma Mojave Quartz DRC Pummels Calibre in User Benchmark (III)
Hi, John,
#8 Results
As I mentioned, we split our experiments into "block" and "chip" modes.
One thing I'll mentioned in advance, I apologize for some obvious holes in
my data. Our effort wasn't focused on having a one-on-one battle between
Calibre and Mojave. Instead, what we were trying to do is assess the
feasibility of Mojave's technology, to see if it merited further look. Our
apples-to-apples comparison data of Mojave vs. Calibre was fit in as best
we could with our limited time and compute resources.
#9 Full Chip Runs
The focus for full chip runs was scalability and total run time. We set
a limit of 20 CPUs (what we felt was "reasonable"... with dual core/dual
CPU machines, this would be 5 physical machines). We were using our
standard compute environment (3 Ghz Xeons, 32-bit) for these runs.
To start, for Quartz DRC, we took 3 chips and ran them through 10 processor
configurations of up to 25 processors:
Chip 1
Total Total Total
CPUS CPU Time Elapsed Time CPU Utilization Speedup
1 50,358.3 52,085.3 96.68% 1.00x
2 50,206.5 26,697.5 94.03 1.95
4 53,829.3 14,732.6 91.34 3.54
6 54,030.0 9,890.6 91.05 5.27
8 52,950.4 7,649.6 86.52 6.81
10 52,239.5 5,823.6 89.70 8.94
12 50,234.0 5,111.7 81.89 10.19
16 51,297.2 3,960.2 80.96 13.15
20 52,406.5 3,364.8 77.87 15.48
25 52,893.3 2,949.9 71.72 17.66
Chip 2
1 107,686.4 112,363.9 95.84% 1.00x
2 120,911.4 64,146.5 94.25 1.75
4 112,912.6 30,540.9 92.43 3.68
6 112,968.0 21,460.5 87.73 5.24
8 114,956.4 16,274.8 88.29 6.90
10 114,338.3 13,870.7 82.43 8.10
12 113,707.0 11,510.7 82.32 9.76
16 114,257.0 8,587.4 83.16 13.08
20 112,621.8 6,885.0 81.79 16.32
25 114,493.1 6,663.8 68.73 16.86
Chip 3
1 44,499.7 46,992.7 94.69% 1.00x
2 47,509.2 24,782.2 95.85 1.90
4 46,434.7 12,525.1 92.68 3.75
6 47,300.4 8,765.5 89.94 5.36
8 48,295.6 7,191.7 83.94 6.53
12 46,031.0 5,033.9 76.20 9.34
16 47,264.4 4,229.2 69.85 11.11
20 48,531.1 3,896.4 62.28 12.06
25 48,394.3 3,362.8 57.56 13.97
The thing to point to here is it looks like in these runs we're seeing pretty
reasonable performance even at 20 CPUs. There appears to be a tail off at
25 CPUs. This is of course, design dependent, and if our designs get more
complex in 90 nm like I assume they will, it wouldn't surprise me to see
this "tail off" point go out further. (We did do a few "just for fun" runs
at 50 CPUs on a few blocks, and the tail off wasn't that bad... I could see
where some people might consider this a good operating point. We just felt
going from 25 to 50 CPUs was a very costly decision, and given our current
LSF farm size, probably not worth it currently.)
Now lets run these same 3 chips through Calibre, we ran the same tests,
however as mentioned before, we had to go to a more powerful machine to get
the 4-CPU data points. (2.2 Ghz 64-bit Opterons) So this table is NOT
provided to compare final numbers to final numbers in terms of elapsed
runtime. Our focus was on scalability.
Total Total Total
CPUS CPU Time Elapsed Time CPU Utilization Speedup
1 22,371 23,103 96.83% 1.00x
2 22,602 13,239 85.36 1.75
4 22,839 6,765 84.40 3.42
1 64,047 64,731 98.94% 1.00x
2 64,752 34,134 94.85 1.90
4 65,237 23,375 69.77 2.77
1 22,012 22,082 99.68% 1.00x
2 22,357 11,605 96.32 1.90
4 22,516 6,815 82.60 3.24
If we look across our entire suite of chips, we see the following scalability
(Calibre at 2 or 4 CPUs)/(Calibre at 1 CPU) for Calibre:
CPUs Min Median Max
2 1.75x 1.90x 1.90x
4 2.77 3.24 3.42
Looking at the median numbers, Calibre seems fairly scalable up to 4 CPUs.
#10 Calibre vs. Mojave
So, now let's equalize hardware, and run at the CPU sweet spot of each tool.
That means using only the 2-CPU 3 Ghz Xeons. The sweet spot for the multi-
threaded, machine bound Calibre is 1 2-CPU machine. For distributed Mojave
it's 10 machines networked to get 20 CPUs. We took 11 chips and ran
Calibre vs. Mojave head-to-head.
Chip Calibre Time Quartz DRC Time Speedup (Calibre/Mojave)
1 10,029 1,824 5.5x
2 10,553 3,202 3.3
3 15,317 2,107 7.3
4 9,538 1,548 6.2
5 16,724 2,631 6.3
6 15,241 2,091 7.2
7 18,605 2,164 8.6
8 16,974 2,261 7.5
9 16,153 2,113 7.6
10 22,087 2,688 8.2
11 10,803 1,646 6.6
This naturally did place one bias on us -- we couldn't include our larger
designs (Designs 12 - 17) as they could not fit in Calibre on a 3 Ghz Xeon.
One other interesting thing to point out, over all the chips we've ran, our
longest runtime using 20 3 Ghz Xeons was 1 hr 49 min. Best case speedup
was 8.6x, worst case was 3.3x, median case was 7.2x. This is what we would
consider an expected speedup, given the nature of our current compute farm.
#11 Block level runs
Our criteria for block level runs was a bit tighter. Since we might
have 10-20 blocks on a chip, we need to be able to run in a small number
of CPUs. (The option of running 20 CPU runs is just considered a luxury.)
So we decided to figure out the Mojave sweet spot as before:
Total Total Total
CPUS CPU Time Elapsed Time CPU Utilization Speedup
1 6,820.8 6,955.7 98.06% 1.00x
2 7,031.9 3,836.3 91.65 1.83
4 7,206.6 2,139.8 84.20 3.37
6 6,824.7 1,493.1 76.18 4.57
10 7,098.8 1,305.4 54.38 5.44
20 6,978.4 1,279.6 27.27 5.44
1 3,240.3 3,321.5 97.56% 1.00x
2 2,967.1 1,655.3 89.62 1.79
4 2,881.8 886.0 81.31 3.25
6 2,863.7 655.7 72.79 4.37
10 2,963.6 602.8 49.16 4.92
20 2,883.7 610.7 23.61 4.73
1 2,143.2 2,217.7 96.64% 1.00x
2 2,217.6 1,223.1 90.65 1.81
4 2,195.9 694.2 79.08 3.16
6 2,159.7 515.4 69.84 4.19
10 2,233.3 463.1 48.23 4.82
20 2,229.0 507.5 21.96 4.40
1 2,010.9 2,084.7 96.46% 1.00x
2 1,890.0 1,068.3 88.46 1.77
4 1,964.2 665.9 73.74 2.95
6 1,906.0 453.4 70.06 4.20
10 1,947.9 432.9 45.00 4.51
20 1,891.5 434.8 21.75 4.36
Looking at the above data, the first block is what we would consider a
"large block" (at the point where we are pushing the capacity of Fusion to
turn a block in 24 hours through our flow). For this size block, it looks
like the point of diminishing returns is probably around 6/8 CPUs. The
next 3 blocks are a bit smaller, and are showing tailoff around 4/6 CPUs.
Again, our criteria for block level is that it should be competitive to
Calibre at 1/2 CPUs. So we swept our block suite over 1/2/4/6 CPUs in
Quartz DRC and 1/2 CPUs in Calibre, again using 3 Ghz Xeons. We ran 19
blocks through, and used Calibre 1 & 2 CPUs as our reference. (Since
these runtimes were much faster, we were able to compile more meaningful
and equivalent data.) The tables below show min/max/median speedup
(Quartz/Calibre) for both 1 & 2 CPUs:
Calibre
Med Med Min Min Max Max
CPUs Tool 1 CPU 2 CPU 1 CPU 2 CPU 1 CPU 2 CPU
1 Quartz 1.16 0.61 0.43 0.27 2.44 1.81
2 Quartz 2.26 1.14 0.78 0.49 4.37 3.25
4 Quartz 3.84 1.98 1.42 0.88 6.27 4.65
6 Quartz 5.16 2.61 1.79 1.12 7.17 5.32
10 Quartz 5.74 2.95 2.19 1.36 7.31 5.44
20 Quartz 5.44 2.85 1.77 1.10 6.90 3.58
The way this table works is it's a ratio of Quartz over Calibre performance.
For example, looking at median values, Quartz at 6 CPUs is 5.16x faster
than Calibre with 1 CPU. Quartz on 2 CPUs is 1.14x faster than Calibre
on 2 CPUs.
Looking at median numbers, Quartz DRC is slightly faster (16% and 14%) at
1 and 2 CPUs on similar machines. The worst case (min) numbers are out
of what we originally would have liked to see, however, we went through
and examined the data to look for a pattern. What we were able to see is
that the blocks where Quartz DRC lost were fairly small (under 10 min DRC
runtime for Calibre on a single CPU) and had a large embedded analog/custom
content. On digital blocks, Quartz DRC typically appeared to have a slight
edge at equivalent CPUs. Our chips are typically big D, little A, so while
long term it would be nice to see this be addressed, 10 min runtimes are
not a killer.
#12 Peak memory usage
The next thing we looked at was peak memory usage for these runs. Here,
there is expected to be a bit of a difference, due to the multi-threaded vs.
distributed architectures. What we saw confirmed this, all of our
Quartz DRC jobs fit on a 32-bit machine, and were typically using about
1/3rd the peak memory of a Calibre job. For our chip level runs:
Tool Median Min Max
Calibre 2,085 Mb 1,233 Mb 5,322 Mb
Quartz 883 Mb 504 Mb 1,367 Mb
#13 Accuracy of results
This is the current state of our eval of Mojave technology. Up until this
point, we currently looked for obvious outliers. We also did some minor
"error injection" just to make sure Quartz was in fact checking things.
But we haven't spent much time until the last week or so making sure the
tools are checking the same thing on a micro level.
At this point it's too early to make any conclusions on the accuracy of
results, other than there isn't anything blatantly obvious. We have been
told Mojave has passed our primary fab's qualification at 90 nm (even
though we were benching at 130 nm), so there is a belief things are at
least in the general ballpark. But there is still work to do here on
our end.
One interesting thing that came up in our initial error injection testing
was that Quartz DRC is doing a better job than Calibre of reporting errors
at the proper level of hierarchy. These are initial results only.
#14 Future work
As you can see, we've still got alot of work to go, but so far things look
interesting. The Mojave tool appears to be stabilizing out and becoming
more like a product, and it sounds like we'll have access to LVS and XOR
functionality very soon, and we've already seen prototypes of their
debugging tool.
Some interesting things to consider for the future. The first obvious one
is multi-core CPUs. Right now, our standard machine is a 2-CPU machine
(with some 4-CPU machines available). When we go dual-core, this means
the standard machine is 4-CPU, with 8-CPU machines available. This will
pose an interesting point for Calibre, to see how its scalability at
larger number of CPUs is. In our results Calibre appeared to have a more
noticeable tail-off at 4 CPUs than Quartz DRC did. (median 3.24 vs. 3.68)
Another interesting event is the industry transition from 130 nm to 90 nm
as mainstream, with 65 nm becoming available as well. If history repeats
itself, and for integration purposes, die size can remain constant, the
number of devices (and thus polygons) on our chip should grow. I don't
have a good feel for how Calibre scales and what stalls it, but as far as
Quartz DRC goes, it seems like Mojave got its better scalability results on
our larger designs.
- [ Mr. Brightside ]
Editor's Note: This benchmark is continued from the links below. - John
---- ---- ---- ---- ---- ---- ----
Related Articles:
Magma Mojave Quartz DRC Pummels Calibre in User Benchmark (I)
Magma Mojave Quartz DRC Pummels Calibre in User Benchmark (II)
Magma Mojave Quartz DRC Pummels Calibre in User Benchmark (III)
Index
|
|