( ESNUG 445 Item 14 ) ------------------------------------------- [05/24/05]

From: [ Mr. Brightside ]
Subject: Magma Mojave Quartz DRC Pummels Calibre in User Benchmark (III)

Hi, John,

#8 Results

As I mentioned, we split our experiments into "block" and "chip" modes.

One thing I'll mentioned in advance, I apologize for some obvious holes in
my data.  Our effort wasn't focused on having a one-on-one battle between
Calibre and Mojave.  Instead, what we were trying to do is assess the
feasibility of Mojave's technology, to see if it merited further look.  Our
apples-to-apples comparison data of Mojave vs. Calibre was fit in as best
we could with our limited time and compute resources.


#9 Full Chip Runs

The focus for full chip runs was scalability and total run time.  We set
a limit of 20 CPUs (what we felt was "reasonable"... with dual core/dual
CPU machines, this would be 5 physical machines).  We were using our
standard compute environment (3 Ghz Xeons, 32-bit) for these runs.

To start, for Quartz DRC, we took 3 chips and ran them through 10 processor
configurations of up to 25 processors:

  Chip 1
           Total           Total               Total
  CPUS    CPU Time      Elapsed Time      CPU Utilization      Speedup

   1      50,358.3         52,085.3           96.68%             1.00x
   2      50,206.5         26,697.5           94.03              1.95
   4      53,829.3         14,732.6           91.34              3.54
   6      54,030.0          9,890.6           91.05              5.27
   8      52,950.4          7,649.6           86.52              6.81
  10      52,239.5          5,823.6           89.70              8.94
  12      50,234.0          5,111.7           81.89             10.19
  16      51,297.2          3,960.2           80.96             13.15
  20      52,406.5          3,364.8           77.87             15.48
  25      52,893.3          2,949.9           71.72             17.66

  Chip 2

   1     107,686.4        112,363.9           95.84%             1.00x
   2     120,911.4         64,146.5           94.25              1.75
   4     112,912.6         30,540.9           92.43              3.68
   6     112,968.0         21,460.5           87.73              5.24
   8     114,956.4         16,274.8           88.29              6.90
  10     114,338.3         13,870.7           82.43              8.10
  12     113,707.0         11,510.7           82.32              9.76
  16     114,257.0          8,587.4           83.16             13.08
  20     112,621.8          6,885.0           81.79             16.32
  25     114,493.1          6,663.8           68.73             16.86

  Chip 3

   1      44,499.7         46,992.7           94.69%             1.00x
   2      47,509.2         24,782.2           95.85              1.90
   4      46,434.7         12,525.1           92.68              3.75
   6      47,300.4          8,765.5           89.94              5.36
   8      48,295.6          7,191.7           83.94              6.53
  12      46,031.0          5,033.9           76.20              9.34
  16      47,264.4          4,229.2           69.85             11.11
  20      48,531.1          3,896.4           62.28             12.06
  25      48,394.3          3,362.8           57.56             13.97


The thing to point to here is it looks like in these runs we're seeing pretty
reasonable performance even at 20 CPUs.  There appears to be a tail off at
25 CPUs.  This is of course, design dependent, and if our designs get more
complex in 90 nm like I assume they will, it wouldn't surprise me to see
this "tail off" point go out further.  (We did do a few "just for fun" runs
at 50 CPUs on a few blocks, and the tail off wasn't that bad... I could see
where some people might consider this a good operating point.  We just felt
going from 25 to 50 CPUs was a very costly decision, and given our current
LSF farm size, probably not worth it currently.)

Now lets run these same 3 chips through Calibre, we ran the same tests,
however as mentioned before, we had to go to a more powerful machine to get
the 4-CPU data points.  (2.2 Ghz 64-bit Opterons)  So this table is NOT
provided to compare final numbers to final numbers in terms of elapsed
runtime.  Our focus was on scalability.

           Total           Total               Total
  CPUS    CPU Time      Elapsed Time      CPU Utilization      Speedup

   1       22,371          23,103             96.83%             1.00x
   2       22,602          13,239             85.36              1.75
   4       22,839           6,765             84.40              3.42

   1       64,047          64,731             98.94%             1.00x
   2       64,752          34,134             94.85              1.90
   4       65,237          23,375             69.77              2.77

   1       22,012          22,082             99.68%             1.00x
   2       22,357          11,605             96.32              1.90
   4       22,516           6,815             82.60              3.24

If we look across our entire suite of chips, we see the following scalability
(Calibre at 2 or 4 CPUs)/(Calibre at 1 CPU) for Calibre:

                 CPUs      Min       Median    Max

                  2        1.75x     1.90x     1.90x
                  4        2.77      3.24      3.42

Looking at the median numbers, Calibre seems fairly scalable up to 4 CPUs.


#10 Calibre vs. Mojave

So, now let's equalize hardware, and run at the CPU sweet spot of each tool.
That means using only the 2-CPU 3 Ghz Xeons.  The sweet spot for the multi-
threaded, machine bound Calibre is 1 2-CPU machine.  For distributed Mojave
it's 10 machines networked to get 20 CPUs.  We took 11 chips and ran
Calibre vs. Mojave head-to-head.

    Chip    Calibre Time   Quartz DRC Time     Speedup (Calibre/Mojave)

      1        10,029          1,824                   5.5x
      2        10,553          3,202                   3.3
      3        15,317          2,107                   7.3
      4         9,538          1,548                   6.2
      5        16,724          2,631                   6.3
      6        15,241          2,091                   7.2
      7        18,605          2,164                   8.6
      8        16,974          2,261                   7.5
      9        16,153          2,113                   7.6
     10        22,087          2,688                   8.2
     11        10,803          1,646                   6.6

This naturally did place one bias on us -- we couldn't include our larger
designs (Designs 12 - 17) as they could not fit in Calibre on a 3 Ghz Xeon.

One other interesting thing to point out, over all the chips we've ran, our
longest runtime using 20 3 Ghz Xeons was 1 hr 49 min.  Best case speedup
was 8.6x, worst case was 3.3x, median case was 7.2x.  This is what we would
consider an expected speedup, given the nature of our current compute farm.


#11 Block level runs

Our criteria for block level runs was a bit tighter.  Since we might
have 10-20 blocks on a chip, we need to be able to run in a small number
of CPUs.  (The option of running 20 CPU runs is just considered a luxury.)
So we decided to figure out the Mojave sweet spot as before:

            Total           Total             Total
  CPUS     CPU Time      Elapsed Time     CPU Utilization      Speedup

   1       6,820.8          6,955.7           98.06%             1.00x
   2       7,031.9          3,836.3           91.65              1.83
   4       7,206.6          2,139.8           84.20              3.37
   6       6,824.7          1,493.1           76.18              4.57
  10       7,098.8          1,305.4           54.38              5.44
  20       6,978.4          1,279.6           27.27              5.44

   1       3,240.3          3,321.5           97.56%             1.00x
   2       2,967.1          1,655.3           89.62              1.79
   4       2,881.8            886.0           81.31              3.25
   6       2,863.7            655.7           72.79              4.37
  10       2,963.6            602.8           49.16              4.92
  20       2,883.7            610.7           23.61              4.73

   1       2,143.2          2,217.7           96.64%             1.00x
   2       2,217.6          1,223.1           90.65              1.81
   4       2,195.9            694.2           79.08              3.16
   6       2,159.7            515.4           69.84              4.19
  10       2,233.3            463.1           48.23              4.82
  20       2,229.0            507.5           21.96              4.40

   1       2,010.9          2,084.7           96.46%             1.00x
   2       1,890.0          1,068.3           88.46              1.77
   4       1,964.2            665.9           73.74              2.95
   6       1,906.0            453.4           70.06              4.20
  10       1,947.9            432.9           45.00              4.51
  20       1,891.5            434.8           21.75              4.36

Looking at the above data, the first block is what we would consider a
"large block" (at the point where we are pushing the capacity of Fusion to
turn a block in 24 hours through our flow).  For this size block, it looks
like the point of diminishing returns is probably around 6/8 CPUs.  The
next 3 blocks are a bit smaller, and are showing tailoff around 4/6 CPUs.

Again, our criteria for block level is that it should be competitive to
Calibre at 1/2 CPUs.  So we swept our block suite over 1/2/4/6 CPUs in
Quartz DRC and 1/2 CPUs in Calibre, again using 3 Ghz Xeons.  We ran 19
blocks through, and used Calibre 1 & 2 CPUs as our reference.  (Since
these runtimes were much faster, we were able to compile more meaningful
and equivalent data.)  The tables below show min/max/median speedup
(Quartz/Calibre) for both 1 & 2 CPUs:

                                       Calibre
                   Med      Med      Min      Min      Max      Max 
  CPUs  Tool       1 CPU    2 CPU    1 CPU    2 CPU    1 CPU    2 CPU

    1   Quartz     1.16     0.61     0.43     0.27     2.44     1.81
    2   Quartz     2.26     1.14     0.78     0.49     4.37     3.25
    4   Quartz     3.84     1.98     1.42     0.88     6.27     4.65
    6   Quartz     5.16     2.61     1.79     1.12     7.17     5.32
   10   Quartz     5.74     2.95     2.19     1.36     7.31     5.44
   20   Quartz     5.44     2.85     1.77     1.10     6.90     3.58

The way this table works is it's a ratio of Quartz over Calibre performance.
For example, looking at median values, Quartz at 6 CPUs is 5.16x faster
than Calibre with 1 CPU.   Quartz on 2 CPUs is 1.14x faster than Calibre
on 2 CPUs.

Looking at median numbers, Quartz DRC is slightly faster (16% and 14%) at
1 and 2 CPUs on similar machines.  The worst case (min) numbers are out
of what we originally would have liked to see, however, we went through
and examined the data to look for a pattern.  What we were able to see is
that the blocks where Quartz DRC lost were fairly small (under 10 min DRC
runtime for Calibre on a single CPU) and had a large embedded analog/custom
content.  On digital blocks, Quartz DRC typically appeared to have a slight
edge at equivalent CPUs.  Our chips are typically big D, little A, so while
long term it would be nice to see this be addressed, 10 min runtimes are
not a killer.


#12 Peak memory usage

The next thing we looked at was peak memory usage for these runs.  Here,
there is expected to be a bit of a difference, due to the multi-threaded vs.
distributed architectures.  What we saw confirmed this, all of our
Quartz DRC jobs fit on a 32-bit machine, and were typically using about
1/3rd the peak memory of a Calibre job.  For our chip level runs:

           Tool            Median           Min             Max
        Calibre           2,085 Mb        1,233 Mb        5,322 Mb
         Quartz             883 Mb          504 Mb        1,367 Mb


#13 Accuracy of results

This is the current state of our eval of Mojave technology.  Up until this
point, we currently looked for obvious outliers.  We also did some minor
"error injection" just to make sure Quartz was in fact checking things.
But we haven't spent much time until the last week or so making sure the
tools are checking the same thing on a micro level.

At this point it's too early to make any conclusions on the accuracy of
results, other than there isn't anything blatantly obvious.  We have been
told Mojave has passed our primary fab's qualification at 90 nm (even
though we were benching at 130 nm), so there is a belief things are at
least in the general ballpark.  But there is still work to do here on
our end.

One interesting thing that came up in our initial error injection testing
was that Quartz DRC is doing a better job than Calibre of reporting errors
at the proper level of hierarchy.  These are initial results only.


#14 Future work

As you can see, we've still got alot of work to go, but so far things look
interesting.  The Mojave tool appears to be stabilizing out and becoming
more like a product, and it sounds like we'll have access to LVS and XOR
functionality very soon, and we've already seen prototypes of their
debugging tool.

Some interesting things to consider for the future.  The first obvious one
is multi-core CPUs.  Right now, our standard machine is a 2-CPU machine
(with some 4-CPU machines available).  When we go dual-core, this means
the standard machine is 4-CPU, with 8-CPU machines available.  This will
pose an interesting point for Calibre, to see how its scalability at
larger number of CPUs is.  In our results Calibre appeared to have a more
noticeable tail-off at 4 CPUs than Quartz DRC did.  (median 3.24 vs. 3.68)

Another interesting event is the industry transition from 130 nm to 90 nm
as mainstream, with 65 nm becoming available as well.  If history repeats
itself, and for integration purposes, die size can remain constant, the
number of devices (and thus polygons) on our chip should grow.  I don't
have a good feel for how Calibre scales and what stalls it, but as far as
Quartz DRC goes, it seems like Mojave got its better scalability results on
our larger designs.

    - [ Mr. Brightside ]

  Editor's Note: This benchmark is continued from the links below.  - John

        ----    ----    ----    ----    ----    ----    ----

Related Articles:

    Magma Mojave Quartz DRC Pummels Calibre in User Benchmark (I)
    Magma Mojave Quartz DRC Pummels Calibre in User Benchmark (II)
    Magma Mojave Quartz DRC Pummels Calibre in User Benchmark (III)

Index







   
 Sign up for the DeepChip newsletter.
Email
 Read what EDA tool users really think.


Feedback About Wiretaps ESNUGs SIGN UP! Downloads Trip Reports Advertise

"Relax. This is a discussion. Anything said here is just one engineer's opinion. Email in your dissenting letter and it'll be published, too."
This Web Site Is Modified Every 2-3 Days
Copyright 1991-2024 John Cooley.  All Rights Reserved.
| Contact John Cooley | Webmaster | Legal | Feedback Form |

   !!!     "It's not a BUG,
  /o o\  /  it's a FEATURE!"
 (  >  )
  \ - / 
  _] [_     (jcooley 1991)