( ESNUG 445 Item 13 ) ------------------------------------------- [05/24/05]
From: [ Mr. Brightside ]
Subject: Magma Mojave Quartz DRC Pummels Calibre in User Benchmark (II)
Hi, John,
#5 Running Quartz DRC
OK, when we saw in their slides "any chip in 2 hours", it made us want to
tell the Magma marketing guy to shut up and get real. Our current
Calibre jobs take 8 to 12 hours on multiple CPU machines.
One interesting twist... while those 50 CPU slides Magma showed looked
interesting, we have a hierarchical design style. What this means, is a
lot of our Calibre runs are not full-chip, 12 hour runs. But rather,
runs that are a few hours. While scalability is good, you can't rely on
it to make up for a bad engine. If we have 20 blocks on our chip and we
need 20 CPUs per block for those runs... well... I can just see our
frontend team yelling at us for killing our LSF farm. :)
Anyways, we split our runtime eval into 2 levels, chip level & block level.
Our criteria was that Mojove should be competitive with Calibre at the
1/2/4 CPU block level, and we should see Mojave win hands down at large
CPU counts at the chip level.
#6 Background
Our typical Calibre jobs are run on 2-CPU Xeon machines. Since Calibre
only runs multi-threaded (except for MTFlex, which was not looked at much
in-depth during our timeframe), we are limited with Calibre to the number
of CPUs on 1 machine. I did have access to 4-CPU Opterons, unfortunately,
these came along late in the game, and it would have taken a month or so to
collect full data for both tools over the entire test suite.
One thing to point out here is the Opterons did seem to favor Calibre
(~45% speedup vs. ~30% speedup for Quartz DRC) in our current binary. This
seemed to be due to the dataprep stage of Quartz DRC being IO bound
(which we've been told will be addressed in future releases.) Due to the
nature of our compute environment being primarily Xeons, optimizing Opteron
performance currently is not a primary concern for us, and there is
nothing that we know of in Mojave's technology which would lead us
to believe this couldn't be equalized in the future.
On the flip side, Quartz DRC's runs were all able to be done on 32-bit
Xeons, whereas 1/3rd of our benchmark required an Opteron (64-bit) for
Calibre... in the short term this will benefit Quartz DRC, but our belief
is over the next year 64-bit machines will become the norm, and this
benefit will go away.
Our compute farm environment is LSF. Quartz DRC makes it easy to manage an
LSF job with their tool. You can control LSF options via:
config dp queue QUEUE_NAME
config dp resource_request modelPIV_3GHZ
config dp exec_path "/tools/lsf/bin"
config dp num_processors 20
Mojave will farm the jobs off for you, and things will start up even if not
all CPUs are ready. The "master machine" doesn't use much CPU power, for
my case I was using a shared "login machine", which I was sharing with 20
other users (i.e. all computing is done on the slaves, the master appears
to only manage task-handling responsibilities.)
If you notice, the LSF options are global options, which brought up a few
issues w/ our current LSF configurations. We have some machines configured
where they might have multiple CPUs available, but only one LSF job slot.
This is not currently supported. We also have multiple queues available to
us, but Quartz DRC only supports a single queue & a single resource. Also,
our queue has the ability to allow us to access "free resources" from other
group's queues in an unstable queue (i.e. if that resource is needed, our
job is terminated.) Support for such a feature would be very nice, since
typically during tapeout, my group's queue is full, but other groups might
not be taping chips out that week, so their might be excess in theirs
which I could steal.
#7 How Quartz DRC runs
Above I mentioned that I invoke the actual "mojave" command on a machine
which is a shared resource. The way Quartz DRC works is different than
Calibre. Calibre appears to "multithread on the fly", on each command it
splits data and sends it off to all CPUs. Mojave works the opposite. The
first thing it does is reads in the data, and creates all the jobs... what
they call "dataprep". Then, when Mojave runs the checks, the main CPU only
appears to acting as a "taskmaster"... typical CPU load is under 10%.
Mojave also distributes tasks via disk, so the master machine does not get
throttled via communication.
Another thing different is that these Mojave jobs (or 'pipes') are not a
single command. Mojave appears to have taken the approach that if the data
is already on a machine, executed as much as you can on it, so that data IO
is not the bottleneck. For someone used to Calibre/Hercules/Dracula, from
a benchmarking standpoint, this made it a bit difficult to do in-depth
performance comparisons between commands. It also made it a bit difficult
when Quartz died, to figure out what it might have been the caused the
death. But this is probably something that will go away with experience.
The next question is, does figuring the "threads" out beforehand really help
speed up overall runtime? Or is it just moving the choke point from the
middle to the beginning? (i.e. you aren't eliminating the task-splitting,
you are just doing it upfront.) For Quartz DRC, dataprep itself (even
including GDS in) is also done distributed. Again, the master machine
appears to do very little, so the bandwidth of any single machine doesn't
appear to be a choke point. This also allows us to get around stalls where
one task doesn't multithread well (which for some reason also seem to be
the tasks which take forever!) and the other CPUs sit idle. One point where
this "stalling" did appear to happen though, was if you have a large GDS
cell structure in your file (for example a metal fill overlay or something
like that.) Due to our design methodology, this definitely did occur on
most of the chips we ran. We've been told it is being addressed in our
next Mojave release. But in current releases, it was not uncommon to see
a part of the dataprep stage have a large number of CPUs sitting idle at
the end, waiting for the last few processes to complete.
The other question we had, since Quartz DRC uses disk for IO, is would this
hit our file server hard? In initial Mojave binaries, this definitely was
the case, where 60 CPUs could cause our file server to hit more than 70% IO
loading. This has improved in later binaries, where they started to
compress data on disk, and now if we monitor our file server, Mojave jobs
are noticeable but at an acceptable level (~20%), with room to grow. Its
yet to be seen what a reasonable load will look like during tapeout time...
but for now, its not as troublesome as we first thought. It is noted that
Calibre does not suffer from this issue.
The other issue with disk IO is Quartz is fairly disk hungry... of course
this is understandable for a beta-level tool, where it is probably dumping
out a lot of stuff that will go away in the release version. It is also
something they probably haven't given a priority to. A typical block took
roughly half a gig of drive space to run DRC. A typical chip might take
15 gigs of drive space. Drive space is relatively cheap... however at least
for us, it seems like our drives always sit at 90% full -- which appears
to be the threshold when people start deleting stuff. ;) So disk use has
been a bit of an issue for us. Switching to compressed data definitely
helped Quartz DRC here, but its still disk hungry. And unfortunately, if
you run out of disk space, your job dies (as does everyone else on the
file server). Since this task is run late in the game, Mojave might want
to consider some form of graceful death, where you could set a "space left"
threshold not to exceed X amount (so you don't kill other jobs), and it
doesn't fire off new pipes, so you can free drive space and fire things off
again. As a comparison, Calibre is considered very "disk lean", and
typically doesn't use much space.
One thing which isn't fully captured in my results is Mojave runs in 2
phases, "dataprep" (roughly 30% of the elapsed time) and "check"
(roughly 70% of the elapsed time). We've been told by Magma R&D that
most of the dataprep results between DRC, LVS, and ERC are the same, meaning
their usage model is a concurrent DRC/LVS/ERC run, with a single dataprep
stage. So the incremental cost for other checks should be less than
what we currently have with Calibre. This claim has not yet been tested,
but from what we've seen, it seems logical.
- [ Mr. Brightside ]
Editor's Note: This benchmark is continued in the next item. - John
---- ---- ---- ---- ---- ---- ----
Related Articles:
Magma Mojave Quartz DRC Pummels Calibre in User Benchmark (I)
Magma Mojave Quartz DRC Pummels Calibre in User Benchmark (II)
Magma Mojave Quartz DRC Pummels Calibre in User Benchmark (III)
Index
Next->Item
|
|