( ESNUG 553 Item 1 ) -------------------------------------------- [11/13/15]
Subject: Pt 1 - Lauro missed Veloce2 and Zebu have lame gate ultilization
[ phone rings ]
Cooley: "Hello"
Frank: "Hey, Cooley, I don't like this post that Lauro did for Veloce.
It has too many errors and things that are missing in it about
Palladium and ZeBu. Can I write a rebuttal?"
Cooley: "Yea, sure... If you keep it factual and not marketing crap."
Frank: "Any word limit?"
Cooley: "No limit, if it's detailed tech talk. No marketing crap."
What follows is how I learned to never ask a German to give detailed tech
talk on why something is wrong. And always give a word limit. - John
---- ---- ---- ---- ---- ---- ----
From: [ Frank Schirrmeister of Cadence ]
Hi John,
I was highly entertained to read the so-called emulation "update" that
"Lauro Rizzatti, emulation consultant" wrote on behalf of Mentor. From
my count his update had 9 errors plus Lauro missed 22 related issues.
Here's the details.
- Frank Schirrmeister
Cadence Design Systems, Inc. San Jose, CA
---- ---- ---- ---- ---- ---- ----
LAURO ERRS ON NUMBER OF PARALLEL JOBS & HE MISSED SCALING CAPACITIES
Lauro's table in ESNUG 547 #1 deliberately misleads the reader into thinking
Veloce 2 supports the most users. Lauro failed to scale everything in his
table to be a fair apples-to-apples comparison.
Also proper common references would have:
- Typical emulation capacity
- Minimum emulator cabinet size
- Specified max emulation capacity
In addition, his max capacity is wrong for both SNPS and CDNS. CDNS said
2.3 billion gates in September 2013. SNPS said 3 billion in February 2014.
Here is Lauro's orginal table:
|
Cadence
Palladium-XP2 (GXL)
|
Mentor
Veloce 2
|
Synopsys EVE
Zebu Server 3
|
Single Cabinet
Capacity
|
72 million
ASIC-gates
|
1 billion
ASIC-gates
|
300 million
ASIC-gates
|
Total Max
Capacity
|
1.1 billion
ASIC-gates in
16 cabinets
|
2.0 billion
ASIC-gates in
2 cabinets
|
2.0 billion
ASIC-gates in
7 cabinets
|
# of Users
per Cabinet
|
16 users
|
64 users
|
5 users
|
And here is how the table should actually look like if he was objective:
|
Cadence
Palladium-XP2 (GXL)
|
Mentor
Veloce 2
|
Synopsys EVE
Zebu Server 3
|
Single Cabinet
Capacity
|
72 million
ASIC-gates
|
1 billion
ASIC-gates
|
300 million
ASIC-gates
|
Total Max
Capacity
|
1.1 billion
ASIC-gates in
16 cabinets
|
2.0 billion
ASIC-gates in
2 cabinets
|
2.0 billion
ASIC-gates in
7 cabinets
|
# of Users
per Cabinet
|
16 users
|
64 users
|
5 users
|
Minimum
cabinet
capacity (full)
|
72 million
ASIC-gates
Palladium P16
|
256 Million
ASIC-gates
Veloce 2 Quattro
|
300 Million
ASIC-gates
ZeBu 3
|
# of users/jobs
per
minimum
cabinet
|
16 users
|
16 users
|
5 users
|
Total specificed max
capacity
|
2.3 billion
ASIC-gates in
32 cabinets
|
2.0 billion
ASIC-gates in
2 cabinets
|
3.0 billion
ASIC-gates in
10 cabinets
|
Total specificed max
capacity
|
512 users
|
128 users
|
30 users
|
# of parallel Users/Jobs
for typical 256MG
|
64 jobs
4 cabinets
256MG
|
16 jobs
1 cabinet
288MG
|
5 jobs
1 cabinet
300MG
|
What Lauro missed is that Palladium is designed to scale to the highest
number of both parallel users (64 users) and jobs (64 jobs).
Error Count: 3 Total Error Count: 3
Miss Count: 6 Total Miss Count: 6
---- ---- ---- ---- ---- ---- ----
LAURO MISSED THAT "VELOCE 2" GATES ARE UP TO 1.4X MORE EXPENSIVE
FPGA based emulators (like Veloce and Zebu) lose capacity because of FPGA
gate utilization and routing issues. Processor based systems (Palladium)
get 100% utilization and do not lose gate capacity compared to their specs.
Veloce 2's utilization is 70% for 512+ million gate designs -- and 80% for
designs under 512 million gates. (Their marketing materials has changed
over time as they started by claiming a 1 billion gate system Maximus;
but customers who evaluated Maximus tell me it's really only holding 800MG
or 700MG.
Zebu utilization can be 60% out of the box per 60 million gate board, but
it can easily drop to 40% due to the chained nature of the boards.
Veloce 2 is better here than Zebu because they use custom FPGAs that can
get better utilization than with the standard FPGAs that Xilinx is using.
For Palladium there is a fixed number of gates that each processor executes
and we test with reference designs to get 100% capacity utilization.
Fig 1. What capacity users actually get vs. what datasheets say
(click pic to enlarge)
The bricked area is not usable due to wasted FPGA gate utilization. Note
this is for the biggest design possible. Your utilization will be somewhat
better for smaller designs.
SHOW ME THE MONEY
These reduced gate utilizations makes the price-per-gate discussion very
entertaining. Prices-per-gate need to be normalized on what one gate
actually means.
|
Cadence
Palladium-XP2 (GXL)
|
Mentor
Veloce 2
|
Synopsys EVE
Zebu Server 3
|
Utilization
|
100%
|
70% for >512MG designs. 80% for <512MG designs
|
60% within a 60MG board down to 40% for very large designs
|
Depending on your design size, for the same capacity purchased, a Zebu user
gets 2.5x less usable gates than in Palladium. Veloce2 Quattro users get
1.42x less usable gates than in Palladium. Lauro neglected to mention this.
Error Count: 0 Total Error Count: 3
Miss Count: 2 Total Miss Count: 8
---- ---- ---- ---- ---- ---- ----
LAURO MISSED HOW SMALL JOB GRANULARITY MAKES EMULATION MORE EFFICIENT
Emulation job granularity is nominally 60 million gates per job for Zebu,
nominally 16 million gates/job for Veloce 2 -- and an actual 4.5 million
gates/job for Palladium XP II and 4 million gate/job for Palladium XP.
This is especially crucial as emulation is now used as a computing resource
in corporate data centers. So utilization is really important.
Let me be crystal clear on this:
- For a Zebu, a 40MG job will waste 20MG per job. That capacity
cannot be used for anything else.
- In Veloce 2, a 14MG job will waste 2MG per job. That capacity
cannot be used for anything else.
- For Palladium XP II, if your job size drops below 4.5MG the
difference is wasted, so that is 1.5MG for a 3MG design.
With the gate utilization added in, the picture changes:
- For a Zebu, a 40MG job will not fit into one board because 60%
utilization of 60MG is only 36MG. Users will waste 32MG of the
2nd board that is used to map the remaining 4MG of their 40MG job.
That capacity cannot be used for anything else.
- In Veloce 2, a 14MG job will not fit into one board because 80%
utilization of 16MG is only 12.8MG. Users will waste 14.8MG of
the 2nd board that is used to map the remaining 1.2MG of their
14MG job. That capacity cannot be used for anything else.
- For Palladium, if you design drops below 4.5MG, the difference
is wasted, so that still is 1.5MG for a 3MG design.
Graphically adding the impact of gate utilization looks like:
(click pic to enlarge)
Fig 3. The impact of Zebu-Veloce-Palladium gate utilization and
job granularity for 40MG-14MG-3MG job sizes
Note that I mapped the three 40MG-14MG-3MG example job sizes here. It is
intuitively obvious that Palladium's small granularity leads to the least
amount of "wasted" emulation space (the bricked area).
|
Cadence
Palladium-XP2 (GXL)
|
Mentor
Veloce 2
|
Synopsys EVE
Zebu Server 3
|
Smallest Granularity per job ("Domain")
|
4.5MG
|
16MG
|
60MG
|
I will run more analysis on this later, but needless to say, Palladium maps
most efficiently. Lauro neglected to mention this.
Error Count: 0 Total Error Count: 3
Miss Count: 3 Total Miss Count: 11
---- ---- ---- ---- ---- ---- ----
LAURO MISSED PALLADIUM'S MEMORY-TO-GATE RATIO IS UP TO 8X BETTER
Your chip has memories, right? This is the Semico prediction on how a chip
is actually partitioned:
(click pic to enlarge)
Fig 4. Semico predicts SoC's will be 90% memory by 2019
For 2014, Semico finds that chip are 85% memory. This is on-chip memory
that your emulator has to model. It's not counting your SoC's external
memory that it's connected to and that needs to be modeled as well. Also
memory is used for trace data collected during emulation. Here is how the
different emulators support memory:
- Veloce 2 has 2GB for memory per 16MG job (or whatever is left after
users find out about utilization)
- Zebu Server 3 has 24GB for memory per 60MG job (or whatever is left
after users find out about utilization)
- Palladium XP has 64GB for memory per 64MG job, and Palladium XPII
goes up to 128GB for memory per 72 MG job.
So Palladium XP's memory-to-gate ratio is up to 8x better than Veloce 2 and
2.5x better than Zebu. Add utilization effects and the ratio is higher.
We've seen evals where customers struggled to map their memories into Veloce
and Zebu Server. What happens in the cases when more memory is required is
that the usable Veloce/Zebu emulation capacity further goes down, limited by
the overall memory required of their design. Lauro missed this.
Error Count: 0 Total Error Count: 3
Miss Count: 1 Total Miss Count: 12
---- ---- ---- ---- ---- ---- ----
CONTINUED IN Pt 2 BELOW...
---- ---- ---- ---- ---- ---- ----
Pt 1 - Lauro missed Veloce2 and Zebu have lame gate ultilization
Pt 2 - Lauro missed Palladium job throughput is 3X faster vs. Zebu
Pt 3 - Lauro missed energy costs is intrinsic power use over time
Pt 4 - Lauro errs on channel latency, sim acceleration, and ICE
---- ---- ---- ---- ---- ---- ----
RELATED ARTICLES
Hogan follows up on emulation user replies plus market share data
Hogan warns Lauro missed emulation's TOTAL power use footprint
The 14 metrics - plus their gotchas - used to select an emulator
Join
Index
Next->Item
|