Pt 1 - Lauro missed Veloce2 and Zebu have lame gate ultilization

( ESNUG 553 Item 1 ) -------------------------------------------- [11/13/15]

Subject: Pt 1 - Lauro missed Veloce2 and Zebu have lame gate ultilization

    [ phone rings ]

  Cooley: "Hello"

   Frank: "Hey, Cooley, I don't like this post that Lauro did for Veloce.
           It has too many errors and things that are missing in it about
           Palladium and ZeBu.  Can I write a rebuttal?"

  Cooley: "Yea, sure... If you keep it factual and not marketing crap."

   Frank: "Any word limit?"

  Cooley: "No limit, if it's detailed tech talk.  No marketing crap."

What follows is how I learned to never ask a German to give detailed tech
talk on why something is wrong.  And always give a word limit.  - John

         ----    ----    ----    ----    ----    ----    ----

Today, all 3 emulators choices can do the job, some better than others.

- Lauro Rizzatti, emulation consultant http://www.deepchip.com/items/0547-09.html


From: [ Frank Schirrmeister of Cadence ]

Hi John,

I was highly entertained to read the so-called emulation "update" that
"Lauro Rizzatti, emulation consultant" wrote on behalf of Mentor.  From
my count his update had 9 errors plus Lauro missed 22 related issues.
Here's the details.

    - Frank Schirrmeister
      Cadence Design Systems, Inc.               San Jose, CA

         ----    ----    ----    ----    ----    ----    ----

LAURO ERRS ON NUMBER OF PARALLEL JOBS & HE MISSED SCALING CAPACITIES

Lauro's table in ESNUG 547 #1 deliberately misleads the reader into thinking
Veloce 2 supports the most users.  Lauro failed to scale everything in his
table to be a fair apples-to-apples comparison.

Also proper common references would have:

    - Typical emulation capacity
    - Minimum emulator cabinet size
    - Specified max emulation capacity

In addition, his max capacity is wrong for both SNPS and CDNS.  CDNS said
2.3 billion gates in September 2013.  SNPS said 3 billion in February 2014.

Here is Lauro's orginal table:

	Cadence Palladium-XP2 (GXL)	Mentor Veloce 2	Synopsys EVE Zebu Server 3
Single Cabinet Capacity	72 million ASIC-gates	1 billion ASIC-gates	300 million ASIC-gates
Total Max Capacity	1.1 billion ASIC-gates in 16 cabinets	2.0 billion ASIC-gates in 2 cabinets	2.0 billion ASIC-gates in 7 cabinets
# of Users per Cabinet	16 users	64 users	5 users


And here is how the table should actually look like if he was objective:

	Cadence Palladium-XP2 (GXL)	Mentor Veloce 2	Synopsys EVE Zebu Server 3
Single Cabinet Capacity	72 million ASIC-gates	1 billion ASIC-gates	300 million ASIC-gates
Total Max Capacity	1.1 billion ASIC-gates in 16 cabinets	2.0 billion ASIC-gates in 2 cabinets	2.0 billion ASIC-gates in 7 cabinets
# of Users per Cabinet	16 users	64 users	5 users
Minimum cabinet capacity (full)	72 million ASIC-gates Palladium P16	256 Million ASIC-gates Veloce 2 Quattro	300 Million ASIC-gates ZeBu 3
# of users/jobs per minimum cabinet	16 users	16 users	5 users
Total specificed max capacity	2.3 billion ASIC-gates in 32 cabinets	2.0 billion ASIC-gates in 2 cabinets	3.0 billion ASIC-gates in 10 cabinets
Total specificed max capacity	512 users	128 users	30 users
# of parallel Users/Jobs for typical 256MG	64 jobs 4 cabinets 256MG	16 jobs 1 cabinet 288MG	5 jobs 1 cabinet 300MG

What Lauro missed is that Palladium is designed to scale to the highest
number of both parallel users (64 users) and jobs (64 jobs).

                Error Count: 3    Total Error Count: 3
                 Miss Count: 6     Total Miss Count: 6

         ----    ----    ----    ----    ----    ----    ----

LAURO MISSED THAT "VELOCE 2" GATES ARE UP TO 1.4X MORE EXPENSIVE

FPGA based emulators (like Veloce and Zebu) lose capacity because of FPGA
gate utilization and routing issues.  Processor based systems (Palladium)
get 100% utilization and do not lose gate capacity compared to their specs.

Veloce 2's utilization is 70% for 512+ million gate designs -- and 80% for
designs under 512 million gates.  (Their marketing materials has changed
over time as they started by claiming a 1 billion gate system Maximus;
but customers who evaluated Maximus tell me it's really only holding 800MG
or 700MG.

Zebu utilization can be 60% out of the box per 60 million gate board, but
it can easily drop to 40% due to the chained nature of the boards.

Veloce 2 is better here than Zebu because they use custom FPGAs that can
get better utilization than with the standard FPGAs that Xilinx is using. 

For Palladium there is a fixed number of gates that each processor executes
and we test with reference designs to get 100% capacity utilization.

       Fig 1. What capacity users actually get vs. what datasheets say
                          (click pic to enlarge) 

The bricked area is not usable due to wasted FPGA gate utilization.  Note
this is for the biggest design possible.  Your utilization will be somewhat
better for smaller designs.

SHOW ME THE MONEY

These reduced gate utilizations makes the price-per-gate discussion very
entertaining.  Prices-per-gate need to be normalized on what one gate
actually means.

	Cadence Palladium-XP2 (GXL)	Mentor Veloce 2	Synopsys EVE Zebu Server 3
Utilization	100%	70% for >512MG designs. 80% for <512MG designs	60% within a 60MG board down to 40% for very large designs

Depending on your design size, for the same capacity purchased, a Zebu user 
gets 2.5x less usable gates than in Palladium.  Veloce2 Quattro users get
1.42x less usable gates than in Palladium.  Lauro neglected to mention this.

                Error Count: 0    Total Error Count: 3
                 Miss Count: 2     Total Miss Count: 8

         ----    ----    ----    ----    ----    ----    ----

LAURO MISSED HOW SMALL JOB GRANULARITY MAKES EMULATION MORE EFFICIENT

Emulation job granularity is nominally 60 million gates per job for Zebu,
nominally 16 million gates/job for Veloce 2 -- and an actual 4.5 million
gates/job for Palladium XP II and 4 million gate/job for Palladium XP.

This is especially crucial as emulation is now used as a computing resource
in corporate data centers.  So utilization is really important.

Let me be crystal clear on this:

   - For a Zebu, a 40MG job will waste 20MG per job.  That capacity
     cannot be used for anything else.
   - In Veloce 2, a 14MG job will waste 2MG per job.  That capacity
     cannot be used for anything else.
   - For Palladium XP II, if your job size drops below 4.5MG the
     difference is wasted, so that is 1.5MG for a 3MG design. 

With the gate utilization added in, the picture changes:

   - For a Zebu, a 40MG job will not fit into one board because 60%
     utilization of 60MG is only 36MG.  Users will waste 32MG of the
     2nd board that is used to map the remaining 4MG of their 40MG job.
     That capacity cannot be used for anything else.
   - In Veloce 2, a 14MG job will not fit into one board because 80%
     utilization of 16MG is only 12.8MG.  Users will waste 14.8MG of
     the 2nd board that is used to map the remaining 1.2MG of their
     14MG job.  That capacity cannot be used for anything else.
   - For Palladium, if you design drops below 4.5MG, the difference
     is wasted, so that still is 1.5MG for a 3MG design.

Graphically adding the impact of gate utilization looks like:

                           (click pic to enlarge)
    Fig 3. The impact of Zebu-Veloce-Palladium gate utilization and
           job granularity for 40MG-14MG-3MG job sizes

Note that I mapped the three 40MG-14MG-3MG example job sizes here.  It is
intuitively obvious that Palladium's small granularity leads to the least
amount of "wasted" emulation space (the bricked area).

	Cadence Palladium-XP2 (GXL)	Mentor Veloce 2	Synopsys EVE Zebu Server 3
Smallest Granularity per job ("Domain")	4.5MG	16MG	60MG

I will run more analysis on this later, but needless to say, Palladium maps
most efficiently.  Lauro neglected to mention this.

               Error Count:  0    Total Error Count:  3
                Miss Count:  3     Total Miss Count: 11

         ----    ----    ----    ----    ----    ----    ----

LAURO MISSED PALLADIUM'S MEMORY-TO-GATE RATIO IS UP TO 8X BETTER

Your chip has memories, right?  This is the Semico prediction on how a chip
is actually partitioned:

                           (click pic to enlarge)
         Fig 4. Semico predicts SoC's will be 90% memory by 2019

For 2014, Semico finds that chip are 85% memory.  This is on-chip memory
that your emulator has to model.  It's not counting your SoC's external
memory that it's connected to and that needs to be modeled as well.  Also
memory is used for trace data collected during emulation.  Here is how the
different emulators support memory:

    - Veloce 2 has 2GB for memory per 16MG job (or whatever is left after
      users find out about utilization) 
    - Zebu Server 3 has 24GB for memory per 60MG job (or whatever is left
      after users find out about utilization)
    - Palladium XP has 64GB for memory per 64MG job, and Palladium XPII
      goes up to 128GB for memory per 72 MG job. 

So Palladium XP's memory-to-gate ratio is up to 8x better than Veloce 2 and
2.5x better than Zebu.  Add utilization effects and the ratio is higher.

We've seen evals where customers struggled to map their memories into Veloce
and Zebu Server.  What happens in the cases when more memory is required is
that the usable Veloce/Zebu emulation capacity further goes down, limited by
the overall memory required of their design.  Lauro missed this.

               Error Count:  0    Total Error Count:  3
                Miss Count:  1     Total Miss Count: 12

         ----    ----    ----    ----    ----    ----    ----

CONTINUED IN Pt 2 BELOW...

         ----    ----    ----    ----    ----    ----    ----

  Pt 1 - Lauro missed Veloce2 and Zebu have lame gate ultilization
  Pt 2 - Lauro missed Palladium job throughput is 3X faster vs. Zebu
  Pt 3 - Lauro missed energy costs is intrinsic power use over time
  Pt 4 - Lauro errs on channel latency, sim acceleration, and ICE

         ----    ----    ----    ----    ----    ----    ----

RELATED ARTICLES

  Hogan follows up on emulation user replies plus market share data
  Hogan warns Lauro missed emulation's TOTAL power use footprint
  The 14 metrics - plus their gotchas - used to select an emulator

Join Index Next->Item

"Relax. This is a discussion. Anything said here is just one engineer's opinion. Email in your dissenting letter and it'll be published, too."
This Web Site Is Modified Every 2-3 Days
Copyright 1991-2025 John Cooley. All Rights Reserved.
| Contact John Cooley | Webmaster | Legal | Feedback Form |


   !!!     "It's not a BUG,
  /o o\  /  it's a FEATURE!"
 (  >  )
  \ - / 
  _] [_     (jcooley 1991)