( ESNUG 532 Item 1 ) -------------------------------------------- [09/06/13]
Subject: CDNS says Hogan missed granularity, user access, speed, capacity
> Watch out. When an emulator sales guys says his partitioning is
> automated, he's only making that claim very losely.
>
> And don't get me started about FPGA memory mapping limits.
>
> - One of 37 users responding to Jim Hogan's emulation post
> http://www.deepchip.com/items/0530-01.html
From: [ Frank Schirrmeister of Cadence ]
Hi, John,
After seeing all those many users respond to Jim Hogan's emulation post in
ESNUG 522, I felt I should give the Cadence reply to what Hogan said.
---- ---- ---- ---- ---- ---- ----
Jim got the emulation challenges in ESNUG 522 #1 right on. We fully agree
that:
1. hardware assisted verification is becoming ubiquitous,
2. we do live in a software dominated world in which software more
and more defines the constraints of the hardware architecture and
3. of course IP use and re-use are exploding and leading to the
average number of IP blocks and their cost-to-design growing.
Jim was spot on with this!
Two aspects from this section are worth re-emphasizing:
1. The need for flexible design granularity and
2. the role of software requiring multiple users.
---- ---- ---- ---- ---- ---- ----
PALLADIUM VS. VELOCE 2 GRANULARITY:
> Because of compute-speed and capacity gains, emulation is becoming an
> attractive option for more mainstream verification tasks; such as
> verifying individual IP blocks in the low millions of gates. Emulation
> is even now needed for smaller, under 1 million gate designs -- if
> there's lots of control complexity with a large number of cycles -- such
> as with an H.265 encoder/decoder. In fact, simulating high density
> videos on an H.265 decoder would be impractical (because it would take
> weeks to do) if it couldn't be simulated in seconds in an emulator.
>
> - from http://www.deepchip.com/items/0522-01.html
As Jim points out, to be effective, emulation needs to support various
levels of granularity from smaller IP blocks in the million gate range to
full blown SoCs with hundreds of millions of gates.
Let's looks at the different boxes based on datasheets for Palladium and
Veloce & Veloce 2, comparing a 256 MG configuration.
That's a Palladium XP P64 with 64 domains vs. a Veloce 2 Quattro holding
16 boards. (Of course there is Synopsys/EVE/Zebu as well, but in his big
summary table Jim Hogan had kept EVE separate but he combined Veloce and
Palladium in the same category, so I focus on those two.)
The following table sums up the basics on granularity, multi-user access,
speed and capacity.
|
Palladium P64
|
Veloce 2 Quattro
|
Granularity
|
4 MG to 256 MG resulting in much better utilization |
16 MG to 256MG
|
Number of users
|
64 users
|
16 users
|
Speed
|
2 MHz (as per datasheet) scaling with design size
|
1-1.5 MHz (as per datasheet),
degrading with design size due to architecture
|
Capacity
|
256 MG nominal 90% to 100% utilization => 256 MG actual capacity
|
256 MG nominal 60% to 75% utilization => 200 MG actual capacity
|
Table 1: Difference in hardware utilization for smaller granularity
Palladium's flexible granularity allows its users to choose design sizes in
increments of 4 million gates (MG) covering the full spectrum from small
designs of 4 MG to 256 MG, for up to 64 users. Compare that granularity to
a minimum of 16 MG in a Quattro allowing 16 users. Palladium supports 4x
more users at 4x better granularity.
Fig 1: Difference in hardware utilization for smaller granularity
(CLICK ON PIC TO ENLARGE IMAGE.)
Why does that matter? See the picture above. Imagine that you are running
64 regression runs of three 4 MG designs. For Palladium that is one run.
To get to the same throughput on a Quattro, you need to run 4 times with
each of the 16 MG boards holding a 4 MG design. And don't forget re-booting
your Quattro in-between runs to re-configure (no kidding, that's what users
tell us, and it makes perfect sense given that Veloce is custom-FPGA-based).
Your readers can check out a great user example for this scenario in ARM's
presentation from CDNLive India 2012 on how they verified big.LITTLE [Ref 1]
presented at CDNLive India last year. On slide 15 they show how they ran
one trillion Palladium XP cycles per week of processor clusters of dual core
Cortex A7 MP2, Cortex A7/A15 MP4 and dual core Cortex A15 MP4 with gate
counts of 13 MG, 28.5 MG and 41.4 MG at frequencies of 1.33 MHz, 1.13 MHz
and 1.02 MHz in 4, 8 and 11 clock domains, respectively. They have a cool
graph on slide 11 that shows utilization per domain -- this flexibility is
unique to Palladium XP.
Another example, focused on software, you may have seen yourself last week
at CDNLive Boston 2013. Medtronic users presented their paper [Ref 2] on
"Early Hardware and Firmware Co-Development Using Hardware Acceleration"
and how they attached a physical debugger unit in dynamic target mode.
The resulting set-up allowed Medtronic to debug their design with firmware
running on it. Hardware and tests stop for them when the software runs into
a breakpoint. This is where the number of users becomes critical. In
Palladium XP, 4x more users can be connected to the emulator at the same
time for interactive debug. That makes it much more cost efficient compared
to Veloce or Zebu.
---- ---- ---- ---- ---- ---- ----
PALLADIUM SW BRING UP:
There a numerous other examples of software bring-up on Palladium out there.
LSI Logic at DAC 2012 [Ref 3] how they brought up software on a combination
of Palladium, RPP and Virtual Prototypes ("A Deterministic Flow" showing the
need for different engines as described in Item 5 below).
Nufront at CDNLive China 2012 [Ref 4] boots Linux and Android on a processor
("NS115 System Emulation Based on Cadence Palladium XP").
---- ---- ---- ---- ---- ---- ----
SPEED VS. GATE COUNT
Hogan's data on speed and capacity:
> Emulator speed is measured in cycles-per-second. Processor-based speeds
> range from 100 K to 4 M cycles/sec, while FPGA-based range from 500 K to
> 50 M cycles/sec, depending on the number of devices.
>
> - from http://www.deepchip.com/items/0522-03.html
> Speed range for Palladium and Veloce is 100K to 2M cycles-per-second"
>
> - from http://www.deepchip.com/items/0522-04.html
Jim's data is right on. If one optimizes FPGAs by hand then one gets high
speed. That's why we at Cadence offer Palladium XP for emulation and RPP
for FPGA-based prototyping. Let's stay, for the time being in emulation.
According to the datasheets, both Cadence Palladium and Mentor Veloce 2
support up to 2 billion gates. Datasheet speed is specified at 2 MHz for
Palladium XP and 1 to 1.5 Mhz for Veloce (their latest datasheet no longer
mentions the speed specification, but the older version from last year did).
Now, if you take into account the difference between processor-based and
FPGA-based emulation, Palladium gets quite ahead.
For speed, processor-based emulation scales just fine with design size.
There's some speed degradation (seen 0.1 Mhz per 10 M gates) for larger
designs, but by far not as much as for FPGA-based systems. Reason is
Palladium XP is hierarchical and shares memory -- lots of it -- and compiles
the design into the bit-level processors of its architecture. That's why
we know exactly how fast the design will run after compile is done; we
eliminated the need for backplanes and cross bars that are used to connect
FPGAs and signals across them.
Mentor Veloce 2 has a switching backplane and virtual-wires while Synopsys
Zebu server uses a cross bar and Time Division Multiplexing (TDM) to connect
signals. Hence, for FPGA-based emulation like Veloce and Zebu Server, the
speed situation is quite the different.
As designs grow larger in gate count, Veloce/Zebu execution speed decreases
quite dramatically due to routing and interconnect issues within and between
its FPGAs. And that's not even accounting for slow down caused by probes for
debug. Palladium speeds barely degrade (10%) as your design gets larger!
The situation is similar for capacity as it was for speed.
---- ---- ---- ---- ---- ---- ----
CAPACITY AND UTILIZATION:
> There are subtle issues around an emulator's capacity; a given capacity
> can be delivered in multiple ways. First, there is the straightforward
> capacity measurement in terms of total number of gates: that range for
> emulators currently ranges from 2 million to 2 billion gates.
>
> Second, there is the granularity in terms of the number of devices (ASICs
> or FPGAs), boards, or boxes that are used to reach a given capacity.
> Processor-based emulators are architected to provide capability in a
> seamless way, where it looks monolithic to users. FPGA-based emulators,
> if it is a vertically integrated box, it should also appear monolithic
> to a user.
>
> - from http://www.deepchip.com/items/0522-03.html
In processor based emulation like Palladium, we mean it when we say 4 MG per
clock domain. Palladium does not have the concept of gates. It has a huge
number of processors with associated memory that execute the desired RTL
functionality after a compiler has mapped it to our processor array. To
calculate gate count we took a suite of designs, including approved customer
designs, compiled them into the Palladium processor array and statistically
found how many gates a processor and its memory represents. Four million
gates divided by that number results in the number of processor for a 4 MG
domain. The resulting variance of how a customer design actually maps, is
pretty small so we can get up to about 110% utilization.
FPGAs simply assume how many gates a Look Up Table (LUT) represent and that
is their published number. The reality is FPGA-based systems (Veloce/EVE)
never utilize the nominal gates per FPGA fully. Users typically get to
60-75% utilization.
There was a time when a Veloce Quattro with nominal 256 MG was actually
marketed as a 200 MG system due FPGA utilization mark downs. Their latest
datasheet, however, has reverted back to 256 MG. Utilization principles for
FPGAs have not changed just because it is printed -- users see about 200 M
actual usable gates for a Veloce 2 Quattro vs. 256 MG in Palladium.
Palladium's capacity is about 30% higher.
---- ---- ---- ---- ---- ---- ----
EMULATOR NODE TALK IS MISLEADING
> Palladium was caught asleep at the wheel a full node behind the new
> Veloce 2 and has been woefully losing sales to MENT since then. ...
>
> As I told you before, Palladium's losing share cause it's a full node
> behind Wally's Veloce 2.
>
> - from http://www.deepchip.com/items/0531-04.html
There has been commentary by our friendly competitors in your last ESNUG
post that Palladium is "one node behind".
Well, parameters like granularity, speed, capacity and multi-user access
are determined by three things
1. the system fabric itself (FPGA vs CPU),
2. the system architecture (interconnect, memory, etc.)
3. and the actual silicon.
As I mentioned above, it is actually the architecture of interconnect and
memory that determines most of the parameters for FPGA-based emulation.
Faster, lower power chips will not result in a higher performance or lower
power systems as Amdahl's law kicks in. Given the fundamental architecture
differences, the process node is really not that relevant and everyone's
datasheets nicely show that.
The reality is that Veloce 2 at its introduction last year actually never
caught up with Palladium's 4x better granularity, 4x more users in parallel,
faster execution speed that scales with design size and 30% more capacity.
Simply put, their node FUD is a red herring.
- Frank Schirrmeister
Cadence Design Systems, Inc. San Jose, CA
---- ---- ---- ---- ---- ---- ----
Related Articles
CDNS says Hogan missed FPGA compile time, Rent's Rule, probing
CDNS says Hogan missed close to 10 emulation customer use models
CDNS says Hogan missed 5 metrics/gotchas for picking an emulator
CDNS says Hogan missed 47 Palladium user papers on Cadence.com
Join
Index
Next->Item
|
|