Jolly's quick-and-dirty cheat sheet for those exploring Big Data

( ESNUG 554 Item 2 ) -------------------------------------------- [12/10/15]

Subject: Jolly's quick-and-dirty cheat sheet for those exploring Big Data


All Big Data systems share these common traits:

  - Data is broken into many small pieces called "shards".

  - Shards are stored and distributed across many smaller cheap disks.

  - These cheap disks exist on cheap Linux machines.
    Cheap == low memory, consumer-grade disks and CPU's.

  - Shards can be stored redundantly across multiple disks, to
    build resiliency.  (Cheap disks and cheap computers have
    higher failure rates).

Big Data software (like Hadoop) use simple, powerful techniques so the
data and compute are massively parallel.

      - from http://www.deepchip.com/items/0554-01.html


From: [ John "Jolly" Lee of Apache Ansys ]

Hi, John,

P.S. Here's cheat sheet links for those who want to explore Big Data more.

By far, the most popular Big Data system is Hadoop.  It's open source and
was originally developed by some Yahoo engineers

         http://en.wikipedia.org/wiki/Apache_Hadoop

who took the ideas from the seminal Google MapReduce research paper

         http://research.google.com/archive/mapreduce.html

It's not hard for a SW engineer to start playing with Big Data systems.
Hadoop is available for download

         http://hadoop.apache.org/releases.html

but it usually requires root access to install.  Or you can use Amazon
to start playing with pre-installed systems

         http://aws.amazon.com/elasticmapreduce

Most companies will want to go with a commercially supported solution.


Cloudera.com	MapR.com	Hortonworks.com


Just like how many companies go with RedHat for a commercially supported
Linux distro, there are many companies also going with Cloudera, MapR, and
Hortonworks for Hadoop-based Big Data systems.

It's interesting to note that Intel invested $1B into Cloudera, and Google
invested $1B into MapR, and Hortonworks is a spin-out from Yahoo.   Other
Big Data related companies that your readers may want to read up on are:
Databricks, Splunk, Pivotal, Palantir, MongoDB, and Tableau.

    - John "Jolly" Lee
      Ansys-Apache, Inc.                         San Jose, CA

        ----    ----    ----    ----    ----    ----   ----

Related Articles

    ANSS "Jolly" on why Big Data is a bad fit for EDA and chip design

Join Index Next->Item

"Relax. This is a discussion. Anything said here is just one engineer's opinion. Email in your dissenting letter and it'll be published, too."
This Web Site Is Modified Every 2-3 Days
Copyright 1991-2024 John Cooley. All Rights Reserved.
| Contact John Cooley | Webmaster | Legal | Feedback Form |


   !!!     "It's not a BUG,
  /o o\  /  it's a FEATURE!"
 (  >  )
  \ - / 
  _] [_     (jcooley 1991)