Some days ago I was going through an article on Big Data. I couldn’t make a head or tail of it. I immediately asked IBM & My Colleague working at IBM to help me to get going on big data. Both of them forwarded me to Big Data University, and to free e-books by IBMers.

I was going through the book, and actively trying to link the different pieces like HDFS, Map Reduce, Hadoop, Pig, Hive, Jaql, Zoo Keeper, Flume etc.. Then I realized:

Hadoop and and the different components make a specialized computing system for big data.

The different components of Hadoop and the components of OS is almost similar.

What’s Computing System??

Computing System (CS) (this term is coined by me so don’t google it) is comprised of many components. The different components are

  1. Storage system to store the data submitted via Input (E.g. Hard Disks)
  2. Input devices which produce data streams (E.g. Keyboard, Sensors)
  3. Output devices (E.g. Screen)
  4. Operating System for managing the show for users and hardware (E.g. Windows, Mac)
  5. Machine Language aka Machine Instruction Set (E.g. Intel SSE, Intel MMX, Intel VT-X)
  6. High Level Languages for writing apps and scripts (E.g. C, C++, Java, Python)
  7. Application and System softwares to do the user defined tasks, as well as managing the high level system activities. (E.g. MS Word, Photoshop, C Cleaner, Antivirus, Disk Defrag)

The Storage System is one of the important abstraction in computing System. For the Users the storage system gives illusion of Folder File Tree structure but the files are actually stored  as blocks of fixed sizes on hard disk platter. Primarily the main function of storage system is to give users Easy to Manage abstraction of storage function and handle the difficult process of physical storage all by itself.

The Input devices produces stream of data. We split streaming Input into 3 things. The first is source which produces data, the filters is 2nd thing, which process the source stream. The final component is sink, which is the destination for stream. Incase of Computers source will be keyboard key press data. The filter will be the Controller which performs the complex filtering operations like input validation etc. The file in which data is stored is sink.

The operating system is the interface between computer and users. The operating system performs 2 functions of managing the hardware resources and providing interface to users to do their tasks. UI is like layer for kernel and doesn’t include much of complexity like resources. The system management is tough nut to crack, hence kernel has lots of things like process mgmt., memory mgmt. modules, and loads of technologies and algorithms working behind scenes to make system useable to users.

Every processor comes with with its own of instruction it can understand. These instructions form the part of assembly language, and instruction set is called as machine instruction set.

The high level languages are created to make programmers job easy. The high level language programs when compiled/interpreted they create sequence on machine instructions. HLL’s also provide higher level of abstraction so that programmers can choose to focus on complex problems instead of optimizing code for machine.

The application and system softwares are created with  high level language and solve specific problem of users. Like Adobe PageMaker solved publishing problem. MS Excel solved spread sheet computation problems.

Hadoop Ecosystem and Compute System

The CS and Different Hadoop Ecosystem components have lots of Similarity between them.

  1. Storage in CS is similar to Hadoop File System (HDFS). The HDFS is Distributed Storage System and the way data is actually stored in HDFS/CS and How we view data is totally different.
  2. Apache Flume is Input equivalent of CS Input Device. Flume routes data into HDFS. Flume can be viewed as log data continuously being stored in a file without any user intervention.
  3. Hadoop is like Operating System which manages the show for User as well as manages the Resources. The way OS has many components like Resource Managers, Kernel, File systems – Hadoop has different components like Hadoop Core, HDFS, Hadoop YARN, Hadoop Map Reduce
  4. The Map Reduce Framework is like Machine Instruction Set.
  5. The Pig, Hive and Jaql are High Level Languages the way we have C, Java, Python in CS. The commands in above languages are converted into corresponding Map Reduce Jobs.
  6. The Mahout, HBase, Cassandra, Ambari, Zoo Keeper are the various Application and System Softwares equivalents running atop Hadoop.

You are reading an Article by Harsha Ankola, originally posted on Harsha’s Tech Space. If you have enjoyed this post, be sure to follow Harsha on Twitter, Facebook and Google+.