We are living in the era of the data driven word, and that data is mostly found in digital form. From a single piece of data a wide variety of knowledge can be extracted that can enhance your business to a large extent. Initially digital data was not to that extent stored over internet, therefore it can be easily manipulated by SQL engine or simple databases query. Now the data is available into not only large extent but in different variety also. Deriving Knowledge or useful information from this data is not possible by simple SQL engine or database processing engine. So we need something that can meet the need of processing that extent of data. There are wide variety of frameworks and tools like Apache Storm, Cassandra, MongoDb, Big Query available in the software Industry that can be used to deal with Big Data. Out of which mos popular one being Apache Hadoop
What is Hadoop?
Hadoop is an Open Source framework that is used to deal with big data. It provides a method to access data that is distributed among multiple clustered computers., process the data, and manage resources across the computing and network resources that are involved.
Components of Hadoop
Hadoop is a distributed framework that makes it easier to process large data sets that reside in clusters of computers. Because it is a framework, Hadoop is not a single technology or product. Instead, Hadoop is made up of four core modules that are supported by a large ecosystem of supporting technologies and products. The modules are:
• Hadoop Distributed File System (HDFS) – Provides access to application data. Hadoop can also work with other file systems, including FTP, Amazon S3 and Windows Azure Storage Blobs (WASB), among others.
• Hadoop YARN – Provides the framework to schedule jobs and manage resources across the cluster that holds the data
• Hadoop MapReduce – A YARN-based parallel processing system for large data sets.
• Hadoop Common – A set of utilities that supports the three other core modules.
Some of the well-known Hadoop ecosystem components include Oozie, Spark, Sqoop, Hive and Pig.
These four basic elements are discussed in details below:
Hadoop Distributed File System (HDFS)
Hadoop works across clusters of commodity servers. Therefore there needs to be a way to coordinate activity across the hardware. Hadoop can work with any distributed file system, however the Hadoop Distributed File System is the primary means for doing so and is the heart of Hadoop technology. HDFS manages how data files are divided and stored across the cluster. Data is divided into blocks, and each server in the cluster contains data from different blocks. There is also some built-in redundancy.
It would be nice if YARN could be thought of as the string that holds everything together, but in an environment where terms like Oozie, tuple and Sqoop are common, of course it’s not that simple. YARN is an acronym for Yet Another Resource Negotiator. As the full name implies, YARN helps manage resources across the cluster environment. It breaks up resource management, job scheduling, and job management tasks into separate daemons. Key elements include the ResourceManager (RM), the NodeManager (NM) and the ApplicationMaster (AM).
Think of the ResourceManager as the final authority for assigning resources for all the applications in the system. The NodeManagers are agents that manage resources (e.g. CPU, memory, network, etc.) on each machine. NodeManagers report to the ResourceManager. ApplicationMaster serves as a library that sits between the two. It negotiates resources with ResourceManager and works with one or more NodeManagers to execute tasks for which resources were allocated.
MapReduce provides a method for parallel processing on distributed servers. Before processing data, MapReduce converts that large blocks into smaller data sets called tuples. Tuples, in turn, can be organized and processed according to their key-value pairs. When MapReduce processing is complete, HDFS takes over and manages storage and distribution for the output. The shorthand version of MapReduce is that it breaks big data blocks into smaller chunks that are easier to work with.
The “Map” in MapReduce refers to the Map Tasks function. Map Tasks is the process of formatting data into key-value pairs and assigning them to nodes for the “Reduce” function, which is executed by Reduce Tasks, where data is reduced to tuples. Both Map Tasks and Reduce Tasks use worker nodes to carry out their functions.
JobTracker is a component of the MapReduce engine that manages how client applications submit MapReduce jobs. It distributes work to TaskTracker nodes. TaskTracker attempts to assign processing as close to where the data resides as possible.
Note that MapReduce is not the only way to manage parallel processing in the Hadoop environment.
Common, which is also known as Hadoop Core, is a set of utilities that support the other Hadoop components. Common is intended to give the Hadoop framework ways to manage typical (common) hardware failure.
This was a guest post from one of our readers. If you also have something to post , you can to our guest-post page via link tekraze.com/guest-post for posting. Keep sharing and visiting back for more updates. Feel free to comment below with what you like, suggestions , feedback or anything you like to say about this post. Have a nice day ahead.