Online Learning Platform

Big Data > Hadoop > What is Hadoop?

What is Hadoop?

Hadoop is an open-source framework used to store and process big data in a distributed and scalable way. It was developed by Apache and is built to handle massive datasets across clusters of computers.

Hadoop Ecosystem comprises:

HDFS: Hadoop, Hadoop Distributed File System, which simply stores data files as close to the original format as possible. It has capable to store very large files across multiple machines in a Hadoop cluster.
Hbase: It is a Hadoop database management system like HBase is an open-source, non-relational, distributed database that stores data in a column-oriented manner. It's designed to handle large, sparse datasets and provides fast, real-time access to data, making it suitable for various big data use cases.
Hive : Apache Hive is a data warehouse system built on top of Hadoop to analyze big data. Instead of writing complex MapReduce code it enables query, SQL-like syntax, called HiveQL and And Hive will automatically convert it into MapReduce jobs behind the scenes.
Pig is an easy-to-understand data flow language, helpful in analyzing Hadoop- based data. Pig scripts are automatically converted to MapReduce jobs by the Pig Interpreter, thus enabling SQL-type processing of Hadoop data.
ZooKeeper: Apache ZooKeeper is a centralized service used to manage configuration, coordination, and synchronization for distributed systems. It acts as the “brains” or “traffic controller” of a cluster by ensuring consistency and coordination across nodes in a fault-tolerant way.
Oozie is a workflow schedule system to manage Apache Hadoop Jobs. It chains together multiple big data tasks like MapReduce, Hive, Pig, or Shell scripts and run them in a defined sequence or schedule.
Mahout: Apache Mahout is an open-source project designed to build scalable machine learning algorithms. It was originally developed to run on top of Hadoop MapReduce, but has since evolved to support Apache Spark and Scala DSL for faster computation.
Chukwa: Apache Chukwa is a data collection and monitoring system built on top of Hadoop. It helps you collect logs and system metrics from various sources and then stores them in HDFS or other Hadoop-compatible storage for analysis.
Sqoop: Apache Sqoop is a tool designed to efficiently transfer bulk data between relational databases (like MySQL, Oracle, SQL Server, PostgreSQL) and Hadoop ecosystems (like HDFS, Hive, HBase).
Ambari: Apache Ambari is a web-based tool for provisioning, managing, monitoring, and securing Hadoop clusters. It is like the “cPanel” of Hadoop — instead of doing everything via the terminal or multiple config files, Ambari helps to monitor Hadoop components (like HDFS, YARN, Hive), to install new services, to restart failed nodes, to track metrics and alerts, and to manage user permissions and configurations.

No More

Feedback

ABOUT

Statlearner

Statlearner STUDY