Azad Parinda

Where freedom comes true

How to type invisible Commands in Terminal ?

Leave a comment »

Map Reduce in a simplified way


Leave a comment »

The History of Linux

The History of Linux

Leave a comment »

Lessons Learned from The Life of Steve Jobs

Lessons Learned from The Life of Steve Jobs

Leave a comment »

Programs are like WOMEN

Programs are like Women

1 Comment »

Projects – other than Hadoop !


Mostly compatible with Hadoop/HDFS

  • Apache Drill – provides low latency ad-hoc queries to many different data sources, including nested data. Inspired by Google’s Dremel, Drill is designed to scale to 10,000 servers and query petabytes of data in seconds.
  • Apache Hama – is a pure BSP (Bulk Synchronous Parallel) computing framework on top of HDFS for massive scientific computations such as matrix, graph and network algorithms.
  • Akka –  a toolkit and runtime for building highly concurrent, distributed, and fault tolerant event-driven applications on the JVM.
  • ML-Hadoop – Hadoop implementation of Machine learning algorithms
  • Shark – is a large-scale data warehouse system for Spark designed to be compatible with Apache Hive. It can execute Hive QL queries up to 100 times faster than Hive without any modification to the existing data or queries. Shark supports Hive’s query language, metastore, serialization formats, and user-defined functions, providing seamless integration with existing Hive deployments and a familiar, more powerful option for new ones.
  • Apache Crunch – Java library provides a framework for writing, testing, and running MapReduce pipelines. Its goal is to make pipelines that are composed of many user-defined functions simple to write, easy to test, and efficient to run
  • Azkaban – batch workflow job scheduler created at LinkedIn to run their Hadoop Jobs
  • Apache Mesos – is a cluster manager that provides efficient resource isolation and sharing across distributed applications, or frameworks. It can run Hadoop, MPI, Hypertable, Spark, and other applications on a dynamically shared pool of nodes.
  • Druid – is open source infrastructure for Real²time Exploratory Analytics on Large Datasets. The system uses an always-on, distributed, shared-nothing, architecture designed for real-time querying and data ingestion. It leverages column-orientation and advanced indexing structures to allow for cost effective, arbitrary exploration of multi-billion-row tables with sub-second latencies.
  • Apache MRUnit – a Java library that helps developers unit test Apache Hadoop map reduce jobs.
  • hiho – Hadoop Data Integration with various databases, ftp servers, salesforce. Incremental update, dedup, append, merge your data on Hadoop
  • white-elephant – a Hadoop log aggregator and dashboard which enables visualization of Hadoop cluster utilization across users.
  • Tachyon – a fault tolerant distributed file system enabling reliable file sharing at memory-speed across cluster frameworks, such as Spark and MapReduce
  • HIPI – is a library for Hadoop’s MapReduce framework that provides an API for performing image processing tasks in a distributed computing environment
  • Cassovary -a simple big graph processing library for the JVM
  • Apache Helix – is a generic cluster management framework used for the automatic management of partitioned, replicated and distributed resources hosted on a cluster of nodes
  • Summingbird -Streaming MapReduce with Scalding and Storm
  • MongoDB – an open-source document database, and the leading NoSQL database. Written in C++
  • Katta – is a scalable, failure tolerant, distributed, data storage for real time access.
  • Kiji – for building Real-time Big Data Applications on Apache HBase
  • MLBase – a platform addressing the issues of ML Developers & end users, which consists of three components –MLlib, MLI, ML Optimizer
  • cloud9 – is a collection of Hadoop tools that tries to make working with big data a bit easier.
  • elasticsearch – flexible and powerful open source, distributed real-time
    search and analytics engine for the cloud
  • Apache Curator– is a set of Java libraries that make using Apache ZooKeeper much easier.
  • Parquet – is a columnar storage format for Hadoop.
  • OpenTSDB – is a distributed, scalable Time Series Database (TSDB) written on top of HBase. OpenTSDB was written to address a common need: store, index and serve metrics collected from computer systems (network gear, operating systems, applications) at a large scale, and make this data easily accessible and graphable.
  • Giraph – is an iterative graph processing system built for high scalability. For example, it is currently used at Facebook to analyze the social graph formed by users and their connections. Giraph originated as the open-source counterpart to Pregel, the graph processing architecture developed at Google and described in a 2010 paper
  • CouchDB – is a database that uses JSON for documents,JavaScript for MapReduce queries, and regular HTTP for an API
  • Datafu– is a collection of user-defined functions for working with large-scale data in Hadoop and Pig. This library was born out of the need for a stable, well-tested library of UDFs for data mining and statistics.
  • Norbert – is a cluster manager and networking layer built on top of Zookeeper
  • Apache Samza – is a distributed stream processing framework. It uses Apache Kafka for messaging, and Apache Hadoop YARN to provide fault tolerance, processor isolation, security, and resource management.
  • Apache Kafka – is publish-subscribe messaging rethought as a distributed commit log.
  • Apache Whirr – is a set of libraries for running cloud services.
  • HUE – a File Browser for HDFS, a Job Browser for MapReduce/YARN, an HBase Browser, query editors for Hive, Pig, Cloudera Impala and Sqoop2.
  • Nagios – offers complete monitoring and alerting for servers, switches, applications, and services.
  • Ganglia – is a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids
  • Apache Thrift – is software framework, for scalable cross-language services development, combines a software stack with a code generation engine to build services that work efficiently and seamlessly between C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, JavaScript, Node.js, Smalltalk, OCaml and Delphi and other languages.
  • – is an open source machine learning server for software developers to create predictive features, such as personalization, recommendation and content discovery
  • CloudMapReduce –  A MapReduce implementation on Amazon Cloud OS
  • Titan –  is a scalable graph database optimized for storing and querying graphs containing hundreds of billions of vertices and edges distributed across a multi-machine cluster.


Hadoop Alternatives

  • Apache Spark– open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.
  • GraphLab – a redesigned fully distributed API, HDFS integration and a wide range of new machine learning toolkits.
  • HPCC Systems– (High Performance Computing Cluster) is a massive parallel-processing computing platform that solves Big Data problems.
  • Dryad– is investigating programming models for writing parallel and distributed programs to scale from a small cluster to a large data-center.
  • Stratosphere – above the cloud.
  • Storm – is a free and open source distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. Storm is simple, can be used with any programming language, and is a lot of fun to use!
  • R3 – is a map reduce engine written in python using a redis backend.
  • Disco – is a lightweight, open-source framework for distributed computing based on the MapReduce paradigm.
  • Phoenix – is a shared-memory implementation of Google’s MapReduce model for data-intensive processing tasks.
  • Plasma – PlasmaFS is a distributed filesystem for large files, implemented in user space. Plasma Map/Reduce runs the famous algorithm scheme for mapping and rearranging large files. Plasma KV is a key/value database on top of PlasmaFS.
  • Peregrine – is a map reduce framework designed for running iterative jobs across partitions of data.
  • httpmr – A scalable data processing framework for people with web clusters.
  • sector/sphere – sector is a high performance, scalable, and secure distributed file system. Sphere is a high performance parallel data processing engine that can process Sector data files on the storage nodes with very simple programming interfaces.
  • Filemap – is a lightweight system for applying Unix-style file processing tools to large amounts of data stored in files.
  • misco – is a distributed computing framework designed for mobile devices.
  • MR-MPI – is a  library, which is an open-source implementation of MapReduce written for distributed-memory parallel machines on top of standard MPI message passing.
  • GridGain – in-memory computing


MapReduce Alternatives

  • Octopy – is a fast-n-easy MapReduce implementation for Python.
  • Cassalog – is a fully-featured data processing and querying library for Clojure or Java. The main use cases for Cascalog are processing “Big Data” on top of Hadoop or doing analysis on your local computer. Cascalog is a replacement for tools like Pig, Hive, and Cascading and operates at a significantly higher level of abstraction than those tools.
  • Cascading – is an application framework for Java developers to simply develop robust Data Analytics and Data Management applications on Apache Hadoop.
  • MySpace Qizmt – is a mapreduce framework for executing and developing distributed computation applications on large clusters of Windows servers.
  • bashreduce – mapreduce in bash.
  • Meguro – a simple Javascript Map/Reduce framework.
  • mincemeatpy – Lightweight MapReduce in python.
  • skynet – A Ruby MapReduce Framework.
  • mapredus – simple mapreduce framework using redis and resque.
  • starfish – is a utility to make distributed programming ridiculously easy.
  • GPMR – is a MapReduce library that leverages the power of GPU clusters for large-scale computing.
  • Elastic Phoenix – An elastic MapReduce framework based on Phoenix


Hadoop Eco-System

  • Zookeeper –a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.
  • Avro –  a data serialization system
  • HBase – is the Hadoop database, a distributed, scalable, big data store.
  • Sqoop – is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.
  • Hive – a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL
  • Pig –  is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs
  • Chukwa – is an open source data collection system for monitoring large distributed systems. Chukwa is built on top of the Hadoop Distributed File System (HDFS) and Map/Reduce framework and inherits Hadoop’s scalability and robustness. Chukwa also includes a flexible and powerful toolkit for displaying, monitoring and analyzing results to make the best use of the collected data
  • Flume – is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application.
  • Oozie – is a workflow scheduler system to manage Apache Hadoop jobs.
  • Mahout –  scalable machine learning libraries
  • Ambari – is aimed at making Hadoop management simpler by developing software for provisioning, managing, and monitoring Apache Hadoop clusters. Ambari provides an intuitive, easy-to-use Hadoop management web UI backed by its RESTful APIs.
  • HCatalog – is a set of interfaces that open up access to Hive’s metastore for tools inside and outside of the Hadoop grid.
  • Cassandra – is the right database of choice when you need scalability and high availability without compromising performance. Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data.
  • Hadoop – is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
  • Bigtop – is a project for the development of packaging and tests of the Apache Hadoop ecosystem.
Leave a comment »

Apache Hadoop & Hive Installation-examples on Linux



Leave a comment »

Programmer Vs. Hacker

Programmer Vs. Hacker


Leave a comment »

How Hadoop Cluster Works ?

How Hadoop Cluster Works ?

NameNode [Mater] = > Slaves

Leave a comment »

Hadoop / MapReduce books

Hadoop in Action


Hadoop in Practice


Hadoop MapReduce cookbook


Hadoop Operations


Hadoop-The Definitive Guide,2nd ed.


Hadoop-The Definitive Guide,3rd ed.


MapReduce Design Patterns


Pro Hadoop


Hadoop -Real Word Solutions cookbook


Leave a comment »

%d bloggers like this: