Most of Hadoop optimization tools out there, but they are focused on simplifying the deploy and managment of Hadoop clusters.
Very few tools are designed to help Hadoop users optimize their flows.
Dr.Elephant supports Hadoop with a variety of frameworks and can be easily extended to newer frameworks.
You can plugin and configure as many custom heuristics as you like.
It is designed to help the users of Hadoop and Spark understand the internals of their flow and to help them tune their jobs easily.
Key Features
Pluggable and configurable rule-based heuristics that diagnose a job;
Out-of-the-box integration with Azkaban scheduler and support for adding any other Hadoop scheduler, such as Oozie;
Representation of historic performance of jobs and flows;
Job-level comparison of flows;
Diagnostic heuristics for MapReduce and Spark;
Easily extensible to newer job types, applications, and schedulers;
REST API to fetch all the information.
How does it work?
Dr. Elephant gets a list of all recent succeeded and failed applications, at regular intervals, from the YARN resource manager.
The metadata for each application—namely, the job counters, configurations, and the task data—are fetched from the Job History server.
Dr. Elephant runs a set of heuristics on them and generates a diagnostic report on how the individual heuristics and the job as a whole performed.
These are then tagged with one of five severity levels, to indicate potential performance problems.
Sample Usage
Once a job completes, it can be found in the Dashboard.
The color Red means the job is in critical state and requires tuning while Green means the job is running efficiently. As follow
And u can click into the app to get the complete report, including details on each of the individual heuristics and a link, [Explain], which provides suggestions on how to tune the job to improve that heuristic.