标签:
From http://projects.apache.org/indexes/quick.html
[Now, Future], 2015-02-06 update.
The Apache Accumulo sorted, distributed key/value store is based on Google‘s BigTable design. It is built on top of Apache Hadoop, Zookeeper, and Thrift. It features a few novel improvements on the BigTable design in the form of cell-level access labels and a server-side programming mechanism that can modify key/value pairs at various points in the data management process. Categories: database |
|
Apache Ambari makes Hadoop cluster provisioning, managing, and monitoring dead simple. Categories: big-data |
|
Apache Avro is a data serialization system. Categories: library, big-data |
|
Chukwa is an open source data collection system for monitoring large distributed systems. Chukwa is built on top of the Hadoop Distributed File System (HDFS) and Map/Reduce framework and inherits Hadoop’s scalability and robustness. Chukwa also includes a ?exible and powerful toolkit for displaying, monitoring and analyzing results to make the best use of the collected data. Categories: hadoop |
|
Apache Drill is a distributed MPP query layer that supports SQL and alternative query languages against NoSQL and Hadoop data storage systems. It was inspired in part by Google‘s Dremel. Categories: big-data |
|
Apache Giraph is an iterative graph processing system built for high scalability. For example, it is currently used at Facebook to analyze the social graph formed by users and their connections. Categories: big-data |
|
Hadoop is a distributed computing platform. This includes the Hadoop Distributed Filesystem (HDFS) and an implementation of MapReduce. Categories: database |
|
The Apache Hama is an efficient and scalable general-purpose BSP computing engine which can be used to speed up a large variety of compute-intensive analytics applications. Categories: big-data |
|
Use Apache HBase software when you need random, realtime read/write access to your Big Data. This project‘s goal is the hosting of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware. HBase is an open-source, distributed, versioned, column-oriented store modeled after Google‘s Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, HBase provides Bigtable-like capabilities on top of Hadoop and HDFS. Categories: database |
|
The Apache Hive (TM) data warehouse software facilitates querying and managing large datasets residing in distributed storage. Built on top of Apache Hadoop (TM), it provides * tools to enable easy data extract/transform/load (ETL) * a mechanism to impose structure on a variety of data formats * access to files stored either directly in Apache HDFS (TM) or in other data storage systems such as Apache HBase (TM) * query execution via MapReduce Hive defines a simple SQL-like query language, called HiveQL, that enables users familiar with SQL to query the data. At the same time, this language also allows programmers who are familiar with the MapReduce framework to be able to plug in their custom mappers and reducers to perform more sophisticated analysis that may not be supported by the built-in capabilities of the language. HiveQL can also be extended with custom scalar functions (UDF‘s), aggregations (UDAF‘s), and table functions (UDTF‘s). Categories: database |
|
Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform. Categories: database |
|
Scalable machine learning library Categories: library |
|
Apache Nutch is a highly extensible and scalable open source web crawler software project. Stemming from Apache Lucene, the project has diversified and now comprises two codebases, namely: Nutch 1.x: A well matured, production ready crawler. 1.x enables fine grained configuration, relying on Apache Hadoop data structures, which are great for batch processing. Nutch 2.x: An emerging alternative taking direct inspiration from 1.x, but which differs in one key area; storage is abstracted away from any specific underlying data store by using Apache Gora for handling object to persistent mappings. This means we can implement an extremely flexibile model/stack for storing everything (fetch time, status, content, parsed text, outlinks, inlinks, etc.) into a number of NoSQL storage solutions. Being pluggable and modular of course has it‘s benefits, Nutch provides extensible interfaces such as Parse, Index and ScoringFilter‘s for custom implementations e.g. Apache Tika for parsing. Additonally, pluggable indexing exists for Apache Solr, Elastic Search, etc. Nutch can run on a single machine, but gains a lot of its strength from running in a Hadoop cluster Categories: web-framework |
|
Oozie is a workflow scheduler system to manage Apache Hadoop jobs. Oozie is integrated with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box (such as Java map-reduce, Streaming map-reduce, Pig, Hive, Sqoop and Distcp) as well as system specific jobs (such as Java programs and shell scripts). Categories: big-data |
|
Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets. Pig‘s infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs. Pig‘s language layer consists of a textual language called Pig Latin, which has the following key properties: * Ease of programming. It is trivial to achieve parallel execution of simple, "embarrassingly parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, making them easy to write, understand, and maintain. * Optimization opportunities. The way in which tasks are encoded permits the system to optimize their execution automatically, allowing the user to focus on semantics rather than efficiency. * Extensibility. Users can create their own functions to do special-purpose processing. Categories: database |
|
Apache Spark is a fast and general engine for large-scale data processing. It offers high-level APIs in Java, Scala and Python as well as a rich set of libraries including stream processing, machine learning, and graph analytics. Categories: big-data |
|
Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. Categories: big-data |
|
Apache Storm is a distributed real-time computation system. Similar to how Hadoop provides a set of general primitives for doing batch processing, Storm provides a set of general primitives for doing real-time computation. Categories: big-data |
|
Apache ZooKeeper is an effort to develop and maintain an open-source server which enables highly reliable distributed coordination. Categories: database |
标签:
原文地址:http://www.cnblogs.com/vscn/p/4277031.html