首页 > 其他好文 > 详细

hadoop权威指南(第四版)要点翻译(2)——Chapter 1. Meet Hadoop

时间:2015-08-03 14:44:35      阅读:136      评论:0      收藏:0      [点我收藏+]


a) The trend is for every individual’s data footprint to grow, but perhaps more significantly,the amount of data generated by machines as a part of the Internet of Things will be even greater than that generated by people.
b) Organizations no longer have to merely manage their own data; success in the future will be dictated to a large extent by their ability to extract value from other organizations’ data.
c) Mashups between different information sources make for unexpected and hitherto unimaginable applications.
d) It has been said that “more data usually beats better algorithms,”.
e) This is a long time to read all data on a single drive — and writing is even slower. The obvious way to reduce the time is to read from multiple disks at once. Imagine if we had 100 drives, each holding one hundredth of the data. Working in parallel, we could read the data in under two minutes.
f) We can imagine that the users of such a system would be happy to share access in return for shorter analysis times,and statistically, that their analysis jobs would be likely to be spread over time, so they wouldn’t interfere with each other too much.
g) The first problem to solve is hardware failure: as soon as you start using many pieces of hardware, the chance that one will fail is fairly high.
h) The second problem is that most analysis tasks need to be able to combine the data in some way, and data read from one disk may need to be combined with data from any of the other 99 disks.
i) In a nutshell, this is what Hadoop provides: a reliable, scalable platform for storage and analysis. What’s more, because it runs on commodity hardware and is open source,Hadoop is affordable.
j) Queries typically take minutes or more, so it’s best for offline use, where there isn’t a human sitting in the processing loop waiting for results.
k) The first component to provide online access was HBase, a key-value store that uses HDFS for its underlying storage. HBase provides both online read/write access of individual rows and batch operations for reading and writing data in bulk, making it a good solution for building applications on.
l) The real enabler for new processing models in Hadoop was the introduction of YARN(which stands for Yet Another Resource Negotiator) in Hadoop 2. YARN is a cluster resource management system, which allows any distributed program (not just MapReduce) to run on data in a Hadoop cluster.
m) Despite the emergence of different processing frameworks on Hadoop, MapReduce still has a place for batch processing, and it is useful to understand how it works since it introduces several concepts that apply more generally (like the idea of input formats, or how a dataset is split into pieces).
n) Why can’t we use databases with lots of disks to do large-scale analysis? Why is Hadoop needed? The answer to these questions comes from another trend in disk drives: seek time is improving more slowly than transfer rate.
o) In many ways, MapReduce can be seen as a complement to a Relational Database Management System (RDBMS). MapReduce is a good fit for problems that need to analyze the whole dataset in a batch fashion, particularly for ad hoc analysis. An RDBMS is good for point queries or updates, where the dataset has been indexed to deliver low-latency retrieval and update times of a relatively small amount of data. MapReduce suits applications where the data is written once and read many times, whereas a relational database is good for datasets that are continually updated.
p) Another difference between Hadoop and an RDBMS is the amount of structure in the datasets on which they operate.
q) Relational data is often normalized to retain its integrity and remove redundancy. Normalization poses problems for Hadoop processing because it makes reading a record a nonlocal operation, and one of the central assumptions that Hadoop makes is that it is possible to perform (high-speed) streaming reads and writes.
r) Hadoop tries to co-locate the data with the compute nodes, so data access is fast because it is local.This feature, known as data locality, is at the heart of data processing in Hadoop and is the reason for its good performance.
s) Processing in Hadoop operates only at the higher level: the programmer thinks in terms of the data model (such as key-value pairs for MapReduce), while the data flow remains implicit.
我们进行hadoop 的操作仅仅发生在高级阶段:程序员只需根据数据模型(比如MapReduce的键值对)来构思,但是依然隐含着数据流。
t) Coordinating the processes in a large-scale distributed computation is a challenge. The hardest aspect is gracefully handling partial failure — when you don’t know whether or not a remote process has failed — and still making progress with the overall computation.
u) MapReduce is designed to run jobs that last minutes or hours on trusted, dedicated hardware running in a single data center with very high aggregate bandwidth interconnects.
v) Today, Hadoop is widely used in mainstream enterprises. Hadoop’s role as a generalpurpose storage and analysis platform for big data has been recognized by the industry, and this fact is reflected in the number of products that use or incorporate Hadoop in some way.

hadoop权威指南(第四版)要点翻译(2)——Chapter 1. Meet Hadoop



评论 一句话评论(0
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com