标签:
Some important ideas in big data processing:
Implicit Sequences
A sequence can be represented without each element being stored explicitly in the memory of the computer. That is, we can construct an object that provides access to all of the elements of some sequential dataset without computing all of those elements in advance and storing them. Instead, we compute elements on demand.
Example: The built-in range class represents consecutive integers
自己构造的一个range object:
class Range:
"""An implicit sequence of consecutive integers.
>>> r = Range(3, 12)
>>> len(r)
9
>>> r[3]
6
"""
def __init__(self, start, end=None):
if end is None:
start, end = 0, start
self.start = start
self.end = end
def __repr__(self):
return ‘Range({0}, {1})‘.format(self.start, self.end)
def __len__(self):
return max(0, self.end - self.start)
def __getitem__(self, k):
if k >= len(self):
raise IndexError(‘index out of range‘)
return self.start + k
1. Iterables and Iterators
Iterator: Mutable object that tracks a position in a sequence, advancing on __next__
Iterable: Represents a sequence and returns a new iterator(have __next__method) on __iter__
ps: iterator interface could also implement __iter__
methods, but it does not require the __iter__
method to return a new object — just itself (with state about its current position).
class LetterIter:
"""An iterator over letters.
>>> a_to_c = LetterIter(‘a‘, ‘c‘)
>>> next(a_to_c)
‘a‘
>>> next(a_to_c)
‘b‘
>>> next(a_to_c)
Traceback (most recent call last):
...
StopIteration
"""
def __init__(self, start=‘a‘, end=‘e‘):
self.next_letter = start
self.end = end
def __next__(self):
if self.next_letter >= self.end:
raise StopIteration
result = self.next_letter
self.next_letter = chr(ord(result)+1)
return result
class Letters:
"""An implicit sequence of letters.
>>> b_to_k = Letters(‘b‘, ‘k‘)
>>> first_iterator = b_to_k.__iter__()
>>> next(first_iterator)
‘b‘
>>> next(first_iterator)
‘c‘
>>> second_iterator = iter(b_to_k)
>>> second_iterator.__next__()
‘b‘
>>> first_iterator.__next__()
‘d‘
>>> first_iterator.__next__()
‘e‘
>>> second_iterator.__next__()
‘c‘
>>> second_iterator.__next__()
‘d‘
"""
def __init__(self, start=‘a‘, end=‘e‘):
self.start = start
self.end = end
def __iter__(self):
return LetterIter(self.start, self.end)
1.1 Many built-in Python sequence operations return iterators that compute results lazily.
map(func, iterable):
filter(func, iterable):
zip(first_iter, second_iter):
reversed(sequence):
1.2 For statement
When executing a for statement, __iter__ returns an iterator and __next__ provides each item
counts = [1, 2, 3]
for item in counts:
print(item)
-------------
1
2
3
counts = [1, 2, 3]
items = counts.__iter__()
try:
while True:
item = items.__next__()
print(item)
except StopIteration:
pass
-----------------
1
2
3
2. Generators and Generator Functions
A generator function is a function that yields values instead of returning them
A generator is an iterator, created by a generator function
When a generator function is called, it returns a generator that iterates over yields
def letter_generator(next_letter, end):
while next_letter < end:
yield next_letter
next_letter = chr(ord(next_letter)+1)
>>> s = letter_generator(‘a‘, ‘z‘)
>>> next(s)
‘a‘
>>> next(s)
‘b‘
前面的Letter 可以修改为:
class Letters:
def __init__(self, start=‘a‘, end=‘e‘):
self.start = start
self.end = end
def __iter__(self):
return letter_generator(self.start, self.end)
3. Stream
A stream is a linked list with an explicit first element and a rest-of-the-list that is computed lazily
(Second element is a zero-argument function that returns a Stream or Stream.empty)
class Stream:
"""A lazily computed linked list.
>>> s = Stream(1, lambda: Stream(6-2, lambda: Stream(9)))
>>> s.first
1
>>> s.rest.first
4
>>> s.rest
Stream(4, <...>)
>>> s.rest.rest.first
9
"""
class empty:
def __repr__(self):
return ‘Stream.empty‘
empty = empty()
def __init__(self, first, compute_rest=lambda: Stream.empty):
assert callable(compute_rest), ‘compute_rest must be callable.‘
self.first = first
self._compute_rest = compute_rest
@property
def rest(self):
"""Return the rest of the stream, computing it if necessary."""
if self._compute_rest is not None:
self._rest = self._compute_rest()
self._compute_rest = None
return self._rest
def __repr__(self):
return ‘Stream({0}, <...>)‘.format(repr(self.first))
def first_k(s, k):
"""Return up to k elements of stream s as a list.
>>> s = Stream(1, lambda: Stream(4, lambda: Stream(9)))
>>> first_k(s, 2)
[1, 4]
>>> first_k(s, 5)
[1, 4, 9]
"""
elements = []
while s is not Stream.empty and k > 0:
elements.append(s.first)
s, k = s.rest, k-1
return elements
e.g: integer_stream
def integer_stream(first=1):
"""Return a stream of consecutive integers, starting with first.
>>> s = integer_stream(3)
>>> s
Stream(3, <...>)
>>> m = map_stream(lambda x: x*x, s)
>>> first_k(m, 5)
[9, 16, 25, 36, 49]
"""
def compute_rest():
return integer_stream(first+1)
return Stream(first, compute_rest)
3.1 Higher-Order Functions on Streams
def map_stream(fn, s):
"""Map a function fn over the elements of a stream s.
>>> s = integer_stream(3)
>>> s
Stream(3, <...>)
>>> m = map_stream(lambda x: x*x, s)
>>> first_k(m, 5)
[9, 16, 25, 36, 49]
"""
if s is Stream.empty:
return s
def compute_rest():
return map_stream(fn, s.rest)
return Stream(fn(s.first), compute_rest)
def filter_stream(fn, s):
"""Filter stream s with predicate function fn."""
if s is Stream.empty:
return s
def compute_rest():
return filter_stream(fn, s.rest)
if fn(s.first):
return Stream(s.first, compute_rest)
else:
return compute_rest()
e.g. primes stream
def primes(positives):
"""Return a stream of primes, given a stream of positive integers.
>>> positives = integer_stream(2)
>>> first_k(primes(positives), 8)
[2, 3, 5, 7, 11, 13, 17, 19]
"""
def not_divisible(x):
return x % positives.first != 0
def compute_rest():
return primes(filter_stream(not_divisible, positives.rest))
return Stream(positives.first, compute_rest)
2015-07-17
标签:
原文地址:http://www.cnblogs.com/whuyt/p/4654135.html