标签:name distrib als try safe str app osi popular
转自:https://blog.minio.io/stream-processing-with-apache-flink-and-minio-10da85590787
Modern technology trends like Machine Learning, Deep Learning, Artificial intelligence, and IoT have pushed the need for a reliable, scaleable storage platform that is versatile enough to cater to the high volume data streams that these applications generate.
In this post, we’ll see an introduction to Apache Flink, one of the most popular stream processing engines today and try to understand its value that makes it widely adopted by Enterprises across the world. Later we’ll also explore how Minio works with Flink to build a private cloud data pipeline for a variety of use cases.
Stream processing enables analyzing continuous data streams. In this approach, data is seen as a continuous stream that processing engines ingest, analyze and return the response within a small time frame — few milliseconds to minutes.
The response time is generally based on the use-case and criticality of response time. For example, you’d expect IoT sensor data from a nuclear reactor to be processed in a much smaller time frame as compared to data from a user’s website visit.
There are several situations where the streaming approach to data analysis is better suited when compared to the batch analysis:
As we discussed, stream processing is beneficial in situations where quick, (sometimes approximate) answer is best suited, while processing data. Let us now take a look at common real world applications of stream processing approach:
Anomaly detection: Streaming analysis can be applied to continuous streams of data and detect anomalies in near real time. For example, in a stream of financial transaction data, fraudulent transactions can be thought of as anomalies — stream processing can detect these, protecting banks and customers from financial damage.
Business process monitoring: A business process involves several events within a specific domain for example in an e-commerce business all the events starting CHECK_OUT_FROM_CART
to ITEM_RECEIVED_BY_CUSTOMER
may be thought of as one business process — a critical one at that. Stream processing can be used to monitor such processes for anomalies like not completing within a time frame, items mishandled by delivery partners etc.
Rule based alerting: Stream processing can be used to trigger alerts based on certain rules. This means as soon as a certain criteria is met, alerts can be sent out to different targets.
Read more about stream processing use cases on Apache Flink website.
Apache Flink is a distributed processing engine for stateful computations over data streams. Flink excels at processing unbounded and bounded data sets.
Flink has been designed to run in all common cluster environments, perform computations at in-memory speed and at any scale.
While Apache Spark is well know to provide Stream processing support as one of its features, stream processing is an after thought in Spark and under the hoods Spark is known to use mini-batches to emulate stream processing.
Apache Flink on the other hand has been designed ground up as a stream processing engine. This means Flink
Apache Flink supports three different data targets in its typical processing flow — data source, sink and checkpoint target. While data source and sink are fairly obvious, checkpoint target is used to persist states at certain intervals, during processing, to guard against data loss and recover consistently from a failure of nodes.
With AWS S3 API support a first class citizen in Apache Flink, all the three data targets can be configured to work with any AWS S3 API compatible object store, including ofcourse, Minio.
Minio can be configured with Flink in four broad ways, let’s take a look at all four below:
2. Minio object data: Minio S3 SELECT command response is streaming data, this data can be directly fed to Flink for further analysis and processing.
3. Minio as the checkpoint for Flink: Flink supports checkpointing to ensure it can recover node failures and start from right where it left off. Flink can be configured to store these Checkpoints on Minio server.
4. Minio as the sink for Flink: As Flink can output data to S3 targets, Minio can be used the sink for processing data output from Flink.
Why is it a good idea to use Minio with Flink:
Let us now take a look at how to configure Apache Flink with Minio as the remote storage backend. In this example, we’ll use Minio as both the source and sink.
To start with, you’ll need Minio server deployed, refer this document for details. Next, download Flink binary as explained in the quick start document.
Then update $FLINK_DIR/conf/flink-conf.yaml
and add the below sections:
state.backend: filesystem
s3.endpoint: http://127.0.0.1:9000
s3.path-style: true
s3.access-key: minio
s3.secret-key: minio123
$FLINK_DIR
here is the directory where you untarred Flink tar file. Also, don’t forget to update the s3.
fields based on actuals from your Minio server deployment.
Now, start Flink. The setup is now ready to use Minio as the default storage system. To test this I used the WordCount
example from Flink documentation
./bin/flink run examples/batch/WordCount.jar — input s3://input/test.txt — output s3://testbucket/output
Here test.txt
is a sample text file (use any file with lots of text data). Once the job finishes, you can see the word count in the testbucket/output
file.
In this post we learnt about Stream processing and how it has the potential to help enterprises speed up their data processing approach. We learnt why Stream processing is gaining popularity and saw some of the popular use cases. Finally we understood how Minio combined with Flink can help create a private cloud based streaming data infrastructure.
As Streaming data becomes one of the most popular ways to consume and process events, we hope this post helped you understand how Flink is well suited to handle such approach and why it makes sense to use Minio as the storage engine for such streaming data infrastructure.
Stream processing with Apache Flink and Minio
标签:name distrib als try safe str app osi popular
原文地址:https://www.cnblogs.com/rongfengliang/p/10030112.html