Streaming 102: The world beyond batch

mr_tristan · 2021-02-01 · Original thread

Yeah, I found the article to be a little annoying in that it's mostly attacking WeWork.

My recommendation for anyone new to the area, is the streaming systems book: http://streamingbook.net/, which came after two excellent blog posts: https://www.oreilly.com/radar/the-world-beyond-batch-streami..., and https://www.oreilly.com/radar/the-world-beyond-batch-streami...

Basically, if you have an unbounded data problem, and you might have something that is _practically_ unbounded, that book really helped me gain insight to describe what we used Kinesis for (which is really similar to Kafka in a lot of ways).

Thinking about event vs processing time, ordering and windowing requirements, etc, were really helpful in thinking about my own data problems.

In the end, the author of this article claimed that WeWork just simply didn't have much of an unbounded data problem. Which I generally agree with. But simply saying "you can do this in PostgreSQL" isn't really a great takeaway, and, it really feels that was the message being reinforced here.

james_woods · 2021-01-22 · Original thread

Where and how in dataflow is late data being handled? How can I configure in which ways refinements relate? These questions are the standard "What Where When How" I want to answer and put into code when dealing with streaming data. I was not able to find this in the documentation, but I only spent a few minutes scanning it.

https://www.oreilly.com/radar/the-world-beyond-batch-streami...

Also "Materialize" seems not to support needed features like tumbling windows (yet) when dealing with streaming data in SQL: https://arxiv.org/abs/1905.12133

Additionally "Materialize" states in their doc: State is all in totally volatile memory; if materialized dies, so too does all of the data. - this is not true for example for Apache Flink which stores its state in systems like RocksDB.

Having SideInputs or seeds is pretty neat, imagine you have two tables of several TiBs or larger. This is also something that "Materialize" currently lacks: Streaming sources must receive all of their data from the stream itself; there is no way to “seed” a streaming source with static data.

IvanVergiliev · 2017-10-05 · Original thread

I've had this open in my browser for about two years now. I've read it maybe 1.5 times and I know I want to re-read it, but it's so long that I rarely actually get to it.

If you enjoy the subject, these two are good reads as well: - https://www.oreilly.com/ideas/the-world-beyond-batch-streami... - https://www.oreilly.com/ideas/the-world-beyond-batch-streami...

urlgrey · 2017-02-16 · Original thread

I highly recommend reading Tyler Akidau's article titled "The world beyond batch: Streaming 101": https://www.oreilly.com/ideas/the-world-beyond-batch-streami...

And its follow-up post: https://www.oreilly.com/ideas/the-world-beyond-batch-streami...

justinc-yelp · 2016-07-15 · Original thread

Most of the stream processing in the Data Pipeline happens inside of an internal project called PaaStorm, which is storm-like. It was built to take advantage of our platform as a service (http://engineeringblog.yelp.com/2015/11/introducing-paasta-a...), which handles process scheduling really well. Architecturally, it's pretty similar to Samza, with distributed processes communicating using Kafka.

We do use Spark streaming, and are starting to use Kafka Streams and Data Flow, where they're a better fit. I'm personally most excited about Beam/Flink. We'll probably end up replacing the PaaStorm internals with some other tool, when one with good python support matures. Beam's event-time handling and windowing seem really promising at this point. https://www.oreilly.com/ideas/the-world-beyond-batch-streami... is a great overview of the different concerns for stream processing.

ericand · 2016-01-20 · Original thread

O'Reilly post also released today references the Apache Dataflow submission: https://www.oreilly.com/ideas/the-world-beyond-batch-streami...

ISBN: None