Found in 2 comments on Hacker News
mr_tristan · 2021-02-01 · Original thread
Yeah, I found the article to be a little annoying in that it's mostly attacking WeWork.

My recommendation for anyone new to the area, is the streaming systems book: http://streamingbook.net/, which came after two excellent blog posts: https://www.oreilly.com/radar/the-world-beyond-batch-streami..., and https://www.oreilly.com/radar/the-world-beyond-batch-streami...

Basically, if you have an unbounded data problem, and you might have something that is _practically_ unbounded, that book really helped me gain insight to describe what we used Kinesis for (which is really similar to Kafka in a lot of ways).

Thinking about event vs processing time, ordering and windowing requirements, etc, were really helpful in thinking about my own data problems.

In the end, the author of this article claimed that WeWork just simply didn't have much of an unbounded data problem. Which I generally agree with. But simply saying "you can do this in PostgreSQL" isn't really a great takeaway, and, it really feels that was the message being reinforced here.

james_woods · 2021-01-22 · Original thread
Where and how in dataflow is late data being handled? How can I configure in which ways refinements relate? These questions are the standard "What Where When How" I want to answer and put into code when dealing with streaming data. I was not able to find this in the documentation, but I only spent a few minutes scanning it.

https://www.oreilly.com/radar/the-world-beyond-batch-streami...

https://www.oreilly.com/radar/the-world-beyond-batch-streami...

Also "Materialize" seems not to support needed features like tumbling windows (yet) when dealing with streaming data in SQL: https://arxiv.org/abs/1905.12133

Additionally "Materialize" states in their doc: State is all in totally volatile memory; if materialized dies, so too does all of the data. - this is not true for example for Apache Flink which stores its state in systems like RocksDB.

Having SideInputs or seeds is pretty neat, imagine you have two tables of several TiBs or larger. This is also something that "Materialize" currently lacks: Streaming sources must receive all of their data from the stream itself; there is no way to “seed” a streaming source with static data.