Found in 23 comments on Hacker News
Coding and building teach you more than taking a course or watching a video. If you don't have any programming background, you can enroll in some coursera or udacity courses to start with. Then go through this course http://web.stanford.edu/class/cs106x/, the course reader is really good. After that for data engineering read this book https://www.amazon.com/Designing-Data-Intensive-Applications.... Also learn some sql. Take some data, feed into sql light db, and ask question and convert question into query. Becoming good at this takes some time. Be patience. The learning curve is like hokey stick, initial phase might have a dip but it accelerate in the later phase. BY ANY CHANCE DO NOT JOIN A BOOTCAMP.
deepakkarki · 2020-11-19 · Original thread
Source and more info : https://martin.kleppmann.com/2020/11/18/distributed-systems-...

This recorded series is from Kleppmann's Concurrent and Distributed Systems course which he teaches at University of Cambridge. In case the name seems familiar, Kleppmann is the author of perhaps HN's favourite book "Designing Data-Intensive Applications" https://www.amazon.com/dp/1449373321

snicky · 2020-09-25 · Original thread
If you are interested in the computer science in general I highly recommend:

1. Structure and Interpretation of Computer Programs (available for free, e.g. here http://sarabander.github.io/sicp/html/index.xhtml

2. https://computationbook.com/

Also, I haven't read it yet, but this book has been praised here a lot recently: https://www.amazon.com/Designing-Data-Intensive-Applications...

pqb · 2020-08-25 · Original thread
I would say exactly the opposite. I regret of buying a book from Amazon [0] dedicated to Kindle-use, because it is DRM protected and I am forced to use "Amazon Kindle" application, otherwise I cannot open it. I am usually okay with DRMs but I miss a fact I haven't bought it elsewhere with less annoying protection.

[0]: https://www.amazon.com/Designing-Data-Intensive-Applications...

Psst, "Designing Data Intensive Applications" was very good read. Do you know similar books that focus on distributed systems?

mesaframe · 2020-08-12 · Original thread
I read this book. https://www.amazon.com/Designing-Data-Intensive-Applications...

And it was really enlightening. I would heavily recommend it. It starts off by teaching different types of implementations of different parts of DBMS. Then goes on to teaching about how distributed systems deal with various problems.

If you are interested in distributed systems , I found the book "Designing data intensive application by Martin Kleppmann" to be a good starting point. Its not about only about distributed systems but also covers quite a bit of ground on overall data systems. https://www.amazon.com/Designing-Data-Intensive-Applications...
avremel · 2019-12-31 · Original thread
How would you compare Database Internals to Designing Data Intensive Applications?

[1] https://www.amazon.com/Designing-Data-Intensive-Applications...

malisper · 2019-08-22 · Original thread
Referencing my copy of designing-data intensive applications[0], here are some approaches mentioned:

1) The naive approach is to assign all writes to a chunk randomly. This makes reads a lot more expensive as now a read for a particular key (e.g. device) will have to touch every chunk.

2) If you know a particular key is hot, you can spread writes for that particular key to random chunks. You need some extra bookeeping to keep track of which keys you are doing this for.

3) Splitting hot chunks into smaller chunks. You will wind up with varying sized chunks, but each chunk will now have a roughly equal write volume.

One more approach I would like to add is rate-limiting. If the reads or writes for a particular key crosses some threshold, you can drop any additional operations. Of course this is only fine if you are ok with having operations to hot keys often fail.

[0] https://www.amazon.com/Designing-Data-Intensive-Applications...

meritt · 2019-07-29 · Original thread
For anyone eager to read something now, Designing Data-Intensive Applications [1] is an excellent and completed book that covers nearly all of the same material with significant depth.

[1] https://www.amazon.com/Designing-Data-Intensive-Applications...

nw__dataeng · 2019-07-12 · Original thread
I'd highly recommend reading [Designing Data-Intensive Applications](https://www.amazon.com/Designing-Data-Intensive-Applications...). The book gives you a great overview of designing data systems - foundational knowledge you'll need in any DE role.

The reason you can't find data engineering materials online is because real data engineering really only happens at a handful of companies - and those companies maintain this knowledge base internally and do not share it.

I noticed that you listed tools / frameworks to learn, as well as languages. Another piece of advice would be to not focus on those because they come and go (for example, Hadoop is pretty much deprecated in any DE-heavy company). What lasts is an understanding of distributed systems, distributed query engines, storage technologies, and algorithms & data structures. If you have a firm grasp on those, you won't have to start from scratch every time a new framework is introduced. You'll immediately recognize what problems the tech is solving and how they're solving it, and based on your knowledge you can connect the dots and know if that solution is what you need.

Another thing to do is watch CS186 from Berkeley in its entirety. This course is about relational databases, but will give you the foundation you need to speak the DE language.

Source: I work as a data engineer at what some would call a big company :)

sambroner · 2019-02-22 · Original thread
I haven't read Designing Distributed Systems, but I have read Designing Data-Intensive Applications [0] and it was fantastic.

An overview of databases (what and why, but also a lot of how) plus distributed concepts and modern architectures.

[0] https://www.amazon.com/Designing-Data-Intensive-Applications...

karolist · 2019-02-12 · Original thread
I'll structure this in "current/future/recent_past" format if I may.

Currently:

* The Go Programming Language

https://www.amazon.com/Programming-Language-Addison-Wesley-P...

* Building Microservices

https://www.amazon.com/Building-Microservices-Designing-Fine...

Plan to do next:

* Designing Data-Intensive Applications

https://www.amazon.com/Designing-Data-Intensive-Applications...

* Designing Distributed Systems

https://www.amazon.com/Designing-Distributed-Systems-Pattern...

* Unix and Linux System Administration 5th ed, but probably just gonna skip/read chapters of interest, i.e. I wanna get a better understanding of SystemD.

https://www.amazon.com/UNIX-Linux-System-Administration-Hand...

Read last month:

* Learning React

Good for a quick intro but I probably wouldn't read cover-to-cover again, some sections are old, but overall an OK book.

https://www.amazon.com/Learning-React-Functional-Development...

* React Design Patterns and Best Practices

Really liked this one, picked a tonne of new ideas and approaches that are hard to find otherwise for a newbie in JS scene. These two books, some time spent reading up on webpack and lots of github/practice code made me not scared of JS anymore and not feeling the fatigue. I mean, I was one of the people who dismissed everything frontend related, big node_modules, electron, complicated build systems etc. But now I sort of understand why and am on the different side of the fence.

https://www.amazon.com/React-Design-Patterns-Best-Practices/...

* Flexbox in CSS

Wanted to understand what's the new flexbox layout is about since it's been a while when I've done some serious CSS work. Long story short I made it about half of this and dropped it - not any more useful than MDN docs and actually playing with someone's codepen gave me better understanding in 5 minutes than 3 hours spent with this book.

https://www.amazon.com/Flexbox-CSS-Estelle-Weyl-ebook/dp/B07...

tracer4201 · 2019-02-10 · Original thread
Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

https://www.amazon.com/Designing-Data-Intensive-Applications...

I read through this book last year when I saw it recommended on HN. I recommended it to engineers on my team at work.

I’m reading it for a second time now, and just finished chapter 2 today. It’s dense but an amazingly detailed and thorough text.

otras · 2018-11-04 · Original thread
I'd recommend the following:

Clean Code: A Handbook of Agile Software Craftsmanship [0] is a great book on writing and reading code.

Similarly, Clean Architecture: A Craftsman's Guide to Software Structure and Design [1] is, no surprise, a book on organizing and architecting software.

Designing Data-Intensive Applications [2] may be overkill for your situation, but it's a good read to get an idea about how large scale applications function.

The Architecture of Open Source Applications [3] is a fantastic free resource that walks through how many applications are built. As another comment mentioned, reading code and understanding how other programs are built are great ways to build your "how to do things" repertoire.

Finally, I'd also recommend taking some classes. I started as a self-taught developer, but I've since taken classes both in-person and online that have been a tremendous help. There are many available for free online, and if in-person classes work better for you (motivation, support, resources, etc), definitely go that route. They're a fantastic way to grow.

[0]: https://www.amazon.com/Clean-Code-Handbook-Software-Craftsma...

[1]: https://www.amazon.com/Clean-Architecture-Craftsmans-Softwar...

[2]: https://www.amazon.com/Designing-Data-Intensive-Applications...

[3]: http://aosabook.org/en/index.html

sbmthakur · 2018-08-24 · Original thread
I second Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems. Rather than covering theoretical aspects in detail, it focuses on real-life problems that can be solved using various paradigms.

https://www.amazon.com/Designing-Data-Intensive-Applications...

chw9e · 2018-07-22 · Original thread
As a self-taught developer, I used to think that some of the theoretical elements were overhyped. I can build iOS apps that work, and I did just that for the last 2-3 years. However, many of the programs that I wrote have not been as easy to maintain as I would like and some difficult to fix bugs have popped up overtime, both of which are due to a lack of deeper understanding of CS fundamentals. Last year I started interviewing and was ridiculed at one company in particular for a lack of CS knowledge. Afterwords I started exploring a lot of the CS concepts listed in this link and I have since found numerous ways to improve my code quality and have a better understanding of how CS best practices came to be. I also used to think that algorithms and data structures were relatively useless for an iOS developer, and I was able to do the job without them, thus proving my point. However, after gaining a better understanding, it quickly becomes clear that things like view hierarchies are simply trees and understanding ways to traverse these hierarchies can lead to much cleaner code. With the open sourcing of Swift, I also became more interested in understanding the language, but a lot of the language design decisions didn't make sense to me until I gained a better understanding of CS fundamentals. I have found the programming languages course on Coursera [1] to be particularly useful, and have also greatly enjoyed the book Designing Data Intensive Applications [2]. There's also a great video from this year's WWDC that really inspires algorithm study and use in everyday applications [3].

[1] https://www.coursera.org/learn/programming-languages

[2] https://www.amazon.com/Designing-Data-Intensive-Applications...

[3] https://developer.apple.com/videos/play/wwdc2018/223/

dustingetz · 2018-07-14 · Original thread
"CP/AP: a false dichotomy" https://martin.kleppmann.com/2015/05/11/please-stop-calling-... . Martin Kleppman is the author of "Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems" https://www.amazon.com/Designing-Data-Intensive-Applications...
Another good resource is Designing Data-Intensive Applications [1]. Chapter 2 does a really good job explaining how different categories of databases relate to different data models, including examples of querying graph-like data models using `WITH RECURSIVE` compared to a query language for graph databases.

[1] https://www.amazon.com/Designing-Data-Intensive-Applications...

throwawaypls · 2018-05-16 · Original thread
I read this book titled "Designing Data Intensive Applications", which covers this and a lot of other stuff about designing applications in general. https://www.amazon.com/Designing-Data-Intensive-Applications...
jpamata · 2018-05-10 · Original thread
Designing Data-Intensive Applications[0] by Martin Kleppmann. There's a previous HN thread about it[1]. Helped me understand a bit more about databases and systems. The book is also very approachable and has the perfect blend of application and theory at a high level that anyone approaching the industry for the first time stands to gain a lot from reading it.

The Architecture of Open Source Applications[2] series is a good one for leaning how to build production applications and you can read it online. The chapter on Scalable Web Architecture[3] is a must-read.

[0] https://www.amazon.com/Designing-Data-Intensive-Applications...

[1] https://news.ycombinator.com/item?id=15428526

[2] http://aosabook.org/en/index.html

[3] http://aosabook.org/en/distsys.html

adamnemecek · 2017-01-17 · Original thread
You should try to understand how databases in general work, it will help you with your query writing.

One thing you have to realize is that once you get a little advanced, you have to get to the details of the single SQL implementations, it's not about SQL but about Postgres.

I've found these books really valuable

# SQL Performance Explained Everything Developers Need to Know about SQL Performance

https://www.amazon.com/Performance-Explained-Everything-Deve...

This book fundamentally talks about how to effectively use and leverage the SQL indices. Talks about all the important implementations (Postgres, MySQL, Oracle, SQL Server).

# Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

https://www.amazon.com/Designing-Data-Intensive-Applications...

This book gets mentioned a bunch around here and for a good reason. There aren't too many concrete resources on making your systems "webscale" and this one is really good.

# PostgreSQL 9.0 High Performance

https://www.amazon.com/PostgreSQL-High-Performance-Gregory-S...

Discusses all the different settings and tweaks you can do in Postgres. It's crazy how much of a perf gain you can get just by twiddling the parameters of the database, i.e. all the tricks you can do when the single instances are bottle necks.

There's a similar book for MySQL https://www.amazon.com/High-Performance-MySQL-Optimization-R...

# PostgreSQL 9 High Availability Cookbook

https://www.amazon.com/PostgreSQL-9-High-Availability-Cookbo...

Discusses how do you go from 1 Postgres instance to 1+ instance. Talks about replication, monitoring, cluster management, avoiding downtime etc i.e. all the tricks you can do to manage multiple instances. Again there's a similar book for MySQL https://www.amazon.com/MySQL-High-Availability-Building-Cent...

Last but not least check out the postgres documentation, people consider it a standard of what good documentation looks like https://www.postgresql.org/docs/9.6/static/index.html

Also last but not least, read up on relational algebra (the foundation of SQL) https://en.wikipedia.org/wiki/Relational_algebra. I've always found SQL to be extremely verbose (the syntax reminds me of idk COBOL or smth) but there's another query language called Datalog, that's for our purposes similar to SQL but the syntax is much more legible.

E.g. check out these snippets from these slides (page 29) (and check out the whole class too)

https://pages.iai.uni-bonn.de/manthey_rainer/IIS_1617/IIS201...

Datalog:

s(X) <- p(X,Y).

s(X) <- r(Y,X).

t(X,Y,Z) <- p(X,Y), r(Y,Z).

w(X) <- s(X), not q(X).

SQL:

CREATE VIEW s AS (SELECT a FROM p)

UNION

(SELECT b FROM r);

CREATE VIEW t AS

SELECT a, b, c

FROM p, r

WHERE p.b = r.a,

CREATE VIEW w AS (TABLE s)

MINUS (TABLE q);

mindcrash · 2016-02-25 · Original thread
You probably might want to read this (for free): http://book.mixu.net/distsys/single-page.html

And pay a little to read this book: http://www.amazon.com/Designing-Data-Intensive-Applications-...

And this one: http://www.amazon.com/Big-Data-Principles-practices-scalable...

Nathan Marz brought Apache Storm to the world, and Martin Kleppmann is pretty well known for his work on Kafka.

Both are very good books on building scalable data processing systems.

Fresh book recommendations delivered straight to your inbox every Thursday.