Dan Luu's site [1] goes into a fair bit of detail on the disk side. I see no reason why you can't emulate a superset of the worst case behaviour, and have a great deal of confidence that you're using file access in a way that won't result in corruption.
Networks will have a similar long tail, e.g. asymmetric net splits.
The SQL version is a bit trickier, as the API is much wider. The abstraction I was working on was essentially that you get select, insert and update, and write anything more complicated yourself.
This works for replicating the skews and other phenomena described in DDIA [2], but it runs into the same core problem that you're simulating a model of your code, not your code itself. The best pathway for temporal fuzzing databases with production loads is probably at the network layer.
[1] https://danluu.com/file-consistency/ [2] https://www.amazon.com.au/Designing-Data-Intensive-Applicati...
https://www.amazon.com/Designing-Data-Intensive-Applications...
0. Total Compensation (TC) Salary comparison site: https://www.levels.fyi/ Anonymous posting with verified employees: https://www.teamblind.com/
These are the best tools for finding out what compensation actually is at these places. I know enough people in these companies to know these numbers are accurate. Keep in mind these numbers often include stock appreciation. You can filter to new offers to get numbers that exclude stock appreciation.
1. Leetcode (LC)
FAANG+ interviews always involve solving programming problems in real time. The best place to practice is Leetcode.
Buy a yearlong Leetcode premium subscription and do all the modules listed here, in no particular order, but skip decision trees and machine learning: https://leetcode.com/explore/learn/
When you are done with that, do all the problems on this list: https://www.teamblind.com/post/New-Year-Gift---Curated-List-...
A lot of these problems are on the modules linked previously, so you will only have 30-40 new problems here
Next, do random problems until you "see through the matrix." Focus on medium level problems. Try to do something like 35% easy, 50% medium, 15% hard. If you can't find the optimal solution to a problem, "upsolve" by reading a bit of the solution and trying again. If you still can't get it, copy the code of the solution and study it. Then erase it and try to solve it from memory. Periodically go back over solved problems and re-solve them while taking notes. Your goal should be to solve two random LC mediums in ~35 minutes. Solve problems out loud to simulate communicating your thoughts to an interviewer.
Consider using Python as your interview language if you are comfortable enough with it. It's faster than Java for writing. Some places will have you run the code, others it will be a glorified whiteboard, so don't use the run button as a crutch. Around two weeks before your interview, start doing company tagged problems like: https://leetcode.com/company/doordash/
Start doing this part first and grind it hard. It might take 3 months, it might take a year. It takes as long as it takes until you think you can crush it.
2. System Design
The system design interview tests your ability to piece together components to build an entire product or feature. A typical question is something like "design a URL shortener that serves 1B requests per day." You will need to choose database/pubsub/caching technologies appropriate to the problem, describe DB schemas, caching strategies, partitioning/replication schemes, design APIs, etc.
For senior level roles, this will be the most important part of your interview as far as leveling. If you are shaky, they will downlevel. Buy DDIA: https://www.amazon.com/Designing-Data-Intensive-Applications...
Read it more than once.
These courses on educative.io are useful: https://www.educative.io/courses/grokking-the-system-design-... These videos are also really good: https://www.codekarle.com/
Also FAANG level engineering blogs. Uber/Doordash/Netflix/Facebook. Tech talks on Cassandra/Kafka and stuff like that.
Videos are the best last minute prep before interviews for design.
3. Applying
Get referrals wherever you can. Most places will ignore you unless you have them. I applied to probably 25+ companies and got rejects or ignored for all but Uber, AirBnB and LinkedIn. Places I had referrals to I scored onsites for 100% of the time, including places that rejected me before a referral. You can get them referrals off of Blind, but you probably also have people in your network in FANG and top tier companies. People will be motivated to refer since referral bonuses are usually large.
4. Interviewing
The process is recruiter call -> "phone screen" (do an LC problem on Hackerrank on a zoom call) -> "onsite" which is 5 hours of zoom...usually 2 coding, 1 behavioral (maybe a small coding question as well), 1 design.
Do mock interviews with friends/colleagues for LC problems. I would totally be willing to do mocks with you when you are ready. I had 3 different people give me a total of 6 mock interviews. You can also pay for this with different companies like interviewing.io or randoms off Blind. I can give you the contact info of the guy from Uber who did the system design mock with me as well. He is super super good. It's much harder to find mock interviewers for system design.
Also for interviews you can interview over 2-3 days after 3pm PST to avoid taking time off work. Recruiters will let you push back interviews for any reason multiple times, especially if it's for more interview prep, so if you aren't where you want to be before one, it's totally fine to ask for more time.
5. Negotiating
You should try to get all your interviews lined up very close together to get competing offers, which can increase your offer by a lot.
and
https://www.amazon.co.uk/Designing-Data-Intensive-Applicatio...
are great
Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems
https://www.amazon.com/Designing-Data-Intensive-Applications...
Took about four months of studying ~2 hours daily.
0. Total Compensation (TC)
Compensation data: https://www.levels.fyi/
Get the app Blind and start browsing it daily. People regularly post their offers, and it is the most up to date info on the market. It’s an anonymous forum where your company email is verified. You can DM employees of target companies for referrals or information about roles.
1. Leetcode (LC)
Buy a yearlong Leetcode premium subscription and do all the modules listed here, in no particular order, but skip decision trees and machine learning: https://leetcode.com/explore/learn/
When you are done with that, do all the problems on this list: https://www.teamblind.com/post/New-Year-Gift---Curated-List-...
A lot of these problems are on the modules linked previously, so you will only have 30-40 new problems here
Next, do random problems until you "see through the matrix."
Focus on medium level problems. Try to do something like 35% easy, 50% medium, 15% hard.
If you can't find the optimal solution to a problem, "upsolve" by reading a bit of the solution and trying again. If you still can't get it, copy the code of the solution and study it. Then erase it and try to solve it from memory.
Periodically go back over solved problems and re-solve them while taking notes.
Your goal should be to solve two random LC mediums in ~35 minutes. Consider using Python as your interview language if you are comfortable enough with it. It's faster than Java for writing.
Some places will have you run the code, others it will be a glorified whiteboard, so don't use the run button as a crutch.
Around two weeks before your interview, start doing company tagged problems like: https://leetcode.com/company/doordash/
Start doing this part first and grind it hard. It might take 3 months, it might take a year. It takes as long as it takes until you think you can crush it. I spent around 2 hrs each day in the morning on LC.
2. System Design
If you are being considered for senior level roles, this will be by far the most important part of your interview as far as leveling. If you are shaky, they will downlevel.
Buy DDIA: https://www.amazon.com/Designing-Data-Intensive-Applications...
Read it more than once.
These courses on educative.io are useful: https://www.educative.io/courses/grokking-the-system-design-... https://www.educative.io/courses/grokking-adv-system-design-...
These videos are also really good: https://www.codekarle.com/
Tech talks on Cassandra/Kafka and stuff like that are good.
Videos are the best last minute prep before interviews for design.
3. Companies
Amazon tends to be easier in terms of LC problems but ask more behavioral. Amazon also has a reputation of being stressful and pay is not at the level of Meta/Google, though that might be changing. I would do this interview first since it’s good practice for getting behavioral stories real sharp.
Google is way slower than these other companies, so if you wanna consider them, get the process started as early as you can.
If you are interested in remote, also consider Zoom, Square, Twitter, and Coinbase.
4. Applying
Get referrals wherever you can. Most places will ignore you unless you have them. I applied to probably 25+ companies and got rejects or ignored for all but Uber and AirBnB. Places I had referrals to I scored onsites for 100% of the time, including places that rejected me before a referral.
You can get referrlas off Blind. I didn’t do this, but I guess it happens! You probably also have people somewhere in your network in FANG and top tier companies if you look. If people think you have a chance of passing they’ll be happy to refer. Referral bonuses are several thousand dollars. Ask them for mock interviews as well.
5. Interviewing
The process is recruiter call -> "phone screen" (do an LC problem on Hackerrank while on a zoom call) -> "onsite" which is 5 hours of zoom...usually 2 coding, 1 behavioral (maybe a small coding question as well), 1 design.
Do mock interviews with friends/colleagues for LC problems. I had 3 different people give me a total of 6 mock interviews. You can also pay for this with different companies like interviewing.io or randoms off Blind.
Getting mock interviews for system design is harder, and you might have to pay for it. I did and it was the best money I spent that year.
Also for interviews you can interview over 2-3 days after 3pm PST to avoid taking time off work if you’re not in PST.
Recruiters will let you push back interviews for any reason multiple times, especially if it's for more interview prep, so if you aren't where you want to be before one, it's totally fine to ask for more time.
6. Negotiating
You should try to get all your interviews lined up very close together to get competing offers, especially if you want Google, who tends to lowball candidates that do not have competing offers.
I surprisingly really enjoyed it. Well written and it pulled back the veil on a lot of concepts that I thought were too complex for me to understand/enjoy.
You can learn a lot of algorithms. It's useless unless you start to create architecture and use them in practice.
This recorded series is from Kleppmann's Concurrent and Distributed Systems course which he teaches at University of Cambridge. In case the name seems familiar, Kleppmann is the author of perhaps HN's favourite book "Designing Data-Intensive Applications" https://www.amazon.com/dp/1449373321
1. Structure and Interpretation of Computer Programs (available for free, e.g. here http://sarabander.github.io/sicp/html/index.xhtml
2. https://computationbook.com/
Also, I haven't read it yet, but this book has been praised here a lot recently: https://www.amazon.com/Designing-Data-Intensive-Applications...
[0]: https://www.amazon.com/Designing-Data-Intensive-Applications...
Psst, "Designing Data Intensive Applications" was very good read. Do you know similar books that focus on distributed systems?
And it was really enlightening. I would heavily recommend it. It starts off by teaching different types of implementations of different parts of DBMS. Then goes on to teaching about how distributed systems deal with various problems.
[1] https://www.amazon.com/Designing-Data-Intensive-Applications...
1) The naive approach is to assign all writes to a chunk randomly. This makes reads a lot more expensive as now a read for a particular key (e.g. device) will have to touch every chunk.
2) If you know a particular key is hot, you can spread writes for that particular key to random chunks. You need some extra bookeeping to keep track of which keys you are doing this for.
3) Splitting hot chunks into smaller chunks. You will wind up with varying sized chunks, but each chunk will now have a roughly equal write volume.
One more approach I would like to add is rate-limiting. If the reads or writes for a particular key crosses some threshold, you can drop any additional operations. Of course this is only fine if you are ok with having operations to hot keys often fail.
[0] https://www.amazon.com/Designing-Data-Intensive-Applications...
[1] https://www.amazon.com/Designing-Data-Intensive-Applications...
The reason you can't find data engineering materials online is because real data engineering really only happens at a handful of companies - and those companies maintain this knowledge base internally and do not share it.
I noticed that you listed tools / frameworks to learn, as well as languages. Another piece of advice would be to not focus on those because they come and go (for example, Hadoop is pretty much deprecated in any DE-heavy company). What lasts is an understanding of distributed systems, distributed query engines, storage technologies, and algorithms & data structures. If you have a firm grasp on those, you won't have to start from scratch every time a new framework is introduced. You'll immediately recognize what problems the tech is solving and how they're solving it, and based on your knowledge you can connect the dots and know if that solution is what you need.
Another thing to do is watch CS186 from Berkeley in its entirety. This course is about relational databases, but will give you the foundation you need to speak the DE language.
Source: I work as a data engineer at what some would call a big company :)
An overview of databases (what and why, but also a lot of how) plus distributed concepts and modern architectures.
[0] https://www.amazon.com/Designing-Data-Intensive-Applications...
Currently:
* The Go Programming Language
https://www.amazon.com/Programming-Language-Addison-Wesley-P...
* Building Microservices
https://www.amazon.com/Building-Microservices-Designing-Fine...
Plan to do next:
* Designing Data-Intensive Applications
https://www.amazon.com/Designing-Data-Intensive-Applications...
* Designing Distributed Systems
https://www.amazon.com/Designing-Distributed-Systems-Pattern...
* Unix and Linux System Administration 5th ed, but probably just gonna skip/read chapters of interest, i.e. I wanna get a better understanding of SystemD.
https://www.amazon.com/UNIX-Linux-System-Administration-Hand...
Read last month:
* Learning React
Good for a quick intro but I probably wouldn't read cover-to-cover again, some sections are old, but overall an OK book.
https://www.amazon.com/Learning-React-Functional-Development...
* React Design Patterns and Best Practices
Really liked this one, picked a tonne of new ideas and approaches that are hard to find otherwise for a newbie in JS scene. These two books, some time spent reading up on webpack and lots of github/practice code made me not scared of JS anymore and not feeling the fatigue. I mean, I was one of the people who dismissed everything frontend related, big node_modules, electron, complicated build systems etc. But now I sort of understand why and am on the different side of the fence.
https://www.amazon.com/React-Design-Patterns-Best-Practices/...
* Flexbox in CSS
Wanted to understand what's the new flexbox layout is about since it's been a while when I've done some serious CSS work. Long story short I made it about half of this and dropped it - not any more useful than MDN docs and actually playing with someone's codepen gave me better understanding in 5 minutes than 3 hours spent with this book.
https://www.amazon.com/Flexbox-CSS-Estelle-Weyl-ebook/dp/B07...
https://www.amazon.com/Designing-Data-Intensive-Applications...
I read through this book last year when I saw it recommended on HN. I recommended it to engineers on my team at work.
I’m reading it for a second time now, and just finished chapter 2 today. It’s dense but an amazingly detailed and thorough text.
Clean Code: A Handbook of Agile Software Craftsmanship [0] is a great book on writing and reading code.
Similarly, Clean Architecture: A Craftsman's Guide to Software Structure and Design [1] is, no surprise, a book on organizing and architecting software.
Designing Data-Intensive Applications [2] may be overkill for your situation, but it's a good read to get an idea about how large scale applications function.
The Architecture of Open Source Applications [3] is a fantastic free resource that walks through how many applications are built. As another comment mentioned, reading code and understanding how other programs are built are great ways to build your "how to do things" repertoire.
Finally, I'd also recommend taking some classes. I started as a self-taught developer, but I've since taken classes both in-person and online that have been a tremendous help. There are many available for free online, and if in-person classes work better for you (motivation, support, resources, etc), definitely go that route. They're a fantastic way to grow.
[0]: https://www.amazon.com/Clean-Code-Handbook-Software-Craftsma...
[1]: https://www.amazon.com/Clean-Architecture-Craftsmans-Softwar...
[2]: https://www.amazon.com/Designing-Data-Intensive-Applications...
https://www.amazon.com/Designing-Data-Intensive-Applications...
[1] https://www.coursera.org/learn/programming-languages
[2] https://www.amazon.com/Designing-Data-Intensive-Applications...
[1] https://www.amazon.com/Designing-Data-Intensive-Applications...
The Architecture of Open Source Applications[2] series is a good one for leaning how to build production applications and you can read it online. The chapter on Scalable Web Architecture[3] is a must-read.
[0] https://www.amazon.com/Designing-Data-Intensive-Applications...
[1] https://news.ycombinator.com/item?id=15428526
One thing you have to realize is that once you get a little advanced, you have to get to the details of the single SQL implementations, it's not about SQL but about Postgres.
I've found these books really valuable
# SQL Performance Explained Everything Developers Need to Know about SQL Performance
https://www.amazon.com/Performance-Explained-Everything-Deve...
This book fundamentally talks about how to effectively use and leverage the SQL indices. Talks about all the important implementations (Postgres, MySQL, Oracle, SQL Server).
# Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems
https://www.amazon.com/Designing-Data-Intensive-Applications...
This book gets mentioned a bunch around here and for a good reason. There aren't too many concrete resources on making your systems "webscale" and this one is really good.
# PostgreSQL 9.0 High Performance
https://www.amazon.com/PostgreSQL-High-Performance-Gregory-S...
Discusses all the different settings and tweaks you can do in Postgres. It's crazy how much of a perf gain you can get just by twiddling the parameters of the database, i.e. all the tricks you can do when the single instances are bottle necks.
There's a similar book for MySQL https://www.amazon.com/High-Performance-MySQL-Optimization-R...
# PostgreSQL 9 High Availability Cookbook
https://www.amazon.com/PostgreSQL-9-High-Availability-Cookbo...
Discusses how do you go from 1 Postgres instance to 1+ instance. Talks about replication, monitoring, cluster management, avoiding downtime etc i.e. all the tricks you can do to manage multiple instances. Again there's a similar book for MySQL https://www.amazon.com/MySQL-High-Availability-Building-Cent...
Last but not least check out the postgres documentation, people consider it a standard of what good documentation looks like https://www.postgresql.org/docs/9.6/static/index.html
Also last but not least, read up on relational algebra (the foundation of SQL) https://en.wikipedia.org/wiki/Relational_algebra. I've always found SQL to be extremely verbose (the syntax reminds me of idk COBOL or smth) but there's another query language called Datalog, that's for our purposes similar to SQL but the syntax is much more legible.
E.g. check out these snippets from these slides (page 29) (and check out the whole class too)
https://pages.iai.uni-bonn.de/manthey_rainer/IIS_1617/IIS201...
Datalog:
s(X) <- p(X,Y).
s(X) <- r(Y,X).
t(X,Y,Z) <- p(X,Y), r(Y,Z).
w(X) <- s(X), not q(X).
SQL:
CREATE VIEW s AS (SELECT a FROM p)
UNION
(SELECT b FROM r);
CREATE VIEW t AS
SELECT a, b, c
FROM p, r
WHERE p.b = r.a,
CREATE VIEW w AS (TABLE s)
MINUS (TABLE q);
And pay a little to read this book: http://www.amazon.com/Designing-Data-Intensive-Applications-...
And this one: http://www.amazon.com/Big-Data-Principles-practices-scalable...
Nathan Marz brought Apache Storm to the world, and Martin Kleppmann is pretty well known for his work on Kafka.
Both are very good books on building scalable data processing systems.
Taking a look at the Kafka docs [2] is also enlightening.
[1] https://www.amazon.com/Designing-Data-Intensive-Applications...
[2] https://kafka.apache.org/documentation/#gettingStarted