Found in 4 comments on Hacker News
There's a lot of stuff under the "Big Data" umbrella: I focused on Hadoop below because that's my focus right now. I'm sure I'm missing some roles here, but the specialities I can think of are:

- getting data out of production systems and transforming it (infrastructure or ETL) - analytical querying and reporting - system administration - machine learning

There's also the wide world of NoSQL data stores, which people lump in with big data, but which require vastly different skills.

The Hadoop VM I linked to above is good for working through exercises for all of the above.

As a starting point, this book[1] walks through the motivation behind Hadoop, and then gets a little into internals and use cases. It's out of date, but you can work through it and get into the right frame of mind, understand HDFS, etc. It's a good starting point.

AMP Camp (that I linked to above) is an introduction to Spark for people with a little Hadoop experience. Spark is getting a lot of attention, you could run into it in a number of roles.

If you're going to be planning the whole pipeline, or doing any sort of infrastructure role, I recommend Hadoop Application Architecture[2] for more modern tools and design patterns. This blog post[3] is a pretty good overview of distributed logs, which are essential for horizontal scale. Understanding Kafka and ZooKeeper is really useful for infrastructure roles, maybe less so for admins.

If you're planning to be in the reporting layer, having a deep understanding of SQL and data warehousing is useful. This book[4] is old hat, but I would say it's expected knowledge for anyone planning a warehouse, and it's interesting to understand best practices. Most places will also expect knowledge of Tableau or a similar BI tool, but that's tougher to learn on your own since licenses are brutal. Visualization with D3 is nice to have in this space, especially if you're coming from a web background - Scott Murray's tutorials [5] are a good starting place.

It's harder to point to resources for sysadmins - if you weren't a sysadmin before, you need to understand a lot of other concepts before you worry about Hadoop stuff. ML is similar - you need to understand the principles and be able to work on a single node. There's lots of good resources out there about getting started in data science.






mariusz331 · 2012-08-27 · Original thread
My neighbor at work looks through this book all the time-

mindcrime · 2010-07-13 · Original thread
I'm still not the world's foremost expert, but what I do know I've learned through a combination of trial and error, reading books (I'll edit this later and put in a couple of specific titles), reading stuff on the 'Net and classes I took in school (I did a degree in "High Performance Computing" which had some useful aspects to it).

A good place to start, if you're not already familiar with it, is High Scalability:

Edit: book recommendations:

Scalable Internet Architectures -

Linux Clustering - Building and Maintaining Linux Clusters -

High Performance Linux Clusters -

Linux Enterprise Cluster -

Java Message Service -

Java Message Service API Tutorial and Reference -

Enterprise JMS Programming -

Hadoop: The Definitive Guide -

Pro Hadoop -


It's important to understand the difference between vertical scaling and horizontal scaling. Horizontal is very en vogue these days, especially with commodity hardware. Why? Because you can add power incrementally without spending tons of money upfront, and without requiring a "forklift upgrade" (that is a reference to needing a forklift to bring in a new mainframe or minicomputer). This is a pretty good article on the topic:

As popular as horizontal scaling is, don't ignore the possibilities of going to bigger hardware though. It has it's own advantages, especially when you start talking about physical floor space to store servers.

Of course "cloud computing" changes some of this, both by making it cheap and easy to add VPS's to scale horizontally, or by making it possible (sometimes) to easily add more processing power, RAM, etc. to your "server." Read up on Xen, KVM, EC2, etc. for more on that whole deal.

voberoi · 2009-04-07 · Original thread
I suggest checking out the first video here: It's a little more than 20 minutes long, and the lecturer talks about the problems that MapReduce and HDFS (Hadoop Distributed File System) are designed to solve.

If you're super curious about MapReduce and are down to spend a few hours learning about and playing around with Hadoop, most certainly check out the rest of the videos (you can leave out the ones about Hive) and work through the first two exercises -- the virtual machine they provide makes it very easy to implement and run your first MapReduce job. Doing this will answer your question better than anyone explaining it to you can. If you'd like to go even further and learn about how Hadoop works under the hood, buy the rough cuts version of Hadoop: The Definitive Guide here:

Fresh book recommendations delivered straight to your inbox every Thursday.