There's a lot of stuff under the "Big Data" umbrella: I focused on Hadoop below because that's my focus right now. I'm sure I'm missing some roles here, but the specialities I can think of are:
- getting data out of production systems and transforming it (infrastructure or ETL)
- analytical querying and reporting
- system administration
- machine learning
There's also the wide world of NoSQL data stores, which people lump in with big data, but which require vastly different skills.
The Hadoop VM I linked to above is good for working through exercises for all of the above.
As a starting point, this book[1] walks through the motivation behind Hadoop, and then gets a little into internals and use cases. It's out of date, but you can work through it and get into the right frame of mind, understand HDFS, etc. It's a good starting point.
AMP Camp (that I linked to above) is an introduction to Spark for people with a little Hadoop experience. Spark is getting a lot of attention, you could run into it in a number of roles.
If you're going to be planning the whole pipeline, or doing any sort of infrastructure role, I recommend Hadoop Application Architecture[2] for more modern tools and design patterns. This blog post[3] is a pretty good overview of distributed logs, which are essential for horizontal scale. Understanding Kafka and ZooKeeper is really useful for infrastructure roles, maybe less so for admins.
If you're planning to be in the reporting layer, having a deep understanding of SQL and data warehousing is useful. This book[4] is old hat, but I would say it's expected knowledge for anyone planning a warehouse, and it's interesting to understand best practices. Most places will also expect knowledge of Tableau or a similar BI tool, but that's tougher to learn on your own since licenses are brutal. Visualization with D3 is nice to have in this space, especially if you're coming from a web background - Scott Murray's tutorials [5] are a good starting place.
It's harder to point to resources for sysadmins - if you weren't a sysadmin before, you need to understand a lot of other concepts before you worry about Hadoop stuff. ML is similar - you need to understand the principles and be able to work on a single node. There's lots of good resources out there about getting started in data science.
- getting data out of production systems and transforming it (infrastructure or ETL) - analytical querying and reporting - system administration - machine learning
There's also the wide world of NoSQL data stores, which people lump in with big data, but which require vastly different skills.
The Hadoop VM I linked to above is good for working through exercises for all of the above.
As a starting point, this book[1] walks through the motivation behind Hadoop, and then gets a little into internals and use cases. It's out of date, but you can work through it and get into the right frame of mind, understand HDFS, etc. It's a good starting point.
AMP Camp (that I linked to above) is an introduction to Spark for people with a little Hadoop experience. Spark is getting a lot of attention, you could run into it in a number of roles.
If you're going to be planning the whole pipeline, or doing any sort of infrastructure role, I recommend Hadoop Application Architecture[2] for more modern tools and design patterns. This blog post[3] is a pretty good overview of distributed logs, which are essential for horizontal scale. Understanding Kafka and ZooKeeper is really useful for infrastructure roles, maybe less so for admins.
If you're planning to be in the reporting layer, having a deep understanding of SQL and data warehousing is useful. This book[4] is old hat, but I would say it's expected knowledge for anyone planning a warehouse, and it's interesting to understand best practices. Most places will also expect knowledge of Tableau or a similar BI tool, but that's tougher to learn on your own since licenses are brutal. Visualization with D3 is nice to have in this space, especially if you're coming from a web background - Scott Murray's tutorials [5] are a good starting place.
It's harder to point to resources for sysadmins - if you weren't a sysadmin before, you need to understand a lot of other concepts before you worry about Hadoop stuff. ML is similar - you need to understand the principles and be able to work on a single node. There's lots of good resources out there about getting started in data science.
1. http://shop.oreilly.com/product/0636920021773.do
2. http://shop.oreilly.com/product/0636920033196.do
3. http://engineering.linkedin.com/distributed-systems/log-what...
4. http://ca.wiley.com/WileyCDA/WileyTitle/productCd-0471200247...
5. http://alignedleft.com/tutorials