The field of computer science is experiencing a transition from computation-intensive to data-intensive problems, wherein data is produced in massive amounts by large sensor networks, simulations, and social networks. Efficiently extracting, interpreting, and learning from very large datasets requires a new generation of data management technologies. This course gives an introduction to the Hadoop ecosystem as de facto big-data-management system and special consideration will be made to the Apache Spark data analysis framework. The fundamental concepts on which the emerging big data management systems are based are discussed first. Once a foundation is defined, technologies and algorithms that are used to work with big data sets are studied. Tentative topics covered include: distributed file system, map-reduce programming paradigm, Apache Spark basics, SparkSQL, Pig, Hive, Impala, and Scoop. The course is programming intensive and includes several programming assignment projects using the Hadoop ecosystem.
- Learns the definition, sources, and challenges of big data.
- Understands the similarities and differences between the emerging big data computing platforms and the traditional computing systems.
- Understands the key issues in big data management including distributed and parallel processing, data modeling, query languages, and transaction processing.
- Understands the principles of distributed file system.
- Understands and practice developing applications using the map reduce programming paradigm.
- Has a good knowledge of the Hadoop ecosystem as one of the currently most commonly used platforms to work with big data.
- Get introduced to some of the Hadoops ecosystem components including Hive, Pig, Sqoop, and HBase.