ICS 613 Introduction to Big Data Computing Systems

The field of computer science is experiencing a transition from computation-intensive to data-intensive problems, wherein data is produced in massive amounts by large sensor networks, simulations, and social networks. Efficiently extracting, interpreting, and learning from very large datasets requires a new generation of data management technologies. This course gives an introduction to the Hadoop ecosystem as de facto big-data-management system and special consideration will be made to the Apache Spark data analysis framework. The fundamental concepts on which the emerging big data management systems are based are discussed first. Once a foundation is defined, technologies and algorithms that are used to work with big data sets are studied. Tentative topics covered include: distributed file system, map-reduce programming paradigm, Apache Spark basics, SparkSQL, Pig, Hive, Impala, and Scoop. The course is programming intensive and includes several programming assignment projects using the Hadoop ecosystem.

Prerequisites

ICS 311: Database Management Systems and ICS 141: Problem Solving with Programming

Special information

First day attendance is mandatory.
Prerequisites: Graduate standing. Note: Students are responsible to both be aware of and abide by prerequisites for ICS courses for which they enroll, and will be administratively dropped from a course if they have not met prerequisites.

4 Graduate credits

Effective January 10, 2016 to present

Learning outcomes

General

Learns the definition, sources, and challenges of big data.
Understands the similarities and differences between the emerging big data computing platforms and the traditional computing systems.
Understands the key issues in big data management including distributed and parallel processing, data modeling, query languages, and transaction processing.
Understands the principles of distributed file system.
Understands and practice developing applications using the map reduce programming paradigm.
Has a good knowledge of the Hadoop ecosystem as one of the currently most commonly used platforms to work with big data.
Get introduced to some of the Hadoop¿s ecosystem components including Hive, Pig, Sqoop, and HBase.