ICS 411

Big Data Storage and Processing

4 Undergraduate credits
Effective May 3, 2017 – Present

Graduation requirements this course fulfills

The field of computer science is experiencing a transition from processing-intensive to data-intensive problems, wherein data is produced in massive amounts by large sensor networks, simulations, and social networks. Efficiently extracting, interpreting, and learning from these very large data sets need different storage and processing requirements compared to traditional business applications that are mostly dependent on relational database management systems. These emerging data-intensive applications require heavy read/write workloads and do not need some of the stringent schema and ACID properties that are central to relational databases. To cope with these requirements, a new genre of large-scale systems, is introduced that is called NoSQL databases. The main characteristics of NoSQL databases are that they are open source, non-schema oriented, having weak consistency properties and heavily distributed over large and clusters of commodity hardware. In this course, we will cover the basic concepts and approaches that are used by such big-data systems. Students will gain hands-on experience by solving relevant problems through projects utilizing publicly available systems. Topics covered includes: fundamentals of big data storage and processing using Hadoop, distributed file systems, and map-reduce, fundamentals of the four categories of NoSQL systems, namely kay-value stores, document stores, column stores, and graph stores. Students will implement applications using the following systems: Apache HBase, Amazon's Dynamo, Apache Cassandra, MongoDB, and Neo4J.

Learning outcomes

General

  • Identify and justify the storage and processing requirements of data-intensive applications.
  • Explain the similarities and differences between the requirements of big-data applications and the ACID requirements of traditional database applications.
  • Analyze and solve data-intensive problems using Hadoop and the distributed file system.
  • Design and develop algorithms using the map-reduce programming paradigm.
  • Classify and describe NoSQL systems
  • Assess the suitability for using a particular type of NoSQL databases for an application
  • Experiment, contrast and evaluate the following NoSQL systems: MongoDB, DynamoDB, Cassandra, HBase, and Neo4J.