Big data processing
We have covered a lot of practical/engineering topics with this
Most of the work we have done was about programming big data
systems, but we spent a lot of time to understand how those systems are
To succesfuly finish this course, you must be able to answer the
questions in the following sections without
1. Big data
- Why is big data important?
- What do the 3Vs of big data mean?
- What is the ETL cycle?
- What is the difference between stream and batch processing?
2. Functional programming
- What is the essense of FP?
- What does \(f(x: A, y: [B]) \rightarrow
- Why is lazyness a virtue in BDP?
- What is a monad and what is it used for?
- How can we exploit immutability?
3. Data processing with FP
- What is the difference between element-wise and aggregation
- What is the function signature for
- What is the difference between
- How can we implement
zip etc with
- How can we implement a
join between KV pairs?
- (How) Can we re-write an SQL query with FP primitives?
- What is a pipe(-line)?
map-like operations does Unix support?
reduce-like operations does Unix support?
- How can we:
- Find all files that contain a pattern?
- Process data as they come?
- Compare file contents?
- Run commands in parallel?
5. Distributed systems
- What is the key difference between distributed and parallel
- What does Amdhal’s law tell us?
- What are the key problems with distributed systems?
- How do we deal with time being unreliable?
- How do we make decisions in distributed settings?
- How many nodes do we need?
- What is the CAP theorem?
- What are the different consistency models?
- What is causal consistency?
- What is sequential consistency and what is linearisability?
6. Distributed databases and filesystems
- Why do we need to replicate data?
- What are the most common replication architectures?
- Why do we need to partition datasets?
- What are the most common transaction isolation levels?
- How does HDFS store a file?
- What are Spark RDDs? Why was Spark so revolutionary?
- What is the difference between RDDs and Pair RDDs? Why do we need
- What are the key Spark API calls?
- What are wide and narrow dependencies?
- How does Spark deal with faults?
- What types of partitioning can we employ for dist systems like
- How does Catalyst optimize queries?
8. Stream processing
- When is a problem a data streaming problem?
- Why do we need streaming windows?
- What types of windows do we get with stream processing?
- What is the difference between event, processing and ingestion
- What is the difference between microbatching and stream
- What is the problem with state in streaming systems?
- How can we disseminate events from producers to consumers?
- How do we take consistent snapshots?
9. Graph processing
- What is the best way of reprensenting graphs in memory and in a
- How can we traverse a graph stored in an SQL database?
- What is the bulk synchronous parallel model?
- How does Pregel implement the BSP model?