Algorithm for Big Data
Sketching and Streaming
Problem: Approximate Counting by Robert Morris
If I want you to count N events passing by, the minimum space you need to keep track of the N count is log N bits. But if it is approximation and you want to use minimum space possible, you may just need to remember which power of 2 it is. So, the size to store this number that drops to log(log N)) bits.
Morris:
- initialize x = 0
- for each event, increment X with probability 1/(2^x)
- output 2^x-1
Dimensionality Reduction
Large Scale Machine Learning Problems
For example, large scale regression problems. If an variable x is a function f of another variable y and y is measured/ observed with noise, we want to recover f' that is close to f.
- PCA
- matrix completion (Netflix problem)
Compressed Sensing
- compress/ cheaply acquire high dimensional structured signal.(using linear measurements)
- JPEG (image compression)
- MRI (cheap acquisition)
External Memory Model
- measure disk I/O instead of # instructions (random seeks are expensive).
- infinite disk divided into blocks
- B-Tree
Other models
- eg. Map Reduce