Algorithm for Big Data

Sketching and Streaming

Problem: Approximate Counting by Robert Morris

If I want you to count N events passing by, the minimum space you need to keep track of the N count is log N bits. But if it is approximation and you want to use minimum space possible, you may just need to remember which power of 2 it is. So, the size to store this number that drops to log(log N)) bits.

Morris:

  1. initialize x = 0
  2. for each event, increment X with probability 1/(2^x)
  3. output 2^x-1

Dimensionality Reduction

Large Scale Machine Learning Problems

For example, large scale regression problems. If an variable x is a function f of another variable y and y is measured/ observed with noise, we want to recover f' that is close to f.

  • PCA
  • matrix completion (Netflix problem)

Compressed Sensing

  • compress/ cheaply acquire high dimensional structured signal.(using linear measurements)
  • JPEG (image compression)
  • MRI (cheap acquisition)

External Memory Model

  • measure disk I/O instead of # instructions (random seeks are expensive).
  • infinite disk divided into blocks
  • B-Tree

Other models

  • eg. Map Reduce