Shangyu Luo, Spark -- A Fast and General Engine for Large-scale Data Processing

Slides

Spark is an open-source cluster computing system. It can run programs up to 100X faster than Hadoop MapReduce. Nowadays, more and more companies and research groups are using Spark to develop their own applications. In my talk, I will make a brief introduction to Spark, including what is Spark, why I use Spark to conduct my experiments, and a core design idea of Spark (i.e., Resilient Distributed Dataset). Next, I will talk about several experiments I did with Spark. More specifically, I will describe the models, machines as well as the datasets I used in these experiments. The final part of my talk consists of some discussions about the results of my experiments and the conclusions of my talk.