Skip to main content


Showing posts from July, 2013

Spark Tutorial and Cheatsheet

Main resources:
Scala Cheat Sheet
Reactive Cheat Sheet
Spark Cheat sheet
Spark Quick start
Spark programming guide
Spark Streaming: processing real-time data streams
Spark SQL and DataFrames: support for structured data and relational queries
MLlib: built-in machine learning library
GraphX: Spark’s new API for graph processing

Scala programming examples:

Define a object with main function -- Helloworld.
object HelloWorld { def main(args: Array[String]) { println("Hello, world!") } } Execute main function:
scala> HelloWorld.main(null) Hello, world! Creating RDDs
Parallelized Collections:
val data = Array(1, 2, 3, 4, 5) val distData = sc.parallelize(data) External Datasets:
val distFile = sc.textFile("data.txt") Above command returns the content of the file:
scala> distFile.collect() res16: Array[String] = Array(1,2,3, 4,5,6) SparkContext.wholeTextFiles can return (filename, content).
val distFile = sc.wholeTextFiles("/tmp/tmpdir") scala> distFile.collect…