Skip to main content

Posts

Showing posts from July, 2013

Spark Tutorial and Cheatsheet

Main resources: Scala Cheat Sheet Reactive Cheat Sheet Spark Cheat sheet Spark Quick start Spark programming guide Spark Streaming : processing real-time data streams Spark SQL and DataFrames : support for structured data and relational queries MLlib : built-in machine learning library GraphX : Spark’s new API for graph processing Scala programming examples: Define a object with main function -- Helloworld. object HelloWorld { def main(args: Array[String]) { println("Hello, world!") } } Execute main function: scala> HelloWorld.main(null) Hello, world! Creating RDDs Parallelized Collections: val data = Array(1, 2, 3, 4, 5) val distData = sc.parallelize(data) External Datasets: val distFile = sc.textFile("data.txt") Above command returns the content of the file: scala> distFile.collect() res16: Array[String] = Array(1,2,3, 4,5,6) SparkContext.wholeTextFiles can return (filename, content). val distFile = sc.wholeTextFiles("/tmp/t