Monday, August 15, 2016

Install iPython (Jupyter) Notebook on Amazon EMR


  1. Use the bootstrap script on this link to install iPython Notebook: https://github.com/awslabs/emr-bootstrap-actions/tree/master/ipython-notebook
  2. Although the iPython server is running, it's not integrated with Spark. Follow the instructions according to this blog post: https://districtdatalabs.silvrback.com/getting-started-with-spark-in-python
  3. Create the initial SparkContext and SQL context as follows:

from pyspark import  SparkContext
sc = SparkContext( 'local', 'pyspark')
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

Friday, August 12, 2016

MySQL Driver Error in Apache Spark

I was following the Spark example to load data from MySQL database. See "http://spark.apache.org/examples.html"

There was an error upon executing:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 20.0 failed 4 times, most recent failure: Lost task 0.3 in stage 20.0 (TID 233, ip-172-22-11-249.ap-southeast-1.compute.internal): java.lang.IllegalStateException: Did not find registered driver with class com.mysql.jdbc.Driver

To force Spark to load the "com.mysql.jdbc.Driver", add the following option as highlighted below
val df = sqlContext
  .read
  .format("jdbc")
  .option("url", url)
  .option("dbtable", "people") 
  .option("driver","com.mysql.jdbc.Driver").load()

Wednesday, August 10, 2016

Install New Interpreter in Zeppelin 0.6.x

In new Zeppelin 0.6.x, you can install new interpreters as follows:


  • List all available interpreter: 
  1. /usr/lib/zeppelin/bin/install-interpreter.sh --list
  • To install the specific interpreters: 
  1. /usr/lib/zeppelin/bin/install-interpreter.sh --name jdbc,hbase,postgresql

Friday, August 5, 2016

IAM Errors when Creating Amazon EMR

There are errors related to the lack of permissions in the EMR_EC2_DefaultRole whenever I launch a Amazon EMR cluster. After some searching on the support forum, the default EMR role may not be created automatically for you. Hence, I removed the old default role and created new one as follows:
  1. Create default role: 
    • aws emr create-default-roles
  2. Create instance profile: 
    • aws iam create-instance-profile --instance-profile-name EMR_EC2_DefaultRole
  3. Verify that instance profile exists but doesn't have any roles:
    • aws iam get-instance-profile --instance-profile-name EMR_EC2_DefaultRole
  4. Add the role using:
    • aws iam add-role-to-instance-profile --instance-profile-name EMR_EC2_DefaultRole --role-name EMR_EC2_DefaultRole