Spark read S3 files locally

Spark access S3 files locally

In this post, I share how we can access S3 files locally using Spark.

My settings are as follows:

  1. spark-2.4.7
  2. MacOS 10.14.6
After trying a lot, I found that the new spark versions do not come with the AWS jars by default. We need to add following jars from hadoop (share/hadoop/tools/lib/) folder into SPARK_HOME/jars folder:
  1. hadoop-aws-2.7.x.jar
  2. aws-java-sdk-1.7.4.jar
  3. joda-time-2.9.3.jar
  4. + jackson-*-2.6.5.jar

Also, set your AWS access key and Secret key in SPARK_HOME/conf/spark-env.sh

AWS_ACCESS_KEY_ID="YOUR KEY ID"
AWS_SECRET_ACCESS_KEY="YOUR SECRET ACCESS KEY"
 
Make sure to add the $SPARK_HOME to your $PATH. You can find the path settings in ~/.bash_profile

export SPARK_HOME= "path to your spark"
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
export PATH=$PATH:$SPARK_HOME/bin
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"

Here is a simple script to check if your spark code is working:

#spark related imports
import findspark
findspark.init()
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
from pyspark.sql import *
from pyspark.sql.functions import *

#init spark
conf = SparkConf().setMaster("local[4]").setAppName("Spark-Test")
spark = SparkSession.builder.config(conf=conf).getOrCreate()
spark.conf.set("spark.sql.execution.arrow.enabled", "true")

#get your files from S3
my_s3_file = "s3a://my_folder/my_file.json"

my_data = spark.sparkContext.textFile(my_s3_file)
my_data.count()

NOTE: Check your hadoop and spark versions match to make sure there is no incompatibility.

 

No comments:

Post a Comment