Spark access S3 files locally
In this post, I share how we can access S3 files locally using Spark.
My settings are as follows:
- spark-2.4.7
- MacOS 10.14.6
After trying a lot, I found that the new spark versions do not come with the AWS jars by default. We need to add following jars from hadoop (share/hadoop/tools/lib/) folder into SPARK_HOME/jars folder:
- hadoop-aws-2.7.x.jar
- aws-java-sdk-1.7.4.jar
- joda-time-2.9.3.jar
- + jackson-*-2.6.5.jar
Also, set your AWS access key and Secret key in SPARK_HOME/conf/spark-env.sh
AWS_ACCESS_KEY_ID="YOUR KEY ID"AWS_SECRET_ACCESS_KEY="YOUR SECRET ACCESS KEY"
Make sure to add the $SPARK_HOME to your $PATH. You can find the path settings in ~/.bash_profile
export SPARK_HOME= "path to your spark"export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbinexport PATH=$PATH:$SPARK_HOME/binexport PYSPARK_DRIVER_PYTHON="jupyter"export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
Here is a simple script to check if your spark code is working:
#spark related importsimport findsparkfindspark.init()from pyspark import SparkConf, SparkContextfrom pyspark.sql import SparkSessionfrom pyspark.sql import *from pyspark.sql.functions import *#init sparkconf = SparkConf().setMaster("local[4]").setAppName("Spark-Test")spark = SparkSession.builder.config(conf=conf).getOrCreate()spark.conf.set("spark.sql.execution.arrow.enabled", "true")#get your files from S3my_s3_file = "s3a://my_folder/my_file.json"my_data = spark.sparkContext.textFile(my_s3_file)my_data.count()
NOTE: Check your hadoop and spark versions match to make sure there is no incompatibility.
No comments:
Post a Comment