2017年9月13日星期三

获取spark-submit --files的文件


If you add your external files using "spark-submit --files" your files will be uploaded to this HDFS folder: hdfs://your-cluster/user/your-user/.sparkStaging/application_1449220589084_0508

application_1449220589084_0508 is an example of yarn application ID!

1. find the spark staging directory by below code: (but you need to have the hdfs uri and your username)

System.getenv("SPARK_YARN_STAGING_DIR"); --> .sparkStaging/application_1449220589084_0508

2. find the complete comma separated file paths by using:

System.getenv("SPARK_YARN_CACHE_FILES"); --> hdfs://yourcluster/user/hdfs/.sparkStaging/application_1449220589084_0508/spark-assembly-1.4.1.2.3.2.0-2950-hadoop2.7.1.2.3.2.0-2950.jar#__spark__.jar,hdfs://yourcluster/user/hdfs/.sparkStaging/application_1449220589084_0508/your-spark-job.jar#__app__.jar,hdfs://yourcluster/user/hdfs/.sparkStaging/application_1449220589084_0508/test_file.txt#test_file.txt


我的总结(以--files README.md为例):
方法1:按照上面所说,--files会把文件上传到hdfs的.sparkStagin/applicationId目录下,使用上面说的方法先获取到hdfs对应的这个目录,然后访问hdfs的这个文件。
spark.read().textFile(System.getenv("SPARK_YARN_STAGING_DIR") + "/README.md")解决。textFile不指定hdfs、file或者去其他前缀的话默认是hdfs://yourcluster/user/your_username下的相对路径。不知道是不是我使用的集群是这样设置的。

方法2:
SparkFiles.get(filePath),我获取的结果是:/hadoop/yarn/local/usercache/research/appcache/application_1504461219213_9796/spark-c39002ee-01a4-435f-8682-2ba5950de230/userFiles-e82a7f84-51b1-441a-a5e3-78bf3f4a8828/README.md,不知道为什么,无论本地还是hdfs都没有找到该文件。看了一下,本地是有/hadoop/yarn/local/usercache/research/...目录下的确有README.md。worker和driver的本地README.md路径不一样。
原因:
https://stackoverflow.com/questions/35865320/apache-spark-filenotfoundexception
https://stackoverflow.com/questions/41677897/how-to-get-path-to-the-uploaded-file
SparkFiles.get()获取的目录是driver node下的本地目录,所以sc.textFile无法在worker节点访问该目录文件。不能这么用。
"""I think that the main issue is that you are trying to read the file via the textFile method. What is inside the brackets of the textFile method is executed in the driver program. In the worker node only the code tobe run against an RDD is performed. When you type textFile what happens is that in your driver program it is created a RDD object with a trivial associated DAG.But nothing happens in the worker node."""


关于--files和addfile,可以看下这个问题:https://stackoverflow.com/questions/38879478/sparkcontext-addfile-vs-spark-submit-files

cluster模式下本地文件使用addFile是找不到文件的,因为只有本地有,所以必须使用--files上传。


结论:不要使用textFile读取--files或者addFile传来的文件。

没有评论:

发表评论