CHEN Xiaoyu's blog: 九月 2017

2017年9月30日星期六

bazel build ... experimental文件夹不生效

bazel build ... 发现experimental中的没有编出来，猜测应该是bazel在递归查找的时候自动过滤了experimental的文件夹。

cd experimental
bazel build ...
进入experimental目录之后编是可以的。

2017年9月25日星期一

vim中200e unicode字符

参考https://unix.stackexchange.com/questions/59447/replace-unicode-chars-in-vim

先按ctrl-v，然后输入u200e可以输入<200e>字符。

:help i_CTRL-V_digit 可以查看一些帮助。

2017年9月21日星期四

module.js:341
throw err;
^

Error: Cannot find module 'npmlog'
at Function.Module._resolveFilename (module.js:339:15)
at Function.Module._load (module.js:290:25)
at Module.require (module.js:367:17)
at require (internal/module.js:16:19)
at /usr/local/lib/node_modules/npm/bin/npm-cli.js:20:13
at Object.<anonymous> (/usr/local/lib/node_modules/npm/bin/npm-cli.js:76:3)
at Module._compile (module.js:413:34)
at Object.Module._extensions..js (module.js:422:10)
at Module.load (module.js:357:32)
at Function.Module._load (module.js:314:12)

因为之前卸载node没有卸载干净，解决方法：brew uninstall --force node
rm -rf /usr/local/lib/node_modules
rm /usr/local/bin/npm
brew install node

https://stackoverflow.com/a/39504056/5685754

pip install pyfst失败

Command "/usr/bin/python -u -c "import setuptools, tokenize;__file__='/tmp/pip-build-etZdYk/pyfst/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /tmp/pip-9_TCLO-record/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /tmp/pip-build-etZdYk/pyfst/

解决：
有人说export CFLAGS="-std=c++11"，试了下，没用。

最后发现是版本问题，我安装的openfst是1.6.3版本，pyfst只支持到了1.3.3版本，再往上版本不兼容。

附，安装openfst：
./configure
make
sudo make install

安装完openfst运行时错误

fstinfo: error while loading shared libraries: libfstscript.so.8: cannot open shared object file: No such file or directory

解决：
在~/.bashrc中添加：
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/local/lib

2017年9月20日星期三

递归修改文件夹下所有文件的fileencoding和fileformat

修改文件编码和文件格式，脚本如下：
#!/usr/bin/env bash

function walk()
{
for file in `ls $1`
do
local path=$1"/"$file
if [ -d $path ]
then
echo "DIR $path"
walk $path
else
echo "FILE $path"
vi +":set fileencoding=utf-8" +":set fileformat=unix" +":wq" $path
fi
done
}

if [ $# -ne 1 ]
then
echo "USAGE: $0 TOP_DIR"
else
walk $1
fi

vim的一些配置

vim ~/.vimrc，
保存后自动删除行尾空格：
autocmd BufWritePre * :%s/\s\+$//e

显示行尾空格：
highlight WhitespaceEOL ctermbg=red guibg=red
match WhitespaceEOL /\s\+$/

显示tab键：
set list
set listchars=tab:>-,trail:-

设置缩进：
set autoindent
set shiftwidth=2
set cindent

tab转为空格：
set expandtab
set tabstop=2

显示高亮：
syntax on

yarn的几个命令

杀掉任务：
yarn application -list
yarn application -kill <Application ID>

查看log：
yarn logs -applicationId application_1451022530184_0001

2017年9月19日星期二

linux shell每隔一段时间保存cpu和memory使用情况到文件

#!/usr/bin/env bash

while true
do
top -n 1 -b | grep -E "PID|java" > cpu_mem$(date -d "today" +"%Y%m%d_%H%M%S").txt
sleep 300
done

该脚本会保存所有的java进程cpu和内存使用情况。

2017年9月18日星期一

linux下压缩文件

tar
tar chf bazel-genfiles/data/uploader/uploader.tar -C bazel-out/host/bin/data/uploader uploader uploader.runfiles
其中，-C是切换工作目录。

zip没有tar的-C功能，可以通过下面的trick实现：
(cd bazel-out/host/bin/data/uploader && zip -qr - uploader uploader.runfiles) > bazel-out/local-fastbuild/genfiles/data/uploader/uploader.zip

2017年9月14日星期四

crontab修改默认编辑器

select-editor，选择3，使用vim

2017年9月13日星期三

spring boot结合hibernate使用中的一些问题

org.hibernate.MappingException: composite-id class must implement Serializable
如果指定了多个@Id，这个类必须implement Serializable

org.xml.sax.SAXException: null:11: Element <defaultCache> does not allow attribute "maxEntriesLocalHeap".
因为ehcache版本过低，需要在maven中单独引入,不能用hibernate默认的版本

org.hibernate.HibernateException: Could not obtain transaction-synchronized Session for current thread
参考https://stackoverflow.com/questions/26203446/spring-hibernate-could-not-obtain-transaction-synchronized-session-for-current
必须要指定@EnableTransactionManagement

获取spark-submit --files的文件

参考https://community.hortonworks.com/questions/9265/how-can-i-add-configuration-files-to-a-spark-job-r.html

If you add your external files using "spark-submit --files" your files will be uploaded to this HDFS folder: hdfs://your-cluster/user/your-user/.sparkStaging/application_1449220589084_0508

application_1449220589084_0508 is an example of yarn application ID!

1. find the spark staging directory by below code: (but you need to have the hdfs uri and your username)

System.getenv("SPARK_YARN_STAGING_DIR"); --> .sparkStaging/application_1449220589084_0508

2. find the complete comma separated file paths by using:

System.getenv("SPARK_YARN_CACHE_FILES"); --> hdfs://yourcluster/user/hdfs/.sparkStaging/application_1449220589084_0508/spark-assembly-1.4.1.2.3.2.0-2950-hadoop2.7.1.2.3.2.0-2950.jar#__spark__.jar,hdfs://yourcluster/user/hdfs/.sparkStaging/application_1449220589084_0508/your-spark-job.jar#__app__.jar,hdfs://yourcluster/user/hdfs/.sparkStaging/application_1449220589084_0508/test_file.txt#test_file.txt

我的总结（以--files README.md为例）：
方法1：按照上面所说，--files会把文件上传到hdfs的.sparkStagin/applicationId目录下，使用上面说的方法先获取到hdfs对应的这个目录，然后访问hdfs的这个文件。
spark.read().textFile(System.getenv("SPARK_YARN_STAGING_DIR") + "/README.md")解决。textFile不指定hdfs、file或者去其他前缀的话默认是hdfs://yourcluster/user/your_username下的相对路径。不知道是不是我使用的集群是这样设置的。

方法2：
SparkFiles.get(filePath)，我获取的结果是：/hadoop/yarn/local/usercache/research/appcache/application_1504461219213_9796/spark-c39002ee-01a4-435f-8682-2ba5950de230/userFiles-e82a7f84-51b1-441a-a5e3-78bf3f4a8828/README.md，不知道为什么，无论本地还是hdfs都没有找到该文件。看了一下，本地是有/hadoop/yarn/local/usercache/research/...目录下的确有README.md。worker和driver的本地README.md路径不一样。
原因：
https://stackoverflow.com/questions/35865320/apache-spark-filenotfoundexception
https://stackoverflow.com/questions/41677897/how-to-get-path-to-the-uploaded-file
SparkFiles.get()获取的目录是driver node下的本地目录，所以sc.textFile无法在worker节点访问该目录文件。不能这么用。
"""I think that the main issue is that you are trying to read the file via the textFile method. What is inside the brackets of the textFile method is executed in the driver program. In the worker node only the code tobe run against an RDD is performed. When you type textFile what happens is that in your driver program it is created a RDD object with a trivial associated DAG.But nothing happens in the worker node."""

关于--files和addfile，可以看下这个问题：https://stackoverflow.com/questions/38879478/sparkcontext-addfile-vs-spark-submit-files

cluster模式下本地文件使用addFile是找不到文件的，因为只有本地有，所以必须使用--files上传。

结论：不要使用textFile读取--files或者addFile传来的文件。

SparkFiles.get出现NullPointerException错误

错误代码：
val serFile = SparkFiles.get("myobject.ser")

原因：SparkFiles.get只能在spark算子内使用：
sc.parallelize(1 to 100).map { i => SparkFiles.get("my.file") }.collect()

2017年9月12日星期二

spring boot使用bazel编译运行时无法注入bean

使用到了azure-storage-spring-boot-starter中的bean，所以我增加了@ComponentScan({"com.xxx", "com.microsoft.azure"})，还是无法注入bean。

最后增加了@PropertySource("classpath:application.properties")显示指定application.properties文件（配置文件中有azure storage连接的配置）才解决问题。不知道啥原因，瞎打瞎碰解决了问题。

2017年9月5日星期二

sampled softmax

出自于On Using Very Large Target Vocabulary for Neural Machine Translation这篇paper，主要是解决词表过大训练时间长的问题。

下面这两篇blog讲的比较清楚：
On word embeddings - Part 2: Approximating the Softmax： http://ruder.io/word-embeddings-softmax/index.html#whichapproachtochoose
中文翻译：http://geek.csdn.net/news/detail/135736

关于sampling softmax中重要性采样的论文阅读笔记：http://blog.csdn.net/wangpeng138375/article/details/75151064

订阅：博文 (Atom)