2016年1月29日星期五

使用grep、sed、awk、sort统计日志文件中的字段

遇到这样一个需求,统计hdfs中log文件中的某个字段,log文件每一行格式如下:
{"msgid":"478721ef-ff6e-42b1-a2fb-97bba74c7bac","request_json":"{\"address\":\"中国,山东,临沂,罗庄,,,35.01737326529004,118.29531628357033\",\"appkey\":\"com.google\",\"channel\":\"aw_sdk\",\"device_id\":\"b61f0da076b9a88547c3b790014bd21d\",\"msg_id\":\"478721ef-ff6e-42b1-a2fb-97bba74c7bac\",\"output\":\"\",\"query\":\"on_peer_connected\",\"query_type\":\"event\",\"remote_ip\":\"\",\"task\":\"\",\"user_id\":\"b61f0da076b9a88547c3b790014bd21d\",\"version\":\"\",\"watch_build\":\"LHA72H\",\"watch_device_id\":\"7c0f71560710fe730e9b8d08ab7b7b00\"}","status":"success","msg_type":"event","lng":"118.295","wechat_user_id":"3790728","deviceid":"b61f0da076b9a88547c3b790014bd21d","qa_classification":"","qa_result":"{}","id":"56741496","content":"on_peer_connected","speech_id":"NULL","updated_at":"2015-10-08 15:27:43","wechat_app_id":"62","address":"中国,山东,临沂,罗庄,,,35.01737326529004,118.29531628357033","created_at":"2015-10-08 15:27:43","historical":"NULL","lat":"35.0174"}

需求是统计所有不同的watch_build的个数,一个可行的方法是直接使用grep先过滤出含有watch_build字段的log,然后sed截取watch_build,最后awk计算求和:
hdfs dfs -cat /data/log/offline/process/weiyuyi/2016/01/26/* | grep watch_build | sed 's/\(.*\)\(watch_build\)\\":\\"\([^\]*\)\(.*\)/\2:\3/g' | awk 'BEGIN{FS=":"} {a[$2]++;} END {for (i in a) print i ", " a[i];}'

或者使用sort,可以不用awk:
hdfs dfs -cat /data/log/offline/process/weiyuyi/2016/01/26/* | grep watch_build | sed 's/\(.*\)\(watch_build\)\\":\\"\([^\]*\)\(.*\)/\2:\3/g' | sort | uniq -c | sort -n -r

sed简明教程:link
awk简明教程:link

一个sed的简单例子,使用sed提取子字符串:
  testchenxiaoyuage20xxx,提取chenxiaoyu和20,格式为chenxiaoyu:20
  echo "testchenxiaoyuage20xxx" | sed 's/.*\(chenxiaoyu\)age\([0-9]*\).*/\1:\2/g'
  其中:
    .* 表示匹配任意文本;
    被 \( 和 \) 之间括起来的内容被自动按顺序编上了号,后面可以使用 \1 \2 等等来使用;
    /g表示一行上的替换所有匹配。

没有评论:

发表评论