CHEN Xiaoyu's blog: 十一月 2015

2015年11月30日星期一

使用Hive统计应用商店数据

Download log:
{
"device_id":"5d9e295b8705895f8c6de5420eb1973b",
"download_log":
{
"apk_url":"http://res.appstore.ticwear.com/com.broadlink.rmt_3.5.9-1.1.9.apk",
"appid":1000
},
"message_type":3,
"remote_ip":"220.200.5.247",
"time":1441571407,
"uid":"29716727"
}

Count the number of devices:
hive> select count(distinct device_id) from download_log;

Count the download times of different users:
hive> select device_id, count(device_id) rank from download_log group by device_id order by rank desc;
or: hive> select uid, device_id, count(device_id) rank from download_log group by uid, device_id order by rank desc;

Count the download times of different apps:
hive> select apk_url, appid, count(appid) rank from download_log group by apk_url, appid order by rank desc;

Detail log:
{
"detail_log":
{
"from":"",
"package_name":"com.stmp.minimalface",
"query":""
},
"device_id":"a5147789062ba29b06562bca6cbb3cc1",
"message_type":0,
"remote_ip":"61.239.223.125",
"time":1441331302,
"uid":"29728861"
}

Count the detail times of different users:
hive> select device_id, count(device_id) rank from detail_log group by device_id order by rank desc;

Count the detail times of different apps:
hive> select package_name, count(package_name) rank from detail_log group by package_name order by rank desc;

Save the result into local file system:
hive> insert overwrite local directory '$your_path'
select device_id, count(device_id) rank from download_log group by device_id order by rank desc;

Save the result into HDFS:

hive> insert overwrite directory '$your_path'
select device_id, count(device_id) rank from download_log group by device_id order by rank desc;

Save the result into other table:
hive> insert into table test

select device_id, count(device_id) rank from download_log group by device_id order by rank desc;

2015年11月25日星期三

在Hive中根据Json格式的日志创建用户数据表

PART 1
If your JSON file is not in nested structures, please see this link.Or you can also use the method in part 2.

PART 2
As this method does not support nested structures, there is another way to parse json files: link.

We need to parse the json file in the following format:
{
"detail_log":
{
"from":"",
"package_name":"com.aggro.wearappmanager",
"query":""
},
"device_id":"2fc35dde39093c78302ed45b287ae81e",
"message_type":0,
"remote_ip":"39.169.9.107",
"time":1446268416,
"uid":"29737982"
}

As listed in the official document, we can use get_json_object or json_tuple to parse json string.

At first, I wanted to create a app_log table with struct but I cannot keep going to extract json object, keep it unfinished:
hive> create table app_log (detail_log struct<`from`:string, package_name:string, query:string>, device_id string, message_type int, remote_ip string, time bigint, uid string);

Attention: Never load data from HDFS! See link1 and link2.

Below are my steps:
1. hive> create table src_json(json string);
hive> load data local inpath '/home/chenxiaoyu/data/app.log.2015-11-01.1446328474758' into table src_json;
2. hive> create table app_log as select get_json_object(src_json.json, '$.detail_log.from'), get_json_object(src_json.json, '$.detail_log.package_name'), get_json_object(src_json.json, '$.detail_log.query'), get_json_object(src_json.json, '$.device_id'), get_json_object(src_json.json, '$.message_type'), get_json_object(src_json.json, '$.remote_ip'), get_json_object(src_json.json, '$.time'), get_json_object(src_json.json, '$.uid') from src_json;
3. hive> alter table app_log change `_c0` `from` string;
hive> alter table app_log change `_c1` package_name string;
hive> alter table app_log change `_c2` query string;
hive> alter table app_log change `_c3` device_id string;
hive> alter table app_log change `_c4` message_type int;
hive> alter table app_log change `_c5` remote_ip string;
hive> alter table app_log change `_c6` time bigint;
hive> alter table app_log change `_c7` uid string;
done.

Till now the work has been finished, but for step 3, I have something to say.
I wanted to use json_tuple, but it seems that "select json_tuple(src_json.json, 'detail_log.package_name') from src_json;" doesn't work. Only "select json_tuple(src_json.json, 'detail_log') from src_json" works.
So I tried to create another table to save detail_log:
hive> create table src_json2 as select json_tuple(src_json.json, 'detail_log') from src_json；
hive> alter table src_json2 change c0 detail string;
But then "select json_tuple(src_json2.detail, 'from', 'package_name', 'query'), json_tuple(src_json.json, 'device_id', 'message_type', 'remmote_ip', 'time', 'uid') from src_json, src_json2" failed, this error occurs:

Only a single expression in the SELECT clause is supported with UDTF's.

So I don't know how to do with that, keep this unsolved.

PART 3
We encounter this problem: In the same file, we have two different types of JSON as shown below:
{
"detail_log":
{
"from":"",
"package_name":"com.mapelf",
"query":""
},
"device_id":"897ee392eaec9eeac0e9de069ab3aac5",
"message_type":0,
"remote_ip":"120.207.151.129",
"time":1446268421,
"uid":"29752894"
}

{
"device_id":"42f89c8deb5decd55380205cd6decaf6",
"download_log":
{
"apk_url":"http://res.appstore.ticwear.com/net.ddroid.aw.watchface.mv08_5.4.3.apk",
"appid":911
},
"message_type":3,
"remote_ip":"106.39.192.55",
"time":1446268426,
"uid":"29754909"
}

We can use a trick to deal with this case.
As the first JSON has the 'package_name' field and is not NULL and the second JSON has the 'apk_url' field and is not NULL, so we can distinguish between the two of them.
We save the two JSON logs into two tables: detail_log and download_log.

My steps:
If you want to use data in HDFS, try this step 1:
1. hive> create external table src_json(json string) stored as textfile location '<hdfs_location>';
Otherwise you can follow this step 1:
1. hive> create table src_json(json string);
hive> load data local inpath '/home/chenxiaoyu/data/app.log.2015-11-01.1446328474758' into table src_json;
2. hive> create table detail_log as select get_json_object(src_json.json, '$.detail_log.from'), get_json_object(src_json.json, '$.detail_log.package_name'), get_json_object(src_json.json, '$.detail_log.query'), get_json_object(src_json.json, '$.device_id'), get_json_object(src_json.json, '$.message_type'), get_json_object(src_json.json, '$.remote_ip'), get_json_object(src_json.json, '$.time'), get_json_object(src_json.json, '$.uid') from src_json where get_json_object(src_json.json, '$.detail_log.package_name') is not NULL;
3. hive> alter table detail_log change `_c0` `from` string;
hive> alter table detail_log change `_c1` package_name string;
hive> alter table detail_log change `_c2` query string;
hive> alter table detail_log change `_c3` device_id string;
hive> alter table detail_log change `_c4` message_type int;
hive> alter table detail_log change `_c5` remote_ip string;
hive> alter table detail_log change `_c6` time bigint;
hive> alter table detail_log change `_c7` uid string;
4. hive> create table download_log as select get_json_object(src_json.json, '$.device_id'), get_json_object(src_json.json, '$.download_log.apk_url'), get_json_object(src_json.json, '$.download_log.appid'), get_json_object(src_json.json, '$.message_type'), get_json_object(src_json.json, '$.remote_ip'), get_json_object(src_json.json, '$.time'), get_json_object(src_json.json, '$.uid') from src_json where get_json_object(src_json.json, '$.download_log.apk_url') is not NULL;
5. hive> alter table download_log change `_c0` device_id string;
hive> alter table download_log change `_c1` apk_url string;
hive> alter table download_log change `_c2` appid int;
hive> alter table download_log change `_c3` message_type int;
hive> alter table download_log change `_c4` remote_ip string;
hive> alter table download_log change `_c5` time bigint;
hive> alter table download_log change `_c6` uid string;
done.

Matlab 2015a and Visual Studio 2015

When entering this command: mex -setup, this error occurs:
Error using mex. No supported compiler or SDK was found.

Solution: see this link. It may help you.

2015年11月21日星期六

Hive

Install and configure Hive:
Please see this doc.

Hive operations:
Please see this doc.

A video tutorial of Hive: link.

Windows下Sublime Text搭建LaTeX编写环境

注意：
tex路径中千万不要有中文！否则ctrl+b编译会一片空白。

准备软件：
1、Sublime Text：链接.
2、TeX Live: 链接.
3、SumatraPDF：链接.

安装步骤：
1、安装Sublime Text。在安装时勾选Add to explorer context menu，这样在右键单击文件时就可以直接使用Sublime Text打开。
2、安装TeX Live。不要在任意的步骤中使用含有中文或空格的路径，以免触发一些潜在的bug。可以参考：链接。
3、安装SumatraPDF，安装路径中还是不要出现空格。

设置步骤：
1、Sublime Text
安装Package Control，网上有很多教程，自行搜索，参考链接。
装好Package Control后，安装LaTeXTools插件：在sublime text界面按下ctrl+shift+p，输入install package，回车，再输入latextools，回车，完成。

安装好LaTeXTools后，需要更改配置：
先点击Preferences - Package Settings -> LaTeXTools -> Reconfigure LaTeXTools and migrate settings
然后点击菜单栏：Preferences - Package Settings -> LaTeXTools -> Setting-User，
修改配置如下图：

保存。

2、编辑你的tex文件，ctrl+b编译，pdf文件会自动打开。

3、设置从SumatraPDF中反向搜索
打开SumatraPDF，设置 - 选项...，设置反向搜索命令行：
"C:\Program Files\Sublime Text 2\sublime_text.exe" "%f:%l"

确定。

设置好了之后双击pdf文件中的位置可以跳转到代码位置，done。

如果你发现ctrl+b编译时候是空白，试试这个方法：
卸载sublime text，然后删除C:\Users\your-username\AppData\Roaming中的sublime text文件夹，重新安装设置。

P.S. 推荐安装BracketHighlighter插件，可以高亮匹配括号。

2015年11月18日星期三

The difference between "hadoop fs" and "hdfs dfs"

Please see this link.

2015年11月17日星期二

用Maven构建Hadoop项目

参见用Maven构建Hadoop项目。

遇到的问题及解决方法：
1、安装Eclipse的Maven插件出错
错误提示：Cannot complete the install because one or more required items could not be found.
requires 'bundle org.slf4j.api 1.6.2' but it could not be found
解决方法：首先安装

Name: slf4j
Url: http://www.fuin.org/p2-repository/

参见Stack Overflow的帖子。

2015年11月12日星期四

解决virtualbox复制ubuntu后改变mac地址不能识别网卡问题

参见链接方法一中的（2）
vi /etc/udev/rules.d/70-persistent-net.rules

echo manual | tee /etc/init/samba-ad-dc.override

参考：链接1，链接2

2、ubuntu屏幕分辨率太低
解决方法: http://blog.csdn.net/weilanxing/article/details/7664324

CHEN Xiaoyu's blog