2015年11月30日星期一

使用Hive统计应用商店数据

Download log:
{
    "device_id":"5d9e295b8705895f8c6de5420eb1973b",
    "download_log":
    {
        "apk_url":"http://res.appstore.ticwear.com/com.broadlink.rmt_3.5.9-1.1.9.apk",
        "appid":1000
    },
    "message_type":3,
    "remote_ip":"220.200.5.247",
    "time":1441571407,
    "uid":"29716727"
}

Count the number of devices:
hive> select count(distinct device_id) from download_log;

Count the download times of different users:
hive> select device_id, count(device_id) rank from download_log group by device_id order by rank desc;
or: hive> select uid, device_id, count(device_id) rank from download_log group by uid, device_id order by rank desc;

Count the download times of different apps:
hive> select apk_url, appid, count(appid) rank from download_log group by apk_url, appid order by rank desc;

Detail log:
{
    "detail_log":
    {
        "from":"",
        "package_name":"com.stmp.minimalface",
        "query":""
    },
    "device_id":"a5147789062ba29b06562bca6cbb3cc1",
    "message_type":0,
    "remote_ip":"61.239.223.125",
    "time":1441331302,
    "uid":"29728861"
}

Count the detail times of different users:
hive> select device_id, count(device_id) rank from detail_log group by device_id order by rank desc;

Count the detail times of different apps:
hive> select package_name, count(package_name) rank from detail_log group by package_name order by rank desc;

Save the result into local file system:
hive> insert overwrite local directory '$your_path'
select device_id, count(device_id) rank from download_log group by device_id order by rank desc;

Save the result into HDFS:
hive> insert overwrite directory '$your_path'
select device_id, count(device_id) rank from download_log group by device_id order by rank desc;

Save the result into other table:
hive> insert into table test
select device_id, count(device_id) rank from download_log group by device_id order by rank desc;

2015年11月25日星期三

在Hive中根据Json格式的日志创建用户数据表

PART 1
If your JSON file is not in nested structures, please see this link.Or you can also use the method in part 2.

PART 2
As this method does not support nested structures, there is another way to parse json files: link.

We need to parse the json file in the following format:
{
 "detail_log":
 {
 "from":"",
 "package_name":"com.aggro.wearappmanager",
 "query":""
 },
 "device_id":"2fc35dde39093c78302ed45b287ae81e",
 "message_type":0,
 "remote_ip":"39.169.9.107",
 "time":1446268416,
 "uid":"29737982"
}

As listed in the official document, we can use get_json_object or json_tuple to parse json string.

At first, I wanted to create a app_log table with struct but I cannot keep going to extract json object, keep it unfinished:
hive> create table app_log (detail_log struct<`from`:string, package_name:string, query:string>, device_id string, message_type int, remote_ip string, time bigint, uid string);

Attention: Never load data from HDFS! See link1 and link2.

Below are my steps:
1. hive> create table src_json(json string);
hive> load data local inpath '/home/chenxiaoyu/data/app.log.2015-11-01.1446328474758' into table src_json;
2. hive> create table app_log as select get_json_object(src_json.json, '$.detail_log.from'), get_json_object(src_json.json, '$.detail_log.package_name'), get_json_object(src_json.json, '$.detail_log.query'), get_json_object(src_json.json, '$.device_id'), get_json_object(src_json.json, '$.message_type'), get_json_object(src_json.json, '$.remote_ip'), get_json_object(src_json.json, '$.time'), get_json_object(src_json.json, '$.uid') from src_json;
3. hive> alter table app_log change `_c0` `from` string;
hive> alter table app_log change `_c1` package_name string;
hive> alter table app_log change `_c2` query string;
hive> alter table app_log change `_c3` device_id string;
hive> alter table app_log change `_c4` message_type int;
hive> alter table app_log change `_c5` remote_ip string;
hive> alter table app_log change `_c6` time bigint;
hive> alter table app_log change `_c7` uid string;
done.

Till now the work has been finished, but for step 3, I have something to say.
I wanted to use json_tuple, but it seems that "select json_tuple(src_json.json, 'detail_log.package_name') from src_json;" doesn't work. Only "select json_tuple(src_json.json, 'detail_log') from src_json" works.
So I tried to create another table to save detail_log:
hive> create table src_json2 as select json_tuple(src_json.json, 'detail_log') from src_json;
hive> alter table src_json2 change c0 detail string;
But then "select json_tuple(src_json2.detail, 'from', 'package_name', 'query'), json_tuple(src_json.json, 'device_id', 'message_type', 'remmote_ip', 'time', 'uid') from src_json, src_json2" failed, this error occurs:
Only a single expression in the SELECT clause is supported with UDTF's.
So I don't know how to do with that, keep this unsolved.

PART 3
We encounter this problem: In the same file, we have two different types of JSON as shown below:
{
    "detail_log":
    {
        "from":"",
        "package_name":"com.mapelf",
        "query":""
    },
    "device_id":"897ee392eaec9eeac0e9de069ab3aac5",
    "message_type":0,
    "remote_ip":"120.207.151.129",
    "time":1446268421,
    "uid":"29752894"
}

{
    "device_id":"42f89c8deb5decd55380205cd6decaf6",
    "download_log":
    {
        "apk_url":"http://res.appstore.ticwear.com/net.ddroid.aw.watchface.mv08_5.4.3.apk",
        "appid":911
    },
    "message_type":3,
    "remote_ip":"106.39.192.55",
    "time":1446268426,
    "uid":"29754909"
}

We can use a trick to deal with this case.
As the first JSON has the 'package_name' field and is not NULL and the second JSON has the 'apk_url' field and is not NULL, so we can distinguish between the two of them.
We save the two JSON logs into two tables: detail_log and download_log.

My steps:
If you want to use data in HDFS, try this step 1:
1. hive> create external table src_json(json string) stored as textfile location '<hdfs_location>';
Otherwise you can follow this step 1:
1. hive> create table src_json(json string);
hive> load data local inpath '/home/chenxiaoyu/data/app.log.2015-11-01.1446328474758' into table src_json;
2. hive> create table detail_log as select get_json_object(src_json.json, '$.detail_log.from'), get_json_object(src_json.json, '$.detail_log.package_name'), get_json_object(src_json.json, '$.detail_log.query'), get_json_object(src_json.json, '$.device_id'), get_json_object(src_json.json, '$.message_type'), get_json_object(src_json.json, '$.remote_ip'), get_json_object(src_json.json, '$.time'), get_json_object(src_json.json, '$.uid') from src_json where get_json_object(src_json.json, '$.detail_log.package_name') is not NULL;
3. hive> alter table detail_log change `_c0` `from` string;
hive> alter table detail_log change `_c1` package_name string;
hive> alter table detail_log change `_c2` query string;
hive> alter table detail_log change `_c3` device_id string;
hive> alter table detail_log change `_c4` message_type int;
hive> alter table detail_log change `_c5` remote_ip string;
hive> alter table detail_log change `_c6` time bigint;
hive> alter table detail_log change `_c7` uid string;
4. hive> create table download_log as select get_json_object(src_json.json, '$.device_id'), get_json_object(src_json.json, '$.download_log.apk_url'), get_json_object(src_json.json, '$.download_log.appid'), get_json_object(src_json.json, '$.message_type'), get_json_object(src_json.json, '$.remote_ip'), get_json_object(src_json.json, '$.time'), get_json_object(src_json.json, '$.uid') from src_json where get_json_object(src_json.json, '$.download_log.apk_url') is not NULL;
5. hive> alter table download_log change `_c0` device_id string;
hive> alter table download_log change `_c1` apk_url string;
hive> alter table download_log change `_c2` appid int;
hive> alter table download_log change `_c3` message_type int;
hive> alter table download_log change `_c4` remote_ip string;
hive> alter table download_log change `_c5` time bigint;
hive> alter table download_log change `_c6` uid string;
done.

Matlab 2015a and Visual Studio 2015

When entering this command: mex -setup, this error occurs:
Error using mex. No supported compiler or SDK was found.

Solution: see this link. It may help you.

2015年11月21日星期六

Hive

Install and configure Hive:
Please see this doc.

Hive operations:
Please see this doc.

A video tutorial of Hive: link.

Windows下Sublime Text搭建LaTeX编写环境

注意:
tex路径中千万不要有中文!否则ctrl+b编译会一片空白。

准备软件:
1、Sublime Text:链接.
2、TeX Live: 链接.
3、SumatraPDF:链接.

安装步骤:
1、安装Sublime Text。在安装时勾选Add to explorer context menu,这样在右键单击文件时就可以直接使用Sublime Text打开。
2、安装TeX Live。不要在任意的步骤中使用含有中文或空格的路径,以免触发一些潜在的bug。可以参考:链接
3、安装SumatraPDF,安装路径中还是不要出现空格。

设置步骤:
1、Sublime Text
安装Package Control,网上有很多教程,自行搜索,参考链接
装好Package Control后,安装LaTeXTools插件:在sublime text界面按下ctrl+shift+p,输入install package,回车,再输入latextools,回车,完成。

安装好LaTeXTools后,需要更改配置:
先点击Preferences - Package Settings -> LaTeXTools -> Reconfigure LaTeXTools and migrate settings
然后点击菜单栏:Preferences - Package Settings -> LaTeXTools -> Setting-User,
修改配置如下图:

保存。

2、编辑你的tex文件,ctrl+b编译,pdf文件会自动打开。

3、设置从SumatraPDF中反向搜索
打开SumatraPDF,设置 - 选项...,设置反向搜索命令行:
"C:\Program Files\Sublime Text 2\sublime_text.exe" "%f:%l"
确定。

设置好了之后双击pdf文件中的位置可以跳转到代码位置,done。

如果你发现ctrl+b编译时候是空白,试试这个方法:
卸载sublime text,然后删除C:\Users\your-username\AppData\Roaming中的sublime text文件夹,重新安装设置。

P.S. 推荐安装BracketHighlighter插件,可以高亮匹配括号。

2015年11月17日星期二

用Maven构建Hadoop项目

参见用Maven构建Hadoop项目

遇到的问题及解决方法:
1、安装Eclipse的Maven插件出错
错误提示:Cannot complete the install because one or more required items could not be found.
requires 'bundle org.slf4j.api 1.6.2' but it could not be found
解决方法:首先安装
参见Stack Overflow的帖子

2015年11月12日星期四

2015年11月11日星期三

VirtualBox中设置ubuntu虚拟机和主机通信且可以连接外网

网卡1设置为网络地址转换(NAT),网卡2设置为host-only。
具体设置参考:为VirtualBox中的Ubuntu配置双网卡
我的/etc/network/interfaces文件如下:

有时候网连不上,需要先ifdown再ifup起来,原因不明。为了避免不必要的问题,主机最好连接有线网。

Hadoop安装

Ubuntu14.04:环境搭建
CentOS6.4:环境搭建
Hadoop2.7.1:安装

See also: install-hadoop or this doc

安装ubuntu server过程中出现的两个问题和解决方法

在VirtualBox中安装了ubuntu-14.04-server,遇到以下两个问题:

1、我安装了samba服务,开机时候遇到遇到一个fail:
Starting SMB/CIFS File and Active Directory Server. [FAIL]
查了一下,找到了一个解决方法:
echo manual | tee /etc/init/samba-ad-dc.override
参考:链接1链接2

2、ubuntu屏幕分辨率太低
解决方法: http://blog.csdn.net/weilanxing/article/details/7664324