2015年11月25日星期三

在Hive中根据Json格式的日志创建用户数据表

PART 1
If your JSON file is not in nested structures, please see this link.Or you can also use the method in part 2.

PART 2
As this method does not support nested structures, there is another way to parse json files: link.

We need to parse the json file in the following format:
{
 "detail_log":
 {
 "from":"",
 "package_name":"com.aggro.wearappmanager",
 "query":""
 },
 "device_id":"2fc35dde39093c78302ed45b287ae81e",
 "message_type":0,
 "remote_ip":"39.169.9.107",
 "time":1446268416,
 "uid":"29737982"
}

As listed in the official document, we can use get_json_object or json_tuple to parse json string.

At first, I wanted to create a app_log table with struct but I cannot keep going to extract json object, keep it unfinished:
hive> create table app_log (detail_log struct<`from`:string, package_name:string, query:string>, device_id string, message_type int, remote_ip string, time bigint, uid string);

Attention: Never load data from HDFS! See link1 and link2.

Below are my steps:
1. hive> create table src_json(json string);
hive> load data local inpath '/home/chenxiaoyu/data/app.log.2015-11-01.1446328474758' into table src_json;
2. hive> create table app_log as select get_json_object(src_json.json, '$.detail_log.from'), get_json_object(src_json.json, '$.detail_log.package_name'), get_json_object(src_json.json, '$.detail_log.query'), get_json_object(src_json.json, '$.device_id'), get_json_object(src_json.json, '$.message_type'), get_json_object(src_json.json, '$.remote_ip'), get_json_object(src_json.json, '$.time'), get_json_object(src_json.json, '$.uid') from src_json;
3. hive> alter table app_log change `_c0` `from` string;
hive> alter table app_log change `_c1` package_name string;
hive> alter table app_log change `_c2` query string;
hive> alter table app_log change `_c3` device_id string;
hive> alter table app_log change `_c4` message_type int;
hive> alter table app_log change `_c5` remote_ip string;
hive> alter table app_log change `_c6` time bigint;
hive> alter table app_log change `_c7` uid string;
done.

Till now the work has been finished, but for step 3, I have something to say.
I wanted to use json_tuple, but it seems that "select json_tuple(src_json.json, 'detail_log.package_name') from src_json;" doesn't work. Only "select json_tuple(src_json.json, 'detail_log') from src_json" works.
So I tried to create another table to save detail_log:
hive> create table src_json2 as select json_tuple(src_json.json, 'detail_log') from src_json;
hive> alter table src_json2 change c0 detail string;
But then "select json_tuple(src_json2.detail, 'from', 'package_name', 'query'), json_tuple(src_json.json, 'device_id', 'message_type', 'remmote_ip', 'time', 'uid') from src_json, src_json2" failed, this error occurs:
Only a single expression in the SELECT clause is supported with UDTF's.
So I don't know how to do with that, keep this unsolved.

PART 3
We encounter this problem: In the same file, we have two different types of JSON as shown below:
{
    "detail_log":
    {
        "from":"",
        "package_name":"com.mapelf",
        "query":""
    },
    "device_id":"897ee392eaec9eeac0e9de069ab3aac5",
    "message_type":0,
    "remote_ip":"120.207.151.129",
    "time":1446268421,
    "uid":"29752894"
}

{
    "device_id":"42f89c8deb5decd55380205cd6decaf6",
    "download_log":
    {
        "apk_url":"http://res.appstore.ticwear.com/net.ddroid.aw.watchface.mv08_5.4.3.apk",
        "appid":911
    },
    "message_type":3,
    "remote_ip":"106.39.192.55",
    "time":1446268426,
    "uid":"29754909"
}

We can use a trick to deal with this case.
As the first JSON has the 'package_name' field and is not NULL and the second JSON has the 'apk_url' field and is not NULL, so we can distinguish between the two of them.
We save the two JSON logs into two tables: detail_log and download_log.

My steps:
If you want to use data in HDFS, try this step 1:
1. hive> create external table src_json(json string) stored as textfile location '<hdfs_location>';
Otherwise you can follow this step 1:
1. hive> create table src_json(json string);
hive> load data local inpath '/home/chenxiaoyu/data/app.log.2015-11-01.1446328474758' into table src_json;
2. hive> create table detail_log as select get_json_object(src_json.json, '$.detail_log.from'), get_json_object(src_json.json, '$.detail_log.package_name'), get_json_object(src_json.json, '$.detail_log.query'), get_json_object(src_json.json, '$.device_id'), get_json_object(src_json.json, '$.message_type'), get_json_object(src_json.json, '$.remote_ip'), get_json_object(src_json.json, '$.time'), get_json_object(src_json.json, '$.uid') from src_json where get_json_object(src_json.json, '$.detail_log.package_name') is not NULL;
3. hive> alter table detail_log change `_c0` `from` string;
hive> alter table detail_log change `_c1` package_name string;
hive> alter table detail_log change `_c2` query string;
hive> alter table detail_log change `_c3` device_id string;
hive> alter table detail_log change `_c4` message_type int;
hive> alter table detail_log change `_c5` remote_ip string;
hive> alter table detail_log change `_c6` time bigint;
hive> alter table detail_log change `_c7` uid string;
4. hive> create table download_log as select get_json_object(src_json.json, '$.device_id'), get_json_object(src_json.json, '$.download_log.apk_url'), get_json_object(src_json.json, '$.download_log.appid'), get_json_object(src_json.json, '$.message_type'), get_json_object(src_json.json, '$.remote_ip'), get_json_object(src_json.json, '$.time'), get_json_object(src_json.json, '$.uid') from src_json where get_json_object(src_json.json, '$.download_log.apk_url') is not NULL;
5. hive> alter table download_log change `_c0` device_id string;
hive> alter table download_log change `_c1` apk_url string;
hive> alter table download_log change `_c2` appid int;
hive> alter table download_log change `_c3` message_type int;
hive> alter table download_log change `_c4` remote_ip string;
hive> alter table download_log change `_c5` time bigint;
hive> alter table download_log change `_c6` uid string;
done.

没有评论:

发表评论