PART 1
If your JSON file is not in nested structures, please see this
link.Or you can also use the method in part 2.
PART 2
As this method does not support nested structures, there is another way to parse json files:
link.
We need to parse the json file in the following format:
{
"detail_log":
{
"from":"",
"package_name":"com.aggro.wearappmanager",
"query":""
},
"device_id":"2fc35dde39093c78302ed45b287ae81e",
"message_type":0,
"remote_ip":"39.169.9.107",
"time":1446268416,
"uid":"29737982"
}
As listed in the official document, we can use
get_json_object or
json_tuple to parse json string.
At first, I wanted to create a app_log table with struct but I cannot keep going to extract json object, keep it unfinished:
hive> create table app_log (detail_log struct<`from`:string, package_name:string, query:string>, device_id string, message_type int, remote_ip string, time bigint, uid string);
Attention: Never load data from HDFS! See
link1 and
link2.
Below are my steps:
1. hive> create table src_json(json string);
hive> load data local inpath '/home/chenxiaoyu/data/app.log.2015-11-01.1446328474758' into table src_json;
2. hive> create table app_log as select get_json_object(src_json.json, '$.detail_log.from'), get_json_object(src_json.json, '$.detail_log.package_name'), get_json_object(src_json.json, '$.detail_log.query'), get_json_object(src_json.json, '$.device_id'), get_json_object(src_json.json, '$.message_type'), get_json_object(src_json.json, '$.remote_ip'), get_json_object(src_json.json, '$.time'), get_json_object(src_json.json, '$.uid') from src_json;
3. hive> alter table app_log change `_c0` `from` string;
hive> alter table app_log change `_c1` package_name string;
hive> alter table app_log change `_c2` query string;
hive> alter table app_log change `_c3` device_id string;
hive> alter table app_log change `_c4` message_type int;
hive> alter table app_log change `_c5` remote_ip string;
hive> alter table app_log change `_c6` time bigint;
hive> alter table app_log change `_c7` uid string;
done.
Till now the work has been finished, but for step 3, I have something to say.
I wanted to use
json_tuple, but it seems that "
select json_tuple(src_json.json, 'detail_log.package_name') from src_json;" doesn't work. Only "
select json_tuple(src_json.json, 'detail_log') from src_json" works.
So I tried to create another table to save detail_log:
hive> create table src_json2 as select json_tuple(src_json.json, 'detail_log') from src_json;
hive> alter table src_json2 change c0 detail string;
But then "
select json_tuple(src_json2.detail, 'from', 'package_name', 'query'), json_tuple(src_json.json, 'device_id', 'message_type', 'remmote_ip', 'time', 'uid') from src_json, src_json2" failed, this error occurs:
Only a single expression in the SELECT clause is supported with UDTF's.
So I don't know how to do with that, keep this unsolved.
PART 3
We encounter this problem: In the same file, we have two different types of JSON as shown below:
{
"detail_log":
{
"from":"",
"package_name":"com.mapelf",
"query":""
},
"device_id":"897ee392eaec9eeac0e9de069ab3aac5",
"message_type":0,
"remote_ip":"120.207.151.129",
"time":1446268421,
"uid":"29752894"
}
{
"device_id":"42f89c8deb5decd55380205cd6decaf6",
"download_log":
{
"apk_url":"http://res.appstore.ticwear.com/net.ddroid.aw.watchface.mv08_5.4.3.apk",
"appid":911
},
"message_type":3,
"remote_ip":"106.39.192.55",
"time":1446268426,
"uid":"29754909"
}
We can use a trick to deal with this case.
As the first JSON has the 'package_name' field and is not NULL and the second JSON has the 'apk_url' field and is not NULL, so we can distinguish between the two of them.
We save the two JSON logs into two tables: detail_log and download_log.
My steps:
If you want to use data in HDFS, try this step 1:
1. hive> create external table src_json(json string) stored as textfile location '<hdfs_location>';
Otherwise you can follow this step 1:
1. hive> create table src_json(json string);
hive> load data local inpath '/home/chenxiaoyu/data/app.log.2015-11-01.1446328474758' into table src_json;
2. hive> create table detail_log as select get_json_object(src_json.json, '$.detail_log.from'), get_json_object(src_json.json, '$.detail_log.package_name'), get_json_object(src_json.json, '$.detail_log.query'), get_json_object(src_json.json, '$.device_id'), get_json_object(src_json.json, '$.message_type'), get_json_object(src_json.json, '$.remote_ip'), get_json_object(src_json.json, '$.time'), get_json_object(src_json.json, '$.uid') from src_json where get_json_object(src_json.json, '$.detail_log.package_name') is not NULL;
3. hive> alter table detail_log change `_c0` `from` string;
hive> alter table detail_log change `_c1` package_name string;
hive> alter table detail_log change `_c2` query string;
hive> alter table detail_log change `_c3` device_id string;
hive> alter table detail_log change `_c4` message_type int;
hive> alter table detail_log change `_c5` remote_ip string;
hive> alter table detail_log change `_c6` time bigint;
hive> alter table detail_log change `_c7` uid string;
4. hive> create table download_log as select get_json_object(src_json.json, '$.device_id'), get_json_object(src_json.json, '$.download_log.apk_url'), get_json_object(src_json.json, '$.download_log.appid'), get_json_object(src_json.json, '$.message_type'), get_json_object(src_json.json, '$.remote_ip'), get_json_object(src_json.json, '$.time'), get_json_object(src_json.json, '$.uid') from src_json where get_json_object(src_json.json, '$.download_log.apk_url') is not NULL;
5. hive> alter table download_log change `_c0` device_id string;
hive> alter table download_log change `_c1` apk_url string;
hive> alter table download_log change `_c2` appid int;
hive> alter table download_log change `_c3` message_type int;
hive> alter table download_log change `_c4` remote_ip string;
hive> alter table download_log change `_c5` time bigint;
hive> alter table download_log change `_c6` uid string;
done.