Problem
Given several partitioned AVRO formatted files, together with the AVSC schema, we want to create a table in Hive.
We have hundreds of files in a directory partitioned by year and mont in the HDFS folder /data/mytable.db/mytable.
The folder structure is:
/data/mytable.db/mytable/Year=2018/month=11
/data/mytable.db/mytable/Year=2018/month=12
/data/mytable.db/mytable/Year=2019/month=1
/data/mytable.db/mytable/Year=2019/month=2
/data/mytable.db/mytable/Year=2019/month=3
/data/mytable.db/mytable/Year=2019/month=4
Solution
Assuming that the AVSC file is: /data/schemas/mytable_schema.avsc
and that the AVRO files are in: /data/mytable.db/mytable
We can create a Hive table in the mydb schema with reference to the former files:
CREATE EXTERNAL TABLE `mydb.mytable`( `name` string COMMENT '', `surname` string COMMENT '', `phone` string COMMENT '', `score` int COMMENT '' ) PARTITIONED BY ( `year` int, `month` int) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' WITH SERDEPROPERTIES ( 'avro.schema.url'='hdfs:///data/schemas/mytable_schema.avsc') STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' LOCATION 'hdfs:///data/mytable.db/mytable' TBLPROPERTIES ( 'transient_lastDdlTime'='1566997300');
After that you need to execute MSCK REPAIR TABLE:
Use this statement on Hadoop partitioned tables to identify partitions that were manually added to the distributed file system (DFS). This statement (a Hive command) adds metadata about the partitions to the Hive catalogs.
MSCK REPAIR TABLE mydb.mytable
If you are goint to use Impala you need to repair the metastore with:
invalidate metadata mydb.mytable;
Now you can execute queries:
select * from mydb.mytable limit 10