Hive Data Definition

In this chapter

This chapter explains how to define schema to perform Ad hoc analysis of the files stored on the cluster.

What is Hive ?

Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

Various supported file type are :

What is Hive Data Definition ?

Hive data definition assigns relational structure to the files stored on the HDFS cluster. You can easily query the structured data to extract specific information. For example, data definition for log files would contain columns like: CLASS, FILENAME, MESSAGE, LINENUBER, etc. Now if you want to check for the classes in which exception occurred, you can search for the term 'Exception' in the 'MESSAGE' column in a relational way. You can run SQL like queries for your files on cluster to search for the required data.

Data definition for CSV files.

Following are the steps to create data definition for CSV files so that you can perform ad hoc analysis on those files:

You can use Query Designer to query the csv data registered using this Hive Data Definition.


Copyright 2017 QueryIO Corporation. All Rights Reserved.

QueryIO, "Big Data Intelligence" and the QueryIO Logo are trademarks of QueryIO Corporation. Apache, Hadoop and HDFS are trademarks of The Apache Software Foundation.