• SQL vs Hive

    Apache Hive, similar to QueryIO, helps with Ad hoc querying and analysis of large set of structured and unstructured data stored on Hadoop Distributed File System (HDFS). QueryIO is designed to address many of the limitations that are present with usage of Hive, Apache Pig and similar Big Data> analysis and Extract Transform Load (ETL) tools currently available.


    Standard SQL Syntax

    Hive provides means to query the data using SQL-like language called HiveQL. On the contrary QueryIO enables you to leverage the vast and mature infrastructure built around Structured Query Language (SQL) and Relational databases and utilize it for your Hadoop Analytics needs. Since QueryIO uses standard SQL and relational databases to store the processed & structured data, it can be easily integrated with existing Business Intelligence (BI) tools. Hive on the other hand uses a proprietary HiveQL language so integration with Business Intelligence tools is a huge challenge with that.


    Optimized Storage

    Hive creates duplicate copies of source files on hadoop distributed file system for processing and also creates temporary result files on HDFS to store the transit results. It gets even worse as HDFS creates extra replicas at block level for data redundancy. This duplication is unnecessary and adds unwanted overhead on the cluster. QueryIO on the other hand processes the source files on HDFS directly without any data duplication, fully leveraging the power of Map Reduce framework and outputs the result directly to relational database.


    High Processing Speed

    QueryIO adds some intelligence where it understands different file formats and only processess required files while processing large set of files. Say if an input directory given to QueryIO for processing CSV files also has some image files in it then it will skip those image files while processing. Hive would still process all the files given at input location irrespective of whether it conforms to given data model or not. QueryIO thus helps with intelligent and optimized data processing resulting in accurate analysis of unstructured data. In addition to that QueryIO is also optimized to take care of processing files in parallel per mapper, bringing in huge advantage with speed while processing large set of files on cluster.


    Iterative data analysis

    Iterative data analysis with Hive is again a big challenge. With QueryIO since processed data is directly stored in relational database its readily available for any further analysis and filtering. With Hive you will need to process the source data every time you query which results in investing large amount of time in data analysis. For example if you are processing some machine logs and initially query for data where CPU > 50%. Now if you need to further filter the results to just get rows where CPU > 70%, with Hive you will need to process all the data again, while with QueryIO you can run SQL queries directly on your already processed results present in database without having to go to HDFS cluster again for data processing.


    On-Ingest Parsing

    QueryIO offers numerous other features in terms of Data Tagging, On-Ingest metadata parsing where you can tag your data or query it based on user defined tags or extended metadata. Hive, Pig or any other Hadoop Analytics or ETL tools lack this feature and has no concept of data tagging or on-ingest parsing which can make your unstructured data searchable on hadoop cluster.


    Usability

    With QueryIO you can process files recursively in any directory or sub directories, with Hive its a big limitation where it just allows files in its input path for processing making it difficult to process files nested within sub folders. QueryIO is more user friendly and easier in terms of usability as compared to Hive.