QueryIO: Big Data Analytics - Architecture

QueryIO Architecture

The above figure shows the architecture of QueryIO: Big Data Analytics. QueryIO can be referred to as a software stack of different layers where each layer is a group of several program components. Each layer in the architecture provides different services to the layer just above it. We will examine the features of each layer in detail.

Distributed Storage (HDFS)

The basic layer is HDFS (Hadoop Distributed File System). QueryIO is built on top of HDFS layer. HDFS is a distributed file system designed to run on commodity hardware. It provides high throughput to application data and is suitable for applications having large data sets.

HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. There are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files. Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read and write requests from the file system's clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.

QueryIO MetaStore Repository - Data Intelligence

QueryIO key differentiator is the DB-based MetaStore framework. Depending on data and usage, perhaps 80% of the effort goes in to data preparation. It provides an extensive Data Intelligence framework which enables you to build structure/MetaData around your unstructured data as data is getting ingested into HDFS. This has two key advantages:

  1. It enables you to reduce having to repeatedly process raw data for commonly needed attributes. You can progressively build intelligence about your data and store that in MetaStore. This avoids repeated processing of same data. This is similar to Google building indexes for your search data instead of searching the entire Internet when a search is performed.
  2. It enables you to use PostgreSQL to perform data analysis. It reduces the learning curve involved today in learning and implementing Hive/Pig/HBase and also leverages existing SQL skills and BI tools investment.

Distributed Computing Map Reduce v2(YARN)

MapReduce is a framework for processing parallelizable problems across huge datasets using a large number of computers. Map step of MapReduce framework involves master node taking the input, dividing it into smaller sub-problems, and distributing them to worker nodes. Reduce step of MapReduce involves master node then collecting the solutions to all the sub-problems and combining them in some way to form the output.

QueryIO allows traditional programmers to submit their own MapReduce applications. It also provides the status of the application that you submit.

In cases where your existing structured data in the DB is not sufficient, QueryIO runs Hive from your UI to process and analyze data. Hive converts HiveQL (mostly SQL) queries, and generates full-fledged MapReduce job that are run to process raw data. Most of this is done via extensive Web-based UI. Being able to drive MapReduce jobs via PostgreSQL and UI is a great value-add to customer since Hadoop/MR skill sets are hard to come by.

QueryIO Big Data Analytics Services

On top of MapReduce, QueryIO provides a framework which allows you to perform standard SQL queries on your big data. It also provides an easy to use interface through which you can generate SQL queries and design reports to present your processed data.

QueryIO Data Integration Services

Using data integration services of QueryIO you can easily import/export data from/to various data sources. The data sources supported are:

QueryIO Management and Monitoring Services

QueryIO management features make it easy to configure a cluster having a large number of nodes in just a matter of minutes. You can easily perform operations like start, stop, balance, health check etc. on the nodes running in your cluster. It allows you to easily add new machines to your cluster. Using QueryIO, you can easily change the configuration parameters of all the nodes from a very friendly user interface.

QueryIO monitoring services constantly monitor all the nodes and machines in your cluster to check for any failures or alerts. QueryIO performs system monitoring to monitor various machine statistics like disk read/write operations per second, network received/sent bytes per second, etc. These statistics provide you information about the performance of your hardware. QueryIO monitors the status of nodes via JMX. Node related parameters provide you information about the operations performed on the cluster like total file reads/writes etc. QueryIO server automatically notifies the administrators via email in case of any hardware failure or alerts. This ensures that your data is always safe.

QueryIO Client Interfaces

QueryIO exposes various client interfaces which allow user to access HDFS. To see how you can use these client interfaces, refer to the developer documentation.

WebHDFS REST API

Apache Hadoop provides HTTP REST API which allows you to access HDFS using HTTP requests. Using WebHDFS, you can also write non-Java applications to interact with HDFS. The FileSystem scheme of WebHDFS is "webhdfs://". A WebHDFS FileSystem URI has the following format.

webhdfs://<HOST>:<HTTP_PORT>/<PATH>

JAVA API

Apache Hadoop provides DFS client APIs through which you can access HDFS using Java code.

Amazon S3 REST API

QueryIO software stack includes Amazon S3 compatibility API server which allows you to access HDFS using Amazon S3 REST interface.

PostgreSQL

QueryIO provides advanced data tagging feature which allows you to write procedures to process files while they are being imported into HDFS. By default, QueryIO extracts basic metadata for uploaded files from the associated fsImage files. The extracted metadata is stored in a relational database. You can execute standard SQL queries on the extracted metadata.

FTP

QueryIO software stack includes an FTP server which works on top of HDFS. It allows you to connect to HDFS using any FTP client. It also allows to use secure connection over SSL.

Databases in QueryIO

QueryIO uses databases for following modules:

Management and Monitoring

Replicate MetaData

Manage Tags

Data Parsing

MapReduce jobs

Analytics

In a typical Hadoop cluster, total number of files grows to the order of millions over a period of time. Thus with multiple NameNodes having millions of files each, HDFS cluster storage scales horizontally but the namespace does not. In order to scale the name service horizontally, NameNode federation uses multiple independent namespaces. The Namenodes are federated, that is, the Namenodes are independent and don't require coordination with each other. The datanodes are used as common storage for blocks by all the federated Namenodes. Each datanode registers with all the Namenodes in the cluster.
QueryIO supports configuration of one database instance per namespace to support NameNode Federation. User can define a database configuration and link it to a namespace. All the metadata / tags associated with the data in given namespace is stored in this link in database.



Copyright © 2018 QueryIO Corporation. All Rights Reserved.

QueryIO, "Big Data Intelligence" and the QueryIO Logo are trademarks of QueryIO Corporation. Apache, Hadoop and HDFS are trademarks of The Apache Software Foundation.