The above figure shows the architecture of QueryIO: Big Data Analytics. QueryIO can be referred to as a software stack of different layers where each layer is a group of several program components. Each layer in the architecture provides different services to the layer just above it. We will examine the features of each layer in detail.
The basic layer is HDFS (Hadoop Distributed File System). QueryIO is built on top of HDFS layer. HDFS is a distributed file system designed to run on commodity hardware. It provides high throughput to application data and is suitable for applications having large data sets.
HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. There are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files. Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read and write requests from the file system's clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.
QueryIO key differentiator is the DB-based MetaStore framework. Depending on data and usage, perhaps 80% of the effort goes in to data preparation. It provides an extensive Data Intelligence framework which enables you to build structure/MetaData around your unstructured data as data is getting ingested into HDFS. This has two key advantages:
MapReduce is a framework for processing parallelizable problems across huge datasets using a large number of computers. Map step of MapReduce framework involves master node taking the input, dividing it into smaller sub-problems, and distributing them to worker nodes. Reduce step of MapReduce involves master node then collecting the solutions to all the sub-problems and combining them in some way to form the output.
QueryIO allows traditional programmers to submit their own MapReduce applications. It also provides the status of the application that you submit.
In cases where your existing structured data in the DB is not sufficient, QueryIO runs Hive from your UI to process and analyze data. Hive converts HiveQL (mostly SQL) queries, and generates full-fledged MapReduce job that are run to process raw data. Most of this is done via extensive Web-based UI. Being able to drive MapReduce jobs via PostgreSQL and UI is a great value-add to customer since Hadoop/MR skill sets are hard to come by.
On top of MapReduce, QueryIO provides a framework which allows you to perform standard SQL queries on your big data. It also provides an easy to use interface through which you can generate SQL queries and design reports to present your processed data.
Using data integration services of QueryIO you can easily import/export data from/to various data sources. The data sources supported are:
QueryIO management features make it easy to configure a cluster having a large number of nodes in just a matter of minutes. You can easily perform operations like start, stop, balance, health check etc. on the nodes running in your cluster. It allows you to easily add new machines to your cluster. Using QueryIO, you can easily change the configuration parameters of all the nodes from a very friendly user interface.
QueryIO monitoring services constantly monitor all the nodes and machines in your cluster to check for any failures or alerts. QueryIO performs system monitoring to monitor various machine statistics like disk read/write operations per second, network received/sent bytes per second, etc. These statistics provide you information about the performance of your hardware. QueryIO monitors the status of nodes via JMX. Node related parameters provide you information about the operations performed on the cluster like total file reads/writes etc. QueryIO server automatically notifies the administrators via email in case of any hardware failure or alerts. This ensures that your data is always safe.
QueryIO exposes various client interfaces which allow user to access HDFS. To see how you can use these client interfaces, refer to the developer documentation.
Apache Hadoop provides HTTP REST API which allows you to access HDFS using HTTP requests. Using WebHDFS, you can also write non-Java applications to interact with HDFS. The FileSystem scheme of WebHDFS is "webhdfs://". A WebHDFS FileSystem URI has the following format.
webhdfs://<HOST>:<HTTP_PORT>/<PATH>
Apache Hadoop provides DFS client APIs through which you can access HDFS using Java code.
QueryIO software stack includes Amazon S3 compatibility API server which allows you to access HDFS using Amazon S3 REST interface.
QueryIO provides advanced data tagging feature which allows you to write procedures to process files while they are being imported into HDFS. By default, QueryIO extracts basic metadata for uploaded files from the associated fsImage files. The extracted metadata is stored in a relational database. You can execute standard SQL queries on the extracted metadata.
QueryIO software stack includes an FTP server which works on top of HDFS. It allows you to connect to HDFS using any FTP client. It also allows to use secure connection over SSL.
QueryIO uses databases for following modules:
Node configuration and Management
This module maintains the cluster configuration information.
Monitoring
QueryIO constantly monitors all the configured nodes and systems to check for various parameters like file read/write operations, disk status etc. All of these monitoring data is saved in the database. QueryIO shows charts for this data in node details and dashboard view. Exploring this feature, you can make sure that your cluster remains up and running with efficiency at all times.
HDFS Metadata
HDFS metadata refers to the file system metadata in HDFS. QueryIO automatically extracts the namespace metadata for files that are imported into HDFS and saves the extracted metadata in the database so that you can query the extracted metadata to extract information like file permissions, compression type etc.
Tags are user specified or extracted from file content which can be set either on-ingest or post-ingest. Tags are used to create extended metadata.
Extended Metadata
Extended metadata refers to the file metadata that is not interpreted by file system. It is used to associate files with some extra information. Typical uses for this can be storing the author of a document, the character encoding of a plain-text document, a checksum, or digital signature.
Data Parsing
User must define content parsers before QueryIO parses and extracts data. If you have configured any parser, QueryIO executes configured parsers to extract required information from imported files. This extracted information is then saved in the database.
MapReduce Jobs
You can use databases in QueryIO as an output for your MapReduce jobs.
Analytics
Using Analytics feature in QueryIO, you can easily construct SQL queries and execute those queries on the database.
In a typical Hadoop cluster, total number of files grows to the order of millions over a period of time. Thus with multiple NameNodes having millions of files each, HDFS cluster storage scales horizontally but the namespace does not.
In order to scale the name service horizontally, NameNode federation uses multiple independent namespaces. The Namenodes are federated, that is, the Namenodes are independent and don't require coordination with each other.
The datanodes are used as common storage for blocks by all the federated Namenodes. Each datanode registers with all the Namenodes in the cluster.
QueryIO supports configuration of one database instance per namespace to support NameNode Federation. User can define a database configuration and link it to a namespace. All the metadata / tags associated with the data in given namespace is stored in this link in database.