Introduction

QueryIO provides a feature to associate Data Tags with files. You can configure a database instance for use with each Namespace and the tags associated to each file are stored in this database. This enables you to search for specific files as per the tags they've been associated with. You can execute standard SQL queries on the database specifying the filters and retrieve a list of the files that pertain to your requirement.

QueryIO can stores two types of tags for a file:

Typically, the number of files stored in HDFS grows to the order of millions over a period of time. It is recommended that you configure multiple database instances in such a way that the database instances and namenodes have one to one mapping with each other.

HDFS Core Metadata

Metadata (metacontent) is defined as data providing information about one or more aspects of the data. The namespace of the entire filesystem, including the mapping of blocks to files and file system properties, is stored in HDFS. Core Metadata refers to these properties that are stored in HDFS. When you upload any files to QueryIO, it automatically extracts this HDFS metadata for those files from HDFS. The extracted metadata is then stored in the associated table in the database. Using the extracted metadata you can also reconstruct the file system namespace so that you do not loose access to your data even if your Namenode crashes.

Database schema for storing hdfs metadata is as follows:

Parameter Description Data Type Column Name
File path

Absolute path of the file on the cluster

String FILEPATH
Access time

The time when the file was last accessed

Timestamp ACCESSTIME
Modification time

The time when the file was last modified

Timestamp MODIFICATIONTIME
Owner

Name of the file owner

String OWNER
User group

User group to which the file belongs

String USERGROUP
Permission

Permissions of the file

String PERMISSION
Block size

Size of the blocks of the file

Integer BLOCKSIZE
Replication

Replication count for the file

Integer REPLICATION
Length

Length of the file in bytes

Integer LEN
Compression Type

Compression algorithm used for the file. Supported algorithms are SNAPPY, GZ, LZ4.

String COMPRESSION_TYPE
Encryption Type

Encryption algorithm used for the file. Supported algorithms are AES256

String ENCRYPTION_TYPE

Data Tags

Every type of file has some associated properties with it that can help to make it searchable. These properties includes such things as the name of the author or the date that the file was last modified. These could be file specific, such as aspect ratio or dimensions of an image file. QueryIO enables users to associate files in HDFS with these Data Tags that are not interpreted by HDFS.

QueryIO uses Apache Tika to parse and extract Data Tags from various files. Following are some of the supported document formats whose parsers are pre-configured in QueryIO:

You can also register your own metadata parser to support other file formats.

If you do not want QueryIO to extract the metadata from the files during ingestion, you can disable the registered parser from the "Data Tag Parsers" view.

 

When you upload any file to HDFS using QueryIO, it automatically parses the file and stores the extracted Data Tags in the associated table (specific to file type) in the database.

For instance, if you upload a PDF file to QueryIO, information like author, subject, number of pages, etc will be extracted from that file. All of this extracted information will be saved in the respective columns in the 'datatags_pdf' table in the database. You can view the schema for such a schema from "Manage Datasource" view. Following is an example showing some of the columns / fields for 'datatags_pdf' table.

 

 

QueryIO supports custom data tags, that are meta-information attached to a file designed to be customized by the user. These tags are stored in datatags table for its particular file type. They are great for making searching easier because you can use words or even phrases that make sense to you. You can think of these tags as keywords.

The value of the tag defined by the user can be any constant value, global function or an operator on any table column. One can also define conditions based on which the data will be tagged. Data tagging can be scheduled on-ingest or post-ingest.

You can choose to define data tags using the schema you have already defined using Hive DDL or can choose System defined schemas for different file formats.

To add a data tag, navigate to Data > Data Tagging view.

Following example guides you to add a data tag on a hive table : hivecsvtable1 which finds average reading of CPU in a particular file for host 192.168.0.4

You can use Data Import to import data to the cluster and check the tags added in metadata table using Query Designer


Copyright © 2017 QueryIO Corporation. All Rights Reserved.

QueryIO, "Big Data Intelligence" and the QueryIO Logo are trademarks of QueryIO Corporation. Apache, Hadoop and HDFS are trademarks of The Apache Software Foundation.