Hadoop (HIVE and/or HDFS) Masking - What is required
search cancel

Hadoop (HIVE and/or HDFS) Masking - What is required

book

Article ID: 214226

calendar_today

Updated On:

Products

CA Test Data Manager (Data Finder / Grid Tools)

Issue/Introduction

We would like to do a PoC that will involve the following within Hadoop (HIVE and/or HDFS)

  1. Create connections to Hadoop environments
  2. Create data models for data in HDFS/HIVE
  3. Tag mark PHI/PII data
  4. Assign masking functions 
  5. Execute Masking jobs

Please provide us with the needed information to get started, and with the latest sets of JAR files that we must install. What level of access does TDM require in the environment/Edge Node/server etc. 

Also, step-by-step process a user must go through including the interface details to mask data within Hadoop environment/s, would be helpful.

Environment

Release : All supported releases of TDM

Component : Hadoop Integration

Resolution

 

Looking at the Supported Data Sources - Non-Relational Data Sources, there is very limited support for Hadoop (Hive).

  • Dynamic Test Data Reservation - TDM Portal = Not Supported
  • Data Generation - TDM Portal = Not Supported
  • Data Generation - Datamaker = Not Supported
  • Data Masking - Certified
  • Data Subsetting = Not Supported
  • Test Match = Not Supported
  • Virtual Test Data Management = Not Supported
  • Data Modelling and PII Audit = Not Supported

 

To help better set the expectations for your PoC, and to answer your specific questions:

  1. Create connections to Hadoop environments

    Answer - TDM components do not interface directly with the Hadoop environment. There are not any connection profiles in TDM that need to be configured. For more information regarding how to set up the Hadoop environment for masking, see Mask Data Stored in Hadoop


  2. Create data models for data in HDFS/HIVE

    Answer - Not Supported. You do not need to create a data model in TDM Datamaker, or TDM Portal.

  3. Tag mark PHI/PII data

    Answer - Not Supported.

  4. Assign masking functions

    Answer - Hadoop masking is supported through the provided JAR files, which need to be deployed to the Hadoop environment. Steps for deploying the JAR files used for the supported masking functions are outlined in the TDM Documentation. See Mask Data Stored in Hadoop

  5. Execute Masking jobs

    Answer - Masking is executed through the Hive UDFs that the deployed JAR files include to perform the supported masking functions. See Mask Data Stored in Hadoop

  6. Can you point me to the latest sets of Jar files that we must install?

    Answer - The required JAR files are included in the MaskingSDK-<version>.zip, which is located in the root directory of your CA TDM Installation media. 

  7. The level of access the TDM person would require in the environment/Edge Node/server etc.  Also, step-by-step process a user must go through including the interface details to mask data within Hadoop environment/s? Would it be possible for someone to actively guide us to set up the environment and ensure the setup is correct? (Steps to be performed).

    Answer - The JAR files are executed by the Hive UDFs (User Defined Functions) to perform the masking. The stored Hadoop data must be structured data and must have a defined schema. The user executes the Hive UDF (provided JAR file) through the Hive query language and access the structured data stored in Hadoop. The Hive UDF executes the FDM masking function, which is provided in the masking library. The structured data is updated as a result of the masking function. The documentation covers how to deploy the JAR files, and documents which FDM masking functions correspond with which Hive UDF to execute for a given masking function. Therefore, everything is executed from the Hive and doesn't rely on a TDM connection profile for Portal or FDM.

    A step-by-step guide to setting up masking for your Hadoop environment, and help with the Hadoop masking, is something our Solutions Engineering team could help you with. Reach out to your Broadcom account team and ask them to help arrange a Solutions Engineer to meet with you and review your environment and POC requirements. The Solutions Engineering can help provide steps for setting this up in your environment, to meet your PoC requirements. Another option is to reach out to the TDM User Community to see if anyone has suggestions they can share for making Hadoop (hive) data.