FDM masking of Parquet fails with " java.lang.IllegalArgumentException-INT96 is deprecated" error
book
Article ID: 404499
calendar_today
Updated On:
Products
CA Test Data Manager (Data Finder / Grid Tools)
Issue/Introduction
While attempting to mask multiple Parquet files, we are seeing the masking job fails with the following error:
java.lang.RuntimeException: java.lang.RuntimeException: Failed to process file: <PATH to Parquet_File.parquet>
Caused by: java.lang.RuntimeException: Failed to process file: <PATH to Parquet_File.parquet>
Caused by: java.lang.IllegalArgumentException: INT96 is deprecated. As interim enable READ_INT96_AS_FIXED flag to read as byte array.
Environment
FDM 4.11.x
Cause
INT96 is a data type specifically used to represent timestamps in Parquet. This data type has been deprecated, and is encouraged to stop using this data type.
Resolution
TDM Engineering is working to provide support for the INT96 data type in a future release of FDM.
We recommend using duckdb to see if it is able to read the file with the INT96 data type. If that works, try using the option PARQUET_USE_DUCKDB=Y as a possible workaround.
Additional Information
For more about the INT96 data type, see the following Google IA Overview results:
"Understanding INT96 in Parquet files
Parquet files support various data types, and INT96 is one that's specifically used for representing timestamps.
Here's a breakdown of what INT96 means in the context of Parquet:
1. INT96 as a physical type
INT96 is a physical type in Parquet, meaning it describes the underlying storage format for data.
2. INT96 for timestamp representation (legacy/compatibility)
While Parquet has a dedicated TIMESTAMP logical type that's typically implemented over INT64 (representing milliseconds or microseconds since the Unix epoch), some systems, particularly older versions of Apache Spark, Impala, and Hive, use INT96 to store timestamps for historical reasons and compatibility.
These INT96 timestamps are structured as follows:
First 8 bytes (most significant): Represent nanoseconds since midnight.
Last 4 bytes (least significant): Represent the Julian day (number of days since the beginning of the Julian calendar).
3. Important considerations and limitations
Deprecated: INT96 is considered a deprecated type in the Parquet specification. New Parquet files should ideally use TIMESTAMP logical types with INT64 for storing timestamps with milliseconds or microseconds precision.
Reading/Writing Challenges:
Interpretation Issues: The interpretation of the first 8 bytes (nanoseconds) within the INT96 structure can vary between different systems, potentially leading to incorrect timestamp values if not handled carefully. For instance, some systems might treat them as unsigned INT64, while others (like Snowflake) treat them as signed INT64.
Compatibility: While some systems like Spark and Hive might default to writing INT96 timestamps, users might need to configure their applications to use the standard TIMESTAMP logical types over INT64 to avoid compatibility issues with other tools.
Time Zones: INT96 timestamps do not include timezone information, according to ASF JIRA.
Performance: Parquet is optimized for read performance, not write operations, and dealing with INT96 timestamps might introduce some overhead in this process.
Debugging: Parquet files are in binary format and not human-readable, which can make debugging INT96 timestamps challenging, according to Edge Delta.
In summary, while INT96 has been used for timestamps in Parquet files by some systems, it's a deprecated and less-preferred method for modern implementations. It's generally recommended to leverage the standard TIMESTAMP logical types over INT64 for better compatibility and fewer potential interpretation challenges."