Cannot read decimal values whose physical types are INT32 and INT64 #3149

Avcu · 2025-02-08T08:04:05Z

Describe the bug, including details regarding any error messages, version, and platform.

Issue

I am saving a parquet file with spark where one of the columns is decimal. Physical type of this column becomes INT32 and INT64 based on its precision. Then, when I read the parquet file with AvroParquetReader, I see logical type being long with the wrong value. For example, if original value is 23.4 then read value is 234.

Spark side

If I enable spark.sql.parquet.writeLegacyFormat for the Spark (ex Jira: SPARK-20297), I see that Spark does not use INT32/INT64 as physical type and then I can successfully read the parquet file. However, this is not the default option and also based on the decimal documentation of this repo, INT32/INT64 should be viable options.

How to reproduce

Writing with Spark (version: 3.3.0)

df_temp = spark.createDataFrame([
    (120.321, "Alex"), (24.45, "John")],
    schema=["salary", "name"]
)

df_temp.createOrReplaceTempView("companyTable")
df = spark.sql("SELECT *, CAST(salary as DECIMAL(10,1)) as decimal_salary FROM companyTable")
df.show()
df.write.parquet("my_path")

+-------+----+--------------+
| salary|name|decimal_salary|
+-------+----+--------------+
|120.321|Alex|         120.3|
|  24.45|John|          24.5|
+-------+----+--------------+

Confirming the schema

Running the parquet-tools:
parquet-tools inspect github_example.parquet

############ file meta data ############
created_by: parquet-mr version 1.12.2 (build ${buildNumber})
num_columns: 3
num_rows: 1
num_row_groups: 1
format_version: 1.0
serialized_size: 757


############ Columns ############
salary
name
decimal_salary

############ Column(salary) ############
name: salary
path: salary
max_definition_level: 1
max_repetition_level: 0
physical_type: DOUBLE
logical_type: None
converted_type (legacy): NONE
compression: SNAPPY (space_saved: -5%)

############ Column(name) ############
name: name
path: name
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: SNAPPY (space_saved: -5%)

############ Column(decimal_salary) ############
name: decimal_salary
path: decimal_salary
max_definition_level: 1
max_repetition_level: 0
physical_type: INT64
logical_type: Decimal(precision=10, scale=1)
converted_type (legacy): DECIMAL
compression: SNAPPY (space_saved: -5%)

Reading with AvroParquetReader

    public static void main(String[] args) {
        String filePath = "my_path";

        // Check if file exists
        File file = new File(filePath);
        if(!file.exists() || file.isDirectory()) {
            System.err.println("Invalid file path");
            return;
        }

        GenericData genericData = new GenericData();
        genericData.addLogicalTypeConversion(new Conversions.DecimalConversion());

        try {
            Path path = new Path(filePath);
            ParquetReader<GenericRecord> reader = AvroParquetReader
                    .<GenericRecord>builder(HadoopInputFile.fromPath(path, new Configuration()))
                    .withDataModel(genericData)
                    .build();

            GenericRecord record;

            while ((record = reader.read()) != null) {
                // Process the record
                System.out.println(record.toString());
                System.out.println(record.getSchema());
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

{"salary": 120.321, "name": "Alex", "decimal_salary": 1203}
{"type":"record","name":"spark_schema","fields":[{"name":"salary","type":["null","double"],"default":null},{"name":"name","type":["null","string"],"default":null},{"name":"decimal_salary","type":["null","long"],"default":null}]}

Dependencies

    <dependencies>
        <dependency>
            <groupId>org.apache.parquet</groupId>
            <artifactId>parquet-common</artifactId>
            <version>1.15.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.parquet</groupId>
            <artifactId>parquet-encoding</artifactId>
            <version>1.15.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.parquet</groupId>
            <artifactId>parquet-column</artifactId>
            <version>1.15.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.parquet</groupId>
            <artifactId>parquet-hadoop</artifactId>
            <version>1.15.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.parquet</groupId>
            <artifactId>parquet-avro</artifactId>
            <version>1.15.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>3.4.1</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-mapreduce-client-core</artifactId>
            <version>3.4.1</version>
        </dependency>
    </dependencies>

Artifacts

github_example.parquet.zip

Component(s)

Avro

The text was updated successfully, but these errors were encountered:

ConeyLiu · 2025-02-08T09:41:31Z

It looks like ParquetAvroReader doesn't handle decimal logical types.

wgtmac · 2025-02-09T07:52:08Z

@ConeyLiu IIUC, parquet-cli (which uses ParquetAvroReader) might also hit this issue?

ConeyLiu · 2025-02-10T07:03:11Z

Yes, it should have the same problem. I searched the code in ParquetAvroReader, and there is not any process for decimal.

Avcu added the Type: bug label Feb 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot read decimal values whose physical types are INT32 and INT64 #3149

Cannot read decimal values whose physical types are INT32 and INT64 #3149

Avcu commented Feb 8, 2025 •

edited

Loading

Writing with Spark (version: 3.3.0)

Confirming the schema

Reading with AvroParquetReader

ConeyLiu commented Feb 8, 2025

wgtmac commented Feb 9, 2025

ConeyLiu commented Feb 10, 2025

Cannot read decimal values whose physical types are INT32 and INT64 #3149

Cannot read decimal values whose physical types are INT32 and INT64 #3149

Comments

Avcu commented Feb 8, 2025 • edited Loading

Describe the bug, including details regarding any error messages, version, and platform.

Issue

Spark side

How to reproduce

Writing with Spark (version: 3.3.0)

Confirming the schema

Reading with AvroParquetReader

Dependencies

Artifacts

Component(s)

ConeyLiu commented Feb 8, 2025

wgtmac commented Feb 9, 2025

ConeyLiu commented Feb 10, 2025

Avcu commented Feb 8, 2025 •

edited

Loading