Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot read decimal values whose physical types are INT32 and INT64 #3149

Open
Avcu opened this issue Feb 8, 2025 · 3 comments
Open

Cannot read decimal values whose physical types are INT32 and INT64 #3149

Avcu opened this issue Feb 8, 2025 · 3 comments

Comments

@Avcu
Copy link

Avcu commented Feb 8, 2025

Describe the bug, including details regarding any error messages, version, and platform.

Issue

I am saving a parquet file with spark where one of the columns is decimal. Physical type of this column becomes INT32 and INT64 based on its precision. Then, when I read the parquet file with AvroParquetReader, I see logical type being long with the wrong value. For example, if original value is 23.4 then read value is 234.

Spark side

If I enable spark.sql.parquet.writeLegacyFormat for the Spark (ex Jira: SPARK-20297), I see that Spark does not use INT32/INT64 as physical type and then I can successfully read the parquet file. However, this is not the default option and also based on the decimal documentation of this repo, INT32/INT64 should be viable options.

How to reproduce

  1. Writing with Spark (version: 3.3.0)

df_temp = spark.createDataFrame([
    (120.321, "Alex"), (24.45, "John")],
    schema=["salary", "name"]
)

df_temp.createOrReplaceTempView("companyTable")
df = spark.sql("SELECT *, CAST(salary as DECIMAL(10,1)) as decimal_salary FROM companyTable")
df.show()
df.write.parquet("my_path")

+-------+----+--------------+
| salary|name|decimal_salary|
+-------+----+--------------+
|120.321|Alex|         120.3|
|  24.45|John|          24.5|
+-------+----+--------------+
  1. Confirming the schema

Running the parquet-tools:
parquet-tools inspect github_example.parquet

############ file meta data ############
created_by: parquet-mr version 1.12.2 (build ${buildNumber})
num_columns: 3
num_rows: 1
num_row_groups: 1
format_version: 1.0
serialized_size: 757


############ Columns ############
salary
name
decimal_salary

############ Column(salary) ############
name: salary
path: salary
max_definition_level: 1
max_repetition_level: 0
physical_type: DOUBLE
logical_type: None
converted_type (legacy): NONE
compression: SNAPPY (space_saved: -5%)

############ Column(name) ############
name: name
path: name
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: SNAPPY (space_saved: -5%)

############ Column(decimal_salary) ############
name: decimal_salary
path: decimal_salary
max_definition_level: 1
max_repetition_level: 0
physical_type: INT64
logical_type: Decimal(precision=10, scale=1)
converted_type (legacy): DECIMAL
compression: SNAPPY (space_saved: -5%)
  1. Reading with AvroParquetReader

    public static void main(String[] args) {
        String filePath = "my_path";

        // Check if file exists
        File file = new File(filePath);
        if(!file.exists() || file.isDirectory()) {
            System.err.println("Invalid file path");
            return;
        }

        GenericData genericData = new GenericData();
        genericData.addLogicalTypeConversion(new Conversions.DecimalConversion());

        try {
            Path path = new Path(filePath);
            ParquetReader<GenericRecord> reader = AvroParquetReader
                    .<GenericRecord>builder(HadoopInputFile.fromPath(path, new Configuration()))
                    .withDataModel(genericData)
                    .build();

            GenericRecord record;

            while ((record = reader.read()) != null) {
                // Process the record
                System.out.println(record.toString());
                System.out.println(record.getSchema());
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
{"salary": 120.321, "name": "Alex", "decimal_salary": 1203}
{"type":"record","name":"spark_schema","fields":[{"name":"salary","type":["null","double"],"default":null},{"name":"name","type":["null","string"],"default":null},{"name":"decimal_salary","type":["null","long"],"default":null}]}
Dependencies
    <dependencies>
        <dependency>
            <groupId>org.apache.parquet</groupId>
            <artifactId>parquet-common</artifactId>
            <version>1.15.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.parquet</groupId>
            <artifactId>parquet-encoding</artifactId>
            <version>1.15.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.parquet</groupId>
            <artifactId>parquet-column</artifactId>
            <version>1.15.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.parquet</groupId>
            <artifactId>parquet-hadoop</artifactId>
            <version>1.15.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.parquet</groupId>
            <artifactId>parquet-avro</artifactId>
            <version>1.15.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>3.4.1</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-mapreduce-client-core</artifactId>
            <version>3.4.1</version>
        </dependency>
    </dependencies>

Artifacts

github_example.parquet.zip

Component(s)

Avro

@Avcu Avcu added the Type: bug label Feb 8, 2025
@ConeyLiu
Copy link
Contributor

ConeyLiu commented Feb 8, 2025

It looks like ParquetAvroReader doesn't handle decimal logical types.

@wgtmac
Copy link
Member

wgtmac commented Feb 9, 2025

@ConeyLiu IIUC, parquet-cli (which uses ParquetAvroReader) might also hit this issue?

@ConeyLiu
Copy link
Contributor

Yes, it should have the same problem. I searched the code in ParquetAvroReader, and there is not any process for decimal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants