Realism or variety of generated data? #1292

treylav · 2022-11-07T22:16:06Z

treylav
Nov 7, 2022

Hello.

I wanted to refactor the module with Russian locale and ran into an issues that are more related to the concept of a library than anything else. Let me explain with an example.

Problem

Many Russian tax-related IDs — INN, KPP, OGRN — use the same code, which looks like 5300, where 53 is the number of the region, 00 is the number of the specific tax office in this region that assigned concrete ID. List of all existing codes is available online on the website of the Russian tax authority.

If I want to implement the generation of such code in a library, I see several ways.

Options

Ignore format, use completely random codes

For example, the author of the current implementation of the OGRN generation chose this way.

The advantage of this approach:

maximum number of possible codes;
simplicity of the algorithm.

Disadvantages of this approach:

you may generate a code with a non-existent region: for example, there is no region with number 90 in Russia;
you may generate a code with a non-existent tax office.

Use a set of predefined codes

For example, the author of the current implementation of the KPP generation chose this way.

The advantage of this approach: 100% real data as source, generated codes becomes more realistic.
The disadvantage of this approach the number of generated codes is significantly reduced.

If you think that this is the best approach, we can improve the generation process using the official base mentioned above. I dumped the codes from the database as a simple Python tuple. Unfortunately, I cannot suggest how best to store such an array (at the time of writing this post — 3409 entries), but together we can find a way to store this data for the purposes of the library.

Use real region and random tax office number

This is the way I chose when I tried to start refactoring the module.

The advantages of this approach:

realistic looking codes;
still quite a large number of generated codes.

The disadvantage of this approach: it is still easy to recognize fake data by tax office indices, which are uncharacteristic for small regions.

Conclusion

Personally, I'll be glad if we find a way to store an array of real codes from the database of the Russian tax authority.

The problem from the example may seem insignificant, but I would still like to understand, before contributing, how to balance the realism and variety of the generated data.

Thanks in advance.

Answered by lk-geimfari

Nov 8, 2022

I see no problem in creating the list of real codes. Feel free to create PR.

View full answer

lk-geimfari · 2022-11-08T05:59:43Z

lk-geimfari
Nov 8, 2022
Maintainer

I see no problem in creating the list of real codes. Feel free to create PR.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Realism or variety of generated data? #1292

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Realism or variety of generated data? #1292

treylav Nov 7, 2022

Problem

Options

Ignore format, use completely random codes

Use a set of predefined codes

Use real region and random tax office number

Conclusion

Replies: 1 comment

lk-geimfari Nov 8, 2022 Maintainer

treylav
Nov 7, 2022

lk-geimfari
Nov 8, 2022
Maintainer