Realism or variety of generated data? #1292
-
Hello. I wanted to refactor the module with Russian locale and ran into an issues that are more related to the concept of a library than anything else. Let me explain with an example. ProblemMany Russian tax-related IDs — INN, KPP, OGRN — use the same code, which looks like If I want to implement the generation of such code in a library, I see several ways. OptionsIgnore format, use completely random codesFor example, the author of the current implementation of the OGRN generation chose this way. The advantage of this approach:
Disadvantages of this approach:
Use a set of predefined codesFor example, the author of the current implementation of the KPP generation chose this way.
If you think that this is the best approach, we can improve the generation process using the official base mentioned above. I dumped the codes from the database as a simple Python tuple. Unfortunately, I cannot suggest how best to store such an array (at the time of writing this post — 3409 entries), but together we can find a way to store this data for the purposes of the library. Use real region and random tax office numberThis is the way I chose when I tried to start refactoring the module. The advantages of this approach:
The disadvantage of this approach: it is still easy to recognize fake data by tax office indices, which are uncharacteristic for small regions. ConclusionPersonally, I'll be glad if we find a way to store an array of real codes from the database of the Russian tax authority. The problem from the example may seem insignificant, but I would still like to understand, before contributing, how to balance the realism and variety of the generated data. Thanks in advance. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
I see no problem in creating the list of real codes. Feel free to create PR. |
Beta Was this translation helpful? Give feedback.
I see no problem in creating the list of real codes. Feel free to create PR.