- Unreleased
- 1.6.1 (2024-01-12)
- Bug Fix: Prevent amnesia that causes multiple type mismatches warnings
- If a data set contains multiple records with a column which do not
match each other, then the old code would remove the corresponding
internal
schema_entry
for that column, and print a warning message. - This means that subsequent records would recreate the
schema_entry
, and a subsequent mismatch would print another warning message. - This also meant that if there was a second record after the most recent mismatch, the script would output a schema entry for the mismatching column, corresponding to the type of the last record which was not marked as a mismatch.
- The fix is to use a tombstone entry for the offending column, instead
of deleting the
schema_entry
completely. Only a single warning message is printed, and the column is ignored for all subsequent records in the input data set. - See [Issue#98](https://github.com/bxparks/bigquery-schema-generator/issues/98] which identified this problem which seems to have existed from the very beginning.
- If a data set contains multiple records with a column which do not
match each other, then the old code would remove the corresponding
internal
- Bug Fix: Prevent amnesia that causes multiple type mismatches warnings
- 1.6.0 (2023-04-01)
- Allow
null
fields to convert toREPEATED
becausebq load
seems to interpret null fields to be equivalent to an empty array[]
. See #90. - Add
input_format='csvdictreader'
option. Similar to'dict'
but intended to be used with thecsv.DictReader
class to read CSV and TSV files with various options. More documentation and discussions at:
- Allow
- 1.5.1 (2022-12-04)
- Add
examples/*.py
to demonstrate how to useSchemaGenerator
as a library. - Update README.md to state that
bq load --autodetect
uses the first 500 records. Previously, it scanned only the 100 records. - This is a maintenance release with no new features or bug fixes.
- Add
- 1.5 (2021-11-14)
- Make the column order in the BQ schema file match the order of appearance
in the JSON data file using the
--preserve_input_sort_order
flag. Thanks to kdeggelman@ in PR#75.
- Make the column order in the BQ schema file match the order of appearance
in the JSON data file using the
- 1.4.1 (2021-08-23)
- Add documentation for the
input_format='dict'
option. - Add additional input format 'json' and 'dict' test cases.
- Maintenance release, no functional change in core code.
- Add documentation for the
- 1.4 (2020-12-09)
- Add 'dict' as a third
input_format
whenSchemaGenerator
is used as a library. This can be useful when the data has already been transformed into a list of native Pythondict
objects (see #58, thanks to ZiggerZZ@). - Expand the pattern matchers for quoted integers and quoted floating point
numbers to be more compatible with the patterns recognized by
bq load --autodetect
. - Add Table of Contents to README.md. Add usage info for the
schema_map=existing_schema_map
and theinput_format='dict'
parameters in theSchemaGenerator()
constructor.
- Add 'dict' as a third
- 1.3 (2020-12-05)
- Allow an existing schema file to be specified using
--existing_schema_path
flag, so that new data can be merged into it. See #40, #57, and #61. (Thanks to abroglesc@ and bozzzzo@).
- Allow an existing schema file to be specified using
- 1.2 (2020-10-27)
- Print full path of nested JSON elements in error messages (See #52; thanks abroglesc@).
- 1.1 (2020-07-10)
- Add
--ignore_invalid_lines
to ignore parsing errors on invalid lines and continue processing. Fixes #49. - Add GitHub actions for automated tests and flake8 validation.
- Add package
__version__
string. - Update setup.py, no longer need to convert README.md markdown to RST.
- Add
- 1.0 (2020-04-04)
- Fix
--sanitize_names
for recursive RECORD fields (Thanks riccardomc@, see #43). - Clean up how unit tests are run, trying my best to figure out Python's convolution package importing mechanism.
- Add GitHub Actions continuous integration pipelines with flake8 checks and automated unit testing.
- Fix
- 0.5.1 (2019-06-17)
- Add
--sanitize_names
to convert invalid characters in column names and to shorten them if too long. (See #33; thanks jonwarghed@).
- Add
- 0.5 (2019-06-06)
- Add input and output parameters to run() to allow the client code using
SchemaGenerator
to redirect the input and output files. (See #30). - Remove fields with incompatible types (or other errors) from the generated schema, instead of picking the type of the first encounter. (See #31).
- Improve internal data validation handling, reserving exceptions for programming errors only.
- Add input and output parameters to run() to allow the client code using
- 0.4 (2019-03-06)
- Support CSV input files using
--input_format
flag. Preserve the ordering of fields in the schema file for CSV. - Implement
--infer_mode
flag for CSV files so that fields that are present in all input records are marked asREQUIRED
in the schema (Thanks korotkevics@, see #28).
- Support CSV input files using
- 0.3.2 (2019-02-24)
- Add
--quoted_values_are_strings
flag to force quoted values (integers, floats, booleans) to be interpreted as aSTRING
. (Thanks de-code@, see #22).
- Add
- 0.3.1 (2019-01-18)
- Infer integers that overflow signed 64-bits to be
FLOAT
for consistency withbq load
. (Fixes #18) - Support 'UTC' suffix in TIMESTAMP fields, in addition to 'Z'. (Fixes #19)
- Infer integers that overflow signed 64-bits to be
- 0.3 (2018-12-17)
- Tighten TIMESTAMP and DATE validation (thanks jtschichold@).
- Inspect the internals of STRING values to infer BOOLEAN, INTEGER or FLOAT types (thanks jtschichold@).
- Handle conversion of these string types when mixed with their non-quoted equivalents, matching the conversion logic followed by 'bq load'.
- 0.2.1 (2018-07-18)
- Add
anonymizer.py
script to create anonymized data files for benchmarking. - Add benchmark numbers to README.md.
- Add
DEVELOPER.md
file to record how to upload to PyPI. - Fix some minor warnings from pylint3.
- Add
- 0.2.0 (2018-02-10)
- Add support for
DATE
andTIME
types. - Update type conversion rules to be more compatible with bq load.
- Allow
DATE
,TIME
andTIMESTAMP
to gracefully degrade toSTRING
. - Allow type conversions of elements within arrays
(e.g. array of
INTEGER
andFLOAT
, or array of mixedDATE
,TIME
, orTIMESTAMP
elements). - Better detection of invalid values (e.g. arrays of arrays).
- Allow
- Add support for
- 0.1.6 (2018-01-26)
- Pass along command line arguments to
generate-schema
.
- Pass along command line arguments to
- 0.1.5 (2018-01-25)
- Updated installation instructions for MacOS.
- 0.1.4 (2018-01-23)
- Attempt #3 to fix exception during pip3 install.
- 0.1.3 (2018-01-23)
- Attempt #2 to fix exception during pip3 install.
- 0.1.2 (2018-01-23)
- Attempt to fix exception during pip3 install. Didn't work. Pulled.
- 0.1.1 (2018-01-03)
- Install
generate-schema
script in/usr/local/bin
- Install
- 0.1 (2018-01-02)
- Initial release to PyPI.