All notable changes to this project are documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning (as of version 1.5.0).
- Upgrade to support only Python 3.13 (#203)
This is a major release with a lot of breaking changes but most changes are easy to fix.
In addition to item below, see content of Release Candidates for changes since 4.x.
- Add support for urllib3 2.3.x #243
- Mark library as typed and fix sdist content (#241)
- Upgrade wombat to 3.8.7 (#239)
- Fix wombatSetup.js location in wheel (#236)
This is a major release with a lot of breaking changes but most changes are easy to fix.
It focuses on type safety with the introduction of runtime checks: any call to zimscraperlib API must match the type definition or an exception will be raised.
Documentation is available as docstrings and on https://python-scraperlib.readthedocs.io
Main changes includes:
- ZIM metadata handling has completely changed with new types for each kind of metadata.
i18n
module has been redesigned around a single main classLanguage
- New
rewriting
module for HTTML/CSS/JS (that one being done at runtime via Wombat) - Now supporting only Python 3.12
- Documentation using
mkdocs
, published on readthedocs.com (#92) rewriting
module to rewrite URLs in content for generic scrapersrewriting.css
to rewrite URLs in CSS filesrewriting.html
to rewrite URLs in HTML filesrewriting.js
to rewrite URLs in JS files (at runtime, usingwombat
)wombat-setup
javascript module injavascript/
typing
module with custom types:Callback
to use where we expect callbacksSupportsWrite
,SupportsRead
,SupportsSeeking
SupportsSeekableRead
andSupportsSeekableWrite
: protocols for IO type annotations
zim.metadata
module with a type-based approach for each kind of metadata and helpers for custom ones- [
zim.metadata
]APPLY_RECOMMENDATIONS
: general flag to toggle openZIM-recommended constraints - [
zim.metadata
] Type-based classes:Metadata
,TextBasedMetadata
,TextListBasedMetadata
,DateBasedMetadata
,IllustrationBasedMetadata
- [
zim.metadata
] Usage-based classes:NameMetadata
,LanguageMetadata
,DefaultIllustrationMetadata
, etc. - [
zim.metadata
]StandardMetadataList
to package the standard metadata - See details for additional API endpoints and variables
- [
- [
constants
]DEFAULT_WEB_REQUESTS_TIMEOUT
exposed fordownload
module - [
download
]stream_file()
now acceptstimeout: int
param (defaults to constant timeout) (#222) - [
filesystem
]path_from
context manager to acquire a pathlibPath
fromPath
orTemporaryDirectory
- [
i18n
]Language
,get_language()
andget_language_or_none()
. See breaking changes - [
image.optimization
]OptimizePngOptions
dataclass to store PNG options - [
image.optimization
]OptimizeJpgOptions
dataclass to store JPEG options - [
image.optimization
]OptimizeGifOptions
dataclass to store WebP options - [
image.optimization
]OptimizeOptions
dataclass to store cross-formats options - [
inputs
]unique_values()
to deduplicate a list while preserving order - [
logging
]DEFAULT_FORMAT_WITH_THREADS
as many scrapers uses threads - [
video.encoding
]reencode()
'sexisting_tmp_path
param - [
zim.filesystem
]validate_folder_writable()
to ensure one can write into a folder (#200) - [
zim.creator
]Creator._get_first_language_metadata_value()
to retrieve first language from metadata - [
zim.items
]no_indexing_indexdata()
to get an IndexData that disables indexing - [
zim.items
]URLItem.get_mimetype()
now only returningstr
- Entire API is now type-protected using beartype. Any call to scraperlib that doesn't satisfy the annotated types will raise an exception
- [
constants
]MANDATORY_ZIM_METADATA_KEYS
andDEFAULT_DEV_ZIM_METADATA
moved tozim/metadata
- [
download
]YoutubeDownloader.download
'soptions
parameters now expect andict[str, Any]
instead ofdict
- [
download
]YoutubeConfig
options now limited tostr | bool | int | None
- [
download
]_get_retry_adapter()
now exposed asget_retry_adapter()
- [
download
]stream_file
'sbyte_stream' param now more flexible, accepting
SupportsWrite[bytes] | SupportsSeekableWrite[bytes]` - [
download
]stream_file
'sproxies
param now acceptingdict[str, str]
instead ofdict
- [
filesystem
]delete_callback()
is now a simple callback accepting anfpath
and deleting it (doesn't chain other callback anymore). - [
filesystem
]delete_callback()
doesn't fail on missing file (#192) - [
i18n
] Redesigned API around a single object:Language
which is inited with any acceptable code. RaisesNotFoundError
on 639-3 matching failurefind_language_names()
is retained but only accepts aquery: str
- added
get_language()
andget_language_or_none()
as shortcuts aroundLanguage
is_valid_iso_639_3()
is retained
- [
image.conversion
]convert_image()
now acceptsio.BytesIO
in place ofIO[bytes]
forsrc
anddst
. - [
image.conversion
]convert_svg2png()
now acceptsio.BytesIO
in place ofIO[bytes]
forsrc
anddst
. - [
image.optimization
]optimize_png()
now acceptsoptions: OptimizePngOptions
instead of individual params. - [
image.optimization
]optimize_jpeg()
now acceptsoptions: OptimizeJpgOptions
instead of individual params. - [
image.optimization
]optimize_webp()
now acceptsoptions: OptimizeWebpOptions
instead of individual params. - [
image.optimization
]optimize_gif()
now acceptsoptions: OptimizeGifOptions
instead of individual params. - [
image.presets
] All presets now use the new options dataclass instead of ClassVar dict - [
image.probing
]format_for()
now acceptsio.BytesIO
in place ofIO[bytes]
forsrc
. - [
image.probing
]is_valid_image()
now acceptsio.BytesIO
in place ofIO[bytes]
forimage
. - [
image.utils
]save_image()
now acceptsio.BytesIO
in place ofIO[bytes]
fordst
. - [
video.config
]Config
was mostly not using type annotations. - [
video.config
]Config
options only expectingstr | None
- [
video.presets
] All options only expectingstr | None
- [
video.encoding
]reencode()
now always returning atuple[bool, CompletedProcess]
- [
zim._libkiwix
]MimetypeAndCounter
now expects specific types formimetype: str
andvalue: int
- [
zim.filesystem
]make_zim_file()
publisherparam now properly expects an
str` - [
zim.filesystem
]IncorrectZIMPathError
renamed toIncorrectPathError
- [
zim.filesystem
]MissingZIMFolderError
renamed toMissingFolderError
- [
zim.filesystem
]NotADirectoryZIMFolderError
renamed toNotADirectoryFolderError
- [
zim.filesystem
]NotWritableZIMFolderError
renamed toNotWritableFolderError
- [
zim.filesystem
]IncorrectZIMFilenameError
renamed toIncorrectFilenameError
- [
zim.filesystem
]validate_zimfile_creatable()
renamed tovalidate_file_creatable()
- [
zim.items
]Item
andStaticItem
now expectinghints
asdict[libzim.writer.Hint, int]
instead ofdict
- [
zim.items
]Item.get_hints()
now returningdict[libzim.writer.Hint, int]
instead ofdict
- [
zim.items
]URLItem.download_for_size()
now specifying type annotations and reordered params - [
zim.providers
]FileLikeProvider.gen_blob()
andURLProvider.gen_blob()
now properly annotates return type (Generator[libzim.writer.Blob, None, None]
) - [
zim.providers
]URLProvider.get_size_of()
paramurl
now explicitly expects anstr
- [
zim.creator
]Creator.config_metadata()
signature changed, now mainly accepting aStandardMetadataList
- [
zim.creator
]Creator.config_dev_metadata()
signature changed to accept new metadata types - [
zim.creator
]Creator.add_item_for()
'scallback
renamed tocallbacks
and acceptingCallback
- [
zim.creator
]Creator.add_item()
'scallback
renamed tocallbacks
and acceptingCallback
- [deps]
iso639-lang
now requires at least v2.4.0 - [
download
]stream_file()
now returntuple[int, requests.structures.CaseInsensitiveDict[str]]
instead oftuple[int, requests.structures.CaseInsensitiveDict]
- [
download
]stream_file()
now accepts bothfpath
andbyte_stream
params (writes to both) - [
image.utils
]save_image()
now acceptsAny
**params
. - [
zim.archive
]Archive.counters
now returningCounterMap
(compatible with previousdict[str, int]
)
- Direct dependencies now properly references: pillow, urllib3, piexif, idna (#226)
- [
download
]YoutubeDownloader.download
now respects its return type (bool | Future[Any]
) - [
image.conversion
]convert_image()
**params
properly declared as acceptingNone
. - [
logging
]getLogger()
's'console
now properly acceptingTextIO | io.StringIO | None
- [
video.probing
]get_media_info()
type annotation forsrc_path
- [
zim.archive
]Archive.get_item()
return type (libzim.reader.Item
)
- Support for Python 3.8/3.9/3.10/3.11. Only Python 3.12 is supported now.
- [
i18n
]Lang
(See breaking changes) - [
i18n
]get_iso_lang_data()
(See breaking changes) - [
i18n
]update_with_macro()
(See breaking changes) - [
i18n
]get_language_details()
(See breaking changes) - [
uri
]rebuild_uri
failsafe
param (was only handling incorrect types) - [
video.encoding
]reencode()
'swith_process
param - [
zim.creator
]Creator.validate_metadata()
- [
zim.creator
]Creator.convert_and_check_metadata()
- Add utility function to compute ZIM Tags #164, including deduplication #156
- Metadata does not automatically drops control characters #159
- New
indexing.IndexData
class to hold title, content and keywords to pass to libzim to index an item - Automatically index PDF documents content #167
- Automatically set proper title on PDF documents #168
- Expose new
optimization.get_optimization_method
to get the proper optimization method to call for a given image format - Add
optimization.get_optimization_method
to get the proper optimization method to call for a given image format - New
creator.Creator.convert_and_check_metadata
to convert metadata to bytes or str for known use cases and check proper type is passed to libzim - Add svg2png image conversion function #113
- Add
conversion.convert_svg2png
image conversion function + support for SVG inprobing.format_for
#113 - Add
i18n.Lang
class used as typed result of i18n operations #151
- BREAKING Renamed
zimscraperlib.image.convertion
tozimscraperlib.image.conversion
to fix typo - BREAKING Many changes in type hints to match the real underlying code
- BREAKING Force all boolean arguments (and some other non-obvious parameters) to be keyword-only in function calls for clarity / disambiguation (see ruff rule FBT002)
- Prefer to use
IO[bytes]
toio.BytesIO
when possible since it is more generic - BREAKING
i18n.NotFound
renamedi18n.NotFoundError
- BREAKING
types.get_mime_for_name
now returnsstr | None
- BREAKING
creator.Creator.add_metadata
andcreator.Creator.validate_metadata
now only acceptsbytes | str
as value (it must have been converted before call) - BREAKING second argument of
creator.Creator.add_metadata
has been renamed tovalue
instead ofcontent
to align with other methods - When a type issue arises in metadata checks, wrong value type is displayed in exception
- BREAKING
i18n.get_language_details()
,i18n.get_iso_lang_data()
,i18n.find_language_names()
andi18n.update_with_macro
now process / return a new typedLang
class #151 - BREAKING Rename
i18.NotFound
toi18n.NotFoundError
- BREAKING Remove translation features in
i18n
:Locale
class +_
andsetlocale
functions #134
- Metadata length validation is buggy for unicode strings #158
- Pillow 10.4.0 reveals improper type hints for image probing functions #177
- Enhance error when locale fails to setup #157
zim.creator.Creator._log_metadata()
to log (DEBUG) all metadata set on_metadata
(prior to start()) #155- New utility function to confirm ZIM can be created at given location / name #163
- Migrate the VideoWebmLow and VideoWebmHigh presets to VP9 for smaller file size #79
- New preset versions are v3 and v2 respectively
- Simplify type annotations by replacing Union and Optional with pipe character ("|") for improved readability and clarity #150
- Calling
Creator._log_metadata()
onCreator.start()
if running in DEBUG #155
- Add back the
--runinstalled
flag for test execution to allow smooth testing on other build chains #139
- Add support for
disable_metadata_checks
andignore_duplicates
arguments inmake_zim_file
function ("zimwritefs-mode")
- Relaxed constraints on Python dependencies
- Upgraded optional dependencies used for test and QA
- Set a user-agent for
handle_user_provided_file
#103
- Migrate to generic syntax in all std collections #140
- Do not modify the ffmpeg_args in reencode function #144
- New
disable_metadata_checks
parameter inzimscraperlib.zim.creator.Creator
initializer, allowing to disable metadata check at startup (assuming the user will validate them on its own) #119
- Rework the VideoWebmLow preset for faster encoding and smaller file size #122
- preset has been bumped to version 2
- when using an S3 cache, all videos using this preset will be reencoded and uploaded to cache again (it will replace the same file encoded with preset version 1)
- When reencoding a video, ffmpeg now uses only 1 CPU thread by default (new arg to
reencode
allows to override this default value) - Using openZIM Python bootstrap conventions (including hatch-openzim plugin) #120
- Add support for Python 3.12, drop Python 3.7 support #118
- Replace "iso-369" by "iso639-lang" library
- Replace "file-magic" by "python-magic" library for Alpine Linux support and better maintenance
- Fixed type hints of
zimscraperlib.zim.Item
and subclasses, andzimscraperlib.image.optimization:convert_image
- Add utility function to compute/check ZIM descriptions #110
- Using pylibzim
3.4.0
- Support for Python 3.7 (EOL)
- Fixed declared (hint) return type of
download.stream_file
#104 - Fixed declared (hint) type of
content
param forCreator.add_item_for
#107
- Using pylibzim
3.1.0
- ZIM metadata check now allows multiple values (comma-separated) for
Language
- Using
yt_dlp
instead ofyoutube_dl
- Dropped support for Python 3.6
zim.creator.Creator
and zim.filesystem.make_zim_file
zim.creator.Creator.config_metadata
method (returning Self) exposing all mandatory Metdata, all standard ones and allowing extra text metdadata.zim.creator.Creator.config_dev_metadata
method setting stub metdata for all mandatory ones (allowing overrides)zim.metadata
module with a list of per-metadata validation functionszim.creator.Creator.validate_metadata
(called onstart
) to verify metadata respects the spec (and its recommendations)zim.filesystem.make_zim_file
accepts a new optionallong_description
param.i18n.is_valid_iso_639_3
to check ISO-639-3 codesimage.probing.is_valid_image
to check Image format and size
zim.creator.Creator
main_path
argument now mandatoryzim.creator.Creator.start
now fails on missing required or invalid metadatazim.creator.Creator.add_metadata
nows enforces validation checkszim.filesystem.make_zim_file
renamed itsfavicon_path
param toillustration_path
zim.creator.Creator.config_indexing
language
argument now optionnal whenindexing=False
zim.creator.Creator.config_indexing
now validateslanguage
is ISO- 639-3 whenindexing=True
zim.creator.Creator.update_metadata
. See.config_metadata()
insteadzim.creator.Creator
language
argument. See.config_metadata()
insteadzim.creator.Creator
keyword arguments. See.config_metadata()
insteadzim.creator.Creator.add_default_illustration
. See.config_metadata()
insteadzim.archibe.Archive.media_counter
(deprecated in2.0.0
)
zim.creator.Creator(language=)
can be specified asList[str]
.["eng", "fra"]
,["eng"]
,"eng,fra"
, "eng" are all valid values.
- Fixed
zim.providers.URLProvider
returning incomplete streams under certain circumstances (from openzim/kolibri#40) - Fixed
zim.creator.Creator
not supporting multiple values in for Language metadata, as required by the spec
- Using pylibzim v2.1.0 (using libzim 8.1.0)
- [libzim]
Entry.get_redirect_entry()
- [libzim]
Item.get_indexdata()
to implement custom IndexData per entry (writer) - [libzim]
Archive.media_count
- [libzim]
Archive.article_count
updated to match scraperlib's version Archive.article_counter
now deprecated. Now returnsArchive.article_count
Archive.media_counter
now deprecated. Now returnsArchive.media_count
- [libzim]
lzma
compression algorithm
download.get_session()
to build a new requests Session
download.stream_file()
accepts asession
param to use instead of creating one
zim.Creator
now supportsignore_duplicates: bool
parameter to prevent duplicates from raising exceptionszim.Creator.add_item
,zim.Creator.add_redirect
andzim.Creator.add_item_for
now supports aduplicate_ok: bool
parameter to prevent an exception should this item/redirect be a duplicate
download.stream_file()
supports passingheaders
(scrapers were already using it)
- Fixed
filesystem.get_content_mimetype()
crashing on non-guessable byte stream
- Wider range of accepted lxml dependency version as 4.9.1 fixes a security issue
Archive.get_metadata_item()
to retrieve full item instead of just value
- Using pylibzim v1.1.0 (using libzim 7.2.1)
- Adding duplicate entries now raises RuntimeError
- filesize is fixed for larger ZIMs
zim.Archive.tags
andzim.Archive.get_tags()
to retrieve parsed Tags with optionnallibkiwix
param to include libkiwix's hints- [tests] Counter tests now also uses a libzim6 file.
zim.Archive.article_counter
follows libkiwix's new bahavior of returning libzim'sarticle_count
for libzim 7+ ZIMs and returning previously returned (parsed) value for older ZIMs.
- Unreachable code removed in
imaging
module. - [tests] “Sanskrit” removed from tests as output not predicatble depending on plaftform.
zim.Archive.counters
wont fail on missingCounter
metadata
- Fixed leak in
zim.Archive
's.counters
- New
.get_text_metadata()
method onzim.Archive
to save UTF-8 decoding
- New
Counter
metadata based properties for Archive:.counters
: parsed dict of the Counter metadata.article_counter
: libkiwix's calculation for nb or article.media_counter
: libkiwix's calculation for nb or media
- Fixed
i18n.find_language_names()
failing on some languages - Added
uri
module withrebuild_uri()
- Using new python-libzim based on libzim v7
- New Creator API
- Removed all namespace references
- Renamed
url
mentions topath
- Removed all links rewriting
- Removed Article/CSS/Binary seggreation
- Kept zimwriterfs mode (except it doesn't rewrite for namespaces)
- New
html
module for HTML document manipulations - New callback system on
add_item_for()
andadd_item()
- New Archive API with easier search/suggestions and content access
- Changed download log level to DEBUG (was INFO)
filesystem.get_file_mimetype
now passes bytes to libmagic instead of filename due to release issue in libmagic- safer
inputs.handle_user_provided_file
regarding input as str instead of Path image.presets
andvideo.presets
now all includesext
andmimetype
properties- Video convert log now DEBUG instead of INFO
- Fixed
image.save_image()
saving to disk even when using a bytes stream - Fixed
image.transformation.resize_image()
when resizing a byte stream without a dst
Intermediate release using unreleased libzim to support development of libzim7. Don't use it.
- requesting newer libzim version (not released ATM)
- New ZIM API for non-namespace libzim (v7)
- updated all requirements
- Fixed download test inconsistency
- fix_ogvjs mostly useless: only allows webm types
- exposing retry_adapter for refactoring
- Changed download log level to DEBUG (was INFO)
- guess more-defined mime from filename if magic says it's text
- get_file_mimetype now passes bytes to libmagic
- safer regarding input as str instead of Path
- fixed static item for empty content
- ext and mimetype properties for all presets
- Video convert log now DEBUG instead of INFO
- Added delete_fpath to add_item_for() and fixed StaticItem's auto remove
- Updated badges for new repo name
- add
stream_file()
to stream content from a URL into a file or aBytesIO
object - deprecated
save_file()
- fixed
add_binary
when used without an fpath (#69) - deprecated
make_grayscale
option in image optimization - Added support for in-memory optimization for PNG, JPEG, and WebP images
- allows enabling debug logs via ZIMSCRAPERLIB_DEBUG environ
- added
wait
option inYoutubeDownloader
to allow parallelism while using context manager - do not use extension for finding format in
ensure_matches()
inimage.optimization
module - added
VideoWebmHigh
andVideoMp4High
presets for high quality WebM and Mp4 convertion respectively - updated presets
WebpHigh
,JpegMedium
,JpegLow
andPngMedium
inimage.presets
save_image
moved fromimage
toimage.utils
- added
convert_image
optimize_image
resize_image
functions toimage
module
- added
YoutubeDownloader
todownload
to download YT videos using a capped nb of threads
- fixed rewriting of links with empty target
- added support for image optimization using
zimscraperlib.image.optimization
for webp, gif, jpeg and png formats - added
format_for()
inzimscraperlib.image.probing
to get PIL image format from the suffix
- replaced BeautifoulSoup parser in rewriting (
html.parser
–>lxml
)
- detect mimetypes from filenames for all text files
- fixed non-filename based StaticArticle
- enable rewriting of links in poster attribute of audio element
- added find_language_in() and find_language_in_file() to get language from HTML content and HTML file respectively
- add a mime mapping to deal with inconsistencies in mimetypes detected by magic on different platforms
- convert_image signature changed:
target_format
positional argument removed. Replaced with optionnalfmt
key of keyword arguments.colorspace
optionnal positional argument removed. Replaced with optionnalcolorspace
key of keyword arguments.
- prevent rewriting of links with special schemes
mailto
, 'tel', etc. in HTML links rewriting - replaced
imaging
module with explodedimage
module (convertion
,probing
,transformation
) - changed
create_favicon()
param names (source_image
->src
,dest_ico
->dst
) - changed
save_image()
param names (image
->src
) - changed
get_colors()
param names (image_path
->src
) - changed
resize_image()
param names (fpath
->src
)
- fixed URL rewriting when running from /
- added support for link rewriting in
<object>
element - prevent from raising error if element doesn't have the attribute with url
- use non greedy match for CSS URL links (shortest string matching
url()
format) - fix namespace of target only if link doesn't have a netloc
- added UTF8 to constants
- added mime_type discovery via magic (filesystem)
- Added types: mime types guessing from file names
- Revamped zim API
- Removed ZimInfo which role was tu hold metadata for zimwriterfs call
- Removed calling zimwriterfs binary but kept function name
- Added zim.filesystem: zimwriterfs-like creation from a build folder
- Added zim.creator: create files by manually adding each article
- Added zim.rewriting: tools to rewrite links/urls in HTML/CSS
- add timeout and retries to save_file() and make it return headers
- fixed
convert_image()
which tried to use a closed file
- exposed reencode, Config and get_media_info in zimscraperlib.video
- added save_image() and convert_image() in zimscraperlib.imaging
- added support for upscaling in resize_image() via allow_upscaling
- resize_image() now supports params given by user and preservs image colorspace
- fixed tests for zimscraperlib.imaging
- added video module with reencode, presets, config builder and video file probing
make_zim_file()
accepts extra kwargs for zimwriterfs
- added translation support to i18n
- added s3transfer to verbose dependencies list
- changed default log format to include module name
- verbose dependencies (urllib3, boto3) now logged at WARNING level by default
- ability to set verbose dependencies log level and add modules to the list
- zimscraperlib's logging level now aligned with scraper's requested one
- fix_ogvjs_dist script more generic (#1)
- updated zim to support other zimwriterfs params (#10)
- more flexible requirements for requests dependency
- fixed return value of
get_language_details
on non-existent language - fixed crash on
resize_image
with methodheight
- fixed root logger level (now DEBUG)
- removed useless
console=True
getLogger
param - completed tests (100% coverage)
- added
./test
script for quick local testing - improved tox.ini
- added
create_favicon
to generate a squared favicon - added
handle_user_provided_file
to handle user file/URL from param
- fixed fix_ogvjs_dist
- initial version providing
- download: save_file, save_large_file
- fix_ogvjs_dist
- i18n: setlocale, get_language_details
- imaging: get_colors, resize_image, is_hex_color
- zim: ZimInfo, make_zim_file