-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Audio codec, priming samples and padding, without container or information from the encoder #626
Comments
Agreed! Chrome doesn't do anything today for this. @tguilbert-google |
You're right. Chrome pads the end of Opus encodes with silence. Would the idea to spec how much priming/padding should be used, or just to expose the information and let users trim accordingly? |
As briefly discussed with Dale today, this could probably be an addition to EncodedAudioChunkMetadata |
A design that would be fairly clean would be to have this information on the For the encode side, this would output Then when decoding, it's clear that if the pts is negative, the output frames are trimmed, up to a start time of zero, and if the duration is different from the number of frames, the frames are also trimmed, so that the number of frames matches the duration. This can be done by the Web Codec implementation, so there's no confusion. This means that encoding / decoding round trips properly without having to do anything, and that there is no extradata to carry at any point, each packet carries the information. Additionally, with this design, it's now impossible to ignore the fact that some audio isn't to be rendered, generally yielding better output. More over, this means that comparing timestamps for audio and video makes sense, because they're on the same timeline, and not offset by the encoder delay. This also makes looping of audio assets seamless by default when looping the PCM output (while keeping in mind that decoders need to be reset and that all packets, including the ones with negative timestamps, are to be decoded again for the output to be correct on the second loop iteration). On the other side of this issue, authors that write demuxers for use with Web Codecs can decide to create Authors that write muxers will be able to look at the timestamps of |
This is kind of the scheme ffmpeg uses and while it works it does have some warts:
Having an explicit sample count to trim resolves these issues. In general I haven't loved this scheme over the years, but it is mostly tried and true -- so long as there is a precise set of wpt tests for each codec whatever scheme we choose is fine though. |
In general, we set the first packet timestamp to be negative (when that indicates the preroll), not really something in between 0 or the padding (maybe you meant preroll/encoder delay?). It's also frequently 0, and then can be anything greater than zero, but that doesn't have a relation with the encoder delay either, that can be any file that's been demuxed / remuxed without reencoding or any other processing (e.g. part of a live stream). That's a bit broken because quite a few codecs need this delay, so rendering might be wrong, except if the author actually splits the file a bit before the intended start point and then discards the first few decoded output buffers correctly. But here Web Codecs will just decode the packets, and it is assumed folks know what they are doing. A demuxer or post-demuxing/pre-decoding pass can also rebase the clock if it knows the intended start point, and offset everything, so it's back to signaling the encoder delay, discarding, and everything is fine again. If we find
This is technically true, but I'm not sure that it matters, because 768kHz audio won't be in AAC (max 192kHz) or MP3 (max 48kHz) or Opus (always 48kHz), or Vorbis (looks like 192kHz) or whatever else but PCM, not even flac (max
If negative timestamps are considered to be pre-roll / padding, then I'm not sure when they'd like to have them, but maybe I'm missing something. Having pre-roll/padding separately and in sample-frame does indeed resolve some issues at the expense of requiring all authors to write the same code for most codecs, which I don't find satisfactory because that won't happen, and web apps will generally be worse. We can also stick two more properties on Also, to answer your last point I've got a rather extensive test suite for this precise problem though (to test Firefox's It's a collection of WAV files generated exactly for this purpose, of length that is known (sometimes exactly 0.5s, or sometimes exactly 1.0s), at various sample-rates, channel count, that precisely loop seamlessly without discontinuity and then converted with all encoders that I have access to for all the codecs we care about and then also muxed in all containers that make sense for those codecs, sometimes with various ways of indicating preroll/padding (depending on the container). It's fairly probably that I end up extending those tests and porting to WPT while implementing the audio side of Web Codecs. |
I meant we've seen streams where the first packets has discard padding and the pts > 0. If negative timestamps are the only way to express discard that will require client side manipulation.
This will definitely happen :) We still get a couple weird timestamp bugs each quarter.
Correct, I mention it only as future proofing.
Maybe this is what you meant by padding, but mp4 edit lists may shift arbitrary amounts before zero for discard -- so it's not always just decoder delay.
Sorry, I wasn't arguing for both separately and in-frame, but rather either in-frame or metadata. My preference being in I like this approach since container discard metadata is often specified as a duration / frame count / time range without regard to how many frames each individual packet has. By putting it on the chunk metadata / decoder config we can avoid clients manually mapping this data onto input chunks (which may require codec level understanding) or worry about shifting timestamps across the entire stream. I think the
Great! That will be really helpful. |
Sorry, I just realized I'm also worried about reusing So my vote is still for explicit fields; either both on the chunk (accepting that
I didn't follow this point. Why would we need something on |
One new proposal:
|
I'd like this fixed before going to CR, because it makes rendering audio in a way that's 100% correct impossible. I'll propose things based on previous discussion. |
What does this mean? How does your decoder know how to trim the last packet? I've implemented the following:
|
One way was to have If instead we put those same fields on the chunks, the semantics would be similar but flush wouldn't be required. We'd discard the next The second idea solves your description issue by just stating that these are always additional frames discarded. If we allow carryover between chunks we need to specify how this field interacts with flush(). |
I now understand, thanks. While it certainly works, I find this re-queueing a bit heavy-handed, requiring copies, potentially adding latency or calling I think it'd prefer to add something to trim the end padding on the packet itself, and leave most packets untouched, except the first and the last, like it usually happens. |
How are implementations supposed to handle encoder delay / priming samples / packet padding?
If PCM is encoded with a particular codec, and then decoded, does it roundtrip properly? I don't think it can, because the number of priming samples is not returned by the encoder, so there's no way to mux the file properly to signal this (e.g., using an edit list in an mp4, or any other codec/container pair specific way).
It's possible to make a guess for some codecs (e.g. AAC is frequently 2112 frames), but this isn't ideal (and sometimes it depends on the implementation of a codec).
This is important for accurate synchronization, but a requirement for musical applications (because it breaks sample-accurate looping).
The text was updated successfully, but these errors were encountered: