-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ID-1570 - Fix dice-where and update dce-id to use the latest version #112
Conversation
...er-lib/src/main/java/technology/dice/dicewhere/downloader/stream/StreamWithMD5Decorator.java
Outdated
Show resolved
Hide resolved
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #112 +/- ##
============================================
+ Coverage 58.75% 59.02% +0.26%
- Complexity 370 374 +4
============================================
Files 97 97
Lines 1913 1923 +10
Branches 167 167
============================================
+ Hits 1124 1135 +11
+ Misses 730 729 -1
Partials 59 59 ☔ View full report in Codecov by Sentry. |
public MD5Checksum md5() { | ||
String hex = (new HexBinaryAdapter()).marshal(this.md5.digest()); | ||
return MD5Checksum.of(hex); | ||
if (duplicatedStream.isEmpty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aren't we consuming stream on the constructor? is this possible to happen?
if not, instead of throwing an exception, wouldn't be possible to make sure we consume the stream here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm really not sure what's going on in StreamWithMD5Decorator
- either it needs documentation to explain it, or else we are performing operations in an odd order and/or unnecessarily.
byte[] buffer = new byte[BUFFER]; | ||
while ((md5Is.read(buffer)) != -1) { | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wow, what was this ever for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@julianhowarth the digest is incrementally added per each byte we read.
To be able to generate and compare the hash we need to consume every byte for the stream and then perform the MD5. Because we cannot re-use the same InputStream, what we're doing here is copying the stream ( therefore updating the digest for the original stream ) and provide a copied InputStream to be streamed elsewhere - that contains the exact md5 checksum of the original one.
|
||
private StreamWithMD5Decorator(DigestInputStream inputStream, MessageDigest md5) { | ||
this.inputStream = inputStream; | ||
private StreamWithMD5Decorator(DigestInputStream originalStream, MessageDigest md5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class is lying and likely to cause issues. One of the major benefits of InputStream
s is that they do not require the whole contents of the file (or other source) to be read into memory at any time. But this class subverts this as the passed stream is consumed and written to a temporary array - this is not what clients of this class would expect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we're indeed copying an InputStream for origin -
But if the stream is not read, I cannot generate an hash from it ( gives me d41d8cd98f00b204e9800998ecf8427e
checksum for an empty input )
and if the stream is read it can't be re-used.
So this implementation tries to perform the md5 whilst letting you a InputStream to be used elsewhere
...er-lib/src/main/java/technology/dice/dicewhere/downloader/stream/StreamWithMD5Decorator.java
Outdated
Show resolved
Hide resolved
private Optional<InputStream> duplicatedStream = Optional.empty(); | ||
private Optional<MD5Checksum> md5Checksum = Optional.empty(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this class is stateful but without any documentation to that effect. Would be much cleaner if the only state was within the inputstream itself.
...er-lib/src/main/java/technology/dice/dicewhere/downloader/stream/StreamWithMD5Decorator.java
Outdated
Show resolved
Hide resolved
...er-lib/src/main/java/technology/dice/dicewhere/downloader/stream/StreamWithMD5Decorator.java
Outdated
Show resolved
Hide resolved
} | ||
|
||
@Override | ||
public void close() throws IOException { | ||
this.inputStream.close(); | ||
inputStream().close(); | ||
originalStream.close(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
arguably, this is not ours to close, caller should be closing it
WHAT
Fix dice-where and update dce-id to use the newest dice-where version.
We call .md5() twice on the same StreamWithMD5Decorator object. This causes the second invocation to return the incorrect hash.
First call is newly added to the LocalFileAcceptor and S3FileAcceptor classes
Second call is already existing in S3Source.produce()
This causes dice-where to download a file successfully, but not return the correct DownloadExecutionResult when it’s finished processing. The error is triggered in the ID test DiceWhereDownloaderScheduleJobTest .
WHY
Recently dice-where was updated to add extra md5 checks prevent faulty downloads. We need to update id to use the newest version.
Sadly the above change also introduced a bug for services using dice-where as a library. This also needs to be fixed so that we can use the updated version in id.
AC
Dice-where is fixed so it no longer calls md5() twice on the same object.
dce-id testDiceWhereDownloaderScheduleJobTest passes with the updated version of dice-where
Updated dce-id can be deployed and successfully pulls in files from dce-shared.
Implementation Details
Usage
Creation:
of
to instantiate passing the original InputStream.Reading: Because
StreamWithMD5Decorator
extends from InputStream, read is the same as it would on any InputStream.Checksum:
md5()
method will return the md5 hash at any time as long as the original stream has been fully consumed.Purpose
The current implementation allows for:
Drawbacks
StreamWithMD5Decorator
buffers the entire stream in memory to be able to calculate the MD5 - This is not acceptable for large files.Alternative PR ( WIP )
#113