Skip to content

Commit

Permalink
NUTCH-1011 Remove duplicate slashes from URLs
Browse files Browse the repository at this point in the history
git-svn-id: https://svn.apache.org/repos/asf/nutch/branches/branch-1.4@1143467 13f79535-47bb-0310-9956-ffa450edef68
  • Loading branch information
Markus Jelsma committed Jul 6, 2011
1 parent 32f7caf commit 60f0dbf
Show file tree
Hide file tree
Showing 3 changed files with 17 additions and 0 deletions.
2 changes: 2 additions & 0 deletions CHANGES.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@ Nutch Change Log

Release 1.4 - Current development

* NUTCH-1011 Normalize duplicate slashes in URL's (markus)

* NUTCH-993 NullPointerException at FetcherOutputFormat.checkOutputSpecs (Christian Guegi via jnioche)

* NUTCH-1013 Migrate RegexURLNormalizer from Apache ORO to java.util.regex (markus)
Expand Down
6 changes: 6 additions & 0 deletions conf/regex-normalize.xml.template
Original file line number Diff line number Diff line change
Expand Up @@ -63,4 +63,10 @@
<substitution></substitution>
</regex>

<!-- removes duplicate slashes -->
<regex>
<pattern>(?&lt;!:)/{2,}</pattern>
<substitution>/</substitution>
</regex>

</regex-normalize>
9 changes: 9 additions & 0 deletions src/test/org/apache/nutch/net/TestURLNormalizers.java
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,15 @@ public void testURLNormalizers() {
} catch (MalformedURLException mue) {
fail(mue.toString());
}

// NUTCH-1011 - Get rid of superfluous slashes
try {
String normalizedSlashes = normalizers.normalize("http://www.example.org//path/to//somewhere.html", URLNormalizers.SCOPE_DEFAULT);
assertEquals(normalizedSlashes, "http://www.example.org/path/to/somewhere.html");
} catch (MalformedURLException mue) {
fail(mue.toString());
}

// check the order
int pos1 = -1, pos2 = -1;
URLNormalizer[] impls = normalizers.getURLNormalizers(URLNormalizers.SCOPE_DEFAULT);
Expand Down

0 comments on commit 60f0dbf

Please sign in to comment.