Skip to content

Latest commit

 

History

History
277 lines (206 loc) · 9.1 KB

regular-expressions.md

File metadata and controls

277 lines (206 loc) · 9.1 KB

Regular expressions in Clojure

Regular expressions, or regexes for short, allow you to specify patterns that you can search for within strings, e.g. the regex:

foo [0-9a-fA-F]+ (bar|gloop)

matches "foo ", followed by one or more hexadecimal digits, followed by " ", followed by either "bar" or "gloop".

Regexes are not general enough to match any set of strings, e.g. you cannot write one to match all strings of balanced parentheses, nor all legal C or Clojure programs, but they are useful for many tasks.

This document does not attempt to give a complete description of everything that regexes are capable of -- only some tips on using them, with examples, plus a few references if you wish to read further.

Regexes in Clojure are Java regexes, implemented by the Java classes java.util.regex.Pattern and java.util.regex.Matcher. Regexes in other variants of Clojure (e.g. Clojure/CLR or ClojureScript) have much in common with Java regexes, but they are not identical. Do not expect 100% compatibility of regex syntax and matching behavior between different regex implementations.

Examples

(require '[clojure.string :as s])

re-find returns nil if there is no match.

user> (re-find #"Time:" "No time like the present")
nil

If the regex contains no parenthesized capturing groups, then on a match re-find returns the part of the string that matches the regex.

user> (re-find #"Time: ..:..:.."
               "At the chime it is Time: 09:58:10.")
"Time: 09:58:10"

If there are capturing groups, then on a match re-find returns a vector of strings. The first is the string that matches the entire regex. Each following string matches one capturing group.

user> (re-find #"Time: (..):(..):(..)"
               "At the chime it is Time: 09:58:10.")
["Time: 09:58:10" "09" "58" "10"]

The order of matching strings is the order that the left parens appear in the regex.

user> (re-find #"(Time: ((..):(..))):(..)"
               "At the chime it is Time: 09:58:10.")
["Time: 09:58:10" "Time: 09:58" "09:58" "09" "58" "10"]

Destructuring can be a handy way to give names to the strings of these capturing groups. Unlike Perl and some other regex libraries, Java regexes do not provide a way to embed these names inside the regex itself.

user> (let [matches (re-find #"Time: (..):(..):(..)"
                             "At the chime it is Time: 09:58:10.")
            [hours mins secs] (map #(bigint %) (rest matches))
            t (+ (* 3600 hours) (* 60 mins) secs)]
        (println (str t) "seconds since midnight"))
35890 seconds since midnight
nil

You can call clojure.string/replace and clojure.string/replace-first with a string in which to search, a regex, and a function to calculate the replacement strings. This function is called once for each match. Its one argument is the return value from re-find as described above: a string if there are no capturing groups in the regex, or a vector of strings if there are.

user> (s/replace "0x4e out of 0x64 dentists agree they prefer decimal"
                 #"0x([0-9a-fA-F]+)"
                 (fn [[_ s]] (str (BigInteger. s 16))))
"78 out of 100 dentists agree they prefer decimal"

A parenthesized subexpression is normally a capturing group. If you never want to get back a separate return value for the part of the string matched by that subexpression, you can make it a non-capturing group. The simplest way is to put ?: immediately after the left paren.

TBD: Simple examples

A capturing group, just like most other regexes, can be followed by a ? to indicate that the part of the regex just before the ? may appear in the matched string once, but need not appear at all. If such a capturing group is not matched by anything in the string, nil is returned for that capturing group.

TBD: Examples

Capturing groups with a * after them also return nil if they match 0 times in the string.

Capturing groups followed by *, +, or {m,n} that match more than once in the string only capture the last occurrence, not all of them. If you want all matches, use another set of parens to make the expression, including the *, +, or {m,n} into a capture group. The inner one can be made a non-capturing group if you do not want the result of its last match.

TBD: Examples

subexpression within the parentheses is optional,

user> (re-find #"([a-zA-Z]+)?(\d)+" "word 587")
["587" nil "7"]
user> (re-find #"([a-zA-Z]+)?(\d)+" "word587")
["word587" "word" "7"]

user> (re-find #"([a-zA-Z]+)?(\d+)" "word 587")
["587" nil "587"]
user> (re-find #"([a-zA-Z]+)?(\d+)" "word587")
["word587" "word" "587"]

user> (re-find #"term: ([ab])*" "term: ")
["term: " nil]
user> (re-find #"term: ([ab])*" "term: a")
["term: a" "a"]
user> (re-find #"term: ([ab])*" "term: ab")
["term: ab" "b"]
user> (re-find #"term: ([ab])*" "term: bbbbbbba")
["term: bbbbbbba" "a"]

user> (re-find #"term: ([ab]*)" "term: ")
["term: " ""]
user> (re-find #"term: ([ab]*)" "term: a")
["term: a" "a"]
user> (re-find #"term: ([ab]*)" "term: ab")
["term: ab" "ab"]
user> (re-find #"term: ([ab]*)" "term: bbbbbbba")
["term: bbbbbbba" "bbbbbbba"]

Code

See also

  • Quick start section on regular-expressions.info site. Note: While many of the regex features described on this site can be used in Java regexes, not all of them can. The site usually does a good job distinguishing which features exist in Java vs. those that do not.

  • Documentation for Java regular expressions.

  • The Regex Coach: A Windows program for interactively building up regular expressions and showing which parts of a regular expression match which parts of a string, by colorizing those parts with the same colors.

  • Wikipedia article on regular expressions.

See also: re-pattern, re-matcher, re-matches, re-seq, re-groups, clojure.string/replace, clojure.string/replace-first

TBD: Some explanation on the difference between specifying regex using #"" syntax vs. as a string using re-pattern. Basically #"" lets you type fewer backslash characters.

TBD: How to take a string and use it as part of a regex, matching its exact contents literally, with any special regex characters not treated specially. See Java method java.util.regex.Pattern/quote

TBD: Writing and reading back regexes. Regex syntax is not part of edn, so one must either use the potentially unsafe clojure.core/read or read-string to read them in, or find a different way to write them out, such as creating a data reader tag for regexes.

TBD: Any way with github Markdown to include an auto-generated table of contents, i.e. of headers?

Perl Cookbook examples that would be good to include similar examples for:

TBD 6.4 Commenting regular expressions

TBD 6.6 Matching within multiple lines

TBD 6.7 Reading records with a separator

TBD 6.14 Matching from where the last pattern left off

TBD 6.15 Greedy and non-greedy matches

There is a dead simple way to code this by matching "^(.?)(pattern I'm looking for)(.)$", but that incurs the overhead of explicitly scanning the rest of the string after the pattern is matched to match the final (.*) (TBD: measure this effect to verify, with very long strings, and 'normal line length' strings).

It is trickier to write the code for it, but if you use Java start() and end() methods of the underlying Matcher object, you can get the position of the match without.

Misc code

(defn re-group-posns
  [^java.util.regex.Matcher m]
  (let [gc (. m (groupCount))]
    (if (zero? gc)
      [(. m (start)) (. m (end))]
      (loop [ret [] c 0]
        (if (<= c gc)
          (recur (conj ret (if (neg? (. m (start c)))
                             nil
                             [(. m (start c)) (. m (end c))]))
                 (inc c))
          ret)))))

(defn re-find-pos
  ([^java.util.regex.Matcher m]
   (when (. m (find))
     (re-group-posns m)))
  ([^java.util.regex.Pattern re s]
   (let [m (re-matcher re s)]
     (re-find-pos m))))

Web app that takes a regex and parses it, explaining its pieces in English: http://www.myezapp.com/apps/dev/regexp/show.ws

TBD: I know the Common Lisp library CL-PPCRE can let you parse a regex and turn it into a kind of parse tree, and also lets you create parse trees and then use them to match. Is there a Java library for this?

Regular expression test page: http://www.regexplanet.com/advanced/java/index.html

Commercial program RegexBuddy for learning regular expressions, and analyzing and explaining them: http://www.regexbuddy.com/

Benchmarks of different Java regex libraries: http://tusker.org/regex/regex_benchmark.html