Regular expressions in Clojure

Regular expressions, or regexes for short, allow you to specify patterns that you can search for within strings, e.g. the regex:

foo [0-9a-fA-F]+ (bar|gloop)

matches "foo ", followed by one or more hexadecimal digits, followed by " ", followed by either "bar" or "gloop".

Regexes are not general enough to match any set of strings, e.g. you cannot write one to match all strings of balanced parentheses, nor all legal C or Clojure programs, but they are useful for many tasks.

This document does not attempt to give a complete description of everything that regexes are capable of -- only some tips on using them, with examples, plus a few references if you wish to read further.

Regexes in Clojure are Java regexes, implemented by the Java classes java.util.regex.Pattern and java.util.regex.Matcher. Regexes in other variants of Clojure (e.g. Clojure/CLR or ClojureScript) have much in common with Java regexes, but they are not identical. Do not expect 100% compatibility of regex syntax and matching behavior between different regex implementations.

Examples

(require '[clojure.string :as s])

re-find returns nil if there is no match.

user> (re-find #"Time:" "No time like the present")
nil

If the regex contains no parenthesized capturing groups, then on a match re-find returns the part of the string that matches the regex.

user> (re-find #"Time: ..:..:.."
               "At the chime it is Time: 09:58:10.")
"Time: 09:58:10"

If there are capturing groups, then on a match re-find returns a vector of strings. The first is the string that matches the entire regex. Each following string matches one capturing group.

user> (re-find #"Time: (..):(..):(..)"
               "At the chime it is Time: 09:58:10.")
["Time: 09:58:10" "09" "58" "10"]

The order of matching strings is the order that the left parens appear in the regex.

user> (re-find #"(Time: ((..):(..))):(..)"
               "At the chime it is Time: 09:58:10.")
["Time: 09:58:10" "Time: 09:58" "09:58" "09" "58" "10"]

Destructuring can be a handy way to give names to the strings of these capturing groups. Unlike Perl and some other regex libraries, Java regexes do not provide a way to embed these names inside the regex itself.

user> (let [matches (re-find #"Time: (..):(..):(..)"
                             "At the chime it is Time: 09:58:10.")
            [hours mins secs] (map #(bigint %) (rest matches))
            t (+ (* 3600 hours) (* 60 mins) secs)]
        (println (str t) "seconds since midnight"))
35890 seconds since midnight
nil

You can call clojure.string/replace and clojure.string/replace-first with a string in which to search, a regex, and a function to calculate the replacement strings. This function is called once for each match. Its one argument is the return value from re-find as described above: a string if there are no capturing groups in the regex, or a vector of strings if there are.

user> (s/replace "0x4e out of 0x64 dentists agree they prefer decimal"
                 #"0x([0-9a-fA-F]+)"
                 (fn [[_ s]] (str (BigInteger. s 16))))
"78 out of 100 dentists agree they prefer decimal"

A parenthesized subexpression is normally a capturing group. If you never want to get back a separate return value for the part of the string matched by that subexpression, you can make it a non-capturing group. The simplest way is to put ?: immediately after the left paren.

TBD: Simple examples

A capturing group, just like most other regexes, can be followed by a ? to indicate that the part of the regex just before the ? may appear in the matched string once, but need not appear at all. If such a capturing group is not matched by anything in the string, nil is returned for that capturing group.

TBD: Examples

Capturing groups with a * after them also return nil if they match 0 times in the string.

Capturing groups followed by *, +, or {m,n} that match more than once in the string only capture the last occurrence, not all of them. If you want all matches, use another set of parens to make the expression, including the *, +, or {m,n} into a capture group. The inner one can be made a non-capturing group if you do not want the result of its last match.

TBD: Examples

subexpression within the parentheses is optional,

user> (re-find #"([a-zA-Z]+)?(\d)+" "word 587")
["587" nil "7"]
user> (re-find #"([a-zA-Z]+)?(\d)+" "word587")
["word587" "word" "7"]

user> (re-find #"([a-zA-Z]+)?(\d+)" "word 587")
["587" nil "587"]
user> (re-find #"([a-zA-Z]+)?(\d+)" "word587")
["word587" "word" "587"]

user> (re-find #"term: ([ab])*" "term: ")
["term: " nil]
user> (re-find #"term: ([ab])*" "term: a")
["term: a" "a"]
user> (re-find #"term: ([ab])*" "term: ab")
["term: ab" "b"]
user> (re-find #"term: ([ab])*" "term: bbbbbbba")
["term: bbbbbbba" "a"]

user> (re-find #"term: ([ab]*)" "term: ")
["term: " ""]
user> (re-find #"term: ([ab]*)" "term: a")
["term: a" "a"]
user> (re-find #"term: ([ab]*)" "term: ab")
["term: ab" "ab"]
user> (re-find #"term: ([ab]*)" "term: bbbbbbba")
["term: bbbbbbba" "bbbbbbba"]

Code

Misc code

(defn re-group-posns
  [^java.util.regex.Matcher m]
  (let [gc (. m (groupCount))]
    (if (zero? gc)
      [(. m (start)) (. m (end))]
      (loop [ret [] c 0]
        (if (<= c gc)
          (recur (conj ret (if (neg? (. m (start c)))
                             nil
                             [(. m (start c)) (. m (end c))]))
                 (inc c))
          ret)))))

(defn re-find-pos
  ([^java.util.regex.Matcher m]
   (when (. m (find))
     (re-group-posns m)))
  ([^java.util.regex.Pattern re s]
   (let [m (re-matcher re s)]
     (re-find-pos m))))

Web app that takes a regex and parses it, explaining its pieces in English: http://www.myezapp.com/apps/dev/regexp/show.ws

TBD: I know the Common Lisp library CL-PPCRE can let you parse a regex and turn it into a kind of parse tree, and also lets you create parse trees and then use them to match. Is there a Java library for this?

Regular expression test page: http://www.regexplanet.com/advanced/java/index.html

Commercial program RegexBuddy for learning regular expressions, and analyzing and explaining them: http://www.regexbuddy.com/

Benchmarks of different Java regex libraries: http://tusker.org/regex/regex_benchmark.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

regular-expressions.md

regular-expressions.md

Regular expressions in Clojure

Examples

Code

See also

Misc code

Files

regular-expressions.md

Latest commit

History

regular-expressions.md

File metadata and controls

Regular expressions in Clojure

Examples

Code

See also

Misc code