Thursday, April 18, 2013

Extracting URLs from text using Java and Regular Brainexplosions

Human mankind can fly to the moon, jump from the stratosphere down to the earth, but we can't produce a regular expression, which matches all URLs. Oh well...

After searching for a "perfect" regular expression, I came across this site, which shows the result of testing tons of URL-recognizing regular expressions. The result? There's no such thing as a perfect expression, but there are short ones (nearly 30 characters) and ridiculously long ones (1000+ characters!). Of course I want to include a 1000 character long regular expression in my code, right? Escaping it for Java must be a real pleasure...

Eventually, I found one which is only 15 characters: /\bhttps?://\S+/g
Of course, this one hardly matches all URLs, but I think it gets pretty close to a perfect balance between effort and effectiveness.

If you want to know more about regular expressions for URLs, I recommend you to read this article.

Thanks refiddle.com for making testing of regular expressions so much easier. :)