Regex to match words or phrases in string but NOT match if part of a URL or inside tags. (php)

I am aware that regex is not ideal for use with HTML strings and I have looked at the PHP Simple HTML DOM Parser but still believe this is the way to go. All the HTML tags will be generated by my forum software so they will be consistent and valid HTML.

What I am trying to do is make a plugin that will find a list of keywords (or phrases) in a string of HTML and replace them with a link I specify. For example if someone types:

I use Amazon for that.

it would replace it with:

I use <a href="http://www.amazon.com">Amazon</a> for that.

The problem is of course is that if "amazon" is in the URL it would also get replaced. I solved that issue with a callback function found on this site, slightly modified.

But now I still have an issue, it still replaces words between opening and closing tags.

<a href="http://www.amazon.com">My Amazon Link</a>

It will match the "Amazon" in "My Amazon Link"

What I really need is a regex to match say "amazon" anywhere except between <a href and </a>

Any ideas?

--------------Solutions-------------

Using the DOM would certainly be preferable.

However, you might get away with this:

$result = preg_replace('%Amazon(?![^<]*</a>)%i', '<a href="http://www.amazon.com">Amazon</a>', $subject);

It matches Amazon only if

  1. it's not followed by a closing </a> tag,
  2. it's not itself part of a tag,
  3. there are no intervening tags, i. e. it will be thrown off if tags can be nested inside <a> tags.

It will therefore change this:

I use Amazon for that.
I use <a href="http://www.amazon.com">Amazon</a> for that.
<a href="http://www.amazon.com">My Amazon Link</a>
It will match the "Amazon" in "My Amazon Link"

into this:

I use <a href="http://www.amazon.com">Amazon</a> for that.
I use <a href="http://www.amazon.com">Amazon</a> for that.
<a href="http://www.amazon.com">My Amazon Link</a>
It will match the "<a href="http://www.amazon.com">Amazon</a>" in "My <a href="http://www.amazon.com">Amazon</a> Link"

Don't do this. You cannot reliably do this with Regex, no matter how consistent your HTML is.

Something like this should work, however:

<?php
$dom = new DOMDocument;
$dom->load('test.xml');
$x = new DOMXPath($dom);

$nodes = $x->query("//text()[contains(., 'Amazon')][not(ancestor::a)]");

foreach ($nodes as $node) {
while (false !== strpos($node->nodeValue, 'Amazon')) {
$word = $node->splitText(strpos($node->nodeValue, 'Amazon'));
$after = $word->splitText(6);

$link = $dom->createElement('a');
$link->setAttribute('href', 'http://www.amazon.com');

$word->parentNode->replaceChild($link, $word);
$link->appendChild($word);

$node = $after;
}
}

$html = $dom->saveHTML();
echo $html;

It's verbose, but it will actually work.

Unfortunately I think the logic you need is still more complex than text pattern matching :-/

I know it's not the answer you want to hear, but you'll probably get better results with a DOM model.

Here's a discussion of this topic elsewhere: http://coderzone.org/forum/index.php?topic=84.0

Is it possible to just run the filter once, so you don't end up with dupes? Or could the original corpus also include links?

Try this here

Amazon(?![^<]*</a>)

This will search for Amazon and the negative lookahead ensures that there is no closing tag behind. And I search there only for not < so that I will not read a opening tag accidentally.

http://regexr.com

Use this code:

$p = '~((<a\s)(?(2)[^>]*?>))?(amazon)~smi';

$str = '<a href="http://www.amazon.com">Amazon</a>';

$s = preg_replace($p, "$1My $3 Link", $str);
var_dump($s);

OUTPUT

String(50) "<a href="http://www.amazon.com">My Amazon Link</a>"

Joe, resurrecting this question because it had a simple solution that wasn't mentioned. (Found your question while doing some research for a general question about how to exclude patterns in regex.)

With all the disclaimers about using regex to parse html, here is a simple way to do it.

Here's our simple regex:

<a.*?</a>(*SKIP)(*F)|amazon

The left side of the alternation matches complete <a... </a> tags, then deliberately fails. The right side matches amazon, and we know this is the right amazon because it was not matched by the expression on the left.

This program shows how to use the regex (see the results at the bottom of the online demo):

<?php
$target = "word1 <a stuff amazon> </a> word2 amazon";
$regex = "~(?i)<a.*?</a>(*SKIP)(*F)|amazon~";
$repl= '<a href="http://www.amazon.com">Amazon</a>';
$new=preg_replace($regex,$repl,$target);
echo htmlentities($new);

Reference

How to match (or replace) a pattern except in situations s1, s2, s3...

Category:php Time:2011-05-15 Views:0

Related post

  • c# regular expression match at specific index in string? 2009-08-11

    I'd like to test if a regex will match part of a string at a specific index (and only starting at that specific index). For example, given the string "one two 3 4 five", I'd like to know that, at index 8, the regular expression [0-9]+ will match "3".

  • Best way string-matching algorithm for same-length strings? 2009-12-07

    I need to implement a string-matching algorithm to determine which strings most closely match. I see the the Hamming distance is a good matching algorithm when this fixed-length is obtainable. Is there any advantage in the quality of matching if I we

  • JavaScript Regex: How to split Regex subexpression matches in to multi-dimensional string arrays? 2009-10-15

    How would you split Regex subexpression matches in to multi-dimensional string arrays? I have a "myvar" string of: 1-4:2;5-9:1.89;10-24:1.79;25-99:1.69;100-149:1.59;150-199:1.49;200-249:1.39;250+:1.29 which is a repeat of QuantityLow - QuantityHigh :

  • Match the first number/word/string in quotation marks in the Input - Regex Help 2010-01-21

    I want to match the first number/word/string in quotation marks/list in the input with Regex. For example, it should match those: "hello world" gdfigjfoj sogjds -14.5 fdhdfdfi dfjgdlf test14 hfghdf hjgfjd (a (c b 7)) (3 4) "hi" Any ideas to a regex o

  • Regex to match 4 repeated letters in string using java pattern 2010-04-12

    i want to match something like aaaa, aaaad, adjjjjk. Something like this ([a-z])\1+ was used to match the repeated characters but i am not able to figure this out for 4 letters. --------------Solutions------------- Not knowing about the finite repeti

  • Greek String doesn't match regex when read from keyboard 2011-01-02

    public static void main(String[] args) throws IOException { String str1 = "ΔΞ123456"; System.out.println(str1+"-"+str1.matches("^\\p{InGreek}{2}\\d{6}")); //ΔΞ123456-true BufferedReader br = new BufferedReader(new InputStreamReader(System.in)); Strin

  • Rebuild regex string based on match keywords in python 2011-02-07

    Example regular expression regex = re.compile('^page/(?P<slug>[-\w]+)/(?P<page_id>[0-9]+)/$') matches = regex.match('page/slug-name/5/') >> matches.groupdict() {'slug': 'slug-name', 'page_id': '5'} Is there an easy way to pass a dic

  • Can I print what string format was matched from a multiple-string, Regex search 2011-04-27

    In the code below, I'm searching for string, hex, and ascii .... If the string is matched, the file where the string was matched is printed. Is there a way to print what string type was matched (hex, ascii, or string)? Additionally, I would like to c

  • Regex.Match, startat and ^ (start of string) 2011-05-04

    Does some knows why the output of this code: Regex re = new Regex("^bar", RegexOptions.Compiled); string fooBarString = @"foo bar"; Match match1 = re.Match(fooBarString, 4); Console.WriteLine(String.Format("Match 1 sucess: {0}", match1.Success)); Mat

  • Regex for matching of anchor negation and string 2011-06-14

    I'm trying add a space before a particular string (Token for example) by replacing a regex with another: somethingToken should become something Token but something Token should stay something Token_ and not something Token (with 2 spaces) I'm having

  • regex to match subexpression at end of string 2011-07-10

    I'm trying to test whether the ending pattern in a string is an html closing tag (assuming trailing spaces are trimmed). var str1 = "<em>I</em> am <strong>dummy</strong> <em>text.</em>"; //ends with html close tag

  • How to ignore regex matches wrapped by a particular string? 2011-12-15

    Long time lurker, first time poster- please bare with me, I'm a regular expression n00b, but I had a great idea for some functionality on a project and I've tried to implement it to the best of my ability but I need a little help achieving the desire

  • vim regex search csv string and paste matches 2009-05-05

    edit: I need advice on best way to search with regex in vim and extract any matches that are discovered. I have a csv file that looks something like this: Two fields: id description 0g98932,"long description sometimes containing numbers like 1234567,

  • User Regex in java to find phrases 2012-02-20

    Hi I am trying to discover phrases in a string using regex I have the following code: it does not seem to find all the two word phrases. public static void main(String[] args) { String inputText = "test and test Test hello hello hello test test hello

  • What regEx can I use to Split a string into whole words but only if they start with #? 2008-09-29

    I have tried this... Dim myMatches As String() = System.Text.RegularExpressions.Regex.Split(postRow.Item("Post"), "\b\#\b") But it is splitting all words, I want an array of words that start with# Thanks! --------------Solutions------------- This see

  • Regex to strip lat/long from a string 2009-01-29

    Anyone have a regex to strip lat/long from a string? such as: ID: 39.825 -86.88333 --------------Solutions------------- var latlong = 'ID: 39.825 -86.88333'; var point = latlong.match( /-?\d+\.\d+/g ); //result: point = ['39.825', '-86.88333']; To ma

  • Scala: match and parse an integer string? 2009-07-02

    I'm looking for a way to matching a string that may contain an integer value. If so, parse it. I'd like to write code similar to the following: def getValue(s: String): Int = s match { case "inf" => Integer.MAX_VALUE case Int(x) => x case _ =

  • Regex.MatchData returning null: why not Option[String]? 2009-12-03

    Is there any particular reason why Regex.MatchData.group(i: Int): java.lang.String returns null rather than Option[String]? Is there a "Scala Way" to handle nulls in Scala? --------------Solutions------------- It returns null because it is a shallow

  • Trim string using reqex match 2010-03-05

    I have to use a crippled tool which doesn't provide any way to trim leading an trailing spaces from a string. It does have .NET style regex, but only Match is implemented, not replace. So, I came up (surprisingly by myself) with this regex that seems

Copyright (C) pcaskme.com, All Rights Reserved.

processed in 0.980 (s). 13 q(s)