Using PowerShell Select-String to Match Regular Expressions

This blog provides a simple example of using PowerShell select-string to match regular expressions.

Select-string and regular expressions come in really useful when we want to extract specific data from a string that matches a particular pattern.  Consider this URL:

https://www.alkanesolutions.co.uk/2023/05/05/another-great-blog/

We can see that it contains the date in year/month/day.  But how can we extract it easily?  We can use regular expressions.  And the expression we’ll use in this example is:

\d{4}/\d{2}/\d{2}

Splitting this up, we are simply searching for a pattern that matches 4 {4} digits \d followed by a forward slash, then 2 {2}digits \d followed by a forward slash, then 2 {2} digits \d.

We need to expand on this though, because when we search our string we want to be able to extract the day, the month or the year separately.  And to do this we need to make each part a matching group.  We can do this by simply enclosing each matching group in a rounded bracket like so:

(\d{4})/(\d{2})/(\d{2})

So if we find matches in our string, the first matching group will be the whole pattern (\d{4})/(\d{2})/(\d{2}), the second matching group will be (\d{4}), the third (\d{2}) and finally the fourth (\d{2}).

Here’s a quick example:

$text = "https://www.alkanesolutions.co.uk/2023/05/05/another-great-blog/"

$text | Select-String -Pattern "(\d{4})/(\d{2})/(\d{2})" | Select -Expand Matches | Select -Expand Groups

and the result is:

Groups   : {0, 1, 2, 3}
Success  : True
Name     : 0
Captures : {0}
Index    : 34
Length   : 10
Value    : 2023/05/05

Success  : True
Name     : 1
Captures : {1}
Index    : 34
Length   : 4
Value    : 2023

Success  : True
Name     : 2
Captures : {2}
Index    : 39
Length   : 2
Value    : 05

Success  : True
Name     : 3
Captures : {3}
Index    : 42
Length   : 2
Value    : 05

We can clearly see that there are 4 groups captured in total as mentioned above, and you can see the value of each matching group.  This is handy because it makes it really easy to extract the data we want like so:

$text = "https://www.alkanesolutions.co.uk/2023/05/05/another-great-blog/"

$matchingGroups = $text | Select-String -Pattern "(\d{4})/(\d{2})/(\d{2})" | Select -Expand Matches | Select -Expand Groups

$wholeDate = $matchingGroups | Where Name -eq 0 | Select -Expand Value
$justYear = $matchingGroups | Where Name -eq 1 | Select -Expand Value
$justMonth = $matchingGroups | Where Name -eq 2 | Select -Expand Value
$justDay = $matchingGroups | Where Name -eq 3 | Select -Expand Value

write-host $wholeDate
write-host $justYear
write-host $justMonth
write-host $justDay

AppSense Regular Expression for Microsoft Office

I needed to add a new rule to AppSense recently on process start.  I wanted the rule to only run when a Microsoft Office application was run.  We required an AppSense regular expression for Microsoft Office since there are multiple applications within the suite (Word, Excel etc).

Now I usually eat the basic regular expressions for breakfast (with some ketchup on top for good measure).  However I noticed that my regular expression wasn’t working in AppSense and it turned out to be the flavour of Regular Expression that AppSense uses!

You see, I tend to use JavaScript regular expressions or .Net regular expressions in my web development projects.  But AppSense was presumably written in C++ and uses the CAtlRegExp regular expression of the ATL class which is…..lame.  Grouping syntax is different, and so is character matching syntax.

AppSense Regular Expression for Microsoft Office

To test my regular expressions, rather than update the AppSense policy and wait for it to deploy to the machine, I just downloaded the regular expression tester from here.

So this was my first attempt – the MfcRegex tool said it was a successful match!  So I plonked it into AppSense:

.*\\Microsoft Office\\Office\d\d?\\((WINWORD)|(EXCEL)|(POWERPNT)|(MSACCESS)|(OUTLOOK)|(VISIO)|(WINPROJ))\.EXE$

But wait!  AppSense tries to be clever and escapes the brackets with preceding backslashes (I noticed this in the client debug logs), so this RegEx was failing because AppSense was evaluating it to this:

.*\\Microsoft Office\\Office\d\d?\\\(\(WINWORD\)|\(EXCEL\)|\(POWERPNT\)|\(MSACCESS\)|\(OUTLOOK\)|\(VISIO\)|\(WINPROJ\)\)\.EXE$

So by this point I was close to throwing my computer out of the window, until finally I used this syntax which works like a charm:

.*\\Microsoft Office\\Office\d\d?\\{WINWORD}|{EXCEL}|{POWERPNT}|{MSACCESS}|{OUTLOOK}|{VISIO}|{WINPROJ}\.EXE$

Notice that I have changed the brackets and slightly altered the syntax.  If you wanted to limit it to a specific version of Office (2010 in my case) you can use a regular expression similar to this:

.*\\Microsoft Office\\Office14\\((WINWORD)|(EXCEL)|(POWERPNT)|(MSACCESS)|(OUTLOOK)|(VISIO)|(WINPROJ))\.EXE$

 

 

Strip out style attributes in HTML

This post describes the process I use to strip out style attributes in HTML code using a regular expression.

My website is presenting data from a field in SharePoint.  This field uses HTML and CSS style attributes to construct the note.  A user would enter this data via a Sharepoint website, and my .Net website will present it elsewhere. The trouble is, when my site presents this data the message can look like a right mess.  Different fonts, different sizes and different colours (you’ve met those idiots before who like to use Comic Sans font in a professional environment, right?).  So before I present the data in a Literal control I decided to write a regular expression to strip out any style/class attributes etc.  And here is the .Net function (which I have in a class):

  //function to strip CSS styles etc from sharepoint notes
    public static string stripStyles(string message)
    {

        //replace non-ascii with empty string
        message = Regex.Replace(message, @"[^\u0000-\u007F]", string.Empty);

        //replace 3 or more BR with one BR
        message = Regex.Replace(message, "(?:\\s*<br[/\\s]*>\\s*){3,}", "");

        //remove any style attributes   
        message = Regex.Replace(message, "style=(\"|')[^(\"|')]*(\"|')", "");

        //remove any classe attributes
        message = Regex.Replace(message, "class=(\"|')[^(\"|')]*(\"|')", "");  

        //remove empty p tags
        message = Regex.Replace(message, "(<p>\\s*</p>|<p>\\s*​\\?</p>)", "");
        
        //remove font tags
        message = Regex.Replace(message, "</?(font)[^>]*>", "");

        return message;

    }

It won’t produce perfect results, because there are also uses of the <font> tag scattered about in these messages.  But I’m going to leave those alone for now since I suspect <font> tags may be used to highlight (bold/colour) certain words (auto-generated from the WYSIWYG editor).