This post describes the process I use to strip out style attributes in HTML code using a regular expression.
My website is presenting data from a field in SharePoint. This field uses HTML and CSS style attributes to construct the note. A user would enter this data via a Sharepoint website, and my .Net website will present it elsewhere. The trouble is, when my site presents this data the message can look like a right mess. Different fonts, different sizes and different colours (you’ve met those idiots before who like to use Comic Sans font in a professional environment, right?). So before I present the data in a Literal control I decided to write a regular expression to strip out any style/class attributes etc. And here is the .Net function (which I have in a class):
//function to strip CSS styles etc from sharepoint notes
public static string stripStyles(string message)
{
//replace non-ascii with empty string
message = Regex.Replace(message, @"[^\u0000-\u007F]", string.Empty);
//replace 3 or more BR with one BR
message = Regex.Replace(message, "(?:\\s*<br[/\\s]*>\\s*){3,}", "");
//remove any style attributes
message = Regex.Replace(message, "style=(\"|')[^(\"|')]*(\"|')", "");
//remove any classe attributes
message = Regex.Replace(message, "class=(\"|')[^(\"|')]*(\"|')", "");
//remove empty p tags
message = Regex.Replace(message, "(<p>\\s*</p>|<p>\\s*\\?</p>)", "");
//remove font tags
message = Regex.Replace(message, "</?(font)[^>]*>", "");
return message;
}
It won’t produce perfect results, because there are also uses of the <font> tag scattered about in these messages. But I’m going to leave those alone for now since I suspect <font> tags may be used to highlight (bold/colour) certain words (auto-generated from the WYSIWYG editor).