Clean up pasted text in WordPress
I have a love/hate relationship with the paste plugin in TinyMCE, the JavaScript WYSIWYG editor that ships with WordPress. In recent releases TinyMCE has gotten very good at sanitizing text pasted from MS Word, but it is still much more permissive than I would like.
A long standing gripe is that classes are added to every paragraph
and span
when I paste text into the editor:
<p class="p1">This is <span class="s1">pasted</span> text</p>
This odd behaviour is consistent across many applications – InDesign, TextEdit, Mail – so I suspect it may be occurring at an OS level. In any case, I don’t want class attributes finding their way into markup unless I put them there!
A more significant problem is that TinyMCE doesn’t filter out HTML tags from pasted content. If you copy and paste text from a website you might unknowingly bring its markup along for the ride, and end up with something like this:
<div class="row">
<div class="col-8">
<table>
<tr>
<td><p>This is pasted text</p></td>
</tr>
</table>
</div>
</div>
I’m sure you can imagine how disastrous that extra markup could be on your website’s front end.
Unfortunately TinyMCE doesn’t have a configuration option to specify which tags are permitted in pasted markup. The valid_elements
option allows us to define which elements will remain in the edited text when TinyMCE saves, but that is an overzealous solution. On occasion you might need to add markup via TinyMCE, and valid_elements
would strip it out again when you save your post. What is needed is a way to clean up text when it is pasted into TinyMCE, but still allow the user to type markup into the editor if they choose.
The solution comes in the form of the paste_preprocess
option, which allows us to specify a callback that will be executed when content is inserted into the editor. Within that callback we can strip out any tags that we don’t want, and remove class
, id
and any other undesirable attributes from our content.
In WordPress you can hook into TinyMCE’s configuration using the tiny_mce_before_init
filter. In your theme’s functions.php
file:
add_filter('tiny_mce_before_init','configure_tinymce');
/**
* Customize TinyMCE's configuration
*
* @param array
* @return array
*/
function configure_tinymce($in) {
$in['paste_preprocess'] = "function(plugin, args){
// Strip all HTML tags except those we have whitelisted
var whitelist = 'p,span,b,strong,i,em,h3,h4,h5,h6,ul,li,ol';
var stripped = jQuery('<div>' + args.content + '</div>');
var els = stripped.find('*').not(whitelist);
for (var i = els.length - 1; i >= 0; i--) {
var e = els[i];
jQuery(e).replaceWith(e.innerHTML);
}
// Strip all class and id attributes
stripped.find('*').removeAttr('id').removeAttr('class');
// Return the clean HTML
args.content = stripped.html();
}";
return $in;
}
We use jQuery to populate the DOM with the pasted content, then traverse the DOM and remove tags and classes we don’t want. Finally we extract the cleaned HTML markup and pass it back to the editor. This approach may seem convoluted, but it is far less error prone than using regular expressions to sanitize our content.
All you need to do is modify the whitelist variable, which contains a list of tags to allow past your filter.
One last thing: You might notice that our entire JavaScript function is contained in a string. This might seem odd, but is necessary since WordPress stores TinyMCE’s configuration options as an array of strings.