Clean up pasted text in WordPress

I have a love/hate relationship with the paste plugin in TinyMCE, the JavaScript WYSIWYG editor that ships with WordPress. In recent releases TinyMCE has gotten very good at sanitizing text pasted from MS Word, but it is still much more permissive than I would like.

A long standing gripe is that classes are added to every paragraph and span when I paste text into the editor:

<p class="p1">This is <span class="s1">pasted</span> text</p>

This odd behaviour is consistent across many applications – InDesign, TextEdit, Mail – so I suspect it may be occurring at an OS level. In any case, I don’t want class attributes finding their way into markup unless I put them there!

A more significant problem is that TinyMCE doesn’t filter out HTML tags from pasted content. If you copy and paste text from a website you might unknowingly bring its markup along for the ride, and end up with something like this:

<div class="row">
  <div class="col-8">
    <table>
      <tr>
        <td><p>This is pasted text</p></td>
      </tr>
    </table>
  </div>
</div>

I’m sure you can imagine how disastrous that extra markup could be on your website’s front end.

Unfortunately TinyMCE doesn’t have a configuration option to specify which tags are permitted in pasted markup. The valid_elements option allows us to define which elements will remain in the edited text when TinyMCE saves, but that is an overzealous solution. On occasion you might need to add markup via TinyMCE, and valid_elements would strip it out again when you save your post. What is needed is a way to clean up text when it is pasted into TinyMCE, but still allow the user to type markup into the editor if they choose.

The solution comes in the form of the paste_preprocess option, which allows us to specify a callback that will be executed when content is inserted into the editor. Within that callback we can strip out any tags that we don’t want, and remove class, id and any other undesirable attributes from our content.

In WordPress you can hook into TinyMCE’s configuration using the tiny_mce_before_init filter. In your theme’s functions.php file:

add_filter('tiny_mce_before_init','configure_tinymce');

/**
 * Customize TinyMCE's configuration
 *
 * @param   array
 * @return  array
 */
function configure_tinymce($in) {
  $in['paste_preprocess'] = "function(plugin, args){
    // Strip all HTML tags except those we have whitelisted
    var whitelist = 'p,span,b,strong,i,em,h3,h4,h5,h6,ul,li,ol';
    var stripped = jQuery('<div>' + args.content + '</div>');
    var els = stripped.find('*').not(whitelist);
    for (var i = els.length - 1; i >= 0; i--) {
      var e = els[i];
      jQuery(e).replaceWith(e.innerHTML);
    }
    // Strip all class and id attributes
    stripped.find('*').removeAttr('id').removeAttr('class');
    // Return the clean HTML
    args.content = stripped.html();
  }";
  return $in;
}

We use jQuery to populate the DOM with the pasted content, then traverse the DOM and remove tags and classes we don’t want. Finally we extract the cleaned HTML markup and pass it back to the editor. This approach may seem convoluted, but it is far less error prone than using regular expressions to sanitize our content.

All you need to do is modify the whitelist variable, which contains a list of tags to allow past your filter.

One last thing: You might notice that our entire JavaScript function is contained in a string. This might seem odd, but is necessary since WordPress stores TinyMCE’s configuration options as an array of strings.

13 thoughts on “Clean up pasted text in WordPress

  1. Josh Visick says:

    Thanks for this. Just what I needed to figure out how to do!

    Just a note for others that might want to use this with bbpress forums that it needs to be hooked onto teeny_mce_before_init.

  2. Rocco says:

    Amazing!
    Thank you very much, you saved me!

  3. Nice!

    Only issue with the code above is that if you have nested non-whitelisted elements, the inner element will not get removed because innerHTML takes the string as is, so the reference to the inner element gets lost. I would suggset using e.childNodes instead, or jQuery’s unwrap method, which does the same thing:

    function(plugin, args) {
    var whitelist = ‘p,span,b,strong,i,em,h3,h4,h5,h6,ul,li,ol’;
    var $fragment = jQuery(” + args.content + ”);

    $fragment.find(*).not(whitelist).each(function(i, element) {
    jQuery(element).unwrap();
    });

    $fragment.find(‘*’).removeAttr(‘id’).removeAttr(‘class’);

    args.content = $fragment.html();
    }

    I wish I knew how to properly format a code block here :(

  4. Jonathan says:

    @Joseph Good suggestion. Thanks.

  5. Ismael Latorre says:

    Out of curiosity, of the two versions proposed, yours and Joseph’s via comments, which one would you say works best? If Joseph’s, could you please post the entire code so that we can make it work? This would make an ideal, must-use wp plugin. Thanks a lot.

  6. Nate311 says:

    Thanks man! Was getting SO annoyed with this…

  7. Nick says:

    I had to update the code to use this rather than element.

    function configure_tinymce($in) {
    $in[‘paste_preprocess’] = “function(plugin, args) {
    var whitelist = ‘p,span,b,strong,i,em,ul,li,ol’;
    var stripped = jQuery(” + args.content + ”);
    stripped.find(‘*’).not(whitelist).each(function(i, element) {
    jQuery(this).unwrap();
    });
    stripped.find(‘*’).removeAttr(‘id’).removeAttr(‘class’);
    args.content = stripped.html();
    }”;
    return $in;
    }

    add_filter(‘tiny_mce_before_init’,’configure_tinymce’);

  8. Jurgen says:

    How about the solution that is mentioned here: http://www.wizzud.com/2014/02/14/force-paste-as-text-on-in-wordpress/

    Seems to be a lot simpler. Does that one has any downsides, or does your solution have any advantages?

    And what about the plugin that seems to do the same: https://wordpress.org/plugins/paste-as-plain-text/

    I’m not a coder, so I cannot decide what solution would be the best to use.

  9. Sofian says:

    So great I found someone that is working on that issue. My great vision is as well to get pasted content come out in excellent shape, so I don’t have to bother clients and friends with these copy-paste issues, they will never understand such behaviour and it leaves big questions.

    The original code works fine so far, but I wonder about the nested issue that Joseph mentioned, and I don’t understand what Nick means. His code is giving a parsing error by the way, something with an ‘{‘ I couldn’t figure out.

    Jonathan, would you consider to open that box once more and bringing it to a result that could really even give a superb plugin? I am very glad you shared this, and I wonder why such a great necessity hasn’t been fixed long ago, but there is much more I wonder about in this world :)

    I am not a coder, I “speak” HTML and CSS and tiny bit of PHP, so I can not do it myself, but I would like to bring some ideas hoping you want to continue on this subject, or to find another developer here who has time to extend the code a bit.

    So one more nice feature in some situations would be if div’s would not be stripped but replaced with p. This would bring often a problem of double and nested p of course which would need to be removed, which would anyhow be a nice additional feature getting closer to a perfect paste solution, so to also strip out double paragraphs and linebreaks as ( br /)
    ( br /)
    or ( p )
    ( p )
    ( /p )
    ( /p )
    How can this be achieved? (I replaced the < with () so they are not troubling the comment area)

    Also I have cases of (p align="justify") that should also be stripped out, basically everything inside of a tag which is not the pure tag itself, if this can be achieved in general.

    Thanks a lot for this initiative
    Sofian

  10. Sofian says:

    Wow I solved the first one myself. In the line
    stripped.find(‘*’).removeAttr(‘id’).removeAttr(‘class’);

    I added it
    stripped.find(‘*’).removeAttr(‘id’).removeAttr(‘class’).removeAttr(‘align’);

    So I learned a bit more of JS today I guess :)
    But is there a way to remove all of them without listing all of them?
    I also discovered an (ol start=”3″) and I would like to catch everything with a wildcard.
    I tried .removeAttr(‘*’) but then it is not pasting text at all any longer :(

  11. Jonathan says:

    A few updates on this. I have looked into Joseph’s code, but I can’t actually reproduce the problem it is meant to fix. When I use my original code and paste in content that contains multiple non-whitelisted elements (a table element which contains nested tr and td elements for instance) the blacklisted elements are all stripped, but their inner content remains. Perhaps I am misunderstanding the problem?

    Also when I tried to use unwrap() on certain content the entire content was stripped, which makes me nervous about this approach. This happened when I tried copying tables from a website and pasting them into TinyMCE.

    @Jurgen The other options you have mentioned strip all formatting from the pasted content. My solution can be fine tuned to only strip certain tags and attributes from the pasted content, which might be useful if, for example, you want to strip tables and class attributes but allow bolding and italicisation.

    @Sofian I don’t think that jQuery offers a simple way to remove all attributes from an element. But there are ways of doing it. Do a Google search for “jquery remove all attributes” and you will find a few methods.

  12. Dona says:

    Thanks! Saved a lot of time.

  13. Filipe says:

    I’d buy you a beer, thanks!

Comments are closed.