htmlArea

A directory of browser-based WYSIWYG editors

  MAIN
INDEX
SEARCH
POSTS
WHO'S
ONLINE
LOG
IN

Home: htmlArea 3 (beta): htmlArea 2 & 3 archive (read only): htmlArea v3.0 - Discussion:
Character encoding and quote marks


The htmlArea 2 & 3 editors have been discontinued.

We've made these forums available as a read-only reference and knowledge-base for people using or developing editors based on htmlArea 2 or 3.

Anyone who is interested in taking over version 2 or 3 is free to do so. All we ask is that you choose a new name that doesn't have "htmlarea" in it to avoid confusion with this site. We'll even give you a link in the directory to make it easier for people to find you. If you are developing or hosting an htmlArea based-editor under a new name, please submit it to our directory.

 


fweisser
New User

Feb 13, 2005, 12:12 PM

Post #1 of 6 (35118 views)
Shortcut
Character encoding and quote marks Can't Post

First off, I'm happy to see htmlArea is getting back on track.

Problem: Using RC3, when I copy/paste a block of Word text that uses smart/curly quote marks (on both single and double quotes), the output renders weird characters instead of quote marks:

‘You had best answer or I will tell Mother that you were out of the house!’

In RC2, the exact same line of text is correctly rendered:

You had best answer or I will tell Mother that you were out of the house!

htmlArea v.2.0 also handles this incoming text corectly, I might add.

I am using the same browser (Mozilla 1.7.3) in both cases. The display pages use the same character set definition tag:

<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" />

Is there some clean-up code from RC2 I can insert?

fweisser


rick_deckard
New User

Feb 13, 2005, 1:48 PM

Post #2 of 6 (35100 views)
Shortcut
Re: [fweisser] Character encoding and quote marks [In reply to] Can't Post

Have you tried it with UTF-8? ISO-8859-1 does not allow for curly quotes. In MS apps they are either coming from the Windows-1252 charset (in which case you need to set to Win-1252) or from the Unicode encoding (in which case UTF-8 would work.

Another option is to capture it on the backend and convert from Win-1252 to entities or UTF-8 as you wish using something like the PHP multibyte function (mb_detect_order, mb_detect_encoding, mb_convert_encoding)

You might also be able to catch this on the client side and check for certain code points that are allowed in Win-1252 and not in ISO-8859-1 and convert to entities, but I'm not sure how to do that with Javascript.


fweisser
New User

Feb 13, 2005, 2:36 PM

Post #3 of 6 (35090 views)
Shortcut
Re: [rick_deckard] Character encoding and quote marks [In reply to] Can't Post

I imagine I'm going to have to, but mostly I am curious as to why there is this large change from RC2 to RC3, as the *only* difference in either the input form or the display page on my dev site is the version of htmlArea I'm using. It looks like RC2 may replace the MS characters (a *good* thing), but RC3 leaves them in.

Also, I notice RC3 has the same infuriating insert line break behavior as in tinyMCE, which also is not apparent in RC2. In this case, I will copy/paste a block of text from a word processor into the editor window and submit. The editor inserts line breaks, as these are not present in the original text. Before clicking "Submit", the text looks like a single, unbroken paragraph. Looking at the source code before submission shows a single unbroken line.

I have processing code on the other end (legacy) that looks for line breaks in the text and converts them to break tags. This feature allows code without any HTML markup to be read with simple line breaks intact, and preserves paragraphs and lines for people who have no HTML formatting skills and do not use a browser that can support the newer editors - like all Mac OS9 users.

This code (ColdFusion) is finding line break characters [ chr(13) ] scattered throughout single paragraphs in the RS3 processed code. There are also a profusion of empty paragraph tags that are not generated by RC2.

I'm a little dismayed by changes that are substantially messing up very simple text blocks - plain paragraphs from Word that have little except bold and italics. Since I am seeing the exact same problems in a rival in-page editor, I wonder if this has to do with modifications to make the editor more stable in Gecko.

Probably I'll stay with RC2, warts and all, since it does a better job handling simple text than RC3 or tinyMCE.

Thanks for the suggestions, in any event. I am going to do a little CF research and see about removing Win-1252 characters prior to storing in the database.

fweisser


rick_deckard
New User

Feb 14, 2005, 11:59 AM

Post #4 of 6 (35067 views)
Shortcut
Re: [fweisser] Character encoding and quote marks [In reply to] Can't Post

Since you can use CF for line breaks, you can use it to clean the Windows-1252 codes and then just serve the pages as UTF-8.

Since ASCII is a subset of ISO-8859-1 which is a subset of Unicode, the code points are the same. So you don't have to worry about any ASCII >> UTF-8 or ISO-8859-1 >> UTF-8 conversion as there is none. So "modern" versions of Word which use Unicode will render fine, as well as any old legacy text that is in ASCII or ISO-8859-1.

Then check out Jukka Korpela's page for all the necessary conversions. On submit, search for [chr(159)] or however you do it in CF and replace with the appropriate entity.

Jukka's page is:
http://www.cs.tut.fi/~jkorpela/www/windows-chars.html

It's really not that hard to fix server side (I'm not a JS guy, so I don't know how hard it is to fix client side)

Tom


fweisser
New User

Feb 14, 2005, 8:00 PM

Post #5 of 6 (35047 views)
Shortcut
Re: [rick_deckard] Character encoding and quote marks [In reply to] Can't Post

This is great info! Thanks for taking the time to write it up. I appreciate the help.

Fixing the data before it hits the database is my preference, which makes display easier on everyone, especially those with slow connections and old browsers.

Hmm, now you're giving me some ideas for cleaning out stray break tags in the incoming code. And mayby a regular expression to rip out a few empty paragraph tags. If I find a fix for the CF coders out there, I'll post it.

fweisser


rick_deckard
New User

Feb 16, 2005, 5:36 PM

Post #6 of 6 (35007 views)
Shortcut
Re: [fweisser] Character encoding and quote marks [In reply to] Can't Post

No problem. If you want to see about the PHP side in case that would help, you can read my lengthier write-up on the issue at
http://www.webmasterworld.com/forum88/3543.htm

message 8 as well as the links in the last message which lead to another long post on the problem.

 
 
 


Search for (options)