At the end of last year, I used my free time1 to fix Firefox Implementation of Segment Break Transformation Rules (Bug 1081858). Starting from Firefox 52, the line breaks between Chinese and Chinese in HTML will not be displayed as blank in the browser.
When using a plain text editor to compose a document, for the comfort of reading, we often set the line width to wrap at 80 characters. This width can accommodate forty Chinese characters, which is also comfortable for reading Chinese. With the advancement of software and hardware, the display content is no longer limited to a fixed width. Many typesetting software can reformat the content and display it in a different environment.
In the process of formatting from the original document to the content displayed on the screen, these software will reorganize the text of the same paragraph into a text stream, remove or insert new line breaks, and retype it in the new container width; the problem is that many latin scripts use white space as the separation between words, and the algorithm of typesetting software also assumes this. As a result, the line breaks removed in the original document are replaced with white space by default.
Nowadays, many content on the Internet is written in Markdown format, and most of the Markdown translation programs retain the line breaks of the original text2, so many Chinese content on the Internet will find such problems. If you don’t read it carefully, you may not notice it, but once you notice it, it’s hard to ignore this typographic flaw, especially if the web page uses justified layout.
This problem appears not only on web pages, but also in many applications that automatically reformat the text stream. For example, the Chinese package XeCJK of XeLaTeX also deals with similar problems. In the past, I used to modify the back-end of the program to solve this problem, but I have been looking for a way to fix it once and for all, directly in the browser.
Later I found that CSS Text 3 has specifications in this regard.
4.1.2. Segment Break Transformation Rules
When white-space is pre, pre-wrap, or pre-line, segment breaks are not collapsible and are instead transformed into a preserved line feed (U+000A).
For other values of white-space, segment breaks are collapsible. As with spaces, any collapsible segment break immediately following another collapsible segment break is removed. Then the remaining segment breaks are either transformed into a space (U+0020) or removed depending on the context before and after the break:
- If the character immediately before or immediately after the segment break is the zero-width space character (U+200B), then the break is removed, leaving behind the zero-width space.
- Otherwise, if the East Asian Width property [UAX11] of both the character before and after the segment break is F, W, or H (not A), and neither side is Hangul, then the segment break is removed.
- Otherwise, the segment break is converted to a space (U+0020).
After actually starting to implement it, I found that Firefox had done some processing for Chinese early on, but this code incorrectly judged the boundaries of Chinese characters, making this code doing nothing after it was written. The original writing of this program is rather difficult to understand (using negative array index), and the lack of relevant tests has led to no one to find this error3, and even after I fixed this code, some bugs in the original layout engine were exposed.
My colleague Jeremy later further modified Gecko’s algorithm to make it conform to the current spec behavior.
For other browsers, IE already has partial support. Now if the same spec is implemented in Blink/Webkit, most web pages will have no problems. Servo currently only has a simple algorithm in this part, which is also an opportunity to contribute!
This bug has been in my “If I have time” list before, and usually I don’t have time to do it. 😆
The original Markdown is just simple text substitution with perl
That’s why test coverage is very important!