Working with embedded CDATA in XML documents

Recently while working on the SME Toolkit, a project sponsored by the International Finance Corporation (a member of the World Bank Group), I encountered a problem with CDATA sections in XML documents.

CDATA sections are used in markup languages to identify general character data -- data that should only be interpreted as characters, and not as specialized markup or commands. In XML, CDATA sections allow XML markup to be embedded, but not interpreted as part of the XML document itself.

For example, CDATA would allow XHTML to be embedded inside a larger XML document without treating the XHTML as part of the parent document:

CDATA in an XML document.

*Note: This XML document uses a contrived DTD for the purposes of example - these tags aren't really part of the standard XML schema.

Unfortunately, there seems to be a great deal of confusion about the proper usage of CDATA sections. This is probably because they are not often worked with, and the CDATA markers behave differently than traditional XML tags. CDATA sections are defined as beginning with the following character sequence:<![CDATA[ ...and ending with the first occurrence of the following character sequence: ]]>. Unfortunately, this means that CDATA sections cannot be 'nested' hierarchically like XML tags because any occurrence of the ending CDATA marker will terminate any open CDATA section.

This means that the following XML document is invalid because the first occurrence of "]]>" within the style section of the embedded XHTML document terminates the first CDATA section, leaving half of the embedded XHTML document to be considered as part of the larger XML document.

Broken nested CDATA in an XML document.

The preferred solution to this problem is to break-up the CDATA end markers when nesting them in a new XML document by inserting markers to close and re-open a CDATA section. Then, when the combined CDATA sections are interpreted, the original CDATA markers will be restored. This is accomplished by utilizing the following character sequence: ]] ]]> <![CDATA[ >

Solution to hiding CDATA in an XML document.

Essentially, while CDATA sections cannot be nested, it is possible to escape ending CDATA markers to prevent a CDATA section from being prematurely terminated during parsing. In the example above, parsing of the parent or container XML document will combine the two separate, yet adjacent, CDATA sections into a single set of general character data as intended, preserving the embedded CDATA markers. The nature of the embedded data will be preserved without having it mistakenly treated as part of the XML markup.

Further Reading:

Fixed getAttribute method in kSOAP2

Lately I have been using the kSOAP2 library for SOAP communications on an Android device. Today I ran into some real trouble: I could not getAttribute on a SoapObject that I knew had attributes.

Manfred Moser ran into the same trouble and created a fork of kSOAP2 that fixes the issue. Thanks a lot, Manfred!

Resources:

Innovation comes from people who care

Innovation is the magic pixie dust of our age. Amazon lists more than 40,000 books on the subject (up 230% since when I ran the same search four years ago). Companies that master it are more profitable. Globalization demands it of all countries. With just a little innovation, a company or a team works better. Who wouldn’t want that stuff sprinkled all over their company?

We work in a highly competitive, rapidly globalizing industry. It’s an industry with a horrible reputation for missing deadlines, blowing budgets, and creating unreliable products. If you haven’t figured it out yet, we develop software. More specifically, we develop software for other companies’ products. Our software is found all over the world thanks to our globally competitive partners in the automotive, aerospace, and color measurement industries.

Our company is highly innovative. We’ve pioneered agile development practices in West Michigan. We publish papers and present at conferences and workshops. And yet we don’t have any products of our own. Our form of innovation isn’t about better mousetraps, it’s in the process of building better mousetraps. Process innovation is all about answers to questions like: how do we work together effectively? how do we know what to build? how do we know it works reliably? how do we do the most for the least? how do we do it better next time? In Thomas Friedman’s flat world, process innovation in high-tech services is a key competitive advantage.

I think we’ve figured it out. Like those funny little creatures that occasionally float across the lens of your eyeball when you’re staring absently into the distance, the pixie dust of process innovation isn’t something you can see directly. And you can’t just apply it or make it happen. Innovation at our company happens because people care. (Around our office this attitude goes by the more colloquial “give a shit”, and is one of our core value mantras.) Our developers and designers care personally whether a client’s project succeeds. We insist that the customer define success. We want the customer to know what we’re doing, who’s doing it, and why. We hate spending customer money on pointless, ineffective formality and documentation. We’re offended by the idea of building software that can’t be tested. We hate the unpredictability of non-automated, trivial work. We’re tired of old, inefficient computer languages. We live to see our work deployed and in use. In short, we care deeply about our projects and our process.

People who don’t care are satisfied with the status quo. They don’t feel any pressure, they don’t seek ways of doing things better, and they don’t innovate. Innovation is a natural response to the combination of attitude and circumstance. It doesn’t require a team of savants. It happens in a context in which people want to improve what they do. People who care, innovate.

Rendering UTF-8 characters in Rich Text Format with PHP

One of the requirements for a project that I’ve been working on was to dynamically generate a document using information in a database. RTF was chosen for its versatility and compatibility across platforms.

While implementing this feature, I discovered that some characters would not rendering properly. These were UTF-8 characters, which cannot legally be embedded directly into RTF output.

After some research, I learned that it was possible to render a specific character by specifying its 16-bit code point. The RTF sequence looks as follows: \u####?. The #’s represet the decimal code point value. The question mark acts as a replacement character for legacy RTF viewers that do not support rendering by code point.

Let’s use the left double quotation mark “ as an example:

  • UTF-8: \xE2\x80\x9C
  • UTF-16 hexadecimal: 201C
  • UTF-16 decimal: 8220
  • RTF: \u8220?

I needed to implement a mechanism in PHP for locating, isolating, and converting sequences of UTF-8 characters.

The PHP function mb_convert_encoding appeared to satisfy what I required:


mb_convert_encoding("\xE2\x80\x9C", 'UTF-16', 'UTF-8') == "\x201C" // True

Unfortunately applying this to an entire block of text converts all of the text to UTF-16, which was not the desired result.


mb_convert_encoding("a\xE2\x80\x9Cb", 'UTF-16', 'UTF-8') == "a\x201Cb" // False

I needed to isolate the multibyte UTF-8 sequences and convert them individually.

The specification for UTF-8 indicates that a byte sequence is one to four bytes long. Single byte UTF-8 “sequences” map directly to US-ASCII and therefore do not need to be converted. The first byte in the multibyte sequence is used to determine how many bytes the sequence has. For example, if the first byte falls within the range \xE0 to \xEF, two additional bytes follow in the range \x80 to \xBF.

These patterns could be easily represented in a regular expression array:

1
2
3
4
5
$patterns = array(
"[\xC2-\xDF][\x80-\xBF]",    // Two byte sequence
"[\xE0-\xEF][\x80-\xBF]{2}", // Three bytes
"[\xF0-\xF4][\x80-\xBF]{3}", // Four bytes
);

PHP has a plethora of regular expression functions, and the solution ultimately came from preg_replace. What made preg_replace especially convenient was the ability to pass PHP code directly to the replacement parameter using the /e modifier.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
function utf8_to_rtf($utf8_text) {
    $utf8_patterns = array(
      "[\xC2-\xDF][\x80-\xBF]",
      "[\xE0-\xEF][\x80-\xBF]{2}",
      "[\xF0-\xF4][\x80-\xBF]{3}",
    );
    $new_str = $utf8_text;
    foreach($utf8_patterns as $pattern) {
      $new_str = preg_replace("/($pattern)/e", 
        "'\u'.hexdec(bin2hex(mb_convert_encoding('$1', 'UTF-16', 'UTF-8'))).'?'", 
        $new_str);
    }
    return $new_str;
  }

The key bit of code in from above is the replacement string sent to preg_replace:


'\u' . hexdec( bin2hex( mb_convert_encoding('$1', 'UTF-16', 'UTF-8'))) . '?'

The first matched grouping (denoted by $1) is converted from UTF-8 to UTF-16 using mb_convert_encoding. The binary result is converted to hexadecimal, and then to decimal. Finally, a \u is prepended and a ? is appended.

Native App Vs. Mobile Friendly Web Application

There are two main ways to create mobile applications. The following post lays out advantages and disadvantages of both approaches.

Native Apps – applications that are installed directly on smart phone devices (iPhone, Android, Blackberry, etc.).

Advantages
  1. Cool Factor – “There’s an app for that.” Being in the Apple App Store or Google Android App Market is great marketing for an organization.
  2. Application Icon – When an app is installed its icon is placed onto the user’s smart phone desktop.
  3. Experience – Native apps are generally faster and more fun to use.
  4. Hardware Access – Native apps can easily take advantage of a smart phone’s GPS or camera.
  5. Offline Mode – App features can be developed that do not require an internet connection.
Disadvantages
  1. Many Platforms – Each application is unique to its platform. For example, an iPhone app will only work on an iPhone. If you want it to also work on a Blackberry, you will need to create another application tailored to Blackberry. New frameworks are being developed to help ease this pain.
  2. Many Versions – When a new version of an existing native app is released, the users of the native app will need to download and install the update. People are not forced to update; therefore there will be multiple versions of the application in production.

Mobile Friendly Web Application – web application that is easily viewable and usable on a smart phone.

Advantages:
  1. Reaches everyone – Anyone that has an internet enabled phone can view the application.
  2. Web application already exists – If a web application already exists it can be updated to accommodate mobile phones.
  3. One Version – Updating the website updates all the application users.
Disadvantages:
  1. Not as cool – There is no app store or app icon. You need to access that application through your mobile browser.
  2. Must be online – The application only works when you have internet connectivity.
  3. Limited hardware – Although it is technically possible to access some smart phone hardware from a website, it is not as seamless.
  4. Optimized look and feel – Each smart phone has its own look and feel and screen dimensions. Optimizing the web experience for specific smart phones requires implementing various mobile stylesheets.

The correct approach in any given situation depends upon the experience you are trying to achieve. It is important to note that this decision is not mutually exclusive. In some cases it makes sense to do both.

Working at Home at BarcampGR

During this year’s BarcampGR, Andy Keller of Traction Software gave a talk named Working at Home. Andy’s been telecommuting for over ten years, so in the talk he outlined some of the positives and negatives of his experience.

I particularly enjoyed his discussion of the ritual he performs every morning. Instead of rolling out of bed and going to work in his sweatpants, Andy prefers to take a shower, put on real clothes, and get out of the house for a little bit. He said this helps him shift from the “I’m at home” mentality to “I’m at work.”

Andy says that “Looking over the slides, I noticed that I described most of my points in words and the slides were really just an outline for the talk….” Regardless, here’s a link to the slideshow.

On the Importance of Character Sets and Character Encodings in MySQL

When transmitting and storing digital data, one of the most important considerations should be the character encoding. Unfortunately, this rarely seems to be on anyone's mind when setting up a database or making a database connection. For the most part, the defaults are just expected to work and provide the best set of options. With regards to character encodings (in any context), this is a dangerous approach.

In MySQL, the default character set is Latin-1. As a reminder, Latin-1 is an 8-bit, single byte, character encoding capable of representing 255 values. This would be awesome if you only ever had to represent characters from the Latin alphabet, and would never store or retrieve characters outside of the Latin-1 character set. Unfortunately, in a world driven by the Internet, this is almost never the case, and it causes problems.

Why? Well, because the default MySQL character set is Latin-1, any characters not within that character set may not be properly stored (or retrieved). This often doesn't occur to developers in the U.S. because nearly everything is represented in characters from the Latin alphabet anyway. However, should you try to store (or retrieve) something not in the standard Latin-1 character set, there are often problems.

For instance, let's create a sample database on a new MySQL server installation from a UTF-8 client:

mysql> SET NAMES utf8;
mysql> CREATE DATABASE mydatabase;
mysql> USE mydatabase;
mysql> CREATE TABLE `mytable` (`id` int(11) NOT NULL AUTO_INCREMENT, 
     `name` text, PRIMARY KEY (`id`));

Our database has been created, and is using the Latin-1 character set, as expected:

mysql> SHOW CREATE DATABASE mydatabase;
+------------+-----------------------------------------------------------------------+
| Database   | Create Database     |
+------------+-----------------------------------------------------------------------+
| mydatabase | CREATE DATABASE `mydatabase` /*!40100 DEFAULT CHARACTER SET latin1 */ |
+------------+-----------------------------------------------------------------------+

mysql> SHOW CREATE TABLE mytable;
+---------+--------------------------------------------------------------------------+
| Table   | Create Table     |
+---------+--------------------------------------------------------------------------+
| mytable | CREATE TABLE `mytable` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `name` text,
  PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1     |
+---------+--------------------------------------------------------------------------+

Now let's try to insert some data:

  • Standard Latin-1:
    mysql> INSERT mytable (name) VALUES ("abc");
  • UTF-8 (Greek):
    mysql> INSERT mytable (name) VALUES ("αβγ");

Now let's try to retrieve our data:

mysql> SELECT * FROM mytable;
+----+------+
| id | name |
+----+------+
|  1 | abc  |
|  2 | ???  |
+----+------+

Well, our UTF-8 data doesn't look right, does it? No, not at all. In fact, our data is gone. Permanently. Because the database is Latin-1, and our UTF-8 characters don't exist in Latin-1, MySQL simply replaced all of our UTF-8 characters with the "replacement character" -- which is supposed to signify that the character was not understood, and not properly converted. However, this is little help when you are trying to retrieve data from the database at a later time.

In practice, rarely will developers set up a MySQL database and send all non-Latin-1 characters to it. Usually most of the characters will be Latin-1, with an odd UTF-8 character thrown in. These UTF-8 characters may be forever lost or corrupted, but the Latin-1 characters are just fine. Because these UTF-8 characters may be rarely used, their loss or corruption may not be noticed for some time (if at all). This varied behavior contributes to the lack of awareness and understanding about properly configuring character encodings in general. It should be noted that this is only one specific example of the data corruption and loss that can occur due to improperly configured character encodings -- many different variants can and do occur.

So, what is one to do about this problem? The answer is really very simple: always use the correct character set all the time. From a practical perspective, this should mean always using UTF-8 for everything. Why? Because that is the way the world is trending -- the Internet is international, and nearly all locales except the U.S. and Western Europe rely upon UTF-8 (or some other form of Unicode) to represent characters all the time. If anyone hopes to serve an international or Internet audience, the character encoding of choice is UTF-8.

So, how is this accomplished in MySQL? Generally, the MySQL server itself should be configured to use UTF-8 as the default character set.

  • This can be done by inserting the following line into the MySQL configuration file, usually /etc/my.conf :
    [mysqld]
    ...
    default-character-set = utf8
    ...
    
  • If the server configuration file isn't accessible, you must specify the correct character set at database creation:
    mysql> CREATE DATABASE mytest2 DEFAULT CHARACTER SET utf8;
  • You can also specify the correct character set at table creation:
    mysql> CREATE TABLE `mytable2` (`id` int(11) NOT NULL AUTO_INCREMENT, `name` text, PRIMARY KEY (`id`)) CHARACTER SET utf8;
  • Alternatively, you can specify the correct character set on a per column basis at table creation:
    mysql> CREATE TABLE `mytable2` (`id` int(11) NOT NULL AUTO_INCREMENT, `name` text CHARACTER SET utf8, PRIMARY KEY (`id`));

Existing character sets for servers, databases, tables, and columns can be altered, but this poses a risk for further corrupting or damaging existing data.

Our database and table have both been created, using the UTF-8 character set as specified:

mysql> SHOW CREATE DATABASE mytest2;
+----------+------------------------------------------------------------------+
| Database | Create Database     |
+----------+------------------------------------------------------------------+
| mytest2  | CREATE DATABASE `mytest2` /*!40100 DEFAULT CHARACTER SET utf8 */ |
+----------+------------------------------------------------------------------+

mysql> SHOW CREATE TABLE mytable2;
+----------+------------------------------------------------------------------+
| Table    | Create Table     |
+----------+------------------------------------------------------------------+
| mytable2 | CREATE TABLE `mytable2` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `name` text CHARACTER SET utf8,
  PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 |
+----------+------------------------------------------------------------------+

We can now insert UTF-8 data into the database without a problem:

mysql> INSERT mytable2 (name) VALUES ("αβγ");
mysql> SELECT * FROM mytable2;
+----+--------+
| id | name   |
+----+--------+
|  1 | αβγ   |
+----+--------+

It is important to note that clients and their connections to MySQL server also have their own character sets. These should also always be the same as the server, database, and table: UTF-8. The MySQL client will often try and establish a connection to the MySQL server using the default character set (Latin-1), so it must sometimes be specifically set to UTF-8.

  • On the MySQL client command line, this can be accomplished by setting the following variable:
    mysql> SET NAMES utf8;
  • Certain other MySQL clients must also specifically be told to use UTF-8. For example, in Ruby on Rails, the database.yml file should specify UTF-8:
    production:
      adapter: mysql
      database: mydatabase
      username: myuser
      password: mypass
      host:  mydb
      encoding: utf8
    

The SQL samples shown here are really intended to illustrate the importance of using the proper character set on MySQL server, and on MySQL clients. These are just examples, and not total solutions. You should do proper research on the character encoding of the server and clients that you utilize. Always backup data and use caution when trying to change the character encoding used in a production database.

Further Reading:

Filed in: Technologies, Tips

Professional Development through Performance Art

For the past few years I have been involved with the Tesla Orchestra, a performance art and engineering group that uses Tesla coils both as musical instruments and to create impressive, lightning-like electric arcs on stage.

Although the Orchestra's Tesla coils rely on modern solid-state electronics to rapidly vary their power output, causing the electric arc itself to vibrate like a loudspeaker and emit musical tones, the basic design of the coils can be traced back to Nikola Tesla's experimental wireless power transmission equipment of the late 19th century. Therefore it was a special honor for our group to be invited to perform in Rijeka, Croatia, not far from Tesla's birthplace in the town of Smiljan. I was fortunate to have Atomic Object provide me the flexibility to travel with the Orchestra to Croatia in order to assist with the technical aspects of the performance.

The Tesla Orchestra's first international concert, held August 11, was a great success, judging by both the enthusiastic reactions of the crowd and the attention from Croatian national news outlets. The engineering team was also pleased with the flawless operation of the Tesla coils themselves.

For me, the project has provided a unique opportunity to exercise and develop my electrical and firmware engineering skills. Also, it has been interesting and educational to observe how a large, diverse, and sometimes geographically dispersed team of volunteers has organizational and project management needs that are very parallel to, but not quite the same as, those of professional project teams.

Undelete!

I was working on a server this morning and accidentally deleted an important configuration file. Like many Linux users, I lamented the absence of an “undelete” command. The file wasn’t still open by any processes, wasn’t present in the backups, and would be painful to recreate.

Fortunately, not all hope was lost. When a file is deleted from a hard drive, the blocks are freed, but not actually cleared. The data remains on disk, but it cannot be directly accessed and is in danger of being overwritten. Recovery is a matter of search and rescue.

Since the file I was hoping to recover was a text file, and I knew a fair amount about it (such as approximate file size and some text that was definitely going to be included), finding it actually turned out to be fairly simple task using grep:

grep -a -B 25 -A 100 'some string in the file' /dev/sda1 > results.txt

Here’s what the command does:

grep searches through a file and prints out all the lines that match some pattern. Here, the pattern is some string that is known to be in the deleted file. The more specific this string can be, the better. The file being searched by grep (/dev/sda1) is the partition of the hard drive the deleted file used to reside in. The “-a” flag tells grep to treat the hard drive partition, which is actually a binary file, as text. Since recovering the entire file would be nice instead of just the lines that are already known, context control is used. The flags “-B 25 -A 100” tell grep to print out 25 lines before a match and 100 lines after a match. Be conservative with estimates on these numbers to ensure the entire file is included (when in doubt, guess bigger numbers). Excess data is easy to trim out of results, but if you find yourself with a truncated or incomplete file, you need to do this all over again. Finally, the ”> results.txt” instructs the computer to store the output of grep in a file called results.txt.

Once the command is done, results.txt will probably contain lots of gibberish, but if you’re lucky, the contents of the deleted file will be intact and recoverable.

To help prevent this problem from happening in the first place, many people elect to alias the rm command to a script which will move files to a temporary location, like a trash bin, instead of actually deleting them.

Filed in: Technologies, Tips

Capistrano: Deploying Only Subversion Modified

My recent project work has resulted in the development of a nifty piece of deployment code. While local deployment for this project is possible, I have preferred to deploy to a remote development system. Doing a full deploy takes several minutes, which can feel like ages if all you want to do is fix a syntax error on line 157.

With that said, I created a Capistrano deployment task that only deploys files which subversion has determined have been modified.

1
2
3
4
5
6
7
8
9
10
11
12
13
desc "Deploy SVN modified files"
task :deploy_svn_modified do
  php_files = []
  other_files = []
  `svn st`.split("\n").select{|i|i.match(/^(A|M)/)}.map{|i|i[8..-1].strip}.select{|i|i.match(/^some_root/)}.each do |file|
    puts file
    db = other_files
    db = php_files if file[/\.php$/] 
    db << file
  end
  deploy_some_php_file *php_files unless php_files.empty?
  upload_files other_files unless other_files.empty?
end

Let’s see if we can break down what is happening in this code. Most of the magic is happening on Line 5.

`svn st`.split(”\n”).select{|i|i.match(/^(A|M)/)}.map{|i|i[8..-1].strip}.select{|i|i.match(/^some_root/)}
  • Capture output from subversion’s status command (abbreviated st)
  • Split each line
  • Keep only lines that start with A or M, discard the rest
  • The subversion manual for status indicates that the file path starts on column 8, so we extract a substring from column 8 until the end of the line
  • Keep only those file paths that fall under some root. Depending on your project layout, this last select may not be needed

Finally, the code does additional selection based on file extension. For example, PHP files can be run through a syntax checker before being uploaded.

Hopefully this brings as much joy to your deployments as it has mine. Happy deploying!

Filed in: Tips