Page 1


p <?

Templating PDFs for maximum reusability

FPDI in Detail

Importing existing documents with Free PDF Import

2005 Look Back Reflecting on last year’s events in the PHP world with PHP guru Derick Rethans

i18n Internationalize your web application with less PHP code


Secure your applications against Email Injection Tips on Output Buffering KOMODO - reviewed and much more...

NEXCESS.NET Internet Solutions 304 1/2 S. State St. Ann Arbor, MI 48104-2445

PHP / MySQL SPECIALISTS! Simple, Affordable, Reliable PHP / MySQL Web Hosting Solutions P O P U L A R S H A R E D H O S T I N G PAC K A G E S



6 95

SMALL BIZ $ 2195/mo


500 MB Storage 15 GB Transfer 50 E-Mail Accounts 25 Subdomains 25 MySQL Databases PHP5 / MySQL 4.1.X SITEWORX control panel

2000 MB Storage 50 GB Transfer 200 E-Mail Accounts 75 Subdomains 75 MySQL Databases PHP5 / MySQL 4.1.X SITEWORX control panel

POPULAR RES ELLER HO ST I NG PAC KA G ES NEXRESELL 1 $16 95/mo 900 MB Storage 30 GB Transfer Unlimited MySQL Databases Host 30 Domains PHP5 / MYSQL 4.1.X NODEWORX Reseller Access

NEXRESELL 2 $ 59 95/mo 7500 MB Storage 100 GB Transfer Unlimited MySQL Databases Host Unlimited Domains PHP5 / MySQL 4.1.X NODEWORX Reseller Access



All of our servers run our in-house developed PHP/MySQL server control panel: INTERWORX-CP INTERWORX-CP features include: - Rigorous spam / virus filtering - Detailed website usage stats (including realtime metrics) - Superb file management; WYSIWYG HTML editor

INTERWORX-CP is also available for your dedicated server. Just visit for more information and to place your order.


NEW! PHP 5 & MYSQL 4.1.X

php 5


We'll install any PHP extension you need! Just ask :) PHP4 & MySQL 3.x/4.0.x options also available

php 4







De dicat ed & M an ag ed D edic at e d s e rv e r so lu t io ns a ls o av a ila ble Serving the web since Y2K



Columns 6 EDITORIAL 8 php|news

Features 10


Why is it Taking so Long?

Lead times and the rationale behind them

2005 Look Back

Reflecting on last year’s events in the PHP world





Email Injection by CHRIS SHIFLETT

PHPLib’s Block Tool

Templating PDF’s for Maximum Reusability



Output Buffering by BEN RAMSEY


FPDI in Detail

Importing existing documents with Free PDF Import



The Web Development IDE for All Platforms?





Internationalize Your Web applications with less PHP code


64 exit(0);

2006: A Look Forward by MARCO TABINI

Download this month’s code at:


If you want to bring a php-related topic to the attention of the professional php community, whether it is personal research, company software, or anything else, why not write an article for php|architect? If you would like to contribute, contact us and one of our editors will be happy to help you hone your idea and turn it into a beautiful article for our magazine. Visit or contact our editorial team at and get started!




n the past five (or so) years, especially, the desktop landscape has changed, severely. Desktops have traditionally been dominated by Windows, but alternatives are making their way into both the office and home. Apple’s hit operating systems in the OS X series, and other chic products (like the iPod) have not only fueled the sales of Macintosh computers, but have opened consumers’ minds to the reality that there are alternatives to Windows. The market is still strongly clutched by Microsoft, but more and more users are making the “switch” to Mac (and to a much lesser extent, alternatives like Linux). This diversity, while good, can cause portability problems, and as I’ve touched on in past issues, developers can no longer target a single browser, but must become more and more aware of standards and cross-browser/cross-platform compatibility issues. For the most part, developers seem to have the browser issue under control. I personally never use Internet Explorer for anything but testing (I’m a Firefox fanboy), and it’s very rare that I still run into sites that simply won’t work with FF. Even in cases where it seems I’m out of luck, I can often spoof the User-Agent header, and get a working site. Since Firefox is available on many platforms, it seems that the HTML issue is (mostly) behind us—I say “mostly” because standards-compliance and portability are things that we always need to strive for. If you’ve tried to distribute a printable, offline-viewable, and well laid out document, in the past, you know that HTML doesn’t cut it. There’s little provision for the features that are necessary to build a professional document (there is hope with CSS, though). This often leaves websites delivering “richer” documents, such as MS Word documents or RTF files. The distribution of proprietary format documents leads to its own set of problems, primarily: document creation and portability. Have you tried to build a Word document from your non-Windows Web server? It’s not fun. Equally tedious is trying to get that document to render properly in different versions of Word, on different platforms—worse is the rendering in non-Microsoft applications, such as OpenOffice. Enter PDF. Now, PDF is certainly not new technology. It does, however, seem to be becoming more and more the de facto standard for document distribution. PDF is no stranger to php|architect readers: if you’re not reading this on paper, you’re reading a PDF, and we’ve brought you much PDF-centric content in the past, but we’ve certainly not drained the PDF knowledge pool. This month, we’re happy to focus on PDF, once again, but this time with a twist: using PHP to modify existing PDFs, through various means. It’s also our pleasure to be running Derick Rethans’ PHP Lookback, 2005. Marco will touch more on this in exit(0). On that note, we at php|architect wish you and your business a happy and successful 2006. Here’s to another great year of PHP!

Volume 5 - Issue 1 Publisher Marco Tabini

Editor-in-Chief Sean Coates

Editorial Team Arbi Arzoumani Peter MacIntyre Eddie Peloke

Graphics & Layout Aleksandar Ilievski

Managing Editor Emanuela Corso

News Editor Leslie Hill Authors Marcus Baker, Ron Goff, Peter B. MacIntyre, Carl McDade, Ben Ramsey, Derick Rethans, Chris Shiflett, Jan Slabon php|architect (ISSN 1709-7169) is published twelve times a year by Marco Tabini & Associates, Inc., P.O. Box 54526, 1771 Avenue Road, Toronto, ON M5M 4N5, Canada. Although all possible care has been placed in assuring the accuracy of the contents of this magazine, including all associated source code, listings and figures, the publisher assumes no responsibilities with regards of use of the information contained herein or in all associated material. php|architect, php|a, the php|architect logo, Marco Tabini & Associates, Inc. and the Mta Logo are trademarks of Marco Tabini & Associates, Inc.

Contact Information: General mailbox: Editorial: Sales & advertising: Printed in Canada Copyright © 2003-2006 Marco Tabini & Associates, Inc. All Rights Reserved

6 • php|architect • Volume 5 Issue 1

news eZ components ez components 1.0 beta2

PHP 5.1.2 RC1

Ilia Alshanetsky announces the release of php 5.1.2 RC1. “I’ve just packaged PHP 5.1.2RC1, the first release candidate for the next 5.1 version. A small holiday present for all PHP users, from the PHP developers. This is primarily a bug fixing release with its major points being: • Many fixes to the strtotime() function, over 10 bugs have been resolved. • A fair number of fixes to PDO and its drivers • New OCI8 that fixes large number of bugs backported from head. • A final fix for Apache 2 crash when SSI includes are being used. • A number of crash fixes in extensions and core components. • XMLwriter & Hash extensions were added and enabled by default.” Get all the info at 97-PHP-5.1.2RC1-Released!.html

FUDforum 2.7.4RC1 Released

The FUDforum team has announced the latest release of their open source forum package, version 2.7.4 RC1. Some of the new features include: • Added subscribed forum filter to message navigator • Added handling for in-lined attachments in mailing list import • Added the ability to supply custom signature to message synchronized from the forum back to mailing list or a news group • Added support for allowing the user to select how many threads they want to see per page • Much more… Visit for all the latest info.

8 • php|architect • Volume 5 Issue 1 is proud to announce the release of ez components. announces: ”Ez components is an enterprise ready, general purpose PHP platform. As a collection of high quality independent building blocks for PHP application development, ez components will both speed up development and reduce risks. An application can use one or more components effortlessly, as they all adhere to the same naming conventions and follow the same structure. All components are based on PHP 5.1, except for the ones that require the new Unicode support that will be available from PHP 6 on.” Need to speed up your development? Check out for more info.

xajax 0.2 announces the release of version 0.2. What is it? The site describes it as:” an open source PHP class library that allows you to easily create powerful, webbased, Ajax applications using HTML, CSS, JavaScript, and PHP. Applications developed with xajax can asynchronously call server-side PHP functions and update content without reloading the page.” To start working with xajax, visit

SQLiteManager 1.2.0RC2

If SQLite is the db of choice for your PHP application, you may be interested in the latest release of SQLiteManager. SQLiteManager. org lists the features as: • Management of several databases (creation, access or upload) • Management of the attached databases • Create, edit and delete tables and indexes • Insert, edit, delete records in these tables • Management of views; create views from SELECTs • Management of triggers • Management of user defined functions • Manual request and from file, it is possible to define the format of the requests, sqlite or MySQL; a conversion is done in order to directly import a MySQL database in SQLite • Importing of records from a formatted text file • Export of structure and the data • Choice of several display skins Check out to start managing your SQLite DB, today.

php|architect Releases New PDFlib Book We are proud to announce the release of our latest book in the “Nanobooks” series called Beginning PDF Programming with PHP and PDFlib. Authored by Ron Goff, this book provides a thorough introduction to the great capabilities provided by the PDFlib library for the creation and manipulation of PDF files. The book features a foreword by Thomas Merz, the original author of PDFlib and founder of PDFlib GmbH, and tackles topic like PDF file creation, fonts, text, shapes and much more, including PDFlib’s Block Tool, which allows for the manipulation of existing PDF documents. For more information,

MDB2_Drivers Check out the hottest new releases from PEAR.

Image_Color2 0.1.4

PHP 5 color conversion and basic mixing. Currently supported color models: • CMYK - Used in printing • Grayscale - Perceptively weighted grayscale • Hex - Hex RGB colors i.e. #abcdef • HSL - Used in CSS3 to define colors • HSV - Used by Photoshop and other graphics packages • Named - RGB value for named colors like black, khaki, etc. • WebsafeHex - Just like Hex but rounds to websafe colors

Config 1.10.5

The Config package provides methods for configuration manipulation. • Creates configurations from scratch • Parses and outputs different formats (XML, PHP, INI, Apache...) • Edits existing configurations • Converts configurations to other formats • Allows manipulation of sections, comments, directives... • Parses configurations into a tree structure • Provides XPath like access to directives

MDB2 drivers where released for: • SQLite • postgreSQL • mysqli • mysql • Oracle

MDB2 2.0.0RC3

PEAR MDB2 is a merge of PEAR DB and Metabase php database abstraction layers. Note that the API will be adapted to better fit with the new PHP 5-only PDO before the first stable release. It provides a common API for all supported RDBMS. The main difference to most other DB abstraction packages is that MDB2 goes much further to ensure portability. Among other things MDB2 features: • An OO-style query API • A DSN (data source name) or array format for specifying database servers • Datatype abstraction and on demand datatype conversion • Various optional fetch modes to fix portability issues • Portable error codes • Sequential and non sequential row fetching as well as bulk fetching • Ability to make buffered and unbuffered queries • Ordered array and associative array for the fetched rows • Prepare/execute (bind) emulation • Sequence emulation • Replace emulation • Limited sub select emulation

Fileinfo 1.0.3 GDChart 0.2.0

The GDChart extension provides an interface to the bundled gdchart library. This library uses the (bundled) GD library to generate 20 different types of graphs, based on supplied parameters. The extension provides an OO interface to gdchart exposing majority of options via properties and complex (array) options via a series of methods. To use the current version of the extension PHP 5.0.0 is required, and older PHP 4 only version can be downloaded from CVS, by checking out the extension with PECL_4_3 tag.

yaz 1.0.6

This extension implements a Z39.50 client for PHP using the YAZ toolkit.

This extension allows retrieval of information regarding vast majority of files. This information may include dimensions, quality, length etc... Additionally, it can also be used to retrieve the mime type for a particular file and for text files, the proper language encoding.

pecl_http 0.21.0

It eases handling of HTTP URLs, dates, redirects, headers and messages, provides means for negotiation of clients preferred language and charset, as well as a convenient way to send any arbitrary data with caching and resuming capabilities. It provides powerful request functionality, if built with CURL support. Parallel requests are available for PHP-5 and greater. PHP-5 classes: HttpUtil, HttpMessage, HttpRequest, HttpRequestPool, HttpDeflateStream, HttpInflateStream PHP-5.1 classes: HttpResponse

• • • • • • • • • • • •

Row limit support Transactions support Large Object support Index/Unique Key/Primary Key support Autoincrement emulation Module framework to load advanced functionality on demand Ability to read the information schema RDBMS management methods (creating, dropping, altering) Reverse engineering schemas from an existing DB SQL function call abstraction Full integration into the PEAR Framework PHPDoc API documentation

MDB2_Schema 0.4.1

PEAR::MDB2_Schema enables users to maintain RDBMS independent schema files in XML that can be used to create, alter and drop database entities and insert data into a database. Reverse engineering database schemas from existing databases is also supported. The format is compatible with both PEAR::MDB and Metabase.

Validate_ptBR 0.5.2

Package contains locale validation for ptBR such as: • Postal Code • CNPJ • CPF • Region (brazilian states) • Phone Number • Vehicle plates

Xdebug 2.0.0beta5

The Xdebug extension helps you debugging your script by providing a lot of valuable debug information. The debug information that Xdebug can provide includes the following: • stack and function traces in error messages with: • full parameter display for user defined functions • function name, file name and line indications • support for member functions • memory allocation • protection for infinite recursions Xdebug also provides: • profiling information for PHP scripts • script execution analysis • capabilities to debug your scripts interactively with a debug client

Volume 5 Issue 1 • php|architect •9


2005 PHP

A new year is upon us, and as is customary in the PHP world, it is time to reflect


on the events of the past year. Derick Rethans, a PHP internals developer, has been publishing a PHP Look Back for a few years, now, and this year, we saw it fitting to publish it, here. Happy 2006!



elcome to the fourth installment of the PHP Look Back. Just as in previous years, we’ll look back on PHP development discussions, bloopers and accomplishments of the last year. This is not supposed to be a fully objective review of last year—note that the opinions in this article are that of the author, and not of the PHP development team (nor of php|architect).

January January was a quiet month, with not much going on. After about 8 months [001], we finally added [002] a PIC/nonPIC detection mechanism to the configure script, that will select non-PIC object generation for supported platforms (Linux and FreeBSD). Non-PIC code is about 30% faster, as measured in earlier benchmarks. 10 • php|architect • Volume 5 Issue 1


A week later, Leonardo [003] was wondering whether we planned on adding type hints for scalar types to PHP. As PHP is a weakly-typed language, this is not something we wanted to add, although we did add support for an “array” type hint, later in the year. With PHP 5.1’s new GOTO execution method (added last August), variable name lookups are cached internally. This caused some problems for Xdebug [004], as it needs some information to find out which variables are used in a specific scope. Andi committed [005] a patch that made Xdebug work properly, again. Michael started working on his HTTP extension (which

2005 Look Back generates way too many commit mails ;-) and encountered a problem with a naming clash [006] between PEAR’s HTTP class and his PECL extension. Greg responded [007], and said that this problem will be solved when PEAR 1.4 comes out, with its channel support.

February Andi started discussions in February by pointing out a date for the first beta of PHP 5.1: March 1st. He declared that “both PDO and Date should be included in the default distribution”[008] and others suggested that XML Reader[009] should be included by default, as well. In reply to Andi, Rasmus mentioned [010] that he would like to see the

issue that—later in the year—warranted a new PHP release, and Greg introduced [027] PEAR 1.4, with channel support. Halfway through the month, Marcus [028] mentioned a few things that should go into PHP 5.1; most notably the __toString() fix, which unfortunately, did not actually make it into the release. Type hinting with “= NULL” did, make it in [029], though. Martin Sarsale reported [030] an issue with references and segfaults, something which had been annoying us at eZ systems [031] for quite some time, too. This issue got fixed in PHP 4.4, albeit not without a little bickering (more about that later).

Luckily, Debian’s PHP packages got rid of some of the insanity that was present in previous releases. filter extension included, as well. The discussion about this extension quickly transitioned to data mangling of input request variables, and how they could not be influenced by the script authors, but only by the system administrator. In the end, this discussion made place for the topic of Operator overloading [011], where certain people kept reiterating that operator overloading is a “good thing. [012]” Andrei tried to stop this discussion by being funny [013], but it didn’t work very well [014]. Around the same time, Wez announced [015] the first beta of PDO—PHP Data Objects. Wez wanted people to test [016] PDO, and of course, over the next couple of months, there were various PDO-related concerns [017] and issues raised. Another discussion in February was about auto boxing [018] in PHP. Auto boxing is the encapsulation of all primitive types as objects. Naturally, people asked why [019] we would want to have this, and no sound reason was given. In the end, this discussion suggested that phpDocumentor[020] should handle type determining, instead. Having a doc block [021] parsing extension to the reflection API would be nice, although a bit hard. We also had an often-recurring discussion [022] on why the GPL[023] is a bad idea for PECL[024] extensions. John added the first version [025] of XMLRPCi to CVS; why he chose this silly name is still unknown. Jani wrote about a problem with overwriting globals [026], an

March In March, Ilia proposed [032] a patch that adds a special token that tells PHP’s parser to abort parsing when the token is encountered. This allows us to attach binary data to the end of a PHP script, which is highly useful for one-script installers, such as the one that FUDForum [033] uses. On the 14th of the month, Zeev released the first RCs [034] of both 5.0.4 and 4.3.11. We also encountered further reference issues [035]. The same guy that mailed tons of “fixes” to the internals list, last June [036], was back with more [037] patches. Andrei, once again, pointed out [038] that it is a good idea to check with an extension’s maintainer before applying patches, and Greg published [039] the package2.xml documentation. Lukas, once more, pointed out [040] the weird naming scheme that new extensions seem to be getting, and luckily Debian’s PHP packages got rid [041] of some of the insanity that was present in previous [042] releases by not always building in ZTS mode. Unfortunately, their packages still force PIC mode for the libraries. A user brought up the idea of an upload meter patch [043], again, and although we all seemed to remember[044] that the original patch was rejected [044], no one could find the original thread [046] where this was discussed. Last year’s Look Back discussed this too, and Volume 5 Issue 1 • php|architect • 11

2005 Look Back there, the reason was mentioned [047]. In the last week of the month, we had some fuss [048] about “FreeBSD doing stupid things [049]” regarding their naming of auto tools executables [050].

April April started with a suggestion [051] by Zeev to change the way that __autoload() works, by allowing multiple instances of this magic function. In the end we, didn’t end up implementing this, and as Lukas described [052], “Frameworks should provide __autoload() helper methods, but should never implement the function itself. It’s up to the end user to do this.” (This is exactly how we implemented it for the eZ components [053]). Andi wanted to release PHP 5.1 Beta 1[054] really soon, but, as Jani mentioned [055], there were quite a few things that were still not fully ready, and thus the suggestion to call it “Alpha”[056] was made, instead. During this thread, some pet-features [058] were brought up [059]. Kamesh, from the Netware porting team, found another reference issue [060]. Marcus added the File [061] class to his SPL extension, causing a small stir—the new class clashed with any application that already defines its own File class. Although this is a valid point, projects defining a “File” class should know better, and would be wise to prefix their class names. This same issue will pop up later in the year. A last, somewhat larger, discussion erupted when a question [062] about whether APC could be used as a content cache was posted to the list. Rasmus found it an interesting idea [063], although this functionality can also be accomplished in user space. In the last point of the thread, Rasmus mentioned [064] that APC will soon support PHP 5.

May May had a slow start, and things only got interesting at the end of the month. The first discussion that came up was Ilia’s removal of dangling commas from enums, something that “was in c language from the first day [065].” Apparently, GCC 4 is “becoming worse and worse [066],” but luckily, we can still just ignore the warnings [067]. After a small private discussion with Dmitry about Marcus’ and my reference fix patch [068], he came to the conclusion that this patch breaks binary compatibility and that this problem warrants a PHP 4.4 release. As this reference problem has been affecting many users, and definitely eZ over the past months, I wrote an email [069] to the list stating that it is “totally irresponsible” not to release a fix for such a grave bug. Zeev[070] also said that “we should probably not fix this at all in the 4.x tree” because of the hassles that accompany “breaking module 12 • php|architect • Volume 5 Issue 1

binary compatibility.” He also seemed to think that the bug can easily be worked around. Other users were a bit happier[071] that we finally nailed this bug, and Jani replied to Zeev that the magnitude [072] of this bug is pretty high. Rasmus added that he “will be deploying the patch and happily breaking binary compatibility [073]” as soon as the patch is ready. Breaking binary compatibility is only a “burden on the maintainers of these packages” (of the various distributions). Wez thought that “the only logical move forward is a 4.4 branch and release [074].” In the end, the Zeev almighty was “tired of going through the reasons again and again [075]” and noted that “everyone appears to prefer the upsides to the downsides.” This resulted in the creation of the PHP_4_4 branch [076] in the first week of June.

June Wez added a new patch to our CVS server that allows us to block access [077] to specific branches—with this, we closed the PHP_4_3 branch for good. A week later, I announced 4.4.0RC1[078], which features the reference bug fix. Andi wrote another PHP 5.1 mail [079], which spawned a nice long discussion on adding goto [080] to PHP, and comparing goto to exceptions. Magnus smartly added [081] that “people are talking about hypothetical messy code because of goto” and that they forget that you don’t have to use a language construct simply because it is available. The same thread also went into a branch that discussed [082] the ifsetor() language construct. After Andi returned, he decided not to do anything with goto or ifsetor()[083], and that it was now the time to branch, so that we can merge the Unicode support that was developed in parallel by mostly Andrei and Dmitry, although Rasmus was “pretty sure the current discussions will pale in comparison to the chaos that will be created when the Unicode stuff goes into HEAD![084]” Johannes wondered when the new date stuff[085] was going in; it was added a week later, just before PHP 5.1 beta 2. Lukas suggested that we add [086] the public keyword to PHP 4.4 for forward compatibility. Rasmus again wondered about “the reasoning ... for not having var be a synonym for public in PHP 5 [087].”. Andi mentioned [088] that this “was meant to help people find vars so that they can be explicit about the access modifiers” when moving to PHP 5. A few days later, Andi read a blog posting [089] which described how PHP 4.4 is breaking backwards compatibility by issuing an E_STRICT in cases where developers abuse return-by-reference. This, however, was not actually the case [090].

2005 Look Back Yasuo started a long thread [091] on allow_url_fopen() and claimed it was dangerous [092]. The main result of this thread seemed to be that we wanted to split the setting into two different privileges: one that allows remote opening of URLs and one to allow include() on remote URLs. However, this is something we could not yet change. The last thread of the month was by Andi, writing about the PHP 5.1 release process [093]

July In July, Jessie suggested [094] a String extension that declares only one class: String. This class is meant to prevent copying of the string’s data for most operations (which is currently done with PHP’s string functions). Most of the other developers where against it, for

where some people didn’t see [108] why we had to implement this fix. Unfortunately, there were some quirks [109] that we still had to sort out. In this same month, Rasmus released APC 3.0.0 [110] which came with PHP 5.1 support and numerous fixes.

August August started with a discussion on instanceof[111] being “broken,” as it raises a fatal error in the case where the class that is being checked for doesn’t exist. Andi declared “if you’re referencing classes/exceptions in your code that don’t exist, then something is very bogus with your code [112]” and “the only problem is if the class does not exist in your code base, in which case, your application should blow up![113]” I raised a question about whether the new PHP with

If you’re referencing classes/exceptions in your code that don’t exist, then something is very bogus with your code. different reasons: “String is such a generic name for a non-core class [095]” and “the savings gained by this will be more than offset by OO overhead [096],” so we will not let “this get anywhere near the core [097].” In the same week, I made more changes to the date extension [098] that allows users to more easily select the timezone that they want, instead of having to rely on the TZ environment variable. This is also needed because the TZ environment variable [099] can most likely not be used in a thread safe way, and it is certainly not portable [100]. Also in the same week, I proposed an API for new Date and Timezone functionality [101]. After some pressure [102], I added [103] an OO API, too. Near the end of the month, I committed the implementation of the new date functionality [104]. It was, however, #ifdef-ed out to facilitate discussions at a later date. Jessie came up with Yet Another Namespace Proposal [105], and tried to come up with a solution for all the previous problems we had with the implementation. He also made several patches [106] that added namespaces to PHP. We had some more fuss [107] about PHP 4.4 breaking BC,

Unicode should be called PHP 5.5 or PHP 6.0 [114]. Andi (amd the majority) wanted to go “with PHP 6 and aim to release it before Perl 6 [115].” After PHP_5_1 was branched, Andrei merged the Unicode branch and gave us some instructions on how to get started with it [116]. He also introduced the general ideas behind the implementation [117]. PHP 5.1 RC1 was finally rolled, about half way through the month, followed by PHP 5.0.5 RC2[118], a week later. During the development of the eZ components [119], we discovered various things in PHP’s OO model that we wanted to see changed. One of those issues was described in the Property Overloading RFC [120]. Unfortunately, not everybody could be convinced [121], and no changes were made. I will try again though :). The other issue that we raised was that failed typehints throw a fatal error[122], while that is not strictly necessary. Instead of throwing exceptions [123] in this case, the discussion turned towards adding a new error mode [124] (E_RECOVERABLE[125]) that will be used for non-enginecorrupting fatal errors at the language level—this is exactly the case with failed typehints. Volume 5 Issue 1 • php|architect • 13

2005 Look Back The longest thread of the month, was started by Rasmus when he posted his PHP 6 [126] wish list, which featured controversial changes such as “removing magic_quotes” and “making identifiers case-sensitive,”

attempt detection in favour of the new date.timezone setting [147]. After some discussion, we came up with a solution [147], which was then implemented. It should guess the timezone correctly in most cases, even on

The filter extension, which I’ve been developing for quite some time, did not make it into PHP 5.1... to which most developers quickly agreed [127]. Following his initial wish list, the crowd went wild and started suggesting all kinds of weird changes, such as “Radically change all of the operator syntaxes [128],” adding <?php6 [129] as a BC breaking mode, and “Named parameters [130].” Marcus made a list of his own [131] which would later become the first draft of the meeting agenda for a PHP Developers Meeting.

September In September, Antony committed [132] an upgraded OCI8 extension which fixes a lot of bugs [133]. We also decided to play a bit nicer with version_compare(), regarding naming [134] release candidates. Zeev wanted to roll [135] PHP 5.0.5 but there was an issue [136] with the shutdown order. The reference issues returned, too. The first one [137] turned out to be an incorrect merge to the PHP 5.0 branch, where suddenly some of the notices turned into errors [138]. The second one [139] is simply a small change in behaviour, which previously created memory corruption. Rasmus explained the issue a bit more [140], once again. Ilia tried to implement a clever fix [141] which turned out to be a problem later on. Pierre started a discussion on supporting Unicode in identifiers, something he didn’t want to see. PHP already supports using UTF-8 encoded characters [142] in identifiers, so removing this feature will break BC unnecessarily. Besides breaking BC, many people simply want to use their own language for writing code, as Tex [143] writes. Zeev made another attempt at PHP 5.1.0 RC2[144] with the latest PEAR being the only thing missing. Marcus brought up the issue of __toString() again, and finally managed to get it into CVS, but unfortunately not in time for PHP 5.1. Stanislav[146] noticed some problems with detecting time zones, as the new date/time code did not try to 14 • php|architect • Volume 5 Issue 1

Windows. I also added support for an external timezone database [149].

October In October, I noticed some weird notices [150] with “make install-pear,” without a clue as to why they were showing up. This discussion turned into a “why does PEAR not support PHP 5.1” thread [151]. In the end, Greg managed to nail down the weird notices, though. I also noticed a commit by Dmitry [152] that ignores “&” when $this is passed. I pointed out that this should not be supported (in PHP 5), as it doesn’t make really sense that people won’t see a warning/notice/error when they’re doing something silly. Dmitry explained [153] that disallowing it would break code, but he also writes that by “using ‘=& $this’, a user can break the $this value”— which is something we definitely should prevent. He suggested [154] we make this an E_STRICT warning, and Andi suggested [155] we escalate this to an E_ERROR in PHP 6, but neither of those things happened. A week later, Piotr[156] asked for a tarball of our CVS to make it “possible to convert it to Subversion repository ... so browsing the repositories would be much easier.” We wondered [157] why he needed that, as we offer our own browser[158], already. Matthias [159] said that we “do not want to set off yet another discussion about the changes 4.4 brought,” but that is exactly what he did. Again, there was something wrong with his code, and thus the warning is legal. After resolving the timezone issues, last month, we were surprised by a message from Zeev. He simply missed [161] the conclusion in the “lengthly thread.” As a result of the negative comments on the PHP 4.4.0 release, Lukas, Ilia and I set up a routine [162] for involving some of the more known projects to the PHP 4 [163] and PHP 5 [164] release processes. As part of this effort, we send out [165] a mail to all participating projects whenever we

2005 Look Back have a release candidate to test. I raised [166] some concern regarding our current Unicode implementation because of maintenance issues. In part of my mail, I also indicated that I wanted “to clean up PHP 6 for real, [167]” after private discussions with Marcus and Ilia. Behind the scenes, we prepared some material to organize a PHP Developers Meeting to discuss the Unicode implementation and the extended “PHP 6 Wishlist.” I also committed [168] a patch that allows typehints for classes to work with = NULL[169]. Another guy raised the issue of “that new isset()-like language construct, [170]” but this ended up going nowhere, as people were suggesting very Perl-like [171] operators. Jani replied to this thread with “How about a good ol’ beating with a large trout?[172]” On the last day of the month, we released PHP 4.4.1[173] which addresses some of the reference issues we’ve seen in PHP 4.4.0.

November In November, we prepared to finally release PHP 5.1, and one of the efforts was to make an upgrade guide [174] for people switching to PHP 5.1. Sean noticed [175] a problem with the parameter parsing API’s automatic type conversion. Like Andrei [176], many people think that “passing ‘123abc’ and having it interpreted as 123” is still wrong. Dmitry implemented [177] support for “= null” as default to array type hinting, something that I did not do [178] on purpose because “= array()” is the logically correct way of doing this. Andi agreed [179] with me on this. Ilia implemented, in PHP 5.1RC5 [180], one of the items that was on the outcome list of the PHP Developers Meeting: adding a notice that warns people that curly braces [181] for addressing a character in a string is now deprecated in favour of the [] operator—contrary to the current explanation in our manual. {} and [] are exactly the same thing [182] and “having two constructs for the same behaviour is silly and leads to confusing, hard to read code.” The outcome of this discussion was the removal of the notice in PHP 5.1 and the likely conclusion is that it is not going to get removed. Another change that as made PHP 5.1RC6 was the creation of the “Date” class, which caused quite a stir after the release of PHP 5.1[183]. The reason to introduce it in 5.1 was simply to make sure that no applications were going to break if we introduced the Date class later in the 5.1.x series. Unfortunately a lot of projects, including PEAR, never heard of “prefixing” class names, causing class name clashes. Marcus described the problem as “PEAR ignores coding standards, [184]” but others suggested that we renamed the internal class [185] to something silly

like php_date. Andrei [186] asked “what does renaming really buy us? The only purpose of introducing this class in RC6, as far as I can tell, was to reserve the ‘Date’ name for future use.” Now that we know about this issue, it’s time for PEAR to start prefixing its classes, so that we finally can do the right thing and add our Date (and Timezone) classes, code that has been around for months, now, and I’m quite tired of waiting for it to be in a release where I can use it. We ended up reverting the change that claimed the Date and Timezone classes, and released 5.1.1 with this change. After the PDM I posted [187] the meeting notes [188] to the list. Most of the outcome was well appreciated, except the curly braces idea which has already been discussed. With these notes, we hope to make PHP 6 a success. The notes also spawned numerous [189] polls [190] on the symbol to use for separating namespaces from class names/function names. We also discussed our version of a goto: labeled [191] breaks [192]. The filter extension [193], which I’ve been developing for quite some time, did not make it into PHP 5.1, although it is a good idea [194] to add it, now, with an “experimental” status, so that this wanted extension gets more testing. Perhaps for PHP 5.1.2…

December December was a quiet month with little action. Ilia proposed [195] a plan for PHP 5.1.2 and released PHP 5.1.2RC1[196], Zeev committed [197] Dmitry’s re-implementation of the FastCGI API and some user[198] was whining about our “official” IRC channel (which doesn’t exist). That was it for 2005 (as far as PHP internal development is concerned)! I hope you enjoyed reading this, and have a happy new year. Extra thanks go to Ilia, for being the release master, Dmitry for maintaining the engine, Jani for hunting down bug reports, Andrei for his work on Unicode, Mike for his enormous stream of useless commit messages ;-), and to all others who made PHP happen this year. 

DERICK RETHANS provides solutions for Internet related problems. He has contributed in a number of ways to the PHP project, including the mcrypt, date and input-filter extensions, bug fixes, additions and leading the QA team. He now works as project leader for the eZ compoments project for eZ systems A.S. In his spare time he likes to work on, xdebug watch movies, travel and practice photography. You can reach him at

Volume 5 Issue 1 • php|architect • 15

2005 Look Back 046 047

FOOTNOTES: 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045

16 • php|architect • Volume 5 Issue 1

048 049 050 051 052 053 054 055 056 057 058 059 060 061 062 063 064 065 066 067 068 069 070 071 072 073 074 075 076 077 078 079 080 081 082 083 084 085 086 087 088 089 090 091 092 093 094 095 096

2005 Look Back 097






































































































Volume 5 Issue 1 • php|architect • 17


PHPLib’s Block Tool

p h p ? <

PDFLib’s Block Tool

If you’ve been developing for any length of time, you’ve probably been tasked with generating PDFs at some point. In this article, we’ll discuss the process of combining data from many sources into a single PDF—from installation of the block tool, to creating the blocks in Adobe Acrobat, and then finally working with the blocks via PDFlib.

b y R o n G of f


he PDFLib Block Tool—available for use only with PDFlib Personalization Server (PPS)—helps create PDF documents derived from large amounts of variable data. Before the block tool was added, it was a difficult process to place variable data, images, and even other PDFs into precise areas of a PDF that had been designed previously. Now, adding variable data is very simple and helps create great dynamic pieces for just about any application.

Installing the Block Tool Currently, the block tool plug-in for Adobe Acrobat is only available on the Windows and Macintosh (both Mac OS 9 and Mac OS X) platforms. On either platform, you must also have Version 6 or 7 of Adobe Acrobat Professional or Adobe Acrobat Standard, or the full version of Adobe Acrobat 5. Other versions of Adobe Acrobat—Acrobat Reader, and Acrobat Elements—and all other PDF creation

18 • php|architect • Volume 5 Issue 1

CODE DIRECTORY: pdflib TO DISCUSS THIS ARTICLE VISIT: tools do not work with the block tool plug-in. (Check the PDFlib web site for an up-to-date list of supported PDF authoring tools.)

Windows OS Installation If you’re using Windows, you can use the block tool installer provided by PDFlib to get the plug-in installed correctly into your version of Adobe Acrobat 5, 6, or 7. The installer places the correct files into the Acrobat plug-ins folder, which is typically found at C:\Program Files\Adobe\Acrobat 6.0\Acrobat\plug_ins\ PDFlib. The Windows version of the block tool is

compatible only with PPS version 6.0.1.

PHPLib’s Block Tool



Mac OS Installation You can install the block tool in either Mac OS 9 or OS X. If you own Adobe Acrobat 5, place the files that comprise the block tool into the Acrobat plug-in directory, typically located at /Applications/Adobe Acrobat 5.0/Plug-Ins/. If you’re using Adobe Acrobat version 6 or version 7, save the files that comprise the block tool into a new directory and then locate the Acrobat program, which is usually found at /Applications/Adobe Acrobat 6.0 Professional. Using the Finder, click once on the Acrobat application to select it and then choose “File > Get Info” from the menu bar. Locate the triangle next to the words “Plugins.” Expand the triangle, select “Add,” and then locate the folder that contains the block tool plug-in files.

The New and Improved Block Tool

If you’ve used previous versions of the block tool, you’ll notice that the new version is much more user friendly. The export and import features have also been updated, making it much quicker to apply blocks from previously formatted PDFs.


Creating Blocks After you install the block tool, you should see a new menu called “PDFlib Blocks” in Acrobat’s main menubar. You should also see a new icon that resembles [=])—this is the block tool. (See the top of Figure 1.) You use the block tool icon to create regions that you can fill with variable data. When you click the block tool icon and hover over the PDF, your cursor turns into a crosshair. To create a block, click the mouse and hold it while dragging your cursor. As you drag your cursor, a lightly-outlined box should appear. (See Figure 1.) When you’re satisfied with the size of the box, release the mouse button. A menu like the one shown in Figure 3 appears. The menu controls all of the properties of the block, including the formatting of the data that will be contained in the block (data that you will add via Volume 5 Issue 1 • php|architect • 19

PHPLib’s Block Tool PDFlib). FIGURE 4 There are three types of blocks that can be created: • The first and default type of block is text. It handles any type of text, whether it’s a single line of text or many lines of text. • The second type of block is image. As its name implies, an image block is a container for the dynamic placement of images within the PDF. • The third and last type is PDF, which is able to contain other PDFs. Each block has general properties (see Figure 2) and FIGURE 5

type-specific properties. General properties set attributes such as the placement of the block, its background and border colours, and its orientation, to name just a few. Some of the sections that follow describe the typespecific properties. So what do you do with blocks? As you might have inferred, already, you use blocks to mix dynamic content amid static content. A designer can create a PDF, include static text and images, and then place blocks wherever dynamic content should appear. Your application “fills in the blanks,” so to speak, and because blocks retain properties such as typeface, font size, color, kerning, and other settings, the block, once filled, looks exactly like the rest of document—just as the designer intended. Using blocks, the application that generates each PDF document need not format anything. However, if you want to customize a block on-the-fly, you can. Pre-defined block attributes can be overwritten by your code.

Editing Block Settings


To change a block property, select the block you want to configure and then navigate to find the property you want to change. For example, Figure 3 shows how to edit the textflow property, which can be either true or false (hence, the dropdown menu). The purpose of most properties is obvious, but be careful with attributes that specify font names. Unless you’re running Acrobat on the same machine as your PDFlib application, it’s likely that the set of fonts on the two machines (say, your desktop and the server, respectively) will differ. Be sure to use the name of fonts that are installed on your server.

Text Flow Settings If you want a block to flow (automatically wrap and justify) arbitrary amounts of text, set the textflow property to true. Once set to true, an additional button named TextFlow appears next to the existing button labeled Text. Click on TextFlow to examine and set specific variables (such as leading and indents) that control how text flows in the block. All other text attributes—those for one line of text or a flow of text—remain in the same pane as the textflow property.

Mac OS X “Tiger”

If you’re using a very recent version of Mac OS X, you can find Acrobat’s plug-ins folder by control-clicking the Acrobat application and selecting “Package Content”. 20 • php|architect • Volume 5 Issue 1

PHPLib’s Block Tool

Image Settings By changing the block option to image, you can use PDFlib to place images dynamically in a PDF. There are far fewer options for an image block than for a text block. The options screen for an image block is shown in Figure 5. The defaultimage attribute names a default image to place if the image specified by PDFlib is unavailable. The dpi setting, or the number of dots per inch, is used to override the dpi of an image. PDFlib will use the default dpi value of the image if it is available, or 72 dpi if this option isn’t set. If necessary, you can set the horizontal and vertical dpi independently by supplying two values instead of one, first horizontal dpi and then vertical dpi. The scale property controls the scaling of the image. You can supply one value to scale horizontally and vertically equally, or supply two values, one for the horizontal and another for the vertical scale factor.


PDF Settings The settings for a PDF block are very similar to the settings for an image block, as shown in Figure 6. defaultpdf specifies a default PDF to place if the PDF document that PDFlib names cannot be found. defaultpdfpage specifies which page of the default PDF to place if the default PDF must be used. scale controls the scaling of the PDF. As with an image, you can specify one value to apply to both axes or you can provide two values, one for horizontal scaling and another for vertical scaling.


Custom Settings When using any type of block, you can specify custom attributes. Custom attributes do not affect the output when using PDFlib, but can be retrieved by PDFlib for interpretation by your code. Custom attributes are good for passing information to the PDFlib program, or even for just better record keeping. As an example, say that you want to create a text block that’s limited to ten characters or less. Create the text block, add a custom property named length, set it to 10, and then retrieve the value via PDFlib at runtime. Your code can verify the length of a string before filling the block and react accordingly, perhaps truncating the string or asking the user to provide a new value.


The PDFlib Blocks Menu To make setting up blocks easier, the “PDFlib Blocks” menu has a few handy tools. You can export and import blocks to re-use complex blocks, you can align elements, and more. Volume 5 Issue 1 • php|architect • 21

PHPLib’s Block Tool

Exporting The “Export” feature is a huge timesaver when dealing with multiple PDFs that require the same types of blocks. Once you’ve finished setting up blocks in a single “master” PDF, you can export those blocks and then import them over and over again into other PDFs. There are several different settings in the “Export” dialog (see Figure 7): • You can export blocks from all pages of the PDF or from a subset of them. • You can export blocks to a new PDF or to an existing PDF. Selecting “New File on Disk” creates a blank PDF with the blocks set in the new file. If you want to export blocks to a document that you already have opened in Adobe Acrobat, select “Open Document” and click “Choose” to see a list of all open documents. If you choose “Replace Existing Files”, the block tool will overwrite the target file with blank pages with the blocks in the proper place. • The next option is “Export Which Blocks?” This section allows you to control which blocks are exported. You can export all blocks— depending on the number of pages you choose in the first section—or just the blocks that you highlight before exporting. You can also choose to delete the blocks that exist on the target PDF.

that it’s your primary choice. Then choose another block; it should turn blue, indicating that it’s your secondary choice. When you select “Align,” the blue block should align with the pink block. Figure 9 shows two blocks, Block_1, the secondary block, left-aligned to the primary block, Block_0. The “Size” alignment option only works when more than one block is selected. You can change all secondary blocks (blue) to be either the same width or height as the primary block (pink). The “Center” alignment option aligns all blocks selected either horizontally or vertically, and even both horizontally and vertically.

Defining Blocks and Detecting Settings Two other time savers are available in the “PDFlib Block” menu: one creates a block from a placed object like an image, and another creates blocks that automatically detect the font settings and font color of the font that the block is being created over. Click on “Click Object to Define Block” and then click on an object such as an image to create a block of the same dimension in the exact same position. Or, if you click on “Detect Underlying Font and Color” before you create a block, the block’s font settings are automatically set to match the style and size of the text below the new block. This feature is especially useful

Whatever text you “insert” assumes the formatting of the block. Importing You can import blocks from another PDF using the import option in the “PDFlib Blocks” menu. When you choose “Import,” you will be presented with a screen to choose the file that contains the blocks you want to import (Figure 8). After you choose the appropriate file, you can determine which pages the blocks should be applied to.

Alignment Options The alignment option in the “PDFlib Blocks” menu allows you to align two blocks. To align, choose a block. It should turn pink, reflecting

22 • php|architect • Volume 5 Issue 1

when dealing with a lot of text and specific colors. (You may have to adjust the font name to match a font located on the server running PDFlib.)

Using Blocks As you might imagine, working with blocks from within your code makes placing text, images, and PDFs into a dynamic PDF far simpler than writing code to control the pointer, stroke text line-by-line, and so on. With blocks, formatting is separated from your code, leaving all of the aesthetics to the designer creating the PDF. Better yet, a change to the design of the page doesn’t (necessarily)




PHPLib’s Block Tool necessitate tweaking your code. Setting up the dynamic PDF document is similar to what’s been shown in prior chapters, except you need to pull in the PDF that contains the blocks. First, specify the basic information: if (!extension_loaded(‘pdf’)) { dl(‘’); } $p = PDF_new(); PDF_begin_document($p, “”, “”); PDF_set_info($p, “Creator”, “block_tool.php”); PDF_set_info($p, “Author”, “Ron Goff”); PDF_set_info($p, “Title”, “Block Tool”);

Next, pull in the PDF page that contains the blocks, place it into memory, and create a new blank page: $block_file = “block_file.pdf”; $blockcontainer = PDF_open_pdi($p, $block_file, “”, 0); //Page standard 8.5 x 11 PDF_begin_page_ext($p, 612, 792, “”);

Continuing, call up the actual page that you want to use. In the line of code below, the 1 (numeral one) refers to page one of the PDF that contains the blocks. $page = PDF_open_pdi_page($p, $blockcontainer, 1, “”);

If you want to use another page from the “template” PDF, just specify that page number instead of 1. Finally, the page with blocks is “copied” to the new page in the new PDF. PDF_fit_pdi_page($p, $page, 0.0, 0.0, “adjustpage”);

The adjustpage option adjusts the size of the new page to match the page size of the template PDF. adjustpage overrides any page settings that have been set previously. From here, you are ready to use the blocks.

Text Blocks Whether working with a line of text or a text flow, text is easy to fill in: just specify the name of the block and the text to render and call PDF_fill_textblock(). $block = “Block_1”; $text = “All the pie in the sky wasn’t enough to fill my plate”; PDF_fill_textblock($p, $page, $block, $text, “encoding=winansi”);

The block name, here Block_1, is the name that was assigned to the block when it was created in the template PDF. (Block names are unique and the default name is Block_#, but a block name can be any string of alphanumeric characters.) Notice that there are no extra formatting options. Whatever text you “insert” assumes the formatting of the block.

24 • php|architect • Volume 5 Issue 1

Form Conversion

You may be familiar with the Adobe Acrobat “Form Tool,” a great way to create fillable areas of your PDF. So, why not just use forms to define variable data placement? Because the form tool is limited: it cannot specify advanced font settings, whereas the block tool has been designed specifically to customize all aspects of your text. However, if you have a PDF that used the form tool to define areas for text, there is an option within the “PDFlib Blocks” menu to convert your pre-made forms into blocks (Figure 5.4). If you want to override a block’s formatting, you can. Where encoding=winansi appears, add the options that you want to override. For example, to override the font size, specify encoding=winansi fontsize=12. You should also enable embedding as needed. You can enable embedding by adding embedding=true as in encoding=winansi embedding=true.

Image Blocks The process of placing an image in an image block resembles that of placing the image “manually”: the image is loaded and then placed. $block4 = “Block_4”; $image_load = “image.jpg”; $image = PDF_load_image($p, “auto”, $image_load, “”); PDF_fill_imageblock($p, $page, $block4, $image, “”); PDF_close_image($p, $image);

In this example, the image image.jpg is placed in Block_4 using the function PDF_fill_imageblock().

PDF Blocks The steps to place a PDF document within the dynamicallygenerated PDF are similar to the steps required to set up a page to work with blocks. You identify which block you want to “fill,” identify the PDF and the page you want to extract from, and then fill the named block with that content. $block5 = “Block_5”; $pdf_load = “basic_pdf.pdf”; $pdf = PDF_open_pdi($p, $pdf_load, “”, 0); $pdf_fill = PDF_open_pdi_page($p, $pdf, 1, “”); PDF_fill_pdfblock($p, $page, $block5, $pdf_fill, “”); PDF_close_pdi($p, $pdf);

PDF_open_pdi() opens the PDF, while PDF_open_pdi_page() loads the correct page. The function PDF_fill_pdfblock()

puts it all together, placing the actual PDF onto the page. Finally, close the open PDF by calling PDF_close_pdi(), which frees the resources consumed by the open PDF.

PHPLib’s Block Tool

Closing the Page After you’ve filled all of the appropriate blocks on the open page, you must close that page. PDF_close_pdi_page($p, $page);

This line closes the PDF and you can start a new page, or end the entire document after this is called.

Putting All Together A complete example using the PDF_fill_textblock() function can be seen in Listing 1. The PDFlib block tool is easy to use and provides for complex layouts without extensive programming. Using blocks, a designer can assign where dynamic text, images, and even PDFs are to be placed, yielding a much more professional result. 

RON GOFF is the technical director/senior programmer for Conveyor Group (, a Southern-California based web development firm. He is the author of several articles for PHP|Architect magazine and other online publications. Ron’s lives in California with his wife Nadia and 2 children. You can contact him at

LISTING 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44

<?php if (!extension_loaded(‘pdf’)) { dl(‘’); } $p = PDF_new(); PDF_begin_document($p, “”, “”); PDF_set_info($p, “Creator”, “block_tool.php”); PDF_set_info($p, “Author”, “Ron Goff”); PDF_set_info($p, “Title”, “Block Tool”); $block_file = “block_file.pdf”; $blockcontainer = PDF_open_pdi($p, $block_file, “”, 0); PDF_begin_page_ext($p, 612, 792, “”); $page = PDF_open_pdi_page($p, $blockcontainer, 1, “”); PDF_fit_pdi_page($p, $page, 0.0, 0.0, “adjustpage”); $block = “Block_1”; $text = “All the pie in the sky wasn’t enough to “ .”fill my plate”; PDF_fill_textblock($p, $page, $block, $text, “”); PDF_close_pdi($p, $blockcontainer); PDF_close_pdi_page($p, $page); PDF_end_page_ext($p, “”); PDF_end_document($p, “”); $buf = PDF_get_buffer($p); $len = strlen($buf); header(“Content-type: application/pdf”); header(“Content-Length: $len”); header(“Content-Disposition: inline; “ .”filename=block_pdf.pdf”); print $buf; PDF_delete($p); ?>

Volume 5 Issue 1 • php|architect • 25


FPDI in Detail

FPDI in detail

Most PHP developers about the ability to create PDF documents on the fly. When looking at the wide range of PHP classes or APIs, every product has its own advantages and disadvantages—some of them are very expensive and others are free, but don’t offer the same functionality as the expensive ones. The main difference between the free and commercial libraries is the ability to use external documents. PDFLib has supported this through its PDI interface, but the free classes didn’t external documents, until I released FPDI for FPDF, which gives you the same muscle—but for free!



DF documents—or better stated: the PDF format—have reached widespread popularity over the past few years, and this momentum continues. A very strong example of this is in a recent ISO standard, which is based on PDF 1.4, and defines a PDF derivate for the long-term preservation of electronic documents. PDF has becomea a real standard! In fact, the dynamic generation of PDF documents is an important issue today, and will continue to be so in the future. While it’s quite simple to build PDF docments on desktop PCs, their dynamic generation on a webserver, especially when using a language like PHP, can prove very difficult. On the Internet, you’ll find several PDF APIs that will allow you to create PDF documents with PHP. Some 26 • php|architect • Volume 5 Issue 1

PHP: 4.2+ OTHER SOFTWARE: FPDF 1.53 and FPDI 1.1 CODE DIRECTORY: fpdi TO DISCUSS THIS ARTICLE VISIT: are delivered as PHP extensions, and some are “simple” PHP classes. Years ago, I came across a PHP class going by the name of FPDF, written by Olivier Plathey ( I was absolutely amazed by its capabilities, its easy usage and that that the “F” in “FPDF” stands for “Free.”

FPDI in Detail When I was working with FPDF, I was often challenged with a situation where I had to rebuild a whole document, programmatically. As you can imagine, this part was very frustrating, tedious, and time consuming. A digital version of your document is sitting right in front of you, and you just cannot use it. Similarly, I ran into additional problems when dealing with vector based graphics and FPDF. There was no real way to import such things, except by converting them to bitmaps and using the Image() method of FPDF. I’m sure I don’t have to explain the drawbacks to this workaround. When I found an article in php|architect (Vol. 3, Issue 5) where Marco Tabini described how to parse a PDF and update it with some simple content, I got the idea to implement this technique into FPDF—which resulted in a library which was also named with 4 simple chars: FPDI (Free PDF Import). I released my new library under the Apache Software License 2.0, which allows you to use it in your commercial or non-commercial projects. The project homepage can be found at The article by Marco is freely available as a monthly sample, at In this article, I’ll introduce you to FPDI, explain how it was born, and cover its internal workings. I will assume that you have some knowledge of FPDF, and have a bit of experience with the Portable Document Format, itself. If not, just download FPDF, and run the tutorials that Olivier provided in the package. This article will not tell you how to use FPDF, but will delve deeper into the details of the PDF structure and how FPDI extends FPDF, bringing out the ability to import single pages of existing PDF documents—not just modifying existing documents. This feature is not that clear to most people out there. At this point I could tell you much about the structure of a PDF document, but as I already mentioned, the whole idea is based on another article, where everything you need to know about parsing a PDF is already described. I will cover some details about that issue later in this article. I want to make it clear why I chose the “import single pages” method, instead of “really modifying/updating” a PDF. To put it simply: “It is much easier.” You can look at a PDF document as a collection of single objects which are linked to each other. Pages, images, font descriptions, and document information are all single objects and can be identified by a unique ID. The PDF format is more flexible than just assigning objects by simple IDs, though—it allows one to define named relations. For example, these relations can be used to put an image into a content stream of a PDF page. You have to set up a resource dictionary, where you

define the name of the image and its real object relation. After this, you can simply refer to the image by using the name you provided in the content stream. As FPDF, and any other PDF generators, use named relations, which lead into name conventions, you have to pay attention when updating a PDF. If you’ve read Marco’s article, you’ll remember that there’s a part in it where he searches for the next available font name. This check has to be built into FPDF before every piece of code where FPDF creates a named relation. Another disadvantage of updating documents is that you cannot remove single pages, or reuse an existing page in an easy way. This method will, however, allow us to reuse, resize, crop or rotate page. We can also avoid naming conventions, because every imported page has its own kind of namespace in the new document, as you’ll see below.

The Basics While I was studying the PDF reference to find a good solution for importing pages, I came across a technique with the spooky name of “form XObjects”. I’m sure that everyone who stumbles upon this term thinks about conventional “forms” like those that we use in HTML, or on paper. In this case, “form” has another meaning: it corresponds to the notation of forms in the PostScript language. A form XObject can be compared with a kind of layer. It is a self-contained description of any sequence of graphics objects—its whole structure is almost similar to the structure of a single page in a PDF document. The form XObject has its own resource dictionary, where named relations are defined. So, it seemed to be the perfect solution for my problem: if I could create form XObjects, I most certainly would be able to convert pages into them. But, form XObjects have more advantages than simply preparing FPDF for PDF import. For example, they can be reused at any time in a PDF document, where the viewer application can cache the rendered results to optimize the execution. It sounded like a kind of template to me, so I began extending FPDF with this feature, which resulted in a PHP class called fpdf_tpl. This class redirects all output made by FPDF into containers which will be used as form XObjects, so one can reuse any output created with FPDF, at any time. This class has more to offer than merely preparing FPDF for FPDI—as already stated. You can reuse a template multiple times in a document, whereas it only needs to be written once to the resulting document, which leads to less memory usage and processing time in your script. Volume 5 Issue 1 • php|architect • 27

FPDI in Detail LISTING 1


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96

<?php define(FPDF_FONTPATH, ‘classes/font/’); require_once(‘classes/fpdf_tpl.php’); class pdf extends fpdf_tpl { var $useTPLs = true; var $_startTime; var $_endTime; var $_writingTime = false; function Header() { static $tplidx = null; if ($this->_writingTime) return; if ($this->useTPLs) { if (is_null($tplidx)) { $tplidx = $this->beginTemplate(); $this->writeBackground(); $this->endTemplate(); } $this->useTemplate($tplidx); } else { $this->writeBackground(); } } function writeBackground() { static $content = null; $this->SetFont(‘Arial’,’B’, 10); $this->SetFillColor(255,153,0); $this->Rect($this->lMargin, 28, $width = $this->w-$this->rMargin-$this->lMargin, 3, ‘F’); $this->Rect($this->lMargin, $this->h-10, $width, 3, ‘F’);

list($usec, $sec) = explode(“ “, $this->_startTime); $start = ((float)$usec + (float)$sec); list($usec, $sec) = explode(“ “, $this->_endTime); $end = ((float)$usec + (float)$sec); $time = $end - $start; $this->Cell(0, 4, ‘Time: ‘.$time, 0, 1); // get the size of the buffer $buffersize = 0; for($n = 0, $c = count($this->pages); $n < $c; $n++) $buffersize += strlen($this->pages[$n]); for($n = 0, $c = count($this->tpls); $n < $c; $n++) $buffersize += strlen($this->tpls[$n][‘buffer’]); $this->Cell(0, 4, ‘Total buffersize: ‘.$buffersize. ‘ bytes (uncompressed)’); parent::Close(); } } $pdf =& new pdf(); #$pdf->useTPLs = false; for ($n = 0; $n < 200; $n++) $pdf->AddPage(); $pdf->Output(‘test.pdf’,’I’); ?>


$this->Image(‘images/php-a.png’, 100, 5, 100); $this->SetDrawColor(0); $this->SetLineWidth(0.3); $this->Rect($this->lMargin+.15, 31, $width-0.3, $this->h-31-10, ‘D’); $this->SetXY($this->lMargin+.15, 31+.15); if (is_null($content)) $content = file_get_contents(__FILE__); $this->SetFont(‘Courier’,’’,6); $this->MultiCell($width-.3, 2.5, $content); } // For debugging purpose function pdf($orientation=’P’,$unit=’mm’,$format=’A4’) { $this->_startTime = microtime(); parent::fpdf_tpl($orientation,$unit,$format); } // For debugging purpose function Close() { $this->_endTime = microtime(); $this->_writingTime = true; $this->AddPage();

Examples of its use are: the generation of headers and/or footers, table headers which could be repeated on every page, a background grid of large tables, text in front or behind a template, etc. If you take a look at Listing 1 and Figure 1, you’ll see a sample script which demonstrates the use of templates. You turn templates on and off by setting the $pdf->useTPLs property to true or false—the visual result is the same. This demo has no real meaning, but it shows how much the file size and process time decrease if you’re using templates. My tests gave me a process time of only 0.0766 seconds when using templates, and 3.649 seconds without them! The same was true for the buffer size: with templates it only takes up 14.5 kb—without 28 • php|architect • Volume 5 Issue 1

templates, approximately 1.2 MB. I hope that the main advantage of fpdf_tpl is now clear. Let’s skip ahead and take a deeper look at this class. The class uses an array for holding all created templates named $this->tpls where each entry describes a single template as an array with special keys. The main entries in each template array are x, y, w, h and buffer. All other entries are just used to save other information, and are prefixed with o_. A new property, with the name of $this->res is used to assign resources like fonts, images, or other templates, to the template or the page. The assignment of resources to single pages is left in for testing purposes, and will be removed in the next release of fpdf_tpl.


FPDI in Detail



1 Array 2 ( 3 [0] => 9 4 [obj] => 11 5 [gen] => 0 6 [1] => Array 7 ( 8 [0] => 5 9 [1] => Array 10 ( 11 [/Type] => Array 12 ( 13 [0] => 2 14 [1] => /Page 15 ) 16 17 [/Parent] => Array 18 ( 19 [0] => 8 20 [1] => 10 21 [2] => 0 22 ) 23 24 [/MediaBox] => Array 25 ( 26 [0] => 6 27 [1] => Array 28 ( 29 [0] => Array 30 ( 31 [0] => 1 32 [1] => 0 33 ) 34

35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 )

So, we’ll only take a look at the tpl key in $this->res. This array is needed to rebuild the form XObjects resources dictionary with named relations, which are used in the template. To redirect the output made by FPDF, I used a simple flag, $this->intpl, and extended the _out() method. I had to take special care because a form XObject cannot include internal or external links or better, any kind of annotation. FPDF uses a single, global resource dictionary for all pages and creates this within the _putresources() method. I extended this method to make it call _puttemplates(), which will create all necessary template objects. After the objects are created and written, the named relations to them will be written to the main resource dictionary. All created templates are usable on every page! Unfortunately, using the global resource dictionary isn’t the best solution because it’ll introduce problems when interpreting or extracting pages of a document, as you will see later. With the fpdf_tpl class, I’ve build the basis for FPDI—now, we have to convert the pages of an already existing PDF document, but we have to parse it first, to get the desired information.

pdf_parser, and added support for reading streams. Let’s

Parsing the Original Document I owe a lot of credit to Marco’s article, because the parsing of an existing document was nearly completely covered in it. I adapted all parsing functions into a single class, 30 • php|architect • Volume 5 Issue 1

[1] => Array ( [0] => 1 [1] => 0 ) [2] => Array ( [0] => 1 [1] => 595 ) [3] => Array ( [0] => 1 [1] => 842 ) ) ) [/Contents] ( [0] [1] [2] )

=> Array => 8 => 1 => 0

) )

take a quick look at the structure and how the parsing has to be done. The first task that the parser has to do is to read the xref-table of the PDF document. This is done by the pdf_parser::pdf_read_xref() method. The xref-table is similar to a table of contents. It gives us information about the objects used in the document, and their byte-offset positions in the file. At the end of the xref-table, we’ll find the file trailer dictionary; the entries in this table lead us to the catalogue dictionary of the file. The catalogue dictionary is the root of all objects in the document’s object hierarchy and we’ll find the reference to the first page tree node of the document’s page tree—which is exactly what we’re searching for: all single pages used in the existing document. The parser has to follow the whole page tree to get the exact page count and to collect other information on the pages, which is done by read_pages() in the extended class, fpdi_pdf_parser, and results in an array as the $this->pages property. The keys of $this->pages are the desired page numbers starting at zero where each entry holds the related page object. After this task is done, we have enough information about the source document for now. While I was implementing this code, I got stuck on some problems—it took me several days (and nights) to fix them. A great problem for me was the determination of the line ending in a file. Normally, this task is handled by the PHP configuration directive

FPDI in Detail auto_detect_line_endings, but as a PDF file can have

multiple updates by different programs (on different operating systems), the line endings can be mixed. To overcome this issue, I’ve written a wrapper for fgets() which comes in use as a fallback function if fgets() returns incorrect data. This wrapper function also enables the class to be used with a PHP-version less than 4.3, where auto_detect_line_endings was introduced. To make FPDI compatible with PHP versions less than 4.3, I also created other wrapper functions for strspn() and strcspn() where introduced so that FPDI should run with php 4.2+. During my testing (with hundreds of PDF files), I found several minor bugs in the parsing process—some are fixed and some are so raw that they can be ignored for now.

Let’s Convert a Page to a Form XObject First, we’ll take a deeper look at a page object found in $this->pages of a parser object. A PDF object is represented internally as an array, in a specified structure, as Marco defined in his article. For demonstration purposes, we use the shipped demonstration PDF with FPDI: $pdf =& new fpdi(); $pdf->setSourceFile(‘classes/pdfdoc.pdf’); echo “<pre>”; print_r($pdf->current_parser->pages[0]);

You can see the output in Listing 2. At first look, it seems very odd, but everything makes sense! Every entry in any level is built as an array with at least the keys 0 and 1, where 0 describes the type of the value in key 1. All other keys are used to define special attributes of that value. The types are defined as constants in pdf_parser.php. For example the 0 key in the lowest level is 9 which is defined as a PDF object. This object’s value is a dictionary (5)—in this case a page dictionary—with tokens that each have their own value types. To import a page, FPDI offers a method called ImportPage() which is close to the BeginTemplate() method of fpdi_tpl. As we’ve seen, the structure of a template entry in $this->tpls contains main entries like x, y, w, h and buffer. If we take a closer look at Listing 2, we can see a relationship between these entries. /MediaBox is an array (6) of exactly 4 entries, whose value types are numeric (1). The first entry’s value is that of x, the second of y, third of w and, not surprisingly, the last one of h. This is actually a bug in the current release of FPDI. The last 2 values are also coordinates. The real values for the width and the height have to be calculated by specifying the

distance of the first to the third and the second to the fourth value. This bug has been overlooked for a long time, because its only manifests itself if the MediaBox’s x- or y-value have values other than 0. It’ll be fixed in the future! To resolve the MediaBox’s data, the extended parser for FPDI is shipped with a getPageBox() method. This method is needed, because the MediaBox (or any other box) can also be referenced to another PDF object, or the value can be inherited by a parent page in the page tree. This method makes sure that the correct values will be resolved. Currently, FPDI supports only PDFs that contain a MediaBox—there are other boxes in the PDF specification e.g. a CropBox or a TrimBox. If your PDF uses other boxes instead of a MediaBox, the results of FPDI might not be as expected. Also if another box is used, you can ignore the bug described in the paragraph above. The next task is to fill the buffer of our template with the content stream of the imported page. There’s one important difference between a PDF page and a form XObject: a page can have multiple content streams, while a form XObject can only have one. Because of this issue, we have to concatenate all content streams of a page into one single stream. To do this, there’s a method called getPageContent() in the extended parser (fpdi_pdf_parser). All of these resolved streams can be encoded with different filters. The most commonly used filter is the FlateDecode filter which can be decoded with the zlib functions, if they are enabled in the PHP installation. I’ve also written 2 more decoders for the LZWDecode- and ASCII85Decode-filters. With these 3 filters, FPDI should handle nearly all documents which have encoded page content streams—until now there have been no bug reports related to an absent filter. The decoding of the content streams is done by the rebuildContentStream() method, in the extended parser class. After decoding all streams, they can be simply concatenated to a single one and assigned to the buffer key in the desired template array. The next step is to resolve the resources which are used in the content streams we want to import. These can be relations to images, fonts or other form XObjects. The resources are normally defined as named relations in the page dictionary, or in one parent page in the page tree. To resolve them, the extended parser offers a _getPageResources() method, which returns the desired resource data of the page. The method will not resolve the resource’s own data, but only the information like its name, and to which objects it is referenced in the original document. The real import of these resources Volume 5 Issue 1 • php|architect • 31

FPDI in Detail



A PDF cannot be compared to a file with a structural language like HTML. LISTING 3


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37

1 class pdf extends fpdi { 2 [...] 3 var $_logoIdx = null; 4 [...] 5 6 function Header() { 7 [...] 8 if ($this->_writingTime) 9 return; 10 11 if (is_null($this->_logoIdx)) { 12 $this->setSourceFile(‘pdfs/php-a.pdf’); 13 $this->_logoIdx = $this->ImportPage(1); 14 } 15 16 if ($this->useTPLs) { 17 [...] 18 } else { 19 [...] 20 } 21 } 22 23 function writeBackground() { 24 [...] 25 $this->Rect($this->lMargin, $this->h-10, 26 $width, 3, ‘F’); 27 28 $this->useTemplate($this->_logoIdx, 100, 5, 100); 29 30 $this->SetDrawColor(0); 31 [...] 32 } 33 34 [...] 35 }

<?php define(FPDF_FONTPATH, ‘classes/font/’); require_once(‘classes/fpdi.php’); $pdf =& new fpdi(‘L’,’pt’); // load the origin document $pagecount = $pdf->setSourceFile(‘pdfs/article_110.pdf’); #$pagecount = $pdf->setSourceFile(‘pdfs/thumbnails.pdf’); $pdf->AddPage(); $x = $pdf->lMargin; $y = $pdf->tMargin; for ($i = 1; $i <= $pagecount; $i++) { // import page no. $i $tplidx = $pdf->ImportPage($i); // use the imported page $size = $pdf->useTemplate($tplidx, $x, $y, 250); // draw a border around the used page $pdf->Rect($x, $y, $size[‘w’], $size[‘h’], ‘D’); // if it’s the third page in a row do a // pagebreak and reset the x- and y-values. if ($i % 3 == 0) { $pdf->AddPage(); $x = $pdf->lMargin; $y = $pdf->tMargin; continue; } $x += 270; $y += 100; } $pdf->Output(‘thumbnails.pdf’, ‘D’); $pdf->closeParsers(); ?>

32 • php|architect • Volume 5 Issue 1

FPDI in Detail into the new document will be done automatically in the extended _puttemplates() method. Because these resources have their own unique identifiers in their source document, FPDI has to reassign new identifiers to the objects at runtime. All of the data which will be copied from the original document to the new document will be written by the pdf_write_value() method, which accepts an array in the same structure that you see in Listing 2. If pdf_write_value() reaches an object reference (8), it’ll reassign a new unique id (if one does not exist), and push the original object identifier onto a stack. This stack will be processed in the _putOobjects() method, recursively. If _putOobjects() sends data to pdf_write_value(), which also includes object references, the stack will be filled again. FPDI will not write duplicates of object references—it will “remember” previously written objects of a specific file. FPDI will, however, follow every object reference it finds. This behaviour is particularly important to the programmer, even if you want to import only a single page of a very large file. As I’ve already stated, the PDF structure allows the creator to define a single, global resource dictionary, as FPDF does, where all used resources are defined in the document. FPDI will not recognize which of these resources are really in use on the imported page. Just think about the following example: we create a 100 page PDF with FPDF, where each page shows one unique image. Now, we want to import page number 40 into a new document with FPDI. Because FPDF uses such a global resource dictionary, FPDI will resolve that dictionary as the resource dictionary of the single page, and will copy all of the images into the new document— even if it only shows one image! So, don’t be surprised, if you re-import pages of PDFs made by FPDF.

Using FPDI Now we should know how FPDI and fpdf_tpl work, internally. It’s time to take a look at some examples. Listing 3 shows code which creates a thumbnail overview, similar to Marco’s original article. As you can see, the usage is very simple. The first step is to call setSourceFile() with the desired PDF file, which will return the page count of the document. Next, we simply use a for loop to import each page. As you can see, the useTemplate() method nicely returns the dimensions of the imported page, so we can use this data to draw a border around it. You can see the results in Figure 2. To demonstrate FPDI’s flexibility, you can try to re-import this generated document by changing the filename to thumbnails.pdf and then take a look at Figure 3. I already suggested that FPDF normally cannot work with vector based graphics, like a logo. But, as a PDF




Volume 5 Issue 1 • php|architect • 33



FPDI in Detail document can have vector based information, we can use FPDI to do the job. Let’s go back to the first example of fpdf_tpl. I used a PNG image as the php|architect logo. If we zoom in, we’ll see that the image gets a bit distorted (see Figure 4)—it isn’t a vector image, so it doesn’t scale. To use an imported page in a template, it is necessary to import it before the call to beginTemplate(), as you can see Listing 4. This results in a much better quality page, as you can see in Figure 5. If you’re currently reading a PDF issue of this magazine, you’ll see that the document is personalized with your name and email. With FPDI and FPDF, you can get similar results. Just import a pre-existing page, and render personalized information on top of the imported data. In Listing 5, you’ll find an example of how you can personalize a PDF with FPDI—the result can be viewed in Figure 6. There’s something you need to know about creating such personalized documents: you should always keep in mind, that FPDI will not and will never manipulate an existing document, but will create a completely new one with its own structure. I should also mention that all dynamic content like links, PDF form elements, or any other annotation will get lost during the import process—they are not part of the content stream of a page. So, this personalization will only work with simple PDF files. Another point to mention is the size of the original document. Because FPDI has to rebuild the whole document, it must decode every content stream and hold them in memory. It will need a lot of computing power and memory for this task, which results in a long process time of the script—the limits of a standard PHP installation can be reached much faster than you think! If you take a closer look at the PDF version of php|a, you’ll see that it is also protected with your personal password (the same as your account). PDF allows this, but it cannot be implemented with FPDI, alone. Some time ago, the protection extension for FPDF was written by Klemen Vodopivec, and I was involved as a beta tester and bug hunter—which was a long time before I thought about FPDI. Protection is an essential extension for FPDF—I think it’s the most commonly used one. It gives users or programmers a secure feeling. I’ve received several emails from users who want to mix both extensions to create protected PDFs with FPDF and FPDI, which in the end, resulted in a FPDI_Protection extension, which you also can download from the FPDI project homepage. FPDI_Protection’s task is simple: it must encrypt output made by FPDF’s _putstream() and _textstring() 36 • php|architect • Volume 5 Issue 1

methods, and also by FPDI’s pdf_write_value() method. There is only one particularly tricky part that you must pay attention to: strings which are HEX-encoded, instead of plain strings. These values have to be converted to plain text, first, then encrypted and reconverted to HEX values. To use FPDI_Protection in our example, we have to simply extend our pdf class from FPDI_Protection instead of FPDI. Now, we can simply use the SetProtection() method to add the protection/encryption features to our resulting PDFs.

Future and Dreams I’ve already mentioned some problems and bugs in FPDI, but have you ever found software without bugs? Probably not... I have some plans for the coming releases, which are not only mere bug fixes, but also improvements. On top of my list, there’s the handling of PDFs that contain other boxes than the aforementioned MediaBox. This missing feature is sadly FPDI’s most reported problem. If you’ve run into same problem, you can work around it by simply reprinting the PDF through the Adobe PDF printer, which is shipped with Adobe Acrobat or (maybe) some other PDF printer—I haven’t test the others. Another missing feature that I have not yet mentioned in this article is the handling of rotated pages. A PDF page can be defined as rotated, whereas LISTING 5 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39

<?php define(FPDF_FONTPATH, ‘classes/font/’); require_once(‘classes/fpdi.php’); class pdf extends fpdi { var $text = ‘’; function SetText($text) { $this->text = $text; } function Footer() { static $w = null; $this->SetFont(‘Arial’, ‘B’, 6); $this->SetFillColor(255,0,0); if (is_null($w)) $w = $this->GetStringWidth($this->text) +$this->cMargin*2; $this->SetXY($this->w-$this->rMargin-$w, -3); $this->Cell($w, 2.8, $this->text,0,0,’R’,1); } } $pdf =& new pdf(‘P’, ‘mm’, array(215.9, 279.4)); $pdf->SetText(‘This document is personalized ‘ . ‘for php|arch readers.’); $pagecount = $pdf->setSourceFile(‘pdfs/article_110.pdf’); for ($i = 1; $i <= $pagecount; $i++) { $pdf->AddPage(); $tplidx = $pdf->ImportPage($i); $pdf->useTemplate($tplidx); } $pdf->Output(); ?>

FPDI in Detail the coordinate system isn’t. FPDI does not currently care about the rotation, and will import such a page as it is: rotated. This means that it will be shown rotated in the resulting document, whereas it is displayed correctly in the original document. For now, you can use the rotation script at to correct this behaviour, but FPDI will automatically fix this for you in the next release. Another problem that I already described was the copying of unused resources. Maybe, in the future FPDI will remove the unneeded resources automatically, too. As you can see, there are several things on my to-do list, but I want to take the opportunity to write a little about the most asked question I received after releasing FPDI: “Can I replace placeholders in an existing PDF with new text with FPDI?” No, you can’t—not with FPDI, nor any other program, without preparing the original documents. A PDF cannot be compared with a file in a structural language like HTML, even though a PDF can be a simple text file without any binary data. There is a way that will work with very raw PDF files, but it cannot be generalized. The requisites for such files are a decoded content stream of each object that will output any text string. The text string has to be plain text (not encoded), and the font that is used has to be: a) one of the 14 standard fonts, or b) completely embedded

in the original document. Now, these requirements aren’t too strict, but a PDF can be created in various ways, and you usually don’t have much of a say in how a particular PDF should be build. For example, the text string can be split into various small pieces, because the program that created the PDF used kerning pairs for layout purposes. These individual pieces or even the whole text string at all can be written as HEX-encoded strings. Generally, only a subset of the font is embedded (only the characters that are actually used in the document are included). In this case, even the full version of Acrobat itself cannot change text strings in the document. The only program I know of that will produce PDFs which are suitable is FPDF—but it will not make sense to build your templates in FPDF and replace strings in it afterwards. This intention is a dream and it looks like it will remain so, forever. Don’t waste your time on finding a solution for this. If it was technically possible, someone would have already implemented the solution. 

JAN SLABON is author of FPDI and lives in Helmstedt, germany. He has put his mainskills on development of individual PHP solutions for endcustomers or other webdevelopment companies over the whole world. You can contact him at

Available Right At Your Desk

All our classes take place entirely through the Internet and feature a real, live instructor that interacts with each student through voice or real-time messaging.

What You Get

Your Own Web Sandbox Our No-hassle Refund Policy Smaller Classes = Better Learning


The training program closely follows the certification guide— as it was built by some of its very same authors.

Sign-up and Save!

For a limited time, you can get over $300 US in savings just by signing up for our training program! New classes start every three weeks!

Volume 5 Issue 1 • php|architect • 37



Internationalize Your Web Application with Less PHP Code

If you are looking to internationalize a web application, then you should try this simple technique which uses less PHP code, and consists mainly of easy to maintain HTML. by Carl McDa de


aking a web application support multiple languages can be a large job. It is a job that many do not like, and one that a lot of open source projects have avoided until now. It seems everyone is jumping on the multi-lingual train and using all sorts of PHP gadgetry to make it happen. Check a few open source projects to-do lists and you will likely find something to do with Internationalization listed. In this article, I will show you one of the easier methods of internationalizing your code using very little PHP and ordinary HTML files. Using this method is fast, easy to maintain and is as cross-platform as you can get. Before we get started on that, though, we need to go over some points that will make it easier to understand why globalization is necessary.

Globalization explained

Globalization, abbreviated by the little used g11n, is the area where the application of business practices and processes to take a business or a software product to a global market. If you want to know why globalization is important then you only have to take a look at the following statistics. As you can see in Figure 1, the internet is outgrowing its American roots and the default language is not necessarily English. Language is only part of the picture. You have to take into account that none of the countries that make up a great percentage of internet users use the

38 • php|architect • Volume 5 Issue 1

PHP: 4.3+ OTHER SOFTWARE: Macromedia Dreamweaver 2004 MX CODE DIRECTORY: i18n TO DISCUSS THIS ARTICLE VISIT: same currency and possibly not the same date format. If your software is going to grow with the growing internet market then globalization is the key to it being successful. Now that you are excited about the prospect of people all over the world using your program, let’s take a look at the steps involved with making it useful on a global scale.

Internationalization Explained

There are several reasons why the i18n process of programming should be done at the beginning of the development cycle. Doing so significant decreases the amount of necessary code, and it removes the need to extend the product or make compromises later on in development. In many cases, a little forethought will make sure that the developer does not have to rewrite all of the code. Instead, he will simply need to write a few files to make the existing software adaptable to a different market.

i18n When there is less code to write, fewer programmers will be needed to work on internationalization. Good internationalization support means that your programming resources can be used to improve the software in other areas; the size of the end user market increases, the software becomes more globally popular because it is usable by a more diverse customer base. Using simple text, an end user can easily localize the product to a specific region.

Internationalization is abbreviated i18n because there are 18 letters in between the “i” and “n” in the word. Internationalization is the process of designing software or a web application to handle different linguistic and cultural conventions without rewriting the codebase. Internationalization is only important if you are going to be distributing your software or web application. If you are not doing so, or you are borrowing code from somewhere, then localization might be more important. Localization Explained

Localization (also known as L10n) is the process of adapting your software to the requirements of a target locale. A locale is another word for the countries and languages of a particular region. In software development a locale is mostly used in its abbreviated form. Examples of abbreviations used in software are en_US which stands for “United States English” and en_uk which stands for “European English.” Making sure the locale can be easily changed is the most important part of internationalizing software. When you build or change an application so that it can be localized to multiple languages and countries, this process is called internationalization. Remember, a web application can be localized without being internationalized. You just have to translate all of the

interfaces and content into the language of choice. There are two phases to the localization process of a web application. The first part is the translation of the user interface—the part that controls the events and presentation of the resources. The second phase is the translation of the text, media files or documents—the so called content being delivered by the presentation layer of the application. I will be talking mostly about the first phase of the process, and will touch on the second when necessary. Internationalization of a program includes a few tasks that should be planned out ahead of time. If careful attention is paid to these items at the beginning of development then there is less to debug later on.

Encodings & Code Pages

When building web enabled applications, you need to encode the page using either UTF-8 or UTF-16, and with send it with the appropriate HTML headers. It is very important to have some test content on hand, and to test the HTML page in the web browsers of choice, to make sure that they react to the headers and encoding. The localized text should appear properly with very little (or no) user configuration. The single most important element in internationalizing a web application is the page content-type. <meta http-equiv=”Content-Type” content=”text/html; charset=utf-8”>

Plural Text

The plural format of text is the nemesis of a software developer. Plurals, added to gender characteristics and social hierarchy of a language, all add up to a real challenge. The best thing to do would be to minimize the usage of text and to design flows so that the same phrase or text can be used multiple times. • There are 0 Comments • There is 1 Comment • There are 3 posts







896,721,874 14.0 % 23,917,500 2.7 % AFRICA ASIA 3,622,994,130 56.4 % 332,590,713 9.2 % EUROPE 804,574,696 12.5 % 285,408,118 35.5 % MIDDLE EAST 187,258,006 2.9 % 16,163,500 8.6 % NORTH AMERICA 328,387,059 5.1 % 224,103,811 68.2 % LATIN AMERICA/CARIBBEAN 546,723,509 8.5 % 72,953,597 13.3 % OCEANIA / AUSTRALIA 33,443,448 0.5 % 17,690,762 52.9 % WORLD TOTAL 6,420,102,722 100.0 % 972,828,001 15.2 % NOTES: (1) Internet Usage and World Population Statistics were updated on November 21, 2005

USAGE % OF WORLD 2.5 % 34.2 % 29.3 % 1.7 % 23.0 % 7.5 % 1.8 % 100.0 %

USAGE GROWTH 2000-2005 429.8 % 191.0 % 171.6 % 392.1 % 107.3 % 303.8 % 132.2 % 169.5 %

Volume 5 Issue 1 • php|architect • 39

i18n Making plurals like this can be avoided using different wording which make localization simpler by removing grammatical differences. • New Comments 0 • Comments to date 1 • Number of posts 3


Usually, this is where a coder has to show some talent for business logic, or get help from a group. The internationalization of dates is an area where software companies start guarding their secrets. The date format problem can be compounded by the location of the client using the software and the location of the webserver that the software is being run on.

Database Encoding

Database encoding and unicode support are musts. A coder can never tell when the database is going to store or return incorrectly encoded text, seemingly at random. The fact that MySQL, the most popular web database, now supports unicode will make things much easier. You only now have to make sure that unicode support is enabled and ready to go.


If you build search functionality into your application, then how the data is stored is critical, since all searches will likely be based on SQL statements influenced by the language and calendar system being used. The sorting and ordering of database information must also be internationalized; otherwise the search data returned may be invalid or irrelevant. Do not forget to code your PHP to allow for Unicode strings. It does not do any good to go through all the trouble of preparing a Unicodeenabled database and flexible SQL statements when the PHP code cannot insert or retrieve resources in Unicode format.

The PHP Language

It is important to remember that PHP, unlike Java, does not yet have native multi-byte character (or more simply put: Unicode) support. In PHP, a character is the same as a byte, so there are exactly 256 different possible characters. Since a string is a series of characters, this means there is a limitation on how a string is interpreted. As long as the string contains a combination of the 256 characters allowed, then things are okay. But, the internet is a very large place where some languages contain more than 256 characters. This is not quite enough characters to cover all those languages. Japanese, where the number of characters is in the hundreds, is a good example. There is, however, a way to encode and decode

40 • php|architect • Volume 5 Issue 1

strings to and from UTF-8, or Unicode, which allows a much larger set of characters. The PHP utf8_encode() and utf8_decode()functions allow string characters to be stored in multiple bytes. There are also a number of conversion routines to fix the problem of using multibyte characters When using routines like utf8_encode() on its own, the manipulation of strings cannot be trusted to the default single byte string handlers in PHP. This is where the mbstring extension comes into play. mbstring contains functions that are sensitive to multibyte encodings and allow splitting, splicing, searching and other areas of string handling. As of this writing the mbstring extension is not enabled in a default installation of PHP. This means that developers and end users that want to run software that requires mbstring_* functions should check their PHP configuration. There are still many shared hosting companies and server administrators that are unaware of the importance of the mbstring extension.

Using Open Source to Get a Jumpstart

If you are not creating a new PHP application from scratch, using an open source application may take care of most of the internationalization steps involved in building a website. The popular content management systems all use one of the three listed techniques for internationalization. Though using a content management system’s i18n support may be transparent, knowing the underlying techniques used by it can be a deciding factor in choosing a pre-made application as a base for your own projects. Knowledge of what is used in a CMS to internationalize will also influence your choice of shared hosting or what should be installed on your own server to support the software.

Internationalization Techniques

There are very few techniques used in internationalizing a PHP web application. Listed here are the three most popular: • Text definition files written in PHP, using constants • Using PHP gettext to extract and do string substitution • Using a database to store and retrieve translated text The above techniques all have their place and are useful. They also have many things in common. They are not simple enough for lazy web developers, like me. The storage method for the localized text or resources is not always readily accessible. How the resources are stored determines if they are difficult to read and manipulate. Two of the techniques in the list do not

i18n allow for easy visual formatting of the HTML code within resources while they are being translated. Being able to see the visual formatting is important, as it influences the words and choices made when translating text for a web application. Frequently, when doing a translation, it is necessary to see the wording in context with a list, line break, paragraph or the direction the text is read. Let’s take a look at each one of these techniques so that we get a baseline for comparison to the new technique I will be showing later in this article. I will also use some of the more popular open source software as reference examples of the techniques. The reason that I go through these alternate methods is that I feel you have to be familiar with the other more difficult techniques in order to see how easy it can be.

Text Definition Files

This is my personal favorite because of its simplicity and the fact that it works in the widest range of server environments. There is usually no need to do any pre-investigation of the server or shared host before installing software using this method. This is probably the most popular technique used. The reason for this is the reliability and ease of implementation. Distribution and sharing of both the original text and the translated resources is easy and fast. Some of the more popular open source content management systems that use this technique are Xoops, Joomla and PHPnuke.

Disadvantages of Text Definition Files

Duplication of defined variables can easily occur, and these files can be hard to read, at times. Like gettext(), this technique does not allow for easy formatting of HTML code. Using a visual editor to edit, copy and paste helps with this, but there is still room for improvement, as I will explain later. This technique also exposes the translator to the PHP source code and the temptation to “fix” things as they translate. There is a duplicate constant, do they delete it or change the name? It might seem a minor thing, but what if the constant contains an entire page of help text that suddenly does not show? When the application is updated and additional strings are added, there is no way to determine which new strings were added and if they are present in every language. What happens if a newly added string is not yet translated into a specific language? You have to write a script that checks for the instance and location of a variable. Text definition files suffer from a lack of readability if not formatted properly. Formatting is critical as there are no readers or other tools to help with the maintenance

of the files. The use of double or single quotes becomes a factor. Choosing one or the other means that some of the text will have to be escaped to prevent PHP parsing errors. So, while this method is very simple in itself, it does require a bit more code to implement properly. Typically, a file will contain text as shown here. Define(‘_ERROR_1’, ‘You cannot use double quotes (\” \”) ’ . ‘in the text you are sending.’); Define(“_ERROR_1”, “You cannot use single quotes (\’ \’) ” . “in the text you are sending.”);

Choices about the type of variable to be used need to be made, when writing a definition file. The PHP define() function has advantages of being slightly more readable, the use of array elements has the plus of performance and the ability to use the array index to create groups to increase the amount of text that is reusable over the entire application. $language[‘the_index’] = ‘This is some translatable text’;

A bit of advice: leave grammatical logic to the translator. Creating or finding a localization scheme that properly covers plurals is a difficult task, and many times, the coder comes to a point where they will try to use PHP to create some translation logic. Plurals can turn an elegant and simple solution in to a coding nightmare. This usually happens when the coder decides to introduce grammar and plurals to the application to make it “easier” to translate. Take a look at the following code. <?php $messages = array( ‘en_US’ => array( ‘I am X years and Y months old.’ => ‘I am %d years and %d months old.’), ‘es_US’ => array( ‘I am X years and Y months old.’=> ‘Tengo %2$d meses y %1$d años.’) ); ?>

This was a simple array of strings before the coder decided to allow for word plurals and grammar. By doing this, the translator is forced to know PHP. The legibility of the text and the context become lost in the code. When doing this, the coder may also introduce errors in to the text. The coder should save their energy for internationalization of business logic, date formats and try to keep program logic separated from language specific terms. Text definition files are not really meant to deal with complicated language structure. In situations like this, the better option is to allow for variances in text by using multi-dimensional arrays to group plurals $language[‘the_index’][0] = $language[‘the_index’][1] = . ‘in the standard plural $language[‘the_index’][2] = . ‘in the gender specific

‘This is some translatable text’; ‘This is some translatable text ’ form’; ‘This is some translatable text ‘ plural form’;

Volume 5 Issue 1 • php|architect • 41


Directory Structure

The directory structure for this type of system does not have to be elaborate, but it should have some standard and memorable path mapping to make coding and troubleshooting easier. A slightly modified version of the typical gettext() hierarchy works nicely. Whatever the choice, it should include separate subdirectories for each language. The reason for this is that I have found that frequently, a specialty file or extension may be needed in the localization of a web application. I also recommend that the directory and file names be similar or follow some type of naming scheme that eases the dynamic writing of paths and SQL statements. /languages /en_En en_En.php /sv_SV sv_Sv.php

Setting up definition files

Below are examples of typical definition files. As you can see, creating one of these leaves a lot of room for error on the part of the coder. This particular code does something which I consider to be an internationalization mistake. They have used place holders in the strings. This is not a developer- or translator-friendly mechanism, because it hard-codes the context and removes any possibility of reusing the phrase. It also makes it necessary to hunt down the string that will be used in the place holder. When creating translations, a non-coder may be forced to remove or adjust what is considered to be PHP code. As mentioned earlier, text should be as generic and simple as possible to make this type of thing unnecessary. Doing this is a form of string concatenation, something that should be avoided when globalizing software. // %s is your site name define(‘_US_NEWPWDREQ’,’New Password Request at %s’); define(‘_US_YOURACCOUNT’, ‘Your account at %s’); define(‘_US_MAILPWDNG’,’mail_password: could not update ‘ . ‘user entry. Contact the Administrator’);

Some other PHP software uses this format. Take note of the use of numbered indexing, which makes matching the strings to their location in the program easier. $txt[342] $txt[343] $txt[344] $txt[345] $txt[346]

= = = = =

‘Una palabra por línea’; ‘Coincidir todas las palabras’; ‘Coincidir con cualquier palabra’; ‘Coincidir como frase’; ‘Buscar -Todo- Sólo miembros’;

Advantages of Text Definition Files

Defining variables to hold text strings is the simplest

42 • php|architect • Volume 5 Issue 1

and most developer friendly method of internationalizing a web application. It requires no special tools for creation and maintenance. The technique does not impose a great amount of server resources, such as hard drive space or memory.

PHP gettext The PHP gettext() method of localizing a web application is a blessing for those that have finished a web application and want to internationalize it afterwards. Many open source PHP applications like Drupal and Gallery2 rely on the gettext extension.

Disadvantages of gettext

There are several problems with this the use of this function, though: • gettext() isn’t thread-safe, so it is not advisable in a multi-threaded environment • gettext() relies on setlocale(), but that depends on which languages are installed on the system, and in this case UTF-8 is a very tricky setting to use. I personally dislike gettext() because once you change the default language template you have to review and re-compile all the secondary languages. It is very difficult to design and program around gettext because of this factor. The addition or modification of PHP code pages that contain text which needs localization requires going through a multi-step process over and over again. This redundant process can lead to mistakes, which can waste even more time. In open source web applications, where things are being changed due to security, bug fixes or regular version upgrading, you run the risk of losing your translation in part or entirely. There just may be no translation files for the code that you are using, which may force you into learning about the systems involved and trying to find a translator on short notice. Finally, it is difficult—if not impossible—to reuse translated text when using gettext. The text extraction process is on a by file and per hit basis. So, when creating a translation, you may find yourself writing several instances of the same text, or writing a similar translation with only minor differences for many files. This is costly if you are paying for a translation. “Time is money” as they say. In a large application, where the text is stored in a PO file and there are similar occurrences of the same text, it is difficult to find the text string for just that element on the page that you are looking to change. Message IDs are no indicator of the location of the string being swapped via gettext(). PO files, themselves, are strange things that require some programming knowledge and

i18n careful usage. Although they can be altered manually in a text editor, using a program like POEdit is the preferred method. This is a limitation for many, because POEdit is not a cross-platform program. POedit has no Macintosh version, which leaves those types of users out. This is saddening, since many Mac users are writers or in the news media. They are the ones most likely to also be in the need of, or provide translation services. Computer assisted translation, CAT, is also very difficult to setup and use with PO files. The CAT programs that do this well are very expensive. These shortcomings are probably the reason that Word files are the standard file format for translators. After translation texts are completed, a PO file must compiled into an MO file for use by PHP

Directory Structure

gettext requires that the resource files have a specific

structure and that the information about this structure be set into the PHP code. /locale /en /LC_MESSAGES messages.po

Multiple languages are set up in an identical hierarchy. /locale /en /LC_MESSAGES messages.po /sv_SV /LC_MESSAGES messages.po

Setting a Locale (and Other Requirements)

Setting a locale is requirement for gettext(). This is the main instructions that PHP needs if it is to find resources for translated text. <?php // I18N support information here $language = ‘en’; putenv(“LANG=$language”); setlocale(LC_ALL, $language); // Set the text domain as ‘messages’ $domain = ‘messages’; bindtextdomain($domain, “/www/htdocs/”); textdomain($domain); echo gettext(“A string to be translated would go here”); ?>

Designating and Extracting Strings

The PHP code needs to be set up to accommodate the extraction of strings and so that PHP can find the strings

that are to be translated. This is done by using the gettext() function on strings: <?php $var = gexttext(‘ This is a translatable string ’);?>

The text string in the above code can be extracted and set into a po file using a command line function that will hunt for instances of gettext() and set the strings into indexed messages for each occurrence: $ xgettext -n *.php

After extraction, the po file to be translated should look like Listing 1.

Creating the MO Files

In any case, either you or the volunteers will translate the po file and then you will need to convert the file into a binary file that gettext actually understands. For that, you would use the following command: $ msgfmt messages.po

The line above will create a file, which you should save in the appropriate directory. locale/<LANG_CODE>/LC_MESSAGES/ ng strings y.

Plurals and ngettext()

Plural form is the toughest part of text translation, especially if you have lots of text where plurals are needed. In this case, you will need ngettext() and not the simpler gettext(). <?php $n = 3; printf(ngettext(“%d comment”, “%d comments”, $n), $n); ?>

Advantages of gettext

The gettext method of internationalization is not as popular as the other two methods. The reason for this is that it poses a heavy burden on the developers and the end users. In most OSS projects, the developers are responsible for providing the original translation files. After this is done via extraction scripts, the files need to be once again translated and possibly merged to previous translation versions by the translator. The translator can be the end user, a volunteer, or even another development team member. The bottom line is that gettext requires a lot of resources to maintain and support. In a large project with lots of volunteers, or a medium sized company, this is not really a hindrance. But, for the lone developer or small group the burden is large. There is also the factor that gettext does not mean that the developer escapes the job of hunting down text strings and formatting them to use the gettext() function in the

Volume 5 Issue 1 • php|architect • 43

i18n same way that you would have to do if definition files were used. The best thing about the gettext method of internationalization is that the developer does not have to think up unique names for variables. In a large application this can be a tremendous advantage over other techniques.

Database Storage

At first look, working with the database method of storing translated text seems like a joy. I admit I had fun using the Mambelfish component for the Mambo CMS when doing a translation of a website. A database gives what the other techniques seem to lack: order. Relational database systems were built to give power to how information is related, and use these relationships to organize the information into an easily accessible source.

are not part of innovation. Even if I were not so lazy, there are no repositories of MySQL translation tables for Mambelfish which is used in Joomla or any other open source CMS project. Asking for exports from someone else’s database on the Mambo and Joomla CMS forums proved to be less than successful. If a repository for database tables did exist, there is also the problem of not being able to browse the translation beforehand to check its quality. There is always a bit of uncertainty associated with storing information in a database which is why backups are so important. When you start moving information from one database or database server to another, things can rapidly start to fail or acquire bugs. In my experience, you just never know if the encoding is going to be correct after the move. Even when the server configurations are identical, there may be some things that just do not work.

Internationalization of a program includes a few tasks that should be planned out ahead of time. Disadvantages of Using a Database

When internationalizing a web application, distribution of the resources to be translated is very important. Getting the work to the translators is necessary, and there must be a system in place for getting the finished translations to the end users of the product. So far, I have not found one commercial or open source product that offers localization resources in the form of SQL scripts or native database files. As a result, translations are done repeatedly by each end user of that product. Frequently, internationalization using a database is mixed with the other techniques to make up for this shortcoming. I first came across this problem when I found that I wanted to reuse my translation for several different website installations, or borrow one for a language I did not know. Even though the exportation and importation of database tables was not difficult, I found the need maintain an archive of translations because I am the lazy type of coder. I don’t like doing repetitive tasks that

44 • php|architect • Volume 5 Issue 1

You just never know until you determine which part of the chain is responsible for an incorrect encoding bug. Was it the PHP code, the HTML, or the data source? You are just very happy if everything works. I use a lot of web hosting located in the United States, but frequently, my clients are in Sweden or another European country. There have been times where the web host has not installed a UTF-8 character set. The Swedish alphabet only has three characters more than its English counterpart, so fixing any problems was easy. But I do not envy any web developer that has to solve this with any of the Cyrillic alphabet languages. This technique of using a database as a resource for translation strings works well when it works. Using computer assisted translation tools is obviously difficult if not impossible with the database method. You are reduced to using cut and paste operations within a web based interface or a database front end program like MySQL administrator or Microsoft enterprise manager. Caution must be taken when doing this as inputting text this way may work fine and produce the right results at

i18n first glance, but when the actual web application is used to retrieve the text the encoding maybe different from what you expect.

MySQL 4.1

MySQL 3.x or MySQL 4.0.x do not have unicode support. The default character encoding is called latin1 and is single-byte, may not seem like much of a problem at first glance because while the database itself is not aware of the actual encoding, using a varchar field type, it still manages to output the strings in much the same way that they were previously put in the database. But in some cases, you may see incorrect characters when directly accessing the database with code that does not take this into account. Searching or ordering will sometimes not work correctly. These inconsistencies are due to the fact that even though two, three or four bytes should actually represent one character, MySQL interprets them as one character per byte. I have personally had experiences with the Swedish characters äåö being stored as varchar but being seen differently by different versions of phpMyAdmin, the php database administration tool, when exploring a database with these characters.

Many people wondered why I got so excited that MySQL was finally going to support unicode with version 4.1. This is because with unicode support (UTF-8), a more elegant internationalization plan can be implemented. Different character sets can also be set per column, table or database, which means data from many languages can be stored without using elaborate coding routines to encode and decode strings. It also means ordering, searching, indexing and similar stringrelated functions in MySQL work correctly.

Advantages of Using a Database

The greatest advantage of using a database to store the resources for localization is the convenience. An interface can be built to group the translation tasks in to a single area. You don’t have to dig into the file system to find the proper resource file that holds the text strings that you want to translate. If done right, usually within a few clicks you are presented with a user interface and only have to make a few simple choices before enter the



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73

# SOME DESCRIPTIVE TITLE. # Copyright (C) YEAR Free Software Foundation, Inc. # FIRST AUTHOR <EMAIL@ADDRESS>, YEAR. # #, fuzzy msgid “” msgstr “” “Project-Id-Version: PACKAGE VERSION\n” “POT-Creation-Date: 2002-04-06 21:44-0500\n” “PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n” “Last-Translator: FULL NAME <EMAIL@ADDRESS>\n” “Language-Team: LANGUAGE <>\n” “MIME-Version: 1.0\n” “Content-Type: text/plain; charset=CHARSET\n” “Content-Transfer-Encoding: 8bit\n” #: gettext_example.php:12 msgid “A string to be translated would go here” msgstr “”

LISTING 2 1 <!DOCTYPE HTML PUBLIC “-//W3C//DTD HTML 4.01 Transitional// EN” “”> 2 <html> 3 <head> 4 <meta http-equiv=”Content-Type” content=”text/html; charset=utf-8”> 5 <title>Untitled Document</title> 6 </head> 7 8 <body> 9 <p>my own</p> 10 11 <!--my own--> 12 <p dir=”LTR”>Najib said “?????? ?????” 13 (as-salaam alaykum] to me.</p>This is the 14 help text for my own idea module possible 15 to see the line ends in Dreamweaver because 16 of the syntax editor. 17 <!--end--> 18 19 <p>my own button</p> 20 21 <!--my own button--> 22 send 23 <!--end--> 24

<p>my own idea text</p> <!--my own idea text--> 1This is some other text in the module this to check for paragraph and line breaks*** <p>en styke till</p> <ul> <li>add some more HTML here l&auml;gga till mer HTML text h&auml;r </li> <li>this works nicely a&nbsp;in both design and code mode of Dreamweaver.</li> <li>var bra med &ouml;&auml;&aring; ocks&aring;</li> <li>possibly want to have the headers in so that unicode can be used in the editors. They are easy enough to remove</li> </ul> <!--end--> <p>my own translation scheme</p> <!--my own translation scheme--> <p>This is the translation scheme for my own idea module which does not make room for HTML yet. but the best thing about it is that translations can be done in a simple HTML editor or a visual editor like Dreamweaver.</p> <!--end--> <p>my own reset button</p> <!--my own reset button--> reset <!--end--> <p>my own more text</p> <!--my own more text--> Detta ar nagra text p&aring; svenska <!--end--> <p>a new button</p> <!--a new button--> button text <!--end--> </body> </html>

Volume 5 Issue 1 • php|architect • 45

i18n translation. This method and allows for making small changes quickly. The translator is kept totally separate form the underlying PHP code. Though database resource storage is suitable to content translation for the most part, there are situations where it shines when used on the user interface. Dynamic menu systems are a good example of where this technique is a must. In the Joomla content management system, database translation tables are used in coordination with text definition files to make localization easier. The database tables feed the more dynamic presentation layer, while the definition files deal with the administration areas, which are not changed often (or at all). Searching resources stored in a database results is much more relevant information being returned because more relevant information can be stored. Dates, titles, categories and strings can be searched in the localized language. This is very hard to do when translations are stored in other formats.

But, in most situations, a dynamic web page will access a single array element no more than twice to get the needed texts. Calling an array into your code may require you to set it as a global. Arrays have the benefits of being organized and duplicates can be weeded out easily. When using constants, you always run the risk of name collisions. The plus side to using constants is that they are easily written to a cache table and are not required to be set as global, to be called within your php code. Rather than getting into benchmarking and other aspects of performance I will just say that you should weigh the pluses and minuses and choose the method that seems best for your application. The code to process the HTML into PHP data can be seen in Listing 3.

Editing the Text

Here it is the technique you have been waiting for. It is simple, user friendly and editable without a using database, special tools or exposing the translator to PHP source code. The code is short, easily modified to suit various needs, and PHP makes using this technique easy.

The best thing about this technique is that any text or HTML editor can be used. These are available on most popular desktop operating systems. The translator is not bound by the restrictions of a program like POedit. The text is also seen in a familiar format. When using Dreamweaver in code mode, editing the translation file is easy and straight forward. After setting up the translation file in the code view, using Dreamweaver in design mode makes translating and editing the text even easier. You can also see and edit the comment tags in design mode as shown in Figure 2.


Computer Assisted Translation

HTML Definition Files

Yes, there are disadvantages to using this method, but they are the same as those when using a typical definition file described earlier. Some of the problems in using text definition are solved by changing the storage method and avoiding using PHP within the resource files.

Creating Resources

First, you need to create a simply formatted HTML document using <p> tags to show the names of the variables to be created as separate text blocks while in a visual HTML editor. Comment tags are used to designate text blocks to be translated and loaded into PHP variables. When finished and formatted, your HTML file should look like Listing 2.

The PHP Code

Let’s look at two examples. The first uses an array to store the translated text; the second, a set of defined constants. Both of these methods have some minor drawbacks. When using an array, if the array is large with the number of elements in the millions and it is accessed multiple times, then a performance problem may occur.

46 • php|architect • Volume 5 Issue 1

Although much CAT software does not like HTML, this is not really a big problem when using the technique described here. You can easily use a WYSIWYG editor then cut and paste the translation text into the CAT program.


Why use the technique I’ve described here? There are many reasons but here are a few of the strong ones. HTML is universally used and accepted with a very shallow learning curve. Translators, developers, programmers, webmasters and designers can easily see the HTML text and know what is going on, thus making it easier to maintain a good translation and share it. HTML pages can be checked in a web browser for proper encoding. CSS can be used to create a more visually pleasing text at the time of translation. In cases of right-to-left or top-to-bottom languages the technique can show the text in the proper read direction while editing. There is less PHP code, fewer server resources and reduced maintenance to worry about. I hope that many PHP developers will start using this simple technique in the future, as it makes everyone’s job easier.

i18n Both commercial and open source projects can benefit from this type of Internationalization technique. It goes very quickly and previous resources used in internationalization maintenance can be used to make improvements elsewhere in the project. The time to get the software to market becomes shorter and more defined.

You might think that delivering an English version of an application is good enough, but is it really? The software market may carry your work across national borders—if it does, an English version is only the beginning. 



1 <head> 2 <meta http-equiv=”Content-Type” content=”text/html; charset=utf-8”> 3 <title>Untitled Document</title> 4 </head> 5 6 <body> 7 <?php 8 9 //This is where the php code goes that translates the 10 // HTML into php variables when the file is scanned and parsed 11 //avoid calling the regular expression engine by using 12 // other faster php functions 13 // if cached file exists then skip all steps or clear 14 // cached file and create new. 15 16 // Get the file contents based on language 17 $text = file_get_contents(“language/sv_language.html”); 18 19 //Set file content string into an Array; 20 $preVar = explode(‘<!--end-->’,$text); 21 22 foreach ($preVar as $preVar_1){ 23 24 //Start seperation of array item key name and array item value; 25 $preVar_2 = explode(‘-->’, $preVar_1); 26 27 //Seperate out the names for the array keys’; 28 $preVar_3 = explode(‘<!--’,$preVar_2[0]); 29 30 //Clean up the names to be used as array keys’; 31 $newVar = str_replace(‘ ‘,’_’,$preVar_3[1]); 32 33 // Load text into array item $_lang[“my_own_idea_text”] the 34 // thought of using the faster variable variables 35 // should be passed over due to possible security breaches 36 // if users are creating or helping with translations 37 $_lang[$newVar] = $preVar_2[1]; 38 39 // optionally you can use define() and goes with setting 40 // constants but be wary of name collisions 41 define(“_lang_”.$newVar, $preVar_2[1]); 42 43 } 44 45 // check for duplicate array keys and throw error if found 46 // speed things up after first use by caching the result 47 // into a php file for include 48 49 // Test print the language array elements 50 print $_lang[“my_own”].’<br>’.”\n\n”; 51 print $_lang[“my_own_translation_scheme”].’<br>’.”\n\n”; 52 print $_lang[“my_own_idea_text”].”\n\n”; 53 print $_lang[“a_new_button”].”\n\n”; 54 55 print ‘Test print constants’; 56 print _lang_my_own.’<br>’.”\n\n”; 57 print _lang_my_own_translation_scheme.’<br>’.”\n\n”; 58 print _lang_my_own_idea_text.”\n\n”; 59 print _lang_a_new_button.”\n\n”; 60 ?> 61 </body> 62 </html>

Carl McDade is a freelance web developer and programmer living in Sweden. He is a certified Microsoft database administrator and has been doing web development since 1997. Carl spends most of his development time working with documentation, code and studying PHP content management systems. You can find him at his website

Volume 5 Issue 1 • php|architect • 47

Why is it Taking so Long?


Why is it Taking so Long? How long does it take to push out a new feature? Not the man/hours spent coding, but the actual dates. From the point of deciding to implement something, by what date are the users actually using it? This is called “lead time.” It’s not just an academic measure; it’s actually the most important measurement a development team can make, even if it’s a little simplistic. by MARKUS BAKER


orking out the time interval, from customer request to actual feature delivery, is a very powerful measure. Often, it’s quite a shock just how long things take to deliver. Fortunately, it can generate action as well as dismay, so it’s a measurement worth making. It’s not a great diagnostic: it won’t tell you what has gone wrong, just that something is going off of the rails. To analyze long lead times, we need to look a little deeper. Lead time a bit simplistic for what I want to look at next, so I’m going to choose a variant called “Value Stream Mapping.” This was popularized by Mary and Tom Poppendeick in their book “Lean Software Development.” Here’s how it works. You plot—on a timeline—all of the periods when a feature is being worked on, and all of the times when it’s idle. There are some genuine reasons for something being idle: waiting for an external resource or user feedback are good examples. Most times, though, it’s because something more urgent has to finish first. I’ve plotted a real life example for a persistence library written at my current main client, although it’s also available publicly (Figure 1). It catalogues quite a few mistakes on our part, but also a few lessons learned. We’ll look at it in detail.

48 • php|architect • Volume 5 Issue 1


Rapid Progress at First First, you need a quick bit of history. Our previous library was a bit untidy when it came to handling transactions, and you had to call save() on every object you wanted to persist. Something like this... $person = &new Person(); $person->setName(‘Marcus’); $person->save();

That second requirement was a real pain. Often, a method would get some dependent objects (shipping cost on an order, for example), and change some them. The top level script would no longer have references to these objects, so it couldn’t persist the change. This meant lots of save() calls were spread around the accounting code in order to ensure that the calculations would not get lost (yuck!). function addShippingCost(&$order) { $shipping = &$order->getShipping(); $shipping->setTotal( 0.15 * $order->getTotal()); $shipping->save(); }

The development team was small at that time, only two full time enterprise developers. Luckily, we were self managing, and had the freedom to do what we knew was right. We could move quickly. We decided to write a much sharper tool. Thanks to the magic of CVS, I can pinpoint that day as Monday July 19, 2004. That was the day of the first design session, and we progressed rapidly. By July 26th, we had the core working, but were starting to run into interruptions. We hadn’t tested the transaction side, but we could perform CRUD operations on objects. Knocking out a crude persistence layer in a week with only two developers isn’t bad, but we were able to do it because we were doing nothing else. Then, my co-developer went on holiday. By August 14th, working mainly at home, I had got the whole design up—collections, transactions and UnitOfWork. UnitOfWork is a pattern for bulk saving—see “Patterns of Enterprise Application Architecture” by Martin Fowler. For us, it looked like this...

Multitasking is Evil Although multitasking is bad for the lower priority tasks, it’s actually bad for the high priority ones too. Suppose you have two tasks of one week. If you release the work to the team sequentially, then it takes two weeks, and one of those tasks is done after the first week and can FIGURE 1

$ship_it = &new Change(); $order = $ship_it->create(‘Order’); ... addShippingCost($order); $ship_it->commit();

The call to addShippingCost() will now work without the save() call, making that function much cleaner. The Change object—the UnitOfWork—knows it has to save persistent objects and their dependents. It treats the entire business transaction as a whole. Much nicer. The new code was not yet battle-tested, but it was definitely up and running. All we needed to do was add cascading deletes and back-up and restore capabilities, and it was ready to go into our system. The rot was starting to set in. A direct competitor to the company had appeared and this altered some of the priorities towards front end stuff. Persistence refactoring had to wait for a bit in favour of more urgent tasks. It wasn’t until a month later that cascading deletes were fixed. The only other changes until December were either minor, or forced externally by other code changes. Without a nice sized time slot, it was difficult to sink the mind back into tricky transaction problems. This meant exclusive time had to be scheduled, but there always seemed more pressing concerns. It seems that under time pressure, it’s the clever stuff that gets left by the wayside, no matter the importance.

Volume 5 Issue 1 • php|architect • 49

Why is it Taking so Long? be leveraged. After all, it was high priority for a reason. If you release all the work to the developers at once, and allow for task switching cost, things will probably take three weeks. That is, you get both tasks completed after three weeks. Even without the extra switching cost, it means you are denying yourself the benefit of having one task completed after the first week. Now there are some things that you cannot put the whole team on—you won’t have a baby in less than nine months no matter how many mothers-to-be are deployed—so you have to subdivide a little. If you can though, choose one task and do it to the exclusion of all others. If you have already started something, then it’s best to complete it first, rather than come back to it later.

item is done. You may as well wait until you have less than two items on your list, and only then even entertain the idea of working on it. Until then, it’s just clutter. It’s even dangerous clutter, because if you do get those two days free, you will do a day’s real work on a task that someone has to spend a day understanding. In other words, you have made absolutely no progress in the best case. More likely, is that conditions have changed since then and some of the code is wrong and now has to be fixed. The contribution of an odd couple of days is almost certainly negative. You’d do better to just tidy your desk, or better yet, to spend your time clearing the bottleneck in your main task. Developer stacks are a bad thing. Let the project manager build a simple queue of work. The project

Multitasking is bad for both lower priority tasks, and high priority ones too. The Public Release December 2004 saw a short flurry of activity prior to a public release. We often release code into the wild that is not business-critical, and that we would like to see tested more widely. Besides, we liked the code. It even appeared on Sitepoint at Some minor changes resulted from this public appearance, and usage in the wild began. Our plan was progressing, albeit a bit slowly. Then we moved offices. We also hired more staff and reorganized our schedule. The persistence layer was declared our third most important thing. That sounds good, but that’s actually a long way down the stack. All through 2005, the library has languished.

Forget Your Stack Third on the stack is actually pretty useless. It means that at least two other things must be completely and utterly stuck before it gets a time slot. You must also have a time slot that allows you to do serious work on it, which means two days at least if you allow a day to pick up the mental pieces. This is such an important experimental result that it’s worth repeating. If you have three items in your work queue, the third item will never progress until another

50 • php|architect • Volume 5 Issue 1

manager then delivers these parcels to the development team, not individuals, just below the rate that will produce stacking. If you think of the developers as a road, you want a steady flow of traffic, not gridlock.

Constant Drain Some work was forced upon us. The library had to have some minor edits when PHP 4.4 spewed reference warnings, costing another day. It also got accidentally deleted for a while in July, when someone was clearing out unused code. That was an hour spent removing it, and several putting it back in. That’s not progress; that’s more cost. Meanwhile this unused library contributes nothing to our bottom line. What’s really annoying is the maintenance of the very system that this new code is due to replace. We have to apply fixes to our old persistence layer too. We also want to make changes to other parts of the application— changes that would be a lot easier if we had the new code in place, as it was designed to facilitate just these alterations. We know we have a better system waiting in the wings, so maintaining the old code is doubly depressing. The good news? Persistence is the next task queued. Its possible inclusion in the next iteration has been discussed in every iteration meeting for the past two months. We really, really want this work finished now.

Why is it Taking so Long? Hopefully this story will soon have a happy ending.

a process of continual improvement.

The Theory of Constraints

Light at the End of the Tunnel?

This is a one sentence theory that has great implications. It says you cannot change one of throughput, inventory or cost without affecting the others. What does that mean? Throughput is the amount of money software makes once delivered and the rate you deliver it. Think of it as delivered value in each iteration. You subtract fixed costs, but not daily labour costs. For software, the fixed costs are usually zero—there are no materials to consume. Inventory is the cost of all of the stuff you have lying around that is incomplete or unused. Note that you count this as money spent, not an asset. It’s only an asset once you have delivered and it’s contributing to throughput. Inventory costs money even when it’s just lying around, as it has to be read occasionally, and maintained. Even if neither of these things happen, it’s more files to scroll past every day you look at the code base and it’s more cognitive load on everybody just knowing that it’s there. Our persistence library is currently inventory. Cost is daily costs such as staffing, training, software licenses and equipment. Costs are investments to improve throughput. This can be a big change of emphasis for some companies, because the focus is on making money, not saving money. For example, you often end up multitasking so as to raise your efficiency by reducing your idle time, but this is misguided. By creating inventory you will slow down the rate things get done overall, which hurts the company. Counter intuitively it can be best to do nothing. Throughput is king in this model, not efficiency. It’s the dominance of throughput that gives this theory its name. If you look at a manufacturing process, you will find that throughput is limited by one element in the workflow. Perhaps it’s some machine that has a maximum rate it can work. In a software shop, it’s likely to be a programmer with a key area of knowledge. This limit is called the constraint, and if other parts of the workflow go faster than the constraint, you just pile up inventory. A chain is only as strong as its weakest link. You can attack constraints by increasing cost. You could train another developer in a vital skillset, or use pair programming to spread around knowledge of the system or simply hire more developers in key areas. All of these raise costs, but to the greater good of making more money. You also need a minimal amount of inventory in front of bottlenecks so that they don’t get starved of work, whereas non-bottlenecks can always catch up, so their inventory is just clutter. As you attack one bottleneck, another will appear. Removing constraints is

Right now, the constraint is me. I’m the one that knows the persistence layer, having written most of it. Unfortunately, I am still finishing off a previous task and will be heavily involved in training new staff. That’s not good, and it leads to some stark choices. Likely, the persistence layer gets placed on hold yet again and our story lasts a little longer. Alternately, I am protected from staff training for a few weeks. This will delay the positive contribution of the new team members and may be a greater evil, but it only has to happen long enough for another developer to take over the persistence code. If we were to do all of this again, we would, of course, not start the persistence layer at all unless we were certain of finishing it. Once started, we would either finish it regardless, or delete the parts already done and wait until we really had enough time. Deleting half finished work sounds brutal, but I am more convinced than ever that it would have been the right thing to do. We now understand scheduling problems a lot better and can finally make rational decisions. I didn’t even mention the option of stopping my current task, for instance, as it’s now unthinkable. We can just work out the real throughput costs and make our choice by the numbers. I think we will make the right decision.

Things to do While Waiting While you are waiting for your bottleneck to clear, you could do worse than read these resources... The Goal by Eliyahu Goldratt: it’s actually a novel and I recommended it last month. It explains the theory of constraints. has many papers on applying lean manufacturing to software. Out of Crisis by W. Edwards Deming: it was he who started it all when he visited Japan in 1950. His 14 principles of management seem as modern as ever, such as “foster all chances for pride of workmanship and sharing in the improvement process.” He had vision. 

MARCUS BAKER works at Wordtracker as a Technical Consultant, where his responsibilities include the development of applications for mining Internet search engine data ( Based in London, he is a regular contributor to Sitepoint forums ( His previous work includes telephony and robotics. Marcus is the lead developer of the SimpleTest project, which is available on Sourceforge. He’s also a big fan of eXtreme programming, which he has been practising for about two years. Volume 5 Issue 1 • php|architect • 51

"Zend Studio is far and above the best IDE on the market for PHP / LAMP development." - Rich Morrow, Senior Software Engineer, Lockheed Martin

Email Injection


Email Injection This edition marks a milestone: two full years of Security Corner. Thanks to everyone at php|architect for the opportunity to write about PHP security and especially to my loyal readers. I hope this column continues to help the PHP community develop more secure applications. Thanks for reading! by CHRIS SHIFLETT


must admit that when I first heard about email injection a few years ago, I wasn’t very impressed. After all, it’s just another case of developers making the mistake of blindly trusting user input. If you let users manipulate the arguments passed to the mail() function, they can send email from your server. No big surprise there. There are an alarming number of email injection vulnerabilities in PHP applications, and this has prompted me to focus on email injection in this month’s column. The popularity of this type of vulnerability has become a beloved treasure trove for spammers around the world, but why is it so common? I think the root cause of email injection’s popularity is that developers don’t understand the attack or the necessity of filtering input prior to passing it to the mail() function. It is as if there is an assumption being made about how PHP sends email.

Sending Email Most veteran PHP developers know all about the mail() function, and they realize that it provides a rather raw interface for sending email. Like many PHP functions, it is very flexible—mail() provides enough functionality to send almost any type of email you can imagine, provided you know the proper format. As with most security vulnerabilities, it isn’t the experts who are making the most mistakes. Using mail() is very simple, and the possibilities aren’t immediately

TO DISCUSS THIS ARTICLE VISIT: obvious to a novice developer. For example, the following demonstrates a basic use: <?php mail(‘’, ‘My Subject’, ‘My Message’); ?>

A simple email injection vulnerability is to let a user provide the first argument: <?php mail($_POST[‘email’], ‘My Subject’, ‘My Message’); ?>

This is similar to any other injection vulnerability, but the context is different. If you try this yourself and provide your own email address (, you’ll receive an email similar to the following (some headers removed for clarity): To: Subject: My Subject From: nobody@localhost My Message

The value of the To header is provided by the user, so someone wanting to exploit this situation might try to Volume 5 Issue 1 • php|architect • 53

Email Injection send spam from your server by simply providing a list of addresses:,,

You can mimic this situation with a simple test: <?php mail(‘,,’, ‘My Subject’, ‘My Message’); ?>

Each of these addresses will receive the email, because the To header can handle multiple addresses in this way. Many developers mistakenly assume that this problem represents the extent of the concern, so it’s not a concern if the recipient is static. This isn’t true. The larger problem is when the fourth argument to the mail() function is provided in part by the user.

Injecting Headers A common use of the mail() function is to provide a contact or feedback form. An example of such a form is as follows: <form action=”sendmail.php” method=”POST”> <p>Your Email:<br /> <input type=”text” name=”email” /></p> <p>Your Subject:<br /> <input type=”text” name=”subject” /></p> <p>Your Message:<br /> <textarea name=”message”></textarea></p> <p><input type=”submit” value=”Send Message” /></p> </form>

The trouble with using the mail() function as demonstrated earlier, is that the email appears to be sent from your web server—the From header is generated by PHP, automatically. If you want to let customers send you email through a page on your web site, you need to be able to set this, so that the email appears to be from them. This is where using the mail() function’s fourth argument helps: <?php mail(‘’, ‘My Subject’, ‘My Message’, ‘From:’); ?>

By letting you specify additional headers, PHP gives you the flexibility you need to specify the From header. This is great, but it is also where the real danger of email injection lurks. Consider the following example for the sendmail.php script referenced in the previous contact form:

54 • php|architect • Volume 5 Issue 1

<?php mail(‘’, $_POST[‘subject’], $_POST[‘message’], “From: {$_POST[‘email’]}”); ?>

If you test this yourself, you’ll see that it works as expected. You receive an email at your contact address ( just as if the user had emailed you directly. Unfortunately, users now have almost absolute control over the email. The most common tactic used by spammers is to provide the spam message in the contact form and attempt to provide additional headers in the email field. For example, they can provide an email such as the following to send the message to an additional recipient: To:

The trouble with this approach, from a spammer’s perspective, is that you’re more likely to notice the vulnerability. The address will receive an email similar to the following: To: Subject: My Subject From: To: My Message

Although the email is sent to both and victim@ as desired, the exploit is quite obvious. Spammers don’t succeed by exploiting a script once—they want to find a vulnerable script and exploit it for a long time. This makes the Bcc header a favorite injection: Bcc:

As many Bcc headers as desired can be provided, and the resulting email will be much less conspicuous: To: Subject: My Subject From: My Message

Because the Bcc header is not present in the message, you’re more likely to think that a spammer has simply tried to spam you personally using your online form. After all, you’re allowing anyone to send you a message, and this doesn’t appear to break the rules in any way. However, an unknown number of other people might have received the same spam message, and your script’s URL will become a favorite spammer destination.

Email Injection

Exploiting Vulnerable Scripts In order to test your own scripts, you want to be able to provide more than one line for the email, as demonstrated in a previous example: Bcc:

There are a few ways to accomplish this. The simplest is to type it out in a text editor, copy it, and paste it into the form. Of course, spammers opt for something a bit more automatic. The following PHP script exploits a contact form hypothetically located at <?php $fp = fsockopen(‘’, 80); fputs($fp, “POST /sendmail.php HTTP/1.1\r\n”); fputs($fp, “Host:\r\n”); fputs($fp, “Content-Type: application/” . “x-www-form-urlencoded\r\n”); fputs($fp, “Content-Length: 95\r\n\r\n”); fputs($fp, ‘’ . ‘’ . ‘&subject=My+Subject&message=My+Message’); fclose($fp); ?>

The format of POST data is exactly the same as the format of GET data, and the URL-encoded CRLFs appear as %0D%0A. Note: Although I do not wish to detract from the focus of the article, keep in mind that more sophisticated attacks can be used to send HTML email, attachments, and the like. An attacker, given complete control over the arguments to the mail() function, can do anything PHP is capable of.

<?php $clean = array(); $email_pattern = ‘/^[^@\s<&>]+@([-a-z0-9]+\.)+[a-z]{2,}$/i’; if (preg_match($email_pattern, $_POST[‘email’])) { $clean[‘email’] = $_POST[‘email’]; } ?>

A good Defense in Depth approach is to inspect the data specifically for newlines and carriage returns, and the ctype_print() function can help: <?php if (ctype_print($clean[‘email’])) { /* The email contains no newlines or carriage returns. */ } ?>

This technique can save the day in the event that your filtering logic has a flaw.

Until Next Time... I hope this article helps you appreciate the need to consider security in every aspect of your PHP development, even those simple contact and feedback forms. By inspecting input to be sure that it’s the format and size that you expect, you can prevent many types of vulnerabilities, email injection included. Defense in Depth measures such as checking for carriage returns and newlines are very useful, but try to resist the urge to rely on these techniques as primary safeguards. Until next month, be safe. 

Preventing Email Injection As I hope is already clear, preventing email injection is a simple matter of filtering input. In this case, filtering with a whitelist approach isn’t easy. You can probably restrict the subject to a whitelist of valid characters, but you might need to be more lenient in the message. Email addresses have proven difficult to filter for a number of reasons, including the fact that the specification isn’t very restrictive. (Did you know an email address can have comments in it?) My advice is to do the best you can and consider some Defense in Depth approaches to strengthen your filtering. There are numerous regular expressions that can help you filter an email address, and even the more lenient examples prevent common email injection attacks:

CHRIS SHIFLETT is an internationally recognized expert in the field of PHP security and the founder and President of Brain Bulb, a PHP consultancy that offers a variety of services to clients around the world. Chris is a leader in the PHP community, and his involvement includes being the founder of the PHP Security Consortium, the founder of, a member of the Zend PHP Advisory Board, and an author of the Zend PHP Certification. A prolific writer, Chris has regular columns in both PHP Magazine and php|architect. He is also the author of HTTP Developer’s Handbook (Sams) as well as the highly acclaimed Essential PHP Security (O’Reilly). You can contact him at or visit his web site and blog at

Volume 5 Issue 1 • php|architect • 55

Output Buffering


Output Buffering Output is generally sent from calls to echo or print, or from outside PHP code blocks, and once it’s sent, it’s gone. However, using PHP’s output buffering functionality, it is possible to capture this output and further manipulate it before sending to the client. In this month’s Tips & Tricks, I’ll show you why and how to control output with output buffering.



ortable Document Markup Language (PDML) is a language used for creating PDF documents. What’s best: it’s implemented entirely in PHP, and it’s extremely simple to use. All a user must do is create a document with markup similar to HTML, include one line of PHP at the top of the file, and then the file will magically render a PDF document when called from a Web browser. PDML is a remarkably lightweight package. It only requires that the user create a PDF using a simple markup language. After glancing at PDML, other PDF-creation packages written in PHP seem to introduce needless complexity to the process of creating a PDF on the fly. For example, Listing 1 shows a very simple “Hello, World” document using PDML. So, what makes it work? The magic behind PDML: output buffering.

What is Output Buffering? Normally, when output is echoed or printed, it is sent immediately to PHP’s output buffer. It cannot be retrieved or changed once this occurs, and all document headers must be set before echoing or printing output. This is not the case when using output buffering. Output buffering, put simply, is the process of

56 • php|architect • Volume 5 Issue 1

CODE DIRECTORY: output TO DISCUSS THIS ARTICLE VISIT: delaying the transmission of output to the client. During this delay, the script may access or modify the contents of the buffer before it is sent. What’s more, the script can send the buffer all at once or in chunks, which I’ll explain later on. In the PDML example, the markup is never sent to the client. Instead, PDML uses ob_start() to start buffering output. Meanwhile, it passes a callback function to ob_start()—the custom function ob_pdml(). Now, when the output is flushed to the client—in this case, when the script is finished processing—it will first pass through ob_pdml(). What comes out is a PDF document. I hope it is evident how this technique can be useful for any number of applications.

Start Buffering As mentioned, to start buffering content, one must place a call to ob_start(). Any output echoed or printed

Output Buffering previous to the ob_start() call will not be stored in the internal buffer. That is, it has already been sent to the client, even though the sending of output will actually be delayed until the script has finished running (or the buffer is full). All output after the ob_start() call will be in the script’s local output buffer. Aside from starting the output buffer, ob_start() also accepts a callback function parameter, also mentioned earlier. Using a custom callback function, one can use output buffering to create one’s own markup language (as is the case with PDML), perform customized content rewriting before sending the output to a client (e.g. URL rewriting, output escaping), or implement a custom templating engine.

Sending Compressed Content PHP comes with a built-in output buffering callback function that can be used along with ob_start() to send gzip-compressed data to browsers. In fact, it

Accessing the Buffer All data stored in the buffer may be easily accessed, provided the buffer has not yet been flushed. To get the contents of the buffer at any given time, simply use ob_get_contents() or ob_get_flush(). Both of these functions return a string representing all current output in the buffer. However, ob_get_flush() returns the buffer string and then flushes the buffer, while ob_get_contents() leaves the buffer unchanged. Take, for example, the following: <?php ob_start(); echo ‘Hello, World!’; $output = ob_get_contents(); ob_end_clean(); ?>

This code, when run, will not output anything. Since I have turned on output buffering with ob_start() and cleared the buffer, turning off buffering, with ob_end_clean(), the echo doesn’t send anything to the

Output buffering, put simply, is the process of delaying the transmission of output to the client. will even detect whether the browser requests gzipped content, and if so, how to send the data—compressed or uncompressed. For example, my browser (Mozilla Firefox) sends an Accept-Encoding header with most requests, the value of which is “gzip,deflate”. This tells the Web server that it can compress content before sending it to the browser, which saves on bandwidth and cuts down load times. Placing the following at the top of a script will force PHP to handle the compression, which can be helpful, especially if your Web server doesn’t compress responses: <?php ob_start(‘ob_gzhandler’); ?>

Now, the response will include a Content-Encoding header with a value of “gzip”. Please note, however, that this works only for browsers that request (and can read) compressed content. All other browsers will receive uncompressed content.

client. Instead, the variable $output contains the value “Hello, World!” I simply captured the contents of the buffer with ob_get_contents(). Had I used ob_get_flush(), the contents of the buffer would have also been sent to the client. While “Hello, World!” would have displayed in the client output, the script would still have a chance to take action on all of the data stored in $output, which, in this case, is only “Hello, World!” Using this technique, it is possible to control all output from an application, running it through any number of functions and processing routines. At the top of the script, use ob_start(), and at the bottom, get the contents with ob_get_contents(), clear and close the buffer with ob_end_clean(). Now, we can modify everything the script intended to output. For example, regular expression matching with the Perl-Compatible Regular Expression (PCRE) library is often used on buffered data to replace certain content, such as HTML or Javascript in output. For that matter,

Volume 5 Issue 1 • php|architect • 57

Output Buffering the full content of $output may be passed through htmlentities() or htmlspecialchars(). It is also important to note that, when using this technique, document headers may be sent at anytime until the buffer is flushed, which, depending on the methods used, may not be until the very end of the script. In the example above, it is possible to place a call to header() after the echo. However, it is not possible to buffer headers sent with header(). These headers are still sent immediately to PHP’s output buffer and cannot be changed. As in all cases, headers must be sent before any output. With output buffering, though, output is being delayed. This is why it is possible to set headers after calls to echo or print.

table “foo” contains 20,000 records. Iterating over these records may take some time. Meanwhile, without a chunked response, the user waits on this data with no real feedback that the request is being processed. However, the example in Listing 2 uses output buffering to send a chunked response using flush(). According to the PHP manual, flush() flushes “the output buffers of PHP and whatever backend PHP is using (CGI, a web server, etc).” Thus, it “effectively tries to push all the output so far to the user’s browser.” As mentioned earlier, this is not always the case, however. So, in Listing 2, the buffer is being explicitly flushed to the client after every 100 records. Thus, the user receives some feedback that the request is being processed and

Output buffering is a surefire way to take control of your output. Sending Chunked Responses A chunked response is one that is broken up into smaller pieces and sent separately rather than all at once. In a typical process, all output is sent to PHP’s output buffer, which usually waits to send the data to the client until the script finishes. Then, when it is sent to the client, it includes a Content-Length header specifying the exact length of the content. Sometimes, however, it is necessary to send data to the client before the script finishes. This is especially the case when processing large amounts of data could lead to very long page load times. Output buffering can solve this problem by providing the means to immediately flush the contents of the buffer to the Web server itself, encouraging it to send the contents immediately. I say that it “encourages” the Web server because the Web server may not always do this, as is the case when Apache is using mod_gzip or when using certain Web servers on the Microsoft Windows platform. Nevertheless, when using a standard Apache installation without mod_gzip, the ability to send chunked responses can greatly improve usability and decrease load times. Listing 2 shows an example that might be used in a real-world scenario. For the sake of argument, let’s say the fictional 58 • php|architect • Volume 5 Issue 1

can begin viewing records while the remainder of the script continues to process and send more data to the client. Note that the response now contains a Transfer-Encoding header with a value of “chunked” in lieu of the Content-Length header.

URL Rewriting Not to be confused with Apache’s mod_rewrite, PHP’s output buffering functionality allows users to “rewrite” URLs by dynamically appending querystring values to URLs and adding hidden form fields in output. This works in much the same way as the session ID with session.use_trans_sid set in php.ini. For example, consider the following HTML: <a href=”foo.php”>Link</a> <form action=”bar.php” method=”POST”> <input type=”text” name=”baz” /> </form>

Now, consider that a persistent variable of some sort—perhaps an authentication token—needs to exist throughout the script in all links and forms. Simply add the following at the top of the script (or above the content where the variable should be appended):

Output Buffering <?php output_add_rewrite_var(‘token’, ‘abc123’); ?>

Now, the link and form will be rewritten as such: <a href=”foo.php?token=abc123”>Link</a> <form action=”bar.php” method=”POST”> <input type=”hidden” name=”token” value=”abc123” /> <input type=”text” name=”baz” /> </form>

To clear the variable(s) set with output_add_rewrite_var() from being appended in later parts of the script, use output_reset_rewrite_vars(). The behavior of this functionality is controlled by url_rewriter.tags in php.ini.

Content Length and Fin Finally, it is possible to get the length of the content in the buffer with ob_get_length() for times when it is necessary to explicitly set the Content-Length header at the script level, among other things. Output buffering is a surefire way to take control of your output. Implementing these techniques in your scripts can help improve the performance and, in some cases, usability of your applications. Still, there are myriad ways to use output buffering; I’d like to hear yours. If you have a tip or trick that you’d like to see published here, send it to, and, if I use it, you’ll receive a free digital subscription to php|architect. Until next time, happy coding! 

LISTING 1 1 <?php require_once ‘pdml.php’; ?> 2 <pdml> 3 <body> 4 <font face=”Arial” size=”16pt”>Hello, World!</font> 5 </body> 6 </pdml>

LISTING 2 1 <?php 2 ob_start(); 3 try 4 { 5 $i = 0; 6 $dbh = new PDO(‘mysql:host=localhost;dbname=test’, $user, $pass); 7 foreach ($dbh->query(‘SELECT * FROM foo’) as $row) { 8 print_r($row); 9 $i++; 10 if ($i % 100 == 0) flush(); 11 } 12 flush(); 13 $dbh = NULL; 14 } 15 catch (PDOException $e) 16 { 17 print ‘Error: ‘ . $e->getMessage(); 18 } 19 ?>

BEN RAMSEY is a Technology Manager for Hands On Network in Atlanta, Georgia. He is an author, Principal member of the PHP Security Consortium, and Zend Certified Engineer. Ben lives just north of Atlanta with his wife Liz and dog Ashley. You may contact him at or read his blog at

Volume 5 Issue 1 • php|architect • 59

Product Review: Komodo



The Web Development IDE for all platforms? by PETER B. MacINTYRE


first heard of the Komodo development environment by seeing one of their advertisements in our magazine. This got me interested in the tool, as I had been using another competing IDE for some time. What interested me most was that Komodo claimed to be a development environment for many other languages. The focus of this review, however, will be on the PHP portion. I usually give the product the first say in summarizing its claims, so let’s see what they have to say about themselves:

PHP: 4+, 5+ PRODUCT VERSION: 3.5.2 O/S: Linux/Unix, Windows, MacOS, Solaris PRICING: Personal - $29.95 US Professional - $295.00 US LINK:

ActiveState Komodo is the award-winning, professional integrated development environment (IDE) for dynamic languages, providing a powerful workspace for editing, debugging, and testing your programs.

Getting Started So, let’s take a look at what this tool is supposed to do. The installation process on Windows was quite uneventful. I simply downloaded the license key and the installation file and ran the install wizard. Once this was completed I started up the IDE and started to poke around. The layout itself takes a little getting used to, especially if you are quite accustomed to using another interface, but once I got oriented, I generally liked what I saw. Figure 1 shows the Komodo IDE at first start-up. As is evident here, there are two major sections to the IDE: the project management area on the left side, and the main editing window on the right. Of course, there are also toolbars at the ready along the top and debugging panes along the bottom. One nice thing about how Komodo starts up is that there is a page full of 60 • php|architect • Volume 5 Issue 1

your most recent work (projects and files) and helpful tutorials. This certainly aids the developer in getting right back to the work that they were doing the last time they were using this IDE. Much time is often lost in reopening the number of files in a related project just to get back to what you were most recently doing.

Digging Deeper Let’s look at some of the features that make Komodo stand out from the crowd. Apart from the toolbars along the top of the IDE that appear on startup, there are also a few other toolbars that can be added to the mix. One thing that I have been looking for in a PHP toolbar (and haven’t really found, elsewhere) is the ability to customize and size the toolbars. I am one developer who likes to set up their own environment and have the ability to add commonly used menu items (common for me) to the tool bar. The code editor is also a place where I have certain

Product Review: Komodo

things that I am looking for. As is shown in Figure 2, you can see that this editor is color sensitive to the differences between raw HTML and PHP code (or the web development language that you are currently using). I did find, though, that the default colors were not as stark in their differences as I would like so that they stand out more clearly, but this is a personal matter, and the

that is very often repeated--just this little feature alone is quite a time-saver. Another feature that I liked was in the project pane on the left of the display. There is a little toolbar there as well, with 3 top-level options all related to the project at hand. This is another time-saving feature that puts the more-often-used menu items at the ready.

The built-in browser automatically refreshes itself if the file being viewed is changed within Komodo. colors can be adjusted in the preferences section. Figure 3 shows the “Dark” code editing setting as a contrast to the default settings. As well, this editor has a code folding feature which is quite useful in getting some code that you know is fine out of scope while looking for a coding problem (as one example).

What I liked What I really liked about this product was that the builtin browser would automatically refresh itself if the file being served was changed within Komodo. This only saves a mouse press or two, generally, but it is also something FIGURE 1

I also really liked the fact that a developer could have a few projects open at the same time. This is valuable in that you can access code that is similar in another project and bring it into play in a different project without having to close one and open the other—yet another time saver. But, it’s not all about saving development time on menu items and key-strokes, there is also the overall functionality of the product. This seems to be where Komodo didn’t quite live up to my expectations.

What I Didn’t Like In other IDEs that I have looked at, there were some great features that, once a person gets comfortable in using them, are definitely missed when they are absent. Code completion (AKA Code intelligence) and syntax checking is the first example. Komodo has support for these features, but I was completely unable to get it to function in the PHP context (it did, however seem to work fine with some of the other supported languages), even after an email exchange with Komodo’s tech support team. The connection to database servers is another feature that is common to a few other PHP IDEs; this is not present in Komodo. Although it is not necessarily directly related to a language’s development environment, it is almost a must-have these Volume 5 Issue 1 • php|architect • 61

Product Review: Komodo FIGURE 2



days, since there is so much use of database connectivity in web sites. The last item that I wanted to touch on as frustrating was my inability to get the PHP debugger operational. Now, to be fair, this may be a basic PEBCAK (problem exists between chair and keyboard) error, but I did try a number of times to get it operational, and failed. I expected this feature to be quite a bit simpler to set up.

Summary The Komodo IDE project has won awards; you can see them listed on their web site. In my opinion, however, there are better PHP IDEs on the market. I think that part of the issue that I have with Komodo is that they are trying to be all editors to many languages and that seems to be too large of a task for them. There was lots of supporting information, and I did get some good pointers from some of the staff members at Active State, so it’s not all bad. Since I only tested this product in the context of PHP development, I cannot speak of its abilities with the other languages that it claims to support, like Perl, Python, and Ruby.

Dynamic Web Pages

I give this product 3.5 starts out of a possible 5. sex could not be better | dynamic web pages - german php.node

news . scripts . tutorials . downloads . books . installation hints

62 • php|architect • Volume 5 Issue 1

PETER MACINTYRE lives and works in Prince Edward Island, Canada. He has been and editor with php|architect since September 2003. Peter is a Zend Certified Engineer. Peter’s web site is at

///exit(0); ////// 2006: A Look Forward by M ARCO TABINI


’ve got to hand it to Derick Rethans—he’s launched a new fashion. Derick’s Look Back, which has been running for three years now and is featured for the first time in this issue of php|architect, has spawned a veritable industry of blog posts and articles that provide some interesting insight into what happened to PHP in the past year. At the risk of alienating some of my other friends, however, I must say that I still like Derick’s Look Back best, because its only goal is really that of fondly reminiscing about how his life—and by extension the many lives that PHP touches every day—has been affected by the events surrounding the internals mailing list. Since so much has been said and done about what has happened in 2005, I figured I’d take a look at what I think will happen in 2006. Predictions are nothing new, of course, and there’s nothing quite as potentially devastating to one’s reputation than playing prophet only to be proven completely wrong in the end. Oh well, it’s worked for Nostradamus… so, as long as I can manage to be vague enough to give the idea that I know what I’m talking about without actually saying anything, I should be perfectly fine. In a recent blog post ( mt/?p=106), I claimed that 2006 would be the “year of confusion” for PHP. The short version of my thesis there was that PHP has reached such a level of maturity that any further innovation comes at a steep price for everyone involved—those who develop the language, who must carry an ever-growing baggage of backwardscompatibility needs, and those who develop with the language, who will be faced with the non-trivial task of migrating the next cycle of their applications to PHP 5. I still think that confusion will be the defining factor of 2006. The problem with PHP is best described by an aphorism that I picked up from one of my very first 64 • php|architect • Volume 5 Issue 1

business partners way back when I got started: “turning a hundred thousand dollars in a million is a heck of a lot easier than turning a million into ten.” I remember laughing at the time—probably because a hundred thousand dollars seemed such a ridiculous amount of money—but I have found out just how right he was. With monetary growth also comes a growth in the complexity of a company, which, in turn, increases your overhead. The same is true of a mature language like PHP, and in more ways that meet the eye. As I mentioned, changes to the language itself are getting increasingly difficult to implement because of compatibility considerations. The real challenge, however, is going to be in the hands of the thousands (or hundreds of thousands, if you believe some research firms), who in 2006 are likely to find themselves faced with an end-of-lifetime decision regarding their current applications. On one hand, porting software to PHP 5 seems to be the logical thing to do: after all, what’s the point of rewriting your applications if you don’t take advantage of a version of the language that provides you with the best possible facilities and the highest longevity? Maintaining PHP 4 is going to make less and less sense to developers for a number of reasons—primarily the fact that it provides very limited support for some of the technologies that are emerging as the must-haves of web development, like good XML handling, SOAP, and so on. On the flip side, the average PHP developer is, in my opinion, thoroughly confused by PHP 5—and several well-publicized recent “discussions,” such as the reference hoopla, have done nothing to make things better. While the PHP development team is busy scoping out and developing PHP 6, the community will have a lot of catching up to do trying to educate itself into proper PHP 5 development. 


PHP Architect  

The magazine for PHP Professionals

Read more
Read more
Similar to
Popular now
Just for you