Issuu on Google+

TRIFLE An efficient way to serialize tree structured information. by Daniil Moerman

Document History Release ID

Release Date

1.00

19.03.2013

Modifications - This is the first version of the document.

All rights reserved Copyright by Daniil Moerman Germany, 2013 1 / 20


Table of contents 1 Abstract.................................................................................................................................................................. 3 2 Terminology......................................................................................................................................................... 3 2.1 Overview....................................................................................................................................................... 3 2.2 Tree................................................................................................................................................................. 5 2.3 Root Branch................................................................................................................................................. 5 2.4 Branch............................................................................................................................................................ 5 2.5 Leaf................................................................................................................................................................. 5 2.6 Association.................................................................................................................................................. 5 2.7 Association Key.......................................................................................................................................... 5 2.8 Tree Projection............................................................................................................................................ 5 3 Format Specification......................................................................................................................................... 6 3.1 Overview....................................................................................................................................................... 6 3.2 VINT................................................................................................................................................................ 6 3.3 VINT32........................................................................................................................................................... 7 3.4 VINT64........................................................................................................................................................... 9 3.5 TRIFLE-Tree................................................................................................................................................ 11 3.6 Format Byte............................................................................................................................................... 11 3.7 Header Section........................................................................................................................................ 13 3.8 Branch Descriptor...................................................................................................................................13 3.9 Leaf-Entry.................................................................................................................................................. 14 3.10 Branch-Entry.......................................................................................................................................... 14 3.11 Payload Section.....................................................................................................................................14 3.12 Checksum................................................................................................................................................ 15 3.13 Well known Type-Identifier..............................................................................................................16 4 License................................................................................................................................................................. 16 5 About the Author............................................................................................................................................ 17 6 Example of a TRIFLE-Tree............................................................................................................................. 17 7 Where can TRIFLE help me?........................................................................................................................19 8 Visual Overview of TRIFLE............................................................................................................................20

2 / 20


1

Abstract

This document can be seen as a specification of a method which allows its users to serialize tree structured data in a very efficient way. In this document the term efficiency deals with following aspects: •

Reduce the amount of data which has to be stored or transferred to get complete tree serialization

Reduce the complexity of the parsing process, so that the serialization and deserialization can be be done very fast.

Reduce the time which a person who has to implement this method needs to do his/her job.

Allow access to the received information without waiting until every single byte of the whole serialization reached the target system.

I tried my best to visualize everything that could sound complicated or even allow multiple interpretation with diagrams and tables thus making the implementation of the needed methods to a more or less easy task which can be done not only by computer scientist.

2

Terminology

This section should provide you the whole terminology you will need to understand this document completely.

2.1 Overview A TRIFLE consists of a small amount of substructures which either can consists of further substructure. The format is both hierarchical and recursively build making the definition and processing very simple. But before we can transfer the elements of a real tree to a TRIFLE we have to understand the basic structure of a tree. Following diagram will show you all needed elements.

3 / 20


4 / 20


2.2 Tree By a tree we mean a structure which stores information about the available branches and leafs within this tree, and its connections to each other. A connection associates one branch with an another, or with exactly one leaf. To identify a certain association we use a association key.

2.3 Root Branch The root branch is the highest branch of a tree. It is also the only one on this level. That means that there are no multiple root branches within a single tree. Every non-trivial tree has exactly one root branch.

2.4 Branch A branch is a node within a tree structure with which you can associate either further branches or leafs. The amount of associated elements can vary from 0 to n where n is a non-negative integer. A non-trivial branch has at least one element associated with it.

2.5 Leaf A leaf is node within a tree structure which is always associated with exactly one branch within the tree. It stores the “real� data which you want to transfer. That data can be any kind of digital information like integers, arrays, documents and so on. To be able to read the content of a leaf properly a type identifier must be stored within a leaf.

2.6 Association An association is a connection between two elements of a tree. It can exist between either a branch and a leaf, or between a branch and an another branch. Every association must have an association key, so that in later context one can address this very association without any misunderstanding.

2.7 Association Key An association key is a Byte-Array which makes a certain association between two elements of a tree unique. It is not necessary to use one special association key only once within the whole tree, but every association key which identify a association within one branch has to be unique. That means that you are not allowed to use the same association key for e.g. two associations within one certain branch.

2.8 Tree Projection A tree projection is a vector of all leafs which you can find within a single tree. After sorting 5 / 20


all branches and leafs of a single tree using a defined order, the resulting vector of the projection becomes unique. Thus from now on you have only to provide the associations and the branches to have all data together which you need to reconstruct the tree back in its origin shape.

3

Format Specification

The following section will describe the structure of all the elements a TRIFLE consist of. Knowing this is essential for those who has to write a parser which should be able to handle with a TRIFLE-Tree

3.1 Overview This chapter is divided in two parts. In the first part you will see how every format-specific integer value is coded using the so-called “VINT” representation. Afterwards every single format element of TRIFLE will be shown byte-by-byte, in combination with definitions and of every single section and element.

3.2 VINT As already mentioned VINT which stands for “Variable Integer” is used within TRIFLE as an efficient way to represent non-negative integer values. Such values can be used to store the length of leaf with content. Another purpose is using an integer as a type identifier, where allowed integer values represent special types. In both situations different ranges of integer values must be realized. By using normal integer format we would mostly waste a lot of memory which will be filled with zeros (remember that a simply integer value has a size of 4 Bytes and long integer value of 8 Bytes). VINT allows us to adapt the size of the representation to the value it actually is intended to store. The VINT format can be used in two different versions – VINT32 and VINT64. The difference of both versions are the intervals of values you can represent in this format. Lets look at the numerical boundaries you come across by using VINT:

Type Boundaries (MIN / MAX) unsigned integer (int32) 0 … 4.294.967.295 unsigned long integer (int64) 0 … 18.446.744.073.709.551.615 VINT32 0 … 1.077.952.575 VINT64 0 … 2.314.885.530.818.453.536

Every VINT number consist of 2 – 3 Range-Bits and the rest which stores an unsigned integer in Big-Endian style. To calculate the value which had been stored in VINT you simply need to know the used range (which you can retrieve from reading its range-bits). 6 / 20


Afterwards you add the rest value to the minimal value which the current range can represent. So that rest value can be seen as an offset to the chosen range. Both process – encoding and decoding VINT`s can be implemented by using bit-shift operations and basic arithmetic operations.

3.3 VINT32 VINT32 has 4 ranges. The boundaries of this ranges are:

Range Name Boundaries RANGE 0 0 … 63 RANGE 1 64 … 16.447 RANGE 2 16.448 … 4.210.751 RANGE 3 4.210.752 … 1.077.952.575 The exactly structure of a VINT32 value in a certain range you can see in the following diagram:

7 / 20


8 / 20


3.4 VINT64 VINT64 has 8 ranges. The boundaries of this ranges are:

Range Name Boundaries RANGE 0 0 … 31 RANGE 1 32 … 8.223 RANGE 2 8.224 … 2.105.375 RANGE 3 2.105.376 … 538.976.287 RANGE 4 538.976.288 … 137.977.929.759 RANGE 5 37.977.929.760 … 35.322.350.018.591 RANGE 6 35.322.350.018.592 … 9.042.521.604.759.583 RANGE 7 9.042.521.604.759.584 … 2.314.885.530.818.453.536 The exactly structure of a VINT64 value in a certain range you can see in the following diagram:

9 / 20


10 / 20


3.5 TRIFLE-Tree The whole TRIFLE-Tree consist of maximum 4 logical sections. Section Name Content

FB

HS

PS

CS

Format-Byte

HeaderSection

PayloadSection

Checksum

If the last section – the checksum – is available decides a flag inside the Format-Byte. That is why a TRIFLE-Tree can consist also only of 3 sections instead of 4.

3.6 Format Byte The Format-Byte defines specific properties of the serialization which a parser has to read carefully to parse the TRIFLE-Tree in a correct way. The Format-Byte has following structure. Bits of the Format-Byte 1

2

TOB

ALI

Abbreviation

3

4

WORDSZ

5

6

BYO

BIO

7

Original Term

8 CHA Description

TOB

TRIFLE Original Flag

This flag is used to define if the current TRIFLETree is formed in the here defined way. If a modified version of this specification had been used for the serialization, this flag should be set to “0”. Otherwise it should be set to “1”

ALI

Aligned Flag

This flag is used to define if the current payload which is inside the Payload-Section is aligned or not aligned. If the Alignment Flag is set, all written payload-sizes must be interpreted as values, which has to be completed to the word size defined in WORDSZ. Example: Payload-Size = 42 Byte [WORDSZ = 32 Bit] The transferred payload has a size of 44 Bytes, where the last 2 Bytes are filled with zeros and do not store any information.

WORDSZ

Word Size

This flag is used to define the Word-Size which had been used to generate this serialization. This 11 / 20


is important for the above mentioned alignment calculation as well as to interpret some payload types in a right way. (e.g. payload of type ARRAY_INT stores values which has the size of 1 Word. So by using 32-Bit Words it will have the size of 4 Bytes and in case of 64-Bit Words it will have 8 Bytes.) The possible combinations has following meaning: Bit 3

Bit 4 Meaning

0

0

Word Size equals 8 Bit

0

1

Word Size equals 16 Bit

1

0

Word Size equals 32 Bit

1

1

Word Size equals 64 Bit

BYO

Byte-Order

This flag is important for some types of payload. In case of ARRAY_INT it defines the value of every byte of a single integer value. If this flag is set, the Big-Endian style is used. If this flag is not set, the Little-Endian style is used. ATTENTION: This flag do not have an effect to the VINT values as they are always coded in BigEndian style.

BIO

Bit-Endianes

This flag is important to define the Bit-Endianes within the whole message. Normaly this flag should be set, which signals that the most significant bit is the first, and the least significant bit the last within a byte. In case of the inverted case, this flag should be unset.

CHA

Checksum Algorithm This two bits decide, which checksum should be appended to the end of the serialization. The possible combinations has following meaning: Bit 7 Bit 8 Meaning 0

0

No Checksum at all

0

1

CRC

1

0

MD5

1

1

User defined algorithm

12 / 20


3.7 Header Section The Header-Section do not exist really but should help to imagine the location of all the meta-data one has to store to get a complete serialization. Syntactically the HeaderSection consists of exactly one Branch Descriptor (the Branch Descriptor of the Root Branch). As this Descriptor is recursively defined, in the end we get a complete description of the whole tree.

3.8 Branch Descriptor A Branch Descriptor describes the content of a single branch. It mention the associated branches, its association key, the associated leafs as well as its association keys and the size of the payload in bytes. The Branch Descriptor consist of 4 Sections: Section Name Content

LC

BC

LDA

BDA

Leaf-Counter

BranchCounter

Leaf-EntryArray

Branch-EntryArray

The LC (Leaf-Counter) is a VINT32 value and stores the number of the leafs which are associated with this branch. The BC (Branch-Counter) is a VINT32 value and stores the number of the branches which are associated with this branch. The LDA (Leaf-Entry-Array) is an array which consist of Leaf-Entries which are writen one after another without gaps in between. The number of the entities is defined in the LCvalue. The BDA (Branch-Entry-Array) is an array which consist of Branch-Entries which are writen one after another without gaps in between. The number of the entities is definied in the BC-value.

13 / 20


3.9 Leaf-Entry A Leaf-Enty describe one single leaf and consists of 4 sections: Section Name Content

KL

KEY

PS

TID

Key Length

Association Key

Payload-Size

Type-Identifier

The KL (Key Length) is a VINT32 value and stores the size of the following KEY-Section in bytes. The KEY (Association-Key) is a Byte-Array which stores the Association-Key. The PS (Payload-Size) is a VINT64 value which stores the size of the payload in bytes. ATTENTION: This size has to be recalculated, in case of a settet Alignment Flag within the Format-Byte. If the flag is set, the PS-value stores the size of the payload, but not the size of the region, which this payload needs within the Payload-Section. The difference between this two values is padded with zeros in the end of the region. The TID (Type-Identifier) is a VINT32 value which stores the type-number of the current payload's type (for well-known Type-Identifier see the section Fehler: Referenz nicht gefunden in this document).

3.10 Branch-Entry A Branch-Entry describes the content of single branch. In opposite to a simple BranchDescriptor it is used only on deeper levels than the Root Branch. In fact it is a composition of a Association Key as seen in the Leaf-Entry definition and the Branch-Descriptor itself. This fact results in a recursive behaviour of the Branch-Descriptor definition. The recursion will terminate when the deepest branch had been reached, as the Branch-Counter goes to zero in this situation. Section Name Content

KL

KEY

BD

Key Length

Association Key

Branch-Descriptor

The KL (Key Length) and the KEY (Association-Key) has the same meaning type and meaning as in the Leaf-Entry (see section Fehler: Referenz nicht gefunden). The BD (Branch-Descriptor) is a Branch-Descriptor (see section 3.8) which describs the associated branch.

3.11 Payload Section The Payload Section is a more or less huge Byte-Array where the stored payload is written 14 / 20


one after another withouth letting gaps inbetween. The information of the Header Section allows to reconstruct all information about the boundaries of the payload entries and its hierarchy within the serialized tree. ATTENTION: Do not forget about the alignment and recalculate if it is necassery the given payload size of every single payload element. If you forget it you will get a wrong content into your leafs.

3.12 Checksum The Checksum is a value of a fixed size which allows you to determine if there were any kind of transmission errors on the way between your source where you got you TRIFLETree from and you. If a checksum is really needed on this level is up to the application and the environment you work at. Normaly there is no need for a checksum, as the layer on which TRIFLE-Structures are transferred is so high, that the information had passed 3 previous checksums. According to the value of the Checksum-Flag within the Format-Byte (see section 3.6) the checksum has following format:

Bit 7 Bit 8 Meaning 0

0

No Checksum at all

0

1

A CRC32 checksum is appended. It has the size of 32-Bit and the used polynom is 0x04C11DB7 or more mathematicaly: x32 + x26 + x23 + x22 + x16 + x12 + x11 + x10 + x8 + x7 + x5 + x4 + x2 + x + 1

1

0

A MD5 checksum is appended. (128-Bit)

1

1

User defined algorithm can be used. In this case you have to define your own format and validation algorithm. A library which implements TRIFLE should provide an API which allows the user of that library to append a self-coded checksum validator.

15 / 20


3.13 Well known Type-Identifier Following Type-Identifier are seen as well-known and should be used only in the here written meaning. The rest of the VINT32 number space can be used individually. In case you are looking for primitive datatype instead of the array type see a primitive value as an array of that type with the length of 1.

Type-Identifier

Type-Label

0x01

ARRAY_BOOLEAN (8-Bit per element) true = 0xAA false = 0x55

0x02

ARRAY_BYTE (8-Bit per element)

4

0x03

ARRAY_SHORT (size of one element = 0.5 x WORD)

0x04

ARRAY_INT (size of one element = WORD)

0x05

ARRAY_LONG (size of one element = 2 x WORD)

0x06

ARRAY_FLOAT (IEEE 754 – single precision) (32-Bit per element)

0x07

ARRAY_DOUBLE (IEEE 754 – double precision) (64-Bit per element)

0x08

ARRAY_CHAR (UTF16 Big Endian)

0x09

ZIP (DEFLATE)

License

You are free to use the here presented technology for any kind of application (as long as the application is not intended to harm any living creature directly or indirectly). Using this technology is absolutely free of charges. The only restriction is that you have to mention the name of this technology (TRIFLE) in every technical specification of the product which uses TRIFLE – internal as well as in published specifications.

16 / 20


5

About the Author

My name is Daniil Moerman. I am currently (2013) studying Computer Science in Germany and am interessting in many scientific areas. In case you want to contact me to get further informations about TRIFLE I recommend you to use the following e-mail-address: daniil.moerman@googlemail.com (I accept e-mail's written in German, English or Russian language)

6

Example of a TRIFLE-Tree

Let's assume you have the following tree structure which you want to serialize to a byte stream which meats the reqiurenments of the TRIFLE-Definition in this document. ┌ Root Brach ├--- Branch : ABC ├------ Leaf : XYZ [type:FLOAT_ARRAY] {0.678,0.445,9.6556} ├------ Leaf : SyMbOL [type:INT_ARRAY] {1,566,87654} ├------ Leaf : Param01 [type:CHAR_ARRAY] {a,b,c,d,e} ├--- Branch : Next_Branch ├------ Leaf : Truth [type:BOOLEAN] {true,false,false} Word Size

= 64 Bit

Byte-Endianess

= Big-Endian

Bit-Endianes

= Big-Endian

Checksum-Algorithm

= MD5

Checksum Wished

= YES

Aligned

= YES

So now I want to calculate step by step the resulting byte sequence. Let us first look on the last 8 lines which specify the format of the TRIFLE-Structure and form an appropriate Format-Byte. According to section 3.6 which defines the structure of the Format-Byte I get the following value: 11111110 = 0xFE Second step is to calculate the Header-Section which is a little bit more difficult. That's why I will first write the binary string in human readable form before encoding it to a sequence of bytes:

17 / 20


{ROOT [LeafNr = 0] [BranchNr = 2] {Branch:ABC { [LeafNr=3] [BranchNr=0] [Leaf:XYZ Type:FLOAT_ARRAY Size:12] [Leaf:SyMbOL Type:INT_ARRAY Size:24] [Leaf:Param01 Type:CHAR_ARRAY Size:10]} {Branch:Next_Branch { [LeafNr=1] [BranchNr=0] [Leaf: Truth Type:ARRAY_BOOLEAN Size:3]} } Now we will be replace all the noted elements to the appropriate TRIFLE elements: 0x00 0x02 0x03 0x41 0x42 0x43 0x03 0x00 0x03 0x58 0x59 0x5A 0x0C A

B

C

X

Y

Z

0x06 0x06 0x53 0x79 0x4D 0x62 0x4F 0x4C 0x18 0x04 0x07 0x50 0x61 S

y

M

b

O

L

P

a

0x72 0x61 0x6D 0x30 0x31 0x0A 0x0B 0x4E 0x65 0x78 0x74 0x5F 0x42 r

a

m

0

1

N

e

x

t

_

B

0x72 0x61 0x6E 0x63 0x68 0x08 0x01 0x00 0x05 0x54 0x72 0x75 0x74 r

a

n

c

h

T

r

u

t

0x68 0x03 0x01 h This was the most complicated part. Writing an recursive algorithm for this task is very easy in both direction – serialization and deserialization. Now we will write the Payload-Section. First the human readable form … {Payload : XYZ} {Payload : SyMbOL} {Payload : Param01} {Payload : Truth} And now the binary form … 0x3F 0x2D 0x91 0x68 0x3E 0xE3 0xD7 0x0A 0x41 0x1A 0x7D 0x56 0x00 XYZ 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x01 0x00 0x00 SyMbOL 0x00 0x00 0x00 0x00 0x02 0x36 0x00 0x00 0x00 0x00 0x00 0x01 0x56 0x66 0x00 0x61 0x00 0x62 0x00 0x63 0x00 0x64 0x00 0x65 0x00 0x00 Param01 0x00 0x00 0x00 0x00 0xAA 0x55 0x55 0x00 0x00 0x00 0x00 0x00 Truth Now we have only to calculate the MD5 hash value of the whole binary message we just created. This can be done by already implemented library. From the content in this example the MD5 value is 0xB6 0x97 0xF5 0x2F 0x27 0x05 0x3E 0xC0 0xC2 0x78 0x48 0xF1 0xA3 0x82 0x5B 0x9B 18 / 20


After concatination of all four regions we get a complete TRIFLE-Tree serialization. I hope this document and the example were enough to transfer the requirenments which a TRIFLE-Implementation have to meet and would really like to see at least as many implementations, as programming languages exists.

7

Where can TRIFLE help me?

Following applications I see as possible domain for the TRIFLE specification: •

RPC

Storing tree structures

Interprocess Communication

19 / 20


8

Visual Overview of TRIFLE

20 / 20


TRIFLE