On Data Flow and Homogeneous Data

by moodyharsh 2014-10-26
homogenous-data

On Data Flow and Homogeneous Data

Data is not like electricity.
Electricity doesn’t have list of formats – http://en.wikipedia.org/wiki/List_of_file_formats
Like water or fuel it is Homogeneous.

Electricity ~ Array # Homogenous

With duck typing all data is treated as Homogenous in a way.
FBP / Data Flow in statically compiled languages will need type conversions.

Homogenous Data is used rarely in literature, when it is used it means Contiguous Data like Arrays.

I feel that Software Engineering borrows a lot from other Engineering disciplines.

If you can look at

  • Message Passing
  • Even Driven Programming

They try to model model Software upon

  • Routers
  • Signals

which are from electronics again.

How do Chemical Engineers do it ?
They too probably deal with a lot of permutations and combinations.

We can treat Data Structures like container for chemicals.

http://en.wikipedia.org/wiki/Unit_operation

Chemical engineering unit operations also fall in the following
categories which involve elements from more than one class:

Combination (mixing)
Separation (distillation, crystallization, chromatography)
Reaction (chemical reaction)

Separation to what ?
To Homogeneous Data of course !

A bit like parsing XML / JSON / CSV to native data formats.

Dictionary meanings are,

Homogenous: of the same kind; alike.
Heterogenous: diverse in character or content.

FBP already does a bit of this !
Split / Collate / Merge ….

Cooking is applied Chemical Engineering
This explains why O’Reilly cookbooks are popular :)

What is Homogeneous Data ?

I believe it has the following properties

  • Flat
  • Non-Nested
  • Non-Recursive
  • Uniform Type

Some examples

  • Arrays (assembly)
  • Stacks (assembly, forth)
  • Queues (actor model)
  • Matrices (apl)
  • Tuples (normalized relational algebra , linda)
  • Streams <—– FBP
  • CSV
  • Fixed Strings

Heterogeneous Data

  • Cons Cells (lisp)
  • Trees (objects)
  • Map (everyone) <—–– Datron
  • Messages (event based)
  • XML
  • Files <—— Pipes

** This also made me realize that everywhere there is Homogeneous Data
you will find transformations based approaches
if not outright flow based ones. **

A stronger test for Homogeneous Data can be that any two random elements should be similar in structure.

Array vs Array for example are more Homogeneous than Hash<String, Array>

Flow = Transformation(Homogeneous Data)

What is the Separation Process for Data ?

I think ER modelling / normalization can help here.

In FBP a Node can act like a parser and normalizer to give simple streams
for others to parse from multiple outlets.

Normalisation creates homogenous rows / tuples.
This means a table can be

  • Stored efficiently
    ** varchars are cheating
  • Joins ( Collations ) are cleaner

Heterogenous Data like XML will be very unpredictable.
You need an additional schema enforcer.

CSV on the other hand is again very fixed and much easier to process.
I reckon if CSV is popular in chemistry data sets.
The human genome projects uses CSV.

All in all Heterogenous Data can be recognised by its parsing over head.

Array
Array
Array<Hash<String, String>>

Also from the perspective of the x86 processor, Arrays fit linearly into L1 and L2 caches.

See - http://research.scee.net/files/presentations/gcapaustralia09/Pitfalls_of_Object_Oriented_Programming_GCAP_09.pdf

Let me list down Data Sets in the increasing order of Heterogeneity.

  1. Customer Records
  2. Customer Records + Sales Records
  3. Customer + Sales Records + Last Login
  4. An XML dump of 3
  5. An XML dump of 3 with Sales Records in CSV format
  6. An SQL dump of 3 where Customer Records are in XML and Sales are in CSV rows
  7. A base64 dump of 6
  8. A Gzip of 7

As you go down both Processing and Storage go haywire !
Also 6-7-8 have far lesser Generic operations.

Homogenous Node

m input ports
n output ports

The computation time for the Homogenous Node has a constant upper-bound for
1 or 2 or 3 or .. m input IP packets ( simultaneously or individually).

The key is – the Node performs a transformation.

The simplest case for such a Node is
single input port
single output port
and numeric transformations over an IP.