Data Conversion Engines

What is data conversion?
Legacy systems
Client/server systems
The need for data conversion engines
The Conversion Engine

What is data conversion?

Data conversion is the movement of disparate data sources to a destination target database. In many cases, source data tends to be on various hardware platforms, in a variety of database types, sometimes highly decentralized and chaotically organized. Data conversion is only one aspect of the entire process. Because of the convergence of multiple data sources into a target warehouse, we also like to call it data convergence.

To better understand data conversion, we must first have an understanding of the types of systems and data we are dealing with. Without this background, it is impossible to understand the importance of automating the process of data convergence. Most, if not all, large organizations have several data sources based on legacy, proprietary and client/server technology.

Legacy systems:

Legacy data can be defined in many different ways. For simplicity reasons we will call it older mainframe based data. Mainframes include the obvious IBM 3090/390 systems, but also include data in systems such as the HP3000 and older DEC VAX hardware. In the past, these systems locked away data which could only be accessed in a proprietary fashion using only the manufacturer's products or requiring expensive and inefficient COBOL programs. However, recently vendors have pumped new life into these systems by supporting open systems concepts and functions. These include the ability to access the data via Microsoft applications and applications developed using Sybases' Powerbuilder application development tool.

Despite attempts to "open" legacy data, major stumbling blocks remain. These include the high cost of maintaining these systems. For example, it is not unusual for a large organization to pay several million dollars a year to maintain the existence of legacy systems. The word "open" has a different meaning for legacy systems. What it really means is retro-fitting a system for an environment it was never intended for. IBM mainframes were intended for SNA (does anybody remember what SNA stands for?) networks and 3270 devices, not PC's with Win95/NT applications accessing its data via TCP/IP networks. In addition, most of the newer automated applications from such vendors as SAP, Peoplesoft and Baan work mainly on popular client/server systems, though some started as mainframe application vendors. Despite these facts, many high level managers still swear by legacy systems primarily because of their maturity, reliability and proven battlefield experience. In addition, IBM has done a decent job of updating its software to meld into the client/server look and feel. IBM has also lowered price points and introduced new versions of the AS/400 using RISC chips, supporting both AS/400 MVS and AIX.

Legacy database include VSAM, IMS, IDMS, SUPRA, ADABASE, DB2 and many more. Unfortunately, requirements for expensive gateways, and the lack throughput, have limited the ability to include legacy data in open systems applications. This is a serious and expensive pitfall. It is not unusual for third party legacy products to cost two to three times as much as similar open systems products. In addition, legacy databases possess a skeleton of features compared to its client/server adversaries. These include powerful features such as data replication, stored procedures, parallel query, full SQL functionality, and many more.

 

Client/Server Systems:

This includes hardware supporting UNIX operating systems such as HP UX, IBM AIX, SUN Solaris and DEC Unix. These operating systems are powerful and can scale to tens of processors on Symmetrical Multi-Processor (SMP) devices. With wider support for Microsoft's NT operating system on traditional UNIX hardware, NT is vastly becoming the operating system of choice for smaller workgroup applications. However, NT continues to suffer from scalability issues. In addition, hardware for UNIX and NT systems is usually a fraction of the cost of mainframe hardware. UNIX was also designed and built for open TCP/IP networks, without the requirement for additional gateways or protocol converters. Thus, the name "open systems" evolved.

UNIX systems work with all the popular client/server databases such as Oracle, Sybase, Informix and Microsoft, in addition to databases designed specifically for data warehousing (DSS) from Arbor Software and Red Brick. Most of today's popular applications run on several of these databases. Thus, customer's have a wider choice of hardware and software to choose from. Additionally, fierce competition has driven down prices while vastly improving performance. Many UNIX servers also come packaged with Web capabilities and some include a limited version database (limited in the number of users and features).

Unfortunately, despite the term "open systems", UNIX is still a vendor based operating systems, which means is that each hardware manufacturer supports its own flavor of UNIX. However, they are very similar and conform to the POSIX standards. The problem is databases from different vendors cannot seamlessly exchange data between one another. This can have a serious impact on corporations attempting to "link" disparate data sources (Oracle, Sybase, DB2, etc.) into a data mart or warehouse.

 

The need for data conversion engines:

So what we really have is proprietary mainframe data and "open systems" data that is proprietary in disguise. Oracle applications cannot be exchanged with Sybase data and IMS will never be mistaken for Microsoft SQL Server. A company can set a direction to standardize on a specific vendor's database, but it is virtually impossible to dictate database standards to the user community. In addition, mergers and acquisitions are commonplace. IT professionals must expect the unexpected.

Conversion engines make data conversion a far easier task than ever before. First generation tools, such as Prism's Warehouse Solution, ETI's Extract and Carleton's Passport somewhat automated the process but continues to be a single-threaded code oriented solution. The term "code generation" tools is commonly associated with these vendor's products. First generation tools can be very expensive both in terms of cost and training. The learning curve can be a monstrous hurdle and the requirement for experienced COBOL programmers is necessary. In addition, these products rarely reduce the time to production for data warehouses. In addition, managing COBOL code is never a desired task.

 

The Conversion Engine:

Finally! The conversion engine is a new breed of emerging tools. The conversion engine requires is not based on code. Conversion engines perform most of its work in memory running several "processes" at once. These products vary in support of relational and non-relational databases, so make sure your environment is fully supported. Conversion engines are written in C, C++ and run on Unix and NT platforms. They are designed for distributed enterprise data convergence.

I will use the term "toolset" to describe data conversion engines because most are an integrated toolset. Typically, there is a design tool, a meta data repository tool, a server manager and a transformation designer. The integration of the toolset varies amongst the vendors. Regardless, data conversion engines streamline the process of moving data from production systems into a decision support system. With these new second generation toolsets, users can focus on source-to-target mappings and building a truly robust end-user oriented data mart/warehouse. With these toolsets, meta data is automatically trapped and stored in the meta data repository. These toolsets will access the meta data repository for all functions of design, transformation, management and scheduling the loads.

Data conversion engines are multi-threaded and can therefore concurrently load multiple transformation mappings. This is a major benefit over the code generators because these mapping can come from one or several sources at once. Most data conversion engines can perform on-the-fly table joins from like data sources (e.g. all Oracle data or all Informix data). Some can even perform on-the-fly joins of different data sources, a big benefit for heterogeneous shops.

The bottom line for these toolsets is they should save time and money in building industrial strength data marts/warehouses.