Kavyas Tutorials: SAP BODS Notes-Part1

Architecture:

SAP Business Objects Data Services is a data warehousing product that delivers a single enterprise-class software solution for data integration (ETL), data management (data quality) and text data processing(unstructured text like emails, paragraphs etc,)

DS Components

Ø Designer: Designer is used to create datastores and data mappings, to manage metadata and to manually execute jobs. This is the front end GUI (Graphical User Interface) tool where developers can login and build the jobs in SAP Data Services to move the data from one system to other or with in the system and define the logic for transformations. It is the place where we develop, validate and execute the ETL jobs. We can connect to the central repository only through designer. We can import and export the jobs to file or repository from designer as well.

Ø Management Console: It is a web based console for managing SAP Data Services like scheduling the jobs, looking at system statistics on memory usage, runtime of jobs, CPU utilization etc.

Ø Repository: SAP Data Services is a software application and needs backup database to store the developed objects created by individual developers as well as system generated metadata. Repositories are used to store all this data and these repositories can be created in any database like Oracle, HANA, MS SQL and MySQL.

Note: We cannot open data services designer without selecting the repository while login.

We have 3 different types of repositories in Data Services namely Local, Central and Profiler.

Local Repository: This is mandatory repository in Data Services, at least one local repository is required to be setup before start using Data Services system. Usually individual local repositories are created for each developer in the system to store all his/her developments. The connection to the database server is also created in the local repository only. We call it as data store.

Central Repository: This is required to have more security and integration between different developments in multi user environment. When we have multiple developers working in same development, this type of repository is recommended because of its robust version control and users can access the objects by using check-out and check-in functionality for this purpose which is similar to locking (when one user is accessing an application, it can’t be accessed by any other user.). This is an optional repository in the system.

Profiler Repository: This is used to store the data profiling information performed by developers and also tasks history. This is an optional in the system.That is used to determine the quality of data.

Ø Job Server: Each Repository is associated with one or more DI Job Servers. This is one of the main server component in data services and is used to execute all the batch jobs created by developers in the system. Repositories should be attached to at least to one job server to execute the jobs in the repository, otherwise developer cannot execute the jobs. The job server consists of .bat files (contains jobs) and its work is to start the job engine which executes jobs. SAP Data Services Job Server can move data in either batch or real-time mode and uses distributed query optimization, multi-threading, in-memory caching, in-memory data transformations, and parallel processing to deliver high data throughput and scalability.

Ø Job Engine: It executes the requested jobs.

Ø Access Server: This server is used to execute the real-time jobs created by developers in the repositories. This is used for DI Real-time service jobs. This Server controls the XML message passing between the source and target nodes in Real-Time jobs.

Ø Address Server: This server is used to perform data quality and data cleansing activities in data services. This is also associated with Address directories and dictionaries which are required for data cleansing.

Ø Web Server: This is used for the administration activities. We can schedule jobs, execute jobs, creating users etc., are the tasks that can be done through this server.

Ø CMC (Central Management Console): This web based tool for managing data services users, repositories configurations to jobs servers and security management etc.

Ø Data Services. Service Names- start al_Designer,al_jobserver, al_jobservice

DS Objects Hierarchy:

Ø Projects: Project is It is the highest level of hierarchy in which we create jobs. It is used to group the jobs relevant to an individual process or application. Having the jobs grouped under projects makes it easy to maintain the objects in the system. We can have multiple projects created in the same repository. Only one project can be opened at a time in project area in designer.

Ø Jobs: Job is the only executable object in SAP Data Services. It can contain any number of workflows, data flows, scripts and conditions. Developers can manually execute the jobs in designer or can be scheduled using management console.

Ø Workflow: Workflow is used to define the data flow execution sequence with the help of conditional operations. It is an optional object where we create different data flows, scripts. Let’s say if we have 5 data flows in a job out of which 2 has to run daily and 3 has to run on month end. We can define two workflow to group the data flows and then write the condition based on calendar date to define execution process. Workflows are majorly used to organize the data flows within a job and it’s not mandatory to have work flows in the job.

Ø Conditionals: Conditionals are single use objects to implement if-then-else logic in a work flow to control the execution process.

Ø Scripts: Scripts are single used objects to call functions, define values to variable, manipulating the tables at database level in a work flow. Scripts cannot be created in data flows. External required logic which is not built through BODS can be developed by writing scripts. One of the example is maintaining job control, which requires scripts to be defined to pick the start time, end time and insert into a database table.

Ø Dataflow: Data Flow contains actual source and target objects along with the transformations defined. This is the most important object of the hierarchy where we create our mapping. This is where we define everything about data movement like source table, target table, transformation logic. Once the data flow is created we can include it either in a job or in a workflow (and then into job) to execute the ETL process.

Ø Datastore: Datastores are defined to access any system and use it as either source or target. We can create Datastores for most of the applications, databases and software’s available in the industry.

Ø File Formats: If the source or target is some external file like flat files, excel workbooks, xml files then we can configure this using File Formats option in Data Services. Once the file formats are defined we can use it as either source or target.

Transforms:

A transform enables you to control how datasets change in a dataflow.SAP Data Services provides lot of transforms to define the transformation logic in the system. These transforms are grouped into four different categories. They are 1.Data Integrator, 2.Data Quality,3.Platform,4.Text Data Processing

Data integrator transforms

Ø 1. Data Transfer transform:
• This is used to transfer the data from source to the target system.

• It is mainly used when millions of records are being processed.

• It splits the task into different sub data flows and executes them serially.

• Mainly used when there are multiple threads

• So performance increases.

Ø 2. Date Generation transform:

• It produces dates serially from start date to end date specified by us.

• The serial order is may be day wise, month wise, week wise or we can manually assign using global variables.

Ø 3. Effective Date Transform:

• This transform calculates “effective to” date for records which contains “Effective Date” field of date type.

• The “effective to” column is added to target along with source schema and it gives the date until which each record is valid.

• These effective date and effective to column are like valid from and valid to respectively.

• So we can identify the range of each record’s validity.

Ø 4. Hierarchy Flattening Transform:

• This transform builds a complete hierarchy in target table for the source data set and describes the hierarchy based on given parent and child columns.

• It describes the hierarchy in both horizontal and vertical format based on our request.

Ø 5. Table Comparison transform:

• It is used for identifying data manipulation operations in the source.

• It identifies changes in the source and transfers the data to the target with rows flagged as insert, update or delete as per the change.

• We can select which column’s changes has to be identified by dragging the column in compare columns tab in transform editor.

Ø 6. Key Generation Transform:

• It is used to generate keys for the newly inserted rows.

• This transform identifies the maximum existing key value in the source, and generates keys by taking that maximum value as starting value.

Ø 7. History Preserving Transform:

• As the name indicating it is used to preserve the history of all the records.

• When we update a record in the source, instead of updating the record in the target, this transform inserts a new row with updated record along with previous record.

Ø 8. Pivot Transform:

• It is used for converting the selected columns to rows.

• i.e., for each value in selected column, pivot transform inserts a new row in target table.

Ø 9. Reverse Pivot Transform:

• It is used for converting the selected rows to columns.

• It combines data from multiple rows into a single row.

• It creates new columns to perform above task.

Ø 10. XML Pipe Line Transform:

• It processes large XML files. Instead of taking all the instances at a time into memory, it takes one instance at a time and processes it and releases the memory.

• Next instance is again loaded to same memory location so that memory consumption is reduced.

Platform transforms

Ø 1. Case Transform:

• It is used to split the input rows based on given conditions and route them to different locations respectively.

• Rows that do not satisfy any given condition goes to other specified location labeled as ‘default’.

Ø 2. Map Operation Transform:

• This transform is used to alter the data manipulation operation codes.

• It takes data manipulation operation codes as input and alters them to specified operational code like update to insert, update to update etc..

Ø 3. Merge Transform:

• This transform acts like UNION ALL in SQL.

• It combines all the input data sets that are from different sources and sends to target.

• It allows duplicates.

Ø 4. Query Transform:

It is very important transform which is frequently used in jobs.

• Mapping columns from input to output schema

• Assigning primary keys to output schema

• Adding new columns in output schema

• Adding function calls

• Performing join operations

• Applying group by, order by clauses.

• Removing duplicate rows

Ø 5. Row Generation Transform:

• Produces a data set with a single column. The column values start with the number that you set in the “Row number starts at” option. The values then increments by one to a specified number of rows.

Ø 6. SQL Transform:

• This transform won’t take any input.

• This is useful when any of the built-in transform cannot perform required operation.

• We will write the SQL query manually through this transform.

Ø 7. Validation Transform:

• This is used to validate the input rows based on specific criteria.

• The rows that satisfies the required criteria are routed to the target labeled as “pass”.

• Those rows which do not satisfy the required criteria are routed to the target labeled as “fail”.

Data quality transforms

These are used for parsing, standardizing and correcting the data. We have different transforms like address cleansing, match, country code, geocode etc.,

Naming conventions

Project name start with --> "PRJ"
Job name start with --> "JOB_"
Work flow name start with --> "WF_"
Data flow name start with --> "DF_"
Script name start with --> "SC_"
Query name start with --> "QRY"
Function name start with --> "FN_"
Operational data store's name start with --> "ODS_"
Global variable name start with --> "GV_"
If DF is extracting data from source table Person; DF's name is like DF_EXT_PERSON etc
All target tables should include "ETL_DATE" column.
Delete, truncate operations are executed in scripts. (Delete data from table before loading option is not used)
Dimension table names start with -->"DIM_"
Fact table names start with -->"FACT_"
Staging table names start with "ST_"
Temp table names start with "TEMP_"

Kavyas Tutorials

Home3

Home2

Home

Tuesday, May 23, 2017

SAP BODS Notes-Part1

1 comment: