Architecture:
SAP Business Objects Data Services is a data
warehousing product that delivers a single enterprise-class software solution
for data integration (ETL), data management (data quality) and text data
processing(unstructured text like emails, paragraphs etc,)
DS Components
Ø Designer: Designer
is used to create datastores and data mappings, to manage metadata and to
manually execute jobs. This
is the front end GUI (Graphical User Interface) tool where developers can login
and build the jobs in SAP Data Services to move the data from one system to
other or with in the system and define the logic for transformations. It
is the place where we develop, validate and execute the ETL jobs. We can
connect to the central repository only through designer. We can import and
export the jobs to file or repository from designer as well.
Ø Management Console: It
is a web based console for managing SAP Data Services like scheduling the jobs,
looking at system statistics on memory usage, runtime of jobs, CPU utilization
etc.
Ø Repository:
SAP Data Services is a software application and needs backup database to store
the developed objects created by individual developers as well as system
generated metadata. Repositories are used to store all this data and these
repositories can be created in any database like Oracle, HANA, MS SQL and
MySQL.
Note: We cannot open data
services designer without selecting the repository while login.
We have 3 different types of
repositories in Data Services namely Local, Central and Profiler.
Local Repository: This is mandatory
repository in Data Services, at least one local repository is required to be
setup before start using Data Services system. Usually individual local
repositories are created for each developer in the system to store all his/her
developments. The connection to the database
server is also created in the local repository only. We call it as data
store.
Central Repository: This is
required to have more security and integration between different developments
in multi user environment. When we have multiple
developers working in same development, this type of repository is recommended
because of its robust version control and users can access the objects by using
check-out and check-in functionality for this purpose which is similar to
locking (when one user is accessing an application, it can’t be accessed by any
other user.). This is an optional repository in the system.
Profiler Repository: This is
used to store the data profiling information performed by developers and also
tasks history. This is an optional in the system.That is used to determine the
quality of data.
Ø Job Server:
Each Repository is associated with one or more DI Job Servers. This is one of
the main server component in data services and is used to execute all the batch
jobs created by developers in the system. Repositories should be attached to at
least to one job server to execute the jobs in the repository, otherwise
developer cannot execute the jobs. The job server consists of
.bat files (contains jobs) and its work is to start the job engine which
executes jobs. SAP
Data Services Job Server can move data in either batch or real-time mode and
uses distributed query optimization, multi-threading, in-memory caching, in-memory
data transformations, and parallel processing to deliver high data throughput
and scalability.
Ø Job Engine: It
executes the requested jobs.
Ø Access Server: This
server is used to execute the real-time jobs created by developers in the
repositories. This is used for DI Real-time
service jobs. This Server controls the XML message passing between the source
and target nodes in Real-Time jobs.
Ø Address Server: This
server is used to perform data quality and data cleansing activities in data
services. This is also associated with Address directories and dictionaries
which are required for data cleansing.
Ø Web Server: This
is used for the administration activities. We can schedule jobs, execute jobs,
creating users etc., are the tasks that can be done through this server.
Ø CMC (Central Management
Console): This web based tool for managing data services
users, repositories configurations to jobs servers and security management etc.
Ø Data
Services. Service Names- start al_Designer,al_jobserver,
al_jobservice
DS Objects
Hierarchy:
Ø Projects: Project
is It is the highest level of hierarchy in which we create jobs. It is used to
group the jobs relevant to an individual process or application. Having the
jobs grouped under projects makes it easy to maintain the objects in the
system. We can have multiple projects created in the same repository. Only one
project can be opened at a time in project area in designer.
Ø Jobs: Job
is the only executable object in SAP Data Services. It can contain any number
of workflows, data flows, scripts and conditions. Developers can manually
execute the jobs in designer or can be scheduled using management console.
Ø Workflow: Workflow
is used to define the data flow execution sequence with the help of conditional
operations. It is an optional object where we
create different data flows, scripts. Let’s say if we have 5 data
flows in a job out of which 2 has to run daily and 3 has to run on month end.
We can define two workflow to group the data flows and then write the condition
based on calendar date to define execution process. Workflows are majorly used
to organize the data flows within a job and it’s not mandatory to have work
flows in the job.
Ø Conditionals: Conditionals
are single use objects to implement if-then-else logic in a work flow to
control the execution process.
Ø Scripts: Scripts
are single used objects to call functions, define values to variable,
manipulating the tables at database level in a work flow. Scripts cannot be
created in data flows. External required logic which is not built through BODS
can be developed by writing scripts. One of the example is
maintaining job control, which requires scripts to be defined to pick the start
time, end time and insert into a database table.
Ø Dataflow: Data
Flow contains actual source and target objects along with the transformations
defined. This is the most important object of the hierarchy where we create our
mapping. This is where we define everything about data movement like source
table, target table, transformation logic. Once the data flow is created we can
include it either in a job or in a workflow (and then into job) to execute the
ETL process.
Ø Datastore: Datastores
are defined to access any system and use it as either source or target. We can
create Datastores for most of the applications, databases and software’s
available in the industry.
Ø File Formats: If
the source or target is some external file like flat files, excel workbooks,
xml files then we can configure this using File Formats option in Data
Services. Once the file formats are defined we can use it as either source or
target.
Transforms:
A transform
enables you to control how datasets change in a dataflow.SAP Data Services
provides lot of transforms to define the transformation logic in the system.
These transforms are grouped into four different categories. They are 1.Data
Integrator, 2.Data Quality,3.Platform,4.Text Data Processing
Data
integrator transforms
Ø 1. Data Transfer
transform:
• This is used to transfer the data from source to the target system.
• This is used to transfer the data from source to the target system.
• It
is mainly used when millions of records are being processed.
• It
splits the task into different sub data flows and executes them serially.
• Mainly
used when there are multiple threads
• So
performance increases.
Ø 2. Date Generation
transform:
• It
produces dates serially from start date to end date specified by us.
• The
serial order is may be day wise, month wise, week wise or we can manually
assign using global variables.
Ø 3. Effective Date
Transform:
• This
transform calculates “effective to” date for records which contains “Effective
Date” field of date type.
• The
“effective to” column is added to target along with source schema and it gives
the date until which each record is valid.
• These
effective date and effective to column are like valid from and valid to
respectively.
• So
we can identify the range of each record’s validity.
Ø 4. Hierarchy Flattening
Transform:
• This
transform builds a complete hierarchy in target table for the source
data set and describes the hierarchy based on given parent and child columns.
• It
describes the hierarchy in both horizontal and vertical format based on our
request.
Ø 5. Table Comparison
transform:
• It
is used for identifying data manipulation operations in the source.
• It
identifies changes in the source and transfers the data to the target with rows
flagged as insert, update or delete as per the change.
• We
can select which column’s changes has to be identified by dragging the column
in compare columns tab in transform editor.
Ø 6. Key Generation
Transform:
• It
is used to generate keys for the newly inserted rows.
• This
transform identifies the maximum existing key value in the source, and
generates keys by taking that maximum value as starting value.
Ø 7. History Preserving
Transform:
• As
the name indicating it is used to preserve the history of all the records.
• When
we update a record in the source, instead of updating the record in the target,
this transform inserts a new row with updated record along with previous
record.
Ø 8. Pivot Transform:
• It
is used for converting the selected columns to rows.
• i.e.,
for each value in selected column, pivot transform inserts a new row in target
table.
Ø 9. Reverse Pivot
Transform:
• It
is used for converting the selected rows to columns.
• It
combines data from multiple rows into a single row.
• It
creates new columns to perform above task.
Ø 10. XML Pipe Line
Transform:
• It
processes large XML files. Instead of taking all the instances at a time into
memory, it takes one instance at a time and processes it and releases the
memory.
• Next
instance is again loaded to same memory location so that memory consumption is
reduced.
Platform
transforms
Ø 1. Case Transform:
• It
is used to split the input rows based on given conditions and route them to
different locations respectively.
• Rows
that do not satisfy any given condition goes to other specified location
labeled as ‘default’.
Ø 2. Map Operation
Transform:
• This
transform is used to alter the data manipulation operation codes.
• It
takes data manipulation operation codes as input and alters them to specified
operational code like update to insert, update to update etc..
Ø 3. Merge Transform:
• This
transform acts like UNION ALL in SQL.
• It
combines all the input data sets that are from different sources and sends to
target.
• It
allows duplicates.
Ø 4. Query Transform:
It is very important transform
which is frequently used in jobs.
• Mapping
columns from input to output schema
• Assigning
primary keys to output schema
• Adding
new columns in output schema
• Performing
join operations
• Applying
group by, order by clauses.
• Removing
duplicate rows
Ø 5. Row Generation
Transform:
• Produces
a data set with a single column. The column values start with the number that
you set in the “Row number starts at” option. The values then increments
by one to a specified number of rows.
Ø 6. SQL Transform:
• This
transform won’t take any input.
• This
is useful when any of the built-in transform cannot perform required operation.
• We
will write the SQL query manually through this transform.
Ø 7. Validation
Transform:
• This
is used to validate the input rows based on specific criteria.
• The
rows that satisfies the required criteria are routed to the target labeled as
“pass”.
• Those
rows which do not satisfy the required criteria are routed to the target
labeled as “fail”.
Data quality
transforms
These are used for parsing,
standardizing and correcting the data. We have different transforms like
address cleansing, match, country code, geocode etc.,
Naming
conventions
Project name start with --> "PRJ"
Job name start with --> "JOB_"
Work flow name start with --> "WF_"
Data flow name start with --> "DF_"
Script name start with --> "SC_"
Query name start with --> "QRY"
Function name start with --> "FN_"
Operational data store's name start with --> "ODS_"
Global variable name start with --> "GV_"
If DF is extracting data from source table Person; DF's name is like DF_EXT_PERSON etc
All target tables should include "ETL_DATE" column.
Delete, truncate operations are executed in scripts. (Delete data from table before loading option is not used)
Dimension table names start with -->"DIM_"
Fact table names start with -->"FACT_"
Staging table names start with "ST_"
Temp table names start with "TEMP_"
Job name start with --> "JOB_"
Work flow name start with --> "WF_"
Data flow name start with --> "DF_"
Script name start with --> "SC_"
Query name start with --> "QRY"
Function name start with --> "FN_"
Operational data store's name start with --> "ODS_"
Global variable name start with --> "GV_"
If DF is extracting data from source table Person; DF's name is like DF_EXT_PERSON etc
All target tables should include "ETL_DATE" column.
Delete, truncate operations are executed in scripts. (Delete data from table before loading option is not used)
Dimension table names start with -->"DIM_"
Fact table names start with -->"FACT_"
Staging table names start with "ST_"
Temp table names start with "TEMP_"
Nice post........
ReplyDeleteWe are providing the best master data services around the world....visit our website for more information....
Inventory Analysis
SAP Master Data Governance
data management services
master data management in sap
data cleansing tools
Master Data Governance
Data Cleansing Services
data classification tools
Master Data Management Solutions
data transformation service