Question 1

Describe the DataStage tool.

Accepted Answer

DataStage is a tool provided by IBM that is used to design, develop, and run applications that populate large volumes of data into data warehouses and datamarts by extracting data from diverse databases on Windows servers. It includes graphical visualizations for data integration and can also extract data from multiple sources and run scheduled jobs. As such, it is considered one of the most powerful ETL tools. DataStage has different versions that companies can use depending on their needs. The versions are Server Edition, MVS Edition, and Enterprise Edition. IBM DataStage tool was the first tool that introduced the parallelism concept.

Question 2

Describe the capabilities of the DataStage tool.

Accepted Answer

Some of the notable capabilities of the DataStage tool are -

Can integrate data from various enterprise and external data sources
Implemented data validation rules are supported by the tool.
Help process and transform large amounts of data.
Use scalable parallel processing approaches to manage large volumes.
Handle complex transformations and multiple integrations.
Leverage direct connectivity to enterprise applications that can manage processes as a source or destination.
Leverage metadata for analysis and maintenance.
Works in batch mode, real-time, or as a web service. It could be deployed on-premise as well as cloud based on client requirements.
The tool has access to Big Data through a distributed file system, JDBC integrator and JSON support.

Question 3

Specify the difference between DataStage and Informatica.

Accepted Answer

The DataStage tool was the first to support Parallel processing, while the Informatica does not support such processing.
In terms of Slow changing Dimensions (SDC), the DataStage supports complex methods or custom scripts while it's quite easy in Informatica to implement.
Version control is unavailable in DataStage while it is supported in Informatica.
DataSatge supports a large number of data transformation blocks compared to Informatica.
The DataStage lacks the lookup, while the Informatica supports the powerful dynamic cache lookup.

Question 4

What is the source of the DataStage tool?

Accepted Answer

InfoSphere DataStage and QualityStage can access data in enterprise applications and data sources such as:

Relational databases
Mainframe databases
Business and analytic applications
Enterprise resource planning (ERP) or customer relationship management (CRM) databases
Online analytical processing (OLAP) or performance management databases

Question 5

What are the aspects of the DataStage tool?

Accepted Answer

The aspects of IBM InfoSphere DataStage are -

Data transformation
Jobs
Parallel processing

Question 6

What is the primary use of the DataStage tool?

Accepted Answer

ETL tool is required by the Data Engineers for configuring the data from one source to another. DataStage is a tool in the IBM InfoSphere Information Server Suite where we can create data pipelines from a source system to the destination system by designing the internal data transformations such as merging, joining, lookups, and clustering data to create meaningful data.

The data is required to populate the different data tables in a data warehouse, datamarts Staging tables where other systems could extract the data for data analysis, data science etc.

The system allows the creation of the scheduled job along with featuring parallel processing, where it shows performance exceptions from other such ETL tools.

Question 7

What is the data source and destination in the DataStage tool?

Accepted Answer

The DataStage tool being the ETL (Extract, Transform, Load), suggests that there would be a place where the tool would extract the data from the source system (such as mainframe databases, OLAP databases, file system, Database tables, SQL Query in Oracle). There could be a minimum of one such source system where we would be getting the data, and after the transformation, the data prepared is populated into the multiple destination systems (database tables, file output, analytics tools, etc.), which is named as a destination in the tool.

The DataStage tool also supports Big data Hadoop by using tools such as an H-Base connector, JSON support, file system etc.

Question 8

Describe the DataStage job.

Accepted Answer

DataStage job is a code flow prepared by the developer, which is arranged in a process and stages defining the final data and requirement as asked.

In each stage, the task is broken into specific functionalities and is clubbed together in the tool and scheduled, which is named as DataStage job.

For example –

Suppose we have an objective to sum the sales amount. So, breaking it into stages, firstly, we would be performing the operation of the ‘group by’ function on the Sales data.

Nextly, in the next step, we would load that data into the destination table, so it would be designed in the next stage. Then using the DataStage links, we make the flow move from the ‘group by’ stage to the target loading stage.

Question 9

List a few functions that we can execute using the ‘dsjob’ command.

Accepted Answer

dsjob command is used to get and display information about a specific project, job, phase, or link. Access to log files. You can use the dsjob command to add entries to a job's log file or to retrieve and display specific log entries.

Command	Functionality
$dsjob -run	Used to run the DataStage job
$dsjob -stop	used to stop the job that is currently present in the process
$dsjob -jobid	used for providing the job information
$dsjob -report	used for displaying the complete job report
$dsjob -lprojects	used for listing all the projects that are present
$dsjob -ljobs	used for listing all the jobs that are present in the project
$dsjob -lstages	used for listing all the stages of the current job
$dsjob -llinks	used for listing all the links
$dsjobs -lparams	used for listing all the parameters of the job
$dsjob -projectinfo	used for retrieving information about the project
$dsjob -log	used for adding a text message to the log
$dsjob -jobinfo	used for the information retrieval of the job
$dsjob -stageinfo	used for information retrieval at that stage of the job
$dsjob –loginfo	used for getting information about the log

Question 10

What is a flow designer in IBM DataStage?

Accepted Answer

A must-know for anyone looking for DataStage advanced interview questions, this is one of the frequent questions asked of managers as well.

IBM DataStage flow designer is a window client application where we can create, edit, schedule, trigger and perform operations. It provides the dashboard palette where we can drag/ drop components to create jobs and go to a few important features of the DataStage flow designer listed below -

Projects
Connections
Table definitions
Jobs
Parameter sets

We could trigger the code part through the designer tasks and link the different stages through DataStage links. The tool provides options such as using containers, creating, compiling, running jobs, editing, scheduling the existing jobs, Importing/ exporting jobs, loading columns, maintaining metadata of job info, updating configuration files, parameters etc. functionality which makes the job execution dynamic.

DataStage Flow

Question 11

What are the different tiers of DataStage?

Accepted Answer

A tier is a logical grouping of components within the tool and the computers on which those components are installed.

Every tier contains a subset of the components that make up InfoSphere Information Server.

Tiers provide product module services, order fulfillment, and storage of metadata and other data.

There are mainly 6 service tiers-

Client Tier

It consists of client programs and consoles used for development, administration, and other tasks and the computers on which those programs and consoles are installed.

Services Tier

The Services Tier consists of Application Servers, General Services, and product services for suites and product modules. The services layer provides general services (such as security) and services specific to a particular product module. At the service level, IBM WebSphere Application Server hosts the service. It also hosts web-based InfoSphere Information Server applications.

Engine Level

The engine level is a logical grouping of engine components and communication agents, which helps in executing jobs and other tasks for product modules.

Repository Tier

The repository tier consists of the metadata repository and, if installed, other data stores to support other product modules. The metadata repository contains shared metadata, data, and configuration information for InfoSphere Information Server product modules. Other data stores contain extension data for use by supporting product modules. The operational database is the data store used by the engine operations console.

Microservices Layer

The Microservices Layer consists of many different microservices, each with a different role.

Layer Relationships

This layer provides storage for metadata and other data for services, job executions, and product modules that we install.

Question 12

What are the different common services in DataStage?

Accepted Answer

The common services in DataStage are mainly –

Scheduling services – Used to plan and track activities in terms of logging and reporting and within the console where we could define schedules, view status, history, forecast, and purge data of the tasks to get and take action and analyze the job/ tasks.
Logging services – It is used to manage the logs across the IBM suite components where for tracking purposes, logs are stored in a metadata repository in their relevant logging categories.
Directory services – These are used for authentication resources and manage identities as a central authority.
Security services - Used to manage role-based authorization of users and access-control services to maintain privacy and security regulations.
Reporting services – Creating create cross-product reports for logging, monitoring, scheduling, and security services and running all reporting tasks from the InfoSphere Information Server Web console.
Core services - It is used as low-level services such as service registration, life cycle management, binding services, and agent services.
Metadata services – This manages the metadata management within InfoSphere Information Server. It includes repository management, persistence management, and model management

Question 13

What are DataStage sequences?

Accepted Answer

For the complex architecture, subsequent jobs could be created to run multiple jobs in combination with other jobs, which are called DataStage sequences. Sequence jobs allow for incorporating programming controls such as branching and looping into the flow.

In DataStage, a special type of job called a sequence job is present where a series of parallel or server jobs could be specified to run.

Different actions could be taken depending on whether the jobs in the sequence succeed or fail. After creating a successor job, we schedule it to run using the InfoSphere DataStage Director client, just like any parallel or server job.

Question 14

How is merging done in DataStage?

Accepted Answer

The Merge stage joins tables based on the values of key columns as specified in the table definition.

The merge could have any number of input links and only a single output link, and there must be the same number of reject links as there are updated input links.

The Merge stage combines a master data set with the other data sets provided in its processing stage. It is one of three processes that join tables based on the values of the primary key of the table. The other two stages are:

Join stage
Lookup stage

Question 15

What are data and descriptor files?

Accepted Answer

The data files are the raw or exported data that is to be imported into the system and contain the information which is required by the user, whereas the descriptor file contains all the metadata of the data files, i.e., it contains the information and description about the data into the file provided.

Often, we require the descriptor file where we need the information at a brief as dealing with the large scale of data doesn’t provide us the flexibility to go through each data delivered by the source system.

Question 16

What is the difference between the join, merge, and look-up stages?

Accepted Answer

Expect to come across this important DataStage question in your next interviews.

DataStage has three stages of processing that allow you to join tables based on key column values: lookup, join and merge. This post explains which phase to choose when, the differences between these phases, and development references for using these phases.

We use lookup levels when:

If the reference dataset is small.
A range search is being performed.
Row validations (reject links can output rows if there is no corresponding entry in the key-value lookup table).

We use the join phase when:

Joining large tables (running out of memory during the lookup phase).
Outer joins are performed (left, right, full outer).
Joining multiple tables with the same key.

We use merge levels when:

When multiple updates and reject links are required (for example, when combining a master record with one or more update records)

Question 17

What is Data Pipelining?

Accepted Answer

A data pipeline is a process of importing records from a source system and moving them through a set of processing functions (transformations) defined in a data flow (job). After the records pass through the pipeline, they can be processed without being written to a table

The requirement of data Pipelining is to -

Disk Management.
The developer must manage input and output handling between components.
To manage large amounts of data
As the complexity of using, managing, and designing disks increases, applications slow down, so a proper pipeline needs to be designed
Each process must complete before downstream processes can start, limiting performance and maximizing hardware resources.

Question 18

What is Data Partitioning?

Accepted Answer

Data partitioning is an approach to parallelism that divides a data set into partitions or subsets of the data set. In the absence of resource constraints or other data skew issues, data partitioning linearly improves application performance.

DataStage provides a capability where it automatically partitions data based on the partition type required by the phase.

Traditional packaging tools do not have this capability, so developers used to manually create data partitions and rewrite applications or data partitions scripts, which was costly and time-consuming when administrators wanted to use more hardware capacity.

Question 19

How are duplicates removed in DataStage?

Accepted Answer

The sort function can be used for removing duplicates in DataStage. While running the sort function, the user should specify the option that allows for duplicates and set it to false.

Steps to remove the duplicates are -

Go to the Input page, followed by the Partitioning tab, and select the Partition Type as "Hash".
Select the Id and Date fields as Keys.
Check the "Perform sort" check box under the "Sorting" category.
Under the "Selected" category, set the Usage property for the ID and Date fields. For the Id field, Usage should be "Sorting, Partitioning". For the Date field, the Usage should be "Sorting". Keep the default setting for Options.
Go to the "Stage" page, followed by the "Properties" Tab. Select the Key field as "Id".
Duplicate to Retain as " Last"

Question 20

How to set up DataStage?

Accepted Answer

To access DataStage, we need to download and install the latest version of the IBM InfoSphere Server present on the web. The server supports AIX, Linux, and Windows operating systems and could be chosen as per the requirement.

We could migrate the data from an older version of infosphere to a new version by the tool - asset interchange tool,

Required Installation Files

For installing and configuring Infosphere DataStage, we need the following files in the local system.

Windows server,

EtlDeploymentPackage-windows-oracle.pkg 
EtlDeploymentPackage-windows-db2.pkg

Linux server,

EtlDeploymentPackage-linux-db2.pkg 
EtlDeploymentPackage-linux-oracle.pkg

Question 21

Briefly explain architecture of DataStage

Accepted Answer

IBM DataStage has 4 components – Administrator for managing Admin tasks, Manager for storing reusable Metadata, Designer for managing applications/ jobs, and Director for validating, scheduling, executing and monitoring DataStage server jobs and parallel jobs.

DataStage follows a client-server model as its architecture, and the main components of the client-server architecture are:

Client components
Servers
Stages
Table definitions
Containers
Projects
Jobs

Question 22

How do you start developing a DataStage project?

Accepted Answer

For setting up the projects, we start by creating the DataStage job on the DataStage server, and then all objects that we create in the meantime are stored in the DataStage project. The DataStage project serves as a separate environment on the server for the storage and maintenance of jobs, tables, definitions, and routines.

A DataStage project represents a separate environment on the installed server containing the jobs, tables, definitions, and routines created for the task.

Question 23

Where do you see different stages in the designer?

Accepted Answer

The “Palette” window contains all the stages which are available in the DataStage tool. Based on the functionalities, multiple stages are provided from which the developer chooses.

Listing down the various categories of stages present in the Palette are - General, Data Quality, Database, Development, File, Processing, etc.

Question 24

What are the Processing stages?

Accepted Answer

Different DataStage Processing Stages in the tool are the functionalities that help in the following:

Reading the data from the source.
Processes transform or convert the data read from the source.
Writing the processed data to the destination

A few examples of processing Stages are –

Transformer, Aggregator, FTP, Link Collector, Interprocess, Pivot, Sort etc., which helps in dealing with the data transformation and processing effectively.

Question 25

Name the different sorting methods in DataStage.

Accepted Answer

A staple in DataStage interview questions and answers for experienced, be prepared to answer this one using your hands-on experience.

DataStage offers two sorting methods for parallel jobs.

Link Sort
Sort Phase

By default, both methods use the same sort package (DataStage sort operator). Independent sorting offers more possibilities.

Link Sort

When the key splitting method is specified, this option cannot be used when sorting by the link option is specified in the input/split level option, and the automatic splitting method is used.

Key column usage:

Sort only
Partition only
Sort, partition

Sorting phase

The standalone sorting phase has additional options that are not available for link sorting.

The following properties are not available when sorting by link:

Sort Key Mode: This is a very important option for improving performance.
Create Cluster Key Change Column: True creates the cluster exchange column.
Create a Key Change Column
Output Statistics
Stable Sort: Stable sort preserves the order of non-key columns within each sort group. This requires an additional overhead of about in the sorting algorithm, so stable sorts are generally slower than unstable sorts for the same input dataset and sort keys. For this reason, disable stable sorting unless necessary.

Question 26

How do you import and export the DataStage jobs?

Accepted Answer

We could use command-line functions to import/ export the DataStage job

dsimport.exe- imports the DataStage components.
dsexport.exe- exports the DataStage components.

Additionally, The XML format serves as the basis for moving jobs/dataflows between the Design Studio and DataStage environments. We could export a DataStage job to an XML file and then import it into the Design Studio to migrate the job from one server to another. The imported job is stored as a subflow in the designer.

Question 27

What are routines in DataStage? Enlist various types of routines.

Accepted Answer

A routine is a set of functions defined by DataStage Manager, which is routed through a transformer stage.

There are three types of routines:

Parallel Routines
Mainframe Routines
Server Routines

Routines are stored in the Routines branch of the DataStage repository, where they can be created, viewed, or edited, which could be done through the Routine Dialog Box.

Question 28

Give an idea of system variables.

Accepted Answer

The system variables are the read-only variables that are used in the transformer stage or routine. They supply the system information, and these variables start with ‘@.’

Few system variables and their uses –

@FALSE	It replaces the value with 0.
@TRUE	It replaces the value with 1.
@INROWNUM	Number of input rows
@OUTROWNUM	Number of output rows per link
@NUMPARTITIONS	Total number of partitions for the stage.
@PARTITIONNUM	Partition number for the particular instance.
@ITERATION	Iteration number of the current loop

Question 29

What are the various kinds of containers available in DataStage?

Accepted Answer

A container is a group of stages and links. Containers allow you to simplify and modularize your job design by replacing complex areas of your diagram with single container stages.

These containers are available for both - parallel and server jobs.

IBM DataStage tool offers two types of containers.

Local container. They are created within a job and can only be accessed within that job. Local containers are edited on tabs in the job's graph window. Local containers can be used in server jobs or parallel jobs. Their primary use is to "clean up" the job design.

Shared containers. They are created separately and stored in repository-like jobs. These can be inserted into your job design.

Question 30

What is the use of the DataStage director?

Accepted Answer

The DataStage director tool is used for:

Scheduling, Running, and Monitoring Jobs
Viewing Job Status, Logs, and Schedules
Filtering Displayed Events

The DataStage Director window is split into two panes.

The Job Categories pane shows the ones on the repository.

The right pane shows one of his three views: Status View, Schedule View, or Log View.

DataStage Director has three view options.

The Status view displays the status, start time, elapsed time, and other execution information for each job in the selected repository category.
The Schedule view displays details about job schedules.
The log view shows all events for a particular execution of a job.

Question 31

What is a Hive connector?

Accepted Answer

InfoSphere DataStage supports different data types from Java and Hive data types. To complete the operation, the connector needs to map data types depending on the scenario in which the connector will be used. You can configure the Hive Connector stage to automatically add columns to the output link at runtime.

Hive Connector is a tool used to support partition mode when reading data. There are two ways to do this.

modulus partition mode
min-max partition mode

Hive Connector is a tool used to support partition mode when reading data. There are two ways to do this.

modulus partition mode
min-max partition mode

Question 32

What is an HBase connector?

Accepted Answer

The DataStage tool provides support to connect to the HBase dataset and access their database components and tables in the DataStage. It is required to perform the following tasks:

Reading and writing data from HBase to the DataStage tool
Reading the data in parallel mode.
Using the HBase database as a view table in the tool.

Question 33

What are the types of parallel processing in the DataStage?

Accepted Answer

Don't be surprised if this question pops up as one of the top DataStage technical interview questions in your next interview.

There are two basic types of parallelism. Pipelining and partitioning. The DataStage tool provides these two methods. For example, for a simple parallel job that extracts data from one data source, transforms it in some way, and then writes it to another data source. In any case, this job looks the same on the designer's canvas but can be configured to work in different ways depending upon the parallelism selected.

Pipeline parallelism

If you run a sample of his job on a system with 3 or more processors, the phase read he will start on one processor and start filling the pipeline with read data.

In parallel parallelism, the Datastage tool works like a continuous process where the process of transformation, cleaning and load process keeps working in parallel. It is not required that one process of upstream will stop while the downstream process is working and vice-versa.

This process helps in minimizing the risk usage, which is for the staging area. Additionally, it also reduces the idle time held on the processors working.

Partition Parallelism

Suppose you have the same simple pipelined parallelism job, processing a very large amount of data.

In partition parallelism, the input data stream is divided into various subsets referred to as partitions which are further processed using the individual processors by the same operation process.

Question 34

List all the different tiers of InfoSphere Information Server.

Accepted Answer

There are 4 tiers –

Client tier - The Infosphere Information Server client tier is used for computer development and full management through client programs and consoles.
Services tier - The services layer is used to provide standard services such as metadata and logging, as well as other engine-specific services. This includes application servers, various product modules, and other product services.
Engine tier - The engine tier consists of a set of logical components used to run the jobs and other tasks of the product modules.
Metadata Repository tier - The metadata repository tier includes the metadata repository, the analytics database, and the computer. Used to share metadata, shared data, and configuration information.

Question 35

What is splitsubrec, splitvect restructure operator, and what does it do?

Accepted Answer

splitsubrec restructure operator splits the input subrecords into a set of top-level output vector fields

splitvect restructure operator promotes the elements of a fixed-length vector to a set of top-level fields with the same name.

Question 36

Suppose you have n records in a text file. First 3 records of 1st file, last 3 records of 3rd file, and remaining n-6 records of 2nd file required. (Note: I don't know how many records there are in the file. We receive one file each day and need 3 target files as above).

Accepted Answer

One of the most frequently posed scenario based DataStage interview questions and answers, be ready for this conceptual question.

First we look at the file name and how many Is there a record with wc - l filename is counted

Then first check if the file name has a header and footer, If unavailable, then use following Unix commands

when we use wc -l filename, it will give count, suppose for example there are 1000 records

Ans1: Head -3 filename.txt >Target1.txt
Ans2: Tail -3 filename.txt >Target3.txt
Ans3: sed -n '4,997p' filename >Target2.txt

Question 37

How can we print/display the first line of a file?

Accepted Answer

We could use the below command using [head] to display the first line of a file.

$> head -1 file.txt

When we specify [head -2] then it will print only first 2 records of the file.

Otherwise, we could use [sed] command. [Sed] is a very powerful text editor which is used for various text manipulation purposes as asked here.

$> sed '2,$ d' file.txt

Question 38

Command to remove the last line/ trailer from a file in Unix script?

Accepted Answer

Note that in [sed], switch '$' refers to the last line always. So we could use the below command as it will read data from the last line and give output as expected.

$> sed –i '$ d' file.txt

Question 39

Command to print/display the last line of a file?

Accepted Answer

Firstly, we use the following script using the [tail] command to get output.

$> tail -1 file.txt

If we use the [sed] command, then we use the following command. Remember, ‘$’ reads data from the last line.

$> sed -n '$ p' test

Question 40

How could we remove the first line/header from a file?

Accepted Answer

In [sed], by using the 'd' switch, we could delete a certain line from the output –. So if we want to delete the first line, the following command we use-

$> sed '1 d' file.txt

Here it will just print out all the lines except the first line in the file, and hence if we want to delete the first line from the file itself, we need to perform either of two options.

Either we can redirect the output of the file to another file and then renaming it back to original file as below:

$> sed '1 d' file.txt > new_file.txt 
$> mv new_file.txt file.txt

Or else, we could use [sed] switch '–i', which changes the file in place as per the below command.

$> sed –i '1 d' file.txt

Question 41

Command to remove certain lines from a file in Unix?

Accepted Answer

If you want to delete line by line from a specific file, you can perform your task in a similar way as above. For example:

$> sed –i '5,7 d' file.txt

Question 42

What steps should be taken to improve DataStage jobs?

Accepted Answer

This is one of the most frequently asked DataStage interview questions for freshers in recent times.

This answer is specifically a part of DataStage performance tuning interview questions. For performance tuning of the DataStage job, the following practice must be used -

Using Job Monitor – Provides a handy snapshot of a job's performance at a particular point in time during execution.
Using Score Dump - Score Dump produces a report that displays and provides information about operators, processes, and records within a job.
Using Resource Estimates - Used to determine necessary system requirements.
Performance Analysis Usage - Collect data during job execution. (Note: This option is disabled by default). When activated and a job is opened, a new toolbar option called Performance Analysis

Additionally, selecting the correct configuration files, partitions and buffers. Handling sorting the data and handling zero-time values. You should try to copy, modify, or filter instead of using transformers. Need to reduce unnecessary metadata proliferation between phases.

Question 43

What is a repository table in DataStage?

Accepted Answer

A repository table is used to access and process the output of the ad-hoc, historical, analytical or complex query results. It could be centralized or distributed.

Prior to DataStage version 8, DataStage repository information was stored in files(uv). Starting with version 8, repository information is also stored in the database. The information is available in a database, making it easy to create queries and find details. During installation, you have the option to install the repository on either DB2 or Oracle.

DataStage repositories are typically created on DB2 databases. This is because Db2 comes standard with Information Server software.

Question 44

What is the pre-requisite of the DataStage tool for setup?

Accepted Answer

For DataStage, we require the following in the server to set up.

IBM InfoSphere
DataStage Server 9.1.2 or above
Microsoft Visual Studio .NET 2010 Express Edition C++
Oracle client (full client, not an instant client) if connecting to an Oracle database
DB2 client if connecting to a DB2 database

Question 45

List the difference in Validated OK with Compiled Process in DataStage.

Accepted Answer

This, along with other interview questions on DataStage, is a regular feature in DataStage interviews, be ready to tackle it with an approach mentioned below.

Following are the differences -

In DataStage, validating a job means running the job.
During validation, the DataStage engine will check if all required properties are provided.
In other cases, when compiling a job, the DataStage engine checks if all specified properties are valid.

Question 46

What is an exception activity?

Accepted Answer

If an unknown error occurs during Job Sequencer execution, all stages after the exception activity will be executed. Exception activity in DataStage is, therefore, very important in case of such a situation.

Question 47

How do you run a job using the command line?

Accepted Answer

For running the job in DataStage using the command line, we use the following command where we replace the specified project Name and job name –

dsjob -run -jobstatus

Question 48

Give the difference between a sequential file and a hash file.

Accepted Answer

The only difference between hashed and sequential files is that hashed files store data about the hashing algorithm and hash key values, while sequential files do not have key values to store data.

Based on this hash key function, searching hash files is faster than sequential files.

There are two types of hash files:

static hash files
dynamic hash files

Question 49

What does NLS mean?

Accepted Answer

NLS stands for National Language Support. This means that the DataStage tools support and can be used in multiple languages, including multi-byte character languages (Chinese, Japanese, Spanish). You can read and write in any language.

Question 50

How to clean a DataStage repository?

Accepted Answer

For cleaning up the DataStage repository, we need to follow the following steps -
Firstly, go to the DataStage Manager
Open the “Job” in the menu bar
Select the option of “Clean up resources”.
Next, we need to remove the logs, so we need to visit respective jobs and clean up the log files.

Question 51

What is tagbatch, tagswitch restructure operator, and what does it do?

Accepted Answer

tagbatch restructure operator transforms the tagged fields into an output record whose schema supports all possible fields of a tag instance.

Tagswitch is used for the contents of tagged aggregates to be converted to InfoSphere DataStage-compatible records.-compatible records.

Question 52

What is a collection, and what are the collectors available in the collection library?

Accepted Answer

A collection library is a set of operators used to collect partitioned data.

The following collector types are available in the Collection Library.

Sort-merge collector
Round robin collector
Ordered collector

Question 53

What is Link buffering?

Accepted Answer

A common yet one of the most important DataStage interview questions for experienced, don't miss this one.

InfoSphere DataStage automatically buffers links for certain stages. Its main purpose is to prevent deadlock conditions (where a stage cannot read its input because a previous stage in the job cannot write to its output).

Jobs with fork joins can lead to deadlock conditions. Here, the stage has two output connections whose data paths are merged later in the job. A situation can arise where all stages in the flow are waiting for other stages to read or write, so none of them can continue. No error or warning message is issued if a deadlock occurs. Your work will wait forever for input.

InfoSphere DataStage automatically inserts buffering into job flows containing branch joins that can lead to deadlock conditions. In most cases, you do not need to change the default buffering implemented by InfoSphere DataStage.

Question 54

What are the Players in DataStage?

Accepted Answer

The DataStage player is the workhorse process. They are useful for parallel processing and are assigned to operators on each node.

Player: One or more logical groupings of processes used to execute data flow logic. All players are created as a group on the same server as the section leader process.

Question 55

Explain how to get the next column value in the current row

Input file format :

Sq,No

1,1000

2,2200

3,3030

4,5600

Output File format :

Sq,No,No2

1,1000,2200

2,2200,3030

3,3030,5600

4,5600,NULL

Accepted Answer

Job Design :

To achieve the solution, we design the job as shown in the below image. Firstly, we read the seq file as input, then data is passed through a Sort and then Transformer stage to achieve the output as needed in the file.

b) Sort Stage Properties

We would require to configure the sort stage where we sort the data based on column “Sq” in descending order.

c) Transformer Stage Properties

Here, we define 2 stage variables: StageVar, StageVar1 and following derivations -

StageVar1 = StageVar
DSLink6.No = StageVar1

d) Configuring Output File

Before capturing the data into a Sequential file, we need to sort the data again in ascending order to get the output as needed.

Question 56

How to convert a single row to multiple rows.

Sample Input file-

@FALSE	It replaces the value with 0.
@TRUE	It replaces the value with 1.
@INROWNUM	Number of input rows
@OUTROWNUM	Number of output rows per link
@NUMPARTITIONS	Total number of partitions for the stage.
@PARTITIONNUM	Partition number for the particular instance.
@ITERATION	Iteration number of the current loop

Sample O/P file data -

Col1	Col2	Name
Abc	Def	Ram
Abc	Def	Ben
Abc	Def	Sed

Accepted Answer

For designing this process, we will use the Transformer stage to flatten the input data and create multiple output rows for each input row.

In the transformer stage –

Configuring the Loop condition –

@ITERATION <= 3

As each input row has three columns of names, we need to process each input row three times so that it will create three separate output rows.

Set the following loop variable –

LoopVar1 we set by the following:

IF (@ITERATION = 1) THEN inlink.Name1

ELSE IF (@ITERATION = 2) THEN inlink.Name2

ELSE inlink.Name3

Defining the Output link metadata and derivations

Col1 - inlink.col1

Col2 - inlink.col2

Name - LoopVar1

Question 57

How to Count the occurrence of char in file

Input Values -

1,a

2,b

3,a

4,b

5,a

6,a

7,b

8,a

Output expected

No,char,count

1,a,1

3,a,2

5,a,3

6,a,4

8,a,5

2,b,1

4,b,2

7,b,3

Accepted Answer

We need to design the following job for achieving the task. Here we are reading sequential file as a input, then data is passing through a Sort and Transformer stage to achieve the asked.

In sort Stage –

We need to sort the data based on column “Char” in ascending order.

In transformer stage –

We define 2 stage variable : StageVar, StageVar1, StageVar2

With following derivations -

StageVar = If StageVar1=DSLink6.Char Then StageVar+1 Else1

StageVar1 = DSLink6.Char

Then, create a new column in output which will hold the value of Occurrence of characters and assigned the StageVar.

Occurrence = StageVar

Then, using the in-line sorting, we sort the data on "Occurrence" column in ascending order and populate the output as expected.

Question 58

How to get the max salary from data file ( Sequential file )

Input sequential file data –

FIRSTNME,SALARY,WORKDEPT

Andrew, 0089750.00,E11

Alena, 0096170.00,D21

Mattias, 0080175.00,E01

Anna, 0152750.00,A00

Vishal, 0098250.00,C01

Justin, 0072250.00,D11

Milly, 0086150.00,E21

Sandra, 0094250.00,B01

Output File data -

0152750.00

Accepted Answer

Usually, we use max value in a column of a table, but here asked is to get the max value from seq file.

For designing such job, we firstly read the data from flat file, following it we generate a dummy column in column generator stage, next, doing an aggregate on that and sending the output to seq file.

In the column generator stage, we generate the column DUMMY with the value 'X' for all the rows which we are reading from the input seq file. Additionally, we will set the generator for the Cycle algorithm with value 'X' and map the Salary and DUMMY columns in flow to the output of the Column Generator stage.

At the aggregator stage, we will then aggregate on the DUMMY column and then calculate the MAX of Salary. In the nextstep, we map the Max Salary to Output.

Question 59

How do we separate Numbers and Alphabets from a sequential file –

Input file –

1Da4ta3St1age4

I34B09M67

67I5nfo45sp7eh9er0e

3W4in67gs89of06tec23hn56ol77ogy44

Output file –

Character,Number

IBM,340967

Wingsoftechnology,3467890623567744

DataStage,14314

Infospehere,67545790

Accepted Answer

We read the sequential file, then do processing in transformer stage and then write the output to file as asked. The job design would be like –

Input data is read from the sequential file stage:

Nextly, we pass the input data to transformer stage where we separate the alphabets and numeric using 2 stage variables in transformer stage.

We need to define the stage variables like below:

svAlpha: Convert("0123456789","",lnk_alphanumerics.MixData)

svNumerics: Convert(svAlpha,"",lnk_alphanumerics.MixData)

Nextly we assign these stage variables to output columns as following.

svAlpha = Alpha

svNumerics = Number

In last stage, we write the data into the output file.

Question 60

How can you use the JSP 2.0 expression language to access object methods with arguments?

Accepted Answer

The JSP 2.0 Expression Language (EL) is a simple language that enables you to access object methods with arguments. To access a method with arguments in a JSP page, you need to specify the name of the object, followed by the method name and the arguments in parentheses.

For example, consider the following object with a format method that takes two arguments:

public class Formatter { 
public String format(String s, int i) { 
return s + ": " + i; 
} 
}

To access the format method and pass it two arguments in a JSP page, you can use the following expression:

${formatter.format("Number", 42)}

The expression will return the string "Number: 42", which is the result of the format method.

You can also pass EL expressions as arguments to object methods. For example, consider the following object with a format method that takes a variable number of arguments:

public class Formatter { 
public String format(Object... args) { 
return Arrays.toString(args); 
} 
}

To access the format method and pass it a variable number of arguments in a JSP page, you can use the following expression:

${formatter.format(1, 2, 3, 4, 5)}

The expression will return the string "[1, 2, 3, 4, 5]", which is the result of the format method.

To pass EL expressions as arguments to object methods, you can enclose the expressions in brackets:

${formatter.format([1 + 1], [2 + 2], [3 + 3])}

In this example, the expression will pass the values 2, 4, and 6 as arguments to the format method.

You can also use the fn:length function to pass the length of an array or a collection as an argument to a method:

${formatter.format(numbers.length)}

In this example, the expression will pass the length of the numbers array as an argument to the format method.

To summarize, the JSP EL makes it easy to access and invoke object methods with arguments in a JSP page. You can use it to pass arguments to methods, as well as to pass EL expressions and the length of arrays and collections as arguments.

DataStage Interview Questions and Answers for 2024 Database

Beginner

Intermediate

Advanced

Description

Top DataStage Interview Tips and Tricks

How to Prepare for a DataStage Interview?

What to Expect in a DataStage Interview?

Summary

Related Interview Questions