Monday, 2 June 2014

SAS_Combining_Data_Vertically_Horizontally

Combining Data Vertically

1)
Using a FILENAME Statement
We can use a FILENAME statement to concatenate raw data files by assigning a single fileref to the raw data files that need to be combined.

filename qtr1 (’c:\sasuser\month1.dat’ ‘c:\sasuser\month2.dat’ ‘c:\sasuser\month3.dat’);
data work.firstqtr;
infile qtr1;
input Flight $ Origin $ Dest $
Date : date9. RevCargo : comma15.2;
run;

2)
Using an INFILE Statement with FILEVAR= option
The FILEVAR= option enables you to dynamically change the currently opened input file to a new input file.When you use an INFILE statement with the FILEVAR= option, the file specification is a placeholder, not an actual filename or a fileref that had been assigned previously to a file. SAS uses this placeholder for reporting processing information to the SAS log.
When the INFILE statement executes, it reads from the file that the FILEVAR=variable specifies. Like automatic variables, this variable is not written to the data set

data work.quarter;
do i = 9, 10, 11;
nextfile="c:\sasuser\month"!!put(i,2.)!!".dat";
infile temp filevar=nextfile;
input Flight $ Origin $ Dest $
Date : date9. RevCargo : comma15.2;
end;

3.1)
Using Data Step SET statement(concatenation) / Vertical SET stmt

data one_two;
set one two;
run;

All the observations from data set One are followed by all the observations from data setTwo. SAS refers to this process as concatenating data sets.
what happens if you use the SET statement on two data sets that don’t contain all the same variables
At compile time, SAS looks at every data set listed in the SET statement. First data set One variables is put into into the PDV (and all their attributes).
Next SAS looks at data set two,If there are any variables in data set two that are not already in the PDV-then those variable are added to PDV.
At execution time-First, all the variables in the PDV are set to missing. Then, each observation from data set One is read and then written to the output data set. The PDV is not set back to missing as each observation in data set One is processed, but this really doesn’t matter because each new observation replaces the values in the PDV. (This would be true even if one of the values in the currentobservation was missing.)After all the observations in data set One are read and written out to data set One_Two and SAS prepares to read observations from data set Two, all the variables in the PDV are again set to missing. SAS continues reading observations from data set Two until it reaches the end of the data set.

·         If the length of a variable is different in any of the input data sets, the length of the variable in the output data set is equal to the length that variable has in the first data set encountered in DATA step.

·         If necessary, place a LENGTH statement before the SET statement to be sure the resulting length is adequate to hold all of your values.

·         Finally, if you have a variable in two data sets, one character and the other numeric,SAS prints an error message in the log and the program terminates.

3.2)
Using Data Step SET statement(Interleaving Data Sets)

There is another way to add observations from several data sets. If each of the data sets to be combined is already sorted, you can take advantage of that fact and wind up with a sorted output data set. All you need to do is to follow the SET statement with a BY statement. SAS selects observations from each of the input data sets in order, with the resulting data set already sorted. The advantage of this method is that you don’t have to sort the resulting data set. There are times when the resulting data set would be too large to sort conveniently or at all.

proc sort data=one;
by ID;
run;
proc sort data=two;
by ID;
run;
data interleave;
set one two;
by ID;
run;

3.3)
difference between interleaving and concatenating the data sets
When you use PROC SORT to sort a SAS data set, a sort flag is set (you can see this on the first page of output from PROC CONTENTS) and SAS does not resort this data set if you attempt to sort it again by the same BY variables. When you interleave data sets, this sort flag is not set (which should not cause you any problems).

4.1)
Appending SAS Data Sets

PROC APPEND BASE=SAS-data-set
DATA=SAS-data-set;
RUN;

Difference between PROC APPEND and DATA step SET statement to combine SAS data vertically.
Unlike the APPEND procedure,the SET statement in the DATA step reads all observations in both input data sets in order to concatenate them.
Therefore, the APPEND procedure is more efficient than the SET statement in the DATA step for concatenating data sets because it reads only the data in the DATA= data set.

DATA= data set contains variables

that...

FORCE

required?

Consequences of using the

FORCE option

 

 

 

are in the BASE=data set, but the

BASE= data set has more variables

no

Missing values are assigned to the

extra BASE= data set variables.

are not in the BASE= data set

yes

Extra DATA= data set variables are

dropped.

are longer than the variables in the

BASE= data set

yes

DATA= data set variable values are

truncated.

do not have the same type as the

variables in the BASE= data set

yes

Missing values are assigned to the

DATA= data set variables with the

data type mismatch.

 

4.2)
Difference between Vertical SET stmt and Proc Append

Vertical SET

Proc Append

 

 

 

 

 

 

 

·         The Vertical SET stmt is the best means (better than the APPEND procedure) to concatenate datasets if the data subsequently require more manipulation that can be accomplished within the same DATA STEP.

·         The APPEND procedure can improve processing time substantially because SAS only reads in the dataset being appended

 

 

·         PROC APPEND may be the best choice to concatenate the input datasets if (1) processing time is a consideration and (2) no further data manipulation is required

 

 

·         Strive to ensure that both datasets have the same exact variables and variable attributes to avoid a ‘messy’ log (i.e. avoid using the FORCE option if at all possible).

 

 

 

5) Proc SQL Union operator

 

 

 

Combining Data Horizontally

1)
Merging Two Data Sets

SAS uses the term merge to describe the process of combining variables (columns) from two or more data sets.You first sort each data set by the variable or variables that link the two data sets. Next,you name each of the data sets in a MERGE statement. Be sure to follow the MERGE statement with a BY statement, naming the variable or variables that tell SAS which observations to place in the same observation.

proc sort data=employee;
by ID;
run;

proc sort data=hours;
by ID;
run;

data combine;
merge employee hours;
by ID;
run;

2)
Using PROC SQL to Join Data

Match Merge

PROC SQL JOIN (default Inner Join)

there is no limit to the number of input

data sets, other than memory.

The maximum number of tables that can

be joined at one time is 32.

Allows for complex business logic to be

incorporated into the new data set by using DATA step processing, such as arrays and DO loops, in addition to MERGE features.

Complex business logic is difficult to

incorporate into the join.

 

 

PROC SQL might require more resources

than the DATA step with the MERGE

statement for simple joins.

Data sets must be sorted by or indexed on

the BY variable(s) prior to merging.

Data sets do not have to be sorted or indexed, but an index can be used to improve performance.

The BY variable(s) must be present in all data sets, and names of key variable(s) must match exactly.

Multiple data sets can be joined in one step without having common variables in all data sets.

In case of non match data-it returns both matches and nonmatches

In case of non match data-it returns only matches

 

Matching Data and Inner Join

One-to-one matches produce identical results whether the data sets are merged in a DATA step or joined in a PROC SQL step.

One-to-many matches produce identical results whether the data sets are merged in a DATA step or joined in a PROC SQL step.

Many-to-many matches produce different results depending on whether the data sets are merged in a DATA step or joined in a PROC SQL step.

Non Matching Data and Inner Join

Many-to-Many Nonmatching data between the data sets produces different results depending on whether the data sets are merged in a DATA step or combined by using a PROC SQL inner join or outer join.

One-to-One and One-to-many Nonmatching data between the data sets produces same results depending on whether the data sets are merged in a DATA step or combined by using a PROC SQL outer join.

3)
Using Multiple SET Statements

You can use multiple SET statements to combine observations from several SAS data sets.

For example, the following DATA step creates a new data set named Combine. Each observation in Combine contains data from one observation in Dataset1 and data from one observation inDataset2.

data combine;

set dataset1;

set dataset2;

run;

When you use multiple SET statements,

·         processing stops when SAS encounters the end-of-file (EOF) marker on either data set (even if there is more data in the other data set). Therefore, the output data set contains the same number of observations as the smallest input data set.

·         the variables in the program data vector (PDV) are not reinitialized when a second SET statement is executed. 

·         for any variables that are common to both input data sets, the value or values from the data set in the second SET statement will overwrite the value or values from the data set in the first SET statement in the PDV.

Keep in mind that using multiple SET statements to combine data from multiple input sources that do not have a one-to-one match can be complicated. By default, the first observation from each data set is combined, the second observation from each data set is combined, and so on until the first EOF marker is reached in one of the data sets. Therefore, if you are working with data sources that do not have a one-to-one match, or that contain nonmatching data, you will need to add additional DATA step syntax in order to produce the results that you want.

4)
Difference between Multiple set stmt(Horizontal concat) Vs Merge stmt

Multiple SET stmt

Merge stmt

 

 

processing stops when SAS encounters the end-of-file (EOF) marker on either data set
So only 1:1 matching produce desired result.

No such EOF problem.
1:1 and 1:M matching produce desired result.

 

 

 

 


 

Q)How can you combine two datasets based on the relative position of rows in each data set; that is, the first observation in one data set is joined with the first observation in the other, and so on?
A)One to One reading (Merge)

 Q)Describe 5 ways to do a "table lookup" in SAS.
A) Match Merging, Direct Access, Format Tables, Arrays, PROC SQL

Q)
data concat;
set a b;
run;
format of variable Revenue in dataset a is dollar10.2 and format  of variable Revenue in dataset b is dollar12.2  What would be the format of Revenue in resulting dataset (concat)?
A) dollar10.2

Q)If you have two datasets you want to combine them in the manner such that observations in each BY group in each data set in the SET statement are read sequentially, in the order in which the data sets and BY variables are listed then which method of combining datasets will work for this?
A)Interleaving (In Concatenation - we read all observations from one dataset at once then from another dataset)

Q)While match merging two data sets, you cannot use the __________option with indexed data sets because indexes are always stored in ascending order.
A)Descending

 Q)I have a dataset concat having variable a b & c. How to rename a b to e & f?
A) data concat(rename=(a=e b=f));
     set concat;
     run;

Q)What is the difference between One to One Merge and Match Merge? Give example also..
A)If both data sets in the merge  statement are sorted by id(as shown below) and each observation in one data set has a corresponding observation in the other data set, a one-to-one merge is suitable.

data mydata1;
input id class $;
cards;
1 Sa
2 Sd
3 Rd
4 Uj;

data mydata2;
input id class1 $;
cards;
1 Sac
2 Sdf
3 Rdd
4 Lks;

data mymerge;
merge mydata1 mydata2;
run;

 

If the observations do not match, then match merging is suitable

data mydata1;
input id class $;
cards;
1 Sa
2 Sd
2 Sp
3 Rd
4 Uj;

data mydata2;
input id class1 $;
cards;
1 Sac
2 Sdf
3 Rdd
3 Lks
5 Ujf;

data mymerge;
merge mydata1 mydata2;
by id
run;

 


Q)How
would you code a merge that will keep only the observation that have matches form both sets? A)By using the IN internal variable in the merge statement.
DATA NEW;
MERGE ONE_TEMP (IN=ONE) TWO_TEMP (IN=TWO);
BY NAME;
IF ONE=1 AND TWO=1;
RUN;

Q)How do the IN= variable improve the capability of a MERGE?
A) IN is a implicit variable in SAS which helps in controlling which dataset needs to contribute to the new dataset

Q)What is the problem with merging two data set that have variable with the same name but different data?
A)The second data set value will overwrite the value of the first data set.

Q)When you use SET statement with multiple data sets, suppose variable salary defined in one data set contain type character and salary defined in another data set contain type numeric. Did my SET statement work without error?
A)Error in SAS log


Q)
When would you choose to MERGE two data sets together and when would you SET two data sets? A)If we want to create a dataset as an exact copy of The old dataset without any bothering about which Dataset is going to contribute to the new dataset Then we will use set statement. If we want to control the contribution of the old Datasets to the new dataset then we will use the Merge statement

Q)Which data set is the controlling data set in the MERGE statement?
A) Dataset having the less number of observations control the data set in the merge statement.

Q)Can merge statement give output without BY statement? Give at least one example on it.If I have following two data sets.I am submitting following code.( Assume that NEW and TEST Data sets will be sorted by A and B variable.)

Data nete;

  Merge new test;

  By a b;

Run;

What will be my output?

If we merge two datasets with all their variable in by clause then SET and Merge produce same result.

New

a

b

1

2

3

4

5

6

Test

a

b

3

5

1

6

5

7

Output

a

b

1

2

1

6

3

4

3

5

5

6

5

7

 

Q)Assume that you have 50 observation in one data set and 3000 observation in another data set.If you use DATA step Merge to combine the data sets, how many observations are read from the larger data set (i.e. second data set)?

A)3000

Q)What is a problem with merging two data sets that have variables with the same name but different data?
A)Error
Understanding the basic algorithm of MERGE will help you understand how the stepProcesses. There are still a few common scenarios whose results sometimes catch users off guard. Here are a few of the most frequent 'gotchas':
1- BY variables has different lengthsIt is possible to perform a MERGE when the lengths of the BY variables are different,But if the data set with the shorter version is listed first on the MERGE statement, theShorter length will be used for the length of the BY variable during the merge. Due to this shorter length, truncation occurs and unintended combinations could result.In Version 8, a warning is issued to point out this data integrity risk. The warning will be issued regardless of which data set is listed first:WARNING: Multiple lengths were specified for the BY variable name by input data sets.This may cause unexpected results. Truncation can be avoided by naming the data set with the longest length for the BY variable first on the MERGE statement, but the warning message is still issued.
To prevent the warning, ensure the BY variables have the same length prior to combining them in the MERGE step with PROC CONTENTS. You can change the variable length with either a LENGTH statement in the merge DATA step prior to the MERGE statement, or by recreating the data sets to have identical lengths for the BY variables.

No comments:

Post a Comment