Combining Data Vertically
1)
Using a FILENAME Statement
We can use a FILENAME statement to concatenate raw data files by assigning
a single fileref to the raw data files that need to
be combined.
filename qtr1
(’c:\sasuser\month1.dat’ ‘c:\sasuser\month2.dat’ ‘c:\sasuser\month3.dat’);
data work.firstqtr;
infile qtr1;
input Flight $ Origin $ Dest $
Date : date9. RevCargo : comma15.2;
run;
2)
Using an INFILE Statement with FILEVAR= option
The FILEVAR= option enables you to dynamically change the currently opened
input file to a new input file.When you use an INFILE
statement with the FILEVAR= option, the file specification is a placeholder,
not an actual filename or a fileref that had been
assigned previously to a file. SAS uses this placeholder for reporting
processing information to the SAS log.
When the INFILE statement executes, it reads from the file that the
FILEVAR=variable specifies. Like automatic variables, this variable is not
written to the data set
data work.quarter;
do i = 9, 10, 11;
nextfile="c:\sasuser\month"!!put(i,2.)!!".dat";
infile temp filevar=nextfile;
input Flight $ Origin $ Dest $
Date : date9. RevCargo : comma15.2;
end;
3.1)
Using Data Step SET statement(concatenation) / Vertical SET stmt
data one_two;
set one two;
run;
All the observations from data set One
are followed by all the observations from data setTwo.
SAS refers to this process as concatenating data sets.
what happens if you use the SET statement on two data
sets that don’t contain all the same variables
At compile time, SAS looks at every data set listed in the SET statement. First
data set One variables is put into into
the PDV (and all their attributes).
Next SAS looks at data set two,If
there are any variables in data set two that are not already in the PDV-then
those variable are added to PDV.
At execution time-First, all the variables in the PDV are set to missing. Then,
each observation from data set One is read and then
written to the output data set. The PDV is not set back to missing as each
observation in data set One is processed, but this
really doesn’t matter because each new observation replaces the values in the
PDV. (This would be true even if one of the values in the currentobservation
was missing.)After all the observations in data set One are read and written
out to data set One_Two and SAS prepares to read
observations from data set Two, all the variables in the PDV are again set to
missing. SAS continues reading observations from data set Two
until it reaches the end of the data set.
· If the length of a variable is different in any of the input data sets, the length of the variable in the output data set is equal to the length that variable has in the first data set encountered in DATA step.
· If necessary, place a LENGTH statement before the SET statement to be sure the resulting length is adequate to hold all of your values.
· Finally, if you have a variable in two data sets, one character and the other numeric,SAS prints an error message in the log and the program terminates.
3.2)
Using Data Step SET statement(Interleaving Data Sets)
There is another way to add observations from several data sets. If each of the data sets to be combined is already sorted, you can take advantage of that fact and wind up with a sorted output data set. All you need to do is to follow the SET statement with a BY statement. SAS selects observations from each of the input data sets in order, with the resulting data set already sorted. The advantage of this method is that you don’t have to sort the resulting data set. There are times when the resulting data set would be too large to sort conveniently or at all.
proc sort data=one;
by ID;
run;
proc sort data=two;
by ID;
run;
data interleave;
set one two;
by ID;
run;
3.3)
difference between interleaving and
concatenating the data sets
When you use PROC SORT to sort a SAS data set, a
sort flag is set (you can see this on the first page of output from PROC
CONTENTS) and SAS does not resort this data set if you attempt to sort it again
by the same BY variables. When you interleave data sets, this sort flag is not
set (which should not cause you any problems).
4.1)
Appending SAS Data Sets
PROC APPEND BASE=SAS-data-set
DATA=SAS-data-set;
RUN;
Difference between PROC APPEND and DATA step SET statement to combine SAS data vertically.
Unlike the APPEND procedure,the SET statement in the DATA step reads all
observations in both input data sets in order to concatenate them.
Therefore, the APPEND procedure is more
efficient than the SET statement in the DATA step for concatenating data sets
because it reads only the data in the DATA= data set.
DATA= data set contains
variables that... |
FORCE required? |
Consequences of using the FORCE option |
|
|
|
are in
the BASE=data set, but the BASE=
data set has more variables |
no |
Missing
values are assigned to the extra BASE= data set variables. |
are
not in the BASE= data set |
yes |
Extra
DATA= data set variables are dropped. |
are
longer than the variables in the BASE=
data set |
yes |
DATA=
data set variable values are truncated. |
do not
have the same type as the variables
in the BASE= data set |
yes |
Missing
values are assigned to the DATA=
data set variables with the data type mismatch. |
4.2)
Difference between Vertical SET stmt and Proc Append
Vertical SET |
Proc Append |
|
|
· The Vertical SET stmt is the best means (better than the APPEND procedure) to concatenate datasets if the data subsequently require more manipulation that can be accomplished within the same DATA STEP. |
· The APPEND procedure can improve processing time substantially because SAS only reads in the dataset being appended · PROC APPEND may be the best choice to concatenate the input datasets if (1) processing time is a consideration and (2) no further data manipulation is required · Strive to ensure that both datasets have the same exact variables and variable attributes to avoid a ‘messy’ log (i.e. avoid using the FORCE option if at all possible). |
|
|
5) Proc SQL Union operator
Combining Data Horizontally
1)
Merging Two Data Sets
SAS uses the term merge to describe the process of combining variables
(columns) from two or more data sets.You first sort
each data set by the variable or variables that link the two data sets. Next,you name each of the data
sets in a MERGE statement. Be sure to follow the MERGE statement with a BY
statement, naming the variable or variables that tell SAS which observations to
place in the same observation.
proc sort data=employee;
by ID;
run;
proc sort data=hours;
by ID;
run;
data combine;
merge employee hours;
by ID;
run;
2)
Using PROC SQL to Join Data
Match Merge |
PROC SQL JOIN (default Inner Join) |
there is no limit to the number of input data sets, other than memory. |
The maximum number of tables that can be joined at one time is 32. |
Allows for complex business logic to be incorporated into the new data set by using DATA step processing, such as arrays and DO loops, in addition to MERGE features. |
Complex business logic is difficult to incorporate into the join. |
|
PROC SQL might require more resources than the DATA step with the MERGE statement for simple joins. |
Data sets must be sorted by or indexed on the BY variable(s) prior to merging. |
Data sets do not have to be sorted or indexed, but an index can be used to improve performance. |
The BY variable(s) must be present in all data sets, and names of key variable(s) must match exactly. |
Multiple data sets can be joined in one step without having common variables in all data sets. |
In case of non match data-it returns both matches and nonmatches |
In case of non match data-it returns only matches |
Matching Data and Inner Join
One-to-one matches produce identical results whether the data sets are merged in a DATA step or joined in a PROC SQL step.
One-to-many matches produce identical results whether the data sets are merged in a DATA step or joined in a PROC SQL step.
Many-to-many matches produce different results depending on whether the data sets are merged in a DATA step or joined in a PROC SQL step.
Non Matching Data and Inner Join
Many-to-Many Nonmatching data
between the data sets produces different results depending on whether the data
sets are merged in a DATA step or combined by using a PROC SQL inner join or
outer join.
One-to-One and One-to-many Nonmatching data between
the data sets produces same results depending on whether the data sets are
merged in a DATA step or combined by using a PROC SQL outer join.
3)
Using Multiple SET Statements
You can use multiple SET statements to combine observations from several
SAS data sets.
For example, the following DATA step creates a new
data set named Combine. Each observation in Combine contains data from one observation in Dataset1 and data from one observation inDataset2.
data
combine;
set
dataset1;
set
dataset2;
run;
When you use multiple SET statements,
·
processing stops
when SAS encounters the end-of-file (EOF) marker on either data
set (even if there is more data in the other data set). Therefore, the output
data set contains the same number of observations as the smallest input data
set.
·
the
variables in the program data vector (PDV) are not reinitialized
when a second SET statement is executed.
·
for any
variables that are common to both input data sets, the value or values from the
data set in the second SET statement will overwrite the value or values from
the data set in the first SET statement in the PDV.
Keep in mind that using multiple SET statements to combine data from
multiple input sources that do not have a one-to-one match can be complicated.
By default, the first observation from each data set is combined,
the second observation from each data set is combined, and so on until the
first EOF marker is reached in one of the data sets. Therefore, if you are
working with data sources that do not have a one-to-one match, or that contain nonmatching data, you will need to add additional DATA step
syntax in order to produce the results that you want.
4)
Difference between Multiple set stmt(Horizontal concat)
Vs Merge stmt
Multiple SET stmt |
Merge stmt |
|
|
processing stops when
SAS encounters the end-of-file (EOF) marker on either data
set |
No such EOF problem. |
Q)How can you combine
two datasets based on the relative position of rows in each
data set; that is, the first observation in one data set is joined with the
first observation in the other, and so on?
A)One to One reading (Merge)
Q)Describe 5 ways to do a "table lookup" in SAS.
A) Match Merging, Direct Access, Format Tables,
Arrays, PROC SQL
Q)data concat;
set a b;
run;
format of variable Revenue in dataset a is dollar10.2 and format of
variable Revenue in dataset b is dollar12.2 What would be the format of
Revenue in resulting dataset (concat)?
A) dollar10.2
Q)If you have two datasets you want
to combine them in the manner such that observations in each BY group in each
data set in the SET statement are read sequentially, in the order in which the
data sets and BY variables are listed then which method of combining datasets
will work for this?
A)Interleaving
(In Concatenation - we read all observations from one dataset at once then from
another dataset)
Q)While match merging two data sets, you cannot use the
__________option with indexed data sets because indexes are always stored in
ascending order.
A)Descending
Q)I have a dataset concat having variable a b & c. How to rename a b to e
& f?
A) data concat(rename=(a=e b=f));
set concat;
run;
Q)What is the difference
between One to One Merge and Match Merge? Give example also..
A)If
both data sets in the merge statement are sorted by id(as shown below)
and each observation in one data set has a corresponding observation in the
other data set, a one-to-one merge is suitable.
data mydata1; |
data
mydata2; |
data mymerge; |
If the observations do not match,
then match merging is
suitable
data
mydata1; |
data
mydata2; |
data mymerge; |
Q)How
would you code a merge that will keep only the observation that have matches
form both sets? A)By using the IN internal
variable in the merge statement.
DATA NEW;
MERGE
ONE_TEMP (IN=ONE) TWO_TEMP (IN=TWO);
BY NAME;
IF
ONE=1 AND TWO=1;
RUN;
Q)How
do the IN= variable improve the capability of a MERGE?
A) IN is a
implicit variable in SAS which helps in controlling which dataset needs to
contribute to the new dataset
Q)What
is the problem with merging two data set that have variable with the same name
but different data?
A)The
second data set value will overwrite the value of the first data set.
Q)When you use SET statement with
multiple data sets, suppose variable salary defined in one data set contain
type character and salary defined in another data set contain type numeric. Did
my SET statement work without error?
A)Error in SAS log
Q)When would you choose to MERGE two data sets together and when would you
SET two data sets? A)If we want to create a dataset as an exact copy of The old
dataset without any bothering about which Dataset is going to contribute to the
new dataset Then we will use set statement. If we want to control the
contribution of the old Datasets to the new dataset then we will use the Merge
statement
Q)Which data set is
the controlling data set in the MERGE statement?
A) Dataset having the less number of observations control
the data set in the merge statement.
Q)Can merge statement give output
without BY statement? Give at least one example on it.If
I have following two data sets.I am submitting
following code.( Assume that NEW and TEST Data sets
will be sorted by A and B variable.)
Data nete;
Merge new
test;
By a b;
Run;
What will be my output?
If we merge two datasets with all their
variable in by clause then SET and Merge produce same result.
New
|
Test
|
Output
|
Q)Assume that you have 50 observation in
one data set and 3000 observation in another data set.If
you use DATA step Merge to combine the data sets, how many observations are
read from the larger data set (i.e. second data set)?
A)3000
Q)What is a problem with merging two
data sets that have variables with the same name but different data?
A)Error
Understanding
the basic algorithm of MERGE will help you understand how the stepProcesses. There are still a few common scenarios whose
results sometimes catch users off guard. Here are a few of the most frequent 'gotchas':
1- BY variables has different lengthsIt is possible
to perform a MERGE when the lengths of the BY variables are different,But if the data set with the shorter version is
listed first on the MERGE statement, theShorter
length will be used for the length of the BY variable during the merge. Due to
this shorter length, truncation occurs and unintended combinations could result.In Version 8, a warning is issued to point out this
data integrity risk. The warning will be issued regardless of which data set is
listed first:WARNING:
Multiple lengths were specified for the BY variable name by input data sets.This may cause unexpected results. Truncation can be
avoided by naming the data set with the longest length for the BY variable
first on the MERGE statement, but the warning message is still issued.
To prevent the warning, ensure the BY variables have the same length prior to
combining them in the MERGE step with PROC CONTENTS. You can change the
variable length with either a LENGTH statement in the merge DATA step prior to
the MERGE statement, or by recreating the data sets to have identical lengths
for the BY variables.
No comments:
Post a Comment