Industrial Data Systems: Relational Algebra

Relational Algebra

What is an algebra ?

A formal system of manipulation of symbols to deal with general statements of relations.

Algebraic systems are abstract: in the sense that a system only provides a set of symbols, and set of rules to construct valid expressions, and a set of rules by which expressions and symbols can be manipulated.

The algebra we learn in high school is the algebra of real numbers. All symbols for variables represented real numbers, and all manipulations of symbols gave relations between some real numbers.

Later we learnt the Algebra of Complex numbers.. which worked pretty much the same way.

Now we learn the Algebra of relations, where every relation is represented by a relational table as defined earlier: An array of atomic values, with no ordering of tuples (rows), and strictly ordered tuple-attributes (columns).

Just as real algebra has operations (+, -, x, /), Relational Algebra also has operations.

The main RA operations: Select, Project, Join, Divide

Notice that the result of any operation in real algebra is also a real number. Thus, for real numbers x, y: (x + y) is real. So is ( x - y). So is ( x / y) for all values where ‘/’ is defined.

[Give an example where it is not defined !]

Similarly, in RA, whenever an operation is defined the result of an RA operation is a relational table.

Why is this important ?

[Because this allows us to combine a sequence of RA operations in arbitrary order !]

Why is it important to combine arbitrary sequences of RA operations ?

We have already seen why any given amount of data can be efficiently and intelligently stored in a set of relational tables. The relational schema, if properly designed, guarantees minimum repetition, no update anomalies, and efficient data retrieval.

Now we shall see that extraction/modification of ANY information in a relational schema can be done by some combination of RA operations !

We shall demonstrate with the following tables:

*EMPLOYEE*
Name	ID	SupervisorID	DeptNo
John	111	222	5
Frankie	222	777	4
Alice	333	444	4
Jennifer	444	777	4
William	555	222	5
Joyce	666	222	5
James	777	222	5
John	888	null	1

WORKS_ON

IDno ProjNo Hours

111 1 2.5

111 2 1.5

555 3 2.5

666 1 3.5

666 2 3.5

222 2 2.5

222 3 5.5

222 1 5.5

*WORKS_ON*
IDno	ProjNo	Hours
111	1	2.5
111	2	1.5
555	3	2.5
666	1	3.5
666	2	3.5
222	2	2.5
222	3	5.5
222	1	5.5

DEPENDENT

EmpID DepName Relationship

222 Jack Son

222 Jill Wife

222 John Son

444 Ted Son

111 Mike Son

111 Anita Daughter

*DEPENDENT*
EmpID	DepName	Relationship
222	Jack	Son
222	Jill	Wife
222	John	Son
444	Ted	Son
111	Mike	Son
111	Anita	Daughter

Select Operation:

A relational table is composed of an unordered set of tuples (rows).

The SELECT operation allows us select a SUBSET of the tuples of a relational table, which satisfy some specified conditions.

SYNTAX:

SELECT [conditions] ( TABLE )

Example:

SELECT [DeptNo = 5] ( EMPLOYEE )

OUTPUT
Name	ID	SupervisorID	DeptNo
John	111	222	5
William	555	222	5
Joyce	666	222	5
James	777	222	5

All tuples in relational table EMPLOYEE, for which the condition [DeptNo = 5] was TRUE, were placed in the OUTPUT.

NOTES:

1. How does SELECT work ?

It looks at each tuple of the table, and evaluates the specified conditions. If the conditions are true, then that tuple is placed in OUTPUT. Otherwise it is rejected.

2. The result of the SELECT operation is also a Relational table !

In fact, it is a subset of all the tuples of the table EMPLOYEE.

3. Selection conditions MUST always evaluate to LOGICAL values: TRUE, or FALSE. Hence all conditions are EXPRESSIONS, connected by LOGICAL Operators (AND, OR, NOT).

Example:

SELECT [ (DeptNo != 4) AND ( (ID = 222) OR (ID = 111)) ] ( EMPLOYEE)

OUTPUT
Name	ID	SupervisorID	DeptNo
John	111	222	5

Project Operation:

The SELECT operation outputs a subset of the rows of a Relational Table. In contrast, the PROJECT operation outputs a subset of the columns of the Relational Table.

SYNTAX:

PROJECT [attribute-list] ( TABLE )

Example:

PROJECT [ Name, ID] ( EMPLOYEE )

OUTPUT
Name	ID
John	111
Frankie	222
Alice	333
Jennifer	444
William	555
Joyce	666
James	777
John	888

NOTES:

1. Once again, the result of the PROJECT operation is another Relational Table.

2. Since Relational tables are a set of tuples, if PROJECT will not output identical tuples twice !

Example:

PROJECT [EmpID, Relationship] ( DEPENDENT )

OUTPUT
EmpID	Relationship
222	Son
222	Wife
444	Son
111	Son
111	Daughter

Of course, since you can have any combination of RA Operations, the following example is totally valid:

PROJECT [Name, ID] ( SELECT [ SupervisorID = 222] ( EMPLOYEE) )

Which gets evaluated in two steps: First, the SELECT returns OUTPUT1, and then, the PROJECT returns the final OUTPUT.

Step 1:

*OUTPUT1*
Name	ID	SupervisorID	DeptNo
John	111	222	5
William	555	222	5
Joyce	666	222	5
James	777	222	5

and then Step 2:

OUTPUT2
Name	ID
John	111
William	555
Joyce	666
James	777

Join Operations:

As we have seen, the entire process of Normalization broke down all our data into many small tables. Of course, very often we need to get information from not one, but a combination of two or more tables. RA allows us to do so by providing an operator called JOIN. This operator combines the information in two Relational Tables. In particular, we look at the operation THETA-JOIN.

SYNTAX:

THETA-JOIN [conditions] ( TABLE1, TABLE2)

In the above, TABLE1 and TABLE2 can possibly be the same table.

THETA-JOIN forms combinations of the tables TABLE1 and TABLE2. The output is another table, with all the attributes of TABLE1 and all attributes of TABLE2.

How it works:

Since Relational Tables are sets of tuples, first compute the CARTESIAN PRODUCT of the tables.
If TABLE1 has A rows, and B columns; TABLE2 has N rows and M columns, then:
Each tuple in the Cartesian product will have ( B + M) attributes.
The Cartesian product will have A*N rows.
For each row in the cartesian product, check if [conditions] evaluetes to TRUE or FALSE. If [conditions] are TRUE, add the tuple to OUTPUT.

Example:

THETA-JOIN [ID = IDno] ( EMPLOYEE, WORKS_ON )

OUTPUT
Name	ID	SupervisorID	DeptNo	IDno	ProjNo	Hours
John	111	222	5	111	1	2.5
John	111	222	5	111	2	1.5
Frankie	222	777	4	222	2	2.5
Frankie	222	777	4	222	3	5.5
Frankie	222	777	4	222	1	5.5
William	555	222	5	555	3	2.5
Joyce	666	222	5	666	1	3.5
Joyce	666	222	5	666	2	3.5

NOTES:

The result of THETA-JOIN is also a Relational Table.
Sometimes, the names of attributes in TABLE1 and TABLE2 can be the same. To make sure there is no confusion, the conditions are often written using the dot-notation, referring to an attribute, a, of table A as: A.a
Thus, the OUTPUT should actually have attributes called: EMPLOYEE.Name, EMPLOYEE.ID, ..., WORKS_ON.IDno, ..., WORKS_ON.Hours.
Another JOIN Operation, called NATURAL-JOIN is sometimes used. Here, ALL attributes of TABLE1 and TABLE2 which have the SAME NAME, must be equal in value.

An algorithm to perform the JOIN operation can be written as: For each tuple, t1_i, of TABLE1 { For each tuple, t2_j, in TABLE2 { if [conditions] are TRUE, add the combined tuple t1_it2_j to OUTPUT ; }}

Both, THETA-Join and NATURAL-JOIN operations are called INNER-JOIN operations. These have one common property:

If there is no tuple from TABLE2 matching the conditions with a tuple of TABLE1, then that tuple of TABLE1 does not occur in the OUTPUT.

Sometimes, when performing JOIN operations, we may specifically require that all tuples of TABLE1 must occur in the OUTPUT. If no matching tuples are found in TABLE2, then just enter NULL values for the attributes related to TABLE2. This operation is called a LEFT-OUTER-JOIN.

Example:

LEFT-OUTER-JOIN [ID = EmpID] ( EMPLOYEE, DEPENDENT )

OUTPUT
Name	ID	SupervisorID	DeptNo	EmpID	DepName	Relationship
John	111	222	5	111	Mike	Son
John	111	222	5	111	Anita	Daughter
Frankie	222	777	4	222	Jack	Son
Frankie	222	777	4	222	Jill	Wife
Frankie	222	777	4	222	John	Son
Alice	333	444	4	null	null	null
Jennifer	444	777	4	444	Ted	Son
William	555	222	5	null	null	null
Joyce	666	222	5	null	null	null
James	777	222	5	null	null	null
John	888	null	1	null	null	null

Similarly, we can define a RIGHT-OUTER-JOIN, where all tuples of TABLE2 appear at least once, with null values for tuples of TABLE1 when there is no match.

Set theoretic Operations: Union, Intersection, Difference, Division.

Since relational tables are sets of tuples, you are allowed to do any set-theoretic operation on them.

Union:

If two tables have the same attributes, you can perform a union of the two.

SYNTAX:

UNION ( TABLE1, TABLE2)

Example:

UNION ( (SELECT [EmpID = 222] (DEPENDENTS)), ( SELECT [EmpID = 444] ( DEPENDENTS) ) )

OUTPUT

EmpID DepName Relationship

222 Jack Son

222 Jill Wife

222 John Son

444 Ted Son

OUTPUT
EmpID	DepName	Relationship
222	Jack	Son
222	Jill	Wife
222	John	Son
444	Ted	Son

Intersection:

Performs set intersection on the tables.

SYNTAX:

INTERSECTION ( TABLE1, TABLE2)

Example:

INTERSECTION (( (SELECT [EmpID = 222] (DEPENDENTS)), ( SELECT [Relationship = SON] ( DEPENDENTS) ) )

OUTPUT

EmpID DepName Relationship

222 Jack Son

222 John Son

OUTPUT
EmpID	DepName	Relationship
222	Jack	Son
222	John	Son

Difference:

Performs set difference (every element of first set which is NOT a member of the second set is output).

SYNTAX:

DIFFERENCE ( TABLE1, TABLE2)

Example:

DIFFERENCE (( (SELECT [EmpID = 222] (DEPENDENTS)), ( SELECT [Relationship = SON] ( DEPENDENTS) ) )

OUTPUT

EmpID DepName Relationship

222 Jill Wife

OUTPUT
EmpID	DepName	Relationship
222	Jill	Wife

NOTE:

DIFFERENCE ( TABLE1, TABLE2) is NOT the SAME as

DIFFERENCE( TABLE2, TABLE1).

DivideBy:

You may also perform a set division operation on two tables. The operation is described using the simple tables and example that follow.

SYNTAX:

DIVIDEBY ( TABLE1, TABLE2)

The result is defined as follows:

Attributes of TABLE2 must be a proper subset of attributes of TABLE1. If not, the result is NULL.
We divide the attributes of TABLE1 into two subsets: T1unique, T1commonT2. T1unique has those attributes not shared by TABLE2, and T1commonT2 are all attributes which are in TABLE2.
The attributes in the DIVIDEBY will be exactly the same as T1unique.
For each unique value of T1unique in TABLE1, check if there is a tuple corrsponding to each tuple of TABLE2, in T1commonT2. If so, this value of T1unique will appear in the result.
Repeat this check for each unique value of T1unique in TABLE1.

This operation is best explained with a simple example:

Problem: Find the ID of Employees who work on ALL the projects.

You can try to use the other RA functions, and will soon discover that this query is not easy to solve. The best way to answer such For-ALL queries is the use of the DIVIDEBY operation, as follows.

We first find a table listing 'ALL the Projects':

PROJECTS = PROJECT[ ProjNo] ( WORKS_ON)

PROJECTS

ProjNo

1

2

3

Next, we find which employee is working on what projects:

E_WORKS_ON = PROJECT[ ID, ProjNo]( WORKS_ON)

E_WORKS_ON

ID ProjNo

111 1

111 2

222 2

222 3

222 1

555 3

666 1

666 2

E_WORKS_ON
ID	ProjNo
111	1
111	2
222	2
222	3
222	1
555	3
666	1
666	2

Now, we can find the answer we want by using the following query:

DIVIDEBY ( E_WORKS_ON, PROJECTS); The result is computed in the following way:

1. We note that T1unique = {ID}, T1commonT2 = {ProjNo}

2. The result therefore must be a single column table, with one attribute (ID).

3. Take the first unique value of T1unique: ID = 111.

4. For each value of ProjNo in PROJECTS, do we have a tuple in E_WORKS_ON corresponding to ID = 111 ?

NO: because there is no tuple in E_WORKS_ON with ID = 111, and ProjNo = 3.

Therefore, ID = 111 does not appear in the result.

5. Repeat step 4 for next unique value of T1unique, that is, Id = 222.

In this case, the test is true, and therefore ID = 222 appears in the result.

6. Repeat step 4 noxt unique value of T1unique, ID = 555.

The test again fails, and therefore ID = 555 is not in the result.

7. Similarly, the test fails for ID = 666; thus, ID = 666 will not appear in the output.

8. Since there are no more unique values of T1unique left, we are done, and the result is the following table:

RESULT

ID

222

This concludes our coverage of all important Relational Algebra operations. It can be shown (though we shall not do so) that RA is complete, in the sense that any modification you wold like to do on any subset of a set of relational tables can be done using some combination of RA commands. It is, as we have seen, an elegant mathematical tool. Nortice that RA is procedural -- an RA expression indicates the sequence of performing the operations, and therefore imposes some constraint on the use of smart algorithms to perhaps improve the performance of some queries. It is easy to construct two queries that will yield the same result, but require different amount of effort to compute.

One language that can logically describe the output of a query, without imposing any restrictions on the procedure that must be used to compute the output is Relational Calculus (RC). RC is equivalent to RA, in the sense that any output that can be expressed in RA can also be expressed in RC, and vice versa. In the following lectures, we shall study the most popular of all DML's for relational DBMS's: Structutred Query Language (SQL). SQL is based on relational calculus.