BA371 — Assignment 2
Note: this is a big assignment; give yourself some time for it
In this assignment you must use SQL queries to answer a number of questions concerning data which were collected as part of an IS usability experiment.
The experiment asked participants to search a website for information using one of two types of user interfaces. One interface was list based;
i.e., search results where shown in the form of a list. The other interface was map based, meaning that search results were displayed on a (2D) map
(see examples of both interfaces below).
 |
 |
List interface |
Map interface |
The data collected from this experiment are stored in a SQL Server database.
SQL Server is Microsoft's industry-strength relational database management
system (RDBMS). It runs as a server application that manages individual
databases. Users connect to these databases through a client program.
To connect to your database from a COB machine, run Microsoft SQL Server Management Studio,
(Start —> Microsoft —> SQL Server xxxx Management Studio).
For the server name use cob-bissql and select Connect (you are connecting through
Windows authentication).
(If you want to interact with this database from off-campus, you can do so by first 'vpn-ing'
into the campus network, followed by connecting to the database through any DB querying tool that
supports MS SQL Server. You may need a password to connect to your database. See the Start Here —>
BA371 Database Resources Canvas Files page for connection/password information.)
On the left hand side (Databases) you see all the databases stored on the server cob-bissql; your database is
named StDB_<your_onid_id>.
Open a new query window by selecting the New Query button below the File menu. Make sure that the database
selected is indeed your database. You can always make sure that you are using your database by running
the following SQL command:
use <database_name>;
Your database is currently empty and you need to fill it first. To do this, copy the content
of this SQL script into the query window and run the query.
The script will take a few seconds to finish.
To see if the script did what it was supposed to do, run the following SQL query:
select count(*) from experiment_data;
The result should be 5204 indicating that you now have an experiment_data table which holds 5204 records.
The data for this assignment are stored in four tables:
+-----------------+--------------+
| Field | Type |
+-----------------+--------------+
| username | varchar(255) |
| user_type | varchar(10) |
| years | int |
| low_grade | int |
| high_grade | int |
| on_line | varchar(10) |
| on_line_sources | varchar(255) |
| location | varchar(5) |
| exp_condition | int |
+-----------------+--------------+
This table contains the information for each of the experiment's participants. The meaning of the columns is as follows:
- username: id of the participant.
- user_type: was the participant a student or a licensed teacher.
- years: years of teaching experience.
- low_grade: lowest grade the participant teaches.
- high_grade: highest grade the participant teaches.
- on_line: does the participant use on-line lesson materials?
- on_line_sources: if so, which ones?
- location: Subject's US state.
- exp_condition: 1=list, 2=map.
tasks table:
+------------+--------------+
| Field | Type |
+------------+--------------+
| username | varchar(255) |
| task | varchar(5) |
| confidence | int |
| sim_helpd | int |
+------------+--------------+
This table lists the various tasks participants conducted during the experiment:
- username: id of the participant.
- task: id of the task the participant completed.
- confidence: confidence that the participant had in having done the task well (1-6).
- sim_helpd: participant's opinion that the map or list (depending on which one they used) was helpful in conducting the task (1-6).
documents table:
+--------------+--------------+
| Field | Type |
+--------------+--------------+
| username | varchar(255) |
| task | varchar(5) |
| doc_type | varchar(10) |
| used_tool | int |
| relevant | int |
| motivational | int |
| concepts | int |
| background | int |
| grade_level | int |
| hands_on | int |
| attachments | int |
+--------------+--------------+
Participants were asked to search for documents on the web site as part of their tasks. Each row in the table indicates a document they found:
- username: id of the participant.
- task: id of the task the participant completed.
- doc_type: the type of the found document.
- used_tool — attachments: opinion about the usefulness of the found document (1-6).
experiment_data table:
+---------------+--------------+
| Field | Type |
+---------------+--------------+
| id | int(11) |
| host_ip | varchar(20) |
| referer | varchar(255) |
| file | varchar(255) |
| querystring | varchar(255) |
| timestamp | datetime |
| username | varchar(255) |
| source_doc_id | int(11) |
| dest_doc_id | int(11) |
| exp_condition | int(11) |
| map_action | int(11) |
+---------------+--------------+
All of the participants' actions associated with the 'list' and 'map' interfaces are captured in this table:
- id: record id.
- host_ip: IP of the machine from which the request came.
- referer: referring web page.
- file: requested web page.
- querystring: any parameters passed in the request.
- timestamp: time of the request (one-second accuracy).
- username: id of the participant.
- source_doc_id: the id of the document from which the request/action is launched.
- dest_doc_id: the id of the document resulting from the action.
- exp_condition: 1=list, 2=map.
- map_action: type of map action executed.
Note that this database violates 3NF. For instance, even though each participant participated in only one experimental
condition (a participant was exposed to either the list-based interface or the map-based interface),
the combination username—exp_condition is stored for each user interaction in the experiment_data table.
We are not asking you to change this. Just leave things as they are.
To see the structure of a table in SQL Server, use the SQL Server sp_help command: e.g.,
sp_help experiment_data;
This exercise consists of a series of SQL operations that you must formulate and execute to retrieve
information about the experiment. It gives you the opportunity to learn a few things:
- Query and manipulate relational database tables with SQL.
- Further familiarize yourself with text-driven SQL command interpreters.
- Learn some of the limits of SQL and how to work around them.
IMPORTANT NOTE!! Since this is a
learning exercise, it is quite possible that at some time or other, you break things. You either delete the wrong
things or perhaps even wipe out an entire database table. DON'T PANIC!! If you find
yourself in this situation, simply do the following:
- Drop the various tables:
- drop table participants;
- drop table tasks;
- drop table documents;
- drop table experiment_data;
- Rerun the above-mentioned script; it will regenerate the tables and their contents for you.
- If all of this fails, see your instructor for help.
Below are 18 questions to answer and some directives to follow.
Please note:
- You MUST write your SQL from scratch.
Do NOT use any of the graphics-based facilities in Access or SQL Server to put a
query together and then 'back out' the corresponding SQL as it will be terribly bloated and you will not receive any credit for it (not to
mention that you do not learn much that way).
- Unless mentioned otherwise, only SQL code should be used for this exercise.
For each of the 18 questions below turn in the question (!!! the whole question, not just the question number !!!), the
SQL you used to find its answer (some questions do not require any SQL), and the answer you found.
For each of the questions, write one or more queries which return(s) the exact answer and !!!ONLY!!! the exact answer to the question.
For instance, question 1 below asks for SQL code which returns only a single number ("How many"). Thus, your query should return just a single number; i.e., a single record with a single column containing a count.
- What is the total number of documents retrieved by participants during the experiment? (Hint: your query should return only a single record, containing only a count).
- What is the average number of documents retrieved per participant? Note: write a single SQL query which gives you the answer.
Also, when the result of the query is '5,' you should realize that that is not exactly right.
- Which document types were involved in these retrievals? (Hint: your query should return four records, one of which is NULL)
- What does the NULL result from the previous query signify? (No SQL needed here; just think about this and provide your answer)
- Using only the documents table, how many different participants (usernames) did participate in the experiment?
(remember; the query should return but a single(!) number)
- Check this result against the participants table and explain the difference. (Hint: the count from the previous question is different
from the number of different usernames in the participants table. How can that be?)
- What is the 'extra' username in the documents table which is missing from the participants table? (Hint: SQL query should return only a single record).
- Update the documents table to set the username of the 'extra' username in the documents table to Bang_Bang_Johnson and
check the result of question 5 again.
- Which participant has the longest username and how many characters does that username have? (use a single query)
(Hint: the longest username has 18 characters)
- Which two participants retrieved the most documents and how many documents did each of those participants retrieve? Note: your query should return only
two rows. (Hint: both of these users each retrieved 11 documents)
- How many tasks were completed for each of the two experimental conditions? (do this with a single query!!).
- Hint 1: The experimental condition is not stored in the tasks table; it is stored in the participants table. You will need a join
between the participants and the tasks tables to do this.
- Hint 2: there is a difference of 23 between the two numbers.
- Complete the table below; show the queries which gave you these results:
Experimental condition | List (1) | Map (2) |
number of participants |
|
|
number of documents retrieved |
|
|
smallest number of documents retrieved by any participant (you may use several queries for this) |
|
|
largest number of documents retrieved by any participant (you may use several queries for this) |
|
|
average number of documents retrieved per participant (do not use SQL for this; just compute from the numbers you already have) |
|
|
sigma (std. dev.) of the number of documents retrieved per participant: the SQL to compute this is as follows:
select stdev(my_table.my_count) from
(select count(*) as my_count
from documents, participants
where participants.username = documents.username
and participants.exp_condition = 1 (...or 2)
group by participants.username) as my_table;
|
|
|
Hint: check the numbers in the table for consistency with some of the other numbers you have found. For instance,
The number of documents retrieved across the two conditions must equal the result of question 1.
- This not a SQL question, but it is a natural follow up from what you computed in the previous question (plus it explains why we take stats classes!).
From the results in the previous question we may conclude that participants in the 'Map' condition, on average retrieve more documents than participants in
the 'List' condition. Before we draw that conclusion, however, we should ask ourselves if the apparent difference is likely the result of random
effects; i.e., we must check for the statistical significance of the difference. The test that applies here is the t-test for testing
equality of means in two samples.
One way to run this test is to plug in the values from the table above in an on-line t-test utility such as
https://www.graphpad.com/quickcalcs/ttest1/?Format=SD
Run this test and provide the resulting two-tailed p-value and your conclusion on whether or not the difference in means is
statistically significant at α=0.05.
- If we consider the two left-most bytes of an IP address to indicate the organization hosting the address, how many different
host_ips in our experiment are associated with the 128.138.* network? (Hint: the count should be 33)
- Not a SQL question: Which organization is associated with the 128.138 addresses? How about the 129.123 ones? Hint: find a few host_ips in the experiment_data
table which are associated with these organizations and then do a reverse DNS lookup (on Windows: nslookup command).
If a reverse lookup does not work, use the Internet to answer this question.
- List, in order from earliest to latest, the different dates (your SQL should return dates only, not times of day!) during which experimental data were collected and
the number of experiment_data records collected on each of those dates. Again, use a single query! (Hint: your list of dates should have 13 dates)
- What is the daily minimum, maximum and average number of experiment_data records collected? Again, do not read or compute these numbers
manually from the results in the previous question. Use SQL to do it. Hint: perhaps the easiest way to do this is in two steps:
first make a new table from the results of your previous query; one which contains the totals for each day, and then query that new table for the minimum,
maximum and average.
You can make a new table from a query using the select ... into new-table-name from ... syntax.
(Also: if the result for your average is 400, you should realize that that is not exactly correct!)
- How many days have passed between the first record being collected and the last (in the experiment_data table)? Note: you must write a SQL query which
computes this number. Do not compute it yourself from the min and max timestamps you computed in the previous question.