Problem Set #1: ECON 216
Professor Derrick Robinson
All datasets used in the problem set can be located at: Blackboard Problems Set 1 post
1.1) For each of the following variables, determine whether the variable is categorical or numerical and its scale of measurement. If variable is numerical: discrete or continuous?
Categorical (Y=1/N=0) | Numerical (Y=1/N=0) | Scale/Measure | Discrete (Y=1/N=0) | Continuous (Y=1/N=0) | |
Household Cellphone Count | |||||
Monthly Data Usage (Megabytes) | |||||
Text Message Count | |||||
Time Spent Shopping in Bookstore | |||||
Number of Purchased Textbooks | |||||
Academic Major | |||||
Gender | |||||
Amount of Money Spent on Clothing | |||||
Favorite Department Store | |||||
Number of Shoes Owned | |||||
Monthly Mortgage Payments | |||||
Number of Jobs Worked over 20 years | |||||
Annual Household Income | |||||
Marital Status |
1.11) The director of market research at a Trader Joes wanted to conduct a survey in San Diego to understand how much time working women spend shopping for clothing in a typical month.
a) What type of data will the director need to collect?
b) Develop three categorical questions for a potential survey meant to collect this information.
c) Develop three numerical questions that for a potential survey meant to collect this information.
1.26) Clean the following list of data, which indicates cellphone brands owned by a sample of 20 respondents.
a) Place data in a table and identify any irregularities. Clean the data by resolving irregularities.
b) Identify any missing values.
Apple, Samsung, Appel, Nokia, Blackberry, HTC, Apple, Samsung, HTC, LG, Blueberry, Samsung, Samsong, APPLE, Motorola, Apple, Samsun, Appl, Samsung.
1.46) The Pew Research Center releases reports based on surveys at its website, www.pewresearch.org. Visit this site and read an article of interest.
a) Provide a link to the article
b) Describe the population of interest
c) Describe the sample that was collected
d) Describe a parameter of interest
e) Describe the statistic used to estimate the parameter in (d)
1.48) The American Community Survey (www.census.gov/acs ) provides data every year about communities in the United States. Addresses are randomly selected, and respondents are required to supply answers to a series of questions.
a) Describe a variable for which data is collected.
b) Is the variable categorical or numerical?
c) If the variable is numerical, is it discrete or continuous?
2.6) The following table represents world oil production in millions of barrels a day in 2013:
Region | Oil Production (mil of barrels a day) |
Iran | 2.69 |
Saudi Arabia | 9.58 |
Other OPEC countries | 17.93 |
Non-OPEC countries | 51.99 |
a) Compute the percentage of values in each category.
b) What conclusions can you reach concerning the production of oil in 2013?
2.7 & 2.29) Visier’s 2014 Survey of Employers explores current workforce analytics and planning practices, investments, and future plans. U.S.-based employers were asked what they see as the most common technical barrier to workforce analytics. The responses, stored in Barriers dataset, were as follows:
Barriers | Frequency |
Data must be integrated from multiple sources | 68 |
Lack of automation/repeatable process | 51 |
Metrics need to be identified or defined | 45 |
Production is cumbersome | 42 |
Data quality is not reliable | 36 |
Sharing findings is challenging | 21 |
Analytic tools are too complex | 17 |
Ensuring security and integrity of workforce data | 17 |
Other | 3 |
a) Compute the percentage of values in each response need.
b) What conclusions can you reach concerning technical barriers to workforce analytics?
c) Construct a bar chart, a pie chart, and a Pareto chart.
d) What conclusions can you reach concerning technical barriers to workforce analytics?
2.15 & 2.36) The NBACost dataset contains the total cost ($) for four average priced tickets, two beers, four soft drinks, four hot dogs, two game programs, two adult-sized caps, and one parking space at each of the 30 National Basketball Association (NBA) arenas during the 2014-2015 season. These costs were:
246.39 | 444.16 | 404.60 | 212.40 | 477.32 | 271.74 | 312.20 | 322.50 |
261.20 | 336.52 | 369.86 | 232.44 | 435.72 | 541.00 | 223.92 | 468.20 |
325.85 | 281.06 | 221.80 | 676.42 | 295.40 | 263.10 | 278.90 | 341.90 |
317.08 | 280.28 | 340.60 | 289.71 | 275.74 | 258.78 |
a) Organize these costs as an ordered array.
b) Construct a frequency distribution and a percentage distribution based on quartile classes and median position.
c) Construct a histogram and percentage polygon
d) Around what value, if any, are the costs of attending a basketball game concentrated? Explain.
2.16 & 2.38) The Utility dataset contains the following data about the cost of electricity (in $) during July 2015 for a random sample of 50 one-bedroom apartments in a large city.
96 | 171 | 202 | 178 | 147 | 102 | 153 | 197 | 127 | 82 |
157 | 185 | 90 | 116 | 172 | 111 | 148 | 213 | 130 | 165 |
141 | 149 | 206 | 175 | 123 | 128 | 144 | 168 | 109 | 167 |
95 | 163 | 150 | 154 | 130 | 143 | 187 | 166 | 139 | 149 |
108 | 119 | 183 | 151 | 114 | 135 | 191 | 137 | 129 | 158 |
a) Construct a frequency distribution and a percentage distribution that have class intervals with the upper-class boundaries $99, $119, and so on.
b) Construct a cumulative percentage distribution.
c) Around what amount does the monthly electricity cost seem to be concentrated?
d) Construct a histogram and a percentage polygon.
e) Construct a cumulative percentage polygon.
f) Around what amount does the monthly electricity cost seem to be concentrated? Explain.
2.25) How much time doing what activities do college students spend using their cell phones? A 2014 Baylor University study showed college students spend an average of 9 hours a day using their cell phones across a range of activities. Use CellPhoneActivity dataset.
a) Construct a bar chart, a pie chart, and a Pareto chart.
b) Which graphical method do you think is best for portraying these data?
c) What conclusions can you reach concerning how college students spend their time using cellphones?
2.40) The PropertyTaxes dataset contains data about the property taxes per capita ($) for the 50 states and the District of Columbia. Use this dataset to:
a) Construct a histogram and a cumulative percentage polygon to visualize the data.
b) What conclusions can you reach concerning the property taxes per capita?
2.52) College football is big business, with coaches’ pay in millions of dollars. The CollegeFootball dataset contains the 2013 total athletic department revenue and 2014 football coaches’ total pay for 108 schools. Use this data set to answer:
a) Do you think schools with higher total revenues also have higher total coaches’ pay?
b) Construct a scatter plot with total revenue on the X axis and total coaches’ pay on the Y axis.
c) Does the scatter plot confirm or contradict your answer to (a)?
d) Compute the covariance.
e) Compute the coefficient of correlation.
f) Based on (d) & (e), what conclusions can you reach about the relationship between coaches’ total pay and athletic department revenues?
2.55) The data in NewHomeSales represent number and median sales price of new single-family houses sold in the United States recorded at the end of each month from January 2000 through December 2014.
a) Construct a time-series plot of new home sales prices.
b) What pattern, if any, is present in the data?
2.96) The DomesticBeer dataset contains the percentage alcohol, number of calories per 12 ounces, and number of carbohydrates (in grams) per 12 ounces for 158 of the best-selling domestic beers in the United States.
a) Construct a percentage histogram for percentage alcohol, number of calories per 12 ounces, and number of carbohydrates (in grams) per 12 ounces.
b) Construct three scatter plots: percentage alcohol versus calories, percentage alcohol versus carbohydrates, and calories versus carbohydrates
c) Discuss inferences you have developed based on the visualization provided in (a) and (b).
2.97 & 3.41) The CigaretteTax dataset contains the state cigarette tax ($) for each state as of January 1, 2015.
a) Organize the data into an ordered array.
b) Plot a percentage histogram.
c) What conclusions can you reach about the differences in the state cigarette tax between the states?
d) Compute the population mean and population standard deviation for the state cigarette tax.
e) How do interpret these parameters from (d).
3.1) The following set of data is from a sample of n=5: 7 4 9 8 2
a) Compute the mean, median, and mode
b) Compute the range, variance, standard deviation, and coefficient of variation
c) Compute the Z scores. Are there any outliers?
d) Describe the shape of the data set.
Descriptive Statistics | ||
Descriptors | Values | |
Mean | ||
Median | ||
Mode | ||
Range | ||
Variance | ||
Standard Deviation | ||
Coefficient of Variation | ||
Z-Score | Z-Score | Outlier, Yes (1,0) |
Skewness Shape | ||
Apex Shape |
3.16 & 3.33) The HotelAway dataset contains the average room price (in US$) paid by various nationalities while traveling abroad (away from their home country) in 2014.
a) Compute the mean, median, and mode
b) Compute the range, variance, standard deviation, and coefficient of variation
c) Compute the Z scores. Are there any outliers?
d) Construct a boxplot and describe the shape of the data set.
e) Based on the results of the table, what conclusions can you reach concerning the room price (in US$) in 2014?
Descriptive Statistics | ||
Descriptors | Values | |
Mean | ||
Median | ||
Mode | ||
Range | ||
Variance | ||
Standard Deviation | ||
Coefficient of Variation | ||
Z-Score | Z-Score | Outlier, Yes (1,0) |
Skewness Shape | ||
Apex Shape |
3.71) The Protein dataset contains the cost per meal and the ratings of 50 center city and 50 metro area restaurants on their food, décor, and service (and their summated ratings). Complete the following for the center city and metro area restaurants:
a) Construct a boxplot of the cost of a meal. What is the shape of the distribution?
b) Compute and interpret the correlation coefficient of the summated rating and the cost of a meal.
c) What conclusions can you reach about the cost of a meal at center city and metro area restaurants?
3.73) What was the mean price of a room at two-star, three-star, and four-star hotels in the major cities of the world during the first half of 2014? The HotelPrices contains the mean prices in English pounds, approximately US$1.57 as of May 2015, for the three categories of hotels. Do the following for two-star, three-star, and four-star hotels:
a) Compute the mean, median, first quartile, and third quartile.
b) Compute the range, interquartile range, variance, standard deviation, and coefficient of variation.
c) Interpret the measures of central tendency and variation within the context of this problem.
Descriptive Statistics | |
Descriptors | Values |
Mean | |
Median | |
Mode | |
First Quartile | |
Third Quartile | |
Range | |
Interquartile Range | |
Variance | |
Standard Deviation | |
Coefficient of Variation |
d) Construct a boxplot. Are the data skewed? If so, how?
e) Compute the covariance between the mean price at two-star and three-star hotels, between two-star and four-star hotels, and between three-star and four-star hotels.
g) Which do you think is more valuable in expressing the relationship between the mean price of a room at two-star, three-star, and four-star hotels – the covariance or the coefficient of correlation? Explain.
h) Based on (f), what conclusions can you reach about the relationship between the mean price of a room at two-star, three-star, and four-star hotels?
1