Winter 2020

** **

** ********** ****Announcements**** **************

· Please form groups of 5 students each by the end of
week 2; any groups not of size 5 will be subject to re-assignment of group
membership.

· Be sure to attend week 1 TA sessions which will cover
how to get started with R, a statistical programming language and environment.

· Student presentations are mostly on Mondays, but could
also be on other days if Monday is a holiday, or if we need more time slots.

· If you have added the course late, please email your
TA to obtain access to Gradescope.

· Suggestions for student presentations:

1)
Bring
your laptop and **adaptor**, or USB to
be plugged into my mac air

2)
Show
your data set (perhaps in excel)

3)
Show/talk
through background, description, summary etc.

· Instructions for presentation on Wed 1/22:

1)
Group
#1 and 2 will present;

2)
Prepare
a few slides to describe your data and research question; try not to just show
a web link which might have too much contents for the audience to read through
in a short amount of time;

3)
Be
prepared to have your data ready to show (eg. In
excel);

4)
Show
your Table 1;

5)
Show
plots;

6)
Each
member of the group should present something; aim for a total of 20 minutes;

· **Important from the grader:**

1.
Use
your name as appearing in Gradescope

2.
For
group project, only one member of the group submits the homework please, and
add their group members on the Gradescope submission
page; do NOT make duplicate submission.

3.
On
the first page of each group assignment, put down all the member names and
email addresses.

· Group #3 and 4 will continue presentation on Monday
1/27 (there have been updates to members of group 5, 8, 17, 19, please check
Canvas)

· All are **required**
to attend student presentations – it’s respect to your fellow students, and it
is where you learn to improve on your work. Random attendance will be checked,
and each absence will be subject to 2% deduction in the final grades.

· Group #5 and 6 will present on Monday 2/3: group 5
will present their data set as the previous groups did; group 6 will present
the simulation assignment of week 4.

· TA sessions have started to introduce **R Markdown**. You are encouraged to use it
in week 5 homework, and will be **required**
to use it starting from week 6 homework and for the rest of the quarter when
doing data analysis including the final project.

· Group #7 and 8 will present on Monday 2/10 part 3) of
week 5 assignment; be sure to give an overview of your data first.

· Groups #9 and 10 will present on Wednesday 2/19 part
3) of week 6 assignment; group 9 please focus on discussing the use of
regression models for prediction, and group 10 on the use for causal inference.

· Groups #11 and 12 will present on Monday 2/24 part 2)
of week 7 assignment (although the written version is due end of week 8); group
11 please focus on the stepwise approach, and group 12 on R-squared and AIC.

· Groups #13 and 14 will present on Monday 3/2 part 2)
of week 8 assignment; please have R on your laptop so we can generate random
group numbers to take attendance.

· Week 10 presentations: each group please give a
comprehensive presentation of the analysis of your data set, pay some attention
to univariate screening; if you have time to try some tree-based methods,
that’ll be nice too.

**Overview:** This course will build upon previous
iterations of MATH 189, with emphasis on data sense, intuitions and skills of
data analysis. From a personal
perspective of the instructor, I have been doing and then overseeing real world
data analysis for over 20 years since my Ph.D.. The
course is planned to be taught in a more interactive fashion during lecture
hours, including discussions, group presentations and critiques every week. In
other words, this will be a partially or hybrid ‘flipped’ classroom. The main
idea is to learn from mistakes. For this purpose, sometimes a preliminary
version of the homework needs to be done (and collected) ahead of the
presentation, and an improved version can then be handed in. Formal statistical
knowledge is a necessity of good analysis, but is not the sole target in this
course, and will be covered by a combination of pre-requisites, lectures and TA
sessions.

After initial introduction into all things data related, content-wise we will focus more on categorical outcomes data (a gap in our undergraduate curriculum), which leads naturally into classification etc..

**Important Note: **You are strongly encouraged to attend
lectures where interactions happen.

**Lecture:** MWF 4:00-4:50pm, CSB 001

**Instructor:** Ronghui (Lily) Xu

**Office:** APM 5856

**Phone:** 534-6380

**Email: **rxu@ucsd.edu

**Office Hours:**

** **Wed 2-3pm, or
by appointment.

**Teaching
Assistants: **

Yuqian Zhang, yuz643@ucsd.edu, office hours F 2-4pm, APM
1210

Yuyao Wang, yuw079@ucsd.edu, office hours F 12:30-2:30pm, APM 6446

** **

**Reference books: **

1. OpenIntro Statistics, https://www.openintro.org

2. James et al., An Introduction to Statistical Learning, Springer, 2013.

3. Li and Xu (eds), High-Dimensional Data Analysis in Cancer Research, Springer, 2009.

**Topics covered: **(future
topics are subject to updates)

Week 1: active learning and why; importance of communication in data science; random variables, sample and population; data examples; unboxing the data.

Week 2: exploratory data analysis; “Table 1”; principles of visualization (better plots).

Week 3: visualization continued; concept of inference.

Week 4: confidence intervals;
hypothesis testing.

Week 5: exact tests; 2x2
contingency table; odds ratio.

Week 6: logistic regression:
inference.

Week 7: logistic regression:
variable screening, model building (stepwise, generalized R-squared,
information criteria).

Week 8: prediction error;
cross-validation; classification trees (CART approach);

Week 9: tree-based ensembles;
high-dimensional data methods (LASSO).

** **

**Homework:** You may
discuss, but please write them independently. Write your solutions, answers and
results** in your own words **(and in complete sentences). In general, clearly
lay out the context (including background, setup as applicable), solution,
interpretation of the analysis results for a non-statistical audience etc. in
the main part, and **append R program
codes in the back**; all needs to be turned in. More instructions will be
given for specific assignments. Any two students/groups turning in
exactly the same solutions may be considered plagiarism, in which case 0 points
will be given to all parties involved, and any additional action will be
determined by the office for academic integrity.

Homeworks are due in Gradescope
by 11:59pm on Sunday of the same week (eg. Week 1 is
due on 1/12) unless specified otherwise. No late HW will be accepted. **They are individual assignments unless
specified as group assignments**.

Week 1:

1) Write a paragraph with at
least 5 (and no more than 10) sentences on how you feel about 'comfort zone'.
Then write a 2nd paragraph with 2-3 sentences on how it relates to this course
(MATH 189).

2) Find your own data set and
write a description about: where it came from, what it was collected for, how
many observations and how many variables, what are some examples of research
questions it can answer. Prepare to discuss your data next Monday.

Week 2:

1) submit your group membership in a
PDF file with 5 names, and you will get a group number assigned in Gradescope;

2) **[**due 11:59pm on Tuesday 1/21; you may do it as group or individual
assignment**]** continue with the data
set and description (be sure to include that) from week 1 (or you may use a different
one, in which case you need to re-write the description), think of and state
clearly a research question where you will compare between 2-4 groups (if the
data does not come with this many groups, you can most likely categorize a
variable into groups). Produce Table 1 similar to the Leflunomide paper, then
plot histograms, densities, and boxplots for the continuous variables, and bar
plots for the discrete variables. Do the plots by the groups as in your Table
1, and try to place them side by side. Do these plots for up to 10 variables.

Week 3:

[group project] submit an improved and final version of #2) from Week 2 above. You don’t need to provide p-values in Table 1 as I said in class.

Week 4:

1) New York Times is well-known for their data graphics. Find a favorite data presentation of theirs (Upshot is a good section to try), submit with reasons why you like it.

2) A) For Y ~ Binomial (n, p), write down the formula for a 95% confidence interval (CI) of p. B) For n = 100 and p = 0.1, 0.2, 0.3, 0.4, 0.5, respectively, simulate 500 such Y’s. Tabulate over these 500 simulation runs: the average of the estimated p’s (call them p_hat’s), the empirical variance of the p_hat’s, the average of the estimated variances of the p_hat’s, the proportion of the 95% CI’s that contain the true value of p, and the average of the length of the 95% CI’s. Discuss the simulation results.

Week 5 [group
project]:

1) Constructive criticism for student presentations.

2) Find out (you can search online) what is reproducible research and why it is important. Write a paragraph about it together with your thoughts on how it relates to data analysis. Seven sentences minimum.

3) Continue from your data set from week 3, keep the description, reduce to 2 groups if you had more, by either excluding the extra group(s) or combining them into 2 groups. Identify an outcome variable that you are interested in comparing between these 2 groups. Make sure that your outcome is binary by dichotomizing it if it is not binary initially. Then do the following:

a)
State
the research question of interest;

b)
Discuss
why it is reasonable to assume that the observations are i.i.d.
(if they are not, for example, collected over the years, you might want to
reduce your data to include just one year);

c)
Set
up the null and alternative hypothesis (introduce the random variables,
distribution(s) and parameter(s) first), use a two-sided significant level of
0.05;

d)
Find
the 95% confidence intervals for p1, p2, the risk difference, risk ratio, and
odds ratio;

e)
Carry
out both the Chi-squared and Fisher’s exact test, and discuss their suitability
to your data.

Week 6 [group
project, due 11:59pm on Monday 2/17]:

1) Constructive criticism for student presentations.

2) Research and write about the use of regression models in the context of a) prediction, b) causal inference on effect of a variable on the outcome.

3) [preliminary version] With your data set (same as before or a different one that consists of i.i.d. observations suitable for multiple logistic regression with a binary outcome), keep the overall description as before. Then do the following:

a)
Describe
the distribution of the outcome variable, identify a main predictor that you’re
interested in studying its effect on the outcome (this can be the group
variable from week 5 or a different one);

b)
Identify
other variables (i.e. predictors, often called covariates) that might be
related to the outcome or the main predictor, discuss these variables in the
context of part 2) above of this assignment;

c)
Carry
out univariate logistic regression of the outcome on each of the predictors
including the main predictor, interpret the results in terms of odds ratio etc.

d)
Fit
a multiple logistic regression model by including more than one predictors, interpret the results in terms of conditional
odds ratio etc.

Week 7 [group
project, due 11:59pm on Sunday 3/1]:

1) Constructive criticism for student presentations.

2) [improved version] Continue with and polish your work in part 3) of week 6. Also do the following:

a) Instead of 3d) from week 6, use one of the stepwise procedures we talked about, together with computing the generalized R-squared and AIC for each model considered during the process, to arrive at a ‘final’ multiple logistic regression model. Consider interaction terms also. Interpret the results from your final model in the context of the research question that you are trying to answer.

b) Towards the end of your report, write a paragraph discussing limitations from your data source, assumptions, approaches etc. as applicable. For those that the grader marked comments about the i.i.d. assumption from week 5 homework, be sure to including discussion on those.

Week 8 [group project]:

1) Constructive criticism for student presentations.

2) [final version] Continue with your work from previous two weeks, take your final model from week 7:

a)
Perform
prediction on the whole data set, plot the ROC curve and compute the AUC;

b)
Use
a randomly chosen 90% of your observations as training sample to fit the final
model (if your data set is too small, you may reduce your final model to a
smaller one this week), and use the rest 10% as test sample to compute the
out-of-sample AUC;

c)
Now
instead of test-training sample, carry out 10-fold CV with 10 repetitions to
estimate the out-of-sample AUC;

d)
Comment
on your results.

Week 10:

Constructive criticism for each of the 4
group presentations: 1) what the data were about and what analyses were done;
2) strengths of the analyses and presentation; 3) room for improvement.

Final Group Project
(20%, due 11:59pm on Sunday 3/15):

1)
Compile a
collection of tips for best data presentation, including the illustration and R
script for each. (5%)

2)
See Canvas. (15%
+ 2% bonus)

**Grading:** 70% Homework (20% preliminary + 50% improved) +
10% Presentation + 20% Final Project

Note: we will drop at least one
lowest HW score before computing the final grade. Each week’s assignment
otherwise carry the same weight within the 20% and
50%, respectively.