码迷,mamicode.com
首页 > 其他好文 > 详细

STAT0006 Autum

时间:2019-12-19 12:59:33      阅读:98      评论:0      收藏:0      [点我收藏+]

标签:red   where   length   enc   mos   var   appear   graph   ike   


In-course assessment 3 (ICA 3), Autumn Term 2019 STAT0006
Overview of ICA 3
There are two tasks associated with this ICA. You must complete both. More details
are given below, but in brief:
1. Analyse the given dataset culminating in a linear model for life expectancy, writing
your findings as a short report.
2. Write a short supplement to the existing STAT0006 course notes on the implications
of heteroskedasticity of the error term in a linear regression model. This should
include a discussion of the methods available to reduce or eliminate heteroskedasticity,
and a demonstration of these methods in R. You should make sure that your
supplement is suitable for students on this course.
Task 1
The data
Data are available in the Excel file icadata.xlsx, and you should analyse it in any way you
see fit in order to address the task in hand.
Each row relates to a particular country. In the original data there were missing values
but these have been inputed for the purposes of this assessment. Your task is to investigate
the drivers of life expectancy in 2018 - that is, the mean number of years a newborn
would live if current mortality patterns were to stay the same (the variable lifeexp2018
in the dataset) - using the other variables in the dataset. Please see the associated data
dictionary for a description of each variable.
How do the factors in the given dataset collectively affect life expectancy? Your analysis
should include building a linear model for life expectancy. Please note that these are real
data. There is no ‘correct model’: it all depends on the assumptions you are willing to
make.
Write a report on your findings, which should include the following things:
• An initial exploratory analysis of the dataset. The aim of this is to give someone
who doesn’t have access to the data an overview of what the data are and a feel for
the variables in the dataset (e.g. summaries of each variable or simple relationships).
This should be non-technical.
• A description of how you approached the model-building phase. Don’t just show your
chosen final model. How did you choose your particular model? What processes did
you go through?
• How well does your final model fit the data? Note that you don’t need to write
about the fit of all models; you just need to convince me that your final model is a
reasonable fit for the data.
– Please note that these are real data.
– It can be the case that getting a linear model to fit well is a little tricky.
1
In-course assessment 3 (ICA 3), Autumn Term 2019 STAT0006
– This doesn’t mean that you shouldn’t care about how well the model fits your
data but you should be realistic in terms of how well a linear model will fit (i.e.
is the fit reasonable?).
• A brief description of the final model. What does it tell you about the drivers of life
expectancy?
• Conclusion, including a brief discussion of limitations of the data and model. Do
you think the model is reliable?
• Note that, unless you find the data to have an issue with heterscedasticity, you don’t
need to apply the methods discussed in Task 2.
The maximum length for Task 1 is three sides of A4, which is to include plots/ tables/
figures. Make sure that any plots/ tables/ figures, if applicable, are legible (i.e. don’t
squeeze these in if they are not readable - you will be penalised for this). The minimum
font size is 11pt. You may choose your own margin size. Given the maximum length, you
are strongly advised to select plots/ tables/ figures with care.
Task 2
Heteroskedasticity (unequal variances) of the error term violate one of the assumptions
STAT0006留学生作业代做、R编程设计作业调试
we make in fitting regression models. As a group, investigate:
(a) methods for detecting heteroskedasticity;
(b) how heteroskedasticity impacts model estimates (regression coefficients and their
standard errors, variance of the error term);
(c) methods to overcome this issue, including advice on when to use these methods and
how to implement them in R;
(d) a demonstration in R of the impact of heteroskedasticity on model estimates and
a commentary on how the methods you discuss in (c) help in dealing with heteroskedasticity.
The output should be a written document in the style of a supplement to the current
STAT0006 course notes. That is, your supplement should be accessible (understandable)
to STAT0006 students. Your supplement should not be overly mathematical. Instead,
your focus should be on explaining the practicalities of using the methods along
with their implications and interpretation.
This supplement should be no longer than four sides of A4. For this assignment you should
reference sources; the list of references is not included in the 4 page limit (i.e. you may
have additional pages with a list of references). However you should not quote directly
from any source, even if you put the text in quotation marks and reference it: everything
should be in your own words but acknowledging where you found the information. The
same goes for pictures or graphics: you may not copy and paste a picture/graphic that
you found elsewhere into your write-up. Minimum font size is 11pt, but you may choose
your margin size and font. Any graphics should be large enough to be easily readable,
adequately labelled, and captioned.
2
In-course assessment 3 (ICA 3), Autumn Term 2019 STAT0006
More details about what to include in your supplement for Task 2
I expect the supplement to contain the following for each of (a) through (d) listed above:
(a) An explanation of what heteroskedasticity is, and a discussion of how to detect such
problems. The emphasis of this section should be on the graphical methods as seen
in class though you are welcome to investigate other methods.
(b) A discussion of the implications of heteroskedasticity on the estimated regression
coefficients, their standard errors, and the variance of the error term (that is, how
is heteroskedasticity likely to affect these factors?).
(c) What methods exist to reduce or eliminate the effect of heteroskedasticity? When
should these methods be used? How can we implement these methods in R?
• You are free to choose any relevant methods, but you should discuss transformations
of variables, robust standard errors, and weighted least squares as
possible remedies. For the latter, you should include a discussion on how to
find appropriate ‘weights’. While you are welcome to mention other techniques
you will not get extra credit for them.
• You do not need to write your own R code to implement robust standard
errors or weighted least squares. I expect you to research existing packages
which contain relevant functions. You should explicity state which packages
and functions can be used to implement these methods (there may be more
than one package for each of these methods, but you don’t need to write about
all packages that do the job - one will do!).
(d) Demonstrate the use of these methods in R on simulated data.
• You will need to simulate your own data in R, but please keep it simple! Having
just one numeric covariate will suffice and will enable you to plot results.
• Start with generating bivariate data which has no inherent heteroskedasticity
(see example code below, which you are welcome to use directly) and then
introduce heteroskedasticity.
– There are many ways of doing the latter and it’s worth spending time
thinking about what heteroskedasticity means in order to design some way
of introducing it into your data.
– I don’t mind how you do this as long as it’s sensible!
– Don’t over-complicate things: you can introduce heteroskedasticity in relatively
straightforward ways.
• I expect you to show the following:
– Explain how you simulated your heteroskedastic data (i.e. tell me the
format of the model that generated the data including the parameters used
- I give an example below, but this is for homoskedastic data).
– You can vary the ‘amount’ of heteroskedasticity in your data. This means
you can have one simulated dataset with a ‘mild’ or ‘moderate’ amount of
heterskedasticity, and another where the data are even more heteroskedastic.
Keep it simple here - just two datasets with different amounts of heteroskedasticity
will do. Make sure that the data generating mechansim is
the same for both, other than the amount of heteroskedasticity (otherwise
3
In-course assessment 3 (ICA 3), Autumn Term 2019 STAT0006
you will not be able to see what happens as you increase the amount of
heterskedasticiy, holding everything else in the data-generating mechanism
the same).
– For your simulated data with ‘mild’ or ‘moderate’ heteroskedasticity, compare
the model estimates from fitting a model using the usual methods
(ordinary least squares) and, where possible, using the methods you discussed
in part (c).
– Do the same for your dataset where the heteroskedasticity is more pronounced.
– How do the methods compare? How well do the methods estimate the
regression coefficients and variance of the error term? What happens to
the standard errors of the regression coefficients in each case? How does
the amount of heteroskedasticity affect the results?
∗ Note: the purpose of simulating the data is that you know what the
‘true’ values of the regression coefficients and variance of error term is,
which wouldn’t be available to you with real data.
∗ You can compare these ‘true’ values with those you get from the various
analyses requested here.
Note: heteroskedasticity is sometimes a consequence of violations of other assumptions.
In this assignment you may assume that all other assumptions have been met (that is, the
heteroskedasticity isn’t caused by another assumption being violated).
Example code for generating homoskedastic data
Example code for generating homoskedastic data. Note the data are homoskedastic, not
heteroskedastic.
#Set sample size
n<-1000
#Generate a single covariate.
#I’m assuming the covariate follows a uniform[0, 100] distribution.
x<-runif(n,0,100)
#Generate an error term, which is normally distributed with mean 0 and SD 4
error<-rnorm(n,0,4)
#Choose parameters (intercept and slope) and generate the outcome:
y<- -10+0.4*x+error
I would explain the data generating process as follows.
One thousand bivariate observations were generated according to the following model. We
assumed that the covariate X was uniformly distributed between 0 and 100, and that the
error term, , was normally distributed with mean 0 and standard deviation of 4. The
outcome, Y , was then generated assuming:
Yi = −10 + 0.4xi + i
.
4
In-course assessment 3 (ICA 3), Autumn Term 2019 STAT0006
Administrative details
Basic details
• This assessment counts for 50% of your final mark for STAT0006.
• You should work in groups of no more than 3 students. You may work on the project
alone if you wish but note that this is not efficient. It is up to you to form your own
groups. You should have already registered your group on Moodle.
In addition to the outputs from Tasks 1 and 2, all groups must submit an additional page
where each group member briefly describes their contribution to the project.
• You will need to agree this in your groups before submitting the report.
• If all group members agree that everyone contributed equally then it is sufficient
to write a single sentence to that effect, or alternatively you are very welcome to
describe your own personal contribution to the project.
• Note that I will not mark this page, nor allocate different marks to different group
members based on this. The purpose is to encourage you all to be mindful about
contributing to this piece of group-work.
• If you feel that one or more of your peers is not contributing fairly, please contact
me by email in the first instance BEFORE SUBMISSION of the report and as early
as possible.
You should insert student ID numbers of all students in your group on the report, but do
not write your names. Your report will be marked anonymously. This also applies to
the page with descriptions of contributions.
Please note: it would be very helpful if you could adhere to the following format when
submitting your work:
• Please only submit ONE document, which should include everything: the first three
pages should be allocated to Task 1, the next four to Task 2, and the final page
should have your declaration of contribution to the project.
• Please start Task 2 and the declaration of contributions on a fresh page.
• Do not provide a cover sheet.
• All pages should have your GROUP NUMBER and STUDENT NUMBERS printed
somewhere along the top of the page.
• Save your document as a pdf file.
• Name your file using your group name, e.g. Group ICA3 200.pdf
• Please see associated document (layout.pdf) for suggested layout.
5
In-course assessment 3 (ICA 3), Autumn Term 2019 STAT0006
How do I get help with this assignment?
You can ask for help from me during office hours. Please note that I will not provide
comments on draft reports. Note that it may not be appropriate for me to answer all your
questions.
You may also post to the Moodle forum to ask questions. Please do not email me with
statistical questions - if you do, I will ask you to post them to the Moodle forum instead.
This being said, you should email me immediately if you have any technical difficulties
with Moodle (e.g. with submitting your report).
Submitting your work
The outputs from both tasks should be submitted to Moodle by 12 noon on Monday
27th January 2020. A submission button will appear on Moodle a few days prior to this
date. Under no circumstances should you email me your submission - if you do this, I
will immediately delete your email.
How will the report be marked?
Your report will be marked out of 50, with allocation as follows:
• 25 marks for Task 1, split as follows:
– 18 marks for the content of the report, including whether you have selected
appropriate information and supporting evidence (e.g. plots, tables), whether
your interpretation of the results are accurate, etc.
– 7 marks for the presentation and clarity of the report overall, including clarity
of expression and how easy it is to read and understand, whether you have
structured the report sensibly, good use of plots/tables where appropriate, adequately
sized graphics with suitably informative captions and labelling, and
so on.
• 25 marks for Task 2, split as follows:
– 15 marks for technical accuracy of your supplement;
– 10 marks for overall presentation and clarity of the supplement, including suitability
for the intended audience.
The mark you will receive is your group mark - everyone in the group will be awarded the
same mark, unless there are exceptional circumstances (e.g. a member of a group did not
contribute to the project).
Elinor M Jones
December 2019
6

因为专业,所以值得信赖。如有需要,请加QQ99515681  微信:codehelp

STAT0006 Autum

标签:red   where   length   enc   mos   var   appear   graph   ike   

原文地址:https://www.cnblogs.com/bizhunjava/p/12066050.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!