2  Data

2.1 Technical Description

In this project we download data from CDC-National Health and Nutrition Examination Survey(NHANES). This dataset is constructed by CDC, Centers for Disease Control and Prevention. NHANES is a major project of the National Center for Health Statistics (part of the CDC), and became a continuous project since 1999. The survey upadtes every year and contains records from 5000 people national wide.

We focus on the three questionnaire: Smoking - Cigarette Use, Smoking - Household Smokers, and Medical Conditions. Due to the Covid-19, data from cycle 2019-2020 is not completed and thus the department combine 2017-2018 with 2019-2020 to form a new version of data from 2017-2020, which is the latest version of data. We use this cycle given the great amount of records it provides.

This is how the data looks like:

    SEQN SMD470 SMQ020 SMQ040 SMD650 SMD100FL SMQ621 MCQ010 MCQ092 MCQ160A
1 109264     NA     NA     NA     NA       NA      1      2      2      NA
2 109266     NA      2     NA     NA       NA     NA      2      9       2
3 109267     NA      2     NA     NA       NA     NA      2      2       2
4 109268     NA      2     NA     NA       NA     NA      2      2      NA
5 109271      2      1      1     20       NA     NA      1      2       1
6 109273      0      1      1     15        1     NA      2      2       2
  MCQ160B MCQ160C MCQ160D MCQ160E MCQ160F MCQ160M MCQ160P MCQ160L MCQ550 MCQ560
1      NA      NA      NA      NA      NA      NA      NA      NA     NA     NA
2       2       2       2       2       2       2       2       2      2      2
3       2       2       2       2       2       2       2       2      2      2
4      NA      NA      NA      NA      NA      NA      NA      NA     NA     NA
5       2       2       2       2       2       2       1       2      2      2
6       2       2       2       2       2       2       2       2      2      2
  MCQ220
1     NA
2      2
3      2
4     NA
5      2
6      2

The Description of each variable is as below:

Index Variable Description
SEQN Respondent sequence number
SMD470 # of people who smoke inside this home
SMQ020 Smoked at least 100 cigarettes in life
SMQ040 Do you now smoke cigarettes?
SMD650 Avg # cigarettes/day during past 30 days
SMD100FL Cigarette Filter type
SMQ621 Cigarettes smoked in entire life
MCQ010 Ever been told you have asthma
MCQ092 Ever receive blood transfusion
MCQ160A Doctor ever said you had arthritis
MCQ160B Ever told had heart failure
MCQ160C Ever told had coronary heart disease
MCQ160D Ever told had angina/angina pectoris
MCQ160E Ever told had heart attack
MCQ160F Ever told you had a stroke
MCQ160M Ever told you had thyroid problem
MCQ160P Ever told you had COPD, emphysema, ChB
MCQ160L Ever told you had any liver condition
MCQ550 Has DR ever said you have gallstones
MCQ560 Ever had gallbladder surgery
MCQ220 Ever told you had cancer or malignancy

We would like to use SMQ020-SMQ621 to reflect the personal smoking status and MCQ010-MCQ220 to indicate the personal health status, which contains doctor diagnoses to diverse types of diseases. We would also like to include SMD470 as a reflect of secondhand smoke exposure to find if there is any possibility that secondhand smoke exposure has effect to health.

2.2 Research plan

We would like to start with a classic question: the effect of smoking to health. This question has been studied for years and has several well-known results like smoking will increase the risk for lung cancer. In out study, we would like examine the influence of smoking to health condition comprehensively, including its effect to heart, liver, etc. We are especially interested in discovering the relation between smoking to health problems that are not thought to be related to smoking in general.

To achieve this goal, we would like to use various visualization method including mosaic plot and histogram by condition to see if there are any apparant relations. We also want to use pearson correlation with heatmap visualization to measure the relation between smoking to different health benchmarks.

After this, we plan to find the effect of secondhand smoke exposure to health conditions. Secondhand smoke exposure is nowadays a common phenomenon, but people tend to oversee this problem compared to the direct influence by cigarette. Therefore, we would like to repeat the procedure as before but change the smoking variable to secondhand smoke exposure to find its possible effect to one’s health condition.

2.3 Missing value analysis

This number of missing values for each variable is as below:

Code
count <- nrow(data)
missing <- data %>%
    gather(key = "key", value = "val") %>%
    mutate(is.missing = is.na(val)) %>%
    group_by(key, is.missing) %>%
    summarise(num.missing = n()) %>%
    filter(is.missing==T) %>%
    select(-is.missing) %>%
    arrange(desc(num.missing)) %>%
    mutate(num.contain = count - num.missing)

We use a histogram and pie charts to visualize the missing values:

Code
ggplot(missing) + geom_bar(aes(x = key, y = num.missing), stat = 'identity') +
  labs(x = 'variables', y = 'count of missing values', title = 'Bar plot of missing values') +
  theme(plot.title=element_text(hjust = 0.5),
        axis.text.x = element_text(angle = 45, hjust = 1))

Code
count <- length(missing$num.missing)
pie <- data.frame('value' = c(missing$num.missing, missing$num.contain),
                  'category' = c(rep('miss', count), rep('contain', count)),
                  'variable' = c(rep(missing$key, 2)))
ggplot(pie, aes(x = '', y = value, fill = category)) + geom_col() +
  coord_polar(theta = 'y', start = 0) + facet_wrap(~variable) + 
  labs(x = '', y = '', title = 'Pie charts of missing values') +
  theme(plot.title=element_text(hjust = 0.5),
        axis.text = element_blank(),
        axis.ticks = element_blank(),
        panel.grid  = element_blank())

We can see that there are a fixed number of missing values for a bunch of medical variables (MCQ160A - MCQ220). Since we have about 10k data in total and 1905 is not a big number, we choose to delete these records.

Code
data <- data %>%
  drop_na(c('MCQ160A', 'MCQ160B', 'MCQ160C', 'MCQ160D', 'MCQ160E', 'MCQ160F',
            'MCQ160M', 'MCQ160P', 'MCQ160L', 'MCQ550', 'MCQ560', 'MCQ220'))

Regarding the smoking status variables, we decide to drop SMD650, SMD100FL, and SMQ621 variables since there are too many missing values. For the rest of the data we drop any row contain NAs.

Code
data <- data %>%
  select(-c('SMD650', 'SMD100FL', 'SMQ621')) %>%
  drop_na()
    SEQN SMD470 SMQ020 SMQ040 MCQ010 MCQ092 MCQ160A MCQ160B MCQ160C MCQ160D
1 109271      2      1      1      1      2       1       2       2       2
2 109273      0      1      1      2      2       2       2       2       2
3 109307      0      1      3      2      2       2       2       2       1
4 109323      0      1      3      2      2       2       2       2       2
5 109334      0      1      1      2      2       2       2       2       2
6 109342      2      1      1      2      2       1       2       2       2
  MCQ160E MCQ160F MCQ160M MCQ160P MCQ160L MCQ550 MCQ560 MCQ220
1       2       2       2       1       2      2      2      2
2       2       2       2       2       2      2      2      2
3       2       2       2       2       2      2      2      2
4       2       2       2       2       2      2      2      2
5       2       2       2       2       2      2      2      2
6       2       2       1       2       2      2      1      2

The result data looks clean now. We have 1829 records in total, which is a sufficient amount.

2.4 Feature Selection

After treating missing values, we examine the real-life significance of columns and their correlation values, drop the following columns from the data, thus to prevent information overlapping among columns:

  • SMQ020: Smoked at least 100 cigarettes in life. The answer is ‘yes’ for everyone who smoke, according to the data.
  • MCQ160B: Ever told had heart failure. The four hear-related diseases has the same possible levels and similar distribution. After checking the correlation between each feature and the independent features (explained below), we drop MCQ160B, MCQ160D and MCQ160E and keep MCQ160C.
  • MCQ160D: Ever told had angina/angina pectoris. Angina pectoris is the chest pain caused by insufficient blood and oxygen suppliant to heart and is associated with coronary heart disease (MCQ160C).
  • MCQ160E: Ever told had heart attack. Same reason as MCQ160B.
  • MCQ560: Ever had gallbladder surgery. This feature is included in MCQ550, ‘Has DR ever said you have gallstones’, and the two features have almost the same distribution.
Index Variable Description After Preprocessing and Selection
SEQN Respondent sequence number
SMD470 # of people who smoke inside this home
SMQ040 Do you now smoke cigarettes?
MCQ010 Ever been told you have asthma
MCQ092 Ever receive blood transfusion
MCQ160A Doctor ever said you had arthritis
MCQ160C Ever told had coronary heart disease
MCQ160F Ever told you had a stroke
MCQ160M Ever told you had thyroid problem
MCQ160P Ever told you had COPD, emphysema, ChB
MCQ160L Ever told you had any liver condition
MCQ550 Has DR ever said you have gallstones
MCQ220 Ever told you had cancer or malignancy
Code
data <- data %>%
  select(-c('SMQ020', 'MCQ160B', 'MCQ160D', 'MCQ160E', 'MCQ560'))
write_csv(data, 'data/data.csv')

As shown in the correlation heatmap below, after feature selection, no feature pair exhibits high correlation that may result in multi-collinearity or information overlapping.

Code
cormat <- round(cor(data[, 2:13]),2)
ggplot(data = melt(cormat), aes(x=Var1, y=Var2, fill=value)) + 
  geom_tile() +
  theme(plot.title=element_text(hjust = 0.5),
        axis.text.x = element_text(angle = 45, hjust = 1))+
  labs(x = 'variables', y = 'variables', title = 'Correlation Heatmap')

2.5 Identifying Dependency

In this study, we consider the conditions of smoking (SMD470, SMQ040) - number of people who smoke in the household and whether the questionnaire-respondent smoke - as the independent features. SEQN, the respondent sequence number, or ID, doesn’t hold statistical meaning. The remaining ten features that describe health condition are considered dependent features.