<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://zianzhuang.com/feed.xml" rel="self" type="application/atom+xml" /><link href="https://zianzhuang.com/" rel="alternate" type="text/html" /><updated>2025-07-17T05:58:01+00:00</updated><id>https://zianzhuang.com/feed.xml</id><title type="html">Zian ZHUANG</title><subtitle>personal description</subtitle><author><name>Zian</name><email>zianzhuang@ucla.edu</email></author><entry><title type="html">Machine Learning practice 3</title><link href="https://zianzhuang.com/posts/2021/06/blog-post-7/" rel="alternate" type="text/html" title="Machine Learning practice 3" /><published>2021-06-01T00:00:00+00:00</published><updated>2021-06-01T00:00:00+00:00</updated><id>https://zianzhuang.com/posts/2021/06/blog-post-7</id><content type="html" xml:base="https://zianzhuang.com/posts/2021/06/blog-post-7/"><![CDATA[<!--more-->

<h2 id="q1">Q1.</h2>
<p>(SVM, <em>20 pt</em>) In this problem, you will use support vector approaches in order to predict whether a given car gets high or low gas mileage based on the <code class="language-plaintext highlighter-rouge">Auto</code> data set.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">data</span><span class="p">(</span><span class="n">Auto</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<h3 id="a">(a)</h3>
<p>Create a binary variable that takes on a 1 for cars with gas mileage above the median, and a 0 for cars with gas mileage below the median.</p>

<p><strong>Answer</strong>:</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">Auto</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">mutate</span><span class="p">(</span><span class="n">mpg01</span><span class="o">=</span><span class="n">ifelse</span><span class="p">(</span><span class="n">mpg</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="n">median</span><span class="p">(</span><span class="n">mpg</span><span class="p">),</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">as.factor</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">select</span><span class="p">(</span><span class="o">-</span><span class="nf">c</span><span class="p">(</span><span class="n">mpg</span><span class="p">,</span><span class="w"> </span><span class="n">name</span><span class="p">))</span><span class="w">
</span></code></pre></div></div>

<h3 id="b">(b)</h3>
<p>Fit a support vector classifier to the data with various values of <code class="language-plaintext highlighter-rouge">cost</code>, in order to predict whether a car gets high or low gas mileage. Report the cross-validation errors associated with different values of this parameter. Comment on your results.</p>

<p><strong>Answer</strong>:</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">set.seed</span><span class="p">(</span><span class="m">1996</span><span class="p">)</span><span class="w">
</span><span class="n">linear_tune</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">tune</span><span class="p">(</span><span class="n">svm</span><span class="p">,</span><span class="w"> </span><span class="n">mpg01</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">.</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">df</span><span class="p">,</span><span class="w"> </span><span class="n">kernel</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"linear"</span><span class="p">,</span><span class="w">
                 </span><span class="n">ranges</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">cost</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">.001</span><span class="p">,</span><span class="w"> </span><span class="m">.01</span><span class="p">,</span><span class="w"> </span><span class="m">.1</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">5</span><span class="p">,</span><span class="w"> </span><span class="m">10</span><span class="p">,</span><span class="w"> </span><span class="m">100</span><span class="p">)))</span><span class="w">
</span><span class="n">summary</span><span class="p">(</span><span class="n">linear_tune</span><span class="p">)</span><span class="w">
</span><span class="n">linear_tune</span><span class="o">$</span><span class="n">best.parameters</span><span class="w">
</span><span class="n">linear_tune</span><span class="o">$</span><span class="n">best.performance</span><span class="w">
</span></code></pre></div></div>

<p>Results presents that cross-validation error was minimized when cost equals 5.</p>

<h3 id="c">(c)</h3>
<p>Now repeat (b), this time using SVMs with radial and polynomial basis kernels, with different values of <code class="language-plaintext highlighter-rouge">gamma</code> and <code class="language-plaintext highlighter-rouge">degree</code> and <code class="language-plaintext highlighter-rouge">cost</code>. Comment on your results.</p>

<p><strong>Answer</strong>:</p>

<p>radial basis kernels:</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">set.seed</span><span class="p">(</span><span class="m">1996</span><span class="p">)</span><span class="w">
</span><span class="n">radial_tune</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">tune</span><span class="p">(</span><span class="n">svm</span><span class="p">,</span><span class="w"> </span><span class="n">mpg01</span><span class="o">~</span><span class="n">.</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="o">=</span><span class="n">df</span><span class="p">,</span><span class="w"> </span><span class="n">kernel</span><span class="o">=</span><span class="s1">'radial'</span><span class="p">,</span><span class="w">
                 </span><span class="n">ranges</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">cost</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">.001</span><span class="p">,</span><span class="w"> </span><span class="m">.01</span><span class="p">,</span><span class="w"> </span><span class="m">.1</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">5</span><span class="p">,</span><span class="w"> </span><span class="m">10</span><span class="p">,</span><span class="w"> </span><span class="m">100</span><span class="p">,</span><span class="m">1000</span><span class="p">),</span><span class="w">
                               </span><span class="n">gamma</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0.5</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="m">4</span><span class="p">)))</span><span class="w">
</span><span class="n">radial_tune</span><span class="o">$</span><span class="n">best.parameters</span><span class="w">
</span><span class="n">radial_tune</span><span class="o">$</span><span class="n">best.performance</span><span class="w">
</span></code></pre></div></div>

<p>As we can see from the output, the training CV error is minimized for a radial model at cost=1 and gamma=1. In addition, the training CV error is a little better than that of the linear kernel model.</p>

<p>polynomial basis kernels:</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">set.seed</span><span class="p">(</span><span class="m">1996</span><span class="p">)</span><span class="w">
</span><span class="n">poly_tune</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">tune</span><span class="p">(</span><span class="n">svm</span><span class="p">,</span><span class="w"> </span><span class="n">mpg01</span><span class="o">~</span><span class="n">.</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="o">=</span><span class="n">df</span><span class="p">,</span><span class="w"> </span><span class="n">kernel</span><span class="o">=</span><span class="s1">'polynomial'</span><span class="p">,</span><span class="w">
                 </span><span class="n">ranges</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">cost</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">.001</span><span class="p">,</span><span class="w"> </span><span class="m">.01</span><span class="p">,</span><span class="w"> </span><span class="m">.1</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">5</span><span class="p">,</span><span class="w"> </span><span class="m">10</span><span class="p">,</span><span class="w"> </span><span class="m">100</span><span class="p">,</span><span class="m">1000</span><span class="p">),</span><span class="w">
                               </span><span class="n">degree</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="m">2</span><span class="p">,</span><span class="m">3</span><span class="p">,</span><span class="m">4</span><span class="p">,</span><span class="m">5</span><span class="p">)))</span><span class="w">
</span><span class="n">poly_tune</span><span class="o">$</span><span class="n">best.parameters</span><span class="w">
</span><span class="n">poly_tune</span><span class="o">$</span><span class="n">best.performance</span><span class="w">
</span></code></pre></div></div>

<p>As we can see from the output, the training CV error is minimized for a polynomial model at cost=5 and degree=3, which suggested the true decision boundary is non-linear. In addition, the training CV error is better than that of the linear kernel model but worse than that of the radial kernel model.</p>

<h3 id="d">(d)</h3>
<p>Make some plots to back up your assertions in (b) and (c).</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">svmfit_l</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">svm</span><span class="p">(</span><span class="n">mpg01</span><span class="o">~</span><span class="n">.</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="o">=</span><span class="n">df</span><span class="p">,</span><span class="w"> </span><span class="n">kernel</span><span class="o">=</span><span class="s2">"linear"</span><span class="p">,</span><span class="w"> </span><span class="n">cost</span><span class="o">=</span><span class="m">5</span><span class="p">)</span><span class="w">
</span><span class="n">svmfit_r</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">svm</span><span class="p">(</span><span class="n">mpg01</span><span class="o">~</span><span class="n">.</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="o">=</span><span class="n">df</span><span class="p">,</span><span class="w"> </span><span class="n">kernel</span><span class="o">=</span><span class="s2">"radial"</span><span class="p">,</span><span class="w"> </span><span class="n">cost</span><span class="o">=</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">gamma</span><span class="o">=</span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">svmfit_p</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">svm</span><span class="p">(</span><span class="n">mpg01</span><span class="o">~</span><span class="n">.</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="o">=</span><span class="n">df</span><span class="p">,</span><span class="w"> </span><span class="n">kernel</span><span class="o">=</span><span class="s2">"polynomial"</span><span class="p">,</span><span class="w"> </span><span class="n">cost</span><span class="o">=</span><span class="m">5</span><span class="p">,</span><span class="w"> </span><span class="n">degree</span><span class="o">=</span><span class="m">3</span><span class="p">)</span><span class="w">
</span><span class="n">names_list</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">names</span><span class="p">(</span><span class="n">df</span><span class="p">)[</span><span class="m">-8</span><span class="p">]</span><span class="w">

</span><span class="c1"># some plots</span><span class="w">

</span><span class="c1">#linear</span><span class="w">
</span><span class="n">plot</span><span class="p">(</span><span class="n">svmfit_l</span><span class="p">,</span><span class="w"> </span><span class="n">df</span><span class="p">,</span><span class="w"> </span><span class="n">displacement</span><span class="o">~</span><span class="n">weight</span><span class="p">)</span><span class="w">

</span><span class="c1">#radial</span><span class="w">
</span><span class="n">plot</span><span class="p">(</span><span class="n">svmfit_r</span><span class="p">,</span><span class="w"> </span><span class="n">df</span><span class="p">,</span><span class="w"> </span><span class="n">displacement</span><span class="o">~</span><span class="n">weight</span><span class="p">)</span><span class="w">

</span><span class="c1">#polynomial</span><span class="w">
</span><span class="n">plot</span><span class="p">(</span><span class="n">svmfit_p</span><span class="p">,</span><span class="w"> </span><span class="n">df</span><span class="p">,</span><span class="w"> </span><span class="n">displacement</span><span class="o">~</span><span class="n">weight</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<h2 id="q2">Q2.</h2>
<p>($K$-Means Clustering, PCA and MDS, <em>40 pt</em>) The following codes read in a gene expression data from the TCGA project, which contains the expression of a random sample of 2000 genes for 563 patients from three cancer subtypes: Basal (<code class="language-plaintext highlighter-rouge">Basal</code>), Luminal A (<code class="language-plaintext highlighter-rouge">LumA</code>), and Luminal B (<code class="language-plaintext highlighter-rouge">LumB</code>). Suppose we are only interested in distinguishing Luminal A samples from Luminal B - but alas, we also have Basal samples, and we don’t know which is which. Write a data analysis report to address the following problems.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>```r
TCGA &lt;- read.csv("TCGA_sample_2.txt", header = TRUE)

# Store the subtypes of tissue and the gene expression data
Subtypes &lt;- TCGA[ ,1]
Gene &lt;- as.matrix(TCGA[,-1])
```
</code></pre></div></div>

<h3 id="a-1">(a)</h3>
<p>Run $K$-means for $K$ from 1 to 20 and plot the associated within cluster sum of squares (WSSs). Comment the WSS at $K=3$.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">wss</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="kc">NULL</span><span class="w">
</span><span class="k">for</span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="m">20</span><span class="p">){</span><span class="w">
  </span><span class="n">kmeansfit</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">kmeans</span><span class="p">(</span><span class="n">Gene</span><span class="p">,</span><span class="w"> </span><span class="n">centers</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="n">nstart</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">20</span><span class="p">)</span><span class="w">
  </span><span class="n">temp</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">data.frame</span><span class="p">(</span><span class="n">value</span><span class="o">=</span><span class="n">kmeansfit</span><span class="o">$</span><span class="n">betweenss</span><span class="o">/</span><span class="n">kmeansfit</span><span class="o">$</span><span class="n">totss</span><span class="p">,</span><span class="w">
                     </span><span class="n">clusters</span><span class="o">=</span><span class="n">i</span><span class="p">)</span><span class="w">
  </span><span class="n">wss</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">rbind</span><span class="p">(</span><span class="n">wss</span><span class="p">,</span><span class="w"> </span><span class="n">temp</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">plot</span><span class="p">(</span><span class="n">wss</span><span class="o">$</span><span class="n">clusters</span><span class="p">,</span><span class="w"> </span><span class="n">wss</span><span class="o">$</span><span class="n">value</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="o">=</span><span class="s2">"b"</span><span class="p">,</span><span class="w"> 
     </span><span class="n">xlab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"number of clusters"</span><span class="p">,</span><span class="w"> </span><span class="n">ylab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"WSS value"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">k3</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">kmeans</span><span class="p">(</span><span class="n">Gene</span><span class="p">,</span><span class="w"> </span><span class="n">centers</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="n">nstart</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">20</span><span class="p">)</span><span class="w">
</span><span class="n">k3</span><span class="o">$</span><span class="n">betweenss</span><span class="o">/</span><span class="n">k3</span><span class="o">$</span><span class="n">totss</span><span class="w">
</span></code></pre></div></div>

<p>Comment the WSS at $K=3$: The 14.8% is a measure of the total variance in our data set that is explained by the clustering. k-means minimize the within group dispersion and maximize the between-group dispersion. By assigning the samples to 3 clusters rather than 563 (number of samples) clusters achieved a reduction in sums of squares of 14.8%.</p>

<h3 id="b-1">(b)</h3>
<p>Apply $K$-means with $K=3$ to the <code class="language-plaintext highlighter-rouge">Gene</code> dataset. What percentage of <code class="language-plaintext highlighter-rouge">Basal</code>, <code class="language-plaintext highlighter-rouge">LumA</code>, and <code class="language-plaintext highlighter-rouge">LumB</code> type samples are in each of the 3 resulting clusters? Did we do a good job distinguishing <code class="language-plaintext highlighter-rouge">LumA</code> from <code class="language-plaintext highlighter-rouge">LumB</code>? Confusion matrix of clusters versus subtypes might be helpful.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">results</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">data.frame</span><span class="p">(</span><span class="n">Subtypes</span><span class="p">,</span><span class="n">pred</span><span class="o">=</span><span class="n">k3</span><span class="o">$</span><span class="n">cluster</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">filter</span><span class="p">(</span><span class="n">Subtypes</span><span class="o">!=</span><span class="s2">"Basal"</span><span class="p">)</span><span class="w">
</span><span class="n">table</span><span class="p">(</span><span class="n">results</span><span class="o">$</span><span class="n">Subtypes</span><span class="p">,</span><span class="n">results</span><span class="o">$</span><span class="n">pred</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<p>According to the simple counting table, we found that LumA should be matching with 2 and LumB should be matching with 1. Then we can have confusion matrix of clusters versus subtypes,</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">tibble</span><span class="p">(</span><span class="n">LumA</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">192</span><span class="p">,</span><span class="w"> </span><span class="m">117</span><span class="p">),</span><span class="w">
       </span><span class="n">LumB</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">27</span><span class="p">,</span><span class="w"> </span><span class="m">125</span><span class="p">),</span><span class="w">
       </span><span class="n">pred</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"LumA (pred)"</span><span class="p">,</span><span class="w"> </span><span class="s2">"LumB (pred)"</span><span class="p">))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> 
  </span><span class="n">column_to_rownames</span><span class="p">(</span><span class="n">var</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"pred"</span><span class="p">)</span><span class="w">
</span><span class="p">(</span><span class="m">192+125</span><span class="p">)</span><span class="o">/</span><span class="p">(</span><span class="m">461</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<p>As we can tell from the confusion matrix that the overall classify accuracy is 68.8%, which is satisfying.</p>

<h3 id="c-1">(c)</h3>
<p>Now apply PCA to the <code class="language-plaintext highlighter-rouge">Gene</code> dataset. Plot the data in the first two PCs colored by <code class="language-plaintext highlighter-rouge">Subtypes</code>. Does this plot appear to separate the cancer subtypes well?</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Gene_df</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">Gene</span><span class="p">[,</span><span class="w"> </span><span class="n">colSums</span><span class="p">(</span><span class="n">Gene</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="m">0</span><span class="p">)</span><span class="w"> </span><span class="o">!=</span><span class="w"> </span><span class="n">nrow</span><span class="p">(</span><span class="n">Gene</span><span class="p">)]</span><span class="w">
</span><span class="c1"># Dimension reduction using PCA</span><span class="w">
</span><span class="n">res.pca</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">prcomp</span><span class="p">(</span><span class="n">Gene_df</span><span class="p">,</span><span class="w">  </span><span class="n">scale</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span><span class="c1">#Visualize eigenvalues (scree plot).</span><span class="w">
</span><span class="n">fviz_eig</span><span class="p">(</span><span class="n">res.pca</span><span class="p">)</span><span class="w">
</span><span class="c1">#Graph of individuals.</span><span class="w">
</span><span class="n">fviz_pca_ind</span><span class="p">(</span><span class="n">res.pca</span><span class="p">,</span><span class="w">
             </span><span class="n">col.ind</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">as.factor</span><span class="p">(</span><span class="n">Subtypes</span><span class="p">),</span><span class="w"> </span><span class="c1"># color by groups</span><span class="w">
             </span><span class="n">palette</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"#00AFBB"</span><span class="p">,</span><span class="w"> </span><span class="s2">"#E7B800"</span><span class="p">,</span><span class="w"> </span><span class="s2">"#FC4E07"</span><span class="p">),</span><span class="w">
             </span><span class="n">legend.title</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Groups"</span><span class="p">,</span><span class="w">
             </span><span class="n">repel</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="w">
             </span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<p><img src="/post/2021-06-15-machine-learning-practice-3/index.en-us_files/fig5.png" alt="" /></p>

<p><img src="/post/2021-06-15-machine-learning-practice-3/index.en-us_files/fig6.png" alt="" />
As we can tell from the plot, PCA appears to separate the cancer subtypes well.</p>

<h3 id="d-1">(d)</h3>
<p>Try plotting some more PC combinations. Can you find a pair of PCs that appear to separate all three subtypes well? Report the scatterplot of the data for pair of PCs that you think best separates all three types.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Coordinates of individuals</span><span class="w">
</span><span class="n">ind.coord</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">data.frame</span><span class="p">(</span><span class="n">type</span><span class="o">=</span><span class="n">as.factor</span><span class="p">(</span><span class="n">Subtypes</span><span class="p">),</span><span class="w"> </span><span class="n">get_pca_ind</span><span class="p">(</span><span class="n">res.pca</span><span class="p">)</span><span class="o">$</span><span class="n">coord</span><span class="p">)</span><span class="w">

</span><span class="n">index</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="m">1</span><span class="w">
</span><span class="k">for</span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="m">5</span><span class="p">){</span><span class="w">
  </span><span class="k">for</span><span class="p">(</span><span class="n">k</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="m">+1</span><span class="p">)</span><span class="o">:</span><span class="m">6</span><span class="p">){</span><span class="w">
    </span><span class="n">ggplot</span><span class="p">(</span><span class="n">ind.coord</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
      </span><span class="n">geom_point</span><span class="p">(</span><span class="n">aes_string</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="nf">names</span><span class="p">(</span><span class="n">ind.coord</span><span class="p">)[</span><span class="n">i</span><span class="m">+1</span><span class="p">],</span><span class="w"> </span><span class="n">y</span><span class="o">=</span><span class="nf">names</span><span class="p">(</span><span class="n">ind.coord</span><span class="p">)[</span><span class="n">k</span><span class="m">+1</span><span class="p">],</span><span class="w"> 
                            </span><span class="n">color</span><span class="o">=</span><span class="s2">"type"</span><span class="p">))</span><span class="o">+</span><span class="n">theme_bw</span><span class="p">()</span><span class="w"> </span><span class="o">-&gt;</span><span class="w"> </span><span class="n">temp</span><span class="w">
    </span><span class="n">assign</span><span class="p">(</span><span class="n">paste0</span><span class="p">(</span><span class="s2">"p_"</span><span class="p">,</span><span class="n">index</span><span class="p">),</span><span class="w"> </span><span class="n">temp</span><span class="p">)</span><span class="w">
    </span><span class="n">index</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">index</span><span class="m">+1</span><span class="w">
  </span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">ggarrange</span><span class="p">(</span><span class="n">plotlist</span><span class="o">=</span><span class="n">mget</span><span class="p">(</span><span class="n">paste0</span><span class="p">(</span><span class="s2">"p_"</span><span class="p">,</span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="m">6</span><span class="p">))),</span><span class="w"> 
          </span><span class="n">nrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">ggarrange</span><span class="p">(</span><span class="n">plotlist</span><span class="o">=</span><span class="n">mget</span><span class="p">(</span><span class="n">paste0</span><span class="p">(</span><span class="s2">"p_"</span><span class="p">,</span><span class="nf">c</span><span class="p">(</span><span class="m">7</span><span class="o">:</span><span class="m">12</span><span class="p">))),</span><span class="w"> 
          </span><span class="n">nrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">ggarrange</span><span class="p">(</span><span class="n">plotlist</span><span class="o">=</span><span class="n">mget</span><span class="p">(</span><span class="n">paste0</span><span class="p">(</span><span class="s2">"p_"</span><span class="p">,</span><span class="nf">c</span><span class="p">(</span><span class="m">13</span><span class="o">:</span><span class="m">15</span><span class="p">))),</span><span class="w"> 
          </span><span class="n">nrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<p>According to plots, we cannot find a pair of PCs that appear to separate all three subtypes well. The best pair of PCs that separates all three types is Dim1&amp;PCDim3.</p>

<h3 id="e">(e)</h3>
<p>Perform $K$-means with $K=3$ on the pair of PCs identified in (d). Report the confusion matrix and make some comments.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">k3</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">kmeans</span><span class="p">(</span><span class="n">ind.coord</span><span class="p">[,</span><span class="nf">c</span><span class="p">(</span><span class="m">2</span><span class="p">,</span><span class="m">4</span><span class="p">)],</span><span class="w"> </span><span class="n">centers</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="n">nstart</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">20</span><span class="p">)</span><span class="w">
</span><span class="n">results</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">data.frame</span><span class="p">(</span><span class="n">Subtypes</span><span class="p">,</span><span class="n">pred</span><span class="o">=</span><span class="n">k3</span><span class="o">$</span><span class="n">cluster</span><span class="p">)</span><span class="w">
</span><span class="n">table</span><span class="p">(</span><span class="n">results</span><span class="o">$</span><span class="n">Subtypes</span><span class="p">,</span><span class="n">results</span><span class="o">$</span><span class="n">pred</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<p>According to the simple counting table, we found that LumA should be matching with 2 and LumB should be matching with 3. Then we can have confusion matrix of clusters versus subtypes,</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">tibble</span><span class="p">(</span><span class="n">Basal</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">101</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">),</span><span class="w">
       </span><span class="n">LumA</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">201</span><span class="p">,</span><span class="w"> </span><span class="m">108</span><span class="p">),</span><span class="w">
       </span><span class="n">LumB</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">10</span><span class="p">,</span><span class="w"> </span><span class="m">41</span><span class="p">,</span><span class="w"> </span><span class="m">101</span><span class="p">),</span><span class="w">
       </span><span class="n">pred</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"Basal (pred)"</span><span class="p">,</span><span class="w"> </span><span class="s2">"LumA (pred)"</span><span class="p">,</span><span class="w"> </span><span class="s2">"LumB (pred)"</span><span class="p">))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> 
  </span><span class="n">column_to_rownames</span><span class="p">(</span><span class="n">var</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"pred"</span><span class="p">)</span><span class="w">
</span><span class="p">(</span><span class="m">201+101</span><span class="p">)</span><span class="o">/</span><span class="p">(</span><span class="m">201+108+41+101</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<p>As we can tell from the confusion matrix for <code class="language-plaintext highlighter-rouge">LumA</code> and <code class="language-plaintext highlighter-rouge">LumB</code> that the overall classify accuracy is 67.0%, which is largely consistent with the results when we applied $K$-means with $K=3$ algorithm directly.</p>

<h3 id="f">(f)</h3>
<p>Create two plots colored by the clusters found in (b) and in (e) respectively. Do they look similarly or differently? Explain why using PCA to reduce the number of dimensions from 2000 to 2 did not significantly change the results of $K$-means.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">k3_origin</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">kmeans</span><span class="p">(</span><span class="n">Gene_df</span><span class="p">,</span><span class="w"> </span><span class="n">centers</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="n">nstart</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">20</span><span class="p">)</span><span class="w">
</span><span class="c1"># clusters found in (b)</span><span class="w">
</span><span class="n">data.frame</span><span class="p">(</span><span class="n">real</span><span class="o">=</span><span class="n">as.factor</span><span class="p">(</span><span class="n">Subtypes</span><span class="p">),</span><span class="w"> 
           </span><span class="n">pred</span><span class="o">=</span><span class="n">as.factor</span><span class="p">(</span><span class="n">k3_origin</span><span class="o">$</span><span class="n">cluster</span><span class="p">),</span><span class="n">ind.coord</span><span class="p">)</span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">ggscatter</span><span class="p">(</span><span class="w">
  </span><span class="n">.</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Dim.1"</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Dim.3"</span><span class="p">,</span><span class="w"> 
  </span><span class="n">color</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"pred"</span><span class="p">,</span><span class="w"> </span><span class="n">palette</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"npg"</span><span class="p">,</span><span class="w"> </span><span class="n">ellipse</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span><span class="w"> </span><span class="n">ellipse.type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"convex"</span><span class="p">,</span><span class="w">
  </span><span class="n">shape</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"real"</span><span class="p">,</span><span class="w"> </span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.5</span><span class="p">,</span><span class="w">  </span><span class="n">legend</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"right"</span><span class="p">,</span><span class="w"> </span><span class="n">ggtheme</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">theme_bw</span><span class="p">()</span><span class="w">
</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<p><img src="/post/2021-06-15-machine-learning-practice-3/index.en-us_files/fig10.png" alt="" /></p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># clusters found in (e)</span><span class="w">
</span><span class="n">data.frame</span><span class="p">(</span><span class="n">real</span><span class="o">=</span><span class="n">as.factor</span><span class="p">(</span><span class="n">Subtypes</span><span class="p">),</span><span class="w"> 
           </span><span class="n">pred</span><span class="o">=</span><span class="n">as.factor</span><span class="p">(</span><span class="n">k3</span><span class="o">$</span><span class="n">cluster</span><span class="p">),</span><span class="n">ind.coord</span><span class="p">)</span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">ggscatter</span><span class="p">(</span><span class="w">
  </span><span class="n">.</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Dim.1"</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Dim.3"</span><span class="p">,</span><span class="w"> 
  </span><span class="n">color</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"pred"</span><span class="p">,</span><span class="w"> </span><span class="n">palette</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"npg"</span><span class="p">,</span><span class="w"> </span><span class="n">ellipse</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span><span class="w"> </span><span class="n">ellipse.type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"convex"</span><span class="p">,</span><span class="w">
  </span><span class="n">shape</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"real"</span><span class="p">,</span><span class="w"> </span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.5</span><span class="p">,</span><span class="w">  </span><span class="n">legend</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"right"</span><span class="p">,</span><span class="w"> </span><span class="n">ggtheme</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">theme_bw</span><span class="p">()</span><span class="w">
</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<p>As we can tell from the plots that two model provided different clusters (but similar classification accuracy). Research (Yeung &amp; Ruzzo, 2000) showed that clustering with the PC’s rather than the original dims does not necessarily improve cluster quality. In this question, since none of PC’s (which contain most of the variation in the data) capture the cluster structure well, it cannot improve the classification accuracy.</p>

<h3 id="g">(g)</h3>
<p>Now apply MDS with various metrics and non-metric MDS to <code class="language-plaintext highlighter-rouge">Gene</code> to obtain 2-dimensional representations. Does any of them provide better separated scatterplot as compared to that from (d)? Notice that the Euclidean metric in MDS gives the same representation as PCA does.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Compute classic MDS</span><span class="w">
</span><span class="n">mds_c</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">Gene</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">dist</span><span class="p">()</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">          
  </span><span class="n">cmdscale</span><span class="p">(</span><span class="n">k</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">as_tibble</span><span class="p">()</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">mutate</span><span class="p">(</span><span class="n">type</span><span class="o">=</span><span class="n">as.factor</span><span class="p">(</span><span class="n">Subtypes</span><span class="p">))</span><span class="w">
</span><span class="n">colnames</span><span class="p">(</span><span class="n">mds_c</span><span class="p">)[</span><span class="m">1</span><span class="o">:</span><span class="m">2</span><span class="p">]</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"Dim.1"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Dim.2"</span><span class="p">)</span><span class="w">
</span><span class="c1"># Plot MDS</span><span class="w">
</span><span class="n">ggplot</span><span class="p">(</span><span class="n">mds_c</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">geom_point</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="n">Dim.1</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="o">=</span><span class="n">Dim.2</span><span class="p">,</span><span class="w"> </span><span class="n">color</span><span class="o">=</span><span class="n">type</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">theme_bw</span><span class="p">()</span><span class="w">
</span></code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Compute non-metric MDS</span><span class="w">
</span><span class="n">mds_n</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">Gene</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">dist</span><span class="p">()</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">          
  </span><span class="n">isoMDS</span><span class="p">(</span><span class="n">k</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">.</span><span class="o">$</span><span class="n">points</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">as_tibble</span><span class="p">()</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">mutate</span><span class="p">(</span><span class="n">type</span><span class="o">=</span><span class="n">as.factor</span><span class="p">(</span><span class="n">Subtypes</span><span class="p">))</span><span class="w">
</span><span class="n">colnames</span><span class="p">(</span><span class="n">mds_n</span><span class="p">)[</span><span class="m">1</span><span class="o">:</span><span class="m">2</span><span class="p">]</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"Dim.1"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Dim.2"</span><span class="p">)</span><span class="w">
</span><span class="c1"># Plot MDS</span><span class="w">
</span><span class="n">ggplot</span><span class="p">(</span><span class="n">mds_n</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">geom_point</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="n">Dim.1</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="o">=</span><span class="n">Dim.2</span><span class="p">,</span><span class="w"> </span><span class="n">color</span><span class="o">=</span><span class="n">type</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">theme_bw</span><span class="p">()</span><span class="w">
</span></code></pre></div></div>

<p>We can tell from the plots that both of them provide similar separated scatterplot as compared to that from (d).</p>

<h3 id="h">(h)</h3>
<p>Perform $K$-means with $K=3$ on the new representations from (g) and report the confusion matrices. Compare them with that from (e).</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">k3_mdsc</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">kmeans</span><span class="p">(</span><span class="n">mds_c</span><span class="p">[,</span><span class="m">1</span><span class="o">:</span><span class="m">2</span><span class="p">],</span><span class="w"> </span><span class="n">centers</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="n">nstart</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">20</span><span class="p">)</span><span class="w">
</span><span class="n">results</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">data.frame</span><span class="p">(</span><span class="n">Subtypes</span><span class="p">,</span><span class="n">pred</span><span class="o">=</span><span class="n">k3_mdsc</span><span class="o">$</span><span class="n">cluster</span><span class="p">)</span><span class="w">
</span><span class="n">table</span><span class="p">(</span><span class="n">results</span><span class="o">$</span><span class="n">Subtypes</span><span class="p">,</span><span class="n">results</span><span class="o">$</span><span class="n">pred</span><span class="p">)</span><span class="w">

</span><span class="n">k3_mdsn</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">kmeans</span><span class="p">(</span><span class="n">mds_n</span><span class="p">[,</span><span class="m">1</span><span class="o">:</span><span class="m">2</span><span class="p">],</span><span class="w"> </span><span class="n">centers</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="n">nstart</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">20</span><span class="p">)</span><span class="w">
</span><span class="n">results</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">data.frame</span><span class="p">(</span><span class="n">Subtypes</span><span class="p">,</span><span class="n">pred</span><span class="o">=</span><span class="n">k3_mdsn</span><span class="o">$</span><span class="n">cluster</span><span class="p">)</span><span class="w">
</span><span class="n">table</span><span class="p">(</span><span class="n">results</span><span class="o">$</span><span class="n">Subtypes</span><span class="p">,</span><span class="n">results</span><span class="o">$</span><span class="n">pred</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<p>According to the simple counting table, we found that classic mds and non-metric mds provides same results. Then we can have confusion matrix of clusters versus subtypes,</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">tibble</span><span class="p">(</span><span class="n">LumA</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">195</span><span class="p">,</span><span class="w"> </span><span class="m">114</span><span class="p">),</span><span class="w">
       </span><span class="n">LumB</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">33</span><span class="p">,</span><span class="w"> </span><span class="m">117</span><span class="p">),</span><span class="w">
       </span><span class="n">pred</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="w"> </span><span class="s2">"LumA (pred)"</span><span class="p">,</span><span class="w"> </span><span class="s2">"LumB (pred)"</span><span class="p">))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> 
  </span><span class="n">column_to_rownames</span><span class="p">(</span><span class="n">var</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"pred"</span><span class="p">)</span><span class="w">
</span><span class="p">(</span><span class="m">195+117</span><span class="p">)</span><span class="o">/</span><span class="p">(</span><span class="m">195+114+33+117</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<p>As we can tell from the confusion matrix for <code class="language-plaintext highlighter-rouge">LumA</code> and <code class="language-plaintext highlighter-rouge">LumB</code> that the overall classify accuracy is 68.0%, which is very close to the results we obtained in (e).</p>

<h3 id="i">(i)</h3>
<p>Suppose we might know that the first PC contains information we aren’t interested in.  Apply $K$-means with $K=3$ to <code class="language-plaintext highlighter-rouge">Gene</code> dataset <strong>subtracting the approximation from the first PC</strong>. Report the confusion matrix and make some comments.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">k3</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">kmeans</span><span class="p">(</span><span class="n">ind.coord</span><span class="p">[,</span><span class="o">-</span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="m">2</span><span class="p">)],</span><span class="w"> </span><span class="n">centers</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="n">nstart</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">20</span><span class="p">)</span><span class="w">
</span><span class="n">results</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">data.frame</span><span class="p">(</span><span class="n">Subtypes</span><span class="p">,</span><span class="n">pred</span><span class="o">=</span><span class="n">k3</span><span class="o">$</span><span class="n">cluster</span><span class="p">)</span><span class="w">
</span><span class="n">table</span><span class="p">(</span><span class="n">results</span><span class="o">$</span><span class="n">Subtypes</span><span class="p">,</span><span class="n">results</span><span class="o">$</span><span class="n">pred</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<p>According to the simple counting table, we found that classic mds and non-metric mds provides same results. Then we can have confusion matrix of clusters versus subtypes,</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">tibble</span><span class="p">(</span><span class="n">LumA</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">140</span><span class="p">,</span><span class="w"> </span><span class="m">61</span><span class="p">),</span><span class="w">
       </span><span class="n">LumB</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">19</span><span class="p">,</span><span class="w"> </span><span class="m">94</span><span class="p">),</span><span class="w">
       </span><span class="n">pred</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="w"> </span><span class="s2">"LumA (pred)"</span><span class="p">,</span><span class="w"> </span><span class="s2">"LumB (pred)"</span><span class="p">))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> 
  </span><span class="n">column_to_rownames</span><span class="p">(</span><span class="n">var</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"pred"</span><span class="p">)</span><span class="w">
</span><span class="p">(</span><span class="m">140+94</span><span class="p">)</span><span class="o">/</span><span class="p">(</span><span class="m">140+61+19+94</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<p>As we can tell from the confusion matrix for <code class="language-plaintext highlighter-rouge">LumA</code> and <code class="language-plaintext highlighter-rouge">LumB</code> that the overall classify accuracy is 74.5%, which is better than the model including the first PC. But 106 <code class="language-plaintext highlighter-rouge">LumA</code> and 39 <code class="language-plaintext highlighter-rouge">LumB</code> cases are wrongly classified as <code class="language-plaintext highlighter-rouge">Basal</code>, which is much worse than the model including the first PC ($\leq10$ <code class="language-plaintext highlighter-rouge">LumA</code> and $\leq10$ <code class="language-plaintext highlighter-rouge">LumB</code>).</p>]]></content><author><name>Zian</name><email>zianzhuang@ucla.edu</email></author><category term="Machine Learning" /><category term="practice" /><summary type="html"><![CDATA[]]></summary></entry><entry><title type="html">Machine Learning practice 2</title><link href="https://zianzhuang.com/posts/2021/05/blog-post-6/" rel="alternate" type="text/html" title="Machine Learning practice 2" /><published>2021-05-28T00:00:00+00:00</published><updated>2021-05-28T00:00:00+00:00</updated><id>https://zianzhuang.com/posts/2021/05/blog-post-6</id><content type="html" xml:base="https://zianzhuang.com/posts/2021/05/blog-post-6/"><![CDATA[<!--more-->

<h2 id="q1">Q1.</h2>
<p>([ISL] 4.11, <em>25 pt</em>) In this problem, you will develop a model to predict whether a given car gets high or low gas mileage based on the <code class="language-plaintext highlighter-rouge">Auto</code> data set. Write a data analysis report addressing the following problems.</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">data</span><span class="p">(</span><span class="n">Auto</span><span class="p">)</span><span class="w">
</span><span class="c1">#help("Auto")</span><span class="w">
</span></code></pre></div></div>

<h3 id="a">(a)</h3>
<p>Create a binary variable, <code class="language-plaintext highlighter-rouge">mpg01</code>, that contains a 1 if <code class="language-plaintext highlighter-rouge">mpg</code> contains a value above its median, and a 0 if <code class="language-plaintext highlighter-rouge">mpg</code> contains a value below its median.</p>

<p><strong>Answer</strong>:</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">Auto</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">mutate</span><span class="p">(</span><span class="n">mpg01</span><span class="o">=</span><span class="n">ifelse</span><span class="p">(</span><span class="n">mpg</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="n">median</span><span class="p">(</span><span class="n">mpg</span><span class="p">),</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">as.factor</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<h3 id="b">(b)</h3>
<p>Explore the data graphically in order to investigate the association between <code class="language-plaintext highlighter-rouge">mgp01</code> and the other features. Which of the other features seem most likely to be useful in predicting <code class="language-plaintext highlighter-rouge">mpg01</code>? Scatterplots and boxplots may be useful tools to answer this question. Describe your findings.</p>

<p><strong>Answer</strong>:</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> 
  </span><span class="n">mutate</span><span class="p">(</span><span class="n">name_unclass</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">unclass</span><span class="p">(</span><span class="n">name</span><span class="p">))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">mutate_at</span><span class="p">(</span><span class="n">vars</span><span class="p">(</span><span class="nf">names</span><span class="p">(</span><span class="n">df</span><span class="p">)[</span><span class="nf">c</span><span class="p">(</span><span class="m">2</span><span class="p">,</span><span class="m">8</span><span class="p">)]),</span><span class="n">as.factor</span><span class="p">)</span><span class="o">-&gt;</span><span class="w"> </span><span class="n">df_p</span><span class="w">
</span><span class="n">vars_need</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">dput</span><span class="p">(</span><span class="nf">names</span><span class="p">(</span><span class="n">df_p</span><span class="p">))[</span><span class="o">-</span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="m">9</span><span class="p">,</span><span class="m">10</span><span class="p">)]</span><span class="w">
</span><span class="k">for</span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="m">8</span><span class="p">){</span><span class="w">
  </span><span class="k">if</span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="o">%in%</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="m">7</span><span class="p">)){</span><span class="w">
    </span><span class="n">ggplot</span><span class="p">(</span><span class="n">df_p</span><span class="p">,</span><span class="w"> </span><span class="n">aes_string</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="n">vars_need</span><span class="p">[</span><span class="n">i</span><span class="p">],</span><span class="w"> </span><span class="n">y</span><span class="o">=</span><span class="s2">"mpg"</span><span class="p">,</span><span class="w"> </span><span class="n">fill</span><span class="o">=</span><span class="s2">"mpg01"</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
      </span><span class="n">geom_boxplot</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
      </span><span class="n">theme_bw</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
      </span><span class="n">theme</span><span class="p">(</span><span class="n">legend.position</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"none"</span><span class="p">)</span><span class="w"> </span><span class="o">-&gt;</span><span class="w"> </span><span class="n">temp</span><span class="w">
  </span><span class="p">}</span><span class="k">else</span><span class="p">{</span><span class="w">
    </span><span class="n">ggplot</span><span class="p">(</span><span class="n">df_p</span><span class="p">,</span><span class="w"> </span><span class="n">aes_string</span><span class="p">(</span><span class="n">vars_need</span><span class="p">[</span><span class="n">i</span><span class="p">],</span><span class="w"> </span><span class="s2">"mpg"</span><span class="p">,</span><span class="w"> </span><span class="n">group</span><span class="o">=</span><span class="s2">"mpg01"</span><span class="p">,</span><span class="w"> </span><span class="n">color</span><span class="o">=</span><span class="s2">"mpg01"</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w"> 
      </span><span class="n">geom_point</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w"> 
      </span><span class="n">xlab</span><span class="p">(</span><span class="n">vars_need</span><span class="p">[</span><span class="n">i</span><span class="p">])</span><span class="w"> </span><span class="o">+</span><span class="w">
      </span><span class="n">scale_color_manual</span><span class="p">(</span><span class="n">values</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'#999999'</span><span class="p">,</span><span class="s1">'#E69F00'</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w"> 
      </span><span class="c1">#geom_smooth(method=lm, se=T, fullrange=T)+</span><span class="w">
      </span><span class="n">theme_bw</span><span class="p">()</span><span class="o">+</span><span class="w">
      </span><span class="n">theme</span><span class="p">(</span><span class="n">legend.position</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"none"</span><span class="p">)</span><span class="o">-&gt;</span><span class="w"> </span><span class="n">temp</span><span class="w">
  </span><span class="p">}</span><span class="w">
  </span><span class="n">temp_name</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">paste0</span><span class="p">(</span><span class="s2">"fig_"</span><span class="p">,</span><span class="n">i</span><span class="p">)</span><span class="w">
  </span><span class="n">assign</span><span class="p">(</span><span class="n">temp_name</span><span class="p">,</span><span class="w"> </span><span class="n">temp</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">

</span><span class="n">ggarrange</span><span class="p">(</span><span class="n">plotlist</span><span class="o">=</span><span class="n">mget</span><span class="p">(</span><span class="n">paste0</span><span class="p">(</span><span class="s2">"fig_"</span><span class="p">,</span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="m">8</span><span class="p">))),</span><span class="w"> 
          </span><span class="n">nrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="n">labels</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="m">8</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<p><img src="/post/2021-06-14-machine-learning-practice-2/index.en-us_files/fig28.png" alt="" /></p>

<p>According to the scatter plot and boxplot, we found that variable <code class="language-plaintext highlighter-rouge">displacement</code>, <code class="language-plaintext highlighter-rouge">horsepower</code>, <code class="language-plaintext highlighter-rouge">weight</code>, <code class="language-plaintext highlighter-rouge">acceleration</code> and <code class="language-plaintext highlighter-rouge">year</code> are most likely to be useful in predicting <code class="language-plaintext highlighter-rouge">mpg01</code>.</p>

<h3 id="c">(c)</h3>
<p>Split the data into a training set and a test set with ratio 2:1.</p>

<p><strong>Answer</strong>:</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">set.seed</span><span class="p">(</span><span class="m">1996</span><span class="p">)</span><span class="w">
</span><span class="n">trainid</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">sample</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="n">nrow</span><span class="p">(</span><span class="n">df</span><span class="p">),</span><span class="w"> </span><span class="n">nrow</span><span class="p">(</span><span class="n">df</span><span class="p">)</span><span class="o">*</span><span class="m">2</span><span class="o">/</span><span class="m">3</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">round</span><span class="p">,</span><span class="w"> </span><span class="n">replace</span><span class="o">=</span><span class="nb">F</span><span class="p">)</span><span class="w"> 
</span><span class="n">train</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">df</span><span class="p">[</span><span class="n">trainid</span><span class="p">,]</span><span class="w">
</span><span class="n">test</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">df</span><span class="p">[</span><span class="o">-</span><span class="n">trainid</span><span class="p">,]</span><span class="w">
</span></code></pre></div></div>

<h3 id="d">(d)</h3>
<p>Perform LDA on the training data in order to predict <code class="language-plaintext highlighter-rouge">mpg01</code> using the variables that seemed most associated with <code class="language-plaintext highlighter-rouge">mpg01</code> in (b). What is the test error of the model obtained?</p>

<p><strong>Answer</strong>:</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">ldafit</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">lda</span><span class="p">(</span><span class="n">mpg01</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">displacement</span><span class="o">+</span><span class="n">horsepower</span><span class="o">+</span><span class="n">weight</span><span class="o">+</span><span class="n">acceleration</span><span class="o">+</span><span class="n">year</span><span class="p">,</span><span class="w"> 
               </span><span class="n">data</span><span class="o">=</span><span class="n">train</span><span class="p">)</span><span class="w">
</span><span class="n">ldafit_pred</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">predict</span><span class="p">(</span><span class="n">ldafit</span><span class="p">,</span><span class="w"> </span><span class="n">test</span><span class="p">)</span><span class="o">$</span><span class="n">class</span><span class="w">
</span><span class="n">table</span><span class="p">(</span><span class="n">ldafit_pred</span><span class="p">,</span><span class="w"> </span><span class="n">test</span><span class="o">$</span><span class="n">mpg01</span><span class="p">)</span><span class="w">

</span><span class="c1"># test error</span><span class="w">
</span><span class="n">mean</span><span class="p">(</span><span class="n">ldafit_pred</span><span class="w"> </span><span class="o">!=</span><span class="w"> </span><span class="n">test</span><span class="o">$</span><span class="n">mpg01</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<p>test error rate: 9.92%</p>

<h3 id="e">(e)</h3>
<p>Perform QDA on the training data in order to predict <code class="language-plaintext highlighter-rouge">mpg01</code> using the variables that seemed most associated with <code class="language-plaintext highlighter-rouge">mpg01</code> in (b). What is the test error of the model obtained?</p>

<p><strong>Answer</strong>:</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">qdafit</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">qda</span><span class="p">(</span><span class="n">mpg01</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">displacement</span><span class="o">+</span><span class="n">horsepower</span><span class="o">+</span><span class="n">weight</span><span class="o">+</span><span class="n">acceleration</span><span class="o">+</span><span class="n">year</span><span class="p">,</span><span class="w"> 
               </span><span class="n">data</span><span class="o">=</span><span class="n">train</span><span class="p">)</span><span class="w">
</span><span class="n">qdafit_pred</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">predict</span><span class="p">(</span><span class="n">qdafit</span><span class="p">,</span><span class="w"> </span><span class="n">test</span><span class="p">)</span><span class="o">$</span><span class="n">class</span><span class="w">
</span><span class="n">table</span><span class="p">(</span><span class="n">qdafit_pred</span><span class="p">,</span><span class="w"> </span><span class="n">test</span><span class="o">$</span><span class="n">mpg01</span><span class="p">)</span><span class="w">

</span><span class="c1"># test error</span><span class="w">
</span><span class="n">mean</span><span class="p">(</span><span class="n">qdafit_pred</span><span class="w"> </span><span class="o">!=</span><span class="w"> </span><span class="n">test</span><span class="o">$</span><span class="n">mpg01</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<p>test error rate: 10.69%</p>

<h3 id="f">(f)</h3>
<p>Perform logistic regression on the training data in order to predict <code class="language-plaintext highlighter-rouge">mpg01</code> using the variables that seemed most associated with <code class="language-plaintext highlighter-rouge">mpg01</code> in (b). What is the test error of the model obtained?</p>

<p><strong>Answer</strong>:</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">logitfit</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">glm</span><span class="p">(</span><span class="n">mpg01</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">displacement</span><span class="o">+</span><span class="n">horsepower</span><span class="o">+</span><span class="n">weight</span><span class="o">+</span><span class="n">acceleration</span><span class="o">+</span><span class="n">year</span><span class="p">,</span><span class="w"> 
                </span><span class="n">data</span><span class="o">=</span><span class="n">train</span><span class="p">,</span><span class="w"> </span><span class="n">family</span><span class="o">=</span><span class="n">binomial</span><span class="p">)</span><span class="w">
</span><span class="n">logitfit_prob</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">predict</span><span class="p">(</span><span class="n">logitfit</span><span class="p">,</span><span class="w"> </span><span class="n">test</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="o">=</span><span class="s2">"response"</span><span class="p">)</span><span class="w">
</span><span class="n">logitfit_pred</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">ifelse</span><span class="p">(</span><span class="n">logitfit_prob</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="m">0.5</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">)</span><span class="w">
</span><span class="n">table</span><span class="p">(</span><span class="n">logitfit_pred</span><span class="p">,</span><span class="w"> </span><span class="n">test</span><span class="o">$</span><span class="n">mpg01</span><span class="p">)</span><span class="w">

</span><span class="c1"># error rate</span><span class="w">
</span><span class="n">mean</span><span class="p">(</span><span class="n">logitfit_pred</span><span class="w"> </span><span class="o">!=</span><span class="w"> </span><span class="n">test</span><span class="o">$</span><span class="n">mpg01</span><span class="p">)</span><span class="w">  
</span></code></pre></div></div>

<p>test error rate is 10.69%, which turned out to be the same as that of QDA.</p>

<h2 id="q2">Q2.</h2>
<p>The <code class="language-plaintext highlighter-rouge">Boston</code> dataset contains variables <code class="language-plaintext highlighter-rouge">dis</code> (the weighted mean of distances to five Boston employment centers) and <code class="language-plaintext highlighter-rouge">nox</code> (nitrogen oxides concentration in parts per 10 million).</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">data</span><span class="p">(</span><span class="s2">"Boston"</span><span class="p">)</span><span class="w">
</span><span class="c1">#help(Boston)</span><span class="w">
</span></code></pre></div></div>

<h3 id="a-1">(a)</h3>
<p>Use the <code class="language-plaintext highlighter-rouge">poly()</code> function to fit a cubic polynomial regression to predict <code class="language-plaintext highlighter-rouge">nox</code> using <code class="language-plaintext highlighter-rouge">dis</code>. Report the regression output, and plot the data and resulting polynomial fits.</p>

<p><strong>Answer</strong>:</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">lmfit</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">lm</span><span class="p">(</span><span class="n">nox</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">poly</span><span class="p">(</span><span class="n">dis</span><span class="p">,</span><span class="w"> </span><span class="m">3</span><span class="p">),</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Boston</span><span class="p">)</span><span class="w">
</span><span class="n">summary</span><span class="p">(</span><span class="n">lmfit</span><span class="p">)</span><span class="w">

</span><span class="n">dislims</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">range</span><span class="p">(</span><span class="n">Boston</span><span class="o">$</span><span class="n">dis</span><span class="p">)</span><span class="w">
</span><span class="n">dis.grid</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="n">dislims</span><span class="p">[</span><span class="m">1</span><span class="p">],</span><span class="w"> </span><span class="n">dislims</span><span class="p">[</span><span class="m">2</span><span class="p">],</span><span class="w"> </span><span class="m">0.1</span><span class="p">)</span><span class="w">
</span><span class="n">preds</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">predict</span><span class="p">(</span><span class="n">lmfit</span><span class="p">,</span><span class="w"> </span><span class="n">newdata</span><span class="o">=</span><span class="nf">list</span><span class="p">(</span><span class="n">dis</span><span class="o">=</span><span class="n">dis.grid</span><span class="p">),</span><span class="w"> </span><span class="n">se</span><span class="o">=</span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span><span class="n">se95</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">preds</span><span class="o">$</span><span class="n">fit</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">cbind</span><span class="p">(</span><span class="m">1.96</span><span class="o">*</span><span class="n">preds</span><span class="o">$</span><span class="n">se.fit</span><span class="p">,</span><span class="w"> </span><span class="m">-1.96</span><span class="o">*</span><span class="n">preds</span><span class="o">$</span><span class="n">se.fit</span><span class="p">)</span><span class="w">
</span><span class="n">plot</span><span class="p">(</span><span class="n">Boston</span><span class="o">$</span><span class="n">dis</span><span class="p">,</span><span class="w"> </span><span class="n">Boston</span><span class="o">$</span><span class="n">nox</span><span class="p">,</span><span class="w"> </span><span class="n">xlim</span><span class="o">=</span><span class="n">dislims</span><span class="p">,</span><span class="w"> </span><span class="n">cex</span><span class="o">=</span><span class="m">0.5</span><span class="p">)</span><span class="w">
</span><span class="n">lines</span><span class="p">(</span><span class="n">dis.grid</span><span class="p">,</span><span class="w"> </span><span class="n">preds</span><span class="o">$</span><span class="n">fit</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="o">=</span><span class="m">2.5</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="o">=</span><span class="s2">"blue"</span><span class="p">)</span><span class="w">
</span><span class="n">matlines</span><span class="p">(</span><span class="n">dis.grid</span><span class="p">,</span><span class="w"> </span><span class="n">se95</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="o">=</span><span class="m">1.5</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="o">=</span><span class="s2">"blue"</span><span class="p">,</span><span class="w"> </span><span class="n">lty</span><span class="o">=</span><span class="m">2</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<h3 id="b-1">(b)</h3>
<p>Plot the polynomial fits for a range of different polynomial degrees (say, from 1 to 10), and report the associated residual sum of squares.</p>

<p><strong>Answer</strong>:</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">rss.error</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="kc">NULL</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="m">10</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
  </span><span class="n">lmfit</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">lm</span><span class="p">(</span><span class="n">nox</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">poly</span><span class="p">(</span><span class="n">dis</span><span class="p">,</span><span class="n">i</span><span class="p">),</span><span class="w"> </span><span class="n">data</span><span class="o">=</span><span class="n">Boston</span><span class="p">)</span><span class="w">
  </span><span class="n">rss.error</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">rbind</span><span class="p">(</span><span class="n">rss.error</span><span class="p">,</span><span class="w"> 
                     </span><span class="n">data.frame</span><span class="p">(</span><span class="n">rss.error</span><span class="o">=</span><span class="nf">sum</span><span class="p">(</span><span class="n">lmfit</span><span class="o">$</span><span class="n">residuals</span><span class="o">^</span><span class="m">2</span><span class="p">),</span><span class="w">
                                </span><span class="n">polynomial.degrees</span><span class="o">=</span><span class="n">i</span><span class="p">))</span><span class="w">
</span><span class="p">}</span><span class="w">

</span><span class="c1"># report the associated residual sum of squares</span><span class="w">
</span><span class="n">rss.error</span><span class="w">
</span><span class="n">plot</span><span class="p">(</span><span class="n">rss.error</span><span class="o">$</span><span class="n">polynomial.degrees</span><span class="p">,</span><span class="w"> </span><span class="n">rss.error</span><span class="o">$</span><span class="n">rss.error</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="o">=</span><span class="s2">"b"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<p>We can tell from the plot that rss decreases monotonically when polynomial degrees increase.</p>

<h3 id="c-1">(c)</h3>
<p>Perform cross-validation to select the optimal degree for the polynomial, and explain your results.</p>

<p><strong>Answer</strong>:</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">cv.error</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="kc">NULL</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="m">10</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
  </span><span class="n">set.seed</span><span class="p">(</span><span class="m">1996</span><span class="p">)</span><span class="w">
  </span><span class="n">glm.fit</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">glm</span><span class="p">(</span><span class="n">nox</span><span class="o">~</span><span class="n">poly</span><span class="p">(</span><span class="n">dis</span><span class="p">,</span><span class="n">i</span><span class="p">),</span><span class="w"> </span><span class="n">family</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">gaussian</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="o">=</span><span class="n">Boston</span><span class="p">)</span><span class="w">
  </span><span class="n">temp</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">cv.glm</span><span class="p">(</span><span class="n">Boston</span><span class="p">,</span><span class="w"> </span><span class="n">glm.fit</span><span class="p">,</span><span class="w"> </span><span class="n">K</span><span class="o">=</span><span class="m">10</span><span class="p">)</span><span class="o">$</span><span class="n">delta</span><span class="p">[</span><span class="m">2</span><span class="p">]</span><span class="w">
  </span><span class="n">cv.error</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">rbind</span><span class="p">(</span><span class="n">cv.error</span><span class="p">,</span><span class="w"> 
                    </span><span class="n">data.frame</span><span class="p">(</span><span class="n">rss.error</span><span class="o">=</span><span class="n">temp</span><span class="p">,</span><span class="w"> </span><span class="n">polynomial.degrees</span><span class="o">=</span><span class="n">i</span><span class="p">))</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">cv.error</span><span class="w">
</span><span class="n">plot</span><span class="p">(</span><span class="n">cv.error</span><span class="o">$</span><span class="n">polynomial.degrees</span><span class="p">,</span><span class="w"> </span><span class="n">cv.error</span><span class="o">$</span><span class="n">rss.error</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="o">=</span><span class="s2">"b"</span><span class="p">)</span><span class="w">

</span></code></pre></div></div>

<p>According to the plot, we see that the CV error reduces when polynomial degrees increase from 1 to 3 and does not show clear improvement after degree 3 polynomial. Thus, we pick 3 as the best polynomial degree.</p>

<h3 id="d-1">(d)</h3>
<p>Use the <code class="language-plaintext highlighter-rouge">bs()</code> function to fit a regression spline to predict <code class="language-plaintext highlighter-rouge">nox</code> using <code class="language-plaintext highlighter-rouge">dis</code>. Report the output for the fit using four degrees of freedom. How did you choose the knots? Plot the resulting fit.</p>

<p><strong>Answer</strong>:</p>

<p>We choose the knots as the 25%, 50% and 75% quantile of the <code class="language-plaintext highlighter-rouge">dis</code> data.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">knots_set</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">summary</span><span class="p">(</span><span class="n">Boston</span><span class="o">$</span><span class="n">dis</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">as.numeric</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">.</span><span class="p">[</span><span class="nf">c</span><span class="p">(</span><span class="m">2</span><span class="p">,</span><span class="m">3</span><span class="p">,</span><span class="m">5</span><span class="p">)]</span><span class="w">
</span><span class="n">spfit</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">lm</span><span class="p">(</span><span class="n">nox</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">bs</span><span class="p">(</span><span class="n">dis</span><span class="p">,</span><span class="w"> </span><span class="n">df</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="n">knots</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">knots_set</span><span class="p">),</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Boston</span><span class="p">)</span><span class="w">

</span><span class="c1">#Report the model summary</span><span class="w">
</span><span class="n">summary</span><span class="p">(</span><span class="n">spfit</span><span class="p">)</span><span class="w">

</span><span class="c1">#Resulting fit</span><span class="w">
</span><span class="n">sppred</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">predict</span><span class="p">(</span><span class="n">spfit</span><span class="p">,</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">dis</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">dis.grid</span><span class="p">))</span><span class="w">
</span><span class="n">plot</span><span class="p">(</span><span class="n">nox</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">dis</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Boston</span><span class="p">)</span><span class="w">
</span><span class="n">lines</span><span class="p">(</span><span class="n">dis.grid</span><span class="p">,</span><span class="w"> </span><span class="n">sppred</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"blue"</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<p>The prediction line seems to fit the data well.</p>

<h3 id="e-1">(e)</h3>
<p>Now fit a regression spline for a range of degrees of freedom, and plot the resulting fits and report the resulting RSS. Describe the results obtained.</p>

<p><strong>Answer</strong>:</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">rss.error</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="kc">NULL</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="m">4</span><span class="o">:</span><span class="m">16</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
  </span><span class="n">spfit</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">lm</span><span class="p">(</span><span class="n">nox</span><span class="o">~</span><span class="n">bs</span><span class="p">(</span><span class="n">dis</span><span class="p">,</span><span class="w"> </span><span class="n">df</span><span class="o">=</span><span class="n">i</span><span class="p">),</span><span class="w"> </span><span class="n">data</span><span class="o">=</span><span class="n">Boston</span><span class="p">)</span><span class="w">
  </span><span class="n">rss.error</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">rbind</span><span class="p">(</span><span class="n">rss.error</span><span class="p">,</span><span class="w"> 
                     </span><span class="n">data.frame</span><span class="p">(</span><span class="n">rss.error</span><span class="o">=</span><span class="nf">sum</span><span class="p">(</span><span class="n">spfit</span><span class="o">$</span><span class="n">residuals</span><span class="o">^</span><span class="m">2</span><span class="p">),</span><span class="w">
                                </span><span class="n">polynomial.degrees</span><span class="o">=</span><span class="n">i</span><span class="p">))</span><span class="w">
</span><span class="p">}</span><span class="w">

</span><span class="c1"># report the associated residual sum of squares</span><span class="w">
</span><span class="n">rss.error</span><span class="w">
</span><span class="n">plot</span><span class="p">(</span><span class="n">rss.error</span><span class="o">$</span><span class="n">polynomial.degrees</span><span class="p">,</span><span class="w"> </span><span class="n">rss.error</span><span class="o">$</span><span class="n">rss.error</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="o">=</span><span class="s2">"b"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<p>We can tell from the plots that rss error reduces when degrees of freedom increase from 4 to 14 and does not show clear improvement after 14 degrees of freedom.</p>

<h3 id="f-1">(f)</h3>
<p>Perform cross-validation to select the best degrees of freedom for a regression spline on this data. Describe your results.</p>

<p><strong>Answer</strong>:</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">cv.error</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="kc">NULL</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="m">4</span><span class="o">:</span><span class="m">16</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
  </span><span class="n">set.seed</span><span class="p">(</span><span class="m">19969</span><span class="p">)</span><span class="w">
  </span><span class="n">glm.fit</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">glm</span><span class="p">(</span><span class="n">nox</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">bs</span><span class="p">(</span><span class="n">dis</span><span class="p">,</span><span class="w"> </span><span class="n">df</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">i</span><span class="p">),</span><span class="w"> </span><span class="n">family</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">gaussian</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="o">=</span><span class="n">Boston</span><span class="p">)</span><span class="w">
  </span><span class="n">temp</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">cv.glm</span><span class="p">(</span><span class="n">Boston</span><span class="p">,</span><span class="w"> </span><span class="n">glm.fit</span><span class="p">,</span><span class="w"> </span><span class="n">K</span><span class="o">=</span><span class="m">10</span><span class="p">)</span><span class="o">$</span><span class="n">delta</span><span class="p">[</span><span class="m">2</span><span class="p">]</span><span class="w"> </span><span class="c1"># adjusted cross-validation estimate</span><span class="w">
  </span><span class="n">cv.error</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">rbind</span><span class="p">(</span><span class="n">cv.error</span><span class="p">,</span><span class="w"> 
                    </span><span class="n">data.frame</span><span class="p">(</span><span class="n">rss.error</span><span class="o">=</span><span class="n">temp</span><span class="p">,</span><span class="w"> </span><span class="n">polynomial.degrees</span><span class="o">=</span><span class="n">i</span><span class="p">))</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">cv.error</span><span class="w">
</span><span class="n">plot</span><span class="p">(</span><span class="n">cv.error</span><span class="o">$</span><span class="n">polynomial.degrees</span><span class="p">,</span><span class="w"> </span><span class="n">cv.error</span><span class="o">$</span><span class="n">rss.error</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="o">=</span><span class="s2">"b"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<p>After 10-folds cross-validation, we can tell from the plots that cv error reach the minimum value at 10 degrees of freedom and does not show clear improvement after 10 degrees of freedom. Thus, we choose 10 as the best degrees of freedom.</p>]]></content><author><name>Zian</name><email>zianzhuang@ucla.edu</email></author><category term="Machine Learning" /><category term="practice" /><summary type="html"><![CDATA[]]></summary></entry><entry><title type="html">Machine Learning practice 1</title><link href="https://zianzhuang.com/posts/2021/05/blog-post-5/" rel="alternate" type="text/html" title="Machine Learning practice 1" /><published>2021-05-22T00:00:00+00:00</published><updated>2021-05-22T00:00:00+00:00</updated><id>https://zianzhuang.com/posts/2021/05/blog-post-5</id><content type="html" xml:base="https://zianzhuang.com/posts/2021/05/blog-post-5/"><![CDATA[<!--more-->

<ol>
  <li>(Model Selection, [ISL] 6.8, <em>25 pt</em>) In this exercise, we will generate simulated data, and will then use this data to perform model selection.</li>
</ol>

<p>(a) Use the <code class="language-plaintext highlighter-rouge">rnorm</code> function to generate a predictor $X$ of length $n = 100$, as well as a noise vector $\epsilon$ of length $n = 100$.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">set.seed</span><span class="p">(</span><span class="m">199609</span><span class="p">)</span><span class="w">
</span><span class="n">X</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="m">100</span><span class="p">)</span><span class="w">
</span><span class="n">epsilon</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="m">100</span><span class="p">)</span><span class="w">

</span></code></pre></div></div>

<p>(b) Generate a response vector $Y$ of length $n = 100$ according to the model \(Y = \beta_0 + \beta_1 X + \beta_2 X^2 + \beta_3 X^3 + \epsilon,\) where $\beta_0 = 3$, $\beta_1 = 2$, $\beta_2 = -3$, $\beta_3 = 0.3$.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">b0</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="m">3</span><span class="w"> </span><span class="p">;</span><span class="w"> </span><span class="n">b1</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="m">2</span><span class="w"> </span><span class="p">;</span><span class="w"> </span><span class="n">b2</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="m">-3</span><span class="w"> </span><span class="p">;</span><span class="w"> </span><span class="n">b3</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="m">0.3</span><span class="w">

</span><span class="n">Y</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">b0</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">b1</span><span class="o">*</span><span class="n">X</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">b2</span><span class="o">*</span><span class="n">X</span><span class="o">^</span><span class="m">2</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">b3</span><span class="o">*</span><span class="n">X</span><span class="o">^</span><span class="m">3</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">epsilon</span><span class="w">
</span></code></pre></div></div>

<p>(c) Use the <code class="language-plaintext highlighter-rouge">regsubsets</code> function from <code class="language-plaintext highlighter-rouge">leaps</code> package to perform best subset selection in order to choose the best model from the set of predictors $(X, X^2, \cdots, X^{10})$. What are the best models obtained according to $C_p$, BIC, and adjusted $R^2$, respectively? Show some plots to provide evidence for your answer, and report the coefficients of the best model obtained.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">data.frame</span><span class="p">(</span><span class="n">X</span><span class="p">,</span><span class="n">X</span><span class="o">^</span><span class="m">2</span><span class="p">,</span><span class="n">X</span><span class="o">^</span><span class="m">3</span><span class="p">,</span><span class="n">X</span><span class="o">^</span><span class="m">4</span><span class="p">,</span><span class="n">X</span><span class="o">^</span><span class="m">5</span><span class="p">,</span><span class="n">X</span><span class="o">^</span><span class="m">6</span><span class="p">,</span><span class="n">X</span><span class="o">^</span><span class="m">7</span><span class="p">,</span><span class="n">X</span><span class="o">^</span><span class="m">8</span><span class="p">,</span><span class="n">X</span><span class="o">^</span><span class="m">9</span><span class="p">,</span><span class="n">X</span><span class="o">^</span><span class="m">10</span><span class="p">,</span><span class="n">Y</span><span class="p">)</span><span class="w">
</span><span class="n">best_model</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">regsubsets</span><span class="p">(</span><span class="n">Y</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">.</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">df</span><span class="p">,</span><span class="w"> </span><span class="n">nvmax</span><span class="o">=</span><span class="m">10</span><span class="p">)</span><span class="w"> 
</span><span class="n">summary_model</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">best_model</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">summary</span><span class="w">

</span><span class="c1">#the best models according to C_p, BIC, and adjusted R^2</span><span class="w">
</span><span class="n">min.cp</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">which.min</span><span class="p">(</span><span class="n">summary_model</span><span class="o">$</span><span class="n">cp</span><span class="p">)</span><span class="w">  
</span><span class="n">min.cp</span><span class="w">
</span><span class="n">min.bic</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">which.min</span><span class="p">(</span><span class="n">summary_model</span><span class="o">$</span><span class="n">bic</span><span class="p">)</span><span class="w"> 
</span><span class="n">min.bic</span><span class="w">
</span><span class="n">min.adjr2</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">which.max</span><span class="p">(</span><span class="n">summary_model</span><span class="o">$</span><span class="n">adjr2</span><span class="p">)</span><span class="w">
</span><span class="n">min.adjr2</span><span class="w">

</span><span class="c1"># plots that show the best models</span><span class="w">
</span><span class="n">par</span><span class="p">(</span><span class="n">mfrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">2</span><span class="p">,</span><span class="m">2</span><span class="p">))</span><span class="w">
</span><span class="n">plot</span><span class="p">(</span><span class="n">summary_model</span><span class="o">$</span><span class="n">cp</span><span class="p">,</span><span class="w"> </span><span class="n">xlab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Number of Poly(X)"</span><span class="p">,</span><span class="w"> 
     </span><span class="n">ylab</span><span class="w"> </span><span class="o">=</span><span class="s2">"Cp"</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="o">=</span><span class="s2">"l"</span><span class="p">)</span><span class="w">
</span><span class="n">points</span><span class="p">(</span><span class="n">min.cp</span><span class="p">,</span><span class="w"> </span><span class="n">summary_model</span><span class="o">$</span><span class="n">cp</span><span class="p">[</span><span class="n">min.cp</span><span class="p">],</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"red"</span><span class="p">,</span><span class="w"> </span><span class="n">pch</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">5</span><span class="p">)</span><span class="w">

</span><span class="n">plot</span><span class="p">(</span><span class="n">summary_model</span><span class="o">$</span><span class="n">bic</span><span class="p">,</span><span class="w"> </span><span class="n">xlab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Number of Poly(X)"</span><span class="p">,</span><span class="w"> 
     </span><span class="n">ylab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"BIC"</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="o">=</span><span class="s2">"l"</span><span class="p">)</span><span class="w">
</span><span class="n">points</span><span class="p">(</span><span class="n">min.bic</span><span class="p">,</span><span class="w"> </span><span class="n">summary_model</span><span class="o">$</span><span class="n">bic</span><span class="p">[</span><span class="n">min.bic</span><span class="p">],</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"red"</span><span class="p">,</span><span class="w"> </span><span class="n">pch</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">5</span><span class="p">)</span><span class="w">

</span><span class="n">plot</span><span class="p">(</span><span class="n">summary_model</span><span class="o">$</span><span class="n">adjr2</span><span class="p">,</span><span class="w"> </span><span class="n">xlab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Number of Poly(X)"</span><span class="p">,</span><span class="w"> 
     </span><span class="n">ylab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Adjusted R^2"</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="o">=</span><span class="s2">"l"</span><span class="p">)</span><span class="w">
</span><span class="n">points</span><span class="p">(</span><span class="n">min.adjr2</span><span class="p">,</span><span class="w"> </span><span class="n">summary_model</span><span class="o">$</span><span class="n">adjr2</span><span class="p">[</span><span class="n">min.adjr2</span><span class="p">],</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"red"</span><span class="p">,</span><span class="w"> </span><span class="n">pch</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">5</span><span class="p">)</span><span class="w">

</span><span class="n">coef</span><span class="p">(</span><span class="n">best_model</span><span class="p">,</span><span class="w"> </span><span class="n">min.cp</span><span class="p">)</span><span class="w">
</span><span class="n">coef</span><span class="p">(</span><span class="n">best_model</span><span class="p">,</span><span class="w"> </span><span class="n">min.bic</span><span class="p">)</span><span class="w">
</span><span class="n">coef</span><span class="p">(</span><span class="n">best_model</span><span class="p">,</span><span class="w"> </span><span class="n">min.adjr2</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<p>(d) Repeat (c), using forward stepwise selection and also using backward stepwise selection. How does your answer compare to the results in (c)?</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># forward stepwise selection.</span><span class="w">
</span><span class="n">best_model_f</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">regsubsets</span><span class="p">(</span><span class="n">Y</span><span class="o">~</span><span class="n">.</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="o">=</span><span class="n">df</span><span class="p">,</span><span class="w"> </span><span class="n">nvmax</span><span class="o">=</span><span class="m">10</span><span class="p">,</span><span class="w"> </span><span class="n">method</span><span class="o">=</span><span class="s1">'forward'</span><span class="p">)</span><span class="w">
</span><span class="n">summary_model_f</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">best_model_f</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">summary</span><span class="w">

</span><span class="c1"># refer to C_p, BIC, and adjusted R^2</span><span class="w">
</span><span class="n">min.cp</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">which.min</span><span class="p">(</span><span class="n">summary_model_f</span><span class="o">$</span><span class="n">cp</span><span class="p">)</span><span class="w">  
</span><span class="n">min.cp</span><span class="w">
</span><span class="n">min.bic</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">which.min</span><span class="p">(</span><span class="n">summary_model_f</span><span class="o">$</span><span class="n">bic</span><span class="p">)</span><span class="w"> 
</span><span class="n">min.bic</span><span class="w">
</span><span class="n">min.adjr2</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">which.max</span><span class="p">(</span><span class="n">summary_model_f</span><span class="o">$</span><span class="n">adjr2</span><span class="p">)</span><span class="w">
</span><span class="n">min.adjr2</span><span class="w">

</span><span class="n">coef</span><span class="p">(</span><span class="n">best_model</span><span class="p">,</span><span class="w"> </span><span class="n">min.cp</span><span class="p">)</span><span class="w">
</span><span class="n">coef</span><span class="p">(</span><span class="n">best_model</span><span class="p">,</span><span class="w"> </span><span class="n">min.bic</span><span class="p">)</span><span class="w">
</span><span class="n">coef</span><span class="p">(</span><span class="n">best_model</span><span class="p">,</span><span class="w"> </span><span class="n">min.adjr2</span><span class="p">)</span><span class="w">

</span><span class="c1"># backward stepwise selection.</span><span class="w">
</span><span class="n">best_model_b</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">regsubsets</span><span class="p">(</span><span class="n">Y</span><span class="o">~</span><span class="n">.</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="o">=</span><span class="n">df</span><span class="p">,</span><span class="w"> </span><span class="n">nvmax</span><span class="o">=</span><span class="m">10</span><span class="p">,</span><span class="w"> </span><span class="n">method</span><span class="o">=</span><span class="s1">'backward'</span><span class="p">)</span><span class="w">
</span><span class="n">summary_model_b</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">best_model_f</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">summary</span><span class="w">

</span><span class="c1">#the best models according to C_p, BIC, and adjusted R^2</span><span class="w">
</span><span class="n">min.cp</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">which.min</span><span class="p">(</span><span class="n">summary_model_b</span><span class="o">$</span><span class="n">cp</span><span class="p">)</span><span class="w">  
</span><span class="n">min.cp</span><span class="w">
</span><span class="n">min.bic</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">which.min</span><span class="p">(</span><span class="n">summary_model_b</span><span class="o">$</span><span class="n">bic</span><span class="p">)</span><span class="w"> 
</span><span class="n">min.bic</span><span class="w">
</span><span class="n">min.adjr2</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">which.max</span><span class="p">(</span><span class="n">summary_model_b</span><span class="o">$</span><span class="n">adjr2</span><span class="p">)</span><span class="w">
</span><span class="n">min.adjr2</span><span class="w">

</span><span class="n">coef</span><span class="p">(</span><span class="n">best_model</span><span class="p">,</span><span class="w"> </span><span class="n">min.cp</span><span class="p">)</span><span class="w">
</span><span class="n">coef</span><span class="p">(</span><span class="n">best_model</span><span class="p">,</span><span class="w"> </span><span class="n">min.bic</span><span class="p">)</span><span class="w">
</span><span class="n">coef</span><span class="p">(</span><span class="n">best_model</span><span class="p">,</span><span class="w"> </span><span class="n">min.adjr2</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<p>We found that both forward and backward method provided same results as part (c).</p>

<p>(e) Now fit a LASSO model with <code class="language-plaintext highlighter-rouge">glmnet</code> function from <code class="language-plaintext highlighter-rouge">glmnet</code> package to the simulated data, again using $(X,X^2,\cdots,X^{10})$ as predictors. Use cross-validation to select the optimal value of $\lambda$. Create plots of the cross-validation error as a function of $\lambda$. Report the resulting coefficient estimates, and discuss the results obtained.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Lasso model</span><span class="w">
</span><span class="n">lasso</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">glmnet</span><span class="p">(</span><span class="n">df</span><span class="p">[,</span><span class="w"> </span><span class="m">-11</span><span class="p">]</span><span class="o">%&gt;%</span><span class="n">as.matrix</span><span class="p">,</span><span class="w"> </span><span class="n">df</span><span class="p">[,</span><span class="w"> </span><span class="m">11</span><span class="p">],</span><span class="w"> </span><span class="n">alpha</span><span class="o">=</span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="c1"># cross-validation</span><span class="w">
</span><span class="n">cv.out</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cv.glmnet</span><span class="p">(</span><span class="n">df</span><span class="p">[,</span><span class="w"> </span><span class="m">-11</span><span class="p">]</span><span class="o">%&gt;%</span><span class="n">as.matrix</span><span class="p">,</span><span class="w"> </span><span class="n">df</span><span class="p">[,</span><span class="w"> </span><span class="m">11</span><span class="p">],</span><span class="w"> </span><span class="n">alpha</span><span class="o">=</span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="c1"># cross-validation error</span><span class="w">
</span><span class="n">plot</span><span class="p">(</span><span class="n">cv.out</span><span class="p">)</span><span class="w">
</span><span class="c1"># the optimal value of lambda</span><span class="w">
</span><span class="n">best_lam</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cv.out</span><span class="o">$</span><span class="n">lambda.min</span><span class="w">
</span><span class="c1"># Coefficients from lasso model with best lambda.</span><span class="w">
</span><span class="n">predict</span><span class="p">(</span><span class="n">lasso</span><span class="p">,</span><span class="w"> </span><span class="n">s</span><span class="o">=</span><span class="n">best_lam</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="o">=</span><span class="s2">"coefficients"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<p><img src="/post/2021-06-14-machine-learning-practice-1/index.en-us_files/fig24.png" alt="" /></p>

<p>According the results, we see that estimated coefficients of b1~b3 from lasso model with best lambda are largely consistent with the true value of b1~b3. Nevertheless, lasso model included X\^7 and X\^9 by mistake, with small coefficients.</p>

<p>(f) Now generate a response vector $Y$ according to the model \(Y = \beta_0 + \beta_7 X^7 + \epsilon,\) where $\beta_7 = 7$, and perform best subset selection and the LASSO. Discuss the results obtained.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># generate y2</span><span class="w">
</span><span class="n">b7</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="m">7</span><span class="w">
</span><span class="n">Y2</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">b0</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">b7</span><span class="o">*</span><span class="n">X</span><span class="o">^</span><span class="m">7</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">epsilon</span><span class="w">

</span><span class="n">df</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">data.frame</span><span class="p">(</span><span class="n">X</span><span class="p">,</span><span class="n">X</span><span class="o">^</span><span class="m">2</span><span class="p">,</span><span class="n">X</span><span class="o">^</span><span class="m">3</span><span class="p">,</span><span class="n">X</span><span class="o">^</span><span class="m">4</span><span class="p">,</span><span class="n">X</span><span class="o">^</span><span class="m">5</span><span class="p">,</span><span class="n">X</span><span class="o">^</span><span class="m">6</span><span class="p">,</span><span class="n">X</span><span class="o">^</span><span class="m">7</span><span class="p">,</span><span class="n">X</span><span class="o">^</span><span class="m">8</span><span class="p">,</span><span class="n">X</span><span class="o">^</span><span class="m">9</span><span class="p">,</span><span class="n">X</span><span class="o">^</span><span class="m">10</span><span class="p">,</span><span class="n">Y2</span><span class="p">)</span><span class="w">

</span><span class="c1"># best subset selection</span><span class="w">
</span><span class="n">best_model</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">regsubsets</span><span class="p">(</span><span class="n">Y2</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">.</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="o">=</span><span class="n">df</span><span class="p">,</span><span class="w"> </span><span class="n">nvmax</span><span class="o">=</span><span class="m">10</span><span class="p">)</span><span class="w">
</span><span class="n">summary_model</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">summary</span><span class="p">(</span><span class="n">best_model</span><span class="p">)</span><span class="w">

</span><span class="c1"># refer to C_p, BIC, and adjusted R^2</span><span class="w">
</span><span class="n">min.cp</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">which.min</span><span class="p">(</span><span class="n">summary_model</span><span class="o">$</span><span class="n">cp</span><span class="p">)</span><span class="w">  
</span><span class="n">min.bic</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">which.min</span><span class="p">(</span><span class="n">summary_model</span><span class="o">$</span><span class="n">bic</span><span class="p">)</span><span class="w"> 
</span><span class="n">min.adjr2</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">which.max</span><span class="p">(</span><span class="n">summary_model</span><span class="o">$</span><span class="n">adjr2</span><span class="p">)</span><span class="w">

</span><span class="c1"># check results</span><span class="w">
</span><span class="n">coef</span><span class="p">(</span><span class="n">best_model</span><span class="p">,</span><span class="w"> </span><span class="n">min.cp</span><span class="p">)</span><span class="w">
</span><span class="n">coef</span><span class="p">(</span><span class="n">best_model</span><span class="p">,</span><span class="w"> </span><span class="n">min.bic</span><span class="p">)</span><span class="w">
</span><span class="n">coef</span><span class="p">(</span><span class="n">best_model</span><span class="p">,</span><span class="w"> </span><span class="n">min.adjr2</span><span class="p">)</span><span class="w">

</span><span class="c1"># LASSO model</span><span class="w">
</span><span class="n">lasso</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">glmnet</span><span class="p">(</span><span class="n">df</span><span class="p">[,</span><span class="w"> </span><span class="m">-11</span><span class="p">]</span><span class="o">%&gt;%</span><span class="n">as.matrix</span><span class="p">,</span><span class="w"> </span><span class="n">df</span><span class="p">[,</span><span class="w"> </span><span class="m">11</span><span class="p">],</span><span class="w"> </span><span class="n">alpha</span><span class="o">=</span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="c1"># cross-validation</span><span class="w">
</span><span class="n">cv.out</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cv.glmnet</span><span class="p">(</span><span class="n">df</span><span class="p">[,</span><span class="w"> </span><span class="m">-11</span><span class="p">]</span><span class="o">%&gt;%</span><span class="n">as.matrix</span><span class="p">,</span><span class="w"> </span><span class="n">df</span><span class="p">[,</span><span class="w"> </span><span class="m">11</span><span class="p">],</span><span class="w"> </span><span class="n">alpha</span><span class="o">=</span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="c1"># cross-validation error</span><span class="w">
</span><span class="n">plot</span><span class="p">(</span><span class="n">cv.out</span><span class="p">)</span><span class="w">
</span><span class="c1"># the optimal value of lambda</span><span class="w">
</span><span class="n">best_lam</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cv.out</span><span class="o">$</span><span class="n">lambda.min</span><span class="w">
</span><span class="c1"># Coefficients from lasso model with best lambda.</span><span class="w">
</span><span class="n">predict</span><span class="p">(</span><span class="n">lasso</span><span class="p">,</span><span class="w"> </span><span class="n">s</span><span class="o">=</span><span class="n">best_lam</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="o">=</span><span class="s2">"coefficients"</span><span class="p">)</span><span class="w">

</span></code></pre></div></div>

<p>According to the results, we found that the best subset selection (referring to the BIC value) and Lasso regression provides the relatively more accurate estimates than that of best subset selection (referring to C_p and adjusted R\^2).</p>

<ol>
  <li>(Prediction, [ISL] 6.9, <em>20 pt</em>) In this exercise, we will predict the number of applications received (<code class="language-plaintext highlighter-rouge">Apps</code>) using the other variables in the <code class="language-plaintext highlighter-rouge">College</code> data set from <code class="language-plaintext highlighter-rouge">ISLR</code> package.</li>
</ol>

<p>(a) Randomly split the data set into equal sized training set and test set (1:1).</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">data</span><span class="p">(</span><span class="n">College</span><span class="p">)</span><span class="w">

</span><span class="n">set.seed</span><span class="p">(</span><span class="m">199609</span><span class="p">)</span><span class="w">
</span><span class="n">train_id</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">sample</span><span class="p">(</span><span class="nf">dim</span><span class="p">(</span><span class="n">College</span><span class="p">)[</span><span class="m">1</span><span class="p">],</span><span class="w"> </span><span class="nf">dim</span><span class="p">(</span><span class="n">College</span><span class="p">)[</span><span class="m">1</span><span class="p">]</span><span class="o">/</span><span class="m">2</span><span class="p">)</span><span class="w">

</span><span class="n">train_set</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">College</span><span class="p">[</span><span class="n">train_id</span><span class="p">,</span><span class="w"> </span><span class="p">]</span><span class="w">
</span><span class="n">test_set</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">College</span><span class="p">[</span><span class="o">-</span><span class="n">train_id</span><span class="p">,</span><span class="w"> </span><span class="p">]</span><span class="w">
</span></code></pre></div></div>

<p>(b) Fit a linear model using least squares on the training set, and report the test error obtained.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># linear model</span><span class="w">
</span><span class="n">modl</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">lm</span><span class="p">(</span><span class="n">Apps</span><span class="o">~</span><span class="n">.</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">train_set</span><span class="p">)</span><span class="w">
</span><span class="n">summary</span><span class="p">(</span><span class="n">modl</span><span class="p">)</span><span class="w">
</span><span class="n">pred1</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">predict</span><span class="p">(</span><span class="n">modl</span><span class="p">,</span><span class="w"> </span><span class="n">newdata</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">test_set</span><span class="p">[,</span><span class="m">-2</span><span class="p">],</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"response"</span><span class="p">)</span><span class="w">

</span><span class="c1"># test error</span><span class="w">
</span><span class="n">MAE</span><span class="p">(</span><span class="n">pred1</span><span class="p">,</span><span class="w"> </span><span class="n">College</span><span class="o">$</span><span class="n">Apps</span><span class="p">[</span><span class="o">-</span><span class="n">train_id</span><span class="p">])</span><span class="w">
</span><span class="n">MSE</span><span class="p">(</span><span class="n">pred1</span><span class="p">,</span><span class="w"> </span><span class="n">College</span><span class="o">$</span><span class="n">Apps</span><span class="p">[</span><span class="o">-</span><span class="n">train_id</span><span class="p">])</span><span class="w">
</span><span class="n">RMSE</span><span class="p">(</span><span class="n">pred1</span><span class="p">,</span><span class="w"> </span><span class="n">College</span><span class="o">$</span><span class="n">Apps</span><span class="p">[</span><span class="o">-</span><span class="n">train_id</span><span class="p">])</span><span class="w">
</span><span class="n">R2</span><span class="p">(</span><span class="n">pred1</span><span class="p">,</span><span class="w"> </span><span class="n">College</span><span class="o">$</span><span class="n">Apps</span><span class="p">[</span><span class="o">-</span><span class="n">train_id</span><span class="p">],</span><span class="w"> </span><span class="n">form</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"traditional"</span><span class="p">)</span><span class="w">

</span><span class="n">err1</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">data.frame</span><span class="p">(</span><span class="n">MAE</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">MAE</span><span class="p">(</span><span class="n">pred1</span><span class="p">,</span><span class="w"> </span><span class="n">College</span><span class="o">$</span><span class="n">Apps</span><span class="p">[</span><span class="o">-</span><span class="n">train_id</span><span class="p">]),</span><span class="w">
                  </span><span class="n">MSE</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">MSE</span><span class="p">(</span><span class="n">pred1</span><span class="p">,</span><span class="w"> </span><span class="n">College</span><span class="o">$</span><span class="n">Apps</span><span class="p">[</span><span class="o">-</span><span class="n">train_id</span><span class="p">]),</span><span class="w">
                  </span><span class="n">RMSE</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">RMSE</span><span class="p">(</span><span class="n">pred1</span><span class="p">,</span><span class="w"> </span><span class="n">College</span><span class="o">$</span><span class="n">Apps</span><span class="p">[</span><span class="o">-</span><span class="n">train_id</span><span class="p">]),</span><span class="w">
                  </span><span class="n">R2</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">R2</span><span class="p">(</span><span class="n">pred1</span><span class="p">,</span><span class="w"> </span><span class="n">College</span><span class="o">$</span><span class="n">Apps</span><span class="p">[</span><span class="o">-</span><span class="n">train_id</span><span class="p">],</span><span class="w"> </span><span class="n">form</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"traditional"</span><span class="p">))</span><span class="w">
</span></code></pre></div></div>

<p>(c) Fit a ridge regression model on the training set, with $\lambda$ chosen by 5-fold cross-validation. Report the test error obtained.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># prepare data</span><span class="w">
</span><span class="n">train_set</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">model.matrix</span><span class="p">(</span><span class="n">Apps</span><span class="o">~</span><span class="n">.</span><span class="m">-1</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="o">=</span><span class="n">College</span><span class="p">[</span><span class="n">train_id</span><span class="p">,</span><span class="w"> </span><span class="p">])</span><span class="w">
</span><span class="n">test_set</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">model.matrix</span><span class="p">(</span><span class="n">Apps</span><span class="o">~</span><span class="n">.</span><span class="m">-1</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="o">=</span><span class="n">College</span><span class="p">[</span><span class="o">-</span><span class="n">train_id</span><span class="p">,</span><span class="w"> </span><span class="p">])</span><span class="w">

</span><span class="c1"># ridge regression with cross-validation</span><span class="w">
</span><span class="n">modr_cv</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">cv.glmnet</span><span class="p">(</span><span class="n">train_set</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">as.matrix</span><span class="p">,</span><span class="w">
                     </span><span class="n">College</span><span class="o">$</span><span class="n">Apps</span><span class="p">[</span><span class="n">train_id</span><span class="p">]</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">as.matrix</span><span class="p">,</span><span class="w"> 
                     </span><span class="n">nfolds</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">5</span><span class="p">,</span><span class="w"> </span><span class="n">alpha</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">)</span><span class="w">
</span><span class="n">plot</span><span class="p">(</span><span class="n">modr_cv</span><span class="p">)</span><span class="w">
</span><span class="n">lambda</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">modr_cv</span><span class="o">$</span><span class="n">lambda.min</span><span class="w">
</span><span class="n">pred2</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">predict</span><span class="p">(</span><span class="n">modr_cv</span><span class="p">,</span><span class="w"> </span><span class="n">s</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">lambda</span><span class="p">,</span><span class="w"> </span><span class="n">newx</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">test_set</span><span class="p">)</span><span class="w">
</span><span class="n">summary</span><span class="p">(</span><span class="n">modr_cv</span><span class="p">)</span><span class="w">
</span><span class="c1"># test error</span><span class="w">
</span><span class="n">MAE</span><span class="p">(</span><span class="n">pred2</span><span class="p">,</span><span class="w"> </span><span class="n">College</span><span class="o">$</span><span class="n">Apps</span><span class="p">[</span><span class="o">-</span><span class="n">train_id</span><span class="p">])</span><span class="w">
</span><span class="n">MSE</span><span class="p">(</span><span class="n">pred2</span><span class="p">,</span><span class="w"> </span><span class="n">College</span><span class="o">$</span><span class="n">Apps</span><span class="p">[</span><span class="o">-</span><span class="n">train_id</span><span class="p">])</span><span class="w">
</span><span class="n">RMSE</span><span class="p">(</span><span class="n">pred2</span><span class="p">,</span><span class="w"> </span><span class="n">College</span><span class="o">$</span><span class="n">Apps</span><span class="p">[</span><span class="o">-</span><span class="n">train_id</span><span class="p">])</span><span class="w">
</span><span class="n">R2</span><span class="p">(</span><span class="n">pred2</span><span class="p">,</span><span class="w"> </span><span class="n">College</span><span class="o">$</span><span class="n">Apps</span><span class="p">[</span><span class="o">-</span><span class="n">train_id</span><span class="p">],</span><span class="w"> </span><span class="n">form</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"traditional"</span><span class="p">)</span><span class="w">

</span><span class="n">err2</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">data.frame</span><span class="p">(</span><span class="n">MAE</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">MAE</span><span class="p">(</span><span class="n">pred2</span><span class="p">,</span><span class="w"> </span><span class="n">College</span><span class="o">$</span><span class="n">Apps</span><span class="p">[</span><span class="o">-</span><span class="n">train_id</span><span class="p">]),</span><span class="w">
                  </span><span class="n">MSE</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">MSE</span><span class="p">(</span><span class="n">pred2</span><span class="p">,</span><span class="w"> </span><span class="n">College</span><span class="o">$</span><span class="n">Apps</span><span class="p">[</span><span class="o">-</span><span class="n">train_id</span><span class="p">]),</span><span class="w">
                  </span><span class="n">RMSE</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">RMSE</span><span class="p">(</span><span class="n">pred2</span><span class="p">,</span><span class="w"> </span><span class="n">College</span><span class="o">$</span><span class="n">Apps</span><span class="p">[</span><span class="o">-</span><span class="n">train_id</span><span class="p">]),</span><span class="w">
                  </span><span class="n">R2</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">R2</span><span class="p">(</span><span class="n">pred2</span><span class="p">,</span><span class="w"> </span><span class="n">College</span><span class="o">$</span><span class="n">Apps</span><span class="p">[</span><span class="o">-</span><span class="n">train_id</span><span class="p">],</span><span class="w"> </span><span class="n">form</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"traditional"</span><span class="p">))</span><span class="w">
</span></code></pre></div></div>

<p>(d) Fit a LASSO model on the training set, with $\lambda$ chosen by 5-fold cross-validation. Report the test error obtained, along with the number of non-zero coefficient estimates.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># LASSO regression with cross-validation</span><span class="w">
</span><span class="n">modr_cv</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">cv.glmnet</span><span class="p">(</span><span class="n">train_set</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">as.matrix</span><span class="p">,</span><span class="w">
                     </span><span class="n">College</span><span class="o">$</span><span class="n">Apps</span><span class="p">[</span><span class="n">train_id</span><span class="p">]</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">as.matrix</span><span class="p">,</span><span class="w"> 
                     </span><span class="n">nfolds</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">5</span><span class="p">,</span><span class="w"> </span><span class="n">alpha</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">plot</span><span class="p">(</span><span class="n">modr_cv</span><span class="p">)</span><span class="w">
</span><span class="n">lambda</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">modr_cv</span><span class="o">$</span><span class="n">lambda.min</span><span class="w">
</span><span class="n">pred3</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">predict</span><span class="p">(</span><span class="n">modr_cv</span><span class="p">,</span><span class="w"> </span><span class="n">s</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">lambda</span><span class="p">,</span><span class="w"> </span><span class="n">newx</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">test_set</span><span class="p">)</span><span class="w">

</span><span class="c1"># test error</span><span class="w">
</span><span class="n">MAE</span><span class="p">(</span><span class="n">pred3</span><span class="p">,</span><span class="w"> </span><span class="n">College</span><span class="o">$</span><span class="n">Apps</span><span class="p">[</span><span class="o">-</span><span class="n">train_id</span><span class="p">])</span><span class="w">
</span><span class="n">MSE</span><span class="p">(</span><span class="n">pred3</span><span class="p">,</span><span class="w"> </span><span class="n">College</span><span class="o">$</span><span class="n">Apps</span><span class="p">[</span><span class="o">-</span><span class="n">train_id</span><span class="p">])</span><span class="w">
</span><span class="n">RMSE</span><span class="p">(</span><span class="n">pred3</span><span class="p">,</span><span class="w"> </span><span class="n">College</span><span class="o">$</span><span class="n">Apps</span><span class="p">[</span><span class="o">-</span><span class="n">train_id</span><span class="p">])</span><span class="w">
</span><span class="n">R2</span><span class="p">(</span><span class="n">pred3</span><span class="p">,</span><span class="w"> </span><span class="n">College</span><span class="o">$</span><span class="n">Apps</span><span class="p">[</span><span class="o">-</span><span class="n">train_id</span><span class="p">],</span><span class="w"> </span><span class="n">form</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"traditional"</span><span class="p">)</span><span class="w">

</span><span class="n">err3</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">data.frame</span><span class="p">(</span><span class="n">MAE</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">MAE</span><span class="p">(</span><span class="n">pred3</span><span class="p">,</span><span class="w"> </span><span class="n">College</span><span class="o">$</span><span class="n">Apps</span><span class="p">[</span><span class="o">-</span><span class="n">train_id</span><span class="p">]),</span><span class="w">
                  </span><span class="n">MSE</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">MSE</span><span class="p">(</span><span class="n">pred3</span><span class="p">,</span><span class="w"> </span><span class="n">College</span><span class="o">$</span><span class="n">Apps</span><span class="p">[</span><span class="o">-</span><span class="n">train_id</span><span class="p">]),</span><span class="w">
                  </span><span class="n">RMSE</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">RMSE</span><span class="p">(</span><span class="n">pred3</span><span class="p">,</span><span class="w"> </span><span class="n">College</span><span class="o">$</span><span class="n">Apps</span><span class="p">[</span><span class="o">-</span><span class="n">train_id</span><span class="p">]),</span><span class="w">
                  </span><span class="n">R2</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">R2</span><span class="p">(</span><span class="n">pred3</span><span class="p">,</span><span class="w"> </span><span class="n">College</span><span class="o">$</span><span class="n">Apps</span><span class="p">[</span><span class="o">-</span><span class="n">train_id</span><span class="p">],</span><span class="w"> </span><span class="n">form</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"traditional"</span><span class="p">))</span><span class="w">

</span><span class="c1"># the number of non-zero coefficient estimates (intercept term included)</span><span class="w">
</span><span class="n">coef</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">predict</span><span class="p">(</span><span class="n">modr_cv</span><span class="p">,</span><span class="w"> </span><span class="n">s</span><span class="o">=</span><span class="n">best_lam</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="o">=</span><span class="s2">"coefficients"</span><span class="p">)</span><span class="w">
</span><span class="n">coef</span><span class="p">[</span><span class="n">which</span><span class="p">(</span><span class="n">coef</span><span class="w"> </span><span class="o">!=</span><span class="w"> </span><span class="m">0</span><span class="p">)]</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">length</span><span class="w">
</span></code></pre></div></div>

<p>(e) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these three approaches?</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">data.frame</span><span class="p">(</span><span class="n">Model</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"Linear"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Ridge"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Lasso"</span><span class="p">),</span><span class="w">
           </span><span class="n">rbind</span><span class="p">(</span><span class="n">err1</span><span class="p">,</span><span class="w"> </span><span class="n">err2</span><span class="p">,</span><span class="w"> </span><span class="n">err3</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="nf">round</span><span class="p">(</span><span class="m">2</span><span class="p">))</span><span class="w"> 
</span></code></pre></div></div>
<p>We used <code class="language-plaintext highlighter-rouge">MAE</code>, <code class="language-plaintext highlighter-rouge">MSE</code>, <code class="language-plaintext highlighter-rouge">RMSE</code> and <code class="language-plaintext highlighter-rouge">R-squared</code> to access the model test error.</p>

<ul>
  <li>MAE (Mean absolute error) represents the difference between the original and predicted values extracted by averaged the absolute difference over the data set.</li>
  <li>MSE (Mean Squared Error) represents the difference between the original and predicted values extracted by squared the average difference over the data set.</li>
  <li>RMSE (Root Mean Squared Error) is the error rate by the square root of MSE.</li>
  <li>R-squared (Coefficient of determination) represents the coefficient of how well the values fit compared to the original values. The value from 0 to 1 interpreted as percentages. The higher the value is, the better the model is. <a href="https://www.datatechnotes.com/2019/02/regression-model-accuracy-mae-mse-rmse.html">source</a></li>
</ul>

<p>Generally speaking, all of models showed high R2 ($\geq0.9$), indicating that the high accuracy of the predictions. According to four test errors, we found that Ridge regression provides the lowerst <code class="language-plaintext highlighter-rouge">MSE</code>, <code class="language-plaintext highlighter-rouge">RMSE</code> and the highest <code class="language-plaintext highlighter-rouge">R-squared</code>, which suggested that Ridge regression performed the best. Nevertheless, the test errors differences are very small.</p>]]></content><author><name>Zian</name><email>zianzhuang@ucla.edu</email></author><category term="Machine Learning" /><category term="practice" /><summary type="html"><![CDATA[]]></summary></entry><entry><title type="html">Data science practice 3</title><link href="https://zianzhuang.com/posts/2021/03/blog-post-3/" rel="alternate" type="text/html" title="Data science practice 3" /><published>2021-03-11T00:00:00+00:00</published><updated>2021-03-11T00:00:00+00:00</updated><id>https://zianzhuang.com/posts/2021/03/blog-post-3</id><content type="html" xml:base="https://zianzhuang.com/posts/2021/03/blog-post-3/"><![CDATA[<p>Database practice &amp; shiny app</p>

<!--more-->

<h2 id="q1-compile-the-icu-cohort-in-practice-2-q8-from-the-postgresql-database-mimiciv">Q1. Compile the ICU cohort in Practice 2 Q8 from the PostgreSQL database <code class="language-plaintext highlighter-rouge">mimiciv</code>.</h2>

<p>Below is an outline of steps.</p>

<h3 id="q11">Q1.1</h3>
<p>Connect to database <code class="language-plaintext highlighter-rouge">mimiciv</code>. We are going to use username <code class="language-plaintext highlighter-rouge">postgres</code> with password <code class="language-plaintext highlighter-rouge">postgres</code> to access the <code class="language-plaintext highlighter-rouge">mimiciv</code> database.</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Load configuration settings first</span><span class="w">
</span><span class="c1">#</span><span class="w">
</span><span class="c1"># Connect to the database using the configuration settings</span><span class="w">
</span><span class="p">(</span><span class="n">con</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">dbConnect</span><span class="p">(</span><span class="n">RPostgreSQL</span><span class="o">::</span><span class="n">PostgreSQL</span><span class="p">(),</span><span class="w"> 
                  </span><span class="n">dbname</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">dbname</span><span class="p">,</span><span class="w"> 
                  </span><span class="n">user</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">user</span><span class="p">,</span><span class="w"> 
                  </span><span class="n">password</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">password</span><span class="p">))</span><span class="w">
</span></code></pre></div></div>

<h3 id="q12">Q1.2</h3>
<p>List all schemas in the <code class="language-plaintext highlighter-rouge">mimiciv</code> database.</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">dbGetQuery</span><span class="p">(</span><span class="n">con</span><span class="p">,</span><span class="w"> </span><span class="s2">"SELECT SCHEMA_NAME FROM INFORMATION_SCHEMA.SCHEMATA"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<p>List tables in the <code class="language-plaintext highlighter-rouge">mimiciv</code> database:</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">dbListTables</span><span class="p">(</span><span class="n">con</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<p>List schemas and tables in the <code class="language-plaintext highlighter-rouge">mimiciv</code> database (bash command).</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>psql <span class="nt">-U</span> postgres <span class="nt">-d</span> mimiciv <span class="nt">-c</span> <span class="s2">"</span><span class="se">\d</span><span class="s2">t *."</span>
</code></pre></div></div>

<h3 id="q13">Q1.3</h3>
<p>Connect to the icustays table. Note how to use <code class="language-plaintext highlighter-rouge">Id()</code> to specify the schema containing the table.</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">icustays_tble</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">tbl</span><span class="p">(</span><span class="n">con</span><span class="p">,</span><span class="w"> </span><span class="n">Id</span><span class="p">(</span><span class="n">schema</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"mimic_icu"</span><span class="p">,</span><span class="w"> </span><span class="n">table</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"icustays"</span><span class="p">))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">print</span><span class="p">(</span><span class="n">width</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">Inf</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<h3 id="q14">Q1.4</h3>
<p>Connect to the patients table.</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">patients_tble</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">tbl</span><span class="p">(</span><span class="n">con</span><span class="p">,</span><span class="w"> </span><span class="n">Id</span><span class="p">(</span><span class="n">schema</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"mimic_core"</span><span class="p">,</span><span class="w"> </span><span class="n">table</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"patients"</span><span class="p">))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">print</span><span class="p">(</span><span class="n">width</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">Inf</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<h3 id="q15">Q1.5</h3>
<p>Connect to the admissions table.</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">admissions_tble</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">tbl</span><span class="p">(</span><span class="n">con</span><span class="p">,</span><span class="w"> </span><span class="n">Id</span><span class="p">(</span><span class="n">schema</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"mimic_core"</span><span class="p">,</span><span class="w"> </span><span class="n">table</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"admissions"</span><span class="p">))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">print</span><span class="p">(</span><span class="n">width</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">Inf</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<h3 id="q16">Q1.6</h3>
<p>Connect to the <code class="language-plaintext highlighter-rouge">mimic_labevents_icu</code> table.</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">labevents_tble</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">tbl</span><span class="p">(</span><span class="n">con</span><span class="p">,</span><span class="w"> </span><span class="n">Id</span><span class="p">(</span><span class="n">schema</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"public"</span><span class="p">,</span><span class="w"> 
                              </span><span class="n">table</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"mimic_labevents_icu"</span><span class="p">))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">print</span><span class="p">(</span><span class="n">width</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">Inf</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<h3 id="q17">Q1.7</h3>
<p>Connect to <code class="language-plaintext highlighter-rouge">mimic_chartevents_icu</code> table.</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">chartevents_tble</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">tbl</span><span class="p">(</span><span class="n">con</span><span class="p">,</span><span class="w"> </span><span class="n">Id</span><span class="p">(</span><span class="n">schema</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"public"</span><span class="p">,</span><span class="w"> 
                              </span><span class="n">table</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"mimic_chartevents_icu"</span><span class="p">))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">print</span><span class="p">(</span><span class="n">width</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">Inf</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<h3 id="q18">Q1.8</h3>
<p>Put things together. Using one chain of pipes <code class="language-plaintext highlighter-rouge">%&gt;%</code> to perform following data wrangling steps: (i) keep only the first ICU stay of each unique patient, (ii) merge in admissions and patients tables, (iii) keep adults only (age at admission &gt;= 18), (iv) merge in the labevents and chartevents tables, (v) display the SQL query, (vi) collect SQL query result into memory as a tibble, (vii) create an indicator for 30-day mortality, (viii) save the final tibble to an <code class="language-plaintext highlighter-rouge">icu_cohort.rds</code> R data file in the <code class="language-plaintext highlighter-rouge">mimiciv_shiny</code> folder.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># make a directory mimiciv_shiny</span><span class="w">
</span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="o">!</span><span class="n">dir.exists</span><span class="p">(</span><span class="s2">"mimiciv_shiny"</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
  </span><span class="n">dir.create</span><span class="p">(</span><span class="s2">"mimiciv_shiny"</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w"> 
</span></code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">which</span><span class="p">(</span><span class="n">duplicated</span><span class="p">(</span><span class="n">labevents_tble</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> 
                   </span><span class="n">select</span><span class="p">(</span><span class="n">subject_id</span><span class="p">,</span><span class="w"> </span><span class="n">hadm_id</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> 
                   </span><span class="n">collect</span><span class="p">())</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">length</span><span class="w">
</span><span class="n">which</span><span class="p">(</span><span class="n">duplicated</span><span class="p">(</span><span class="n">chartevents_tble</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> 
                   </span><span class="n">select</span><span class="p">(</span><span class="n">subject_id</span><span class="p">,</span><span class="w"> </span><span class="n">hadm_id</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> 
                   </span><span class="n">collect</span><span class="p">())</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">length</span><span class="w">
</span></code></pre></div></div>
<p>After a quick check, we found that there are some patients have more than one record at a single time point (duplicated resords). Thus, we need to only keep one record</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">icustays_tble</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="c1"># keep only the first ICU stay of each unique patient</span><span class="w">
  </span><span class="n">group_by</span><span class="p">(</span><span class="n">subject_id</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> 
  </span><span class="n">filter</span><span class="p">(</span><span class="n">rank</span><span class="p">(</span><span class="n">intime</span><span class="p">)</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> 
  </span><span class="n">ungroup</span><span class="p">()</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> 
  </span><span class="c1"># merge in admissions and patients tables</span><span class="w">
  </span><span class="n">left_join</span><span class="p">(</span><span class="n">patients_tble</span><span class="p">,</span><span class="w"> </span><span class="n">by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"subject_id"</span><span class="p">))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">left_join</span><span class="p">(</span><span class="n">admissions_tble</span><span class="p">,</span><span class="w"> </span><span class="n">by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"subject_id"</span><span class="p">,</span><span class="w"> </span><span class="s2">"hadm_id"</span><span class="p">))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="c1"># keep adults only (age at admission &gt;= 18)</span><span class="w">
  </span><span class="n">mutate</span><span class="p">(</span><span class="n">age_at_adm</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">year</span><span class="p">(</span><span class="n">admittime</span><span class="p">)</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">anchor_year</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">anchor_age</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">filter</span><span class="p">(</span><span class="n">age_at_adm</span><span class="w"> </span><span class="o">&gt;=</span><span class="w"> </span><span class="m">18</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="c1"># merge in the labevents and chartevents tables</span><span class="w">
  </span><span class="n">left_join</span><span class="p">(</span><span class="n">labevents_tble</span><span class="p">,</span><span class="w"> </span><span class="n">by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"subject_id"</span><span class="p">,</span><span class="w"> </span><span class="s2">"hadm_id"</span><span class="p">))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">left_join</span><span class="p">(</span><span class="n">chartevents_tble</span><span class="p">,</span><span class="w"> </span><span class="n">by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"subject_id"</span><span class="p">,</span><span class="w"> </span><span class="s2">"hadm_id"</span><span class="p">))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="c1"># display the SQL query</span><span class="w">
  </span><span class="n">show_query</span><span class="p">()</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="c1"># collect SQL query result into memory as a tibble</span><span class="w">
  </span><span class="n">collect</span><span class="p">()</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> 
  </span><span class="c1"># delete duplicate row</span><span class="w">
  </span><span class="n">group_by</span><span class="p">(</span><span class="n">subject_id</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">slice_head</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">ungroup</span><span class="p">()</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> 
  </span><span class="c1"># create an indicator for 30-day mortality</span><span class="w">
  </span><span class="n">mutate</span><span class="p">(</span><span class="n">death_30</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ifelse</span><span class="p">((</span><span class="n">deathtime</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">admittime</span><span class="p">)</span><span class="o">/</span><span class="p">(</span><span class="m">60</span><span class="o">*</span><span class="m">24</span><span class="p">)</span><span class="w"> </span><span class="o">&lt;=</span><span class="w"> </span><span class="m">30</span><span class="p">,</span><span class="w">
                           </span><span class="s2">"Yes"</span><span class="p">,</span><span class="w"> </span><span class="s2">"No"</span><span class="p">))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">mutate_at</span><span class="p">(</span><span class="n">vars</span><span class="p">(</span><span class="n">death_30</span><span class="p">),</span><span class="w"> 
            </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">){</span><span class="n">ifelse</span><span class="p">(</span><span class="nf">is.na</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span><span class="w"> </span><span class="s2">"No"</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="p">)})</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="c1"># save the final tibble</span><span class="w">
  </span><span class="n">saveRDS</span><span class="p">(</span><span class="n">file</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"./mimiciv_shiny/icu_cohort.rds"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<p>Close database connection and clear workspace.</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">dbDisconnect</span><span class="p">(</span><span class="n">con</span><span class="p">)</span><span class="w">
</span><span class="n">rm</span><span class="p">(</span><span class="n">list</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ls</span><span class="p">())</span><span class="w">
</span></code></pre></div></div>

<h2 id="q2-shiny-app">Q2. Shiny app</h2>

<p>Develop a Shiny app for exploring the ICU cohort data created in Q1. The app should reside in the <code class="language-plaintext highlighter-rouge">mimiciv_shiny</code> folder. The app should provide easy access to the graphical and numerical summaries of variables (demographics, lab measurements, vitals) in the ICU cohort.</p>

<blockquote>
  <p><strong>solution</strong>: 
Please refer to: https://zianzhuang.shinyapps.io/mimiciv_shiny/</p>
</blockquote>]]></content><author><name>Zian</name><email>zianzhuang@ucla.edu</email></author><category term="Data science" /><category term="practice" /><summary type="html"><![CDATA[Database practice &amp; shiny app]]></summary></entry><entry><title type="html">Data science practice 4</title><link href="https://zianzhuang.com/posts/2021/03/blog-post-4/" rel="alternate" type="text/html" title="Data science practice 4" /><published>2021-03-11T00:00:00+00:00</published><updated>2021-03-11T00:00:00+00:00</updated><id>https://zianzhuang.com/posts/2021/03/blog-post-4</id><content type="html" xml:base="https://zianzhuang.com/posts/2021/03/blog-post-4/"><![CDATA[<p>Multiple imputation &amp; modeling 
<!--more--></p>

<h2 id="q1-missing-data">Q1. Missing data</h2>

<p>Through the Shiny app developed in HW3, we observe abundant missing values in the MIMIC-IV ICU cohort we created. In this question, we use multiple imputation to obtain a data set without missing values.</p>

<h3 id="q10">Q1.0</h3>
<p>Read following tutorials on the R package miceRanger for imputation: <a href="https://github.com/farrellday/miceRanger">https://github.com/farrellday/miceRanger</a>, <a href="https://cran.r-project.org/web/packages/miceRanger/vignettes/miceAlgorithm.html">https://cran.r-project.org/web/packages/miceRanger/vignettes/miceAlgorithm.html</a>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>A more thorough book treatment of the practical imputation strategies is the book [*_Flexible Imputation of Missing Data_*](https://stefvanbuuren.name/fimd/) by Stef van Buuren. 
</code></pre></div></div>

<h3 id="q11">Q1.1</h3>
<p>Explain the jargon MCAR, MAR, and MNAR.</p>

<blockquote>
  <p><strong>solution</strong>:</p>
</blockquote>

<blockquote>
  <ul>
    <li>MAR: missing at random. The absence of data was independent of incomplete variables as well as complete variables.</li>
    <li>MCAR: missing completely at random. The absence of data relies solely on the complete variable.</li>
    <li>MNAR: missing not at random. The absence of data in the incomplete variables relies on the incomplete variables themselves and such absence is not negligible.</li>
  </ul>
</blockquote>

<h3 id="q12">Q1.2</h3>
<p>Explain in a couple of sentences how the Multiple Imputation by Chained Equations (MICE) work.</p>

<blockquote>
  <p><strong>solution</strong>: 
operates under the assumption that given the variables used in the imputation procedure, the missing data are Missing At Random (MAR). In the MICE procedure a series of regression models are run whereby each variable with missing data is modeled conditional upon the other variables in the data. This means that each variable can be modeled according to its distribution. <a href="https://onlinelibrary.wiley.com/doi/pdf/10.1002/mpr.329">reference link</a></p>
</blockquote>

<h3 id="q13">Q1.3</h3>
<p>Perform a data quality check of the ICU stays data. Discard variables with substantial missingness, say &gt;5000 <code class="language-plaintext highlighter-rouge">NA</code>s. Replace apparent data entry errors by <code class="language-plaintext highlighter-rouge">NA</code>s.</p>

<blockquote>
  <p>Please note that here we dropped variables that have more than 7000 NAs. In addition, we assigned the outliers (out of 1.5*IQR range) of each numeric variables as NA.</p>
</blockquote>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">#define functions</span><span class="w">
</span><span class="n">.quantile_cut</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">){</span><span class="w">
  </span><span class="n">lb</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">quantile</span><span class="p">(</span><span class="n">df</span><span class="p">[,</span><span class="w"> </span><span class="n">x</span><span class="p">]</span><span class="w"> </span><span class="p">,</span><span class="w"> </span><span class="m">0.25</span><span class="p">,</span><span class="w"> </span><span class="n">na.rm</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
  </span><span class="n">ub</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">quantile</span><span class="p">(</span><span class="n">df</span><span class="p">[,</span><span class="w"> </span><span class="n">x</span><span class="p">]</span><span class="w"> </span><span class="p">,</span><span class="w"> </span><span class="m">0.75</span><span class="p">,</span><span class="w"> </span><span class="n">na.rm</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
  </span><span class="n">iqr</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">ub</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">lb</span><span class="w">
  </span><span class="n">df</span><span class="p">[</span><span class="n">which</span><span class="p">(</span><span class="n">df</span><span class="p">[,</span><span class="w"> </span><span class="n">x</span><span class="p">]</span><span class="w"> </span><span class="o">&lt;=</span><span class="w"> </span><span class="n">lb</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="m">1.5</span><span class="o">*</span><span class="n">iqr</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">df</span><span class="p">[,</span><span class="w"> </span><span class="n">x</span><span class="p">]</span><span class="w"> </span><span class="o">&gt;=</span><span class="w"> </span><span class="n">ub</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1.5</span><span class="o">*</span><span class="n">iqr</span><span class="p">),</span><span class="w"> </span><span class="n">x</span><span class="p">]</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="kc">NA</span><span class="w">
  </span><span class="nf">return</span><span class="p">(</span><span class="n">df</span><span class="p">[,</span><span class="n">x</span><span class="p">])</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">.na_count</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">icu_cohort</span><span class="p">){</span><span class="w">
  </span><span class="n">na_num</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">which</span><span class="p">(</span><span class="nf">is.na</span><span class="p">(</span><span class="n">data</span><span class="p">[,</span><span class="w"> </span><span class="n">x</span><span class="p">])</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="nb">T</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="nf">length</span><span class="p">()</span><span class="w">
</span><span class="p">}</span><span class="w">


</span><span class="n">icu_cohort</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">readRDS</span><span class="p">(</span><span class="s2">"./icu_cohort.rds"</span><span class="p">)</span><span class="w">
</span><span class="n">var_list</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"first_careunit"</span><span class="p">,</span><span class="w">  
</span><span class="s2">"los"</span><span class="p">,</span><span class="w"> </span><span class="s2">"gender"</span><span class="p">,</span><span class="w"> </span><span class="s2">"admission_type"</span><span class="p">,</span><span class="w"> </span><span class="s2">"admission_location"</span><span class="p">,</span><span class="w"> 
</span><span class="s2">"insurance"</span><span class="p">,</span><span class="w"> </span><span class="s2">"language"</span><span class="p">,</span><span class="w"> </span><span class="s2">"marital_status"</span><span class="p">,</span><span class="w"> 
</span><span class="s2">"ethnicity"</span><span class="p">,</span><span class="w"> </span><span class="s2">"age_at_adm"</span><span class="p">,</span><span class="w"> </span><span class="s2">"bicarbonate"</span><span class="p">,</span><span class="w"> </span><span class="s2">"calcium"</span><span class="p">,</span><span class="w"> </span><span class="s2">"chloride"</span><span class="p">,</span><span class="w"> 
</span><span class="s2">"creatinine"</span><span class="p">,</span><span class="w"> </span><span class="s2">"glucose"</span><span class="p">,</span><span class="w"> </span><span class="s2">"magnesium"</span><span class="p">,</span><span class="w"> </span><span class="s2">"potassium"</span><span class="p">,</span><span class="w"> </span><span class="s2">"sodium"</span><span class="p">,</span><span class="w"> 
</span><span class="s2">"hematocrit"</span><span class="p">,</span><span class="w"> </span><span class="s2">"wbc"</span><span class="p">,</span><span class="w"> </span><span class="s2">"lactate"</span><span class="p">,</span><span class="w"> </span><span class="s2">"heart_rate"</span><span class="p">,</span><span class="w"> 
</span><span class="s2">"non_invasive_blood_pressure_systolic"</span><span class="p">,</span><span class="w"> </span><span class="s2">"non_invasive_blood_pressure_mean"</span><span class="p">,</span><span class="w">
</span><span class="s2">"respiratory_rate"</span><span class="p">,</span><span class="w"> </span><span class="s2">"temperature_fahrenheit"</span><span class="p">,</span><span class="w"> 
</span><span class="s2">"arterial_blood_pressure_systolic"</span><span class="p">,</span><span class="w"> 
</span><span class="s2">"arterial_blood_pressure_mean"</span><span class="p">)</span><span class="w">


</span><span class="n">var_list</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">var_list</span><span class="p">[</span><span class="n">which</span><span class="p">(</span><span class="n">apply</span><span class="p">(</span><span class="n">var_list</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> 
                                   </span><span class="n">as.matrix</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">.na_count</span><span class="p">)</span><span class="w"> </span><span class="o">&lt;=</span><span class="w"> </span><span class="m">7000</span><span class="p">)]</span><span class="w">

</span><span class="n">icu_cohort</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> 
  </span><span class="n">select</span><span class="p">(</span><span class="n">all_of</span><span class="p">(</span><span class="n">var_list</span><span class="p">))</span><span class="w"> </span><span class="o">-&gt;</span><span class="w"> </span><span class="n">df</span><span class="w">
</span><span class="n">name_list</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">names</span><span class="p">(</span><span class="n">df</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">select_if</span><span class="p">(</span><span class="n">is.numeric</span><span class="p">))[</span><span class="o">-</span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">)]</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">as.list</span><span class="p">()</span><span class="w">
</span><span class="n">numeric_var</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">lapply</span><span class="p">(</span><span class="n">name_list</span><span class="p">,</span><span class="w"> </span><span class="n">.quantile_cut</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">Reduce</span><span class="p">(</span><span class="s2">"cbind"</span><span class="p">,</span><span class="w"> </span><span class="n">.</span><span class="p">)</span><span class="w">
</span><span class="n">df</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">df</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> 
  </span><span class="n">select</span><span class="p">(</span><span class="o">!</span><span class="n">is.numeric</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">cbind</span><span class="p">(</span><span class="n">numeric_var</span><span class="p">)</span><span class="w">

</span></code></pre></div></div>

<h3 id="q14">Q1.4</h3>
<p>Impute missing values by <code class="language-plaintext highlighter-rouge">miceRanger</code> (request $m=3$ datasets). This step is very computational intensive. Make sure to save the imputation results as a file.</p>

<blockquote>
  <p>Note that we didn’t include the <code class="language-plaintext highlighter-rouge">age_at_adm</code> in the multiple imputation, beacase 1) it has no missing value in original data set; 2) it does not converge in the <code class="language-plaintext highlighter-rouge">miceRanger</code>; 3) it may also influence the converge of other variabls. We also removed <code class="language-plaintext highlighter-rouge">death_30</code> and <code class="language-plaintext highlighter-rouge">discharge location</code> and <code class="language-plaintext highlighter-rouge">hospital expire flag</code> since they contain the information of response (survival/dead), part of which is supposed to be unknown in the predicting part.</p>
</blockquote>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Set up back ends.</span><span class="w">
</span><span class="n">cl</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">makeCluster</span><span class="p">(</span><span class="n">detectCores</span><span class="p">())</span><span class="w">
</span><span class="n">registerDoParallel</span><span class="p">(</span><span class="n">cl</span><span class="p">)</span><span class="w">

</span><span class="n">miss_list</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">names</span><span class="p">(</span><span class="n">df</span><span class="p">)[</span><span class="n">which</span><span class="p">(</span><span class="n">apply</span><span class="p">(</span><span class="nf">names</span><span class="p">(</span><span class="n">df</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">as.matrix</span><span class="p">,</span><span class="w">
                                   </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">.na_count</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">df</span><span class="p">)</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="m">0</span><span class="p">)]</span><span class="w">

</span><span class="c1"># Perform mice </span><span class="w">
</span><span class="n">parTime</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">system.time</span><span class="p">(</span><span class="w">
  </span><span class="n">miceObjPar</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">miceRanger</span><span class="p">(</span><span class="w">
      </span><span class="n">df</span><span class="w">
    </span><span class="p">,</span><span class="w"> </span><span class="n">m</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">3</span><span class="w">
    </span><span class="p">,</span><span class="w"> </span><span class="n">maxiter</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">7</span><span class="w">
    </span><span class="p">,</span><span class="w"> </span><span class="n">vars</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">miss_list</span><span class="w">
    </span><span class="p">,</span><span class="w"> </span><span class="n">max.depth</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">15</span><span class="w">
    </span><span class="p">,</span><span class="w"> </span><span class="n">parallel</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="w">
    </span><span class="p">,</span><span class="w"> </span><span class="n">verbose</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="w">
  </span><span class="p">)</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="n">stopCluster</span><span class="p">(</span><span class="n">cl</span><span class="p">)</span><span class="w">
</span><span class="n">registerDoSEQ</span><span class="p">()</span><span class="w">

</span><span class="n">saveRDS</span><span class="p">(</span><span class="n">miceObjPar</span><span class="p">,</span><span class="w"> </span><span class="n">file</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"./miceObjPar.rds"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<h3 id="q15">Q1.5</h3>
<p>Make imputation diagnostic plots and explain what they mean.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">miceObjPar</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">readRDS</span><span class="p">(</span><span class="s2">"./miceObjPar.rds"</span><span class="p">)</span><span class="w">
</span><span class="n">plotDistributions</span><span class="p">(</span><span class="n">miceObjPar</span><span class="p">,</span><span class="w"> </span><span class="n">vars</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'allNumeric'</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<p><img src="/post/2021-03-20-data-science-practice-4/index.en-us_files/unnamed-chunk-4-1.png" alt="" /></p>

<p>This is the Distribution of Imputed Values plots, the red line is the density of the original, nonmissing data. The smaller, black lines are the density of the imputed values in each of the datasets. According to the  plots, we found that the imputed distributions is largely consistent to the original distribution for each variable. It means that the data was Missing Completely at Random (MCAR).</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">plotCorrelations</span><span class="p">(</span><span class="n">miceObjPar</span><span class="p">,</span><span class="w"> </span><span class="n">vars</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'allNumeric'</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<p><img src="/post/2021-03-20-data-science-practice-4/index.en-us_files/unnamed-chunk-5-1.png" alt="" /></p>

<p>The Convergence of Correlation plots shows boxplots of the correlations between imputed values in every combination of datasets, at each iteration. We can see that imputation for all of variables converged after 5 interactions.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">plotVarConvergence</span><span class="p">(</span><span class="n">miceObjPar</span><span class="p">,</span><span class="w"> </span><span class="n">vars</span><span class="o">=</span><span class="s1">'allNumeric'</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<p><img src="/post/2021-03-20-data-science-practice-4/index.en-us_files/unnamed-chunk-6-1.png" alt="" /></p>

<p>The Center and Dispersion Convergence plots were designed to see whether the missing data locations are correlated with higher or lower values. From the plots, we can see that the most of imputed data were largely converged to the true theoretical mean, while <code class="language-plaintext highlighter-rouge">non_invasive_blood_pressure</code> seems have a slight convergence issue. We would ignore it at current stage.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">plotModelError</span><span class="p">(</span><span class="n">miceObjPar</span><span class="p">,</span><span class="w"> </span><span class="n">vars</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'allNumeric'</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<p><img src="/post/2021-03-20-data-science-practice-4/index.en-us_files/unnamed-chunk-7-1.png" alt="" /></p>

<p>According to the plots of OOB accuracy for Random Forests model classification. We can see how these converged as the iterations progress: It looks like the variables were imputed with a reasonable degree of accuracy after 5 iterations.</p>

<h3 id="q16">Q1.6</h3>
<p>Obtain a complete data set by averaging the 3 imputed data sets.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">.dummy_trans</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">){</span><span class="w">
  </span><span class="n">out</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">dummy_cols</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w">
           </span><span class="n">remove_first_dummy</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span><span class="w">
           </span><span class="n">remove_selected_columns</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
  </span><span class="nf">return</span><span class="p">(</span><span class="n">out</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">

</span><span class="n">dataList</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">completeData</span><span class="p">(</span><span class="n">miceObjPar</span><span class="p">)</span><span class="w">
</span><span class="n">Datasets_imputed</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">lapply</span><span class="p">(</span><span class="n">dataList</span><span class="p">,</span><span class="w"> </span><span class="n">.dummy_trans</span><span class="p">)</span><span class="w">

</span><span class="n">Final_data</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="p">(</span><span class="n">Datasets_imputed</span><span class="p">[[</span><span class="s2">"Dataset_1"</span><span class="p">]]</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">Datasets_imputed</span><span class="p">[[</span><span class="s2">"Dataset_2"</span><span class="p">]]</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">Datasets_imputed</span><span class="p">[[</span><span class="s2">"Dataset_3"</span><span class="p">]])</span><span class="o">/</span><span class="m">3</span><span class="w"> 

</span><span class="n">df</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">Final_data</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">mutate_at</span><span class="p">(</span><span class="n">vars</span><span class="p">(</span><span class="s1">'first_careunit_Coronary Care Unit (CCU)'</span><span class="w"> </span><span class="o">:</span><span class="w"> 
                   </span><span class="s1">'ethnicity_WHITE'</span><span class="p">),</span><span class="w"> </span><span class="n">round</span><span class="p">)</span><span class="w"> 
</span></code></pre></div></div>

<h2 id="q2-predicting-30-day-mortality">Q2. Predicting 30-day mortality</h2>

<p>Develop at least two analytic approaches for predicting the 30-day mortality of patients admitted to ICU using demographic information (gender, age, marital status, ethnicity), first lab measurements during ICU stay, and first vital measurements during ICU stay. For example, you can use (1) logistic regression (<code class="language-plaintext highlighter-rouge">glm()</code> function), (2) logistic regression with lasso penalty (glmnet package), (3) random forest (randomForest package), or (4) neural network.</p>

<h3 id="q21-data-preparation">Q2.1 Data preparation</h3>
<p>Partition data into 80% training set and 20% test set. Stratify partitioning according the 30-day mortality status.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">df</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">select</span><span class="p">(</span><span class="n">contains</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="s2">"gender"</span><span class="p">,</span><span class="w"> </span><span class="s2">"ethnicity"</span><span class="p">,</span><span class="w"> </span><span class="s2">"marital"</span><span class="p">))</span><span class="w"> </span><span class="o">|</span><span class="w">
           </span><span class="n">bicarbonate</span><span class="w"> </span><span class="o">:</span><span class="w"> </span><span class="n">temperature_fahrenheit</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">mutate</span><span class="p">(</span><span class="n">age</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">icu_cohort</span><span class="o">$</span><span class="n">age_at_adm</span><span class="p">)</span><span class="w">

</span><span class="nf">names</span><span class="p">(</span><span class="n">df</span><span class="p">)[</span><span class="nf">c</span><span class="p">(</span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="m">6</span><span class="p">)]</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"ethnicity_BLACK"</span><span class="p">,</span><span class="w"> </span><span class="s2">"ethnicity_HISPANIC"</span><span class="p">,</span><span class="w">
                          </span><span class="s2">"ethnicity_UNABLE"</span><span class="p">)</span><span class="w">

</span><span class="n">set.seed</span><span class="p">(</span><span class="m">19969</span><span class="p">)</span><span class="w">
</span><span class="n">folds</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">createFolds</span><span class="p">(</span><span class="n">factor</span><span class="p">(</span><span class="n">icu_cohort</span><span class="o">$</span><span class="n">death_30</span><span class="p">),</span><span class="w"> </span><span class="n">k</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">5</span><span class="p">,</span><span class="w"> </span><span class="n">list</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">T</span><span class="p">)</span><span class="w">

</span><span class="c1">#test data set</span><span class="w">
</span><span class="n">data.test.x</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">df</span><span class="p">[</span><span class="n">folds</span><span class="p">[[</span><span class="m">1</span><span class="p">]],</span><span class="w"> </span><span class="p">]</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> 
  </span><span class="n">mutate_at</span><span class="p">(</span><span class="n">vars</span><span class="p">(</span><span class="n">bicarbonate</span><span class="w"> </span><span class="o">:</span><span class="w"> </span><span class="n">age</span><span class="p">),</span><span class="w"> </span><span class="n">scale</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">as.matrix</span><span class="p">()</span><span class="w">
</span><span class="n">data.test.y</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">dummy_cols</span><span class="p">(</span><span class="n">icu_cohort</span><span class="o">$</span><span class="n">death_30</span><span class="p">[</span><span class="n">folds</span><span class="p">[[</span><span class="m">1</span><span class="p">]]],</span><span class="w">
           </span><span class="n">remove_first_dummy</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span><span class="w">
           </span><span class="n">remove_selected_columns</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">as.matrix</span><span class="p">()</span><span class="w">
</span><span class="c1">#train data set</span><span class="w">
</span><span class="n">data.train.x</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">df</span><span class="p">[</span><span class="n">folds</span><span class="p">[</span><span class="m">-1</span><span class="p">]</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">Reduce</span><span class="p">(</span><span class="s2">"c"</span><span class="p">,</span><span class="w"> </span><span class="n">.</span><span class="p">),</span><span class="w"> </span><span class="p">]</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> 
  </span><span class="n">mutate_at</span><span class="p">(</span><span class="n">vars</span><span class="p">(</span><span class="n">bicarbonate</span><span class="w"> </span><span class="o">:</span><span class="w"> </span><span class="n">age</span><span class="p">),</span><span class="w"> </span><span class="n">scale</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">as.matrix</span><span class="p">()</span><span class="w">
</span><span class="n">data.train.y</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">dummy_cols</span><span class="p">(</span><span class="n">icu_cohort</span><span class="o">$</span><span class="n">death_30</span><span class="p">[</span><span class="n">folds</span><span class="p">[</span><span class="m">-1</span><span class="p">]</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">Reduce</span><span class="p">(</span><span class="s2">"c"</span><span class="p">,</span><span class="w"> </span><span class="n">.</span><span class="p">)],</span><span class="w">
           </span><span class="n">remove_first_dummy</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span><span class="w">
           </span><span class="n">remove_selected_columns</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">as.matrix</span><span class="p">()</span><span class="w">
  
</span></code></pre></div></div>

<h3 id="q22-modeling">Q2.2 Modeling</h3>
<p>Train the models using the training set.</p>

<blockquote>
  <p>We noticed that the data set is heavily imbalanced, which may cause the problem of low prediction accuracy for death_30 cases. Considering that a major goal of the prediction model may be sending early warning to cases which has a higher risk of death, we tryed to increase the prediction accuracy for death_30 cases by using the weighting and resampling method in two models.</p>
</blockquote>

<ul>
  <li><strong>Method 1</strong></li>
</ul>

<p>neural network (MLP) [original weight]</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">model</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">keras_model_sequential</span><span class="p">()</span><span class="w"> 
</span><span class="n">model</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> 
  </span><span class="n">layer_dense</span><span class="p">(</span><span class="n">units</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">16</span><span class="p">,</span><span class="w"> </span><span class="n">activation</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'relu'</span><span class="p">,</span><span class="w">
              </span><span class="n">kernel_initializer</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"uniform"</span><span class="p">,</span><span class="w"> </span><span class="n">input_shape</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">27</span><span class="p">))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> 
  </span><span class="n">layer_dropout</span><span class="p">(</span><span class="n">rate</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.4</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> 
  </span><span class="n">layer_dense</span><span class="p">(</span><span class="n">units</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">8</span><span class="p">,</span><span class="w"> </span><span class="n">activation</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'relu'</span><span class="p">,</span><span class="w">
              </span><span class="n">kernel_initializer</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"uniform"</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">layer_dropout</span><span class="p">(</span><span class="n">rate</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.3</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">layer_dense</span><span class="p">(</span><span class="n">units</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">activation</span><span class="w">  </span><span class="o">=</span><span class="w"> </span><span class="s2">"sigmoid"</span><span class="p">)</span><span class="w">
</span><span class="n">summary</span><span class="p">(</span><span class="n">model</span><span class="p">)</span><span class="w">
</span><span class="n">model</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">compile</span><span class="p">(</span><span class="w">
  </span><span class="n">loss</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'binary_crossentropy'</span><span class="p">,</span><span class="w">
  </span><span class="n">optimizer</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'adam'</span><span class="p">,</span><span class="w">
  </span><span class="n">metrics</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'accuracy'</span><span class="p">)</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="n">set.seed</span><span class="p">(</span><span class="m">19969</span><span class="p">)</span><span class="w">
</span><span class="n">system.time</span><span class="p">({</span><span class="w">
</span><span class="n">history1</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">model</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">fit</span><span class="p">(</span><span class="w">
  </span><span class="n">data.train.x</span><span class="p">,</span><span class="w"> </span><span class="n">data.train.y</span><span class="p">,</span><span class="w"> 
  </span><span class="n">epochs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">30</span><span class="p">,</span><span class="w"> </span><span class="n">batch_size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">128</span><span class="p">,</span><span class="w"> 
  </span><span class="n">validation_split</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.3</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="p">})</span><span class="w">

</span><span class="n">model</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">evaluate</span><span class="p">(</span><span class="n">data.test.x</span><span class="p">,</span><span class="w"> </span><span class="n">data.test.y</span><span class="p">)</span><span class="w">
</span><span class="n">results1</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">model</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">predict_proba</span><span class="p">(</span><span class="n">data.test.x</span><span class="p">)</span><span class="w"> 
</span><span class="n">saveRDS</span><span class="p">(</span><span class="n">results1</span><span class="p">,</span><span class="w"> </span><span class="s2">"./results1.rds"</span><span class="p">)</span><span class="w">
</span><span class="n">saveRDS</span><span class="p">(</span><span class="n">history1</span><span class="p">,</span><span class="w"> </span><span class="s2">"./history1.rds"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<p>Check the confusionMatrix and history plots of prediction results</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">results1</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">readRDS</span><span class="p">(</span><span class="s2">"./results1.rds"</span><span class="p">)</span><span class="w">
</span><span class="n">history1</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">readRDS</span><span class="p">(</span><span class="s2">"./history1.rds"</span><span class="p">)</span><span class="w">
</span><span class="n">confusionMatrix</span><span class="w"> </span><span class="p">(</span><span class="n">results1</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">round</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">as.factor</span><span class="p">,</span><span class="w"> 
                 </span><span class="n">as.matrix</span><span class="p">(</span><span class="n">data.test.y</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">as.factor</span><span class="p">)</span><span class="w">
</span><span class="n">plot</span><span class="p">(</span><span class="n">history1</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<p><img src="/post/2021-03-20-data-science-practice-4/index.en-us_files/unnamed-chunk-11-1.png" alt="" /></p>

<p>We can see from the history plot that the original model (without weighting and bias_initializer) converge very quick. It provided a high accuracy for the non death_30 cases, while the predicted accuracy for death_30 cases is terribly low.</p>

<p>Then we tried to add weighting in the model. We also set a initializing weight for the MLP model to help model converge.</p>

<ul>
  <li><strong>Method 1.1</strong></li>
</ul>

<p>neural network (MLP) [with weighting and bias_initializer]</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">model</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">keras_model_sequential</span><span class="p">()</span><span class="w"> 
</span><span class="n">model</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> 
  </span><span class="n">layer_dense</span><span class="p">(</span><span class="n">units</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">64</span><span class="p">,</span><span class="w"> </span><span class="n">activation</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'relu'</span><span class="p">,</span><span class="w">
              </span><span class="n">bias_initializer</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">initializer_constant</span><span class="p">(</span><span class="m">0.01</span><span class="p">),</span><span class="w">
              </span><span class="n">kernel_initializer</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"uniform"</span><span class="p">,</span><span class="w"> </span><span class="n">input_shape</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">27</span><span class="p">))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> 
  </span><span class="n">layer_dropout</span><span class="p">(</span><span class="n">rate</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.4</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> 
  </span><span class="n">layer_dense</span><span class="p">(</span><span class="n">units</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">32</span><span class="p">,</span><span class="w"> </span><span class="n">activation</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'relu'</span><span class="p">,</span><span class="w">
              </span><span class="n">bias_initializer</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">initializer_constant</span><span class="p">(</span><span class="m">0.01</span><span class="p">),</span><span class="w">
              </span><span class="n">kernel_initializer</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"uniform"</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">layer_dropout</span><span class="p">(</span><span class="n">rate</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.3</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">layer_dense</span><span class="p">(</span><span class="n">units</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">activation</span><span class="w">  </span><span class="o">=</span><span class="w"> </span><span class="s2">"sigmoid"</span><span class="p">)</span><span class="w">
</span><span class="n">summary</span><span class="p">(</span><span class="n">model</span><span class="p">)</span><span class="w">
</span><span class="n">model</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">compile</span><span class="p">(</span><span class="w">
  </span><span class="n">loss</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'binary_crossentropy'</span><span class="p">,</span><span class="w">
  </span><span class="n">optimizer</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'adam'</span><span class="p">,</span><span class="w">
  </span><span class="n">metrics</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'accuracy'</span><span class="p">)</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="n">set.seed</span><span class="p">(</span><span class="m">199609</span><span class="p">)</span><span class="w">
</span><span class="n">system.time</span><span class="p">({</span><span class="w">
</span><span class="n">history2</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">model</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">fit</span><span class="p">(</span><span class="w">
  </span><span class="n">data.train.x</span><span class="p">,</span><span class="w"> </span><span class="n">data.train.y</span><span class="p">,</span><span class="w"> 
  </span><span class="n">epochs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">700</span><span class="p">,</span><span class="w"> </span><span class="n">batch_size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">128</span><span class="p">,</span><span class="w"> 
  </span><span class="n">class_weight</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="s2">"0"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="s2">"1"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">8</span><span class="p">),</span><span class="w"> 
  </span><span class="n">validation_split</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.3</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="p">})</span><span class="w">

</span><span class="n">model</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">evaluate</span><span class="p">(</span><span class="n">data.test.x</span><span class="p">,</span><span class="w"> </span><span class="n">data.test.y</span><span class="p">)</span><span class="w">
</span><span class="n">results2</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">model</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">predict_proba</span><span class="p">(</span><span class="n">data.test.x</span><span class="p">)</span><span class="w"> 
</span><span class="n">saveRDS</span><span class="p">(</span><span class="n">results2</span><span class="p">,</span><span class="w"> </span><span class="s2">"./results2.rds"</span><span class="p">)</span><span class="w">
</span><span class="n">saveRDS</span><span class="p">(</span><span class="n">history2</span><span class="p">,</span><span class="w"> </span><span class="s2">"./history2.rds"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<p><img src="/post/2021-03-20-data-science-practice-4/index.en-us_files/unnamed-chunk-13-1.png" alt="" /></p>

<p>Check the confusionMatrix and history plots of prediction results</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">results2</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">readRDS</span><span class="p">(</span><span class="s2">"./results2.rds"</span><span class="p">)</span><span class="w">
</span><span class="n">history2</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">readRDS</span><span class="p">(</span><span class="s2">"./history2.rds"</span><span class="p">)</span><span class="w">
</span><span class="n">confusionMatrix</span><span class="w"> </span><span class="p">(</span><span class="n">results2</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">round</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">as.factor</span><span class="p">,</span><span class="w"> 
                 </span><span class="n">as.matrix</span><span class="p">(</span><span class="n">data.test.y</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">as.factor</span><span class="p">)</span><span class="w">
</span><span class="n">plot</span><span class="p">(</span><span class="n">history2</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<p>We can tell from the history plots that the model converged after around 400 epochs (the validation line became stable).</p>

<ul>
  <li><strong>Method 2</strong></li>
</ul>

<p>XGBoost model [with variable selection and re-sampling]</p>

<p>In this model, we created relatively balanced samples (survival:dead = 3:1) by random under-sampling method.</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">###### model function ########</span><span class="w">
</span><span class="n">.importance_function</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">dtrain</span><span class="p">){</span><span class="w">
  </span><span class="n">model</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">xgboost</span><span class="p">(</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">dtrain</span><span class="p">,</span><span class="w">          
                   </span><span class="n">nround</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">500</span><span class="p">,</span><span class="w">
                   </span><span class="n">early_stopping_rounds</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">30</span><span class="p">,</span><span class="w">
                   </span><span class="n">objective</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"binary:logistic"</span><span class="p">,</span><span class="w">
                   </span><span class="n">eval_metric</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"logloss"</span><span class="p">,</span><span class="w">
                   </span><span class="n">verbose</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">)</span><span class="w">
  </span><span class="n">importance_matrix</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">xgb.importance</span><span class="p">(</span><span class="n">model</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">model</span><span class="p">)</span><span class="w"> 
  </span><span class="nf">return</span><span class="p">(</span><span class="n">importance_matrix</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">

</span><span class="n">.auc_function_train</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">folds</span><span class="p">){</span><span class="w">
  </span><span class="n">dtrain</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">xgb.DMatrix</span><span class="p">(</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">b.train.x</span><span class="p">[</span><span class="o">-</span><span class="n">folds</span><span class="p">,</span><span class="w"> </span><span class="p">]</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">as.matrix</span><span class="p">(),</span><span class="w">
                        </span><span class="n">label</span><span class="o">=</span><span class="w"> </span><span class="n">b.train.y</span><span class="p">[</span><span class="o">-</span><span class="n">folds</span><span class="p">]</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">as.matrix</span><span class="p">())</span><span class="w">
  </span><span class="n">dtest</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">xgb.DMatrix</span><span class="p">(</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">b.train.x</span><span class="p">[</span><span class="n">folds</span><span class="p">,</span><span class="w"> </span><span class="p">]</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">as.matrix</span><span class="p">(),</span><span class="w">
                       </span><span class="n">label</span><span class="o">=</span><span class="w"> </span><span class="n">b.train.y</span><span class="p">[</span><span class="n">folds</span><span class="p">]</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">as.matrix</span><span class="p">())</span><span class="w">
  </span><span class="n">test_labels</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">b.train.y</span><span class="p">[</span><span class="n">folds</span><span class="p">]</span><span class="w">
  
  </span><span class="n">model</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">xgboost</span><span class="p">(</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">dtrain</span><span class="p">,</span><span class="w">          
                   </span><span class="n">nround</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">500</span><span class="p">,</span><span class="w">
                   </span><span class="n">early_stopping_rounds</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">30</span><span class="p">,</span><span class="w"> 
                   </span><span class="n">objective</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"binary:logistic"</span><span class="p">,</span><span class="w">
                   </span><span class="n">eval_metric</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"logloss"</span><span class="p">,</span><span class="w">
                   </span><span class="n">verbose</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">)</span><span class="w">
  </span><span class="n">pred</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">predict</span><span class="p">(</span><span class="n">model</span><span class="p">,</span><span class="w"> </span><span class="n">dtest</span><span class="p">)</span><span class="w"> 
  </span><span class="n">xgbpred</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">ifelse</span><span class="w"> </span><span class="p">(</span><span class="n">pred</span><span class="w"> </span><span class="o">&gt;=</span><span class="w"> </span><span class="m">0.5</span><span class="p">,</span><span class="m">1</span><span class="p">,</span><span class="m">0</span><span class="p">)</span><span class="w">
  </span><span class="n">roc_l</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">roc</span><span class="p">(</span><span class="n">test_labels</span><span class="p">,</span><span class="n">pred</span><span class="p">)</span><span class="w">
  </span><span class="n">auc_value</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">auc</span><span class="p">(</span><span class="n">roc_l</span><span class="p">)</span><span class="w">
  </span><span class="nf">return</span><span class="p">(</span><span class="n">auc_value</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">

</span><span class="n">.model_function_Total</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">importance</span><span class="o">=</span><span class="nb">F</span><span class="p">,</span><span class="n">auc</span><span class="o">=</span><span class="nb">T</span><span class="p">){</span><span class="w">
  </span><span class="k">if</span><span class="p">(</span><span class="n">importance</span><span class="o">==</span><span class="nb">T</span><span class="p">){</span><span class="w">
    </span><span class="n">cv.group.x</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">lapply</span><span class="p">(</span><span class="n">folds</span><span class="p">,</span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">){</span><span class="n">out</span><span class="o">&lt;-</span><span class="kc">NULL</span><span class="p">;</span><span class="w">
    </span><span class="n">out</span><span class="o">&lt;-</span><span class="n">rbind</span><span class="p">(</span><span class="n">out</span><span class="p">,</span><span class="n">data.frame</span><span class="p">(</span><span class="n">label</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">b.train.y</span><span class="p">[</span><span class="o">-</span><span class="n">x</span><span class="p">],</span><span class="w"> </span><span class="n">b.train.x</span><span class="p">[</span><span class="o">-</span><span class="n">x</span><span class="p">,]))})</span><span class="w">
    </span><span class="n">cv.group.y</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">lapply</span><span class="p">(</span><span class="n">folds</span><span class="p">,</span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">){</span><span class="n">out</span><span class="o">&lt;-</span><span class="kc">NULL</span><span class="p">;</span><span class="w">
    </span><span class="n">out</span><span class="o">&lt;-</span><span class="n">rbind</span><span class="p">(</span><span class="n">out</span><span class="p">,</span><span class="n">data.frame</span><span class="p">(</span><span class="n">label</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">b.train.y</span><span class="p">[</span><span class="n">x</span><span class="p">],</span><span class="w"> </span><span class="n">b.train.x</span><span class="p">[</span><span class="n">x</span><span class="p">,]))})</span><span class="w">
    
    </span><span class="n">dtrain</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">lapply</span><span class="p">(</span><span class="n">cv.group.x</span><span class="p">,</span><span class="w"> 
                     </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">){</span><span class="n">out</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">xgb.DMatrix</span><span class="p">(</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">as.matrix</span><span class="p">()</span><span class="w"> 
                                                    </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">.</span><span class="p">[,</span><span class="m">-1</span><span class="p">],</span><span class="w">
                                                    </span><span class="n">label</span><span class="o">=</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">as.matrix</span><span class="p">()</span><span class="w">
                                                    </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">.</span><span class="p">[,</span><span class="m">1</span><span class="p">])})</span><span class="w">
    </span><span class="n">dtest</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">lapply</span><span class="p">(</span><span class="n">cv.group.y</span><span class="p">,</span><span class="w"> 
                    </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">){</span><span class="n">out</span><span class="o">&lt;-</span><span class="n">xgb.DMatrix</span><span class="p">(</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">as.matrix</span><span class="p">()</span><span class="w">
                                                 </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">.</span><span class="p">[,</span><span class="m">-1</span><span class="p">],</span><span class="w">
                                                 </span><span class="n">label</span><span class="o">=</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">as.matrix</span><span class="p">()</span><span class="w"> 
                                                 </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">.</span><span class="p">[,</span><span class="m">1</span><span class="p">])})</span><span class="w">
    </span><span class="n">importance_matrix</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">lapply</span><span class="p">(</span><span class="n">dtrain</span><span class="p">,</span><span class="w"> </span><span class="n">.importance_function</span><span class="p">)</span><span class="w"> 
    </span><span class="n">importance_combine</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">Reduce</span><span class="p">(</span><span class="s2">"rbind"</span><span class="p">,</span><span class="w"> </span><span class="n">importance_matrix</span><span class="p">)</span><span class="w">
    </span><span class="nf">return</span><span class="p">(</span><span class="n">importance_combine</span><span class="p">)</span><span class="w">
  </span><span class="p">}</span><span class="k">else</span><span class="w"> </span><span class="k">if</span><span class="p">(</span><span class="n">auc</span><span class="o">==</span><span class="nb">T</span><span class="p">){</span><span class="w">
    </span><span class="n">auc_value</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">sapply</span><span class="p">(</span><span class="n">folds</span><span class="p">,</span><span class="w"> </span><span class="n">.auc_function_train</span><span class="p">)</span><span class="w"> </span><span class="c1">#%&gt;% mean() </span><span class="w">
    </span><span class="n">auc_value_combine</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">auc_value_combine</span><span class="p">,</span><span class="w"> </span><span class="n">auc_value</span><span class="p">)</span><span class="w">
    </span><span class="nf">return</span><span class="p">(</span><span class="n">auc_value_combine</span><span class="p">)</span><span class="w">
  </span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<p>Firstly we tried to rank the importance of all variables by 10-folder cross validation</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">importance_combine</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="kc">NULL</span><span class="w">
</span><span class="n">temp</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">cbind</span><span class="p">(</span><span class="n">data.train.y</span><span class="p">,</span><span class="n">data.train.x</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">as.data.frame</span><span class="p">()</span><span class="w">

</span><span class="k">for</span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="m">300</span><span class="p">){</span><span class="w">
  </span><span class="n">set.seed</span><span class="p">(</span><span class="m">1e7</span><span class="o">-</span><span class="n">i</span><span class="p">)</span><span class="w">
  </span><span class="n">data.balanced</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">ovun.sample</span><span class="p">(</span><span class="n">.data_Yes</span><span class="o">~</span><span class="n">.</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">temp</span><span class="p">,</span><span class="w"> </span><span class="n">p</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.33</span><span class="p">,</span><span class="w">
                               </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"under"</span><span class="p">)</span><span class="o">$</span><span class="n">data</span><span class="w">
  </span><span class="n">b.train.x</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">data.balanced</span><span class="p">[,</span><span class="w"> </span><span class="m">-1</span><span class="p">]</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">as.matrix</span><span class="p">()</span><span class="w">
  </span><span class="n">b.train.y</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">data.balanced</span><span class="p">[,</span><span class="w"> </span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">as.matrix</span><span class="p">()</span><span class="w">
  </span><span class="n">folds</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">createFolds</span><span class="p">(</span><span class="n">factor</span><span class="p">(</span><span class="n">b.train.y</span><span class="p">),</span><span class="w"> </span><span class="n">k</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10</span><span class="p">,</span><span class="w"> </span><span class="n">list</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">T</span><span class="p">)</span><span class="w">
  </span><span class="n">importance_combine</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">rbind</span><span class="p">(</span><span class="n">importance_combine</span><span class="p">,</span><span class="w">
                              </span><span class="n">.model_function_Total</span><span class="p">(</span><span class="n">importance</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">T</span><span class="p">,</span><span class="w"> </span><span class="n">auc</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">F</span><span class="p">))</span><span class="w">
  </span><span class="n">print</span><span class="p">(</span><span class="n">i</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">

</span><span class="n">importance_combine</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">group_by</span><span class="p">(</span><span class="n">Feature</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">summarise</span><span class="p">(</span><span class="n">mean</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="n">Gain</span><span class="p">))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">arrange</span><span class="p">(</span><span class="n">desc</span><span class="p">(</span><span class="n">mean</span><span class="p">))</span><span class="w"> </span><span class="o">-&gt;</span><span class="w"> </span><span class="n">importance_sum</span><span class="w">

</span><span class="n">saveRDS</span><span class="p">(</span><span class="n">importance_sum</span><span class="p">,</span><span class="w"> </span><span class="s2">"importance_sum.rds"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<p>Check the plot of top 10 important variables</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">importance_sum</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">readRDS</span><span class="p">(</span><span class="s2">"importance_sum.rds"</span><span class="p">)</span><span class="w">
</span><span class="n">ggplot</span><span class="p">(</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">importance_sum</span><span class="p">[</span><span class="m">1</span><span class="o">:</span><span class="m">10</span><span class="p">,</span><span class="w"> </span><span class="p">],</span><span class="w">
       </span><span class="n">mapping</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">reorder</span><span class="p">(</span><span class="n">Feature</span><span class="p">,</span><span class="w"> </span><span class="o">-</span><span class="n">mean</span><span class="p">),</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mean</span><span class="p">,</span><span class="w"> </span><span class="n">fill</span><span class="o">=</span><span class="n">Feature</span><span class="p">))</span><span class="o">+</span><span class="w">
  </span><span class="n">geom_bar</span><span class="p">(</span><span class="n">stat</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"identity"</span><span class="p">)</span><span class="o">+</span><span class="w">
  </span><span class="n">scale_fill_brewer</span><span class="p">(</span><span class="n">palette</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"RdBu"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">xlab</span><span class="p">(</span><span class="kc">NULL</span><span class="p">)</span><span class="o">+</span><span class="w">
  </span><span class="n">ylab</span><span class="p">(</span><span class="s2">"Relative Rmpotortance"</span><span class="p">)</span><span class="o">+</span><span class="w">
  </span><span class="n">theme_few</span><span class="p">()</span><span class="o">+</span><span class="w">
  </span><span class="n">theme</span><span class="p">(</span><span class="n">axis.ticks.x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">(),</span><span class="w">
        </span><span class="n">axis.text.x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">())</span><span class="w">
  
</span></code></pre></div></div>
<p><img src="/post/2021-03-20-data-science-practice-4/index.en-us_files/unnamed-chunk-16-1.png" alt="" /></p>

<p>Then we did a forward-stepwise variable selection according to the AIC value (10-folder cross validation average).</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">feature_select</span><span class="o">&lt;-</span><span class="kc">NULL</span><span class="p">;</span><span class="n">auc.comb</span><span class="o">&lt;-</span><span class="kc">NULL</span><span class="w">
</span><span class="k">for</span><span class="p">(</span><span class="n">f</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="m">15</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
  </span><span class="n">auc_value_combine</span><span class="o">&lt;-</span><span class="kc">NULL</span><span class="w">
  </span><span class="n">feature_select</span><span class="o">&lt;-</span><span class="n">importance_sum</span><span class="o">$</span><span class="n">Feature</span><span class="p">[</span><span class="m">1</span><span class="o">:</span><span class="n">f</span><span class="p">]</span><span class="w">
  </span><span class="n">temp</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">cbind</span><span class="p">(</span><span class="n">data.train.y</span><span class="p">,</span><span class="w"> 
                </span><span class="n">data.train.x</span><span class="p">[,</span><span class="n">feature_select</span><span class="p">])</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">as.data.frame</span><span class="p">()</span><span class="w">
  
  </span><span class="k">for</span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="m">100</span><span class="p">){</span><span class="w">
    </span><span class="n">set.seed</span><span class="p">(</span><span class="m">1e7</span><span class="o">-</span><span class="n">i</span><span class="p">)</span><span class="w">
    </span><span class="n">data.balanced</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">ovun.sample</span><span class="p">(</span><span class="n">.data_Yes</span><span class="o">~</span><span class="n">.</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">temp</span><span class="p">,</span><span class="w"> </span><span class="n">p</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.33</span><span class="p">,</span><span class="w">
                                 </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"under"</span><span class="p">)</span><span class="o">$</span><span class="n">data</span><span class="w">
    </span><span class="n">b.train.x</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">data.balanced</span><span class="p">[,</span><span class="m">-1</span><span class="p">]</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">as.matrix</span><span class="p">()</span><span class="w">
    </span><span class="n">b.train.y</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">data.balanced</span><span class="p">[,</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">as.matrix</span><span class="p">()</span><span class="w">
    </span><span class="n">folds</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">createFolds</span><span class="p">(</span><span class="n">factor</span><span class="p">(</span><span class="n">b.train.y</span><span class="p">),</span><span class="w"> </span><span class="n">k</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10</span><span class="p">,</span><span class="w"> </span><span class="n">list</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">T</span><span class="p">)</span><span class="w">
    </span><span class="n">auc_value_combine</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">.model_function_Total</span><span class="p">(</span><span class="n">importance</span><span class="o">=</span><span class="nb">F</span><span class="p">,</span><span class="n">auc</span><span class="o">=</span><span class="nb">T</span><span class="p">)</span><span class="w">
    </span><span class="c1">#print(paste0(f," variables - ", i, "%"))</span><span class="w">
  </span><span class="p">}</span><span class="w">
  
  </span><span class="n">auc_new</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">auc_value_combine</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">mean</span><span class="p">()</span><span class="w">
  </span><span class="n">auc_sd</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">auc_value_combine</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">sd</span><span class="p">()</span><span class="w">
  </span><span class="n">auc.comb</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">rbind</span><span class="p">(</span><span class="n">auc.comb</span><span class="p">,</span><span class="w"> </span><span class="n">data.frame</span><span class="p">(</span><span class="n">auc_new</span><span class="p">,</span><span class="w"> </span><span class="n">auc_sd</span><span class="p">,</span><span class="n">model</span><span class="o">=</span><span class="n">f</span><span class="p">))</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">feature_select</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">feature_select</span><span class="p">[</span><span class="m">1</span><span class="w"> </span><span class="o">:</span><span class="w"> </span><span class="n">which.max</span><span class="p">(</span><span class="n">auc.comb</span><span class="o">$</span><span class="n">auc_new</span><span class="p">)]</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> 
  </span><span class="nf">as.character</span><span class="p">()</span><span class="w">
</span><span class="n">saveRDS</span><span class="p">(</span><span class="n">feature_select</span><span class="p">,</span><span class="w"> </span><span class="s2">"./feature_select.rds"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<p>Check the variables in the final model</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">feature_select</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">readRDS</span><span class="p">(</span><span class="s2">"./feature_select.rds"</span><span class="p">)</span><span class="w">
</span><span class="n">print</span><span class="p">(</span><span class="n">feature_select</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<p>Finally, we trained the dataset (re-sampled) and predicted.</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">temp</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">cbind</span><span class="p">(</span><span class="n">data.train.y</span><span class="p">,</span><span class="w"> 
                </span><span class="n">data.train.x</span><span class="p">[,</span><span class="w"> </span><span class="n">feature_select</span><span class="p">])</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">as.data.frame</span><span class="p">()</span><span class="w">
</span><span class="n">data.balanced</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">ovun.sample</span><span class="p">(</span><span class="n">.data_Yes</span><span class="o">~</span><span class="n">.</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">temp</span><span class="p">,</span><span class="w"> </span><span class="n">p</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.33</span><span class="p">,</span><span class="w"> 
                                </span><span class="n">seed</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">199609</span><span class="p">,</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"under"</span><span class="p">)</span><span class="o">$</span><span class="n">data</span><span class="w">
</span><span class="n">dtrain</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">xgb.DMatrix</span><span class="p">(</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">data.balanced</span><span class="p">[,</span><span class="m">-1</span><span class="p">]</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">as.matrix</span><span class="p">(),</span><span class="w">
                      </span><span class="n">label</span><span class="o">=</span><span class="w"> </span><span class="n">data.balanced</span><span class="p">[,</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">as.matrix</span><span class="p">())</span><span class="w">
</span><span class="n">dtest</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">xgb.DMatrix</span><span class="p">(</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">data.test.x</span><span class="p">[,</span><span class="w"> </span><span class="n">feature_select</span><span class="p">]</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">as.matrix</span><span class="p">(),</span><span class="w">
                      </span><span class="n">label</span><span class="o">=</span><span class="w"> </span><span class="n">data.test.y</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">as.matrix</span><span class="p">())</span><span class="w">
</span><span class="n">model</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">xgboost</span><span class="p">(</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">dtrain</span><span class="p">,</span><span class="w">
                 </span><span class="n">nround</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">500</span><span class="p">,</span><span class="w">
                 </span><span class="n">early_stopping_rounds</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">30</span><span class="p">,</span><span class="w">
                 </span><span class="n">objective</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"binary:logistic"</span><span class="p">,</span><span class="w"> 
                 </span><span class="n">eval_metric</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"logloss"</span><span class="p">,</span><span class="w">
                 </span><span class="n">verbose</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">)</span><span class="w">
</span><span class="n">xgbpred</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">predict</span><span class="p">(</span><span class="n">model</span><span class="p">,</span><span class="w"> </span><span class="n">dtest</span><span class="p">)</span><span class="w">
</span><span class="n">saveRDS</span><span class="p">(</span><span class="n">xgbpred</span><span class="p">,</span><span class="w"> </span><span class="s2">"./xgbpred.rds"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<p>check the prediction results</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">xgbpred</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">readRDS</span><span class="p">(</span><span class="s2">"./xgbpred.rds"</span><span class="p">)</span><span class="w">
</span><span class="n">confusionMatrix</span><span class="p">(</span><span class="n">xgbpred</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">round</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">as.factor</span><span class="p">(),</span><span class="w"> 
                 </span><span class="n">data.test.y</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">as.factor</span><span class="p">())</span><span class="w">
</span></code></pre></div></div>

<h3 id="q23-comparation">Q2.3 Comparation</h3>
<p>Compare model prediction performance on the test set.</p>

<p>According to the confusion Matrix presented above, we found that the two models have similar performance. XGBoost model has a slightly higher overall accuracy (79%) and MLP model predict better for death_30 cases (69%). Both two models have a Balanced Accuracy at around 70%. Then we plotted Receiver Operating Characteristic (ROC) curves, which also suggests that two models have similar prediction pattern and AUC value.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">rocmlp</span><span class="o">&lt;-</span><span class="n">roc</span><span class="p">(</span><span class="n">controls</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">results2</span><span class="p">[</span><span class="n">data.test.y</span><span class="o">==</span><span class="m">0</span><span class="p">],</span><span class="w">
            </span><span class="n">cases</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">results2</span><span class="p">[</span><span class="n">data.test.y</span><span class="o">==</span><span class="m">1</span><span class="p">],</span><span class="w">
            </span><span class="n">quiet</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span><span class="n">rocxgs</span><span class="o">&lt;-</span><span class="n">roc</span><span class="p">(</span><span class="n">controls</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">xgbpred</span><span class="p">[</span><span class="n">data.test.y</span><span class="o">==</span><span class="m">0</span><span class="p">],</span><span class="w">
            </span><span class="n">cases</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">xgbpred</span><span class="p">[</span><span class="n">data.test.y</span><span class="o">==</span><span class="m">1</span><span class="p">],</span><span class="w">
            </span><span class="n">quiet</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">

</span><span class="n">plot</span><span class="p">(</span><span class="n">rocmlp</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"dark blue"</span><span class="p">,</span><span class="w"> </span><span class="n">lty</span><span class="o">=</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="o">=</span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">plot</span><span class="p">(</span><span class="n">rocxgs</span><span class="p">,</span><span class="w"> </span><span class="n">add</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"orange"</span><span class="p">,</span><span class="w"> </span><span class="n">lty</span><span class="o">=</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="o">=</span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">legend</span><span class="p">(</span><span class="s2">"bottomright"</span><span class="p">,</span><span class="w"> </span><span class="n">legend</span><span class="o">=</span><span class="nf">c</span><span class="p">(</span><span class="s2">"XGBoost (AUC:0.7821)"</span><span class="p">,</span><span class="w">
                              </span><span class="s2">"MLP (AUC:0.7715)"</span><span class="p">),</span><span class="w">
       </span><span class="n">col</span><span class="o">=</span><span class="nf">c</span><span class="p">(</span><span class="s2">"orange"</span><span class="p">,</span><span class="w"> </span><span class="s2">"dark blue"</span><span class="p">),</span><span class="w">
       </span><span class="n">lty</span><span class="o">=</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="o">=</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">cex</span><span class="o">=</span><span class="m">0.8</span><span class="p">,</span><span class="w"> </span><span class="n">bty</span><span class="o">=</span><span class="s2">"n"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>]]></content><author><name>Zian</name><email>zianzhuang@ucla.edu</email></author><category term="Data science" /><category term="practice" /><summary type="html"><![CDATA[Multiple imputation &amp; modeling]]></summary></entry><entry><title type="html">Data science practice 2</title><link href="https://zianzhuang.com/posts/2021/03/blog-post-2/" rel="alternate" type="text/html" title="Data science practice 2" /><published>2021-03-10T00:00:00+00:00</published><updated>2021-03-10T00:00:00+00:00</updated><id>https://zianzhuang.com/posts/2021/03/blog-post-2</id><content type="html" xml:base="https://zianzhuang.com/posts/2021/03/blog-post-2/"><![CDATA[<!--more-->

<h2 id="q1-physionet-credential">Q1. PhysioNet credential</h2>

<h3 id="q11">Q1.1</h3>
<p>At this moment, you should already get credentialed on the PhysioNet. Please include a screenshot of your <code class="language-plaintext highlighter-rouge">Data Use Agreement for the MIMIC-IV (v0.4)</code>.</p>

<blockquote>
  <p><strong>solution</strong>: <img src="/post/2021-03-20-data-science-practice-2/index.en-us_files/agreement.png" alt="" /></p>
</blockquote>

<h2 id="q2-readcsv-base-r-vs-read_csv-tidyverse-vs-fread-datatable">Q2. <code class="language-plaintext highlighter-rouge">read.csv</code> (base R) vs <code class="language-plaintext highlighter-rouge">read_csv</code> (tidyverse) vs <code class="language-plaintext highlighter-rouge">fread</code> (data.table)</h2>

<p>There are quite a few utilities in R for reading data files. Let us test the speed of reading a moderate sized compressed csv file, <code class="language-plaintext highlighter-rouge">admissions.csv.gz</code>, by three programs: <code class="language-plaintext highlighter-rouge">read.csv</code> in base R, <code class="language-plaintext highlighter-rouge">read_csv</code> in tidyverse, and <code class="language-plaintext highlighter-rouge">fread</code> in the popular data.table package. Is there any speed difference?</p>

<p>In this homework, we stick to the tidyverse.</p>

<blockquote>
  <p><strong>solution</strong>:</p>
  <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">.timer</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">fun_input</span><span class="p">){</span><span class="w">
  </span><span class="n">timestart</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">Sys.time</span><span class="p">()</span><span class="w">
  </span><span class="n">adm</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">fun_input</span><span class="p">(</span><span class="n">str_c</span><span class="p">(</span><span class="n">mimic_path</span><span class="p">,</span><span class="w"> </span><span class="s2">"/core/admissions.csv.gz"</span><span class="p">))</span><span class="w"> 
  </span><span class="n">timeend</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">Sys.time</span><span class="p">()</span><span class="w">
  </span><span class="nf">return</span><span class="p">(</span><span class="n">timeend</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">timestart</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">lapply</span><span class="p">(</span><span class="nf">list</span><span class="p">(</span><span class="s2">"read_csv"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">read_csv</span><span class="p">,</span><span class="w">
            </span><span class="s2">"read.csv"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">read.csv</span><span class="p">,</span><span class="w">
            </span><span class="s2">"fread"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fread</span><span class="p">),</span><span class="w"> </span><span class="n">.timer</span><span class="p">)</span><span class="w">
</span></code></pre></div>  </div>
  <p><code class="language-plaintext highlighter-rouge">fread</code> is proved to be the fastest function to read the file. <code class="language-plaintext highlighter-rouge">read.csv</code> is the slowest function.</p>
</blockquote>

<h2 id="q3-icu-stays">Q3. ICU stays</h2>

<p><code class="language-plaintext highlighter-rouge">icustays.csv.gz</code> (<a href="https://mimic-iv.mit.edu/docs/datasets/icu/icustays/">https://mimic-iv.mit.edu/docs/datasets/icu/icustays/</a>) contains data about Intensive Care Units (ICU) stays. Summarize following variables using appropriate numerics or graphs:</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">icustays</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">read_csv</span><span class="p">(</span><span class="n">str_c</span><span class="p">(</span><span class="n">mimic_path</span><span class="p">,</span><span class="w"> </span><span class="s2">"/icu/icustays.csv.gz"</span><span class="p">))</span><span class="w">
</span></code></pre></div></div>

<h3 id="q31-how-many-unique-stay_id">Q3.1 how many unique <code class="language-plaintext highlighter-rouge">stay_id</code>?</h3>

<blockquote>
  <p><strong>solution</strong>:</p>
  <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">icustays</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> 
  </span><span class="n">distinct</span><span class="p">(</span><span class="n">stay_id</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> 
  </span><span class="n">nrow</span><span class="p">()</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">str_c</span><span class="p">(</span><span class="n">.</span><span class="p">,</span><span class="w"> </span><span class="s2">" unique stay_id"</span><span class="p">)</span><span class="w">  
</span></code></pre></div>  </div>
</blockquote>

<h3 id="q32-how-many-unique-subject_id">Q3.2 how many unique <code class="language-plaintext highlighter-rouge">subject_id</code>?</h3>

<blockquote>
  <p><strong>solution</strong>:</p>
  <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">icustays</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">  
  </span><span class="n">distinct</span><span class="p">(</span><span class="n">subject_id</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> 
  </span><span class="n">nrow</span><span class="p">()</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">str_c</span><span class="p">(</span><span class="n">.</span><span class="p">,</span><span class="w"> </span><span class="s2">" unique subject_id"</span><span class="p">)</span><span class="w">
</span></code></pre></div>  </div>
</blockquote>

<h3 id="q33-length-of-icu-stay">Q3.3 length of ICU stay</h3>

<blockquote>
  <p><strong>solution</strong>:
<strong>please note that we took the log scale of x-axis to present data in a more readable way</strong></p>
  <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">icustays</span><span class="o">$</span><span class="n">los</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">summary</span><span class="w">
</span><span class="n">ggplot</span><span class="p">(</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">icustays</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">geom_histogram</span><span class="p">(</span><span class="n">mapping</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">los</span><span class="p">),</span><span class="w"> </span><span class="n">bins</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">150</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">scale_x_log10</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">xlab</span><span class="p">(</span><span class="s2">"length of ICU stay (days)"</span><span class="p">)</span><span class="o">+</span><span class="w">
  </span><span class="n">theme_classic</span><span class="p">()</span><span class="w"> 
</span></code></pre></div>  </div>
</blockquote>

<h3 id="q34-first-icu-unit">Q3.4 first ICU unit</h3>

<blockquote>
  <p><strong>solution</strong>:</p>
  <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">icustays</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> 
  </span><span class="n">group_by</span><span class="p">(</span><span class="n">first_careunit</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> 
  </span><span class="n">summarise</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">())</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">arrange</span><span class="p">(</span><span class="n">desc</span><span class="p">(</span><span class="n">n</span><span class="p">))</span><span class="w">
</span><span class="n">ggplot</span><span class="p">(</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">icustays</span><span class="p">,</span><span class="w"> 
       </span><span class="n">mapping</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">factor</span><span class="p">(</span><span class="m">1</span><span class="p">),</span><span class="w"> </span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">first_careunit</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">scale_fill_discrete</span><span class="p">(</span><span class="n">name</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"ICUs"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">xlab</span><span class="p">(</span><span class="kc">NULL</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">ylab</span><span class="p">(</span><span class="kc">NULL</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">geom_bar</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w"> 
  </span><span class="n">coord_polar</span><span class="p">(</span><span class="s2">"y"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">theme</span><span class="p">(</span><span class="n">axis.ticks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">(),</span><span class="w"> 
        </span><span class="n">axis.text.y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">())</span><span class="w">
</span></code></pre></div>  </div>
</blockquote>

<h3 id="q35-last-icu-unit">Q3.5 last ICU unit</h3>

<blockquote>
  <p><strong>solution</strong>:</p>
  <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">icustays</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> 
  </span><span class="n">group_by</span><span class="p">(</span><span class="n">last_careunit</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> 
  </span><span class="n">summarise</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">())</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">arrange</span><span class="p">(</span><span class="n">desc</span><span class="p">(</span><span class="n">n</span><span class="p">))</span><span class="w">
</span><span class="n">ggplot</span><span class="p">(</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">icustays</span><span class="p">,</span><span class="w"> 
       </span><span class="n">mapping</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">factor</span><span class="p">(</span><span class="m">1</span><span class="p">),</span><span class="w"> </span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">last_careunit</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">scale_fill_discrete</span><span class="p">(</span><span class="n">name</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"ICUs"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">xlab</span><span class="p">(</span><span class="kc">NULL</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">ylab</span><span class="p">(</span><span class="kc">NULL</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">geom_bar</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w"> 
  </span><span class="n">coord_polar</span><span class="p">(</span><span class="s2">"y"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">theme</span><span class="p">(</span><span class="n">axis.ticks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">(),</span><span class="w"> 
        </span><span class="n">axis.text.y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">())</span><span class="w">
</span></code></pre></div>  </div>
</blockquote>

<h2 id="q4-admission-data">Q4. <code class="language-plaintext highlighter-rouge">admission</code> data</h2>

<p>Information of the patients admitted into hospital is available in <code class="language-plaintext highlighter-rouge">admissions.csv.gz</code>. See <a href="https://mimic-iv.mit.edu/docs/datasets/core/admissions/">https://mimic-iv.mit.edu/docs/datasets/core/admissions/</a> for details of each field in this file. Summarize following variables using appropriate graphs. Explain any patterns you observe.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">adm</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">read_csv</span><span class="p">(</span><span class="n">str_c</span><span class="p">(</span><span class="n">mimic_path</span><span class="p">,</span><span class="w"> </span><span class="s2">"/core/admissions.csv.gz"</span><span class="p">))</span><span class="w">
</span></code></pre></div></div>

<p>Note it is possible that one patient (uniquely identified by the <code class="language-plaintext highlighter-rouge">subject_id</code>) is admitted into hospital multiple times. When summarizing some demographic information, it makes sense to summarize based on unique patients.</p>

<h3 id="q41-admission-year">Q4.1 admission year</h3>

<blockquote>
  <p><strong>solution</strong>:</p>
  <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">adm</span><span class="w"> </span><span class="o">%&lt;&gt;%</span><span class="w"> </span><span class="n">mutate</span><span class="p">(</span><span class="n">admission_year</span><span class="o">=</span><span class="n">year</span><span class="p">(</span><span class="n">adm</span><span class="o">$</span><span class="n">admittime</span><span class="p">))</span><span class="w">
</span><span class="n">adm</span><span class="o">$</span><span class="n">admission_year</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">summary</span><span class="w">
</span><span class="n">ggplot</span><span class="p">(</span><span class="n">adm</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">admission_year</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w"> 
  </span><span class="n">geom_bar</span><span class="p">(</span><span class="n">colour</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"black"</span><span class="p">,</span><span class="w"> </span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"white"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">theme_classic</span><span class="p">()</span><span class="w">
</span></code></pre></div>  </div>
</blockquote>

<h3 id="q42-admission-month">Q4.2 admission month</h3>

<blockquote>
  <p><strong>solution</strong>:</p>
  <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">adm</span><span class="w"> </span><span class="o">%&lt;&gt;%</span><span class="w"> </span><span class="n">mutate</span><span class="p">(</span><span class="n">admission_month</span><span class="o">=</span><span class="n">month</span><span class="p">(</span><span class="n">adm</span><span class="o">$</span><span class="n">admittime</span><span class="p">))</span><span class="w">
</span><span class="n">adm</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> 
  </span><span class="n">group_by</span><span class="p">(</span><span class="n">admission_month</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> 
  </span><span class="n">summarise</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">(),</span><span class="w"> </span><span class="n">.groups</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"keep"</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">arrange</span><span class="p">(</span><span class="n">desc</span><span class="p">(</span><span class="n">n</span><span class="p">))</span><span class="w">
</span><span class="n">ggplot</span><span class="p">(</span><span class="n">adm</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> 
  </span><span class="n">geom_bar</span><span class="p">(</span><span class="n">mapping</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">admission_month</span><span class="p">,</span><span class="w"> 
                         </span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">admission_month</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">as.factor</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">scale_fill_discrete</span><span class="p">(</span><span class="n">name</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"month"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">theme</span><span class="p">(</span><span class="n">axis.text.x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">(),</span><span class="w"> 
        </span><span class="n">axis.ticks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">(),</span><span class="w">
        </span><span class="n">panel.background</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">(),</span><span class="w">
        </span><span class="n">axis.line</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_line</span><span class="p">(</span><span class="n">colour</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"black"</span><span class="p">))</span><span class="w">
</span></code></pre></div>  </div>
</blockquote>

<h3 id="q43-admission-month-day">Q4.3 admission month day</h3>

<blockquote>
  <p><strong>solution</strong>:</p>
  <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">adm</span><span class="w"> </span><span class="o">%&lt;&gt;%</span><span class="w"> </span><span class="n">mutate</span><span class="p">(</span><span class="n">admission_monthday</span><span class="o">=</span><span class="n">day</span><span class="p">(</span><span class="n">adm</span><span class="o">$</span><span class="n">admittime</span><span class="p">))</span><span class="w">
</span><span class="n">adm</span><span class="o">$</span><span class="n">admission_monthday</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">summary</span><span class="w">
</span><span class="n">ggplot</span><span class="p">(</span><span class="n">adm</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">admission_monthday</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w"> 
  </span><span class="n">geom_bar</span><span class="p">(</span><span class="n">colour</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"black"</span><span class="p">,</span><span class="w"> </span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"white"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">theme_classic</span><span class="p">()</span><span class="w">
</span></code></pre></div>  </div>
</blockquote>

<h3 id="q44-admission-week-day">Q4.4 admission week day</h3>

<blockquote>
  <p><strong>solution</strong>:</p>
  <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">adm</span><span class="w"> </span><span class="o">%&lt;&gt;%</span><span class="w"> </span><span class="n">mutate</span><span class="p">(</span><span class="n">admission_weekday</span><span class="o">=</span><span class="n">wday</span><span class="p">(</span><span class="n">adm</span><span class="o">$</span><span class="n">admittime</span><span class="p">))</span><span class="w">
</span><span class="n">adm</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> 
  </span><span class="n">group_by</span><span class="p">(</span><span class="n">admission_weekday</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> 
  </span><span class="n">summarise</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">(),</span><span class="w"> </span><span class="n">.groups</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"keep"</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">arrange</span><span class="p">(</span><span class="n">desc</span><span class="p">(</span><span class="n">n</span><span class="p">))</span><span class="w">
</span><span class="n">ggplot</span><span class="p">(</span><span class="n">adm</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> 
  </span><span class="n">geom_bar</span><span class="p">(</span><span class="n">mapping</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">admission_weekday</span><span class="p">,</span><span class="w"> 
                         </span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">admission_weekday</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">as.factor</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">scale_fill_discrete</span><span class="p">(</span><span class="n">name</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"weekday"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">theme</span><span class="p">(</span><span class="n">axis.text.x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">(),</span><span class="w"> 
        </span><span class="n">axis.ticks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">(),</span><span class="w">
        </span><span class="n">panel.background</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">(),</span><span class="w">
        </span><span class="n">axis.line</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_line</span><span class="p">(</span><span class="n">colour</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"black"</span><span class="p">))</span><span class="w">
</span></code></pre></div>  </div>
</blockquote>

<h3 id="q45-admission-hour-anything-unusual">Q4.5 admission hour (anything unusual?)</h3>

<blockquote>
  <p><strong>solution</strong>:</p>
  <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">adm</span><span class="w"> </span><span class="o">%&lt;&gt;%</span><span class="w"> </span><span class="n">mutate</span><span class="p">(</span><span class="n">admission_hour</span><span class="o">=</span><span class="n">hour</span><span class="p">(</span><span class="n">adm</span><span class="o">$</span><span class="n">admittime</span><span class="p">))</span><span class="w">
</span><span class="n">adm</span><span class="o">$</span><span class="n">admission_hour</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">summary</span><span class="w">
</span><span class="n">ggplot</span><span class="p">(</span><span class="n">adm</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">admission_hour</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w"> 
  </span><span class="n">geom_bar</span><span class="p">(</span><span class="n">colour</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"black"</span><span class="p">,</span><span class="w"> </span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"white"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">theme_classic</span><span class="p">()</span><span class="w"> 
</span></code></pre></div>  </div>
  <p>According to the barplot, we found that there are more admissions during the night than during the day.</p>
</blockquote>

<h3 id="q46-number-of-deaths-in-each-year">Q4.6 number of deaths in each year</h3>

<p>Firstly we need check whether the indicators for the death cases are consistent (<code class="language-plaintext highlighter-rouge">deathtime</code>,<code class="language-plaintext highlighter-rouge">hospital_expire_flag</code>).</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">adm</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">mutate</span><span class="p">(</span><span class="n">test</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ifelse</span><span class="p">((</span><span class="nf">is.na</span><span class="p">(</span><span class="n">deathtime</span><span class="p">)</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="kc">FALSE</span><span class="w"> </span><span class="o">&amp;</span><span class="w"> 
                          </span><span class="n">hospital_expire_flag</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="m">0</span><span class="p">)</span><span class="o">|</span><span class="w">
                         </span><span class="p">(</span><span class="nf">is.na</span><span class="p">(</span><span class="n">deathtime</span><span class="p">)</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="kc">TRUE</span><span class="w"> </span><span class="o">&amp;</span><span class="w"> 
                            </span><span class="n">hospital_expire_flag</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="m">1</span><span class="p">),</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">filter</span><span class="p">(</span><span class="n">test</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">select</span><span class="p">(</span><span class="n">subject_id</span><span class="p">,</span><span class="w"> </span><span class="n">deathtime</span><span class="p">,</span><span class="w"> 
         </span><span class="n">hospital_expire_flag</span><span class="p">,</span><span class="w"> </span><span class="n">discharge_location</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<p><strong>Note</strong>: We found that there are some cases whose <code class="language-plaintext highlighter-rouge">deathtime</code> is missing but <code class="language-plaintext highlighter-rouge">hospital_expire_flag</code> equals 1. Referring to the other terms (e.g. <code class="language-plaintext highlighter-rouge">discharge_location</code>), <code class="language-plaintext highlighter-rouge">hospital_expire_flag</code> is turn out to be the more presise indicator. Thus, we decided to choose <strong><code class="language-plaintext highlighter-rouge">hospital_expire_flag</code></strong> as the indicator for death cases.</p>

<blockquote>
  <p><strong>solution</strong>:</p>
  <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">adm</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">filter</span><span class="p">(</span><span class="n">hospital_expire_flag</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">group_by</span><span class="p">(</span><span class="n">admission_year</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">summarise</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">(),</span><span class="w"> </span><span class="n">.groups</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"keep"</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">arrange</span><span class="p">(</span><span class="n">desc</span><span class="p">(</span><span class="n">n</span><span class="p">))</span><span class="w">
</span><span class="n">ggplot</span><span class="p">(</span><span class="n">adm</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">filter</span><span class="p">(</span><span class="nf">is.na</span><span class="p">(</span><span class="n">deathtime</span><span class="p">)</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">),</span><span class="w"> 
       </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">admission_year</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w"> 
  </span><span class="n">geom_bar</span><span class="p">(</span><span class="n">colour</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"black"</span><span class="p">,</span><span class="w"> </span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"white"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">theme_classic</span><span class="p">()</span><span class="w">
</span></code></pre></div>  </div>
</blockquote>

<h3 id="q47-admission-type">Q4.7 admission type</h3>

<blockquote>
  <p><strong>solution</strong>:</p>
  <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">adm</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> 
  </span><span class="n">group_by</span><span class="p">(</span><span class="n">admission_type</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">summarise</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">(),</span><span class="w"> </span><span class="n">.groups</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"keep"</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">arrange</span><span class="p">(</span><span class="n">desc</span><span class="p">(</span><span class="n">n</span><span class="p">))</span><span class="w">
</span><span class="n">ggplot</span><span class="p">(</span><span class="n">adm</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> 
  </span><span class="n">geom_bar</span><span class="p">(</span><span class="n">mapping</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">admission_type</span><span class="p">,</span><span class="w"> </span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">admission_type</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">theme</span><span class="p">(</span><span class="n">axis.text.x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">(),</span><span class="w"> 
        </span><span class="n">axis.ticks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">(),</span><span class="w">
        </span><span class="n">panel.background</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">(),</span><span class="w">
        </span><span class="n">axis.line</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_line</span><span class="p">(</span><span class="n">colour</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"black"</span><span class="p">))</span><span class="w">
</span></code></pre></div>  </div>
</blockquote>

<h3 id="q48-number-of-admissions-per-patient">Q4.8 number of admissions per patient</h3>

<blockquote>
  <p><strong>solution</strong>:</p>
  <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">number_of_ad</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">adm</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> 
  </span><span class="n">group_by</span><span class="p">(</span><span class="n">subject_id</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">summarise</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">(),</span><span class="w"> </span><span class="n">.groups</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"keep"</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">arrange</span><span class="p">(</span><span class="n">n</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">mutate</span><span class="p">(</span><span class="n">num</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ifelse</span><span class="p">(</span><span class="n">n</span><span class="o">&gt;=</span><span class="m">5</span><span class="p">,</span><span class="w"> </span><span class="m">5</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">))</span><span class="w">
</span><span class="n">number_of_ad</span><span class="o">$</span><span class="n">num</span><span class="w"> </span><span class="o">%&lt;&gt;%</span><span class="w"> </span><span class="n">ordered</span><span class="p">(</span><span class="n">levels</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"1"</span><span class="p">,</span><span class="w"> </span><span class="s2">"2"</span><span class="p">,</span><span class="w"> </span><span class="s2">"3"</span><span class="p">,</span><span class="w"> </span><span class="s2">"4"</span><span class="p">,</span><span class="w"> </span><span class="s2">"5"</span><span class="p">))</span><span class="w">
</span><span class="n">ggplot</span><span class="p">(</span><span class="n">number_of_ad</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> 
  </span><span class="n">geom_bar</span><span class="p">(</span><span class="n">mapping</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">num</span><span class="w"> </span><span class="p">,</span><span class="w"> </span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">num</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">xlab</span><span class="p">(</span><span class="s2">"number of admissions"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">scale_fill_discrete</span><span class="p">(</span><span class="n">name</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"number of admissions"</span><span class="p">,</span><span class="w"> 
                      </span><span class="n">labels</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"1"</span><span class="p">,</span><span class="w"> </span><span class="s2">"2"</span><span class="p">,</span><span class="w"> </span><span class="s2">"3"</span><span class="p">,</span><span class="w"> </span><span class="s2">"4"</span><span class="p">,</span><span class="w"> </span><span class="s2">"&gt;=5"</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">theme</span><span class="p">(</span><span class="n">panel.background</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">(),</span><span class="w">
        </span><span class="n">axis.text.x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">(),</span><span class="w">
        </span><span class="n">axis.line</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_line</span><span class="p">(</span><span class="n">colour</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"black"</span><span class="p">))</span><span class="w">
</span></code></pre></div>  </div>
</blockquote>

<h3 id="q49-admission-location">Q4.9 admission location</h3>

<blockquote>
  <p><strong>solution</strong>:</p>
  <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">adm</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> 
  </span><span class="n">group_by</span><span class="p">(</span><span class="n">admission_location</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">summarise</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">(),</span><span class="w"> </span><span class="n">.groups</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"keep"</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">arrange</span><span class="p">(</span><span class="n">desc</span><span class="p">(</span><span class="n">n</span><span class="p">))</span><span class="w">
</span><span class="n">ggplot</span><span class="p">(</span><span class="n">adm</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> 
  </span><span class="n">geom_bar</span><span class="p">(</span><span class="n">mapping</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">admission_location</span><span class="p">,</span><span class="w"> </span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">admission_location</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">scale_y_sqrt</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">theme</span><span class="p">(</span><span class="n">axis.text.x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">(),</span><span class="w"> 
        </span><span class="n">axis.ticks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">(),</span><span class="w">
        </span><span class="n">panel.background</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">(),</span><span class="w">
        </span><span class="n">axis.line</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_line</span><span class="p">(</span><span class="n">colour</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"black"</span><span class="p">))</span><span class="w">
</span></code></pre></div>  </div>
</blockquote>

<h3 id="q49-discharge-location">Q4.9 discharge location</h3>

<blockquote>
  <p><strong>solution</strong>:</p>
  <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">adm</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> 
  </span><span class="n">group_by</span><span class="p">(</span><span class="n">discharge_location</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">summarise</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">(),</span><span class="w"> </span><span class="n">.groups</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"keep"</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">arrange</span><span class="p">(</span><span class="n">desc</span><span class="p">(</span><span class="n">n</span><span class="p">))</span><span class="w">
</span><span class="n">ggplot</span><span class="p">(</span><span class="n">adm</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> 
  </span><span class="n">geom_bar</span><span class="p">(</span><span class="n">mapping</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">discharge_location</span><span class="p">,</span><span class="w"> </span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">discharge_location</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">scale_y_sqrt</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">theme</span><span class="p">(</span><span class="n">axis.text.x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">(),</span><span class="w"> 
        </span><span class="n">axis.ticks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">(),</span><span class="w">
        </span><span class="n">panel.background</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">(),</span><span class="w">
        </span><span class="n">axis.line</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_line</span><span class="p">(</span><span class="n">colour</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"black"</span><span class="p">))</span><span class="w">
</span></code></pre></div>  </div>
</blockquote>

<h3 id="q410--insurance">Q4.10  insurance</h3>

<blockquote>
  <p><strong>solution</strong>: (summarized based on unique patients)</p>
  <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">adm</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> 
  </span><span class="n">distinct</span><span class="p">(</span><span class="n">subject_id</span><span class="p">,</span><span class="w"> </span><span class="n">.keep_all</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> 
  </span><span class="n">group_by</span><span class="p">(</span><span class="n">insurance</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">summarise</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">(),</span><span class="w"> </span><span class="n">.groups</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"keep"</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">arrange</span><span class="p">(</span><span class="n">desc</span><span class="p">(</span><span class="n">n</span><span class="p">))</span><span class="w">
</span><span class="n">ggplot</span><span class="p">(</span><span class="n">adm</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> 
  </span><span class="n">distinct</span><span class="p">(</span><span class="n">subject_id</span><span class="p">,</span><span class="w"> </span><span class="n">.keep_all</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w"> 
  </span><span class="n">geom_bar</span><span class="p">(</span><span class="n">mapping</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">insurance</span><span class="p">,</span><span class="w"> </span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">insurance</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">theme</span><span class="p">(</span><span class="n">axis.text.x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">(),</span><span class="w"> 
        </span><span class="n">axis.ticks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">(),</span><span class="w">
        </span><span class="n">panel.background</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">(),</span><span class="w">
        </span><span class="n">axis.line</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_line</span><span class="p">(</span><span class="n">colour</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"black"</span><span class="p">))</span><span class="w">
</span></code></pre></div>  </div>
</blockquote>

<h3 id="q411-language">Q4.11 language</h3>

<blockquote>
  <p><strong>solution</strong>: (summarized based on unique patients)</p>
  <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">adm</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> 
  </span><span class="n">distinct</span><span class="p">(</span><span class="n">subject_id</span><span class="p">,</span><span class="w"> </span><span class="n">.keep_all</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> 
  </span><span class="n">group_by</span><span class="p">(</span><span class="n">language</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">summarise</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">(),</span><span class="w"> </span><span class="n">.groups</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"keep"</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">arrange</span><span class="p">(</span><span class="n">desc</span><span class="p">(</span><span class="n">n</span><span class="p">))</span><span class="w">
</span><span class="n">ggplot</span><span class="p">(</span><span class="n">adm</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> 
  </span><span class="n">distinct</span><span class="p">(</span><span class="n">subject_id</span><span class="p">,</span><span class="w"> </span><span class="n">.keep_all</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w"> 
  </span><span class="n">geom_bar</span><span class="p">(</span><span class="n">mapping</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">language</span><span class="p">,</span><span class="w"> </span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">language</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">scale_fill_discrete</span><span class="p">(</span><span class="n">label</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"unknown"</span><span class="p">,</span><span class="s2">"English"</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">theme</span><span class="p">(</span><span class="n">axis.text.x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">(),</span><span class="w"> 
        </span><span class="n">axis.ticks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">(),</span><span class="w">
        </span><span class="n">panel.background</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">(),</span><span class="w">
        </span><span class="n">axis.line</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_line</span><span class="p">(</span><span class="n">colour</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"black"</span><span class="p">))</span><span class="w">
</span></code></pre></div>  </div>
</blockquote>

<h3 id="q412-marital-status">Q4.12 marital status</h3>

<blockquote>
  <p><strong>solution</strong>: (summarized based on unique patients)</p>
  <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">adm</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> 
  </span><span class="n">distinct</span><span class="p">(</span><span class="n">subject_id</span><span class="p">,</span><span class="w"> </span><span class="n">.keep_all</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">group_by</span><span class="p">(</span><span class="n">marital_status</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">summarise</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">(),</span><span class="w"> </span><span class="n">.groups</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"keep"</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">arrange</span><span class="p">(</span><span class="n">desc</span><span class="p">(</span><span class="n">n</span><span class="p">))</span><span class="w">
</span><span class="n">ggplot</span><span class="p">(</span><span class="n">adm</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> 
  </span><span class="n">distinct</span><span class="p">(</span><span class="n">subject_id</span><span class="p">,</span><span class="w"> </span><span class="n">.keep_all</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w"> 
  </span><span class="n">geom_bar</span><span class="p">(</span><span class="n">mapping</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">marital_status</span><span class="p">,</span><span class="w"> </span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">marital_status</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">scale_fill_discrete</span><span class="p">(</span><span class="n">name</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"marital status"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">theme</span><span class="p">(</span><span class="n">axis.text.x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">(),</span><span class="w"> 
        </span><span class="n">axis.ticks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">(),</span><span class="w">
        </span><span class="n">panel.background</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">(),</span><span class="w">
        </span><span class="n">axis.line</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_line</span><span class="p">(</span><span class="n">colour</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"black"</span><span class="p">))</span><span class="w">
</span></code></pre></div>  </div>
</blockquote>

<h3 id="q413-ethnicity">Q4.13 ethnicity</h3>

<blockquote>
  <p><strong>solution</strong>: (summarized based on unique patients)</p>
  <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">adm</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> 
  </span><span class="n">distinct</span><span class="p">(</span><span class="n">subject_id</span><span class="p">,</span><span class="w"> </span><span class="n">.keep_all</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> 
  </span><span class="n">group_by</span><span class="p">(</span><span class="n">ethnicity</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">summarise</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">(),</span><span class="w"> </span><span class="n">.groups</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"keep"</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">arrange</span><span class="p">(</span><span class="n">desc</span><span class="p">(</span><span class="n">n</span><span class="p">))</span><span class="w">
</span><span class="n">ggplot</span><span class="p">(</span><span class="n">adm</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> 
  </span><span class="n">distinct</span><span class="p">(</span><span class="n">subject_id</span><span class="p">,</span><span class="w"> </span><span class="n">.keep_all</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w"> 
  </span><span class="n">geom_bar</span><span class="p">(</span><span class="n">mapping</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ethnicity</span><span class="p">,</span><span class="w"> </span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ethnicity</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">scale_fill_discrete</span><span class="p">(</span><span class="n">name</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"ethnicity"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">scale_y_sqrt</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">theme</span><span class="p">(</span><span class="n">axis.text.x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">(),</span><span class="w"> 
        </span><span class="n">axis.ticks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">(),</span><span class="w">
        </span><span class="n">panel.background</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">(),</span><span class="w">
        </span><span class="n">axis.line</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_line</span><span class="p">(</span><span class="n">colour</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"black"</span><span class="p">))</span><span class="w">
</span></code></pre></div>  </div>
</blockquote>

<h3 id="q414-death">Q4.14 death</h3>

<blockquote>
  <p><strong>solution</strong>:</p>
  <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">adm</span><span class="w"> </span><span class="o">%&lt;&gt;%</span><span class="w"> </span><span class="n">mutate</span><span class="p">(</span><span class="n">death</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ifelse</span><span class="p">(</span><span class="n">hospital_expire_flag</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="s2">"Yes"</span><span class="p">,</span><span class="w"> </span><span class="s2">"No"</span><span class="p">))</span><span class="w">
</span><span class="n">adm</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> 
  </span><span class="n">group_by</span><span class="p">(</span><span class="n">death</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">summarise</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">(),</span><span class="w"> </span><span class="n">.groups</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"keep"</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">arrange</span><span class="p">(</span><span class="n">desc</span><span class="p">(</span><span class="n">n</span><span class="p">))</span><span class="w">
</span><span class="n">ggplot</span><span class="p">(</span><span class="n">adm</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> 
  </span><span class="n">geom_bar</span><span class="p">(</span><span class="n">mapping</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">death</span><span class="p">,</span><span class="w"> </span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">death</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">scale_fill_discrete</span><span class="p">(</span><span class="n">name</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"death"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">scale_y_sqrt</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">theme</span><span class="p">(</span><span class="n">axis.text.x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">(),</span><span class="w"> 
        </span><span class="n">axis.ticks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">(),</span><span class="w">
        </span><span class="n">panel.background</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">(),</span><span class="w">
        </span><span class="n">axis.line</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_line</span><span class="p">(</span><span class="n">colour</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"black"</span><span class="p">))</span><span class="w">
</span></code></pre></div>  </div>
</blockquote>

<h2 id="q5-patient-data">Q5. <code class="language-plaintext highlighter-rouge">patient</code> data</h2>

<p>Explore <code class="language-plaintext highlighter-rouge">patients.csv.gz</code> (<a href="https://mimic-iv.mit.edu/docs/datasets/core/patients/">https://mimic-iv.mit.edu/docs/datasets/core/patients/</a>) and summarize following variables using appropriate numerics and graphs:</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">patients</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">read_csv</span><span class="p">(</span><span class="n">str_c</span><span class="p">(</span><span class="n">mimic_path</span><span class="p">,</span><span class="w"> </span><span class="s2">"/core/patients.csv.gz"</span><span class="p">))</span><span class="w">
</span></code></pre></div></div>

<h3 id="q51-gender">Q5.1 <code class="language-plaintext highlighter-rouge">gender</code></h3>

<blockquote>
  <p><strong>solution</strong>: (summarized based on unique patients)</p>
  <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">patients</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> 
  </span><span class="n">distinct</span><span class="p">(</span><span class="n">subject_id</span><span class="p">,</span><span class="w"> </span><span class="n">.keep_all</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">group_by</span><span class="p">(</span><span class="n">gender</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">summarise</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">(),</span><span class="w"> </span><span class="n">.groups</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"keep"</span><span class="p">)</span><span class="w">
</span><span class="n">ggplot</span><span class="p">(</span><span class="n">patients</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> 
  </span><span class="n">distinct</span><span class="p">(</span><span class="n">subject_id</span><span class="p">,</span><span class="w"> </span><span class="n">.keep_all</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w"> 
  </span><span class="n">geom_bar</span><span class="p">(</span><span class="n">mapping</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">gender</span><span class="p">,</span><span class="w"> </span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">gender</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">scale_fill_discrete</span><span class="p">(</span><span class="n">name</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"gender"</span><span class="p">,</span><span class="w"> 
                      </span><span class="n">label</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"Female"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Male"</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">theme</span><span class="p">(</span><span class="n">axis.text.x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">(),</span><span class="w"> 
        </span><span class="n">axis.ticks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">(),</span><span class="w">
        </span><span class="n">panel.background</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">(),</span><span class="w">
        </span><span class="n">axis.line</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_line</span><span class="p">(</span><span class="n">colour</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"black"</span><span class="p">))</span><span class="w">
</span></code></pre></div>  </div>
</blockquote>

<h3 id="q52-anchor_age">Q5.2 <code class="language-plaintext highlighter-rouge">anchor_age</code></h3>
<p>(explain pattern you see)</p>

<blockquote>
  <p><strong>solution</strong>: (summarized based on unique patients)</p>
  <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">patients</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> 
  </span><span class="n">distinct</span><span class="p">(</span><span class="n">subject_id</span><span class="p">,</span><span class="w"> </span><span class="n">.keep_all</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">select</span><span class="p">(</span><span class="n">anchor_age</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">summary</span><span class="w">
</span><span class="n">ggplot</span><span class="p">(</span><span class="n">patients</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> 
  </span><span class="n">distinct</span><span class="p">(</span><span class="n">subject_id</span><span class="p">,</span><span class="w"> </span><span class="n">.keep_all</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w"> 
  </span><span class="n">geom_density</span><span class="p">(</span><span class="n">mapping</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">anchor_age</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">xlab</span><span class="p">(</span><span class="s2">"anchor_age"</span><span class="p">)</span><span class="o">+</span><span class="w">
  </span><span class="n">theme_classic</span><span class="p">()</span><span class="w"> 
</span></code></pre></div>  </div>
</blockquote>

<p>According to the summary statistics, the mean anchor age of patients are 41. And figure presents that the anchor age of patients has two peak at 0 and 25 respectively. Very few patients have anchor age at around 10. Then we presumed that the peak in age 0 should be correspond to the missing data or some corrupted data. Thus, we filter out the anchor_age which equals 0 and plot again.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">patients</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> 
  </span><span class="n">distinct</span><span class="p">(</span><span class="n">subject_id</span><span class="p">,</span><span class="w"> </span><span class="n">.keep_all</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">filter</span><span class="p">(</span><span class="n">anchor_age</span><span class="w"> </span><span class="o">!=</span><span class="w"> </span><span class="m">0</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">ggplot</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">geom_density</span><span class="p">(</span><span class="n">mapping</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">anchor_age</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">xlab</span><span class="p">(</span><span class="s2">"anchor_age (filter out 0)"</span><span class="p">)</span><span class="o">+</span><span class="w">
  </span><span class="n">theme_classic</span><span class="p">()</span><span class="w"> 
</span></code></pre></div></div>

<h2 id="q6-lab-results">Q6. Lab results</h2>

<p><code class="language-plaintext highlighter-rouge">labevents.csv.gz</code> (<a href="https://mimic-iv.mit.edu/docs/datasets/hosp/labevents/">https://mimic-iv.mit.edu/docs/datasets/hosp/labevents/</a>) contains all laboratory measurements for patients.</p>

<p>We are interested in the lab measurements of creatinine (50912), potassium (50971), sodium (50983), chloride (50902), bicarbonate (50882), hematocrit (51221), white blood cell count (51301), glucose (50931), magnesium (50960), calcium (50893), and lactate (50813). Find the <code class="language-plaintext highlighter-rouge">itemid</code>s of these lab measurements from <code class="language-plaintext highlighter-rouge">d_labitems.csv.gz</code> and retrieve a subset of <code class="language-plaintext highlighter-rouge">labevents.csv.gz</code> only containing these items.</p>

<p><strong>solution</strong>:</p>

<p><strong>Quick check the data file</strong></p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">readLines</span><span class="p">(</span><span class="n">str_c</span><span class="p">(</span><span class="n">mimic_path</span><span class="p">,</span><span class="w"> </span><span class="s2">"/hosp/labevents.csv.gz"</span><span class="p">),</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">5L</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<p><strong>Begin data processing</strong></p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">item_list</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"creatinine"</span><span class="p">,</span><span class="w"> </span><span class="s2">"potassium"</span><span class="p">,</span><span class="w"> </span><span class="s2">"sodium"</span><span class="p">,</span><span class="w"> 
               </span><span class="s2">"chloride"</span><span class="p">,</span><span class="w"> </span><span class="s2">"bicarbonate"</span><span class="p">,</span><span class="w"> </span><span class="s2">"hematocrit"</span><span class="p">,</span><span class="w"> 
               </span><span class="s2">"white blood cell"</span><span class="p">,</span><span class="w"> </span><span class="s2">"glucose"</span><span class="p">,</span><span class="w"> 
               </span><span class="s2">"magnesium"</span><span class="p">,</span><span class="w"> </span><span class="s2">"calcium"</span><span class="p">,</span><span class="w"> </span><span class="s2">"lactate"</span><span class="p">)</span><span class="w">
</span><span class="n">lab_name</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"subject_id"</span><span class="p">,</span><span class="w"> </span><span class="s2">"hadm_id"</span><span class="p">,</span><span class="w"> </span><span class="s2">"itemid"</span><span class="p">,</span><span class="w">
              </span><span class="s2">"charttime"</span><span class="p">,</span><span class="w"> </span><span class="s2">"valuenum"</span><span class="p">)</span><span class="w">
</span><span class="c1">#              </span><span class="w">
</span><span class="c1">#read and save</span><span class="w">
</span><span class="c1">#</span><span class="w">
</span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="o">!</span><span class="n">file.exists</span><span class="p">(</span><span class="s2">"labevents_icu.csv.gz"</span><span class="p">)){</span><span class="w">
  </span><span class="n">system.time</span><span class="p">(</span><span class="n">labevents</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">fread</span><span class="p">(</span><span class="n">str_c</span><span class="p">(</span><span class="n">mimic_path</span><span class="p">,</span><span class="w"> </span><span class="s2">"/hosp/labevents.csv.gz"</span><span class="p">),</span><span class="w">
                     </span><span class="n">select</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">lab_name</span><span class="p">,</span><span class="w"> </span><span class="n">nThread</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">getDTthreads</span><span class="p">()))</span><span class="w">
  </span><span class="n">labevents</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
    </span><span class="n">semi_join</span><span class="p">(</span><span class="n">icustays</span><span class="p">,</span><span class="w"> </span><span class="n">by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"subject_id"</span><span class="p">,</span><span class="w"> </span><span class="s2">"hadm_id"</span><span class="p">))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
    </span><span class="n">fwrite</span><span class="p">(</span><span class="s2">"labevents_icu.csv.gz"</span><span class="p">,</span><span class="w"> </span><span class="n">nThread</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">getDTthreads</span><span class="p">())</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">labitems</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">fread</span><span class="p">(</span><span class="n">str_c</span><span class="p">(</span><span class="n">mimic_path</span><span class="p">,</span><span class="w"> </span><span class="s2">"/hosp/d_labitems.csv.gz"</span><span class="p">))</span><span class="w">
</span><span class="n">system.time</span><span class="p">(</span><span class="n">labevents_icu</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">fread</span><span class="p">(</span><span class="s2">"labevents_icu.csv.gz"</span><span class="p">,</span><span class="w">
                                   </span><span class="n">nThread</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">getDTthreads</span><span class="p">()))</span><span class="w">
</span><span class="c1">#</span><span class="w">
</span><span class="c1">#define lookup function</span><span class="w">
</span><span class="c1">#</span><span class="w">
</span><span class="n">.look_up</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"look_up"</span><span class="p">,</span><span class="w"> </span><span class="n">input</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">NULL</span><span class="p">){</span><span class="w">
  </span><span class="k">if</span><span class="p">(</span><span class="n">type</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"look_up"</span><span class="p">){</span><span class="w">
    </span><span class="n">lookup</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">input</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
      </span><span class="n">filter</span><span class="p">(</span><span class="n">str_detect</span><span class="p">(</span><span class="n">label</span><span class="p">,</span><span class="w"> </span><span class="n">regex</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">ignore_case</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
      </span><span class="n">select</span><span class="p">(</span><span class="n">itemid</span><span class="p">,</span><span class="w"> </span><span class="n">label</span><span class="p">)</span><span class="w"> 
    </span><span class="nf">return</span><span class="p">(</span><span class="n">lookup</span><span class="p">)</span><span class="w">
  </span><span class="p">}</span><span class="w"> </span><span class="k">else</span><span class="w"> </span><span class="k">if</span><span class="p">(</span><span class="n">type</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"find_id"</span><span class="p">){</span><span class="w">
    </span><span class="n">count</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">input</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
      </span><span class="n">filter</span><span class="p">(</span><span class="n">itemid</span><span class="w"> </span><span class="o">%in%</span><span class="w"> </span><span class="n">as_vector</span><span class="p">(</span><span class="n">x</span><span class="o">$</span><span class="n">itemid</span><span class="p">))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
      </span><span class="n">count</span><span class="p">(</span><span class="n">itemid</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
      </span><span class="n">arrange</span><span class="p">(</span><span class="n">desc</span><span class="p">(</span><span class="n">n</span><span class="p">))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">.</span><span class="p">[</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">]</span><span class="w">
    </span><span class="nf">return</span><span class="p">(</span><span class="n">count</span><span class="p">)</span><span class="w">
  </span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="c1">#</span><span class="w">
</span><span class="c1">#find data matching item id and save</span><span class="w">
</span><span class="c1">#</span><span class="w">
</span><span class="n">idkey_all</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">lapply</span><span class="p">(</span><span class="n">item_list</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">as.list</span><span class="p">,</span><span class="w"> 
                    </span><span class="n">.look_up</span><span class="p">,</span><span class="w"> 
                    </span><span class="n">input</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">labitems</span><span class="p">)</span><span class="w">
</span><span class="n">idkey</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">sapply</span><span class="p">(</span><span class="n">idkey_all</span><span class="p">,</span><span class="w"> 
                </span><span class="n">.look_up</span><span class="p">,</span><span class="w"> 
                </span><span class="n">type</span><span class="o">=</span><span class="s2">"find_id"</span><span class="p">,</span><span class="w"> 
                </span><span class="n">input</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">labevents_icu</span><span class="p">)</span><span class="w">
</span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="o">!</span><span class="n">file.exists</span><span class="p">(</span><span class="s2">"labevents_icu_selected.csv.gz"</span><span class="p">)){</span><span class="w">
  </span><span class="n">labevents_icu</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">filter</span><span class="p">(</span><span class="n">itemid</span><span class="w"> </span><span class="o">%in%</span><span class="w"> </span><span class="n">as_vector</span><span class="p">(</span><span class="n">idkey</span><span class="p">))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">left_join</span><span class="p">(</span><span class="n">Reduce</span><span class="p">(</span><span class="n">rbind</span><span class="p">,</span><span class="w"> </span><span class="n">idkey_all</span><span class="p">),</span><span class="w"> </span><span class="n">by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"itemid"</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">fwrite</span><span class="p">(</span><span class="s2">"labevents_icu_selected.csv.gz"</span><span class="p">,</span><span class="w"> </span><span class="n">nThread</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">getDTthreads</span><span class="p">())</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>
<p><strong>Check the content of the processed data</strong></p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">labevents_icu</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">fread</span><span class="p">(</span><span class="s2">"labevents_icu_selected.csv.gz"</span><span class="p">,</span><span class="w">
                         </span><span class="n">nThread</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">getDTthreads</span><span class="p">())</span><span class="w">
</span><span class="n">labevents_icu</span><span class="w">
</span></code></pre></div></div>

<h2 id="q7-vitals-from-chartered-events">Q7. Vitals from chartered events</h2>

<p>We are interested in the vitals for ICU patients: heart rate, mean and systolic blood pressure (invasive and noninvasive measurements combined), body temperature, SpO2, and respiratory rate. Find the <code class="language-plaintext highlighter-rouge">itemid</code>s of these vitals from <code class="language-plaintext highlighter-rouge">d_items.csv.gz</code> and retrieve a subset of <code class="language-plaintext highlighter-rouge">chartevents.csv.gz</code> only containing these items.</p>

<p><code class="language-plaintext highlighter-rouge">chartevents.csv.gz</code> (<a href="https://mimic-iv.mit.edu/docs/datasets/icu/chartevents/">https://mimic-iv.mit.edu/docs/datasets/icu/chartevents/</a>) contains all the charted data available for a patient. During their ICU stay, the primary repository of a patient’s information is their electronic chart. The <code class="language-plaintext highlighter-rouge">itemid</code> variable indicates a single measurement type in the database. The <code class="language-plaintext highlighter-rouge">value</code> variable is the value measured for <code class="language-plaintext highlighter-rouge">itemid</code>.</p>

<p><code class="language-plaintext highlighter-rouge">d_items.csv.gz</code> (<a href="https://mimic-iv.mit.edu/docs/datasets/icu/d_items/">https://mimic-iv.mit.edu/docs/datasets/icu/d_items/</a>) is the dictionary for the <code class="language-plaintext highlighter-rouge">itemid</code> in <code class="language-plaintext highlighter-rouge">chartevents.csv.gz</code>.</p>

<p><strong>solution</strong>:</p>

<p><strong>Quick check the data file</strong></p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">readLines</span><span class="p">(</span><span class="n">str_c</span><span class="p">(</span><span class="n">mimic_path</span><span class="p">,</span><span class="w"> </span><span class="s2">"/icu/chartevents.csv.gz"</span><span class="p">),</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">5L</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<p><strong>Begin data processing</strong></p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">item_list2</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"heart rate"</span><span class="p">,</span><span class="w"> </span><span class="s2">"arterial blood pressure systolic"</span><span class="p">,</span><span class="w"> 
                </span><span class="s2">"arterial blood pressure mean"</span><span class="p">,</span><span class="w"> 
                </span><span class="s2">"invasive blood pressure mean"</span><span class="p">,</span><span class="w">
                </span><span class="s2">"invasive blood pressure systolic"</span><span class="p">,</span><span class="w"> </span><span class="s2">"temperature"</span><span class="p">,</span><span class="w"> 
                </span><span class="s2">"SpO2"</span><span class="p">,</span><span class="w"> </span><span class="s2">"respiratory rate"</span><span class="p">)</span><span class="w">
</span><span class="n">char_name</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"subject_id"</span><span class="p">,</span><span class="w"> </span><span class="s2">"hadm_id"</span><span class="p">,</span><span class="w"> </span><span class="s2">"itemid"</span><span class="p">,</span><span class="w">
              </span><span class="s2">"charttime"</span><span class="p">,</span><span class="w"> </span><span class="s2">"valuenum"</span><span class="p">)</span><span class="w">
</span><span class="c1">#</span><span class="w">
</span><span class="c1">#read and save</span><span class="w">
</span><span class="c1">#</span><span class="w">
</span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="o">!</span><span class="n">file.exists</span><span class="p">(</span><span class="s2">"chartevents_icu.csv.gz"</span><span class="p">)){</span><span class="w">
  </span><span class="n">system.time</span><span class="p">(</span><span class="n">chartevents</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">fread</span><span class="p">(</span><span class="n">str_c</span><span class="p">(</span><span class="n">mimic_path</span><span class="p">,</span><span class="w"> 
                                         </span><span class="s2">"/icu/chartevents.csv.gz"</span><span class="p">),</span><span class="w">
                     </span><span class="n">select</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">char_name</span><span class="p">,</span><span class="w"> </span><span class="n">nThread</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">getDTthreads</span><span class="p">()))</span><span class="w">
  </span><span class="n">chartevents</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
    </span><span class="n">semi_join</span><span class="p">(</span><span class="n">icustays</span><span class="p">,</span><span class="w"> </span><span class="n">by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"subject_id"</span><span class="p">,</span><span class="w"> </span><span class="s2">"hadm_id"</span><span class="p">))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
    </span><span class="n">fwrite</span><span class="p">(</span><span class="s2">"chartevents_icu.csv.gz"</span><span class="p">,</span><span class="w"> </span><span class="n">nThread</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">getDTthreads</span><span class="p">())</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">d_items</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">fread</span><span class="p">(</span><span class="n">str_c</span><span class="p">(</span><span class="n">mimic_path</span><span class="p">,</span><span class="w"> </span><span class="s2">"/icu/d_items.csv.gz"</span><span class="p">))</span><span class="w">
</span><span class="n">system.time</span><span class="p">(</span><span class="n">chartevents_icu</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">fread</span><span class="p">(</span><span class="s2">"chartevents_icu.csv.gz"</span><span class="p">,</span><span class="w">
                                     </span><span class="n">nThread</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">getDTthreads</span><span class="p">()))</span><span class="w">
</span><span class="c1">#</span><span class="w">
</span><span class="c1">#find data matching item id and save</span><span class="w">
</span><span class="c1">#</span><span class="w">
</span><span class="n">idkey_all2</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">lapply</span><span class="p">(</span><span class="n">item_list2</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">as.list</span><span class="p">,</span><span class="w"> 
                     </span><span class="n">.look_up</span><span class="p">,</span><span class="w"> 
                     </span><span class="n">input</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">d_items</span><span class="p">)</span><span class="w">
</span><span class="n">idkey2</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">sapply</span><span class="p">(</span><span class="n">idkey_all2</span><span class="p">,</span><span class="w"> 
                 </span><span class="n">.look_up</span><span class="p">,</span><span class="w"> 
                 </span><span class="n">type</span><span class="o">=</span><span class="s2">"find_id"</span><span class="p">,</span><span class="w"> 
                 </span><span class="n">input</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">chartevents_icu</span><span class="p">)</span><span class="w">
</span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="o">!</span><span class="n">file.exists</span><span class="p">(</span><span class="s2">"chartevents_icu_selected.csv.gz"</span><span class="p">)){</span><span class="w">
  </span><span class="n">chartevents_icu</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">filter</span><span class="p">(</span><span class="n">itemid</span><span class="w"> </span><span class="o">%in%</span><span class="w"> </span><span class="n">as_vector</span><span class="p">(</span><span class="n">idkey2</span><span class="p">))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">left_join</span><span class="p">(</span><span class="n">Reduce</span><span class="p">(</span><span class="n">rbind</span><span class="p">,</span><span class="w"> </span><span class="n">idkey_all2</span><span class="p">),</span><span class="w"> </span><span class="n">by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"itemid"</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">fwrite</span><span class="p">(</span><span class="s2">"chartevents_icu_selected.csv.gz"</span><span class="p">,</span><span class="w"> </span><span class="n">nThread</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">getDTthreads</span><span class="p">())</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>
<p><strong>Check the content of the processed data</strong></p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">chartevents_icu</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">fread</span><span class="p">(</span><span class="s2">"chartevents_icu_selected.csv.gz"</span><span class="p">,</span><span class="w">
                         </span><span class="n">nThread</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">getDTthreads</span><span class="p">())</span><span class="w">
</span><span class="n">chartevents_icu</span><span class="w">
</span></code></pre></div></div>

<h2 id="q8-putting-things-together">Q8. Putting things together</h2>

<p>Let us create a tibble for all ICU stays, where rows are</p>

<ul>
  <li>first ICU stay of each unique patient</li>
  <li>adults (age at admission &gt; 18)</li>
</ul>

<p>and columns contain at least following variables</p>

<ul>
  <li>all variables in <code class="language-plaintext highlighter-rouge">icustays.csv.gz</code></li>
  <li>all variables in <code class="language-plaintext highlighter-rouge">admission.csv.gz</code></li>
  <li>all variables in <code class="language-plaintext highlighter-rouge">patients.csv.gz</code></li>
  <li>first lab measurements during ICU stay</li>
  <li>first vitals measurement during ICU stay</li>
  <li>an indicator variable whether the patient died within 30 days of hospital admission</li>
</ul>

<p><strong>solution</strong>:</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">which</span><span class="p">(</span><span class="n">duplicated</span><span class="p">(</span><span class="n">labevents_icu</span><span class="p">)</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">length</span><span class="w">
</span><span class="n">which</span><span class="p">(</span><span class="n">duplicated</span><span class="p">(</span><span class="n">chartevents_icu</span><span class="p">)</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">length</span><span class="w">
</span></code></pre></div></div>
<p>After a quick check, we found that there are some patients have more than one record at a single time point (duplicated resords). Thus, we need to only keep one record.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">#keep the first lab\vitals measurements</span><span class="w">
</span><span class="n">labevents_icu</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">mutate_at</span><span class="p">(</span><span class="n">vars</span><span class="p">(</span><span class="n">charttime</span><span class="p">),</span><span class="w"> </span><span class="n">ymd_hms</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">left_join</span><span class="p">(</span><span class="n">icustays</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> 
              </span><span class="n">select</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="s2">"subject_id"</span><span class="p">,</span><span class="w"> </span><span class="s2">"hadm_id"</span><span class="p">,</span><span class="w"> </span><span class="s2">"intime"</span><span class="p">)),</span><span class="w"> 
            </span><span class="n">by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"subject_id"</span><span class="p">,</span><span class="w"> </span><span class="s2">"hadm_id"</span><span class="p">))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">filter</span><span class="p">(</span><span class="n">charttime</span><span class="w"> </span><span class="o">&gt;=</span><span class="w"> </span><span class="n">intime</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">group_by</span><span class="p">(</span><span class="n">subject_id</span><span class="p">,</span><span class="w"> </span><span class="n">label</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">arrange</span><span class="p">(</span><span class="n">charttime</span><span class="p">,</span><span class="w"> </span><span class="n">.by_group</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">slice_head</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">select</span><span class="p">(</span><span class="o">-</span><span class="nf">c</span><span class="p">(</span><span class="n">charttime</span><span class="p">,</span><span class="w"> </span><span class="n">itemid</span><span class="p">,</span><span class="w"> </span><span class="n">intime</span><span class="p">))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">spread</span><span class="p">(</span><span class="n">key</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">label</span><span class="p">,</span><span class="w"> </span><span class="n">value</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">valuenum</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">fwrite</span><span class="p">(</span><span class="s2">"labevents_icu_final.csv.gz"</span><span class="p">,</span><span class="w"> </span><span class="n">nThread</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">getDTthreads</span><span class="p">())</span><span class="w">

</span><span class="n">chartevents_icu</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">mutate_at</span><span class="p">(</span><span class="n">vars</span><span class="p">(</span><span class="n">charttime</span><span class="p">),</span><span class="w"> </span><span class="n">ymd_hms</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">left_join</span><span class="p">(</span><span class="n">icustays</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> 
              </span><span class="n">select</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="s2">"subject_id"</span><span class="p">,</span><span class="w"> </span><span class="s2">"hadm_id"</span><span class="p">,</span><span class="w"> </span><span class="s2">"intime"</span><span class="p">)),</span><span class="w"> 
            </span><span class="n">by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"subject_id"</span><span class="p">,</span><span class="w"> </span><span class="s2">"hadm_id"</span><span class="p">))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">filter</span><span class="p">(</span><span class="n">charttime</span><span class="w"> </span><span class="o">&gt;=</span><span class="w"> </span><span class="n">intime</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">group_by</span><span class="p">(</span><span class="n">subject_id</span><span class="p">,</span><span class="w"> </span><span class="n">label</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">arrange</span><span class="p">(</span><span class="n">charttime</span><span class="p">,</span><span class="w"> </span><span class="n">.by_group</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">slice_head</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">select</span><span class="p">(</span><span class="o">-</span><span class="nf">c</span><span class="p">(</span><span class="n">charttime</span><span class="p">,</span><span class="w"> </span><span class="n">itemid</span><span class="p">,</span><span class="w"> </span><span class="n">intime</span><span class="p">))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">spread</span><span class="p">(</span><span class="n">key</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">label</span><span class="p">,</span><span class="w"> </span><span class="n">value</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">valuenum</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">fwrite</span><span class="p">(</span><span class="s2">"chartevents_icu_final.csv.gz"</span><span class="p">,</span><span class="w"> </span><span class="n">nThread</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">getDTthreads</span><span class="p">())</span><span class="w">
</span></code></pre></div></div>

<p>Then we can prepare the final dataset</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">#read in data</span><span class="w">
</span><span class="n">labevents_icu</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">fread</span><span class="p">(</span><span class="s2">"labevents_icu_final.csv.gz"</span><span class="p">,</span><span class="w">
                         </span><span class="n">nThread</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">getDTthreads</span><span class="p">())</span><span class="w">
</span><span class="n">chartevents_icu</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">fread</span><span class="p">(</span><span class="s2">"chartevents_icu_final.csv.gz"</span><span class="p">,</span><span class="w">
                         </span><span class="n">nThread</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">getDTthreads</span><span class="p">())</span><span class="w">
</span><span class="c1">#all variables in `icustays.csv.gz`</span><span class="w">
</span><span class="n">final_icu_dataset</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">icustays</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="c1"># first ICU stay of each unique patient </span><span class="w">
  </span><span class="n">group_by</span><span class="p">(</span><span class="n">subject_id</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">slice_min</span><span class="p">(</span><span class="n">intime</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="c1"># add all variables in `admission.csv.gz` </span><span class="w">
  </span><span class="n">left_join</span><span class="p">(</span><span class="n">adm</span><span class="p">,</span><span class="w"> </span><span class="n">by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"subject_id"</span><span class="p">,</span><span class="w"> </span><span class="s2">"hadm_id"</span><span class="p">))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="c1"># add all variables in `patients.csv.gz` </span><span class="w">
  </span><span class="n">left_join</span><span class="p">(</span><span class="n">patients</span><span class="p">,</span><span class="w"> </span><span class="n">by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"subject_id"</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">mutate</span><span class="p">(</span><span class="n">age_at_adm</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">year</span><span class="p">(</span><span class="n">admittime</span><span class="p">)</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">anchor_year</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">anchor_age</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="c1"># adults (age at admission &gt; 18)</span><span class="w">
  </span><span class="n">filter</span><span class="p">(</span><span class="n">age_at_adm</span><span class="w"> </span><span class="o">&gt;=</span><span class="w"> </span><span class="m">18</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="c1"># add first vitals measurement during ICU stay</span><span class="w">
  </span><span class="n">left_join</span><span class="p">(</span><span class="n">chartevents_icu</span><span class="p">,</span><span class="w"> </span><span class="n">by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"subject_id"</span><span class="p">,</span><span class="w"> </span><span class="s2">"hadm_id"</span><span class="p">))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="c1"># add first lab measurements during ICU stay</span><span class="w">
  </span><span class="n">left_join</span><span class="p">(</span><span class="n">labevents_icu</span><span class="p">,</span><span class="w"> </span><span class="n">by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"subject_id"</span><span class="p">,</span><span class="w"> </span><span class="s2">"hadm_id"</span><span class="p">))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="c1"># filter out all death cases [indicator: "hospital_expire_flag"]</span><span class="w">
  </span><span class="n">mutate</span><span class="p">(</span><span class="n">death_binary</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ifelse</span><span class="p">(</span><span class="n">hospital_expire_flag</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="s2">"Yes"</span><span class="p">,</span><span class="w"> </span><span class="s2">"No"</span><span class="p">))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="c1"># indicator variable whether the patient died within 30 days of admission</span><span class="w">
  </span><span class="n">mutate</span><span class="p">(</span><span class="n">death_30</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ifelse</span><span class="p">(</span><span class="n">deathtime</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">intime</span><span class="w"> </span><span class="o">&lt;=</span><span class="w"> </span><span class="m">30</span><span class="p">,</span><span class="w"> </span><span class="s2">"Yes"</span><span class="p">,</span><span class="w"> </span><span class="s2">"No"</span><span class="p">))</span><span class="w">

</span><span class="n">print</span><span class="p">(</span><span class="n">final_icu_dataset</span><span class="p">,</span><span class="w"> </span><span class="n">width</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">Inf</span><span class="p">)</span><span class="w">

</span></code></pre></div></div>

<p>Set NA as “No” in indicator variable and compare with with <code class="language-plaintext highlighter-rouge">death_binary</code></p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">final_icu_dataset</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">ungroup</span><span class="p">()</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> 
  </span><span class="n">mutate_at</span><span class="p">(</span><span class="n">vars</span><span class="p">(</span><span class="n">death_30</span><span class="p">),</span><span class="w"> 
            </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">){</span><span class="n">ifelse</span><span class="p">(</span><span class="nf">is.na</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span><span class="w"> </span><span class="s2">"No"</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="p">)})</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> 
  </span><span class="n">count</span><span class="p">(</span><span class="n">death_30</span><span class="p">,</span><span class="w"> </span><span class="n">death_binary</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>]]></content><author><name>Zian</name><email>zianzhuang@ucla.edu</email></author><category term="Data science" /><category term="practice" /><summary type="html"><![CDATA[]]></summary></entry><entry><title type="html">Data science practice 1</title><link href="https://zianzhuang.com/posts/2021/03/blog-post-1/" rel="alternate" type="text/html" title="Data science practice 1" /><published>2021-03-09T00:00:00+00:00</published><updated>2021-03-09T00:00:00+00:00</updated><id>https://zianzhuang.com/posts/2021/03/blog-post-1</id><content type="html" xml:base="https://zianzhuang.com/posts/2021/03/blog-post-1/"><![CDATA[<!--more-->

<h2 id="q1-linux-shell-commands">Q1. Linux Shell Commands</h2>

<h3 id="q11">Q1.1</h3>
<p>This exercise (and later in this course) uses the <a href="https://mimic-iv.mit.edu">MIMIC-IV data</a>, a freely accessible critical care database developed by the MIT Lab for Computational Physiology. Follow the instructions at <a href="https://mimic-iv.mit.edu/docs/access/">https://mimic-iv.mit.edu/docs/access/</a> to (1) complete the CITI <code class="language-plaintext highlighter-rouge">Data or Specimens Only Research</code> course and (2) obtain the PhysioNet credential for using the MIMIC-IV data. Display the verification links to your completion report and completion certificate here. (Hint: The CITI training takes a couple hours and the PhysioNet credentialing takes a couple days; do not leave it to the last minute.)</p>

<blockquote>
  <p><strong>solution</strong>: The verification links to the <a href="https://www.citiprogram.org/verify/?k600f5f26-0a64-4f74-a92f-83c92ec84f0e-40386397">completion report</a> and <a href="https://www.citiprogram.org/verify/?w4f427623-e63f-402b-b4df-48fd01dd09a6-40386397">completion certificate</a>.</p>
</blockquote>

<h3 id="q12">Q1.2</h3>
<p>The <code class="language-plaintext highlighter-rouge">/usr/203b-data/mimic-iv/</code> folder on teaching server contains data sets from MIMIC-IV. Refer to <a href="https://mimic-iv.mit.edu/docs/datasets/">https://mimic-iv.mit.edu/docs/datasets/</a> for details of data files.</p>

<p>```{bash, eval=FALSE}
ls -l /usr/203b-data/mimic-iv</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>
Please, do **not** put these data files into Git; they are big. Do **not** copy them into your directory. Do **not** decompress the gz data files. These create unnecessary big files on storage and are not big data friendly practices. Just read from the data folder `/usr/203b-data/mimic-iv` directly in following exercises. 

    Use Bash commands to answer following questions.

&gt; **solution**: Done.

### Q1.3
Display the contents in the folders `core`, `hosp`, `icu`. What are the functionalities of the bash commands `zcat`, `zless`, `zmore`, and `zgrep`? 

&gt; **solution**:
```{bash, eval=FALSE}
ls -l /usr/203b-data/mimic-iv/core
</code></pre></div></div>
<p>```{bash, eval=FALSE}
ls -l /usr/203b-data/mimic-iv/hosp</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>```{bash, eval=FALSE}
ls -l /usr/203b-data/mimic-iv/icu
</code></pre></div></div>

<blockquote>
  <p>The functionalities of bash commands:</p>
</blockquote>

<blockquote>
  <ul>
    <li><code class="language-plaintext highlighter-rouge">zcat</code>: Line utility for viewing the contents of a compressed file without literally uncompressing it.</li>
    <li><code class="language-plaintext highlighter-rouge">zmore</code>: a filter which allows examination of compressed or plain text files one screenful at a time on a soft-copy terminal.</li>
    <li><code class="language-plaintext highlighter-rouge">zless</code>: works the same way as <code class="language-plaintext highlighter-rouge">zmore</code>, except the decompressed output is displayed by the <code class="language-plaintext highlighter-rouge">less</code> command for additional viewing flexibility.</li>
    <li><code class="language-plaintext highlighter-rouge">zgrep</code>: Search out expressions from a given a file even if it is compressed.</li>
  </ul>
</blockquote>

<h3 id="q14">Q1.4</h3>
<p>What’s the output of following bash script?
    <code class="language-plaintext highlighter-rouge">{bash, eval=FALSE}
    for datafile in /usr/203b-data/mimic-iv/core/*.gz
      do
        ls -l $datafile
      done
   </code></p>
<blockquote>
  <p><strong>solution</strong>: The bash script print out all <code class="language-plaintext highlighter-rouge">.gz</code> files in the folder <code class="language-plaintext highlighter-rouge">core</code>.</p>
</blockquote>

<p>Display the number of lines in each data file using a similar loop.</p>

<blockquote>
  <p><strong>solution</strong>:
 <code class="language-plaintext highlighter-rouge">{bash, eval=FALSE}
 for datafile in /usr/203b-data/mimic-iv/core/*.gz
   do
     ls -l $datafile
     echo "the number of lines:" 
     zcat $datafile | awk 'END { print NR }'
   done
</code></p>
</blockquote>

<h3 id="q15">Q1.5</h3>
<p>Display the first few lines of <code class="language-plaintext highlighter-rouge">admissions.csv.gz</code>. How many rows are in this data file? How many unique patients (identified by <code class="language-plaintext highlighter-rouge">subject_id</code>) are in this data file? What are the possible values taken by each of the variable <code class="language-plaintext highlighter-rouge">admission_type</code>, <code class="language-plaintext highlighter-rouge">admission_location</code>, <code class="language-plaintext highlighter-rouge">insurance</code>, <code class="language-plaintext highlighter-rouge">language</code>, <code class="language-plaintext highlighter-rouge">marital_status</code>, and <code class="language-plaintext highlighter-rouge">ethnicity</code>? Also report the count for each unique value of these variables. (Hint: combine Linux commands <code class="language-plaintext highlighter-rouge">zcat</code>, <code class="language-plaintext highlighter-rouge">head</code>/<code class="language-plaintext highlighter-rouge">tail</code>, <code class="language-plaintext highlighter-rouge">awk</code>, <code class="language-plaintext highlighter-rouge">uniq</code>, <code class="language-plaintext highlighter-rouge">wc</code>, and so on.)</p>

<blockquote>
  <p><strong>solution</strong>:
```{bash, eval=FALSE}
zcat /usr/203b-data/mimic-iv/core/admissions.csv.gz | 
awk ‘(NR&lt;=5)’</p>
  <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>```{bash, eval=FALSE}
echo "the number of rows:"
zcat /usr/203b-data/mimic-iv/core/admissions.csv.gz | 
awk 'END { print NR }' 
</code></pre></div>  </div>
  <p>```{bash, eval=FALSE}
echo “the number of unique patients: (colname row excluded)”
zcat /usr/203b-data/mimic-iv/core/admissions.csv.gz | 
awk  -F ‘,’ ‘{ print $1 }’ | sort | uniq |
tail -n +2 | awk ‘END { print NR }’</p>
  <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>```{bash, eval=FALSE}
for i in 6 7 9 10 11 12; 
do
echo "---------------------------"
zcat /usr/203b-data/mimic-iv/core/admissions.csv.gz | 
awk  -F ',' -v i=$i '{ print $i }' | 
awk '(NR&lt;=1)''{printf "%-19s~%-20s\n", $1,
"(count &amp; values (* NULL/NA included))"}' 
zcat /usr/203b-data/mimic-iv/core/admissions.csv.gz | 
awk  -F ',' -v i=$i '{ print $i }' | tail -n +2 | sort | uniq -c 
done
</code></pre></div>  </div>
</blockquote>

<h2 id="q2-whos-popular-in-price-and-prejudice">Q2. Who’s popular in Price and Prejudice</h2>

<h3 id="q21">Q2.1</h3>
<p>You and your friend just have finished reading <em>Pride and Prejudice</em> by Jane Austen. Among the four main characters in the book, Elizabeth, Jane, Lydia, and Darcy, your friend thinks that Darcy was the most mentioned. You, however, are certain it was Elizabeth. Obtain the full text of the novel from <a href="http://www.gutenberg.org/cache/epub/42671/pg42671.txt">http://www.gutenberg.org/cache/epub/42671/pg42671.txt</a> and save to your local folder. 
    <code class="language-plaintext highlighter-rouge">{bash, eval=FALSE}
    curl http://www.gutenberg.org/cache/epub/42671/pg42671.txt &gt; pride_and_prejudice.txt
   </code>
Do <strong>not</strong> put this text file <code class="language-plaintext highlighter-rouge">pride_and_prejudice.txt</code> in Git. Using a <code class="language-plaintext highlighter-rouge">for</code> loop, how would you tabulate the number of times each of the four characters is mentioned?</p>

<blockquote>
  <p><strong>solution</strong>: Use <code class="language-plaintext highlighter-rouge">grep -o</code> prints strings that match an name and then calculated the times.
<code class="language-plaintext highlighter-rouge">{bash, eval=FALSE}
declare -a name_arry=("Elizabeth" "Jane" "Lydia" "Darcy")
for name_need in ${name_arry[@]}
do
grep -o $name_need pride_and_prejudice.txt | wc -l | 
awk -v var="$name_need" '{print "---------------" 
printf "%-10s|%-5s\n", var, $1}'
done
</code></p>
</blockquote>

<h3 id="q22">Q2.2</h3>
<p>What’s the difference between the following two commands?
    <code class="language-plaintext highlighter-rouge">{bash eval=FALSE}
    echo 'hello, world' &gt; test1.txt
   </code>
    and
    <code class="language-plaintext highlighter-rouge">{bash eval=FALSE}
    echo 'hello, world' &gt;&gt; test2.txt
   </code></p>

<blockquote>
  <p><strong>solution</strong>: <code class="language-plaintext highlighter-rouge">'&gt; test1.txt'</code> redirects output to <code class="language-plaintext highlighter-rouge">test1.txt</code>, overwriting the file. <code class="language-plaintext highlighter-rouge">'&gt;&gt; test1.txt'</code> redirects output to <code class="language-plaintext highlighter-rouge">test1.txt</code>, appending the redirected output at the end.</p>
</blockquote>

<h3 id="q23">Q2.3</h3>
<p>Using your favorite text editor (e.g., <code class="language-plaintext highlighter-rouge">vi</code>), type the following and save the file as <code class="language-plaintext highlighter-rouge">middle.sh</code>:
    <code class="language-plaintext highlighter-rouge">{bash eval=FALSE}
    #!/bin/sh
    # Select lines from the middle of a file.
    # Usage: bash middle.sh filename end_line num_lines
    head -n "$2" "$1" | tail -n "$3"
   </code>
Using <code class="language-plaintext highlighter-rouge">chmod</code> make the file executable by the owner, and run 
    <code class="language-plaintext highlighter-rouge">{bash eval=FALSE}
    ./middle.sh pride_and_prejudice.txt 20 5
   </code>
Explain the output. Explain the meaning of <code class="language-plaintext highlighter-rouge">"$1"</code>, <code class="language-plaintext highlighter-rouge">"$2"</code>, and <code class="language-plaintext highlighter-rouge">"$3"</code> in this shell script. Why do we need the first line of the shell script?</p>

<blockquote>
  <p><strong>solution</strong>:
<code class="language-plaintext highlighter-rouge">{bash, eval=FALSE}
./middle.sh pride_and_prejudice.txt 20 5
</code>
the meaning of:</p>
</blockquote>

<blockquote>
  <ul>
    <li><code class="language-plaintext highlighter-rouge">"$1"</code>: the first column/element of the input (the element <code class="language-plaintext highlighter-rouge">pride_and_prejudice.txt</code> here)</li>
    <li><code class="language-plaintext highlighter-rouge">"$2"</code>: the second column/element of the input (the element <code class="language-plaintext highlighter-rouge">20</code> here)</li>
    <li><code class="language-plaintext highlighter-rouge">"$3"</code>: the third column/element of the input (the element <code class="language-plaintext highlighter-rouge">5</code> here)</li>
  </ul>
</blockquote>

<blockquote>
  <p>The first line <code class="language-plaintext highlighter-rouge">#!/bin/sh</code> means that the script should always be run with bash, rather than another shell. It’s a convention for the server to know what program it should use to run the shell script.</p>
</blockquote>

<h2 id="q3-more-fun-with-linux">Q3. More fun with Linux</h2>

<p>Try these commands in Bash and interpret the results: <code class="language-plaintext highlighter-rouge">cal</code>, <code class="language-plaintext highlighter-rouge">cal 2021</code>, <code class="language-plaintext highlighter-rouge">cal 9 1752</code> (anything unusual?), <code class="language-plaintext highlighter-rouge">date</code>, <code class="language-plaintext highlighter-rouge">hostname</code>, <code class="language-plaintext highlighter-rouge">arch</code>, <code class="language-plaintext highlighter-rouge">uname -a</code>, <code class="language-plaintext highlighter-rouge">uptime</code>, <code class="language-plaintext highlighter-rouge">who am i</code>, <code class="language-plaintext highlighter-rouge">who</code>, <code class="language-plaintext highlighter-rouge">w</code>, <code class="language-plaintext highlighter-rouge">id</code>, <code class="language-plaintext highlighter-rouge">last | head</code>, <code class="language-plaintext highlighter-rouge">echo {con,pre}{sent,fer}{s,ed}</code>, <code class="language-plaintext highlighter-rouge">time sleep 5</code>, <code class="language-plaintext highlighter-rouge">history | tail</code>.</p>

<blockquote>
  <p><strong>solution</strong>:
```{bash, eval=FALSE}
cal</p>
  <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>`cal` display the calender of current month.
```{bash, eval=FALSE}
cal 2021
</code></pre></div>  </div>
  <p><code class="language-plaintext highlighter-rouge">cal 2021</code> display the calender of all month in 2021.
```{bash, eval=FALSE}
cal 9 1752</p>
  <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>`cal 9 1752` seems display a incomplete calender of September 1752. Reason: The Gregorian calendar reform was adopted by the Kingdom of Great Britain in September 1752. As a result, the September 1752 cal shows the adjusted days missing. [[wiki](https://en.wikipedia.org/wiki/Cal_(Unix))]
```{bash, eval=FALSE}
date
</code></pre></div>  </div>
  <p><code class="language-plaintext highlighter-rouge">date</code> returns the date in the default system timezone.
```{bash, eval=FALSE}
hostname</p>
  <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>`hostname` provides the name of the server.
```{bash, eval=FALSE}
arch
</code></pre></div>  </div>
  <p><code class="language-plaintext highlighter-rouge">arch</code> provides the computer architecture.
```{bash, eval=FALSE}
uname -a</p>
  <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>`uname -a` prints the name, version and other details about the current machine and the operating system running on it.
```{bash, eval=FALSE}
uptime
</code></pre></div>  </div>
  <p><code class="language-plaintext highlighter-rouge">uptime</code> returns information about how long your system has been running together with the current time, number of users with running sessions, and the system load averages for the past 1, 5, and 15 minutes.
```{bash, eval=FALSE}
who am i</p>
  <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>`who am i` displays the username of the current user when this command is invoked.
```{bash, eval=FALSE}
who
</code></pre></div>  </div>
  <p><code class="language-plaintext highlighter-rouge">who</code> displays account information: user login name, user’s terminal, time of login as well as the host the user is logged in from.
```{bash, eval=FALSE}
w</p>
  <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>`w` displays information about currently logged in users and what each user is doing.
```{bash, eval=FALSE}
id
</code></pre></div>  </div>
  <p><code class="language-plaintext highlighter-rouge">id</code> print real and effective User ID (UID) and Group ID (GID).
```{bash, eval=FALSE}
last | head</p>
  <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>`last | head` displays the first 10 users logged in and out since the file /var/log/wtmp was created.
```{bash, eval=FALSE}
echo {con,pre}{sent,fer}{s,ed}
</code></pre></div>  </div>
  <p><code class="language-plaintext highlighter-rouge">echo {con,pre}{sent,fer}{s,ed}</code> generates all the permutations possible of a set of elements ({con,pre}{sent,fer}{s,ed}) stored in a variable in groups of 2 elements.
```{bash, eval=FALSE}
time sleep 5</p>
  <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>`time sleep 5` pauses execution of shell scripts or commands for a 5-second period on a Linux 
```{bash, eval=FALSE}
set -o history
echo "zza"
history | tail
</code></pre></div>  </div>
  <p><code class="language-plaintext highlighter-rouge">history | tail</code> shows 10 of the last commands that have been recently used.</p>
</blockquote>]]></content><author><name>Zian</name><email>zianzhuang@ucla.edu</email></author><category term="Data science" /><category term="practice" /><summary type="html"><![CDATA[]]></summary></entry></feed>