kloppenborg.cahttps://www.kloppenborg.ca/2021-02-10T02:56:48-05:00Basis Values From Censored Data2021-02-10T02:56:48-05:002021-02-10T02:56:48-05:00Stefan Kloppenborgtag:www.kloppenborg.ca,2021-02-10:/2021/02/basis-values-censored-data/<p>Earlier, I wrote a post about using a <a href="https://www.kloppenborg.ca/2021/02/likelihood-basis-values/">likelihood-based approach to calculating Basis values</a>. In that post, I hinted that likelihood-based approaches can be useful when dealing with censored&nbsp;data.</p> <p>First of all, what does censoring mean? It means that the value reported is either artificially high or artificially low …</p><p>Earlier, I wrote a post about using a <a href="https://www.kloppenborg.ca/2021/02/likelihood-basis-values/">likelihood-based approach to calculating Basis values</a>. In that post, I hinted that likelihood-based approaches can be useful when dealing with censored&nbsp;data.</p> <p>First of all, what does censoring mean? It means that the value reported is either artificially high or artificially low. There are a few reasons that this could happen. It happens often with lifetime data: with fatigue tests, you set a number of cycles at which the specimen &#8220;runs out&#8221; and you stop the test; with studies of mortality, some of the subjects will still be alive when you do the analysis. In these cases, the true value is greater than the observed result, but you don&#8217;t know by how much. These are examples of <em>right-censored</em>&nbsp;data.</p> <p>Data can also be <em>left-censored</em>, meaning that the true value is less than the observed value. This can happen if some of the values are too small to be measured. Perhaps the instrument that you&#8217;re using can&#8217;t detect values below a certain&nbsp;amount.</p> <p>There is also <em>interval-censored</em> data. This often occurs in survey data. For example, you might have data for individuals aged 40-44, but you don&#8217;t know where they fall within that&nbsp;range.</p> <p>In this post, we&#8217;re going to deal with <em>right-censored</em>&nbsp;data.</p> <p>At my day job, I often deal with data from testing of metallic inserts installed in honeycomb sandwich panel. These metallic inserts have a hole in their centers that will accept a screw. Their purpose is to allow a screw to be fastened to the panel, and the strength of this connections is one of the important&nbsp;considerations.</p> <p>We determine the strength of the insert through testing. The usual test coupon that we use has two of these inserts installed, and we pull them away from each other to measure the shear strength. This is a convenient way of applying the load, but I&#8217;ve long thought that it must give low results. The loading of the coupon looks like&nbsp;this:</p> <p><img alt="Coupon" src="https://www.kloppenborg.ca/2021/02/basis-values-censored-data/basis-values-censored-data_files/figure-markdown/coupon.png"></p> <p>The reason that I&#8217;ve thought that this test method will give artificially low results is the fact that there are two inserts. The test ends when either one of these two insert fails: the other insert must be stronger than the one that failed&nbsp;first.</p> <p>To illustrate this, let&#8217;s do a slightly silly thought experiment. Let&#8217;s imagine that we&#8217;re making a set of these coupons. We decide that we&#8217;re going to install one insert in each coupon first, then come back tomorrow and install the other insert in each coupon. Tomorrow comes around, and we decide to let the brand new intern install the second insert. The intern hasn&#8217;t yet been fully trained, and they accidentally install the wrong type of insert in teh second hole, but unfortunately they look identical to the correct type. The correct type of insert has a strength that is always <span class="math">$$1000 lbf$$</span>, but we don&#8217;t know that yet. The wrong type of insert always has a strength of exactly <span class="math">$$500 lbf$$</span>. When we do our tests, all of the coupons fail on the side that the intern installed (the wrong insert) and the strength of each coupon is <span class="math">$$500 lbf$$</span>. We conclude that the mean strength of these inserts is <span class="math">$$500 lbf$$</span> with a very low&nbsp;variance.</p> <p>But, we&#8217;d be&nbsp;wrong.</p> <p>In this thought experiment, the actual mean strength of the inserts (considering both the correct and incorrect types of inserts) is <span class="math">$$750 lbf$$</span> and there&#8217;s actually a pretty high variance. We were simply unable to observe the strength of the stronger screws because of <em>censoring</em>.</p> <p>In a more realistic case, we&#8217;re actually going to be dealing with parts that have strengths drawn from the same continuous distribution. As we move on, we&#8217;re going to assume that the strength of each individual insert is a random variable drawn from the same continuous distribution (that is, they are <span class="caps">IID</span>).</p> <p>Let&#8217;s create some simulated data. We&#8217;ll start by loading a few R packages that we&#8217;ll&nbsp;need.</p> <div class="highlight"><pre><span></span><code><span class="nf">library</span><span class="p">(</span><span class="n">tidyverse</span><span class="p">)</span> <span class="nf">library</span><span class="p">(</span><span class="n">cmstatr</span><span class="p">)</span> <span class="nf">library</span><span class="p">(</span><span class="n">stats4</span><span class="p">)</span> </code></pre></div> <p>Next, we&#8217;ll create a sample of <span class="math">$$40$$</span> simulated insert strengths. These will be drawn from a normal distribution with a mean of <span class="math">$$1000$$</span> and a standard deviation of <span class="math">$$100$$</span>.</p> <div class="highlight"><pre><span></span><code><span class="n">pop_mean</span> <span class="o">&lt;-</span> <span class="m">1000</span> <span class="n">pop_sd</span> <span class="o">&lt;-</span> <span class="m">100</span> <span class="nf">set.seed</span><span class="p">(</span><span class="m">123</span><span class="p">)</span> <span class="c1"># make this example reproducible</span> <span class="n">strength</span> <span class="o">&lt;-</span> <span class="nf">rnorm</span><span class="p">(</span><span class="m">40</span><span class="p">,</span> <span class="n">pop_mean</span><span class="p">,</span> <span class="n">pop_sd</span><span class="p">)</span> </code></pre></div> <p>Now let&#8217;s calculate the mean of this sample. We expect it to be fairly close to 1000, and indeed it&nbsp;is.</p> <div class="highlight"><pre><span></span><code><span class="nf">mean</span><span class="p">(</span><span class="n">strength</span><span class="p">)</span> </code></pre></div> <div class="highlight"><pre><span></span><code><span class="err">## [1] 1004.518</span> </code></pre></div> <p>And we can also calculate the standard&nbsp;deviation:</p> <div class="highlight"><pre><span></span><code><span class="nf">sd</span><span class="p">(</span><span class="n">strength</span><span class="p">)</span> </code></pre></div> <div class="highlight"><pre><span></span><code><span class="err">## [1] 89.77847</span> </code></pre></div> <p>For the strength of most aircraft structures, we are concerned with a lower tolerance bound of the strength. For multiple load-path structure, we need to calculate the B-Basis strength, which is the <span class="math">$$95/%$$</span> lower confidence bound on the 10-th percentile of the&nbsp;strength.</p> <p>Since we know the actual strength of all 40 inserts, we can calculate the B-Basis based on these actual insert strengths. Ideally, the B-Basis value that we calculate later will be close to this&nbsp;value.</p> <div class="highlight"><pre><span></span><code><span class="nf">basis_normal</span><span class="p">(</span><span class="n">x</span> <span class="o">=</span> <span class="n">strength</span><span class="p">)</span> </code></pre></div> <div class="highlight"><pre><span></span><code><span class="c1">## outliers_within_batch not run because parameter batch not specified</span> <span class="c1">## between_batch_variability not run because parameter batch not specified</span> <span class="c1">## </span> <span class="c1">## Call:</span> <span class="c1">## basis_normal(x = strength)</span> <span class="c1">## </span> <span class="c1">## Distribution: Normal ( n = 40 )</span> <span class="c1">## B-Basis: ( p = 0.9 , conf = 0.95 )</span> <span class="c1">## 852.1482</span> </code></pre></div> <p>Now, we&#8217;ll take these <span class="math">$$40$$</span> insert strengths and put them into <span class="math">$$20$$</span> coupons: each with two inserts. The observed coupon strength will be set to the <em>lower</em> of the two inserts installed in that coupon, because the coupon will fail as soon as either one of the installed inserts&nbsp;fails.</p> <div class="highlight"><pre><span></span><code><span class="n">dat</span> <span class="o">&lt;-</span> <span class="nf">data.frame</span><span class="p">(</span> <span class="n">ID</span> <span class="o">=</span> <span class="m">1</span><span class="o">:</span><span class="m">20</span><span class="p">,</span> <span class="n">strength1</span> <span class="o">=</span> <span class="n">strength</span><span class="p">[</span><span class="m">1</span><span class="o">:</span><span class="m">20</span><span class="p">],</span> <span class="n">strength2</span> <span class="o">=</span> <span class="n">strength</span><span class="p">[</span><span class="m">21</span><span class="o">:</span><span class="m">40</span><span class="p">]</span> <span class="p">)</span> <span class="o">%&gt;%</span> <span class="nf">rowwise</span><span class="p">()</span> <span class="o">%&gt;%</span> <span class="nf">mutate</span><span class="p">(</span><span class="n">strength_observed</span> <span class="o">=</span> <span class="nf">min</span><span class="p">(</span><span class="n">strength1</span><span class="p">,</span> <span class="n">strength2</span><span class="p">))</span> <span class="o">%&gt;%</span> <span class="nf">ungroup</span><span class="p">()</span> <span class="n">dat</span> </code></pre></div> <div class="highlight"><pre><span></span><code><span class="err">## # A tibble: 20 x 4</span> <span class="err">## ID strength1 strength2 strength_observed</span> <span class="err">## &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;</span> <span class="err">## 1 1 944. 893. 893.</span> <span class="err">## 2 2 977. 978. 977.</span> <span class="err">## 3 3 1156. 897. 897.</span> <span class="err">## 4 4 1007. 927. 927.</span> <span class="err">## 5 5 1013. 937. 937.</span> <span class="err">## 6 6 1172. 831. 831.</span> <span class="err">## 7 7 1046. 1084. 1046.</span> <span class="err">## 8 8 873. 1015. 873.</span> <span class="err">## 9 9 931. 886. 886.</span> <span class="err">## 10 10 955. 1125. 955.</span> <span class="err">## 11 11 1122. 1043. 1043.</span> <span class="err">## 12 12 1036. 970. 970.</span> <span class="err">## 13 13 1040. 1090. 1040.</span> <span class="err">## 14 14 1011. 1088. 1011.</span> <span class="err">## 15 15 944. 1082. 944.</span> <span class="err">## 16 16 1179. 1069. 1069.</span> <span class="err">## 17 17 1050. 1055. 1050.</span> <span class="err">## 18 18 803. 994. 803.</span> <span class="err">## 19 19 1070. 969. 969.</span> <span class="err">## 20 20 953. 962. 953.</span> </code></pre></div> <p>Let&#8217;s look at the summary statistics for this&nbsp;data:</p> <div class="highlight"><pre><span></span><code><span class="n">dat</span> <span class="o">%&gt;%</span> <span class="nf">summarise</span><span class="p">(</span> <span class="n">mean</span> <span class="o">=</span> <span class="nf">mean</span><span class="p">(</span><span class="n">strength_observed</span><span class="p">),</span> <span class="n">sd</span> <span class="o">=</span> <span class="nf">sd</span><span class="p">(</span><span class="n">strength_observed</span><span class="p">),</span> <span class="n">cv</span> <span class="o">=</span> <span class="nf">cv</span><span class="p">(</span><span class="n">strength_observed</span><span class="p">)</span> <span class="p">)</span> </code></pre></div> <div class="highlight"><pre><span></span><code><span class="err">## # A tibble: 1 x 3</span> <span class="err">## mean sd cv</span> <span class="err">## &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;</span> <span class="err">## 1 954. 75.1 0.0788</span> </code></pre></div> <p>Hmmm. We see the mean is much lower than the mean of the individual insert strength. Remember that the mean insert strength was <span class="math">$$1005$$</span>, but the mean strength of the coupons is <span class="math">$$954$$</span>.</p> <p>Next, we&#8217;ll naively calculate a B-Basis value from the measured strength. We&#8217;ll assume a normal&nbsp;distribution.</p> <div class="highlight"><pre><span></span><code><span class="n">dat</span> <span class="o">%&gt;%</span> <span class="nf">basis_normal</span><span class="p">(</span><span class="n">strength_observed</span><span class="p">)</span> </code></pre></div> <div class="highlight"><pre><span></span><code><span class="err">## </span> <span class="err">## Call:</span> <span class="err">## basis_normal(data = ., x = strength_observed)</span> <span class="err">## </span> <span class="err">## Distribution: Normal ( n = 20 )</span> <span class="err">## B-Basis: ( p = 0.9 , conf = 0.95 )</span> <span class="err">## 809.1911</span> </code></pre></div> <p>We&#8217;ll just keep this number in mind for now and we&#8217;ll move on to the idea of using a likelihood-based approach to calculate a B-Basis value, considering the fact that this data is&nbsp;censored.</p> <p>The way that this data is censored might not be immediately obvious. But, each time we test one of these coupons, which contain two inserts, we actually get two pieces of data. We get the strength of one of the inserts. This is an <em>exact</em> value. But we also get a second piece of data. We know that the strength of the other insert is at least as high as the one that failed first. This is a <em>right censored</em>&nbsp;value.</p> <p>In the <a href="https://www.kloppenborg.ca/2021/02/likelihood-basis-values/">previous post</a>, I gave an expression for the likelihood function. However, that function only considers exact observations. The expression for the likelihood, considering censored data as follows (see&nbsp;<a href='#Meeker_Hahn_Escobar2017' id='ref-Meeker_Hahn_Escobar2017-1'> Meeker et al. (2017) </a>).</p> <div class="math">$$\mathcal{L}\left(\theta\right) = \prod_{i=1}^{n} \begin{cases} f\left(X_i;\,\theta\right) &amp; \mbox{if } X_i \mbox{ is exact} \\ F\left(X_i;\,\theta\right) &amp; \mbox{if } X_i \mbox{ is left censored} \\ 1 - F\left(X_i;\,\theta\right) &amp; \mbox{if } X_i \mbox{ is right censored} \end{cases}$$</div> <p>Where <span class="math">$$f()$$</span> is the probability density function and <span class="math">$$F()$$</span> is the cumulative density&nbsp;function.</p> <p>We can implement a log-likelihood function based on this in R as&nbsp;follows:</p> <div class="highlight"><pre><span></span><code><span class="n">log_likelihood_normal</span> <span class="o">&lt;-</span> <span class="nf">function</span><span class="p">(</span><span class="n">mu</span><span class="p">,</span> <span class="n">sig</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">censored</span><span class="p">)</span> <span class="p">{</span> <span class="nf">suppressWarnings</span><span class="p">(</span> <span class="nf">sum</span><span class="p">(</span><span class="nf">map2_dbl</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">censored</span><span class="p">,</span> <span class="nf">function</span><span class="p">(</span><span class="n">xi</span><span class="p">,</span> <span class="n">ci</span><span class="p">)</span> <span class="p">{</span> <span class="nf">if </span><span class="p">(</span><span class="n">ci</span> <span class="o">==</span> <span class="s">&quot;exact&quot;</span><span class="p">)</span> <span class="p">{</span> <span class="nf">dnorm</span><span class="p">(</span><span class="n">xi</span><span class="p">,</span> <span class="n">mean</span> <span class="o">=</span> <span class="n">mu</span><span class="p">,</span> <span class="n">sd</span> <span class="o">=</span> <span class="n">sig</span><span class="p">,</span> <span class="n">log</span> <span class="o">=</span> <span class="kc">TRUE</span><span class="p">)</span> <span class="p">}</span> <span class="n">else</span> <span class="nf">if </span><span class="p">(</span><span class="n">ci</span> <span class="o">==</span> <span class="s">&quot;left&quot;</span><span class="p">)</span> <span class="p">{</span> <span class="nf">pnorm</span><span class="p">(</span><span class="n">xi</span><span class="p">,</span> <span class="n">mean</span> <span class="o">=</span> <span class="n">mu</span><span class="p">,</span> <span class="n">sd</span> <span class="o">=</span> <span class="n">sig</span><span class="p">,</span> <span class="n">log.p</span> <span class="o">=</span> <span class="kc">TRUE</span><span class="p">)</span> <span class="p">}</span> <span class="n">else</span> <span class="nf">if </span><span class="p">(</span><span class="n">ci</span> <span class="o">==</span> <span class="s">&quot;right&quot;</span><span class="p">)</span> <span class="p">{</span> <span class="nf">pnorm</span><span class="p">(</span><span class="n">xi</span><span class="p">,</span> <span class="n">mean</span> <span class="o">=</span> <span class="n">mu</span><span class="p">,</span> <span class="n">sd</span> <span class="o">=</span> <span class="n">sig</span><span class="p">,</span> <span class="n">log.p</span> <span class="o">=</span> <span class="kc">TRUE</span><span class="p">,</span> <span class="n">lower.tail</span> <span class="o">=</span> <span class="kc">FALSE</span><span class="p">)</span> <span class="p">}</span> <span class="n">else</span> <span class="p">{</span> <span class="nf">stop</span><span class="p">(</span><span class="s">&quot;Invalid value of censored&quot;</span><span class="p">)</span> <span class="p">}</span> <span class="p">}))</span> <span class="p">)</span> <span class="p">}</span> </code></pre></div> <p>We can use this log-likelihood function to find the maximum-likelihood estimates (<span class="caps">MLE</span>) of the population parameters using the <code>stats4::mle()</code> function. First, we&#8217;ll find the <span class="caps">MLE</span> based only on the observed strength of each coupon, taken as a single exact&nbsp;value.</p> <div class="highlight"><pre><span></span><code><span class="nf">mle</span><span class="p">(</span> <span class="nf">function</span><span class="p">(</span><span class="n">mu</span><span class="p">,</span> <span class="n">sig</span><span class="p">)</span> <span class="p">{</span> <span class="o">-</span><span class="nf">log_likelihood_normal</span><span class="p">(</span><span class="n">mu</span><span class="p">,</span> <span class="n">sig</span><span class="p">,</span> <span class="n">dat</span><span class="o">$</span><span class="n">strength_observed</span><span class="p">,</span> <span class="s">&quot;exact&quot;</span><span class="p">)</span> <span class="p">},</span> <span class="n">start</span> <span class="o">=</span> <span class="nf">c</span><span class="p">(</span><span class="m">1000</span><span class="p">,</span> <span class="m">100</span><span class="p">)</span> <span class="p">)</span> </code></pre></div> <div class="highlight"><pre><span></span><code><span class="err">## </span> <span class="err">## Call:</span> <span class="err">## mle(minuslogl = function(mu, sig) {</span> <span class="err">## -log_likelihood_normal(mu, sig, dat$strength_observed, &quot;exact&quot;)</span> <span class="err">## }, start = c(1000, 100))</span> <span class="err">## </span> <span class="err">## Coefficients:</span> <span class="err">## mu sig </span> <span class="err">## 953.70230 73.27344</span> </code></pre></div> <p>(Note that the value of start is just a starting point for the numeric root&nbsp;finding.)</p> <p>Here, we get the same value of mean that we previously&nbsp;calculated.</p> <p>Now, we&#8217;ll repeat the <span class="caps">MLE</span> procedure, but now give it two pieces of data for each coupon: one exact value, and one right-censored&nbsp;value.</p> <div class="highlight"><pre><span></span><code><span class="nf">mle</span><span class="p">(</span> <span class="nf">function</span><span class="p">(</span><span class="n">mu</span><span class="p">,</span> <span class="n">sig</span><span class="p">)</span> <span class="p">{</span> <span class="o">-</span><span class="nf">log_likelihood_normal</span><span class="p">(</span><span class="n">mu</span><span class="p">,</span> <span class="n">sig</span><span class="p">,</span> <span class="nf">c</span><span class="p">(</span><span class="n">dat</span><span class="o">$</span><span class="n">strength_observed</span><span class="p">,</span> <span class="n">dat</span><span class="o">$</span><span class="n">strength_observed</span><span class="p">),</span> <span class="nf">c</span><span class="p">(</span><span class="nf">rep</span><span class="p">(</span><span class="s">&quot;exact&quot;</span><span class="p">,</span> <span class="m">20</span><span class="p">),</span> <span class="nf">rep</span><span class="p">(</span><span class="s">&quot;right&quot;</span><span class="p">,</span> <span class="m">20</span><span class="p">)))</span> <span class="p">},</span> <span class="n">start</span> <span class="o">=</span> <span class="nf">c</span><span class="p">(</span><span class="m">1000</span><span class="p">,</span> <span class="m">100</span><span class="p">)</span> <span class="p">)</span> </code></pre></div> <div class="highlight"><pre><span></span><code><span class="err">## </span> <span class="err">## Call:</span> <span class="err">## mle(minuslogl = function(mu, sig) {</span> <span class="err">## -log_likelihood_normal(mu, sig, c(dat$strength_observed, </span> <span class="err">## dat$strength_observed), c(rep(&quot;exact&quot;, 20), rep(&quot;right&quot;, </span> <span class="err">## 20)))</span> <span class="err">## }, start = c(1000, 100))</span> <span class="err">## </span> <span class="err">## Coefficients:</span> <span class="err">## mu sig </span> <span class="err">## 1003.90717 88.51774</span> </code></pre></div> <p>The mean estimated this way is remarkably close to the true&nbsp;value.</p> <p>As we did in the previous blog post, we&#8217;ll next create a function that returns the profile likelihood based on a value of <span class="math">$$t_p$$</span> (the value that the proportion <span class="math">$$p$$</span> of the population is&nbsp;below).</p> <div class="highlight"><pre><span></span><code><span class="n">profile_likelihood_normal</span> <span class="o">&lt;-</span> <span class="nf">function</span><span class="p">(</span><span class="n">tp</span><span class="p">,</span> <span class="n">p</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">censored</span><span class="p">)</span> <span class="p">{</span> <span class="n">m</span> <span class="o">&lt;-</span> <span class="nf">mle</span><span class="p">(</span> <span class="nf">function</span><span class="p">(</span><span class="n">mu</span><span class="p">,</span> <span class="n">sig</span><span class="p">)</span> <span class="p">{</span> <span class="o">-</span><span class="nf">log_likelihood_normal</span><span class="p">(</span><span class="n">mu</span><span class="p">,</span> <span class="n">sig</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">censored</span><span class="p">)</span> <span class="p">},</span> <span class="n">start</span> <span class="o">=</span> <span class="nf">c</span><span class="p">(</span><span class="m">1000</span><span class="p">,</span> <span class="m">100</span><span class="p">)</span> <span class="c1"># A starting guess</span> <span class="p">)</span> <span class="n">mu_hat</span> <span class="o">&lt;-</span> <span class="n">m</span><span class="o">&#64;</span><span class="n">coef</span><span class="p">[</span><span class="m">1</span><span class="p">]</span> <span class="n">sig_hat</span> <span class="o">&lt;-</span> <span class="n">m</span><span class="o">&#64;</span><span class="n">coef</span><span class="p">[</span><span class="m">2</span><span class="p">]</span> <span class="n">ll_hat</span> <span class="o">&lt;-</span> <span class="nf">log_likelihood_normal</span><span class="p">(</span><span class="n">mu_hat</span><span class="p">,</span> <span class="n">sig_hat</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">censored</span><span class="p">)</span> <span class="nf">optimise</span><span class="p">(</span> <span class="nf">function</span><span class="p">(</span><span class="n">sig</span><span class="p">)</span> <span class="p">{</span> <span class="nf">exp</span><span class="p">(</span> <span class="nf">log_likelihood_normal</span><span class="p">(</span> <span class="n">mu</span> <span class="o">=</span> <span class="n">tp</span> <span class="o">-</span> <span class="n">sig</span> <span class="o">*</span> <span class="nf">qnorm</span><span class="p">(</span><span class="n">p</span><span class="p">),</span> <span class="n">sig</span> <span class="o">=</span> <span class="n">sig</span><span class="p">,</span> <span class="n">x</span> <span class="o">=</span> <span class="n">x</span><span class="p">,</span> <span class="n">censored</span> <span class="o">=</span> <span class="n">censored</span> <span class="p">)</span> <span class="o">-</span> <span class="n">ll_hat</span> <span class="p">)</span> <span class="p">},</span> <span class="n">interval</span> <span class="o">=</span> <span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span> <span class="n">sig_hat</span> <span class="o">*</span> <span class="m">5</span><span class="p">),</span> <span class="n">maximum</span> <span class="o">=</span> <span class="kc">TRUE</span> <span class="p">)</span><span class="o">$</span><span class="n">objective</span> <span class="p">}</span> </code></pre></div> <p>The shape of this curve is as&nbsp;follows:</p> <div class="highlight"><pre><span></span><code><span class="nf">data.frame</span><span class="p">(</span> <span class="n">tp</span> <span class="o">=</span> <span class="nf">seq</span><span class="p">(</span><span class="m">700</span><span class="p">,</span> <span class="m">1000</span><span class="p">,</span> <span class="n">length.out</span> <span class="o">=</span> <span class="m">200</span><span class="p">)</span> <span class="p">)</span> <span class="o">%&gt;%</span> <span class="nf">rowwise</span><span class="p">()</span> <span class="o">%&gt;%</span> <span class="nf">mutate</span><span class="p">(</span><span class="n">R</span> <span class="o">=</span> <span class="nf">profile_likelihood_normal</span><span class="p">(</span> <span class="n">tp</span><span class="p">,</span> <span class="m">0.1</span><span class="p">,</span> <span class="nf">c</span><span class="p">(</span><span class="n">dat</span><span class="o">$</span><span class="n">strength_observed</span><span class="p">,</span> <span class="n">dat</span><span class="o">$</span><span class="n">strength_observed</span><span class="p">),</span> <span class="nf">c</span><span class="p">(</span><span class="nf">rep</span><span class="p">(</span><span class="s">&quot;exact&quot;</span><span class="p">,</span> <span class="m">20</span><span class="p">),</span> <span class="nf">rep</span><span class="p">(</span><span class="s">&quot;right&quot;</span><span class="p">,</span> <span class="m">20</span><span class="p">))</span> <span class="p">))</span> <span class="o">%&gt;%</span> <span class="nf">ggplot</span><span class="p">(</span><span class="nf">aes</span><span class="p">(</span><span class="n">x</span> <span class="o">=</span> <span class="n">tp</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="n">R</span><span class="p">))</span> <span class="o">+</span> <span class="nf">geom_line</span><span class="p">()</span> <span class="o">+</span> <span class="nf">ggtitle</span><span class="p">(</span><span class="s">&quot;Profile Likelihood for the 10th Percentile&quot;</span><span class="p">)</span> </code></pre></div> <p><img alt="unnamed-chunk-13-1" src="https://www.kloppenborg.ca/2021/02/basis-values-censored-data/basis-values-censored-data_files/figure-markdown/unnamed-chunk-13-1.png"></p> <p>Next, we&#8217;ll find the value of <span class="math">$$u$$</span> that satisfies this&nbsp;equation:</p> <div class="math">$$0.05 = \frac{ \int_{-\infty}^{u}R(t_p) d t_p }{ \int_{-\infty}^{\infty}R(t_p) d t_p }$$</div> <div class="highlight"><pre><span></span><code><span class="n">fn</span> <span class="o">&lt;-</span> <span class="nf">Vectorize</span><span class="p">(</span><span class="nf">function</span><span class="p">(</span><span class="n">tp</span><span class="p">)</span> <span class="p">{</span> <span class="nf">profile_likelihood_normal</span><span class="p">(</span> <span class="n">tp</span><span class="p">,</span> <span class="m">0.1</span><span class="p">,</span> <span class="nf">c</span><span class="p">(</span><span class="n">dat</span><span class="o">$</span><span class="n">strength_observed</span><span class="p">,</span> <span class="n">dat</span><span class="o">$</span><span class="n">strength_observed</span><span class="p">),</span> <span class="nf">c</span><span class="p">(</span><span class="nf">rep</span><span class="p">(</span><span class="s">&quot;exact&quot;</span><span class="p">,</span> <span class="m">20</span><span class="p">),</span> <span class="nf">rep</span><span class="p">(</span><span class="s">&quot;right&quot;</span><span class="p">,</span> <span class="m">20</span><span class="p">)))</span> <span class="p">})</span> <span class="n">denominator</span> <span class="o">&lt;-</span> <span class="nf">integrate</span><span class="p">(</span> <span class="n">f</span> <span class="o">=</span> <span class="n">fn</span><span class="p">,</span> <span class="n">lower</span> <span class="o">=</span> <span class="m">0</span><span class="p">,</span> <span class="n">upper</span> <span class="o">=</span> <span class="m">1000</span> <span class="p">)</span> <span class="nf">uniroot</span><span class="p">(</span> <span class="nf">function</span><span class="p">(</span><span class="n">upper</span><span class="p">)</span> <span class="p">{</span> <span class="n">trial_area</span> <span class="o">&lt;-</span> <span class="nf">integrate</span><span class="p">(</span> <span class="n">fn</span><span class="p">,</span> <span class="n">lower</span> <span class="o">=</span> <span class="m">0</span><span class="p">,</span> <span class="n">upper</span> <span class="o">=</span> <span class="n">upper</span> <span class="p">)</span> <span class="nf">return</span><span class="p">(</span><span class="n">trial_area</span><span class="o">$</span><span class="n">value</span> <span class="o">/</span> <span class="n">denominator</span><span class="o">$</span><span class="n">value</span> <span class="o">-</span> <span class="m">0.05</span><span class="p">)</span> <span class="p">},</span> <span class="n">interval</span> <span class="o">=</span> <span class="nf">c</span><span class="p">(</span><span class="m">700</span><span class="p">,</span> <span class="m">1000</span><span class="p">)</span> <span class="p">)</span> </code></pre></div> <div class="highlight"><pre><span></span><code><span class="err">##$root</span> <span class="err">## [1] 845.7739</span> <span class="err">## </span> <span class="err">## $f.root</span> <span class="err">## [1] -1.8327e-08</span> <span class="err">## </span> <span class="err">##$iter</span> <span class="err">## [1] 10</span> <span class="err">## </span> <span class="err">## $init.it</span> <span class="err">## [1] NA</span> <span class="err">## </span> <span class="err">##$estim.prec</span> <span class="err">## [1] 6.103516e-05</span> </code></pre></div> <p>This value of <span class="math">$$846$$</span> is much higher than the value of <span class="math">$$809$$</span> that we found earlier based on the coupon strength. But, this value of <span class="math">$$846$$</span> is a little lower than the B-Basis of <span class="math">$$852$$</span> that was based on the actual strength of all of the inserts&nbsp;installed.</p> <p>One way to view the differences between these three numbers is as follows. The B-Basis strength is related to the 10-th percentile of the strength. But it is actually a confidence bound on the 10-th percentile. If we have only a little bit of information about the strength, there is a lot of uncertainty about the actual 10-th percentile, so the lower confidence bound is quite low. If we have a lot of information about the strength, the uncertainty is small, so the lower confidence bound is close to the actual 10-th&nbsp;percentile.</p> <p>When we calculated a B-Basis from the observed coupon strength, we had 20 pieces of information. When we calculated a B-Basis from the actual insert strength, we had 40 pieces of information. When we calculated the B-Basis value considering the censored data, we had 40 pieces of information, but half that information wasn&#8217;t as informative as the other half: the exact values provide more information than the censored&nbsp;values.</p> <script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%&#64;#&#64;#')) { var align = "center", indent = "0em", linebreak = "false"; if (false) { align = (screen.width < 768) ? "left" : align; indent = (screen.width < 768) ? "0em" : indent; linebreak = (screen.width < 768) ? 'true' : linebreak; } var mathjaxscript = document.createElement('script'); mathjaxscript.id = 'mathjaxscript_pelican_#%&#64;#&#64;#'; mathjaxscript.type = 'text/javascript'; mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML'; mathjaxscript[(window.opera ? "innerHTML" : "text")] = "MathJax.Hub.Config({" + " config: ['MMLorHTML.js']," + " TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," + " jax: ['input/TeX','input/MathML','output/HTML-CSS']," + " extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," + " displayAlign: '"+ align +"'," + " displayIndent: '"+ indent +"'," + " showMathMenu: true," + " messageStyle: 'normal'," + " tex2jax: { " + " inlineMath: [ ['\\\$$','\\\$$'] ], " + " displayMath: [ ['$$','$$'] ]," + " processEscapes: true," + " preview: 'TeX'," + " }, " + " 'HTML-CSS': { " + " styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," + " linebreaks: { automatic: "+ linebreak +", width: '90% container' }," + " }, " + "}); " + "if ('default' !== 'default') {" + "MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" + "var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" + "VARIANT['normal'].fonts.unshift('MathJax_default');" + "VARIANT['bold'].fonts.unshift('MathJax_default-bold');" + "VARIANT['italic'].fonts.unshift('MathJax_default-italic');" + "VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" + "});" + "MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" + "var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" + "VARIANT['normal'].fonts.unshift('MathJax_default');" + "VARIANT['bold'].fonts.unshift('MathJax_default-bold');" + "VARIANT['italic'].fonts.unshift('MathJax_default-italic');" + "VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" + "});" + "}"; (document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript); } </script>Basis Values Using a Likelihood Approach2021-02-05T01:28:04-05:002021-02-10T02:56:48-05:00Stefan Kloppenborgtag:www.kloppenborg.ca,2021-02-05:/2021/02/likelihood-basis-values/<p>All materials have some variability in their strength: some pieces of a given material are stronger than others. The design standards for civil aircraft mandate that one must account for this material variability. This is done by setting appropriate material allowables such that either <span class="math">$$90\%$$</span> or <span class="math">$$99\%$$</span> of the material …</p><p>All materials have some variability in their strength: some pieces of a given material are stronger than others. The design standards for civil aircraft mandate that one must account for this material variability. This is done by setting appropriate material allowables such that either <span class="math">$$90\%$$</span> or <span class="math">$$99\%$$</span> of the material will have a strength greater than the allowable with <span class="math">$$95\%$$</span> confidence. These values are referred to as B-Basis and A-Basis values, respectively. In the language of statistics, they are lower tolerance bounds on the material&nbsp;strength.</p> <p>When you&#8217;re designing an aircraft part, one of the first steps is to determine the allowables to which you&#8217;ll compare the stress when determining the margin of safety. For many metals, A- or B-Basis values are published, and the designer will use those published values as the allowable. However, when it comes to composite materials, it is often up to the designer to determine the A- or B-Basis value&nbsp;themselves.</p> <p>The most common way of calculating Basis values is to use the statistical methods published in Volume 1 of <a href="https://www.cmh17.org/"><span class="caps">CMH</span>-17</a> and implemented in the R package <a href="https://www.cmstatr.net/"><code>cmstatr</code></a> (among other implementations). These methods area based on <em>frequentest inference</em>.</p> <p>For example, if the data is assumed to be normally distributed, with this frequentest approach, you would calculate the B-Basis value using the non-central <em>t-</em>distribution (see, for example,&nbsp;<a href='#Krishnamoorthy_Mathew_2008' id='ref-Krishnamoorthy_Mathew_2008-1'> Krishnamoorthy and Mathew (2008) </a>).</p> <p>However, the frequentest approach is not the only way to calculate Basis values: a <em>likelihood</em>-based approach can be used as well. The book <em>Statistical Intervals</em> by <a href='#Meeker_Hahn_Escobar2017' id='ref-Meeker_Hahn_Escobar2017-1'> Meeker et al. (2017) </a> discusses this approach, among other&nbsp;topics.</p> <p>The basic idea of a likelihood-based inference is that you can observe some data (by doing mechanical tests, or whatever), but you don&#8217;t yet know the population parameters, such as the mean and the variance. But, you can say that some possible values of the population parameters are more likely than others. For example, if you perform 18 tension tests of a material and the results are all around 100, the likelihood that the population mean is 100 is pretty high, but the likelihood that the population mean is 50 is really low. You can define a mathematical function to quantify this likelihood: this is called the <em>likelihood function</em>.</p> <p>If you just need a point-estimate of the population parameters, you can find the highest value of this likelihood function: this is called the maximum likelihood estimate. If you need to find an interval or a bound (for example, the B-Basis, which is a lower tolerance bound), you can plot this likelihood function versus the population parameters and use this distribution of likelihood to determine a range of population parameters that are &#8220;sufficiently likely&#8221; to be within the&nbsp;interval.</p> <p>The likelihood-based approach to calculating Basis values is more computationally expensive, but it allows you to deal with data that is left- or right-censored, and you can use the same computational algorithm for a wide variety of location-scale distributions. I&#8217;m planning on writing about calculating Basis values for censored data&nbsp;soon.</p> <h1>Example&nbsp;Data</h1> <p>For the purpose of this blog post, we&#8217;ll look at some data that is included in the <code>cmstatr</code> package. We&#8217;ll use this data to calculate a B-Basis value using the more traditional frequentest approach, then using a likelihood-based&nbsp;approach.</p> <p>We&#8217;ll start by loading several R packages that we&#8217;ll&nbsp;need:</p> <div class="highlight"><pre><span></span><code><span class="nf">library</span><span class="p">(</span><span class="n">cmstatr</span><span class="p">)</span> <span class="nf">library</span><span class="p">(</span><span class="n">tidyverse</span><span class="p">)</span> <span class="nf">library</span><span class="p">(</span><span class="n">stats4</span><span class="p">)</span> </code></pre></div> <p>Next, we&#8217;ll get the data that we&#8217;re going to use. We&#8217;ll use the &#8220;warp tension&#8221; data from the <code>carbon.fabric.2</code> data set that comes with <code>cmstatr</code>. We&#8217;ll consider only the <code>RTD</code> environmental&nbsp;condition.</p> <div class="highlight"><pre><span></span><code><span class="n">carbon.fabric.2</span> <span class="o">%&gt;%</span> <span class="nf">filter</span><span class="p">(</span><span class="n">test</span> <span class="o">==</span> <span class="s">&quot;WT&quot;</span> <span class="o">&amp;</span> <span class="n">condition</span> <span class="o">==</span> <span class="s">&quot;RTD&quot;</span><span class="p">)</span> </code></pre></div> <div class="highlight"><pre><span></span><code><span class="err">## test condition batch thickness nplies strength modulus failure_mode</span> <span class="err">## 1 WT RTD A 0.113 14 129.224 8.733 LAB</span> <span class="err">## 2 WT RTD A 0.112 14 144.702 8.934 LAT,LWB</span> <span class="err">## 3 WT RTD A 0.113 14 137.194 8.896 LAB</span> <span class="err">## 4 WT RTD A 0.113 14 139.728 8.835 LAT,LWB</span> <span class="err">## 5 WT RTD A 0.113 14 127.286 9.220 LAB</span> <span class="err">## 6 WT RTD A 0.111 14 129.261 9.463 LAT</span> <span class="err">## 7 WT RTD A 0.112 14 130.031 9.348 LAB</span> <span class="err">## 8 WT RTD B 0.111 14 140.038 9.244 LAT,LGM</span> <span class="err">## 9 WT RTD B 0.111 14 132.880 9.267 LWT</span> <span class="err">## 10 WT RTD B 0.113 14 132.104 9.198 LAT</span> <span class="err">## 11 WT RTD B 0.114 14 137.618 9.179 LAT,LAB</span> <span class="err">## 12 WT RTD B 0.113 14 139.217 9.123 LAB</span> <span class="err">## 13 WT RTD B 0.113 14 134.912 9.116 LAT</span> <span class="err">## 14 WT RTD B 0.111 14 141.558 9.434 LAB / LAT</span> <span class="err">## 15 WT RTD C 0.108 14 150.242 9.451 LAB</span> <span class="err">## 16 WT RTD C 0.109 14 147.053 9.391 LGM</span> <span class="err">## 17 WT RTD C 0.111 14 145.001 9.318 LAT,LWB</span> <span class="err">## 18 WT RTD C 0.113 14 135.686 8.991 LAT / LAB</span> <span class="err">## 19 WT RTD C 0.112 14 136.075 9.221 LAB</span> <span class="err">## 20 WT RTD C 0.114 14 143.738 8.803 LAT,LGM</span> <span class="err">## 21 WT RTD C 0.113 14 143.715 8.893 LAT,LAB</span> <span class="err">## 22 WT RTD C 0.113 14 147.981 8.974 LGM,LWB</span> <span class="err">## 23 WT RTD C 0.112 14 148.418 9.118 LAT,LWB</span> <span class="err">## 24 WT RTD C 0.113 14 135.435 9.217 LAT/LAB</span> <span class="err">## 25 WT RTD C 0.113 14 146.285 8.920 LWT/LWB</span> <span class="err">## 26 WT RTD C 0.111 14 139.078 9.015 LAT</span> <span class="err">## 27 WT RTD C 0.112 14 146.825 9.036 LAT/LWT</span> <span class="err">## 28 WT RTD C 0.110 14 148.235 9.336 LWB/LAB</span> </code></pre></div> <p>We really care only about the strength vector from this data, so we&#8217;ll save that vectory by itself in a variable for easy access&nbsp;later.</p> <div class="highlight"><pre><span></span><code><span class="n">dat</span> <span class="o">&lt;-</span> <span class="p">(</span><span class="n">carbon.fabric.2</span> <span class="o">%&gt;%</span> <span class="nf">filter</span><span class="p">(</span><span class="n">test</span> <span class="o">==</span> <span class="s">&quot;WT&quot;</span> <span class="o">&amp;</span> <span class="n">condition</span> <span class="o">==</span> <span class="s">&quot;RTD&quot;</span><span class="p">))[[</span><span class="s">&quot;strength&quot;</span><span class="p">]]</span> <span class="n">dat</span> </code></pre></div> <div class="highlight"><pre><span></span><code><span class="err">## [1] 129.224 144.702 137.194 139.728 127.286 129.261 130.031 140.038 132.880</span> <span class="err">## [10] 132.104 137.618 139.217 134.912 141.558 150.242 147.053 145.001 135.686</span> <span class="err">## [19] 136.075 143.738 143.715 147.981 148.418 135.435 146.285 139.078 146.825</span> <span class="err">## [28] 148.235</span> </code></pre></div> <h1>Frequentest&nbsp;B-Basis</h1> <p>We can use the <code>cmstatr</code> package to calculate the B-Basis value from this example data. We&#8217;re going to assume that the data follows a normal distribution throughout this blog&nbsp;post.</p> <div class="highlight"><pre><span></span><code><span class="nf">basis_normal</span><span class="p">(</span><span class="n">x</span> <span class="o">=</span> <span class="n">dat</span><span class="p">,</span> <span class="n">p</span> <span class="o">=</span> <span class="m">0.9</span><span class="p">,</span> <span class="n">conf</span> <span class="o">=</span> <span class="m">0.95</span><span class="p">)</span> </code></pre></div> <div class="highlight"><pre><span></span><code><span class="err">## </span> <span class="err">## Call:</span> <span class="err">## basis_normal(x = dat, p = 0.9, conf = 0.95)</span> <span class="err">## </span> <span class="err">## Distribution: Normal ( n = 28 )</span> <span class="err">## B-Basis: ( p = 0.9 , conf = 0.95 )</span> <span class="err">## 127.5415</span> </code></pre></div> <p>So using this approach, we get a B-Basis value of <span class="math">$$127.54$$</span>.</p> <h1>Likelihood-Based&nbsp;B-Basis</h1> <p>The first step in implementing a likelihood-based approach is to define a likelihood function. This function is the product of the probability density function (<span class="caps">PDF</span>) at each observation (<span class="math">$$X_i$$</span>), given a set of population parameters (<span class="math">$$\theta$$</span>) (see&nbsp;<a href='#Wasserman_2004' id='ref-Wasserman_2004-1'> Wasserman (2004) </a>).</p> <div class="math">$$\mathcal{L}\left(\theta\right) = \prod_{i=1}^{n} f\left(X_i;\,\theta\right)$$</div> <p>We&#8217;ll actually implement a log-likelihood function in R because taking a log-transform avoids some numerical issues. This log-likelihood function will take three arguments: the two parameters of the distribution (<code>mu</code> and <code>sigma</code>) and a vector of the&nbsp;data.</p> <div class="highlight"><pre><span></span><code><span class="n">log_likelihood_normal</span> <span class="o">&lt;-</span> <span class="nf">function</span><span class="p">(</span><span class="n">mu</span><span class="p">,</span> <span class="n">sig</span><span class="p">,</span> <span class="n">x</span><span class="p">)</span> <span class="p">{</span> <span class="nf">suppressWarnings</span><span class="p">(</span> <span class="nf">sum</span><span class="p">(</span> <span class="nf">dnorm</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">mean</span> <span class="o">=</span> <span class="n">mu</span><span class="p">,</span> <span class="n">sd</span> <span class="o">=</span> <span class="n">sig</span><span class="p">,</span> <span class="n">log</span> <span class="o">=</span> <span class="kc">TRUE</span><span class="p">)</span> <span class="p">)</span> <span class="p">)</span> <span class="p">}</span> </code></pre></div> <p>We can use this log-likelihood function to find the maximum-likelihood estimates (<span class="caps">MLE</span>) of the population parameters using the <code>stats4::mle()</code> function. This function takes the negative log-likelihood function and a starting guess for the&nbsp;parameters.</p> <div class="highlight"><pre><span></span><code><span class="nf">mle</span><span class="p">(</span> <span class="nf">function</span><span class="p">(</span><span class="n">mu</span><span class="p">,</span> <span class="n">sig</span><span class="p">)</span> <span class="p">{</span> <span class="o">-</span><span class="nf">log_likelihood_normal</span><span class="p">(</span><span class="n">mu</span><span class="p">,</span> <span class="n">sig</span><span class="p">,</span> <span class="n">dat</span><span class="p">)</span> <span class="p">},</span> <span class="n">start</span> <span class="o">=</span> <span class="nf">c</span><span class="p">(</span><span class="m">130</span><span class="p">,</span> <span class="m">6.5</span><span class="p">)</span> <span class="p">)</span> </code></pre></div> <div class="highlight"><pre><span></span><code><span class="err">## </span> <span class="err">## Call:</span> <span class="err">## mle(minuslogl = function(mu, sig) {</span> <span class="err">## -log_likelihood_normal(mu, sig, dat)</span> <span class="err">## }, start = c(130, 6.5))</span> <span class="err">## </span> <span class="err">## Coefficients:</span> <span class="err">## mu sig </span> <span class="err">## 139.626036 6.594905</span> </code></pre></div> <p>We will be denoting these maximum likelihood estimates as <span class="math">$$\hat\mu$$</span> and <span class="math">$$\hat\sigma$$</span>. They match the sample mean and sample standard deviation within a reasonable tolerance, but are not exactly&nbsp;equal.</p> <div class="highlight"><pre><span></span><code><span class="nf">mean</span><span class="p">(</span><span class="n">dat</span><span class="p">)</span> </code></pre></div> <div class="highlight"><pre><span></span><code><span class="err">## [1] 139.6257</span> </code></pre></div> <div class="highlight"><pre><span></span><code><span class="nf">sd</span><span class="p">(</span><span class="n">dat</span><span class="p">)</span> </code></pre></div> <div class="highlight"><pre><span></span><code><span class="err">## [1] 6.716047</span> </code></pre></div> <p>The relative likelihood is the ratio between the value of the likelihood function evaluated at a given set of parameters to the value of the likelihood function evaluated at the <span class="caps">MLE</span> of the parameters. The relative likelihood would then be a function with two arguments: one for each of the parameters <span class="math">$$\mu$$</span> and <span class="math">$$\sigma$$</span>. To reduce the number of arguments, <a href='#Meeker_Hahn_Escobar2017' id='ref-Meeker_Hahn_Escobar2017-2'> Meeker et al. (2017) </a> use a <em>profile likelihood</em> function instead. This the same as the likelihood ratio, but it is maximized with respect to <span class="math">$$\sigma$$</span>, as defined&nbsp;below:</p> <div class="math">$$R\left(\mu\right) = \max_\sigma \left[\frac{\mathcal{L}\left(\mu, \sigma\right)}{\mathcal{L}\left(\hat\mu, \hat\sigma\right)}\right]$$</div> <p>When we&#8217;re trying to calculate a Basis value, we don&#8217;t really care about the mean as a population parameter. Instead, we care about a particular proportion of the population. Since a normal distribution (or other location-scale distributions) are uniquely defined by two parameters, <a href='#Meeker_Hahn_Escobar2017' id='ref-Meeker_Hahn_Escobar2017-3'> Meeker et al. (2017) </a> note that you can use two alternate parameters instead. In our case, we&#8217;ll keep <span class="math">$$\sigma$$</span> as one of the parameters, but we&#8217;ll use <span class="math">$$t_p$$</span> as the other instead. Here, <span class="math">$$t_p$$</span> is the value that the proportion <span class="math">$$p$$</span> of the population falls below. For example, <span class="math">$$t_{0.1}$$</span> would represent the 10-th percentile of the&nbsp;population.</p> <p>We can convert between <span class="math">$$\mu$$</span> and <span class="math">$$t_p$$</span> as&nbsp;follows:</p> <div class="math">$$\mu = t_p - \sigma \Phi^{-1}\left(p\right)$$</div> <p>Given this re-parameterization, we can implement the profile likelihood function as&nbsp;follows:</p> <div class="highlight"><pre><span></span><code><span class="n">profile_likelihood_normal</span> <span class="o">&lt;-</span> <span class="nf">function</span><span class="p">(</span><span class="n">tp</span><span class="p">,</span> <span class="n">p</span><span class="p">,</span> <span class="n">x</span><span class="p">)</span> <span class="p">{</span> <span class="n">m</span> <span class="o">&lt;-</span> <span class="nf">mle</span><span class="p">(</span> <span class="nf">function</span><span class="p">(</span><span class="n">mu</span><span class="p">,</span> <span class="n">sig</span><span class="p">)</span> <span class="p">{</span> <span class="o">-</span><span class="nf">log_likelihood_normal</span><span class="p">(</span><span class="n">mu</span><span class="p">,</span> <span class="n">sig</span><span class="p">,</span> <span class="n">x</span><span class="p">)</span> <span class="p">},</span> <span class="n">start</span> <span class="o">=</span> <span class="nf">c</span><span class="p">(</span><span class="m">130</span><span class="p">,</span> <span class="m">6.5</span><span class="p">)</span> <span class="p">)</span> <span class="n">mu_hat</span> <span class="o">&lt;-</span> <span class="n">m</span><span class="o">&#64;</span><span class="n">coef</span><span class="p">[</span><span class="m">1</span><span class="p">]</span> <span class="n">sig_hat</span> <span class="o">&lt;-</span> <span class="n">m</span><span class="o">&#64;</span><span class="n">coef</span><span class="p">[</span><span class="m">2</span><span class="p">]</span> <span class="n">ll_hat</span> <span class="o">&lt;-</span> <span class="nf">log_likelihood_normal</span><span class="p">(</span><span class="n">mu_hat</span><span class="p">,</span> <span class="n">sig_hat</span><span class="p">,</span> <span class="n">x</span><span class="p">)</span> <span class="nf">optimise</span><span class="p">(</span> <span class="nf">function</span><span class="p">(</span><span class="n">sig</span><span class="p">)</span> <span class="p">{</span> <span class="nf">exp</span><span class="p">(</span> <span class="nf">log_likelihood_normal</span><span class="p">(</span> <span class="n">mu</span> <span class="o">=</span> <span class="n">tp</span> <span class="o">-</span> <span class="n">sig</span> <span class="o">*</span> <span class="nf">qnorm</span><span class="p">(</span><span class="n">p</span><span class="p">),</span> <span class="n">sig</span> <span class="o">=</span> <span class="n">sig</span><span class="p">,</span> <span class="n">x</span> <span class="o">=</span> <span class="n">x</span> <span class="p">)</span> <span class="o">-</span> <span class="n">ll_hat</span> <span class="p">)</span> <span class="p">},</span> <span class="n">interval</span> <span class="o">=</span> <span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span> <span class="n">sig_hat</span> <span class="o">*</span> <span class="m">5</span><span class="p">),</span> <span class="n">maximum</span> <span class="o">=</span> <span class="kc">TRUE</span> <span class="p">)</span><span class="o">$</span><span class="n">objective</span> <span class="p">}</span> </code></pre></div> <p>We can visualize the profile likelihood&nbsp;function:</p> <div class="highlight"><pre><span></span><code><span class="nf">data.frame</span><span class="p">(</span> <span class="n">tp</span> <span class="o">=</span> <span class="nf">seq</span><span class="p">(</span><span class="m">120</span><span class="p">,</span> <span class="m">140</span><span class="p">,</span> <span class="n">length.out</span> <span class="o">=</span> <span class="m">200</span><span class="p">)</span> <span class="p">)</span> <span class="o">%&gt;%</span> <span class="nf">rowwise</span><span class="p">()</span> <span class="o">%&gt;%</span> <span class="nf">mutate</span><span class="p">(</span><span class="n">R</span> <span class="o">=</span> <span class="nf">profile_likelihood_normal</span><span class="p">(</span><span class="n">tp</span><span class="p">,</span> <span class="m">0.1</span><span class="p">,</span> <span class="n">dat</span><span class="p">))</span> <span class="o">%&gt;%</span> <span class="nf">ggplot</span><span class="p">(</span><span class="nf">aes</span><span class="p">(</span><span class="n">x</span> <span class="o">=</span> <span class="n">tp</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="n">R</span><span class="p">))</span> <span class="o">+</span> <span class="nf">geom_line</span><span class="p">()</span> <span class="o">+</span> <span class="nf">ggtitle</span><span class="p">(</span><span class="s">&quot;Profile Likelihood for the 10th Percentile&quot;</span><span class="p">)</span> </code></pre></div> <p><img alt="unnamed-chunk-9-1" src="https://www.kloppenborg.ca/2021/02/likelihood-basis-values/likelihood-basis-values_files/figure-markdown/unnamed-chunk-9-1.png"></p> <p>The way to interpret this plot is that it&#8217;s quite unlikely that the true value of <span class="math">$$t_p$$</span> is 120, and it&#8217;s unlikely that it&#8217;s 140, but it&#8217;s pretty likely that it&#8217;s around&nbsp;131.</p> <p>However, when we&#8217;re calculating Basis values, we aren&#8217;t trying to find the most likely value of <span class="math">$$t_p$$</span>: we&#8217;re trying to find a lower bound of the value of <span class="math">$$t_p$$</span>.</p> <p>The asymptotic distribution of <span class="math">$$R$$</span> is the <span class="math">$$\chi^2$$</span> distribution. If you&#8217;re working with large samples, you can use this fact to determine the lower bound of <span class="math">$$t_p$$</span>. However, for the sample sizes that are typically used for composite material testing, the actual distribution of <span class="math">$$R$$</span> is far enough from a <span class="math">$$\chi^2$$</span> distribution, that you can&#8217;t actually do&nbsp;this.</p> <p>Instead, we can use numerical integration to find the lower tolerance bound. We can find a value of <span class="math">$$t_p$$</span>, which we&#8217;ll call <span class="math">$$u$$</span>, where <span class="math">$$0.05\%$$</span> of the area under the <span class="math">$$R$$</span> curve is to its left. This will give the <span class="math">$$95\%$$</span> lower confidence bound on the population parameter. This can be written as follows. We&#8217;ll use numerical root finding to solve this expression for <span class="math">$$u$$</span>.</p> <div class="math">$$0.05 = \frac{ \int_{-\infty}^{u}R(t_p) d t_p }{ \int_{-\infty}^{\infty}R(t_p) d t_p }$$</div> <p>Since the value of <span class="math">$$R$$</span> vanishes as we move far from about 130, we won&#8217;t actually integrate from <span class="math">$$-\infty$$</span> to <span class="math">$$\infty$$</span>, but rather integrate between two values are are relatively far from the peak of the <span class="math">$$R$$</span>&nbsp;curve.</p> <p>We can implement this in the R language as follows. First, we&#8217;ll find the value of the&nbsp;denominator.</p> <div class="highlight"><pre><span></span><code><span class="n">fn</span> <span class="o">&lt;-</span> <span class="nf">Vectorize</span><span class="p">(</span><span class="nf">function</span><span class="p">(</span><span class="n">tp</span><span class="p">)</span> <span class="p">{</span> <span class="nf">profile_likelihood_normal</span><span class="p">(</span><span class="n">tp</span><span class="p">,</span> <span class="m">0.1</span><span class="p">,</span> <span class="n">dat</span><span class="p">)</span> <span class="p">})</span> <span class="n">denominator</span> <span class="o">&lt;-</span> <span class="nf">integrate</span><span class="p">(</span> <span class="n">f</span> <span class="o">=</span> <span class="n">fn</span><span class="p">,</span> <span class="n">lower</span> <span class="o">=</span> <span class="m">100</span><span class="p">,</span> <span class="n">upper</span> <span class="o">=</span> <span class="m">150</span> <span class="p">)</span> <span class="n">denominator</span> </code></pre></div> <div class="highlight"><pre><span></span><code><span class="err">## 4.339919 with absolute error &lt; 8.9e-07</span> </code></pre></div> <div class="highlight"><pre><span></span><code><span class="nf">uniroot</span><span class="p">(</span> <span class="nf">function</span><span class="p">(</span><span class="n">upper</span><span class="p">)</span> <span class="p">{</span> <span class="n">trial_area</span> <span class="o">&lt;-</span> <span class="nf">integrate</span><span class="p">(</span> <span class="n">fn</span><span class="p">,</span> <span class="n">lower</span> <span class="o">=</span> <span class="m">0</span><span class="p">,</span> <span class="n">upper</span> <span class="o">=</span> <span class="n">upper</span> <span class="p">)</span> <span class="nf">return</span><span class="p">(</span><span class="n">trial_area</span><span class="o">$</span><span class="n">value</span> <span class="o">/</span> <span class="n">denominator</span><span class="o">$</span><span class="n">value</span> <span class="o">-</span> <span class="m">0.05</span><span class="p">)</span> <span class="p">},</span> <span class="n">interval</span> <span class="o">=</span> <span class="nf">c</span><span class="p">(</span><span class="m">100</span><span class="p">,</span> <span class="m">150</span><span class="p">)</span> <span class="p">)</span> </code></pre></div> <div class="highlight"><pre><span></span><code><span class="err">##$root</span> <span class="err">## [1] 127.4914</span> <span class="err">## </span> <span class="err">## $f.root</span> <span class="err">## [1] -3.810654e-08</span> <span class="err">## </span> <span class="err">##$iter</span> <span class="err">## [1] 14</span> <span class="err">## </span> <span class="err">## $init.it</span> <span class="err">## [1] NA</span> <span class="err">## </span> <span class="err">##$estim.prec</span> <span class="err">## [1] 6.103516e-05</span> </code></pre></div> <p>The B-Basis value that we get using this approach is <span class="math">$$127.49$$</span>. This is quite close to <span class="math">$$127.54$$</span>, which was the value that we got using the frequentest&nbsp;approach.</p> <p>In a simple case like this data set, it wouldn&#8217;t be worth the extra effort of using a likelihood-based approach to calculating the Basis value, but we have demonstrated that this approach does&nbsp;work.</p> <p>In a later blog post, we&#8217;ll explore a case where it is worth the extra effort. (<em>Edit: that post is <a href="https://www.kloppenborg.ca/2021/02/basis-values-censored-data/">here</a></em>)</p> <script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%&#64;#&#64;#')) { var align = "center", indent = "0em", linebreak = "false"; if (false) { align = (screen.width < 768) ? "left" : align; indent = (screen.width < 768) ? "0em" : indent; linebreak = (screen.width < 768) ? 'true' : linebreak; } var mathjaxscript = document.createElement('script'); mathjaxscript.id = 'mathjaxscript_pelican_#%&#64;#&#64;#'; mathjaxscript.type = 'text/javascript'; mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML'; mathjaxscript[(window.opera ? "innerHTML" : "text")] = "MathJax.Hub.Config({" + " config: ['MMLorHTML.js']," + " TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," + " jax: ['input/TeX','input/MathML','output/HTML-CSS']," + " extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," + " displayAlign: '"+ align +"'," + " displayIndent: '"+ indent +"'," + " showMathMenu: true," + " messageStyle: 'normal'," + " tex2jax: { " + " inlineMath: [ ['\\\$$','\\\$$'] ], " + " displayMath: [ ['$$','$$'] ]," + " processEscapes: true," + " preview: 'TeX'," + " }, " + " 'HTML-CSS': { " + " styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," + " linebreaks: { automatic: "+ linebreak +", width: '90% container' }," + " }, " + "}); " + "if ('default' !== 'default') {" + "MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" + "var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" + "VARIANT['normal'].fonts.unshift('MathJax_default');" + "VARIANT['bold'].fonts.unshift('MathJax_default-bold');" + "VARIANT['italic'].fonts.unshift('MathJax_default-italic');" + "VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" + "});" + "MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" + "var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" + "VARIANT['normal'].fonts.unshift('MathJax_default');" + "VARIANT['bold'].fonts.unshift('MathJax_default-bold');" + "VARIANT['italic'].fonts.unshift('MathJax_default-italic');" + "VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" + "});" + "}"; (document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript); } </script>cmstatr: Composite Material Data Statistics in R2020-07-22T14:20:41-04:002020-07-22T14:20:41-04:00Stefan Kloppenborgtag:www.kloppenborg.ca,2020-07-22:/2020/07/cmstatr/<p>From what I&#8217;ve seen, a lot of the statistical analysis of data from composite material data is done in <span class="caps">MS</span> Excel. There are a number of very good tools for doing this analysis in <span class="caps">MS</span> Excel: <a href="https://www.niar.wichita.edu/agate/Documents/default.htm"><span class="caps">ASAP</span></a>, <a href="http://www.niar.wichita.edu/coe/NCAMP_Documents/Programs/HYTEQ%20Feb%207%20%202011.xls"><span class="caps">HYTEQ</span></a>, <span class="caps">STAT</span>-17, and more recently, <a href="https://www.cmh17.org/RESOURCES/StatisticsSoftware.aspx"><span class="caps">CMH17</span>-<span class="caps">STATS</span></a>. I expect that the …</p><p>From what I&#8217;ve seen, a lot of the statistical analysis of data from composite material data is done in <span class="caps">MS</span> Excel. There are a number of very good tools for doing this analysis in <span class="caps">MS</span> Excel: <a href="https://www.niar.wichita.edu/agate/Documents/default.htm"><span class="caps">ASAP</span></a>, <a href="http://www.niar.wichita.edu/coe/NCAMP_Documents/Programs/HYTEQ%20Feb%207%20%202011.xls"><span class="caps">HYTEQ</span></a>, <span class="caps">STAT</span>-17, and more recently, <a href="https://www.cmh17.org/RESOURCES/StatisticsSoftware.aspx"><span class="caps">CMH17</span>-<span class="caps">STATS</span></a>. I expect that the reason for the popularity of <span class="caps">MS</span> Excel for this application is that everyone in the industry has <span class="caps">MS</span> Excel installed on their computer and <span class="caps">MS</span> Excel is easy to&nbsp;use.</p> <p>If you&#8217;ve read my blog before, you&#8217;ll know that I think that <a href="https://www.kloppenborg.ca/2019/02/reproducibility/">reproducibility</a> is important for engineering calculations. In my view, this includes statistical analysis. If the analysis isn&#8217;t reproducible, how does a reviewer &#8212; either now or in the future &#8212; know if it&#8217;s&nbsp;right?</p> <p>The current <span class="caps">MS</span> Excel tools are typically password protected so that users can&#8217;t view the macros that perform the calculations. I suspect that this was done with the best of intentions in order to prevent users from changing the code. But it also means that users can&#8217;t verify that the code is correct, or check if there are any unstated assumptions&nbsp;made.</p> <p>To allow statistical analysis of composite material data using open-source software, I&#8217;ve written an package for the <a href="https://www.r-project.org/">R programming language</a> that implements the statistical methods described in <a href="https://www.cmh17.org"><span class="caps">CMH</span>-17-1G</a>. This package, <a href="https://www.cmstatr.net"><code>cmstatr</code></a> has been released on <a href="https://cran.r-project.org/package=cmstatr"><span class="caps">CRAN</span></a>. There is also a brief discussion of this package in a <a href="https://doi.org/10.21105/joss.02265">paper published in the Journal of Open Source Software</a>.</p> <p>This R package allows statistical analysis to be performed using open-source tools &#8212; which can be verified by the user &#8212; and facilitates statistical analysis reports to be written at the same time that the analysis is performed by using <code>R-Notebooks</code> (see my <a href="https://www.kloppenborg.ca/2019/06/pandoc-report-templates/">earlier post</a>).</p> <p>I&#8217;ve tried to write the functions in a consistent manner so that it&#8217;s easier to learn how to use the package. I&#8217;ve also written functions to work well with the <a href="https://www.tidyverse.org/"><code>tidyverse</code></a> set of&nbsp;packages.</p> <p>There are some examples of how to use the <code>cmstatr</code> package in <a href="https://www.cmstatr.net/articles/cmstatr_Tutorial.html">this vignette</a>.</p> <p>I hope that people find this package useful. If you use this package and find a bug, have feedback or would like a feature added, please raise an issue on <a href="https://github.com/ComtekAdvancedStructures/cmstatr/issues">GitHub</a>.</p>Tracking Issues using Jupyter Notebooks2020-04-13T01:38:34-04:002020-04-13T01:38:34-04:00Stefan Kloppenborgtag:www.kloppenborg.ca,2020-04-13:/2020/04/tracking-issues/<p>I&#8217;m currently collaborating on a paper. My collaborator and I are writing the paper using LaTeX and we&#8217;re using git to track and share changes to the manuscript. We currently have a shared repository on <a href="https://www.github.com">GitHub</a>.</p> <p>GitHub has a lot of great features for collaborating on software &#8212; after …</p><p>I&#8217;m currently collaborating on a paper. My collaborator and I are writing the paper using LaTeX and we&#8217;re using git to track and share changes to the manuscript. We currently have a shared repository on <a href="https://www.github.com">GitHub</a>.</p> <p>GitHub has a lot of great features for collaborating on software &#8212; after all that&#8217;s why it was developed. The &#8220;Issues&#8221; features in a repository is a particularly useful feature. This allows you to discuss problems, and track the resolution of those problems. Text formatting is supported in GitHub Issues using Markdown. In many flavors of markdown, you can also embed math using LaTeX syntax. Unfortunately, GitHub flavored markdown <a href="https://github.com/github/markup/issues/274">does not support math</a>. This is probably fine for the vast majority of software projects. However, it is a problem when we&#8217;re trying to discuss a mathematical&nbsp;model.</p> <p>Several people on the internet have suggested various solutions to this shortcoming. Some have suggested using an external engine to render your math as an image, then embed that image in markdown. This works, but I think it&#8217;s&nbsp;cumbersome.</p> <p>Several others have suggested using a <a href="https://jupyter.org">Jupyter Notebook</a>, which GitHub does actually render. I think that this is a better solution, and this is the solution that I&#8217;m planning on using with my&nbsp;collaborator.</p> <h1>Implementation&nbsp;Summary</h1> <p>In our git repository, I&#8217;m creating a folder called <code>issues-open</code>. Inside this folder is a set of Jupyter Notebooks, one per issue. Each collaborator can review these Notebooks, which conveniently get rendered on the GitHub web interface. When a collaborator has something to add to the issue, they can fire up their Jupyter instance and make some changes &#8212; either by adding new cells to the bottom of the notebook, or making changes to the existing text &#8212; and committing and pushing the changes. We&#8217;ve adopted the practice of starting each cell with a heading with the name of the author of that cell. This way, the Notebook looks a bit like a&nbsp;conversation.</p> <h1>Launching Jupyter&nbsp;Notebooks</h1> <p>We&#8217;re using a <a href="https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html">conda environment</a> for Python so that we&#8217;re synced up on the versions of each package we&#8217;re using. So, the first step will be creating the conda environment from the environment <span class="caps">YAML</span> file. In our case, this would look like&nbsp;this:</p> <div class="highlight"><pre><span></span><code>conda env create -f environment.yml </code></pre></div> <p>This only needs to be done once on each computer. Once that&#8217;s been done, you just need to activate the environment. This is basically just telling your terminal that you want to use that version of Python. This can be accomplished like the following (obviously, replace the name of the environment with the correct&nbsp;name):</p> <div class="highlight"><pre><span></span><code>conda activate my-environment </code></pre></div> <p>Now, you can launch the Jupyter Notebook session using the following. Your web browser should pop up and allow you to create new notebooks and edit existing notebooks in the browser once you run this&nbsp;command.</p> <div class="highlight"><pre><span></span><code>jupyter notebook </code></pre></div> <h1>Collaborating on&nbsp;Issues</h1> <p>The Jupyter Notebook interface is relatively straight forward and doesn&#8217;t need much discussion here. Most of the important features are available through the menus. There are keyboard shortcuts that come in handy, which can be found <a href="https://towardsdatascience.com/jypyter-notebook-shortcuts-bf0101a98330">here</a>.</p> <p>Jupyter notebooks comprise a set of cells. The basic types of cells are markdown, code and raw. We&#8217;ll ignore raw cells here. Markdown cells contain text styled using <a href="https://jupyter-notebook.readthedocs.io/en/stable/examples/Notebook/Working%20With%20Markdown%20Cells.html">markdown syntax</a>. Code cells contain executable code. In our case, this will all be Python&nbsp;code.</p> <p>If there is any code in the notebook, it&#8217;s important to realize that it runs interactively. You execute one code cell at a time. You don&#8217;t have to execute them in order either. So, if the code has side effects &#8212; like changing a global variable &#8212; the order that you run the cells in makes a difference. I think it&#8217;s good practice to restart your Python interpreter and re-run all the cells before committing a notebook in git. To do this, just click <code>Kernel</code> / <code>Restart &amp; Run All</code>. This guarantees that the cells were run in order and have repeatable&nbsp;output.</p> <p>The other advantage to restarting the kernel and re-running all the cells before committing is to avoid extraneous changes being tracked by git. The notebook files include a counter indicating the order in which the cells were executed. The first cell to be executed will have a counter value of 1, the second will have a value of 2, etcetera. If you execute the first five cells, then execute the first one again, it will now have a counter value of 6. If you&#8217;ve been playing around with a notebook for a while, all those counters will be incremented even higher. Even if you make no real changes to the notebook, git will register these counter changes as changes that need to be committed and tracked. You really only want the real changes to be tracked, and the easiest way to do this is to ensure that the code cells are executed in order starting from an execution count of&nbsp;one.</p> <h1>Closing an&nbsp;Issue</h1> <p>When it&#8217;s time to close an issue, whomever closes the issue simply moves the Jupyter Notebook discussing the issue to a folder called <code>issues-closed</code>. This should be a <code>git-mv</code> so that the history is&nbsp;maintained.</p> <p>As an example, to close the issue discussed in the Notebook <code>reorder-model-development.ipynb</code>, the command would&nbsp;be:</p> <div class="highlight"><pre><span></span><code>git mv issues-open/reorder-model-development.ipynb issues-closed/ </code></pre></div>Package Adequacy for Engineering Calculations2019-06-29T17:11:00-04:002019-06-29T17:11:00-04:00Stefan Kloppenborgtag:www.kloppenborg.ca,2019-06-29:/2019/06/adequacy/<p>If you do engineering calculations or analysis using a language like <a href="https://www.r-project.org/">R</a> or <a href="https://www.python.org/">Python</a>, chances are that you&#8217;re going to use some packages. Packages are collections of code that someone else has written that you can use in your code. For example, if you need to solve a system …</p><p>If you do engineering calculations or analysis using a language like <a href="https://www.r-project.org/">R</a> or <a href="https://www.python.org/">Python</a>, chances are that you&#8217;re going to use some packages. Packages are collections of code that someone else has written that you can use in your code. For example, if you need to solve a system of linear equations by inverting a matrix and you&#8217;re using Python, you might use <a href="http://www.numpy.org/">numpy</a>. Or if you&#8217;re using R and you need to fit a linear model to some data, you would probably use the <a href="https://stat.ethz.ch/R-manual/R-devel/library/stats/html/00Index.html">stats</a>&nbsp;package.</p> <p>If you&#8217;re involved in &#8220;engineering,&#8221; you need a high level of confidence that the results that you&#8217;re getting are correct. Note that in this post &#8212; and my blog in general &#8212; that when I say &#8220;engineering,&#8221; I don&#8217;t mean <em>software engineering:</em> I mean design and analysis of of structures or systems that have an effect on safety. I work in civil aeronautics, mainly dealing with composites, but also dealing with metallic structure regularly. Depending on the particular type of engineering that you&#8217;re engaged in and the particular problem at hand, the consequences of getting the wrong answer could be fairly severe. You better be sure that both the interpreter and the packages are correct. Probably the best way to do this is to validate the results using another method: are there published results for a similar problem that you can use as a benchmark? Perhaps you can do some physical testing? But even if you&#8217;re doing you due diligence and validating the results somehow, you will still waste a lot of your time if there were a problem with either the interpreter or one of the&nbsp;packages.</p> <p>Compiled languages &#8212; like <code>C</code> or <code>FORTRAN</code> &#8212;are compiled into machine code that runs directly on the processor. Interpreted languages, like Python, R or JavaScript, are not compiled into machine code, but instead an interpreter (a piece of software) reads each line of code and figures out how to run it when you run the code (not ahead of time). As far as interpreters go, if you&#8217;re using <code>CPython</code> (the &#8220;standard&#8221; Python interpreter) or <code>GNU-R</code> (the &#8220;standard&#8221; R interpreter), I think there is a rather low risk that there are any errors in the interpreter. These interpreters are written by a bunch of smart people, and both are open source, so the code that makes up the interpreters themselves are read by a much larger group of smart people. Furthermore, both interpreters are widely used and have been around for a while, so it&#8217;s very likely that significant bugs that are likely to change the result of an engineering calculation would have been found by users by now and would have been&nbsp;fixed.</p> <p>Packages are more of a risk than interpreters are. Again, if you&#8217;re using a very widely used package that has been around for a while, like <code>numpy</code> (in Python) or <code>stats</code> (in R), there&#8217;s a pretty good chance that any bugs that would affect your calculations would have been found by now &#8212; and packages like these are maintained by groups of dedicated&nbsp;people.</p> <p>If you&#8217;re using R, chances are that you&#8217;re getting your packages from <a href="https://cran.r-project.org/"><span class="caps">CRAN</span></a>. You should be reading the <span class="caps">CRAN</span> page for the package that you&#8217;re using. You can find an example of such a page <a href="https://cran.r-project.org/package=MASS">here</a>. There are a few things that you should look for to help you evaluate the reliability of the package (in addition to reference manual and any vignettes that explain how to use the package). The first is the priority of the package. Not all packages have a priority, but if the priority is &#8220;base&#8221; or &#8220;recommended,&#8221; the package is maintained by the r-core team and is almost certainly used by a lot of people. You can be fairly comfortable with these&nbsp;packages.</p> <p>The second thing that you should look at on the <span class="caps">CRAN</span> page for a package is the <span class="caps">CRAN</span> Checks. <span class="caps">CRAN</span> will test all the packages every time a new version of R is released and it tests all the packages routinely to determine if a change in one package caused errors in another packages. You can see an example <span class="caps">CRAN</span> Check for my package <code>rde</code> <a href="https://cran.r-project.org/web/checks/check_results_rde.html">here</a>.</p> <p>This practice is called <a href="https://en.m.wikipedia.org/wiki/Continuous_integration">continuous integration</a>. It does all of these checks on several different operating systems &#8212; Windows, <span class="caps">OSX</span>, and several Linux distributions. If you open the <span class="caps">CRAN</span> Checks results for a package, you&#8217;ll see a table of all the various combinations of R version and operating system that have been tested along with the amount of time that it took to run the test and a status for each. If the Status is &#8220;<span class="caps">OK</span>,&#8221; then there were no errors identified. If the Status is &#8220;<span class="caps">NOTE</span>,&#8221; &#8220;<span class="caps">WARNING</span>,&#8221; or &#8220;<span class="caps">ERROR</span>.&#8221; There might be something wrong and it may or may not be serious. If you click on the Status link, you&#8217;ll see details and can evaluate for&nbsp;yourself.</p> <p>I think that these <span class="caps">CRAN</span> checks are actually a very strong point for the R ecosystem. It ensures that package maintainers know when something outside of their package breaks their code. And, it enforces a certain level of quality: package maintainers are given a certain amount of time to fix errors, and if they don&#8217;t the package gets removed from <span class="caps">CRAN</span>.</p> <p>The <span class="caps">CRAN</span> checks do a few things. First, they check that the package can, in fact, be loaded (maybe there&#8217;s an error that prevents you from using it at all). There are a few other things that it does, but the most important in terms of reliability of the package is that the <span class="caps">CRAN</span> checks will run any test created by the package maintainer. These tests are called <a href="https://en.m.wikipedia.org/wiki/Unit_testing">unit tests</a>. They are test that determine if the code in the package actually has the expected behavior. Package maintainers don&#8217;t have to write unit tests, but the good ones do. You can look at what tests the package maintainer has written by downloading the code of the package (you can download it from <span class="caps">CRAN</span>). The test are in a folder called <code>tests</code>. Tests basically work by providing some input to the package&#8217;s functions, and checking that the result is correct. For R packages, the <code>testthat</code> framework is a popular testing framework. For packages that use the <code>testthat</code> framework, you&#8217;ll see a number of statements that use the <code>expect_...</code> family of functions. Some of these tests will likely ensure that the package works at all &#8212; checking things like the return type for functions, or that a function actually does raise certain errors when invalid arguments are passed to it. Some of the tests should also ensure that the package provides correct results. When I write tests for a package, I always write both types of tests. For the tests that ensure that the results are correct, I often either check cases that have closed-form solutions, or check that the code in the package produces results that are approximately equal to example results published in articles or books. You&#8217;ll need to read through the tests to decide if they provide enough assurance that the package is&nbsp;correct.</p> <p>If you decide that the tests for a package are not sufficient, you have three&nbsp;options.</p> <ul> <li>You could choose not the use that package: maybe there is another that does something&nbsp;similar.</li> <li>You can write tests yourself and contribute those tests back to the package maintainer. After all, R packages are open-source and users are encouraged to contribute back to the community. Most package maintainers would be happy to receive a patch that adds more tests: writing tests is not fun, and most people would be grateful if someone else offers to do&nbsp;it.</li> <li>You could also manually test the package. The difficulty here is ensuring that you re-test the package every time you update the version of this package on your&nbsp;system.</li> </ul> <p>In the python world, continuous integration isn&#8217;t as well integrated into the ecosystem. Most packages that you install probably come from PyPI. As far as I know, PyPI doesn&#8217;t do any continuous integration: it&#8217;s up to the package maintainer to run their tests regularly. Package maintainers can do one of two things: they can run the tests on their own machine before releasing a new version to PyPI, or they can use a continuous integration service like Travis-<span class="caps">CI</span> or CircleCI. Many of the continuous integration services provide the service for free for open source projects, so many Python packages do use a continuous integration services. Packages that use a continuous integration service normally advertise it in their <span class="caps">README</span> file. You&#8217;ll still need to assess whether the tests are adequate, and if the package doesn&#8217;t use continuous integration, you&#8217;ll have to either run the test yourself, or trust that the package maintainer&nbsp;did.</p> <p>If you have already written tests for your package, setting up continuous integration using Travis-<span class="caps">CI</span> is quite straight forward. I haven&#8217;t personally used CircleCI, but I would imagine that it&#8217;s similarly easy to use. You can see the continuous integration results from my pcakge <code>rde</code> on Travis-<span class="caps">CI</span> <a href="https://travis-ci.com/kloppen/rde">here</a>.</p> <p>Whether you&#8217;re using Python or R, there are ways of ensuring that the packages you use for engineering calculations are adequate for your needs. Some people seem to be a little bit scared of open source packages and software for engineering calculations, but in a lot of ways, open source software is actually better for this since you have the ability of verifying it yourself and making a decision about whether to use&nbsp;it.</p>Pandoc Report Templates2019-06-21T01:55:33-04:002019-10-29T16:09:56-04:00Stefan Kloppenborgtag:www.kloppenborg.ca,2019-06-21:/2019/06/pandoc-report-templates/<p>The main benefit of using Notebooks (R Notebooks or Jupyter Notebooks) is that the document is reproducible: the reader knows exactly how the results of the analysis were obtained. I wrote about the use of Notebooks in an <a href="https://www.kloppenborg.ca/2019/02/reproducibility/">earlier&nbsp;post.</a></p> <p>Most organizations have a certain report format: a certain cover …</p><p>The main benefit of using Notebooks (R Notebooks or Jupyter Notebooks) is that the document is reproducible: the reader knows exactly how the results of the analysis were obtained. I wrote about the use of Notebooks in an <a href="https://www.kloppenborg.ca/2019/02/reproducibility/">earlier&nbsp;post.</a></p> <p>Most organizations have a certain report format: a certain cover sheet layout, a certain font, a log of revisions, etcetera. For the most part, organizations have an <span class="caps">MS</span> Word template for this report format. If you want to use a Notebook for you analysis and to write your report, you have a few&nbsp;options:</p> <ul> <li>You could write front matter in <span class="caps">MS</span> Word using your company&#8217;s report template and then attach the Notebook as an&nbsp;appendix.</li> <li>You could also use <code>Pandoc</code> (more about what this is later) to convert the Notebook into a .docx file and then merge it into the report&nbsp;template.</li> <li>You could create your own <code>Pandoc</code> template to convert a Notebook directly into a <span class="caps">PDF</span> with the correct&nbsp;formatting.</li> </ul> <p>The first option of attaching a Notebook as an appendix to a report otherwise created in <span class="caps">MS</span> Word is effective but is means that you need to maintain two different files: the <span class="caps">MS</span> Word report and the Notebook itself. The second option of exporting the Notebook to <span class="caps">MS</span> Word and merging it into the template is problematic when it comes to document revisions. If the part of the analysis is revised, there is a temptation to change the affected part by either only re-exporting that section from the Notebook into docx, or worse, making the change directly in <span class="caps">MS</span> Word. In both cases, there is the possibility of breaking the reproducibility. For example, let&#8217;s say that in your report you define some constants at the beginning and do some math using these&nbsp;constants:</p> <div class="highlight"><pre><span></span><code><span class="n">P</span> <span class="o">=</span> <span class="mi">1000</span> <span class="n">A1</span> <span class="o">=</span> <span class="mi">2</span> <span class="n">A2</span> <span class="o">=</span> <span class="mi">4</span> <span class="n">sigma1</span> <span class="o">=</span> <span class="n">P</span> <span class="o">/</span> <span class="n">A1</span> <span class="nb">print</span><span class="p">(</span><span class="n">sigma1</span><span class="p">)</span> <span class="c1"># 500</span> <span class="n">sigma2</span> <span class="o">=</span> <span class="n">P</span> <span class="o">/</span> <span class="n">A2</span> <span class="nb">print</span><span class="p">(</span><span class="n">sigma2</span><span class="p">)</span> <span class="c1"># 250</span> </code></pre></div> <p>Now let&#8217;s say that you ask your new intern to revise the document so that <span class="math">$$P = 1200$$</span>. They just edit the <span class="caps">MS</span> Word version of the report thinking that they will save some time. They don&#8217;t notice that <span class="math">$$P$$</span> is used twice in the calculation and only update the result from the first time it&#8217;s used. Now the report&nbsp;reads:</p> <div class="highlight"><pre><span></span><code><span class="n">P</span> <span class="o">=</span> <span class="mi">1200</span> <span class="n">A1</span> <span class="o">=</span> <span class="mi">2</span> <span class="n">A2</span> <span class="o">=</span> <span class="mi">4</span> <span class="n">sigma1</span> <span class="o">=</span> <span class="n">P</span> <span class="o">/</span> <span class="n">A1</span> <span class="nb">print</span><span class="p">(</span><span class="n">sigma1</span><span class="p">)</span> <span class="c1"># 600</span> <span class="n">sigma2</span> <span class="o">=</span> <span class="n">P</span> <span class="o">/</span> <span class="n">A2</span> <span class="nb">print</span><span class="p">(</span><span class="n">sigma2</span><span class="p">)</span> <span class="c1"># 250</span> </code></pre></div> <p>The report is now wrong. In a simple case like this, you&#8217;ll probably notice the error when you review your intern&#8217;s work, but if the math was significantly more complex, there is probably a fairly good chance that you wouldn&#8217;t pick up on the newly introduced&nbsp;error.</p> <p>For this reason, I think that the best option is to create a <code>Pandoc</code> template for your company&#8217;s report template. This means that you&#8217;ll be creating a <span class="caps">PDF</span> directly from the Notebook. In order to revise the report, you have to re-run the Notebook &#8212; the whole&nbsp;Notebook.</p> <p>For those unfamiliar with <a href="https://pandoc.org/"><code>Pandoc</code></a>, it is a program for converting between various file formats. It&#8217;s also free and open-source software. Commonly, it&#8217;s used for converting from Markdown into <span class="caps">HTML</span> or <span class="caps">PDF</span> (actually, <code>Pandoc</code> converts to a <a href="https://www.latex-project.org/">LaTeX</a> format and LaTeX converts to <span class="caps">PDF</span>, but this happens transparently). <code>Pandoc</code> can also convert into <span class="caps">MS</span> Word (.docx) and several other&nbsp;formats.</p> <p>When I decided to create a corporate format for use with notebooks, I looked at the types of notebooks that we use. Generally, statistics are done in an <a href="https://bookdown.org/yihui/rmarkdown/notebook.html">R-Notebook</a> and other analysis is done in a <a href="https://jupyter.org/">Jupyter notebook</a>. Unfortunately, R-Notebooks and Jupyter Notebooks use different templates. R-Notebooks use <code>pandoc</code> templates, while Jupyter uses its own template. Fortunately, there is a workaround. Jupyter is able to export to markdown, which can be read by <code>pandoc</code> and translated to <span class="caps">PDF</span> using a pandoc template. Thus, I made the decision to write a <code>pandoc</code> template.</p> <p>When <code>pandoc</code> converts a markdown file to <span class="caps">PDF</span>, it actually uses LaTeX. The <code>pandoc</code> template is actually a template for converting markdown into LaTeX. <code>Pandoc</code> then calls <code>pdflatex</code> to turn this <code>.tex</code> file into a <span class="caps">PDF</span>. </p> <p>When I first started figuring out how to write a template for converting markdown to <span class="caps">PDF</span>, I thought I was going to have to write a LaTeX class or style. I got scared. LaTeX classes are not for the faint of heart. But, I soon realized that I didn&#8217;t actually have to do that. The <code>pandoc</code> template that I needed to write was just a regular LaTeX document that has some parameters that <code>pandoc</code> can fill in. I&#8217;m not sure that I could figure out how to write a LaTeX class in a reasonable amount of time, but I sure can write a document using LaTeX. This is something that I learned to do when I wrote my undergraduate thesis, and while I don&#8217;t write LaTeX often anymore, it&#8217;s really not that&nbsp;hard.</p> <p>A very basic LaTeX file would look something like&nbsp;this:</p> <div class="highlight"><pre><span></span><code><span class="k">\documentclass</span><span class="nb">{</span>article<span class="nb">}</span> <span class="k">\begin</span><span class="nb">{</span>document<span class="nb">}</span> <span class="k">\title</span><span class="nb">{</span>My Report Title<span class="nb">}</span> <span class="k">\author</span><span class="nb">{</span>A. Student<span class="nb">}</span> <span class="k">\maketitle</span> <span class="k">\section</span><span class="nb">{</span>Introduction<span class="nb">}</span> Some text <span class="k">\end</span><span class="nb">{</span>document<span class="nb">}</span> </code></pre></div> <p>A <code>pandoc</code> template is just a LaTeX file, but with placeholder for the content that <code>pandoc</code> will insert. These placeholders are just variables surrounded with dollar signs. For example, <code>pandoc</code> has a variable called <code>body</code>. This variable will contain the body of the report. We would simply put <code>$body$</code> in the part of the template where we want <code>pandoc</code> to insert the body of the&nbsp;report.</p> <p><code>Pandoc</code> also supports <code>for</code> and <code>if</code> statements. A common pattern is to check for the existence of a variable and use it if it does exist and use a default value if it does not. The syntax for this would look something&nbsp;like:</p> <div class="highlight"><pre><span></span><code><span class="err">$if(myvar)$</span> <span class="err"> $myvar$</span> <span class="err">$else$</span> <span class="err"> Default text</span> <span class="err">$endif$</span> </code></pre></div> <p>I&#8217;ve written the above code on multiple lines for readability, but it could be written on a single line&nbsp;too.</p> <p>Similarly, if a variable is a list, you&#8217;d use a <code>for</code> statement to iterate over the list. We&#8217;ll cover this later when we talk about adding logs of&nbsp;revisions.</p> <h1>Defining New Template&nbsp;Variables</h1> <p><code>Pandoc</code> defines a number of variables by default. However, you&#8217;ll likely need to define some variables of your own. First of all, you&#8217;ll likely need to define a variable for the report number and the&nbsp;revision.</p> <p>To create the variable, it&#8217;s just a matter of defining it in the <a href="http://yaml.org/"><code>YAML</code></a> header of the markdown file. Variables can either have a single value or they can be lists. Elements of a list start with dash at the beginning of the&nbsp;line.</p> <p>Once we add the report number (which we&#8217;ll call <code>report-no</code>) and the revision (which we&#8217;ll call <code>rev</code>) to the <code>YAML</code> header, the <span class="caps">YAML</span> header will look like the&nbsp;following:</p> <div class="highlight"><pre><span></span><code><span class="nt">title</span><span class="p">:</span> <span class="s">&quot;Report</span><span class="nv"> </span><span class="s">Title&quot;</span> <span class="nt">author</span><span class="p">:</span> <span class="s">&quot;A.</span><span class="nv"> </span><span class="s">Student&quot;</span> <span class="nt">report-no</span><span class="p">:</span> <span class="s">&quot;RPT-001&quot;</span> <span class="nt">rev</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">B</span> </code></pre></div> <p>(Bonus points if you immediately though of William Sealy Gosset when you read&nbsp;that).</p> <p>We&#8217;ll probably want to add a log of revisions to the report. The contents of this log of revisions will have to come from somewhere, and the <code>YAML</code> header is the most logical place. The log of revisions will be a list with one element of the list corresponding to each revision in the log. Lists can have nested members. In our case, an entry within the log of revisions will have a revision letter, a date and a description. Including the log of revisions, the <code>YAML</code> header will look like&nbsp;this:</p> <div class="highlight"><pre><span></span><code><span class="nt">title</span><span class="p">:</span> <span class="s">&quot;Report</span><span class="nv"> </span><span class="s">Title&quot;</span> <span class="nt">author</span><span class="p">:</span> <span class="s">&quot;A.</span><span class="nv"> </span><span class="s">Student&quot;</span> <span class="nt">report-no</span><span class="p">:</span> <span class="s">&quot;RPT-001&quot;</span> <span class="nt">rev</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">B</span> <span class="nt">rev-log</span><span class="p">:</span> <span class="p p-Indicator">-</span> <span class="nt">rev</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">A</span> <span class="nt">date</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">1-Jun-2019</span> <span class="nt">desc</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">Initial release</span> <span class="p p-Indicator">-</span> <span class="nt">rev</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">B</span> <span class="nt">date</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">18-Jun-2019</span> <span class="nt">desc</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">Updated loads based on fligt test data</span> </code></pre></div> <p>We can now use these variables in our <code>pandoc</code> template. Using the variables <code>report-no</code> and <code>rev</code> are straight forward and will be just the same as using the default variables (like <code>title</code> and <code>author</code>).</p> <p>Using the list variables will require the use of a <code>for</code> statement. In the case of a log of revisions, each revision will get a row in a LaTeX table. Using the variable <code>rev-log</code>, this table will look like&nbsp;this:</p> <div class="highlight"><pre><span></span><code><span class="k">\begin</span><span class="nb">{</span>tabular<span class="nb">}{</span>| m<span class="nb">{</span>0.25in<span class="nb">}</span> | m<span class="nb">{</span>0.95in<span class="nb">}</span> | m<span class="nb">{</span>4.0in<span class="nb">}</span> |<span class="nb">}</span> <span class="k">\hline</span> Rev Ltr <span class="nb">&amp;</span> Date <span class="nb">&amp;</span> Description <span class="k">\\</span> <span class="s">$</span><span class="nb">for</span><span class="o">(</span><span class="nb">rev</span><span class="o">-</span><span class="nb">log</span><span class="o">)</span><span class="s">$</span> <span class="k">\hline</span> <span class="s">$</span><span class="nb">rev</span><span class="o">-</span><span class="nb">log.rev</span><span class="s">$</span> <span class="nb">&amp;</span> <span class="s">$</span><span class="nb">rev</span><span class="o">-</span><span class="nb">log.date</span><span class="s">$</span> <span class="nb">&amp;</span> <span class="s">$</span><span class="nb">rev</span><span class="o">-</span><span class="nb">log.desc</span><span class="s">$</span> <span class="k">\\</span> <span class="s">$</span><span class="nb">endfor</span><span class="s">$</span> <span class="k">\hline</span> <span class="k">\end</span><span class="nb">{</span>tabular<span class="nb">}</span> </code></pre></div> <p>In the above LaTeX code, everything between <code>$for(...)$</code> and <code>$endfor$</code> gets repeated for each item in the list <code>rev-log</code>. We can access the nested members using dot&nbsp;notation.</p> <h1>Using the Pandoc Template from an&nbsp;R-Notebook</h1> <p>RStudio handles a lot of the interface with <code>pandoc</code>. Adding the following to the <code>YAML</code> header of the R-Notebook should cause RStudio to use your new template when it compiles the R-Notebook to <span class="caps">PDF</span>. This <em>should</em> be all you need to&nbsp;do.</p> <div class="highlight"><pre><span></span><code><span class="nt">output</span><span class="p">:</span> <span class="nt">pdf_document</span><span class="p">:</span> <span class="nt">template</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">my_template_file.tex</span> <span class="nt">toc_depth</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">3</span> <span class="nt">fig_caption</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">true</span> <span class="nt">keep_tex</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">false</span> <span class="nt">df_print</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">kable</span> </code></pre></div> <h1>Using the Pandoc Template from a Jupyter&nbsp;Notebook</h1> <p>Using your new <code>pandoc</code> template from a Jupyter Notebook is a bit more complicated because Jupyter doesn&#8217;t work directly with <code>pandoc</code>. First of all, we need to tell <code>nbconvert</code> to convert to markdown. I think that it&#8217;s best to re-run the notebook at the same time (to make sure that it is, in fact, fully reproducible. You can do this using <code>nbconvert</code> as&nbsp;follows:</p> <div class="highlight"><pre><span></span><code>jupyter nbconvert --execute --to markdown my-notebook.ipynb </code></pre></div> <p>But, Jupyter notebooks don&#8217;t have <code>YAML</code> headers like R-Notebooks do, so we need a place to put all the variables that the template needs. The easiest way to do this is to create a cell at the beginning of the notebook with the cell type set as <code>raw</code>, then enter the <code>YAML</code> header into this cell, including the starting end ending fences (<code>---</code>). This cell would, then, have a content similar to the following. Cells of type <code>raw</code> simply get copied to the output, so this becomes the <code>YAML</code> header in the resulting markdown&nbsp;file.</p> <div class="highlight"><pre><span></span><code><span class="nn">---</span> <span class="nt">title</span><span class="p">:</span> <span class="s">&quot;Report</span><span class="nv"> </span><span class="s">Title&quot;</span> <span class="nt">author</span><span class="p">:</span> <span class="s">&quot;A.</span><span class="nv"> </span><span class="s">Student&quot;</span> <span class="nt">report-no</span><span class="p">:</span> <span class="s">&quot;RPT-001&quot;</span> <span class="nt">rev</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">B</span> <span class="nt">rev-log</span><span class="p">:</span> <span class="p p-Indicator">-</span> <span class="nt">rev</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">A</span> <span class="nt">date</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">1-Jun-2019</span> <span class="nt">desc</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">Initial release</span> <span class="p p-Indicator">-</span> <span class="nt">rev</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">B</span> <span class="nt">date</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">18-Jun-2019</span> <span class="nt">desc</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">Updated loads based on flight test data</span> <span class="nn">---</span> </code></pre></div> <p>Once you&#8217;ve used <code>nbconvert</code> to create the markdown file, you can call <code>pandoc</code>. You&#8217;ll have to provide the template as a command-line argument and also specify the output filename (so that <code>pandoc</code> knows you want a pdf) and also give the code highlighting style. The call to <code>pandoc</code> will look something like&nbsp;this.</p> <div class="highlight"><pre><span></span><code><span class="sb"></span>pandoc<span class="sb"></span> my-notebook.md -N --template<span class="o">=</span>my_template_file.tex -o my-notebook.pdf --highlight-style<span class="o">=</span>tango </code></pre></div> <h1>Documentation of Your&nbsp;Template</h1> <p>A &#8220;trick&#8221; that I&#8217;ve used is to add some documentation about how to use the template inside the template itself. It&#8217;s pretty unlikely that the user will actually open up the template, but it&#8217;s relatively likely that the user will forget one of the variables that the template expects. Since <code>pandoc</code> allows <code>if/else</code> statements, I&#8217;ve added the following to my&nbsp;template:</p> <div class="highlight"><pre><span></span><code><span class="s">$</span><span class="nb">if</span><span class="o">(</span><span class="nb">abstract</span><span class="o">)</span><span class="s">$</span> <span class="k">\abstract</span><span class="nb">{</span><span class="s">$</span><span class="nb">abstract</span><span class="s">$</span><span class="nb">}</span> <span class="s">$</span><span class="nb">else</span><span class="s">$</span> <span class="k">\abstract</span><span class="nb">{</span> The documentation for using the template goes here <span class="nb">}</span> <span class="s">$</span><span class="nb">endif</span><span class="s">$</span> </code></pre></div> <p>This means that if the user forgets to define the <code>abstract</code> variable, the cover page of the report (where the abstract normally goes in my case) will contain the documentation for the&nbsp;template.</p> <h1>Change Bars: Future&nbsp;Work</h1> <p>One of the things that I haven&#8217;t yet figured out are change bars. In my organization, we put vertical bars in the margin of reports to indicate what part of a report has been revised. There are LaTeX packages for (manually) inserting <a href="https://www.ctan.org/pkg/changebar">change bars into documents</a>. However, I haven&#8217;t yet figured out how to automatically insert these into a report generated using <code>pandoc</code>. I&#8217;m sure there&#8217;s a way,&nbsp;though.</p> <h1>Conclusion</h1> <p>I hope that this demystifies the process of writing a <code>pandoc</code> template to allow you to create reports directly from Jupyter Notebooks or R-Notebooks in your company&#8217;s report&nbsp;format.</p> <p><em>(Edited to fix a few&nbsp;typos)</em></p> <script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#@#')) { var align = "center", indent = "0em", linebreak = "false"; if (false) { align = (screen.width < 768) ? "left" : align; indent = (screen.width < 768) ? "0em" : indent; linebreak = (screen.width < 768) ? 'true' : linebreak; } var mathjaxscript = document.createElement('script'); mathjaxscript.id = 'mathjaxscript_pelican_#%@#@#'; mathjaxscript.type = 'text/javascript'; mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML'; mathjaxscript[(window.opera ? "innerHTML" : "text")] = "MathJax.Hub.Config({" + " config: ['MMLorHTML.js']," + " TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," + " jax: ['input/TeX','input/MathML','output/HTML-CSS']," + " extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," + " displayAlign: '"+ align +"'," + " displayIndent: '"+ indent +"'," + " showMathMenu: true," + " messageStyle: 'normal'," + " tex2jax: { " + " inlineMath: [ ['\\\$$','\\\$$'] ], " + " displayMath: [ ['$$','$$'] ]," + " processEscapes: true," + " preview: 'TeX'," + " }, " + " 'HTML-CSS': { " + " styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," + " linebreaks: { automatic: "+ linebreak +", width: '90% container' }," + " }, " + "}); " + "if ('default' !== 'default') {" + "MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" + "var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" + "VARIANT['normal'].fonts.unshift('MathJax_default');" + "VARIANT['bold'].fonts.unshift('MathJax_default-bold');" + "VARIANT['italic'].fonts.unshift('MathJax_default-italic');" + "VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" + "});" + "MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" + "var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" + "VARIANT['normal'].fonts.unshift('MathJax_default');" + "VARIANT['bold'].fonts.unshift('MathJax_default-bold');" + "VARIANT['italic'].fonts.unshift('MathJax_default-italic');" + "VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" + "});" + "}"; (document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript); } </script>Automating Software Validation Reports2019-06-07T04:39:42-04:002019-06-21T01:55:33-04:00Stefan Kloppenborgtag:www.kloppenborg.ca,2019-06-07:/2019/06/automating-software-validation-reports/<p>I&#8217;ve been working on a Python package to analyze adhesively bonded joints recently. This package will be used to analyze adhesive joints in certain aircraft structure and will be used to substantiate the design of structural repairs, amongst other uses. Because of this, the output of this package needs …</p><p>I&#8217;ve been working on a Python package to analyze adhesively bonded joints recently. This package will be used to analyze adhesive joints in certain aircraft structure and will be used to substantiate the design of structural repairs, amongst other uses. Because of this, the output of this package needs to be validated against test data. This validation also needs to be documented in an engineering&nbsp;report.</p> <p>I&#8217;ve been thinking about how to do this. On one hand, I&#8217;ve been thinking about the types of (mechanical) tests that we&#8217;ll need to run to validate the model and the various test configurations that we&#8217;ll need to include in the validation test matrix. On the other hand, I&#8217;ve also been thinking about change change management of the package and ensuring that validation report stays up to&nbsp;date.</p> <p>I&#8217;m imagining the scenario where we run the validation testing and find that the model and the test results agree within, say, 10%. Maybe that&#8217;s good enough for the purpose (depending on the direction of the disagreement). We can then write our validation report and type out the sentence <em>&#8220;the test data and the model were found to agree within 10%.&#8221;</em> Then, I&#8217;m imagining that we make a refinement to the model formulation and release a new version of the package that now agrees with the test data within 5%. Now, we have a validation report for the old version of the package, but no report describing the validation of the new version. We&#8217;d need to go back through the validation report, re-run the model for all the validation cases and update the&nbsp;report.</p> <p>When we update the validation report manually, there&#8217;s probably a pretty good chance that some aspect of the update gets missed. Maybe it&#8217;s as simple as a one of the model outputs doesn&#8217;t get updated in the revised validation report. It&#8217;s also potentially rather time consuming to update this report. It would be faster to make this validation report a Jupyter Notebook (which I&#8217;ve <a href="https://www.kloppenborg.ca/2019/02/reproducibility/">previously talked about</a>). I haven&#8217;t yet written about it here, but it is possible to have a Jupyter Notebook render to a <span class="caps">PDF</span> using a corporate report format, so it&#8217;s even possible to make this validation report look like it should <em>(Edit: I&#8217;ve now written about this <a href="https://www.kloppenborg.ca/2019/06/pandoc-report-templates/">here</a>)</em>. We could also set up a test in the package to re-run the Jupyter Notebook, and perhaps integrate it into a continuous integration system so that the Notebook gets re-run every time a change is made to the package. This would mean that the validation report is always up to&nbsp;date.</p> <p>When you write a Jupyter Notebook, it usually has some code that produces a result &#8212; either a numeric result, or a graph &#8212; and then you have some text that you&#8217;ve written which explains the result. The problem is that this text that you&#8217;ve written doesn&#8217;t doesn&#8217;t respond to changes in the result. Sure, there are ways of automatically updating individual numbers inside the text that you&#8217;ve written, but sometimes the way that the result of the code changes warrants a change in the sentiment of the text. Maybe the text needs to change from <em>&#8220;the model shows poor agreement with experimental results and shouldn&#8217;t be used in this case&#8221;</em> to <em>&#8220;the model shows excellent agreement with experimental results and has been validated.&#8221;</em> There&#8217;s no practical way that this type of update to the text could be automated. But if the update to the result of the code in the Notebook has been automated, there&#8217;s a good chance that the text and the results from the code will end up disagreeing &#8212; especially if the report is more than a few&nbsp;pages.</p> <h1>The&nbsp;Solution</h1> <p>So, what can be done to rectify this? We want to have the ease of having the results of the code automatically update, but we want to make sure that those results and the text of the report match. One approach to this problem &#8212; and the approach that I intend to use for the adhesive joint analysis package &#8212; is to add <code>assert</code> statements to the Notebook. This way, if the assertion fails, the Notebook won&#8217;t automatically rebuild and our attention will be drawn to the&nbsp;issue.</p> <p>As an example, if the text says that the model is conservative, meaning that the strain predicted by the model is higher than the strain measured by strain gauges installed on the test articles from the validation testing, we could write the following assert statement in the Jupyter&nbsp;Notebook:</p> <div class="highlight"><pre><span></span><code><span class="k">assert</span><span class="p">(</span><span class="n">model_strain</span> <span class="o">&gt;</span> <span class="n">experimental_strain</span><span class="p">)</span> </code></pre></div> <p>Now, if we later make a change to the model that causes it to under-predict strain, we&#8217;ll be alerted to this and prompted to update the validation&nbsp;report.</p> <h1>Implementing the&nbsp;Solution</h1> <p>To run a Jupyter Notebook from code (for example in a test suite), I&#8217;ve use the following code in the past. This code was based on code found on <a href="http://blog.thedataincubator.com/2016/06/testing-jupyter-notebooks/">The Data Incubator&nbsp;Blog</a></p> <div class="highlight"><pre><span></span><code><span class="k">def</span> <span class="nf">_notebook_run</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">path</span><span class="p">):</span> <span class="n">kernel_name</span> <span class="o">=</span> <span class="s2">&quot;python</span><span class="si">{}</span><span class="s2">&quot;</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">sys</span><span class="o">.</span><span class="n">version_info</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span> <span class="n">file_dir</span> <span class="o">=</span> <span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">dirname</span><span class="p">(</span><span class="vm">__file__</span><span class="p">)</span> <span class="n">errors</span> <span class="o">=</span> <span class="p">[]</span> <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">path</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span> <span class="n">nb</span> <span class="o">=</span> <span class="n">nbformat</span><span class="o">.</span><span class="n">read</span><span class="p">(</span><span class="n">f</span><span class="p">,</span> <span class="n">as_version</span><span class="o">=</span><span class="mi">4</span><span class="p">)</span> <span class="n">nb</span><span class="o">.</span><span class="n">metadata</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s2">&quot;kernelspec&quot;</span><span class="p">,</span> <span class="p">{})[</span><span class="s2">&quot;name&quot;</span><span class="p">]</span> <span class="o">=</span> <span class="n">kernel_name</span> <span class="n">ep</span> <span class="o">=</span> <span class="n">ExecutePreprocessor</span><span class="p">(</span><span class="n">kernel_name</span><span class="o">=</span><span class="n">kernel_name</span><span class="p">,</span> <span class="n">timeout</span><span class="o">=</span><span class="mi">3600</span><span class="p">)</span> <span class="k">try</span><span class="p">:</span> <span class="n">ep</span><span class="o">.</span><span class="n">preprocess</span><span class="p">(</span><span class="n">nb</span><span class="p">,</span> <span class="p">{</span><span class="s2">&quot;metadata&quot;</span><span class="p">:</span> <span class="p">{</span><span class="s2">&quot;path&quot;</span><span class="p">:</span> <span class="n">file_dir</span><span class="p">}})</span> <span class="k">except</span> <span class="n">CellExecutionError</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span> <span class="k">if</span> <span class="s2">&quot;SKIP&quot;</span> <span class="ow">in</span> <span class="n">e</span><span class="o">.</span><span class="n">traceback</span><span class="p">:</span> <span class="n">errors</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="nb">str</span><span class="p">(</span><span class="n">e</span><span class="o">.</span><span class="n">traceback</span><span class="p">))</span> <span class="k">else</span><span class="p">:</span> <span class="k">raise</span> <span class="n">e</span> <span class="k">return</span> <span class="n">nb</span><span class="p">,</span> <span class="n">errors</span> <span class="n">_notebook_run</span><span class="p">(</span><span class="s2">&quot;file-name-of-my-notebook.ipynb&quot;</span><span class="p">)</span> </code></pre></div> <p>This code will run the Notebook <code>file-name-of-my-notebook.ipynb</code> and will raise an error if an error is encountered. If this is inside a <code>unittest2</code> or <code>NoseTest</code> test suite, this will cause a test&nbsp;failure.</p> <h1>Conclusion</h1> <p>Validating software used in a way that affects an aircraft design is very important in ensuring the safety of that design. Keeping the validation report up to date can be tedious, but can be automated using Jupyter Notebooks. The conclusions drawn in the validation report need to match the results of the software being validated. One approach to ensuring that this is always true is to add <code>assert</code> statements to the Jupyter Notebook that forms the validation&nbsp;report.</p>Reproducibility of Engineering Calculations2019-02-08T11:21:57-05:002019-06-21T01:55:33-04:00Stefan Kloppenborgtag:www.kloppenborg.ca,2019-02-08:/2019/02/reproducibility/<p>Reproducibility in engineering work doesn&#8217;t seem to get the attention that it deserves. I can&#8217;t count the number of times that I&#8217;ve read an old engineering report in search of a particular result, only to find that the calculation that lead to that result is only barely …</p><p>Reproducibility in engineering work doesn&#8217;t seem to get the attention that it deserves. I can&#8217;t count the number of times that I&#8217;ve read an old engineering report in search of a particular result, only to find that the calculation that lead to that result is only barely described, or there is just a screenshot of an Excel workbook with a few input numbers and a final result. When I find things like this, it makes me a little nervous: did the original author use the correct formula when computing this result? What assumptions did the author make and neglect to document? What approximations were made? Was the original review of the report diligent enough to check this particular&nbsp;result?</p> <p>Let&#8217;s take a hypothetical example. For simplicity, let&#8217;s assume that we&#8217;re analyzing some sort of bracket. It&#8217;s 2 inches wide, 0.125 inches thick and 5 inches long. It&#8217;s cantilievered with a load applied 2 inches from the free edge. We care about both the deflection and the maximum stress. The formulae for deflection and stress are given by Roark<sup id="fnref:1"><a class="footnote-ref" href="#fn:1">1</a></sup>. We&#8217;ll adapt those equations&nbsp;slightly:</p> <div class="math">$$\delta_a = \frac{-P}{6 E I} (2 L^3 - 3 L^2 a + a^3)$$</div> <div class="math">$$\sigma = \frac{M_B \frac{t}{2}}{I} = \frac{P (L - a) \frac{t}{2}}{I}$$</div> <p>Given these equations and the data above, we could quite easily do the calculation in an spreadsheet program like <span class="caps">MS</span>-Excel. But, if we want to include our calculation in a report (most likely as a screenshot of the spreadsheet), our report will probably just look like&nbsp;this:</p> <p><img alt="excel1" src="https://www.kloppenborg.ca/2019/02/reproducibility/reproducibility_excel.png"></p> <p>This shows the &#8220;right&#8221; answer, but if you&#8217;re reviewing the report, how do you know that the answer is right? If you&#8217;re reviewing the report before it&#8217;s released, you can probably get a copy of the Excel file and check the formulae in the cells. You&#8217;ll spend a few minutes deciphering the formula to figure out if it&#8217;s correct. But, if you&#8217;re reading the report later, especially if you&#8217;re outside the company that wrote it, good luck. You&#8217;re going to have to get out a pen, paper and your calculator to repeat the calculation and figure out if it&#8217;s right. This problem is even worse if the author of the report hard coded in a few of the input values (i.e. length, width, elastic modulus, etc.) into the&nbsp;formulae.</p> <p>There are a few ways to address this problem of reproducibility. We&#8217;ll explore two of these ways. The first is to use software like <a href="https://www.ptc.com/en/products/mathcad">MathCAD</a>, or it&#8217;s free alternative <a href="https://en.smath.info/view/SMathStudio/summary">SMath-Studio</a>. Both of these products are <a href="https://en.wikipedia.org/wiki/WYSIWYG"><span class="caps">WYSIWYG</span></a> math editors that are unit aware. With either of these, your could do your calculations in the MathCAD or SMath-Studio and paste a screenshot of this into your&nbsp;report. </p> <p><img alt="smath" src="https://www.kloppenborg.ca/2019/02/reproducibility/reproducibility_smath.png"></p> <p>Now, the input data and the formula would be shown directly in the report. The added benefit is that, since these pieces of software are unit aware, you can&#8217;t make simple unit errors &#8212;- if you forget an exponent, the units shown in the result won&#8217;t be what you expect, so you know that you&#8217;ve made a&nbsp;mistake.</p> <p>The other way to approach this problem is to use something called a notebook. If you&#8217;re comfortable enough to write simple code in <a href="https://www.python.org/">Python</a>, you could use a <a href="http://jupyter.org/">jupyter notebook</a>. If you&#8217;re doing some data analysis or statistics, you might prefer to write some code in <a href="https://www.r-project.org/">R</a> (though, you could use <a href="https://pandas.pydata.org/">pandas</a> if you prefer to use Python). While you use R with jupyter notebooks (as well as several other languages), in my opionion R Studio&#8217;s <a href="https://rmarkdown.rstudio.com/r_notebooks.html">R Notebooks</a> are a little bit better to work with. If you were to do the same calculation with a notebook (in this case, we&#8217;ll use a jupyter notebook and Python), it would look like&nbsp;this:</p> <p><img alt="notebook" src="https://www.kloppenborg.ca/2019/02/reproducibility/reproducibility_notebook.png"></p> <p>There are a few advantages of using a notebook. First, you can use a programming language with a little bit more power than MathCAD or SMath-Studio &#8212; if you need to do an iterative calculation or find the root of system of non-linear equations, you can do it with a language like Python or R &#8212; and do so in a way that&#8217;s not too difficult for the reader to understand. The other advantage of using a notebook is that notebooks are intended to mix code, results and text. You could actually write your whole report using a notebook! You could explain your approach to solving the problem, include the code used to solve the problem and then show the results all in the same document. No need to copy-and-paste anything and no need to store multiple files (like a word document and a SMath-Studio&nbsp;file).</p> <p>Text written in a notebook (either a jupyter notebook or an R Notebook) is written using using something called <a href="https://en.m.wikipedia.org/wiki/Markdown">markdown</a>. This is a &#8220;lightweight&#8221; way of formatting text. If you want a bullet list, you just type an asterix at the beginning of each line; if you want a heading, you start the line with a hash symbol (or two for a sub-heading). And, most importantly for engineering reports, you can include formulae using <a href="https://en.m.wikibooks.org/wiki/LaTeX/Mathematics">LaTeX</a> from within markdown just by enclosing the formula with two dollar signs before and after it &#8212; no need to suffer through using the <span class="caps">MS</span>-Word Equation&nbsp;Editor.</p> <p>If you need a corporate format for your report, there are ways to create PDFs from either a jupyter notebook or an R Notebook using a custom format. I plan on writting about this in a later post. Stay tuned. <em>(Edit: I&#8217;ve written about this <a href="https://www.kloppenborg.ca/2019/06/pandoc-report-templates/">here</a>)</em></p> <p>We&#8217;ve explored a few ways of making an engineering report more reproducible. Neither of the solutions explored are idea for every scenario &#8212; some scenarios are more suited to one of the solutions or the other &#8212; but both will improve many engineering&nbsp;reports.</p> <div class="footnote"> <hr> <ol> <li id="fn:1"> <p>W. Young and R. Budynas, Roark&#8217;s Formulas for Stress and Strain, Seventh Edition. New York: McGraw-Hill, 2002.&#160;<a class="footnote-backref" href="#fnref:1" title="Jump back to footnote 1 in the text">&#8617;</a></p> </li> </ol> </div> <script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#@#')) { var align = "center", indent = "0em", linebreak = "false"; if (false) { align = (screen.width < 768) ? "left" : align; indent = (screen.width < 768) ? "0em" : indent; linebreak = (screen.width < 768) ? 'true' : linebreak; } var mathjaxscript = document.createElement('script'); mathjaxscript.id = 'mathjaxscript_pelican_#%@#@#'; mathjaxscript.type = 'text/javascript'; mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML'; mathjaxscript[(window.opera ? "innerHTML" : "text")] = "MathJax.Hub.Config({" + " config: ['MMLorHTML.js']," + " TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," + " jax: ['input/TeX','input/MathML','output/HTML-CSS']," + " extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," + " displayAlign: '"+ align +"'," + " displayIndent: '"+ indent +"'," + " showMathMenu: true," + " messageStyle: 'normal'," + " tex2jax: { " + " inlineMath: [ ['\\\$$','\\\$$'] ], " + " displayMath: [ ['$$','$$'] ]," + " processEscapes: true," + " preview: 'TeX'," + " }, " + " 'HTML-CSS': { " + " styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," + " linebreaks: { automatic: "+ linebreak +", width: '90% container' }," + " }, " + "}); " + "if ('default' !== 'default') {" + "MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" + "var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" + "VARIANT['normal'].fonts.unshift('MathJax_default');" + "VARIANT['bold'].fonts.unshift('MathJax_default-bold');" + "VARIANT['italic'].fonts.unshift('MathJax_default-italic');" + "VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" + "});" + "MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" + "var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" + "VARIANT['normal'].fonts.unshift('MathJax_default');" + "VARIANT['bold'].fonts.unshift('MathJax_default-bold');" + "VARIANT['italic'].fonts.unshift('MathJax_default-italic');" + "VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" + "});" + "}"; (document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript); } </script>rde: Now on CRAN2018-07-10T02:46:34-04:002018-07-10T02:46:34-04:00Stefan Kloppenborgtag:www.kloppenborg.ca,2018-07-10:/2018/07/rde/<p>For the last couple of years, we&#8217;ve been using the statistical programming language <a href="https://www.r-project.org">R</a> when we do statistical analysis or data visualizations at work. We typically deal with <em>small data</em> &mdash; most of the time, our data sets are high-tens or low-hundreds of rows of&nbsp;data.</p> <p>A lot of the …</p><p>For the last couple of years, we&#8217;ve been using the statistical programming language <a href="https://www.r-project.org">R</a> when we do statistical analysis or data visualizations at work. We typically deal with <em>small data</em> &mdash; most of the time, our data sets are high-tens or low-hundreds of rows of&nbsp;data.</p> <p>A lot of the time, we create <a href="https://rmarkdown.rstudio.com/r_notebooks.html">R Notebooks</a> with our analysis and visualizations. This works well for us: the R Notebook contains the code used to do the analysis, the results of the analysis and the visualizations, all in one place. This eliminates questions like: &#8220;did you remove outliers before making the graph?&#8221; Or, &#8220;did you check that the data are distributed normally before you did that test?&#8221; A reviewer of the R Notebook can see exactly what was&nbsp;done.</p> <p>By default, the R Notebook produces an html file that you can open in your browser. You can email this html file to a colleague, and they can see your results and graphs, as well as exactly how you obtained them. If you made a logical mistake, or an inappropriate assumption, your colleague has the opportunity to find&nbsp;it.</p> <p>There is also a button in the html file that the R Notebook gets exported to that says &#8220;Download Rmd.&#8221; This allows your colleague to open the notebook in <a href="https://www.rstudio.com/">R Studio</a> and run your code. <em>If you sent your&nbsp;data.</em></p> <p>The one problem with just emailing R Notebooks to a colleague is that the R Notebook does not include the data. This might be okay if the data source is a file on a network, or a database that you both have access to, but in a lot of cases &mdash; at least in my work &mdash; the data is a <span class="caps">CSV</span> or Excel file. Now, if I want to send an R Notebook to a colleague to review, I need to remember to send the data file along with&nbsp;it.</p> <p>Enter <code>rde</code>.</p> <p>I wrote the package <a href="https://cran.r-project.org/web/packages/rde/"><code>rde</code></a> (which stands for Reproducible Data Embedding) to tackle this problem. This package allows you to embed data right in your R Notebook (or any other R code). It does so by compressing the data and then <a href="https://en.wikipedia.org/wiki/Base64">base-64 encoding</a> it into an <span class="caps">ASCII</span> string. This string can be pasted into the R Notebook and converted back into the original data when someone re-runs the&nbsp;Notebook.</p> <p>I won&#8217;t go into all the details of how to use the package. If you&#8217;d like to learn more, you can read the package <a href="https://cran.r-project.org/web/packages/rde/vignettes/rde_tutorial.html">vignette</a>.</p> <p>This isn&#8217;t the first R Package that I&#8217;ve written, but it is the first one that I&#8217;ve submitted to <a href="https://cran.r-project.org/"><span class="caps">CRAN</span></a>. When you install an R package using <code>install.packages()</code>, you&#8217;re installing it from <span class="caps">CRAN</span>. I think that <span class="caps">CRAN</span> is one of the best parts of the R ecosystem since it does <a href="https://en.wikipedia.org/wiki/Continuous_integration">continuous integration</a> for all of the packages hosted there. This helps ensure that all the packages continue to work as R is updated and as other packages are updated. I&#8217;ll likely talk about this more in a future blog&nbsp;post.</p> <p>If you&#8217;re an R user and you think that the package <code>rde</code> would help you in your workflow, check it out. You can install it by typing <code>install.packages("rde")</code> in R. If you find a bug, please file an <a href="https://github.com/kloppen/rde/issues">issue on GitHub</a>. And, if you would like to add functionality or improve it in some way, feel free to send me a pull&nbsp;request.</p>Welcome to Kloppenborg.ca2018-06-30T19:29:19-04:002018-06-30T19:29:19-04:00Stefan Kloppenborgtag:www.kloppenborg.ca,2018-06-30:/2018/06/welcome/<p>Welcome to&nbsp;kloppenborg.ca</p> <p>I plan to use this website as a blog where I discuss topics related to engineering, technology and whatever else I&#8217;m thinking about at the&nbsp;time.</p> <p>If you find any of the posts here interesting, feel free to share them. If you don&#8217;t feel …</p><p>Welcome to&nbsp;kloppenborg.ca</p> <p>I plan to use this website as a blog where I discuss topics related to engineering, technology and whatever else I&#8217;m thinking about at the&nbsp;time.</p> <p>If you find any of the posts here interesting, feel free to share them. If you don&#8217;t feel free to ignore&nbsp;them.</p>