Multivariate Statistic Process Control method comparison Part 1: Simulating correlated data and optimizing data generation.

5 min readJan 5, 2021

Time to mix it up a bit from my usual game probability related simulation. This week I want to talk about a project I worked on last year in doing simulations for Statistical Process Control.

For a brief primer, Statistical Process Control or SPC is a method used for monitoring something like a manufacturing process. It’s a form of time series analysis where you monitor results of some statistic taken from observing the process and then compare that statistic to a set of limits on a control chart to determine whether the distribution of that statistic has likely changed. Ideally, by the time you see a data point is off you can investigate it and fix it before you actually harm the product.

I was initially introduced to the idea of SPC with my internship and proceeded to discuss with a professor about conducting a project on the subject. The project was looking at what type of control chart or statistic did the best job for different kinds of dependent data. In the case of SPC, a better job is for a fixed sensitivity in falsely detecting a change, how quickly can you detect a change of differing magnitudes.

To carry out this in a simulation required generating quite a bit of data and then calculating the appropriate control chart statistics from this data. I was going to do four different dependence structures with 4 different correlation strengths each and for each of these I was going to do a baseline simulation and 8 different mean shifts of the data, for a total of 144 different sets of simulations. For each of these I as going to do 1000 simulations of 1000 pairs of data each, with control chart statistics calculated for 3 different control charts, Hotelling T², MEWMA, and MCUSUM.

To generate the dependent pairs of data, I leveraged the copula package from R and used the MSQC package to calculate the statistics. When I started running this I found that with how much data I was trying to produce and crunch that this was going to take nearly 24 hours of compute time. So, I decided to bring in some help by using the R parallelization packages foreach, doMC and doRNG, as well as rewrote the functions of the MSQC package to cut out the plotting unnecessary plotting to speed things up.

In the end I got something a lot more concise and readable, but also could compute within 3 hours, which given that I was using 4 cores meant more time savings than just from using more of my CPU.

You can see the code on my CopulaControl repository on github, or in summary below. Here for brevity I just use MSQC, but at github I have my modified functions for MSQC instead.

library(copula)
library(MSQC)
library(tidyverse)
library(Hmisc)
library(foreach)
library(doMC)
library(doRNG)

copulaMvdGen <- function(myCopula, shift = c(0,0)) {
  return(mvdc(copula = myCopula, margins = c("norm", "norm"), paramMargins = list(list(mean = shift[1], sd = 1), list(mean = shift[2], sd = 1))))
}

mvdGen <- function (category = normalCopula, rho = 0.6, shiftList = c(0.5, 1, 2, 3)) {
  myCopula <- category(iRho(category(), rho))
  myMvd <- list()
  myMvd[[1]] <- copulaMvdGen(myCopula)
  for(i in 1:length(shiftList)) {
    myMvd[[1 + i]] <- copulaMvdGen(myCopula, c(0,shiftList[i]))
    myMvd[[1 + i + length(shiftList)]] <- copulaMvdGen(myCopula, rep(shiftList[i], 2))
  }
  return(myMvd)
}


# Set number of workers, i.e. number of threads to give to R.
# Number of Iterations in generator.R is split up by this.
registerDoMC(4)
seed <- registerDoRNG(20200218)

# Test parameters.
# copulaList: Bivariate Copula functions to use to generate simulation data
#   Change not supported in run.R, need to change in mvdGen() call in generator.R
# rhoList: Different correlations to be forced in the bivariate copulas
#   Any value in interval (-1,1) is allowed.
# shiftList: List of shift in means of the bivariate data. Listed is for single
#   and double variable shifts. Change is not supported in run.R, need to change
#   mvdGen() call in generator.R to change.
# size: Number of pairs of points to generate for each iteration of each experiment.
# iterations: Number of repitions for each experiment.
# cARL: ARL to calibrate UCL for analysis
copulaList <- c("Normal", "Frank", "Clayton", "Gumbel")
rhoList <- c(0.6, 0.2, -0.2, -0.6)
shiftList <- c("0/0", "0/0.5", "0/1", "0/2", "0/3", "0.5/0.5", "1/1", "2/2", "3/3")
size <- 1000
iterations <- 1000
cARL <- 300

Nrho <- length(rhoList)
Ncopula <- length(copulaList)
Nshift <- length(shiftList)

I was then able to generate my data using a nested for loop. If I were rewriting this less than six months from when I last touched this, I would use something like expand.grid to make a list of my experiments and then iterate the master experiment list instead of so many loops.

simData <- vector("list", Nrho)
for(i in 1:Nrho) {
  myMvd <- list(mvdGen(normalCopula, rho = rhoList[i]),
                mvdGen(frankCopula, rho = rhoList[i]),
                mvdGen(claytonCopula, rho = rhoList[i]),
                mvdGen(gumbelCopula, rho = rhoList[i]))
  simData[[i]] <- vector("list", Ncopula)
  for(j in 1:Ncopula) {
    simData[[i]][[j]] <- foreach(k=1:iterations, .packages="copula") %dorng% {
      tmpData <- vector("list", Nshift)
      for(l in 1:Nshift) {
        tmpVar <- as.data.frame(rMvdc(size, myMvd[[j]][[l]]))
        if (l == 1) {
          t2chart <- mult.chart(tmpVar, type = "t2")
          mechart <- mult.chart(tmpVar, type = "mewma")
          mcchart <- mult.chart(tmpVar, type = "mcusum")
          t2data <- t2chart$t2
          medata <- mechart$t2
          mcdata <- mcchart$t2
        } else {
          t2data <- mult.chart(tmpVar, type = "t2", Xmv = t2chart$Xmv, S = t2chart$covariance)$t2
          medata <- mult.chart(tmpVar, type = "mewma", Xmv = mechart$Xmv, S = mechart$covariance)$t2
          mcdata <- mult.chart(tmpVar, type = "mcusum", Xmv = mcchart$Xmv, S = mcchart$covariance)$t2
        }
        tmpData[[l]] <- cbind(rep(copulaList[[j]]),
                            rep(rhoList[[i]]),
                            rep(shiftList[[l]]),
                            rep(k, size),
                            tmpVar,
                            1:size,
                            t2data,
                            medata,
                            mcdata
        )
        colnames(tmpData[[l]]) <- c("copula", "rho", "shift", "iteration", "data1", "data2", "N", "t2", "mewma", "mcusum")
      }
      tmpData
    }
  }
}
gc()

tmpData <- vector("list", Nrho)
for(i in 1:Nrho) {
  tmpData[[i]] <- vector("list", Ncopula)
  for(j in 1:Ncopula) {
    tmpData[[i]][[j]] <- bind_rows(simData[[i]][[j]])
  }
}
rm(simData)
gc()

simData <- bind_rows(tmpData)
rm(tmpData)
gc()

simData$copula <- factor(simData$copula, level = copulaList)
simData$rho <- factor(simData$rho, level = rhoList)
simData$shift <- factor(simData$shift, level = shiftList)

Even with the loops, it becomes much easier to follow than my earlier iterations, where here I loop over the correlations, the copulas, the 1000 iterations, the shifts and generate the data for each of these before bringing it all together.

I will look at this again to talk in further detail about how I proceeded to generate the visualizations and came to my conclusions from this simulation in part 2. For now, I’ll just leave one of the plots I made for this.

Originally published at https://heril.github.io/.

Multivariate Statistic Process Control method comparison Part 1: Simulating correlated data and optimizing data generation.

Written by Spencer R Hall