* Studentoppgave: *

Reduced-rank time series analysis for Pragmatic Subspace Identification

This project concerns how to use multivariate reduced-rank methods from chemometrics and computational statistics in control engineering, within the framework of Big Data Cybernetics.

In statistical ARMAX time series analysis of, say, a single observed variable y and a single control variable u, the observation at time t, y(t), is modelled as a linear function of previous observations of the same variable, y(t-1), y(t-2),… plus a linear function of control variable, u(t), u(t-1), u(t-2), … :

y(t) = y(t-1)*β(1) + y(t-2)*β(2) + … + u(t)*α(0) + u(t-1)*α(1) + u(t-2)*α(2) + … + ε(t)

This can be rewritten, for column vector

**y**(n x 1) =y(t) at times t=1,2,…….,n,

and

**X**(n x p) =[ y(t-1), y(t-2),…, u(t), u(t-1), u(t-2),… ]

** y = Xb + ε (1) **

where the vector of unknown parameters, **b** = [β(1), β(2),…, α(0), α(1), α(2),… ] is to be identified (estimated) from the available y- and u data.

In traditional ARMAX modelling, the estimation is often performed for each measured variable **y** separately, by ordinary least squares (OLS), minimizing **ε’ε**:

Estimated ** b=(X’X)’ ^{-1}X’y **

However, this full-rank OLS analysis can lead to overfitting and variance inflation (noisy identification). For instance, if **y** or **u** have strong, smooth temporal dependencies, or if **y** and **u** are strongly intercorrelated, **X’X** does not have full rank, and hence cannot be inverted in a sensible way.

With the advent of modern multichannel measuring devices such as spectrometers, cameras, spectrograms from microphones/accelerometers and combinations of simple sensors within InternetOfThings, there are thousands of time series **Y** = [**y**(j),j=1,2,…,J ] measured simultaneously.

Moreover, complex systems are often affected by many control variables **U**=[**u**(k),k=1,2,…,K ]. This calls for Big Data Cybernetics!

The model (1) may be extended

** Y = XB+ ε (2) **

where **X** now includes the time-shifted elements from all the J measured Y-variables and all the K known U-variables. But then the reduced- rank problem, sometimes called “the multicollinearity problem”, is even more serious if many Y-variables have been measured and many U-variables are controlled.

A number of approaches have been developed to overcome this reduced-rank problem, such as ridge regression (RR), Total Least Squares/Latent Root Regression (TLS/LRR), elastic net regression (ENR), principal component regression (PCR) and partial least squares regression (PLSR). In chemometrics, PCR and PLSR are particularly popular due to their pragmatic, flexible nature, their ease of statistical validation and their ability to give graphical insight into the subspace controlling the variation patterns in **X** and their impact on **Y**. PCR and PLSR share the same subspace formulation:

Latent variables Z=[z(1),z(2),…,z(A)] are defined as linear combinations of the X-variables, based on weights **V**:

** Z=X*V **

These latent variables Z are then used for modelling both X and Y:

** X=Z*P’ + E **

** Y=Z*Q’ + F **

The number of latent variables a=1,2,…,A, is determined by e.g. cross validation or crossmodel validation or other statistical or graphical tools. The (usually) low number of major latent variables allows the owner of the data to get good graphical overviews of what is going on in the data and facilitates the identification of problematic outliers, heterogeneities, strong nonlinearities etc.

The regression coefficient **B** at rank A is then defined from

** Y = ZQ’+F = (X*V)*Q’ + F = X*(V*Q’) + F =X*B + F **

The PLSR estimates the regression coefficients **B** by using both X and Y information simultaneously, identifying the A subspace dimensions as the first few linear combinations of the X-variables that display meaningful covariance with corresponding linear combinations of the Y-variables. Thereby the “multicollinearity problem” is turned into a “multicollinearity advantage”. This is the most popular method in the Do-It-Yourself multivariate data modelling package The Unscrambler, used in subjects TTK19 and TK8116 at ITK.

In the PCR the subspace of **X** is identified independently of **Y**, in terms of the largest, statistically meaningful singular vectors in **X**; these are subsequently used for modelling both the X- and Y-spaces. This can be advantageous if the number of X-variables and/or Y-variables is overwhelming, or if **X** and **Y** are not sampled in the same way. This is the basis for the new OnTheFlyProcessing methodology for Quantitative Big Data, now developed in Idletechs AS.

The proposed project concerns a literature survey and some basic Matlab programming to reveal how PLSR, PCA and other pragmatic subspace identification methods that statistically stable and graphically interpretable regression models can be used in a Big Data Cybernetics framework.

This can later be extended into an MSc topic.

Oppgaven er knyttet til NTNUs satsning innen Big Data Cybernetics. Det anbefales å følge temaet TTK19 parallelt med prosjektoppgaven.

Det kan være begrensninger på opphavsrettigheter, herunder publisering.

Hovedveileder: Prof. Harald Martens (NTNU ITK / Idletechs)

Medveileder: Dr. Aria Rahmati (Idletechs)