13
Wavelet and Principal Components Analyses: Nonstationary Signals and Data Compression

13.1 Problem: Spectral Analysis of Nonstationary Signals

Problem You have sampled the signal in Figure 13.1 that seems to contain an increasing number of frequencies as time increases. Your problem is to undertake a spectral analysis of this signal that tells you, in the most compact way possible, the amount of each frequency present at each instant of time. Hint: Although we want the method to be general enough to work with numerical data, for pedagogical purposes it is useful to know that the signal is

(13.1) $c13-math-0001$

13.2 Wavelet Basics

The Fourier analysis we used in Chapter 12 reveals the amount of the harmonic functions sin(ωt) and cos(ωt), and their overtones, that are present in a signal. An expansion in periodic functions is fine for stationary signals (those whose forms do not change in time) but has shortcomings for the variable form of our problem signal (13.1). One such problem is that the Fourier reconstruction has all constituent frequencies occurring simultaneously and so does not contain time resolution information indicating when each frequency occurs. Another shortcoming is that all the Fourier components are correlated, which results in more information being stored than may be needed to reconstruct the measured signal.

c13f001 — **Figure 13.1** The input time signal (13.1) we wish to analyze. The signal is seen to contain additional frequencies as time increases. The boxes are possible placements of windows for short-time Fourier transforms.

There are a number of techniques that extend simple Fourier analysis to non-stationary signals. The idea behind wavelet analysis is to expand a signal in a complete set of functions (wavelets), each of which oscillates for a finite period of time, and each of which is centered at a different time. To give you a preview before we get into the details, we show four sample wavelets in Figure 13.2. Because each wavelet is local in time, it is a wave packet,¹⁾ with its time localization leading to a spectrum with a range of frequencies. These wave packets are called “wavelets” because they exist for only short periods of time (Polikar, 2001).

Although wavelets are required to oscillate in time, they are not restricted to a particular functional form (Addison, 2002; Goswani and Chan, 1999; Graps, 1995). As a case in point, they may be oscillating Gaussians (Morlet: in Figure 13.2a),

(13.2) $c13-math-0002$

the second derivative of a Gaussian (Mexican hat, Figure 13.2b),

(13.3) $c13-math-0003$

an up-and-down step function (Figure 13.2c), or a fractal shape (Figure 13.2d). All of these wavelets are localized in both time and frequency, that is, they are large for just a limited time and contain a limited range of frequencies. As we shall see, translating and scaling a mother wavelets generates an entire set of daughter wavelet or basis functions, with the daughters covering different frequency ranges at different times.

c13f002 — **Figure 13.2** Four possible mother wavelets that can be used to generate entire sets of daughter wavelets. (a–d) Morlet (real part), Mexican hat, Daub4 e6 (explained later), and Haar. The daughter wavelets are generated by scaling and translating these mother wavelets.

13.3 Wave Packets and Uncertainty Principle (Theory)

A wave packet or wave train is a collection of waves of differing frequencies added together in such a way as to produce a pulse of width Δt. As we shall see, the Fourier transform of a wave packet is a pulse in the frequency domain of width Δω. We will first study such wave packets analytically, and then use others numerically. An example of a simple wave packet is just a sine wave that oscillates at frequency ω₀ for N periods (Figure 13.3a) (Arfken and Weber, 2001)

(13.4) $c13-math-0004$

where we relate the frequency to the period via the usual ω₀ = 2π/T. In terms of these parameters, the width of the wave packet is

(13.5) $c13-math-0005$

The Fourier transform of the wave packet (13.4) is calculated via a straightforward application of the transform formula (12.19):

(13.6) $c13-math-0006$

c13f003 — **Figure 13.3** (a) A wave packet in time corresponding to the functional form (13.4) with ω₀ = 5 and N = 6. (b) The Fourier transform in frequency of this same wave packet.

where we have dropped a factor of –i that affects only the phase. While at first glance (13.6) appears to be singular at ω = ω₀, it actually just peaks there (Figure 13.3b), reflecting the predominance of frequency ω₀. Note that although the signal y(t) appears to have only one frequency, it does drop off sharply in time (Figure 13.3a), and these corners give Y(ω) a width Δω.

There is a fundamental relation between the widths Δt and Δω of a wave packet. Although we use a specific example to determine that relation, it is true in general. While there may not be a precise definition of “width” for all functions, one can usually deduce a good measure of the width (say, within 25%). To illustrate, if we look Figure 13.3b, it makes sense to use the distance between the first zeros of the transform Y(ω) (13.6) as the frequency width Δω. The zeros occur at

(13.7) $c13-math-0008$

where N is the number of cycles in our original wave packet. Because the wave packet in time makes N oscillations each of period T, a reasonable measure of the time width Δt of the signal y(t) is

(13.8) $c13-math-0009$

When the products of the frequency width (13.7) and the time width (13.8) are combined, we obtain

(13.9) $c13-math-0010$

The greater than sign is used to indicate that this is a minimum, that is, that y(t) and Y(ω) extend beyond Δt and Δω, respectively. Nonetheless, most of the signal and transform should lie within the bound (13.9).

A relation of the form (13.9) also occurs in quantum mechanics, where it is known as the Heisenberg uncertainty principle, with Δt and Δω being called the uncertainties in t and ω. It is true for transforms in general and states that as a signal is made more localized in time (smaller Δt) the transform becomes less localized (larger Δω). Conversely, the sine wave y(t) = sin ω₀t is completely localized in frequency, and consequently has an infinite extent in time, $c13-math-0011$ .

13.3.1 Wave Packet Assessment

Consider the following wave packets:

(13.10) $c13-math-0012$

For each wave packet:

Estimate the width Δt. A good measure might be the full width at half-maxima (FWHM) of |y(t)|.
Use your DFT program to evaluate and plot the Fourier transform Y(ω) for each wave packet. Make both a linear and a semilog plot (small components are often important, yet not evident in linear plots). Make sure that your transform has a good number of closely spaced frequency values over a range that is large enough to show the periodicity of Y(ω).
What are the units for Y(ω) and ω in your DFT?
For each wave packet, estimate the width Δω. A good measure might be the full width at half-maxima of |Y(ω)|.
For each wave packet determine approximate value for the constant C of the uncertainty principle

(13.11) $c13-math-0013$

13.4 Short-Time Fourier Transforms (Math)

The constant amplitude of the functions sin nωt and cos nωt for all times can limit the usefulness of Fourier analysis for reproducing signals. Because these functions and their overtones extend over all times with a constant amplitude, there is considerable overlap among them, and consequently the information present in various Fourier components are correlated. This is undesirable for data storage and compression, where you want to store a minimum number of data information and also want to adjust the amount stored based on the desired quality of the reconstructed signal.²⁾ In lossless compression, which exactly reproduces the original signal, you save space by storing how many times each data element is repeated, and where each element is located. In lossy compression, in addition to removing repeated elements, you also eliminate some transform components consistent with the uncertainty relation (13.9) and with the level of resolution required in the reproduction. This leads to yet greater compression.

In Section 12.5, we defined the Fourier transform Y(ω) of signal y(t) as

(13.12) $c13-math-0014$

As is true for simple vectors, you can think of (13.12) as giving the overlap or scalar product of the basis function $c13-math-0015$ and the signal y(t) (notice that the complex conjugate of the exponential basis function appears in (13.12)). Another view of (13.12) is as the mapping or projection of the signal y into ω space. In this latter case the overlap projects out the amount of the periodic function $c13-math-0015$ in the signal. In other words, the Fourier component Y(ω) is also the correlation between the signal y(t) and the basis function $c13-math-0015$ , which is the same as what results from filtering the signal y(t) through a frequency-ω filter. If there is no exp(iωt) in the signal, then the integral vanishes and there is no output. If $c13-math-0016$ , the signal is at only one frequency, and the integral is accordingly singular.

The signal in Figure 13.1 for our problem clearly has different frequencies present at different times and for different lengths of time. In the past, this signal might have been analyzed with a precursor of wavelet analysis known as the short-time Fourier transform. With that technique, the signal y(t) is “chopped up” into different segments along the time axis, with successive segments centered about successive times τ₁, τ₂, …, τ_N. For instance, we show three such segments in the boxes of Figure 13.1. Once we have the dissected signal, a Fourier analysis is made for each segment. We are then left with a sequence of transforms $c13-math-0017$ , one for each short-time interval, where the superscript ^ST indicates short time.

Rather than chopping up a signal, we can express short-time Fourier transforming mathematically by imagining translating a window function w(t – τ), which is zero outside of some chosen interval, over the signal in Figure 13.1:

(13.13) $c13-math-0018$

Here the values of the translation time τ correspond to different locations of the window w over the signal, and the window function is essentially a transparent box of small size on an opaque background. Any signal within the width of the window is transformed, while the signal lying outside the window is not seen. Note that in (13.13), the extra variable τ in the Fourier transform indicates the location of the time around which the window was placed. Clearly, because the short-time transform is a function of two variables, a surface or 3D plot is needed to view the amplitude as a function of both ω and τ.

13.5 The Wavelet Transform

The wavelet transform of a time signal y(t) is defined as

(13.14) $c13-math-0019$

and is similar in concept and notation to a short-time Fourier transform. The difference is rather than using exp(iωt) as the basis functions, here we are using wave packets or wavelets ψ_S,τ(t) localized in time, such as the those shown in Figure 13.2. Because each wavelet is localized in time, each acts as its own window function. Because each wavelet is oscillatory, each contains its own small range of frequencies.

Equation 13.14 says that the wavelet transform Y(s, τ) is a measure of the amount of basis function ψ_{s, τ}(t) present in the signal y(t). The τ variable indicates the time portion of the signal being decomposed, while the s variable is equivalent to the frequency present during that time:

(13.15) $c13-math-0020$

Because it is key to much that follows, it is a good idea to think about (13.15) for a while. If we are interested in the time details of a signal, then this is another way of saying that we are interested in what is happening at small values of the scale s. Equation 13.15 indicates that small values of s correspond to high-frequency components of the signal. That being the case, the time details of the signal are in the high-frequency, or low-scale, components.

13.5.1 Generating Wavelet Basis Functions

The conceptual discussion of wavelets is over, and it is time to get down to work. We first need a technique for generating wavelet basis functions, and then we need to discretize this technique. As is often the case, the final formulation will turn out to be simple and short, but it will be a while before we get there.

Just as the expansion of an arbitrary function in a complete set of orthogonal functions is not restricted to any particular set, so too is the wavelet transform not restricted to any particular wavelet basis, although some might be better than others for a given signal. The standard way to generate a family of wavelet basis functions starts with Ψ(t), a mother or analyzing function of the real variable t, and then to use this to generate daughter wavelets. As a case in point, we start with the mother wavelet

(13.16) $c13-math-0021$

By scaling, translating, and normalizing this mother wavelet,

(13.17) $c13-math-0022$

we generate the four wavelet basis functions (daughters) displayed in Figure 13.4. We see that larger or smaller values of s, respectively, expand or contract the mother wavelet, while different values of τ shift the center of the wavelet. Because the wavelets are inherently oscillatory, the scaling leads to the same number of oscillations occurring in different time spans, which is equivalent to having basis states with differing frequencies. We see that s < 1 produces a higher frequency wavelet, while s > 1 produces a lower frequency one, both of the same shape. As we shall see, we do not need to store much information to outline the large-time-scale s behavior of a signal (its smooth envelope), but we do need more information to specify its short-time-scale s behavior (details). And if we want to resolve yet finer features in the signal, then we will need to have more information on yet finer details. Here the division by $c13-math-0022a$ is made to ensure that there is equal “power” (or energy or intensity) in each region of s, although other normalizations can also be found in the literature. After substituting in the definition of daughters, the wavelet transform (13.14) and its inverse (van den Berg, 1999) are

(13.18) $c13-math-0023$

(13.19) $c13-math-0024$

where the normalization constant C depends on the wavelet used. In summary, wavelet bases are functions of the time variable t, as well as of the two parameters s and τ. The t variable is integrated over to yield a transform that is a function of the time scale s (frequency 2π/s) and window location τ. You can think of scale as being like the scale on a map (also discussed in Section 16.5.2 in relation to fractal analysis) or in terms of resolution, as might occur in photographic images. Regardless of the words, as we see in Chapter 16, if we have a fractal, then we have a self-similar object that looks the same at all scales or resolutions. Similarly, each wavelet in a set of basis functions is self-similar to the others, but at a different scale or location. The general requirements for a mother wavelet Ψ are (Addison, 2002; van den Berg, 1999)

Ψ(t) is real.
Ψ(t) oscillates around zero such that its average is zero:
(13.20) $c13-math-0025$

Figure 13.4 Four wavelet basis functions (daughters) generated by scaling (s) and translating (τ) an oscillating Gaussian mother wavelet. (a–d) (s = 1, τ = 0), (s = 1/2, τ = 0), (s = 1, τ = 6), and (s = 2, τ = 60). Note how s < 1 is a wavelet with higher frequency, while s > 1 has a lower frequency than the s = 1 mother. Likewise, the τ = 6 wavelet is just a translated version of the τ = 0 one directly above it.
Ψ(t) is local, that is, a wave packet, and is square integrable:
(13.21) $c13-math-0026$
The transforms of low powers of t vanish, that is, the first p moments:
(13.22) $c13-math-0027$

This makes the transform more sensitive to details than to general shape.

As an example of how we use the s and τ degrees of freedom in a wavelet transform, consider the analysis of a chirp signal y(t) = sin(60t²) (Figure 13.5). We see that a slice at the beginning of the signal is compared to our first basis function. (The comparison is carried out via the convolution of the wavelet with the signal.) This first comparison is with a narrow version of the wavelet, that is, at low scale, and yields a single coefficient. The comparison at this scale continues with the next signal slice, and eventually ends when the entire signal has been covered (the top row in Figure 13.5). Then in the second row, the wavelet is expanded to larger s values, and comparisons are repeated. Eventually, the data are processed at all scales and at all time intervals. The narrow wavelets correspond to a high-resolution analysis, while the broad wavelets correspond to low resolution. As the scales get larger (lower frequencies, lower resolution), fewer details of the time signal remain visible, but the overall shape or gross features of the signal become clearer.

c13f005 — **Figure 13.5** A schematic representation of the steps followed in performing a wavelet transformation over all time displacements and scales. The upper signal is first analyzed by evaluating its overlap with a narrow wavelet at the signal’s beginning. This produces a coefficient that measures the similarity of the signal to the wavelet. The wavelet is successively shifted over the length of the signal and the overlaps are successively evaluated. After the entire signal is covered, the wavelet is expanded and the entire analysis is repeated.

13.5.2 Continuous Wavelet Transform Implementation

We want to develop some intuition as to what wavelet transforms look like before going on to apply them in unknown situations and to develop a discrete algorithm. Accordingly, modify the program you have been using for the Fourier transform so that it now computes the continuous wavelet transform.

You will want to see the effect of using different mother wavelets. Accordingly, write a method that calculates the mother wavelet for
1. a) a Morlet wavelet (13.2),
2. b) a Mexican hat wavelet (13.3),
3. c) a Haar wavelet (the square wave in Figure 13.2).
Try out your transform for the following input signals and see if the results make sense:
1. a) A pure sine wave y(t) = sin 2πt,
2. b) A sum of sine waves y(t) = 2.5 sin 2πt + 6 sin 4πt + 10 sin 6πt,
3. c) The nonstationary signal for our problem (13.1)
  (13.23) $c13-math-0028$
  
  Figure 13.6 Comparison of an input and reconstituted signal (13.23) using Morlet wavelets (the curves overlap nearly perfectly). As expected for Fourier transforms, the reconstruction is least accurate near the endpoints.
4. d) The half-wave function
  (13.24) $c13-math-0029$
$c13-math-0029a$ Use (13.19) to invert your wavelet transform and compare the reconstructed signal to the input signal (you can normalize the two to each other). In Figure 13.6 we show our reconstruction.

In Listing 11.1, we give our continuous wavelet transformation CWT.py (Lang and Forinash, 1998). Because wavelets, with their transforms in two variables, are somewhat hard to grasp at first, we suggest that you write your own code and include a portion that does the inverse transform as a check. In the next section we will describe the discrete wavelet transformation that makes optimal discrete choices for the scale and time translation parameters s and τ. Figure 13.7 shows a surface plot of the spectrum produced for the input signal (13.1) in Figure 13.1. In realization of our goal, we see predominantly one frequency at short times, two frequencies at intermediate times, and three frequencies at longer times.

13.6 Discrete Wavelet Transforms, Multiresolution Analysis $c13-math-0029a$

As was true for DFTs, if a time signal is measured at only N discrete times,

(13.25) $c13-math-0030$

then we can determine only N independent components of the transform Y. The trick with wavelets is to remain consistent with the uncertainty principle as we compute only the N independent components required to reproduce the signal. The discrete wavelet transform (DWT) evaluates the transforms with discrete values for the scaling parameter s and the time translation parameter τ:

c13f007 — **Figure 13.7** The continuous wavelet spectrum obtained by analyzing the input signal with Morelet wavelets. Observe how at small values of time τ there is predominantly one frequency present, how a second, higher frequency (smaller scale) component enters at intermediate times, and how at larger times a still higher frequency components enter (figure courtesy of `Z. Dimcovic`).

(13.26) $c13-math-0031$

(13.27) $c13-math-0032$

Here j and k are integers whose maximum values are yet to be determined, and we have assumed that the total time interval T = 1, so that time is always measured in fractions. This choice of s and τ based on powers of 2 is called a dyadic grid arrangement and will be seen to automatically perform the scalings and translations at the different time scales that are at the heart of wavelet analysis.³⁾ The DWT now becomes

(13.28) $c13-math-0033$

where the discreteness here refers to the wavelet basis set and not the time variable. For an orthonormal wavelet basis, the inverse discrete transform is then

(13.29) $c13-math-0034$

This inversion will exactly reproduce the input signal at the N input points, but only if we sum over an infinite number of terms (Addison, 2002). Practical calculations will be less exact.

c13f008 — **Listing 13.1** `CWT.py` computes a normalized continuous wavelet transform of the signal data in input (here assigned as a sum of sine functions) using Morlet wavelets (courtesy of `Z. Dimcovic`). The discrete wavelet transform (DWT) is faster and yields a compressed transform, but is less transparent.

c13f009 — **Listing 13.1** `CWT.py` computes a normalized continuous wavelet transform of the signal data in input (here assigned as a sum of sine functions) using Morlet wavelets (courtesy of `Z. Dimcovic`). The discrete wavelet transform (DWT) is faster and yields a compressed transform, but is less transparent.

Note in (13.26) and (13.28) that we have kept the time variable t in the wavelet basis functions continuous, despite the fact that s and τ have been made discrete. This is useful in establishing the orthonormality of the basis functions

(13.30) $c13-math-0035$

where δ_m,n is the Kronecker delta function. Being normalized to 1 means that each wavelet basis has “unit energy”; being orthogonal means that each basis function is independent of the others. And because wavelets are localized in time, the different transform components have low levels of correlation with each other. Altogether, this leads to efficient and flexible data storage.

The use of a discrete wavelet basis makes it clear that we sample the input signal at the discrete values of time determined by the integers j and k. In general, you want time steps that sample the signal at enough times in each interval to obtain the desired level of precision. A rule of thumb is to start with 100 steps to cover each major feature. Ideally, the needed times correspond to the times at which the signal was sampled, although this may require some forethought.

Consider Figure 13.8. We measure a signal at a number of discrete times within the intervals (k or τ values) corresponding to the vertical columns of fixed width along the time axis. For each time interval we want to sample the signal at a number of scales (frequencies or j values). However, as discussed in Section 13.3, the basic mathematics of Fourier transforms indicates that the width Δt of a wave packet ψ(t) and the width Δω of its Fourier transform Y(ω) are related by an uncertainty principle

$c13-math-0036$

This relation places a constraint on the intervals in which can measure times based on the intervals in which we deduce frequencies. So while we may want a high-resolution reproduction of our signal, we do not want to store more data than are needed to obtain that reproduction. If we sample the signal for times centered about some τ in an interval of width Δτ (Figure 13.8) and then compute the transform at a number of scales s or frequencies $c13-math-0037$ covering a range of height Δω, then the relation between the height and width is restricted by the uncertainty relation, which means that each of the rectangles in Figure 13.8 has the same area Δω Δt = 2π. The increasing heights of the rectangles at higher frequencies means that a larger range of frequencies should be sampled as the frequency increases. The premise here is that the low-frequency components provide the gross or smooth outline of the signal which, being smooth, does not require much detail, while the high-frequency components give the details of the signal over a short time interval and so require many components in order to record these details with high resolution.

c13f010 — **Figure 13.8** A graphical representation of the relation between time and frequency resolutions (the uncertainty relation). Each box represents an equal portion of the time-frequency plane but with different proportions of time and frequency.

c13f011 — **Figure 13.9** A multifrequency dyadic (power-of-2) filter tree used for discrete wavelet transformations. The L boxes represent low-pass filters and the H boxes represent high-pass filters. Each filter performs a convolution (transform). The circles containing “↓ 2” filter out half of the signal that enters them, which is called *subsampling* or *factor-of-2 decimation*. The signal on the left yields a transform with a single low and two high components (less information is needed about the low components for a faithful reproduction).

Industrial-strength wavelet analyses do not compute explicit integrals, but instead apply a technique known as multiresolution analysis (MRA) (Mallat, 1989). We give an example of this technique in Figure 13.9 and in the code DWT.py in Listing 13.2. It is based on a pyramid algorithm that samples the signal at a finite number of times, and then passes it successively through a number of filters, with each filter representing a digital version of a wavelet.

Filters were discussed in Section 12.8, where in (12.67) we defined the action of a linear filter as a convolution of the filter response function with the signal. A comparison of the definition of a filter to the definition of a wavelet transform (13.14) shows that the two are essentially the same. Such being the case, the result of the transform operation is a weighted sum over the input signal values, with each weight the product of the integration weight times the value of the wavelet function at the integration point. Therefore, rather than tabulate explicit wavelet functions, a set of filter coefficients is all that is needed for DWTs.

Because each filter in Figure 13.9 changes the relative strengths of different frequency components, passing the signal through a series of filters is equivalent, in wavelet language, to analyzing the signal at different scales. This is the origin of the name “multiresolution analysis.” Figure 13.9 shows how the pyramid algorithm passes the signal through a series of high-pass filters (H) and then through a series of low-pass filters (L). Each filter changes the scale to that of the level below. Also note that the circles containing ↓ 2 in Figure 13.9. This operation filters out half of the signal and so is called subsampling or factor-of-2 decimation. It is the way we keep the areas of each box in Figure 13.8 constant as we vary the scale and translation times. We consider subsampling further when we discuss the pyramid algorithm.

In summary, the DWT process decomposes the signal into smooth information stored in the low-frequency components and detailed information stored in the high-frequency components. Because high-resolution reproductions of signals require more information about details than about gross shape, the pyramid algorithm is an effective way to compress data while still maintaining high resolution. In addition, because components of different resolutions are independent of each other, it is possible to lower the number of data stored by systematically eliminating higher resolution components. And finally, the use of wavelet filters builds in progressive scaling, which is particularly appropriate for fractal-like reproductions.

13.6.1 Pyramid Scheme Implementation $c13-math-0029a$

We now implement the pyramid scheme outlined in Figure 13.9. The H and L filters will be represented by matrices, which is an approximate way to perform the integrations or convolutions. Then there is a decimation of the output by one-half, and finally an interleaving of the output for further filtering. This process simultaneously cuts down on the number of points in the data set and changes the scale and the resolution. The decimation reduces the number of values of the remaining signal by one half, with the low-frequency part discarded because the details are in the high-frequency parts. As indicated in Figure 13.10, the pyramid DWT algorithm follows five steps:

Successively applies the (soon-to-be-derived) c matrix (13.41) to the whole N-length vector
(13.31) $c13-math-0038$
Applies it to the N/2-length smooth vector.
Repeats the application until only two smooth components remain.
After each filtering, the elements are ordered, with the newest two smooth elements on top, the newest detailed elements below, and the older detailed elements below that.
The process continues until there are just two smooth elements left.

c13f012 — **Figure 13.10** An input signal at the top is processed by a tree of high- and low-band filters. The outputs from each filtering are down-sampled with half the data kept. The process continues until there are only two data of high-band filtering and two data of low-band filtering. When complete, the total number of output data equals the total number of signal data. The process of wavelet analysis is thus equivalent to a series of filterings.

To illustrate, here we filter and reorder an initial vector of length N = 8:

(13.32) $c13-math-0039$

The discrete inversion of a transform vector back to a signal vector is made using the transpose (inverse) of the transfer matrix at each stage. For instance,

(13.33) $c13-math-0040$

As a more realistic example, imagine that we have sampled the chirp signal y(t) = sin(60t²) for 1024 times. The filtering process through which we place this signal is illustrated as a passage from the top to the bottom in Figure 13.10. First the original 1024 samples are passed through a single low band and a single high band (which is mathematically equivalent to performing a series of convolutions). As indicated by the down arrows, the output of the first stage is then downsampled, that is, the number is reduced by a factor of 2. This results in 512 points from the high-band filter as well as 512 points from the low-band filter. This produces the first-level output. The output coefficients from the high-band filters are called $c13-math-0041$ to indicate that they show details, and $c13-math-0042$ that they show smooth features. The superscript indicates that this is the first level of processing. The detail coefficients $c13-math-0043$ are stored to become part of the final output.

In the next level down, the 512 smooth data $c13-math-0042$ are passed through new low-and high-band filters using a broader wavelet. The 512 outputs from each are downsampled to form a smooth sequence $c13-math-0044$ of size 256 and a detailed sequence $c13-math-0045$ of size 256. Again the detail coefficients $c13-math-0046$ are stored to become part of the final output. (Note that this is only half the size of the previously stored details.) The process continues until there are only two numbers left for the detail coefficients and two numbers left for the smooth coefficients. Because this last filtering is carried out with the broadest wavelet, it is of the lowest resolution and therefore requires the least information.

In Figure 13.11, we show the actual effects on the chirp signal of pyramid filtering for various levels in the processing. (The processing is carried out with four-coefficient Daub4 wavelets, which we will discuss soon.) At the uppermost level, the wavelet is narrow, and so convoluting this wavelet with successive sections of the signal results in smooth components that still contain many large high-frequency parts. The detail components, in contrast, are much smaller in magnitude. In the next stage, the wavelet is dilated to a lower frequency and the analysis is repeated on just the smooth (low-band) part. The resulting output is similar, but with coarser features for the smooth coefficients and larger values for the details. Note that in the upper graphs we have connected the points to make the output look continuous, while in the lower graphs, with fewer points, we have plotted the output as histograms to make the points more evident. Eventually the downsampling leads to just two coefficients output from each filter, at which point the filtering ends.

c13f013 — **Figure 13.11** In successive passes, the filtering of the original signal at the top goes through the pyramid algorithm and produces the outputs shown. The sampling is reduced by a factor of 2 in each step. Note that in the upper graphs, we have connected the points to emphasize their continuous nature while in the lower graphs we plot the individual output points as histograms.

To reconstruct the original signal (called synthesis or transformation) a reversed process is followed: Begin with the last sequence of four coefficients, upsample them, pass them through low- and high-band filters to obtain new levels of coefficients, and repeat until all the N values of the original signal are recovered. The inverse scheme is the same as the processing scheme (Figure 13.10), only now the directions of all the arrows are reversed.

13.6.2 Daubechies Wavelets via Filtering

We should now be able to understand that digital wavelet analysis has been standardized to the point where classes of wavelet basis functions are specified not by their analytic forms, but rather by their wavelet filter coefficients. In 1988, the Belgian mathematician Ingrid Daubechies discovered an important class of such filter coefficients (Daubechies, 1995; Rowe and Abbott, 1995). We will study just the Daub4 class containing the four coefficients c₀, c₁, c₂, and c₃.

Imagine that our input contains the four elements $c13-math-0047$ corresponding to measurements of a signal at four times. We represent a low-pass filter L and a high-pass filter H in terms of the four filter coefficients as

(13.34) $c13-math-0048$

(13.35) $c13-math-0049$

To see how this works, we form an input vector by placing the four signal elements in a column and then multiply the input by L and H:

$c13-math-0050$

We see that if we choose the values of the c_i’s carefully, the result of L acting on the signal vector is a single number that may be viewed as a weighted average of the four input signal elements. Because an averaging process tends to smooth out data, the low-pass filter may be thought of as a smoothing filter that outputs the general shape of the signal.

In turn, we see that if we choose the c_i values carefully, the result of H acting on the signal vector is a single number that may be viewed as the weighted differences of the input signal. Because a differencing process tends to emphasize the variation in the data, the high-pass filter may be thought of as a detail filter that produces a large output when the signal varies considerably, and a small output when the signal is smooth.

We have just seen how the individual L and H filters, each represented by a single row of the filter matrix, outputs one number when acting upon an input signal containing four elements in a column. If we want the output of the filtering process Y to contain the same number of elements as the input (four y’s in this case), we just stack the L and H filters together:

(13.36) $c13-math-0052$

Of course, the first and third rows of the Y vector will be identical, as will the second and fourth, but we will take care of that soon.

Now we go about determining the values of the filter coefficients c_i by placing specific demands upon the output of the filter. We start by recalling that in our discussion of discrete Fourier transforms we observed that a transform is equivalent to a rotation from the time domain to the frequency domain. Yet we know from our study of linear algebra that rotations are described by orthogonal matrices, that is, matrices whose inverses are equal to their transposes. In order for the inverse transform to return us to the input signal, the transfer matrix must be orthogonal. For our wavelet transformation to be orthogonal, we must have the 4×4 filter matrix times its transpose equal to the identity matrix:

(13.37) $c13-math-0053$

Two equations in four unknowns are not enough for a unique solution, so we now include the further requirement that the detail filter $c13-math-0054$ must output a zero if the input is smooth. We define “smooth” to mean that the input is constant or linearly increasing:

(13.38) $c13-math-0055$

This is equivalent to demanding that the moments up to order p are zero, that is, that we have an “approximation of order p.” Explicitly,

(13.39) $c13-math-0056$ $c13-math-0057$

(13.40) $c13-math-0058$

These are the basic Daub4 filter coefficients. They are used to create larger filter matrices by placing the row versions of L and H along the diagonal, with successive pairs displaced two columns to the right. For example, for eight elements

(13.41) $c13-math-0059$

Note that in order not to lose any information, the last pair on the bottom two rows is wrapped over to the left. If you perform the actual multiplications indicated in (13.41), you will note that the output has successive smooth and detailed information. The output is processed with the pyramid scheme.

The time dependences of two Daub4 wavelets is displayed in Figure 13.12. To obtain these from our filter coefficients, first imagine that an elementary wavelet $c13-math-0059a$ input into the filter. This should result in a transform Y_1,l = 1. Inversely, we obtain y_1,1(t) by applying the inverse transform to a Y vector with a 1 in the first position and zeros in all the other positions. Like-wise, the ith member of the Daubechies class is obtained by applying the inverse transform to a Y vector with a 1 in the ith position and zeros in all the other positions.

c13f014 — **Figure 13.12** (a) The Daub4 e6 wavelet constructed by inverse transformation of the wavelet coefficients. This wavelet has been found to be particularly effective in wavelet analyses. (b) The sum of Daub4 e10 and Daub4 1e58 wavelets of different scale and time displacements.

The wavelet for coefficient 6 (thus the e6 notation) is shown in Figure 13.12a and in Figure 13.12b the sum of two wavelets corresponding to the coefficients 10 and 58. We see that the two wavelets have different levels of scale as well as different time positions. So despite the fact that the time dependence of the wavelets is not evident when wavelet (filter) coefficients are used, it is there.

13.6.3 DWT Implementation and Exercise

Listing 13.2 gives our program for performing a DWT on the chirp signal y(t) = sin(60t²). The method pyram calls the daube4 method to perform the DWT or inverse DWT, depending upon the value of sign.

Modify the program so that you output to a file the values for the input signal that your code has read in. It is always important to check your input.
Try to reproduce Figure 13.11 by using various values for the variable nend that controls when the filtering ends. A value nend=1024 should produce just the first step in the downsampling (top row in Figure 13.11). Selecting nend=512 should produce the next row, while nend=4 should output just two smooth and detailed coefficients.
Reproduce the scale-time diagram shown on the right of Figure 13.11. This diagram shows the output at different scales and serves to interpret the main components of the signal and the time in which they appear. The time line at the bottom of the figure corresponds to a signal of length 1 over which 256 samples were recorded. The low-band (smooth) components are shown on the left, and the high-band components on the right.
1. a) The bottommost figure results when nend = 256.
2. b) The figure in the second row up results from nend = 128, and we have the output from two filterings. The output contains 256 coefficients but divides time into four intervals and shows the frequency components of the original signal in more detail.
3. c) Continue with the subdivisions for nend = 64, 32, 16, 8, and 4.
For each of these choices except the topmost, divide the time by 2 and separate the intervals by vertical lines.
The topmost spectrum is your final output. Can you see any relation between it and the chirp signal?
Change the sign of sign and check that the inverse DWT reproduces the original signal.
Use the code to visualize the time dependence of the Daubechies mother function at different scales.
1. a) Start by performing an inverse transformation on the eight-component signal [0,0,0,0,1,0,0,0]. This hould yield a function with a width of about 5 units.
2. b) Next perform an inverse transformation on a unit vector with N = 32 but with all components except the fifth equal to zero. The width should now be about 25 units, a larger scale but still covering the same time interval.
3. c) Continue this procedure until you obtain wavelets of 800 units.
4. d) Finally, with N = 1024, select a portion of the mother wavelet with data in the horizontal interval [590,800]. This should show some self similarity.

c13f015 — **Listing 13.2** `DWT.py` computes the DWT using the pyramid algorithm for the 2ⁿ signal values stored in `f[ ]` (here assigned as the chirp signal sin 60t²). The Daub4 digital wavelets are the basis functions, and `sign = ±1` for transform/inverse.

c13f016 — **Listing 13.2** `DWT.py` computes the DWT using the pyramid algorithm for the 2ⁿ signal values stored in `f[ ]` (here assigned as the chirp signal sin 60t²). The Daub4 digital wavelets are the basis functions, and `sign = ±1` for transform/inverse.

13.7 Principal Components Analysis

We have seen that Fourier analysis has a shortcoming of having all of its components correlated. This slows down the calculation of transforms and makes compression and reconstitution of data problematic. Wavelets, on the other hand, appear excellent at data compression, but not appropriate for high-dimensionality data sets or for all physical situations. Principal components analysis (PCA) is excellent for situations in which there are correlation among the variable in the data, and especially for the type of space-time correlations as might be found in brain waves, facial patterns and ocean currents. The same basic PCA approach is used in many fields with names such as the Karhunen–Loève transform, the Hotelling transform, the proper orthogonal decomposition, singular value decomposition, factor analysis, empirical orthogonal functions, empirical component analysis, empirical modal analysis, and so forth (Wikipedia, 2014).

c13f017 — **Figure 13.13** (a) The normalized data and eigenvectors of covariance matrix. (b) The normalized data using the PCA eigenvectors as basis.

A key element in PCA is viewing a collection of data as defining a multidimensional data space, and then finding a few basic components in those data that contain most of the “power”. For example, imagine the output of 32 detectors of magnetic brain waves recorded every tenth of a second for an hour. In this case there will be 33 space-time dimensions to the data and 10 × 60 × 60 × 32 = 1152 000 data elements. And so we are now motivated to transform from the detector-time interval basis to a new set of basis functions known as the principal components which are pretty much guaranteed to concentrate most of the signal strength into a few components (Jackson, 1991; Jolliffe, 1991; Smith, 2002). This is analogous to the principal axes theorem of mechanics, in which the description of the rotation of a solid object is greatly simplified if moments of inertia relative to the principal axes are used.

As follows from the example that we will work out soon, on the left of Figure 13.13, we see a bunch of data point along with the two principal component eigenvectors for these data. This figure shows how the first component accounts for most of the variability of the data (has largest possible variance), while the next component is orthogonal, and thus not correlated with the first component. The second component accounts for much less of the data, namely the most possible variance in the data within the constraint of being orthogonal to the first component.

Table 13.1 PCA data.

Data		Adjusted data		In PCA basis
x	y	x	y	x₁	x₂
2.5	2.4	0.69	0.49	−0.828	−0.175
0.5	0.7	−1.31	−1.21	1.78	0.143
2.2	2.9	0.39	0.99	−0.992	0.484
1.9	2.2	0.09	0.29	−0.274	0.130
3.1	3.0	1.29	1.09	−1.68	−0.209
2.3	2.7	0.49	0.79	0.913	0.175
2	1.6	0.19	−0.31	0.0991	−0.350
1.0	1.1	−0.81	−0.81	1.14	0.464
1.6	1.6	−0.31	−0.31	0.438	0.0178
1.1	0.9	−0.71	−1.01	1.22	−0.163

13.7.1 Demonstration of Principal Component Analysis

The derivation of PCA and proofs of its theorems can get quite involved. Instead, as our introduction to the subject we take a purely operational approach and work through a simple PCA analysis following the example of (Smith, 2002). We assume that the data have two dimension, and we call these x and y, but they need not be related to spatial positions.

1. Enter Data We start with a data set, in our case the first two columns of Table 13.1.

2. Subtract the Mean PCA analysis assumes that the data in each dimension has zero mean. Accordingly, as shown in columns two and three in Table 13.1, we calculate the mean for each column, $c13-math-0059b$ , and then subtract them from the data in each column. The resulting adjusted data are given in the third and fourth columns of the table, and placed here into a data matrix:

(13.42) $c13-math-0060$

3. Calculate Covariance Matrix Recall that for a data set with N members, the variance is a measure of the deviation of the data from their mean:

(13.43) $c13-math-0061$

When a data set contains multiple dimensions (variables), there may well be a dependence of the data in one-dimensional variable to that in another. The covariance gives a measure of how much the deviation of one variable from the mean varies with the deviation of another variable from the mean:

(13.44) $c13-math-0062$

so that a positive covariance indicates that the x and y variables tend to change together in the same direction. Also, we see that the variance (13.43) can be viewed as a special case of the covariance, var(x) = cov(x, x), and that there is a symmetry here with cov(x, y) = cov(y,x). All of these possible covariance values are combined into a symmetric covariance matrix, which for our 2×2 case is just

(13.45) $c13-math-0063$

The next step in a PSA is to compute the covariance matrix for all of the data, which in our case turns out to be

(13.46) $c13-math-0064$

4. Compute Unit Eigenvector and Eigenvalues of C Easy to do with NumPy:

(13.47) $c13-math-0065$

(13.48) $c13-math-0066$

where we have ordered the eigenvalues and eigenvectors so that the largest eigen-value is first. As we shall see, the eigenvector corresponding to this largest eigen-value is in fact the principal component in the data, typically with ∼ 80% of the power in it.

In Figure 13.13, we show the normalized data and the two unit eigenvectors of the covariance matrix (scaled to fill frame). Note that the e₁ eigenvector looks very much like a straight-line best fit to the data. This is the major trend in the data. The other eigenvector e₂ is clearly orthogonal to e₁ and contain much less of the signal strength than e₁ It is in the direction of the variation from the straight line fit, and cleaerly contains less information about the data. These are the essential ideas behind PSA.

5. Expressing Data in Terms of Principal Components Now that we have the principal components, we need to do something with them, namely, express the data in terms of them. Clearly, one choice is to ignore the eigenvectors corresponding to the smaller eigenvalues (only one in our simple example). This is useful for focusing attention on the key elements in the data, as well as for compressing the data. What one usually does now is to form a feature vector F made up of the eigenvectors that we wish to keep, for example,

(13.49) $c13-math-0067$

(13.50) $c13-math-0068$

where F₁ keeps just one principal component, while F₂ keeps two. The matrix gets its name because by deciding which eigenvectors we wish to keep, we are deciding which features of the data we wish to display.

Next, we form the transpose of feature matrix composed of the eigenvectors we wish to keep, and of the adjusted data matrix:

(13.51) $c13-math-0069$

(13.52) $c13-math-0070$

To express the data in terms of the principal components, we multiply the transposed feature matrix by the transposed adjusted data matrix:

(13.53) $c13-math-0071$

(13.54) $c13-math-0072$

(13.55) $c13-math-0073$

(13.56) $c13-math-0074$

In Table 13.1, we place the transform data elements back into standard form, along side the original data. On the right of Figure 13.13 we show the normalized data plotted using the eigenvectors e₁ and e₂ as basis. This plot shows just where each datum point sits relative to the trend in the data. If we use only the principal component, we would have all of the data on a straight line (we leave that as an exercise). Of course, our data are so simple that this example does not show the power of the technique. But if we have millions of data, it would most valuable to be able to categorize them in terms of a few components.

13.7.2 PCA Exercises

Use just the principal eigenvector to perform the PCA analysis just completed with two eigenvectors.
Store data from ten cycles of the chaotic pendulum studied in Chapter 15, but do not include transients. Perform a PCA of these data and plot the results using principal component axes.