Tall-and-Skinny SVDs and RSVDs in MapReduce

A1
A2
A3
A4

Tall-and-skinny !
QRs and SVDs in
MapReduce

David F. Gleich!
Computer Science!
Purdue University!
David Gleich · Purdue
Simons PDAIO

1

Yangyang Hou "
Joe Nichols – U. of Minn
Purdue, CS
James Demmel "
Austin Benson "
UC Berkeley
Stanford University
Joe Ruthruff "
Paul G. Constantine
Jeremy Templeton"
Col. School. Mines "
Sandia CA

Mainly funded by Sandia National Labs
CSAR project, recently by NSF CAREER,"
and Purdue Research Foundation.

Simons PDAIO

2

Big "
simulation
data

Nonlinear heat transfer model in
random media

Apply heat

We did 8192 runs (128 samples of
bubble locations, 64 bubble radii)
4.5 TB of data in Exodus II (NetCDF)

https://www.opensciencedatacloud.org/
publicdata/heat-transfer/
Simons PDAIO

3

Look at temperature

Each run takes 5 hours on 8 processors,
outputs 4M (node) by 9 (time-step) simulation

Non-insulator regime
1
True
ROM
RS

0.8
0.7
1

0.6
0.5

0.5
0.4

0
15

0.3

20

25

0.2
0.1
0
0

Insulator regime
10

20

30
40
Bubble radius

50

60

Simons PDAIO

4

Proportion of temp. > 475 K

0.9

S

VT

Extract 128 x 128
face to laptop
A

U

UF

S

VT

5B-by-64 matrix
2.2TB
Simons PDAIO

5

Each simulation is a column

SVD

Non-insulator regime
1
True
ROM
RS

0.8
0.7
1

0.6
0.5

0.5
0.4

0
15

0.3

20

25

0.2
0.1
0
0

Insulator regime
10

20

30
40
Bubble radius

50

60

Simons PDAIO

6


0.9

(a) Error, s = 0.39 cm
. GLEICH, Y. HOU, AND J. TEMPLETON

20

(b) Std, s = 0.39 cm

P. G. CONSTANTINE, D. F. GLEICH, Y. HOU, AND J. TEMPLETON

imately 8-15 to the We collected precise timing information, but we do not report
model compared hours. prediction standard deubbleit as these at the finalfrom afor two values of
locations times are time multi-tenant, unoptimized Hadoop cluster Simons other
where PDAIO

7

s

s
R(s, ⌧ )
¯
E(s, ⌧ )
¯
0.08
16
1.00e-04
0.23
15
2.00e-04
0.39
14
4.00e-04
59
error
std
std
202
error
0.55
13
206.00e-04
55
2
−1
10
51
1
1
10
0.70
13
8.00e-04
47
00
0
0
−1.5
0.86
12
1.10e-03
43
39
1.01
11
1.50e-03
(c) Error, s = 1.95 cm
(d) Std, s = 1.95 cm
(b) Std, s = 0.39 cm
−2
35
1.17
10
2.10e-03
31
Fig. 4.5: Error in the reduce order model compared to the prediction standard de27
1.33
9
3.10e-03
−2.5
viation for one realization of the bubble locations at the final time for two values of
23
1.48
8
4.50e-03
19
the bubble radius, s = 0.39 and s = 1.95−3
cm. (Colors are visible in the electronic
1.64
8
6.50e-03
15
version.)
11
1.79
7
8.20e-03
−3.5
7
1.95
7
1.07e-02
3
0
0.2
0.4
0.6
0.8
2.11
6
1.23e-02
τ
¯
error
std
20
the varying conductivity fields took approximately twenty minutes to construct using
20
2.26
6
1.39e-02
10 Cubit after substantial optimizations.
Fig. 4.4: The log of the relative10error in
0
Working with the simulation data involved Table 4.1: post-processing steps:
the mean prediction of the ROM0 as a func-a few pre- and The split and the correspond
interpret 4TB of Exodus II files from Aria, globally transpose the data, compute the
Constantine, Gleich,!
(d) the threshold ¯
tion of s and Std, s = 1.95 cm ⌧ . (Colors are ing ROM error for ⌧ = 0.55 and di↵eren
¯
TSSVD, and compute predictions and errors. The preprocessing steps took approxvisible in the electronic version.)
values of Hou & Templeton arXiv 2013.
s.

Dynamic Mode
Decomposition
One simulation, ~10TB of data, compute the
SVD of a space-by-time matrix.

Simons PDAIO

8

DMD video

Is this BIG Data?

BIG Data has two properties

- too big for one hard drive

A

- ‘skewed’ distribution

BIG Data = “Big Internet Giant” Data

BIG Data = “Big In’Gineering” Data

Simons PDAIO

9

“Engineering”

A matrix A : m × n, m ≥ n!
is tall and skinny when O(n2) !
work and storage is “cheap”
compared to m.

Simons PDAIO

10

-- Austin Benson

is given by
the solution of  

Quick review of QR
Let A : m × n, (   ≥ n, real
orthogonal m
)
A = QR
Q is m × n orthogonal (QT Q = I )
upper n upper triangular
R is n × triangular.

A

=

Q

QR is block normalization
“normalize” a vector
usually generalizes to
computing   in the QR

0
R

MapReduce 2011

Simons PDAIO

11

andia)

, real

Tall-and-skinny SVD and RSVD
R

 
SVD

Let A : m × n, m ≥ n, real
A = U𝞢VT

TSQR

U is m × n orthogonal
Q

𝞢 is m × n nonneg, diag.
V is n × n orthogonal

Simons PDAIO

12

A

V

There are good MPI
implementations.

Simons PDAIO

13

What’s left to do?

Simons PDAIO

14

Moving data to an MPI cluster
may be infeasible or costly

How to store tall-and-skinny
matrices in Hadoop
A1

A2

A3
A4

Simons PDAIO

15

A : m x n, m ≫ n

Key is an arbitrary row-id
Value is the 1 x n array "
for a row (or b x n block)

Each submatrix Ai is an "
the input to a map task.

Still, isn’t this easy to do?
Current MapReduce algs use the normal equations
T

A = QR

A A

A1

A2
A3
A4

Cholesky

T

!R R

Map!
Aii to AiTAi

Reduce!
RTR = Sum(AiTAi)

Map 2!
Aii to Ai R-1

Q = AR

1

Two problems!

R inaccurate if A illconditioned

Q not numerically
orthogonal (House-"
holder assures this)

Simons PDAIO

16

Numerical stability was a
problem for prior approaches
0

10

−5

AR-1

10

−10

10

−15

10

0

10

5

10

10

10

15

10

20

10

Condition number
Simons PDAIO

17

Previous methods
couldn’t ensure
that the matrix Q
was orthogonal

norm ( QTQ – I )

Prior work

Four things that are better
1.  A simple algorithm to compute R accurately.
(but doesn’t help get Q orthogonal).
2.  “Fast algorithm” to get Q numerically
orthogonal in most cases.
3.  Multi-pass algorithm to get Q numerically
orthogonal in virtually all cases.

Constantine & Gleich MapReduce 2011
Benson, Gleich & Demmel IEEE BigData 2013
Simons PDAIO

18

4.  A direct algorithm for a numerically
orthogonal Q in all cases.

5

1. Constantine & Gleich,
MapReduce 2011

0

Prior work

10

AR-1

−5

10

2. Benson, Gleich,
Demmel, BigData’13

-1
AR + "
nt
reﬁneme
iterative

−10

10

4. Direct TSQR
Benson, Gleich, "

−15

10

0

10

5

10

10

10

15

10

Condition number

20
3. Benson, Gleich,
10

Simons PDAIO

19

Previous methods
couldn’t ensure
that the matrix Q
was orthogonal

norm ( QTQ – I )

10

MapReduce is great for TSQR!!
You don’t need ATA
Data A tall and skinny (TS) matrix by rows

Input 500,000,000-by-50 matrix"
Each record 1-by-50 row"
HDFS Size 183.6 GB

Time to compute read A 253 sec. write A 848 sec.!
Time to compute R in qr(A) 526 sec. w/ Q=AR-1 1618 sec. "
Time to compute Q in qr(A) 3090 sec. (numerically stable)!

Simons PDAIO

20

Communication avoiding QR
Communication avoiding TSQR
(Demmel et al. 2008)

Second, compute
a QR factorization
of the new “R”

21

First, do QR
factorizations
of each local
matrix  

Demmel et al.David Communicating avoiding Simons PDAIO
QR.
2008. Gleich · Purdue
parallel and sequential

Serial QR factorizations!
Fully serialet al. 2008)
(Demmel TSQR

David Gleich · Purdue Simons PDAIO
Demmel et al. 2008. Communicating avoiding
parallel and sequential QR.

22

Compute QR of   ,
read   , update QR, …

Communication avoiding QR (Demmel et al. 2008) !
on MapReduce (Constantine and Gleich, 2011)
Algorithm
Data Rows of a matrix
A1
A2

A2

qr

Q2

R2
A3

A3

Map QR factorization of rows
Reduce QR factorization of rows
qr

Q3

A4

A4
A5

Serial TSQR

A6

qr

Q6

Serial TSQR

Q4

R4

emit

qr

Q8

R8

emit

R6
A7

A7

qr

Q7

R7

A8

A8

Reducer 1

qr

A5

A6
Mapper 2

R3

R4

R4

R8

R8

qr

Q

R

emit

Simons PDAIO

23

Mapper 1
Serial TSQR

A1

Too many maps cause too
much data to one reducer!
map

emit

R4

R2,2

emit

Reducer 1-2
Serial TSQR

reduce

S3
A2

S(2)
A2

R2,3

shuffle

R3

S
A2

reduce

emit

identity map

S(1)

R2,1

Reducer 1-1
Serial TSQR

reduce

Mapper 1-3
Serial TSQR

map

A3
4

emit

Mapper 1-2
Serial TSQR

map

A3

R2

shuffle

A

S1

Mapper 1-1
Serial TSQR

map

A2

reduce

emit

R

emit

Reducer 2-1
Serial TSQR

emit

Reducer 1-3
Serial TSQR

emit

Mapper 1-4
Serial TSQR

Iteration 1

Iteration 2

Simons PDAIO

24

A1

R1

Simons PDAIO

25

Getting Q

5

MapReduce 2011!

0

Prior work

10

AR-1

AR-1

−5

10

2. Benson, Gleich,

-1
AR + "
nt
reﬁneme
iterative

−10

10

4. Direct TSQR
Benson, Gleich, "

−15

10

0

10

5

10

10

10

15

10

Condition number

20
3. Benson, Gleich,
10

Simons PDAIO

26

Previous methods
couldn’t ensure
that the matrix Q
was orthogonal

norm ( QTQ – I )

10

Iterative refinement helps
Mapper 1
R-1
A1

A2

R
TSQR

R-1

Q2

Q3
Q4

T
TSQR

T

R

A4

R-1

Q1

te
ribu
Dist

te
ribu
Dist

A3

R-1

Mapper 2
T-1
Q1

Q2

Q3
Q4
A4

T-1

T-1

T-1

Q1

Q2

Q3
Q4

Simons PDAIO

27

Iterative refinement is like using Newton’s method to solve Ax = b. It’s folklore that
“two iterations of iterative refinement are enough”

What if iterative reﬁnement is
too slow?
Sample

Mapper 1
R-1
A1

S

A2

R-1

Q2

Q3
Q4

T
TSQR

TR

A4

R-1

Q1

te
ribu
Dist

"
QR,
pute

Com ibute R
distr

A3

R-1

Mapper 2
R-1
T-1
A1
Q1

A2

A3
A4

R-1
T-1

R-1
T-1

R-1
T-1

Q2

Q3
Q4

Based on recent work by “random matrix” community on approximating QR
with a random subset of rows. Also assumes that you can get a subset of
rows “cheaply” – possible, but nontrivial in Hadoop.
Simons PDAIO

28

Estimate the “norm” by S

5

MapReduce 2011

0

Prior work

10

AR-1

AR-1

−5

10

2. Benson, Gleich,
Demmel, BigData’13!

-1
AR + "
nt
reﬁneme
iterative

−10

10

4. Direct TSQR
Benson, Gleich, "

−15

10

0

10

5

10

10

10

15

10

Condition number

20
3. Benson, Gleich,
10
Demmel, BigData’13!

Simons PDAIO

29

Previous methods
couldn’t ensure
that the matrix Q
was orthogonal

norm ( QTQ – I )

10

Mapper 1

Q2

A3

Q3

A4

Q4

R1

R2

R3

R4

R2

Q21

R3

Q31

R4

Q41

Mapper 3
Q11
Q1

Task 2
Q11
R

2. Collect R on one
node, compute Qs
for each piece

Q output

A2

Q1

R output

A1

R1

3. Distribute the
pieces of Q*1 and
form the true Q

Q2

Q3
Q4

Q21

Q31

Q41

Q1

Q2

Q3
Q4

1. Output local Q and
R in separate ﬁles
Simons PDAIO

30

Recreate Q by storing the
history of the factorization

Theoretical lower bound on runtime
for a few cases on our small cluster
R-only R-only
+ no IR
+ PIR

R-only
+ IR

Direct
TSQR

4.0B

4

1803

1821

1821

2343

2525

2.5B

10

1645

1655

1655

2062

2464

0.6B

25

804

812

812

1000

1237

50

1240

1250

1250

1517

2103

Cols

Old

R-only R-only
+ no IR
+ PIR

R-only
+ IR

Direct
TSQR

4.0B

4

2931

3460

3620

4741

6128

2.5B

10

2508

2509

3354

4034

4035

0.6B

25

1098

1104

1476

2006

1910

0.5B

50

921

1618

1960

2655

3090

All values in
seconds

Only two params
needed – read and
write bandwidth for
the cluster – in
order to derive a
performance model
of the algorithm.
This simple model
is almost within a
factor of two of the
true runtime. "
(10-node cluster,
60 disks)

Simons PDAIO

31

Old

Rows

Actual

Cols

0.5B

Model

Rows

Papers

Constantine & Gleich, MapReduce 2011
Benson, Gleich & Demmel, BigData’13

Constantine & Gleich, ICASSP 2012
Constantine, Gleich, Hou & Templeton, "
arXiv 2013

Questions?

BIG

Bloody Imposing Graphs
Building Impressions of Groundtruth
Blockwise Independent Guesses

https://github.com/arbenson/mrtsqr

Best Implemented at Google

"

https://github.com/dgleich/simform
Simons PDAIO

32

Code

Tall-and-Skinny SVDs and RSVDs in MapReduce

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to Tall-and-Skinny SVDs and RSVDs in MapReduce

Similar to Tall-and-Skinny SVDs and RSVDs in MapReduce (20)

More from David Gleich

More from David Gleich (11)

Recently uploaded

Recently uploaded (20)

Tall-and-Skinny SVDs and RSVDs in MapReduce