This document describes research on computing tall-and-skinny QR decompositions in MapReduce frameworks. It discusses past issues with numerical stability in prior approaches and presents four improved methods: 1) An algorithm to accurately compute R; 2) A fast algorithm to make Q numerically orthogonal; 3) A multi-pass algorithm to make Q orthogonal in most cases; 4) A direct algorithm to make Q orthogonal in all cases. It also covers techniques like iterative refinement and sampling rows to improve the orthogonality of Q when computing the QR decomposition of large matrices distributed across systems.
1. A1
A2
A3
A4
Tall-and-skinny !
QRs and SVDs in
MapReduce
David F. Gleich!
Computer Science!
Purdue University!
David Gleich · Purdue
Simons PDAIO
1
Yangyang Hou "
Joe Nichols – U. of Minn
Purdue, CS
James Demmel "
Austin Benson "
UC Berkeley
Stanford University
Joe Ruthruff "
Paul G. Constantine
Jeremy Templeton"
Col. School. Mines "
Sandia CA
Mainly funded by Sandia National Labs
CSAR project, recently by NSF CAREER,"
and Purdue Research Foundation.
2. David Gleich · Purdue
Simons PDAIO
2
Big "
simulation
data
3. Nonlinear heat transfer model in
random media
Apply heat
We did 8192 runs (128 samples of
bubble locations, 64 bubble radii)
4.5 TB of data in Exodus II (NetCDF)
https://www.opensciencedatacloud.org/
publicdata/heat-transfer/
David Gleich · Purdue
Simons PDAIO
3
Look at temperature
Each run takes 5 hours on 8 processors,
outputs 4M (node) by 9 (time-step) simulation
7. (a) Error, s = 0.39 cm
. GLEICH, Y. HOU, AND J. TEMPLETON
20
(b) Std, s = 0.39 cm
P. G. CONSTANTINE, D. F. GLEICH, Y. HOU, AND J. TEMPLETON
imately 8-15 to the We collected precise timing information, but we do not report
model compared hours. prediction standard deubbleit as these at the finalfrom afor two values of
locations times are time multi-tenant, unoptimized Hadoop cluster Simons other
David Gleich · Purdue
where PDAIO
7
s
s
R(s, ⌧ )
¯
E(s, ⌧ )
¯
0.08
16
1.00e-04
0.23
15
2.00e-04
0.39
14
4.00e-04
59
error
std
std
202
error
0.55
13
206.00e-04
55
2
−1
10
51
1
1
10
0.70
13
8.00e-04
47
00
0
0
−1.5
0.86
12
1.10e-03
43
39
1.01
11
1.50e-03
(c) Error, s = 1.95 cm
(d) Std, s = 1.95 cm
(b) Std, s = 0.39 cm
−2
35
1.17
10
2.10e-03
31
Fig. 4.5: Error in the reduce order model compared to the prediction standard de27
1.33
9
3.10e-03
−2.5
viation for one realization of the bubble locations at the final time for two values of
23
1.48
8
4.50e-03
19
the bubble radius, s = 0.39 and s = 1.95−3
cm. (Colors are visible in the electronic
1.64
8
6.50e-03
15
version.)
11
1.79
7
8.20e-03
−3.5
7
1.95
7
1.07e-02
3
0
0.2
0.4
0.6
0.8
2.11
6
1.23e-02
τ
¯
error
std
20
the varying conductivity fields took approximately twenty minutes to construct using
20
2.26
6
1.39e-02
10 Cubit after substantial optimizations.
Fig. 4.4: The log of the relative10error in
0
Working with the simulation data involved Table 4.1: post-processing steps:
the mean prediction of the ROM0 as a func-a few pre- and The split and the correspond
interpret 4TB of Exodus II files from Aria, globally transpose the data, compute the
Constantine, Gleich,!
(d) the threshold ¯
tion of s and Std, s = 1.95 cm ⌧ . (Colors are ing ROM error for ⌧ = 0.55 and di↵eren
¯
TSSVD, and compute predictions and errors. The preprocessing steps took approxvisible in the electronic version.)
values of Hou & Templeton arXiv 2013.
s.
9. Is this BIG Data?
BIG Data has two properties
- too big for one hard drive
A
- ‘skewed’ distribution
BIG Data = “Big Internet Giant” Data
BIG Data = “Big In’Gineering” Data
David Gleich · Purdue
Simons PDAIO
9
“Engineering”
10. A matrix A : m × n, m ≥ n!
is tall and skinny when O(n2) !
work and storage is “cheap”
compared to m.
David Gleich · Purdue
Simons PDAIO
10
-- Austin Benson
11.
is given by
the solution of
Quick review of QR
Let A : m × n, ( ≥ n, real
orthogonal m
)
A = QR
Q is m × n orthogonal (QT Q = I )
upper n upper triangular
R is n × triangular.
A
=
Q
QR is block normalization
“normalize” a vector
usually generalizes to
computing in the QR
0
R
MapReduce 2011
David Gleich · Purdue
Simons PDAIO
11
andia)
, real
12. Tall-and-skinny SVD and RSVD
R
SVD
Let A : m × n, m ≥ n, real
A = U𝞢VT
TSQR
U is m × n orthogonal
Q
𝞢 is m × n nonneg, diag.
V is n × n orthogonal
David Gleich · Purdue
Simons PDAIO
12
A
V
13. There are good MPI
implementations.
David Gleich · Purdue
Simons PDAIO
13
What’s left to do?
14. David Gleich · Purdue
Simons PDAIO
14
Moving data to an MPI cluster
may be infeasible or costly
15. How to store tall-and-skinny
matrices in Hadoop
A1
A2
A3
A4
David Gleich · Purdue
Simons PDAIO
15
A : m x n, m ≫ n
Key is an arbitrary row-id
Value is the 1 x n array "
for a row (or b x n block)
Each submatrix Ai is an "
the input to a map task.
16. Still, isn’t this easy to do?
Current MapReduce algs use the normal equations
T
A = QR
A A
A1
A2
A3
A4
Cholesky
T
!R R
Map!
Aii to AiTAi
Reduce!
RTR = Sum(AiTAi)
Map 2!
Aii to Ai R-1
Q = AR
1
Two problems!
R inaccurate if A illconditioned
Q not numerically
orthogonal (House-"
holder assures this)
David Gleich · Purdue
Simons PDAIO
16
17. Numerical stability was a
problem for prior approaches
0
10
−5
AR-1
10
−10
10
−15
10
0
10
5
10
10
10
15
10
20
10
Condition number
David Gleich · Purdue
Simons PDAIO
17
Previous methods
couldn’t ensure
that the matrix Q
was orthogonal
norm ( QTQ – I )
Prior work
18. Four things that are better
1. A simple algorithm to compute R accurately.
(but doesn’t help get Q orthogonal).
2. “Fast algorithm” to get Q numerically
orthogonal in most cases.
3. Multi-pass algorithm to get Q numerically
orthogonal in virtually all cases.
Constantine & Gleich MapReduce 2011
Benson, Gleich & Demmel IEEE BigData 2013
David Gleich · Purdue
Simons PDAIO
18
4. A direct algorithm for a numerically
orthogonal Q in all cases.
19. Numerical stability was a
problem for prior approaches
5
1. Constantine & Gleich,
MapReduce 2011
0
Prior work
10
AR-1
−5
10
2. Benson, Gleich,
Demmel, BigData’13
-1
AR + "
nt
refineme
iterative
−10
10
4. Direct TSQR
Benson, Gleich, "
Demmel, BigData’13
−15
10
0
10
5
10
10
10
15
10
Condition number
20
3. Benson, Gleich,
10
Demmel, BigData’13
David Gleich · Purdue
Simons PDAIO
19
Previous methods
couldn’t ensure
that the matrix Q
was orthogonal
norm ( QTQ – I )
10
20. MapReduce is great for TSQR!!
You don’t need ATA
Data A tall and skinny (TS) matrix by rows
Input 500,000,000-by-50 matrix"
Each record 1-by-50 row"
HDFS Size 183.6 GB
Time to compute read A 253 sec. write A 848 sec.!
Time to compute R in qr(A) 526 sec. w/ Q=AR-1 1618 sec. "
Time to compute Q in qr(A) 3090 sec. (numerically stable)!
David Gleich · Purdue
Simons PDAIO
20
21. Communication avoiding QR
Communication avoiding TSQR
(Demmel et al. 2008)
Second, compute
a QR factorization
of the new “R”
21
First, do QR
factorizations
of each local
matrix
Demmel et al.David Communicating avoiding Simons PDAIO
QR.
2008. Gleich · Purdue
parallel and sequential
22. Serial QR factorizations!
Fully serialet al. 2008)
(Demmel TSQR
David Gleich · Purdue Simons PDAIO
Demmel et al. 2008. Communicating avoiding
parallel and sequential QR.
22
Compute QR of ,
read , update QR, …
23. Communication avoiding QR (Demmel et al. 2008) !
on MapReduce (Constantine and Gleich, 2011)
Algorithm
Data Rows of a matrix
A1
A2
A2
qr
Q2
R2
A3
A3
Map QR factorization of rows
Reduce QR factorization of rows
qr
Q3
A4
A4
A5
Serial TSQR
A6
qr
Q6
Serial TSQR
Q4
R4
emit
qr
Q8
R8
emit
R6
A7
A7
qr
Q7
R7
A8
A8
Reducer 1
qr
A5
A6
Mapper 2
R3
R4
R4
R8
R8
qr
Q
R
emit
David Gleich · Purdue
Simons PDAIO
23
Mapper 1
Serial TSQR
A1
24. Too many maps cause too
much data to one reducer!
map
emit
R4
R2,2
emit
Reducer 1-2
Serial TSQR
reduce
S3
A2
S(2)
A2
R2,3
shuffle
R3
S
A2
reduce
emit
identity map
S(1)
R2,1
Reducer 1-1
Serial TSQR
reduce
Mapper 1-3
Serial TSQR
map
A3
4
emit
Mapper 1-2
Serial TSQR
map
A3
R2
shuffle
A
S1
Mapper 1-1
Serial TSQR
map
A2
reduce
emit
R
emit
Reducer 2-1
Serial TSQR
emit
Reducer 1-3
Serial TSQR
emit
Mapper 1-4
Serial TSQR
Iteration 1
Iteration 2
David Gleich · Purdue
Simons PDAIO
24
A1
R1
26. Numerical stability was a
problem for prior approaches
5
1. Constantine & Gleich,
MapReduce 2011!
0
Prior work
10
AR-1
AR-1
−5
10
2. Benson, Gleich,
Demmel, BigData’13
-1
AR + "
nt
refineme
iterative
−10
10
4. Direct TSQR
Benson, Gleich, "
Demmel, BigData’13
−15
10
0
10
5
10
10
10
15
10
Condition number
20
3. Benson, Gleich,
10
Demmel, BigData’13
David Gleich · Purdue
Simons PDAIO
26
Previous methods
couldn’t ensure
that the matrix Q
was orthogonal
norm ( QTQ – I )
10
27. Iterative refinement helps
Mapper 1
R-1
A1
A2
R
TSQR
R-1
Q2
Q3
Q4
T
TSQR
T
R
A4
R-1
Q1
te
ribu
Dist
te
ribu
Dist
A3
R-1
Mapper 2
T-1
Q1
Q2
Q3
Q4
A4
T-1
T-1
T-1
Q1
Q2
Q3
Q4
David Gleich · Purdue
Simons PDAIO
27
Iterative refinement is like using Newton’s method to solve Ax = b. It’s folklore that
“two iterations of iterative refinement are enough”
28. What if iterative refinement is
too slow?
Sample
Mapper 1
R-1
A1
S
A2
R-1
Q2
Q3
Q4
T
TSQR
TR
A4
R-1
Q1
te
ribu
Dist
"
QR,
pute
Com ibute R
distr
A3
R-1
Mapper 2
R-1
T-1
A1
Q1
A2
A3
A4
R-1
T-1
R-1
T-1
R-1
T-1
Q2
Q3
Q4
Based on recent work by “random matrix” community on approximating QR
with a random subset of rows. Also assumes that you can get a subset of
rows “cheaply” – possible, but nontrivial in Hadoop.
David Gleich · Purdue
Simons PDAIO
28
Estimate the “norm” by S
29. Numerical stability was a
problem for prior approaches
5
1. Constantine & Gleich,
MapReduce 2011
0
Prior work
10
AR-1
AR-1
−5
10
2. Benson, Gleich,
Demmel, BigData’13!
-1
AR + "
nt
refineme
iterative
−10
10
4. Direct TSQR
Benson, Gleich, "
Demmel, BigData’13
−15
10
0
10
5
10
10
10
15
10
Condition number
20
3. Benson, Gleich,
10
Demmel, BigData’13!
David Gleich · Purdue
Simons PDAIO
29
Previous methods
couldn’t ensure
that the matrix Q
was orthogonal
norm ( QTQ – I )
10
30. Mapper 1
Q2
A3
Q3
A4
Q4
R1
R2
R3
R4
R2
Q21
R3
Q31
R4
Q41
Mapper 3
Q11
Q1
Task 2
Q11
R
2. Collect R on one
node, compute Qs
for each piece
Q output
A2
Q1
R output
A1
R1
3. Distribute the
pieces of Q*1 and
form the true Q
Q2
Q3
Q4
Q21
Q31
Q41
Q1
Q2
Q3
Q4
1. Output local Q and
R in separate files
David Gleich · Purdue
Simons PDAIO
30
Recreate Q by storing the
history of the factorization
31. Theoretical lower bound on runtime
for a few cases on our small cluster
R-only R-only
+ no IR
+ PIR
R-only
+ IR
Direct
TSQR
4.0B
4
1803
1821
1821
2343
2525
2.5B
10
1645
1655
1655
2062
2464
0.6B
25
804
812
812
1000
1237
50
1240
1250
1250
1517
2103
Cols
Old
R-only R-only
+ no IR
+ PIR
R-only
+ IR
Direct
TSQR
4.0B
4
2931
3460
3620
4741
6128
2.5B
10
2508
2509
3354
4034
4035
0.6B
25
1098
1104
1476
2006
1910
0.5B
50
921
1618
1960
2655
3090
All values in
seconds
Only two params
needed – read and
write bandwidth for
the cluster – in
order to derive a
performance model
of the algorithm.
This simple model
is almost within a
factor of two of the
true runtime. "
(10-node cluster,
60 disks)
David Gleich · Purdue
Simons PDAIO
31
Old
Rows
Actual
Cols
0.5B
Model
Rows
32. Papers
Constantine & Gleich, MapReduce 2011
Benson, Gleich & Demmel, BigData’13
Constantine & Gleich, ICASSP 2012
Constantine, Gleich, Hou & Templeton, "
arXiv 2013
Questions?
BIG
Bloody Imposing Graphs
Building Impressions of Groundtruth
Blockwise Independent Guesses
https://github.com/arbenson/mrtsqr
Best Implemented at Google
"
https://github.com/dgleich/simform
David Gleich · Purdue
Simons PDAIO
32
Code