Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Big Data is changing abruptly, and where it is likely heading
1. Big Data is changing abruptly,
and where it is likely heading
Big Data Spain
2014-11-17
bigdataspain.org
Paco Nathan
@pacoid
2. Some questions, and an emerging idea…
The economics of datacenters has shifted toward
warehouse scale, based on commodity hardware
that includes multicore and large memory spaces.
!
Incumbent technologies for Big Data do not
embrace those changes, however newer OSS
projects have emerged to leverage them.
2
3. Some questions, and an emerging idea…
Also, the data requirements are changing abruptly
as sensor data, microsatellites, and other IoT use
cases boost data rates by orders of magnitude.
!
We have math available to address those industrial
needs, but have software frameworks kept up?
3
5. If you’re not using them, you’re missing out…
!
See recent announcements by Google, Amazon, etc.
!
Containers are the tip of the iceberg: OS have
evolved quietly, while practices near to customers
have perhaps not kept pace
5
Containers
6. Key benefits based on Mesos case studies:
• mixed workloads
• higher utilization rates
• lower latency for data products
• significantly lower Ops overhead
!
Examples: Mesos @Twitter, eBay, Netflix, many
startups, etc.; Borg/Omega @Google,
Symphony @IBM, Autopilot @Microsoft, etc.
!
Major cloud users understand this shift; however
much of the world still thinks in terms of VMware
6
Containers
7. We need much more than Docker, we need better
definitions for what is to be used outside of the
container as well – e.g., for microservices
• Intro to Mesos: goo.gl/PbUfNe
• https://mesosphere.com/
• A project to watch: Weave
7
Containers
8. 8
Containers
One problem: how does an institution, such
as a major bank, navigate this kind of change?
10. Speaking of clouds, forgive me for being biased…
For nearly a decade, I’ve been talking with people
about whether cloud economics makes sense?
Consider: datacenter util rates in single-digits (~8%)
10
Clouds
11. Companies accustomed to cheap energy rates,
with staff who thinking in terms of individual VMs…
these will not find clouds to be a friendly territory.
Not so much about software, as it is about
sophisticated engineering practice.
The change recalls the shift of BI ⇒ Data Science
That was not simple…
11
Clouds
12. The majors understand this new world: Google,
Amazon, Microsoft, IBM, Apple, etc.
Sub-majors can create their own clouds: Facebook,
Twitter, LinkedIn, etc.
12
Clouds
What about industries approaching Big Data now
which cannot afford to buy small armies of expert
Ops people? How they must proceed becomes a
key question.
13. Current research points to significant changes
ahead, where machine learning plays a key role
in the context of advanced cluster scheduling:
13
Clouds
Improving Resource Efficiency
with Apache Mesos
Christina Delimitrou
youtu.be/YpmElyi94AA
Experiences with Quasar+Mesos showed:
• 88% apps get >95% performance
• ~10% overprovisioning instead of 500%
• up to 70% cluster util at steady state
• 23% shorter scenario completion
15. A question on StackOverflow in 2013 challenged
Twitter’s use of Abstract Algebra libraries for their
revenue apps…
cs.stackexchange.com/questions/9648/what-use-are-groups-
monoids-and-rings-in-database-computations
The answers are enlightening: think of a kind of
containerization for business logic – leading to
exponential increase in performance, particularly
in the context of streaming cases.
15
Abstract Algebra
16. Abstract Algebra
A cheat-sheet:
non-empty
set
Semigroup
has an identity element
Group
has a binary associative
operation, with closure
each element has an inverse
has two binary
associative operations:
addition and multiplication
Ring
Monoid
17. Add ALL the Things:
Abstract Algebra Meets Analytics
infoq.com/presentations/abstract-algebra-
analytics
Avi Bryant, Strange Loop (2013)
• grouping doesn’t matter (associativity)
• ordering doesn’t matter (commutativity)
• zeros get ignored
In other words, while partitioning data
at scale is quite difficult, you can let the
math allow your code to be flexible at
scale
Avi Bryant
@avibryant
Abstract Algebra
18. Algebra for Analytics
speakerdeck.com/johnynek/
algebra-for-analytics
Oscar Boykin, Strata SC (2014)
• “Associativity allows parallelism
in reducing” by letting you put
the () where you want
• “Lack of associativity increases
latency exponentially”
Oscar Boykin
@posco
Abstract Algebra
20. Functional Programming for Big Data
Theory, eight decades ago:
what can be computed?
Haskell Curry
haskell.org
Alonso Church
wikipedia.org
Praxis, four decades ago:
algebra for applicative systems
John Backus
acm.org
David Turner
wikipedia.org
Reality, two decades ago:
machine data from web apps
Pattie Maes
MIT Media Lab
20
21. Functional Programming for Big Data
circa 2002:
mitigate risk of large distributed workloads lost
due to disk failures on commodity hardware…
Google File System
Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung
research.google.com/archive/gfs.html
!
MapReduce: Simplified Data Processing on Large Clusters
Jeffrey Dean, Sanjay Ghemawat
research.google.com/archive/mapreduce.html
21
22. Functional Programming for Big Data
DryadLINQ influenced a new class of workflow
abstractions based on functional programming:
Spark, Flink, Scalding, Scoobi, Scrunch, etc.
• needed for leveraging the newer hardware:
multicore, large memory spaces, etc.
• significant code volume reduction ⇒ eng costs
22
23. Functional Programming for Big Data
2002
2004
MapReduce paper
2002
MapReduce @ Google
2004 2006 2008 2010 2012 2014
2006
Hadoop @ Yahoo!
2014
Apache Spark top-level
2010
Spark paper
2008
Hadoop Summit
23
24. Functional Programming for Big Data
MapReduce
Pregel Giraph
Dremel Drill
S4 Storm
F1
MillWheel
General Batch Processing Specialized Systems:
Impala
GraphLab
iterative, interactive, streaming, graph, etc.
Tez
MR doesn’t compose well for large applications,
and so specialized systems emerged as workarounds
24
25. Functional Programming for Big Data
circa 2010:
a unified engine for enterprise data workflows,
based on commodity hardware a decade later…
Spark: Cluster Computing with Working Sets
Matei Zaharia, Mosharaf Chowdhury,
Michael Franklin, Scott Shenker, Ion Stoica
people.csail.mit.edu/matei/papers/2010/hotcloud_spark.pdf
!
Resilient Distributed Datasets: A Fault-Tolerant Abstraction for
In-Memory Cluster Computing
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica
usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf
25
26. Functional Programming for Big Data
One minor problem: where do we find
many graduating students who are good at
Scala, Clojure, F#, etc.?
26
28. +1 generations of computer scientists has
grown up thinking that Data implies SQL…
Excellent as a kind of declarative, functional
language for describing data transformations
However, in the long run we should move away
from thinking in terms of SQL
Computational Thinking as a rubric in
general would be preferred
28
Databases
29. A step further is to move away from thinking of
data as an engineering issue – it is about business
So much of the US curricula for math is still stuck
in the Cold War: years of calculus to identify the
best people to build ICBMs
We need more business people thinking in terms
of math, beyond calculus, to contend with difficult
problems in supply chain, maintenance schedules,
energy needs, transportation routes, etc.
29
Databases
30. Databases
1. starting with
real-world data ⇒
2. leverage graph queries
for representation ⇒
3. convert to sparse matrix,
use FP and abstract algebra ⇒
4. achieve high-ROI parallelism at scale,
mostly about optimization ⇒
32. Approximations
19-20c. statistics emphasized defensibility
in lieu of predictability, based on analytic
variance and goodness-of-fit tests
!
That approach inherently led toward a
manner of computational thinking based
on batch windows
!
They missed a subtle point…
32
33. 21c. shift towards modeling based on probabilistic
approximations: trade bounded errors for greatly
reduced resource costs
highlyscalable.wordpress.com/2012/05/01/
probabilistic-structures-web-analytics-data-
mining/
33
Approximations
34. 21c. shift towards modeling based on probabil
approximations: trade bounded errors for greatly
reduced resource costs
Twitter catch-phrase:
“Hash, don’t sample”
highlyscalable.wordpress.com/2012/05/01/
probabilistic-structures-web-analytics-data-
mining/
34
Approximations
35. a fascinating and relatively new area, pioneered
by relatively few people – e.g., Philippe Flajolet
provides approximation, with error bounds –
in general uses significantly less resources
(RAM, CPU, etc.)
many algorithms can be constructed from
combinations of read and write monoids
aggregate different ranges by composing
hashes, instead of repeating full-queries
35
Approximations
36. algorithm use case example
Count-Min Sketch frequency summaries code
HyperLogLog set cardinality code
Bloom Filter set membership
MinHash
set similarity
DSQ streaming quantiles
SkipList ordered sequence search
36
Approximations
37. algorithm use case example
Count-Min Sketch frequency summaries code
HyperLogLog set cardinality code
suggestion: consider these
as your most quintessential
collections data types at scale
Bloom Filter set membership
MinHash
set similarity
DSQ streaming quantiles
SkipList ordered sequence search
37
Approximations
38. • sketch algorithms: trade bounded errors for
orders of magnitude less required resources,
e.g., fit more complex apps in memory
• multicore + large memory spaces (off heap) are
increasing the resources per node in a cluster
• containers allow for finer-grain allocation of
cluster resources and multi-tenancy
• monoids, etc.: guarantees of associativity within
the code allow for more effective distributed
computing, e.g., partial aggregates
• less resources must be spent sorting/windowing
data prior to working with a data set
• real-time apps, which don’t have the luxury of
anticipating data partitions, can respond quickly
38
Approximations
39. Probabilistic Data Structures for Web
Analytics and Data Mining
Ilya Katsov (2012-05-01)
A collection of links for streaming
algorithms and data structures
Debasish Ghosh
Aggregate Knowledge blog (now Neustar)
Timon Karnezos, Matt Curcio, et al.
Probabilistic Data Structures and
Breaking Down Big Sequence Data
C. Titus Brown, O'Reilly (2010-11-10)
Algebird
Avi Bryant, Oscar Boykin, et al. Twitter (2012)
Mining of Massive Datasets
Jure Leskovec, Anand Rajaraman,
Jeff Ullman, Cambridge (2011)
39
Approximations
41. Spreadsheets represented a revolution in how
to think about computing
!
What Fernando Perez, et al., have done with
IPython is subtle and powerful
!
Technology is not just about writing code; it is
about people using that code
!
Enterprise data workflows are moving toward
cloud-based notebooks
41
Notebooks
42. Early emphasis on tools in Big Data now gives
way to workflows: people, automation, process,
integration, test, maintenance, scale
Google had the brilliant approach of shared
documents…
Now we see cloud-based notebooks, which
begin to displace web apps
42
Notebooks
http://nbviewer.ipython.org/
http://jupyter.org/
45. What is Spark?
Developed in 2009 at UC Berkeley AMPLab, then
open sourced in 2010, Spark has since become
one of the largest OSS communities in big data,
with over 200 contributors in 50+ organizations
spark.apache.org
“Organizations that are looking at big data challenges –
including collection, ETL, storage, exploration and analytics –
should consider Spark for its in-memory performance and
the breadth of its model. It supports advanced analytics
solutions on Hadoop clusters, including the iterative model
required for machine learning and graph analysis.”
Gartner, Advanced Analytics and Data Science (2014)
47. What is Spark?
Spark Core is the general execution engine for the
Spark platform that other functionality is built atop:
!
• in-memory computing capabilities deliver speed
• general execution model supports wide variety
of use cases
• ease of development – native APIs in Java, Scala,
Python (+ SQL, Clojure, R)
48. What is Spark?
WordCount in 3 lines of Spark
WordCount in 50+ lines of Java MR
49. What is Spark?
Sustained exponential growth, as one of the most
active Apache projects ohloh.net/orgs/apache
51. • generalized patterns
⇒ unified engine for many use cases
• lazy evaluation of the lineage graph
⇒ reduces wait states, better pipelining
• generational differences in hardware
⇒ off-heap use of large memory spaces
• functional programming / ease of use
⇒ reduction in cost to maintain large apps
• lower overhead for starting jobs
• less expensive shuffles
51
What is Spark?
52. What is Spark?
Kafka + Spark + Cassandra
datastax.com/documentation/datastax_enterprise/4.5/
datastax_enterprise/spark/sparkIntro.html
http://helenaedelson.com/?p=991
github.com/datastax/spark-cassandra-connector
github.com/dibbhatt/kafka-spark-consumer
unified compute
data streams columnar key-value
56. certification:
Apache Spark developer certificate program
• http://oreilly.com/go/sparkcert
• defined by Spark experts @Databricks
• assessed by O’Reilly Media
• establishes the bar for Spark expertise
56
57. community:
spark.apache.org/community.html
video+slide archives: spark-summit.org
events worldwide: goo.gl/2YqJZK
resources: databricks.com/spark-training-resources
workshops: databricks.com/spark-training
new: MOOCs via edX and University of California
57
58. books:
Fast Data Processing
with Spark
Holden Karau
Packt (2013)
shop.oreilly.com/product/
9781782167068.do
Spark in Action
Chris Fregly
Manning (2015*)
sparkinaction.com/
Learning Spark
Holden Karau,
Andy Konwinski,
Matei Zaharia
O’Reilly (2015*)
shop.oreilly.com/product/
0636920028512.do
58
59. events:
Strata EU
Barcelona, Nov 19-21
strataconf.com/strataeu2014
Data Day Texas
Austin, Jan 10
datadaytexas.com
Strata CA
San Jose, Feb 18-20
strataconf.com/strata2015
Spark Summit East
NYC, Mar 18-19
spark-summit.org/east
Strata EU
London, May 5-7
strataconf.com/big-data-conference-uk-2015
Spark Summit 2015
SF, Jun 15-17
spark-summit.org
59
60. presenter:
monthly newsletter for updates,
events, conf summaries, etc.:
liber118.com/pxn/
Just Enough Math
O’Reilly, 2014
justenoughmath.com
preview: youtu.be/TQ58cWgdCpA
Enterprise Data Workflows
with Cascading
O’Reilly, 2013
shop.oreilly.com/product/
0636920028536.do 60