Proposal: JupyterLab Data Registry #5548

ellisonbg · 2018-10-28T01:13:18Z

This is a proposal for the creation of a data bus for JupyterLab. I have talked to a few folks about these ideas, but wanted to put them down in concrete form to open up the discussion and work.

Why?

Different JupyterLab extensions are being created to work with datasets of different types in different ways:

Notebooks that work with Dataframes, which are visualized and tables, and plots.
Extensions that allow a user to create data visualizations without code (https://github.com/altair-viz/jupyterlab_voyager and a similar one for plotly).
The datagrid extension, which visualizes csv files as a table.

Right now, these different extensions have to create a lot of custom glue to move data between each other. This is painful for developers and artificially limits the ways that data can be used. Once a dataset has been used in JupyterLab in any way, all other extensions that can work with that dataset should immediately become aware of it.

Additionally, datasets don't exist in a vaccum. They are surrounded by rich metadata. Such metadata includes associated notebooks, code snippets, links to publications and people that have used the dataset, etc. Furthermore, users collaborating with others will need collaboration facilities, including the ability to create new metadata through comments and annotations on the dataset. Right now, there is no place to store this metadata or associate it with a dataset in JupyterLab.

Finally, there are a wide range of datasets that cannot be fully downloaded to JupyterLab. This includes large files (CSV, hdf5, etc.), SQL databases, streaming datasets, video, and other API endpoints for working with data (such as s3).

What?

We propose the creation of a Data Bus that enables datasets to become a first class entity in JupyterLab. This Data Bus will have the following characteristics:

It will be based on an abstract notion of a dataset that includes a MIME type, a schema for metadata, some sort of URI that points to the actual data, and typescript interfaces for working with the data.
While it ins't strictly required, the usage cases we have in mind initially are for immutable data. However, the metadata should be treated as mutable, to allow user generated metadata to be added.
The Data Bus will be a set of APIs in the frontend that allow other extensions to register as providers or consumers of datasets of a given MIME type.
We will want to consider the scoping of the Data Bus in its design. What state should be per user, per notebook server, per workspace, etc.? This will be important to enable productive collaboration. The different scoping will require storing related state in appropriate places (frontend, real-time store, file system, etc.).
It will be up to those other extensions to determine the semantics around when a datasets is registered with the Data Bus. For example, the file browser could register all datasets in the current working directory, or it could wait until a user opens a dataset to register it.
Data consumers will be able to query datasets of a given MIME type (or all) as well as watch a signal that fires anytime a new datasets of that type is provided.
In initial API brainstorming, the consumer/provider APIs ends up looking a lot like a traditional PUB/SUB system. We may want to introduce the idea of topics as is common in PUB/SUB systems, but maybe the MIME types are sufficient.
When a data provider registers a dataset with the Data Bus, consumers will be notified of the dataset and be able to get a handle of the abstract dataset. Again this handle will include the mimetype, metadata, URI, and interfaces for accessing the data.
To enable collaborative commenting and annotation on the dataset metadata (imaging having a discussion with someone about a column of a SQL DB table), the metadata will likely need to be stored in the real-time datastore. We have begun to do separate work on a general commenting and annotation system, but the general idea is that comments/annotations will be another table in a real-time datastore schema.
Additional server side work will likely be required to make remote datasets of different types available to JupyterLab. @danielballan and @ian-r-rose have started to do work on this that integrates with the Jupyter server content API. It is intended for the Data Bus to be independent of the server side work, but integrate cleanly with it. The dataset notion should be sufficiently abstract that the details of where the data comes from is a separate concern.
We may want to introduce the notion of adaptors between different MIME types. The reason for this is that there may be multiple different MIME types that expose compatible data. An example would be that csv files, SQL DB tables, etc. all may expose a tabular dataset. Some extensions (vega-lite) expect tabular data to be in memory JSON, whereas others (datagrid) can handle dynamically loaded data. An adaptor patterns will make it easier to deal with the realities of different extensions that consume the data.
User interfaces for browsing datasets that the Data Bus is aware of would be an example of a data consumer.
User interfaces for rendering a dataset or allowing a user to work with metadata would be done as data consumers.
In some ways, this is similar to how we do MIME renderers for files and output, but a more general approach that separates registration/providing from the rendering step, and also includes metadata.

How

@ian-r-rose has started some explorations here:

https://github.com/ian-r-rose/jupyterlab-remote-data

We also have funding from the Schmidt Foundation @ Cal Poly and are getting that setup to fund other core JupyterLab folks to help out.

@saulshanabrook @sccolbert @jasongrout @afshin @blink1073 @ian-r-rose

mabayona · 2018-10-28T14:59:32Z

Great Proposal. What about adding some ideas/code from arrow: https://arrow.apache.org/docs/python/
Would it be possible to incorporate arrow as the basis for the JupyterLab Data Bus? or at least as common memory format?

ellisonbg · 2018-10-29T03:14:27Z

I should have mentioned arrow. Some of the ideas for the Data Bus have come from discussions with various folks (including @wesm ) about Arrow. It probably makes sense for us to begin to detail the different data types we would want to support. However, I think we may want to distinguish between input formats and the actual data type formats. For example, a number of different input data formats may provide tabular data, but we may only want to have a single tabular data MIME type in the Data Bus.

Some of the input data formats that I know are relevant:
HDF5
CVS
JSON
JSON Table schema
Relational DBs
Arrow/Parquet
Text

We may also want to have streaming, in memory and dynamically loading variants of some of these (certainly for the tabular and text file based ones).

wesm · 2018-10-29T12:03:55Z

For structured data interchange (anything tabular or JSON-like) I strongly encourage you to consider using the Arrow columnar binary protocol as one of your main mediums. Having worked on this problem space for several years now, getting all the details on this right is devilishly difficult and creating libraries that faithfully implement the same protocol is very time consuming.

I would honestly put Arrow in a different category than the other things you listed. It is much more a protocol for interchange than a storage format, and so distinct from Parquet or other binary formats used for storage.

You might want to take a look at the Flight RPC framework we are developing, which uses gRPC under the hood.

Let us know if we can help!

ellisonbg · 2018-11-01T01:59:43Z

In the JupyterLab meeting today, a few things came up:

Probably makes sense for the data in the Data Bus to be an any type that consumers can cast appropriately for a given MIME type. This will allow the most flexibility for that to be a URI string, TypeScript interface, JSON object, etc.
We also talked about if it makes sense to differential between input file formats and more abstract ones (such as a general table notion). We feel that it doesn't make sense to distinguish those, but instead, let consumers and providers decide.

@wesm - thanks, and yes, fully agree with what you are saying here.

Links:

https://grpc.io/
https://github.com/apache/arrow/blob/master/format/Flight.proto

fperez · 2018-11-01T19:43:24Z

In terms of schemas, there's some open efforts for integrating descriptions of scientific/social data description schemas at schema.org and the companion DataCommons.org that provides semantically integrated datasets.

Would be a good idea to follow these open standards where feasible.

bollwyvl · 2018-11-01T21:06:56Z

+100 to schema.org, hadn't seen datacommons! The heavyweight player in open source data catalogs is CKAN (AGPL): https://github.com/ckan/ckan Runs data.gov, etc. Another very large scale data management system is iRODS: https://github.com/irods/irods As with the on-going annotation and language server efforts, we gain very little in implementing our own new formats, and potentially a great deal in supporting open (or de facto, but openly implemented) standards.

wesm · 2018-11-02T11:03:20Z

From my perspective, as long as producers and consumers have the option to embed a streaming binary protocol (i.e. some people could embed Arrow's message protocol) and do minimal writes on the server side, and zero-copy receives on the client side, then that sounds great.

So the data payload could be treated opaquely in the Data Bus, and handled by code (e.g. IO handlers in various kernels/widgets) that does not necessarily know how to deserialize or access the data. Our goal with Arrow's Flight RPC system is to enable gRPC clients or servers that don't necessary know about Arrow columnar data, only Protocol Buffers, to still be able to handle the opaque components of the data stream (the "FlightData" message https://github.com/apache/arrow/blob/master/format/Flight.proto#L275)

It would also be useful to have a serialization protocol-independent schema representation

saulshanabrook · 2018-11-07T18:42:39Z

It might make sense to systematically analyze types of data/data transports/data stores that are outlined above and start outlining use cases, so we can get a sense of what the end goals could look like here.

From that we can begin to understand what an "adapter" looks like here and how we can start building a mental model of the structure of the problem.

If we take the the Voyager plugin as an example, it takes in either a URL or some inline data (Vega Lite data docs). So now let's say we have a CSV file on disk and we want to Voyager with that data. We could do this by getting the contents of the CSV file and parsing it in JupyterLab, and Voyager the inline data. Or we could send Voyager the URL itself and let it parse the CSV file.

It seems that both could have advantages. If you parse it first in JL then send in the data, you could reu-use that JSON structure if another extension wanted it, without re-parsing. However, if you send in the URL directly, then you can let Voyager handle the parsing, which could possibly be more efficient based on their implementation or do some type inference better for the use case.

If we expand this picture to look at taking some data from a notebook and visualizing it in Voyager, we have even more possible routes. We could save this to JSON file or save this to a CSV file, and pass in those URLs. Or we could export it to Arrow on the server and load this on the client, parsing to JSON, then feed this to Voyager in memory.

The goal of this approach is to start with use cases, then find what technology supports that use case efficiently, then figure out how to design a system that is is flexible enough to support chaining the required technology together with the right UX.

wesm · 2018-11-07T18:58:27Z

The goal of this approach is to start with use cases, then find what technology supports that use case efficiently, then figure out how to design a system that is is flexible enough to support chaining the required technology together with the right UX.

It seems to me that JDB ("Jupyter Data Bus") should be agnostic to the form of data serialization used. What you need is:

Data format (e.g. CSV, JSON, Avro, Arrow, etc.)
Data payload (it would be a good idea to think about how to avoid unnecessary copying of the payload in the protocol)
Pluggable additional metadata (e.g. this could contain a schema for the CSV file, or a schema to coerce the JSON to, with wire protocols like Avro or Arrow, the schema transmitted as part of the protocol, though perhaps as a separate message frame)

This problem of a dataset spanning multiple message frames should be part of the JDB protocol. In Arrow, for example, obtaining the complete schema including dictionaries (for dictionary-encoded fields) may involve receiving multiple messages. In Avro, the schema (JSON) could be sent first, then sequences of records as follow up payloads

Dealing with un-schema'd data in production applications is painful enough / dangerous enough that Jupyter component developers will probably want to use data transport with strong schemas

10Dev · 2018-11-08T09:47:39Z

FWIW, a reminder to keep #2815 in mind and how this would efficiently work inter-cell.

psychemedia · 2018-11-21T23:34:38Z

Way out my depth here, and not sure if this is out of scope, but I just came across tributary, a package supporting Python data streams, offering reactive, asynchronous, functional and lazily evaluated datas streams that perhaps complement static data MIME types with streaming data feeds? Not sure if they're the sort of thing that could offer data access onto and egress from a streaming data bus type? WebRTC would be another obvious streaming type.

The same developer also seems to have had a hand in this streaming chart package — https://github.com/jpmorganchase/perspective — which might provide a possible streaming data bus consumer use case?

10Dev · 2018-11-22T00:35:20Z

Adding a link to the above mentioned project:

https://github.com/timkpaine/tributary

Also, @BoPeng multiple kernel Notebook:

https://github.com/vatlab/sos-notebook

It should be considered vital not to take a dep on a particular language and it's implementation which might make Apache Avro important https://avro.apache.org

gRPC might also be worth consideration https://grpc.io as a building block

BoPeng · 2018-11-22T01:49:07Z

Many thanks to @10Dev for including me in the discussion. The proposed DataBus is at the JupyterLab level, it is mostly designed for extensions that consume dataframe-like data, but I suppose language kernels could make use of DataBus later if an API is provided, and it can be expanded to support more datatypes. In that case any kernel could use some magics to read from and write to the bus and exchange data with the frontend and other kernels. This is brilliant!

Anyway, before DataBus becomes available, I would like to write a bit about how SoS does a similar thing to exchange data between multiple kernels in the same notebook. Basically, SoS is a super kernel that allows the use of multiple kernels in one notebook, and allows the exchange of variables among them. Using a %get magic in the format of

(in kernel_A)
%get var_name --from kernel_B

SoS creates an independent homonymous variables in kernel_A with similar type to the variable in kernel_B. This currently works for kernels for 11 languges and for most native data types, and requires no modification to Jupyter or supported kernels.

Under the hood, SoS defines language modules for each language (e.g. sos-r, sos-python) that "understand" the data types of the language and assist the transfer of variables directly or by way of the SoS (python3) kernel. More specifically, when

%get mtcars --from R

is executed from a kernel, SoS would run a piece of code (hidden to users) to save mtcars to a temporary feather file (based on apache arrow), and run another piece of code in the destination kernel to load it. Simpler datatypes can be transferred directly via memory.

This design is non-centric and incremental in the sense that

There is no central data bus because kernel_A can transfer data directly to kernel_B.
This is no guarantee of lossless data transfer because for example Julia does not yet support row label of data frames, so row labels will be lost if it gets mtcars from R.
It can in theory support the transfer of any data types in any language by expanding the language modules (e.g. types such as Series, slice, Range). This also means a language module can be added to support only a few major data types and expand as needs arise.

I can imagine that SoS can make use of DataBus to expand the data exchange capacity to frontends, and assist the data exchange among kernels, so I will be happy to assist/participate in the development of DataBus. Actually, we ourselves have tried to conceptualize a similar project for data exchange between languages (sos-dataexchange) outside of Jupyter, which could benefit from the DataBus project.

I presented the data exchange feature of SoS in my JupyterCon talk in August. You can check out the youtube video (start from 7 minute) if you are interested.

BoPeng · 2018-11-22T14:07:53Z

Allow me to propose another idea we had during the brainstorm of the sos-dataexchange project.

How about implementing DataBus as a separate project?

Here is how it might work:

Implement DataBus as a data warehouse sort of project that is independent of JupyterLab.
DataBus can "consume" data or "interface" data. In the former case DataBus accepts and holds the content of the data, in the latter case DataBus knows how to access the data with the passed meta information. In the extreme case a DataBus can connect to other (public, remote, etc) DataBuses.
When a DataBus daemon is started, it exposes a (few) (zmq) communication channels. A protocol is defined to talk to the daemon to send and receive entire or pieces of "data" in certain ways.
Individual languages, Jupyter kernels, JupyterLab extensions would implement their own libraries to talk to the DataBus.
On the JupyterLab side, it can start a DataBus instance or connect to an existing DataBus instance and let the rest of the components talk to the DataBus by themselves.

The advantages?

It can be used without JupyterLab. Thinking of a scenario that users can start a databus instance and run a workflow that consists of steps in different scripting languages and use databus to exchange data. This is basically the motivation for the sos-dataexchange project.
It decentralizes the implementation because each language, each data source (e.g. hdf5) can define its own library to work with DataBus. The core of DataBus would be the protocol which can be implemented in different ways. I would also imagine more interests from the community if it has a broader scope.
The possibility of chaining DataBuses or connecting a DataBus to multiple DataBuses can revolutionize the way we work with distributed datasets. In the case of JupyterLab, an extension can handle/visualize data from arbitrary databuses, not necessarily the one provided by JupyterLab.

10Dev · 2018-11-23T03:41:01Z

To my understanding, you can't expect an effective #2815 without this #5548 and implementing that outside JupyterLab as suggested by @BoPeng might introduce serious performance issues.

Without #2815, JupyterLab becomes another dead end in the inherently polyglot field of data science and AI. Perhaps only masochists enjoy polyglot, but it is here to stay for a very long time and needs to be properly addressed as soon as it can for JupyterLab to have relevance and longevity. The longer that #2815 is pushed into future milestones, the more dependencies that build up and make the eventual implementation into a formidable undertaking that could exceed resources available.

The DataBus is something that can seriously "grease the wheels" to help #2815 or become an impediment if not designed right.

Also, semantics. What we mean by "data" and "databus" is going to drag up very different associations for different experiences and domains. A DataBus can be something like DBUS ( https://en.wikipedia.org/wiki/D-Bus ) that was supposed to be a lightweight system like this proposal that turned into a legacy monster or a DataBus can be a type of pipelining system like Nextflow ( https://www.nextflow.io ) or just in-memory organized RAM such as several OSS: Apache Arrow ( https://arrow.apache.org ) - Apache Ignite ( https://ignite.apache.org ) - Apache CarbonData ( https://carbondata.apache.org ) - Apache Gora ( https://gora.apache.org ) - Halzelcast ( https://hazelcast.org ) - Infinispan ( http://infinispan.org )

The @BoPeng sos-dataexchange appears to lean more toward a pipeline system than an inter-kernel data send/receive system.

FWIW, I think there is a critical need for a universal data system that has some aspects of a pipeline but is more flexible like a big data software implementation of a Crossbar switch ( https://en.wikipedia.org/wiki/Crossbar_switch ) or perhaps think of it as a local embeddable in-memory Data Grid ( https://en.wikipedia.org/wiki/Data_grid )

Pipelines are too limited and linear even if you add DAGs to them.

Which is why there is a giant discontinuity from Notebooks to production pipelines with a lot of hand crafting and often complete change of architecture to get any reasonable performance.

I might have a failure of imagination but I can't see how any of this can get shoe-horned into JupyterLab...

Returning consideration to a intra-kernel and inter-kernel ( #2815 ) DataBus there is the inevitable surfacing of the Event System needs perhaps illustrated by Vert.x ( https://vertx.io )

For Apache Arrow, people might find this article interesting:

http://wesmckinney.com/blog/apache-arrow-pandas-internals/

BoPeng · 2018-11-23T07:10:42Z

Which is why there is a giant discontinuity from Notebooks to production pipelines with a lot of hand crafting and often complete change of architecture to get any reasonable performance.

I believe that the concept of DataBus was conceived without consideration of workflow systems. However, the SoS suite of tools, namely the SoS polyglot notebook and SoS workflow engine were designed to narrow the gap between notebooks and production pipelines, and a lack of data exchange model for the SoS workflow system directly motivated the discussions around our sos-dataexchange project and my proposal in this thread, although we have not been able to write a single line of code for that project.

I disagree with @10Dev that implementing DataBus as a separate project would lead to serious performance issues but I agree that expanding DataBus to a more comprehensive project can be unwise giving the intrinsic complexity of the whole polyglot business. On the other hand, if DataBus is to be designed to allow kernel-level access, there will inevitably be some language-specific bindings to a DataBus protocol, and sos-dataexchanger can be shamelessly implemented as a standalone version of JupyterLab DataBus protocol and its language bindings. 😄

10Dev · 2018-11-23T09:25:08Z

FWIW, my words were "might introduce serious performance issues" i.e. keep an eye on that when designing it...

But it does make me wonder. Perhaps the base platform for Notebook type systems should be Native C++ and GPU as a performant base to host everything else...

In a way that is what Apache Arrow has done with separate polyglot implementations of their platform.

In any case when designing the DataBus, it wouldn't hurt to imagine as a design exercise a future Notebook containing different Cells with Python, Java, Scala, Julia, R, C++, C#, Go, Rust, JavaScript, TypeScript, F# and Lua which covers 99% of ML platforms.

If one creates a matrix of that polyglot versus Arrow and gRPC and Protobuf, there are interesting gaps...

BoPeng · 2018-11-28T18:02:23Z

During a developer meeting today, it is clarified that this project will focus on the JLab frontend and visualization of data, not on data processing and data exchange among kernels. The data exchange project from the SoS camp will be a separate project (which might be renamed to DataBus :-).

https://github.com/jpmorganchase/perspective was mentioned during the discussion.

psychemedia · 2018-12-04T01:37:46Z

I was looking at a couple of extensions today that I could see subscribing to / consuming / producing data objects:

ipypivot, a wrapper for a pivottable widget that allows direct manipulation and reshaping of a pandas dataframe;
dual canvas, which allows manipulations on one canvas element to be saved as a snapshot onto a paired canvas element and be made available as an image from that snapshot.

What struck me about the ipypivot widget in particular was that it could be used to carry out transformations to the contents of a dataframe within the widget and then return an appropriately reshaped dataframe to the notebook kernel namespace. What concerns me about that is the loss of reproducibility. If I directly manipulate a data object, how do I replay that? (I think there is a workaround with ipypivot: set up the pivot table to perform the transformation you want then play the data transformation through that.)

What would be nicer would be if the ipypivot widget were to export a set of pandas statements that implement the transformation applied by the pivot table. A user could then visually and directly engage with a dataframe in a pivot table scratchpad and export the corresponding code capable of effecting the same transformation back into the notebook. (i.e. the widget would act as a code generator rather than an object transformer).

But how would that work in a databus sense? I can imagine how a pivot table could subscribe to a data frame object, and then return an updated dataframe object transformed through direct manipulation onto back onto the databus. But could/should it also (or instead?) be able to pass back a set of programme statements that effect the same change?

I.e. rather than subscribe to df and return df', could it subscribe to df and return a set of commands implementing f(df) such that f(df) == df' ?

ellisonbg · 2018-12-04T01:45:56Z

There is strong interest from various people in having code-snippets attached to data sets in the data bus. The way we have been thinking about it is to put things like that into the metadata.

…

On Mon, Dec 3, 2018 at 5:37 PM Tony Hirst ***@***.***> wrote: I was looking at a couple of extensions today that I could see subscribing to / consuming data objects: - [ipypivot]](https://github.com/PierreMarion23/ipypivot), a wrapper for a pivottable widget that allows direct manipulation and reshaping of a pandas dataframe; - dual canvas <https://github.com/AaronWatters/jp_doodle/blob/master/notebooks/workshop/0%20-%20Outline.ipynb>, which allows manipulations on one canvas element to be saved as a snapshot onto a paired canvas element and be made available as an image from that snapshot. What struck me about the ipypivot widget in particular was that it could be used to carry out transformations to the contents of a dataframe within the widget and then return an appropriately reshaped dataframe to the notebook kernel namespace. What concerns me about that is the loss of reproducibility. If I directly manipulate a data object, how do I replay that? (I think there is a workaround with ipypivot: set up the pivot table to perform the transformation you want then play the data transformation through that.) What would be nicer would be if the ipypivot widget were to export a set of pandas statements that implement the transformation applied by the pivot table. A user could then visually and directly engage with a dataframe in a pivot table scratchpad and export the corresponding code capable of effecting the same transformation back into the notebook. (i.e. the widget would act as a code generator rather than an object trasnformer). But how would that work in a databus sense? I can imagine how a pivot table could subscribe to a data frame object, and then return an updated dataframe object transformed through direct manipulation onto back onto the databus. But could/should it also (or instead?) be able to pass back a set of programme statements that effect the same change? I.e. rather than subscribe to df and return df', could it subscribe to df and return a set of commands implementing f(df) such that f(df) = df' ? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#5548 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AABr0ATZ3PkAKakv4FzMzsEAfvJCh4BHks5u1dHrgaJpZM4X9uka> .

-- Brian E. Granger Associate Professor of Physics and Data Science Cal Poly State University, San Luis Obispo @ellisonbg on Twitter and GitHub bgranger@calpoly.edu and ellisonbg@gmail.com

10Dev · 2018-12-04T14:58:34Z

If you step back and listen to what you are saying, "metadata" and "subscribe to data" you are talking about an Event System of some sort. Which might actually be a more appropriate starting point where whatever is in your mind when you think of "DataBus" becomes instead a transport negotiation in an Event Manager.

But, Apache Arrow appears to be the opposite of an abstraction layer since each language/runtime (#2815) accesses the identical data representation without deserialization/conversion/copy etc. and then there data in GPU space to consider as well

https://arrow.apache.org
https://github.com/apache/arrow

with Plasma!

"The Plasma store can assist with developing applications involving multiple processes that need to share data, which may reside in CPU or GPU memory. Computational processes live separately from the Plasma store, a third party daemon. The processes are able to access data managed by Plasma through zero-copy shared memory access, and so by employing the Arrow columnar format to encode structural information, can describe complex datasets and make them available with minimal serialization overhead. We wish to provide strong support for managing datasets used by multiple processes living on the CPU or GPU." https://ursalabs.org/tech/

https://incubator.apache.org/ip-clearance/arrow-plasma-object-store.html
(this process appears complete: https://github.com/apache/arrow/tree/master/cpp/src/plasma)

Feather: https://github.com/wesm/feather

In theory, "out of the box" Arrow would support a common data for #2815 kernels in:
C++
C#
Go
Java
JavaScript
Matlab
Python
R
Ruby
Rust

Other projects can implement Arrow:

Julia: https://github.com/ExpandingMan/Arrow.jl
TypeScript: https://github.com/graphistry/arrow

Looking at a Pub/Sub model seems unwieldy the instant you take the Polyglot into account. Arrow is a new accomplishment in Xplat efficiency I think.

FWIW, some Pub/Sub thingies to assist design imagination:

https://pulsar.apache.org
https://github.com/apache/pulsar

https://nats.io
https://github.com/nats-io

ellisonbg · 2018-12-06T17:29:59Z

I have started to work on an initial implementation of the data bus. In that process I run starting to run into some challenging design questions around metadata. I will try to summarize those here:

First, many dataset providers in Jupiter lab will not have a persistent handle on the data sets. An example is a dataframe that comes from a notebook.

Second, in these situations is also very likely that the data set will not come with any metadata. In other words, a lot of primitive data set formats we are interested in do not have any built in metadata capabilities.

Third, if a mime type is tied to a dataset, metadata pair it makes it very difficult to have a given data set format that has different metadata schemas attached to it by different providers. For example I may be registering CSV files with the data bus with a very simple metadata schema, someone else may be registering CSV files with a much more complex metadata schema. In the current design those would be treated as two different mime types.

I see two ways out of this dilemma. One, we could attempt to design a universal metadata schema that would apply to all mime types. With some of the work that other organizations have done on dataset schemas this might be possible. At the same time the promise of a “universal metadata schema” may resolve to a failure. Two, we could have different mime type identifiers for the data and the metadata and then have a provider register a pair of those. then consumers could work with the data set, even if they don't understand the metadata.

ellisonbg · 2018-12-06T17:33:24Z

Here is the related issue on metadata schema: #5733

RandomFractals · 2018-12-13T13:04:02Z

since Vega Voyager, arrow data and perspective were mentioned in this thread, I thought I'd share a link to the perspective widget that does some data slicing dicing and visualizing similar to Voyager:

https://github.com/timkpaine/perspective-python

ellisonbg · 2018-12-14T22:20:26Z

Some things that are surfacing in ongoing discussions:

It will likely make sense to use a single metadata schema, implemented as a separate extension/service, and remove the metadata from the IDataSet API.
The IDataSet API will likely need an optional URI field, to enable other extensions (such as the metadata one) to hold serializable pointers to the datasets.
We are also talking about a separate API to register data converters (similar to the odo Python library). Because some converters may require going from out-of-memory to concrete, in-memory data, we may wants to prioritize conversion paths that preserve that characteristic (to avoid copying).
Saw some nice design mockups from NYU collaborators with a basic dataset/metadata explorer in JupyterLab's left panel. We try to get a screenshot here.

ellisonbg · 2018-12-19T17:31:32Z

Recording a question from weekly meeting: should datasets have a notion of being trusted (similar to mime bundles)?

timkpaine · 2019-04-22T14:33:56Z

@psychemedia I'm hoping to add reflection support to https://github.com/timkpaine/perspective-python so that you can configure your pivots and stuff in the front end and get the corresponding python code. We're also going to enable edit so you can modify stuff and reflect it back on the underlying dataframe/arrow/list/dict similar to ipysheet (but with added benefits of pivoting and streaming)

psychemedia · 2019-04-23T10:28:12Z

@timkpaine Ah, that's interesting (I'd posted a similar issue some time ago on another pivottable widget here). It seems that patterns for communicating back from widgets into code are starting to surface, which is hugely useful I think. Supporting code generation from py to html made it easier to create HTML pages (eg using things like IPyleaflet) and being able to use browser interactions to send code back to notebook makes for a whole new class of UIs / interactions. See also things like this for getting data out of Altair widget and this for getting data out of ipyleaflet widgetised maps.

jasongrout added this to the Future milestone Nov 13, 2018

10Dev mentioned this issue Nov 23, 2018

This repository is empty. Care to check out the GitHub Channel on YouTube while you wait? vatlab/SoS-DataBus#5

Open

ellisonbg added the enhancement label Nov 27, 2018

ellisonbg added the pkg:databus label Dec 6, 2018

ellisonbg mentioned this issue Dec 6, 2018

Dataset metadata schema #5733

Open

ellisonbg mentioned this issue Dec 6, 2018

[WIP] Add prototype of new data registry package #5734

Closed

4 tasks

ellisonbg changed the title ~~Proposal: JupyterLab Data Bus~~ Proposal: JupyterLab Data Registry Dec 28, 2018

This was referenced Dec 28, 2018

New repository creation request #5813

Closed

Data converters in the data registry #5831

Closed

saulshanabrook mentioned this issue Jan 11, 2019

[WIP] Databus #5857

Closed

55 tasks

rolyp mentioned this issue Dec 13, 2019

Jupyter Community Workshops: Proposal to host event Jan-Aug 2020 | Deadline 15 Dec the-turing-way/the-turing-way#770

Closed

2 tasks

blink1073 removed the pkg:dataregistry label Jan 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: JupyterLab Data Registry #5548

Proposal: JupyterLab Data Registry #5548

ellisonbg commented Oct 28, 2018

mabayona commented Oct 28, 2018 •

edited

ellisonbg commented Oct 29, 2018

wesm commented Oct 29, 2018

ellisonbg commented Nov 1, 2018

fperez commented Nov 1, 2018

bollwyvl commented Nov 1, 2018 via email

wesm commented Nov 2, 2018

saulshanabrook commented Nov 7, 2018

wesm commented Nov 7, 2018

10Dev commented Nov 8, 2018

psychemedia commented Nov 21, 2018

10Dev commented Nov 22, 2018

BoPeng commented Nov 22, 2018

BoPeng commented Nov 22, 2018 •

edited

10Dev commented Nov 23, 2018

BoPeng commented Nov 23, 2018 •

edited

10Dev commented Nov 23, 2018

BoPeng commented Nov 28, 2018 •

edited

psychemedia commented Dec 4, 2018 •

edited

ellisonbg commented Dec 4, 2018 via email

10Dev commented Dec 4, 2018

ellisonbg commented Dec 6, 2018

ellisonbg commented Dec 6, 2018

RandomFractals commented Dec 13, 2018

ellisonbg commented Dec 14, 2018

ellisonbg commented Dec 19, 2018

timkpaine commented Apr 22, 2019

psychemedia commented Apr 23, 2019 •

edited

Proposal: JupyterLab Data Registry #5548

Proposal: JupyterLab Data Registry #5548

Comments

ellisonbg commented Oct 28, 2018

Why?

What?

How

mabayona commented Oct 28, 2018 • edited

ellisonbg commented Oct 29, 2018

wesm commented Oct 29, 2018

ellisonbg commented Nov 1, 2018

fperez commented Nov 1, 2018

bollwyvl commented Nov 1, 2018 via email

wesm commented Nov 2, 2018

saulshanabrook commented Nov 7, 2018

wesm commented Nov 7, 2018

10Dev commented Nov 8, 2018

psychemedia commented Nov 21, 2018

10Dev commented Nov 22, 2018

BoPeng commented Nov 22, 2018

BoPeng commented Nov 22, 2018 • edited

10Dev commented Nov 23, 2018

BoPeng commented Nov 23, 2018 • edited

10Dev commented Nov 23, 2018

BoPeng commented Nov 28, 2018 • edited

psychemedia commented Dec 4, 2018 • edited

ellisonbg commented Dec 4, 2018 via email

10Dev commented Dec 4, 2018

ellisonbg commented Dec 6, 2018

ellisonbg commented Dec 6, 2018

RandomFractals commented Dec 13, 2018

ellisonbg commented Dec 14, 2018

ellisonbg commented Dec 19, 2018

timkpaine commented Apr 22, 2019

psychemedia commented Apr 23, 2019 • edited

mabayona commented Oct 28, 2018 •

edited

BoPeng commented Nov 22, 2018 •

edited

BoPeng commented Nov 23, 2018 •

edited

BoPeng commented Nov 28, 2018 •

edited

psychemedia commented Dec 4, 2018 •

edited

psychemedia commented Apr 23, 2019 •

edited