Ian Soper

Summary

In an exponentially increasing volume of data both management and discovery had become increasingly complex. Not only the data was growing, but the same data needed to exist in multiple formats and data stores for different use cases.

Obstacle

The existing user experience to support these had could not scale with these new needs, not surfacing relevant metadata nor allowing easy data management for the same data spread across different data stores.

This meant that data consumers might find 10 versions of the same data in a search experience across many platforms with no affordances that this data was related.

Data producers also had to manage metadata and schemas for each one of those 10 datasets individually, despite that information being the same across all of them.

Action

In order to solve these problems of scale, I designed a new user experience that allowed data users to discover the same data across data stores, prioritizing the most important metadata first all

Producers of the data could easily manage metadata and schemas across all instances of their data easily in a single unified experience.

Results

The result of these UX changes gave users much simpler discovery methods, focusing their search on the right data, reducing noise and complexity in their search.

Producer could now manage ONE set of schema and metadata across many platforms, easily version their data with

Also, this experience can be scaled. More data stores, additional metadata, data insights, and usage metrics could be added as the needs grow.

My Role & Responsibilities

Conceptual Designs
Design Prototypes
UX Design
Collaboration with Tech & Product

Big data, big challenges

Big Data is a key to being successful in an age where real-time, AI driven experiences are driven by rich customer, system and vendor data. A large company like this has hundreds of thousands of datasets in various data stores.

As you can imagine tracking, tagging and owning all this data is an enormous undertaking. Not to mention all the data users needing to sift through it all to find the data that they need for any given project.

The company needed to track and tag data sets across many different platforms and formats, including:

Platforms

S3 Objects
Snowflake Tables and Views
Real-time Streaming Data
Relational Databases
NoSQL Databases

For each dataset in each platform they have to track information:

Description
Data Schema
Ownership
Platform and location
Data Sensitivity (privacy and security)
Data Quality metrics
And other details

In order meet those goals this company developed an internal data management, discovery and publishing platform where all data in the company has to be registered and published.

Prioritizing the work

This effort was instigated by both myself and tech partners based on our knowledge and limits of the data management ecosystem. We knew that critical changes needed to be made to how data was managed and grouped and the experience to support that.

Challenges

However, this work, despite being important wasn’t prioritized and our efforts were limping by on an experience and infrastructure that made it difficult to innovate.

Based on those needs and realities I knew that we needed to have a vision to show our stakeholders and leadership what was possible.

Selling It

I took it upon myself to design the concepts for the new experience for data discovery and management. I had a few tech partners and data SMEs as sounding boards. I shopped around the initial concepts until there was a groundswell of excitement and support. This elevated the work and got it prioritized.

Research

I had a lot of secondary research to pull in when working through this project. This included how we defined our user groups and their corresponding jobs.

This included interviewing data users (e.g. data scientists) and producers (tech teams, data stewards) as well as data platform stakeholders.

Through those efforts we defined a high level user journey to help guide the work.

Data User Journey

User Jobs

As a design team, through our user research we found the following primary jobs that our users have to undertake

Register
Publish
Manage
Find
Evaluate
Use

User Personas

This data platform has been designed for two main user personas:

Data Producers

They need to register, publish, and manage the data they create.

Example Users: Data Engineers, Tech Teams, Data Stewards

Data Consumers

They need to find, understand and use the data that helps them answer a business question or solve for a business need.

Examples users: Business Analysts, Data Analysts, Data Scientists

Problem Space

Data owners have to manage a large number of datasets, often it is the same data spread across many platforms and formats.

I have the same data in 5 different locations and formats, plus I have to create new versions of my data occasionally. I have to manage all of them individually, which is time consuming and complex.

Schema many-to-many relationship diagram showing data spread across platforms — Example of data that share schemas across many platforms. This creates a lot of complexity for data management and discovery

Data Management Issues

Same Data, Different Places

Data that shares the same data schema, but on a different platforms or formats, needed to be registered and managed separately, creating an unnesscary burden on Data Producers.

No Versioning

Because our system didn’t have methods for versioning data, each new change to the data just resulted in a brand new dataset. This causes complexity for management, discovery and usability as the data evolves.

Complex Publishing

Data producers had to go through manual processes to publish their data to new platforms or locations. This bespoke approach made it hard to scale and simplify the publishing experience for users.

Data Discovery Issues

Because of variants of the same data are in different platforms and formats the user has to search and find all the entries, check the metadata to see if this is the data they need.

Expansive Data

Because of the company’s large data ecosystem the data users have a very challenging time finding the data that they need to do their job (data analysis, model development, program dashboards, etc…)

Duplicative Data

When searching data users often find the same dataset multiple times, with the same description and metadata, but on different platforms and versions of the data.

Disconnected Versioning

Because each version of the dataset was a new dataset in the system, there was no easy way for our users to know if more historical data existed for their needs.

All of this contributes to a cognitive load with experience overhead and noise which begs the questions:

Which dataset is the right one to use? Is there other historical data available? Which platform do I use? How can I trust what I’m seeing here?

Data Details UI

The existing data details experience had not enough information to satisfy the needs of the user. It was designed inflexibly, not allowing additional information to be easily added.

Problems with Data Details Experience

Is there more data?

It also didn’t account for the needs of the same data being populated in different platforms or different versions of that data.

Burying the lead

The most critical information about data to users is the data schema and that was hidden in the navigation. Other less important information was prioritized.

Ghost town

Much of the metadata displayed to users was not populated because it was not available. This created a “ghost town” of data information for each dataset.

Confusion and lack of information around the datasets contributed to eroding trust in the data ecosystem.

Solution

Create a new data details experience which meets the needs of the data producers and data consumers.

Goals for Data Consumer Experience

The main goals for the consumer experience were:

Simplifying data discovery with one page for all platforms and formats
Prioritize the most important information
Give the user information to help them trust the data

Features for Data Consumers

Basic Metadata
Switching Platforms & Versions
Data Views
Data access
Data Trust indicators
- Privacy Tagging
- Freshness (last load)
- Queries and Top Users (Social Proof)
Schema and Data Sample
Usage and Insights
About
Related Data

Data Discovery Version 2 Consumer Interface

Goals for Data Producer Experience

Manage all schemas and metadata in one place. When you make changes, those changes are propagated to all attached datasets, improving efficiency dramatically
Automate the versioning process when the producer makes a breaking change to the metadata or data schema.
Data Quality rules can be managed for all instances as well.
Publish to new data platforms easily

Schema one-to-many relationship diagram showing unified management — With the new design and tech producers can manage one schema and metadata and it propagates across all instances of that data.

Features for Data Producers

Data producers get a new unified management experience, pulling together all platforms into one place.

Embedded Data Management Section (only visible by producers)
Easily publish your data to new platforms
Schema, metadata (about) and data quality rules are shared by all the published instances of the data.

Data Discovery Version 2 Producer Interface

Constraints and dependencies

Engineering

I had to work closely with engineering teams because these UX changes were related to fundamental changes that need to be made in how the data catalog and related systems operated.

Future-Proofing

This solution also needed to be extensible so that it could grow with the needs of the platform.

The IA allows for additional sections to be added
Allows for ownership of specific areas, so that different teams can be responsible for different parts of the experience.

Outcomes

I love this! When can we start using it?

After showing this to users and stakeholders there was a lot of excitement about the new experience. It was clear that this was a big improvement over the existing experience.

Lessons Learned

Too many cooks

As this internal project took off and more we had more and more product, design, and tech stakeholders involved it became difficult to meet the technical and product requirements.
There were strong disagreements between a few tech teams that made the infrastructure difficult to deliver on. Ultimately we were able to do this, but it was a painful process.

In a complex ecosystem, don’t try to solve all the problems

Focus on most critical issues, don’t let a couple of users or feedback requests sway then entire product direction
Create a framework around moving this to target state. Many ideas had to scaled back and cut in order to deliver and the target state of these concepts are likely lost to time.
So much was being done that there became a giant tech backlog and other teams wanting to build on this experience got frustrated waiting and some went their own direction as apposed to building off what we created.

Policies and processes and systemic constraints can dictate what we can build

If only some parts of the data ecosystem can support a certain feature, should it be a feature?

Design with real data

Our data users get caught up with mockups that aren’t close to reality. While a design may look great as a mockup, the devil is in the details and the users focus on those details.