Data Governance at Criteo

Summary

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

2229

Abstract

u want, without a data discovery tool and proper documentation</li><li>Risk from quality: data might not represent what it’s supposed to represent, either because of data quality issues or because of improper data documentation. Also, we provide as much data as possible to our clients. What if they run their own reports and find different results? We need to ensure that the data is accurate and consistent.</li></ul><p id="c4d2">How do we handle such complexity? We thought it would be impossible to have one single person or group in charge of all data. That would create a bottleneck for new projects and businesses needing new data, and would require that group to know the whole data landscape of the company. That’s why we talk about governance: it means improving the decision making. We don’t manage the data, but we provide the tools, information and processes for people to take decisions.</p><h1 id="6a01">It’s all about habits</h1><p id="6465">In a perfect world, the data is clean, every file is properly named, in the right place, we know how the data flows and is transformed, old data is deleted… Yet, this is not a perfect world. Data Governance is about making the imperfection sustainable with a mix of processes, responsibilities and tools.</p><p id="6276">First of all, responsibilities and ownership: who is in charge of the data, and what does “being in charge” mean? That could cover specifying the data transformation, documenting the fields, approving requests for changes, answering questions about the business meaning of the data, supporting the computation jobs, etc. We’re talking about tens of persons with different domains and types of responsibilities. Organising that is one of our major challenges.</p><p id="0f0b">Second, we need to ensure that we use resources wisely, remove data that is no longer used, ensure the security of data and protection of privacy… That means processes, eew! Or does it? We prefer innovation over control, technology above paperwork, agility over conformism. There are some control points that we want to maintain, such as quota increase, but the main focus is better monitoring tools giving an insight on metadata and usage. That’s why the group develops tools

Options

to track all metadata: documentation, data lineage, ownership, availability, resource footprint, etc. All these tools helping us to manage the data lifecycle from creation to deletion.</p><p id="6984">And last but not least is data quality. Each team is a link in the chain, but mostly focusing on how the data is transformed. It’s also the role of the data custodian to focus on data quality: again, provide the tools and foster the effort.</p><h1 id="7aee">Eating our own dog food</h1><p id="59ee">But above all, we produce data, focusing on two types of data: data that is used across the company and data that is exposed as the base for general reporting and BI purposes, or sent in customer reports. The rest is produced by other teams, and though we are ready to help them writing the code when necessary, it’s their task to make things happen.</p><p id="48e3">The reason why we are responsible to produce core data is first to ensure that there is a body of data that has consistency, it also makes the group a user of its own tools, helping innovation and quality of the tools. In order to track data production and foster good habits, we track the maturity against a matrix.</p><h1 id="23e3">It’s only the beginning</h1><p id="a377">This is a long journey and we have only started. We’ll share more about the progress and technical innovation. We're looking for more team members!</p><div id="c659" class="link-block"> <a href="https://www.criteo.com/careers/?team=Engineering%20-%20Software%20Development"> <div> <div> <h2>Careers | Criteo</h2> <div><h3>At Criteo, we're passionate about connecting more shoppers to the things they need and love. And the only way we can do…</h3></div> <div><p>www.criteo.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*ZIi9BD73JBQVsWl7)"></div> </div> </div> </a> </div><p id="9376">Thanks to Lucie Bailly, Muleine Lim, Gregory Letribot and Claire Moquet who where at the origin of the Data Governance in Criteo and for their help.</p></article></body>

Data Governance at Criteo

In an organisation that values innovation, flexibility and freedom, implementing a Data Governance culture sounds quite provocative, rather counter cultural. In the early phases of Criteo, the priority was growth and innovation — adding more services and technology. It’s still the case but now that we reached a significant size, what was obvious at the beginning is not anymore: getting the full picture of data transformations, know who owns what data, manage the growth of data… There is now a dedicated group in Criteo working on that challenge, here’s how we understand Data Governance and what we do in that group.

Why do we need data governance?

Criteo gathers a lot of data from the Internet, typically more than 200TB on a good day. That data goes through many different systems: Kafka, jobs based on Hive, Spark and Presto, many different SQL and NoSQL databases for online and offline purposes. That’s for operational (eg decide to bid for a banner display or not) and reporting purposes. We run offline jobs based on data hosted on HDFS in order to produce data for analytics, forecasting, billing, reporting and machine learning. That represents thousands of datasets and jobs, hundreds of internal users and thousands of external users… That complexity creates risks associated with data, as it does for companies of any size:

Risks on data operations: Criteo’s data lake is already a 230 PB infrastructure at this date, for roughly 60PB useful (non replicated). With no specific management, this number grows exponentially by more than 30% yearly, independently of the company turnover growth. Also, we can’t backup that much data, so we need to be picky about data required for business continuity.

Risk on innovation: developing new ways of making business requires adding data, not as a necessary evil but as an essential part of that new business.

Risk on understanding: On that much data, it’s difficult to find the right one you want, without a data discovery tool and proper documentation

Risk from quality: data might not represent what it’s supposed to represent, either because of data quality issues or because of improper data documentation. Also, we provide as much data as possible to our clients. What if they run their own reports and find different results? We need to ensure that the data is accurate and consistent.

How do we handle such complexity? We thought it would be impossible to have one single person or group in charge of all data. That would create a bottleneck for new projects and businesses needing new data, and would require that group to know the whole data landscape of the company. That’s why we talk about governance: it means improving the decision making. We don’t manage the data, but we provide the tools, information and processes for people to take decisions.

It’s all about habits

In a perfect world, the data is clean, every file is properly named, in the right place, we know how the data flows and is transformed, old data is deleted… Yet, this is not a perfect world. Data Governance is about making the imperfection sustainable with a mix of processes, responsibilities and tools.

First of all, responsibilities and ownership: who is in charge of the data, and what does “being in charge” mean? That could cover specifying the data transformation, documenting the fields, approving requests for changes, answering questions about the business meaning of the data, supporting the computation jobs, etc. We’re talking about tens of persons with different domains and types of responsibilities. Organising that is one of our major challenges.

Second, we need to ensure that we use resources wisely, remove data that is no longer used, ensure the security of data and protection of privacy… That means processes, eew! Or does it? We prefer innovation over control, technology above paperwork, agility over conformism. There are some control points that we want to maintain, such as quota increase, but the main focus is better monitoring tools giving an insight on metadata and usage. That’s why the group develops tools to track all metadata: documentation, data lineage, ownership, availability, resource footprint, etc. All these tools helping us to manage the data lifecycle from creation to deletion.

And last but not least is data quality. Each team is a link in the chain, but mostly focusing on how the data is transformed. It’s also the role of the data custodian to focus on data quality: again, provide the tools and foster the effort.

Eating our own dog food

But above all, we produce data, focusing on two types of data: data that is used across the company and data that is exposed as the base for general reporting and BI purposes, or sent in customer reports. The rest is produced by other teams, and though we are ready to help them writing the code when necessary, it’s their task to make things happen.

The reason why we are responsible to produce core data is first to ensure that there is a body of data that has consistency, it also makes the group a user of its own tools, helping innovation and quality of the tools. In order to track data production and foster good habits, we track the maturity against a matrix.