The analytic system from data of 30 million users
Yesterday, I just went to listen to a seminar about AI and data. I incidentally meet my friend who was also doing things that I had done before. After chatting, I realized that I had built an analytic system before, which was a lot of work, not a joke. So I decided to write an article to save and share a little of my experience in this quite difficult field.
The First Things First
Who am I, and who are you?
The writer of this article is just got in data science, has an excellent bachelor’s degree in economics but has not been trained in schools in information technology methodically, so a lot of the approaches seem a bit naive and non-standard.
This is not a share of the great success of the technology industry giants about super-big data with billions of records, or the application of very hegemony technologies like Hadoop, Kafka… but just shares closer to the beginning. You know, people often share a little about this stage but often talk about success when things are big. It makes us, all beginners, just look up and desire.
So how do you — a reader acting like an outsider, begin to visualize the analytic system?
The First Failure
The initial requirement of the management team was simple: “Build a system to calculate sales for the products that the company is releasing”.
Very simple, right? Just access the payment system, get the customer transaction log and then do some simple sum and count operations on the DB to get the required result. Mission Accomplished.
However, for me — a young man who has just graduated from economics school and learned PHP for 1 month, it is also extremely challenging. Getting started with curl, batch processing, background job, webhook… all sorts of things, develop a MySQL containing the customer’s transaction table and enough information related to that transaction, such as loading it for whom, into Where, how much money, what time… The dashboard will be a sales chart by day, week, month…
But the system only worked for 4 months and I had to design the next system to meet the new problem posed. Why?
MySQL is a database not suitable for analysis. Let’s take a look at our comparison table of requirements and capabilities of MySQL.
My initial solution based on Mysql worked fine until the amount of data reached a few million. Then the query with the time-range longer, up to several dozen, several hundred thousand rows or more become extremely slow.
And that’s just a very small part of the data, when I touch the data about user activity (such as registration, login, etc.), it is a few dozen to a few hundred times bigger than the payment data. Well, IT’S NO HOPE.
The Second Failure
This time I approach a NoSQL database: MongoDB. This Mongo database contains quite a large amount of user activity data (a few hundred million records per collection) on the operating systems. So I decided to switch to it, first as a big data warehouse, then approach in a completely new direction: pre-calculated.
The design of this Mongo is de-normalized, ie its document contains nearly all information about the transaction or activity. Strengths of MongoDB are:
The design of this Mongo is de-normalized, ie its document contains nearly all information about the transaction or activity. Strengths of MongoDB are:
- High insert speed
- Optimized for batch processing (bulk write)
- There is no need for a fixed schema when data changes
- Easy for map-reduce function (with Javascript syntax)
Because of these reasons, I chose Mongo as the next stop for my analytic system.
However, Mongo reveals a huge weakness when working with the analytic system, which is the aggregate and count operation very slowly when querying on large data. Your collection may have hundreds of millions of records and you can read several records at a high speed, but to count a subset of a few hundred thousand or millions of records from that collection would be painful.
Therefore, I have to approach the calculation in a different way. It is calculated (almost) all analytic data displayed on the dashboard. Like, every day there will be a job running to calculate how many users today, how much revenue, and save to a different collection of those results. Here, I temporarily call it a piece of data.
The red part is the big changes I made compared to the previous version.
This design just existed for about 2 months, then it quickly revealed many defects that the system could not meet:
- Scope of the data pieces: Will you calculate the value in minutes, hours, days or weeks? Or for all? Whether you choose the size or scope of data will affect the way you query on larger scopes. For example, if you choose to count by minutes, then the time will get the sum of 60 minutes, and so on
- Pieces of data are reflected by many different aspects and it is difficult to calculate in advance. For example, you must calculate hourly sales by channel, product, package, method, etc. and so on. Pre-calculate these things will drive you crazy.
- Many pieces of data do not have addictive properties. That is, you can not add pieces of data with a small scope to form large pieces of data. For example, you cannot calculate DAU by adding hours of active users. So you have to calculate in advance.
- No custom query support: Pre-calculation eliminates most of the ability to create a customized report. If one day the marketing department wants to statistic a little bit too many aspects, you will have to calculate by hand. It is impossible to anticipate these cases.
So, I keep growing up by smashing the whole thing and start over with the 3rd architecture.
The 3rd Not-Yet-Fail System
This time I have to learn a more detailed and in a methodical way of building the components of the system or the function of each component. Thanks to 6 months of pain from version 1 and 2, I was able to gain quite a bit of experience when I started over again.
Explain the model a little bit:
Data was collected by Sync Worker through ETL — 1 process will be processed for correctness, data type, garbage field… This is a pre-processing to increase the accuracy of the data for centralized storage.
Data lake (MongoDB): a place to gather data from different sources. Because we need to update data frequently, insert large numbers at high speed, the data has no clear structure or complex structure, store large quantities… MongoDB is a good choice and best suited to my resources at the time.
Data in the data lake is enriched by Data enrichment worker. This is the next processing step to synthesize discrete pieces of information to produce more multidimensional information. For example, information about the transaction will be supplemented with information about the device, login session… from data of other sources.
Next, Transform worker will perform ETL-2 process to push data into the data warehouse. In this process, the data will be processed to suit the data warehouse architecture, including clean and clear, convert data type, restructure... This ETL-2 process has large participation of stored procedures in PostgreSQL for batch processing and complements the Slowly Changing Dimension architecture in Data warehouse.
Data warehouse (PostgreSQL): is a place to store data with pre-defined needs. That is, a lot of data will stop at the data lake only. Only data that we have specific needs or clear architecture was pushed into the data warehouse. PostgreSQL is a brilliant choice for database analytic that supports lots of queries and complex data types. Materialized view, window function or hyperloglog extension are extremely valuable features for my analytic system.
Some data in the data warehouse will be pre-calculated by the Pre-calculate worker. This strategy is applied to some kind of complex calculations like retention, churn, etc., Query and calculate directly will take time and burden the system.
Some experience drawn
Duplicated data: Data in the Data Warehouse is organized into the data mart, that is, data clusters with an architecture built for a specific problem, such as revenue calculation problem, customer care, etc. .. So instead of having some tables that contain all the data about a certain problem, you have a dozen tables that contain the same data but are designed differently. This thinking is different from building a common application that minimizes duplicated data. With data-warehouse, duplicated data is a must to optimize performance and meet the intended use.
Normalized or Denormalized: is a problem often encountered with data warehouse, and there is no formula that can be applied to all kind of this problem. It all depends on the fact that you test with performance. So is star-schema or snow-flake or all-in-one.
Index or no-index: is the next problem you must pay attention when working with data analytics. Because your query often processes millions of records. And there are cases where scan-all tables are faster than indexes. At times like these, I really like the explain tool of PostgreSQL. It is extremely detailed in every step that Database has taken to produce results. You can refer to the visual explain tool here
The self-explained ID: is a strategy to build Fact and Dimension tables based on self-meaningful IDs. That is, instead of inventing a meaningless number that must join between fact and dimension table, you use something else that can be represented and calculated right away. A very specific example is the ID for date dimension. Because many operations based on this ID will be much faster than joining the date dimension. And vice versa. It all depends on your optimal mind. Make friends with the explanatory statement as much as possible.
Big job or small job: This is what you face when dealing with large data into a data lake or data warehouse. Some people choose to set up a big job and do it every half an hour, but I recommend that you tear up your job as much as possible to take advantage of parallel processing. Tear here is that you divide the job both vertically and horizontally. Vertical is a 3-step is broken into a 1-step job. The horizontal is to break jobs processing 10000 into 10 jobs processing 1000. But remember to split just enough to take advantage of the batch processing of the DB.
Fast or correct: is the problem of exchanging between the correctness of the result and the return speed. Believe me, sometimes an accurate number is not the data user wants but how to get it quickly is. Familiarizing yourself with approximate computation algorithms such as hyperloglog is a stepping stone for you to have a fast enough system.
Conclusion
So I have finished talking to you about the analytic system architecture that I have built. So far, the system has been responding to the data of more than 30 million users, with the supply of data from nearly 10 different operating systems, millions of records generated a day and a dashboard system with approximately 30 types of reporting metrics for operation, marketing, accounting, BOD… data latency is fewer than 10 minutes and queries mostly under 3s.
If you have any questions (that I can answer), feel free to comment. Thank you for your interest in tracking.