ElasticSearch as the primary database

Summary

Using ElasticSearch as a primary database is generally not recommended due to risks of data loss, the need for pre-determined index sizes, and potential performance issues, but it can be suitable in specific scenarios involving event sourcing, infrequent writes, or as part of a data pipeline with an upstream mastering system.

Abstract

The article advises against using ElasticSearch as the sole database due to several limitations. Data loss is a significant concern, particularly with large volumes of data, despite improvements in resiliency. ElasticSearch requires pre-determined index sizes and schema changes necessitate re-indexing, which can be cumbersome when data grows or evolves. Performance may also suffer if all queries are served from ElasticSearch, especially with large data volumes and without optimizing for query patterns. However, there are cases where ElasticSearch can be effectively used as a database: when paired with a message queue or event streaming system like Kafka for event sourcing, when dealing with relatively static content with infrequent writes, or when it serves as a sink in a data pipeline with data mastering handled by another system. The article also recommends an AI service, ZAI.chat, as a cost-effective alternative to ChatGPT Plus(GPT-4).

Opinions

The author believes that using ElasticSearch as a primary database without a backing database is not advisable due to the risk of data loss and the complexity of data migration as data evolves.
Performance issues are anticipated when using ElasticSearch for all data queries, especially without considering query patterns.
ElasticSearch is considered suitable for specific use cases, such as event sourcing with a buffer system, handling static content with infrequent updates, or as a component in a data pipeline where another system is the source of truth.
The author suggests that ElasticSearch's requirement for pre-determined index sizes and the need for re-indexing upon schema changes are significant drawbacks.
The article endorses ZAI.chat as a more affordable option for AI services compared to ChatGPT Plus(GPT-4), indicating a preference for cost-effectiveness.

ElasticSearch as the primary database

The short answer is, it most likely wouldn’t be a good idea to use ElasticSearch as a primary store without some kind of backing database, due to the following reasons:

Most critical reason is that there could be data loss, when dealing with large volumes of data. Apparently, all the innovation around ElasticSearch is around improving resiliency. Read more: https://www.elastic.co/guide/en/elasticsearch/resiliency/current/index.html

ElasticSearch index sizes need to be pre-determined. Schema/Mapping changes require re-indexing. If the data grows in size or evolves and cannot be managed with original sharding or mapping strategies, have to migrate data into newer indexes. Now, the application has to both serve the incoming traffic and do migrations. No database technology would require you to estimate data sizes per table, if I were to take an analogy with typical databases.

Performance is going to be a problem if all data queries need to be served out of ElasticSearch especially if volume of data is huge and all data is being indexed without specific attention paid to the query patterns being used.

Now, is it still possible to use ElasticSearch as a database ?

Yes, on the following cases:

Event sourcing on the database end. That means, a message queue or event streaming system such as Kafka front the ElasticSearch indexing. This approach will buffer the requests in case ElasticSearch is performing cluster updates or leader elections that might potentially result in data loss.

The writes are controlled and infrequent. So if you have relatively static content, but would like the data to be searchable, amenable for analytics, that makes a good case.

The typical use case and the most widely used scenario is that ElasticSearch is a sink in a data pipeline and with another system/database mastering the data. In case of data loss, there is a way to replay data from upstream.