at Fastenal India
It has been more than 6 months working with Fastenal in the infrastructure team. I had a chance to talk with many senior engineers, read lots of documentation and look into many aspects of industry level work. Along the journey I’ve come across many tools and technology used by the company to build and support the company’s infrastructure. With this post I would like to share my technical journey so far.
Talking about my day to day work, my team and I build data pipelines to stream data between endpoints seamlessly. It’s okay if you didn’t get a word of it. It is complicated. To put it in the simplest of words, we build the lines in the system design documents you regularly see.
For reference, the above represents a diagram of the most basic system. Our team is responsible for building, scaling and maintaining each of the lines which you see in the diagram. For my tenure at Fastenal India, I’ve contributed to build data pipelines to keep our databases in sync, ie, the dotted lines in the above picture.
Demystifying real-time data streaming :
Real-time data streaming isn’t that complex to understand as it is to read it. Let me assist you. A couple of decades back, we used to get news from newspapers. The news agency used to collect news all day, compile all of them and then finally print them in the newspapers which we all read with our morning coffee. Well, the scenario has completely changed now. We receive news on the go through google-news, news outlets, and even instagram as and when it happens. We do not wait for the entire day to get all the news at once. We get real-time news from around the world. This is an apt description of real-time data streaming.
There are ample mechanisms and great tools available to transport data between systems and processes, but we use the most talked about tool for data streaming — Kafka.
A little glimpse about Kafka :
“Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications”
Few noteworthy points from the definition above :
- Open source Software
- Distributed system
- High performant system
While Kafka documentation is insanely comprehensive, the use cases of Kafka are indeed fascinating. The introduction and evolution of this data streaming mechanism has brought some significant changes in the way large companies ingest huge volumes of data and produce it to data consumers.
What is Kafka :
Long answers short, it’s a Publish-Subscribe model
Publishers produce data and subscribers consume data from respective publishers.
Publishers produce data in all shapes and sizes, in all imaginable formats. Subscribers consume data in specific format as they require for their needs.
How kafka does it differently :
By decoupling data producers and consumers.
Publishers publish data to a centralized system, and subscribers consume data from the centralized data store. Publishers produce data in all shapes and sizes, in all imaginable formats and simply push to the system. Subscribers consume data in specific format as they require for their needs. This format conversion is all handled by the centralized system in this case. This is called decoupling of producers and consumers. In this architecture, the producers and consumers are totally unaware of each other which has many advantages as proven in its use cases.
How do we use it?
As I have briefly mentioned above, we use the kafka framework to keep the databases in sync. More specifically, we work with sink connectors and k-streams for the purpose.
That’s a lot to take in for a brief introduction about Kafka, and my job profile. Also kafka docs would be your best friend for more in-depth technicalities. I would love to hear your use cases of Kafka if you use it in your work, or simply shoot me up with any questions you have about this.