This article was first published on Ubex - Medium
The Ubex project continues to develop the platform as newer modules required for its proper functioning are being added and developed.
As we have already described in our previous posts, the JS counter hosted on the webmaster’s site sends data about visits and visitors to the fail-safe cluster of the Pixel Collector servers hosted in the Amazon AWS cloud.
What follows is that this data is enriched on the Pixel Collector servers with service headers and transferred to the Kinesis Data Streams streaming service.
One of the tasks that we face is to update the data on the statistics of advertising campaigns (the remaining budget, impressions, clicks, etc.) as much as possible. To solve this problem, we have implemented the Lamda architecture using Apache Spark.
In fact, we have two data streams, the operational data and the statistical data.
The first half of our architecture components using the Spark Streaming technology reads incoming data from Kinesis Streaming and updates our internal statistics in near real-time mode. This was made possible thanks to the extremely low latency between the receipt of data in the Kinesis Data Streams and their readability, as well as through refraining from processing data in batches in favor of streaming data reading.
The second half launches the scheduled Spark SQL tasks that re-process the accumulated raw data, perform additional cleaning and processing with them, and update the data in the database.
The logical question that might arise is why we need such complexities in the first place. In fact, the first stream contains the operational data. We use it to take into account ad displays and withdraw funds from the advertiser’s account in real time, otherwise there will be artificial increases in ad impressions and we will remain in debt to our partners by RTB.
The second stream ...
To keep reading, please go to the original article at:
Ubex - Medium