Comparison, twitter-crawler, in Akka vs in RxJS

It's been two years since I wrote my first twitter-crawler, at that time using Akka framework in Scala programming language. Last month I wrote another (better, cleaner) twitter-crawler, using microservice architecture (complete with message queuing and caching). This time using RxJS library in JavaScript (see the blogpost about Glazplatova)

... and I felt the need to articulate the difference between the two..., or maybe the reason why. Ok, spoiler alert: actually my motive wasn't purely technical. I can't really tell whether RxJS is better than Akka for this particular use case (or vice versa) -- more on that later. I was just having some difficulties to understand the complex Scala/Akka code I wrote more than two years ago.

It was becoming hard for me to add new features to that code (yeah, I broke the "don't code today what you can't debug tommorow" principle). Beside, these days -- for about a year now -- I've been making heavy use of RxJS in NodeJS scripts, mostly for ETL purposes where I have to do transformations over data that comes in as streams.

Precisely because of this "freshness of RxJS in my head", now I feel the need to refresh my knowledge of Scala/Akka. Making this video serves me that purpose, and I hope the audience can benefit from watching it, somehow :) As usual, I made this video without script. So my apologies for getting off track at several points. Without further ado, here's the video:

Now, as I got the end of the recording, curiosity crawled in. Primarily because even after all that explaining and demo-ing, I don't think I managed to make a point. What exactly is the difference? Or maybe a better question would be: when to use Akka? when to use Rx? Are they equivalent? Are they alternatives to each other (competing)? I googled it up, found this: What are the main differences between Akka and RxJava?

Well..., that was it, if you care about distributing the crawling activities across several nodes, then use Akka, as it handles the distribution of actors across several nodes automatically & transparently.

In the light of that..., was writing the crawler in RxJS a bad decision (because there is no automatic replication & distribution)? Not really. I mean, in my case, there is no need to have multiple instances of crawler. Why? Because of rate-limit in the Twitter API (we can only make certain amount of requests, during certain period of time, for certain endpoint). If you have more than one instance of crawler, and all of them using the same twitter account, and each one of them runs on separate node..., it would be very difficult to control the (total amount of) requests-per-minute, and you will hit the roof very (or too) quickly.... But...If each one of the crawler use a distinct twitter account, then it would be a different story. But in that case, I would simply run another instance of crawler (another NodeJS processes, can be in separate node), which will run independently from the first instance of the crawler.

Beside, the crawling (talking to TwitterAPI) is only part of the bigger picture. There are other activities, as I showed in my video, such as: resolving the short URLs of the embedded articles, extracting metadata from articles, analyzing content of articles, storing in different types of database (mongo, sqlserver, neo4j), plus whatever thing you can come up with such as sentiment analysis in "real-time". Neither of them suffer the same constraint as the crawler (that rate-limiting). Beside the tasks they carry are stateless..., so any of them can easily scaled, simply by spawning another instance of container for the service, and make it listen to the same work-queue as the existing service instance. This sketch of the architecture of the crawler (RxJS version) can clarify what I just stated. It is arguably easier to understand and explain than my version of twitter-crawler in Akka. I insist, not Akka fault. It's mea culpa... probably if I had known better about best practices in Akka, better tooling, etc etc.... probably.

Now..., for the sake of discussion, let's assume I do care about replicating the crawler. The answer would be "use Akka". But now the question: why not Spark? It's also based of Scala, it has some list transformation operators, it also handles distributing the handling of items in the list (much like several actors in Akka, running in different nodes, working in parallel, emptying away their respective mailbox, which are partitions of the complete stream, by means of router).

Well..., this twitter crawler is not so much about transforming things (as opposed to job in spark which is about data-processing-pipeline). It's more about a daemon process. A process that interacts with the outside world (pulling data from twitter in this case), listen to external events (notifications from Redis in this case), and adjusts its interaction with the outside world accordingly, round the clock. Code in Spark is about __job__ (one-off), this twitter crawler is __daemon__. The data transformation / processing itself resides in other microservice(s)..., the sentiment analysis for example..., which receives stream of tweets emitted by the crawler and channeled through message queue (I use RabbitMQ here). For that one, indeed, we can consider using Spark, spark-streaming, especially when you think of using Spark ML libraries for big data.

Now the question: why not Apache Flink (instead of spark-streaming)? :D Ooo... kay..., it's getting late now, maybe that's a topic for another blogpost. See ya!

Oh..., and here are some nice links about RxJS. Might be handy for future references: