Bienvenida & Raka

Comparison, twitter-crawler, in Akka vs in RxJS

It's been two years since I wrote my first twitter-crawler, at that time using Akka framework in Scala programming language. Last month I wrote another (better, cleaner) twitter-crawler, using microservice architecture (complete with message queuing and caching). This time using RxJS library in JavaScript (see the blogpost about Glazplatova)

... and I felt the need to articulate the difference between the two..., or maybe the reason why. Ok, spoiler alert: actually my motive wasn't purely technical. I can't really tell whether RxJS is better than Akka for this particular use case (or vice versa) -- more on that later. I was just having some difficulties to understand the complex Scala/Akka code I wrote more than two years ago.

It was becoming hard for me to add new features to that code (yeah, I broke the "don't code today what you can't debug tommorow" principle). Beside, these days -- for about a year now -- I've been making heavy use of RxJS in NodeJS scripts, mostly for ETL purposes where I have to do transformations over data that comes in as streams.

Precisely because of this "freshness of RxJS in my head", now I feel the need to refresh my knowledge of Scala/Akka. Making this video serves me that purpose, and I hope the audience can benefit from watching it, somehow :) As usual, I made this video without script. So my apologies for getting off track at several points. Without further ado, here's the video:

Now, as I got the end of the recording, curiosity crawled in. Primarily because even after all that explaining and demo-ing, I don't think I managed to make a point. What exactly is the difference? Or maybe a better question would be: when to use Akka? when to use Rx? Are they equivalent? Are they alternatives to each other (competing)? I googled it up, found this: What are the main differences between Akka and RxJava?

Well..., that was it, if you care about distributing the crawling activities across several nodes, then use Akka, as it handles the distribution of actors across several nodes automatically & transparently.

In the light of that..., was writing the crawler in RxJS a bad decision (because there is no automatic replication & distribution)? Not really. I mean, in my case, there is no need to have multiple instances of crawler. Why? Because of rate-limit in the Twitter API (we can only make certain amount of requests, during certain period of time, for certain endpoint). If you have more than one instance of crawler, and all of them using the same twitter account, and each one of them runs on separate node..., it would be very difficult to control the (total amount of) requests-per-minute, and you will hit the roof very (or too) quickly.... But...If each one of the crawler use a distinct twitter account, then it would be a different story. But in that case, I would simply run another instance of crawler (another NodeJS processes, can be in separate node), which will run independently from the first instance of the crawler.

Beside, the crawling (talking to TwitterAPI) is only part of the bigger picture. There are other activities, as I showed in my video, such as: resolving the short URLs of the embedded articles, extracting metadata from articles, analyzing content of articles, storing in different types of database (mongo, sqlserver, neo4j), plus whatever thing you can come up with such as sentiment analysis in "real-time". Neither of them suffer the same constraint as the crawler (that rate-limiting). Beside the tasks they carry are stateless..., so any of them can easily scaled, simply by spawning another instance of container for the service, and make it listen to the same work-queue as the existing service instance. This sketch of the architecture of the crawler (RxJS version) can clarify what I just stated. It is arguably easier to understand and explain than my version of twitter-crawler in Akka. I insist, not Akka fault. It's mea culpa... probably if I had known better about best practices in Akka, better tooling, etc etc.... probably.

Now..., for the sake of discussion, let's assume I do care about replicating the crawler. The answer would be "use Akka". But now the question: why not Spark? It's also based of Scala, it has some list transformation operators, it also handles distributing the handling of items in the list (much like several actors in Akka, running in different nodes, working in parallel, emptying away their respective mailbox, which are partitions of the complete stream, by means of router).

Well..., this twitter crawler is not so much about transforming things (as opposed to job in spark which is about data-processing-pipeline). It's more about a daemon process. A process that interacts with the outside world (pulling data from twitter in this case), listen to external events (notifications from Redis in this case), and adjusts its interaction with the outside world accordingly, round the clock. Code in Spark is about __job__ (one-off), this twitter crawler is __daemon__. The data transformation / processing itself resides in other microservice(s)..., the sentiment analysis for example..., which receives stream of tweets emitted by the crawler and channeled through message queue (I use RabbitMQ here). For that one, indeed, we can consider using Spark, spark-streaming, especially when you think of using Spark ML libraries for big data.

Now the question: why not Apache Flink (instead of spark-streaming)? :D Ooo... kay..., it's getting late now, maybe that's a topic for another blogpost. See ya!

Oh..., and here are some nice links about RxJS. Might be handy for future references:

Playing around with my sweet "Bienvenida"

She is Bienvenida, my sweet little aztec horse (a mare), now completes 5 years old.

I guess now I should start looking for a good horse so she can have a baby horse :)

Attn. Programmers: know some DevOps stuffs. It’s good for you

Somewhat lengthy video, one hour and a half. The original topic was DevOps for programmers; I wanted to make a point why I think DevOps knowledge is important or beneficial to programmers (so yeah, learn container stuffs guys). It's like an expansion or continuation of blogpost about my view on DevOps, posted here: "What should we expect from DevOps."

I used the Twitter crawler I made (Glazplatova) as a vehicle to this video. I also have a blogpost about that crawler, here: Introducing Glazplatova. Why that thing? Because, it's quite a complex system, that is composed some little microservices. Lots of different kind of servers need to be brought up to bring this entire crawler system up. That's where knowledge of docker shines.

Attn. Programmers: know some DevOps stuffs. It's good for you. from Cokorda Raka Angga Jananuraga on Vimeo.

Along the way I also explain a little about the architecture of this crawler system; how I use messaging (rabbitmq) and caching (redis) for that. I also gave a glimpse of reactive programming using RxJs (for processing stream of tweets in memory efficient way, and elegantly..., and still being able to react on external events, like changes in the queries). I also happen to have a blogpost about RxJS, here: Introduction to reactive programming

So yeah, without further ado, here's the video "Attn. Programmers: know some DevOps stuffs. It’s good for you.". I hope you enjoy it and can take some values out of it. Please comment below or inbox.

Oh,... and this little piece that I forgot to include in the above video, about the Neo4J (graph database) I use in the crawler:

Introducing Glazplatova 1.0

We delve into the architecture of Glazplatova in another video, here: Attn. Programmers: know some devops stuffs. It's good for you.

Introducing Glazplatova 1.0 (Глаз Платова 1.0), a twitter crawling-machine I wrote and dub as "adaptive-reactive crawler" (yea, jargon-ridden). But it's true, it's adaptive in the sense that its architecture allows processor plugged-in to change the queries in real-time (adapting to latest finding in the data accumulated so far). Reactive in the sense that..., well, I use Reactive Programming techniques which enables us to create some nifty scheduling, quite elegantly. I use RxJS. This video is just to give a feel of what it is (we are using it for our media-monitoring and social-network analysis). Later I'll try to find an opportunity to dive down into the architecture. Uses: Mongo, RabbitMQ, Redis, RxJS.

Introducing Glazplatova 1.0 (Глаз Платова 1.0) from Cokorda Raka Angga Jananuraga on Vimeo.

What should we expect from a DevOps?

I was wondering.

First of all, supposedly a DevOps is this magic person who resides in the goldilock zone :) I mean, an intersection between programmer (who knows the code of the system she's managing), QA (really necessary?), and system administration (biggest portion of her works would be dealing with shits like hadoop, microservices, database, forward-proxy... shits like that).

I'm asking this question (or have been asking this question), because until I know I'm basically all-the-things (UX expert, product designer, software architect, programmer, QA (barely, not much time left)..., and including DevOps... I decide all the mix of shits we have in the platform (hadoop, druid, mongo, nginx, postgre, spark, loopback), define their dockers, including now writing a web application to automate the creation of those dockers... you know, PaAS-style).

I also made the continuation of this blogpost, here: Attn. Programmers: know some devops stuffs. It's good for you.

Now, the troubleshooting. What happened today; the dashboard is not showing numbers of this day.

  1. Checked the front-end, all seems ok, no error in the browser.
  2. Checked the database, we have data arriving.
  3. Checked the batch process (that performs aggregation), it performs correctly its job; it pulls the data out of the database, transform them, and send them off to Hadoop (the file-system, not the map-reduce cluster).
  4. Checked the Druid, it's up and running. But..., turns out it fails when it tries to send cube-building tasks to Hadoop.
  5. Checked the connection between Druid node and Hadoop; seems fine, I could ping both ways.
  6. Checked again in Druid's, more detail into the log, and it says "failed to communicate with Hadoop service on port 8032".
  7. Checked inside Hadoop's node, check if there is a running service that binds to 8032, netstat -plnt. Nope, it's down. 8032 is YARN.
  8. From that point it was clear what to do: identify why it went down (space issue), and bring it back up. Problem solved.

Now.., I was able to do that without much difficulties because of two things:

  1. My familiarity with linux things, at least the basic system administration tasks. But that we can / should expect from any DevOps. That's the very least we should expect from them. So that shouldn't be an issue.
  2. My knowledge of the system; how the things connect and interact (hadoop, druid, mongo, batch-data-transformation, etc). Should we expect that knowledge from the DevOps (who doesn't write the code of all those things)? In this case, it was pretty simple (or maybe not), from the log of Druid, that it wasn't able to communicate with YARN of Hadoop. But what about less obvious factors, which requires deeper knowledge of the log, or even some knowledge of how the code works? 

Maybe the answer is:

Sharing the code with the DevOps (so we should really expect with some experience as programmer)?

And if the DevOps is in the team during development, maybe involve her in code-review? Or at least design-review?