First of all, supposedly a DevOps is this magic person who resides in the goldilock zone :) I mean, an intersection between programmer (who knows the code of the system she's managing), QA (really necessary?), and system administration (biggest portion of her works would be dealing with shits like hadoop, microservices, database, forward-proxy... shits like that).
I'm asking this question (or have been asking this question), because until I know I'm basically all-the-things (UX expert, product designer, software architect, programmer, QA (barely, not much time left)..., and including DevOps... I decide all the mix of stuffs we have in the platform (hadoop, druid, mongo, nginx, postgre, spark, loopback), define their dockers, including now writing a web application to automate the creation of those dockers... you know, PaAS-style).
NEWSFLASH
---------
I also made the continuation of this blogpost, here: Attn. Programmers: know some devops stuffs. It's good for you.
---------
Now, the troubleshooting. What happened today; the dashboard is not showing numbers of this day.
- Checked the front-end, all seems ok, no error in the browser.
- Checked the database, we have data arriving.
- Checked the batch process (that performs aggregation), it performs correctly its job; it pulls the data out of the database, transform them, and send them off to Hadoop (the file-system, not the map-reduce cluster).
- Checked the Druid, it's up and running. But..., turns out it fails when it tries to send cube-building tasks to Hadoop.
- Checked the connection between Druid node and Hadoop; seems fine, I could ping both ways.
- Checked again in Druid's, more detail into the log, and it says "failed to communicate with Hadoop service on port 8032".
- Checked inside Hadoop's node, check if there is a running service that binds to 8032, netstat -plnt. Nope, it's down. 8032 is YARN.
- From that point it was clear what to do: identify why it went down (space issue), and bring it back up. Problem solved.
Now.., I was able to do that without much difficulties because of two things:
- My familiarity with linux things, at least the basic system administration tasks. But that we can / should expect from any DevOps. That's the very least we should expect from them. So that shouldn't be an issue.
- My knowledge of the system; how the things connect and interact (hadoop, druid, mongo, batch-data-transformation, etc). Should we expect that knowledge from the DevOps (who doesn't write the code of all those things)? In this case, it was pretty simple (or maybe not), from the log of Druid, that it wasn't able to communicate with YARN of Hadoop. But what about less obvious factors, which requires deeper knowledge of the log, or even some knowledge of how the code works?
Maybe the answer is:
Sharing the code with the DevOps (so we should really expect with some experience as programmer)?
And if the DevOps is in the team during development, maybe involve her in code-review? Or at least design-review?