Notebook Driven Development

Tue, 28 Nov 2017 00:00:00 +0000

Fellow Spark developers, hearken to me! How fast is your Spark development cycle? Slow? Really slow? You could use this super awesome template to enable running your Spark jobs in IntelliJ, but sometimes you’re constrained by the size/locality of the data you’re working with, and you find that each re-run takes time (which is precious and finite and all that so yes, this stuff matters).

The craftier of you might turn to that most estimable of tools, the REPL (Read-Evaluate-Process-Loop) for quick command-line iteration. And that’s a good start. I use the Scala REPL on a daily basis, mostly to verify proper date/time formats and regex testing. Using the REPL with Spark, you don’t have the overhead of starting up/shutting down the SparkContext and you can quickly test out things with immediate feedback (cool). And you can enter the REPL from SBT using the console command, giving you access to the classes/utilities you’ve built in that project and the project dependencies (very cool).

A Better Way

So yes, the REPL is nice and all but you can go even FURTHER, FASTER with notebooks like Apache Zeppelin. Zeppelin (like Jupyter) allows you to write snippets of runnable code in notebooks and execute them from the browser. What separates Zeppelin from Jupyter is how well it works out of the box with Spark. Spark is the default interpreter for Zeppelin and provides the spark and sql contexts for you implicitly. You also get great visualizations of SQL queries for free.

Simple SQL query using Zeppelin's bank example

Simple SQL query with bar graph and form input

With Zeppelin, if you’re trying to query some dataset and want to understand its total size, the cardinality of a column, or simple descriptive statistics, you can do that immediately from the notebook itself with simple SQL queries. This sounds trivial but it ABSOLUTELY saves you time and effort by giving you a tight feedback loop when asking questions of data and not having to reload it every single time (when you use cache). In addition, you get documentation for free with Markdown, data visualization support with Angular, a growing ecosystem of modules in the Big Data ecosystem, and simple support for collaboration and sharing among your team.

I also think Zeppelin helps you write more scalable Spark code. Writing code in paragraphs reinforces the idea of making methods as small and concise as possible. Once these chunks of code are worked out, building out your codebase is more or less a matter of composing these chunks into logical classes or methods.

Zeppelin does have it’s drawbacks. Switching between your actual code and the notebook can be challenging, so you need dedicated contexts of exploration (Zeppelin) vs. crafting a solution (codebase) and stick to them. Also, dependency management is too manual. I would love for Zeppelin to know everything my Spark job knows through some Vulcan mindmeld or something (did I use that term correctly? I’m not a Trekkie. I’m a whatever-you-call-Tolkien-book-lover-two-generations-removed. Ringer? Inkling? Istari?).

Big Idea Section

Ultimately, I think Zeppelin is a great tool if you’re a Spark developer trying to build scalable systems in a reasonable amount of time. I think notebooks are “what’s next”. I think speed of development can be a big bottleneck to the software engineering process, especially when working with large volumes of data. I also think, most importantly, that any company of reasonable size needs a certain level of useful, live documentation to understand just what the hell they’re doing.

Because knowledge is power right? Isn’t all of this “coding”, “documentation”, and “testing” just different ways to represent knowledge? Ultimately knowledge is just a tool, a means to achieve some goal. It’s incumbent on us as engineers to use the best tools we can to accomplish our goals. I think Zeppelin is one such tool. I also think we could take this idea further and eventually get to the point where all of the code we write is just simple chunks, easily composable with minimal overhead (why do we spend so much time on packaging and deployment?). Or maybe we’re wasting our time and we should let AI do our dirty work for us? Who knows, but for now, I guess we keep on…

Shallow vs. Deep Research

Fri, 21 Jul 2017 00:00:00 +0000

Are you like me? Do you sometimes get stuck on the merry-go-round of googling for answers to technical questions?

Don’t get me wrong. I think sites like Google and StackOverflow are amazing tools and it’d be hard to be productive without them. I think they’re especially useful for trying to conjure up some obscure Linux commands or DDL syntax for one of the dozen databases I work with on a daily basis.

But sometimes, I notice that I rely on Google TOO much. Like when I have a problem to solve, I immediately go to Google to see how others have done it. It’s a tempting and albeit understandable trap to fall into. I’m a consultant, and so I’m constantly focused on delivering value to my clients in a quick and effective matter. So it can be difficult to justify reading documentation, digging around in source code, or reading papers on the CAP theorem when there’s a good chance I can find the answer to my question in under 60 seconds via search.

In the long run, though, who am I helping by doing this? I’m essentially outsourcing part of my job to someone else. And what’s worse, I’m tricking myself into believing I’ve mastered a certain subject or capability, when in reality I’ve just copied what others have worked hard to figure out.

In “How Will You Measure Your Life?”, Clayton Christensen tells the story of Dell and Asus. When Dell first started out, they used Asus to manufacture their chips. As Dell grew, Asus offered to manufacture more and more of the computer until they began manufacturing the entire computer for Dell. In short order, Asus struck out on their own as a low-cost competitor to Dell. Though each step in the outsourcing process looked good from a balance sheet perspective, in the long term this strategy posed a serious threat to Dell.

What’s the lesson here? Don’t sacrifice long-term growth and learning for the quick hit of an answer on Google. If you’re stuck in a really really time-sensitive situation where you need the quick answer, then leave a TODO for yourself to do a deep dive on the problem in your spare time. Once you have time, use the Feynman technique to deeply understand the problem, and try to question it from all angles. It’s ultimately up to you to decide how far down the rabbit hole to go.

It’s easy to get overwhelmed by the daily demands of your job. But it’s also important to keep a mind toward long-term investments in your career, and having a solid foundation of knowledge and the ability to think for yourself is one of the most important investments you can make.

P.S. It also helps to block social media and other non-essential distractions, at least during your working hours. The quick hits of social media and quick-answer seeking seem very similar to me, and I suspect one reinforces the other. I try to use a Chrome extension called BlockThisSite to that end (though I admit I’m not 100% there yet, new habits take time to form).

P.P.S. I think I wrote this more for myself than anyone else, I tend to have a monkey brain and need to get these thoughts down and persisted somewhere as a reminder to focus. Reading the work of Cal Newport has helped a lot though, would highly recommend it!

Joe Conley Tagged productivity

Notebook Driven Development

A Better Way

Big Idea Section

Shallow vs. Deep Research