Joe Conley Tagged sbt

Notebook Driven Development

Tue, 28 Nov 2017 00:00:00 +0000

Fellow Spark developers, hearken to me! How fast is your Spark development cycle? Slow? Really slow? You could use this super awesome template to enable running your Spark jobs in IntelliJ, but sometimes you’re constrained by the size/locality of the data you’re working with, and you find that each re-run takes time (which is precious and finite and all that so yes, this stuff matters).

The craftier of you might turn to that most estimable of tools, the REPL (Read-Evaluate-Process-Loop) for quick command-line iteration. And that’s a good start. I use the Scala REPL on a daily basis, mostly to verify proper date/time formats and regex testing. Using the REPL with Spark, you don’t have the overhead of starting up/shutting down the SparkContext and you can quickly test out things with immediate feedback (cool). And you can enter the REPL from SBT using the console command, giving you access to the classes/utilities you’ve built in that project and the project dependencies (very cool).

A Better Way

So yes, the REPL is nice and all but you can go even FURTHER, FASTER with notebooks like Apache Zeppelin. Zeppelin (like Jupyter) allows you to write snippets of runnable code in notebooks and execute them from the browser. What separates Zeppelin from Jupyter is how well it works out of the box with Spark. Spark is the default interpreter for Zeppelin and provides the spark and sql contexts for you implicitly. You also get great visualizations of SQL queries for free.

Simple SQL query using Zeppelin's bank example

Simple SQL query with bar graph and form input

With Zeppelin, if you’re trying to query some dataset and want to understand its total size, the cardinality of a column, or simple descriptive statistics, you can do that immediately from the notebook itself with simple SQL queries. This sounds trivial but it ABSOLUTELY saves you time and effort by giving you a tight feedback loop when asking questions of data and not having to reload it every single time (when you use cache). In addition, you get documentation for free with Markdown, data visualization support with Angular, a growing ecosystem of modules in the Big Data ecosystem, and simple support for collaboration and sharing among your team.

I also think Zeppelin helps you write more scalable Spark code. Writing code in paragraphs reinforces the idea of making methods as small and concise as possible. Once these chunks of code are worked out, building out your codebase is more or less a matter of composing these chunks into logical classes or methods.

Zeppelin does have it’s drawbacks. Switching between your actual code and the notebook can be challenging, so you need dedicated contexts of exploration (Zeppelin) vs. crafting a solution (codebase) and stick to them. Also, dependency management is too manual. I would love for Zeppelin to know everything my Spark job knows through some Vulcan mindmeld or something (did I use that term correctly? I’m not a Trekkie. I’m a whatever-you-call-Tolkien-book-lover-two-generations-removed. Ringer? Inkling? Istari?).

Big Idea Section

Ultimately, I think Zeppelin is a great tool if you’re a Spark developer trying to build scalable systems in a reasonable amount of time. I think notebooks are “what’s next”. I think speed of development can be a big bottleneck to the software engineering process, especially when working with large volumes of data. I also think, most importantly, that any company of reasonable size needs a certain level of useful, live documentation to understand just what the hell they’re doing.

Because knowledge is power right? Isn’t all of this “coding”, “documentation”, and “testing” just different ways to represent knowledge? Ultimately knowledge is just a tool, a means to achieve some goal. It’s incumbent on us as engineers to use the best tools we can to accomplish our goals. I think Zeppelin is one such tool. I also think we could take this idea further and eventually get to the point where all of the code we write is just simple chunks, easily composable with minimal overhead (why do we spend so much time on packaging and deployment?). Or maybe we’re wasting our time and we should let AI do our dirty work for us? Who knows, but for now, I guess we keep on…

High-Leverage Development with Giter8 Templates

Thu, 12 Oct 2017 00:00:00 +0000

Edmond Lau talks a lot about leverage in his book The Effective Engineer, a term he borrowed from Andy Grove’s High Output Management. Both are excellent reads, especially for programmers looking to maximize the impact they have on their teams. The term leverage gets to the heart of this. It describes activities that create a disproportionate amount of value. This feels like a much more elegant description than “10x/rockstar/ninja developer” or whatever cliche that stokes the egos of the programmer-gods. It places the focus on output, where it belongs!

Some examples of high-leverage activities Lau mentions include:

improving the onboarding processes for new hires via tutorials, documentation, and notebooks (i.e. labs)
creating tight feedback loops to quickly validate ideas (e.g. use a REPL or a notebook!)
writing tools to make you and other developers more efficient

In this spirit, I’ve created a Giter8 template to show how to create an SBT-based Spark project with the following accouterments:

utilities for logging and writing dataframes in common formats
configuration via Typesafe Config
building the fat jar via sbt-assembly
release support via sbt-release
support for running your Spark job in Intellij

This has saved me a significant amount of time in starting new Spark jobs or testing out quick proof-of-concepts. Simply call sbt new josephpconley/spark-seed.g8 and you’re all set! Enjoy!

Roll Your Own Notification Service

Mon, 27 Jan 2014 00:00:00 +0000

Have you ever wished you could receive customized updates whenever your favorite websites update their content? Most sites offer the means to get notified when a new blog post hits the wire or new products are added to their catalog (RSS, social media, e-mail, etc.). But what if the site doesn’t use any of these services? Or what if you only want specific updates (i.e. blog posts from author X, new products containing the name Y)? Then you’re left with only one course of action: build your own notification service!

Armed with the mighty powers of HTML scraping, the Scala programming language, and a recurring scheduling mechanism (in this case Play’s Akka scheduler), you have all the tools you need to setup your custom notification.

My New EBook Notification Service

Let’s create a notification service which let’s us know when new ebooks are available at my local digital library, Delaware County Library System. At the time of this writing, no such notification service exists. As I’d prefer not to miss any notifications, I’d like to setup an RSS feed. Specifically, we’ll write a process which periodically checks the digital library site for new ebooks and updates an RSS feed accordingly.

Scala/SBT

We’ll start out by creating a basic Scala application using SBT (you can checkout a skeleton project here). Let’s add the HTMLUnit and Scala IO libraries to our project. We’ll use HTMLUnit to parse the HTML code of the library’s website, and we’ll use Scala IO to write our XML to file. Your build file should now look like this (assuming you named your project “ebook”):

Scala XML

Let’s start by building an abstraction for an RSS feed (you can read about the basics of RSS here). We’ll start with an Item case class which holds the basic properties of an RSS item and a method to generate xml. Similarly, we define the basic properties of a Feed using a trait. We’ll make this abstract in the anticipation of re-using this abstraction for other feeds.

Screen Scraping with HTMLUnit

Let’s build a NewEBookFeed which implements Feed. When we implement the items method, we’ll use HTMLUnit to parse the HTML code from Delaware County Library System to find out the newest items. This requires digging around the source HTML a bit to understand the structure and find useful patterns. Basic knowledge of XPath is required to leverage those patterns. After inspecting the source code and following the appropriate links, we can view the New Ebook page source and parse out the new titles, authors, and image URLs.

That’s it! You can find my complete code as part of my scrape library, specifically the com.josephpconley.books and com.josephpconley.rss packages. We can test the code by running the following:

Deploy using Play

Now that we have a way to generate an up-to-date RSS feed, we need a way to update our feed periodically and make it publically available to an RSS Reader like feedly (my personal favorite). We could handle this a few different ways (i.e. schedule a CRON job to push a file to our Dropbox folder), however I’d like to demonstrate how to handle both the scheduling and file writing/serving using the Play Framework.

Start a new Play Scala project, and either package our ebook project as a jar and copy to the lib folder, or just copy and paste the source code into the new Play project (I’ve done the former).

Akka Scheduler

To hook into Play’s Akka scheduler, we create a Global object in the app folder and override the onStart method, which allows us to run code once the application starts. The Akka system scheduler allows you to schedule a recurring process for a given Duration. In our case, since the site doesn’t update that frequently and we want to be respectful by not overloading the site with requests, we’ll set the duration to 12 hours.

From there, it’s simply a matter of building out a controller with some routes to host the updated file (a straightforward exercise I’d leave to the reader). I personally included this code and hosted the RSS feeds in my own Play app running on Heroku.

Drawbacks

One drawback you might have noticed from this specific example is the possibility of the target site’s source code changing. We relied on very specific HTML tags, text and class attributes to query the information we needed, and should the site be re-written significantly, it’s possible that we would have to re-write our scraping code to accommodate.

Conclusion

Managing the daily flow of information can be a challenge. With a little bit of coding, however, we can gain finer control over the information we consume, helping us be more productive in our everyday life.