Joe Conley Tagged comcast

PHLAI Me to the Moon

Tue, 22 Aug 2017 00:00:00 +0000

Hello friends! I was lucky enough last week to attend PHLAI, a Comcast-sponsored conference on machine learning and artificial intelligence. The dreary weather did not dampen our spirits as practitioners and business stakeholders met to discuss one of the most important trends in our lifetime.

The talks ranged from high-level, entertaining overviews to deep-dive technical lectures. The discussions were very focused and targeted on pragmatic approaches to solving business problems using machine learning and AI, and it’s amazing to see how much progress is being made in a seemingly short amount of time.

Here are a few takeaways.

The importance of comprehension of models

This topic sprung up everywhere. The ability to understand why a model predicts something has a great bearing on regulatory concerns, racial profiling, and security. We can’t make meaningful progress in AI without taking steps to make these models as explainable as possible. And it doesn’t even have to be something as explicit as opening the black box and producing a deterministic formula, we just need some insight as to why models predict the way they do.

Pragmatic approach

I enjoyed the constant focus on simplicity and picking the right tool for the job. Why don’t you put down those neural nets and try a simple regression? Or maybe use specific models for specific tasks and (gasp) use imperative or brute force techniques for other tasks. I must have heard the old hammer and nail adage in at least three separate talks, which is great. I think most experienced software engineers have sat down their junior teammates and said the same quote. It’s important to be mindful of our own biases and think about what delivers value to your client/business stakeholder by using the simplest tool for the job.

Spread the love

The final trend I noticed was the focus on distributing ML/AI thinking among several teams rather than having it centralized in one silo. This idea was backed up by studies that showed companies who took a distributed approach showed better sales/ROI numbers that companies who silo-ed their innovation efforts on isolated teams.

From an investment perspective, I also appreciated Kartik Hosanagar’s’s thoughts on a balanced AI portfolio. His studies showed that focusing mostly on quick, iterative wins with a few longer-term projects led to positive ROI. I love how practical this idea is. Speaking in terms of dollars and cents resonates much more strongly with the business stakeholders and aligns these projects with the goals of the entire organization.

Reflection

I’ve been with Chariot Solutions for a few years now, and as such have had the opportunity to attend several conferences like this. Taking this time to think and reflect is essential in ALL fields, especially a field as fast-moving and relevant as artificial intelligence.Bill Gates famously takes an annual “think week” to explore and reflect on big ideas. Conferences are even better, they give you a chance to talk to other people in the field (talking being still one of the most effective forms of information gathering).

But what’s the point of these conferences if we just go back to our day jobs and carry on with business as usual? We need to find a way to actively engage with these ideas. That engagement could be different for everyone. For some it could mean creating a small project using a new AI framework. Or reading a book about a specific trend or application. Or writing a blog post to organize your thoughts and make an argument. Either way, I’d argue that what you do after the conference is just as important as what you do during the conference.

Real World Spark Lessons

Wed, 31 May 2017 00:00:00 +0000

I’ve enjoyed learning the ins and outs of Spark at my current client. I’ve got a nice base SBT project going where I use Scala to write the Spark job, Typesafe Config to handle configuration, sbt-assembly to build out my artifacts, and sbt-release to cut releases. Using this as my foundation, I recently built a Spark job that runs every morning to collect the previous day’s data from a few different datasources, join some reference data, perform a few aggregations and write all of the results to Cassandra. All in roughly three minutes (not too shabby).

Here’s some initial lessons learned:

Be mindful of when to use cache(). It sets a checkpoint for your DAG so you don’t need to re-compute the same instructions. I ended up using this before performing my multiple aggregations.
Apache Avro is really really good at data serialization. Should be the default choice for large-scale data writing into HDFS.
When using pivot(column, range), it REALLY helps if you can enumerate the entire range of the pivot column values. My job time was cut in half as a result of passing all possible values. More here on the Databricks blog
Cassandra does upserting by default, so I didn’t even need to worry about primary key constraints if data needs to be re-run (idempotency is badass).

Recently, I was asked to update my job to run every 15 minutes to grab the latest 15 minutes of data (people always want more of a good thing). So I somewhat mindlessly updated my cronjob and didn’t re-tune any of the configuration parameters (spoiler alert: bad idea). Everything looked good locally and on our test cluster, but when it came time for production, WHAM! My job was now taking 5-7 minutes when running on a fraction of the data for the daily runs. Panic time!

After wading through my own logs and some cryptic YARN stacktraces, it dawned on me to check my configuration properties. One thing in particular jumped out at me:

spark.sql.shuffle.partitions = 2048

I had been advised to set this value when running my job in production. And it worked well for my daily job (cutting down on processing time by 30s). However, now that I was working with data in a 15-minute time window, this was WAY too many partitions. The additional runtime resulted from the overhead of using so many partitions for so little data (my own theory, correct me if I’m wrong). So I disabled this property (defaulting to 200) and my job started running in ~2 minutes, much better!

(UPDATE: after some experimentation on the cluster, I set the number of partitions to 64)

More lessons learned:

ALWAYS test your Spark job on a production-like cluster as soon as you make any changes. Running your job locally vs. running your job on a YARN/Mesos cluster is about as similar as running them on Earth vs. Mars, give or take.
You REALLY should know the memory/cpu stats of your cluster to help inform your configuration choices. You should also be mindful of what other jobs run on the cluster and when.
Develop at least a basic ability to read and understand the Spark UI.
It’s got a lot of useful info, and with event logging you can see the improvements of your incremental changes in real-time.

Let me give another shout-out to Typesafe Config again for making my life easier. I have three different ways (env variables, properties file, command line args) to pass configuration to my Spark job and I was able to quickly tune parameters using all of these options. Interfaces are just as important to developers as they are to end users!

All in all this was a fun learning experience. I try to keep up on different blogs about Spark but you really don’t get a good feel for it until you’re actually working on a problem with production-scale infrastructure and data. I think this is a good lesson for any knowledge work. You need to do the work to acquire knowledge. This involves not just reading but challenging assumptions, proving out ideas, and digging knowledge out of the dirt. Active engagement using quick feedback loops will lead to much deeper and usable knowledge, and that’ll make you, as Mick would say, “a very dangerous person!”

Party on!

Graph-Based Documentation

Wed, 26 Apr 2017 00:00:00 +0000

Has anyone ever met a documentation system they both liked and found useful? I love Evernote as much as the next guy but the simple list view has its limitations. Most wikis present information in a tree view where pages are restricted to a parent-child relationship. Neither are very useful or intuitive for documenting complex systems!

I’m a very visual thinker. I know from experience that when dealing with several layers of abstraction, having a good visualization can be very helpful. And when I say good, I mean good in the sense that the visualization is as close to reality as possible. Shane Parrish and others remind us that the map is not the territory, but we can get pretty damn close. And I think graphs can help (so does neo4j, shockingly). Because it’s 2017, and we deserve more optimal ways to visualize ideas and systems.

Why graphs? Graphs are inherently simple. There are nodes and edges. That’s it. Nodes represent a “thing”, edges represent a “relationship between things”. There’s no parent-child restriction; any node can be related to any other node. Visually, the relationships can be shown compactly and the information structure is more flexible. Using this as our foundation, we can start to build something useful.

So, here’s what I’ve got so far. I’m calling it Episteme (from the Greek for “knowledge, science, or understanding”). It’s a desktop app powered by Electron and is a simple graph of nodes and edges where nodes are some entity we want to document. Here’s an example based on my current SwingStats architecture:

Here the nodes represent backend services, webapps, datasources, and APIs while the edges connect the nodes that interact in some way. Clicking on each node will bring up a Markdown-based document which will autosave on edit:

I’ve already been using it at my current client to help me navigate the dozens of systems and their interactions. I’ve found the most value in quickly accessing common commands (SQL, ssh, docker, etc.), environment information and links. It’s definitely sped up development time as I don’t need to constantly search Confluence or Google for rote memory stuff like query syntax. It just feels like the information is much closer at hand.

I’m hoping to add some functionality to make it more context-driven. I think tagging nodes would serve well here, depending on what project/context I’m working on I could filter the graph by tags to only show relevant nodes. As the graph grows, having a “Jump To” button for nodes would be nice. Full-text search is probably inevitable too.

Another interesting extension would be having teams share and collaborate on the graph. Maybe in a Git-based system with a fork/clone model so you get version control for free and can see how the graph evolves over time? Throw in some live documentation a la Swagger and baby you’ve got a stew going!

One cool thing to note is it took me five hours to get a useful protoype working, and most of that time was spent learning Electron.
I’ve spent a few more hours on refinements but vis.js and SimpleMDE do all the heavy lifting, and the graph is persisted as a simple JSON file for now. And I’m not a master front-end developer by any stretch of the imagination so if you have an idea, find some good tools that get you most of the way there and kick the tires!

Interested in this stuff? Wanna see a hosted version so you can take it for a spin? Wanna help me finish building the damn thing? Let me know in the comments below or on Twitter @josephpconley. And thanks to those five brave individuals who voted in my Twitter poll, your feedback is much appreciated!

Scala By The Schuylkill Recap

Fri, 27 Jan 2017 00:00:00 +0000

This past Tuesday I had the pleasure of attending the Scala by the Schuylkill conference at Comcast headquarters in downtown Philadelphia. Initially begun as an internal Scala conference, the organizers opened the conference this year to external folks interested in Scala. I learned a lot from this event, gaining perspective on trends in the Scala community and sparking curiosity in several interesting applications of the Scala language.

Our #ScalaByTheSchuylkill organizers with keynote speaker @sreekotay! #onbreak #scala pic.twitter.com/yyJoTfkljm
— Comcast Careers (@comcastcareers) January 24, 2017

The keynote speeches were the highlight of the conference for me. Comcast’s CTO, Sree Kotay, gave an engaging talk on the culture of innovation at Comcast and how they’ve evolved into a “technology first” company (as quoted recently by their CEO Brian Roberts). He also explained their rationale for using Scala for certain projects, noting the interoperability with Java, modularity, and its ability to draw top talent as key factors of adoption. He even showed off his geek credentials by detailing his love/hate relationship with a certain Scala web service library. It’s clear that Sree is an engineer at heart and it was refreshing to see that the CTO of a multi-billion dollar company still enjoys tinkering with code.

Michael Pilquist gave the other keynote, doing a masterful job in explaining the value of functional programming. He boiled down the essence of FP as managing the complexity of both state and control flow via composability and small expressions in isolation. He also demystified category theory, an area of mathematics I’ve always found interesting but never really saw the practical use for until now. He stressed that category theory in programming is used to achieve precision by finding the appropriate level of abstraction for a given problem to focus on the essential. Michael put these ideas in an accessible and interesting context, and I also appreciated his book recommendation, How to Bake Pi by Eugenia Chang, which I’m currently devouring.

A great variety of talks followed, touching on interesting topics like GIS, machine learning, microservices, and streaming with a focus on tools like Akka and Spark. About half of the speakers were from Comcast, and it was interesting to see the problems they’ve had to solve and why they chose Scala to solve them (hint: they work with data, a LOT of it). I came away with at least a dozen different TODOs to research new libraries or techniques. I also enjoyed meeting new people and catching up with some past colleagues. As an introvert, I don’t focus much on networking and relationship building, but a tech conference focused on a specific technology like Scala creates an environment that’s very conducive to meeting new people and learning about their work.

I’m happy to see an important tech company like Comcast invest so much time and energy into both the Scala ecosystem and the local Scala community here in Philadelphia. It’s clear that, regardless of what you may have heard, Scala is here to stay!

Special thanks to Chariot for sponsoring my attendance!