Joe Conley Tagged csv Random thoughts on technology, business, books, and everything in between jpc2.org/name/csv Query JSON/XML/CSV using SQL <p>Ever wish you could use your favorite query language across different data formats? Or get query results in several formats (XML, JSON, and CSV/XLS)? Then check out <a href="http://www.datacombinator.com/query">DataCombinator’s new query engine</a>.</p> <h2 id="data-sources">Data sources</h2> <p>You can copy and paste structured data manually, point to a URL, or connect to a database directly (H2, MongoDB, MySQL, or PostgreSQL). The engine hasn’t been optimized yet to handle large documents or tables so please be mindful.</p> <h2 id="query-languages">Query languages</h2> <p>The engine supports JSONPath (powered by <a href="http://www.josephpconley.com/2014/04/15/jsonpath-for-play.html">my open-source Play library</a>), XPath and SQL. You can use any of these languages to query data in any of the JSON, XML or CSV formats. Since JSONPath and XPath are fairly similar and straightforward, the more interesting use cases tend to involve SQL.</p> <h3 id="sql">SQL</h3> <p>The FROM CLAUSE isn’t necessary as the query only applies to one “table”, that is, the data being queried. For SQL to work against JSON, the JSON must be an array of objects, e.g.</p> <div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">[</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"id"</span><span class="p">:</span><span class="mi">1</span><span class="p">,</span><span class="w"> </span><span class="nl">"name"</span><span class="p">:</span><span class="s2">"Joe"</span><span class="w"> </span><span class="p">},</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"id"</span><span class="p">:</span><span class="mi">2</span><span class="p">,</span><span class="w"> </span><span class="nl">"name"</span><span class="p">:</span><span class="s2">"Janine"</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="p">]</span><span class="w"> </span></code></pre></div></div> <p>If the objects in the array have nested levels, each object will be flattened, and the keys concatenated with an “_”, e.g.</p> <div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">[</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"id"</span><span class="p">:</span><span class="w"> </span><span class="mi">1</span><span class="p">,</span><span class="w"> </span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Joe"</span><span class="p">,</span><span class="w"> </span><span class="nl">"address"</span><span class="w"> </span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"street"</span><span class="w"> </span><span class="p">:</span><span class="w"> </span><span class="s2">"123 Main St."</span><span class="p">,</span><span class="w"> </span><span class="nl">"city"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Springfield"</span><span class="p">,</span><span class="w"> </span><span class="nl">"state"</span><span class="p">:</span><span class="w"> </span><span class="s2">"PA"</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="p">]</span><span class="w"> </span></code></pre></div></div> <p>would be flattened to</p> <div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">[</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"id"</span><span class="p">:</span><span class="mi">1</span><span class="p">,</span><span class="w"> </span><span class="nl">"name"</span><span class="p">:</span><span class="s2">"Joe"</span><span class="p">,</span><span class="w"> </span><span class="nl">"address_street"</span><span class="p">:</span><span class="s2">"123 Main St."</span><span class="p">,</span><span class="w"> </span><span class="nl">"address_city"</span><span class="p">:</span><span class="s2">"Springfield"</span><span class="p">,</span><span class="w"> </span><span class="nl">"address_state"</span><span class="p">:</span><span class="s2">"PA"</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="p">]</span><span class="w"> </span></code></pre></div></div> <p>Similarly, an XML must be in a “table format” in order to handle a SQL query, e.g.</p> <div class="language-xml highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="nt">&lt;table</span> <span class="na">class=</span><span class="s">"ui table"</span><span class="nt">&gt;</span> <span class="nt">&lt;row&gt;</span> <span class="nt">&lt;id&gt;</span>1<span class="nt">&lt;/id&gt;</span> <span class="nt">&lt;name&gt;</span>Joe<span class="nt">&lt;/name&gt;</span> <span class="nt">&lt;/row&gt;</span> <span class="nt">&lt;row&gt;</span> <span class="nt">&lt;id&gt;</span>2<span class="nt">&lt;/id&gt;</span> <span class="nt">&lt;name&gt;</span>Janine<span class="nt">&lt;/name&gt;</span> <span class="nt">&lt;/row&gt;</span> <span class="nt">&lt;/table&gt;</span> </code></pre></div></div> <h3 id="supported-sql-functions">Supported SQL functions</h3> <p>The engine supports basic single-table query functionality (no self joins yet) with simple clauses (WHERE, GROUP BY, and ORDER BY) and a few basic aggregation functions (COUNT, MIN, MAX, SUM). I’ll be working to expand upon this, so if you have any requests <a href="http://www.datacombinator.com/contact">let me know</a>.</p> <h2 id="query-results">Query results</h2> <p>The query engine outputs results in JSON, XML, and CSV/HTML Table/Excel if the resulting structure can be converted to a table structure.</p> <h2 id="examples">Examples</h2> <p>Here’s a few examples where I’ve found the query engine helpful.</p> <h3 id="espn-apis---json">ESPN APIs - JSON</h3> <p>ESPN has released a <a href="http://developer.espn.com/docs">variety of APIs</a> that allow developers to access headlines and basic team statistics. You’ll need to create a free account and register for a key, at which point you’ll have immediate access to the Public APIs.</p> <p>So for example, if I wanted to find out stats on my beloved Philadelphia Phillies, I would enter http://api.espn.com/v1/sports/baseball/mlb/teams?apikey=MY_API_KEY as the URL in DataCombinator. Using the JSON Raw tab, I can see the pretty printed response, and quickly search on Phillies to find their id of 22. Using this id, I can get the latest news on the Phightins by using the URL of http://api.espn.com/v1/sports/baseball/mlb/teams/22/news?apikey=MY_API_KEY. I can then use JSONPath to only include the part of the response I want. For example, if I just want all the latest headlines associated with the Phillies, I take a quick look at the structure and apply the <code class="language-plaintext highlighter-rouge">$..headline</code> JSONPath query to return an array of headlines:</p> <div class="language-js highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">[</span> <span class="dl">"</span><span class="s2">Mets end 5-game skid, rally past Phils 5-4 in 11</span><span class="dl">"</span><span class="p">,</span> <span class="dl">"</span><span class="s2">Howard, Rollins lead Phillies past slumping Mets</span><span class="dl">"</span><span class="p">,</span> <span class="dl">"</span><span class="s2">Byrd's double lifts Phillies over Mets 3-2 in 11</span><span class="dl">"</span><span class="p">,</span> <span class="dl">"</span><span class="s2">The base: Approach at your own risk</span><span class="dl">"</span><span class="p">,</span> <span class="dl">"</span><span class="s2">Phillies fall to hot-hitting Blue Jays in 20,000th game</span><span class="dl">"</span><span class="p">,</span> <span class="dl">"</span><span class="s2">Adam Lind activated by Blue Jays</span><span class="dl">"</span><span class="p">,</span> <span class="dl">"</span><span class="s2">Mark Buehrle posts MLB-best sixth win as Blue Jays rock Phillies</span><span class="dl">"</span><span class="p">,</span> <span class="dl">"</span><span class="s2">Blue Jays edge Phillies on sac fly in 10th after blowing 5-run lead</span><span class="dl">"</span><span class="p">,</span> <span class="dl">"</span><span class="s2">Happ stifles Phillies, Blue Jays win 3-0</span><span class="dl">"</span><span class="p">,</span> <span class="dl">"</span><span class="s2">Hernandez outduels Gonzalez, Phillies edge Nats</span><span class="dl">"</span> <span class="p">]</span> </code></pre></div></div> <h3 id="weather-data---xml">Weather Data - XML</h3> <p>OpenWeatherMap.org provides a <a href="http://openweathermap.org/API">free weather API</a> which returns data in XML format. For example, if I wanted to get the current weather in my hometown of Springfield, PA, I could use the URL</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>http://api.openweathermap.org/data/2.5/weather?q=Springfield&amp;mode=xml&amp;units=imperial </code></pre></div></div> <p>to get an XML document back. I could then query the document using XPath to get just the temperature via <code class="language-plaintext highlighter-rouge">//temperature</code>.</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&lt;temperature max="71.52" min="71.52" unit="fahrenheit" value="71.52"/&gt; </code></pre></div></div> <h3 id="opendata---csv">OpenData - CSV</h3> <p>Public institutions are starting to embrace open data practices, enabling civic-minded hackers to build useful applications that provide a public service. In this spirit, the city of Philadelphia has made <a href="https://github.com/CityOfPhiladelphia">various data sets</a> available for public consumption. Most of these data sets are in CSV format. We’ll take one such data set, <a href="https://github.com/CityOfPhiladelphia/phl-site-stats">phl-site-stats</a>, and use the Raw url from Github to query it (I picked this dataset as it’s relatively small).</p> <p>We’ll take a look at the latest month’s stats found at <a href="https://raw.githubusercontent.com/CityOfPhiladelphia/phl-site-stats/master/SiteStats0514.csv">https://raw.githubusercontent.com/CityOfPhiladelphia/phl-site-stats/master/SiteStats0514.csv</a>. Without entering a query, we would get the entire data set in the results. One point to note is that the query engine will try to convert strings to numbers, making it easy to query based on certain conditions. If we wanted to view the most popular sites for phila.gov, we would simply enter a query of</p> <div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">select</span> <span class="o">*</span> <span class="k">order</span> <span class="k">by</span> <span class="n">page_count</span> <span class="k">desc</span> </code></pre></div></div> <p>Or we could get the total number of unique hits for the month of May</p> <div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">select</span> <span class="k">sum</span><span class="p">(</span><span class="n">unique_page_count</span><span class="p">)</span> </code></pre></div></div> <h2 id="next-steps">Next steps</h2> <p>This query engine will be the foundation of DataCombinator’s platform of data collection and composition tools. Our next step is to not only host structured data via API endpoints, but to also combine multiple datasources into one document (which in turn would be hosted as well!). If you’re interested in learning more, <a href="http://www.datacombinator.com">sign up</a> for e-mail updates or <a href="https://www.twitter.com/DataCombinator">follow us on Twitter @DataCombinator</a>.</p> Tue, 13 May 2014 00:00:00 +0000 jpc2.org/2014/05/13/datacombinator-query-engine.html jpc2.org/2014/05/13/datacombinator-query-engine.html