Hello, Haohui here. I'm one of the developer of AthenaX. I'm really glad that the project attracts so many interests.
Before AthenaX most of our real-time analytic pipelines were on Samza, which to to some extent can be seen as predecessor of Kafka Stream.
The migration from Samza towards Kafka Stream might seem more natural, and we actually took a very close look on Kafka Stream and we have decided to move towards Flink, given that:
(1) Kafka Stream lacks of important features like exactly-once delivery and distributed consistent snapshots. They are essential to support use cases that require high fidelity.
(2) SQL is a must-have feature in order to empower our users, many of which are non-technical, to run large-scale streaming analytics in production. Kafka Streams provide no support for that. Arguably the Kafka Streams provide simple APIs -- Simple APIs themselves are insufficient to bring the analytics applications to production. You can't ignore continuous integrations and deployment, monitoring, etc., especially many of our users come from non-CS backgrounds.
(3) It seems that the Apache Flink community is more open and committed compared to the Kafka community, particularly on the SQL side. We have collaborated with Data Artisans, Alibaba and Huawei. All parties above, as well as us are committed and equipped with adequate resources to bring SQL to respective customers.
How do you see this compared to Prestodb, which also provides SQL for Kafka topics? Being able to join Kafka data with other tables seems like a useful benefit of Prestodb?
It would reduce the latency of data but we have not tried in production.
Given that Kafka neither provides secondary indexes nor organizes data in columnar format, Presto essentially have to somewhat scan through the Kafka topics to execute the queries, resulting a lot of disk I/O.
Our Kafka infrastructure handles more than one trillion messages per day and guarantees second-level latency SLAs. Reading aggressively could easily saturate all the I/O bandwidths of the nodes and leads to outages. We actually had several incidents in the past when we did backfills. So I'm more conservative on this.
Unfortunately the UI did not make it to the first version of the open-source release. It's low priority given the fact that the UI itself usually heavily tied to the business needs.
We internally have a React-based UI and we are in the process of cleaning it up and opening sourcing it. Please stay tuned!
Interesting project!
I only got to know about Apache Calcite a few months/weeks ago, but it seems like it's becoming a de facto way of supporting SQL on analytics products. I can think of Dremio [1] as an example.
From an implementation perspective, I wonder how AthenaX will compare to the Apache Beam SQL DSL [2] (I haven't tried it out, still learning DoFn, ParDo, etc.). Beam is also leveraging Calcite, with the partial difference that it runs* on Flink amongst others.
I think it's safe to say that the streaming analytics ecosystem remains interesting and exciting! Hope I'll be able to try out AthenaX, congratulations on the release.
I think it'll only be released in the next minor version. We currently need to build from source to try it, so I'm waiting for next version. I'm still learning the Java API, which is a lot for a few hours here and there.
In the blog post, you talk about running AthX on YARN, so am I correct in saying that I might not be able to run it on single-machine mode? I'd like to try it out (SQL window functions are easier to understand than Beam Java ones).
Maybe Gitter or a public Slack channel could help, more users will have queries that aren't suitable for HN nor GitHub issues.
Although it's slightly frustrating how closely named it is to the Amazon Athena[1] service. Which also is a SQL-based analytics service, but quite a bit different than Uber's project.
This looks like an amazing piece of software! I have a couple of questions:
- Does it (out of the box) support mapping JSON messages on a Kafka Topic to a table accessible from SQL?
- How would you support the following use case: run SQL queries on one or more Kafka Topics and expose the results in (near) real-time through a JDBC supporting BI tool? I assume you'd use, AthenaX to run the queries and store the results in a standard RDBMS like PostgreSQL?
(1) Yes. AthenaX provides a thin wrapper of KafkaJsonTableSource[1] out-of-the-box. As long as you specify the schema it should be good to go.
(2) Your guess is correct. Internally we have AthenaX streaming data to MemSQL, Pinot or MySQL to support the use case you described. Some of them go through the JDBCTableSink [2], the others go through a customized connectors. AthenaX is designed to allow plugging in connectors when required.
Interesting project although cant say I am happy to see SQL being used in Streaming Systems like this. In my last two jobs I had to write frameworks and tools to enable "Data Scientists" and "analysts" to write production jobs and problem I have run into with exposing SQL to this class of user is that every job end up being its own special snowflake with deeply nested SQL with custom UDFs mixed in for good measure. Due to "unique" nature of each its significantly increases the support and maintainability cost. I have to come the conclusion that a typesafe api with map/filter/flatmap is much better API to expose that Stringly typed SQL. I am curious to know whether Uber is running into similar support issues?
Our experience is that AthenaX actually lowers our support costs:
(1) There are significant loads on consultations when users had to implement their own jobs in Java / Scala and run them in production. Sometimes it turned in to co-development as the users lack the expertise of the streaming analytics frameworks.
(2) We consciously encourage our users to write good SQLs via:
(a) enforcing schemas on all analytical Kafka topics.
(b) setting up a team dedicated to help them using SQL in big data systems (i.e., Hive, Presto, AthenaX, etc.)
For UDF we provide general guidances and ask our users to oncall for the jobs that use UDFs. The support costs are definitely not zero but it is still much better to teach users to write a Samza / Flink / Storm job from scratch.
My experience teaching some graduates in a BI shop. SQL is more common, and tools that support SQL tend to be used better.
I've "taught" them how to use Spark, but being a team of varying prior experience, the Scala API meant them learning Scala, the Python one was a bit better, but they did much better with the SQL DSL.
Regarding your concern re maintainability: UDF's tend to be the problem, I'm also curious to know re their support issues, and also: can anyone write their own UDF (the code requires registering a .jar), or is there a team that helps business users in that regard?
It is great to see that SQL becomes one of the de facto standards of building streaming analytics applications.
Personally I'm not a big fan of KSQL, given that:
(1) KSQL is built on top of Kafka Stream. There are use cases that we don't think Kafka Stream is a good fit. Please see the explanations above
(2) Inventing yet another SQL dialect is a bad idea in practice. Not only it incurs additional learning curves, but more importantly you have few hopes winning in development velocity. Calcite has been used by Flink, Storm, Beam, Dremio, etc. The community is simply way bigger even compared to the total number of engineers in Confluent.
Again this is just my personal take it does not reflect the stands of Uber.
I share your sentiment here. Having a common base to work from also has the benefit that end-users, the BI or Data Science practitioner (whom we empower as data engineers) can expect a stable SQL dialect that they'll be able to use everywhere*.
I haven't followed Spark in recent months, but what I recall was that the SQL DSL had some caveats at first, because certain things weren't yet implemented.
My anecdote has been that projects that implement SQL after a while, typically don't deliver the whole thing on initial release. I imagine it's often quite a lot of work.
The more projects that use Calcite, the more upstream contributions there would be ...
Spark actually has probably the best standard SQL support among all open source big data frameworks. It can run all of TPC-DS queries without modifications.
TPC-DS evaluates against batch processing and it is great to hear that Spark supports all of them.
I think the space for streaming processing is still quickly evolving. Many features like stream-table joins, CTEs, streaming joins with late arrival data are unimplemented or do not even have clear semantics yet. It would be great to see a benchmarks like TPC-DS in the domain.
This uses Kafka as a core piece of infrastructure. I realize that Kafka Streams was not available when AthenaX was built, but would it be a good framework to move to ?
Before AthenaX most of our real-time analytic pipelines were on Samza, which to to some extent can be seen as predecessor of Kafka Stream.
The migration from Samza towards Kafka Stream might seem more natural, and we actually took a very close look on Kafka Stream and we have decided to move towards Flink, given that:
(1) Kafka Stream lacks of important features like exactly-once delivery and distributed consistent snapshots. They are essential to support use cases that require high fidelity.
(2) SQL is a must-have feature in order to empower our users, many of which are non-technical, to run large-scale streaming analytics in production. Kafka Streams provide no support for that. Arguably the Kafka Streams provide simple APIs -- Simple APIs themselves are insufficient to bring the analytics applications to production. You can't ignore continuous integrations and deployment, monitoring, etc., especially many of our users come from non-CS backgrounds.
(3) It seems that the Apache Flink community is more open and committed compared to the Kafka community, particularly on the SQL side. We have collaborated with Data Artisans, Alibaba and Huawei. All parties above, as well as us are committed and equipped with adequate resources to bring SQL to respective customers.