Show HN: HyperDX – open-source dev-friendly Datadog alternative

722 points
1/20/1970
8 months ago
by mikeshi42

Comments


addisonj

Wow, there is a lot here and what here is to a pretty impressive level of polish for how far along this is.

The background of someone with a DX background comes through! I will be looking into this a lot more.

Here are a few comments, notes, and questions:

* I like the focus on DX (especially compared to other OSS solutions) in your messaging here, and I think your hero messaging tells that story, but it isn't reinforced as much through the features/benefits section

* It seems like clickhouse is obviously a big piece of the tech here, which is an obvious choice, but from my experience with high data rate ingest, especially logs, you can run into issues at larger scale. Is that something you expect to give options around in open source? Or is the cloud backend a bit different where you can offer that scale without making open source so complex?

* I saw what is in OSS vs cloud and I think it is a reasonable way to segment, especially multi-tenancy, but do you see the split always being more management/security features? Or are you considering functional things? Especially with recent HashiCorp "fun" I think more and more it is useful to be open about what you think the split will be. Obviously that will evolve, but I think that sort of transparency is useful if you really want to grow the OSS side

* on OSS, I was surprised to see MIT license. This is full featured enough and stand alone enough that AGPL (for server components) seems like a good middle ground. This also gives some options for potentially a license for an "enterprise" edition, as I am certain there is a market for a modern APM that can run all in a customer environment

* On that note, I am curious what your target persona and GTM plan is looking like? This space is a a bit tricky IMHO, because small teams have so many options at okay price points, but the enterprise is such a difficult beast in switching costs. This looks pretty PLG focused atm, and I think for a first release it is impressive, but I am curious to know if you have more you are thinking to differentiate yourself in a pretty crowded space.

Once again, really impressive what you have here and I will be checking it out more. If you have any more questions, happy to answer in thread or my email is in profile.

8 months ago

mikeshi42

Thank you, really appreciate the feedback and encouragement!

> It seems like clickhouse is obviously a big piece of the tech here, which is an obvious choice, but from my experience with high data rate ingest, especially logs, you can run into issues at larger scale. Is that something you expect to give options around in open source?

Scaling any system can be challenging - our experience so far is that Clickhouse is a fraction of the overhead of systems like Elasticsearch has previously demanded luckily. That being said, I think there's always going to be a combination of learnings we'd love to open source for operators that are self-hosting/managing Clickhouse, and tooling we use internally that is purpose-built for our specific setup and workloads.

> I saw what is in OSS vs cloud and I think it is a reasonable way to segment, especially multi-tenancy, but do you see the split always being more management/security features?

Our current release - we've open sourced the vast majority of our feature set, including I think some novel features like event patterns that typically are SaaS-only and that'll definitely be the way we want to continue to operate. Given the nature of observability - we feel comfortable continuing to keep pushing a fully-featured OSS version while having a monetizable SaaS that focuses on the fact that it's completely managed, rather than needing to gate heavily based on features.

> on OSS, I was surprised to see MIT license

We want to make observability accessible and we think AGPL will accomplish the opposite of that. While we need to make money at the end of the day - we believe that a well-positioned enterprise + cloud offering is better suited to pull in those that are willing to pay, rather than forcing it via a license. I also love the MIT license and use it whenever I can :)

> On that note, I am curious what your target persona and GTM plan is looking like?

I think for small teams, imo the options available are largely untantilizing, it ranges from narrow tools like Cloudwatch to enterprise-oriented tools like New Relic or Datadog. We're working hard to make it easier for those kinds of teams to adopt good monitoring and observability from day 1, without the traditional requirement of needing an observability expert or dedicated SRE to get it set up. (Admittedly, we still have a ways to improve today!) On the enterprise side, switching costs are definitely high, but most enterprises are highly decentralized in decision making, where I routinely hear F500s having a handful of observability tools in production at a given time! I'll say it's not as locked-in as it seems :)

8 months ago

addisonj

Thanks for the answers Mike!

One more follow-up on the scale side (which I mentioned with sibling comment), it isn't so much about clickhouse itself, but about scaling up ingest. From my own experience and from talking with quite a few APM players (I previously worked in streaming space), a Kafka / durable log storage kind of becomes a requirement, so I was curious if you think at some point you need a log to further scale ingest.

For enterprise side, I was previously in data streaming space and had quite a few conversations with APM players and companies building their own observability platforms, happy to chat and share more if that would be useful!

8 months ago

debarshri

One piece of advice here is, if you pitch yourself as a datadog competitor, then I would recommend replicating some of the GTM motions that datadog employed. For instance, you have an opportunity to go very upmarket, super enterprise orgs. You can do PLG, but ultimately every tool becomes SLG. I would recommend fine tuning that motion as that would be the one bringing larger contract 6 digit contracts and huge growth here.

I have seen orgs remove datadog because of unpredictable pricing. If you do flat price self hosted platform, you will get attention. I dont think orgs would mind hosting clickhouse. You can also bundle it with your helm charts or initial proof of concept might have lower barrier. I know some orgs have million dollar annual contracts with datadog, a cheaper more predictable priced alternative will definitely get attention.

8 months ago

mx20

MIT License allows Amazon and other Cloud providers to offer Cloud Solutions as well. That's why most SaaS changed to AGPL or better versions that explicitly disallow Cloud offerings.

8 months ago

tmd83

https://www.hyperdx.io/docs/oss-vs-cloud

This page shows event pattern available for both oss vs. cloud. The blog doesn't mention exactly how this is being which would be an interesting read but I understand if a secret sauce.

I recall quite a few years ago a standalone commercial & hosted tool for doing something like this just on logs for anomaly detection. Anyone has any reference for similar tools for working with direct log data (say from log files) or in a similar capacity like hypderdx (oss or commercial)

8 months ago

datadeft

> While we need to make money at the end of the day

Honest question: What makes you think that you are not turning into a Datadog (price wise) once reach a certain scale?

The problem what I see with software companies that the pricing is dominated by investor requirements and when a company reaches a certain milestone change up the licensing model and the pricing with it.

8 months ago

dangoodmanUT

For clickhouse, just batch insert. They probably have something batching every few s before inserting directly to their hosted version

8 months ago

vadman97

ClickHouse Async insert docs [1].

We ran into some challenges with async inserts at highlight.io [2]. Namely, ClickHouse Cloud has an async flush size configured (that can't be changed AFAIK) that isn't large enough for our scale. Once you async insert more than can be flushed, you get back pressure on your application waiting to write while Clickhouse flushes the queue. We found that implementing our own batched flushing via kafka [3] is far more performant, allowing us to insert 500k+ RPS on the smallest cloud instance type.

[1] https://clickhouse.com/docs/en/optimize/asynchronous-inserts [2] https://github.com/highlight/highlight/tree/main [3] https://github.com/highlight/highlight/blob/4d28451b1935796d...

8 months ago

addisonj

Generally, any sort of async/batch inserts will get you decently far, but still will have limitations well before you get to million rows a second, mostly because it is really difficult to get your batch size large enough from individual producers without some sort of aggregation, which that aggregation is a challenge if you care about durability.

So often that means you need something like a Kafka to get the bulk ingest to really perform to get batch sizes large enough.

That kind of gets into one of the challenges of OSS observabilility systems, you don't want to make the dependencies insane for someone who only has a few thousand logs a second, but generally at some point of scale you do need more.

8 months ago

dangoodmanUT

There's also async inserts

8 months ago

fnord77

Clickhouse is proprietary, though.

I wonder why not Apache Druid

8 months ago

zx8080

> Clickhouse is proprietary

No. Clickhouse is opensource with Apache License [0].

[0] - https://github.com/ClickHouse/ClickHouse/blob/master/LICENSE

8 months ago

[deleted]
8 months ago

prabhatsharma

A good one. A lot is being built on top of clickhouse. I can count at least 3 if not more (hyperdx, signoz and highlight) built on top of clickhouse now.

We at OpenObserve are solving the same problem but a bit differently. A much simpler solution that anyone can run using a single binary on their own laptop or in a cluster of hundreds of nodes backed by s3. Covers logs, metrics, traces, Session replay, RUM and error tracking are being released by end of the month) - https://github.com/openobserve/openobserve

8 months ago

francislavoie

8 months ago

hu3

https://github.com/uptrace also uses ClickHouse

8 months ago

t1mmen

This looks really cool, congrats on the launch!

I haven’t had time to dig in proper, but this seems like something that would fit perfectly for “local dev” logging as well. I struggled to find a good solution for this, ending up Winston -> JSON, with a simpler “dump to terminal” script running.

(The app I’m building does a ton of “in the background” work, and I wanted to present both “user interactions” and “background worker” logs in context)

I don’t see Winston being supported as a transport, but presumably easy to add/contribute.

Good luck!

8 months ago

mikeshi42

Thank you! We do support Winston (docs: https://www.hyperdx.io/docs/install/javascript#winston-trans...) and use it a lot internally. Let me know if you run into any issues with it (or have suggestions on how to make it more clear)

In fact this is actually how we develop locally - because even our local stack is comparatively noisy, we enable self-logging in HyperDX so our local logs/traces go to our own dev instance, and we can quickly trace a 500 that way. (Literally was doing this last night for a PR I'm working on).

8 months ago

t1mmen

Oh sweet! I was in a bit of a hurry and must’ve missed it, thanks for clarifying. This will be super helpful for us, very excited play with it!

8 months ago

silentguy

Have you tried lnav? It has somewhat steeper learning curve but it'd fit the bill. One small binary and some log parsing config, and you are good to go.

8 months ago

tstack

I’d be interested in what you found difficult to use lnav, if you have a minute.

8 months ago

corytheboyd

Outside of the intended use-case of _replacing_ Datadog, I think this may actually serve as an excellent local development "Datadog Lite", which I have always wanted, and is something embarrassingly, sorely missing from local development environments.

In local development environments, I want to:

- Verify that tracing and metrics (if you use OpenTelemetry) actually work as intended (through an APM-like UI).

- Have some (rudimentary, even) data aggregation and visualization tools to test metrics with. You often discover missing/incorrect metrics by just exploring aggregations, visualizations, filters. Why do we accept that production (or rather, a remote deployment watched by Datadog etc.) is the correct place to do this? It's true that unknowns are... unknown, but what better time to discover them than before shipping anything at all?

- Build tabular views from structured logs (JSON). It is _mind blowing_ to me that most people seem to just not care about this. Good use of structured logging can help you figure out in seconds what would take someone else days.

I mean, that's it, the bar isn't too high lol. It looks like HyperDX may do... all of this... and very well, it seems?!

Before someone says "Grafana"-- no. Grafana is such a horrible, bloated, poorly documented solution for this (for THIS case. NOT IN GENERAL!). It needs to be simple to add to any local development stack. I want to add a service to my docker compose file, point this thing at some log files (bonus points for some docker.sock discoverability features, if possible), expose a port, open a UI in my browser, and immediately know what to do given my Datadog experience. I'm sure Grafana and friends are great when deployed, but they're terrible to throw into a project and have it just work and be intuitive.

8 months ago

mikeshi42

Yes! We definitely do - in fact this is how we develop locally, our local stack is pretty intricate and can fail in different areas, so it's pretty nice for us to be able to debug errors directly in HyperDX when we're developing HyperDX!

Otel tracing works and should be pretty bulletproof - metrics is still early so you might see some weirdness (we'll need to update the remaining work we've identified in GH issues)

You can 100% build tabular views based on JSON logs, we auto-parse JSON logs and you can customize the search table layout to include custom properties in the results table.

Let us know if we fulfill this need - we at least do this ourselves so I feel pretty confident it should work in your use case! If there's anything missing - feel free to ping us on Discord or open an issue, we'd likely benefit from any improvement ideas ourselves while we're building HyperDX :)

Edit: Oh I also talk a bit about this in another comment below https://news.ycombinator.com/item?id=37561358

8 months ago

mikeshi42

Since my comment is too old to edit now - musing on this a bit more I think this would be pretty awesome to turn into a well-supported workflow to have a low-resource-usage/all-in-one version for just local development.

If anyone wants to chat more about this - I've kicked off an issue [1] to gather interest and everyone's feedback.

[1] https://github.com/hyperdxio/hyperdx/issues/7

8 months ago

carlio

I use InfluxDB for this, it comes with a frontend UI and you can configure Telefraf as a statsd listener, so the same metric ingestion as datadog pretty much. There are docker containers for these, which I have added to my docker-compose for local dev.

I think it does log ingestion too, I haven't ever used that, I mostly use it just for the metrics and graphing.

8 months ago

pighive

Do you mind sharing any publicly available examples of this set up on github or somewhere? TIA

8 months ago

corytheboyd

That sounds very promising indeed! It might be enough for what I’m after for my projects!

8 months ago

[deleted]
8 months ago

snowstormsun

8 months ago

Kiro

Not applicable when the base offering is free and open source. The SSO is in the base pricing in this case.

8 months ago

yaleman

It's literally a big red X on the OSS version, so no, it's not "in the base pricing".

8 months ago

jamesmcintyre

This looks really promising, will definitely look into using this for a project i'm working on! Btw I've used both datadog and newrelic in large-scale production apps and for the costs I still am not very impressed by the dx/ux. If hyperdx can undercut price and deliver parity features/dx (or above) i can easily see this doing well in the market. Good luck!

8 months ago

mikeshi42

Thank you! Absolutely agree on Datadog/New Relic DX, I think the funny thing we learned is that most customers of theirs mention how few developers on their team actually comfortably engage with either New Relic or Datadog, and most of the time end up relying on someone to help get the data they need!

Definitely striving to be the opposite of that - and would love to hear how it goes and any place we can improve!

8 months ago

Hamuko

Datadog feels like they've used a shotgun to shoot functionality all over the place. New Relic felt a bit more focused, but even then I had to go attend a New Relic seminar to properly learn how to use the bloody thing.

8 months ago

pighive

What does dx/ux mean in this context? Data Diagnostics?

8 months ago

Dockson

Just want to heap on with the praise here and say that this was definitely the best experience I've had with any tool trying to add monitoring for a Next.js full-stack application. The Client Sessions tab where I, out of the box, can correlate front-end actions and back-end operations for a particular user is especially nice.

Great job!

8 months ago

wrn14897

Thank you. This means a lot to us.

8 months ago

[deleted]
8 months ago

cheema33

I am new to this space and was considering a self hosted install of Sentry software. Sentry is also opensource and appears to be similar to datadog and HyperDX in some ways. Do you know Sentry and can you tell us how your product is different?

Thanks.

8 months ago

mikeshi42

Very familiar with Sentry! I think we have a bit of overlap in that we both do monitoring and help devs debug though here's where I think we differ:

HyperDX:

- Can collect all server logs (to help debug issues even if an exception isn't thrown)

- We can collect server metrics as well (CPU, memory, etc.)

- We accept OpenTelemetry for all your data (logs, metrics, traces) - meaning you only need to instrument once and choose to switch vendors at any time if you'd like without re-instrumenting.

- We can visualize arbitrary data (what's the response time of endpoint X, how many users did action Y, how many times do users hit endpoint X grouped by user id?) - Sentry is a lot more limited in what it can visualize (mainly because it collects more limited amounts of data).

Sentry:

- Great for exception capture, it tries to capture any exception and match them with sourcemap properly so you can get to the right line of code where the issue occurred. We don't have proper sourcemap support yet - so our stack traces point to minified file locations currently.

- Gives you a "inbox" view of all your exceptions so you can see which ones are firing currently, though you can do something similar in HyperDX (error logs, log patterns, etc.) theirs is more opinionated to be email-style inbox, whereas our is more about searching errors.

- Link your exceptions to your project tracker, so you can create Jira, Linear, etc. tickets directly from exceptions in Sentry.

I don't think it's an either/or kind of situation - we have many users that use both because we cover slightly different areas today. In the future we will be working towards accepting exception instrumentation as well, to cover some of our shortfalls when it comes to Sentry v HyperDX (since one common workflow is trying to correlate your Sentry exception to the HyperDX traces and logs).

Hope that gives you an idea! Happy to chat more on our Discord if you'd like as well.

8 months ago

e12e

Does that mean hyperdx doesn't (yet) support exception logs?

8 months ago

mdaniel

> Sentry is also opensource

Well, pedantically the 5 year old version of Sentry is open source, sure

8 months ago

the_mitsuhiko

The rollover of the license is 3 years, not 5 years.

8 months ago

vadman97

How do you think about the query syntax? Are you defining your own or are you following an existing specification? I particularly love the trace view you have, connecting a frontend HTTP request to server side function-level tracing.

8 months ago

mikeshi42

This one is a fun one that I've spent too many nights on - we're largely similar to Google-style search syntax (bare terms, "OR" "AND" logical operators, and property:value kind of search).

We include a "query explainer" - which translates the parsed query AST into something more human readable under the search bar, hopefully giving good feedback to the user on whether we're understand their query or not. Though there's lots of room to improve here!

8 months ago

gajus

Potentially useful resource – https://github.com/gajus/liqe

8 months ago

boundlessdreamz

Looks great

1. Are you funded?

2. https://www.deploysentinel.com/ - Are you going to work on this further?

8 months ago

mikeshi42

Thank you - yes and yes as well w.r.t. the qs!

8 months ago

nodesocket

Congrats on the launch. Perhaps I missed it, but what are the system requirements to run the self-hosted version? Seems decently heavy (Clickhouse, MongoDB, Redis, HyperDX services)? Is there a Helm chart to install into k8s?

Look forward to the syslog integration which says coming soon. I have a hobby project which uses systemd services for each of my Python apps and the path with least resistance is just ingest syslog (aware that I lose stack traces, session reply, etc).

8 months ago

mikeshi42

The absolute bare minimum I'd say is 2GB RAM, though in the README we do say 4GB and 2 cores for testing, obviously more if you're at scale and need performance.

For Syslog - it's something we're actually pretty close to because we already support Heroku's syslog based messages (though it's over HTTP), but largely need to test the otel Syslog receiver + parsing pipeline will translate as well as it should (PRs always welcome of course but it shouldn't be too far out from now ourselves :)). I'm curious are you using TLS/TCP syslog or plain TCP or UDP?

Here's my docker stats on a x64 linux VM where it's doing some minimal self-logging, I suspect the otel collector memory can be tuned down to bring the memory usage closer to 1GB, but this is the default out-of-the-box stats, and the miner can be turned off if log patterns isn't needed:

CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS

439e3f426ca6 hdx-oss-miner 0.89% 167.2MiB / 7.771GiB 2.10% 3.25MB / 6.06MB 8.85MB / 0B 21

7dae9d72913d hdx-oss-task-check-alerts 0.03% 83.65MiB / 7.771GiB 1.05% 6.79MB / 9.54MB 147kB / 0B 11

5abd59211cd7 hdx-oss-app 0.00% 56.32MiB / 7.771GiB 0.71% 467kB / 551kB 6.23MB / 0B 11

90c0ef1634c7 hdx-oss-api 0.02% 93.71MiB / 7.771GiB 1.18% 13.2MB / 7.87MB 57.3kB / 0B 11

39737209c58f hdx-oss-hostmetrics 0.03% 72.27MiB / 7.771GiB 0.91% 3.83GB / 173MB 3.84MB / 0B 11

e13c9416c06e hdx-oss-ingestor 0.04% 23.11MiB / 7.771GiB 0.29% 73.2MB / 89.4MB 77.8kB / 0B 5

36d57eaac8b2 hdx-oss-otel-collector 0.33% 880MiB / 7.771GiB 11.06% 104MB / 68.9MB 1.24MB / 0B 11

78ac89d8e28d hdx-oss-aggregator 0.07% 88.08MiB / 7.771GiB 1.11% 141MB / 223MB 147kB / 0B 11

8a2de809efed hdx-oss-redis 0.19% 3.738MiB / 7.771GiB 0.05% 4.36MB / 76.5MB 8.19kB / 4.1kB 5

2f2eac07bedf hdx-oss-db 1.34% 75.62MiB / 7.771GiB 0.95% 105MB / 3.79GB 1.32MB / 246MB 56

032ae2b50b2f hdx-oss-ch-server 0.54% 128.7MiB / 7.771GiB 1.62% 194MB / 45MB 88.4MB / 65.5kB 316

8 months ago

nodesocket

Thanks for the reply and providing detailed system requirements and docker stats. Seems I missed the note in the README. :-)

Actually I am not really using syslog per say, but systemd journalctl which default behaviour on Debian (rsyslog) also duplicates to /var/log/syslog.

  StandardOutput=journal  
  StandardError=journal
Is there a better integration to pull logs from my systemd services and journalctl up to HyperDX?
8 months ago

Wulfheart

So do I understand the landing page correctly: It is possible to run Clickhouse using an Object Storage like S3? What are the performance implications?

8 months ago

mikeshi42

You can definitely run Clickhouse directly on S3 [1] - though we don't run _just_ on S3 for performance reasons but instead use a layered disk strategy.

A few of the weaknesses of S3 are:

1. API calls are expensive, while storage in S3 is cheap, writing/reading into it is expensive. Using only S3 for storage will incur lots of API calls as Clickhouse will work on merging objects together (which require downloading the files again from S3 and uploading a merged part) continuously in the background. And searching on recent data on S3 can incur high costs as well, if you're constantly needing to do so (ex. alert rules)

2. Latency and bandwidth of S3 are limited, SSDs are an order of magnitude faster to respond to IO requests, and also on-device SSDs typically have higher bandwidth available. This typically is a bottleneck for reads, but typically not a concern for writes. This can be mitigated by scaling out network-optimized instances, but is just another thing to keep in mind.

3. We've seen some weird behavior on skip indices that can negatively impact performance in S3 specifically, but haven't been able to identify exactly why yet. I don't recall if that's the only weirdness we see happen in S3, but it's one that sticks out right now.

Depending on your scale and latency requirements - writing directly to S3 or a simple layered disk + S3 strategy might work well for your case. Though we've found scaling S3 to work at the latencies/scales our customers typically ask for require a bit of work (as with scaling any infra tool for production workloads).

[1] https://clickhouse.com/docs/en/integrations/s3

8 months ago

mnahkies

One thing I appreciate about sentry compared to datadog is the ability to configure hard caps on ingestion to control cost. AFAIK the mechanism is basically that the server starts rate limiting/rejecting requests and the client SDKs are written to handle this and enter a back off state or start sampling events.

I think this could be a nice point of difference to explore that can help people avoid unexpected bills

8 months ago

mikeshi42

Agreed on needing better tooling for surprise bills - definitely no stranger to that problem!

For now we're trying to make the base price cheap enough where those kinds of considerations don't need to be top of mind today and a policy that can be forgiving when it occasionally happens, but certainly as we continue to scale and grow, we'll need to put in proper controls to allow users to define what should happen if events are spiking unexpectedly (how to shed events via sampling, what needs to be explicitly perserved for compliance reasons, when to notify, etc.)

I do like Sentry's auto-sampling algorithm which is a really neat way to solve that issue.

8 months ago

mfkp

Looks very interesting, although a lot of the OpenTelemetry libraries are incomplete: https://opentelemetry.io/docs/instrumentation/

Especially Ruby, which is the one that I would be most interested in using.

8 months ago

mikeshi42

The OpenTelemetry ecosystem is definitely still young depending on the language, but we have Ruby users onboard (typically using OpenTelemetry for the tracing portion, and piping logs via Heroku or something else via the regular Ruby logger).

Feel free to pop in on the Discord if you'd like to chat more/share your thoughts!

8 months ago

[deleted]
8 months ago

jacobbank

Just wanted to say congrats on the launch! We recently adopted hyperdx at Relay.app and it's great.

8 months ago

mikeshi42

Thank you - it's been awesome working with you guys! :)

8 months ago

kcsavvy

The session playback looks useful - I find this is missing from many DD alternatives I have seen.

8 months ago

mikeshi42

Absolutely! It's pretty magical to go from a user report -> session replay -> exact API call being made and the backend error logs.

We dogfood a ton internally and (while obviously biased) we're always surprised how much faster we can pin point issues and connect alarms with bug reports.

Hope you give us a spin and feel free to hop on our discord or open an issue if you run into anything!

8 months ago

[deleted]
8 months ago

solardev

This is awesome! Datadog's one of my favorite providers, and their pricing is great for small businesses, but probably unaffordable for larger businesses (as pointed out in these threads).

This is slick and fast. Will have to check it out. Thanks for making it!

8 months ago

mikeshi42

Thank you - let me know how it goes when you're trying it out, would love to learn how you feel it compares to Datadog :)

8 months ago

robertlagrant

If you want my two Datadog favourite features, they were: 1) clicking on a field and making it a custom search dimension in another click, and 2) flame graphs. Delicious flame graphs.

8 months ago

mikeshi42

We should have both! If you hover over a property value, a magnify/plus icon come up to allow you to search on that property value (no manual facets required) - and our traces all come with delicious flame graphs :) Let me know if you were thinking of something different.

One other thing I think you'd love if you're coming from Datadog is that you're able to full text search on structured logs as well, so even if the value you're looking for lives in a property, it's still full text searchable (this is a huge pain we hear from other Datadog users)

If there's anything you love/hate about Datadog - would love to learn more!

8 months ago

robertlagrant

Well - the worst thing about Datadog is the sales process :-) But I'll save that for my memoirs. I seem to remember at the time their K8s/Helm integration was a little buggy, but no other pain than that. Plugging our software in was very easy, I recall. We had Python in the backend and we just installed their software and wired it into our API services. I also remember they had a consumer for Auth0 via Auth0's log streaming feature, which we were using at the time.

Btw I haven't checked your product out yet; I was just reminiscing :-) I'll take a look soon.

8 months ago

[deleted]
8 months ago

technics256

Is there a guide for integrating this in local dev, either locally or if you want to view it on the hosted?

Ideally hosted, devs can bring up our app locally, and view their logs and traces etc when testing and building

8 months ago

mikeshi42

There shouldn't be any differences with how you want to set things up for local vs production telemetry (in fact all our users test locally typically before pushing it out to staging/prod).

Of course if your local/prod run completely different and require different instrumentation, that might be trickier.

I'm wondering if you had a specific use case in mind? Happy to dive more into how it should be done (feel free to join on Discord too if you'd like to chat there)

8 months ago

choppaface

what is DX?

why not grafana / prometheus / loki?

8 months ago

mikeshi42

(Since DX is already explained...)

Grafana/Prom/Loki is an awesome stack - overall I'd say that we try to correlate more signals in one place (your logs <> traces <> session replay), and we also take an approach to go more dev-friendly to query instead of going the PromQL/LogQL route.

It's a stack I really wanted to love myself as well but I've personally ran into a few issues when using it:

Loki is a handful to get right, you have to think about your labels, they can't be high-cardinality (ex. IDs), the search is really slow if it's not a label, and the syntax is complex because it's derived from PromQL which I don't think is a good fit for logs. This means an engineer on your team can't just jump in and start typing keywords to match on, nor can they just log out logs and know they can quickly find it again in prod. Engineers need to filter logs by a label first and then wait for a regex to run if they want to do full-text search.

Prometheus is pretty good, my only qualm is again the approachability of PromQL - it's rare to see an engineer that isn't fluent with time-series/metric systems to be able to pick up all the concepts very quickly. This means that metrics access is largely limited to premade dashboards or a certain set of engineers that know the Prometheus setup really well.

Grafana has definitely set the standard for OSS metrics, but I personally haven't had a lot of success using their tools outside of metrics, though ymmv and it's all about the tradeoffs you're looking for in an observability tool.

8 months ago

coel

DX is Developer eXperience

8 months ago

TheHiddenSun

why does the docs page force you to login then trying to open/view it? https://www.hyperdx.io/docs/install/javascript

8 months ago

bg46z

For highly regulated workloads, would it be possible to have a self-hosted version that is supported?

8 months ago

mikeshi42

Absolutely! You can either self-host the OSS version today, or chat with us (mike@hyperdx.io) directly if you need a managed on-prem solution or any other custom requirements depending on your deployment.

8 months ago

drchaim

the idea of different features oss vs cloud has sense, but please, support email in oss, it's easy and makes the platform usable.

8 months ago

mikeshi42

Definitely, we want to make it easy to integrate arbitrary email providers instead of whatever vendor we happened to have integrated natively right now. It's not an intentional paid feature gate as much as it's just something we didn't get time to put in an OSS-ready workflow for the OSS launch.

We're thinking of being able to allow users to create a custom webhook alert so you can get full flexibility on what vendor you use and how the alert should be crafted, would love to hear your feedback there, though may need to ship some stopgap solutions depending on demand!

https://github.com/hyperdxio/hyperdx/issues/2

8 months ago

bmikaili

Do you guys have open roles for juniors?

8 months ago

mikeshi42

Unfortunately not at this time :(

8 months ago

agoldis

Very nice, congrats on the launch!

8 months ago

mikeshi42

thank you!

8 months ago

dangoodmanUT

S3-backed CH merge trees are notoriously expensive due to the high API call rates. We have a table doing over 11M APi calls per day. What are you seeing?

8 months ago

mikeshi42

We use a mix of SSDs and S3 for storage depending on the workload - as you're right, merging on S3 is awful and we try to avoid it!

8 months ago

parhamn

Is anyone doing these on cloudflare r2 where the cost is significantly lower?

8 months ago

mikeshi42

I'd love to be using Cloudflare as our cloud provider, but it didn't seem to make a lot of sense for our use case.

We were concerned with some of the performance benchmarks we've seen with R2 in the past (though they've probably have improved), not to mention our compute options become a bit more limited to bandwidth alliance clouds otherwise we'll be eating network egress fees (which I do hate with a HUGE passion).

Though I can imagine if you're comfortable with one of the bandwidth alliance clouds already and can take a bit of a perf hit for search, R2 and Backblaze both can provide some cost savings depending on your workload.

8 months ago

cldellow

R2 is significantly cheaper for egress, but not for API calls. It's still cheaper for API calls, but only by 10%:

- 1M GETs $0.36 (R2) vs $0.40 (S3)

- 1M PUTs $4.50 (R2) vs $5.00 (S3)

8 months ago

[deleted]
8 months ago

fuddle

Congrats on the launch! Are you planning to release the cloud features as source available or are they closed source?

8 months ago

mikeshi42

Our cloud features are closed source in a downstream repo - I think repos that have a very clear separation between OSS and closed are best - this also enforces that our OSS is always a fully-featured product that we develop on the OSS-only version day to day, and our cloud features are only a minor addition on top.

I've historically hit issues with repos that do an `ee` folder and blur the line between what is truly open source and self-hostable, vs need a license/cloud-only. I understand why they do that, but I hope we don't replicate that confusion ourselves :)

8 months ago

[deleted]
8 months ago

drchaim

another Clickhouse wrapper :)

8 months ago

euph0ria

Is it possible to have longer retention rates than 30 day? What is the price for that?

Do you have DPA agreements for GDPR?

8 months ago

mikeshi42

Yes we do support longer retention - for custom retention/plans, it'd probably be best to chat over email (mike@hyperdx.io). Though if you _only_ need retention due to compliance reasons (just need them around somewhere) - we can forward your events to your own S3 bucket for cold storage as well.

As for DPAs - yes!

8 months ago

podoman

Looks very similar to what we're doing at https://highlight.io. Would love to trade notes at some point.

One thing to consider with your messaging is that when you start speaking to large companies, they won't see you as a datadog alternative. They'll see you as a mix of sentry + fullstory + honeycomb.

Datadog originally found its success with its metrics products, and the larger the buyer of datadog gets, the more metrics-esque use case a company finds. The session replay, logging and other things are simply products that datadog tacks on.

That being said, this is clearly a large market (which is why we're working on it). I particularly like the tracing UI that y'all have and I'd love to chat with your team at some point. Good luck.

8 months ago

presentation

It seems there are a lot of Datadog competitor upstarts - also saw Axiom.co recently, though that one doesn't appear to be open source. As a developer not well-versed in observability tooling I don't really have a basis for comparing all these.

8 months ago

distantsounds

You're charging for your product, this is MIT licensed. As the meme goes, "we are not the same."

8 months ago

paulgb

Highlight is Apache-2, which is for all intents and purposes equivalent to MIT if the work is not subject to patent. (this is my understanding, IANAL)

8 months ago

podoman

As other commenters mentioned, we are both comparable (pending your opinion on the MIT license).

We both charge a cloud saas fee as well:

https://www.hyperdx.io/pricing https://www.highlight.io/pricing

8 months ago

endisneigh

they both charge money and they're both some variant of open source.

8 months ago

[deleted]
8 months ago

Sytten

Anyone has objectives blogs/videos that tested/compared all those new platforms? I feel like I see a new one on HN every month. From my quick research: signoz, openobserve, uptrace, highlight.io, opstrace. I would like to recommend some alternatives to my clients, but I don't have time to test them all and keep up with their progress.

I am also worried about long term viability of those platforms. Consolidation is bound to happen, opstrace was in my bookmark last year and they got acquired. Guessing others will follow, since I dont really think they are sustainable without on-going VC funding. Interested to get thoughts on that.

8 months ago

tmd83

I would love to read something like that too. I find such tools are fairly hard to evaluate since some of the challenges only comes with scale and you often need a real/realistic scenario to actually figure out if the tool will be useful in a pinch.

8 months ago

[deleted]
8 months ago

btown

The union of session replay and OpenTelemetry is fascinating - because what is a browser session, really, other than a sequence of RPCs between backend (micro)services <-> API server(s) <-> browser <-> human at the keyboard?

Being able to see that a user bounced because they couldn't handle the input that they were seeing - is it all that different from a service erroring because it cannot handle a certain type of input?

Honeycomb is great for the OpenTelemetry part on the server side (and with https://docs.honeycomb.io/getting-data-in/opentelemetry/brow... is moving towards full-stack), and systems like Posthog and Heap are great for sending session replay + browser events -> Clickhouse. But I don't think I've seen a great DX that ties everything together.

To that point - I would love to see different font/color options for HyperDX: the monospaced font can become tiring to read when so dense. Will be following this project closely though - this is amazing work so far!

8 months ago

mikeshi42

Oh yeah browsers are really just another service (and that's what we try to treat it as, as well!) and it's really the same set of questions you'd ask of any service, but for some reason the tooling completely stops either at the frontend or at the backend.

As for monospace font - feedback received! Is there a particular section you think is too overwhelming? (search page, nav bar, etc.) We've been thinking of how can we balance between the ease of monospace for reading instead of having it literally the default on every UI surface :P

8 months ago

[deleted]
8 months ago

hernantz

There is also SigNoz [0] solving the same problem with a similar stack (OpenTelemetry and Clickhouse)

[0] https://github.com/SigNoz/signoz

8 months ago

[deleted]
8 months ago

pranay01

Congrats on the launch!

Do also check out SigNoz [1] We are working on a similar problem statement ;)

[1] https://github.com/signoz/signoz

8 months ago

[deleted]
8 months ago

candiddevmike

Since this is MIT, someone should fork it and add SSO to the OSS version/remove the SSO tax. Looks like they're just using Passport for auth, shouldn't take much to enable the OAuth bits of it.

That's why this is MIT right, so folks can contribute stuff like this?

8 months ago

mikeshi42

We're more than happy to have users self-host and deploy in a way that works with their SSO provider! Whether that's via SSO on Nginx or forking and adding SSO to Passport in their fork. Depending on the provider, it's likely very straight-forward to do.

We did explicitly choose MIT for the freedom of end users to deploy and modify the code how they want - and tried to open source pretty much everything that doesn't have a hard 3rd party dependency. We do touch a bit on how we think about the open core model as well in the README, and largely align with Gitlab's stewardship model [1] when it comes to paid vs OSS. In this case, a contribution to add SAML specifically to OSS will likely not be merged. It'd also introduce complexities with maintaining that alongside our cloud version that already includes a specific implementation of SAML.

[1] https://handbook.gitlab.com/handbook/company/stewardship/

8 months ago

fuddle

The "SSO tax" is used to fund development of the project.

8 months ago

[deleted]
8 months ago

user3939382

I'm interested. Datadog is cool but the price is ridiculously high for small orgs.

8 months ago

mikeshi42

Agreed! It's per-host pricing can obliterate budgets if you use a fleet of small instances (which is crazy to me their pricing dictates your infra...)

Would love to have you check us out! Let me know if you run into any issues - feel free to hop on our discord as well :)

8 months ago

thelastparadise

Is prometheus/grafana still the recommended FOSS solution?

8 months ago

[deleted]
8 months ago

jefc1111

Hey, cool product. I know that marketing success is not predicated on good grammar, nevertheless I felt moved to suggest a minor edit to your blurb:

"HyperDX helps engineers figure out why production is broken, faster. HyperDX centralises and correlates logs, metrics, traces, exceptions and session replays in one place."

Good luck!

8 months ago

joshxyz

Everyone says that.

How about: "9 out of 10 devs are now pushing to prod on fridays. Thanks to HyperDX. Hehe."

8 months ago

mikeshi42

Thank you! I'm assuming this is in reference to our README? (Sorry I'm a _tad_ lacking in sleep)

If so, would you like to open a PR? I'm also happy to edit it myself but of course don't want to be stealing credit if you'd like to be attributed that way.

8 months ago

[deleted]
8 months ago

codegeek

How are you different compared to similar tools like signoz ?

8 months ago

mikeshi42

Overall we're highly focused on providing solid developer workflows, ex. with HyperDX users can correlate a log to a trace (and vice-versa) really easily in the same UI, we don't silo out features that are commonly needed in a single workflow. You can also search everything from a single panel, whether it's a log, trace, or client-side event, using the same syntax which means there's less to learn.

Feature-to-feature, I'd say the things we do better is browser-side monitoring (session replay), event patterns/clustering, and we have first-party SDKs built on OpenTelemetry to make the setup a lot easier than vanilla OpenTelemetry.

I think Signoz has built a nice one-stop platform for observability, whereas we go one step further and focus on the developer experience to ensure anyone can fully leverage that observability data!

8 months ago

[deleted]
8 months ago

vosper

We've seen a fair few "Datadog alternatives" on HN over the years. Does that mean that Datadog is the reference or gold-standard system to beat, or to compare your product to?

Kind of like how people mostly promote "Elasticsearch alternatives" and not "Solr alternatives".

8 months ago

mikeshi42

It's a pretty scattered landscape with everyone wanting something slightly different, but everyone has likely heard of Datadog at one point or another (whether they wanted to or not... but that's another story).

It becomes convenient short-hand for what they do (collect logs, metrics, traces, RUM, etc. for engineers to debug).

Though with more characters to write, I'd like to think we have a different take on both how our pricing model works and how easy it should be for an engineer to get started with us :)

8 months ago

viraptor

It's a relatively ok priced system which has almost everything: server and client performance, alerts, dashboards, logs, profiling, tracing, etc. It's not amazing and has some issues, but it's one place to get lots of things you want and it's good enough for many. I wouldn't say gold-standard, but rather a benchmark for "you have to be this tall to play the observability product game".

8 months ago

[deleted]
8 months ago

dgoncharov

This could be huge for healthcare companies like Metriport [1] - do you sign BAAs with customers for HIPAA compliance?

[1] https://github.com/metriport/metriport

8 months ago

mikeshi42

Definitely familiar with the compliance needs there - more than happy to chat further about BAAs and HIPAA compliance requirements with you guys. Always love partnering with others in the OSS space :)

8 months ago

[deleted]
8 months ago

lopkeny12ko

I remember when every SaaS landing page looked like Slack, then they all looked like Stripe, and I guess now they all look like Linear.

8 months ago

mikeshi42

I designed our landing page - and I definitely took heavy inspiration from Linear. As an engineer, creating novel beautiful design's isn't first-nature to me, but I know how critical it can be to make a clean/impactful landing page so I try to take some elements from the best.

Some other landing pages I loved and had along side while designing ours were Vercel, Resend, and WorkOS :)

8 months ago

fuddle

Linear seems to be the latest trend. https://www.linears.art/ - A collection of websites inspired by Linear

8 months ago

ilrwbwrkhv

Designers at startups are some of the most cargo culty groups in tech

8 months ago

VTimofeenko

If the project comes close to linear.app's platform UI responsiveness - wouldn't be a bad thing.

8 months ago

[deleted]
8 months ago

specialist

First paragraph https://github.com/hyperdxio/hyperdx

"HyperDX helps engineers figure out why production is broken faster by centralizing and correlating logs, metrics, traces, exceptions and session replays in one place. An open source and developer-friendly alternative to Datadog and New Relic."

Just perfect. Bravo.

--

As a merc, I never understood the why of Datadog (or equiv). The teams and projects I rotated thru each embraced the "LOG ALL THE THINGS!" strategy. No guiding purpose, no esthetics. General agreement about need to improve signal to noise ratio. But little courage or gumption to act. And any such efforts would be easily rebuffed by citing the parable of Chesterfordstorm's Fences of Doom and something something about velocity.

Late last century, IT projects, like CRMs and ERPs, were plagued by over collection of data. Opaque provenance, dubious (data) quality, unclear ownership, subtractive value propositions (where the whole is worth less than the parts). No, no, don't remove that field. We might need it some day.

Today's "analytics" projects are the same, right? Every drive-by stakeholder tosses in a few tags, some misc fields, a little extra meta. And before anyone can say "kanban", the stone soup accreted enough mass to become a gravity well threatening implosion dragging the entire org-chart into the gapping maw of our universe's newest black hole.

Am I wrong?

But logging is useful, right? Or at least has that potential.

The last time I designed a system end-to-end, that's kinda what we did. Listed all the kinds of things we wanted to log. Sorta settled on formats and content (never really ever done). Did regular log bashs to explain and clear anomalies. Scripts for grooming and archiving. (For one team I rotated thru, most of their spend was on just cloudwatch. Hysterical.)

But my stuff wasn't B2C, so wasn't tainted by the attention economy, manufactured outrage, or recommenders. No tags, referrers, campaigns, etc. It was just about keeping the system up and true. And resolving customer support incidents asap.

Does any one talk or write about this? (Those SRE themed novels are now buried deep in my to read pile.)

I'd like some cookbooks or blue prints which show some idealized logging strategies, with depictions of common enough troubleshooting scenarios.

Having something authoritative to cite could reduce my semblance to an Eeyore. "Hey, team mates, you know what'd be really great?! Correlation IDs! So we can see how an action percolates thru our system!"

Just curious.

PS- Datadog's server hexagon map/chart thingie is something else. The kind of innovation that wins prizes.

8 months ago

mikeshi42

Yes! It should definitely be thoughtful about what you log and how you expect to use it. My biggest gripe with logs is often people writing them never think about "how would I use this when things are on fire?" and tend to log useless information or fail to tag them in ways that are actually searchable.

Tagging the right IDs are a huge thing - customer X is saying their instance is really slow, but if none of your logs let you link service performance to customer X, your telemetry you're paying for is absolutely useless!

You have an ally in me on this one :) I'm hoping given a bit more time we get to write things like this - practical observability from the perspective of a dev, as opposed to the SRE angle that I think is well covered. Feel free to join us on discord btw if you want to chat more - I (for better/worse) love musing about these things :)

8 months ago

TheBengaluruGuy

> I'd like some cookbooks or blue prints which show some idealized logging strategies, with depictions of common enough troubleshooting scenarios.

> "Hey, team mates, you know what'd be really great?! Correlation IDs! So we can see how an action percolates thru our system!"

Hi, I'm building, Doctor Droid -- https://drdroid.io/ that enables you join structured application logs via correlation IDs and then build multiple types of rules / frameworks on it -- some are at granular level and some are at aggregate levels (like funnels).

We are early in the development lifecycle, would love to hear your feedback / connect with you.

8 months ago

[deleted]
8 months ago

voynich

[dead]

8 months ago