Mwmbl: Free, open-source and non-profit search engine

209 points

1/20/1970

10 months ago

by marcodiego

Comments

xenodium

If keen on some minor feedback (specially for mobile), you can likely cut down on landing page text:

From:

    MWMBL

    [Search on mwmbl...]

    Welcome to mwmbl, the free, open-source and non-profit search engine.

    You can start searching by using the search bar above!

    Find more on

    [Github] [Wiki]

To:

    MWMBL

    [Search on mwmbl...]

    A free, open-source and non-profit search engine.

    [Github] [Wiki]

10 months ago

daoudc

Thanks, feel free to send a PR!

10 months ago

mdaniel

I wondered if this approach would be feasible for a distributed crawler: https://github.com/mwmbl/mwmbl#crawling

Also, your own posting appears to be missing from the index: https://mwmbl.org/?q=mwmbl+ycombinator

(and, yes, another vote for changing the domain name; you can have a quirky project name, but if I can't remember the cat-walking-on-keyboard domain, I'm not going to use it)

10 months ago

marc_abonce

> We now have a distributed crawler that runs on our volunteers' machines! If you have Firefox you can help out by installing our extension.

This is a very interesting idea that other search engines have tried before. Actually, the Brave search engine is built over Cliqz[6] that implemented this same idea but *without* the user's consent.

Copy pasting from an old comment I made about this "human web" crawler idea:

Both PeARS[1] and Cliqz[2] tried to do that. Both got direct support from Mozilla[3][4] but it looks like neither really kicked off.

PeARS was meant to be installed voluntarily by users who would then choose to share their indexes only to those they personally trusted, so the idea is very privacy conscious but also very hard to scale.

Cliqz, on the other hand, apparently tried to work around that issue by having their add-on bundled by default in some Firefox installations[5] which was obviously very controversial because of its privacy and user consent implications.

I still think the idea has potential, though, even if it's in a more limited scope.

[1] https://github.com/PeARSearch/PeARS-orchard

[2] https://cliqz.com/en/whycliqz/human-web

[3] https://blog.mozilla.org/press-uk/2016/06/22/mozilla-gives-3...

[4] https://blog.mozilla.org/press-uk/2016/08/23/mozilla-makes-s...

[5] https://www.zdnet.com/article/firefox-tests-cliqz-engine-whi...

[6] https://www.theregister.com/2021/03/03/brave_buys_a_search_e...

10 months ago

daoudc

Thanks, I didn't know this history! We don't use any user data when crawling, just bandwidth and compute. We tell the extension what to crawl.

10 months ago

Proven

[dead]

10 months ago

robinduckett

I’m from Wales and it almost seems like a transliteration of the word “Mumble” - actual translation is “mwmial”

10 months ago

dmurray

Welsh has the unfortunate combination of being unfamiliar to most English speakers, and not exotic enough to score diversity points.

10 months ago

melx

On that topic I love the Welsh-English encounter of civil servants thinking they understood each other[0]

[0] http://news.bbc.co.uk/1/hi/7702913.stm

10 months ago

mdaniel

Heh, I look forward to the future version of "en tant que modèle linguistique, je ne peux pas traduire ce texte" or in this specific case "fel model iaith, ni allaf gyfieithu'r testun hwnnw"

10 months ago

zx8080

It surprisingly resembles Swedish language a bit.

10 months ago

toastal

But Mumble brings back fond memories https://www.mumble.info/

10 months ago

remram

So not only is it based on an obscure word in Welsh, but it's not even spelled correctly?

10 months ago

daoudc

Yeah, pretty much, I named it after Mumbles or Mwmbwls where I live, but it's also a play on the word mumble.

10 months ago

eviks

You don't need to remember it, just bookmark and tag however you like (it's anyway a waste of keystrokes to manual type the full domain for such a frequently used site like a search engine)

10 months ago

DandyDev

That is not how a large part of the citizens on the internet works. Hell, a not insignificant number of people will still "search" for Google in their address bar before they get to the actual googling

10 months ago

eviks

Except I'm not talking to a large part of the citizens, but to a single one. Do you type 'google.com' in your address bar to search?

10 months ago

tux1968

You're being argumentative for no good reason. He was suggesting a name change to improve the likelihood of a large userbase, not a change for his own convenience.

10 months ago

Brian_K_White

This is 100% wrong.

One's normal phone and laptop is actually only a fraction of uses, and even one's normal device isn't just one thing that needs to be done one time in grade school and then set for life. It's a dozen different things, and they are all perpetually rotating, and most people are not highly optimized with profiles they actually export and import.

This idea is great but it's going absolutely nowhere without a better understanding of actual humans.

10 months ago

eviks

Name at least half of those perpetually rotating things and quantify "only a fraction"

I'm an actual human, I use alternative search engines, I don't memorize their full names, and the only thing perpertually rotating is the planet

10 months ago

Brian_K_White

A stack of old laptops since they are too good to throw away since I buy good stuff and am a Linux user, so even my 10 year old machines are actually still great. So I use them for vacation to take a clean wiped machine, for things like attaching to a 3d printer or being a part of my electronics workbench or out in the garage. Old android tablets and phones which get used about the same way, vacation, device interface. Not to mention, a smaller but similar collection belonging to my wife. These are just the things I might search in a web browser on.

There are actually browsers also built in to 4 TVs, also in the rokus and google TVs attached to those same TVs, also in the Xbox and ps3. But I won't even count any of those. I have actually used them, but I'll give you those for free since I don't actually use those browsers very much.

Also that just reminded me that all of the old devices are fairly regularly getting reinstalled with some new version of a linux or bsd distro fresh every time I pick one back up, so, no configured profiles.

The windows partition on my main machine is frequently reinstalled since I experiment with trying to use either a partition or Frameworks custom usbc module or a regular usbc external drive, or just a partition on a bigger faster external drive. That's one physical device but a few different OS's, and most of those OS's besides my main daily driver get moved around and reinstalled a lot so they are always new and unconfigured., yet, I still need to use them, and that means I use a browser to search from within them.

My kobo, and 3 or so other eink readers. Which, again, occasionally gets reinstalled, so even the one device needs to be set up more than once.

The only reason I don't have to set up a new phone every 6 months is because I value a headphone jack more than most everyone else. So if you would say my usage pattern is an outlier, I would say, 1 so what? Outliers exist and could even be argued to outnumber the center peak of the bell curve, and 2 some of my outlier usage pattern goes the opposite way, like using the same phone for 5 years.

And then of course I use many machines which are not mine. And this is not even counting that my work used to involve some amount of user it support where I would use a users desk or a hot desk at a customer site, I just mean my own personal normal activity is on many other machines besides my own, including relatives, friends, & public machines.

I had to type "google" (back when I used google primarily, and it wasn't already everyone's default) countless times, even though it was the home page on my own main machine.

This question didn't really even deserve the dignity of any answer it is so obtuse.

10 months ago

eviks

The only rotating thing in this story is the person actively wasting time erasing all the traces of history without any easily available sync that would prevent the need to type "google" countless times.

But even then compared to that effort remembering a new word is trivial

> So if you would say my usage pattern is an outlier, I would say, 1 so what?

I'd say it's not relevant to this conversation where you barge in with an uber-confident "100% wrong" when it's only "1 person" wrong

10 months ago

dang

Show HN: I'm building a non-profit search engine - https://news.ycombinator.com/item?id=29690877 - Dec 2021 (199 comments)

10 months ago

evolve2k

Love that you folks are working on this. We desperately need more diversity in search options.

Much is at stake in this arena.

10 months ago

marginalia_nu

I'd love to see more competition in search. Feels like everyone right now gets tripped up on trying to emulate Google, which is a trap even if you succeed. Nobody is going to out-Google Google.

ChatGPT's recent huge success in performing a specific tasks previously within the domain of Google by doing something other than they are is a good example of this.

10 months ago

kristopolous

Chatgpt is so much more useful for things that are specific and complex than Google is.

Google used to be good at it but it's now utterly befuddled by specificity and returns such garbage that I had given up.

But the form of "I'm doing this, I'm seeing this and I'm wondering if X is possible" chatgpt is solid on that - basically a personal stack overflow

10 months ago

marginalia_nu

Question-answering is something Google pivoted toward with great enthusiasm but never quite nailed down. They'd sometimes get some questions right, but it was more of a broken clock sort of a deal.

10 months ago

kristopolous

Most implementations of this have a race towards generalities.

The biggest problem used to be when seemingly the whole internet was satisfied with an answer that is extremely wrong and broken when you do it.

Chatgpt can work though this without getting into a weird markov cycle maybe half the time which is great.

Patterns like "Hey I tried that. It still doesn't work, can you give me another option"

10 months ago

marginalia_nu

ChatGPT has other failure modes. When a question doesn't have an answer written down somewhere, it really struggles. A case is something like "how do I write a parquet file in Java without using Hadoop".

This not at all trivial but quite possible[1], but ChatGPT will in 100% of the time either hallucinate APIs, disregard the instructions to not use Hadoop or give otherwise plausible but incorrect-looking answers.

The trick is that it isn't doable by simply finding the correct dependencies and API calls, you need extract and override filesystem classes from the Hadoop project to cut those ties.

[1] https://github.com/strategicblue/parquet-floor

10 months ago

kristopolous

You can call it out "hey you just made that up. Think hard and give me a real answer"

I don't know if "think hard" does anything but it seems to work and if I was the one making chatgpt I'd certainly have configurable keywords like that to tweak the generation settings - mostly so I could skate by on cheaper queries 90+% of the time and then have a fix when they fail

10 months ago

Fnoord

Google got completely and utterly raped by world-wide SEO. We'll have to see how ChatGPT ages. Since the dataset is more controlled, I give it a fair chance.

10 months ago

marginalia_nu

Google's problem is with the conflict of interest inherent in their business model. It prevents them from doing what they need to do in order to decisively tackle the search engine spam problem they're struggling with.

It would be very easy to improve Google's search result quality by removing their promoted results, and then penalizing websites with ads and adtech.

10 months ago

kristopolous

That's only half of it. It's a question of fidelity. If I am genuinely interested in reading about the latest celebrity gossip or a local crime story, ad tech sites with videos and irritating things flying across the screen is actually what I'm looking for.

I strongly believe the smartest interfaces have the right fidelity to empower the user to effectively control the tool.

These parameters need to have the right dimensionality, faceting and perimeters to be expressive in this way.

I know you've got your own semi famous search engine and I express these ideas with that known

10 months ago

daoudc

Thank you! Join us if you like, there is plenty of work to do.

10 months ago

Proven

[dead]

10 months ago

krishadi

This and the other engines seem to implement all the components of crawling, indexing, and searching strung together. Is there a reason for this? Wouldn't an option of, let's say, crawling + indexing made available separately, where others could built a search algorithm on top of, or just the crawling as a service made available. Are there stuff like these already available? Or is it just not a viable option?

10 months ago

marginalia_nu

Crawling can be done collaboratively, but there's not a lot of point to doing this. Crawling is the cheap and easy part.

As for the rest, in order to perform well, the indexer needs to be built specifically tailored to the what the search engine is doing. Often you're scrounging for places to cram in individual bits to encode some additional piece of information about the term.

If a DBMS tries to support every use case, a search engine index does the opposite, it supports a singular use case and cuts every corner imaginable and then some to make that happen with as much resource frugality as possible.

10 months ago

daoudc

Thanks, Marginalia search was a big inspiration for this project!

10 months ago

ddorian43

There is common crawl: https://github.com/commoncrawl

10 months ago

marginalia_nu

Kinda sucks that it's stuck in AWS with no easy way of exfiltrating the data from the Amazon ecosystem. Last I tried I got like 100 Kb/s on their HTTP mirror. At that rate, the download would take 12 years.

10 months ago

ddorian43

I just tried 2 http examples from https://commoncrawl.org/get-started for an old dataset and the most recent one and got 110Mb/s (my full download bandwidth):

  wget https://data.commoncrawl.org/crawl-data/CC-MAIN-2018-17/segments/1524125937193.1/warc/CC-MAIN-20180420081400-20180420101400-00000.warc.gz

  wget https://data.commoncrawl.org/crawl-data/CC-MAIN-2023-23/segments/1685224643388.45/warc/CC-MAIN-20230527223515-20230528013515-00483.warc.gz

10 months ago

[deleted]

10 months ago

kwhitefoot

A lot of the terms I searched for returned no hits. The Firefox add-on crawls pages linked from Hacker News which is amusing perhaps but seems unlikely to crawl a representative selection of the web. Perhaps the user should be able to suggest pages to be crawled.

But when it does find something it is very quick! So I'll give it a go.

10 months ago

hk__2

Same experience: it’s quick at finding irrelevant links. For some reason, it seems to have indexed a lot of spammy websites: search for "Trastevere" on Google, and you get Wikipedia and pages about the district in Rome. Search it on Mwmbl and you only get links from a random *.it-romehotels.com website.

Other random examples: search for "2023" and the very first link is "2023 Pomeroy College Basketball Ratings". Search for "iphone", and the 5th link is a page about iPhone 6s that was last updated in 2021. Typos don't work: "haker news" has only one result, a hungarian press article.

10 months ago

marginalia_nu

Even Google is kinda not great for "Trastevere". I'd like to see results like these in favor of the sort of travel industry spam that's 90% of the search results page.

https://www.romeartlover.it/Vasi60.htm https://www.maquettes-historiques.net/P19b.html

10 months ago

Black616Angel

Okay, name aside, because I instantly got that and englisch isn't my first language.

But the crawler seems to be lacking quite a bit. For my first search (current work problem) "rust json diff" it only found 6 links, only one of which was a rust crate. Unfortunate.

Second Search: "black sabbath sleeping village lyrics" only gave 2 results, only one of which was correct.

Also the repo is missing the SearXNG[1] search engine.

[1] https://github.com/searxng/searxng

10 months ago

marginalia_nu

SearXNG isn't really a search engine. It's just a unified front-end for other search engines, doesn't do any actual crawling or indexing as far as I'm aware.

10 months ago

carlsborg

Sub-100 ms search results, nicely typed python codebase, good project. How many 4096 byte pages do you currently store?

10 months ago

daoudc

10240000 - see https://github.com/mwmbl/mwmbl/blob/18dc760a3402c74803823a94...

10 months ago

worksonmine

Why 4096 bytes?

10 months ago

carlsborg

From the github: "Our design is a giant hash map. We have a single store consisting of a fixed number N of pages. Each page is of a fixed size (currently 4096 bytes to match a page of memory), and consists of a compressed list of items."

10 months ago

illegally

For fun and learning is good but don't think it's practical... not even close to functionality from search engines in the 90s

10 months ago

daoudc

The more people that join, and help us crawl, the better it gets.

10 months ago

TheExplorer

Typing 'Debian' and getting some results, adding 'gdm' results 0. lol

10 months ago

Fnoord

When I enter Kamil Galeev I get directed to a Nitter post by him (and only that), but when I enter Kamil Kazani (which was the mentioned nickname of said Nitter post) I get returned nothing at all.

10 months ago

daoudc

That seems to be because it's written kamilkazani, and automatically splitting such names is a hard problem

10 months ago

BlackLotus89

Gigablast (linked in the faq) is dead for some time now. Had some sort of collab with freenode and then suddenly disappeared (not implying causality)

10 months ago

Reticularas

Don't know if the index isn't complete, but the results with this are quite poor

10 months ago

marginalia_nu

I was happy to notice these guys a while back, but the git repo seems very dormant. I wonder if they backpedaled on the open source side of things, or if the project is asleep.

Either would be sad, because the world needs more open source search engines.

10 months ago

daoudc

Yes, I took a break for a while, my fourth child was just born! Still committed to the project though and working on it when I get time.

Thanks for your encouragement. Would love to have a chat some time.

10 months ago

marginalia_nu

Feel free to shoot me an email, I love to talk search engines :D

10 months ago

nonrandomstring

> This website requires you to support/enable scripts.

Bye bye.

You do not need "scripts" to turn the text string I'll supply into a list of candidate links. How can you not understand this basic accessibility foundation?

10 months ago

daoudc

The API is open so feel free to write your own front end that doesn't need js, or send a PR to add support for no js.

10 months ago

daoudc

Hi, creator here, happy to see this posted! Feel free to ask any questions.

10 months ago

luc_rnz

Thank you for sharing this, this is very interesting. I will give it a try, although I don't think it can replace my current engine (DuckDuckGo/Searx), but rahter complement it maybe (by having a smaller, more curated set of data).

Particularly I am having a great time reading the crawler extension source-code: https://github.com/mwmbl/crawler-extension

10 months ago

1vuio0pswjnm7

"Welcome to mwmbl, the free, open-source and non-profit search engine.

This website requires you to support/enable scripts."

JSON results, no Javascript

https://api.mwmbl.org/?search=search+the+web+without+javascr...

10 months ago

1vuio0pswjnm7

Correction: https://api.mwmbl.org/search?s=disable+javascript

10 months ago

[deleted]

10 months ago

rstreefland

I was intrigued by the name and was very pleasantly surprised to confirm the Welsh influence when I clicked though. The creator lives very close to home for me.

Dal ati! We really need open source alternatives to Google.

10 months ago

davidebaldini

If I understand, having only 4096 bytes of data per term causes multiple terms in the same query to intersect to little or no results. The purpose seems to cut cost in compromise of completeness.

10 months ago

marginalia_nu

Yeah. That seems like a design decision that will scale poorly. For reference, even in my dinky 100M index I have individual terms with several gigabytes of associated document references.

In general hash map table index designs don't tend to be very efficient. If you use a skip list or something similar, you can calculate the intersection between sets in sublinear time.

10 months ago

daoudc

We actually just take the union and then re-rank. Because the lists are all small, this is cheap.

10 months ago

marginalia_nu

Point is, with a skip list (or similar), the lists don't need to be small. You can intersect data sets that are enormous very quickly using this algo[1] where a single linear read of both lists is the worst case scenario.

[1] https://nlp.stanford.edu/IR-book/html/htmledition/faster-pos...

10 months ago

daoudc

Yes, you're correct on the purpose. We mitigate it a little by also indexing on bigrams.

10 months ago

IYasha

White screen, no content, 30 errors in the console. Firefox 50.1 @Ubuntu Too much JS for one page... :-|

10 months ago

starstripe

Searched "Littler Books" and nothing came up. Would be awesome if this worked as expected.

10 months ago

daoudc

It's easy to crawl specific sites using the command line crawler

10 months ago

evolve2k

UI feedback: On my iPhone the search box shows two magnifine glasses, make it just one.

10 months ago

retrofuturism

Google only has 1. You gotta 1-up the competition to win.

10 months ago

crtasm

It's much easier to read after changing --bold-font-weight to 500 in the CSS.

10 months ago

tamimio

I searched first test “best business banks in Canada” and it showed no results saying it couldn’t find any “We could not find anything for your search..”, I can also see two redundant lenses icons.

10 months ago

marginalia_nu

What sort of result would you expect for such a query?

10 months ago

tamimio

Primarily a list of options to choose from, preferably that from a non-affiliated site, asking the same in GPT-4 I get the following:

>Tangerine Business Savings Account: This account offers a high interest rate of 2.65% to 3.25% on your balance, no monthly fees, no minimum balance requirement, unlimited transactions, free e-transfers, and access to over 3,000 ATMs.

>Wise Business Account: This account offers low-cost international payments in over 50 currencies, no monthly fees, no minimum balance requirement, free local transfers, free debit card, and access to over 10 million ATMs.

>BMO eBusiness Plan: This account offers no monthly fees, no minimum balance requirement, unlimited transactions, free e-transfers, free cheque deposits, and access to over 3,500 ATMs.

>RBC Digital Choice Business Account: This account offers no monthly fees for the first three months ($5 per month thereafter), unlimited electronic transactions, 10 free debit transactions per month ($1.25 each thereafter), free e-transfers, free cheque deposits, and access to over 4,200 ATMs.

10 months ago

marginalia_nu

Well if we imagine a search engine as a document retrieval machine, who would publish such a document?

10 months ago

tamimio

> who would publish such a document?

The banks? In this case. Because if you do the manual searching, you will “manually” go to each bank site, go to accounts, business section and read, a good search engine will do that for me, no middle man (aka some 3rd party sites) and summarize it based on my query, a bad search engine however, will look into a 3rd party website that already created a list, recommended some based on affiliate links, boosted itself in the results by playing the SEO keywords game.

10 months ago

Brian_K_White

Before I kagi this (not even google, just kagi!) shall we wager on whether there is or is not at least one, likely several such documents? Come on.

10 months ago

joshxyz

man these names are making me dyslexic. love it though.

10 months ago

38

slashes get eaten by the page. not cool.

10 months ago

dotcoma

So do vowels ;)

10 months ago

imachine1980_

still don't understand ""

10 months ago

romwell

OK, the obvious question:

Why go with an unpronounceable name?

I mean, great that it was made, but I can't even tell people I'm using... mwumble? But it's spelled em-doubleyou-em-bee-el dot org.

10 months ago

BLKNSLVR

It's pronounced mumble. An explanation is at the very bottom of the github Readme, quoting:

> How do you pronounce "mwmbl"?

> Like "mumble". I live in Mumbles, which is spelt "Mwmbwls" in Welsh. But the intended meaning is "to mumble", as in "don't search, just mwmbl!"

10 months ago

xpe

Ok, I'll think of it like this. Since the name of "w" is "double u" that means "mwmbl" is also "muumbl"!

UUho knows, maybe the name can uuork after all!

10 months ago

9dev

Marketing 101: don’t try to be clever with your brand name :)

10 months ago

xpe

Philosophy 101: Think deeply. Use reasoning. Things depend on other things.

The marketing claim above is so far from universal truth. Choosing a "clever" versus a straightforward brand name often depends on the brand strategy, target audience, and market conditions.

But, sorry, for this example, I personally think the current brand name is atrocious.

10 months ago

romwell

>I live in Mumbles, which is spelt "Mwmbwls" in Welsh.

Ah Welsh, the golden standard of phonetic spelling and easy pronunciation!

In other words, one needs a manual just to learn how to pronounce the name of that thing.

Off to a great start, aren't we?

>An explanation is at the very bottom of the github Readme

AKA the first place anyone visiting the website would look at. NOT.

Did writing "pronounced Mumble" on the landing page hurt puppies or something?

>"don't search, just mumble!"

I think half of the country is already doing that when it comes to fact-checking, and they certainly don't need a search engine for that.

And here I was thinking GIMP was a horrible name.

10 months ago

sdf4j

really? like mumble? [0]

[0] https://mumble.info

10 months ago

BLKNSLVR

According to government records, the only names not yet trademarked are "Popplers" and "Zittzers"

10 months ago

lionkor

Not for long! Someone ought to make a Popplers fastfood chain.

10 months ago

Brian_K_White

It's pronounced "google mumble".

10 months ago

dabluecaboose

Its a tech startup, vowels are not chic

10 months ago

evolve2k

+1000 change the name

Make it so you can use it in a sentence to replace “Just google it”

10 months ago

Brian_K_White

They are begging to be either ignored or forked.

There are no other outcomes if they don't already understand why everyone is telling them this is unusable.

10 months ago

thelastparadise

I think it's mimblewimble.

10 months ago

kylecazar

That's worse than I thought, suspected it was a play on mumble.

10 months ago

Andrew018

[dead]

10 months ago

based-nerd

[flagged]

10 months ago

bobse

What a terrible name! Into the trash.

10 months ago

jw_cook

While it doesn't refute your point, the Frequently Asked Question section does give an explanation for the consonant soup: it's Welsh.

> How do you pronounce "mwmbl"? Like "mumble". I live in Mumbles, which is spelt "Mwmbwls" in Welsh. But the intended meaning is "to mumble", as in "don't search, just mwmbl!"

10 months ago

seanthemon

I highly recommend grabbing something simpler to say and remember to redirect to your site. You're going to need a large amoung of inertia to get people to comfortably use an odd domain name.

10 months ago

AlphaCerium

Arguably, Google was probably a odd name for a search engine to people in the 90s that weren't maths-savvy.

10 months ago

brettermeier

But it's a normal word, unlike "mwmbl" (I had to look it up, couldn't remember where the "b" and "w" goes after some seconds).

10 months ago

arghwhat

No it's not, it is an intentional misspelling of "googol" and means nothing - not in English or any other language. "Spotify" is also not a "normal word" in any language. And for those not native in English (there's supposedly only some 400 million of those), it's just a random sound sequence like any other.

mwmbl is a shortening of the welsh writing of https://en.wikipedia.org/wiki/Mumbles. Only tricky part is knowing that the w is pronounced as a u. Maybe it would be slightly easier if one followed the fad of leaving out vowels, but guessing a vowel and having a tricky vowel does not seem much different.

10 months ago

marginalia_nu

> Only tricky part is knowing that the w is pronounced as a u

Is that really tricky? W is basically pronounced like U in English already[1]. It just looks funny when you exchange the two.

[1] e.g. say this sentence "uorld uar tuo uas the uorst"

10 months ago

arghwhat

In many other languages, w is pronounced like a v and in some cases even named "double vee".

> e.g. say this sentence "uorld uar tuo uas the uorst"

This doesn't work with the English pronunciations of the letter u from words like "uninteresting" or "mumble". It mostly seems to work with the pronunciation of "you", which does not naturally fit those letter placements.

Not knowing the proper linguistic terms, I'd consider "w" to be a modulation of a another sounding vowel by closing your lips and pressing your tongue a bit down to make room. Without a vowel to modulate, there is no sound, and so "mwmbl" is a bit of a question mark.

But most words require prior knowledge to pronounce correctly, especially in as messy a language as English.

10 months ago

brettermeier

Oh thanks for clarification.

10 months ago

marginalia_nu

It's a normal word now. In 1998, it was pretty weird. How many o:s does it have, is it -el or -le? etc.

10 months ago

kiririn

It’s a fine name but at first glance conflicts with Mumble (VoIP) and Mimblewimble (Crypto)

10 months ago

mdtrooper

I remember https://yacy.net/ but the big problem of this project was java and had not implementations in others languages. I mean it as imagine torrent was only in perl.

10 months ago

marginalia_nu

YaCy's big problem is that distributed search is a bad idea that will never perform well. Search is as fast as the data is local.

10 months ago

kristopolous

There was an effort in the early 90s to have search as a protocol so you could have a query and then select the domains you want to run it on and return an aggregate result.

It was 100% abandoned and I think that's a mistake. It'd be nice to explore some of those ideas again

10 months ago

marginalia_nu

I think a big part of the problem is that domains in isolation don't provide the best search results. Out-of-band information like (global) anchor texts or click data makes search perform so much better.

If I want to learn how to do an INNER JOIN in MariaDB, this is the authoritative source: https://mariadb.com/kb/en/join-syntax/

The problem being that INNER JOIN isn't particularly important to that page using most IR measures of importance, it's also primarily in a <code>-block which is typically further de-prioritized. To learn that this is an important link, you need to look outside of mariadb.com.

10 months ago

kristopolous

There's more to it than that.

What if instead of crawling the php generation of database rows with a bunch of cruft, the administrator published some kind of schema with scraping and querying rules and you could alternatively make a single call to capture all of the data in a sematic schema.

You can still do all the stuff you're talking about but it could make search more coherent.

An entry for that humans and an entry for the computers.

You can't trust everybody like this sure, but say imdb, discogs, wikipedia, all of which provide database dumps anyways (eg: https://datasets.imdbws.com/). That's what I'm advocating for revisiting. Lots of legit sites such as universities, newspapers, public records offices...

You could even have a search toggle "screened sources" or whatever for the ones that make the cut

10 months ago

teddyh

Wasn’t this what the Semantic Web was supposed to enable?

10 months ago

kristopolous

That was more an ontological web. That's a different project which I totally support but this time through ML

10 months ago

teddyh

Then, DBpedia might be more like what you’re after?

10 months ago

teddyh

You’re thinking of WAIS, I believe: <https://en.wikipedia.org/w/index.php?title=Wide_area_informa...>

10 months ago

kristopolous

Yes!

10 months ago