Regular Expressions which query an Oracle

shayanjm · on Nov 27, 2024

At what point do we drop the term "regular" expressions altogether for stuff like this? This is going to sound pedantic since I know that most popularly-used regex implementations are themselves non-regular, but I feel like we're just piling more and more stuff on top of good-old-regexes and trying to turn the concept into a catch-all for anything that does pattern matching on text.

I guess it just feels icky that "regular expressions" has inherent meaning (i.e. can be represented entirely by a finite automaton) which has become completely diluted at this point.

That rant aside, cool paper. The idea of bridging formal language theory with modern computational tooling feels timely. I think I would've liked to see more exploration of oracle-based costs, for instance:

* What happens when oracle outputs are inconsistent/uncertain?

* What happens as oracle interactions become more computationally expensive?

judofyr · on Nov 27, 2024

Your rant is a bit unfounded for this paper as they do actually take completely standard regular expressions (no backtracking or anything like that) and extend it with one more construct. Calling it «semantic regular expressions» seems perfectly reasonable to me. What else to call it?

As for outside of the computer science sphere (which I find is quite consistent in their terminology): I do agree that it seems like it’s a lost cause and «regex» is now synonymous for «pattern matching using this one specific syntax» :(

tgv · on Nov 27, 2024

Regular has a meaning, and it isn't this. It's had this meaning since the 1950s. OTOH, these expressions do not generate a regular language. That's a good reason to me.

They call it "semantic regular expression" because it apparently already is a lost cause. "Regular expressions with TMs embedded" doesn't quite have the same ring to it. Nobody would see it as a regexp.

judofyr · on Nov 27, 2024

> OTOH, these expressions do not generate a regular language.

Okay sure you're technically correct here, but only because these expressions generate a subset of a regular language. The LLM can only be invoked on a substring that can expressed as a regular expression, and then it's only used to remove strings from the language. Their results are based heavily on how regular expressions work. A "semantic context-free grammar" would have different type of characteristics and behavior.

Maybe throwing in the word "extended" or "augmented" would be a bit more clear, but as I reader I definitely would expect "regular expression" to be part of the name.

tgv · on Nov 28, 2024

Removing strings from the language is what makes it non-regular. E.g., a regular language cannot contain a^n b^n (that is: the string is only accepted when it has an identical amount of a's and b's), but it sure as hell can contain a^m b^n. Removing the strings where m != n is what makes a language context-free.

taeric · on Nov 27, 2024

To be fair, I could see this basically allowing a form of "back reference" lookup that lets you offload the back reference to other parts. For example, `/(.); \1 = (.)/.exec("foo; foo = anything")`, but instead of doing a back reference, you could have an oracle lookup.

I haven't looked at the examples in this paper, yet. But I'm having fun imagining ways this could be used.

shayanjm · on Nov 27, 2024

Nested querying is not something that is standard for regular grammars, amongst other aspects introduced in this paper that implicitly require things like memory (again, not standard for regular grammars)

krackers · on Nov 27, 2024

>which has become completely diluted at this point.

Isn't it common in TCS to consider the class X with oracle access to decide L though?

kg · on Nov 27, 2024

To me this feels kind of similar to the idea of verifying tokens in a compiler by querying an oracle. i.e. in the script compiler for my game engine, there's an additional pass at the end of compilation that checks things like filename literals or entity ID literals against a database of assets or a list of all the scripts that were compiled, so when I hit F5 in visual studio any typos in those literals are caught before I waste time booting the game up.

It would be interesting to be able to do this eagerly during parsing, so it's neat to see people thinking about doing this at a regex level, though I'm having trouble thinking of specific cases where I would both want to use a regex and want to query an oracle from inside of it...

nickpsecurity · on Nov 27, 2024

They said they want the cost of invoking the oracle, the DB, to be low. Might be a good use for in-memory, embedded DB’s with a good API.

ashvardanian · on Nov 27, 2024

There aren’t many papers on the subject, and I’m probably not great at spotting them. The linked paper also doesn’t mention much recent research, so it seems like an excellent opportunity to ask for the collective wisdom of the HN community - what are your favorite undervalued information retrieval papers after 2015?

agnishom · on Nov 28, 2024

I feel that the LLM community has been very interested in Retrieval-Augmented-Generation, recently. But as far as I have seen (not very far), they are not very formal methods oriented.

larodi · on Nov 27, 2024

This article lets an UTF8 symbol be something... so now we have all the existing math symbols + all UTF8 icons. Wonder what the papers be look like in 20 years.

agnishom · on Nov 28, 2024

you mean the usage of Crystall-Ball as notation for oracle?

larodi · on Nov 28, 2024

Yes, it is impossible to put it here, but yes, indeed.