Follow Slashdot blog updates by subscribing to our blog RSS feed

 



Forgot your password?
typodupeerror
×
Programming AI IBM

IBM's CodeNet Dataset Can Teach AI To Translate Computer Languages (engadget.com) 40

IBM announced during its Think 2021 conference on Monday that its researchers have crafted a Rosetta Stone for programming code. Engadget reports: In effect, we've taught computers how to speak human, so why not also teach computers to speak more computer? That's what IBM's Project CodeNet seeks to accomplish. "We need our ImageNet, which can snowball the innovation and can unleash this innovation in algorithms," [Ruchir Puri, IBM Fellow and Chief Scientist at IBM Research, said during his Think 2021 presentation]. CodeNet is essentially the ImageNet of computers. It's an expansive dataset designed to teach AI/ML systems how to translate code and consists of some 14 million snippets and 500 million lines spread across more than 55 legacy and active languages -- from COBOL and FORTRAN to Java, C++, and Python.

"Since the data set itself contains 50 different languages, it can actually enable algorithms for many pairwise combinations," Puri explained. "Having said that, there has been work done in human language areas, like neural machine translation which, rather than doing pairwise, actually becomes more language-independent and can derive an intermediate abstraction through which it translates into many different languages." In short, the dataset is constructed in a manner that enables bidirectional translation. That is, you can take some legacy COBOL code -- which, terrifyingly, still constitutes a significant amount of this country's banking and federal government infrastructure -- and translate it into Java as easily as you could take a snippet of Java and regress it back into COBOL.

CodeNet can be used for functions like code search and clone detection, in addition to its intended translational duties and serving as a benchmark dataset. Also, each sample is labeled with its CPU run time and memory footprint, allowing researchers to run regression studies and potentially develop automated code correction systems. Project CodeNet consists of more than 14 million code samples along with 4000-plus coding problems collected and curated from decades' of programming challenges and competitions across the globe. "The way the data set actually came about," Puri said, "there are many kinds of programming competitions and all kinds of problems -- some of them more businesslike, some of them more academic. These are the languages that have been used over the last decade and a half in many of these competitions with 1000s of students or competitors submitting solutions." Additionally, users can run individual code samples "to extract metadata and verify outputs from generative AI models for correctness," according to an IBM press release. "This will enable researchers to program intent equivalence when translating one programming language into another." [...] IBM intends to release the CodeNet data to the public domain, allowing researchers worldwide equal and free access.

This discussion has been archived. No new comments can be posted.

IBM's CodeNet Dataset Can Teach AI To Translate Computer Languages

Comments Filter:
  • Code doesn't really translate from language to language, sure you can write equivalent code trivially in any language, but the point of having different languages to begin with is that there are many different ways to do any one thing and depending on context and type of application some are more advantageous than others. Could you translate c to python, sure, but you wouldn't, because python provides an fundamentally different ecosystem for problem-solving and if you use it as if it were c instead of pytho
    • by gweihir ( 88907 )

      Indeed. I agree to all of that. The results of such a translation tend to not be human-readable in addition, which makes the whole exercise pretty pointless as it makes the code unmaintainable. If you have really old code, it is better to maintain the compiler for that than do an automatic translation to some other language.

      That said, there is a case for translating Python to C, but that one is both solved and comes with some limitations.

    • Re: (Score:3, Interesting)

      by Anonymous Coward

      I am not sure i agree completely. A long time ago I worked in place where we "up lifted" a lot of COBOL code to C. The machine translation of COBOL to see I forget what the product was essentially converted the DATA DIVISION into a giant C UNION with the records as STRUCTS and all the memory allocated at the start of main().

      Basically it was a very direct translation of the COBOL application both in the code layout highlevel and in the how the machine is going to execute it. It was absolutely NOT the way a

      • I love this. It's a pretty complicated high-wire balancing act because you not only had to understand COBOL really well but obviously had to be a c master to "trace" the older program and its logic.

        I have no insight to offer, just thought your thing was rad

    • You normally wouldn't translate code into a language designed for a different purpose. So COBOL to Python wouldn't make much sense. The bigger problem is there isn't a modern language that works much like COBOL. It would translate to C or Java, but it would end up in a style that no modern programmer uses. I translated a very complicated COBOL program that used the same variables over and over for all different things. It was written to use the least amount of memory possible. So every variable had a generi
      • by DarkOx ( 621550 )

        The success of this really depends on the starting COBOL corpus. If things were done neatly with appropriate copybooks and programs were kept small and run as a series of separate job steps from JCL you get useful C modules out that you could can be treated as little black boxes until you are ready to replace them. if you have that giant single program that does all the things - full ledger reconciliation, pay roll, billing, inventory start to finish; mud ball as input you will get a mud ball as output.

    • I hear what you're saying, but I think since it's all programmatic, translation seems like it would be a matter of mapping and functional replication. I don't see why it couldn't be done although the translated code might be ugly, messy, and top-heavy.

      Human coders translate code from one language to another (I've done it myself going from perl to PHP) so I don't see why it couldn't be done programmatically. There's not a lot of nuance or intent to puzzle out, just replicating the flow of the logic.

      I'm not s

      • Because computers/compilers make pretty bad translators?

        IBM used to sell (and I guess they still support, kinda) something called EGL and VG (VisualGen?). You write code in yet a third language and set up generation options and it will generate either COBOL or Java+JSP code to run under Websphere. While the code it generates builds, runs, etc. and does what we expect it to do, the generated Java is nearly unreadable and has a very weird way of dealing with objects and passing them around.

        Fortunately the E

        • Because computers/compilers make pretty bad translators?

          For the most part they do pretty well with human languages, I would think that it would be even more straightforward with programmatic languages. There would be no issues dealing with nuance, no double meanings, and so on.

          Fortunately the EGL/VG code is simple to read and understand so when I have to convert it to "plain" Java (e/t/l stuff) or Spring (rest-like stuff) it is much quicker for me to just do it from scratch and about all I take from it is the sql statements and business rules

          Sounds like you might want to write a Java/SQL simplifier program. :) I'm guessing that some (maybe a lot) of what's involved in your conversion is repetitive or determinate and could be done programmatically.

          • by jythie ( 914043 )
            It is a double edged sword. On the one hand, formal languages are simpler with nice explicit rules, which with non-ML systems would be a huge advantage. On the other hand, formal languages are also a lot more sensitive to context and error.. which is a huge disadvantage with ML systems.
  • by anonymous scaredycat ( 7362120 ) on Tuesday May 11, 2021 @07:02AM (#61372340)
    Why is it terrifying that COBOL code "still constitutes a significant amount of this country's banking and federal government infrastructure". I considering it more terrifying that anyone would consider translating it into Java to be a good thing.
    • Why does this scare you. Just because its old does not mean it bad. That is your consumerist programming showing. Must have newest latest thing... no you do not. How many criminals are hacking away at Java? Lots! How many are hacking away at COBOL? Probably none!
      • by hrieke ( 126185 )

        Old code isn't a bad thing, and I have no issues with the choice of COBOL as a language for running the infrastructure of a bank.
        My worry is the assumptions of the initial team are now invalid.
        Little things like the Y2K bug, a company's worth is trillion dollars (or individuals worth 100s of billions), or the methods for data writing and retrieval.

        As for finding people to hack away at COBOL, answer is easy, paid them money to do so.
        Shocking how simple money makes problems go away.

        • by DarkOx ( 621550 )

          I would suggest to you the assumptions and implicit limitations behind 30+ year old COBOL code are better documented, or at least more obvious than the vast majority of what has found its way into production between then and now.

          I suspect 2038 will be a rude surprise for a lot of organizations. I also expect as we discovered recently think like the price of BRK.A are going to do things like over flow unsigned ints. With the old COBOL programs it was very obvious how wide fields were and not in binary 2s com

      • by jythie ( 914043 )
        On the other hand, language and library designers have learned a lot over the last 70 years. For that matter software engineers have also learned a lot over the time. While 'latest and greatest' is usually not that great, don't underestimate the value of decades of institutional knowledge that has built up since COBOL was common. COBOL was not even the best tool at the time, and its limitations become even more stark as the years roll on.
    • The code has been running fine but decades but sure rewrite it all in javashit. What can possibly go wrong?

      • The code has been running fine

        For any values of "fine"

        • If it ain't broke, maintain it.
        • If it's broke, either repair it or discard it.
        • If it's beyond maintenance, it's definitely broke.
    • by Anonymous Coward

      Agreed.

      Banking/finance deals with little things like money. Money is inherently floating point. 2s complement math tends to do rather poorly with such work. Yes, there's workarounds - but in these situations your better off using a language designed for such work (BCD under the covers) rather than crocking a fix that some poor slob programmer will forget/fail to use.

      Second area - COBOL is designed for record processing. C, Java, Python, etc. are byte oriented. The nature of the work tends to be... reco

      • Re: (Score:3, Insightful)

        by Eunuchswear ( 210685 )

        Banking/finance deals with little things like money. Money is inherently floating point

        You should never be let near any financial software.

        Money is inherently fixed point.

    • Not sure modern Java still deserves this amount of hate these days. The language and its ecosystem are quite fantastic. Usually, when people complain about Java being shit, they're actually talking about the development culture that most Java shops have adopted. That one is completely beyond repair.
    • Because the people who could translate it from human language to COBOL, despite its verbosity, are finding their code outliving them. If no one remembers how to maintain it, when it crashes it'll crash hard and fast and idiots will panic instead of picking up a reference manual.

  • by gweihir ( 88907 ) on Tuesday May 11, 2021 @07:09AM (#61372360)

    There are quite a few languages that get translated to others. That is a very, very old approach. There is also decompilation that reverses the process.

    That said, languages have different limitations and no amount of translation can get around that. Code generated in this fashion tends to be unreadable and hence is not maintainable. That limits the utility of this system rather dramatically.

    I would say this is another desperate attempt by the "AI" people at IBM to prove they can do something useful.

    • Yeah, surely the value / idea of going the AI route would be to have it translate the functionality appropriately for the target language domain, rather than do a code monkey literal translation line by line, which no-one wants. But I guess if the literal translation Verifiably Works(R) then this could see a whole world of automated code reuse that never darkens a programmer's door.
  • As if most of the code out there is even worth recycling.

    • Re:Crap In Crap Out (Score:4, Informative)

      by Canberra1 ( 3475749 ) on Tuesday May 11, 2021 @08:01AM (#61372454)
      Never so flat out wrong. Code contains the business rules. Code contains the data needed to support the business, and all the legacy legislation against the rules of doing business. Ask anyone to enumerate the business rules - and watch a major failure. Have kids fresh out of College try to work out all ER in an agile huddle. Code is easy(bar drivers and protocol work, and OS writing). The business rules within are the golden payload. So maybe you are one of these code first, worry about business design and rules later. As for COBOL, roughly 80% will be SQL code - like any other modern language. And 10% data definition include statements.
    • Better known as the, "my shit doesn't stink, yours does" philosophy.

  • I would like to use this dataset, but it looks like the download server has already been DDOSed.
  • âoeyou can take some legacy COBOL code ... and translate it into Javaâ

    So it can translate from one legacy language into another legacy language. I bet they were disappointed there was already an app called Rosetta Stone. :)

  • Well, it can't be any worse than the code my co-workers write

  • When it can translate the winning programs from the Obfuscated Perl Contest to Apple Basic I'll be impressed. Until then its just the computer language equivalent of Google Translate

  • Ever heard of that phrase? And now, apparently, so can the computer!

    • While I know this was meant in jest (as was my previous post), I've seen this sentiment quite often, though usually with writing Fortran in C/C++. Simply making something work is often straightforward, but making it work well within the expected style of the language often isn't. I don't expect the AI to translate language idioms well at this point, and I especially don't expect it to be able to translate ideas across different programming approaches. It's hard enough to refactor my own code from procedu

  • Start with code conversion. This is not difficult, far easier than written/spoken language translation given the small set of "words".

    The hard part is integration, local or not. Your Cobol has a bunch of hardware/storage specific code, good luck translating that in a useful manner.

    And the "databases" may just be flat files with a SQL overlay (IBM stuff certainly can be). Even if you can pull off data layer conversion, what's the database look like on the "other side" (and how is that integration "generat

  • That is, you can take some legacy COBOL code -- which, terrifyingly, still constitutes a significant amount of this country's banking and federal government infrastructure -- and translate it into Java as easily as you could take a snippet of Java and regress it back into COBOL.

    Actually what would be terrifying would be taking working COBOL code a turning it into Java.

  • by Anne Thwacks ( 531696 ) on Tuesday May 11, 2021 @02:15PM (#61373714)
    When it can translate Klingon into PDP8 assembler!

    Or even Snobol into Klingon.

  • I'll be a believer and a fan when a salad of regexes and hacks gets translated into something a sane person wrote for the same purpose. :)

  • already, then why do we need programming languages at all?

Top Ten Things Overheard At The ANSI C Draft Committee Meetings: (5) All right, who's the wiseguy who stuck this trigraph stuff in here?

Working...