DeepTour

Mohamed Mohey

View on GitHub →

Inspiration

I was looking for a project idea when I stumbled across codetour and once I had the idea for this project I had to do it.

Plus there was an extra limitation because the Codex api, while available in 135+ countries, isn't available in Egypt. (I emailed them and it's not their fault, it's a legal thing or something)

What it does

It basically auto-generates files that work with the codetour extension for code summarization, refinement, translation and any other use case you can come up with!

It allows you to create your own custom tours using your own models, like only using the code refinement model if the defect detection model found a defect. Or maybe you could train a model that links an issue to a line of code and generates a tour for that. The sky's the limit!

How we built it

All the AI models are from HuggingFace. This isn't a requirement but it's much easier than the painful endeavour of trying to make research code work as a simple api (thanks HuggingFace!)

There are parsers that get the Abstract Syntax Trees for each file to extract the function bodies and then unparse the AST back into strings. Each different language uses different libraries like ast for python, esprima for javascript, and javalang for java.

GitPython to get the git blame output for function declarations to get an estimate of when the function was first created.

Some Node.js boilerplate for the extension ¯_(ツ)_/¯

Challenges we ran into

Oh where to start, where to start:

  • While HuggingFace saved me, there doesn't exist nearly enough models fine-tuned on code. Most work on Java only. And a whole lot of models don't have model cards so I either didn't use them or kept scratching my head till I got them working.
  • Extracting functions from files is hard. Whether it's ast, esprima, or java. And unparsing Abstract Syntax Trees into strings is even harder. (I mean, yes, I'm calling APIs that do that for me, but even getting those APIs to work is hard.)
  • Generating git blame outputs was first done by calling a bash script that does that with the git cli. Then I realized that was stupid and directly used GitPython for a slightly less stupid approach
  • Getting the whole thing to work as a VSCode extension (even if it's a wrapper) wasn't easy. Had to figure out how to make node start a python program and act cool throughout.

Accomplishments that we're proud of

It works! It works! sort of...

What we learned

1- GitPython 2- A bit more about using HuggingFace models 3- ASTs are awesome 4- ASTs are hard 5- Deep Learning for software engineering still has many gaps to be filled both in research and industry, but maybe more so in industry

What's next for DeepTour

Maybe, maybe, maybe cleaner code, more and better models, more exciting tours, and more people working on it? maybe.

Try it out

https://github.com/mmi333/deeptour_lib

https://github.com/mmi333/deeptour_vscode