Insights Blog

Unlocking Efficiency: The Role of Coding Assistants in Transforming Data Engineering

In the field of data engineering, a significant change is happening thanks to the emergence of Generative AI. The rise of coding assistants such as OpenAI’s ChatGPT and Microsoft’s GitHub Copilot is influencing traditional methods, affecting platforms like Stack Overflow, and transforming how data engineers search for and apply coding solutions.

Software engineers embrace ChatGPT and GitHub Copilot

An article reveals that software engineers are increasingly turning to AI chatbots and GitHub Copilot for coding assistance instead of traditional forums like Stack Overflow. ChatGPT in particular, powered by OpenAI’s language model, has become a game-changer and has gained popularity for its efficiency in creating detailed code examples, full functions and code explanations. Meanwhile, GitHub Copilot, utilizing the same OpenAI language model, can similarly fast-track code development and avoids the need to search Stack Ovеrflow for code to copy and paste. And DB-GPT, an open-source solution, can be used for database code generation. These developments mark a significant shift not only for software engineers but increasingly for data engineers too, in how they access and use code snippets.

The benefits also extend beyond experienced data engineers, as Co-pilot is being touted as a training tool for newcomers in the field. For example, it can help with learning new languages so instead of having to search Stack Overflow to find the right way to do something in a certain language, newcomers can simply ask Copilot to suggest it. Similarly, an attractive feature of GitHub Copilot is that it works with a broad set of frameworks and languages in technical preview such as Loading...Python and Loading...JavaScript. Moreover, because it is trained on publicly available source code and natural language, Copilot understands both programming and human languages, allowing trainee data engineers to describe a task in English, with Copilot providing the corresponding code.

Stack Overflow’s bumpy ride

The huge impact of GenAI on developer communities is clear when comparing the declining popularity of Stack Overflow to the rising success of GitHub. Observers note that Stack Overflow has lost about 35% of its traffic over an 18-month period beginning in 2022.  In contrast, ChatGPT, which was introduced in late 2022, recorded 1.6 billion visits in March 2023 alone. This shift suggests that developers are opting for AI-driven coding assistants over traditional platforms.

Even in the developer-focused domain, where GitHub stands as a peer, the integration of Copilot has given it a competitive edge. The growing interest in GitHub CoPilot, underscores the changing preferences of data engineers. This shift isn’t just a matter of preference for coders but represents a broader trend which is challenging traditional methods of solving coding problems. While Stack Overflow is making strides to adapt to this fast-moving landscape, by developing its own coding assistant called OverflowAI, and is considering charging tech companies using its data for AI models, it faces challenges in the evolving tech landscape.

Proceed with caution

One of the main concerns with coding assistants is that it will generate code that is identical to code that has been generated under open-source licenses, which don’t allow derivative works, and could be used by developers unknowingly.

Technical preview reviewers of GitHub Copilot, have previously claimed its long-term scalability and quality assurance still need to be tested. While it can be helpful for simple projects and boilerplate code, it may face challenges when dealing with complex and specialized data engineering tasks. It’s therefore recommended that code suggestions be used with caution, especially when trained on quantities of GitHub code, which might be of variable quality. Hence, ensuring the quality, accuracy, and security of AI-generated code is important in data engineering, where the integrity of data and data pipeline processes is critical.

The additional choice provided by coding assistants comes with caveats. According to experts, tools like Copilot and ChatGPT can help to generate code snippets and solve coding challenges, but complex data engineering solutions often require a human touch, and the overreliance on AI-generated code can limit the scope of creative problem-solving. In data engineering, where data accuracy is paramount, human oversight remains crucial in identifying and rectifying any issues in the code generated by AI.

While coding assistants like GitHub Copilot and ChatGPT offer valuable assistance to data engineers, they come with their set of challenges and limitations. Data engineering tasks often involve complex, unique, and sensitive data operations, making human oversight necessary. The integration of AI into the data engineering workflow should be approached with a balanced consideration of both its advantages and limitations to ensure successful and compliant data engineering practices.