Improvements to PyArrow for Astronomy | GSoC'26 - OpenAstronomy

Results are out!!!

Om Biradar — Sun, 14 Jun 2026 12:08:19 GMT

My GSoC project's main goals includes parallelizing the reading of structs in parquet files. This is very beneficial to the community as astronomical data involves data stored in structs and other nested structures.

This figure shows the performance increase/decrease the optimized parquet reader (which I made) has over the baseline main branch of apache/arrow when run on a standard free GitHub runner and the file in loaded into the RAM to remove the I/O overhead which is not relevant here.

For smaller files, the overhead of multi threading causes it to be slower, but for real life cases when the file sizes are large, the optimized reader now provides 25%-30%+ faster reading times.

When compared to flat parquet files, the nested struct now offer similar read speeds with multi threading enabled!!

Overally, the project is going at a great pace with verifiable results. I hope this integrates with the upstream apache arrow library soon so that this performance boost can help the people working with PyArrow on astronomy datasets.

The link to the PR - https://github.com/apache/arrow/pull/50158

Link to the results colab notebook - https://colab.research.google.com/drive/1TsxFkBSI_Iq0hfXEwNDs_D24acr3yxdC?usp=sharing

Orchestrating and benchmarking repo - https://github.com/OmBiradar/pyarrow-lincc-fw-openastronomy-gsoc26

Community Bonding and week 1 & 2

Om Biradar — Tue, 09 Jun 2026 03:27:42 GMT

The start was amazing! The community bonding was really great. I got to meet the mentors, get to know the whole organization structure of OpenAstronomy and LINCC frameworks, the work they do, the people and facilities associated with it, the ways PyArrow and nested-pandas was being used in astronomy and the expectations they had form the internship. They offered to help me through tasks if I could not do it on my own along with some content to look through to get a deeper understanding of the project. I attended the Apache Arrow community meeting with my mentor to introduce the project them and get their views on it. They were really helpful and even suggested certain thing to do to improve the final PR.

Week 1 and 2 were spent on improving the parallel reading of parquet files, which was successfully implemented by me. The benchmarking of these performance changes proved quite hard, as this would require the following, starting with the main arrow branch:

Building arrow C++ from source
Building PyArrow from source
Running benchmarking scripts
Switching the branch from the main branch and repeating steps 1-3

The time taken to benchmark the changes for all order of magnitude of files proved to be very long, approximately ~140 hours or 6 days. I could speed this up using parallel jobs in github which needed to be configured separately using config settings and matrix github runners. This needed to be done because github sets a timeout of 6 hours on each github action with at max 20 concurrent jobs and a 24 hour auto cancel timeout on jobs in queue. Based on these constraints I had to figure out a way to orchestrate different runners and combine their results for which I used sqlite3.

Proposal for Improving PyArrow for Astronomy

Om Biradar — Tue, 31 Mar 2026 17:41:51 GMT

Motivation

I was casually interviewing at a HFT, answering questions about C++ and systems, when the interviewer questioned if I ever used arrow in C++ for data processing/handling. I having only used enigma, didn't ever use arrow at that time. This made me curious about it. On a side note, I did eventually did get placed at a similar role.

I saw that arrow was coming up as a project in GSoC 2026 which peaked by interest - along with that I saw it's under OpenAstronomy. I was like "WOW!", it's kind of like a childhood dream come true of working like an Astronomer. Not exactly an astronomer I know, but quite close to it as to support the community.

Technical breakdown of proposal

Thus, I decided to apply for GSoC 2026 for the "Improvements to PyArrow for Astronomy" project. This project majorly deals with improving the support for nested type structures in parquet files and the processing of these in arrow / PyArrow. In some more depth, the issues are -

Parquet file doesn't utilize parallel processing in structs: This causes structs to not be read as fast as possible with multi-threading even tho it's actually possible and completely safe to do so. Using multi-threading here would enable the reading of many datasets in astronomy like the light curves data very fast. Also it's evident that this would help in general data processing of any sort. The root cause is that StructReader::LoadBatch() and StructReader::BuildArray() currently iterate over child fields serially. However, since each child manages its own RecordReader, PageReader, and output buffers with no shared mutable state (no race conditions), parallel execution is completely thread-safe. By leveraging the existing OptionalParallelFor multi-threading support within the arrow project, we can parallely processing each child of a struct using the CPU thread pool when set_use_threads(true) is enabled. This particularly speeds up processing for time-series astronomy data (like light curves, spectral arrays) where structs may contain large numeric child fields. This implementation would have no change in the API and is purely a performance improvement. Appropriate benchmarks to performance test this change will also be written by me.
Cannot select child columns inside LIST-STRUCT types: If we wanted to select A.b where A is of type list> and b is a field/child of the struct A, currently PyArrow doesn't support this. Where as selecting child columns from regular struct types (like selecting P.a where P is a struct) is currently supported. The issue arises specifically when the STRUCT is nested inside a LIST. My proposal is to expand the nested field resolution logic to detect when a field is a LIST or LARGE_LIST type, unwrap it to access the list's element type, and then search for child fields within that element (which may be a STRUCT type). This would successfully allow selection of type A.b where A is of the type list>.
Arrow compute kernels lack functionality for nested data: the kernels like replace_with_mask do not currently support data types like lists and lists of structs or any such nested data types. Thus this would need to be added for it to support of nested types. The main problem is that supporting nested types means supporting the infinite combinations of nested types of data that are possible like >> or > and infinite more. Thus the support for these datatypes in the kernels must be added dynamically. There is another problem - if we figure out how to handle each nested data by looking at the schema this might be bit inefficient than using a onion peel method - first we unpack the top data structure, consider example as > we unpack list first using a method that knows how to unpack lsit. Then we pass it to a method that further tries to unpack the data. Here we are left with struct. Thus a function that knows how to unpack struct works on it and unpacks it. Thus now we are left with non-nested data types that the existing methods will handle. This is my proposal - to use this smart technique to handle nested data layer by layer dynamically.