Proposal for Improving PyArrow for Astronomy

Motivation

I was casually interviewing at a HFT, answering questions about C++ and systems, when the interviewer questioned if I ever used arrow in C++ for data processing/handling. I having only used enigma, didn't ever use arrow at that time. This made me curious about it. On a side note, I did eventually did get placed at a similar role.

I saw that arrow was coming up as a project in GSoC 2026 which peaked by interest - along with that I saw it's under OpenAstronomy. I was like "WOW!", it's kind of like a childhood dream come true of working like an Astronomer. Not exactly an astronomer I know, but quite close to it as to support the community.

Technical breakdown of proposal

Thus, I decided to apply for GSoC 2026 for the "Improvements to PyArrow for Astronomy" project. This project majorly deals with improving the support for nested type structures in parquet files and the processing of these in arrow / PyArrow. In some more depth, the issues are -

Parquet file doesn't utilize parallel processing in structs: This causes structs to not be read as fast as possible with multi-threading even tho it's actually possible and completely safe to do so. Using multi-threading here would enable the reading of many datasets in astronomy like the light curves data very fast. Also it's evident that this would help in general data processing of any sort. The root cause is that StructReader::LoadBatch() and StructReader::BuildArray() currently iterate over child fields serially. However, since each child manages its own RecordReader, PageReader, and output buffers with no shared mutable state (no race conditions), parallel execution is completely thread-safe. By leveraging the existing OptionalParallelFor multi-threading support within the arrow project, we can parallely processing each child of a struct using the CPU thread pool when set_use_threads(true) is enabled. This particularly speeds up processing for time-series astronomy data (like light curves, spectral arrays) where structs may contain large numeric child fields. This implementation would have no change in the API and is purely a performance improvement. Appropriate benchmarks to performance test this change will also be written by me.
Cannot select child columns inside LIST-STRUCT types: If we wanted to select A.b where A is of type list<struct<b: int, c: int>> and b is a field/child of the struct A, currently PyArrow doesn't support this. Where as selecting child columns from regular struct types (like selecting P.a where P is a struct) is currently supported. The issue arises specifically when the STRUCT is nested inside a LIST. My proposal is to expand the nested field resolution logic to detect when a field is a LIST or LARGE_LIST type, unwrap it to access the list's element type, and then search for child fields within that element (which may be a STRUCT type). This would successfully allow selection of type A.b where A is of the type list<struct<b: int, c: int>>.
Arrow compute kernels lack functionality for nested data: the kernels like replace_with_mask do not currently support data types like lists and lists of structs or any such nested data types. Thus this would need to be added for it to support of nested types. The main problem is that supporting nested types means supporting the infinite combinations of nested types of data that are possible like <list<list<struct>>> or <list<struct>> and infinite more. Thus the support for these datatypes in the kernels must be added dynamically. There is another problem - if we figure out how to handle each nested data by looking at the schema this might be bit inefficient than using a onion peel method - first we unpack the top data structure, consider example as <list<struct>> we unpack list first using a method that knows how to unpack lsit. Then we pass it to a method that further tries to unpack the data. Here we are left with struct. Thus a function that knows how to unpack struct works on it and unpacks it. Thus now we are left with non-nested data types that the existing methods will handle. This is my proposal - to use this smart technique to handle nested data layer by layer dynamically.

Proposal for Improving PyArrow for Astronomy

Motivation

Technical breakdown of proposal

Comments

GSoC - OpenAstronomy

Results are out!!!

More from this blog

Results are out!!!

Community Bonding and week 1 & 2

Command Palette

Motivation

Technical breakdown of proposal

Comments

GSoC - OpenAstronomy

Results are out!!!

More from this blog