“Realistic expectations are necessary when applying for an ERC grant but there's no point in being overly modest,” says Pavel Veselý from the Institute of Informatics, CU, as advice for future applicants. Veselý himself was awarded a prestigious ERC CZ grant in June. With this grant, he and fellow researchers will study the robustness of small data structures against potential attacks that could cause a major error.
You work on algorithms for big data analysis at the Faculty of Physics and Mathematics. How do they work?
These are so-called streaming algorithms, which only need a small memory and a single pass over the data. They are used to estimate simply defined statistics with as little error as possible in a given space. The resulting data structures are called sketches because they attempt to capture important features of the input data in a small space.
Traditional algorithms have virtually arbitrary access to the data and can ‘jump’ back and forth over it as if it were an array, and are not typically limited by memory, which they can increase in direct proportion to the size of the data. In contrast, stream algorithms look at the input data as a sequence that arrives gradually. Imagine running them on a network device that monitors traffic, or on a server that receives certain requests, and it measures how long it takes to process each one. This information is passed to a data structure or stream algorithm. But it has no way to go back in time to previous observations.
The stream algorithm can't store its observations?
It can't store everything, because it works with a small memory, and that limits the ability to store knowledge about the input. It can process every new observation that comes in very quickly, which is another typical feature. Because of the small memory, it can only successfully solve certain basic problems, and it must also learn to forget well and extract from the data the important features that we need to solve a particular problem. For example, figuring out how many requests took more than half a second or a second or even 10 seconds. There may already be relatively few requests that long, and at the same time we don't want to, and can't, store them all because of the small memory.
The problem is that because stream algorithms do not remember everything, except for absolutely trivial tasks (for example, the sum of all the numbers in the input), we always have to allow for some error with them.
What are stream algorithms used for in practice?
They are used in applications that work with large data, where we don't need to know the exact result, but only an approximate one will do. For example, in the field of online advertising, where the advertiser needs to know how many different users have seen it. However, we don't need to know the exact number (whether it's 101 or 102 thousand people), so a small error of a few percent usually doesn't matter.
How long have you been working with streaming algorithms?
I got into them during a postdoctoral research fellowship at the University of Warwick. I was a postdoc of Professor Graham Cormode, one of the leading experts on the subject, which he has been working on for over 20 years. It was a tremendous opportunity for me to learn something new.
But I've been interested in algorithms for much longer. In my PhD, I worked on online algorithms that also process data input element by element. With these, we are not limited by memory, but again by the fact that we have to make an immediate decision while processing a new element. For example, we have jobs coming to us to layout, and we have to immediately assign each one to a machine that will process it, without knowing in advance what other jobs will be coming to us. Thus, we may not always get the optimal solution.
Are you conducting basic research?
Yes, but at the same time I'm interested in how stream algorithms behave on real data. We have been testing this with Splunk, an American company that does big data processing, including estimating distributions and percentiles. I'm also working with the open source Apache DataSketches library, for which we've devised a brand new stream algorithm for estimating percentiles with a high accuracy for elements at the edge of the distribution.
So you're actually applying basic research findings into practice, which very few scientists do.
In the beginning, this is more or less mathematical research. I do an analysis of how the algorithm is supposed to work, and I fix it so that it has some guarantee of error. The smaller the error rate, the better. The result of theoretical research usually gives me a worst-case error guarantee. I then use experiments to see how the algorithm behaves on real data. It is thanks to Professor Cormode that I have reached the people who apply these algorithms in practice.
What is the most interesting thing about this collaboration?
I'm interested in the types of problems people encounter in practice, and I try to use their experience to improve the algorithms. For example, Splunk uses an algorithm with no guarantee of error that works very well in practice. We compared it to our stream algorithm, which we brought in from our theoretical research. The results are quite unsurprising. The Splunk heuristic behaves much better on real data that one encounters in practice, but it can have an arbitrarily large error, which can sometimes be a problem. In contrast, our algorithm behaved stably - it still had more or less the same error rate on all data inputs. The mathematical analysis also showed us that it is very difficult to "throw it off". One of my goals now is to overcome the aforementioned heuristics, while maintaining some guarantee of error.
When did you realise that your research had the potential to win an ERC CZ grant?
I like challenges and when the opportunity to apply came up, I took it. In a way it gets your priorities straight, it makes you think about what is really important in your research and how to highlight its potential.
What specifically will you be working on?
I want to move towards developing data structures robust to attacks designed to cause a big error. There are a number of fairly large open problems.
How much time do you have to solve them?
I've been given two years within the ERC CZ. The Czech Ministry of Education, Youth and Sports offers this grant to male and female scientists whose projects have received an excellent rating in the European Research Council (ERC) grant competition but have not been supported due to lack of funding. In practical terms, the grant will pay for a high quality postdoctoral researcher, which I hope I have already found, and two promising PhD students, one of whom, Tomáš Domes, is finishing a nice thesis with me and will start his PhD in October.
This is your first grant and probably the first time you are dealing with the research side of things as well as the management side.
That's right. Writing the ERC application itself was already new for me. You have to present a big vision, a challenge, which of course requires a great idea. To do this, it's good to learn not to be too modest and to be able to sell previous achievements that are relevant to your current project. You shouldn’t be afraid to take risks. That's the principle of ERC grants in my opinion. Coming up with an idea that can significantly advance the current scientific understanding of the field, while being realistic, albeit risky - may not be successful. That is why Plan B is part of the application for this grant.
Do you worry, ever, that things might not work out?
I love research, so rather than being afraid of anything, I rather look forward to it. Not only the collaboration with PhD students and postdocs, but also international cooperation. After all, I came up with the topic of stream algorithms versus their adaptive adversaries during a talk at a conference. I was intrigued and thinking about directions that would be worth taking. It's an interesting new problem - both mathematically and to some extent, as I mentioned, practically.
To what extent is your work solitary, and when do you need to consult others?
I believe that some colleagues can do longer do research fully their own. I like collaboration and discussions because they largely motivate me to come up with arguments and look for other possible solutions. But I also have periods, for example during the last year of my PhD, when I am on my, thinking, for a long time.
What advice would you give to high school students who enjoy mathematics and computer science? How can they get a clearer picture of these fields?
I would definitely recommend the correspondence seminars that our faculty students prepare for high school students. I was most interested in the programming one. It's a great opportunity for new friendships, too, because the correspondence seminars include camps, something like summer camps, where participants meet people they can talk to about things they don't get to talk about much at a regular high school. During high school, I helped with the development of larger software, so I planned to major in software engineering at the Faculty of Physics and Mathematics. In my sophomore year, however, I picked up combinatorial games as an enthusiastic student would. I became interested in mathematics, especially combinatorics, and went in a more theoretical direction. My point is that I think the most important condition for success is to find what you enjoy, and to do that you need to try as many things as possible at the beginning. In all fields, you will definitely find something interesting if you have a relationship to it, and a good mentor can help you establish that. I have been very lucky to have one and I would like to become one, maybe thanks to the ERC CZ grant.
Pavel Veselý, PhD |
Pavel Veselý graduated from the Faculty of Mathematics and Physics of Charles University. He finished his doctoral studies with a dissertation under the supervision of Professor Jiří Sgall, which dealt with the problem of online algorithms. He then worked as a postdoc at the University of Warwick in the research team of Professor Graham Cormode. He is now at the Institute of Computer Science, CU. His research interests are rooted in theoretical computer science and combinatorics, with a focus on the design of efficient algorithms, especially stream algorithms, and data structures, so-called sketches. This June, he was awarded a prestigious ERC CZ grant for the project Robust stream algorithms versus adaptive adversaries. |