I know what a compiler does but how does it work exactly? What is it doing, how is it doing it?

I know what a compiler does but how does it work exactly? What is it doing, how is it doing it?

Attached: 1550153842145.jpg (600x371, 32K)

Other urls found in this thread:

lagunita.stanford.edu/courses/Engineering/Compilers/Fall2014/about
pastebin.com/yriJ1EY1
web.archive.org/web/20170305102929/https://templeos.org/Wb/Demo/Lectures/Optimization.html
twitter.com/AnonBabble

gnu plus magic

It's an FSA
That's all you need to know

Attached: BonbiBonkers.15631470836480.jpg (540x960, 44K)

I get the feeling I've asked a question with an answer that is extremely complicated so i'm just going to let it go

Jow Forums doesn’t know shit and never has that’s why you get no answer

parse the written code in to tokens
define what order tokens are allowed to appear in
convert tokens to the lower level language of your choice

It parses your source code to make sure it's syntactically correct, maps your source code to assembly/some other intermediate code, does some optimization of the intermediate code, then assembles the intermediate code into machine code and links it with any dependencies to create an executable.

... So I know what a compiler does. But how does it work exactly??? What is it doing, how is it doing it??

Look into gnu bison, grammars and the compiler book the one with a dragon

That image is right.

Lexical analysis checks that everything in the source code is a valid token.
Like in Java, you can't have the string "c $ d" because "$" is not a valid token in Java.
After you create all your tokens, it passes them to the syntax analyzer.

The syntax analyzer checks that all of the tokens appear in the correct order.
Like "if(a && b)" is a valid statement but "if a b(&&)" is not. Same tokens, but wrong order.
You create an abstract syntax tree as you validate groups of tokens, which you pass to the semantic analyzer.

Semantic analysis is where you check everything makes sense and will actually work - like type checking, verifying that functions are declared, they have the right number of arguments.

Code generation is where you spit out Java bytecode.

Which part? For example when it is parsing the source code it reads through the code line by line and checks if the rules enforced by the language it is compiling are being upheld. So it will keep track of how many left/right brackets, quotation marks etc. there are in the code, and if they don't balance out at the end of the file it will throw an error because you didn't close a bracket or something. It will also keep track of variable names and data types and makes sure that variables are only used after they are defined, and that the data type is valid for certain operations/function calls. It then creates a tree representing the code which shows all the simple operations and function calls. In order to generate the intermediate code it must walk this tree and convert each operation into the corresponding intermediate code operation.

read "The Dragon Book of Compiler Design"

Just look up any of the terms in your image and you can find explanations and algorithms.

There are a few free compiler courses on various platforms. Enroll for one if you're interested.
There's probably a few blogs with reasonable guides as well. And books. This board is a really bad place to ask about this.

A compiler consists of a few stages:

Lexer: Splits input text into tokens and categorizes them i.e. what is a variable name what is an operator and what is a function call etc.

Parser: Constructs a abstract syntax tree (AST) from the token list that describes the semantics of the program.

Optimizer: Modifies the abstract tree to optimize the programs execution speed and code size. Examples are constant-folding and hoisting. Google those for explanations.

Codegen/backend: Converts optimized AST tree to assembly for your chosen CPU architecture. Some platform specific optimizations like SIMD or inbuilt memcpy, memset functions are applied here.

This is where the compiler's role ends and it outputs a single assembly file per c or c++ source file (use gcc -S to see assembly output). In the case of the java compiler (non AOT compilation), it would output java bytecode instead of assembly.

>read a 20 year old book
fuck off, faggot

>she doesn't want to read a 20 year old book on a topic that has stagnated for 60 years

Attached: 1530748691578.jpg (549x549, 62K)

you lost all credibility instantly

Attached: 220px-Purple_dragon_book_b.jpg (220x333, 23K)

Languages are the oldest topic in computing. There has been basically no changes to the core principles in literal decades - computers just do it faster now.
In the 90s there were some papers published on exploiting concurrency in compilers, but beginner needs to know anything about that.

The best way to learn might be making your own compiler using Flex and Bison.
Just make your own test language with only integers and simple aditions.

My faculty is the best in my country on computer science and there isn't a single course on compilers. I always wanted to learn this shit.

Usually there's an intermediate step and compilers will transform your AST into IR since it is easier to optimize and they will generate assembly from IR rather than an AST.

OP that diagram is essentially what every modern, optimizing compiler does.
No this question is not that complicated for simple languages with non-optimizing compilers.
No you don't have to read a book to have a basic understanding ().
If you really want to know how a compiler works, just build one. I'm a complete brainlet and even I built one in high school.
To get started, just write a simple compiler for a language composed of the following: variables, arithmetic expressions, functions, and function calls. Functions can return void or int.
Use flex and bison for parsing. The point is you want to get to semantic analysis and code gen as fast as possible.
Flex will lex things for you and create tokens. Bison will parse the language and create an AST for you. The language I described can have a < 50 line grammar. Semantic analysis should expose you to the basics of type-checking (calling a function returning void cannot be used in an integer expression), symbol tables (check for re-declaring variables), etc.
Then just generate assembly from the AST. Just keep everything in memory and load, perform an op, and store for every operation.
This should take you < 1 month even dicking around. Add stuff to your language at leisure and you can even go back and generate IR or hand-write the parser and lexer.
Semantic analysis and code gen are usually the hardest stuff to get right. Getting those right and moving backwards and filling in the holes later is much more interesting. Why do you think university courses don't make you hand-write a parser? Parsers and lexers are essentially "solved" problems for modern programming languages. Semantics (type systems) and codegen are much more interesting.

Why is this thread still alive when the only correct answer has been given 11 hours ago?

Compilers was hands down my favorite class in college. The projects were to literally build a compiler from the ground up.

lagunita.stanford.edu/courses/Engineering/Compilers/Fall2014/about
Make sure you do the projects. That's the important part.

wtf is fsa

I remember taking this course some time ago. it was ok-ish. Wasted too much effort on parsing, did nearly nothing on intermediate generation, and the code generation part was a bit incomplete. But it it's an ok start if you prefer audio/video materials.

The videos are also hosted on youtube, but not listed publicly. Apparently the past me thought it would be a good idea to copy all the links into the notes. So there you go:

pastebin.com/yriJ1EY1

skip IR go right to code gen

>I know what a compiler does but how does it work exactly? What is it doing, how is it doing it?
Depends on the language.
Some are just interpreted, so they don't even have a compiler.
I'll assume C with the gcc compiler for the sake of not being an asshat.

So, in gcc, there's a "linker" and an "assembler" involved, but I don't entirely remember how.
Essentially, your C code is converted to assembly before it's converted to actually executable machine code.

it's usually designed as a pipeline of transformations from one form to another and operation passes on these forms. I'll describe the commonly tought way, note that it's a dogma and few languages to a different thing (lisp, forth, ...) or some domain-specific don't do full pipeline (regex engines, ...)

1) lexical analysis
lexing/scanning transforms text into tokens - lexeme string, type of token and position in source.
parsing - takes this stream of tokens and matches them to grammar rules, constrructs a tree out of them. the rules correspond to operations and figures in the language such as function, statement, expression, literals, call, member access, ...
result of this pass is a abstract syntax tree (AST) which directly corresponds to the source file. lots of books and courses waste so much time on this and parsing/grammar theory - ignore it. it's waste of time and do the simple thing instead.
2) semantic analysis
if it's a valid program: type checking; name resolutions; scoping of symbols and use after definitions; returns and loop on proper place; ...
this operates on the AST. goal is to identify if it's a valid program, the compiler should not identify program as invalid in any later phase (exceptions might be linker errors) and all further operations operate on valid program and produce different valid program.
3) transformation to intermediate representation (IR)
AST isn't the best representation for all things you might want to do during optimizations. Static single assignment form (SSA) is the popular thing, but there are other (CPS, thinned gated single assignment, value dependence graph, ... to be honest, idk their (dis)advantages at all).
SSA is pretty comfy, it uses 3-operand statements where each variable is only assigned once and used multiple times, and for each definition you track all its uses and for each use you link its definition. if this sounds like a directed acyclic graph then you are right. you also track what's constant, ...
(cont...)

after this, the representation doesn't correspond to the source code, although you can track some origin (and ofc names).
4) optimization passes
probably two most important are moving things out of loop and const propagation. just read Optimizing compiler wikipedia page for some list. the point of IR is to provide abstraction where the requirements for these optimizations are easier to find and verify. the transformations are from the IR to another valid IR
various optimizations are architecture agnostic, some might be architecture-dependent.
5) code generation
this is the architecture-specific part. since some IRs can't be directly emitted (e.g. SSA), there needs to be a transformation done, that can be merged with the arch-specific optimization passes, so let's count those into codegen part.
arch-specific things are for example usable registers and ABI rules and call conventions - these are a big part of register allocation. but also different cost of some instructions or the list of instructions, which can result in different transformation of arithmetic expressions or such.
register allocation on SSA can be done with traditional graph coloring algorithms where colors correspond to registers, but there are some more sophisticated algos that you can use. goal is to fits as many things into registers and reduce memory access, reduce moving between registers, ... out-of-order superscalar architectures with deep pipelining and instruction-level parallelism are kinda difficult to handle optimally yet simply, idk what the serious compilers actually do here to decide on optimal instructions and their order.
6) for emitting the final machine code, you need to know addresses of various things such as functions or global variables. object files such as ELF allow you to specify these addresses later. this introduces a linker/loader part of the compiler toolchain.

and there are the lagnuages with less complete pipeline
e.g. serialization languages only do parsing, regex strings are being transformed into runtime state machine or other source code, interpreters either execute on the AST or on IR, brainfuck can execute directly on tokens, ... the definition of what is a compiler is pretty wide

just search on youtube bro, it's not that complicated, also you might get the answer on why decompiling is sometimes harder than compiling

>idk what the serious compilers actually do here to decide on optimal instructions and their order.
Look up instruction scheduling. Compilers reorder instructions to avoid pipeline stalls, etc.

I know this term, but it isn't that simple.
To illustrate, see for example Demo/Lecture/Optimization.HC in TempleOS
>web.archive.org/web/20170305102929/https://templeos.org/Wb/Demo/Lectures/Optimization.html
I know where to find the latency/throughput numbers on individual instructions in Intel's manual, but it still feels like a complete guessing and the results differ only slightly. I just have not discovered yet any way to confidently predict the performance characteristics of various choices and was not confident to speak about it, so I skipped it.
Illustrate another good property of SSA IR, it's DAG and thus topological sort can be done and the parallel parts can be simply identified and safely reordered.

Yeah, I agree with you, at least reading through some research papers, a lot of the optimizations as of now seem to be heuristic based. The optimizations to avoid pipeline stalls aren't that hard to reason about, but more complex optimizations like reordering reads and writes to avoid data hazards are also impacted by cache dependencies which compilers today notoriously suck at, i.e. not an exact science. I think most compilers use some sort of approximation of trying to put close reads and writes together to reason about this, but it all seems kind of handwavey.
This is all at the assembly level though (or at the highest LIR level), since you need to consider the specific architecture. I think SSA IR reordering is more useful to enable other optimizations, since usually at the MIR level, you don't consider your exact architecture. That form of SSA reordering usually enables other things like CSE or copy prop.

Thinking about your response a bit more and looking at your link, I realize that you also need to consider the register renamer as well in conjunction with what micro-ops the processor generates. I guess modern compilers aren't too good at considering all of these factors together.